Collect users activity in Kafka? - apache-kafka

I desire to provide a fast ability to get status of user his availability.
It must be fastest reading data from storage.
Thus I chosed Redis storage for storing available status of each users.
So, besides that I need store more extended information about available users, such as region, time of login, etc.
For this purpose I got a Kafka, where this data is stored.
Question is, how to synchronise Kafka and Redis?
Which sequence should be, first store event online users in Kafka, then sink it to Redis?
Second is store in Redis and asynchronously in Kafka.
I afraid a latency between Kafka and Redis for sink operation.

As I understood from the question, you want to store only user and userstatus in Redis and complete profile on Kafka.
I am not sure about the reason of choosing Kafka as your primary source of all data. Also, how are you planning to use the data stored there.
If data storage in Kafka is really important to you, then I'd suggest to update your primary database first(Kafka or any) and then update cache.
In this case, you need to do a sync operation on Kafka producer and once its successful, update ur cache.
As your readd operations are only from redis - performance will not be impacted.
But if opting sync producer might add little bit overhead beacuse of acknowledgement when compared to async.

Related

Direct Kafka Topic to Database table

Is there a way to automatically tell Kafka to send all events of a specific topic to a specific table of a database?
In order to avoid creating a new consumer that needs to read from that topic and perform the copy explicitly.
You have two options here:
Kafka Connect - this is the standard way to connect your Kafka to a database. There are a lot of connectors. In order to choose one:
The best bet is to use the specific one for your database that is maintained by confluent.
If you don't have a specific one, the second best option is to use the JDBC connector.
Direct ingestion from the database if your database supports it (for instance Clickhouse, and MemSQL are able to load data coming from a Kafka topic). The difference between this and Kafka connects is this way it is fully supported and tested by the db vendor and you need to maintain less pieces of infrastructure.
Which one is better? It depends on:
your data volume
how much you can (and need !) to paralelize the load
and how much you can tolerate downtime or latencies.
Direct ingestion from DB is usually from one node (consumer) to Kafka.
It is good for mid-low volume data traffic. If it fails (or throttles), you might have latency issues.
Kafka connect allows you to insert data in parallel into the db using several workers. If one of the worker fails, the load is redistributed among the others. If you have a lot of data, this probably the best way to load it into the db, but you'll need to take care of the kafka connect infrastructure unless you're using a managed cloud offering.

Streaming on-demand data on to Kafka topics based on consumer requests

We are a source system and we have a couple of downstream systems which would require our data for their needs, currently we are publishing events onto Kafka topics as and when there is a change in source system for them to consume and make changes to their tables (all delta updates)
Our downstream systems is currently accessing our database directly sometimes to make complete refresh of their tables on demand basis once in a while to make sure data is in sync apart from subscribing to Kafka topics, as you know we always need a data refresh sometimes when we feel data is out of sync for some reason.
We are planning to stop giving access to our database directly, how can we achieve this ? Is there a way that consumers request us their data needs by any triggers like passing request to us and we can publish the stream of data for them to consume on their end and they sync the tables or get the bulk data into their memory to perform some tasks based on their needs.
We currently have written RESTful APIs to provide data based on the requests, but we are exposing small data volumes as I think APIs we only send smaller volumes of data, but it won't work when we want to send millions of data to consumers and I believe only way is to stream data on Kafka, but with Kafka how can we respond to the request from consumers and only pump that specific data on to Kafka topics for them to consume ?
You have the option of setting the retention policy on any topic to keep messages forever with:
retention.ms: -1
see the docs
In that case you could store the entire change log in the same manner that you currently are. Then if a consumer needs to re-materialize the entire history, they can start with the first offset and go from there without you having to produce a specialized dataset.

Kafka streams state store for what?

As I got right from book, Kafka Streams state store it is a memory key/value storage to store data to Kafka or after filtering.
I am confused by some theoretical questions.
What is differenct Kafka streams state from another memory storage like Redis etc?
What is real case to use state storage in Kafka Streams?
Why topic is not alternative for state storage?
Why topic is not alternative for state storage?
A topic contains messages in a sequential order that typically represents a log.
Sometimes, we would want to aggregate these messages, group them and perform an operation, like sum, for example and store it in a place which we can retrieve later using a key. In this case, an ideal solution would be to use a key-value store rather than a topic that is a log-structure.
What is real case to use state storage in Kafka Streams?
A simple use-case would be word count where we have a word and a counter of how many times it has occurred. You can see more examples at kafka-streams-examples on github.
What is difference between Kafka streams state from another memory storage like Redis etc?
State can be considered as a savepoint from where you can resume your data processing or it might also contain some useful information needed for further processing (like the previous word count which we need to increment), so it can be stored using Redis, RocksDB, Postgres etc.
Redis can be a plugin for Kafka streams state storage, however the default persistent state storage for Kafka streams is RocksDB.
Therefore, Redis is not an alternative to Kafka streams state but an alternative to Kafka streams' default RocksDB.
-Why topic is not alternative for state storage?
Topic is the final statestore storage under the hood (everything is topic in kafka)
If you create a microservice with name "myStream" and a statestore named "MyState", you'll see appear a myStream-MyState-changelog with has an history of all changes in the statestore.
RocksDB is only the local cache to improve performances, with a first layer of local backup on the local disk, but at the end the real high availability and exactly-once processing guarantee is provided by the underlying changelog topic.
What is differenct Kafka streams state from another memory storage like Redis etc?
What is real case to use state storage in Kafka Streams?
It not a storage, it's a just local, efficient, guaranteed memory state to manage some business case is a fully streamed way.
As an example :
For each Incoming Order (Topic1), i want to find any previous order (Topic2) to the same location in the last 6 hours

Which messages should be published to a Kafka topic, and when?

I have some few services, like Catalog Service, Customer Service, Recommendations Service, Order Taking Service and so on ..., each service has its own Keyspace in a Cassandra database.
I have two questions:
1 - For a change in a service: should I first publish the change's event (or record) to Kafka and then consume it from that same service in other to update its database, or should I update its database first and then publish the record to Kafka ?
2 - How to choose which change to publish to Kafka, should I publish all updates to Kafka, even those without interest for others services, like "attribute X updated to Y for product Z" ?
1) I would suggest you always try to read your writes. Which operation is more likely to succeed? A replicated ack from Kafka, or a durable Cassandra upsert? If you think Kafka is more durable, then you'd write it there then use a tool like Kafka Connect to write it down to Cassandra (assuming you really need Cassandra over a Global KTable, that's up for debate)
2) There's no straightforward answer. If you think data will ever be consumed in ways that might be relevant, then produce it. Think about it like an audit log of any and all events. If you want to build an idempotent system that always knows latest state of any product and all changes that happened, then you can either store the whole object each time as (id, product) pairs where you holistic update the entire product, or you can store each delta of what changed and rebuild state from that

How to archive, not discard, old data in Apache Kafka?

I'm currently assessing Apache Kafka for use in our technology stack. One thing which may become critical is a contractual or legal requirement to be able to audit the system's behaviour, retaining this audit information for as much as a year.
Given the volume of data we process we will, most likely, need to cold-store this rather than simply partitioning the data and setting a long retention period. Cold-store here means storing in Amazon S3 or multiple locally held TB HDDs.
We could of course set up a logger against every topic. Yes.
But this feels like it should be a solved problem to which I just can't find a documented solution.
What's the best way of archiving old data from Apache Kafka rather than simply discarding it?
You could use the S3 sink connector to stream the data to S3, and then set the retention period on your topics as required to age-out the data.