How to archive, not discard, old data in Apache Kafka?

How to archive, not discard, old data in Apache Kafka? - apache-kafka

I'm currently assessing Apache Kafka for use in our technology stack. One thing which may become critical is a contractual or legal requirement to be able to audit the system's behaviour, retaining this audit information for as much as a year.
Given the volume of data we process we will, most likely, need to cold-store this rather than simply partitioning the data and setting a long retention period. Cold-store here means storing in Amazon S3 or multiple locally held TB HDDs.
We could of course set up a logger against every topic. Yes.
But this feels like it should be a solved problem to which I just can't find a documented solution.
What's the best way of archiving old data from Apache Kafka rather than simply discarding it?

You could use the S3 sink connector to stream the data to S3, and then set the retention period on your topics as required to age-out the data.

Related

Apache Kafka messages got archived - is it possible to retrieve the messages

We are using Apache Kafka and we process more than 30 million messages per day. We have an retention policy of "30" days. However, before 30 days, our messages got archived.
Is there a way we could retrieve the deleted messages?
Is it possible to reset the "start index" to older index to retrieve the data through query?
What other options do we have?
If we have "disk backup", could we use that for retrieving the data?
Thank You

I'm assuming your messages got deleted by the Kafka cluster here.
In general, no - if the records got deleted due to duration / size related policies, then they have been removed.
Theoretically, if you have access to backups you might move the Kafka data-log files to server directory, but the behaviour is undefined. Trying that with a fresh cluster with infinite size/time policies (so nothing gets purged instantly) might work and let you consume again.

In my experience, until the general availability of Tiered Storage, there is no free/easy way to recover data (via the Kafka Consumer protocol).
For example, you can use some Kafka Connect Sink connector to write to some external, more persistent storage. Then, would you want to write a job that scrapes that data? Sure, you could have a SQL database table of STRING topic, INT timestamp, BLOB key, BLOB value, and maybe track "consumer offsets" separately from that? If you use that design, then Kafka doesn't really seem useful, as you'd be reimplementing various parts of it when you could've just added more storage to the Kafka cluster.
Is it possible to reset the "start index" to older index to retrieve the data through query?
That is what auto.offset.reset=earliest will do, or kafka-consumer-groups --reset-offsets --to-earliest
have "disk backup", could we use that
With caution, maybe. For example - you can copy old broker log segments into a server, but then there aren't any tools I know of that will retroactively discover the new "low watermark" of each topic (maybe the broker finds this upon restart, I haven't tested). You'd need to copy this data for each broker manually, I believe, since the replicas wouldn't know about old segments (again, maybe after a full cluster restart, they might).
Plus, the consumer offsets would already be reading way past that data, unless you stop all consumers and reset them.
I'm also not sure what happens if you had gaps in the segment files. E.g. your current oldest segment is N and you copy N-2, but not N-1... You might then run into an error or the consumer will simply apply auto.offset.reset policy, and seek to the next available offset or to the very end of the topic

Collect users activity in Kafka?

I desire to provide a fast ability to get status of user his availability.
It must be fastest reading data from storage.
Thus I chosed Redis storage for storing available status of each users.
So, besides that I need store more extended information about available users, such as region, time of login, etc.
For this purpose I got a Kafka, where this data is stored.
Question is, how to synchronise Kafka and Redis?
Which sequence should be, first store event online users in Kafka, then sink it to Redis?
Second is store in Redis and asynchronously in Kafka.
I afraid a latency between Kafka and Redis for sink operation.

As I understood from the question, you want to store only user and userstatus in Redis and complete profile on Kafka.
I am not sure about the reason of choosing Kafka as your primary source of all data. Also, how are you planning to use the data stored there.
If data storage in Kafka is really important to you, then I'd suggest to update your primary database first(Kafka or any) and then update cache.
In this case, you need to do a sync operation on Kafka producer and once its successful, update ur cache.
As your readd operations are only from redis - performance will not be impacted.
But if opting sync producer might add little bit overhead beacuse of acknowledgement when compared to async.

Classical Architecture to Kafka, how do you realize the following?

we are trying to move away from our classical architecture J2EE application server/Relational database to Kafka. I have an use case that I am not sure how exactly to proceed....
Our application exports with a Scheduler from Relation Database, in the future, we are planning to not to place information at all at Relational Database but to realise export directly from the information at Kafka Topic(s).
What I am not sure will be best solution would be, is to configure consumer that polls the topic(s) with the same schedule as the scheduler and export things.
Or to create KafkaStream at schedule triggering point to collect this information from a Kafka Stream?
What do you think?

The approach you want to adopt is technically feasible, few possible solutions:
1) Continuous running Kafka-Consumer with Duration=<export schedule time>
2) Cron triggered kafka-streaming-consumer with batch-duration same as schedule. Do offset commit to Kafka.
3) Cron triggered Kafka-consumer programmatically handle offsets and pull records based on offsets as per your schedule.
Important considerations:
Increase retention.ms to much more than your schedule batch job time.
Increase disk space to accommodate data volume spike since you are going to hold data for longer duration.
Risks & Issues:
Weekend retention could be missed.
Another application if by mistake uses same group.id can mislead offsets.
No aggregation/math function can be applied before retrieval.
Your application can not filter/extract records based on any parameter.
Unless offsets are managed externally, application can not re-read records.
Records will not be formatted i.e. mostly Json strings or maybe some other formats.

How does Kafka know when source data has changed?

I can't find a definitive answer, so I figured I would ask the experts. How does Kafka observe and detect what data in a given source has changed? For instance, in a Relational Database?
Polling comes to mind, but wouldn't it then have to maintain a data set of all primary keys per available table, and then run checks if new primary keys are available? Where is this stored, since memory is probably not durable enough?

This is a very general question so you can imagine the answer is "it depends". Kafka isn't tracking this per se. It's done by whatever Kafka client implementation you have. For example, if you implement a Kafka Connect source connector then you can store offsets to checkpoint what data has been read in Kafka itself. If you are just writing a producer it's a different story. A pretty general example can be found in the Confluent JDBC source connector. It has multiple modes for loading that can give you an idea of the flexibility https://docs.confluent.io/current/connect/connect-jdbc/docs/source_connector.html#features

Reliable usage of DirectKafkaAPI

I am pllaned to develop a reliable streamig application based on directkafkaAPI..I will have one producer and another consumer..I wnated to know what is the best approach to achieve the reliability in my consumer?..I can employ two solutions..
Increasing the retention time of messages in Kafka
Using writeahead logs
I am abit confused regarding the usage of writeahead logs in directkafka API as there is no receiver..but in the documentation it indicates..
"Exactly-once semantics: The first approach uses Kafka’s high level API to store consumed offsets in Zookeeper. This is traditionally the way to consume data from Kafka. While this approach (in combination with write ahead logs) can ensure zero data loss (i.e. at-least once semantics), there is a small chance some records may get consumed twice under some failures. "
so I wanted to know what is the best approach..if it suffices to increase the TTL of messages in kafka or I have to also enable write ahead logs..
I guess it would be good practice if I avoid one of the above since the backup data (retentioned messages, checkpoint files) can be lost and then recovery could face failure..

Direct Approach eliminates the duplication of data problem as there is no receiver, and hence no need for Write Ahead Logs. As long as you have sufficient Kafka retention, messages can be recovered from Kafka.
Also, Direct approach by default supports exactly-once message delivery semantics, it does not use Zookeeper. Offsets are tracked by Spark Streaming within its checkpoints.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse