Use Kafka topics to store data for many years - apache-kafka

I am looking for a way of collecting metrics data from multiple devices. The data should be aggregated by multiple "group by" like functions. The aggregation functions list is not complete and new aggregations will be added later and it will be required to aggregate all data collected from first days.
Is it fine to create Kafka topic with 100 year expiration period and use it as a datastore for this purpose? So new aggregations will be able to read from topic's start while existing aggregations will continue from their's offsets?

In principle, yes you can use Kafka for long-term storage, exactly for the reason you outline - reprocessing of source data to derive additional aggregates/calculations.
A couple of references:
https://www.confluent.io/blog/okay-store-data-apache-kafka/
https://www.confluent.io/blog/publishing-apache-kafka-new-york-times/

Yes if you want to keep the data you can just increase the retention time to a large value.
I'd still recommend having a retention policy on size to ensure you don't run out of disk space

Related

kafka consumer to store history of events in a data store

We are working with kafka as Event Streaming Platform. So far, there is one producer of data and 3 consumers, each of them subcribed to one or several topics in kafka. This is working perfectly fine. Fyi, the kafka retention period is set to 5s since we don't need to persist the events more than that.
Right now, we have a new use-case coming to persist all the events for the latest 20 mins (in an another data store) for post-analysis (mainly for training purposes). So this new kafka consumer should subscribe to all existing topics. We only want to persist the history of latest 20mins of events in the data store and not all the events for a session (that can represent several hours or days). TThe targetted througput is 170kb/s and for 20mins it is almost 1M of messages to be persisted.
We are wondering which architecture pattern is adapted for such situtation? This is not a nominal use-case compared to the current use-cases, so we don't want to reduce the performance of the system to be able to manage it. Our idea is to empty the topcis as fast as we can , push the data into a queue and have another app with a different rate in charge of reading the data from the queue and persisting them into the store.
We woul greatly appreciate any experience or feedback to manage such use-case. Especially about the expiration/pruge mechanism to be used. For sure we need something highly available and scalable.
Regards
You could use Kafka Connect with topics.regex=* to consume everything and write to one location, but you'll end up with a really high total lag, especially if you keep adding new topics.
If you have retention.ms=5000, then I don't know if Kafka is a proper tool for your use case, but perhaps you could ingest into Splunk or Elasticsearch or other time-series system where you can properly slice by 20 minute windows.

Starting new Kafka Streams microservice, when there is data retention period on input topics

Lets assume i have (somewhat) high velocity input topic - for example sensor.temperature and it has retention period of 1 day.
Multiple microservices are already consuming data from it. I am also backing up events in historical event store.
Now (as a simplified example) I have new requirement - calculating maximum all time temperature per sensor.
This is fitting very well with Kafka Streams, so I have prepared new microservice that creates KTable aggregating temperature (with max) grouped per sensor.
Simply deploying this microservice would be enough if input topic had infinite retention, but now maximum would be not all-time, as is our requirement.
I feel this could be common scenario but somehow I was not able to find satisfying solution on the internet.
Maybe I am missing something, but my ideas how to make it work do not feel great:
Replay all past events into the input topic sensor.temperature. This is large amount of data and it would cause all subscribing microservices to run excessive computation, which is most likely not acceptable.
Create duplicate of input topic for my microservice: sensor.temperature.local, where I would always copy all events and then further process(aggregate) them from this local topic.
This way I can freely replay historical events into local topic without affecting other microservices.
However this local duplicate would be required for all Kafka Streams microservices, and if input topic is high velocity this could be too much duplication.
Maybe there some way to modify KTables more directly, so one could query the historical event store for max value per sensor and put it in the KTable once?
But what if streams topology is more complex? It would require orchestrating consistent state in all microsevice's KTables, rather than simply replaying events.
How to design the solution?
Thanks in advance for your help!
In this case I would create a topic that stores the max periodically (so that it won't fell off the topic beacuse of a cleanup). Then you could make your service report the max of the max-topic and the max of the measurement-topic.

Combining data coming from multiple kafka to single kafka

I have N Kafka topic, with data and a timestamp, I need to combine them in a single topic with sorted timestamp order, where the data is sorted inside the partition. I got one way to do that.
Combine all the Kafka topic data in Cassandra(because of its fast write) with clustering order as DESCENDING, it will combine them all but the limit would be if after a timed window of accumulation of data if a data came late, it won't be sorted
Is there any other appropriate way to do that? If not then is there any chance of improvement in my solution.
Thanks
Not clear why you need Kafka to sort on timestamps. Typically this is done only at consumption time for each batch of messages.
For example, create Kafka Streams process that reads from all topics. Create a Global KTable and enable Interactive Querying.
When you query, then you sort the data on the client side, regardless of how it is ordered in the topic.
This way, you are no limited to a single, ordered partition.
Alternatively, I would write to something other than Cassandra (due to my lack of deep knowledge of it). For example, Couchbase or CockroachDB.
Then when you query those later, run a SORT BY

Classical Architecture to Kafka, how do you realize the following?

we are trying to move away from our classical architecture J2EE application server/Relational database to Kafka. I have an use case that I am not sure how exactly to proceed....
Our application exports with a Scheduler from Relation Database, in the future, we are planning to not to place information at all at Relational Database but to realise export directly from the information at Kafka Topic(s).
What I am not sure will be best solution would be, is to configure consumer that polls the topic(s) with the same schedule as the scheduler and export things.
Or to create KafkaStream at schedule triggering point to collect this information from a Kafka Stream?
What do you think?
The approach you want to adopt is technically feasible, few possible solutions:
1) Continuous running Kafka-Consumer with Duration=<export schedule time>
2) Cron triggered kafka-streaming-consumer with batch-duration same as schedule. Do offset commit to Kafka.
3) Cron triggered Kafka-consumer programmatically handle offsets and pull records based on offsets as per your schedule.
Important considerations:
Increase retention.ms to much more than your schedule batch job time.
Increase disk space to accommodate data volume spike since you are going to hold data for longer duration.
Risks & Issues:
Weekend retention could be missed.
Another application if by mistake uses same group.id can mislead offsets.
No aggregation/math function can be applied before retrieval.
Your application can not filter/extract records based on any parameter.
Unless offsets are managed externally, application can not re-read records.
Records will not be formatted i.e. mostly Json strings or maybe some other formats.

Apache Kafka streaming KTable changelog

I'm using Apache Kafka streaming to do aggregation on data consumed from a Kafka topic. The aggregation is then serialized to another topic, itself consumed and results stored in a DB. Pretty classic use-case I suppose.
The result of the aggregate call is creating a KTable backed up by a Kafka changelog "topic".
This is more complex than that in practice, but let's say it is storing the count and sum of events for a given key (to compute average):
KTable<String, Record> countAndSum = groupedByKeyStream.aggregate(...)
That changelog "topic" does not seem to have a retention period set (I don't see it "expires" on the contrary of the other topics per my global retention setting).
This is actually good/necessary because this avoids losing my aggregation state when a future event comes with the same key.
However on the long run this means this changelog will grow forever (as more keys get in)? And I do potentially have a lot of keys (and my aggregation are not as small as count/sum).
As I have a means to know that I won't get anymore events of a particular key (some events are marked as "final"), is there a way for me to strip the aggregation states for these particular keys of the changelog to avoid having it grows forever as I won't need them anymore, possibly with a slight delay "just" in case?
Or maybe there is a way to do this entirely differently with Kafka streaming to avoid this "issue"?
Yes: changelog topics are configured with log compaction and not with retention time. If you receive the "final" record, your aggregation can just return null as aggregation result. This will delete it from the local RocksDB store as well as the underlying changelog topic.