kafka consumer to store history of events in a data store - apache-kafka

We are working with kafka as Event Streaming Platform. So far, there is one producer of data and 3 consumers, each of them subcribed to one or several topics in kafka. This is working perfectly fine. Fyi, the kafka retention period is set to 5s since we don't need to persist the events more than that.
Right now, we have a new use-case coming to persist all the events for the latest 20 mins (in an another data store) for post-analysis (mainly for training purposes). So this new kafka consumer should subscribe to all existing topics. We only want to persist the history of latest 20mins of events in the data store and not all the events for a session (that can represent several hours or days). TThe targetted througput is 170kb/s and for 20mins it is almost 1M of messages to be persisted.
We are wondering which architecture pattern is adapted for such situtation? This is not a nominal use-case compared to the current use-cases, so we don't want to reduce the performance of the system to be able to manage it. Our idea is to empty the topcis as fast as we can , push the data into a queue and have another app with a different rate in charge of reading the data from the queue and persisting them into the store.
We woul greatly appreciate any experience or feedback to manage such use-case. Especially about the expiration/pruge mechanism to be used. For sure we need something highly available and scalable.
Regards

You could use Kafka Connect with topics.regex=* to consume everything and write to one location, but you'll end up with a really high total lag, especially if you keep adding new topics.
If you have retention.ms=5000, then I don't know if Kafka is a proper tool for your use case, but perhaps you could ingest into Splunk or Elasticsearch or other time-series system where you can properly slice by 20 minute windows.

Related

ksqlDB for finding average last hour, and store results back to a kafka topic?

We have a readpanda (kafka compatible) source, with sensor data. Can we do the following:
Every hour, find the average sensor data last hour for each sensor
Store them back to a topic
You want to create a materialized view over the stream of events that can be queried by other applications. Your source publishes the individual events to Kafka/Redpanda, another process observers the events and makes them available as queryable "tables" for other applications. Elaborating a few options:
KSQLdb is likely a default choice as it comes as "native" in the Kafka/Confluent stack. Be careful with using it over your production Kafka cluster. It has heavy impact on the cluster performance. See the basic tutorial or the advanced tutorial.
Use an out-of-the box solution for materialized views such as Materialize. It's easiest to setup/use and doesn't stress the Kafka broker. However, it is single-node only as of now (06/2022). See the tutorial.
Another popular option is using a stream processor and store hourly aggregates to an attached database (for example Flink storing data to Redis). This is a do-it-yourself approach. Have a look on Hazelcast. It is one process running both stream processing services and a queryable store.

How to make sure two microservices are in sync

I have a kubernetes solution with lots of microservices. Each microservice has its own database and to send data between the services we use kafka.
I have one microservice that generates lots of orders and order lines.
Theese are saved to the order services own database and every change should be pushed to kafka using a kafka connector setup.
Another microservice with items and prices. All changes are saved to tables in this services database and changes are pushed to their own topic using the kafka connector.
Now I have a third microservice (The calculater) that calculate something based on the data from the previous mentioned services. Right now it just consumes changes from the order, orderline, item and price topics. And when its time it calculates.
The Calculater microservice is scheduled to do the calculation at a certain time each day. But before doing the calculation Id like to know if the service is up to date with data from the other two microservices.
Is there some kind of best practice on how to do this?
What this should make sure is that I havent lost a change. Lets say an orderlines quantity was changed from 100 to 1. Then I want to make sure I have gotten that change before I start calculating.
If you want to know if the orders and items microservices have all their data published to kafka prior to having the calculater execute its logic, that is quite application specific and it may be hard to come up with a good answer without more details. The Kafka connector that sends the orders, order lines and so on messages from the database to Kafka is some kind of CDC connector (so it basically listens to DB table changes and publishes them to Kafka)? If so, most likely you will need some way to compare the latest message in Kafka with the latest row updated to know if the connector has sent all DB updates to Kafka. There may be connectors that expose that information somehow, or you may have to implement something yourself.
On the other side, if what you want is to know if calculater has read all the messages that have been published to kafka by the other services, that is easier. You just need to get the high watermarks (the latest offset in each topic) and check that the calculater consumer has actually consumed it (so there is no lag). I guess, though, that the topics are continuously updated, so most likely there will always be some lag, but well, there is nothing you can do about that.

Starting new Kafka Streams microservice, when there is data retention period on input topics

Lets assume i have (somewhat) high velocity input topic - for example sensor.temperature and it has retention period of 1 day.
Multiple microservices are already consuming data from it. I am also backing up events in historical event store.
Now (as a simplified example) I have new requirement - calculating maximum all time temperature per sensor.
This is fitting very well with Kafka Streams, so I have prepared new microservice that creates KTable aggregating temperature (with max) grouped per sensor.
Simply deploying this microservice would be enough if input topic had infinite retention, but now maximum would be not all-time, as is our requirement.
I feel this could be common scenario but somehow I was not able to find satisfying solution on the internet.
Maybe I am missing something, but my ideas how to make it work do not feel great:
Replay all past events into the input topic sensor.temperature. This is large amount of data and it would cause all subscribing microservices to run excessive computation, which is most likely not acceptable.
Create duplicate of input topic for my microservice: sensor.temperature.local, where I would always copy all events and then further process(aggregate) them from this local topic.
This way I can freely replay historical events into local topic without affecting other microservices.
However this local duplicate would be required for all Kafka Streams microservices, and if input topic is high velocity this could be too much duplication.
Maybe there some way to modify KTables more directly, so one could query the historical event store for max value per sensor and put it in the KTable once?
But what if streams topology is more complex? It would require orchestrating consistent state in all microsevice's KTables, rather than simply replaying events.
How to design the solution?
Thanks in advance for your help!
In this case I would create a topic that stores the max periodically (so that it won't fell off the topic beacuse of a cleanup). Then you could make your service report the max of the max-topic and the max of the measurement-topic.

Streaming on-demand data on to Kafka topics based on consumer requests

We are a source system and we have a couple of downstream systems which would require our data for their needs, currently we are publishing events onto Kafka topics as and when there is a change in source system for them to consume and make changes to their tables (all delta updates)
Our downstream systems is currently accessing our database directly sometimes to make complete refresh of their tables on demand basis once in a while to make sure data is in sync apart from subscribing to Kafka topics, as you know we always need a data refresh sometimes when we feel data is out of sync for some reason.
We are planning to stop giving access to our database directly, how can we achieve this ? Is there a way that consumers request us their data needs by any triggers like passing request to us and we can publish the stream of data for them to consume on their end and they sync the tables or get the bulk data into their memory to perform some tasks based on their needs.
We currently have written RESTful APIs to provide data based on the requests, but we are exposing small data volumes as I think APIs we only send smaller volumes of data, but it won't work when we want to send millions of data to consumers and I believe only way is to stream data on Kafka, but with Kafka how can we respond to the request from consumers and only pump that specific data on to Kafka topics for them to consume ?
You have the option of setting the retention policy on any topic to keep messages forever with:
retention.ms: -1
see the docs
In that case you could store the entire change log in the same manner that you currently are. Then if a consumer needs to re-materialize the entire history, they can start with the first offset and go from there without you having to produce a specialized dataset.

Is it ok to use Apache Kafka "infinite retention policy" as a base for an Event sourced system with CQRS?

I'm currently evaluating options for designing/implementing Event Sourcing + CQRS architectural approach to system design. Since we want to use Apache Kafka for other aspects (normal pub-sub messaging + stream processing), the next logical question would be, "Can we use the Apache Kafka store as event store for CQRS"?, or more importantly would that be a smart decision?
Right now I'm unsure about this.
This source seems to support it: https://www.confluent.io/blog/okay-store-data-apache-kafka/
This other source recommends against that: https://medium.com/serialized-io/apache-kafka-is-not-for-event-sourcing-81735c3cf5c
In my current tests/experiments, I'm having problems similar to those described by the 2nd source, those are:
recomposing an entity: Kafka doesn't seem to support fast retrieval/searching of specific events within a topic (for example: all commands related to an order's history - necessary for the reconstruction of the entity's instance, seems to require the scan of all the topic's events and filter only those matching some entity instance identificator, which is a no go). [This other person seems to have arrived to a similar conclusion: Query Kafka topic for specific record -- that is, it is just not possible (without relying on some hacky trick)]
- write consistency: Kafka doesn't support transactional atomicity on their store, so it seems a common practice to just put a DB with some locking approach (usually optimistic locking) before asynchronously exporting the events to the Kafka queue (I can live with this though, the first problem is much more crucial to me).
The partition problem: On the Kafka documentation, it is mentioned that "order guarantee", exists only within a "Topic's partition". At the same time they also say that the partition is the basic unit of parallelism, in other words, if you want to parallelize work, spread the messages across partitions (and brokers of course). But this is a problem, because an "Event store" in an event sourced system needs the order guarantee, so this means I'm forced to use only 1 partition for this use case if I absolutely need the order guarantee. Is this correct?
Even though this question is a bit open, It really is like that: Have you used Kafka as your main event store on an event sourced system? How have you dealt with the problem of recomposing entity instances out of their command history (given that the topic has millions of entries scanning all the set is not an option)? Did you use only 1 partition sacrificing potential concurrent consumers (given that the order guarantee is restricted to a specific topic partition)?
Any specific or general feedback would the greatly appreciated, as this is a complex topic with several considerations.
Thanks in advance.
EDIT
There was a similar discussion 6 years ago here:
Using Kafka as a (CQRS) Eventstore. Good idea?
Consensus back then was also divided, and a lot of people that suggest this approach is convenient, mention how Kafka deals natively with huge amounts of real time data. Nevertheless the problem (for me at least) isn't related to that, but is more related to how inconvenient are Kafka's capabilities to rebuild an Entity's state- Either by modeling topics as Entities instances (where the exponential explosion in topics amount is undesired), or by modelling topics es entity Types (where amounts of events within the topic make reconstruction very slow/unpractical).
your understanding is mostly correct:
kafka has no search. definitely not by key. there's a seek to timestamp, but its imperfect and not good for what youre trying to do.
kafka actually supports a limited form of transactions (see exactly once) these days, although if you interact with any other system outside of kafka they will be of no use.
the unit of anything in kafka (event ordering, availability, replication) is a partition. there are no guarantees across partitions of the same topic.
all these dont stop applications from using kafka as the source of truth for their state, so long as:
your problem can be "sharded" into topic partitions so you dont care about order of events across partitions
youre willing to "replay" an entire partition if/when you lose your local state as bootstrap.
you use log compacted topics to try and keep a bound on their size (because you will need to replay them to bootstrap, see above point)
both samza and (IIUC) kafka-streams back their state stores with log-compacted kafka topics. internally to kafka offset and consumer group management is stored as a log compacted topic with brokers holding a "materialized view" in memory - when ownership of a partition of __consumer_offsets moves between brokers the new leader replays the partition to rebuild this view.
I was in several projects that uses Kafka as long term storage, Kafka has no problem with it, specially with the latest versions of Kafka, they introduced something called tiered storage, which give you the possibility in Cloud environment to transfer the older data to slower/cheaper storage.
And you should not worry that much about transactions, in todays IT there are other concepts to deal with it like Event Sourcing, [Boundary Context][3,] yes, you should differently when you are designing your applications, how?, that is explained in this video.
But you are right, your choice about query this data will be limited, easiest way is to use Kafka Streams and KTable but this will be a Key/Value database so you can only ask questions about your data over primary key.
Your next best choice is to implement the Query part of the CQRS with the help of Frameworks like Akka Projection, I wrote a blog about how can you use Akka Projection with Elasticsearch, which you can find here and here.