I am working on data integration project where we need to consume kafka stream of business events but produce daily and monthly reports. We require some sort of state store for stream.
The approach we brainstormed so far are :
Use ktable to store events and let (one to many) consumers query data for further ETL processing
Or Use key-value based (like dynamoDB) to dump events and let consumers use it.
We certainly don't want to own the events and storage should go away after reporting is done. I am a little concerned about the volume of data being stored for monthly processing because when I looked at kafka topic for a week worth of events, they are in the range of GBs.
I am relatively new to this problem space so would need help considering efficiency and scalability. Also, something which is not going to be anti pattern for future use cases.
Related
I'm new in Kafka Streams world. I'm wondering when to use Kafka Streams GlobalKTable (with compacted topic under the hood) instead of regular database for persisting data. And what are advantages and disadvantages of both solution. I guess both ensure data persistence on the same level.
Let's say there is an simple e-commerce app having users registering and updating their data. And there are two microservices - first one (service-users) is responsible for registering users and the second one (service-orders) is responsible for placing orders. And now there are two options:
When new user registers, service-user accepts request, save newly registered user data in it's database (SQL or noSQL, doesn't matter) and then send event to Kafka to propagate this to other services. service-orders receives such event and store necessary user data in it's own database. It's like a most common pattern (from my experience).
and now the second approach with GlobalKTable:
When new user registers or update, service-user accepts request and send event with user data snapshot to Kafka. service-user and service-orders use GlobalKTable to read information about users.
When should I use which solution? Which solution is better in which cases? What are advantages and disadvantages of both approaches? Doesn't the second approach breaks the rule 'each microservice should maintain it's own data in it's own database'?
Hope I explained my considerations well and they make sense at all.
In general the adventages of GlobalKTable are:
You can do a Foreign-Key Join to GlobalKTable
Application has a full data set in memory, the data set is automatically loaded during application startup and all data modifications are automatically synchronized across all instance. Comparing it to the architecture with external database, you don't need to communicate (via network) with any other resource (like relational database) during messages processing, so it is obvious that processing is much faster and as a result you can process large amount of data quickly. When you'd like to achieve similar performance of processing, you need implement by your own some kind of in memory cache (like Guava) and then, you need to solve all issues connected with proper caching management - warming, refreshing, evicting.
And the main disadvantages are:
Application has a full data set in memory, it is advantage but it can be very big issue, all depends on, how big is your data set, or how you model your data. Referring to your example, storing all users orders in GlobalKTable sounds like very bad idea, the data set will grow very fast, and the size of data is growing with time, so after few months/years of running application on production, the data set can has gigabytes and it will continuously grow. When we still like to store orders in GlobalKTable to efficent processing, we need to desing our data model differently. Probalby our entities (Orders, Documents etc) has some life cycle, like: new, paid, closed etc., few of them are terminating - I mean, there will be no further processing on entity with given id, (for example closed Order), so if there will be no processing, there is no need to store data in memory, we can forward it to some other storage, like Elasticsearch and remove it from GlobalKTable. We can name our data set with orders during processing hot storage and data set with terminated orders cold storage. Long story short: having only active/hot Orders in GlobalKTable could be a good idea.
Quering GlobalKTable is limited to iterating over all data set, sub set or getting data by record key, or key composed with timestamp
Processing based on state in external database is broadly used for many years, so, many developers know how to evolve and maintain that kind of applications. We cannot say the same of storing state in Kafka compacted topics.
Goal is to process raw readings (15min and 1h interval) from external remote meters (assets) in real time.
Process is defined using simple Apache Kafka producer/consumer and multiple Spring Boot microservices to deduplicate messages, transform (map) readings to our system (instead external codes insert internal IDS and similar stuff) and insert in TimescaleDB (extension of PostgreSql).
Everything seems fine, but there is requirement to perform real time prediction/estimation of missing intervals.
Simple example for one meter and 15 minute readings:
On day 1 we got all readings. We process them and have them ingested in our DB.
On day 2 we are missing all readings - so process is not even
started for this meter.
On day 3 we again got all readings - but only for day 3. Now we need
to predict that whole day 2 is missing and create empty readings and
then estimate them by some algorithm (that is not that important
now).
My question here, is there any way or idea how to do this without querying existing database in one of the microservices and checking if something is missing?
Is it possible to check previous messages in Kafka topics and based on that do the prediction/estimation (kafka streams? - I don't get them at all) and is that even smart to do, or there is any other way/idea to do it?
Personal opinion disclaimer
It is not reasonably possible to check previous messages in Kafka Streams. If you are hellbent on doing it, you could probably try to seek messages and re-consume them but Kafka will fight you every step on the way. The mental model is, that you are transforming or aggregating data that comes in in real time. If you need to query something about previous data, you ought to have collected that information when that data was coming through.
What could work (rather well even) is to separate the prediction of missing data from the transformation.
Create two consumers for the stream.
Have one topology (or whatever it is that does your transformations already) transform the data and load it back into Kafka and from there to timescaledb.
Have one topology (or another microservice) that does what is needed to predict missing data. Your usecase of backfilling a missing day could be handled by something like a count based on daily windows
Make that trigger your backfilling either as part of that topology or as a subsequent microservice and load that data to timescaledb as well.
Are you already using Kafka Streams for the transformations? This would be a classical usecase.
The recognition of missing data not so much
As far as I understand it does not require high throughput. More the opposite. You want to know if there is no data.
As far as I understand it latency is not a (main) concern.
Kafka Streams could be useful if you need to take automated action within seconds after data stops coming in. But even then, you could just write throughput metrics and trigger alerts in this case.
Pther than that, it is a very stateful problem and stream processing is at its best if you can treat every message separately reduce them in a "standard" manner like sums or counts.
I got the impression, that a delay of a few hours / a day is not that tragic and currently the backfilling might be done manually. In this case the cot of Kafka Streams would outweigh the benefits.
I'm contemplating on whether to use MongoDB or Kafka for a time series dataset.
At first sight obviously it makes sense to use Kafka since that's what it's built for. But I would also like some flexibility in querying, etc.
Which brought me to question: "Why not just use MongoDB to store the timestamped data and index them by timestamp?"
Naively thinking, this feels like it has the similar benefit of Kafka (in that it's indexed by time offset) but has more flexibility. But then again, I'm sure there are plenty of reasons why people use Kafka instead of MongoDB for this type of use case.
Could someone explain some of the reasons why one may want to use Kafka instead of MongoDB in this case?
I'll try to take this question as that you're trying to collect metrics over time
Yes, Kafka topics have configurable time retentions, and I doubt you're using topic compaction because your messages would likely be in the form of (time, value), so the time could not be repeated anyway.
Kafka also provides stream processing libraries so that you can find out averages, min/max, outliers&anamolies, top K, etc. values over windows of time.
However, while processing all that data is great and useful, your consumers would be stuck doing linear scans of this data, not easily able to query slices of it for any given time range. And that's where time indexes (not just a start index, but also an end) would help.
So, sure you can use Kafka to create a backlog of queued metrics and process/filter them over time, but I would suggest consuming that data into a proper database because I assume you'll want to be able to query it easier and potentially create some visualizations over that data.
With that architecture, you could have your highly available Kafka cluster holding onto data for some amount of time, while your downstream systems don't necessarily have to be online all the time in order to receive events. But once they are, they'd consume from the last available offset and pickup where they were before
Like the answers in the comments above - neither Kafka nor MongoDB are well suited as a time-series DB with flexible query capabilities, for the reasons that #Alex Blex explained well.
Depending on the requirements for processing speed vs. query flexibility vs. data size, I would do the following choices:
Cassandra [best processing speed, best/good data size limits, worst query flexibility]
TimescaleDB on top of PostgresDB [good processing speed, good/OK data size limits, good query flexibility]
ElasticSearch [good processing speed, worst data size limits, best query flexibility + visualization]
P.S. by "processing" here I mean both ingestion, partitioning and roll-ups where needed
P.P.S. I picked those options that are most widely used now, in my opinion, but there are dozens and dozens of other options and combinations, and many more selection criteria to use - would be interested to hear about other engineers' experiences!
Event sourcing means a 180 degree shift in the way many of us have been architecting and developing web applications, with lots of advantages but also many challenges.
Apache Kafka is an awesome platform that through its Apache Kafka Streams API is advertised as a tool that allows us to implement this paradimg through its many features (decoupling, fault tolerance, scalability...): https://www.confluent.io/blog/event-sourcing-cqrs-stream-processing-apache-kafka-whats-connection/
On the other hand there are some articles discouraging us from using it for event sourcing: https://medium.com/serialized-io/apache-kafka-is-not-for-event-sourcing-81735c3cf5c
These are my questions regarding Kafka Streams suitability as an event sourcing plaftorm:
The article above comes from Jesper Hammarbäck (who works for serialized.io, an event sourcing platform). I would like to get an answer to the main problems he brings up:
Loading current state. In my view with log compaction and state stores it's not a problem. Am I right?
Consistent writes.
When moving certain pieces of functionality into Kafka Streams I'm not sure if they do fit naturally:
Authentication & Security: Imagine your customers are stored in a state store generated from a customer-topic. Should we keep their passwords in the topic/store? It doesn't sound safe enough, does it? Then how are we supposed to manage this aspect of having customers on a state store and their passwords somewhere else? Any recommended good practice?
Queries: Interactive queries are a nice tool to generate queriable views of our data (by key). That's ok to get an entity by id but what about complex queries (joins)? Do we need to generate state stores per query? For instance one store for customers by id, another one for customers by state, another store for customers who purchased a product last year... It doesn't sound manageable. Another point is the lack of pagination: how can we handle big sets of data when querying the state stores? One more point, we can’t do dynamic queries (like JPA criteria API) anymore. This leads to CQRS maybe? Complexity keeps growing this way...
Data growth: with databases we are used to have thousands and thousands of rows per table. Kafka Streams applications keep a local state store that will grow and grow over time. How scalable is that? How is that local storage kept (local disk/RAM)? If it's disk we should provision applications with enough space, if it's RAM enough memory.
Loading Current State: The mechanism described in the blog, about re-reacting current state ad-hoc for a single entity would indeed be costly with Kafka. However Kafka Streams follow the philosophy to keep the current state for all object in a KTable (that is distributed/sharded). Thus, it's never required to do this -- of course, it come with certain memory costs.
Kafka Streams parallelized based on different events. Thus, all interactions for a single event (processing, state updates) are performed by a single thread. Thus, I don't see why there should be inconsistent writes.
I am not sure what the exact requirement would be. In the current implementation, Kafka Streams does not offer any store specific authentication or security features. There are several things one could do for security though: (a) encrypt the local disk: this might be the simplest thing to do to protect data. (2) encrypt messages within the business logic, before you put them into the store.
Interactive Queries offers limited support for many reasons (don't want to go into details) and it was never design with the goal to support complex queries. The idea is about eager computation of result what can be retrieved with simple lookups. As you pointed out, this is not very scalable (cost intensive) if you have a lot of different queries. To tackle this, it would make sense to load the data into a database, and let the DB does what it is build for. Kafka Streams alone is not the right tool for this atm -- however, there is no reason to not combine both.
Per default Kafka Streams uses RocksDB to keep local state (you can switch to in-memory stores, too). Thus, it's possible to write to disk and to use very large state. Of course, you need to provision your instances accordingly (cf: https://docs.confluent.io/current/streams/sizing.html). Besides this, Kafka Streams scales horizontally and is fully elastic. Thus, you can add new instances at any point in time allowing you to hold terra-bytes of state if you have large disks and enough instances. Note, that the number of input topic partitions limit the number of instances you can use (internally, Kafka Streams is a consumer group, and you cannot have more instances than partitions). If this is a concern, it's recommended to over-partition the input topics in the first place.
I'm designing an event-sourced architecture based around Kafka, and using Flink for stream processing.
One use case will be the querying (filtering and sorting of results) of historical trade data that has passed through the Kafka topic over time. e.g. "Give me all trades in the last 5 years with these attributes, sorted by xx". Total trade history will be around 10m, increasing by say 1m/year.
Is Flink itself the right tool for such historical queries, and able to do so with reasonable performance (a few seconds)? Or am I better feeding the events from Kafka into an indexable/queryable data store like MongoDB/RDBMS, and using that for historical queries?
Doing the former feels like it'll adhere more closely to a Kappa Architecture, whereas resorting to a historical db feels like I'm moving away from that back towards a Lambda architecture.
Flink is well suited to process historic data from a Kafka topic (or any other data source) due to its support for event-time processing, i.e., time-based processing based on timestamps in the records not based on the clock of the processing machine (aka processing-time).
If you only want to perform analytics, you might want to have a look at Flink's SQL support.