Kafka streams RocksDB as an EventStore? - apache-kafka

There are alot of articles about kafka and event sourcing. Most of the articles are about that kafka is not realy usefull when doing eventsourcing, because you cannot query the events just for a given aggregate id.
If you store the events in a topic, then yes this is true. This is because we need to read all the events and skip those that are not relevant.
But what about storing the events in rocksdb? Now, we can actually query all the events just for the given aggregate by using the aggregate id as prefix and do a range query in rocksdb.
Is this a good approach? I know that this will use large state and can be problematic when a rebalance occurs. But again, maybe static membership in kafka will help with this.

Kafka Streams's default disk-based StateStore is RocksDB, so yes, this is a perfectly valid approach.
You'd query the store via Kafka Streams APIs, not RocksDB directly, however.
Now, we can actually query all the events just for the given aggregate by using the aggregate id as prefix
Unclear what you mean by prefix. The stores are built exclusively by Kafka record keys, not by prefixed values. However, as linked in the comments, the store does support prefix scanning, but I assume that'd be the prefix of the kafka record keys
this will use large state and can be problematic
You can refer the Memory Management page on handling state and what to tune for handling its size.

Kafka Streams and RockDB is really good solution for a quick startup to understand the concepts but I am not sure it is good idea in the long term for the production.
My personal experience with RockDB was not that brilliant in production, if you plan to use a Key/Value database in production Apache Cassandra seems to be a much better solution.
But you are also right, querying things only over Primary Key is not that flexible, so implementing CQRS does make much more sense, so you get much more flexibility for Query Side.
As a person that walked the path you plan the walk, you can find my proposed solution in my blog :)

Related

Is Kafka cluster a database?

What does cluster mean?
The doc says The Kafka cluster *stores* streams of records in categories called topics.
If it stores, then is it a database?
cluster means multiple machines that "share the load" among them. this is deliberately vague as there are many ways of achieving this.
what is the question here?
(disclaimer - my opinion and generally subjective) its a database in the broadest sense of it stores data, and you can get the data back out. however, since it lacks any real "fetch by PK" or query facilities it makes for a very bad choice of primary storage for most use cases and is usually used as an intermediate bus rather than a source of truth.

Sharing partitioning logic across polyglot producers with Kafka

We are building an event sourced system at my company, relying on Kafka.
In order to be GDPR compliant, we need to be able to update the events.
Our idea is to use the compaction and tombstone capabilities.
This means that we cannot use the default partitioning strategy, as we want each message to have an unique key (in order to overwrite a specific message), but we still want events occuring on the same aggregate to end on the same partition.
Which brings us to the creation of a custom partitioner (basically copying the "hash modulo" logic of the default partitioner, but using a different value than the message key to compute the hash).
The issue is that we're evolving in a polyglot environment (we have php, python and Java/Kotlin services publishing and consuming events).
We want to ensure that all these services will produce messages to the same partition given a specific partition key (in case different services will publish events to the same topic).
Our main idea was to use a common hashing algorithm, but we find it hard to find one with both a strong distribution guarantee and a good stability (not just part of an experimental lib).
PHP natively supports a wide range of hashing algorithms, but we find it hard to find the same support in the other languages.
As Kafka default partitioner relies on murmur2, we started looking in that direction as well. Unfortunately, it is not natively supported by php (although some implementations exist). Furthermore, this algorithm uses a seed, which means that we will need to use the exact same seed for all our publisher services, which is starting to make the approach look quite complex.
However, we could be looking at the design from the wrong angle. Sharing event store write capabilities across polyglot services might not be a good idea and each services could have its own partitioning logic as long as it ensures the "one partition per aggregate" requirement. The thing is that we have to think this ahead, because no technical safeguard will prevent one service in the future to publish on a "shared" event stream (and not using the exact same partitioning logic will have a huge impact when it happens).
Would someone has experience with building an event store with Kafka in a polyglot environment, and could highlight us on this specific topic, please?

Kafka vs. MongoDB for time series data

I'm contemplating on whether to use MongoDB or Kafka for a time series dataset.
At first sight obviously it makes sense to use Kafka since that's what it's built for. But I would also like some flexibility in querying, etc.
Which brought me to question: "Why not just use MongoDB to store the timestamped data and index them by timestamp?"
Naively thinking, this feels like it has the similar benefit of Kafka (in that it's indexed by time offset) but has more flexibility. But then again, I'm sure there are plenty of reasons why people use Kafka instead of MongoDB for this type of use case.
Could someone explain some of the reasons why one may want to use Kafka instead of MongoDB in this case?
I'll try to take this question as that you're trying to collect metrics over time
Yes, Kafka topics have configurable time retentions, and I doubt you're using topic compaction because your messages would likely be in the form of (time, value), so the time could not be repeated anyway.
Kafka also provides stream processing libraries so that you can find out averages, min/max, outliers&anamolies, top K, etc. values over windows of time.
However, while processing all that data is great and useful, your consumers would be stuck doing linear scans of this data, not easily able to query slices of it for any given time range. And that's where time indexes (not just a start index, but also an end) would help.
So, sure you can use Kafka to create a backlog of queued metrics and process/filter them over time, but I would suggest consuming that data into a proper database because I assume you'll want to be able to query it easier and potentially create some visualizations over that data.
With that architecture, you could have your highly available Kafka cluster holding onto data for some amount of time, while your downstream systems don't necessarily have to be online all the time in order to receive events. But once they are, they'd consume from the last available offset and pickup where they were before
Like the answers in the comments above - neither Kafka nor MongoDB are well suited as a time-series DB with flexible query capabilities, for the reasons that #Alex Blex explained well.
Depending on the requirements for processing speed vs. query flexibility vs. data size, I would do the following choices:
Cassandra [best processing speed, best/good data size limits, worst query flexibility]
TimescaleDB on top of PostgresDB [good processing speed, good/OK data size limits, good query flexibility]
ElasticSearch [good processing speed, worst data size limits, best query flexibility + visualization]
P.S. by "processing" here I mean both ingestion, partitioning and roll-ups where needed
P.P.S. I picked those options that are most widely used now, in my opinion, but there are dozens and dozens of other options and combinations, and many more selection criteria to use - would be interested to hear about other engineers' experiences!

Questions about using Apache Kafka Streams to implement event sourcing microservices

Event sourcing means a 180 degree shift in the way many of us have been architecting and developing web applications, with lots of advantages but also many challenges.
Apache Kafka is an awesome platform that through its Apache Kafka Streams API is advertised as a tool that allows us to implement this paradimg through its many features (decoupling, fault tolerance, scalability...): https://www.confluent.io/blog/event-sourcing-cqrs-stream-processing-apache-kafka-whats-connection/
On the other hand there are some articles discouraging us from using it for event sourcing: https://medium.com/serialized-io/apache-kafka-is-not-for-event-sourcing-81735c3cf5c
These are my questions regarding Kafka Streams suitability as an event sourcing plaftorm:
The article above comes from Jesper Hammarbäck (who works for serialized.io, an event sourcing platform). I would like to get an answer to the main problems he brings up:
Loading current state. In my view with log compaction and state stores it's not a problem. Am I right?
Consistent writes.
When moving certain pieces of functionality into Kafka Streams I'm not sure if they do fit naturally:
Authentication & Security: Imagine your customers are stored in a state store generated from a customer-topic. Should we keep their passwords in the topic/store? It doesn't sound safe enough, does it? Then how are we supposed to manage this aspect of having customers on a state store and their passwords somewhere else? Any recommended good practice?
Queries: Interactive queries are a nice tool to generate queriable views of our data (by key). That's ok to get an entity by id but what about complex queries (joins)? Do we need to generate state stores per query? For instance one store for customers by id, another one for customers by state, another store for customers who purchased a product last year... It doesn't sound manageable. Another point is the lack of pagination: how can we handle big sets of data when querying the state stores? One more point, we can’t do dynamic queries (like JPA criteria API) anymore. This leads to CQRS maybe? Complexity keeps growing this way...
Data growth: with databases we are used to have thousands and thousands of rows per table. Kafka Streams applications keep a local state store that will grow and grow over time. How scalable is that? How is that local storage kept (local disk/RAM)? If it's disk we should provision applications with enough space, if it's RAM enough memory.
Loading Current State: The mechanism described in the blog, about re-reacting current state ad-hoc for a single entity would indeed be costly with Kafka. However Kafka Streams follow the philosophy to keep the current state for all object in a KTable (that is distributed/sharded). Thus, it's never required to do this -- of course, it come with certain memory costs.
Kafka Streams parallelized based on different events. Thus, all interactions for a single event (processing, state updates) are performed by a single thread. Thus, I don't see why there should be inconsistent writes.
I am not sure what the exact requirement would be. In the current implementation, Kafka Streams does not offer any store specific authentication or security features. There are several things one could do for security though: (a) encrypt the local disk: this might be the simplest thing to do to protect data. (2) encrypt messages within the business logic, before you put them into the store.
Interactive Queries offers limited support for many reasons (don't want to go into details) and it was never design with the goal to support complex queries. The idea is about eager computation of result what can be retrieved with simple lookups. As you pointed out, this is not very scalable (cost intensive) if you have a lot of different queries. To tackle this, it would make sense to load the data into a database, and let the DB does what it is build for. Kafka Streams alone is not the right tool for this atm -- however, there is no reason to not combine both.
Per default Kafka Streams uses RocksDB to keep local state (you can switch to in-memory stores, too). Thus, it's possible to write to disk and to use very large state. Of course, you need to provision your instances accordingly (cf: https://docs.confluent.io/current/streams/sizing.html). Besides this, Kafka Streams scales horizontally and is fully elastic. Thus, you can add new instances at any point in time allowing you to hold terra-bytes of state if you have large disks and enough instances. Note, that the number of input topic partitions limit the number of instances you can use (internally, Kafka Streams is a consumer group, and you cannot have more instances than partitions). If this is a concern, it's recommended to over-partition the input topics in the first place.

Akka Distributed Pub/Sub and number of named topics

I would like to create a named topic per online user in my system using akka clustering. Does having couple of 10000s named topic at a time impact the performance negatively?
I would not recommend. Topic information is represented by a service key in the Receptionist. Between 10k and 100k is probably OK, above will most likely give you some performance issues.
Depending on what you need, using cluster sharding might be a better fit.