Is Kafka cluster a database? - apache-kafka

What does cluster mean?
The doc says The Kafka cluster *stores* streams of records in categories called topics.
If it stores, then is it a database?

cluster means multiple machines that "share the load" among them. this is deliberately vague as there are many ways of achieving this.
what is the question here?
(disclaimer - my opinion and generally subjective) its a database in the broadest sense of it stores data, and you can get the data back out. however, since it lacks any real "fetch by PK" or query facilities it makes for a very bad choice of primary storage for most use cases and is usually used as an intermediate bus rather than a source of truth.

Related

Kafka streams RocksDB as an EventStore?

There are alot of articles about kafka and event sourcing. Most of the articles are about that kafka is not realy usefull when doing eventsourcing, because you cannot query the events just for a given aggregate id.
If you store the events in a topic, then yes this is true. This is because we need to read all the events and skip those that are not relevant.
But what about storing the events in rocksdb? Now, we can actually query all the events just for the given aggregate by using the aggregate id as prefix and do a range query in rocksdb.
Is this a good approach? I know that this will use large state and can be problematic when a rebalance occurs. But again, maybe static membership in kafka will help with this.
Kafka Streams's default disk-based StateStore is RocksDB, so yes, this is a perfectly valid approach.
You'd query the store via Kafka Streams APIs, not RocksDB directly, however.
Now, we can actually query all the events just for the given aggregate by using the aggregate id as prefix
Unclear what you mean by prefix. The stores are built exclusively by Kafka record keys, not by prefixed values. However, as linked in the comments, the store does support prefix scanning, but I assume that'd be the prefix of the kafka record keys
this will use large state and can be problematic
You can refer the Memory Management page on handling state and what to tune for handling its size.
Kafka Streams and RockDB is really good solution for a quick startup to understand the concepts but I am not sure it is good idea in the long term for the production.
My personal experience with RockDB was not that brilliant in production, if you plan to use a Key/Value database in production Apache Cassandra seems to be a much better solution.
But you are also right, querying things only over Primary Key is not that flexible, so implementing CQRS does make much more sense, so you get much more flexibility for Query Side.
As a person that walked the path you plan the walk, you can find my proposed solution in my blog :)

Limitations of Kafka as a Distributed DB

I have an application which requires an interesting orchestration between states of instances distributed across geographic regions, in combination with the need for a scalable distributed database.
At the moment I think that Kafka with log compaction will fit my needs for state maintenance and message exchange between instances, and Cassandra will fit my needs for high volume distributed reads and writes of persisted data.
However, there is quite a lot of data duplicated that way: Many of the data exchanged via Kafka would also need to be stored to Cassandra for distributed data access. Using Kafka for both messaging and distributed data querying and persistence seems tempting.
Therefore, I'm interested to figure out the real-world pros and cons to be expected when using e.g. the pull queries feature of Kafka to use it as a distributed database [1].
Though, I'm a bit suspicious about what to expect of that in terms of performance and scalability, especially when compared to Cassandra, as well as unknown pitfalls.
What are the tradeoffs when using Kafka as a distributed DB, and what would it compare performance-wise to "native" distributed systems like Cassandra?
[1] https://www.confluent.io/de-de/blog/pull-queries-in-preview-confluent-cloud-ksqdb/
pure KV lookups
Then Kafka StateStores / Interactive Queries can work, but with the caveat that if you use containers and an orchestrator, you need to maintain the state of those stores somewhere on persistent volumes. Otherwise, when the containers move to a fresh host, the streams changelog topic needs to be read from the very beginning, giving you a "cold-start" problem, and you will be unable to query.
Using any database (with persistent storage) will not have this problem, and will always be able to query immediately.
I'm not sure I would suggest Cassandra for strictly KV data, though.

What's the recommended number of Kafka connectors for a large database ? (Debezium)

I'm trying to set up Debezium for data change monitoring in this huge database.
Documentation says "Debezium can monitor any number of databases. The number of connectors that can be deployed to a single cluster of Kafka Connect services depends upon upon the volume and rate of events. However, Debezium supports multiple Kafka Connect service clusters and, if needed, multiple Kafka clusters as well."
However, there's no mention about how many connectors is a good practice.
Reading mediums and some use cases, it seems like one connector for a whole database is a suitable option. But if we have a lot of tables and a lot of changing events in a fraction of time, it should become a bottleneck. Or not? I've seen people working with one connector per table too. It would mean a LOT of connectors in this case. If you have an use case concerning heavy databases along with Debezium, could you tell your experiences about connectors ?
(The source database, in this case, are mostly postgres)
Sorry if it's a dumb question. Thank you in advance.

Redis Streams for implementing a Messaging System (chat) app versus traditional approaches

I'm implementing a chat app, which will support both one-on-one conversation and Group conversations.
So far the direction was to use Redis Pub/Sub with PostgreSQL as the cold storage, and WebSocket being the transport.
Every user will fetch the history from postgresql upon launch (up until the timestamp of the WebSocket+redis connection), and then subscribe to channels that go by their own user_id.
However, having a roundtrip to a DMBS with each new message sounds a bit strange, while definitely doable and legit.
So I decided to examine other approaches. One possible approach was to use Kafka and eliminate the need for an DBMS altogether.
It sounds viable and comes with its own set of advantages.
But turns out there's a new kid on the block - Redis Streams.
From what I gather, it is actually quite similar to Kafka in this specific scenario (chat).
It has many nice features that sound very convenient for implementing a chat system.
And now I am trying to understand whether Streams + disk persistency is the wise way to go versus Kafka versus PostgreSQL+Redis pub/sub
The main aspects in consideration are:
Performance. Postgres and Kafka both operate on disk, meaning slower than the in-memory operations in the case of redis. On the other hand , obviously the messages must be persisted and available at all times and events, so redis will be persisted to disk. Wouldn't that negate the whole in-memory performance gain?
And even if not - would the performance gain under peak load and a big data base be noticeable?
Memory / Costs. With redis these two are closely tied together. As a small startup, the efforts are focused on being ready to cope with sudden scale peaks (up to a million users), but at the same time - the costs should be minimized.
Is storing millions of messages in Streams going to be too memory-costly which in turn will translate to financially-costly?
Recovery, Reliability & Availability, Persistency. with Postgres, even a single instance can handle a big traffic load, but it can also offer master-slave setups and also consistency. Can Redis be a match to that? Also, with a DMBS I can be assured that the data is there to stay. Can I know that with redis?
Scaling.

Questions about using Apache Kafka Streams to implement event sourcing microservices

Event sourcing means a 180 degree shift in the way many of us have been architecting and developing web applications, with lots of advantages but also many challenges.
Apache Kafka is an awesome platform that through its Apache Kafka Streams API is advertised as a tool that allows us to implement this paradimg through its many features (decoupling, fault tolerance, scalability...): https://www.confluent.io/blog/event-sourcing-cqrs-stream-processing-apache-kafka-whats-connection/
On the other hand there are some articles discouraging us from using it for event sourcing: https://medium.com/serialized-io/apache-kafka-is-not-for-event-sourcing-81735c3cf5c
These are my questions regarding Kafka Streams suitability as an event sourcing plaftorm:
The article above comes from Jesper Hammarbäck (who works for serialized.io, an event sourcing platform). I would like to get an answer to the main problems he brings up:
Loading current state. In my view with log compaction and state stores it's not a problem. Am I right?
Consistent writes.
When moving certain pieces of functionality into Kafka Streams I'm not sure if they do fit naturally:
Authentication & Security: Imagine your customers are stored in a state store generated from a customer-topic. Should we keep their passwords in the topic/store? It doesn't sound safe enough, does it? Then how are we supposed to manage this aspect of having customers on a state store and their passwords somewhere else? Any recommended good practice?
Queries: Interactive queries are a nice tool to generate queriable views of our data (by key). That's ok to get an entity by id but what about complex queries (joins)? Do we need to generate state stores per query? For instance one store for customers by id, another one for customers by state, another store for customers who purchased a product last year... It doesn't sound manageable. Another point is the lack of pagination: how can we handle big sets of data when querying the state stores? One more point, we can’t do dynamic queries (like JPA criteria API) anymore. This leads to CQRS maybe? Complexity keeps growing this way...
Data growth: with databases we are used to have thousands and thousands of rows per table. Kafka Streams applications keep a local state store that will grow and grow over time. How scalable is that? How is that local storage kept (local disk/RAM)? If it's disk we should provision applications with enough space, if it's RAM enough memory.
Loading Current State: The mechanism described in the blog, about re-reacting current state ad-hoc for a single entity would indeed be costly with Kafka. However Kafka Streams follow the philosophy to keep the current state for all object in a KTable (that is distributed/sharded). Thus, it's never required to do this -- of course, it come with certain memory costs.
Kafka Streams parallelized based on different events. Thus, all interactions for a single event (processing, state updates) are performed by a single thread. Thus, I don't see why there should be inconsistent writes.
I am not sure what the exact requirement would be. In the current implementation, Kafka Streams does not offer any store specific authentication or security features. There are several things one could do for security though: (a) encrypt the local disk: this might be the simplest thing to do to protect data. (2) encrypt messages within the business logic, before you put them into the store.
Interactive Queries offers limited support for many reasons (don't want to go into details) and it was never design with the goal to support complex queries. The idea is about eager computation of result what can be retrieved with simple lookups. As you pointed out, this is not very scalable (cost intensive) if you have a lot of different queries. To tackle this, it would make sense to load the data into a database, and let the DB does what it is build for. Kafka Streams alone is not the right tool for this atm -- however, there is no reason to not combine both.
Per default Kafka Streams uses RocksDB to keep local state (you can switch to in-memory stores, too). Thus, it's possible to write to disk and to use very large state. Of course, you need to provision your instances accordingly (cf: https://docs.confluent.io/current/streams/sizing.html). Besides this, Kafka Streams scales horizontally and is fully elastic. Thus, you can add new instances at any point in time allowing you to hold terra-bytes of state if you have large disks and enough instances. Note, that the number of input topic partitions limit the number of instances you can use (internally, Kafka Streams is a consumer group, and you cannot have more instances than partitions). If this is a concern, it's recommended to over-partition the input topics in the first place.