Kafka which volume to use it? - apache-kafka

I work on a log centralization project.
I'm working with ELK to Collect/Aggregate/Store/Visualize my data. I see that Kafka can be useful for large volume of data but
I can not find information from what volume of data it could become interesting to use it.
10 Giga of log per day ? Less, more ?
Thanks for your help.

Let's approach this in two ways.
What volumes of data is Kafka suitable for? Kafka is used at large scale (Netflix, Uber, Paypal, Twitter, etc) and small.
You can start with a cluster of three brokers handling a few MB if you want, and scale out from there as required. 10 Gb of data a day would be perfectly reasonable in Kafka—but so would ten times less or ten times more.
What is Kafka suitable for? In the context of your question, Kafka serves as an event-driven integration point between systems. It can be a "dumb" pipeline, but since it persists data that enables its reconsumption elsewhere. It also offers native stream processing capabilities and integration with other systems.
If all you are doing is getting logs into Elasticsearch then Kafka may be overkill. But if you wanted to use that log data in another place (e.g. HDFS, S3, etc), or process it for patterns, or filter it for conditions to route elsewhere—then Kafka would be a sensible option to route it through. This talk explores some of these concepts.
In terms of ELK and Kafka specifically, Logstash and Beats can write to Kafka as an output, and there's a Kafka Connect connector for Elasticsearch
Disclaimer: I work for Confluent.

Related

How to do stream processing with Redpanda?

Redpanda seems easy to work with, but how would one process streams in real-time?
We have a few thousand IoT devices that send us data every second. We would like to get the running average of the data from the last hour for each of the devices. Can the built-in WebAssembly stuff be used for this, or do we need something like Materialize?
Given that it is marketed as "Kafka Compatible," any Kafka library should work with RedPanda, including Kafka Streams, KSQL, Apache Spark, Flink, Storm, etc.
Thanks, folks. Since it hasn't been mentioned I'm going to add my own answer as well here.
We ended up using Bytewax.
It worked great with our existing Kubernetes setup. It supports stateful operations and scales horizontally into multiple pods if needed. It's pretty performant (1), and since it's basically just a python program it can be customized to read and write to whatever you want.
(1) The Bytewax pod actually uses less CPU than our KafkaJS pod, which just stores all messages to a DB.
Here's more information about stream processors that work with Redpanda.
https://redpanda.com/blog/kafka-stream-processors

Limitations of Kafka as a Distributed DB

I have an application which requires an interesting orchestration between states of instances distributed across geographic regions, in combination with the need for a scalable distributed database.
At the moment I think that Kafka with log compaction will fit my needs for state maintenance and message exchange between instances, and Cassandra will fit my needs for high volume distributed reads and writes of persisted data.
However, there is quite a lot of data duplicated that way: Many of the data exchanged via Kafka would also need to be stored to Cassandra for distributed data access. Using Kafka for both messaging and distributed data querying and persistence seems tempting.
Therefore, I'm interested to figure out the real-world pros and cons to be expected when using e.g. the pull queries feature of Kafka to use it as a distributed database [1].
Though, I'm a bit suspicious about what to expect of that in terms of performance and scalability, especially when compared to Cassandra, as well as unknown pitfalls.
What are the tradeoffs when using Kafka as a distributed DB, and what would it compare performance-wise to "native" distributed systems like Cassandra?
[1] https://www.confluent.io/de-de/blog/pull-queries-in-preview-confluent-cloud-ksqdb/
pure KV lookups
Then Kafka StateStores / Interactive Queries can work, but with the caveat that if you use containers and an orchestrator, you need to maintain the state of those stores somewhere on persistent volumes. Otherwise, when the containers move to a fresh host, the streams changelog topic needs to be read from the very beginning, giving you a "cold-start" problem, and you will be unable to query.
Using any database (with persistent storage) will not have this problem, and will always be able to query immediately.
I'm not sure I would suggest Cassandra for strictly KV data, though.

What's the recommended number of Kafka connectors for a large database ? (Debezium)

I'm trying to set up Debezium for data change monitoring in this huge database.
Documentation says "Debezium can monitor any number of databases. The number of connectors that can be deployed to a single cluster of Kafka Connect services depends upon upon the volume and rate of events. However, Debezium supports multiple Kafka Connect service clusters and, if needed, multiple Kafka clusters as well."
However, there's no mention about how many connectors is a good practice.
Reading mediums and some use cases, it seems like one connector for a whole database is a suitable option. But if we have a lot of tables and a lot of changing events in a fraction of time, it should become a bottleneck. Or not? I've seen people working with one connector per table too. It would mean a LOT of connectors in this case. If you have an use case concerning heavy databases along with Debezium, could you tell your experiences about connectors ?
(The source database, in this case, are mostly postgres)
Sorry if it's a dumb question. Thank you in advance.

What is the better way to have a statistical information among the events in Kafka?

I've a project where I need to provide statistical information via API to the external services. In the mentioned service I use only Kafka as a "storage". When the application starts it reads events from cluster for 1 week and counts some values. And actively listens to new events to update the information. For example information is "how many times x item was sold" etc.
Startup of the application takes a lot of time and brings some other problems with it. It is a Kubernetes service and readiness probe fails time to time, when reading last 1 weeks events takes much time.
Two alternatives came to my mind to replace the entire logic:
Kafka Streams or KSQL (I'm not sure if I will need same amount of memory and computation unit here)
Cache Database
I'm wondering which idea would be better here? Or is there any idea better than them?
First, I hope this is a compacted topic that you are reading, otherwise, your "x times" will be misleading as data is deleted from the topic.
Any option you chose will require reading from the beginning of the topic, so the solution will come down to starting a persistent consumer that:
Stores data on disk (such as Kafka Streams or KSQL KTable) in RocksDB
Some other database of your choice. Redis would be a good option, but so would Couchbase if you want to use Memcached

Akka Distributed Pub/Sub and number of named topics

I would like to create a named topic per online user in my system using akka clustering. Does having couple of 10000s named topic at a time impact the performance negatively?
I would not recommend. Topic information is represented by a service key in the Receptionist. Between 10k and 100k is probably OK, above will most likely give you some performance issues.
Depending on what you need, using cluster sharding might be a better fit.