How to do stream processing with Redpanda? - apache-kafka

Redpanda seems easy to work with, but how would one process streams in real-time?
We have a few thousand IoT devices that send us data every second. We would like to get the running average of the data from the last hour for each of the devices. Can the built-in WebAssembly stuff be used for this, or do we need something like Materialize?

Given that it is marketed as "Kafka Compatible," any Kafka library should work with RedPanda, including Kafka Streams, KSQL, Apache Spark, Flink, Storm, etc.

Thanks, folks. Since it hasn't been mentioned I'm going to add my own answer as well here.
We ended up using Bytewax.
It worked great with our existing Kubernetes setup. It supports stateful operations and scales horizontally into multiple pods if needed. It's pretty performant (1), and since it's basically just a python program it can be customized to read and write to whatever you want.
(1) The Bytewax pod actually uses less CPU than our KafkaJS pod, which just stores all messages to a DB.

Here's more information about stream processors that work with Redpanda.
https://redpanda.com/blog/kafka-stream-processors

Related

Queuing full with confluent kafka

I'm using websockets as producers that are kafka connected (using the confluent_kafka library) to a postgresql database.
I have 4 parallel websockets running in different scripts, connected to different topics which output to different tables in the database.
It turns out that one of those websockets is quite demanding and can return 300 entries within a second or at worst, 10,000 entries within a few seconds. After a while, I get this error:
ERROR: Local: Queue Full
I've tried adding linger.ms=100 to confluent-7.3.1/etc/kafka/producer.properties but I still get the same issue.
What would be a good approach to solving this problem? Should I raise the linger value to even higher numbers or would that incur some sort of downside to my pipeline? Are there any other parameters I should consider?
I'm using a local confluent set-up (for now) and I'm using JDBC connectors to sink the topic data to the database. Is this problem also just an issue with local set-ups and maybe just migrating to a more production-level set-up would solve it?
I'll gladly display specific code or any parameters if necessary. Since there are so many things to tweak I'm not really sure what would be helpful.

What is the better way to have a statistical information among the events in Kafka?

I've a project where I need to provide statistical information via API to the external services. In the mentioned service I use only Kafka as a "storage". When the application starts it reads events from cluster for 1 week and counts some values. And actively listens to new events to update the information. For example information is "how many times x item was sold" etc.
Startup of the application takes a lot of time and brings some other problems with it. It is a Kubernetes service and readiness probe fails time to time, when reading last 1 weeks events takes much time.
Two alternatives came to my mind to replace the entire logic:
Kafka Streams or KSQL (I'm not sure if I will need same amount of memory and computation unit here)
Cache Database
I'm wondering which idea would be better here? Or is there any idea better than them?
First, I hope this is a compacted topic that you are reading, otherwise, your "x times" will be misleading as data is deleted from the topic.
Any option you chose will require reading from the beginning of the topic, so the solution will come down to starting a persistent consumer that:
Stores data on disk (such as Kafka Streams or KSQL KTable) in RocksDB
Some other database of your choice. Redis would be a good option, but so would Couchbase if you want to use Memcached

How to maintain Alpakka/Akka Streams source state across application restarts?

I am new to Alpakka and am considering using it for system integration. What would be the ideal way to maintain the state of the Akka Streams sources across application restarts ?
For example: let's assume I'm using something as follows to continuously read some input data and dump it somewhere. What if it runs for like 4h, then the full JVM crashes and restarts (e.g. k8s restarts my pod or so):
someSource
.via(someTransformation)
.via(someOtherTransformation)
.toMap(...)
.run()
I understand that if someSource is a Kafka source or Kinesis source or some other stateful source, they can keep track of their offset or checkpoint and restart more or less where they left off.
However, many other sources have no such concept, e.g. the Cassandra source, the File source or the RDBMs source. For example, if I shutdown and restart the code provided in the rdms example, it will restart from the top each time.
Am I understanding correctly that there is no mechanism to address that out of the box, s.t. we have to handle it manually ? I would have imagined that this feature would be desired so commonly that it would be handled somehow. If not, how do people typically address that ? Do you use Akka persistence to store some cursors in a few actors? Or do you store the origin offset together with the output data and re-read it on startup?
Or am I looking at all this the wrong way?
It is a feature that is extremely commonly desired, for the reason you suggest.
However, the only generic, reliable way to implement this would be using akka persistence which is probably the single heaviest (e.g. it requires choosing a database) dependency in the Akka ecosystem. Beyond that, it's going to be somewhat source specific. Some (e.g. Kafka, Kinesis) have a means of doing this that's going to fit the bill in nearly every scenario, but for the others, the details of how to store the state of consumption are something on which there will be a lot of differences of opinion. Akka and Alpakka in general tend to shy away from opinionation.

Kafka which volume to use it?

I work on a log centralization project.
I'm working with ELK to Collect/Aggregate/Store/Visualize my data. I see that Kafka can be useful for large volume of data but
I can not find information from what volume of data it could become interesting to use it.
10 Giga of log per day ? Less, more ?
Thanks for your help.
Let's approach this in two ways.
What volumes of data is Kafka suitable for? Kafka is used at large scale (Netflix, Uber, Paypal, Twitter, etc) and small.
You can start with a cluster of three brokers handling a few MB if you want, and scale out from there as required. 10 Gb of data a day would be perfectly reasonable in Kafka—but so would ten times less or ten times more.
What is Kafka suitable for? In the context of your question, Kafka serves as an event-driven integration point between systems. It can be a "dumb" pipeline, but since it persists data that enables its reconsumption elsewhere. It also offers native stream processing capabilities and integration with other systems.
If all you are doing is getting logs into Elasticsearch then Kafka may be overkill. But if you wanted to use that log data in another place (e.g. HDFS, S3, etc), or process it for patterns, or filter it for conditions to route elsewhere—then Kafka would be a sensible option to route it through. This talk explores some of these concepts.
In terms of ELK and Kafka specifically, Logstash and Beats can write to Kafka as an output, and there's a Kafka Connect connector for Elasticsearch
Disclaimer: I work for Confluent.

Microservices & Kafka: To couple or not to couple

I'm having a problem wrapping my mind around a probably normal setup of Microservices and Kafka we are currently setting up.
We are having one Topic in Kafka and multiple consumers reading from this Topic via separate consumer groups.
But somehow I think this could lead to coupling in terms of Microservices as we are having two consumers reading the exact data from the same Topic. Additionally we do not have any retention time for the messages and therefore I'm treating The Kafka as some Kind of data store. So I would think we should rather replicate the messages into its own topic for another Service/consumer.
We are having different opinions on how this is coupling or decoupling and I'd like to hear you opinions on what I'm getting wrong because I feel like I do. Thank you for your support!
In my opinion using a Kafka topic for multiple services or apps to consume is the right approach as long as your services don't rely on it repeatedly. Meaning a service should read the queue once, translate the data into whatever it requires and store it by itself if required. This way the topic doesn't become a permanent data store but a rather a decoupled way to input data (as if you were to call the service directly with that raw data, but in a more decoupled fashion by allowing the service to read the topic whenever ready for it in whatever frequency that is required). This increases the resilience of your overall system.
And there is a coupling, that is the raw data. But from my perspective it is totally OK for multiple services to understand the same data format (of the topic) - As long as its format is mostly stable. The assumption here is that this is raw data that each service has to transform into a form that is useful for itself. You just have to make sure the raw data format is versioned correctly whenever changes are necessary. And to allow services to continue to work you will have to potentially deliver multiple versions concurrently until all services support the latest version. This type of architectural style is used by many large systems and works, as long as you don't have a scenario where you need to require the raw data format to change very frequently in a way that makes it incompatible with your service designs. (If that were the case you'd probably need another layer of stable meta-model below that can describe the dynamic raw-data.)