Improve performance by using Flink instead of Kafka Streams when Source and Sink are in Kafka? - apache-kafka

Assuming I have input data coming in via Kafka topics, and output data to be sent to Kafka topics as well, under what circumstances would Flink be able to process data faster than Kafka Streams? At least when it comes to the time spent consuming and producing, I would not expect Flink to be any faster than Kafka Streams.

Both Flink and Kafka Streams are built on top of the same Producer and Consumer API, so they'll act similarly, up to a point. Once you get into the specific API/DSL, then the stacktrace gets more nested.
Outside of record serialization, Flink can perform more tasks like using Flink SQL compared to Kafka's KSQL, but in those cases, you're managing an external cluster.
Personally, I find Kafka Streams to be faster to develop and maintain because the application itself is the deployable unit, not something to submit to a pool of resources that might be preempted by some scheduler. But if you want to use more than a JVM language, then you will need to venture into Flink or even Beam. And those other languages will be slower because the code will then interface with those native Java libraries.

Related

Kafka Streams vs Flink

I wrote an application that reads 100.000 Avro records per second from Kafka topic, aggregates by key, use tumbling windows with 5 different sizes, do some calculation to know the highest, lowest, initial and end value, and write back to another Kafka topic.
This application already exists in Flink, but the source is RSocket in CSV format and the sink is Cassandra. The problem is that the new application is using a lot more CPU and memory. I checked this article and noticed performance is not mentioned.
Am I correct to assume the difference is mostly because of Avro serialisation / deserialisation, or is Flink supposed to be faster for this use case? If the difference is small, I'd prefer Kafka Streams to avoid needing to manage the cluster.
I don't think this question can be answered generally. Both Flink and Kafka Streaming can be tuned to the workload, and small changes in parameters can make a large difference in performance. Generally, there is no fundamental reason why Flink should be a lot faster for such a use case than Kafka Streams. One exception may be repartitioning, which always need to go through the Kafka cluster for Kafka streams and can stay within the cluster for Flink, but as I understand, you are not repartitioning in your use case.
Serialization format may play a large role, however. Some benchmarks that I remember for protobuf (for avro is similar) showed that the size in (Java) memory is 100x larger than the serialized data on the wire. Again, this depends on many things, in particular how nested/complex your schema is. If avro is deserialized to a complex object model, this will cause a significant CPU / memory overhead compared to passing strings around.
However, the only way to tell for certain what is slowing down your use case is profiling it and seeing where the additional resources are spent.
Without benchmarks on your own hardware, or JVM profiling your code, it's hard to say which will be faster.
Flink does invoke more JVM function calls than Kafka Streams, from what I've seen.
Kafka Streams doesn't work well (or at all) with external systems such as RSocket or Cassandra. Therefore, you would still need Flink or some other ETL tool like Kafka Connect (i.e manage a cluster) to get data into a Kafka topic to then process, regardless of framework.
Serialization format shouldn't matter. Flink or Kafka Streams will use the exact same JVM methods from Avro (or any other format) SDK.

KafkaStream Vs Flink

I have used the Flink for sending data from source to sink.
My flink app consumes the data from Kafka and send to the destination.
The destination is also kafka topic which has a different topic name.
The Flink is only used for delivering purpose without having any business logic.
In this case, I think that changing the flink to Kafka Stream will increase the throughput. Because the flink has no contribution except for delivering data from source to sink. Also my source and sink uses the kafka so I think that kafka streams will be faster in case of the delivery data.
I would appreciate if you could give you any opinion for my question.
Thanks.
There's no guarantee one will be faster than the other. You still need to do JVM and network tuning.
Either will work, but the limitation of Kafka Streams is that the data must remain in the same Kafka cluster. Flink has no such limitation.
Or you can simply use MirrorMaker for moving data between Kafka topics of different clusters.

Does it make sense to use kafka-connect to transform kafka messages?

We have confluents platform in our infrastructure. At core, we are using kafka broker to distribute events. Dozens of devices produce events to kafka topics (there is a kafka topic for each type of event), where events are serialized in google's protobuf. We have confluent's schema registry to keep track of the protobuf schemas.
What we need is, for several events, we need to apply some transformation and then publish the transformation output to some other kafka topic. Of course Kafka Streams is one way to accomplish that, like in this example. However, we don't want to have a java application for each transformation (which increase the complexity of the project and development/deployment effort), and it doesn't feels right to put all streams in one application (modifying one will require to stop all streams ans start again).
At this point, we thought that maybe Confluent's Kafka Connect might be better approach. We can have several workers, and we can deploy them into one kafka connect instance/or cluster. The question is;
Does it make sense to use kafka connect to get message from one kafka topic and send it to another kafka topic? Be cause all the use cases and examples aims to get data from outside (database, file etc.) to kafka, and from kafka to outside.
To clarify, Kafka Connect is not "Confluent's", it's part of Apache Kafka.
While you could use MirrorMaker2/Confluent Replicator with transforms, it honestly wouldn't be much different than extracting the transformation logic into a shared library, then bundling a deployable Kafka Streams application that accepts configuration parameters for input and output topics with the transformation in-between.
You make a good point about single-point of administration, but that's also a single point of failure... If you use Connect, changing your transform plugin will also require you to stop and restart the Connect server, if all topics are part of the same connector, then any task failure would stop some percentage of the topic transformations
Kafka Streams (or KSQL) is preferred for inter-cluster translations, anyway
You could also look at solutions like Apache Nifi for more complex event management and routing

Kafka stream vs kafka consumer how to make decision on what to use

I have worked on some Kafka stream application and Kafka consumer application. In the end, Kafka stream is nothing but consumer which consumes real-time events from Kafka. So I am not able to figure out when to use Kafka streams or why we should use Kafka streams as we can perform all transformation on the consumer end.
I want to understand the main difference between Kafka stream and Kafka consumer as implementation wise and how to make a decision about what we should use in different use cases.
Thanks in advance for answers.
It's a question about "easy of use" (or simplicity) and "flexibility". The two "killer features" of Kafka Streams, compared to plain consumer/producer are:
built-in state handling, and
exactly-once processing semantics.
Building a stateful, fault-tolerant application or using Kafka transactions with plain consumers/producers is quite difficult to get right. Furthermore, the higher level DSL provides a lot of built-in operators that are hard to build from scratch, especially:
windowing and
joins (stream-stream, stream-table, table-table)
Another nice feature is punctuations.
However, even if you build a simple stateless application, using Kafka Streams can help you significantly to reduce you code base (ie, avoid boilerplate code). Hence, the recommendation is, to use Kafka Streams when possible and only fall back to consumer/producer if Kafka Streams is not flexible enough for your use case.
It's different ways to do the same thing, with different levels of abstraction and functionality.
Here's a side-by-side comparison of doing the same thing (splitting a string into two separate fields) in Kafka vs in Kafka Streams (for good measure it shows doing it in ksqlDB too)

Akka Stream Kafka vs Kafka Streams

I am currently working with Akka Stream Kafka to interact with kafka and I was wonderings what were the differences with Kafka Streams.
I know that the Akka based approach implements the reactive specifications and handles back-pressure, functionality that kafka streams seems to be lacking.
What would be the advantage of using kafka streams over akka streams kafka?
Your question is very general, so I'll give a general answer from my point of view.
First, I've got two usage scenario:
cases where I'm reading data from kafka, processing it and writing some output back to kafka, for these I'm using kafka streams exclusively.
cases where either the data source or sink is not kafka, for those I'm using akka streams.
This already allows me to answer the part about back-pressure: for the 1st scenario above, there is a back-pressure mechanism in kafka streams.
Let's now only focus on the first scenario described above. Let's see what I would loose if I decided to stop using Kafka streams:
some of my stream processors stages need a persistent (distributed) state store, kafka streams provides it for me. It is something that akka streams doesn't provide.
scaling, kafka streams automatically balances the load as soon as a new instance of a stream processor is started, or as soon as one gets killed. This works inside the same JVM, as well as on other nodes: scaling up and out. This is not provided by akka streams.
Those are the biggest differences that matter to me, I'm hoping that it makes sense to you!
The big advantage of Akka Stream over Kafka Streams would be the possibility to implement very complex processing graphs that can be cyclic with fan in/out and feedback loop. Kafka streams only allows acyclic graph if I am not wrong. It would be very complicated to implement cyclic processing graph on top of Kafka streams
Found this article to give a good summary of distributed design concerns that Kafka Streams provides (complements Akka Streams).
https://www.beyondthelines.net/computing/kafka-streams/
message ordering: Kafka maintains a sort of append only log where it stores all the messages, Each message has a sequence id also known as its offset. The offset is used to indicate the position of a message in the log. Kafka streams uses these message offsets to maintain ordering.
partitioning: Kafka splits a topic into partitions and each partition is replicated among different brokers. The partitioning allows to spread the load and replication makes the application fault-tolerant (if a broker is down the data are still available). That’s good for data partitioning but we also need to distribute the processes in a similar way. Kafka Streams uses the processor topology that relies on Kafka group management. This is the same group management that is used by the Kafka consumer to distribute load evenly among brokers (This work is mainly managed by the brokers).
Fault tolerance: data replication ensures data fault tolerance. Group management has fault tolerance built-in as it redistributes the workload among remaining live broker instances.
State management: Kafka streams provides a local storage backed up by a kafka change-log topic which uses log compaction (keeps only latest value for a given key).Kafka log compaction
Reprocessing: When starting a new version of the app, we can reprocess the logs from the start to compute new state then redirect the traffic the new instance and shutdown old application.
Time management: “Stream data is never complete and can always arrive out-of-order” therefore one must distinguish the event time vs processed time and handle it correctly.
Author also says "Using this change-log topic Kafka Stream is able to maintain a “table view” of the application state."
My take is that this applies mostly to an enterprise application where the "application state" is ... small.
For a data science application working with "big data", the "application state" produced by a combination of data munging, machine learning models and business logic to orchestrate all of this will likely not be managed well with Kafka Streams.
Also, am thinking that using a "pure functional event sourcing runtime" like https://github.com/notxcain/aecor will help make the mutations explicit and separate the application logic from the technology used to manage the persistent form of the state through the principled management of state mutation and IO "effects" (functional programming).
In other words the business logic does not become tangled with the Kafka apis.
Akka Streams emerged as a dataflow-centric abstraction for the Akka Actors model.
These are high-performance library built for the JVM and specially designed for general-purpose microservices.
Whereas as long as Kafka Streams is concerned, these are client libraries used to process unbounded data. They are used to read data from Kafka topics, then process it, and write the results to new topics.
Well I used both of those and I have a pretty good idea about their strength's and weaknesses.
If you are solely concentrated in Kafka and you don't have to much experience about stream processing, Kafka Streams is good solution out of the box to help understand the streaming concepts. It Achilles heel in my opinion is its datastore, RockDB to help stateful scenarios with KTable or internal State Stores.
If you use Kafka Streams library, RockDB install itself in the background transparently, which is great for a beginner but troublesome for an experienced developer. RockDB is a key/value database like Cassandra, it has the most strengths of Cassandra but also the weakness, one major of those you can only query the things with primary key, which is for most of the real life scenarios s huge limitation. There are some means to implement your own datastore but they are not that well documented and could be great challenge. Also RockDB is really great loading single Values but if you have iterate over things, after a Dataset size of 100 000 the performance degrades significantly.
Unfortunately while RockDB is embedded so deep in Kafka Streams, it is also not that easy to implement a CQRS solution with it.
And as mentioned above, it has no concept of Back Pressure while Kafka Consumer give Records one by one, in a scenario that you have to scale out that can be really good bottleneck. And be really careful about that statement that Kafka Streams does not need Backpressure mechanism, as this Netflix blog points out it can really cause unpleasant effects.
"By the following morning, alerts were received regarding high memory consumption and GC latencies, to the point where the service was unresponsive to HTTP requests. An investigation of the JVM memory dump revealed an internal Kafka message concurrent queue whose size had grown uncontrollably to over 1.3 million elements.
The cause for this abnormal queue growth is due to Spring KafkaListener’s lack of native back-pressure support."
Well so what are the advantages and disadvantages of Akka Streams compared to Kafka Streams. Well first of all, Akka is not that much of out of the box framework, you have to understand the concepts much better, it is not coupled with single persistence of options, you can choose whatever you want. It has direct support for CQRS pattern (Akka Projection) so you are not bound to query your data only over Primary Key. Akka developer thought about a lot scaling out and back pressure, committed a lot of code to Kafka code base to improve performance.
So if you are only working with Kafka and new to Stream Processing you can use Kafka Streams but be prepared that at some point you can hit a wall and switch to Akka Stream.
You want to see working details/example, I have two blogs about it, you can check it those, blog1 blog2