I am currently working with Akka Stream Kafka to interact with kafka and I was wonderings what were the differences with Kafka Streams.
I know that the Akka based approach implements the reactive specifications and handles back-pressure, functionality that kafka streams seems to be lacking.
What would be the advantage of using kafka streams over akka streams kafka?
Your question is very general, so I'll give a general answer from my point of view.
First, I've got two usage scenario:
cases where I'm reading data from kafka, processing it and writing some output back to kafka, for these I'm using kafka streams exclusively.
cases where either the data source or sink is not kafka, for those I'm using akka streams.
This already allows me to answer the part about back-pressure: for the 1st scenario above, there is a back-pressure mechanism in kafka streams.
Let's now only focus on the first scenario described above. Let's see what I would loose if I decided to stop using Kafka streams:
some of my stream processors stages need a persistent (distributed) state store, kafka streams provides it for me. It is something that akka streams doesn't provide.
scaling, kafka streams automatically balances the load as soon as a new instance of a stream processor is started, or as soon as one gets killed. This works inside the same JVM, as well as on other nodes: scaling up and out. This is not provided by akka streams.
Those are the biggest differences that matter to me, I'm hoping that it makes sense to you!
The big advantage of Akka Stream over Kafka Streams would be the possibility to implement very complex processing graphs that can be cyclic with fan in/out and feedback loop. Kafka streams only allows acyclic graph if I am not wrong. It would be very complicated to implement cyclic processing graph on top of Kafka streams
Found this article to give a good summary of distributed design concerns that Kafka Streams provides (complements Akka Streams).
https://www.beyondthelines.net/computing/kafka-streams/
message ordering: Kafka maintains a sort of append only log where it stores all the messages, Each message has a sequence id also known as its offset. The offset is used to indicate the position of a message in the log. Kafka streams uses these message offsets to maintain ordering.
partitioning: Kafka splits a topic into partitions and each partition is replicated among different brokers. The partitioning allows to spread the load and replication makes the application fault-tolerant (if a broker is down the data are still available). That’s good for data partitioning but we also need to distribute the processes in a similar way. Kafka Streams uses the processor topology that relies on Kafka group management. This is the same group management that is used by the Kafka consumer to distribute load evenly among brokers (This work is mainly managed by the brokers).
Fault tolerance: data replication ensures data fault tolerance. Group management has fault tolerance built-in as it redistributes the workload among remaining live broker instances.
State management: Kafka streams provides a local storage backed up by a kafka change-log topic which uses log compaction (keeps only latest value for a given key).Kafka log compaction
Reprocessing: When starting a new version of the app, we can reprocess the logs from the start to compute new state then redirect the traffic the new instance and shutdown old application.
Time management: “Stream data is never complete and can always arrive out-of-order” therefore one must distinguish the event time vs processed time and handle it correctly.
Author also says "Using this change-log topic Kafka Stream is able to maintain a “table view” of the application state."
My take is that this applies mostly to an enterprise application where the "application state" is ... small.
For a data science application working with "big data", the "application state" produced by a combination of data munging, machine learning models and business logic to orchestrate all of this will likely not be managed well with Kafka Streams.
Also, am thinking that using a "pure functional event sourcing runtime" like https://github.com/notxcain/aecor will help make the mutations explicit and separate the application logic from the technology used to manage the persistent form of the state through the principled management of state mutation and IO "effects" (functional programming).
In other words the business logic does not become tangled with the Kafka apis.
Akka Streams emerged as a dataflow-centric abstraction for the Akka Actors model.
These are high-performance library built for the JVM and specially designed for general-purpose microservices.
Whereas as long as Kafka Streams is concerned, these are client libraries used to process unbounded data. They are used to read data from Kafka topics, then process it, and write the results to new topics.
Well I used both of those and I have a pretty good idea about their strength's and weaknesses.
If you are solely concentrated in Kafka and you don't have to much experience about stream processing, Kafka Streams is good solution out of the box to help understand the streaming concepts. It Achilles heel in my opinion is its datastore, RockDB to help stateful scenarios with KTable or internal State Stores.
If you use Kafka Streams library, RockDB install itself in the background transparently, which is great for a beginner but troublesome for an experienced developer. RockDB is a key/value database like Cassandra, it has the most strengths of Cassandra but also the weakness, one major of those you can only query the things with primary key, which is for most of the real life scenarios s huge limitation. There are some means to implement your own datastore but they are not that well documented and could be great challenge. Also RockDB is really great loading single Values but if you have iterate over things, after a Dataset size of 100 000 the performance degrades significantly.
Unfortunately while RockDB is embedded so deep in Kafka Streams, it is also not that easy to implement a CQRS solution with it.
And as mentioned above, it has no concept of Back Pressure while Kafka Consumer give Records one by one, in a scenario that you have to scale out that can be really good bottleneck. And be really careful about that statement that Kafka Streams does not need Backpressure mechanism, as this Netflix blog points out it can really cause unpleasant effects.
"By the following morning, alerts were received regarding high memory consumption and GC latencies, to the point where the service was unresponsive to HTTP requests. An investigation of the JVM memory dump revealed an internal Kafka message concurrent queue whose size had grown uncontrollably to over 1.3 million elements.
The cause for this abnormal queue growth is due to Spring KafkaListener’s lack of native back-pressure support."
Well so what are the advantages and disadvantages of Akka Streams compared to Kafka Streams. Well first of all, Akka is not that much of out of the box framework, you have to understand the concepts much better, it is not coupled with single persistence of options, you can choose whatever you want. It has direct support for CQRS pattern (Akka Projection) so you are not bound to query your data only over Primary Key. Akka developer thought about a lot scaling out and back pressure, committed a lot of code to Kafka code base to improve performance.
So if you are only working with Kafka and new to Stream Processing you can use Kafka Streams but be prepared that at some point you can hit a wall and switch to Akka Stream.
You want to see working details/example, I have two blogs about it, you can check it those, blog1 blog2
Related
I have worked on some Kafka stream application and Kafka consumer application. In the end, Kafka stream is nothing but consumer which consumes real-time events from Kafka. So I am not able to figure out when to use Kafka streams or why we should use Kafka streams as we can perform all transformation on the consumer end.
I want to understand the main difference between Kafka stream and Kafka consumer as implementation wise and how to make a decision about what we should use in different use cases.
Thanks in advance for answers.
It's a question about "easy of use" (or simplicity) and "flexibility". The two "killer features" of Kafka Streams, compared to plain consumer/producer are:
built-in state handling, and
exactly-once processing semantics.
Building a stateful, fault-tolerant application or using Kafka transactions with plain consumers/producers is quite difficult to get right. Furthermore, the higher level DSL provides a lot of built-in operators that are hard to build from scratch, especially:
windowing and
joins (stream-stream, stream-table, table-table)
Another nice feature is punctuations.
However, even if you build a simple stateless application, using Kafka Streams can help you significantly to reduce you code base (ie, avoid boilerplate code). Hence, the recommendation is, to use Kafka Streams when possible and only fall back to consumer/producer if Kafka Streams is not flexible enough for your use case.
It's different ways to do the same thing, with different levels of abstraction and functionality.
Here's a side-by-side comparison of doing the same thing (splitting a string into two separate fields) in Kafka vs in Kafka Streams (for good measure it shows doing it in ksqlDB too)
I'm currently evaluating options for designing/implementing Event Sourcing + CQRS architectural approach to system design. Since we want to use Apache Kafka for other aspects (normal pub-sub messaging + stream processing), the next logical question would be, "Can we use the Apache Kafka store as event store for CQRS"?, or more importantly would that be a smart decision?
Right now I'm unsure about this.
This source seems to support it: https://www.confluent.io/blog/okay-store-data-apache-kafka/
This other source recommends against that: https://medium.com/serialized-io/apache-kafka-is-not-for-event-sourcing-81735c3cf5c
In my current tests/experiments, I'm having problems similar to those described by the 2nd source, those are:
recomposing an entity: Kafka doesn't seem to support fast retrieval/searching of specific events within a topic (for example: all commands related to an order's history - necessary for the reconstruction of the entity's instance, seems to require the scan of all the topic's events and filter only those matching some entity instance identificator, which is a no go). [This other person seems to have arrived to a similar conclusion: Query Kafka topic for specific record -- that is, it is just not possible (without relying on some hacky trick)]
- write consistency: Kafka doesn't support transactional atomicity on their store, so it seems a common practice to just put a DB with some locking approach (usually optimistic locking) before asynchronously exporting the events to the Kafka queue (I can live with this though, the first problem is much more crucial to me).
The partition problem: On the Kafka documentation, it is mentioned that "order guarantee", exists only within a "Topic's partition". At the same time they also say that the partition is the basic unit of parallelism, in other words, if you want to parallelize work, spread the messages across partitions (and brokers of course). But this is a problem, because an "Event store" in an event sourced system needs the order guarantee, so this means I'm forced to use only 1 partition for this use case if I absolutely need the order guarantee. Is this correct?
Even though this question is a bit open, It really is like that: Have you used Kafka as your main event store on an event sourced system? How have you dealt with the problem of recomposing entity instances out of their command history (given that the topic has millions of entries scanning all the set is not an option)? Did you use only 1 partition sacrificing potential concurrent consumers (given that the order guarantee is restricted to a specific topic partition)?
Any specific or general feedback would the greatly appreciated, as this is a complex topic with several considerations.
Thanks in advance.
EDIT
There was a similar discussion 6 years ago here:
Using Kafka as a (CQRS) Eventstore. Good idea?
Consensus back then was also divided, and a lot of people that suggest this approach is convenient, mention how Kafka deals natively with huge amounts of real time data. Nevertheless the problem (for me at least) isn't related to that, but is more related to how inconvenient are Kafka's capabilities to rebuild an Entity's state- Either by modeling topics as Entities instances (where the exponential explosion in topics amount is undesired), or by modelling topics es entity Types (where amounts of events within the topic make reconstruction very slow/unpractical).
your understanding is mostly correct:
kafka has no search. definitely not by key. there's a seek to timestamp, but its imperfect and not good for what youre trying to do.
kafka actually supports a limited form of transactions (see exactly once) these days, although if you interact with any other system outside of kafka they will be of no use.
the unit of anything in kafka (event ordering, availability, replication) is a partition. there are no guarantees across partitions of the same topic.
all these dont stop applications from using kafka as the source of truth for their state, so long as:
your problem can be "sharded" into topic partitions so you dont care about order of events across partitions
youre willing to "replay" an entire partition if/when you lose your local state as bootstrap.
you use log compacted topics to try and keep a bound on their size (because you will need to replay them to bootstrap, see above point)
both samza and (IIUC) kafka-streams back their state stores with log-compacted kafka topics. internally to kafka offset and consumer group management is stored as a log compacted topic with brokers holding a "materialized view" in memory - when ownership of a partition of __consumer_offsets moves between brokers the new leader replays the partition to rebuild this view.
I was in several projects that uses Kafka as long term storage, Kafka has no problem with it, specially with the latest versions of Kafka, they introduced something called tiered storage, which give you the possibility in Cloud environment to transfer the older data to slower/cheaper storage.
And you should not worry that much about transactions, in todays IT there are other concepts to deal with it like Event Sourcing, [Boundary Context][3,] yes, you should differently when you are designing your applications, how?, that is explained in this video.
But you are right, your choice about query this data will be limited, easiest way is to use Kafka Streams and KTable but this will be a Key/Value database so you can only ask questions about your data over primary key.
Your next best choice is to implement the Query part of the CQRS with the help of Frameworks like Akka Projection, I wrote a blog about how can you use Akka Projection with Elasticsearch, which you can find here and here.
Lately I've been looking into real-time data processing using storm, flink, etc...
All architectures I came through uses kafka as a layer between datasources and the stream processor, why this layer should exist ?
I think there are three main reasons why to use Apache Kafka for real-time processing:
Distribution
Performance
Reliability
In real-time processing, there is a requirement for fast and reliable delivery of data from data-sources to stream processor. If u are not doing it well, it can easily become a bottleneck of your real-time processing system. Here is where Kafka can help.
Before, traditional messaging ApacheMQ and RabbitMQ was not particularly good for handling huge amount of data in real-time. For that reason Linkedin engineers developed their own messaging system Apache Kafka to be able to cope with this issue.
Distribution: Kafka is natively distributed which fits to distribution nature of stream processing. Kafka divides incoming data to partition ordered by offset which are physically distributed over the cluster. Then these partition can feed the stream processor in distributed manner.
Performance:
Kafka was designed to be simple, sacrificing advance features for the sake of performance. Kafka outperform traditional messaging systems by big difference which can be seen also in this paper. The main reasons are mentioned below:
The Kafka producer does not wait for acknowledgments from the broker
and send data as fast as broker can handle
Kafka has a more efficient storage format with less meta-data.
The Kafka broker is stateless, it does not need to take care about the state of consumers.
Kafka exploits the UNIX sendfile API to efficiently deliver data from
a broker to a consumer by reducing the number of data copies and
system calls.
Reliability: Kafka serves as a buffer between data sources and the stream processor to handle a big load of data. Kafka just simple store all the incoming data and consumers are responsible for the decision how much and how fast they want to process data. This ensure reliable load-balancing that the stream processor will be not overwhelmed by too many data.
Kafka retention policy also allows to easy recover from failures during processing (Kafka retain all the data for 7 days by default). Each consumers keep track on offset of last processed message. For this reason if some consumer fails, it is easy to rollback to the point right before failure and start processing again without loosing information or need to reprocess all stream from beginning.
I have some basic Kafka Streaming code that reads records from one topic, does some processing, and outputs records to another topic.
How does Kafka streaming handle concurrency? Is everything run in a single thread? I don't see this mentioned in the documentation.
If it's single threaded, I would like options for multi-threaded processing to handle high volumes of data.
If it's multi-threaded, I need to understand how this works and how to handle resources, like SQL database connections should be shared in different processing threads.
Is Kafka's built-in streaming API not recommended for high volume scenarios relative to other options (Spark, Akka, Samza, Storm, etc)?
Update Oct 2020: I wrote a four-part blog series on Kafka fundamentals that I'd recommend to read for questions like these. For this question in particular, take a look at part 3 on processing fundamentals.
To your question:
How does Kafka streaming handle concurrency? Is everything run in a single thread? I don't see this mentioned in the documentation.
This is documented in detail at http://docs.confluent.io/current/streams/architecture.html#parallelism-model. I don't want to copy-paste this here verbatim, but I want to highlight that IMHO the key element to understand is that of partitions (cf. Kafka's topic partitions, which in Kafka Streams is generalized to "stream partitions" as not all data streams that are being processed will be going through Kafka) because a partition is currently what determines the parallelism of both Kafka (the broker/server side) and of stream processing applications that use the Kafka Streams API (the client side).
If it's single threaded, I would like options for multi-threaded processing to handle high volumes of data.
Processing a partition will always be done by a single "thread" only, which ensures you are not running into concurrency issues. But, fortunately, ...
If it's multi-threaded, I need to understand how this works and how to handle resources, like SQL database connections should be shared in different processing threads.
...because Kafka allows a topic to have many partitions, you still get parallel processing. For example, if a topic has 100 partitions, then up to 100 stream tasks (or, somewhat over-simplified: up to 100 different machines each running an instance of your application) may process that topic in parallel. Again, every stream task would get exclusive access to 1 partition, which it would then process.
Is Kafka's built-in streaming API not recommended for high volume scenarios relative to other options (Spark, Akka, Samza, Storm, etc)?
Kafka's stream processing engine is definitely recommended and also actually being used in practice for high-volume scenarios. Work on comparative benchmarking is still being done, but in many cases a Kafka Streams based application turns out to be faster. See LINE engineer's blog: Applying Kafka Streams for internal message delivery pipeline for an article by LINE Corp, one of the largest social platforms in Asia (220M+ users), where they describe how they are using Kafka and the Kafka Streams API in production to process millions of events per second.
The kstreams config num.stream.threads allows you to override the number of threads from 1. However, it may be preferable to simply run multiple instances of your streaming app, with all of them running the same consumer group. That way you can spin up as many instances as you need to get optimal partitioning.
I am building a data processing pipeline using Kafka.
The pipeline is linear with 4 stages.
The data volume is medium (will need more than one machine but not hundreds or thousands; data volume is a few tens of gigabytes)
My question: can I use only Kafka, having a pipeline stage consume from a topic and produce on another topic? Should I be using Spark or Storm and why? Of course, I prefer the simplest possible architecture. If I can do it all with Kafka, I'd prefer that. In the future I may need some additional machine learning stages and that may affect the answer. I have no strong once-only semantics, I can accept some message loss and some duplication with no problem.
My question: can I use only Kafka, having a pipeline stage consume from a topic and produce on another topic? Should I be using Spark or Storm and why?
Technically yes you can. If you are ready to handle the whole distributed architecture on your own. Writing your own multi-threaded producers, managing those consumers and so on. You also need to consider in terms of Scalability, performance, durability etc. And here comes the beauty of using computation engine like Storm, Spark etc. So you can simply concentrate on the core logic and leave the infrastructure be maintained by them.
For example using a combination of Kafka and Storm for your architecture, you can store terabytes of data using kafka and feed them to storm for processing. If you are familiar with storm then a sample topology can be something like this:
(kafka-spout consuming messages from topic) --> ( Bolt-A for processing the data receive through spout & feeding it to bolt B) --> (Bolt-B for pushing back the processed data into another kafka topic)
Using such architecture offers great deal in scalability, throughput, performance etc.Making some easy configuration changes you will be able to tune your application based on your requirements.