The performance tuning documentation provided by Storm states for the absolute best performance scaling multiple parallel topologies can yield better performance than simply scaling workers.
I am try to benchmark this theory against scaling worker.
However, using version 1.2.1 the storm Kafka spout is not behaving as I would have expected across multiple different topologies.
Setting a common client.id and group.id for the kafka spout consumer across all topologies for a single topic, each topology still subscribes to all available partitions and duplicate tuples, with errors being thrown as already committed tuples are recommitted.
I am surprised by this behaviour as I assumed that the consumer API would support this fairly simple use case.
I would be really grateful if somebody would explain
what's the implementation logic of this behaviour with the kafka spout?
any way around this problem?
The default behavior for the spout is to assign all partitions for a topic to workers in the topology, using the KafkaConsumer.assign API. This is the behavior you are seeing. With this behavior, you shouldn't be sharing group ids between topologies.
If you want finer control over which partitions are assigned to which workers or topologies, you can implement the TopicFilter interface, and pass it to your KafkaSpoutConfig. This should let you do what you want.
Regarding running multiple topologies being faster, I'm assuming you're referring to this section from the docs: In multiworker mode, messages often cross worker process boundaries. For performance sensitive cases, if it is possible to configure a topology to run as many single-worker instances [...] it may yield significantly better throughput and latency. The objective here is to avoid sending messages between workers, and instead keep each partition's processing internal in one worker. If you want to avoid running many topologies, you could look at customizing the Storm scheduler to make it allocate e.g. one full copy of your pipeline in each worker. That way, if you use localOrShuffleGrouping, there will always be a local bolt to send to, so you don't have to go over the network to another worker.
Related
I have built a micro service platform based on kubernetes, but Kafka is used as MQ in the service. Now a very confusing question has arisen. Kubernetes is designed to facilitate the expansion of micro services. However, when the expansion exceeds the number of Kafka partitions, some micro services cannot be consumed. What should I do?
This is a Kafka limitation and has nothing to do with your service scheduler.
Kafka consumer groups simply cannot scale beyond the partition count. So, if you have a single partitioned topic because you care about strict event ordering, then only one replica of your service can be active and consuming from the topic, and you'd need to handle failover in specific ways that is outside the scope of Kafka itself.
If your concern is the k8s autoscaler, then you can look into the KEDA autoscaler for Kafka services
Kafka, as OneCricketeer notes, bounds the parallelism of consumption by the number of partitions.
If you couple processing with consumption, this limits the number of instances which will be performing work at any given time to the number of partitions to be consumed. Because the Kafka consumer group protocol includes support for reassigning partitions consumed by a crashed (or non-responsive...) consumer to a different consumer in the group, running more instances of the service than there are partitions at least allows for the other instances to be hot spares for fast failover.
It's possible to decouple processing from consumption. The broad outline of could be to have every instance of your service join the consumer group. Up to the number of instances consuming will actually consume from the topic. They can then make a load-balanced network request to another (or the same) instance based on the message they consume to do the processing. If you allow the consumer to have multiple requests in flight, this expands your scaling horizon to max-in-flight-requests * number-of-partitions.
If it happens that the messages in a partition don't need to be processed in order, simple round-robin load-balancing of the requests is sufficient.
Conversely, if it's the case that there are effectively multiple logical streams of messages multiplexed into a given partition (e.g. if messages are keyed by equipment ID; the second message for ID A needs to be processed after the first message, but could be processed in any order relative to messages from ID B), you can still do this, but it needs some care around ensuring ordering. Additionally, given the amount of throughput you should be able to get from a consumer of a single partition, needing to scale out to the point where you have more processing instances than partitions suggests that you'll want to investigate load-balancing approaches where if request B needs to be processed after request A (presumably because request A could affect the result of request B), A and B get routed to the same instance so they can leverage local in-memory state rather than do a read-from-db then write-to-db pas de deux.
This sort of architecture can be implemented in any language, though maintaining a reasonable level of availability and consistency is going to be difficult. There are frameworks and toolkits which can deliver a lot of this functionality: Akka (JVM), Akka.Net, and Protoactor all implement useful primitives in this area (disclaimer: I'm employed by Lightbend, which maintains and provides commercial support for one of those, though I'd have (and actually have) made the same recommendations prior to my employment there).
When consuming messages from Kafka in this style of architecture, you will definitely have to make the choice between at-most-once and at-least-once delivery guarantees and that will drive decisions around when you commit offsets. Note particularly that you need to be careful, if doing at-least-once, to not commit until every message up to that offset has been processed (or discarded), lest you end up with "at-least-zero-times", which isn't a useful guarantee. If doing at-least-once, you may also want to try for effectively-once: at-least-once with idempotent processing.
i have a Kafka Streams DSL application, we have a requirement on exactly once processing, for the same i have added the configuration
streamConfig.put(processing.gurantee, "exactly_once");
I am using kafka 2.7
I have 2 queries
what's the difference between exactly_once and exactly_once_beta
how do i test this functionality to be sure my messages are getting processed only once
Thanks!
exactly_once_beta is an improvement over exactly_once. While exactly_once uses a transactional producer for each stream task (combination of sub-topology and input partition, exactly_once_beta uses a transactional producer for each stream thread of a Kafka Streams client.
Every producer comes with separate memory buffers, a separate thread, separate network connections which might limit scaling the number of input partitions (i.e. number of tasks). A high number of producers might also cause more load on the brokers. Hence, exactly_once_beta has better scaling characteristics. You can find more details in KIP-447.
Note that exactly_once will be deprecated and exactly_once_beta will be renamed to exactly_once_v2 in Apache Kafka 3.0. See KIP-732 for more details.
For tests you can get inspiration from the tests in the Apache Kafka repo:
https://github.com/apache/kafka/blob/trunk/streams/src/test/java/org/apache/kafka/streams/integration/EosIntegrationTest.java
https://github.com/apache/kafka/blob/trunk/streams/src/test/java/org/apache/kafka/streams/integration/EOSUncleanShutdownIntegrationTest.java
https://github.com/apache/kafka/blob/trunk/tests/kafkatest/tests/streams/streams_eos_test.py
Basically, you need to create a failover scenario and verify that messages are not produced multiple times to the output topics. Note that messages may be processed multiple times, but the results in the output topics must appear as if they were only processed once. You can find a pretty good talk about exactly-once semantics that also explains the failover scenarios here: https://www.confluent.io/kafka-summit-london18/dont-repeat-yourself-introducing-exactly-once-semantics-in-apache-kafka/
The essence of distributed computation is to co-locate execution with data, or in other words, to send your code o your data, not your data to your code. That is the core design of Hadoop, Spark etc.
Does Kafka / Kafka Streams allow such setup? If yes, how? If no is there something planned, maybe as a subproject e.g. using Kubernetes or similar?
I know that we can define consumer groups for a topic but I don't understand how partitions are allocated to consuming application instances and if this allocation can be made to favour co-located instances.
Please let me know if there is a better term to search for as "kafka consumer co-location" didn't please the google gods :/
The Kafka model is different. The Kafka cluster itself only stores data streams. Computation happens outside the Kafka cluster. Thus, there is only limited notion of co-location, ie, data will always be sent over the network to consumer/streams applications that do the processing.
For Kafka Streams, if you do joins for example, data sub-streams (based in Kafka partitions) of both input streams for the join will be co-located within a single Kafka Streams instance though to compute the correct result.
Note, that data stream processing is a different model and thus "shipping code to data" is not important as for batch processing.
Why would we like to have that?
To minimize network traffic? Reduce delays?
The wish is to try to give each partition to a local consumer, if possible. Any of the following condition makes that impossible or undesirable:
Broker's host does not run any Consumers
Local consumers don't subscribe to broker's topics
Local consumer is overloaded, compared to some external consumers
Even in case of a relatively simple StickyAssignor, this problem turns out to be a multi-objective optimization:
Optimize for evenly-distributed consumer load
Optimize for preserving previously-assigned partitions
All, in the situation when topic distribution and consumer membership changes dynamically!
The next step would be to introduce some numerical measure of locality. Ideal assignment would try to connect broker and consumer on the same host, rack, data center, or continent. For example, you might wish to use ping time as a measure of distance between processes; or a number of hops.
Another dimension is variation of host capabilities and load. How many more partitions consumer's host can handle?
There must be a way to aggregate all the requirements into a single number: how good is the assignment of Topic X to a consumer Y.
In the end, you might get a n * m matrix of assignment scores: for each consumer-broker pair you might compute an assignment penalty. By solving that assignment problem in O(n^3) time you'll get the best possible assignment, which favors all aspects, important for your application:
closeness to brocker
closeness to end-user
consumer's cache state
CPU load and free disk space of consumer nodes
maybe some other criteria like: regulation requirements, scheduled maintenance, cost of running a node
Kafka has a PartitionAssignor class, which controls relation between Topics and Consumers. Default is very simple algorithm, but there are more sophisticated implementations like StickyAssignor, which tries to preserve Consumer's caches. It is a pluggable interface, open for experimentation.
Kafka's philosophy favors robustness and universality. Maybe that's why such fragile and multi-faceted optimizations are not a part of a standard distribution.
I am aware of the parallelism advantages that kafka streams offer which are leveraged if your parallelism needs are aligned with the partitioning of the topics.
I am considering having an application subscribe many consumers to different consumer groups so that each consumer is consuming a replication of the whole topic.
Specifically I am thinking of having multiple threads consume the same topic to provide different results even though I know that I can express all my computation needs using the "chaining" computation paradigm that KStreams offer.
The reason why I am considering different threads is because I want multiple dynamically created KTable instances of the stream. Each one working on the same stream (not subset) and aggregating different results. Since it's dynamic it can create really heavy load that could be alleviated by adding thread parallelism. I believe the idea that each thread can work on its own streams instance (and consumer group) is valid.
Of course I can also add thread parallelism by having multiple threads consuming smaller subsets of the data and individually doing all the computations (e.g. each one maintaining subsets of all the different KTables) which will still provide concurrency.
So, two main points in my question
Are KafkaStreams not generally suited for thread parallelism, meaning is the library not intended to be used that way?
In the case where threads are being used to consume a topic would it be a better idea to make threads follow the general kafka parallelism concept of working on different subsets of the data, therefore making thread parallelism an application-level analogous to scaling up using more instances?
But I am wondering would it be okay to have an application that subscribes many consumers to different consumer groups so that each consumer is consuming a replication of the whole topic.
What you could consider is running multiple KafkaStreams instances inside the same Java application. Each instance has its own StreamsConfig and thus its own application.id and consumer group id.
That said, depending on what your use case is, you might want to take a look at GlobalKTable (http://docs.confluent.io/current/streams/concepts.html#globalktable), which (slightly simplified) ensures that the data it reads from a Kafka topic is available in all instances of your Kafka Streams application. That is, this would allow you to "replicate the data globally" without having to run multiple KafkaStreams instances or the more complicated setup you asked about above.
Specifically I am considering having multiple threads consume the same topic to provide different kinds of results. Can I somehow define the consumer group that each KafkaStream consumer is listening to?
Hmm, perhaps you're looking at something else then.
You are aware that you can build multiple "chains" of computation from the same KStream and KTable instance?
KStream<String, Long> input = ...;
KTable<..., ...> firstChain = input.filter(...).groupByKey().count(...);
KTable<..., ...> secondChain = input.mapValues(...);
This would allow you to read a Kafka topic once but then compute different outcomes based on that topic.
Is this considered a bad idea in general?
If I understand you correctly I think there's a better and much simpler approach, see above. If you need something different, you may need to update/clarify your question.
Hope this helps!
I have some basic Kafka Streaming code that reads records from one topic, does some processing, and outputs records to another topic.
How does Kafka streaming handle concurrency? Is everything run in a single thread? I don't see this mentioned in the documentation.
If it's single threaded, I would like options for multi-threaded processing to handle high volumes of data.
If it's multi-threaded, I need to understand how this works and how to handle resources, like SQL database connections should be shared in different processing threads.
Is Kafka's built-in streaming API not recommended for high volume scenarios relative to other options (Spark, Akka, Samza, Storm, etc)?
Update Oct 2020: I wrote a four-part blog series on Kafka fundamentals that I'd recommend to read for questions like these. For this question in particular, take a look at part 3 on processing fundamentals.
To your question:
How does Kafka streaming handle concurrency? Is everything run in a single thread? I don't see this mentioned in the documentation.
This is documented in detail at http://docs.confluent.io/current/streams/architecture.html#parallelism-model. I don't want to copy-paste this here verbatim, but I want to highlight that IMHO the key element to understand is that of partitions (cf. Kafka's topic partitions, which in Kafka Streams is generalized to "stream partitions" as not all data streams that are being processed will be going through Kafka) because a partition is currently what determines the parallelism of both Kafka (the broker/server side) and of stream processing applications that use the Kafka Streams API (the client side).
If it's single threaded, I would like options for multi-threaded processing to handle high volumes of data.
Processing a partition will always be done by a single "thread" only, which ensures you are not running into concurrency issues. But, fortunately, ...
If it's multi-threaded, I need to understand how this works and how to handle resources, like SQL database connections should be shared in different processing threads.
...because Kafka allows a topic to have many partitions, you still get parallel processing. For example, if a topic has 100 partitions, then up to 100 stream tasks (or, somewhat over-simplified: up to 100 different machines each running an instance of your application) may process that topic in parallel. Again, every stream task would get exclusive access to 1 partition, which it would then process.
Is Kafka's built-in streaming API not recommended for high volume scenarios relative to other options (Spark, Akka, Samza, Storm, etc)?
Kafka's stream processing engine is definitely recommended and also actually being used in practice for high-volume scenarios. Work on comparative benchmarking is still being done, but in many cases a Kafka Streams based application turns out to be faster. See LINE engineer's blog: Applying Kafka Streams for internal message delivery pipeline for an article by LINE Corp, one of the largest social platforms in Asia (220M+ users), where they describe how they are using Kafka and the Kafka Streams API in production to process millions of events per second.
The kstreams config num.stream.threads allows you to override the number of threads from 1. However, it may be preferable to simply run multiple instances of your streaming app, with all of them running the same consumer group. That way you can spin up as many instances as you need to get optimal partitioning.