I am aware of the parallelism advantages that kafka streams offer which are leveraged if your parallelism needs are aligned with the partitioning of the topics.
I am considering having an application subscribe many consumers to different consumer groups so that each consumer is consuming a replication of the whole topic.
Specifically I am thinking of having multiple threads consume the same topic to provide different results even though I know that I can express all my computation needs using the "chaining" computation paradigm that KStreams offer.
The reason why I am considering different threads is because I want multiple dynamically created KTable instances of the stream. Each one working on the same stream (not subset) and aggregating different results. Since it's dynamic it can create really heavy load that could be alleviated by adding thread parallelism. I believe the idea that each thread can work on its own streams instance (and consumer group) is valid.
Of course I can also add thread parallelism by having multiple threads consuming smaller subsets of the data and individually doing all the computations (e.g. each one maintaining subsets of all the different KTables) which will still provide concurrency.
So, two main points in my question
Are KafkaStreams not generally suited for thread parallelism, meaning is the library not intended to be used that way?
In the case where threads are being used to consume a topic would it be a better idea to make threads follow the general kafka parallelism concept of working on different subsets of the data, therefore making thread parallelism an application-level analogous to scaling up using more instances?
But I am wondering would it be okay to have an application that subscribes many consumers to different consumer groups so that each consumer is consuming a replication of the whole topic.
What you could consider is running multiple KafkaStreams instances inside the same Java application. Each instance has its own StreamsConfig and thus its own application.id and consumer group id.
That said, depending on what your use case is, you might want to take a look at GlobalKTable (http://docs.confluent.io/current/streams/concepts.html#globalktable), which (slightly simplified) ensures that the data it reads from a Kafka topic is available in all instances of your Kafka Streams application. That is, this would allow you to "replicate the data globally" without having to run multiple KafkaStreams instances or the more complicated setup you asked about above.
Specifically I am considering having multiple threads consume the same topic to provide different kinds of results. Can I somehow define the consumer group that each KafkaStream consumer is listening to?
Hmm, perhaps you're looking at something else then.
You are aware that you can build multiple "chains" of computation from the same KStream and KTable instance?
KStream<String, Long> input = ...;
KTable<..., ...> firstChain = input.filter(...).groupByKey().count(...);
KTable<..., ...> secondChain = input.mapValues(...);
This would allow you to read a Kafka topic once but then compute different outcomes based on that topic.
Is this considered a bad idea in general?
If I understand you correctly I think there's a better and much simpler approach, see above. If you need something different, you may need to update/clarify your question.
Hope this helps!
Related
I am currently trying to understand how Kafka Streams achieves parallelism. My main concern boils down to three questions:
Can multiple sub-topologies read from the same partition?
How can you parallelise a complex operation (making up a sub-topology) that uses the processor API and requires reading the entire topic?
Can multiple sub-topologies read from the same topic (such that independent and expensive operations on the same topic can be run in different sub-topologies)?
As the developer, we don't have direct control about how topologies are divided into sub-topologies. Kafka Streams divides the Topology into multiple sub-topologies using Topics as a "bridge" where possible. Additionally, multiple stream tasks are created that each read a subset of data from the input topic, divided by partition. The documentation reads:
Slightly simplified, the maximum parallelism at which your application may run is bounded by the maximum number of stream tasks, which itself is determined by maximum number of partitions of the input topic(s) the application is reading from.
Assume there was a sub-topology that reads multiple input topics whose number of partitions are not identical. If the above excerpt of the documentation is to be believed, then one or more partitions of the topic that has less partitions would need to be assigned to multiple stream tasks (if both topics need to be read for the logic to work). However, this should not be possible, because, as I understand it, multiple instances of the streams application (each sharing the same application id) act as one Consumer group, where each partition is only assigned once. In such a case, the number of tasks being created for a sub-topology should actually be limited by the minimum number of partitions of its input topics, i.e. a single partition is only assigned to one Task.
I am not sure if the initial problem, i.e. a non-co-partitioned sub-topology would actually occur. If there is an operation that requires to read both input topics, the data would probably need to be co-partitioned (like in Joins).
Say there was an expensive operation between two topics (possibly built from multiple custom processors) that requires the data of one topic to always be available in its entirety. You would want to parallelise this operation into multiple tasks.
If the topic had just one partition, and a partition could be read multiple times, this would not be a problem. However, as discussed earlier, I don't believe this to work.
Then there are GlobalKTables. However, there is no way to use GlobalKTables with custom processors (toStream is not available).
Another idea would be to broadcast the data into multiple partitions, essentially duplicating it by the partition count. This way, multiple stream tasks could be created for the topology to read the same data. To do this, a custom partitioner could be specified in the Produced-Instance given to KStream#to. If this data duplication can be accepted, this seems to be the only way to achieve what I have in mind.
Regarding question number three, because the Streams application is one Consumer group, I would also expect this to not be possible. With my current understanding, this would require to write the data into multiple identical topics (again essentially duplicating the data), such that independent sub-topologies can be created. An alternative would be to run separate streaming applications (such that a different consumer group is used).
Without seeing your topology definition, this is a somewhat vague question. You can have repartition and changelog topics. These are duplicated data from the original input topic.
But stateless operators like map, filter, etc. pass data through from the same (assigned) partitions for each thread.
A "sub topology" is still part of only one application.id, thus one consumer group, so no, it cannot read the same topic partitions more than once. For that, you'd need independent streams/tables via branching operations within the whole topology, for example, filtering numbers by even and odd only consumes the topic once; you don't need to "broadcast" records to all partitions, and I'm not sure that's even possible out of the box (to sends one-to-one, and Produced defines serialization, not multiple partitions). If you need to cross reference different operators, then you can use join / statestores / KTables.
None of this is really related to parallelism. You have num.stream.threads, or you can run multiple instances of the same JVM process to scale.
I have a usecase where I want to have thousands of producers writing messages which will be consumed by thousands of corresponding consumers. Each producer's message is meant for exactly one consumer.
Going through the core concepts here and here: it seems like each consumer-producer pair should have its own topic. Is this correct understanding? I also looked into consumer groups but it seems they are more for parallellizing consumption.
Right now I have multiple producer-consumer pairs sharing very few topics, but because of that (i think) I am having to read a lot of messages in the consumer and filter them out for the specific producer's messages by the key. As my system scales this might take a lot of time. Also in the event I have to delete the checkpoint this will be even more problematic as it starts reading from the very beginning.
Is creating thousands of topics the solution for this? Or is there any other way to use concepts like partitions, consumer groups etc? Both producers and consumers are spark streaming/batch applications. Thanks.
Each producer's message is meant for exactly one consumer
Assuming you commit the offsets, and don't allow retries, this is the expected behavior of all Kafka consumers (or rather, consumer groups)
seems like each consumer-producer pair should have its own topic
Not really. As you said, you have many-to-many relationship of clients. You do not need to have a known pair ahead of time; a producer could send data with no expected consumer, then any consumer application(s) in the future should be able to subscribe to that topic for the data they are interested in.
sharing very few topics, but because of that (i think) I am having to read a lot of messages in the consumer and filter them out for the specific producer's messages by the key. As my system scales this might take a lot of time
The consumption would take linearly more time on a higher production rate, yes, and partitions are the way to solve for that. Beyond that, you need faster network and processing. You still need to consume and deserialize in order to filter, so the filter is not the bottleneck here.
Is creating thousands of topics the solution for this?
Ultimately depends on your data, but I'm guessing not.
Is creating thousands of topics the solution for this? Or is there any
other way to use concepts like partitions, consumer groups etc? Both
producers and consumers are spark streaming/batch applications.
What's the reason you want to have thousands of consumers? or want to have a 1 to 1 explicit relationship? As mentioned earlier, only one consumer within a consumer group will process a message. This is normal.
If however you are trying to make your record processing extremely concurrent, instead of using very high partition counts or very large consumer groups, should use something like Parallel Consumer (PC).
By using PC, you can processing all your keys in parallel, regardless of how long it takes to process, and you can be as concurrent as you wish .
PC directly solves for this, by sub partitioning the input partitions by key and processing each key in parallel.
It also tracks per record acknowledgement. Check out Parallel Consumer on GitHub (it's open source BTW, and I'm the author).
The performance tuning documentation provided by Storm states for the absolute best performance scaling multiple parallel topologies can yield better performance than simply scaling workers.
I am try to benchmark this theory against scaling worker.
However, using version 1.2.1 the storm Kafka spout is not behaving as I would have expected across multiple different topologies.
Setting a common client.id and group.id for the kafka spout consumer across all topologies for a single topic, each topology still subscribes to all available partitions and duplicate tuples, with errors being thrown as already committed tuples are recommitted.
I am surprised by this behaviour as I assumed that the consumer API would support this fairly simple use case.
I would be really grateful if somebody would explain
what's the implementation logic of this behaviour with the kafka spout?
any way around this problem?
The default behavior for the spout is to assign all partitions for a topic to workers in the topology, using the KafkaConsumer.assign API. This is the behavior you are seeing. With this behavior, you shouldn't be sharing group ids between topologies.
If you want finer control over which partitions are assigned to which workers or topologies, you can implement the TopicFilter interface, and pass it to your KafkaSpoutConfig. This should let you do what you want.
Regarding running multiple topologies being faster, I'm assuming you're referring to this section from the docs: In multiworker mode, messages often cross worker process boundaries. For performance sensitive cases, if it is possible to configure a topology to run as many single-worker instances [...] it may yield significantly better throughput and latency. The objective here is to avoid sending messages between workers, and instead keep each partition's processing internal in one worker. If you want to avoid running many topologies, you could look at customizing the Storm scheduler to make it allocate e.g. one full copy of your pipeline in each worker. That way, if you use localOrShuffleGrouping, there will always be a local bolt to send to, so you don't have to go over the network to another worker.
I need data from kafka brokers,but for fast access I am using multiple consumers with same group id known as consumer groups.But after reading by each consumer,how can we combine data from multiple consumers? Is there any logic?
By design, different consumers in the same consumer group process data independently from each other. (This behavior is what allows applications to scale well.)
But after reading by each consumer,how can we combine data from multiple consumers? Is there any logic?
The short but slightly simplified answer when you use Kafka's "Consumer API" (also called: "consumer client" library), which I think is what you are using based on the wording of your question: If you need to combine data from multiple consumers, the easiest option is to make this (new) input data available in another Kafka topic, where you do the combining in a subsequent processing step. A trivial example would be: the other, second Kafka topic would be set up to have just 1 partition, so any subsequent processing step would see all the data that needs to be combined.
If this sounds a bit too complicated, I'd suggest to use Kafka's Streams API, which makes it much easier to define such processing flows (e.g. joins or aggregations, like in your question). In other words, Kafka Streams gives you a lot of the desired built-in "logic" that you are looking for: https://kafka.apache.org/documentation/streams/
The aim of Kafka is to provide you with a scalable, performant and fault tolerant framework. Having a group of consumers reading the data from different partitions asynchronously allows you to archive first two goals. The grouping of the data is a bit outside the scope of standard Kafka flow - you can implement a single partition with single consumer in most simple case but I'm sure that is not what you want.
For such things as aggregation of the single state from different consumers I would recommend you to apply some solution designed specifically for such sort of goals. If you are working in terms of Hadoop, you can use Storm Trident bolt which allows you to aggregate the data from you Kafka spouts. Or you can use Spark Streaming which would allow you to do the same but in a bit different fashion. Or as an option you can always implement your custom component with such logic using standard Kafka libraries.
Considering a stream of different events the recommended way would be
one big topic containing all events
multiple topics for different types of events
Which option would be better?
I understand that messages not being in the same partition of a topic it means there are no order guarantee, but are there any other factors to be considered when making this decision?
A topic is a logical abstraction and should contain message of the same type. Let's say, you monitor a website and capture click stream events and on the other hand you have a database that populates it's changes into a changelog topics. You should have two different topics because click stream events are not related to you database changelog.
This has multiple advantages:
your data will have different format und you will need different (de)serializers to write read the data (using a single topic you would need a hybrid serializer and you will not get type safety when reading data)
you will have different consumer application and one application might be interested in click stream events only, while a second application is only interested in the database changelog and a third application is interested in both. If you have multiple topics, application one and two only subscribe to the topics they are interesting in -- if you have a single topic, application one an two need to read everything and filter the stuff they are not interested in increasing broker, network, can client load
As #Matthias J. Sax told before there is not a golden bullet over here. But we have to take different topics into account.
The conditioner: ordered deliveries
If you application needs guarantee order delivery, you need to work with only one topic, plus same keys for those messages which need to guarantee it.
If ordering is not mandatory, the game starts...
Does the schema same for all messages?
Would be consumers interested in the same type of different events?
What is gonna happen at the consumer side?, do we are reducing or increasing complexity in terms of implementation, maintainability, error handling...?
Does horizontal scalability important for us? More topics often means more partitions available, which means more horizontal scalability capacity. Also it allows more accurate scalability configuration at the broker side, because we can choose what number of partitions to increase per event type. or at the consumer side, what number of consumers stand up per event type.
Does makes sense parallelising consumption per message type?
...
Technically speaking, if we allow consumers to fine tune those type of events to be consumed we're potentially reducing the network bandwidth required to send undesired messages from the broker to the consumer, plus the number deserialisations for all of them (cpu used, which makes along time more free resources, energy cost reduction...).
Also is worthy to remember that splitting different type of messages in different topics doesn't mean have to consume them with different Kafka consumers because they allow consumption from different topics at the same time.
Well, there's not a clear answer for this question, but I have the feeling that with Kafka, because multiple features, if ordered deliveries are not needed we should split our messages per type in different topics.