I am developing a DataStream-based Flink application for a high volume streaming use case (tens of millions of events per second). The data is consumed from a Kafka topic and is already sharded according to a certain key. My intention is to create key-specific states on the Flink side to run custom analytics. The main problem that I can't wrap my head around is how to create the keyed states without reshuffling of the incoming data that is imposed by keyBy().
I can guarantee that the maximum parallelism of the Flink job will be less than or equal to the number of partitions in the source Kafka topic, so logically the shuffling is not necessary. The answer to this StackOverflow question suggests that it may be possible to write the data into Kafka in a way that is compatible with the expectations of Flink and then use reinterpretAsKeyedStream(). I would be happy to do it for this application. Would someone be able to share the necessary steps?
Thank you in advance.
What you need to do is to ensure that each event is written to the Kafka partition that will be read by the same task slot to which the key for that event will be assigned.
Here's what you need to know to make that work:
(1) Kafka partitions are assigned in round-robin fashion to task slots: partition 0 goes to slot 0, partition 1 to slot 1, etc, wrapping back around to slot 0 if there are more partitions than slots.
(2) Keys are mapped to key groups, and key groups are assigned to slots. The number of key groups is determined by the maximum parallelism (which is a configuration parameter; the default is 128).
The key group for a key is computed via
keygroupId = MathUtils.murmurHash(key.hashCode()) % maxParallelism
and then the slot is assigned according to
slotIndex = keygroupId * actualParallelism / maxParallelism
(3) Then you'll need to use DataStreamUtils.reinterpretAsKeyedStream to get Flink to treat the pre-partitioned streams as keyed streams.
One effect of adopting this approach is that it will be painful if you ever need to change the parallelism.
Related
As far as I understand both Kafka Producer and Consumer have to use a single thread per topic-partition if we want to write / read records in an order. Am I right or maybe they use multiple threads in such situations?
So the ordering can be achieved in Kafka in both single threaded as well as multithreaded env
single broker/single partition -> Single thread based consumer model
The order of message in Kafka works well for a single partition. But with a single partition, parallelism and load balancing is difficult to achieve. Please note that in this case only one thread will be used to access topic partition thus the ordering is always guaranteed.
multiple brokers/multiple partitions -> Multithread based consumers model(having consumer groups holding more than 1 consumers)
In this case, we assume that there are multiple partitions present in topic and each partition is being handled by a single consumer(precisely a single thread) in each consumer group which is fairly called multithreading.
There are three methods in which we can retain the order of messages within partitions in Kafka. Each method has its own pros and cons.
Method 1: Round Robin or Spraying
Method 2 : Hashing Key Partition
Method 3 : Custom Partitioner
Round Robin or Spraying (Default)
In this method, the partitioned will send messages to all the partitions in a round-robin fashion, ensuring a balanced server load. Over loading of any partition will not happen. By this method parallelism and load balancing is achieved but it fails to maintain the overall order but the order within the partition will be maintained. This is a default method and it is not suitable for some business scenarios.
In order to overcome the above scenarios and to maintain message ordering, let’s try another approach.
Hashing Key Partition
In this method we can create a ProducerRecord, specifying a message key with each message being passed to the topic to ensure that partition ordering will happen.
The default partitioned will use the hash of the key to ensure that all messages for the same key go to same partition. This is the easiest and most common approach. This is the same method which has been used for hive bucketing as well. It uses modulo operation for hashing.
Hash(Key) % Number of partitions -> Partition number
We can say that the key here will help to define the partition where the producer wants to send the message always to maintain the order. But, the drawback with this method is as it uses random hashing value to pull the data to assigned partition, and it follows overloading of data to single partition. But, the drawback with this method is as it uses random hashing value to pull the data to assigned partition, and it follows overloading of data to single partition.
Custom Partitioner
We can write our own business logic to decide which message need to be send to which partition. With this approach, we can make ordering of messages as per our business logic and achieve parallelism at the same time.
For understanding more details you can check below
https://medium.com/latentview-data-services/how-to-use-apache-kafka-to-guarantee-message-ordering-ac2d00da6c22
Also Please note that this information represents the Partition level parallelism
There has been a new parallelism strategy as well called consumer level parallelism. I have not give it a read but you can find details at below confluent link
https://www.confluent.io/blog/introducing-confluent-parallel-message-processing-client/
As usual, it's bit confusing to see benefits of splitting methods over others.
I can't see the difference/Pros-Cons between having
Topic1 -> P0 and Topic 2 -> P0
over Topic 1 -> P0, P1
and a consumer pull from 2 topics or single topic/2 partitions, while P0 and P1 will hold different event types or entities.
Thee only benefit I can see if another consumer needs Topic 2 data then it's easy to consume
Regarding topic auto generation, any benefits behind that way or it will be out of hand after some time?
Thanks
I would say this decision depends on multiple factors;
Logic/Separation of Concerns: You can decide whether to use multiple topics over multiple partitions based on the logic you are trying to implement. Normally, you need distinct topics for distinct entities. For example, say you want to stream users and companies. It doesn't make much sense to create a single topic with two partitions where the first partition holds users and the second one holds the companies. Also, having a single topic for multiple partitions won't allow you to implement e.g. message ordering for users that can only be achieved using keyed messages (message with the same key are placed in the same partition).
Host storage capabilities: A partition must fit in the storage of the host machine while a topic can be distributed across the whole Kafka Cluster by partitioning it across multiple partitions. Kafka Docs can shed some more light on this:
The partitions in the log serve several purposes. First, they allow
the log to scale beyond a size that will fit on a single server. Each
individual partition must fit on the servers that host it, but a topic
may have many partitions so it can handle an arbitrary amount of data.
Second they act as the unit of parallelism—more on that in a bit.
Throughput: If you have high throughput, it makes more sense to create different topics per entity and split them into multiple partitions so that multiple consumers can join the consumer group. Don't forget that the level of parallelism in Kafka is defined by the number of partitions (and obviously active consumers).
Retention Policy: Message retention in Kafka works on partition/segment level and you need to make sure that the partitioning you've made in conjunction with the desired retention policy you've picked will support your use case.
Coming to your second question now, I am not sure what is your requirement and how this question relates to the first one. When a producer attempts to write a message to a Kafka topic that does not exist, it will automatically create that topic when auto.create.topics.enable is set to true. Otherwise, the topic won't get created and your producer will fail.
auto.create.topics.enable: Enable auto creation of topic on the server
Again, this decision should be dependent on your requirements and the desired behaviour. Normally, auto.create.topics.enable should be set to false in production environments in order to mitigate any risks.
Just adding some things on top of Giorgos answer:
By choosing the second approach over the first one, you would lose a lot of features that Kafka offers. Some of the features may be: data balancing per brokers, removing topics, consumer groups, ACLs, joins with Kafka Streams, etc.
I think that this flag can be easily compared with automatically creating tables in your database. It's handy to do it in your dev environments but you never want it to happen in production.
In a Kafka deployment a custom topic partitioner logic is used to route all messages that belong to the same root entity (for example all message for particular user) to the same partition.
Can anyone recommend a strategy on how to deal with partitioning logic change in such live system?
One example that affects the partitioning is the obvious change of the partitioner implementation. The other example would be change of the number of partitions for a given topic.
In both cases, we would end up in a situation where some of the messages for user A, that entered the Kafka before the change, will be in partition 1, while after the change in partitioning logic or number of partitions messages for that same user A will go the partition 2.
This can lead to a problem where messages for user A are processed out of order. Consumer reading the messages from partition 2 could process messages before the consumer that reads the messages from partition 1.
Have anyone faced this issue in live system? How did you or would you solve this issue?
This seems like a very common scenario, but I was not able to find anything about it.
Thanks
By partitioning logic, if you meant partitioning algorithm, I do not understand how that would just change like that. As for increasing partitions, it is in theory not possible to achieve increasing of partitions while guaranteeing the order of messages. -- there is a KIP for that, but its status is still "under discussion".
What I do usually when I increase partitions is to accept a small downtime.
The playbook is like this:
Stop the producer
Monitor the lag for the consumer group
Once lag is zero, shut down the consumers
Increase the number of partitions
Start the consumers
Start the producers
This way, you can be sure that there are no message losses and no out of order message consumption.
If you want to avoid a downtime, you may have to rely on an external system which can temporarily hold the data per partition in order and publish, but that solution depends on a few things
The best way to change how records are partitioned is to use the default Apache Kafka® partitioner, and change the record keys. If all records from a user need to go to the same topic then make sure they all have the same key.
If you'd like to change the keys for a whole set you can use KSQL to re-key (republish to a new topic with new keys) the data using the PARTITION BY function.
We are working on Confluent Platform and we are still getting to know the internals. But we have implemented generic use cases . We are trying to optimizing our cluster
In my use case, I need to increase the number of partitions of a topic . What is the impact of it ? Can you please share of it
Sure, you can increase partitions.
However,
Increasing partitions does not move existing data. If using Confluent Enterprise, you could use confluent-rebalancer, or if not, then kafka-reassign-partitions CLI tool. So, you'll definitely want to rebalance a topic to "optimize" the cluster.
During the retention period of the topic (read: for the existing data), if you previously had a producer sending data to partition N, and now had N+1 partitions, then you lose ordering of those messages that solely existed in partition N. New messages could be spread across new partitions that a new producer calculates with the DefaultPartitioner. If you don't send keys with messages, then this isn't a problem.
I need to understand something about kafka:
When I have a single kafka broker on a single host - is there any sense to have it have more than one partition for the topics? I means even if my data can be distinguished with some key (say tenant id) - what is the benefit of doing it on a single kafka broker? does this give any parallelism , if so how?
When a key is used, is this means that each key is mapped to a given partition? Does the number of partitions for a topic must be equal to the number of possible values for the key I specified? OR is this just a hash and so the number of partitions doesnt have to be equal?
From what I read, topics are created due to types of messages to be places in kafka. But in my case, i have 2 topics I have created since I have 2 types of consumption: one for reading one by one message. the second in case of a bulk of messages comes into the queue (application reasons) and then it is being entered into the second topic. Is that a good design although the messages type is the same? any other practice for such a scansion?
Yes, it definitely makes sense to have more than one partition for a topic even when you have a single Kafka broker. A scenario when you can benefit from this is pretty simple:
you need to guarantee in-order processing by tenant id
processing logic for each message is rather complex and takes some time. Especially the case when the Kafka message itself is simple, but the logic behind processing this message takes time (simple example - message is an URL, and the processing logic is downloading the file from there and doing some processing)
Given these 2 conditions you may get into a situation where one consumer is not able to keep up processing all the messages if all the data goes to a single partition. Remember, you can process one partition with exactly one consumer (well, you can use 2 consumers if using different consumer groups, but that's not your case), so you'll start getting behind over time. But if you have more than one partition you'll either be able to use one consumer and process data in parallel (this could help to speed things up in some cases) or just add more consumers.
By default, Kafka uses hash-based partitioning. This is configurable by providing a custom Partitioner, for example you can use random partitioning if you don't care what partition your message ends up in.
It's totally up to you what purposes you have topics for
UPD, answers to questions in the comment:
Adding more consumers is usually done for adding more computing power, not for achieving desired parallelism. To add parallelism add partitions. Most consumer implementations process different partitions on different threads, so if you have enough computing power, you might just have a single consumer processing multiple partitions in parallel. Then, if you start bumping into situations where one consumer is not enough, you just add more consumers.
When you create a topic you just specify the number of partitions (and replication factor for this topic, but that's a different thing). The key and partition to send is completely up to producer. In fact, you could configure your producer to use random partitioner and it won't even care about keys, just pick the partition randomly. There's no direct relation between key -> partition, it's just convenient to benefit from having things setup like this.
Can you elaborate on this one? Not sure I understand this, but I guess your question is whether you can send just a value and Kafka will infer a key somehow itself. If so, then the answer is no - Kafka does not apply any transformation to messages and stores them as is, so if you want your message to contain a key, the producer must explicitly send the key.