I've jumped into a specific requirement and would like to hear people's views and certainly not re-invent the wheel.
I've got 2 Kafka topics - A and B.
A and B would be filled with messages at different ingest rate.
For example: A could be filled with 10K messages first and then followed by B. Or in some cases we'd have A and B would be filled with messages at the same time. The ingest process is something we have no control of. It's like a 3rd party upstream system for us.
I need to pick up the messages from these 2 topics and mix them at equal proportion.
For example: If the configured size is 50. Then I should pick up 50 from A and 50 from B (or wait until I have it) and then send it off to another kafka topic as 100 (with equal proportions of A and B).
I was wondering what's the best way to solve this? Although I was looking at the join semantics of KStreams and KTables, I'm not quite convinced that this is a valid use case for join (cause there's no key in the message that joins these 2 streams or tables).
Can this be done without Kafka Streams? Vanilla Kafka consumer (perhaps with some batching?) Thoughts?
With Spring, create 2 #KafkaListeners, one for A, one for B; set the container ack mode to MANUAL and add the Acknowledgment to the method signature.
In each listener, accumulate records until you get 50 then pause the listener container (so that Kafka won't send any more, but the consumer stays alive).
You might need to set the max.poll.records to 1 to better control consumption.
When you have 50 in each; combine and send.
Commit the offsets by calling acknowledge() on the last Acknowledgment received in A and B.
Resume the containers.
Repeat.
Deferring the offset commits will avoid record loss in the event of a server crash while you are in the accumulating stage.
When you have lots of messages in both topics, you can skip the pause/resume part.
Related
Setting the stage..
Here's a diagram to help explain my problem better:
Now, keep in mind the following points:
I have a producer sending messages to 8 partitions of My topic.
On the other side, I have 8 consumers, one for each partition.
The legacy system has limited resources, and can process at most 8 simultaneous requests.
To make sure I don't overwhelm the legacy system, a consumer will only send one request at a time. Any new message will wait for the current message to finish processing.
Explaining the problem..
Since messages are blocked until the previous message is processed, I want to minimize the time a message will wait before it's processed. To do that I need messages to be distributed equally over the partitions. A massage must not be consumed by a busy consumer when another is free.
For example, if 8 messages are produced simultaneously, each message should be sent to one partition. Therefore, each message will be consumed by one consumer, ensuring the messages are processed concurrently without any lag.
What I tried so far
Since the partitions are assigned correctly to the consumers, I had to assume the producer wasn't evenly delivering messages to the partitions. Which turned out to be the case. Here's what I tried so far to resolve the issue...
Using null keys
The most intuitive solution was to produce records without keys which will basically make the DefaultPartitioner behave like the RoundRobinPartitioner. unfortunately, this solution did not work.
Using null keys and batch.size=0
Since using null keys didn't work, It made sense that messages were being sent in batches breaking the even distribution. Setting the batch size to 0 should've caused the producer to send messages one by one. That didn't work either.
Using RoundRobinPartitioner
This one was weird. The RoundRobinPartitioner distributed messages evenly, but it only used 4 out of the 8 partitions.
Using RoundRobinPartitioner and batch.size=0
This made no difference.
Finally, my question:
I need the producer to send messages in Round Robin fashion one by one without batching. How can I do that?
TL;DR
I need the producer to send messages in Round Robin fashion without batching. How can I do that?
I have a stream A which publish to a Kafka server and a stream B which is consuming from the Kafka service, processing and then publish to multiple Kafka topics.
Stream A is producing with a rate of around 50 ms (publish to kafka included) and Stream B is processing and producing with a rate of 500 ms (so, 10 times slower).
Due to this, even some records were publish now by stream A, it takes sometimes up to 5 minutes to be processed by stream B, when under high load (e.g. 50k records to be processed at once) which is not an alternative and close to unacceptable.
My question is: what are the best practices for this scenario, in general, and what could be a quick approach to handle this? These streams are part of the same app.
I know that maybe I only gave the big picture, but I'm looking for a starting point, any ideas are welcome.
The is no back-pressure mechanism for Kafka. If a downstream consumer is slower, the lag will grow.
The way to deal with this is to spin more instances of the consumers or make your consumer beefier (more CPU probably, but depends on what is the bottleneck).
It sounds like you have both upstream produces and downstream consumer in the same deployable. This is a bit questionable: Why not just let B consume directly from the source of A?
Ok, it sound to me for something that you need an Application Logic, you try to solve the problem with Technology.
If you can group the Events you produce under same key (for ex, you have Customer with Id: 111 and you send all CREATE, UPDATE, DELETE Events with same Key <-> Id: 111) then you can use a Topic with multiple partitions.
This way all the Event that are produced with same key, will land in same partition and would be guaranteed to be processed in order, in this way you can parallelise the consuming and processing, so with 10 partitions you can be as fast as producers.
If this is not possible, you have to use backpressure mechanisms of the Alpakka Kafka Streams and may be an Akka State Machine for the Application Logic part, which I explain in the following blog how it can be done.
I want to share a problem and a solution I used, as I think it may be beneficial for others, if people have any other solutions please share.
I have a table with 1,000,000 rows, which I want to send to kafka, and spread the data between 20 partitions.
I want to notify the consumer when producer reached end of data, I don't want to have direct connection between producer and consumer.
I know kafka is designed as logical endless stream of data, but I still need to mark the end of the specific table.
There was a suggestion to count the number of items per logical section, and send this data (to a metadata topic), so the consumer will be able to count items, and know when the logical section ended.
There are several disadvantages for this approach:
As data is spread between partitions, I can tell there are total x items at my logical section, however if there are multiple consumers (one per partition), they'll need to share a counter of consumed messages per logical section. I want to avoid this complexity. Also when consumer is stopped and resumed, it will need to know how many items were already consumed and keep context.
Regular producer session guarantees at-least-once delivery, which means I may have duplicated messages. Counting the messages will need to take this into account (and avoid counting duplicated messages).
There is also the case where I don't know in advance the number of items per logical session, (I'm also kind of consumer, consuming stream of event and signaled when data ended), so at this case, the producer will also need to have a counter, keep it when stopped and resumed etc. Having several producers will need to share the counter etc. So it adds a lot of complexity to the process.
Solution 1:
I actually want the last message at each partition indicate it is the last message.
I can do some work in advance, create some random message keys, send messages partitioned by key, and test to which partition each message is directed. As partitioning by keys is deterministic (for given number of partitions), I want to prepare a map of keys and the target partition. For example key: ‘xyz’ is directed to partition #0, key ‘hjk’ is directed to partition #1 etc, and finally have the reversed map, so for partition 0, use key ‘xyz’, for partition 1, use key ‘hjk’ etc.
Now I can send the entire table (except of the last 20 rows) with partition strategy random, so the data is spread between partitions, for almost entire table.
When I come to the last 20 rows, I’ll send them using partition key and I’ll set for each message partition key which will hash the message to a different partition. This way, each one of the 20 partitions will get one of the last 20 messages. For each one of the last 20 messages, I'll set a relevant header which will state it is the last one.
Solution 2:
Similar to solution 1, but send the entire table spread to random partitions. Now send 20 metadata messages, which I’ll direct to the 20 partitions using the partition by key strategy (by setting appropriate keys).
Solution 3:
Have additional control topic. After the table was sent entirely to the data topic, send a message to the control topic saying table is completed. The consumer will need to test the control topic from time to time, when it gets the 'end of data' message, it will know that if it reached the end of the partition, it actually reached the end of the data for that partition. This solution is less flexible and less recommended, but I wrote it as well.
Another one solution is to use open source analog of S3 (e.g. minio.io). Producers can uplod data, send message with link to object storage. Consumers will remove data frome object storage after collecting.
Avro encoded messages on a single Kafka topic, single partitioned. Each of these messages were to be consumed by a specific consumer only. For ex, message a1, a2, b1 and c1 on this topic, there are 3 consumers named A, B and C. Each consumer would get all the messages but ultimate A would consume a1 and a2, B on b1 and C on c1.
I want to know how typically this is solved when using avro on Kafka:
leave it for the consumers to deserialize the message then some application logic to decide to consume the message or drop the message
use partition logic to make each of the messages to go to a particular partition, then setup each consumer to listen to only a single partition
setup another 3 topics and a tiny kafka-stream application that would do the filtering + routing from main topic to these 3 specific topics
make use of kafka header to inject identifier for downstream consumers to filter
Looks like each of the options have their pros and cons. I want to know if there is a convention that people follow or there is some other ways of solving this.
It depends...
If you only have a single partitioned topic, the only option is to let each consumer read all data and filter client side which data the consumer is interested in. For this case, each consumer would need to use a different group.id to isolate the consumers from each other.
Option 2 is certainly possible, if you can control the input topic you are reading from. You might still have different group.ids for each consumer as it seems that the consumer represent different applications that should be isolated from each other. The question is still if this is a good model, because the idea of partitions is to provide horizontal scale out, and data-parallel processing; however, if each application reads only from one partition it seems not to align with this model. You also need to know which data goes into which partition producer side and consumer side to get the mapping right. Hence, it implies a "coordination" between producer and consumer what seems not desirable.
Option 3 seems to indicate that you cannot control the input topic and thus want to branch the data into multiple topics? This seems to be a good approach in general, as topics are a logical categorization of data. However, it would even be better to have 3 topic for the different data to begin with! If you cannot have 3 input topic from the beginning on, Option 3 seems not to provide a good conceptual setup, however, it won't provide much performance benefits, because the Kafka Streams application required to read and write each record once. The saving you gain is that each application would only consume from one topic and thus redundant data read is avoided here -- if you would have, lets say 100 application (and each is only interested in 1/100 of the data) you would be able to cut down the load significantly from an 99x read overhead to a 1x read and 1x write overhead. For your case you don't really cut down much as you go from 2x read overhead to 1x read + 1x write overhead. Additionally, you need to manage the Kafka Streams application itself.
Option 4 seems to be orthogonal, because is seems to answer the question on how the filtering works, and headers can be use for Option 1 and Option 3 to do the actually filtering/branching.
The data in the topic is just bytes, Avro shouldn't matter.
Since you only have one partition, only one consumer of a group can be actively reading the data.
If you only want to process certain offsets, you must either seek to them manually or skip over messages in your poll loop and commit those offsets
Is it okay to batch 100 messages into a single object and send those objects to kafka or should I split those 100 messages into individual messages and then put them in kafka
Say for example, I have an object that contains a List. I can put 100 strings in that list and send the object to kafka. Is it better to do it that way or should i split the list of strings and send individual strings to kafka instead
What are some pros and cons to the above approaches
Batching is always good when async processing, until you need to partially process the batch in case of errors.
If you are processing an order and the list of 100 are the items. send them together, as they will be processed together. If you are sending 100 orders, and will process the independently, process them one by one, as the error in one order should not block the others.
As for message sizes, kafka has some message size limits, but these are configurable.
Definitively you need to improve your question.
You want to send a huge message that is more than the max.message.bytes configuration of your kafka broker(let's assume you can't change it). You break it down and put it back together at the consumer side.
This would require some work around the limitations of kafka deployment as of now. For e.g
Should your consumer process all these 100 strings as if they were one batch? when should your consumer decide to commit the offsets for these messages? Is your consumer processing idempotent? Do you have one consumer or multiple consumer instances? what if the 100 strings were split across 5 partitions? which consumer gets which subset of these 100 strings?
An approach is to create 100 messags all with the same batch id like so
(batch1:message1, batch1:message2, batch1:message3)
On the consumer side collect all these messages with the same key
(batch1: (message1, message2, message3))
But, how would you know when the batch ends? does the sequence message1, message2, message3 matter?
So you do something like this
(batch1:message1of3, batch1:message2of3, batch1:messsage3of3)
Now what if you received message1of3 and message2of3 but not message3of3? how long do you wait for it?
As you can see, at each step there are multiple ways to go about this and you will have to make choices right for your problem. Perhaps, you will use timeouts, perhaps in your case batches are interleaved like this
(batch1:message1of3, batch2:message2of5, batch1:message2of3...)
Expect to make some compromises. With Kafka your consumer group is guaranteed to receive all messages, and while it's running, any consumer is assigned one or more partitions(meaning a single partition is not assigned to more than one consumer at the same time). Kafka will also assign messages with the same key to the same partition. With these two properties in mind you can design a system that can consume messages in batches with some obvious trade-offs and limitations.