Now I'm doing some tests with Apache Kafka. In the configuration of Kafka Producer the parameters batch.size and linger.ms controls the batching strategy. Is it possible to make these parameters dynamically while producing? e.g. If the data ingestion rate rises fast, we may want to increase batch.size to accumulate more messages per batch. I failed to find any example of dynamic batching with Kafka Producer. Is it possible to implement?
It's possible, but you would have to close and re-open a new Producer instance yourself with the updated configurations during runtime, while making sure that you aren't dropping events between that action.
Related
I am looking for some clarification regarding properties which we can be used to avoid producer timeout due to either more time taken since batch creation with blocked batch or timeout with metadata read. I am confused if I should increase max.block.ms or delivery.timeout.ms?? And if we also need to set buffer.memory with these timeouts to avoid blockage with memory issue??
I am using spring kafka template send method to produce message with defined producer properties bean.
Background
We have a Kafka topic with a steady stream of data. To process it we have a stateless Flink pipeline that consumes that topic and writes to another topic.
From time to time we have bursts of information that our Flink is not configured to handle. We don't want to configure our Flink pipeline and cluster to always support the maximum load we can have, we want to dynamically scale according to the load. (budget reasons $$$)
Solutions we thought of
One way to do so is to add/remove nodes to the Flink cluster and change the parallelism of the Flink pipeline operators. This will require stopping the Flink job with a snapshot, reconfiguring the parallelism and restarting with new parallelism.
This would be great but we cannot allow ourselves the downtime it produces. We have to scale up/down without downtime.
If we would use regular Kafka consumers it would be as simple as adding a consumer (assuming we have enough Kafka partitions) and Kafka would redistribute the topic partitions between all the consumers.
The Flink Kafka consumer manages the partition assignment and the offset on its own which allows exactly-once semantics (we don't need it). The drawback is that a single Flink job always uses all the topic partitions.
We thought we could create another instance of Flink that would subscribe to the same topic with the same group and let Kafka distribute the partitions between them. But for that we would need the Kafka Flink consumer to let Kafka manage which partitions are assigned to which consumer.
What are we looking for
We couldn't find a library that contains such a consumer or a configuration in the existing consumer. We could write it on our own (not so difficult) but if there is an existing solution we'd rather use it.
Are we missing something? Are we misunderstanding something? Is there a better solution?
Thanks!
The most straightforward approach, since you said that at worst you'll need double the capacity, would be to modify your topology to be able to write Kafka messages you can't process quickly enough to a second overflow Kafka topic. Both input and output Kafka topic names would be configurable. Maybe you would have a threshold backlog delay that automatically triggers this writing or maybe you would have a flag in the topology that you can externally set while the topology is running. That's a design detail you can work through that has operational implications.
This gives you a Flink topology that can handle some maximum number of messages in a timely fashion while writing the rest of the messages that can't be handled to a second Kafka topic. You can then run a second instance of the same Flink topology that reads from that secondary topic and writes, if necessary to a third topic. If the writing to the overflow topic happens very early in the topology processing, you could chain several of these instances together via Kafka with minimal latency and without having to reconfigure and restart any topologies.
I have a situation in Kafka where the producer publishes the messages at a very higher rate than the consumer consumption rate. I have to implement the back pressure implementation in kafka for further consumption and processing.
Please let me know how can I implement in spark and also in normal java api.
Kafka acts as the regulator here. You produce at whatever rate you want to into Kafka, scaling the brokers out to accommodate the ingest rate. You then consume as you want to; Kafka persists the data and tracks the offset of the consumers as they work their way through the data they read.
You can disable auto-commit by enable.auto.commit=false on consumer and commit only when consumer operation is finished. That way consumer would be slow, but Kafka knows how many messages consumer processed, also configuring poll interval with max.poll.interval.ms and messages to be consumed in each poll with max.poll.records you should be good.
We're running on apache kafka 0.10.0.x and spring 3.x and cannot use spring kafka as it is supported with spring framework version 4.x.
Therefore, we are using the native Kafka Producer API to produce messages.
Now the concern that i have is the performance of my producer. The thing is i believe a call to producer.send is what really makes the connection to the Kafka broker and then puts the message onto the buffer and then attempts to send and then possibly calls your the provided callback method in the producer.send().
Now the KafkaProducer documentation says that it uses a buffer and another I/O thread to perform the send and that they should be closed appropriately so that there is no leakage of resources.
From what i understand, this means that if i have 100s of messages being sent every time i invoke producer.send() it attempts to connect to the broker which is an expensive I/O operation.
Can you please correct my understanding if i am wrong or maybe suggest a better to use the KafkaProducer?
The two important configuration parameters of kafka producer are 'batch.size' and 'linger.ms'. So you basically have a choice: you can wait until the producer batch is full, or the producer time out.
batch.size – This is an upper limit of how many messages Kafka Producer will attempt to batch before sending – specified in bytes.
linger.ms – How long will the producer wait before sending in order to allow more messages to get accumulated in the same batch.
It depends on your use case, but I would suggest to take a closer look on these parameters.
Your understanding is partially right.
As #leshkin pointed out there are configuration parameters to tune how the KafkaProducer will handle buffering of messages to be sent.
However independently from the buffering strategy, the producer will take care of caching established connections to topic-leader brokers.
Indeed you can tune for how long the producer will keep such connection around using the connections.max.idle.ms parameter (defaults to 9 minutes).
So to respond to your original question, the I/O cost of establishing a connection to the broker happens only on the first send invocation and will be amortised over time as long as you have data to send.
In the below conditions you need to configure batch.size, linger.ms & compression.type properties in your kafka prodocer to increase the performance.
1) If records are arriving faster than the kafka producer can send.
2) If you have huge amount of data in the your respective Topic, its really burden to your kafka producer.
3) if you have a bottlenecks
batch.size = 16_384 * 4
linger.ms 200
compression.type = "snappy"
props.put(ProducerConfig.BATCH_SIZE_CONFIG, 16_384 * 4);
// Send with little bit buffering
props.put(ProducerConfig.LINGER_MS_CONFIG, 200);
//Use Snappy compression for batch compression.
props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "snappy");
kafka Dzone
Performance tunning
Kafka Perforamnce tunning
What is the best way to write an Apache Kafka producer with a steady but adjustable output.
Example: The producer should send constant 1000 messages/sec to the broker. During runtime the output should be adjustable to either 10 or 10000 messages/sec.
One approach would be to set up a scheduler which runs each second and batch sends the predefined amount of messages.
Addition: Since this producer should be part of a performance testing framework the amount of messages that need to be sent is quite high. How would someone handle very high loads? Would it be beneficial to use Akka for that?
Target language is Scala, but example code in any language is very welcome.
In java this can be implemented by using guava's RateLimiter in your producer code, where in you can define the rate at which producer can produce messages to kafka broker.