I have a Kafka topic with cleanup.policy=compact and a producer is producing data with compression type snappy with batch size and linger ms settings for higher throughput. From what I understand, the message batches are compressed on the producer side before sending to broker and the broker receives and stores the compressed messages. When a consumer reads the topic, the compressed batches are delivered to client and the decompression happens at client. There could be multiple producers to the same topic with different compression type as well.
When the compaction thread runs, for correct compaction, the messages would have to be decompressed on brokers, and after compaction the messages would have to be compressed again for efficient delivery to the client. But doing so might give a very uneven distribution of compressed batches depending on messages received, or what would be the compression type if different batches had different compression type. I could not find an explanation of how exactly compaction works with compression enabled on producer. Can someone help understand the process?
Thanks in advance.
Related
I tried to create message chunks of Kafka producer in nodejs using kafka-node but didn't get any solution. so now I am creating a Kafka stream producer using Java and I need to send large message which has size above 1MB. How can I create chunks of message in Kafka producer and consume the same messages?
kafka has a maximum payload size.
if you need to send larger payloads, but their size is still bounded, you can increase that limit in the broker and producer configuration (message.max.bytes in broker configs and max.request.size in producer configs). 10MB should still be a reasonable limit.
linkedin maintains (java) kafka clients (https://github.com/linkedin/li-apache-kafka-clients) that are capable of fragmenting large messages on the producer and reassembling them on the consumer, but the solution is imperfect:
does not work properly with log-compacted kafka topics
has memory overhead on the consumer for re-assembly and storage of fragments.
I hve enabled snappy compression on producer side with a batch size of 64kb, and processing messages of 1 kb each and setting linger time to inf, does this mean till i process 64 messages, producer wont send the messages to kafka out topic...
In other words, will producer send each message to kafka or wait for 64 messages and send them in a single batch...
Cause the offsets are increasing one by one rather than in the multiple of 64
Edit - using flink-kafka connectors
Messages are batched by producer so that the network usage is minimized not to be written "as a batch" into Kafka's commitlog. What you are seeing is correctly done by Kafka as each message needs to be accounted for i.e. identified key / partition relationship, appended to the commitlog and then offset is incremented. Unless the first two steps are done, offset is not incremented.
Also there is data replication to be taken care of based on configurations as well as message tracking systems get updated for each message received (to support lag apis).
Also do note, the batch.size parameter considers ready to ship message's size, which has been pre-processed as 1. compressed 2. serialized by your favorite serializer.
If we do not mention the compression type on kafka producer, and if we mention it on the broker side... How is the performance impacted and in what batch sizes does topic side compression work on?
Compression will work only if you specify on the producer side, otherwise data will be stored in uncompressed format on the disk.
Compression increases the I/O throughput for some compression and decompression cost on the client side. Also, it saves disk space as data will be stored in the compressed format in the kafka brokers.
You can keep the batch size up to the maximum message size allowable limit by the kafka broker, that is 1 MB
After implementation of gzip compression, whether messages stored earlier will aslo get compressed? And while sending messages to consumer whether Message content is changed or kafka internally uncompresses it?
If you turn on Broker side compression, existing messages are unchanged. Compression will apply to only new messages. When consumers fetch the data, it will be automatically decompressed so you don't have to handle it on the consumer side. Just remember, there's a CPU and latency cost by doing this type of compression potentially.
We're running on apache kafka 0.10.0.x and spring 3.x and cannot use spring kafka as it is supported with spring framework version 4.x.
Therefore, we are using the native Kafka Producer API to produce messages.
Now the concern that i have is the performance of my producer. The thing is i believe a call to producer.send is what really makes the connection to the Kafka broker and then puts the message onto the buffer and then attempts to send and then possibly calls your the provided callback method in the producer.send().
Now the KafkaProducer documentation says that it uses a buffer and another I/O thread to perform the send and that they should be closed appropriately so that there is no leakage of resources.
From what i understand, this means that if i have 100s of messages being sent every time i invoke producer.send() it attempts to connect to the broker which is an expensive I/O operation.
Can you please correct my understanding if i am wrong or maybe suggest a better to use the KafkaProducer?
The two important configuration parameters of kafka producer are 'batch.size' and 'linger.ms'. So you basically have a choice: you can wait until the producer batch is full, or the producer time out.
batch.size – This is an upper limit of how many messages Kafka Producer will attempt to batch before sending – specified in bytes.
linger.ms – How long will the producer wait before sending in order to allow more messages to get accumulated in the same batch.
It depends on your use case, but I would suggest to take a closer look on these parameters.
Your understanding is partially right.
As #leshkin pointed out there are configuration parameters to tune how the KafkaProducer will handle buffering of messages to be sent.
However independently from the buffering strategy, the producer will take care of caching established connections to topic-leader brokers.
Indeed you can tune for how long the producer will keep such connection around using the connections.max.idle.ms parameter (defaults to 9 minutes).
So to respond to your original question, the I/O cost of establishing a connection to the broker happens only on the first send invocation and will be amortised over time as long as you have data to send.
In the below conditions you need to configure batch.size, linger.ms & compression.type properties in your kafka prodocer to increase the performance.
1) If records are arriving faster than the kafka producer can send.
2) If you have huge amount of data in the your respective Topic, its really burden to your kafka producer.
3) if you have a bottlenecks
batch.size = 16_384 * 4
linger.ms 200
compression.type = "snappy"
props.put(ProducerConfig.BATCH_SIZE_CONFIG, 16_384 * 4);
// Send with little bit buffering
props.put(ProducerConfig.LINGER_MS_CONFIG, 200);
//Use Snappy compression for batch compression.
props.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "snappy");
kafka Dzone
Performance tunning
Kafka Perforamnce tunning