What will be the effect of adding compression in kafka broker - apache-kafka

After implementation of gzip compression, whether messages stored earlier will aslo get compressed? And while sending messages to consumer whether Message content is changed or kafka internally uncompresses it?

If you turn on Broker side compression, existing messages are unchanged. Compression will apply to only new messages. When consumers fetch the data, it will be automatically decompressed so you don't have to handle it on the consumer side. Just remember, there's a CPU and latency cost by doing this type of compression potentially.

Related

kafka compaction with compression process

I have a Kafka topic with cleanup.policy=compact and a producer is producing data with compression type snappy with batch size and linger ms settings for higher throughput. From what I understand, the message batches are compressed on the producer side before sending to broker and the broker receives and stores the compressed messages. When a consumer reads the topic, the compressed batches are delivered to client and the decompression happens at client. There could be multiple producers to the same topic with different compression type as well.
When the compaction thread runs, for correct compaction, the messages would have to be decompressed on brokers, and after compaction the messages would have to be compressed again for efficient delivery to the client. But doing so might give a very uneven distribution of compressed batches depending on messages received, or what would be the compression type if different batches had different compression type. I could not find an explanation of how exactly compaction works with compression enabled on producer. Can someone help understand the process?
Thanks in advance.

Remove and add compression in kafka topic. What will happen to the existing data in the topic?

If a topic is set without compression, and some data already exist in the topic.
Now the topic is set with compression, will the existing data be compressed?
The other direction is, if a topic is set with compression, and some data already exist in the topic, will the existing data be decompressed?
This question comes up the worries to the data consumer. When the topic has some data is compressed and some is not compressed, this is very messy, or the brokers know those events are compressed and those are not in the same topic, and will deliver the right data?
If the existing data is not corresponding to the compression setup, I will remove the existing data by configuring very low retention time. Until the topic is very clean that has no data, I will then ingest data to ensure every event is either compressed or not compressed.
Both compressed and uncompressed records could coexist in a single topic. The corresponding compression type is stored in each record(record batch actually), so the consumer knows how to handle this message.
On the broker side, it normally does not care if a record batch is compressed. Assuming there occurs no down converting for old-formatted records, the broker always saves the batch as it is.

Producer side compression in apache kafka

I hve enabled snappy compression on producer side with a batch size of 64kb, and processing messages of 1 kb each and setting linger time to inf, does this mean till i process 64 messages, producer wont send the messages to kafka out topic...
In other words, will producer send each message to kafka or wait for 64 messages and send them in a single batch...
Cause the offsets are increasing one by one rather than in the multiple of 64
Edit - using flink-kafka connectors
Messages are batched by producer so that the network usage is minimized not to be written "as a batch" into Kafka's commitlog. What you are seeing is correctly done by Kafka as each message needs to be accounted for i.e. identified key / partition relationship, appended to the commitlog and then offset is incremented. Unless the first two steps are done, offset is not incremented.
Also there is data replication to be taken care of based on configurations as well as message tracking systems get updated for each message received (to support lag apis).
Also do note, the batch.size parameter considers ready to ship message's size, which has been pre-processed as 1. compressed 2. serialized by your favorite serializer.

Topic side compression in apache kafka

If we do not mention the compression type on kafka producer, and if we mention it on the broker side... How is the performance impacted and in what batch sizes does topic side compression work on?
Compression will work only if you specify on the producer side, otherwise data will be stored in uncompressed format on the disk.
Compression increases the I/O throughput for some compression and decompression cost on the client side. Also, it saves disk space as data will be stored in the compressed format in the kafka brokers.
You can keep the batch size up to the maximum message size allowable limit by the kafka broker, that is 1 MB

What does "offsets.topic.compression.codec" in kafka actually do?

Very recently we started to get MessageSizeTooLargeException on the metadata, so we enabled offsets.topic.compression.codec=1, to enable gzip compression, but the overall bytes in rate/messages in rate to the broker hasnt changed. Am i missing something? Is there some other property which needs to be changed?
How does this codec work?
Do we need to add some property on consumers and producers as well? I have just enabled this on the broker.
offsets.topic.compression.codec is only for the internal offset topic (namely __consumer_offsets). It is no use for those user-level topics.
See Kafka: Sending a 15MB message on how to avoid MessageTooLargeException/RecordTooLargeException.