If we do not mention the compression type on kafka producer, and if we mention it on the broker side... How is the performance impacted and in what batch sizes does topic side compression work on?
Compression will work only if you specify on the producer side, otherwise data will be stored in uncompressed format on the disk.
Compression increases the I/O throughput for some compression and decompression cost on the client side. Also, it saves disk space as data will be stored in the compressed format in the kafka brokers.
You can keep the batch size up to the maximum message size allowable limit by the kafka broker, that is 1 MB
Related
When doing stateful processing in kafka streams we can hold large state. We can provision more disks space for the client as the data grows. But what about the changelog topic? The local state is backed up in this compacted topic. Are there any limitations in how much data we can store in this topic?
We did not encounter any issues yet. But i see that some cloud services do have limitations on the size for a compacted topic. Is this a kafka limitation? An if yes, do these limitations also apply for non compacted topics?
Infinite retention of any topic log segments can be achieved by setting
log.retention.bytes = -1
log.retention.hours = -1
This option is available from version 0.9.0.0 which indicates a mature feature on Kafka.
However, many suggest that using Kafka as permanent storage is not what it was designed to do and as the amount of data stored in Kafka increases, users eventually hit a “retention cliff,” at which point it becomes significantly more expensive to store, manage, and retrieve data. The infrastructure costs will be increased as the longer the retention period the more hardware is required.
Having said that, it seems that people do use Kafka for persistence storage, for example, The New York Times uses Kafka as a source of truth, storing 160 years of journalism going back to the 1850s.
I would suggest using a small message size if you decide to use
Kafka as a System Of Record (SOR) and to hold the state of an entity.
Kafka makes it very clear that its performance is greatly based on the event/message size, so there is a size limit on them.
Kafka has a default limit of 1MB per message in the topic. This is
because very large messages are considered inefficient and an
anti-pattern in Apache Kafka.
more for handling larger messages here.
By default, each Kafka topic partition log will start at a minimum size of 20MB and grow to a maximum size of 100MB on disk before a new log file is created. It's possible to have multiple log files in a partition at any one time.
I have a Kafka topic with cleanup.policy=compact and a producer is producing data with compression type snappy with batch size and linger ms settings for higher throughput. From what I understand, the message batches are compressed on the producer side before sending to broker and the broker receives and stores the compressed messages. When a consumer reads the topic, the compressed batches are delivered to client and the decompression happens at client. There could be multiple producers to the same topic with different compression type as well.
When the compaction thread runs, for correct compaction, the messages would have to be decompressed on brokers, and after compaction the messages would have to be compressed again for efficient delivery to the client. But doing so might give a very uneven distribution of compressed batches depending on messages received, or what would be the compression type if different batches had different compression type. I could not find an explanation of how exactly compaction works with compression enabled on producer. Can someone help understand the process?
Thanks in advance.
I hve enabled snappy compression on producer side with a batch size of 64kb, and processing messages of 1 kb each and setting linger time to inf, does this mean till i process 64 messages, producer wont send the messages to kafka out topic...
In other words, will producer send each message to kafka or wait for 64 messages and send them in a single batch...
Cause the offsets are increasing one by one rather than in the multiple of 64
Edit - using flink-kafka connectors
Messages are batched by producer so that the network usage is minimized not to be written "as a batch" into Kafka's commitlog. What you are seeing is correctly done by Kafka as each message needs to be accounted for i.e. identified key / partition relationship, appended to the commitlog and then offset is incremented. Unless the first two steps are done, offset is not incremented.
Also there is data replication to be taken care of based on configurations as well as message tracking systems get updated for each message received (to support lag apis).
Also do note, the batch.size parameter considers ready to ship message's size, which has been pre-processed as 1. compressed 2. serialized by your favorite serializer.
I have a question about Kafka's disk.
Kafka will fail when its disk become full.
So I want to reduce the disk usage to less than x% by discarding the old data stored on the Kafka disk (or discarding a copy of the data) when the Kafka disk usage reaches x%. Do I need to modify the Kafka source code to do this?
You can configure retention.bytes for your topics.
This configuration controls the maximum size a partition (which consists of log segments) can grow to before we will discard old log segments to free up space if we are using the "delete" retention policy. By default there is no size limit only a time limit. Since this limit is enforced at the partition level, multiply it by the number of partitions to compute the topic retention in bytes.
See https://kafka.apache.org/documentation/#topicconfigs
After implementation of gzip compression, whether messages stored earlier will aslo get compressed? And while sending messages to consumer whether Message content is changed or kafka internally uncompresses it?
If you turn on Broker side compression, existing messages are unchanged. Compression will apply to only new messages. When consumers fetch the data, it will be automatically decompressed so you don't have to handle it on the consumer side. Just remember, there's a CPU and latency cost by doing this type of compression potentially.