Producer side compression in apache kafka - apache-kafka

I hve enabled snappy compression on producer side with a batch size of 64kb, and processing messages of 1 kb each and setting linger time to inf, does this mean till i process 64 messages, producer wont send the messages to kafka out topic...
In other words, will producer send each message to kafka or wait for 64 messages and send them in a single batch...
Cause the offsets are increasing one by one rather than in the multiple of 64
Edit - using flink-kafka connectors

Messages are batched by producer so that the network usage is minimized not to be written "as a batch" into Kafka's commitlog. What you are seeing is correctly done by Kafka as each message needs to be accounted for i.e. identified key / partition relationship, appended to the commitlog and then offset is incremented. Unless the first two steps are done, offset is not incremented.
Also there is data replication to be taken care of based on configurations as well as message tracking systems get updated for each message received (to support lag apis).
Also do note, the batch.size parameter considers ready to ship message's size, which has been pre-processed as 1. compressed 2. serialized by your favorite serializer.

Related

kafka compaction with compression process

I have a Kafka topic with cleanup.policy=compact and a producer is producing data with compression type snappy with batch size and linger ms settings for higher throughput. From what I understand, the message batches are compressed on the producer side before sending to broker and the broker receives and stores the compressed messages. When a consumer reads the topic, the compressed batches are delivered to client and the decompression happens at client. There could be multiple producers to the same topic with different compression type as well.
When the compaction thread runs, for correct compaction, the messages would have to be decompressed on brokers, and after compaction the messages would have to be compressed again for efficient delivery to the client. But doing so might give a very uneven distribution of compressed batches depending on messages received, or what would be the compression type if different batches had different compression type. I could not find an explanation of how exactly compaction works with compression enabled on producer. Can someone help understand the process?
Thanks in advance.

What atomicity guarantees - if any - does Kafka have regarding batch writes?

We're now moving one of our services from pushing data through legacy communication tech to Apache Kafka.
The current logic is to send a message to IBM MQ and retry if errors occur. I want to repeat that, but I don't have any idea about what guarantees the broker provide in that scenario.
Let's say I send 100 messages in a batch via producer via Java client library. Assuming it reaches the cluster, is there a possibility only part of it be accepted (e.g. a disk is full, or some partitions I touch in my write are under-replicated)? Can I detect that problem from my producer and retry only those messages that weren't accepted?
I searched for kafka atomicity guarantee but came up empty, may be there's a well-known term for it
When you say you send 100 messages in one batch, you mean, you want to control this number of messages or be ok letting the producer batch a certain amount of messages and then send the batch ?
Because not sure you can control the number of produced messages in one producer batch, the API will queue them and batch them for you, but without guarantee of batch them all together ( I'll check that though).
If you're ok with letting the API batch a certain amount of messages for you, here is some clues about how they are acknowledged.
When dealing with producer, Kafka comes with some kind of reliability regarding writes ( also "batch writes")
As stated in this slideshare post :
https://www.slideshare.net/miguno/apache-kafka-08-basic-training-verisign (83)
The original list of messages is partitioned (randomly if the default partitioner is used) based on their destination partitions/topics, i.e. split into smaller batches.
Each post-split batch is sent to the respective leader broker/ISR (the individual send()’s happen sequentially), and each is acked by its respective leader broker according to request.required.acks
So regarding atomicity.. Not sure the whole batch will be seen as atomic regarding the above behavior. Maybe you can assure to send your batch of message using the same key for each message as they will go to the same partition, and thus maybe become atomic
If you need more clarity about acknowlegment rules when producing, here how it works As stated here https://docs.confluent.io/current/clients/producer.html :
You can control the durability of messages written to Kafka through the acks setting.
The default value of "1" requires an explicit acknowledgement from the partition leader that the write succeeded.
The strongest guarantee that Kafka provides is with "acks=all", which guarantees that not only did the partition leader accept the write, but it was successfully replicated to all of the in-sync replicas.
You can also look around producer enable.idempotence behavior if you aim having no duplicates while producing.
Yannick

when will trigger producer send a request?

if i send just one record at producer side and wait, when will producer sends the record to broker?
In kafka docs, i found the config called "linger.ms", and it says:
once we get batch.size worth of records for a partition it will be
sent immediately regardless of this setting, however if we have
fewer
than this many bytes accumulated for this partition we will 'linger'
for the specified time waiting for more records to show up.
According above docs, i have two questions.
if producer receives datas which size reaches batch.size, it will immediately trigger to send a request which only contains one batch to broker? But as we know, one request can contain many batches, so how does it happen?
does it mean that even the received datas are not enough of batch.size, it will also trigger to send a request to broker after waiting linger.ms ?
In Kafka, the lowest unit of sending is a record (a KV pair).
Kafka producer attempts to send records in batches in-order to optimize data transmission. So a single push from producer to the cluster -- to the broker leader to be precise -- could contain multiple records.
Moreover, batching always applies only to a given partition. Records produced to different partitions cannot be batched together, though they could form multiple batches.
There are a few parameters which influence the batching behaviour, as described in the documentation:
buffer.memory -
The total bytes of memory the producer can use to buffer records
waiting to be sent to the server. If records are sent faster than they
can be delivered to the server the producer will block for
max.block.ms after which it will throw an exception.
batch.size -
The producer will attempt to batch records together into fewer
requests whenever multiple records are being sent to the same
partition. This helps performance on both the client and the server.
This configuration controls the default batch size in bytes. No
attempt will be made to batch records larger than this size.
Requests sent to brokers will contain multiple batches, one for each
partition with data available to be sent.
linger.ms -
The producer groups together any records that arrive in between
request transmissions into a single batched request. Normally this
occurs only under load when records arrive faster than they can be
sent out. However in some circumstances the client may want to reduce
the number of requests even under moderate load. This setting
accomplishes this by adding a small amount of artificial delay—that
is, rather than immediately sending out a record the producer will
wait for up to the given delay to allow other records to be sent so
that the sends can be batched together. This can be thought of as
analogous to Nagle's algorithm in TCP. This setting gives the upper
bound on the delay for batching: once we get batch.size worth of
records for a partition it will be sent immediately regardless of this
setting, however if we have fewer than this many bytes accumulated for
this partition we will 'linger' for the specified time waiting for
more records to show up. This setting defaults to 0 (i.e. no delay).
Setting linger.ms=5, for example, would have the effect of reducing
the number of requests sent but would add up to 5ms of latency to
records sent in the absence of load.
So from above documentation, you could understand - linger.ms is an artificial delay to wait if there are not enough bytes to transmit, but if producer accumulates enough bytes before linger.ms is elapsed, then the request is sent anyway.
On top of that, batching is also influenced by max.request.size
max.request.size -
The maximum size of a request in bytes. This setting will limit the
number of record batches the producer will send in a single request to
avoid sending huge requests. This is also effectively a cap on the
maximum record batch size. Note that the server has its own cap on
record batch size which may be different from this.

How to set Kafka Producer message rate per second?

I am reading a csv file and giving the rows of this input to my Kafka Producer. now I want my Kafka Producer to produce messages at a rate of 100 messages per second.
Take a look at linger.ms and batch.size properties of Kafka Producer.
You have to adjust these properties correspondingly to get desired rate.
The producer groups together any records that arrive in between request transmissions into a single batched request. Normally this occurs only under load when records arrive faster than they can be sent out. However in some circumstances the client may want to reduce the number of requests even under moderate load. This setting accomplishes this by adding a small amount of artificial delay—that is, rather than immediately sending out a record the producer will wait for up to the given delay to allow other records to be sent so that the sends can be batched together. This can be thought of as analogous to Nagle's algorithm in TCP. This setting gives the upper bound on the delay for batching: once we get batch.size worth of records for a partition it will be sent immediately regardless of this setting, however if we have fewer than this many bytes accumulated for this partition we will 'linger' for the specified time waiting for more records to show up. This setting defaults to 0 (i.e. no delay). Setting linger.ms=5, for example, would have the effect of reducing the number of requests sent but would add up to 5ms of latency to records sent in the absense of load.
If you like stream processing then akka-streams has nice support for throttling: http://doc.akka.io/docs/akka/current/java/stream/stream-quickstart.html#time-based-processing
Then the akka-stream-kafka (aka reactive-kafka) library allows you to connect the two together: http://doc.akka.io/docs/akka-stream-kafka/current/home.html
In Kafka JVM Producer, the throughput depends upon multiple factors. And most commonly it's calculated in MB/sec rather than Msg/sec. In your example, if let's say each of your row in CSV is 1MB in size then you need to tune your producer configs to achieve 100MB/sec, so that you can achieve your target throughput of 100 Msg/sec. While tuning producer configs, you have to take into the consideration what's your batch.size ( measured in bytes ) config value? If it's set too low then producer will try to send messages more often and wait for reply from server. This will improve the producer's throughput. But would impact the latency. If you are using async callback based producer then in this case your overall throughput will be limited by how many number of messages producer can send before waiting for reply from server determined by max.in.flight.request.per.connection.
If you keep batch.size too high then producer throughput will get affected since after waiting for linger.ms period kafka producer will send the all messages in a batch to broker for that particular partition at once. But having bigger batch.size means bigger buffer.memory which might put pressure on GC.

kafka "stops working" after a large message is enqueued

I'm running kafka_2.11-0.9.0.0 and a java-based producer/consumer. With messages ~70 KB everything works fine. However, after the producer enqueues a larger, 70 MB message, kafka appears to stop delivering the messages to the consumer. I.e. not only is the large message not delivered but also subsequent smaller messages. I know the producer succeeds because I used kafka callback for the confirmation and I can see the messages in the kafka message log.
kafka config custom changes:
message.max.bytes=200000000
replica.fetch.max.bytes=200000000
consumer config:
props.put("fetch.message.max.bytes", "200000000");
props.put("max.partition.fetch.bytes", "200000000");
You need to increase the size of the messages the consumer can consume so it doesn't get stuck trying to read a message that is to big.
max.partition.fetch.bytes (default value is 1048576 bytes)
The maximum amount of data per-partition the server will return. The
maximum total memory used for a request will be #partitions *
max.partition.fetch.bytes. This size must be at least as large as the
maximum message size the server allows or else it is possible for the
producer to send messages larger than the consumer can fetch. If that
happens, the consumer can get stuck trying to fetch a large message on
a certain partition.
What helped was upping the java heap size.