Kafka keeps data in Buffer as per buffer.memory (32 MB in my case). Does kafka writes record to topic once it reaches to 32 MB limit or is there any time associated with it as well ?
From the Kafka docs buffer.memory is only the property to specify the limit of buffer for a producer to use. But since setting this property producer will not make the producer wait until the buffer gets full for sending records into the server.
buffer.memory
The total bytes of memory the producer can use to buffer records waiting to be sent to the server. If records are sent faster than they can be delivered to the server the producer will block for max.block.ms after which it will throw an exception.
This setting should correspond roughly to the total memory the producer will use, but is not a hard bound since not all memory the producer uses is used for buffering. Some additional memory will be used for compression (if compression is enabled) as well as for maintaining in-flight requests.
If you want the producer to wait until the batch some records in buffer gets, you can use linger.ms property to make the producer wait. But as far as my knowledge there is no strict way to keep producer to wait and send records only if the buffer is full
KafkaProducer
By default a buffer is available to send immediately even if there is additional unused space in the buffer. However, if you want to reduce the number of requests you can set linger.ms to something greater than 0. This will instruct the producer to wait up to that number of milliseconds before sending a request in hope that more records will arrive to fill up the same batch.
Related
if i send just one record at producer side and wait, when will producer sends the record to broker?
In kafka docs, i found the config called "linger.ms", and it says:
once we get batch.size worth of records for a partition it will be
sent immediately regardless of this setting, however if we have
fewer
than this many bytes accumulated for this partition we will 'linger'
for the specified time waiting for more records to show up.
According above docs, i have two questions.
if producer receives datas which size reaches batch.size, it will immediately trigger to send a request which only contains one batch to broker? But as we know, one request can contain many batches, so how does it happen?
does it mean that even the received datas are not enough of batch.size, it will also trigger to send a request to broker after waiting linger.ms ?
In Kafka, the lowest unit of sending is a record (a KV pair).
Kafka producer attempts to send records in batches in-order to optimize data transmission. So a single push from producer to the cluster -- to the broker leader to be precise -- could contain multiple records.
Moreover, batching always applies only to a given partition. Records produced to different partitions cannot be batched together, though they could form multiple batches.
There are a few parameters which influence the batching behaviour, as described in the documentation:
buffer.memory -
The total bytes of memory the producer can use to buffer records
waiting to be sent to the server. If records are sent faster than they
can be delivered to the server the producer will block for
max.block.ms after which it will throw an exception.
batch.size -
The producer will attempt to batch records together into fewer
requests whenever multiple records are being sent to the same
partition. This helps performance on both the client and the server.
This configuration controls the default batch size in bytes. No
attempt will be made to batch records larger than this size.
Requests sent to brokers will contain multiple batches, one for each
partition with data available to be sent.
linger.ms -
The producer groups together any records that arrive in between
request transmissions into a single batched request. Normally this
occurs only under load when records arrive faster than they can be
sent out. However in some circumstances the client may want to reduce
the number of requests even under moderate load. This setting
accomplishes this by adding a small amount of artificial delay—that
is, rather than immediately sending out a record the producer will
wait for up to the given delay to allow other records to be sent so
that the sends can be batched together. This can be thought of as
analogous to Nagle's algorithm in TCP. This setting gives the upper
bound on the delay for batching: once we get batch.size worth of
records for a partition it will be sent immediately regardless of this
setting, however if we have fewer than this many bytes accumulated for
this partition we will 'linger' for the specified time waiting for
more records to show up. This setting defaults to 0 (i.e. no delay).
Setting linger.ms=5, for example, would have the effect of reducing
the number of requests sent but would add up to 5ms of latency to
records sent in the absence of load.
So from above documentation, you could understand - linger.ms is an artificial delay to wait if there are not enough bytes to transmit, but if producer accumulates enough bytes before linger.ms is elapsed, then the request is sent anyway.
On top of that, batching is also influenced by max.request.size
max.request.size -
The maximum size of a request in bytes. This setting will limit the
number of record batches the producer will send in a single request to
avoid sending huge requests. This is also effectively a cap on the
maximum record batch size. Note that the server has its own cap on
record batch size which may be different from this.
Average message size is small, but size is vary.
Average message size: 1KBytes
1MBytes message incomes in arbitrary rate. / So, producer's max.request.size = 1MBytes
broker's max.message.bytes = 2MBytes
My questions.
To avoid producing size error, user have to set batch.size LTE 2?
Or producer library decides batch size automatically to avoid error? (even user set large batch.size)
Thanks.
Below are the definition of the related configs in question
Producer config
batch.size : producer will attempt to batch records until it reaches batch.size before it is sent to kafka ( assuming batch.size is configured to take precedence over linger.ms ) .Default - 16384 bytes
max.request.size : The maximum size of a request in bytes. This setting will limit the number of record batches the producer will send in a single request to avoid sending huge requests. This is also effectively a cap on the maximum record batch size. Default - 1048576 bytes
Broker config
message.max.bytes : The largest record batch size allowed by Kafka. Default - 1000012 bytes
replica.fetch.max.bytes : This will allow for the replicas in the brokers to send messages within the cluster and make sure the messages are replicated correctly.
To answer your questions
To avoid producer send errors , you don't need to set batch size 2MB as this will delay the transmission of your low size messages . You can keep the batch.size according to the avg message size and depending on how much you want to batch
If you don't specify batch size , it would take the default value which is
16384 bytes
So basically you will have to configure producer 'max.request.size'>=2MB and broker 'message.max.bytes' and 'replica.fetch.max.bytes' >=2MB.
This query arises because there are various settings available around batching. Let me attempt to make them clear:
Kafka Setting: message.max.bytes and fetch.max.bytes
The Kafka broker limits the maximum size (total size of messages in a batch, if messages are published in batches) of a message that can be produced, configured by the cluster-wide property message.max.bytes (defaults to 1 MB). A producer that tries to send a message larger than this will receive an error back from the broker, and the message will not be accepted. As with all byte sizes specified on the broker, this configuration deals with compressed message size, which means that producers can send messages that are much larger than this value uncompressed, provided they compress it under the configured message.max.bytes size.
Note: This setting can be overridden by a specific topic (but with name max.message.bytes).
The maximum message size, message.max.bytes, configured on the Kafka broker must be coordinated with the cluster-wide property fetch.max.bytes (defaults to 1 MB) on consumer clients. It configures the maximum number of bytes of messages to attempt to fetch for a request. If this value is smaller than message.max.bytes, then consumers that encounter larger messages will fail to fetch those messages, resulting in a situation where the consumer gets stuck and cannot proceed.
The configuration setting replica.fetch.max.bytes (defaults to 1MB) determines the rough amount of memory you will need for each partition on a broker.
Producer Setting: max.request.size
This setting controls the size of a produce request sent by the producer. It caps both the size of the largest message that can be sent and the number of messages that the producer can send in one request. For example, with a default maximum request size of 1 MB, the largest message you can send is 1MB or the producer can batch 1000 messages of size 1k each into one request.
In addition, the broker has its own limit on the size of the largest message it will accept message.max.bytes). It is usually a good idea to have these configurations match, so the producer will not attempt to send messages of a size that will be rejected by the broker.
Note that message.max.bytes (broker level) and max.requrest.size (producer level) puts a cap on the maximum size of request in a batch, but batch.size (which should be lower than previous two) and linger.ms are the settings which actually govern the size of the batch.
Producer Setting: batch.size and linger.ms
When multiple records are sent to the same partition, the producer will batch them together. The parameter batch.size controls the maximum amount of memory in bytes (not the number of messages!) that will be used for each batch. If a batch has become full, all the message in the batch has to be sent. This helps in throughput on both the client and the server.
A small batch size will make batching less common and may reduce throughput. A very large size may use memory a bit more wastefully as we will always allocate a buffer of the specified batch size in anticipation of additional messages.
The linger.ms (defaults to 0) setting controls the amount of time to wait for additional messages before sending the current batch.
By default, the producer will send messages as soon as there is a sender thread available to send them, even if there's just one message in the batch (note that batch.size only specifies the maximum limit on the size of a batch). By setting linger.ms higher than 0, we instruct the producer to wait a few milliseconds to add additional messages to the batch before sending it to the brokers, even if a sender thread is available. This increases latency but also increases throughput (because we send more messages at once, there is less overhead per message).
I'm trying to figure out the difference between the settings batch.size and buffer.memory in Kafka Producer.
As I understand batch.size: It's the max size of the batch that can be sent.
The documentation describes buffer.memory as: the bytes of memory the Producer can use to buffer records waiting to be sent.
I don't understand the difference between these two. Can someone explain?
Thanks
In my opinion,
batch.size: The maximum amount of data that can be sent in a single request. If batch.size is (32*1024) that means 32 KB can be sent out in a single request.
buffer.memory: if Kafka Producer is not able to send messages(batches) to Kafka broker (Say broker is down). It starts accumulating the message batches in the buffer memory (default 32 MB). Once the buffer is full, It will wait for "max.block.ms" (default 60,000ms) so that buffer can be cleared out. Then it's throw exception.
Kafka Producer and Kafka Consumer have many configuration that helps in performance tuning like gaining low latency and High throughput. buffer.memory and batch.size is also one of those and these are specific for Kafka Producer. Let see more details on these configuration.
buffer.memory
This sets the amount of memory the producer will use to buffer messages waiting to be sent to broker. If messages are sent by the application faster than they can be delivered to server, producer may run out of space and additional send() call will either be block or throw exception based on the max.block.ms configuration which allow blocking for a certain time and then throw exception. Another case may be if all broker server are down due to any reason and kafka producer will not able to send messages to broker and producer have to keep these messages in the memory allocated based on buffer.memory configuration but this will be filled up soon if broker not back normal state then as mentioned above mx.block.ms time will be considered to free up the space.
Default value for max.block.ms is 60,000 ms
Default value for buffer.memory is 32 MB (33554432)
batch.size
When multiple records are sent to the same partition, the producer will put them in batch. This configuration controls the amount of memory in bytes (not messages) that will
be used for each batch. When the batch is full, all the messages in the batch will be sent. However this does not means that the producer will wait for batch to become full. The producer will send half full batches and even the batch with just a single message in them. Therefore setting the batch size too large will not cause delays in sending the messages. it will just use memory for the batches. Setting the batch size too small will add extra overhead because producer will need to send messages more frequently.
Default batch size is 16384.
batch.size is also work based on the linger.ms which controls the amount of time to wait for additional messages before sending current batch. As we know that Kafka producer sends a batch of messages either when rge current batch is full or when the linger.ms time is reached. By default prodcuer will send messages as soon as there is a sender thread available to send them even if there is just message in the bacth.
Both of these producer configurations are described on the Confluent documentation page as following:
batch.size
Kafka producers attempt to collect sent messages into batches to improve throughput. With the Java client, you can use batch.size to control the maximum size in bytes of each message batch.
buffer.memory
Use buffer.memory to limit the total memory that is available to the Java client for collecting unsent messages. When this limit is hit, the producer will block on additional sends for as long as max.block.ms before raising an exception.
It's unclear to me (and I haven't managed to find any documentation that makes it perfectly clear) how compression affects kafka configurations that deal with bytes.
Take a hypothetical message that is exactly 100 bytes, a producer with a batch size of 1000 bytes, and a consumer with a fetch size of 1000 bytes.
With no compression it seems pretty clear that my producer would batch 10 messages at a time and my consumer would poll 10 messages at a time.
Now assume a compression (specified at the producer -- not on the broker) that (for simplicity) compresses to exactly 10% of the uncompressed size.
With that same config, would my producer still batch 10 messages at a time, or would it start batching 100 messages at a time? I.e. is the batch size pre- or post-compression? The docs do say this:
Compression is of full batches of data
...which I take to mean that it would compress 1000 bytes (the batch size) down to 100 bytes and send that. Is that correct?
Same question for the consumer fetch. Given a 1K fetch size, would it poll just 10 messages at a time (because the uncompressed size is 1K) or would it poll 100 messages (because the compressed size is 1K)? I believe that the fetch size will cover the compressed batch, in which case the consumer would be fetching 10 batches as-produced-by-the-producer at a time. Is this correct?
It seems confusing to me that, if I understand correctly, the producer is dealing with pre-compression sizes and the consumer is dealing with post-compression sizes.
It's both simpler and more complicated ;-)
It's simpler in that both the producer and the consumer compresses and uncompresses the same Kafka Protocol Produce Requests and Fetch Requests and the broker just stores them with zero copy in their native wire format. Kafka does not compress individual messages before they are sent. It waits until a batch of messages (all going to the same partition) are ready for send and then compresses the entire batch and sends it as one Produce Request.
It's more complicated because you also have to factor in the linger time which will trigger a send of a batch of messages earlier than when the producer buffer size is full. You also have to consider that messages may have different keys, or for other reasons be going to different topic partitions on different brokers so it's not true to say that qty(10) records compressed to 100 bytes each go all as one batch to one broker as a single produce request of 1000 bytes (unless all the messages are being sent to a topic with a single partition).
From https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/producer/KafkaProducer.html
The producer maintains buffers of unsent records for each partition.
These buffers are of a size specified by the batch.size config. Making
this larger can result in more batching, but requires more memory
(since we will generally have one of these buffers for each active
partition).
By default a buffer is available to send immediately even if there is
additional unused space in the buffer. However if you want to reduce
the number of requests you can set linger.ms to something greater than
0. This will instruct the producer to wait up to that number of milliseconds before sending a request in hope that more records will
arrive to fill up the same batch. This is analogous to Nagle's
algorithm in TCP. For example, in the code snippet above, likely all
100 records would be sent in a single request since we set our linger
time to 1 millisecond. However this setting would add 1 millisecond of
latency to our request waiting for more records to arrive if we didn't
fill up the buffer. Note that records that arrive close together in
time will generally batch together even with linger.ms=0 so under
heavy load batching will occur regardless of the linger configuration;
however setting this to something larger than 0 can lead to fewer,
more efficient requests when not under maximal load at the cost of a
small amount of latency.
I am reading a csv file and giving the rows of this input to my Kafka Producer. now I want my Kafka Producer to produce messages at a rate of 100 messages per second.
Take a look at linger.ms and batch.size properties of Kafka Producer.
You have to adjust these properties correspondingly to get desired rate.
The producer groups together any records that arrive in between request transmissions into a single batched request. Normally this occurs only under load when records arrive faster than they can be sent out. However in some circumstances the client may want to reduce the number of requests even under moderate load. This setting accomplishes this by adding a small amount of artificial delay—that is, rather than immediately sending out a record the producer will wait for up to the given delay to allow other records to be sent so that the sends can be batched together. This can be thought of as analogous to Nagle's algorithm in TCP. This setting gives the upper bound on the delay for batching: once we get batch.size worth of records for a partition it will be sent immediately regardless of this setting, however if we have fewer than this many bytes accumulated for this partition we will 'linger' for the specified time waiting for more records to show up. This setting defaults to 0 (i.e. no delay). Setting linger.ms=5, for example, would have the effect of reducing the number of requests sent but would add up to 5ms of latency to records sent in the absense of load.
If you like stream processing then akka-streams has nice support for throttling: http://doc.akka.io/docs/akka/current/java/stream/stream-quickstart.html#time-based-processing
Then the akka-stream-kafka (aka reactive-kafka) library allows you to connect the two together: http://doc.akka.io/docs/akka-stream-kafka/current/home.html
In Kafka JVM Producer, the throughput depends upon multiple factors. And most commonly it's calculated in MB/sec rather than Msg/sec. In your example, if let's say each of your row in CSV is 1MB in size then you need to tune your producer configs to achieve 100MB/sec, so that you can achieve your target throughput of 100 Msg/sec. While tuning producer configs, you have to take into the consideration what's your batch.size ( measured in bytes ) config value? If it's set too low then producer will try to send messages more often and wait for reply from server. This will improve the producer's throughput. But would impact the latency. If you are using async callback based producer then in this case your overall throughput will be limited by how many number of messages producer can send before waiting for reply from server determined by max.in.flight.request.per.connection.
If you keep batch.size too high then producer throughput will get affected since after waiting for linger.ms period kafka producer will send the all messages in a batch to broker for that particular partition at once. But having bigger batch.size means bigger buffer.memory which might put pressure on GC.