Kafka Large message configuration support for Spring boot application producer consumer - apache-kafka

When I am trying to publish through Kafka producer
in Spring boot application, I am getting error of RecordTooLargeException.
Error is :
org.apache.kafka.common.errors.RecordTooLargeException: The message is 1235934 bytes when serialized which is larger than the maximum request size you have configured with the max.request.size configuration.
I read other discussions about this problem but did not get any suitable support for this as I have to publish also as well as consume that message from client side.
Please help me by giving a brief configuration steps to do this.

Nice thing about Kafka is that is has great exception messages that are pretty much self explanatory. It is basically saying that your message is too large (which you have concluded by yourself, I believe).
If you check the docs for producer config search for the max.request.size in the table for an explanation, it says:
The maximum size of a request in bytes. This setting will limit the
number of record batches the producer will send in a single request to
avoid sending huge requests. This is also effectively a cap on the
maximum record batch size. Note that the server has its own cap on
record batch size which may be different from this.
You can configure this value in your producer configuration, like so:
properties.put(ProducerConfig.MAX_REQUEST_SIZE_CONFIG, "value-in-bytes");
However, the default is pretty much good for 90% use cases. If you can avoid sending such large messages or perhaps try to compress the messages (this will work wonders when talking about throughput), like so:
properties.setProperty(ProducerConfig.COMPRESSION_TYPE_CONFIG, "snappy");
There are 2 other compression types but this one is from Google and is pretty efficient. Along with compression, you can tweak 2 other values to get much better performance (batch.size and linger.ms) but you would have to test for your use case.

Related

Kafka and Event Streaming On Client Side?

I need to consume messages from a event source (represented as a single Kafka topic) producing about 50k to 250k events per second. It only provides a single partition and the ping is quite high (90-100ms).
As far as I have learned by reading the Kafka client code, during polling a fetch request is issued and once the response is fully read, the events/messages are parsed and deserialized and once enough events/messages are available consumer.poll() will provide the subset to the calling application.
In my case this makes the whole thing not worth while. The best throughput I achieve with about 2s duration per fetch request (about 2.5MB fetch.max.bytes). Smaller fetch durations will increase the idle time (time the consumer does not receive any bytes) between last byte of previous response, parsing, deserialization and sending next request and waiting for the first byte of the next request's response.
Using a fetch duration of about 2s results in a max latency of 2s which is highly undesirable. What I would like to see is while receiving the fetch response, that the messages transmitted are already available to the consumer as soon as a individual message is fully transmitted.
Since every message has an individual id and the messages are send in a particular order while only a single consumer (+thread) for a single partition is active, it is not a problem to suppress retransmitted messages in case a fetch response is aborted / fails and its messages were partially processed and later on retransmitted.
So the big question is, if the Kafka client provides a possibility to consume messages from a not-yet completed fetch response.
That is a pretty large amount of messages coming in through a single partition. Since you can't control anything on the Kafka server, the best you can do is configure your client to be as efficient as possible, assuming you have access to Kafka client configuration parameters. You didn't mention anything about needing to consume the messages as fast as they're generated, so I'm assuming you don't need that. Also I didn't see any info about average message size, how much message sizes vary, but unless those are crazy values, the suggestions below should help.
The first thing you need to do is set max.poll.records on the client side to a smallish number, say, start with 10000, and see how much throughput that gets you. Make sure to consume without doing anything with the messages, just dump them on the floor, and then call poll() again. This is just to benchmark how much performance you can get with your fixed server setup. Then, increase or decrease that number depending on if you need better throughput or latency. You should be able to get a best scenario after playing with this for a while.
After having done the above, the next step is to change your code so it dumps all received messages to an internal in-memory queue, and then call poll() again. This is especially important if processing of each message requires DB access, hitting external APIs, etc. If you take even 100ms to process 1K messages, that can reduce your performance in half in your case (100ms to poll/receive, and then another 100ms to process the messages received before you start the next poll())
Without having access to Kafka configuration parameters on the server side, I believe the above should get you pretty close to an optimal throughput for your configuration.
Feel free to post more details in your question, and I'd be happy to update my answer if that doesn't help.
To deal with such a high throughput, this is what community recommendations for number of partitions on a source topic. And it is worth considering all these factors when choosing the number of partitions.
• What is the throughput you expect to achieve for the topic?
• What is the maximum throughput you expect to achieve when
consuming from a single partition?
• If you are sending messages to partitions based on keys,
adding partitions later can be very challenging, so calculate
throughput based on your expected future usage, not the current
usage.
• Consider the number of partitions you will place on each
broker and available diskspace and network bandwidth per
broker.
So if you want to be able to write and read 1 GB/sec from a topic, and each consumer can only process 50 MB/s, then you need at least 20 partitions. This way, you can have 20 consumers reading from the topic and achieve 1 GB/sec.
Also,
Regarding the fetch.max.bytes, I am sure you have already had a glance on this one Kafka fetch max bytes doesn't work as expected.

Streaming audio streams trough MQ (scalability)

my question is rather specific, so I will be ok with a general answer, which will point me in the right direction.
Description of the problem:
I want to deliver specific task data from multiple producers to a particular consumer working on the task (both are docker containers run in k8s). The relation is many to many - any producer can create a data packet for any consumer. Each consumer is processing ~10 streams of data at any given moment, while each data stream consists of 100 of 160b messages per second (from different producers).
Current solution:
In our current solution, each producer has a cache of a task: (IP: PORT) pair values for consumers and uses UDP data packets to send the data directly. It is nicely scalable but rather messy in deployment.
Question:
Could this be realized in the form of a message queue of sorts (Kafka, Redis, rabbitMQ...)? E.g., having a channel for each task where producers send data while consumer - well consumes them? How many streams would be feasible to handle for the MQ (i know it would differ - suggest your best).
Edit: Would 1000 streams which equal 100 000 messages per second be feasible? (troughput for 1000 streams is 16 Mb/s)
Edit 2: Fixed packed size to 160b (typo)
Unless you need disk persistence, do not even look in message broker direction. You are just adding one problem to an other. Direct network code is a proper way to solve audio broadcast. Now if your code is messy and if you want a simplified programming model good alternative to sockets is a ZeroMQ library. This will give you all MessageBroker functionality for which you care: a) discrete messaging instead of streams, b) client discoverability; without going overboard with another software layer.
When it comes to "feasible": 100 000 messages per second with 160kb message is a lot of data and it comes to 1.6 Gb/sec even without any messaging protocol on top of it. In general Kafka shines at message throughput of small messages as it batches messages on many layers. Knowing this sustained performances of Kafka are usually constrained by disk speed, as Kafka is intentionally written this way (slowest component is disk). However your messages are very large and you need to both write and read messages at same time so I don't see it happen without large cluster installation as your problem is actual data throughput, and not number of messages.
Because you are data limited, even other classic MQ software like ActiveMQ, IBM MQ etc is actually able to cope very well with your situation. In general classic brokers are much more "chatty" than Kafka and are not able to hit message troughpout of Kafka when handling small messages. But as long as you are using large non-persistent messages (and proper broker configuration) you can expect decent performances in mb/sec from those too. Classic brokers will, with proper configuration, directly connect a socket of producer to a socket of a consumer without hitting a disk. In contrast Kafka will always persist to disk first. So they even have some latency pluses over Kafka.
However this direct socket-to-socket "optimisation" is just a full circle turn to the start of an this answer. Unless you need audio stream persistence, all you are doing with a broker-in-the-middle is finding an indirect way of binding producing sockets to consuming ones and then sending discrete messages over this connection. If that is all you need - ZeroMQ is made for this.
There is also messaging protocol called MQTT which may be something of interest to you if you choose to pursue a broker solution. As it is meant to be extremely scalable solution with low overhead.
A basic approach
As from Kafka perspective, each stream in your problem can map to one topic in Kafka and
therefore there is one producer-consumer pair per topic.
Con: If you have lots of streams, you will end up with lot of topics and IMO the solution can get messier here too as you are increasing the no. of topics.
An alternative approach
Alternatively, the best way is to map multiple streams to one topic where each stream is separated by a key (like you use IP:Port combination) and then have multiple consumers each subscribing to a specific set of partition(s) as determined by the key. Partitions are the point of scalability in Kafka.
Con: Though you can increase the no. of partitions, you cannot decrease them.
Type of data matters
If your streams are heterogeneous, in the sense that it would not be apt for all of them to share a common topic, you can create more topics.
Usually, topics are determined by the data they host and/or what their consumers do with the data in the topic. If all of your consumers do the same thing i.e. have the same processing logic, it is reasonable to go for one topic with multiple partitions.
Some points to consider:
Unlike in your current solution (I suppose), once the message is received, it doesn't get lost once it is received and processed, rather it continues to stay in the topic till the configured retention period.
Take proper care in determining the keying strategy i.e. which messages land in which partitions. As said, earlier, if all of your consumers do the same thing, all of them can be in a consumer group to share the workload.
Consumers belonging to the same group do a common task and will subscribe to a set of partitions determined by the partition assignor. Each consumer will then get a set of keys in other words, set of streams or as per your current solution, a set of one or more IP:Port pairs.

What defines the scope of a kafka topic

I'm looking to try out using Kafka for an existing system, to replace an older message protocol. Currently we have a number of types of messages (hundreds) used to communicate among ~40 applications. Some are asynchronous at high rates and some are based upon request from user/events.
Now looking at Kafka, it breaks out topics and partitions etc. But I'm a bit confused as to what constitutes a topic. Does every type of message my applications produce get their own topic allowing hundreds of topics, or do I cluster them together to related message types? If the second answer, is it bad practice for an application to read a message and drop it when its contents are not what its looking for?
I'm also in a dilemma where there will be upwards of 10 copies of a single application (a display), all of which getting a very large amount of data (in form of a light weight video stream of sorts) and would be sending out user commands on each particular node. Would Kafka be a sufficient form of communication for this? Assuming that at most 10, but sometimes these particular applications may not have the desire to get the video stream at all times.
A third and final question: I read a bit about replay-ability of messages. Is this only within a single topic, or can the replay-ability go over a slew of different topics?
Kafka itself doesn't care about "types" of message. The only type it knows about are bytes, meaning that you are completely flexible to how you will serialize your datasets. Note, however that the default max message size is just 1MB, so "streaming video/images/media" is arguably the wrong use case for Kafka alone. A protocol like RTMP would probably make more sense
Kafka consumer groups scale horizontally, not in response to load. Consumers poll data at a rate at which they can process. If they don't need data, then they can be stopped, if they need to reprocess data, they can be independently seeked

Is there a way to prioritize messages in Apache Kafka 2.0?

EDIT
In case anyone else is in this particular situation, I got something akin to what I was looking for after tweaking the consumer configurations. I created a producer that sent the priority messages to three separate topics (for high/med/low priorities), and then I created 3 separate consumers to consume from each. Then I polled the higher priority topics frequently, and didn't poll the lower priorities unless the high was empty:
while(true) {
final KafkaConsumer<String,String> highPriConsumer = createConsumer(TOPIC1);
final KafkaConsumer<String,String> medPriConsumer = createConsumer(TOPIC2);
final ConsumerRecords<String, String> consumerRecordsHigh = highPriConsumer.poll(100);
if (!consumerRecordsHigh.isEmpty()) {
//process high pri records
} else {
final ConsumerRecords<String, String> consumerRecordsMed = medPriConsumer.poll(100);
if (!consumerRecordsMed.isEmpty()) {
//process med pri records
The poll timeout (argument to the .poll() method) determines how long to wait if there are no records to poll. I set this to a very short time for each topic, but you can set it lower for the lower priorities to make sure it's not consuming valuable cycles waiting when high pri messages are there
The max.poll.records config obviously determines the maximum number of records to grab in a single poll. This could be set higher for the higher priorities as well.
The max.poll.interval.ms config determines the time between polls - how long it should take to process max.poll.records messages. Clarification here.
Also, I believe pausing/resuming an entire consumer/topic can be implemented like this:
kafkaConsumer.pause(kafkaConsumer.assignment())
if(kafkaConsumer.paused().containsAll(kafkaConsumer.assignment())) {
kafkaConsumer.resume(kafkaConsumer.assignment());
}
I'm not sure if this is the best way, but I couldn't find a good example elsewhere
I agree with senseiwu below that this is not really the correct use for Kafka. This is single-threaded processing, with each topic having a dedicated consumer, but I will work on improving this process from here.
Background
We are trying to improve our application and hoping to use Apache Kafka for messaging between decoupled components. Our system is frequently low-bandwidth (although there are cases where bandwidth can be high for a time), and have small, high-priority messages that must be processed while larger files wait, or are processed slowly to consume less bandwidth. We would like to have topics with different priorities.
I am new to Kafka, but have tried looking into both the Processor API and Kafka Streams with no success, although certain posts on forums seem to be saying this is doable.
Processor API
When I tried the Processor API, I tried to determine if the High Priority KafkaConsumer was currently processing anything by checking if poll() was empty, and then hoped to poll() with the Med Priority Consumer, but the second topic poll returned empty. There also didn't seem to be an easy way to get all TopicPartition's on a topic in order to call kafkaConsumer.pause(partitions).
Kafka Streams
When I tried KafkaStreams, I set up a stream to consume from each of my "priority" topics, but there was no way to check if the KStream or KafkaStreams instance connected to the higher-priority topic was currently idle or processing.
I based my code on this file
Other
I also tried the code here: priority-kafka-client, but it didn't work as expected, as running the downloaded test file had mixed priorities.
I found this thread, where one of the developers says (addressing adding priorities for topics): "...a user could implement this behavior with pause and resume". But I was unable to find out how he meant this could work.
I found this StackOverflow article, but they seem to be using a very old version, and I was unclear on how their mapping function was supposed to work.
Conclusion
I would be very grateful if someone would tell me if they think this is something worth pursuing. If this isn't how Apache Kafka is supposed to work, because it disrupts the benefit gained from the automatic topic/partition handling, that's fine, and I will look elsewhere. However, there were so many instances where people seemed to have success with it, that I wanted to try. Thank you.
This sounds like a design issue in your application - kafka is originally designed as a commit log where each message is written to the broker with an offset and various consumer consume them in the order in which they were committed with very low latency and high throughput. Given that partitions and not topics are fundamental unit of work distribution in Kafka, having topic level priorities would be difficult to achieve natively.
I'd recommend to adapt your design to use other architectural components than Kafka instead of trying to cut your feet to fit into the shoes. One thing you could already do is to let your producer upload file to a proper file storage and send the link via Kafka including metadata. Then depending upon the bandwidth status, your consumer could decide based on metadata of the large file whether it is sensible to download or not. This way you are probably more likely to have a robust design rather than using Kafka the wrong way.
If you indeed want to stick to only Kafka, one solution would be to send large files to some fixed number of hardcoded partitions and consumers consume from those partitions only when bandwidth is good.

Kafka producer resilience config: Fail but never block

I am currently learning some Kafka best practices from netflix (https://www.slideshare.net/wangxia5/netflix-kafka). It is a very good slide. However, I really dont understand one of the slides (slide 18) mentioned about producer resilience configuration, I hope someone in stackoverflow is very kind to give me insight for that (Cant find the video or reach out the author...).
The slide mentioned: Fail but never block in producer resilience configuration.
Block.on.buffer.full=false
Even thought this is the deprecated configuration, I guess the idea is to let producer fail right away rather than block to wait. In the latest kafka configuration, I can use a small value for block.max.ms to fail the producer to sends message rather than blocking it.
Question 1: Why we want to fail it right away, does it means retry later on rather than block it ?
Handle Potential Block for first meta data request
Question 2: I can understand the meta data in the consumer side. i.e registering consumer group and sort of stuff, but what is meta data request for producer point of view ? and is it potentially blocked ? Is there any kafka documentation to describe that
Periodically check whether Kafka producer was open successfully
Question 3: Is there a way we can check that and what benefits for that check ?
Thanks in advance :)
You have to keep in mind how a kafka producer works:
From the API-Documentation:
The producer consists of a pool of buffer space that holds records
that haven't yet been transmitted to the server as well as a
background I/O thread that is responsible for turning these records
into requests and transmitting them to the cluster.
If you call the send method to send a record to the broker, this message will be added to an internal buffer (the size of this buffer can be configured using the buffer.memory configuration property). Now different things can happen:
Happy path: The messages from the buffer will get converted into requests to the broker by the background I/O thread, the broker will ACK this messages and everything will be fine.
The messages can not be send to the kafka broker (connection to broker is broken, you are producing messages faster than they can send out, etc.). In this case it is up to you to decide what to do. Setting the max.block.ms (as an replacement for block.on.buffer.full) to a positive value the send message will block for this amount of time(1) and through a timeout exception afterwards.
Regarding your questions:
(1) If I got the slides right, Netflix explicitly wants to throw away the messages which they can't send to the broker (instead of blocking, retrying, failing ...). This of course highly depends on your application and the kind of messages you are dealing with. If it "just log messages" it might be no big deal. If it comes to financial transactions you may want to
(2) The producer needs some metadata about the cluster. E.g. it needs to know which key goes to which partition. There is a good blogpost by hortonworks how the producer works internaly. I think it is worth reading: https://community.hortonworks.com/articles/72429/how-kafka-producer-work-internally.html
Furthermore the statement:
Handle Potential Block for first meta data request
points to an issues which is as far as I know still around. The very first call of send will do a sync. metadata request to the broker and therefor may take longer.
(3) Connections to the producers are closed by the broker if the producer is idle for some time (see connections.max.idle.ms). I am not aware of some standard way to keep the connection of your consumer alive or even to check if the connection is still alive. What you could do is peridicaly send a metadatarequest to the broker (producer.partitionsFor(anyTopic)). But again maybe this is not an issue for your application.
(1) When it comes to details what is taken into account to calculate the time passed it get's a bit tricky. For max.block.ms it is actually:
metadata fetch time
buffer full block time
serialization time (customized serializer)
partitioning time (customized partitioner)