We are having Camel Kafka consumer (camel-kafka:2.23.1) to read message from Kafka. The default settings work well for daytime,However nightly there is Bulk load of messages 5000 Message per second and we are seeing high CPU, wanted to know are there any settings in Kafka Consumer/Camel Kafka to reduce high CPU usage?Do other face a similar issue
Thanks!
Related
We built a Java application with Kafka reactor consumer. We tried to fetch the messages from Kafka broker as quick as possible in each consumers so setting up fetch.max.wait.ms as a low number like 2 or 4. We see a frequent GC due to high memory usage per consumer threads.
Once we plugin in the VisualVM and found each consumer having a high allocated Bytes / sec. We tried to stop the producer but the memory usage doesn't stop and it looks very consistent cross different consumer threads (some of the partition consumer assigned should have zero new message). Increase the fetch.max.wait.ms will reduce the usage but it doesn't explained a high usage when no new message from last fetch. Is that possible that Kafka consumer poll also container some historical buffer? (Reduce the VisualVM Sampling frequency doesn't change the result)
Rate limiting: As Kafka is able to generate messages at a much higher rate than MQ can consume, can we have some configuration setup # kafka consumer to to enable a rate-limiting transfer to protect the stability of MQ?
Also Exactly-Once Semantic - Understand that kafka supports exactly-once semantics which would stop the retransfer of messages that have already been consumed by consumers. Can someone guide me on how to setup this configuration?
we are using confluent kafka enterprise version in our organization.
Rate limiting: Kafka is pull based so your consumer could read messages at an own peace and transfer them into MQ (but if the second system is constantly slower, the buffer of unprocessed message in Kafka will increase though the time).
Exactly once semantic: to ensure exactly once sematic for consumer you need to commit read offset manually once the message is successfully processed (the default behavior is automatic commit of read offset after small timeout. It could lead to lost of the message, if the fail happens after commit of read offset but before end of the processing of the message)
We have 25Million records written to the Kafka topic.
The topic is having 24 Partitions and 24 Consumers.
Each message is 1KB. And these messages are wrapped with Avro for serialization and Deserialization.
Replication Factor is 2.
The fetch size is 50000 and the Poll interval is 50ms.
Right now during the load test to consume and process 1Million, it takes 40mins on an average. But, we want to process the 25Million records in less than 20 to 30mins.
Broker configs:
background.threads = 10
num.network.threads = 7
num.io.threads = 8
Set replica.lag.time.max.ms = 500
Set replica.lag.max.messages = 4
Set log.flush.interval.ms to default value as per logs
Used G1 collector instead of MarkSweepGC
Changed Xms to 4G and Xmx to 4G
our setup has 8 brokers each with 3 disks and with 10GBPS ethernet with simplex network.
Consumer Configs:
We are using Java Consumer API to consume the messages. We made the swappiness to 1 and using 200 threads to process the data within the consumer. Inside the consumer we are picking up the message and hitting Redis, MaprDB to perform some business logic. Once, the logic is completed we are committing the message using Kafka Commit Sync.
Each consumer is running with -xms 4G and -xmx 4G. What are the other aspects we need to consider in order to increase the read throughput?
I won't provide you an exact answer to your problem, but more a roadmap and methodological help.
10 min for 1Million message is indeed slow IF everything works fine AND the consumer's task is light.
First thing you need to know is what is your bottle neck.
It could be:
the Kafka cluster itself: messages are long to be pulled out of the cluster. T test that, you should check with a simple consumer (the one provided with Kafka CLI for example), running directly on a machine where you have a broker (or close), to avoid network latency. How fast is that?
the network between the brokers and the consumer
the consumer: what does it do? Maybe the processing is really long. Then optimisation should run there. Can you monitor the ressources (CPU, RAM) required for your consumer? Maybe one good test you could do is create a test consumer, in which you load 10k messages in memory, then do your business logic and time it. How long does it last? This will tell you the max throughput of your consumer, irrespective of Kafka's speed.
I have a situation in Kafka where the producer publishes the messages at a very higher rate than the consumer consumption rate. I have to implement the back pressure implementation in kafka for further consumption and processing.
Please let me know how can I implement in spark and also in normal java api.
Kafka acts as the regulator here. You produce at whatever rate you want to into Kafka, scaling the brokers out to accommodate the ingest rate. You then consume as you want to; Kafka persists the data and tracks the offset of the consumers as they work their way through the data they read.
You can disable auto-commit by enable.auto.commit=false on consumer and commit only when consumer operation is finished. That way consumer would be slow, but Kafka knows how many messages consumer processed, also configuring poll interval with max.poll.interval.ms and messages to be consumed in each poll with max.poll.records you should be good.
I was reading Kafka docs where it was mentioned that:-
Consumers pulls data from broker by requesting from offset.
Producer pushes messages to broker.
Making Kafka consumers pull based make sense that the consumers can drive the pace and broker can store the data for a really long time.
However with producers being push based, How does Kafka make sure that speed mismatch between producer and kafka won't happen? Also producers don't have persistance by design.This seems to be a bigger problem, when producers and brokers are separated over high latency network(internet).
As a distributed commit log, Kafka solves exactly this (impedance mismatch). You produce your events at the rate at which they occur into Kafka, and then you consume them at the rate at which your application can. The data is persisted in Kafka regardless. If your application needs to consume at a greater rate, you scale it out and partition your topic and consume in parallel. Because the data is persisted the only factor is how fast you want to consume the data.