How to scale max.incremental.fetch.session.cache.slots - apache-kafka

I'm running a somewhat large Kafka cluster, but currently I'm stuck at properly setting max.incremental.fetch.session.cache.slots and would need some guidance. The documentation about this is not clear either: https://cwiki.apache.org/confluence/display/KAFKA/KIP-227%3A+Introduce+Incremental+FetchRequests+to+Increase+Partition+Scalability
By scale i mean: 3 nodes, ~400 Topics, 4500 Partitions, 300 consumergroups, 500 consumers
For a while now, I'm seeing the FETCH_SESSION_ID_NOT_FOUND errors appearing in the logs and wanted to address them.
So I've tried increasing the value in the config, restarted all brokers and the pool quickly filled up again to it's max capacity. This reduced the occurrence of the errors, but they are not completely gone. At first I've set to value to 2000, it was instantly full. Then in several steps up to 100.000. And the pool was filled in ~40 Minutes.
From the documentation I was expecting the pool to cap out after 2 Minutes when min.incremental.fetch.session.eviction.ms kicks in. But this seems not to be the case.
What would be the metrics to gauge the appropriate size of the cache. Are the errors I'm still seeing anything I can fix on the brokers or do I need to hunt down misconfigured consumers? If so, what do I need to look out for?

Such a high usage of Fetch Sessions is most likely caused by a bad client.
Sarama, a Golang client, had an issue that caused a new Fetch Session to be allocated on every Fetch request between versions 1.26.0 and 1.26.2, see https://github.com/Shopify/sarama/pull/1644.
I'd recommend checking if you have users running this client and ensure they update to the latest release.

Related

Pub/Sub pull request count drastically decreases on kubernetes pods of gcp

I have 5M~ messages (total 7GB~) on my backlog gcp pub/sub subscription and want to pull as many as possible of them. I am using synchronous pull with settings below and waiting for 3 minutes to pile up messages and sent to another db.
defaultSettings := &pubsub.ReceiveSettings{
MaxExtension: 10 * time.Minute,
MaxOutstandingMessages: 100000,
MaxOutstandingBytes: 128e6, // 128 MB
NumGoroutines: 1,
Synchronous: true,
}
Problem is that if I have around 5 pods on my kubernetes cluster pods are able to pull nearly 90k~ messages almost in each round (3 minutes period).However, when I increase the number of pods to 20 in the first or the second round each pods able to retrieve 90k~ messages however after a while somehow pull request count drastically drops and each pods receives 1k-5k~ messages in each round. I have investigated the go library sync pull mechanism and know that without acking successfully messages you are not able to request for new ones so pull request count may drop to prevent exceed MaxOutstandingMessages but I am scaling down to zero my pods to start fresh pods while there are still millions of unacked messages in my subscription and they still gets very low number of messages in 3 minutes with 5 or 20 pods does not matter. After around 20-30 minutes they receives again 90k~ messages each and then again drops to very low levels after a while (checking from metrics page). Another interesting thing is that while my fresh pods receives very low number of messages, my local computer connected to same subscription gets 90k~ messages in each round.
I have read the quotas and limits page of pubsub, bandwith quotas are extremely high (240,000,000 kB per minute (4 GB/s) in large regions) . I tried a lot of things but couldn't understand why pull request counts drops massively in case I am starting fresh pods. Is there some connection or bandwith limitation for kubernetes cluster nodes on gcp or on pub/sub side? Receiving messages in high volume is critical for my task.
If you are using synchronous pull, I suggest using StreamingPull for your scale Pub/Sub usage.
Note that to achieve low message delivery latency with synchronous
pull, it is important to have many simultaneously outstanding pull
requests. As the throughput of the topic increases, more pull requests
are necessary. In general, asynchronous pull is preferable for
latency-sensitive applications.
It is expected that, for a high throughput scenario and synchronous pull, there should always be many idle requests.
A synchronous pull request establishes a connection to one specific server (process). A high throughput topic is handled by many servers. Messages coming in will go to only a few servers, from 3 to 5. Those servers should have an idle process already connected, to be able to quickly forward messages.
The process conflicts with CPU based scaling. Idle connections don't cause CPU load. At least, there should be many more threads per pod than 10 to make CPU-based scaling work.
Also, you can use Horizontal-Pod-Autoscaler(HPA) configured for Pub/Sub consuming GKE pods. With the HPA, you can configure CPU usage.
My last recommendation would be to consider Dataflow for your workload. Consuming from PubSub.

How to check the actual number of incremental fetch session cache slots used in Kafka cluster?

I am reading this question Kafka: Continuously getting FETCH_SESSION_ID_NOT_FOUND, and I am trying to apply the solution suggested by Hrishikesh Mishra, as we also face the similar issue, so I increased the broker setting max.incremental.fetch.session.cache.slots to 2000, default was 1000. But now I wonder how can I monitor the actual number of used incremental fetch session cache slots, in prometheus I see kafka_server_fetchsessioncache_numincrementalfetchpartitionscached metrics, and promql query shows on each of three brokers the number that is now significantly over 2000, that is 2703, 2655 and 2054, so I am confused if I look at the proper metrics. There is also kafka_server_fetchsessioncache_incrementalfetchsessionevictions_total that shows zeros on all brokers.
OK, there is also kafka_server_fetchsessioncache_numincrementalfetchsessions that shows cca 500 on each of three brokers, so that is total of cca 1500, which is between 1000 and 2000, so maybe that metrics is the one that is controlled by max.incremental.fetch.session.cache.slots ?
Actually, as of now, it is already more than 700 incremental fetch sessions on each broker, that is total of more than 2100, so, obviously, the limit of 2000 applies to each broker, so that the number in the whole cluster can go as far as 6000. The reason why the number is now below 1000 on each broker is because the brokers were restarted after the configuration change.
And the question is how can this allocation be checked on the individual consumer level. Such a query:
count by (__name__) ({__name__=~".*fetchsession.*"})
returns only this table:
Element Value
kafka_server_fetchsessioncache_incrementalfetchsessionevictions_total{} 3
kafka_server_fetchsessioncache_numincrementalfetchpartitionscached{} 3
kafka_server_fetchsessioncache_numincrementalfetchsessions{} 3
The metric named kafka.server:type=FetchSessionCache,name=NumIncrementalFetchSessions is the correct way to monitor the number of FetchSessions.
The size is configurable via max.incremental.fetch.session.cache.slots. Note that this setting is applied per-broker, so each broker can cache up to max.incremental.fetch.session.cache.slots sessions.
The other metric you saw, kafka.server:type=FetchSessionCache,name=NumIncrementalFetchPartitionsCached, is the total number of partitions used across all FetchSession. Many FetchSessions will used several partitions so it's expected to see a larger number of them.
As you said, the low number of FetchSessions you saw was likely due to the restart.

Cluster Becomes Unresponsive When One Node gores OOM

We have created a cluster with three nodes using Hazelcast 3.4.2 and I'm having
following issue.
If one node goes OOM, other nodes become unresponsive. Sometime those nodes
(except one that went to OOM) manage to recover however, recovery time is not predictable.
Also, we added following two Hazelcast properties as JVM parameters. However, still the issue persists in the cluster.
hazelcast.client.heartbeat.timeout
hazelcast.max.no.heartbeat.seconds
Please node that, cluster was started several times by giving few different values to above two Hazelcast properties.
So I would like to know, whether this is a know-issue or not. Also, if above scenario
is a know-issue, do we have a workaround for this issue.
Thanks
Do your members have enough headroom? When one member goes down then the same amount of data has to be distributed among less members. It could cause memory pressure on them. I'd recommend to enabled verbose GC log and test your scenario.

Jboss Activemq 6.1.0 queue message processing slows down after 10000 messages

Below is the configuration:
2 JBoss application nodes
5 listeners on the application node with 50 threads each, supports
clustering and is set up as active-active listener, so they run on
both app nodes
The listener simply gets the message and logs the information into
database
50000 messages are posted into ActiveMQ using JMeter.
Here is the observation on first execution:
Total 50000 messages are consumed in approx 22 mins.
first 0-10000 messages consumed in 1 min approx
10000-20000 messages consumed in 2 mins approx
20000-30000 messages consumed in 4 mins approx
30000-40000 messages consumed in 6 mins approx
40000-50000 messages consumed in 8 mins
So we see the message consumption time is increasing with increasing number of messages.
Second execution without restarting any of the servers:
50000 messages consumed in 53 mins approx!
But after deleting data folder of activemq and restarting activemq,
performance again improves but degrades as more data enters the queue!
I tried multiple configuration in activemq.xml, but no success...
Anybody faced similar issue, and got any solution ? Let me know. Thanks.
I've seen similar slowdowns in our production systems when pending message counts go high. If you're flooding the queues then the MQ process can't keep all the pending messages in memory, and has to go to disk to serve a message. Performance can fall off a cliff in these circumstances. Increase the memory given to the MQ server process.
Also looks as though the disk storage layout is not particularly efficient - perhaps having each message as a file in a single directory? This can make access time rise as traversing disk directory takes longer.
50000 messages in > 20 mins seems very low performance.
Following configuration works well for me (these are just pointers. You may already have tried some of these but see if it works for you)
1) Server and queue/topic policy entry
// server
server.setDedicatedTaskRunner(false)
// queue policy entry
policyEntry.setMemoryLimit(queueMemoryLimit); // 32mb
policyEntry.setOptimizedDispatch(true);
policyEntry.setLazyDispatch(true);
policyEntry.setReduceMemoryFootprint(true);
policyEntry.setProducerFlowControl(true);
policyEntry.setPendingQueuePolicy(new StorePendingQueueMessageStoragePolicy());
2) If you are using KahaDB for persistence then use per destination adapter (MultiKahaDBPersistenceAdapter). This keeps the storage folders separate for each destination and reduces synchronization efforts. Also if you do not worry about abrupt server restarts (due to any technical reason) then you can reduce then disk sync efforts by
kahaDBPersistenceAdapter.setEnableJournalDiskSyncs(false);
3) Try increasing the memory usage, temp and storage disk usage values at server level.
4) If possible increase prefetchSize in prefetch policy. This will improve performance but also increases the memory footprint of consumers.
5) If possible use transactions in consumers. This will help to reduce the message acknowledgement handling and disk sync efforts by server.
Point 5 mentioned by #hemant1900 solved the problem :) Thanks.
5) If possible use transactions in consumers. This will help to reduce
the message acknowledgement handling and disk sync efforts by server.
The problem was in my code. I had not used transaction to persist the data in consumer, which is anyway bad programming..I know :(
But didn't expect that could have caused this issue.
Now 50000, messages are getting processed in less than 2 mins.

How to minimize the latency involved in kafka messaging framework?

Scenario: I have a low-volume topic (~150msgs/sec) for which we would like to have a
low propagation delay from producer to consumer.
I added a time stamp from a producer and read it at consumer to record the propagation delay, with default configurations the msg (of 20 bytes) showed a propagation delay of 1960ms to 1230ms. No network delay is involved since, I tried on a 1 producer and 1 simple consumer on the same machine.
When I have tried adjusting the topic flush interval to 20ms, it drops
to 1100ms to 980ms. Then I tried adjusting the consumers "fetcher.backoff.ms" to 10ms, it dropped to 1070ms - 860ms.
Issue: For a 20 bytes of a msg, I would like to have a propagation delay as low as possible and ~950ms is a higher figure.
Question: Anything I am missing out in configuration?
I do welcome comments, delay which you got as minimum.
Assumption: The Kafka system involves the disk I/O before the consumer get the msg from the producer and this goes with the hard disk RPM and so on..
Update:
Tried to tune the Log Flush Policy for Durability & Latency.Following is the configuration:
# The number of messages to accept before forcing a flush of data to disk
log.flush.interval=10
# The maximum amount of time a message can sit in a log before we force a flush
log.default.flush.interval.ms=100
# The interval (in ms) at which logs are checked to see if they need to be
# flushed to disk.
log.default.flush.scheduler.interval.ms=100
For the same msg of 20 bytes, the delay was 740ms -880ms.
The following statements are made clear in the configuration itself.
There are a few important trade-offs:
Durability: Unflushed data is at greater risk of loss in the event of a crash.
Latency: Data is not made available to consumers until it is flushed (which adds latency).
Throughput: The flush is generally the most expensive operation.
So, I believe there is no way to come down to a mark of 150ms - 250ms. (without hardware upgrade) .
I am not trying to dodge the question but I think that kafka is a poor choice for this use case. While I think Kafka is great (I have been a huge proponent of its use at my workplace), its strength is not low-latency. Its strengths are high producer throughput and support for both fast and slow consumers. While it does provide durability and fault tolerance, so do more general purpose systems like rabbitMQ. RabbitMQ also supports a variety of different clients including node.js. Where rabbitMQ falls short when compared to Kafka is when you are dealing with extremely high volumes (say 150K msg/s). At that point, Rabbit's approach to durability starts to fall apart and Kafka really stands out. The durability and fault tolerance capabilities of rabbit are more than capable at 20K msg/s (in my experience).
Also, to achieve such high throughput, Kafka deals with messages in batches. While the batches are small and their size is configurable, you can't make them too small without incurring a lot of overhead. Unfortunately, message batching makes low-latency very difficult. While you can tune various settings in Kafka, I wouldn't use Kafka for anything where latency needed to be consistently less than 1-2 seconds.
Also, Kafka 0.7.2 is not a good choice if you are launching a new application. All of the focus is on 0.8 now so you will be on your own if you run into problems and I definitely wouldn't expect any new features. For future stable releases, follow the link here stable Kafka release
Again, I think Kafka is great for some very specific, though popular, use cases. At my workplace we use both Rabbit and Kafka. While that may seem gratuitous, they really are complimentary.
I know it's been over a year since this question was asked, but I've just built up a Kafka cluster for dev purposes, and we're seeing <1ms latency from producer to consumer. My cluster consists of three VM nodes running on a cloud VM service (Skytap) with SAN storage, so it's far from ideal hardware. I'm using Kafka 0.9.0.0, which is new enough that I'm confident the asker was using something older. I have no experience with older versions, so you might get this performance increase simply from an upgrade.
I'm measuring latency by running a Java producer and consumer I wrote. Both run on the same machine, on a fourth VM in the same Skytap environment (to minimize network latency). The producer records the current time (System.nanoTime()), uses that value as the payload in an Avro message, and sends (acks=1). The consumer is configured to poll continuously with a 1ms timeout. When it receives a batch of messages, it records the current time (System.nanoTime() again), then subtracts the receive time from the send time to compute latency. When it has 100 messages, it computes the average of all 100 latencies and prints to stdout. Note that it's important to run the producer and consumer on the same machine so that there is no clock sync issue with the latency computation.
I've played quite a bit with the volume of messages generated by the producer. There is definitely a point where there are too many and latency starts to increase, but it's substantially higher than 150/sec. The occasional message takes as much as 20ms to deliver, but the vast majority are between 0.5ms and 1.5ms.
All of this was accomplished with Kafka 0.9's default configurations. I didn't have to do any tweaking. I used batch-size=1 for my initial tests, but I found later that it had no effect at low volume and imposed a significant limit on the peak volume before latencies started to increase.
It's important to note that when I run my producer and consumer on my local machine, the exact same setup reports message latencies in the 100ms range -- the exact same latencies reported if I simply ping my Kafka brokers.
I'll edit this message later with sample code from my producer and consumer along with other details, but I wanted to post something before I forget.
EDIT, four years later:
I just got an upvote on this, which led me to come back and re-read. Unfortunately (but actually fortunately), I no longer work for that company, and no longer have access to the code I promised I'd share. Kafka has also matured several versions since 0.9.
Another thing I've learned in the ensuing time is that Kafka latencies increase when there is not much traffic. This is due to the way the clients use batching and threading to aggregate messages. It's very fast when you have a continuous stream of messages, but any time there is a moment of "silence", the next message will have to pay the cost to get the stream moving again.
It's been some years since I was deep in Kafka tuning. Looking at the latest version (2.5 -- producer configuration docs here), I can see that they've decreased linger.ms (the amount of time a producer will wait before sending a message, in hopes of batching up more than just the one) to zero by default, meaning that the aforementioned cost to get moving again should not be a thing. As I recall, in 0.9 it did not default to zero, and there was some tradeoff to setting it to such a low value. I'd presume that the producer code has been modified to eliminate or at least minimize that tradeoff.
Modern versions of Kafka seem to have pretty minimal latency as the results from here show:
2 ms (median)
3 ms (99th percentile)
14 ms (99.9th percentile)
Kafka can achieve around millisecond latency, by using synchronous messaging. With synchronous messaging, the producer does not collect messages into a patch before sending.
bin/kafka-console-producer.sh --broker-list my_broker_host:9092 --topic test --sync
The following has the same effect:
--batch-size 1
If you are using librdkafka as Kafka client library, you must also set socket.nagle.disable=True
See https://aivarsk.com/2021/11/01/low-latency-kafka-producers/ for some ideas on how to see what is taking those milliseconds.