Storm large window size causing executor to be killed by Nimbus - apache-zookeeper

I have a java spring application that submits topologies to a storm (1.1.2) nimbus based on a DTO which creates the structure of the topology.
This is working great except for very large windows. I am testing it with several varying sliding and tumbling windows. None are giving me any issue besides a 24 hour sliding window which advances every 15 minutes. The topology will receive ~250 messages/s from Kafka and simply windows them using a simple timestamp extractor with a 3 second lag (much like all the other topologies I am testing).
I have played with the workers and memory allowances greatly to try and figure this out but my default configuration is 1 worker with a 2048mb heap size. I've also tried reducing the lag which had minimal effects.
I think that it's possible the window size is getting too large and the worker is running out of memory which delays the heartbeats or zookeeper connection check-in which in turn cause Nimbus to kill the worker.
What happens is every so often (~11 window advances) the Nimbus logs report that the Executor for that topology is "not alive" and the worker logs for that topology show either a KeeperException where the topology can't communicate with Zookeeper or a java.lang.ExceptionInInitializerError:null with a nest PrivelegedActionException.
When the topology is assigned a new worker, the aggregation I was doing is lost. I assume this is happening because the window is holding at least 250*60*15*11 (messagesPerSecond*secondsPerMinute*15mins*windowAdvancesBeforeCrash) messages which are around 84 bytes each. To complete the entire window it will end up being 250*60*15*97 messages (messagesPerSecond*secondsPerMinute*15mins*15minIncrementsIn24HoursPlusAnExpiredWindow). This is ~1.8gbs if my math is right so I feel like the worker memory should be covering the window or at least more than 11 window advances worth.
I could increase the memory slightly but not much. I could also decrease the amount of memory/worker and increase the number of workers/topology but I was wondering if there is something I'm missing? Could I just increase the amount of time the heartbeat for the worker is so that there is more time for the executor to check-in before being killed or would that be bad for some reason? If I changed the heartbeat if would be in the Config map for the topology. Thanks!

This was caused by the workers running out of memory. From looking at Storm code. it looks like Storm keeps around every message in a window as a Tuple (which is a fairly big object). With a high rate of messages and a 24 hour window, that's a lot of memory.
I fixed this by using a preliminary bucketing bolt that would bucket all the tuples in an initial 1 minute window which reduced the load on the main window significantly because it was now receiving one tuple per minute. The bucketing window doesn't run out of memory since it only has one minute of tuples at a time in its window.

Related

Druid Cluster going into Restricted Mode

We have a Druid Cluster with the following specs
3X Coordinators & Overlords - m5.2xlarge
6X Middle Managers(Ingest nodes with 5 slots) - m5d.4xlarge
8X Historical - i3.4xlarge
2X Router & Broker - m5.2xlarge
Cluster often goes into Restricted mode
All the calls to the Cluster gets rejected with a 502 error.
Even with 30 available slots for the index-parallel tasks, cluster only runs 10 at time and the other tasks are going into waiting state.
Loader Task submission time has been increasing monotonically from 1s,2s,..,6s,..10s(We submit a job to load the data in S3), after
recycling the cluster submission time decreases and increases again
over a period of time
We submit around 100 jobs per minute but we need to scale it to 300 to catchup with our current incoming load
Cloud someone help us with our queries
Tune the specs of the cluster
What parameters to be optimized to run maximum number of tasks in parallel without increasing the load on the master nodes
Why is the loader task submission time increasing, what are the parameters to be monitored here
At 100 jobs per minute, this is probably why the overlord is being overloaded.
The overlord initiates a job by communicating with the middle managers across the cluster. It defines the tasks that each middle manager will need to complete and it monitors the task progress until completion. The startup of each job has some overhead, so that many jobs would likely keep the overlord busy and prevent it from processing the other jobs you are requesting. This might explain why the time for job submissions increases over time. You could increase the resources on the overlord, but this sounds like there may be a better way to ingest the data.
The recommendation would be to use a lot less jobs and have each job do more work.
If the flow of data is so continuous as you describe, perhaps a kafka queue would be the best target followed up with a Druid kafka ingestion job which is fully scalable.
If you need to do batch, perhaps a single index_parallel job that reads many files would be much more efficient than many jobs.
Also consider that each task in an ingestion job creates a set of segments. By running a lot of really small jobs, you create a lot of very small segments which is not ideal for query performance. Here some info around how to think about segment size optimization which I think might help.

How to fix: java.lang.OutOfMemoryError: Direct buffer memory in flink kafka consumer

We Are running a 5 node flink cluster (1.6.3) over kubernetes, with a 5 partitions Kafka topic source.
5 jobs are reading from that topic (with different consumer group), each with parallelism = 5.
Each task manager is running with 10Gb of ram and the task manager heap size is limited to 2Gb.
The ingestion load is rather small (100-200 msgs per second) and an avg message size is ~4-8kb.
all jobs are running fine for a few hours. after a duration we suddenly see one or more jobs failing on:
ava.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:666)
at java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:123)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311)
at sun.nio.ch.Util.getTemporaryDirectBuffer(Util.java:241)
at sun.nio.ch.IOUtil.read(IOUtil.java:195)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at org.apache.kafka.common.network.PlaintextTransportLayer.read(PlaintextTransportLayer.java:110)
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:97)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:71)
at org.apache.kafka.common.network.KafkaChannel.receive(KafkaChannel.java:169)
at org.apache.kafka.common.network.KafkaChannel.read(KafkaChannel.java:150)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:355)
at org.apache.kafka.common.network.Selector.poll(Selector.java:303)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:349)
at org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:226)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1047)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:995)
at org.apache.flink.streaming.connectors.kafka.internal.KafkaConsumerThread.run(KafkaConsumerThread.java:257)
flink restarts the job, but it keeps failing on that exception.
we've tried reducing the record poll as suggested here:
Kafka Consumers throwing java.lang.OutOfMemoryError: Direct buffer memory
We also tried increasing kafka heap size as suggested here:
Flink + Kafka, java.lang.OutOfMemoryError when parallelism > 1, although i'm failing to understand how failing to allocate memory in the flink process has anything to do with the jvm memory of the kafka broker process, and i don't see anything to indicate oom in the broker logs.
What might be the cause of that failure? what else should we check?
Thanks!
One thing that you may have underestimated, is that having a parallelism of 5, means there are 5+4+3+2+1=18 pairs of combinations. If we compare this to the linked thread, there were likely 3+2+1=6 combinations.
In the linked thread the problem was resolved by setting the max poll records to 250, hence my first thought would be to set it to 80 here (or even to 10) and see if that resolves the problem.
(I am not sure if the requirements are shaped this way, but the only noticable difference is the parallelism from 3 to 5 so that seems like a good point to compensate for).

Spark over Yarn some tasks are extremely slower

I am using a cluster of 12 virtual machines, each of which has 16 GB memory and 6 cores(except master node with only 2 cores). To each worker node, 12GB memory and 4 cores were assigned.
When I submit a spark application to yarn, I set the number of executors to 10(1 as master manager, 1 as application master), and to maximize the parallelism of my application, most of my RDDs have 40 partitions as same as the number of cores of all executors.
The following is the problem I encountered: in some random stages, some tasks need to be processed extremely longer than others, which results in poor parallelism. As we can see in the first picture, executor 9 executed its tasks over 30s while other tasks could be finished with 1s. Furthermore, the reason for much time consumed is also randomized, sometimes just because of computation, but sometimes scheduler delay, deserialization or shuffle read. As we can see, the reason for second picture is different from first picture.
I am guessing the reason for this occurs is once some task got assigned to a specific slot, there is not enough resources on the corresponding machine, so jvm was waiting for cpus. Is my guess correct? And how to set the configuration of my cluster to avoid this situation?
computing
scheduler delay & deserialization
To get a specific answer you need to share more about what you're doing but most likely the partitions you get in one or more of your stages are unbalanced - i.e. some are much bigger than others. The result is slowdown since these partitions are handled by a specific task. One way to solve it is to increase the number of partitions or change the partitioning logic
When a big task finishes shipping the data to other tasks would take longer as well so that's why other tasks may take long

Jboss Activemq 6.1.0 queue message processing slows down after 10000 messages

Below is the configuration:
2 JBoss application nodes
5 listeners on the application node with 50 threads each, supports
clustering and is set up as active-active listener, so they run on
both app nodes
The listener simply gets the message and logs the information into
database
50000 messages are posted into ActiveMQ using JMeter.
Here is the observation on first execution:
Total 50000 messages are consumed in approx 22 mins.
first 0-10000 messages consumed in 1 min approx
10000-20000 messages consumed in 2 mins approx
20000-30000 messages consumed in 4 mins approx
30000-40000 messages consumed in 6 mins approx
40000-50000 messages consumed in 8 mins
So we see the message consumption time is increasing with increasing number of messages.
Second execution without restarting any of the servers:
50000 messages consumed in 53 mins approx!
But after deleting data folder of activemq and restarting activemq,
performance again improves but degrades as more data enters the queue!
I tried multiple configuration in activemq.xml, but no success...
Anybody faced similar issue, and got any solution ? Let me know. Thanks.
I've seen similar slowdowns in our production systems when pending message counts go high. If you're flooding the queues then the MQ process can't keep all the pending messages in memory, and has to go to disk to serve a message. Performance can fall off a cliff in these circumstances. Increase the memory given to the MQ server process.
Also looks as though the disk storage layout is not particularly efficient - perhaps having each message as a file in a single directory? This can make access time rise as traversing disk directory takes longer.
50000 messages in > 20 mins seems very low performance.
Following configuration works well for me (these are just pointers. You may already have tried some of these but see if it works for you)
1) Server and queue/topic policy entry
// server
server.setDedicatedTaskRunner(false)
// queue policy entry
policyEntry.setMemoryLimit(queueMemoryLimit); // 32mb
policyEntry.setOptimizedDispatch(true);
policyEntry.setLazyDispatch(true);
policyEntry.setReduceMemoryFootprint(true);
policyEntry.setProducerFlowControl(true);
policyEntry.setPendingQueuePolicy(new StorePendingQueueMessageStoragePolicy());
2) If you are using KahaDB for persistence then use per destination adapter (MultiKahaDBPersistenceAdapter). This keeps the storage folders separate for each destination and reduces synchronization efforts. Also if you do not worry about abrupt server restarts (due to any technical reason) then you can reduce then disk sync efforts by
kahaDBPersistenceAdapter.setEnableJournalDiskSyncs(false);
3) Try increasing the memory usage, temp and storage disk usage values at server level.
4) If possible increase prefetchSize in prefetch policy. This will improve performance but also increases the memory footprint of consumers.
5) If possible use transactions in consumers. This will help to reduce the message acknowledgement handling and disk sync efforts by server.
Point 5 mentioned by #hemant1900 solved the problem :) Thanks.
5) If possible use transactions in consumers. This will help to reduce
the message acknowledgement handling and disk sync efforts by server.
The problem was in my code. I had not used transaction to persist the data in consumer, which is anyway bad programming..I know :(
But didn't expect that could have caused this issue.
Now 50000, messages are getting processed in less than 2 mins.

How to minimize the latency involved in kafka messaging framework?

Scenario: I have a low-volume topic (~150msgs/sec) for which we would like to have a
low propagation delay from producer to consumer.
I added a time stamp from a producer and read it at consumer to record the propagation delay, with default configurations the msg (of 20 bytes) showed a propagation delay of 1960ms to 1230ms. No network delay is involved since, I tried on a 1 producer and 1 simple consumer on the same machine.
When I have tried adjusting the topic flush interval to 20ms, it drops
to 1100ms to 980ms. Then I tried adjusting the consumers "fetcher.backoff.ms" to 10ms, it dropped to 1070ms - 860ms.
Issue: For a 20 bytes of a msg, I would like to have a propagation delay as low as possible and ~950ms is a higher figure.
Question: Anything I am missing out in configuration?
I do welcome comments, delay which you got as minimum.
Assumption: The Kafka system involves the disk I/O before the consumer get the msg from the producer and this goes with the hard disk RPM and so on..
Update:
Tried to tune the Log Flush Policy for Durability & Latency.Following is the configuration:
# The number of messages to accept before forcing a flush of data to disk
log.flush.interval=10
# The maximum amount of time a message can sit in a log before we force a flush
log.default.flush.interval.ms=100
# The interval (in ms) at which logs are checked to see if they need to be
# flushed to disk.
log.default.flush.scheduler.interval.ms=100
For the same msg of 20 bytes, the delay was 740ms -880ms.
The following statements are made clear in the configuration itself.
There are a few important trade-offs:
Durability: Unflushed data is at greater risk of loss in the event of a crash.
Latency: Data is not made available to consumers until it is flushed (which adds latency).
Throughput: The flush is generally the most expensive operation.
So, I believe there is no way to come down to a mark of 150ms - 250ms. (without hardware upgrade) .
I am not trying to dodge the question but I think that kafka is a poor choice for this use case. While I think Kafka is great (I have been a huge proponent of its use at my workplace), its strength is not low-latency. Its strengths are high producer throughput and support for both fast and slow consumers. While it does provide durability and fault tolerance, so do more general purpose systems like rabbitMQ. RabbitMQ also supports a variety of different clients including node.js. Where rabbitMQ falls short when compared to Kafka is when you are dealing with extremely high volumes (say 150K msg/s). At that point, Rabbit's approach to durability starts to fall apart and Kafka really stands out. The durability and fault tolerance capabilities of rabbit are more than capable at 20K msg/s (in my experience).
Also, to achieve such high throughput, Kafka deals with messages in batches. While the batches are small and their size is configurable, you can't make them too small without incurring a lot of overhead. Unfortunately, message batching makes low-latency very difficult. While you can tune various settings in Kafka, I wouldn't use Kafka for anything where latency needed to be consistently less than 1-2 seconds.
Also, Kafka 0.7.2 is not a good choice if you are launching a new application. All of the focus is on 0.8 now so you will be on your own if you run into problems and I definitely wouldn't expect any new features. For future stable releases, follow the link here stable Kafka release
Again, I think Kafka is great for some very specific, though popular, use cases. At my workplace we use both Rabbit and Kafka. While that may seem gratuitous, they really are complimentary.
I know it's been over a year since this question was asked, but I've just built up a Kafka cluster for dev purposes, and we're seeing <1ms latency from producer to consumer. My cluster consists of three VM nodes running on a cloud VM service (Skytap) with SAN storage, so it's far from ideal hardware. I'm using Kafka 0.9.0.0, which is new enough that I'm confident the asker was using something older. I have no experience with older versions, so you might get this performance increase simply from an upgrade.
I'm measuring latency by running a Java producer and consumer I wrote. Both run on the same machine, on a fourth VM in the same Skytap environment (to minimize network latency). The producer records the current time (System.nanoTime()), uses that value as the payload in an Avro message, and sends (acks=1). The consumer is configured to poll continuously with a 1ms timeout. When it receives a batch of messages, it records the current time (System.nanoTime() again), then subtracts the receive time from the send time to compute latency. When it has 100 messages, it computes the average of all 100 latencies and prints to stdout. Note that it's important to run the producer and consumer on the same machine so that there is no clock sync issue with the latency computation.
I've played quite a bit with the volume of messages generated by the producer. There is definitely a point where there are too many and latency starts to increase, but it's substantially higher than 150/sec. The occasional message takes as much as 20ms to deliver, but the vast majority are between 0.5ms and 1.5ms.
All of this was accomplished with Kafka 0.9's default configurations. I didn't have to do any tweaking. I used batch-size=1 for my initial tests, but I found later that it had no effect at low volume and imposed a significant limit on the peak volume before latencies started to increase.
It's important to note that when I run my producer and consumer on my local machine, the exact same setup reports message latencies in the 100ms range -- the exact same latencies reported if I simply ping my Kafka brokers.
I'll edit this message later with sample code from my producer and consumer along with other details, but I wanted to post something before I forget.
EDIT, four years later:
I just got an upvote on this, which led me to come back and re-read. Unfortunately (but actually fortunately), I no longer work for that company, and no longer have access to the code I promised I'd share. Kafka has also matured several versions since 0.9.
Another thing I've learned in the ensuing time is that Kafka latencies increase when there is not much traffic. This is due to the way the clients use batching and threading to aggregate messages. It's very fast when you have a continuous stream of messages, but any time there is a moment of "silence", the next message will have to pay the cost to get the stream moving again.
It's been some years since I was deep in Kafka tuning. Looking at the latest version (2.5 -- producer configuration docs here), I can see that they've decreased linger.ms (the amount of time a producer will wait before sending a message, in hopes of batching up more than just the one) to zero by default, meaning that the aforementioned cost to get moving again should not be a thing. As I recall, in 0.9 it did not default to zero, and there was some tradeoff to setting it to such a low value. I'd presume that the producer code has been modified to eliminate or at least minimize that tradeoff.
Modern versions of Kafka seem to have pretty minimal latency as the results from here show:
2 ms (median)
3 ms (99th percentile)
14 ms (99.9th percentile)
Kafka can achieve around millisecond latency, by using synchronous messaging. With synchronous messaging, the producer does not collect messages into a patch before sending.
bin/kafka-console-producer.sh --broker-list my_broker_host:9092 --topic test --sync
The following has the same effect:
--batch-size 1
If you are using librdkafka as Kafka client library, you must also set socket.nagle.disable=True
See https://aivarsk.com/2021/11/01/low-latency-kafka-producers/ for some ideas on how to see what is taking those milliseconds.