Kafka Streams "Suppressed" feature causes OOM / heavy GC - apache-kafka

I use Kafka Streams 2.1 and created the following stream using Suppressed feature to process the aggregation of each whole minute:
originStream
.windowedBy(TimeWindows.of(Duration.ofSeconds(60)).grace(Duration.ofMillis(500)))
.aggregate(factory::createAggregation,
(k, v, a) -> a.aggregate(v),
materialized.withLoggingDisabled())
.suppress(untilWindowCloses(Suppressed.BufferConfig.unbounded()))
.toStream();
The rate of messages I receive is about 200 per second.
After a short time I see the GC starting to work very hard, and sometimes OOM errors.
Since I use a heap of 2GB and a record will not take more than 1KB, it is clear to me that something is wrong - there shouldn't be so many messages in a window of 1 minute to explode a 2GB heap.
So.. I took a heap dump, in which I see 5 InMemoryTimeOrderedKeyValueBuffer Objects taking more than 300MB each (total >1.5GB).
I dived some more into one of those, and saw that the smallest/highest timestamp in their sortedMap was 1,575,458,160,000/1,575,481,800,000. This means that the buffer holds messages during a period of 23,640,000 = 394 minutes.
To my understanding the buffer was supposed to be flushed, so that only the last minute will consume the memory - all other windows should have been evicted.
Am I doing something wrong?
Any help would be appriciated.

The problem should not be suppress() but the aggregation state store. By default, it has a retention time of 1 day. You can reduce the retention time by passing in Materialized.withRetention(...) into aggregate().
I am surprised that your heap dump shows InMemoryTimeOrderedKeyValueBuffer though, because this is the store used by suppress(). Hence, I am not 100% sure if reducing the retention time will fix the issue.
Btw: that there are a few bugs in suppress() in 2.1 version that are only fixed in 2.3 release and thus it's highly recommended to upgrade to 2.3 if you use suppress().

I've changed The BufferConfig to use max-bytes boundary:
Suppressed.BufferConfig.unbounded().withMaxBytes(10_000_000)
and that seem to solve the problem. I looked at the code, and don't understand why - because I see it now should have thrown an exception, but it doesn't.
So, I still don't understand something here, but the problem is solved for now.
After that I used Mattias J. Sax suggestions too, just to be even safer (Thanks).
Edit:
It happened again twice today. This means that what I did did not fix the problem (Although it may have changed its frequency).
Right now, I have no solution for this problem.

Related

How can I avoid "Data too large" in ELK / elasticsearch bulk inserts?

I'm sending data daily to my elk-stack via https://metacpan.org/pod/Search::Elasticsearch::Client::7_0::Bulk
Sometimes it happens, more often recently, that I receive a "Data too large" error. The first part of my data was received, but after this error my sending script stops and I end up with incomplete data.
As far as I understood, correct me if I'm wrong, this happens when my stack is experiencing memory issues while processing the data it already received. I assume that, after some time, I could send the rest of the data, because the next day, the same issue occurs: The first bunch of my data is processed, the rest rejected with "Data too large".
I saw that I can add an "on-error" callback, but I have no clue what I can do in it. My idea would be to implement a delay and retry after some time.
Can anyone give me have a hint how to achieve it?
Are there any ideas how to avoid the issue in the first place? I already increased heap space some time ago, but after 2 month the issue reoccured.
you'd need to check your Elasticsearch logs and the full response that Elasticsearch sends back (eg was it a 429?). however heap pressure can cause this, and you'd probably need to dig into why you are experiencing that
the other option is to reduce the size of your requests that you are sending
Update Remembering my "experience" with Java I simply did a restart of my ELK stack and the next import went through smoothly.
So despite the fact that 512m memory seem a bit low, it worked after a restart. Will check again today and then.
Increase memory
Schedule a nightly restart

KStreamWindowAggregate 2.0.1 vs 2.5.0: skipping records instead of processing

I've recently upgraded my kafka streams from 2.0.1 to 2.5.0. As a result I'm seeing a lot of warnings like the following:
org.apache.kafka.streams.kstream.internals.KStreamWindowAggregate$KStreamWindowAggregateProcessor Skipping record for expired window. key=[325233] topic=[MY_TOPIC] partition=[20] offset=[661798621] timestamp=[1600041596350] window=[1600041570000,1600041600000) expiration=[1600059629913] streamTime=[1600145999913]
There seem to be new logic in the KStreamWindowAggregate class that checks if a window has closed. If it has been closed the messages are skipped. Compared to 2.0.1 these messages where still processed.
Question
Is there a way to get the same behavior like before? I'm seeing lots of gaps in my data with this upgrade and not sure how to solve this, as previously these gaps where not seen.
The aggregate function that I'm using already deals with windowing and as a result with expired windows. How does this new logic relate to this expiring windows?
Update
While further exploring I indeed see it to be related to the graceperiod in ms. It seems that in my custom timestampextractor (that has the logic to use the timestamp from the payload instead of the normal timestamp), I'm able to see that the incoming timestamp for the expired window warnings indeed is bigger than the 24 hours compared to the event time from the payload.
I assume this is caused by consumer lags of over 24 hours.
The timestamp extractor extract method has a partition time which according to the docs:
partitionTime the highest extracted valid timestamp of the current record's partition˙ (could be -1 if unknown)
so is this the create time of the record on the topic? And is there a way to influence this in a way that my records are no longer skipped?
Compared to 2.0.1 these messages where still processed.
That is a little bit surprising (even if I would need to double check the code), at least for the default config. By default, store retention time is set to 24h, and thus in 2.0.1 older messages than 24h should also not be processed as the corresponding state got purged already. If you did change the store retention time (via Materialized#withRetention) to a larger value, you would also need to increase the window grace period via TimeWindows#grace() method accordingly.
The aggregate function that I'm using already deals with windowing and as a result with expired windows. How does this new logic relate to this expiring windows?
Not sure what you mean by this or how you actually do this? The old and new logic are similar with regard to how a long a window is stored (retention time config). The new part is the grace period that you can increase to the same value as retention time if you wish).
About "partition time": it is computed base on whatever TimestampExtractor returns. For your case, it's the max of whatever you extracted from the message payload.

Kafka streams reduce after groupby to stream sends partial reduce output on commit [duplicate]

This question already has answers here:
How to send final kafka-streams aggregation result of a time windowed KTable?
(3 answers)
Closed 4 years ago.
We're having an issue where upon doing a groupby --> reduce --> toStream, partial reduce values are being sent downstream when a commit happens during the reduce. So if there are 65 keys to be reduced, and say a commit happens half we through, the output will be two messages: one partially reduced, the other with all the values reduced.
So here is our case in more detail:
msg --> leftJoin
leftJoin --> flatMap //break msg into parts so we can join again downstream
flatMap --> leftJoin
leftJoin --> groupByKey
groupByKey --> reduce
reduce --> toStream
toStream --> to
Currently, we've come up with a very ugly fix for this, which has to do with adding an index and out of values to each message created during the flatMap phase...we filter out any message emitted by the reduce where index != out of. My feeling is we're not doing something right here or looking at it the wrong way. Please advise on the correct way of doing this.
Thanks.
So if there are 65 keys to be reduced, and say a commit happens half we through, the output will be two messages: one partially reduced, the other with all the values reduced.
If I understand your description correctly, this is actually intended behavior. For one, it's a tradeoff between processing latency (where you want to see update records as soon as you have a new piece of input data) vs. coalescing multiple update records into fewer or even just a single update record.
The default behavior of Kafka Streams is to favor lower processing latency. That is, it will not wait for "all input data to have arrived" before sending downstream updates. Rather, it will send updates once new data has arrived. Some background information is described at https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/.
Today, you have two main knobs to change/tune this default behavior, which is controlled by (1) Kafka Streams record caches (for the DSL) and (2) the configured commit interval (you already mentioned this).
Moving forward, the Kafka community has also been working on a new feature that will allow you to define that you just want a single, final update record to be sent (rather than what you described as "partial" updates). This new feature, in case you are interested, is described in the Kafka Improvement Proposal KIP-328: Ability to suppress updates for KTables. This is actively being worked on, but it will unlikely to be finished in time for the upcoming Kafka v2.1 release in October.
Currently, we've come up with a very ugly fix for this, which has to do with adding an index and out of values to each message created during the flatMap phase...we filter out any message emitted by the reduce where index != out of. My feeling is we're not doing something right here or looking at it the wrong way. Please advise on the correct way of doing this.
In short, in stream processing you should embrace the nature of how streaming works. In general, you will only have partial/incomplete knowledge of the world, so to speak, or rather: you only know what you observed thus far. So, at any given point in time, you must deal with the situation that more, additional data may arrive that you still have to deal with.
A typical situation is having to deal with late-arriving data, where your application logic must decide whether you want to still integrate and process this data (quite likely) or discard (sometimes the way it needs to be).
Going back to your example:
So if there are 65 keys to be reduced [...]
How would one know it's 65, and not 100 or 28, and so on? One can only tell that: "Thus far, at this point in time, I have received 65. So, what do I do? Do I reduce those 65 because I believe that's all the input? Or do I wait some seconds/minutes/hours longer because there might be 35 more to arrive, but this will mean that I will not send an update/answer downstream until this waiting time has elapsed (which results in higher processing latency)?"
In your situation, I would ask: Why do you consider the streaming behavior of how/when updates are being sent a problem? Perhaps it's because you have a downstream system or application that doesn't know how to handle such streaming updates?
Does that make any sense? Again, the above is based on my understanding of what you described as being the issue.

Kafka Consumer Lag monitoring with linkedIn/Burrow "jumps" intermittently

we're using the latest master build (at the time of this writing: https://github.com/linkedin/Burrow/commit/12e681a3a8a61f84f17677996dc3e6a2b79fac41)
Our Kafka-Brokers are running 1.1.0
We switched recently from https://github.com/Morningstar/kafka-offset-monitor to Burrow, because we're adding authorization to our Clusters.
Now, most of our consumer-lags are 0 most of the time (according to Burrow, whereas on kafka-offset-monitor they were around 1K - 100K most of the time - both are OK from our point of view).
For reasons unknown to us, the consumer lag "jumps" e.g. from 0 to 1.4 Billion(!) from one minute to the next, and back again after another minute. We have about 20 consumers on our main topic, and all of their lags jump - but by different amounts. Some "only" jump from 1k to 1M, others from 0 to the billions described above.
Is anybody else seeing this?
Is there a known reason or do we have to adjust our config? - We didn't change anything about the default config for the evaluation or notifications...
We use https://github.com/rgannu/burrow-graphite to report to graphite, and our alarming system is based on those metrics...
Any help is appreciated

Spark Streaming mapWithState seems to rebuild complete state periodically

I am working on a Scala (2.11) / Spark (1.6.1) streaming project and using mapWithState() to keep track of seen data from previous batches.
The state is distributed in 20 partitions on multiple nodes, created with StateSpec.function(trackStateFunc _).numPartitions(20). In this state we have only a few keys (~100) mapped to Sets with up ~160.000 entries, which grow throughout the application. The entire state is up to 3GB, which can be handled by each node in the cluster. In each batch, some data is added to a state but not deleted until the very end of the process, i.e. ~15 minutes.
While following the application UI, every 10th batch's processing time is very high compared to the other batches. See images:
The yellow fields represent the high processing time.
A more detailed Job view shows that in these batches occur at a certain point, exactly when all 20 partitions are "skipped". Or this is what the UI says.
My understanding of skipped is that each state partition is one possible task which isn't executed, as it doesn't need to be recomputed. However, I don't understand why the amount of skips varies in each Job and why the last Job requires so much processing. The higher processing time occurs regardless of the state's size, it just impacts the duration.
Is this a bug in the mapWithState() functionality or is this intended behaviour? Does the underlying data structure require some kind of reshuffling, does the Set in the state need to copy data? Or is it more likely to be a flaw in my application?
Is this a bug in the mapWithState() functionality or is this intended
behaviour?
This is intended behavior. The spikes you're seeing is because your data is getting checkpointed at the end of that given batch. If you'll notice the time on the longer batches, you'll see that it happens persistently every 100 seconds. That's because the checkpoint time is constant, and is calculated per your batchDuration, which is how often you talk to your data source to read a batch multiplied by some constant, unless you explicitly set the DStream.checkpoint interval.
Here is the relevant piece of code from MapWithStateDStream:
override def initialize(time: Time): Unit = {
if (checkpointDuration == null) {
checkpointDuration = slideDuration * DEFAULT_CHECKPOINT_DURATION_MULTIPLIER
}
super.initialize(time)
}
Where DEFAULT_CHECKPOINT_DURATION_MULTIPLIER is:
private[streaming] object InternalMapWithStateDStream {
private val DEFAULT_CHECKPOINT_DURATION_MULTIPLIER = 10
}
Which lines up exactly with the behavior you're seeing, since your read batch duration is every 10 seconds => 10 * 10 = 100 seconds.
This is normal, and that is the cost of persisting state with Spark. An optimization on your side could be to think how you can minimize the size of the state you have to keep in memory, in order for this serialization to be as quick as possible. Additionaly, make sure that the data is spread out throughout enough executors, so that state is distributed uniformly between all nodes. Also, I hope you've turned on Kryo Serialization instead of the default Java serialization, that can give you a meaningful performance boost.
In addition to the accepted answer, pointing out the price of serialization related to checkpointing, there's another, less known issue which might contribute to the spikey behaviour: eviction of deleted states.
Specifically, 'deleted' or 'timed out' states are not removed immediately from the map, but are marked for deletion and actually removed only in the process of serialization [in Spark 1.6.1, see writeObjectInternal()].
This has two performance implications, which occur only once per 10 batches:
The traversal and deletion process has its price
If you process the stream of timed-out/ deleted events, e.g. persist it to external storage, the associated cost for all 10 batches will be paid only at this point (and not as one might have expected, on each RDD)