Kafka Consumer Lag monitoring with linkedIn/Burrow "jumps" intermittently - apache-kafka

we're using the latest master build (at the time of this writing: https://github.com/linkedin/Burrow/commit/12e681a3a8a61f84f17677996dc3e6a2b79fac41)
Our Kafka-Brokers are running 1.1.0
We switched recently from https://github.com/Morningstar/kafka-offset-monitor to Burrow, because we're adding authorization to our Clusters.
Now, most of our consumer-lags are 0 most of the time (according to Burrow, whereas on kafka-offset-monitor they were around 1K - 100K most of the time - both are OK from our point of view).
For reasons unknown to us, the consumer lag "jumps" e.g. from 0 to 1.4 Billion(!) from one minute to the next, and back again after another minute. We have about 20 consumers on our main topic, and all of their lags jump - but by different amounts. Some "only" jump from 1k to 1M, others from 0 to the billions described above.
Is anybody else seeing this?
Is there a known reason or do we have to adjust our config? - We didn't change anything about the default config for the evaluation or notifications...
We use https://github.com/rgannu/burrow-graphite to report to graphite, and our alarming system is based on those metrics...
Any help is appreciated

Related

Existing consumer group does not receive messages anymore

We are chasing a strange behavior for a while now - asking your help because we ran out of ideas. Maybe someone has a few good ideas/pointers?
Components we use:
We are using Kafka 2.2.1 iirc cluster - 3 nodes setup
Topics having 6 partitions
We are working in Java and using org.springframework.kafka:spring-kafka 2.6.3 (also tested with latest 2.7.7 - did not help)
The problem:
In our use case we have a few topics where we do not have a continuous consumer, but we just fire one (only one at a time! - consuming all partitions) up periodically / on demand. And we use MANUAL AckMode because sometimes we just want to count the messages and not really process them.
The consumer is created from Java code programmatically - done by this snippet:
#Autowired
KafkaListenerContainerFactory kafkaListenerContainerFactory;
MessageListenerContainer listenerContainer = kafkaListenerContainerFactory.createContainer(topicName);
listenerContainer.setupMessageListener(this); // our class implements AcknowledgingMessageListener<Object, String>
listenerContainer.getContainerProperties().setGroupId("MyService");
listenerContainer.getContainerProperties().setAckMode(ContainerProperties.AckMode.MANUAL);
listenerContainer.start();
... we wait until no more inbound messages (for X seconds, timeout fashion)
listenerContainer.stop();
The strange thing is that although everything works as expected,
sometimes / somehow we get into a "state" we do not receive any messages anymore when we start the consumer. However, we do know for sure there are messages as our "MyService" consumer group there has an offset lag (we have Prometheus+Grafana metrics from Kafka for that so we see).
What makes the above even more weird:
This "state" can self-heal after a while without doing anything, so everything returns to normal. However, how long this state exists varies a lot! Sometimes just 1-2 hours other times might take days (and sometimes weeks).
We have DEV, STAGE, PROD environment - PROD is more powerful regarding Kafka side. The above weird behavior we experience in DEV and STAGE, but PROD not (just happened once during 6 months, for 2 hours then gone).
What we can also add to the above:
Retention period does not play role here. This state can happen even with just a few hours old messages on the topics.
When this weird state happens restarting the JVM (so the Java app) does not help.
We can break out from this state with a small trick: when the consumer is active, we send a message to the topic. Then suddenly Kafka somehow recovers from this state and suddenly starts to send all the messages not processed by the "MyService" consumer group
Any ideas appreciated!

KStreamWindowAggregate 2.0.1 vs 2.5.0: skipping records instead of processing

I've recently upgraded my kafka streams from 2.0.1 to 2.5.0. As a result I'm seeing a lot of warnings like the following:
org.apache.kafka.streams.kstream.internals.KStreamWindowAggregate$KStreamWindowAggregateProcessor Skipping record for expired window. key=[325233] topic=[MY_TOPIC] partition=[20] offset=[661798621] timestamp=[1600041596350] window=[1600041570000,1600041600000) expiration=[1600059629913] streamTime=[1600145999913]
There seem to be new logic in the KStreamWindowAggregate class that checks if a window has closed. If it has been closed the messages are skipped. Compared to 2.0.1 these messages where still processed.
Question
Is there a way to get the same behavior like before? I'm seeing lots of gaps in my data with this upgrade and not sure how to solve this, as previously these gaps where not seen.
The aggregate function that I'm using already deals with windowing and as a result with expired windows. How does this new logic relate to this expiring windows?
Update
While further exploring I indeed see it to be related to the graceperiod in ms. It seems that in my custom timestampextractor (that has the logic to use the timestamp from the payload instead of the normal timestamp), I'm able to see that the incoming timestamp for the expired window warnings indeed is bigger than the 24 hours compared to the event time from the payload.
I assume this is caused by consumer lags of over 24 hours.
The timestamp extractor extract method has a partition time which according to the docs:
partitionTime the highest extracted valid timestamp of the current record's partition˙ (could be -1 if unknown)
so is this the create time of the record on the topic? And is there a way to influence this in a way that my records are no longer skipped?
Compared to 2.0.1 these messages where still processed.
That is a little bit surprising (even if I would need to double check the code), at least for the default config. By default, store retention time is set to 24h, and thus in 2.0.1 older messages than 24h should also not be processed as the corresponding state got purged already. If you did change the store retention time (via Materialized#withRetention) to a larger value, you would also need to increase the window grace period via TimeWindows#grace() method accordingly.
The aggregate function that I'm using already deals with windowing and as a result with expired windows. How does this new logic relate to this expiring windows?
Not sure what you mean by this or how you actually do this? The old and new logic are similar with regard to how a long a window is stored (retention time config). The new part is the grace period that you can increase to the same value as retention time if you wish).
About "partition time": it is computed base on whatever TimestampExtractor returns. For your case, it's the max of whatever you extracted from the message payload.

Stream kinesis Analytics ETL Flink - skip records before and after a delay

EDITED:
I have a requirement to skip records that are created before 10s and 20s after if a gap in incoming data occurs.
(A gap is said to occur when the event-time1 - event-time2 > 3 seconds)
the resulting data is used to calculate average or median in a timewindow,
Is this possible to be done with Kinesis analytics, Dataflow, flink API, or some solution that works?
If I understand correctly, you want to find the median and average of records that are created between 10 and 20 seconds after a gap of at least 3 seconds.
Using Flink (or Kinesis Analytics, which is a managed Flink service), you could do that with session windows, or with a ProcessFunction. Process functions are more flexible, and are capable of handling pretty much anything you might need. However, in this case, session windows are probably simpler, especially if you are willing to wait until a session ends (i.e., until the next gap) to get the results. You could avoid this delay by implementing a custom window Trigger.
window tutorial
process function tutorial

Kafka Streams "Suppressed" feature causes OOM / heavy GC

I use Kafka Streams 2.1 and created the following stream using Suppressed feature to process the aggregation of each whole minute:
originStream
.windowedBy(TimeWindows.of(Duration.ofSeconds(60)).grace(Duration.ofMillis(500)))
.aggregate(factory::createAggregation,
(k, v, a) -> a.aggregate(v),
materialized.withLoggingDisabled())
.suppress(untilWindowCloses(Suppressed.BufferConfig.unbounded()))
.toStream();
The rate of messages I receive is about 200 per second.
After a short time I see the GC starting to work very hard, and sometimes OOM errors.
Since I use a heap of 2GB and a record will not take more than 1KB, it is clear to me that something is wrong - there shouldn't be so many messages in a window of 1 minute to explode a 2GB heap.
So.. I took a heap dump, in which I see 5 InMemoryTimeOrderedKeyValueBuffer Objects taking more than 300MB each (total >1.5GB).
I dived some more into one of those, and saw that the smallest/highest timestamp in their sortedMap was 1,575,458,160,000/1,575,481,800,000. This means that the buffer holds messages during a period of 23,640,000 = 394 minutes.
To my understanding the buffer was supposed to be flushed, so that only the last minute will consume the memory - all other windows should have been evicted.
Am I doing something wrong?
Any help would be appriciated.
The problem should not be suppress() but the aggregation state store. By default, it has a retention time of 1 day. You can reduce the retention time by passing in Materialized.withRetention(...) into aggregate().
I am surprised that your heap dump shows InMemoryTimeOrderedKeyValueBuffer though, because this is the store used by suppress(). Hence, I am not 100% sure if reducing the retention time will fix the issue.
Btw: that there are a few bugs in suppress() in 2.1 version that are only fixed in 2.3 release and thus it's highly recommended to upgrade to 2.3 if you use suppress().
I've changed The BufferConfig to use max-bytes boundary:
Suppressed.BufferConfig.unbounded().withMaxBytes(10_000_000)
and that seem to solve the problem. I looked at the code, and don't understand why - because I see it now should have thrown an exception, but it doesn't.
So, I still don't understand something here, but the problem is solved for now.
After that I used Mattias J. Sax suggestions too, just to be even safer (Thanks).
Edit:
It happened again twice today. This means that what I did did not fix the problem (Although it may have changed its frequency).
Right now, I have no solution for this problem.

Kafka Connect fetch.max.wait.ms & fetch.min.bytes combined not honored?

I'm creating a custom SinkConnector using Kafka Connect (2.3.0) that needs to be optimized for throughput rather than latency. Ideally, what I want is:
Batches of ~ 20 megabytes or 100k records whatever comes first, but if message rate is low, process at least every minute (avoid small batches, but minimum MySinkTask.put() rate to be every minute).
This is what I set for consumer settings in an attempt to accomplish it:
consumer.max.poll.records=100000
consumer.fetch.max.bytes=20971520
consumer.fetch.max.wait.ms=60000
consumer.max.poll.interval.ms=120000
consumer.fetch.min.bytes=1048576
I needs this fetch.min.bytes setting, or else MySinkTask.put() is called for multiple times per second despite the other settings...?
Now, what I observe in a low-rate situation is that MySinkTask.put() is called with 0 records multiple times and several minutes pass by, until fetch.min.bytes is reached, and then I get them all at once.
I fail to understand so far:
Why fetch.max.wait.ms=60000 is not pushing downwards from the consumer to the put() call of my connector? Shouldn't that have precedence over fetch.min.bytes?
What setting controls the ~ 2x per second call to MySinkTask.put() if fetch.min.bytes=1 (default)? I don't understand why it does that, even the verbose output of the Connect runtime settings don't show any interval below multiples of seconds.
I've double-checked the log output, and the lines INFO o.a.k.c.consumer.ConsumerConfig - ConsumerConfig values: as printed by the Connect Runtime are showing the expected values as I pass with the consumer. prefixed values.
The "process at least every interval" part seems not possible, as the fetch.min.bytes consumer setting takes precedence and Connect does not allow you to dynamically adjust the ConsumerConfig while the Task is running. :-(
Work-around for now is batching in the Task manually; set fetch.min.bytes to 1 (yikes), buffer records in the Task on put() calls, and flush when necessary. This is not very ideal as it infers some overhead for the Connector which I hoped to avoid.
The logic how Connect does a ~ 2x per second batching from its consumer's poll to SinkTask.put() remains a mystery to me, but it's better than being called for every message.