I have two Kafka topics, A and B.
On occasion both topics can be idle, however, when topic B or A has new data, it can take a few minutes for the flink application to process (after being idle).
The application is configured to use event time, using forMonotonousTimestamps.
The job is structured like so:
KafkaSource
ProcessFunction
KeyBy
connect the two streams
CoProcessFunction
The ProcessFunction immediately picks up data from the Kafka topic (even after it is idle) and likewise for the KeyBy.
However neither the process1 or process2 of the CoProcessFunction get triggered quickly, I see a delay of around 3 minutes before they are, why is this?
Performance is otherwise very good when both topics have data continuously pushed to them.
I have also tried to implement my own WatermarkGenerator, like so
#Override
public void onEvent(T event, long eventTimestamp, WatermarkOutput output) {
// don't need to do anything because we work on processing time
}
#Override
public void onPeriodicEmit(WatermarkOutput output) {
output.emitWatermark(new Watermark(System.currentTimeMillis()));
}
I can see the onPeriodicEmit function does emit every 5 seconds, but this doesn't solve my problem, and looking at the flink webui, the watermark does not progress.
Using Flink 1.14
I would suspect that there's an issue with the idleness detection:
The Kafka Source does not go automatically in an idle state if
the parallelism is higher than the number of partitions.
You will either need to lower the parallelism
or add an idle timeout to the watermark strategy.
If no records flow in a partition of a stream for that amount of time,
then that partition is considered “idle” and
will not hold back the progress of watermarks in downstream operators.
See https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/connectors/datastream/kafka/#idleness for more details and how to assign a WatermarkStrategy#withIdleness to resolve this.
Related
Having a Kafka Streams application, that performs windowing(using original event time, not wallclock time) via Stream joins of e.g. 1 day.
If bringing up this topology, and reprocessing the data from the start (as in a lambda-style architecture), will this window keep that old data there? da
For example: if today is 2022-01-09, and I'm receiving data from 2021-03-01, will this old data enter the table, or will it be rejected from the start?
In that case - what strategies can be done to reprocess this data?
UPDATE Using Kafka Streams 2.5.0
Updated Answer to OP Kafka Streams version 2.5:
When using event time, Kafka Streams will behave independent of the wallclock time, as long as no events contain the wallclock time. You should not have configured a WallclockTimestampExtractor as your timestamp extractor.
Kafka Streams will assign you input topic partitions to stream tasks, that will consume the partitions one event at a time. On any given topic, at most one partition will be assigned to a stream task. Time-windowed aggregations are carried out for each stream task separately. Kafka Streams uses an internal timestamp called "observedStreamTime" for each aggregation to keep track of the maximum timestamp seen so far. Incoming records are checked for their timestamp in comparison to the observedStreamTime. If they are older than the retention + grace period of the configured time window store, they will be dropped. Otherwise, they will be aggregated according to the configuration. The implementation can be found at https://github.com/apache/kafka/blob/d5b53ad132d1c1bfcd563ce5015884b6da831777/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamWindowAggregate.java#L108-L175
This processing will always yield the same result, if the Kafka Streams application is reset. It is independent on the execution time of the processing. If events are dropped, the corresponding metrics are changed.
There is one caveat with this approach, when multiple topics are consumed. The observedStreamTime will reflect the highest timestamp of all partitions read by the stream task. If you have two topics (maybe because you want to join them) and one contains considerably younger data than the other (maybe because the latter received no new data), the observedStreamTime will be dominated by the younger topic. Events of the older topic might be dropped, if the time window configuration does not have enough retention or grace periods. See the JavaDoc of TimeWindows on the configuration options: https://github.com/apache/kafka/blob/d5b53ad132d1c1bfcd563ce5015884b6da831777/streams/src/main/java/org/apache/kafka/streams/kstream/TimeWindows.java
In your example the old data will be accepted, as long as the stream time has not progress too far. Reprocessing the whole data set should work, since it will linearly progress through your topic. If the old data is aggregated in a time-window with exceeding the window size + grace period, Kafka Streams will reject the record. In that case Kafka Streams will also issue an error message and adjust its metrics accordingly. So this behaviour should be easy to pick up.
I suggest to try out this reprocessing if feasible and watch the logs and metrics.
I'm a beginner kafka and flink enthusiast.
I noticed something troubling. When i increase my parallelism of a kafka job to anything more than 1, i get no windows to execute their processes. I wish to use parallelism to increase analysis speed.
Look at the image examples from Apache Flink Web Dashboard which visualizes the issue.
This is the exact same code and the exact same ingested data-set, the difference is only parallelism. In the first example the ingested data flows through the window functions, but when parallelism is increased the data just piles up in the first window function which never executes. It stays like this forever and never produces any error.
The source used in the code is KafkaSource, FlinkKafkaConsumer seems to work fine with the same setup but is deprecated so i wish not to use it.
Thanks for any ideas!
The issue (is almost certainly) that the Kafka topic being consumed has fewer partitions than the configured parallelism. The new KafkaSource handles this situation differently than FlinkKafkaConsumer.
An event-time window waits for the arrival of a watermark indicating that the stream is now complete up through the end-time of the window. When your KafkaSource operator has 10 instances, some of which aren't receiving any data, those idle instances are holding back the watermark. Basically, Flink doesn't know that those instances aren't expected to ever produce data -- instead it's waiting for them to be assigned work to do.
You can fix this by doing one of the following:
Reduce Flink's parallelism to be less than or equal to the number of Kafka partitions.
Configure your WatermarkStrategy to use withIdleness(duration) so that the idle instances will recognize that they aren't doing anything, and (temporarily) remove themselves from being involved with watermarking. (And if those instances are ever assigned splits/partitions to consume, they'll resume doing watermarking.)
In my Kafka streams application I have a single processor that is scheduled to produce output messages every 60 seconds. Output message is built from messages that come from a single input topic. Sometimes it happens that the output message is bigger than the configured limit on broker (1MB by default). An exception is thrown and the application shuts down. Commit interval is set to default (60s).
In such case I would expect that on the next run all messages that were consumed during those 60s preceding the crash would be re-consumed. But in reality the offset of those messages is committed and the messages are not processed again on the next run.
Reading answers to similar questions it seems to me that the offset should not be committed. When I increase commit interval to 120s (processor still punctuates every 60s) then it works as expected and the offset is not committed.
I am using default processing guarantee but I have also tried exactly_once. Both have the same result. Calling context.commit() from processor seems to have no effect on the issue.
Am I doing something wrong here?
The contract of a Processor in Kafka Streams is, that you have fully processed an input record and forward() all corresponding output messages before process() return. -- This contract implies that Kafka Streams is allowed to commit the corresponding offset after process() returns.
It seem you "buffer" messages within process() in-memory to emit them later. This violated this contract. If you want to "buffer" messages, you should attach a state store to the Processor and put all those messages into the store (cf https://kafka.apache.org/25/documentation/streams/developer-guide/processor-api.html#state-stores). The store is managed by Kafka Streams for you and it's fault-tolerant. This way, after an error the state will be recovered and you don't loose any data (even if the input messages are not reprocessed).
I doubt that setting the commit interval to 120 seconds actually works as expected for all cases, because there is no alignment between when a commit happens and when punctuation is called.
Some of this will depend on the client you are using and whether it's based on librdkafka.
Some of the answer will also depend on how you are "looping" over the "poll" method. A typical example will look like the code under "Automatic Offset Committing" at https://kafka.apache.org/23/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html
But this assumes quite a rapid poll loop (100ms + processing time) and a auto.commit.timeout.ms at 1000ms (the default is usually 5000ms).
If I read your question correctly, you seem to consuming messages once per 60 seconds?
Something to be aware of is that the behavior of kafka client is quite tied to how frequently poll is called (some libraries will wrap poll inside something like a "Consume" method). Calling poll frequently is important in order to appear "alive" to the broker. You will get other exceptions if you do not poll at least every max.poll.interval.ms (default 5min). It can lead to clients being kicked out of their consumer groups.
anyway, to the point... auto.commit.interval.ms is just a maximum. If a message has been accepted/acknowledged or StoreOffset has been used, then, on poll, the client can decide to update the offset on the broker. Maybe due to client side buffer size being hit or some other semantic.
Another thing to look at (esp if using a librdkafka based client. others have something similar) is enable.auto.offset.store (default true) this will "Automatically store offset of last message provided to application" so every time you poll/consume a message from the client it will StoreOffset. If you also use auto.commit then your offset may move in ways you might not expect.
See https://github.com/edenhill/librdkafka/blob/master/CONFIGURATION.md for the full set of config for librdkafka.
There are many/many ways of consuming/acknowledging. I think for your case, the comment for max.poll.interval.ms on the config page might be relevant.
"
Note: It is recommended to set enable.auto.offset.store=false for long-time processing applications and then explicitly store offsets (using offsets_store()) after message processing
"
Sorry that this "answer" is a bit long winded. I hope there are some threads for you to pull on.
EDIT
In case anyone else is in this particular situation, I got something akin to what I was looking for after tweaking the consumer configurations. I created a producer that sent the priority messages to three separate topics (for high/med/low priorities), and then I created 3 separate consumers to consume from each. Then I polled the higher priority topics frequently, and didn't poll the lower priorities unless the high was empty:
while(true) {
final KafkaConsumer<String,String> highPriConsumer = createConsumer(TOPIC1);
final KafkaConsumer<String,String> medPriConsumer = createConsumer(TOPIC2);
final ConsumerRecords<String, String> consumerRecordsHigh = highPriConsumer.poll(100);
if (!consumerRecordsHigh.isEmpty()) {
//process high pri records
} else {
final ConsumerRecords<String, String> consumerRecordsMed = medPriConsumer.poll(100);
if (!consumerRecordsMed.isEmpty()) {
//process med pri records
The poll timeout (argument to the .poll() method) determines how long to wait if there are no records to poll. I set this to a very short time for each topic, but you can set it lower for the lower priorities to make sure it's not consuming valuable cycles waiting when high pri messages are there
The max.poll.records config obviously determines the maximum number of records to grab in a single poll. This could be set higher for the higher priorities as well.
The max.poll.interval.ms config determines the time between polls - how long it should take to process max.poll.records messages. Clarification here.
Also, I believe pausing/resuming an entire consumer/topic can be implemented like this:
kafkaConsumer.pause(kafkaConsumer.assignment())
if(kafkaConsumer.paused().containsAll(kafkaConsumer.assignment())) {
kafkaConsumer.resume(kafkaConsumer.assignment());
}
I'm not sure if this is the best way, but I couldn't find a good example elsewhere
I agree with senseiwu below that this is not really the correct use for Kafka. This is single-threaded processing, with each topic having a dedicated consumer, but I will work on improving this process from here.
Background
We are trying to improve our application and hoping to use Apache Kafka for messaging between decoupled components. Our system is frequently low-bandwidth (although there are cases where bandwidth can be high for a time), and have small, high-priority messages that must be processed while larger files wait, or are processed slowly to consume less bandwidth. We would like to have topics with different priorities.
I am new to Kafka, but have tried looking into both the Processor API and Kafka Streams with no success, although certain posts on forums seem to be saying this is doable.
Processor API
When I tried the Processor API, I tried to determine if the High Priority KafkaConsumer was currently processing anything by checking if poll() was empty, and then hoped to poll() with the Med Priority Consumer, but the second topic poll returned empty. There also didn't seem to be an easy way to get all TopicPartition's on a topic in order to call kafkaConsumer.pause(partitions).
Kafka Streams
When I tried KafkaStreams, I set up a stream to consume from each of my "priority" topics, but there was no way to check if the KStream or KafkaStreams instance connected to the higher-priority topic was currently idle or processing.
I based my code on this file
Other
I also tried the code here: priority-kafka-client, but it didn't work as expected, as running the downloaded test file had mixed priorities.
I found this thread, where one of the developers says (addressing adding priorities for topics): "...a user could implement this behavior with pause and resume". But I was unable to find out how he meant this could work.
I found this StackOverflow article, but they seem to be using a very old version, and I was unclear on how their mapping function was supposed to work.
Conclusion
I would be very grateful if someone would tell me if they think this is something worth pursuing. If this isn't how Apache Kafka is supposed to work, because it disrupts the benefit gained from the automatic topic/partition handling, that's fine, and I will look elsewhere. However, there were so many instances where people seemed to have success with it, that I wanted to try. Thank you.
This sounds like a design issue in your application - kafka is originally designed as a commit log where each message is written to the broker with an offset and various consumer consume them in the order in which they were committed with very low latency and high throughput. Given that partitions and not topics are fundamental unit of work distribution in Kafka, having topic level priorities would be difficult to achieve natively.
I'd recommend to adapt your design to use other architectural components than Kafka instead of trying to cut your feet to fit into the shoes. One thing you could already do is to let your producer upload file to a proper file storage and send the link via Kafka including metadata. Then depending upon the bandwidth status, your consumer could decide based on metadata of the large file whether it is sensible to download or not. This way you are probably more likely to have a robust design rather than using Kafka the wrong way.
If you indeed want to stick to only Kafka, one solution would be to send large files to some fixed number of hardcoded partitions and consumers consume from those partitions only when bandwidth is good.
Good Day,
I would like to find out if kafka queue can hold data for a few seconds and than release data.
I receive a message from a kafka topic,
After parsing the data, I hold it in memory for some time (10 seconds) (This builds up as unique messages come through), with each message having it's own timer), I want kafka to tell me that that message has expired (10 seconds) so that i can continue with other tasks.
But since flink/kafka is event driven, I was hoping kafka has some sort of round timing wheel that can reproduce the key for a message after 10 seconds to the consumer.
Any idea on how I can archieve this using flink windowing or kafka features?
Regards
Regarding your initial problem:
I would like to find out if kafka queue can hold data for a few seconds and than release data
You can set up log.cleanup.policy as delete (this is the default) and change the retention.ms from the default 604800000 (1 week) to 10000.
Can you explain again what else you want to check, and what did you mean after the Regards part?
You could look closer to Kafka Streams library. https://kafka.apache.org/21/documentation/streams/developer-guide/dsl-api.html, https://kafka.apache.org/21/documentation/streams/developer-guide/processor-api.html.
Using Kafka Streams you can do lot of complex event processing work. Processor API is lower level API and gives you more flexibility, ex Each processing message put in state store (Kafka Streams abstraction, that is replicated to changelog topic) and then with the Punctuator you can check if message expired.