Flink Kafka connector - commit offset without checkpointing - apache-kafka

I have a question regarding Flink Kafka Consumer (FlinkKafkaConsumer09).
I've been using this version of connector:
flink-connector-kafka-0.9_2.11-1.1.2 (connector version is 0.9, akka version is 2.11, flink version is 1.1.2)
I gather communication data from kafka within 5-minutes tumbling windows. From what I've seen, the windows are aligned with system time (for example windows end in 12:45, 12:50, 12:55, 13:00 etc.)
After window is closed, its records are processed/aggregated and sent via Sink operator to database.
Simplified version of my program:
env.addSource(new FlinkKafkaConsumer09<>(topicName,jsonMapper, properties))
.keyBy("srcIp", "dstIp", "dstPort")
.window(TumblingEventTimeWindows.of(Time.of(5, TimeUnit.MINUTES)))
.apply(new CounterSum<>())
.addSink(new DbSink(...));
However I need to commit offset in kafka. From what I've read, the only way in FlinkKafkaConsumer09 is to turn on checkpointing. I do it like this:
env.enableCheckpointing(300000); // 5 minutes
Checkpointing stores state of all operators. After checkpoint is complete, the offset is comitted to kafka.
My checkpoints are stored via FsStateBackend in taskmanager system file structures (the first problem - older checkpoint data are not deleted, I saw some bugs being reported for this).
The second problem is when the checkpoint is triggered. If triggered at the beginning of the window, resulting checkpoint file is small, on the other side when triggered just before window is closed, resulting state is large (for example 50MB), because there are already many communication records in this window. The checkpoint process usually takes less than 1-2s, however when the checkpoint is triggered after the window is closed and while processing aggregations and DB sinks, the checkpoint process takes 45s.
But the whole point is that I don't need state checkpointing at all. All I need is to commit offset to kafka after window is closed, is processed and resulting data are sinked to db (or at the beginning of another window). If failover occured, flink would fetch last offset from kafka and would read data from last 5-minute interval again. Because last failed result was not sent to db, there would be no duplicate data being sent to DB and rereading last 5 minute interval is no overhead.
So basically I have 2 questions:
Is there any way how to achieve checkpointing being turned off
and only commit offsets like described above ?
If not, is there any way how to align checkpointing with start of
the window ? I read flink documentation - there is feature called
savepoints (i.e. manual checkpoints), but it is meant to be used
from command line. I would need to call savepoint from code on
window start - state would be small and checkpoint process would be
quick.

In order to commit offset in Kafka, set the property enable.auto.commit=true and then set a commit duration via auto.commit.interval.ms=300000 in the Kafka source builder.
FlinkKafkaConsumer09.<>builder()
...
.setProperty("auto.commit.interval.ms", "500")
.setProperty("enable.auto.commit", "true")
...
This will only commit your offset and not interfere with checkpointing

Related

How to skip kafka history data in flink job if certain lag is encountered?

Sometimes we encounter lag in kafka consumer due to some external issues.
Flink job will always consume kafka history (delayed data) with exactly-once semantics, but here's a scenario:
We will skip delayed data when kafka consumer lag is too much in order to let our downstream service get the latest data in time.
I am thinking to set a window period to do it. What should I code for it?
I'd say your least painful option is to always read all the messages, but do not process (discard them as soon as possible) the ones you want to skip. Just reading and discarding without any further processing is really fast.
You could stop the Flink job and use kafka-consumer-groups CLI from Kafka to seek the consumer group forward (assuming Flink is using one, rather than maintaining offsets itself)
When the job restarts, it'll start from the new offset location

Kafka Streams: reprocessing old data when windowing

Having a Kafka Streams application, that performs windowing(using original event time, not wallclock time) via Stream joins of e.g. 1 day.
If bringing up this topology, and reprocessing the data from the start (as in a lambda-style architecture), will this window keep that old data there? da
For example: if today is 2022-01-09, and I'm receiving data from 2021-03-01, will this old data enter the table, or will it be rejected from the start?
In that case - what strategies can be done to reprocess this data?
UPDATE Using Kafka Streams 2.5.0
Updated Answer to OP Kafka Streams version 2.5:
When using event time, Kafka Streams will behave independent of the wallclock time, as long as no events contain the wallclock time. You should not have configured a WallclockTimestampExtractor as your timestamp extractor.
Kafka Streams will assign you input topic partitions to stream tasks, that will consume the partitions one event at a time. On any given topic, at most one partition will be assigned to a stream task. Time-windowed aggregations are carried out for each stream task separately. Kafka Streams uses an internal timestamp called "observedStreamTime" for each aggregation to keep track of the maximum timestamp seen so far. Incoming records are checked for their timestamp in comparison to the observedStreamTime. If they are older than the retention + grace period of the configured time window store, they will be dropped. Otherwise, they will be aggregated according to the configuration. The implementation can be found at https://github.com/apache/kafka/blob/d5b53ad132d1c1bfcd563ce5015884b6da831777/streams/src/main/java/org/apache/kafka/streams/kstream/internals/KStreamWindowAggregate.java#L108-L175
This processing will always yield the same result, if the Kafka Streams application is reset. It is independent on the execution time of the processing. If events are dropped, the corresponding metrics are changed.
There is one caveat with this approach, when multiple topics are consumed. The observedStreamTime will reflect the highest timestamp of all partitions read by the stream task. If you have two topics (maybe because you want to join them) and one contains considerably younger data than the other (maybe because the latter received no new data), the observedStreamTime will be dominated by the younger topic. Events of the older topic might be dropped, if the time window configuration does not have enough retention or grace periods. See the JavaDoc of TimeWindows on the configuration options: https://github.com/apache/kafka/blob/d5b53ad132d1c1bfcd563ce5015884b6da831777/streams/src/main/java/org/apache/kafka/streams/kstream/TimeWindows.java
In your example the old data will be accepted, as long as the stream time has not progress too far. Reprocessing the whole data set should work, since it will linearly progress through your topic. If the old data is aggregated in a time-window with exceeding the window size + grace period, Kafka Streams will reject the record. In that case Kafka Streams will also issue an error message and adjust its metrics accordingly. So this behaviour should be easy to pick up.
I suggest to try out this reprocessing if feasible and watch the logs and metrics.

Kafka data reads and offset management with sink

What happens when the consumer reads the data from kafka but fails to write into sink. Lets say, I read the data from kafka and applied some transformation on data and finally storing the final result into Database. If everything is perfectly working my final result will be stored in Database. But lets say for some reason my Database isn't available. what happens with the data that i read from kafka? When I restart my application, can I read the same data again since I failed to store it in sink? or will the kafka marks this data as read and will not allow me to read this data?
can you also tell me what this property is used for - enable.auto.commit=true?
There's a part of the metadata in Kafka called consumer offsets. Each message has a unique offset - an integer value that continually increases for each message.
So, in the scenario you've described:
If, you've committed the offset BEFORE writing to the database then you will not be able to read those messages again.
But, if you commit the offset AFTER writing to the database then you will be able to re-read those messages.
enable.auto.commit=true as the name suggests will automatically commit consumer offsets after a certain time interval defined by auto.commit.interval.ms parameter - which by default is 5000 ms (5 seconds). So, as you can probably imagine that if these default values are used, then the offsets will be committed in 5 seconds regardless of whether they have landed in the destination or not.
So, you would basically need to control these through your code and change the enable.auto.commit to false if you'd like to ensure guaranteed delivery.
Hope this helps!

Pentaho Data Integration - Kafka Consumer

I am using the Kafka Consumer Plugin for Pentaho CE and would appreciate your help in its usage. I would like to know if any of you were in a situation where pentaho failed and you lost any messages (based on the official docs there's no way to read the message twice, am I wrong ?). If this situation occurs how do you capture these messages so you can reprocess them?
reference:
http://wiki.pentaho.com/display/EAI/Apache+Kafka+Consumer
Kafka retains messages for the configured retention period whether they've been consumed or not, so it allows consumers to go back to an offset they previously processed and pick up there again.
I haven't used the Kafka plugin myself, but it looks like you can disable auto-commit and manage that yourself. You'll probably need the Kafka system tools from Apache and some command line steps in the job. You'd have to fetch the current offset at the start, get the last offset from the messages you consume and if the job/batch reaches the finish, commit that last offset to the cluster.
It could be that you can also provide the starting offset as a field (message key?) to the plugin, but I can't find any documentation on what that does. In that scenario, you could store the offset with your destination data and go back to the last offset there at the start of each run. A failed run wouldn't update the destination offset, so would not lose any messages.
If you go the second route, pay attention to the auto.offset.reset setting and behavior, as it may happen that the last offset in your destination has already disappeared from the cluster if it's been longer than the retention period.

Submitting offsets to kafka after storm batch

What would be the correct way to submit only the highest offset of every partion when batch bolt finishes proccessing a batch? My main concern is machines dying while proccessing batches as the whole shebang is going to run in AWS spot instances.
I am new to storm development I can't seem to find an answer to IMO is pretty straight forward usage of kafka and storm.
Scenario:
Based on the Guaranteeing Message Processing guide, lets assume that I have a steam (kafka topic) of ("word",count) tuples, Batch bolt that proccess X tupples, does some aggregation and creates CSV file, uploads the file to hdfs/db and acks.
In non-strom "naive" implementation, I would read X msgs (or read for Y seconds), aggregate,write to hdfs and once the upload is completed, commit the latest (highest) offset for every partition to kafka. If machine or proccess dies before the db commit - the next iteration will start from the previous place.
In storm I can create batch bolt that will anchor all of the batch tuples and ack them at once, however I can't find a way to commit the highest offset of every partition to kafka, as the spouts are unaware of the batch, so once the batch bolt acks the tupples, every spout instance will ack his tupples one by one, so I the way I see it I can:
Submit the offset of the acked message on every ack on the spout. This will cause many submits (every batch can be few Ks of tupples), probably out of order, and if the spout works dies while submitting the offsets, I will end up partially replaing some of the events.
Same as 1. but I can add some local offset management in the highest offset commited (fixing out of order offset commits) and submit the highets offset seen every few seconds (reducing the high number of submits) but I still can end up with partially commited offsets if the spout dies
Move the offset submition logic to the bolts - I can add the partition and offset of every message into data sent to the batch bolt and submit the highest proccessed offset of every partition as part of the batch (emit to "offset submitter" bolt at the end of the batch). This will solve the offset tracking, multiple submits and patial replayes issues but this adds kafka specific logic to the bolts thus copling the bolt code with kafka and in generally speaking seems to me as reinventing the wheel.
Go even further with wheel reinvention and manually managed the highest processed patition-offset combination in ZK and read this value when I init the spout.
There is quite a lot in your question, so not sure if this addresses it fully, but if you are concerned about the number of confirmations sent to kafka (e.g. after every message) you should be able to set a batch size for consumption, for example 1000 to reduce this a lot.