Migrating from FlinkKafkaConsumer to KafkaSource, no windows executed - apache-kafka

I am a kafka and flink beginner.
I have implemented FlinkKafkaConsumer to consume messages from a kafka-topic. The only custom setting other than "group" and "topic" is (ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest") to enable re-reading the same messages several times. It works out of the box for consuming and logic.
Now FlinkKafkaConsumer is deprecated, and i wanted to change to the successor KafkaSource.
Initializing KafkaSource with the same parameters as i do FlinkKafkaConsumer produces a read of the topic as expected, i can verify this by printing the stream. De-serialization and timestamps seem to work fine. However execution of windows are not done, and as such no results are produced.
I assume some default setting(s) in KafkaSource are different from that of FlinkKafkaConsumer, but i have no idea what they might be.
KafkaSource - Not working
KafkaSource<TestData> source = KafkaSource.<TestData>builder()
.setDeserializer(new CustomDeserializer())
DataStream<TestData> testDataStreamSource = env.fromSource(
"Kafka Source"
Kafka consumer - Working (Properties contains group.id,bootstrap.servers and zookeeper.connect)
propertiesForKafka.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
FlinkKafkaConsumer<TestData> flinkKafkaConsumer = new FlinkKafkaConsumer(TOPIC, new CustomDeserializer(), propertiesForKafka);
DataStreamSource<TestData> testDataStreamSource = env.addSource(flinkKafkaConsumer)
Both streams use the same pipeline that looks like this
withTimestampAssigner((event, timestamp) -> event.getTimestamp()))
.window(SlidingEventTimeWindows.of(Time.hours(3), Time.hours(1)))
.process(new ProcessWindowFunction<TestData, TestDataOutput, String, TimeWindow>() {
public void process(
Things tried
I've tried to experiment with setting committing of offsets, but it
has not improved the situation.
Setting timestamps already in the source.

Update: The answer is that the KafkaSource behaves differently than FlinkKafkaConsumer in the case where the number of Kafka partitions is smaller than the parallelism of Flink's kafka source operator. See https://stackoverflow.com/a/70101290/2000823 for details.
Original answer:
The problem is almost certainly something related to the timestamps and watermarks.
To verify that timestamps and watermarks are the problem, you could do a quick experiment where you replace the 3-hour-long event time sliding windows with short processing time tumbling windows.
In general it is preferred (but not required) to have the KafkaSource do the watermarking. Using forMonotonousTimestamps in a watermark generator applied after the source, as you are doing now, is a risky move. This will only work correctly if the timestamps in all of the partitions being consumed by each parallel instance of the source are processed in order. If more than one Kafka partition is assigned to any of the KafkaSource tasks, this isn't going to happen. On the other hand, if you supply the forMonotonousTimestamps watermarking strategy in the fromSource call (rather than noWatermarks), then all that will be required is that the timestamps be in order on a per-partition basis, which I imagine is the case.
As troubling as that is, it's probably not enough to explain why the windows don't produce any results. Another possible root cause is that the test data set doesn't include any events with timestamps after the first window, so that window never closes.
Do you have a sink? If not, that would explain things.
You can use the Flink dashboard to help debug this. Look to see if the watermarks are advancing in the window tasks. Turn on checkpointing, and then look to see how much state the window task has -- it should have some non-zero amount of state.


Kafka - different configuration settings

I am going through the documentation, and there seems to be there are lot of moving with respect to message processing like exactly once processing , at least once processing . And, the settings scattered here and there. There doesnt seem a single place that documents the properties need to be configured rougly for exactly once processing and atleast once processing.
I know there are many moving parts involved and it always depends . However, like i was mentioning before , what are the settings to be configured atleast to provide exactly once processing and at most once and atleast once ...
You might be interested in the first part of Kafka FAQ that describes some approaches on how to avoid duplication on data production (i.e. on producer side):
Exactly once semantics has two parts: avoiding duplication during data
production and avoiding duplicates during data consumption.
There are two approaches to getting exactly once semantics during data
Use a single-writer per partition and every time you get a network
error check the last message in that partition to see if your last
write succeeded
Include a primary key (UUID or something) in the
message and deduplicate on the consumer.
If you do one of these things, the log that Kafka hosts will be
duplicate-free. However, reading without duplicates depends on some
co-operation from the consumer too. If the consumer is periodically
checkpointing its position then if it fails and restarts it will
restart from the checkpointed position. Thus if the data output and
the checkpoint are not written atomically it will be possible to get
duplicates here as well. This problem is particular to your storage
system. For example, if you are using a database you could commit
these together in a transaction. The HDFS loader Camus that LinkedIn
wrote does something like this for Hadoop loads. The other alternative
that doesn't require a transaction is to store the offset with the
data loaded and deduplicate using the topic/partition/offset

Acknowledgement Kafka Producer Apache Beam

How do I get the records where an acknowledgement was received in apache beam KafkaIO?
Basically I want all the records where I didn't get any acknowledgement to go to a bigquery table so that I can retry sometime later. I used the following code snippet from the docs
.apply(KafkaIO.<Long, String>read()
.withTopic("my_topic") // use withTopics(List<String>) to read from multiple topics.
// Above four are required configuration. returns PCollection<KafkaRecord<Long, String>>
// Rest of the settings are optional :
// you can further customize KafkaConsumer used to read the records by adding more
// settings for ConsumerConfig. e.g :
.updateConsumerProperties(ImmutableMap.of("group.id", "my_beam_app_1"))
// set event times and watermark based on LogAppendTime. To provide a custom
// policy see withTimestampPolicyFactory(). withProcessingTime() is the default.
// restrict reader to committed messages on Kafka (see method documentation).
// offset consumed by the pipeline can be committed back.
// finally, if you don't need Kafka metadata, you can drop it.g
.withoutMetadata() // PCollection<KV<Long, String>>
.apply(Values.<String>create()) // PCollection<String>
By Default Beam IOs are designed to keep attempting to write/read/process elements until . (Batch pipelines will fail after repeated errors)
What you are referring to is usually called a Dead Letter Queue, to take the failed records and add them to a PCollection, Pubsub topic, queuing service, etc. This is often desire-able as it allows a streaming pipeline to make progress (not block), when errors writing some records are encountered, but allowing the onces which succeed to be written.
Unfortunately, unless I am mistaken there is no dead letter queue implemented in Kafka IO. It may be possible to modify KafkaIO to support this. There was some discussion on the beam mailing list with some ideas proposed to implement this, which might have some ideas.
I suspect it may be possible to add this to KafkaWriter, catching the records that failed and outputting them to another PCollection. If you choose to implement this, please also contact the beam community mailing list, if you would like help merging it into master, they will be able to help make sure the change covers necessary requirements so that it can be merged and makes sense as a whole for beam.
Your pipeline can then write those elsewhere (i.e. a different source). Of course, if that secondary source simultaneously has an outage/issue, you would need another DLQ.

Kafka Streams topology with windowing doesn't trigger state changes

I am building the following Kafka Streams topology (pseudo code):
gK = builder.stream().gropuByKey();
g1 = gK.windowedBy(TimeWindows.of("PT1H")).reduce().mapValues().toStream().mapValues().selectKey();
g2 = gK.reduce().mapValues();
If you notice, this is a rhomb-like topology that starts at single input topic and ends in the single output topic with messages flowing through two parallel flows that eventually get joined together at the end. One flow applies (tumbling?) windowing, the other does not. Both parts of the flow work on the same key (apart from the WindowedKey intermediately introduced by the windowing).
The timestamp for my messages is event-time. That is, they get picked from the message body by my custom configured TimestampExtractor implementation. The actual timestamps in my messages are several years to the past.
That all works well at first sight in my unit tests with a couple of input/output messages and in the runtime environment (with real Kafka).
The problem seems to come when the number of messages starts being significant (e.g. 40K).
My failing scenario is following:
~40K records with the same
key get uploaded into the input topic first
~40K updates are
coming out of the output topic, as expected
another ~40K records
with the same but different to step 1) key get uploaded into the
input topic
only ~100 updates are coming out of the output topic,
instead of expected new ~40K updates. There is nothing special to
see on those ~100 updates, their contents seems to be right, but
only for certain time windows. For other time windows there are no
updates even though the flow logic and input data should definetly
generate 40K records. In fact, when I exchange dataset in step 1)
and 3) I have exactly same situation with ~40K updates coming from
the second dataset and same number ~100 from the first.
I can easily reproduce this issue in the unit tests using TopologyTestDriver locally (but only on bigger numbers of input records).
In my tests, I've tried disabling caching with StreamsConfig.CACHE_MAX_BYTES_BUFFERING_CONFIG. Unfortunately, that didn't make any difference.
I tried both, reduce() calls and aggregate() calls instead. The issue persists in both cases.
What I'm noticing else is that with StreamsConfig.TOPOLOGY_OPTIMIZATION set to StreamsConfig.OPTIMIZE and without it, the mapValues() handler gets called in debugger before the preceding reduce() (or aggregate()) handlers at least for the first time. I didn't expect that.
Tried both join() and leftJoin() unfortunately same result.
In debugger the second portion of the data doesn't trigger reduce() handler in the "left" flow at all, but does trigger reduce() handler in the "right" flow.
With my configuration, if the number or records in both datasets is 100 in each, the problem doesn't manifests itself, I'm getting 200 output messages as I expect. When I raise the number to 200 in each data set, I'm getting less than 400 expected messages out.
So, it seems at the moment that something like "old" windows get dropped and the new records for those old windows get ignored by the stream.
There is window retention setting that can be set, but with its default value that I use I was expecting for windows to retain their state and stay active for at least 12 hours (what exceeds the time of my unit test run significantly).
Tried to amend the left reducer with the following Window storage config:
Duration.ofDays(5 * 365),
Duration.ofHours(1), false)
still no difference in results.
Same issue persists even with only single "left" flow without the "right" flow and without join(). It seems that the problem is in the window retention settings of my set up. Timestamps (event-time) of my input records span 2 years. The second dataset starts from the beginning of 2 years again. this place in Kafka Streams makes sure that the second data set records get ignored:
Kafka Streams Version is 2.4.0. Also using Confluent dependencies version 5.4.0.
My questions are
What could be the reason for such behaviour?
Did I miss anything in my stream topology?
Is such topology expected to work at all?
After some debugging time I found the reason for my problem.
My input datasets contain records with timestamps that span 2 years. I am loading the first dataset and with that the "observed" time of my stream gets set to the maximum timestamp from from input data set.
The upload of the second dataset that starts with records with timestamps that are 2 years before the new observed time causes the stream internal to drop the messages. This can be seen if you set the Kafka logging to TRACE level.
So, to fix my problem I had to configure the retention and grace period for my windows:
instead of
I have to specify
.windowedBy(TimeWindows.of(windowSize).grace(Duration.ofDays(5 * 365)))
Also, I had to explicitly configure reducer storage settings as:
Duration.ofDays(5 * 365),
windowSize, false)
That's it, the output is as expected.

Kafka Stream: KTable materialization

How to identify when the KTable materialization to a topic has completed?
For e.g. assume KTable has few million rows. Pseudo code below:
KTable<String, String> kt = kgroupedStream.groupByKey(..).reduce(..); //Assume this produces few million rows
At somepoint in time, I wanted to schedule a thread to invoke the following, that writes to the topic:
I wanted to ensure all the data is written as part of the above invoke. Also, once the above "to" method is invoked, can it be invoked in the next schedule OR will the first invoke always stay active?
Follow-up Question:
1) Ok, I see that the kstream and the ktable are unbounded/infinite once the kafkastream is kicked off. However, wouldn't ktable materialization (to a compacted topic) send multiple entries for the same key within a specified period.
So, unless the compaction process attempts to clean these and retain only the latest one, the downstream application will consume all available entries for the same key querying from the topic, causing duplicates. Even if the compaction process does some level of cleanup, it is always not possible that at a given point in time, there are some keys that have more than one entries as the compaction process is catching up.
I assume KTable will only have one record for a given key in the RocksDB. If we have a way to schedule the materialization, that will help to avoid the duplicates. Also, reduce the amount of data being persisted in topic (increasing the storage), increase in the network traffic, additional overhead to the compaction process to clean it up.
2) Perhaps a ReadOnlyKeyValueStore would allow a controlled retrieval from the store, but it still lacks the way to schedule the retrieval of key, value and write to a topic, which requires additional coding.
Can the API be improved to allow a controlled materialization?
A KTable materialization never finishes and you cannot "invoke" a to() either.
When you use the Streams API, you "plug together" a DAG of operators. The actual method calls, don't trigger any computation but modify the DAG of operators.
Only after you start the computation via KafkaStreams#start() data is processed. Note, that all operators that you specified will run continuously and concurrently after the computation gets started.
There is no "end of a computation" because the input is expected to be unbounded/infinite as upstream application can write new data into the input topics at any time. Thus, your program never terminates by itself. If required, you can stop the computation via KafkaStreams#close() though.
During execution, you cannot change the DAG. If you want to change it, you need to stop the computation and create a new KafkaStreams instance that takes the modified DAG as input
Follow up:
Yes. You have to think of a KTable as a "versioned table" that evolved over time when entries are updated. Thus, all updates are written to the changelog topic and sent downstream as change-records (note, that KTables do some caching, too, to "de-duplicate" consecutive updates to the same key: cf. https://docs.confluent.io/current/streams/developer-guide/memory-mgmt.html).
will consume all available entries for the same key querying from the topic, causing duplicates.
I would not consider those as "duplicates" but as updates. And yes, the application needs to be able to handle those updates correctly.
if we have a way to schedule the materialization, that will help to avoid the duplicates.
Materialization is a continuous process and the KTable is updated whenever new input records are available in the input topic and processed. Thus, at any point in time there might be an update for a specific key. Thus, even if you have full control when to send updates to the changelog topic and/or downstream, there might be a new update later on. That is the nature of stream processing.
Also, reduce the amount of data being persisted in topic (increasing the storage), increase in the network traffic, additional overhead to the compaction process to clean it up.
As mentioned above, caching is used to save resources.
Can the API be improved to allow a controlled materialization?
If the provided KTable semantics don't meet your requirement, you can always write a custom operator as a Processor or Transformer, attach a key-value store to it, and implement whatever you need.

Apache Flink: changing state parameters at runtime from outside

i'm currently working on a streaming ML pipeline and need exactly once event processing. I was interested by Flink but i'm wondering if there is any way to alter/update the execution state from outside.
The ml algorithm state is kept by Flink and that's ok, but considering that i'd like to change some execution parameters at runtime, i cannot find a viable solution. Basically an external webapp (in GO) is used to tune the parameters and changes should reflect in Flink for the subsequent events.
I thought about:
a shared Redis with pub/sub (as polling for each event would kill throughput)
writing a custom solution in Go :D
The state would be kept by key, related to the source of one of the multiple event streams coming in from Kafka.
You could use a CoMapFunction/CoFlatMapFunction to achieve what you described. One of the inputs is the normal data input and on the other input you receive state changing commands. This could be easiest ingested via a dedicated Kafka topic.