How does kafka streams compute watermarks? - apache-kafka

Does Kafka Streams internally compute watermarks? Is it possible to observe the result of a window (only) when it completes (i.e. when watermark passes end of window)?

Kafka Streams does not use watermarks internally, but a new feature in 2.1.0 lets you observe the result of a window when it closed. It's called Suppressed, and you can read about it in the docs: Window Final Results:
KGroupedStream<UserId, Event> grouped = ...;
grouped
.windowedBy(TimeWindows.of(Duration.ofHours(1)).grace(ofMinutes(10)))
.count()
.suppress(Suppressed.untilWindowCloses(unbounded()))

Does Kafka Streams internally compute watermarks?
No. Kafka Streams follows a continuous update processing model that does not require watermarks. You can find more details about this online:
https://www.confluent.io/blog/watermarks-tables-event-time-dataflow-model/
https://www.confluent.io/resources/streams-tables-two-sides-same-coin
Is it possible to observe the result of a window (only) when it completes (i.e. when watermark passes end of window)?
You can observe of result of a window at any point in time. Either via subscribing to the result changelog stream via for example KTable#toStream()#foreach() (ie, a push based approach), or via Interactive Queries that let you query the result window actively (ie, a pull based approach).
As mentioned by #Dmitry, for the push based approach, you can also use suppress() operator if you are only interested in a window's final result.

Related

Migrating from FlinkKafkaConsumer to KafkaSource, no windows executed

I am a kafka and flink beginner.
I have implemented FlinkKafkaConsumer to consume messages from a kafka-topic. The only custom setting other than "group" and "topic" is (ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest") to enable re-reading the same messages several times. It works out of the box for consuming and logic.
Now FlinkKafkaConsumer is deprecated, and i wanted to change to the successor KafkaSource.
Initializing KafkaSource with the same parameters as i do FlinkKafkaConsumer produces a read of the topic as expected, i can verify this by printing the stream. De-serialization and timestamps seem to work fine. However execution of windows are not done, and as such no results are produced.
I assume some default setting(s) in KafkaSource are different from that of FlinkKafkaConsumer, but i have no idea what they might be.
KafkaSource - Not working
KafkaSource<TestData> source = KafkaSource.<TestData>builder()
.setBootstrapServers(propertiesForKafka.getProperty("bootstrap.servers"))
.setTopics(TOPIC)
.setDeserializer(new CustomDeserializer())
.setGroupId(GROUP_ID)
.setStartingOffsets(OffsetsInitializer.earliest())
.build();
DataStream<TestData> testDataStreamSource = env.fromSource(
source,
WatermarkStrategy.
<TestData>noWatermarks(),
"Kafka Source"
);
Kafka consumer - Working (Properties contains group.id,bootstrap.servers and zookeeper.connect)
propertiesForKafka.setProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");
FlinkKafkaConsumer<TestData> flinkKafkaConsumer = new FlinkKafkaConsumer(TOPIC, new CustomDeserializer(), propertiesForKafka);
DataStreamSource<TestData> testDataStreamSource = env.addSource(flinkKafkaConsumer)
Both streams use the same pipeline that looks like this
testDataStreamSource
.assignTimestampsAndWatermarks(WatermarkStrategy.
<TestData>forMonotonousTimestamps().
withTimestampAssigner((event, timestamp) -> event.getTimestamp()))
.keyBy(TestData::getKey)
.window(SlidingEventTimeWindows.of(Time.hours(3), Time.hours(1)))
.process(new ProcessWindowFunction<TestData, TestDataOutput, String, TimeWindow>() {
#Override
public void process(
....
});
Things tried
I've tried to experiment with setting committing of offsets, but it
has not improved the situation.
Setting timestamps already in the source.
Update: The answer is that the KafkaSource behaves differently than FlinkKafkaConsumer in the case where the number of Kafka partitions is smaller than the parallelism of Flink's kafka source operator. See https://stackoverflow.com/a/70101290/2000823 for details.
Original answer:
The problem is almost certainly something related to the timestamps and watermarks.
To verify that timestamps and watermarks are the problem, you could do a quick experiment where you replace the 3-hour-long event time sliding windows with short processing time tumbling windows.
In general it is preferred (but not required) to have the KafkaSource do the watermarking. Using forMonotonousTimestamps in a watermark generator applied after the source, as you are doing now, is a risky move. This will only work correctly if the timestamps in all of the partitions being consumed by each parallel instance of the source are processed in order. If more than one Kafka partition is assigned to any of the KafkaSource tasks, this isn't going to happen. On the other hand, if you supply the forMonotonousTimestamps watermarking strategy in the fromSource call (rather than noWatermarks), then all that will be required is that the timestamps be in order on a per-partition basis, which I imagine is the case.
As troubling as that is, it's probably not enough to explain why the windows don't produce any results. Another possible root cause is that the test data set doesn't include any events with timestamps after the first window, so that window never closes.
Do you have a sink? If not, that would explain things.
You can use the Flink dashboard to help debug this. Look to see if the watermarks are advancing in the window tasks. Turn on checkpointing, and then look to see how much state the window task has -- it should have some non-zero amount of state.

Process item in a window with Kafka streams

I'm trying to process some events in a sliding window with kafka stream but I think i don't understand some details of kafka streams so I'm not able to do what I want.
What I have :
input topic of events with key/value like (Int, Person)
What I want :
read these events within a sliding window of 10 minutes
process each element in the sliding window
filter and count some element, fire some event to an other kafka
topic (like if a wrong value is detected)
To be simple: get all the events in a sliding window of 10 minutes, do a foreach on them, compute some stats/events in the context of the window, go to the next window...
What I tried :
I tried to mix the Stream and the processor API like :
val streamBuilder = new StreamsBuilder()
streamBuilder.stream[Int, Person](topic)
.groupBy((_, value) => PersonWrapper(value.id, value.name))
.windowedBy(TimeWindows.of(10 * 60 * 1000L).advanceBy(1 * 60 * 1000L))
// now I have a window of (PersonWrapper, Person) right ?
streamBuilder.build().addProcessor(....)
And now I'd add a processor to this topology to process each events of the sliding window.
I don't understand what is TimeWindowStream and why we should have a KGroupedStream to apply a Window on events. If someone can enlight me about Kafka stream and what I'm trying to do.
Did you read the documentation: https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#windowing
Windowing is a special form of grouping (grouping based on time)
Grouping is always require to compute an aggregation in Kafka Streams
After you have a grouped and windowed stream you call aggregate() for the actually processing (not need to attach a Processor manually; the call to aggregate() will implicitly add a Processor for you).
Btw: Kafka Streams does not really support "sliding windows" for aggregation. The window you define is called a hopping window.
KGroupedStream and TimeWindowedKStreams are basically just helper classes and an intermediate representation that allows for a fluent API design.
The tutorial is also a good way to get started: https://docs.confluent.io/current/streams/quickstart.html
You should also check out the examples: https://github.com/confluentinc/kafka-streams-examples

Category projections using kafka and cassandra for event-sourcing

I'm using Cassandra and Kafka for event-sourcing, and it works quite well. But I've just recently discovered a potentially major flaw in the design/set-up. A brief intro to how it is done:
The aggregate command handler is basically a kafka consumer, which consumes messages of interest on a topic:
1.1 When it receives a command, it loads all events for the aggregate, and replays the aggregate event handler for each event to get the aggregate up to current state.
1.2 Based on the command and businiss logic it then applies one or more events to the event store. This involves inserting the new event(s) to the event store table in cassandra. The events are stamped with a version number for the aggregate - starting at version 0 for a new aggregate, making projections possible. In addition it sends the event to another topic (for projection purposes).
1.3 A kafka consumer will listen on the topic upon these events are published. This consumer will act as a projector. When it receives an event of interest, it loads the current read model for the aggregate. It checks that the version of the event it has received is the expected version, and then updates the read model.
This seems to work very well. The problem is when I want to have what EventStore calls category projections. Let's take Order aggregate as an example. I can easily project one or more read models pr Order. But if I want to for example have a projection which contains a customers 30 last orders, then I would need a category projection.
I'm just scratching my head how to accomplish this. I'm curious to know if any other are using Cassandra and Kafka for event sourcing. I've read a couple of places that some people discourage it. Maybe this is the reason.
I know EventStore has support for this built in. Maybe using Kafka as event store would be a better solution.
With this kind of architecture, you have to choose between:
Global event stream per type - simple
Partitioned event stream per type - scalable
Unless your system is fairly high throughput (say at least 10s or 100s of events per second for sustained periods to the stream type in question), the global stream is the simpler approach. Some systems (such as Event Store) give you the best of both worlds, by having very fine-grained streams (such as per aggregate instance) but with the ability to combine them into larger streams (per stream type/category/partition, per multiple stream types, etc.) in a performant and predictable way out of the box, while still being simple by only requiring you to keep track of a single global event position.
If you go partitioned with Kafka:
Your projection code will need to handle concurrent consumer groups accessing the same read models when processing events for different partitions that need to go into the same models. Depending on your target store for the projection, there are lots of ways to handle this (transactions, optimistic concurrency, atomic operations, etc.) but it would be a problem for some target stores
Your projection code will need to keep track of the stream position of each partition, not just a single position. If your projection reads from multiple streams, it has to keep track of lots of positions.
Using a global stream removes both of those concerns - performance is usually likely to be good enough.
In either case, you'll likely also want to get the stream position into the long term event storage (i.e. Cassandra) - you could do this by having a dedicated process reading from the event stream (partitioned or global) and just updating the events in Cassandra with the global or partition position of each event. (I have a similar thing with MongoDB - I have a process reading the 'oplog' and copying oplog timestamps into events, since oplog timestamps are totally ordered).
Another option is to drop Cassandra from the initial command processing and use Kafka Streams instead:
Partitioned command stream is processed by joining with a partitioned KTable of aggregates
Command result and events are computed
Atomically, KTable is updated with changed aggregate, events are written to event stream and command response is written to command response stream.
You would then have a downstream event processor that copies the events into Cassandra for easier querying etc. (and which can add the Kafka stream position to each event as it does it to give the category ordering). This can help with catch up subscriptions, etc. if you don't want to use Kafka for long term event storage. (To catch up, you'd just read as far as you can from Cassandra and then switch to streaming from Kafka from the position of the last Cassandra event). On the other hand, Kafka itself can store events for ever, so this isn't always necessary.
I hope this helps a bit with understanding the tradeoffs and problems you might encounter.

Synchronize Data From Multiple Data Sources

Our team is trying to build a predictive maintenance system whose task is to look at a set of events and predict whether these events depict a set of known anomalies or not.
We are at the design phase and the current system design is as follows:
The events may occur on multiple sources of an IoT system (such as cloud platform, edge devices or any intermediate platforms)
The events are pushed by the data sources into a message queueing system (currently we have chosen Apache Kafka).
Each data source has its own queue (Kafka Topic).
From the queues, the data is consumed by multiple inference engines (which are actually neural networks).
Depending upon the feature set, an inference engine will subscribe to
multiple Kafka topics and stream data from those topics to continuously output the inference.
The overall architecture follows the single-responsibility principle meaning that every component will be separate from each other and run inside a separate Docker container.
Problem:
In order to classify a set of events as an anomaly, the events have to occur in the same time window. e.g. say there are three data sources pushing their respective events into Kafka topics, but due to some reason, the data is not synchronized.
So one of the inference engines pulls the latest entries from each of the kafka topics, but the corresponding events in the pulled data do not belong to the same time window (say 1 hour). That will result in invalid predictions due to out-of-sync data.
Question
We need to figure out how can we make sure that the data from all three sources are pushed in-order so that when an inference engine requests entries (say the last 100 entries) from multiple kakfa topics, the corresponding entries in each topic belong to the same time window?
I would suggest KSQL, which is a streaming SQL engine that enables real-time data processing against Apache Kafka. It also provides nice functionality for Windowed Aggregation etc.
There are 3 ways to define Windows in KSQL:
hopping windows, tumbling windows, and session windows. Hopping and
tumbling windows are time windows, because they're defined by fixed
durations they you specify. Session windows are dynamically sized
based on incoming data and defined by periods of activity separated by
gaps of inactivity.
In your context, you can use KSQL to query and aggregate the topics of interest using Windowed Joins. For example,
SELECT t1.id, ...
FROM topic_1 t1
INNER JOIN topic_2 t2
WITHIN 1 HOURS
ON t1.id = t2.id;
Some suggestions -
Handle delay at the producer end -
Ensure all three producers always send data in sync to Kafka topics by using batch.size and linger.ms.
eg. if linger.ms is set to 1000, all messages would be sent to Kafka within 1 second.
Handle delay at the consumer end -
Considering any streaming engine at the consumer side (be it Kafka-stream, spark-stream, Flink), provides windows functionality to join/aggregate stream data based on keys while considering delayed window function.
Check this - Flink windows for reference how to choose right window type link
To handle this scenario, data sources must provide some mechanism for the consumer to realize that all relevant data has arrived. The simplest solution is to publish a batch from data source with a batch Id (Guid) of some form. Consumers can then wait until the next batch id shows up marking the end of the previous batch. This approach assumes sources will not skip a batch, otherwise they will get permanently mis-aligned. There is no algorithm to detect this but you might have some fields in the data that show discontinuity and allow you to realign the data.
A weaker version of this approach is to either just wait x-seconds and assume all sources succeed in this much time or look at some form of time stamps (logical or wall clock) to detect that a source has moved on to the next time window implicitly showing completion of the last window.
The following recommendations should maximize success of event synchronization for the anomaly detection problem using timeseries data.
Use a network time synchronizer on all producer/consumer nodes
Use a heartbeat message from producers every x units of time with a fixed start time. For eg: the messages are sent every two minutes at the start of the minute.
Build predictors for producer message delay. use the heartbeat messages to compute this.
With these primitives, we should be able to align the timeseries events, accounting for time drifts due to network delays.
At the inference engine side, expand your windows at a per producer level to synch up events across producers.

Writing directly to a kafka state store

We've started experimenting with Kafka to see if it can be used to aggregate our application data. I think our use case is a match for Kafka streams, but we aren't sure if we are using the tool correctly. The proof of concept we've built seems to be working as designed, I'm not sure that we are using the APIs appropriately.
Our proof of concept is to use kafka streams to keep a running tally of information about a program in an output topic, e.g.
{
"numberActive": 0,
"numberInactive": 0,
"lastLogin": "01-01-1970T00:00:00Z"
}
Computing the tally is easy, it is essentially executing a compare and swap (CAS) operation based on the input topic & output field.
The local state contains the most recent program for a given key. We join an input stream against the state store and run the CAS operation using a TransformSupplier, which explictly writes the data to the state store using
context.put(...)
context.commit();
Is this an appropriate use of the local state store? Is there another another approach to keeping a stateful running tally in a topic?
Your design sounds right to me (I presume you are using PAPI not the Streams DSL), that you are reading in one stream, calling transform() on the stream in which an state store is associated with the operator. Since your update logic seems to be only key-dependent and hence can be embarrassingly parallelizable via Streams library based on key partitioning.
One thing to note that, it seems you are calling "context.commit()" after every single put call, which is not a recommended pattern. This is because commit() operation is a pretty heavy call that will involves flushing the state store, sending commit offset request to the Kafka broker etc, calling it on every single call would result in very low throughput. It is recommended to only call commit() only after a bunch of records are processed, or you can just rely on the Streams config "commit.interval.ms" to rely on Streams library to only call commit() internally after every time interval. Note that this will not affect your processing semantics upon graceful shutting down, since upon shutdown Streams will always enforce a commit() call.