Apache Flink - Event time - streaming

I want to create an event time clock for my events in Apache flink. I am doing it in following way
public class TimeStampAssigner implements AssignerWithPeriodicWatermarks<Tuple2<String, String>> {
private final long maxOutOfOrderness = 0; // 3.5
private long currentMaxTimestamp;
#Override
public long extractTimestamp(Tuple2<String, String> element, long previousElementTimestamp) {
currentMaxTimestamp = new Date().getTime();
return currentMaxTimestamp;
}
#Override
public Watermark getCurrentWatermark() {
return new Watermark(currentMaxTimestamp - maxOutOfOrderness);
}
}
Please check the above code and tell if I am doing it correctly. After the event time and watermark assignment i want to process the stream in process function in which i will be collecting the stream data for 10 minutes for different keys.

No, this is not an appropriate implementation. An event time timestamp should be deterministic (i.e., reproducible), and it should be based on data in the event stream. If instead you are going to use Date().getTime, then you are more or less using processing time.
Typically when doing event time processing your events will have a timestamp field, and the timestamp extractor will return the value of this field.
The implementation you've shown will lose most of the benefits that come from working with event time, such as the ability to reprocess historic data in order to reproduce historic results.

Your implementation is implementing ingestion time to the Flink system and not the event time. If you consume from Kafka, for example, previousElementTimestamp should normally point to the time where the event has been produced to the Kafka (if nothing else is said by the Kafka producer), which would make your streaming processing reproducible.
If you want to implement event time processing in Flink you should rather use some timestamps associated with your element. Which could be or inside the element itself (which makes sense for time-series) or stored in the Kafka and available under the previousElementTimestamp.
About maxOutOfOrderness you also probably want to consider Flink's side output feature which makes possible to get the late elements after the window creation and update your Flink job's output.
If you consume from Kafka and want just simple with some data loss event time processing implementation go with AscendingTimestampExtractor.
There are some potential problems with a AscendingTimestampExtractor which can appear in case your data are not ordered within the partition or you apply this extractor after the operator and not directly after the KafkaSource.
For the robust industrial use-case you should rather implement Watermark Ingestion into the persistent log storage as mentioned in the Google DataFlow model.

Related

How does Google Dataflow determine the watermark for various sources?

I was just reviewing the documentation to understand how Google Dataflow handles watermarks, and it just mentions the very vague:
The data source determines the watermark
It seems you can add more flexibility through withAllowedLateness but what will happen if we do not configure this?
Thoughts so far
I found something indicating that if your source is Google PubSub it already has a watermark which will get taken, but what if the source is something else? For example a Kafka topic (which I believe does not inherently have a watermark, so I don't see how something like this would apply).
Is it always 10 seconds, or just 0? Is it looking at the last few minutes to determine the max lag and if so how many (surely not since forever as that would get distorted by the initial start of processing which might see giant lag)? I could not find anything on the topic.
I also searched outside the context of Google DataFlow for Apache Beam documentation but did not find anything explaining this either.
When using Apache Kafka as a data source, each Kafka partition may have a simple event time pattern (ascending timestamps or bounded out-of-orderness). However, when consuming streams from Kafka, multiple partitions often get consumed in parallel, interleaving the events from the partitions and destroying the per-partition patterns (this is inherent in how Kafka’s consumer clients work).
In that case, you can use Flink’s Kafka-partition-aware watermark generation. Using that feature, watermarks are generated inside the Kafka consumer, per Kafka partition, and the per-partition watermarks are merged in the same way as watermarks are merged on stream shuffles.
For example, if event timestamps are strictly ascending per Kafka partition, generating per-partition watermarks with the ascending timestamps watermark generator will result in perfect overall watermarks. Note, that TimestampAssigner is not provided in the example, the timestamps of the Kafka records themselves will be used instead.
In any data processing system, there is a certain amount of lag between the time a data event occurs (the “event time”, determined by the timestamp on the data element itself) and the time the actual data element gets processed at any stage in your pipeline (the “processing time”, determined by the clock on the system processing the element). In addition, there are no guarantees that data events will appear in your pipeline in the same order that they were generated.
For example, let’s say we have a PCollection that’s using fixed-time windowing, with windows that are five minutes long. For each window, Beam must collect all the data with an event time timestamp in the given window range (between 0:00 and 4:59 in the first window, for instance). Data with timestamps outside that range (data from 5:00 or later) belong to a different window.
However, data isn’t always guaranteed to arrive in a pipeline in time order, or to always arrive at predictable intervals. Beam tracks a watermark, which is the system’s notion of when all data in a certain window can be expected to have arrived in the pipeline. Once the watermark progresses past the end of a window, any further element that arrives with a timestamp in that window is considered late data.
From our example, suppose we have a simple watermark that assumes approximately 30s of lag time between the data timestamps (the event time) and the time the data appears in the pipeline (the processing time), then Beam would close the first window at 5:30. If a data record arrives at 5:34, but with a timestamp that would put it in the 0:00-4:59 window (say, 3:38), then that record is late data.

Flink Missing Events With Windowed Processor(Event Time Windows) and Kafka Source

We have a Streaming Job that has 20 separate pipelines, with each pipeline having one/many Kafka topic sources and with some pipelines having Windowed Processor and others being a Non-Windowed Processor.
We are noticing data loss for Windowed Processor pipelines when the job goes down and takes some time to recover/when the job needs to be restarted.
I have set UID for all of the Operators and I can see in logs that offsets are being restored from savepoint for the Kafka consumer operator
we are using BoundedOutOfOrdernessTimestampExtractor to Assign watermarks based on event time.
public class KafkaEventTimestampExtractor extends BoundedOutOfOrdernessTimestampExtractor<Event> implements Serializable{
public KafkaEventTimestampExtractor(Time maxOutOfOrderness) {
super(maxOutOfOrderness);
}
#Override
public long extractTimestamp(Event element) {
try {
log.info("event to be processed, event:{}", new ObjectMapper().writeValueAsString(element));
} catch (JsonProcessingException e) {
e.printStackTrace();
}
Long ts = null;
ts = Double.valueOf(Double.parseDouble(element.getTs())).longValue();
ts = ts.toString().length() < 13 ? ts * 1000 : ts;
return ts;
}
}
Pipeline Config looks something like this.
NON-WINDOWED
SourceUtil
.getEventDataStream(env, kafkaSourceSet)
.process(new S3EventProcessor()).uid(“…..**)
.addSink();
WINDOWED
SourceUtil
.getEventDataStream(env, kafkaSourceSet)
.assignTimestampsAndWatermarks(
new KafkaEventTimestampExtractor(Time.seconds(4)))
.windowAll(TumblingEventTimeWindows.of(
Time.milliseconds(kafkaSourceSet.bufferWindowSize))
.process(new S3EventProcessor()).uid(“…..**)
.addSink();
Lets say job is down 30 min, in that case pipeline where we do not use window processor does not miss any data but paritial data is missed from the windowed processor for those 30 min.
when we increase the out-of-order events delay in TimeWinows, ie- we increased It to 30min from 4sec, then the events are not getting missed if the application is up within 30min.we are getting nowhere near the solution since the delay of more than 1 min is infeasible for us also there will be too many live windows which will mean huge infra change for us.
The only scenario I can imagine that might explain this is if the event timestamps are affected by the outage. Then a 30-minute outage would cause a 30-minute gap in the timestamps, and with out-or-order ingestion, a 4-second bounded-out-of-orderness strategy will yield some late events that will be dropped by the window.
This was happening due to a mistake in my pipeline, instead of using the timestamp Assigner at flinkKafkaConsumer, it was added to the data stream generated from flinkKafkaConsumer.
This change has fixed the issue at my end for automatic recovery but in case of a manual restart post any changes to the pipeline, some data is still being missed for the last window when the job had stopped.
Note:-- we are using checkpoint for manual recovery.
As per docs, Checkpoints are ideal for automatic recovery in case of job failures.
Any note on this would help, if we need to create a savepoint in case we need to make some changes to the pipeline and restart it manually or we can make a complete recovery with the checkpoint.
Our Only concern in case of using savepoint is the reprocessing of same events that might happen, which is not ideal for us in few cases.

Flink: assign watermark to FlinkKafkaConsumer

I have a FlinkKafkaConsumer defined as follows FlinkKafkaConsumer[String]("topic", new SimpleStringSchema(), properties) and I'm working with event time by using setStreamTimeCharacteristic(TimeCharacteristic.EventTime).
Now I want to assign a periodic watermark with the function assignTimestampsAndWatermarks, but I don't know what I should pass to that function since in the documentation the example of this function receive an element of type MyType with a getCreationTime() and my consumer is of type String.
Is it possible to assign event time in this situation?
EDIT: The time I would want to use as event time is the time each register was stored in Kafka.
The notion of EventTime is at least in the definition strictly connected with the time at which events are created rather than received. So, if the events that You are consuming from Kafka have some kind of timestamp (for example if You are consuming JSON as String and then parsing it) then you can use this timestamp inside the assignTimestampsAndWatermarks function.
If you are parsing plain String objects then the best thing You could do is to use custom KafkaDeserializationSchema to extract Kafka timestamp for each event and use this.
Technically, You could even use the counter that increases artificially timestamp for each record(for example by incrementing it by 1), but this doesn't seem to make sense in terms of EventTime processing.

Category projections using kafka and cassandra for event-sourcing

I'm using Cassandra and Kafka for event-sourcing, and it works quite well. But I've just recently discovered a potentially major flaw in the design/set-up. A brief intro to how it is done:
The aggregate command handler is basically a kafka consumer, which consumes messages of interest on a topic:
1.1 When it receives a command, it loads all events for the aggregate, and replays the aggregate event handler for each event to get the aggregate up to current state.
1.2 Based on the command and businiss logic it then applies one or more events to the event store. This involves inserting the new event(s) to the event store table in cassandra. The events are stamped with a version number for the aggregate - starting at version 0 for a new aggregate, making projections possible. In addition it sends the event to another topic (for projection purposes).
1.3 A kafka consumer will listen on the topic upon these events are published. This consumer will act as a projector. When it receives an event of interest, it loads the current read model for the aggregate. It checks that the version of the event it has received is the expected version, and then updates the read model.
This seems to work very well. The problem is when I want to have what EventStore calls category projections. Let's take Order aggregate as an example. I can easily project one or more read models pr Order. But if I want to for example have a projection which contains a customers 30 last orders, then I would need a category projection.
I'm just scratching my head how to accomplish this. I'm curious to know if any other are using Cassandra and Kafka for event sourcing. I've read a couple of places that some people discourage it. Maybe this is the reason.
I know EventStore has support for this built in. Maybe using Kafka as event store would be a better solution.
With this kind of architecture, you have to choose between:
Global event stream per type - simple
Partitioned event stream per type - scalable
Unless your system is fairly high throughput (say at least 10s or 100s of events per second for sustained periods to the stream type in question), the global stream is the simpler approach. Some systems (such as Event Store) give you the best of both worlds, by having very fine-grained streams (such as per aggregate instance) but with the ability to combine them into larger streams (per stream type/category/partition, per multiple stream types, etc.) in a performant and predictable way out of the box, while still being simple by only requiring you to keep track of a single global event position.
If you go partitioned with Kafka:
Your projection code will need to handle concurrent consumer groups accessing the same read models when processing events for different partitions that need to go into the same models. Depending on your target store for the projection, there are lots of ways to handle this (transactions, optimistic concurrency, atomic operations, etc.) but it would be a problem for some target stores
Your projection code will need to keep track of the stream position of each partition, not just a single position. If your projection reads from multiple streams, it has to keep track of lots of positions.
Using a global stream removes both of those concerns - performance is usually likely to be good enough.
In either case, you'll likely also want to get the stream position into the long term event storage (i.e. Cassandra) - you could do this by having a dedicated process reading from the event stream (partitioned or global) and just updating the events in Cassandra with the global or partition position of each event. (I have a similar thing with MongoDB - I have a process reading the 'oplog' and copying oplog timestamps into events, since oplog timestamps are totally ordered).
Another option is to drop Cassandra from the initial command processing and use Kafka Streams instead:
Partitioned command stream is processed by joining with a partitioned KTable of aggregates
Command result and events are computed
Atomically, KTable is updated with changed aggregate, events are written to event stream and command response is written to command response stream.
You would then have a downstream event processor that copies the events into Cassandra for easier querying etc. (and which can add the Kafka stream position to each event as it does it to give the category ordering). This can help with catch up subscriptions, etc. if you don't want to use Kafka for long term event storage. (To catch up, you'd just read as far as you can from Cassandra and then switch to streaming from Kafka from the position of the last Cassandra event). On the other hand, Kafka itself can store events for ever, so this isn't always necessary.
I hope this helps a bit with understanding the tradeoffs and problems you might encounter.

How to schedule periodical task based on number of processed messages?

I want to use Kafka Processor API to process messages from Kafka.
I would like to call some periodically function - something like:
context.schedule(IntervalMS,punctuationType, somePunctuator), where somePunctuator perform some periodical job, but instead using interval time as trigger I would like to invoke that task after processing some number of messages
Is it possible do such triggering in Kafka streams?
yes, it's possible with using Kafka Streams State Store.
logic depends on what exactly you need to do on reaching the number of processed messages.
if you need to propagate data to the next processor or sink node, you need to store aggregated values as a list of objects inside key-value state store. inside Processor.process(..) you put data into key-value store, and after that check whether number of items reached limit, and do required logic (like processorContext.forward(..)). please take a look at similar example here.
if you need to do some logic after reaching number and don't need values, you could store only counter, and inside Processor.process(..) increment this value.