Flink: assign watermark to FlinkKafkaConsumer - apache-kafka

I have a FlinkKafkaConsumer defined as follows FlinkKafkaConsumer[String]("topic", new SimpleStringSchema(), properties) and I'm working with event time by using setStreamTimeCharacteristic(TimeCharacteristic.EventTime).
Now I want to assign a periodic watermark with the function assignTimestampsAndWatermarks, but I don't know what I should pass to that function since in the documentation the example of this function receive an element of type MyType with a getCreationTime() and my consumer is of type String.
Is it possible to assign event time in this situation?
EDIT: The time I would want to use as event time is the time each register was stored in Kafka.

The notion of EventTime is at least in the definition strictly connected with the time at which events are created rather than received. So, if the events that You are consuming from Kafka have some kind of timestamp (for example if You are consuming JSON as String and then parsing it) then you can use this timestamp inside the assignTimestampsAndWatermarks function.
If you are parsing plain String objects then the best thing You could do is to use custom KafkaDeserializationSchema to extract Kafka timestamp for each event and use this.
Technically, You could even use the counter that increases artificially timestamp for each record(for example by incrementing it by 1), but this doesn't seem to make sense in terms of EventTime processing.

Related

How does Google Dataflow determine the watermark for various sources?

I was just reviewing the documentation to understand how Google Dataflow handles watermarks, and it just mentions the very vague:
The data source determines the watermark
It seems you can add more flexibility through withAllowedLateness but what will happen if we do not configure this?
Thoughts so far
I found something indicating that if your source is Google PubSub it already has a watermark which will get taken, but what if the source is something else? For example a Kafka topic (which I believe does not inherently have a watermark, so I don't see how something like this would apply).
Is it always 10 seconds, or just 0? Is it looking at the last few minutes to determine the max lag and if so how many (surely not since forever as that would get distorted by the initial start of processing which might see giant lag)? I could not find anything on the topic.
I also searched outside the context of Google DataFlow for Apache Beam documentation but did not find anything explaining this either.
When using Apache Kafka as a data source, each Kafka partition may have a simple event time pattern (ascending timestamps or bounded out-of-orderness). However, when consuming streams from Kafka, multiple partitions often get consumed in parallel, interleaving the events from the partitions and destroying the per-partition patterns (this is inherent in how Kafka’s consumer clients work).
In that case, you can use Flink’s Kafka-partition-aware watermark generation. Using that feature, watermarks are generated inside the Kafka consumer, per Kafka partition, and the per-partition watermarks are merged in the same way as watermarks are merged on stream shuffles.
For example, if event timestamps are strictly ascending per Kafka partition, generating per-partition watermarks with the ascending timestamps watermark generator will result in perfect overall watermarks. Note, that TimestampAssigner is not provided in the example, the timestamps of the Kafka records themselves will be used instead.
In any data processing system, there is a certain amount of lag between the time a data event occurs (the “event time”, determined by the timestamp on the data element itself) and the time the actual data element gets processed at any stage in your pipeline (the “processing time”, determined by the clock on the system processing the element). In addition, there are no guarantees that data events will appear in your pipeline in the same order that they were generated.
For example, let’s say we have a PCollection that’s using fixed-time windowing, with windows that are five minutes long. For each window, Beam must collect all the data with an event time timestamp in the given window range (between 0:00 and 4:59 in the first window, for instance). Data with timestamps outside that range (data from 5:00 or later) belong to a different window.
However, data isn’t always guaranteed to arrive in a pipeline in time order, or to always arrive at predictable intervals. Beam tracks a watermark, which is the system’s notion of when all data in a certain window can be expected to have arrived in the pipeline. Once the watermark progresses past the end of a window, any further element that arrives with a timestamp in that window is considered late data.
From our example, suppose we have a simple watermark that assumes approximately 30s of lag time between the data timestamps (the event time) and the time the data appears in the pipeline (the processing time), then Beam would close the first window at 5:30. If a data record arrives at 5:34, but with a timestamp that would put it in the 0:00-4:59 window (say, 3:38), then that record is late data.

How to schedule periodical task based on number of processed messages?

I want to use Kafka Processor API to process messages from Kafka.
I would like to call some periodically function - something like:
context.schedule(IntervalMS,punctuationType, somePunctuator), where somePunctuator perform some periodical job, but instead using interval time as trigger I would like to invoke that task after processing some number of messages
Is it possible do such triggering in Kafka streams?
yes, it's possible with using Kafka Streams State Store.
logic depends on what exactly you need to do on reaching the number of processed messages.
if you need to propagate data to the next processor or sink node, you need to store aggregated values as a list of objects inside key-value state store. inside Processor.process(..) you put data into key-value store, and after that check whether number of items reached limit, and do required logic (like processorContext.forward(..)). please take a look at similar example here.
if you need to do some logic after reaching number and don't need values, you could store only counter, and inside Processor.process(..) increment this value.

Set timestamp in output with Kafka Streams

I'm getting CSVs in a Kafka topic "raw-data", the goal is to transform them by sending each line in another topic "data" with the right timestamp (different for each line).
Currently, I have 2 streamers:
one to split the lines in "raw-data", sending them to an "internal" topic (no timestamp)
one with a TimestampExtractor that consumes "internal" and send them to "data".
I'd like to remove the use of this "internal" topic by setting directly the timestamp but I couldn't find a way (the timestamp extractor are only used at consumption time).
I've stumbled upon this line in the documentation:
Note, that the describe default behavior can be changed in the Processor API by assigning timestamps to output records explicitly when calling #forward().
but I couldn't find any signature with a timestamp. What do they mean?
How would you do it?
Edit:
To be clear, I have a Kafka topic with one message containing the event time and some value, such as:
2018-01-01,hello
2018-01-02,world
(this is ONE message, not two)
I'd like to get two messages in another topic with the Kafka record timestamp set to their event time (2018-01-01 and 2018-01-02) without the need of an intermediate topic.
Setting the timestamp for the output requires Kafka Streams 2.0 and is only supported in Processor API. If you use the DSL, you can use transform() to use those APIs.
As you pointed out, you would use context.forward(). The call would be:
stream.transform(new TransformerSupplier() {
public Transformer get() {
return new Transformer() {
// omit other methods for brevity
// you need to get the `context` from `init()`
public KeyValue transform(K key, V value) {
// some business logic
// you can call #forward() as often as you want
context.forward(newKey, newValue, To.all().withTimestamp(newTimestamp));
return null; // only return data via context#forward()
}
}
}
});

Apache Flink - Event time

I want to create an event time clock for my events in Apache flink. I am doing it in following way
public class TimeStampAssigner implements AssignerWithPeriodicWatermarks<Tuple2<String, String>> {
private final long maxOutOfOrderness = 0; // 3.5
private long currentMaxTimestamp;
#Override
public long extractTimestamp(Tuple2<String, String> element, long previousElementTimestamp) {
currentMaxTimestamp = new Date().getTime();
return currentMaxTimestamp;
}
#Override
public Watermark getCurrentWatermark() {
return new Watermark(currentMaxTimestamp - maxOutOfOrderness);
}
}
Please check the above code and tell if I am doing it correctly. After the event time and watermark assignment i want to process the stream in process function in which i will be collecting the stream data for 10 minutes for different keys.
No, this is not an appropriate implementation. An event time timestamp should be deterministic (i.e., reproducible), and it should be based on data in the event stream. If instead you are going to use Date().getTime, then you are more or less using processing time.
Typically when doing event time processing your events will have a timestamp field, and the timestamp extractor will return the value of this field.
The implementation you've shown will lose most of the benefits that come from working with event time, such as the ability to reprocess historic data in order to reproduce historic results.
Your implementation is implementing ingestion time to the Flink system and not the event time. If you consume from Kafka, for example, previousElementTimestamp should normally point to the time where the event has been produced to the Kafka (if nothing else is said by the Kafka producer), which would make your streaming processing reproducible.
If you want to implement event time processing in Flink you should rather use some timestamps associated with your element. Which could be or inside the element itself (which makes sense for time-series) or stored in the Kafka and available under the previousElementTimestamp.
About maxOutOfOrderness you also probably want to consider Flink's side output feature which makes possible to get the late elements after the window creation and update your Flink job's output.
If you consume from Kafka and want just simple with some data loss event time processing implementation go with AscendingTimestampExtractor.
There are some potential problems with a AscendingTimestampExtractor which can appear in case your data are not ordered within the partition or you apply this extractor after the operator and not directly after the KafkaSource.
For the robust industrial use-case you should rather implement Watermark Ingestion into the persistent log storage as mentioned in the Google DataFlow model.

Apache Storm aggregation rules for missing expected events in rolling time-period

My use-case is to identify entities from which expected events have not been received after X amount of time in real-time rather than using batch jobs. For Example:
If we have received PaymentInitiated event at time T but didn't receive either of PaymentFailed / PaymentAborted / PaymentSucedded by T+X, then raise a trigger saying PaymentStuck along with details of PaymentIntitiated event.
How can I model such use-cases in Apache Storm as it is rolling time period X on each event, rather than fixed time interval.
Thanks,
Harish
For Storm, would need to put all your logic into your UDF code using low level Java API (I doubt that Trindent is helpful). I never worked with Samza and cannot provide any help for it (or judge which system would be the better fit for your problem).
In Storm for example, you could assign a timestamp to each tuple in Spout.nextTuple(), and buffer all tuples of an incomplete payment within a Bolt in descending order of the timestamp. Each time Bolt.execute() is called, you can compare the timestamp of the new tuple with the head (ie, oldest tuple) of your queue. If the input tuple has a larger timestamep than head-T plus X, you know that your head tuple times out and you can raise your trigger for it.
Of course, you need to do fieldsGrouping() to ensure that all tuples belonging to the same payment are processed by the same Bolt instance. You might also need to somewhat order the incoming bolt tuples by timestamp or use more advance time-out logic to deal with out-of-order tuples (with regard to increasing timestamps).
Depending on you latency requirement and input stream rate you might also use "tick tuples" to trigger the comparison of the head tuple with this dummy tick tuple. Or as an ever stricter implementation, do all this logic directly in Spout.next() (if you know that all tuples of a payment go through the same Spout instance).