Flink tumbling event time window not triggering - scala

I have a pipeline where I am taking sensor events from each player of a football match (x/y coordinates) and calculating the distance each player has run over the entire game.
I want to keep a running sum of the distance they have each run and emit the stats for all players, as a single event, every second.
I have successfully calculated the running distance of each player and that works fine. If I run the following it outputs results as expected:
val playerRunningStats = playerAverages.keyBy(_.entity.map(_.name))
.window(GlobalWindows.create())
.trigger(ContinuousEventTimeTrigger.of[GlobalWindow](Time.seconds(1)))
.aggregate(new RunningDistanceAggregator)
.print
Results (subset for some players in 1st second):
4> PlayerDistanceEvent(player 1,6.0)
11> PlayerDistanceEvent(player 2,3.414213562373095)
1> PlayerDistanceEvent(player 3,4.414213562373095)
9> PlayerDistanceEvent(player 4,3.0)
5> PlayerDistanceEvent(player 5,3.0)
...
Now I want to aggregate these results into one single event and send them out to a kafka topic but his is where I am encountering my issue. When I try to create a new windowAll aggregation and print it, nothing gets printed:
val playerRunningStats = playerAverages.keyBy(_.entity.map(_.name))
.window(GlobalWindows.create())
.trigger(ContinuousEventTimeTrigger.of[GlobalWindow](Time.seconds(1)))
.aggregate(new RunningDistanceAggregator)
// New code below to combine results into a single event
.windowAll(TumblingEventTimeWindows.of(Time.seconds(1)))
.aggregate(new RunningDistanceSnapshotAggregator)
.print
I know that my RunningDistanceSnapshotAggregator code works because when I change the windowAll to a TumblingProcessingTimeWindow the code seems to work fine. But I don't want to use processing time windows, I want to use event time windows right?
Code with processing time window that seems to work:
val playerRunningStats = playerAverages.keyBy(_.entity.map(_.name))
.window(GlobalWindows.create())
.trigger(ContinuousEventTimeTrigger.of[GlobalWindow](Time.seconds(1)))
.aggregate(new RunningDistanceAggregator)
// Change to processing time window and it works fine
.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(1)))
.aggregate(new RunningDistanceSnapshotAggregator)
.print
Results:
1> PlayerDistanceSnapshot(List(PlayerDistanceEvent(player 1,3.414213562373095), PlayerDistanceEvent(player 2,4.0), PlayerDistanceEvent(player 3,3.0), PlayerDistanceEvent(player 4,2.0), PlayerDistanceEvent(player 5,6.0))
Why can't I use an event time window for the second windowAll and why doesn't it produce any results if I use an event time window?

Related

How or When is time window of Kafka Streams expired?

I use kafka streams in my application, I have a question about time window in aggregate function.
KTable<Windowed<String>, PredictReq> windowedKtable = views.map(new ValueMapper()).groupByKey().windowedBy(TimeWindows.of(TimeUnit.MINUTES.toMillis(1)))
.aggregate(new ADInitializer(), new ADAggregator(),Materialized.with(Serdes.String(), ReqJsonSerde));
KStream<Windowed<String>, Req> filtered = windowedKtable.toStream().transform(new ADTransformerFilter());
KStream<String, String> result = filtered.transform(new ADTransformerTrans());
I aggregrate data in 1 minute window and then transform to get the final aggregate result and do a second transform.
Here is some sample data:
msg1: 10:00:00 comes, msg2: 10:00:20 comes, msg3: 10:01:10 comes
window starts from 10:00:00 to 10:01:00 for example.
I found the windows is not expired until msg3 comes! (because the following transform is not executed until msg3 comes.)
This is not what I want.
Is there something wrong in my testing? If this is truth, how to change it?
I see...
Kafka streams doesn't have the window expired concept. so I use window in message to check whether the window is changed, so I must wait message from next window.
If next message is not come, I don't know the window is finished.

how to calculate the execution time taken for the command cell in databricks notebook

i need to know the time taken to execute the list of command cells in each notebook. Databricks displays the "command took x number of seconds" to execute . Similar to the displayed executed time , i need to capture the time taken to execute all the commands in a notebook.
I've been using the following:
val startTime = System.nanoTime
// your code goes here
val endTime = System.nanoTime
val elapsedSeconds = (endTime - startTime) / 1e9d
Since you mentioned that you are a newbie don't forget that Spark uses lazy execution so the time to execute a cell that contains a transformation is not the true execution time. Be sure to include an action to measure the true execution time.

flink streaming window trigger

I have flink stream and I am calucating few things on some time window say 30 seconds.
here what happens it is giving me result my aggregating previous windows as well.
say for first 30 seconds I get result 10.
next thiry seconds I want fresh result, instead I get last window result + new
and so on.
so my question is how I get fresh result for each window.
You need to use a purging trigger. What you want is FIRE_AND_PURGE (emit and remove window content), what the default flink trigger does is FIRE (emit and keep window content).
input
.keyBy(...)
.timeWindow(Time.seconds(30))
// The important part: Replace the default non-purging ProcessingTimeTrigger
.trigger(new PurgingTrigger[..., TimeWindow](ProcessingTimeTrigger))
.reduce(...)
For a more in depth explanation have a look into Triggers and FIRE vs FIRE_AND_PURGE.
A Trigger determines when a window (as formed by the window assigner) is ready to be processed by the window function. Each WindowAssigner comes with a default Trigger. If the default trigger does not fit your needs, you can specify a custom trigger using trigger(...).
When a trigger fires, it can either FIRE or FIRE_AND_PURGE. While FIRE keeps the contents of the window, FIRE_AND_PURGE removes its content. By default, the pre-implemented triggers simply FIRE without purging the window state.
The functionality you describe can be found in Tumbling Windows: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/windows.html#tumbling-windows
A bit more detail and/or code would help :)
I'm little late into this question but I encountered the same issue with OP's. What I found out later was a bug in my own code. FYI my mistake could be good reference for your problem.
// Old code (modified to be an example):
val tenSecondGrouping: DataStream[MyCustomGrouping] = userIdsStream
.keyBy(_.somePartitionedKey)
.window(TumblingProcessingTimeWindows.of(Time.of(10, TimeUnit.SECONDS)))
.trigger(ProcessingTimeTrigger.create())
.aggregate(new MyCustomAggregateFunc(new MyCustomGrouping()))
Bug happened at new MyCustomGrouping: I unintentionally created a singleton MyCustomGrouping object and reusing it in MyCustomAggregateFunc. As more tumbling windows created, the later aggregation results grow crazy! The fix was to create new MyCustomGrouping each time MyCustomAggregateFunc is triggered. So:
// New code, problem solved
...
.aggregate(new MyCustomAggregateFunc(() => new MyCustomGrouping()))
// passing in a func to create new object per trigger

How can I record only part of a live source with gstreamer?

Application consists of two pipelines:
Sending pipeline
filesrc ! decodebin ! encoder ! payloader ! udpsink
Receiving pipeline
udpsrc ! rtpbin ! depayloader ! decoder ! encoder ! filesink
The wanted behavior is that the sending pipeline plays a file, and when that has finished, another file plays and recording starts.
The actual behavior varies. With some approaches it is that the recording starts from the same time that the first playback starts. This I believe is due to that the pipelines share the same GSocket, in order to get it to work at all. So somehow data coming to the socket must be buffered.
Other approaches result in a few frames from before the recording should start, and then jumps to after the recording begins, resulting in a messy picture (i-frames without keyframe).
I've tried a couple of different approaches to try to get the recording to start at the right time:
Start the receiving pipeline when the second file starts playing
Start both pipelines at the same time and have a valve element dropping everything until the second file starts playing
Start both pipelines at the same time and Seek to the time where the second file starts playing
Start both pipelines at the same time and have the receiving pipeline connected to a fakesink until and switch to the real filter chain when second file starts playing
Set an offset on the receiving pipeline
I would be very grateful for any help with this!
Start both pipelines at the same time and have a valve element dropping everything until the second file starts playing
This actually works. The problem I had was that no picture fast update was sent, and it took a while for the next keyframe to arrive by itself.

not accurate setTimeout and it works only on mousedown

I have problem with setTimeout.
In all major browsers it works fine but not in IE...
I'm creating a facebook app- puzzle. When player press Start button, the timer starts count his time of playing one game.
At the beginning I used setInterval to increase timer but with cooperate of facebook scripts it delayed about 2 seconds at the end of game. Then I found on stackoverflow trick to increase accuracy of timer: setInterval timing slowly drifts away from staying accurate
And again- without facebook it worked fine, no delays were shown. With facebook it still has delays.
Now to condensate info that might interest You:
When user clicks Start then I create new Date as startTime.
When user ends game script creates finalTime new Date, then substract finalTime - startTime.
In code there is setTimeout:
(...)
f : function() {
var sec_time = Math.floor((puzzle.nextAt - puzzle.startTime)/1000);
$('.timer').html(sec_time);
if (!puzzle.startTime) {
puzzle.startTime = new Date().getTime();
puzzle.nextAt = puzzle.startTime;
$('.timer').html('0');
}
puzzle.nextAt += 100;
puzzle.to = setTimeout(puzzle.f, puzzle.nextAt - new Date().getTime());
}
(...)
when user place on correct place last puzzle piece then I call clearTimeout(puzzle.to);
I have now 2 issues:
not accurate time, in IE it can be even 7 second difference!
in IE during game it works only when user have mousedown... :/
To drag puzzles I use jQuery drag & drop plugin.
At least very helpful info will be how to achieve accurate timer.
You should put your scripts in jQuery's ready function and not at the bottom of the page, as the Facebook SDK is loaded asynchronously and may impact timed executions if they're initiated at the bottom of the page.
As for timing, you're gonna see inaccuracy of between 15ms and 45ms in IE7 depending on other JS executions on the page. Your 100ms timeout will drift badly because of this. Better to record a start time and build a timer with a higher polling frequency than needed and do a comparison between start time and 'now' in each cycle to determine what to do next.