Understanding late firings - apache-beam

Given these examples:
1:
Window.<KV<String, DeviceData>>into(FixedWindows.of(Duration.standardSeconds(options.getWindowSize())))
.triggering(
AfterWatermark.pastEndOfWindow()
.withLateFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(5)))
)
.withAllowedLateness(Duration.standardHours(3))
.accumulatingFiredPanes();
2:
Window.<KV<String, DeviceData>>into(FixedWindows.of(Duration.standardSeconds(options.getWindowSize())))
.triggering(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(1)))
.withLateFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardHours(1))))
.withAllowedLateness(Duration.standardHours(6))
.accumulatingFiredPanes();
Do I understand right, that
will trigger once after closing of the pane and then each time there is late data with max lateness of 3 hours?
will trigger when there is first data in the pane after 1 minute delay, once when the pane is closing and each time there is late data up to max 6 hours lateness?

Triggers are used to determine when to emit the aggregated results of each window. Triggers allow Beam to emit early results and also process late data.withAllowedLateness() you can process the data which is late. If withAllowedLateness is set then the default trigger will emit new results immediately whenever late data arrives. As mentioned in this documentation,
Override the amount of lateness allowed for data elements in the
output PCollection and downstream PCollections until explicitly set
again. Any elements that arrive are later than this as decided by the
system-maintained watermark will be dropped. The value in this also
determines how long the state will be kept around for old windows and
once no elements will be added to a window (because this duration has
passed) any state associated with the window will be cleaned up.
For example,
PCollection<String> items = ...;
PCollection<String> windowed_items = items.apply(
Window.<String>into(FixedWindows.of(Duration.standardMinutes(1)))
.triggering(
AfterWatermark.pastEndOfWindow()
.withLateFirings(AfterProcessingTime
.pastFirstElementInPane().plusDelayOf(Duration.standardHours(1))))
.withAllowedLateness(Duration.standardDays(1)));
PCollection<KV<String, Long>> windowed_counts = windowed_items.apply(
Count.<String>perElement());
In the above code, if you want to emit the values which were available when the watermark has passed the end of the window, and then outputs the data that arrive late once-per hour until we have finished processing the next 24-hours of data.

Related

How can I add the time that a message arrives in a tickerplant from a feedhandler in KDB

I’m wondering what the best way to add the time that data arrives in a KDB tickerplant from a feedhandler. As the data arrives in list form, would it be correct to just append .z.z to the end of the list using the join operator?
Are you using the tick library from Kx tick? I think it automatically appends the time that data reaches the tickerplant for you.
https://github.com/KxSystems/kdb-tick/blob/master/tick.q
On line 38:
if[not -16=type first first x;
if[d<"d"$a:.z.P;.z.ts[]];
a:"n"$a;x:$[0>type first x;a,x;(enlist(count first x)#a),x]
];
Here x is the table/list that is being sent to the ticker plant and a is the timespan that has been added.

Why does join use rows that were sent after watermark of 20 seconds?

I’m using watermark to join two streams as you can see below:
val order_wm = order_details.withWatermark("tstamp_trans", "20 seconds")
val invoice_wm = invoice_details.withWatermark("tstamp_trans", "20 seconds")
val join_df = order_wm
.join(invoice_wm, order_wm.col("s_order_id") === invoice_wm.col("order_id"))
My understanding with the above code, it will keep each of the stream for 20 secs. After it comes but, when I’m giving one stream now and the another after 20secs then also both are getting joined. It seems like even after watermark got finished Spark is holding the data in memory. I even tried after 45 seconds and that was getting joined too.
This is creating confusion in my mind regarding watermark.
After it comes but, when I’m giving one stream now and the another after 20secs then also both are getting joined.
That's possible since the time measured is not the time of events as they arrive, but the time that is inside the watermarked field, i.e. tstamp_trans. You have to make sure that the last time in tstamp_trans is 20 seconds after the rows that will participate in the join.
Quoting the doc from: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#inner-joins-with-optional-watermarking
In other words, you will have to do the following additional steps in the join.
Define watermark delays on both inputs such that the engine knows how delayed the input can be (similar to streaming aggregations)
Define a constraint on event-time across the two inputs such that the engine can figure out when old rows of one input is not going to be required (i.e. will not satisfy the time constraint) for matches with the other input. This constraint can be defined in one of the two ways.
Time range join conditions (e.g. ...JOIN ON leftTime BETWEEN rightTime AND rightTime + INTERVAL 1 HOUR),
Join on event-time windows (e.g. ...JOIN ON leftTimeWindow = rightTimeWindow).

KSQL Hopping Window : accessing only oldest subwindow

I am tracking the rolling sum of a particular field by using a query which looks something like this :
SELECT id, SUM(quantity) AS quantity from stream \
WINDOW HOPPING (SIZE 1 MINUTE, ADVANCE BY 10 SECONDS) \
GROUP BY id;
Now, for every input tick, it seems to return me 6 different aggregated values I guess which are for the following time periods :
[start, start+60] seconds
[start+10, start+60] seconds
[start+20, start+60] seconds
[start+30, start+60] seconds
[start+40, start+60] seconds
[start+50, start+60] seconds
What if I am interested is only getting the [start, start+60] seconds result for every tick that comes in. Is there anyway to get ONLY that?
Because you specify a hopping window, each record falls into multiple windows and all windows need to be updated when processing a record. Updating only one window would be incorrect and the result would be wrong.
Compare the Kafka Streams docs about hopping windows (Kafka Streams is KSQL's internal runtime engine): https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#hopping-time-windows
Update
Kafka Streams is adding proper sliding window support via KIP-450 (https://cwiki.apache.org/confluence/display/KAFKA/KIP-450%3A+Sliding+Window+Aggregations+in+the+DSL). This should allow to add sliding window to ksqlDB later, too.
I was in a similar situation and creating a user defined function to access only the window with collect_list(column).size() = window duration appears to be a promising track.
In the udf use List type to get one of your aggregate base column list of values. Then assess is the formed list size is equal to the hopping window number of period, return null otherwise.
From this create a table selecting data and transforming it with the udf.
Create a table from this latest table and filter out null values on the transformed column.

How to check if the stream of rows has ended

Is there a way for me to know if the stream of rows has ended? That is, if the job is on the last row?
What im trying to do is for every 10 rows do something, my problem are the last rows, for example in 115 rows, the last 5 wont happen but i need them to.
There is no built-in functionality in Talend which tells you if you're on the last row. You can work around this using one of the following:
Get the row count beforehand. For instance, if you have a file, you
can use tFileRowCount to count the number of rows, then when you
process your file, you use a variable for your current row
number, and so you can tell if you've reached the last row. If your
data come from a database, you could either issue a query that
returns the total number of rows beforehand, or modify your main
query to return the total number of rows in an additional column and
use that (using ranking functions).
Do some processing after the subjob has ended: There may be situations
where you need a special processing for the last row, you can achieve
this by getting the last row processed by the previous subjob (which
you have already saved, for instance, by putting a tSetGlobalVar
after your target, when your subjob is done, your variable contains the last written value).
Edit
For your use case, what you could do is first store the result of the API call in memory using tHashOutput, then read it with a tHashInput in order to process it, and you'll know then how many rows you have retrieved using tHashOutput's global variable tHashOuput_X_NB_LINE.

tJavaFlex behaviour when changing loop position

Having some problems in a job, and I suspect it is due to a lack of understanding of tJavaFlex. I am generating 10 rows in this test job, and am generating loop inside a tJavaFlex:
So there are 10 rows coming in, and a loop in the Start and End section. I was expecting that for each row coming in, it would generate 10 identical rows coming out. And that I would see iterations 0,1,2,3....9 for each row.
What I got was this. This looks to me like the entire job is running 10 times, and so I have 100 random values coming through the flow from the tRowGenerator.
If I move the for loop into the Main Code section, I get close to the behaviour I was expecting. I am expecting each row when it comes in to be repeated 10 times, and for 1 row coming in to produce 10 output rows. What I get is this.
But even then my tLogRow is only generating one row for each 10 iterations it seems (look at the tLogRow output after iteration 9 above why not 10 items?). I had thought I would be getting 10 rows for each single row coming in and I would see this in the tLogRow.
What I need to do is take a value from a field coming in, do some reg exp parsing and split into an array, and then for each item in the array create lines in the output flow. i.e. 1 row coming in can be turned into x number of rows coming out using a string.split() method.
Can someone explain the behaviour above, and also advise on the best approach to get one value coming in, do some java manipulation and then generate multiple rows coming out?
Any advice appreciated.
Yes you don't use it correctly.
The initial part is for initiate variable. (executed one time before the first tow)
In the principal you put your loop (executed one time at each row)
In the final you store in global variable for example.(executed one time after the last row)
The principal code will be executed at each row in a tjavaflex. So don't put a for loop inside you can do like the example in the screen.
You tjavaflex comportement is normal. you have ten row so each row the for loop wil be executed 10 time (i<10)
You can use it like :
You dont need to create your own loop.
By putting the for loop in the Start code, your main code will be triggered by the loop and by incoming rows, and it will be executed n*r times.
The behaviour of subjob that contains a tJavaFlex, reveils that component before tJavaFlex is included into its starting code, and the after component is included in the ending code, but that may depend to many conditions like data propagation and trigger type.
start code :
System.out.print("tJavaFlex is starting...");
int i = 0;
Main code :
i++;
System.out.print("tJavaFlex inside Main Code...iteration:"+i);
row8.ITEM_NAME = row7.ITEM_NAME;
row8.ITEM_COUNT = row7.ITEM_COUNT;
End code :
System.out.print("tJavaFlex is ending...");
System.out.print(row7.ITEM_NAME);
Instead of main flow in row5, try using iterate flow to connect tJavaFlex