KSQL Hopping Window : accessing only oldest subwindow - apache-kafka

I am tracking the rolling sum of a particular field by using a query which looks something like this :
SELECT id, SUM(quantity) AS quantity from stream \
WINDOW HOPPING (SIZE 1 MINUTE, ADVANCE BY 10 SECONDS) \
GROUP BY id;
Now, for every input tick, it seems to return me 6 different aggregated values I guess which are for the following time periods :
[start, start+60] seconds
[start+10, start+60] seconds
[start+20, start+60] seconds
[start+30, start+60] seconds
[start+40, start+60] seconds
[start+50, start+60] seconds
What if I am interested is only getting the [start, start+60] seconds result for every tick that comes in. Is there anyway to get ONLY that?

Because you specify a hopping window, each record falls into multiple windows and all windows need to be updated when processing a record. Updating only one window would be incorrect and the result would be wrong.
Compare the Kafka Streams docs about hopping windows (Kafka Streams is KSQL's internal runtime engine): https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#hopping-time-windows
Update
Kafka Streams is adding proper sliding window support via KIP-450 (https://cwiki.apache.org/confluence/display/KAFKA/KIP-450%3A+Sliding+Window+Aggregations+in+the+DSL). This should allow to add sliding window to ksqlDB later, too.

I was in a similar situation and creating a user defined function to access only the window with collect_list(column).size() = window duration appears to be a promising track.
In the udf use List type to get one of your aggregate base column list of values. Then assess is the formed list size is equal to the hopping window number of period, return null otherwise.
From this create a table selecting data and transforming it with the udf.
Create a table from this latest table and filter out null values on the transformed column.

Related

Understanding late firings

Given these examples:
1:
Window.<KV<String, DeviceData>>into(FixedWindows.of(Duration.standardSeconds(options.getWindowSize())))
.triggering(
AfterWatermark.pastEndOfWindow()
.withLateFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(5)))
)
.withAllowedLateness(Duration.standardHours(3))
.accumulatingFiredPanes();
2:
Window.<KV<String, DeviceData>>into(FixedWindows.of(Duration.standardSeconds(options.getWindowSize())))
.triggering(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(1)))
.withLateFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardHours(1))))
.withAllowedLateness(Duration.standardHours(6))
.accumulatingFiredPanes();
Do I understand right, that
will trigger once after closing of the pane and then each time there is late data with max lateness of 3 hours?
will trigger when there is first data in the pane after 1 minute delay, once when the pane is closing and each time there is late data up to max 6 hours lateness?
Triggers are used to determine when to emit the aggregated results of each window. Triggers allow Beam to emit early results and also process late data.withAllowedLateness() you can process the data which is late. If withAllowedLateness is set then the default trigger will emit new results immediately whenever late data arrives. As mentioned in this documentation,
Override the amount of lateness allowed for data elements in the
output PCollection and downstream PCollections until explicitly set
again. Any elements that arrive are later than this as decided by the
system-maintained watermark will be dropped. The value in this also
determines how long the state will be kept around for old windows and
once no elements will be added to a window (because this duration has
passed) any state associated with the window will be cleaned up.
For example,
PCollection<String> items = ...;
PCollection<String> windowed_items = items.apply(
Window.<String>into(FixedWindows.of(Duration.standardMinutes(1)))
.triggering(
AfterWatermark.pastEndOfWindow()
.withLateFirings(AfterProcessingTime
.pastFirstElementInPane().plusDelayOf(Duration.standardHours(1))))
.withAllowedLateness(Duration.standardDays(1)));
PCollection<KV<String, Long>> windowed_counts = windowed_items.apply(
Count.<String>perElement());
In the above code, if you want to emit the values which were available when the watermark has passed the end of the window, and then outputs the data that arrive late once-per hour until we have finished processing the next 24-hours of data.

Why does join use rows that were sent after watermark of 20 seconds?

I’m using watermark to join two streams as you can see below:
val order_wm = order_details.withWatermark("tstamp_trans", "20 seconds")
val invoice_wm = invoice_details.withWatermark("tstamp_trans", "20 seconds")
val join_df = order_wm
.join(invoice_wm, order_wm.col("s_order_id") === invoice_wm.col("order_id"))
My understanding with the above code, it will keep each of the stream for 20 secs. After it comes but, when I’m giving one stream now and the another after 20secs then also both are getting joined. It seems like even after watermark got finished Spark is holding the data in memory. I even tried after 45 seconds and that was getting joined too.
This is creating confusion in my mind regarding watermark.
After it comes but, when I’m giving one stream now and the another after 20secs then also both are getting joined.
That's possible since the time measured is not the time of events as they arrive, but the time that is inside the watermarked field, i.e. tstamp_trans. You have to make sure that the last time in tstamp_trans is 20 seconds after the rows that will participate in the join.
Quoting the doc from: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#inner-joins-with-optional-watermarking
In other words, you will have to do the following additional steps in the join.
Define watermark delays on both inputs such that the engine knows how delayed the input can be (similar to streaming aggregations)
Define a constraint on event-time across the two inputs such that the engine can figure out when old rows of one input is not going to be required (i.e. will not satisfy the time constraint) for matches with the other input. This constraint can be defined in one of the two ways.
Time range join conditions (e.g. ...JOIN ON leftTime BETWEEN rightTime AND rightTime + INTERVAL 1 HOUR),
Join on event-time windows (e.g. ...JOIN ON leftTimeWindow = rightTimeWindow).

Tableau Future and Current References

Tough problem I am working on here.
I have a table of CustomerIDs and CallDates. I want to measure whether there is a 'repeat call' within a certain period of time (up to 30 days).
I plan on creating a parameter called RepeatTime which is a range from 0 - 30 days, so the user can slide a scale to see the number/percentage of total repeats.
In Excel, I have this working. I sort CustomerID in order and then sort CallDate from earliest to latest. I then have formulas like:
=IF(AND(CurrentCustomerID = FutureCustomerID, FutureCallDate - CurrentCallDate <= RepeatTime), 1,0)
CurrentCustomerID = the current row, and the FutureCustomerID = the following row (so it is saying if the customer ID is the same).
FutureCallDate = the following row and the CurrentCallDate = the current row. It is subtracting the future call time from the first call time to measure the time in between.
The goal is to be able to see, dynamically, how many customers called in for a specific reason within maybe 4 hours or 1 day or 5 days, etc. All of the way up until 30 days (this is our actual metric but it is good to see the calls which are repeats within a shorter time frame so we can investigate).
I had a similar problem, see here for detailed version Array calculation in Tableau, maxif routine
In your case, that is basically the same thing as mine, so you could apply that solution, but I find it easier to understand the one I'm about to give, I would do:
1) Create a calculated field called RepeatTime:
DATEDIFF('day',MAX(CallDates),LOOKUP(MAX(CallDates),-1))
This will calculated how many days have passed since the last call to the current. You can add a IFNULL not to get Null values for the first entry.
2) Drag CustomersID, CallDates and RepeatTime to the worksheet (can be on the marks tab, don't need to be on rows or column).
3) Configure the table calculation of RepeatTIme, Compute using Advanced..., partitioning CustomersID, Adressing CallDates
Also Sort by Field CallDates, Maximum, Ascending.
This will guarantee the table calculation works properly
4) Now you have a base that you can use for what you need. You can either export it to csv or mdb and connect to it.
The best approach, actually, is to have this RepeatTime field calculated outside Tableau, on your database, so it's already there when you connect to it. But this is a way to use Tableau to do the calculation for you.
Unfortunately there's no direct way to do this directly with your database.

Default range for date range filter in tableau

I want to set the default range on a date filter to show me the last 10 days - so basically looking at the lastDate (max date) in the data and default filtering only on the last 10 days (maxDate - 10)
How it looks now:
I still would want to the see the entire range bar on the dashboard and give the user the ability to modify the selected range if he wants to. The maxDate changes after every data refresh so it has to be some sort of a condition that is applied to the filter.
How I want it to look (by default after every refresh of data - new dates coming in):
Any suggestions on how this can be done? I know I can use the relative date and show the data for last 10 days but that would modify the filter and create a drop down which I don't want.
Any suggestions are welcome!
One simple approach that does most of what you want is the following:
Create an integer valued parameter with a range from 1 to some max
you choose, say 100. Call it say num_days.
Show the parameter control on your dashboard as a slider, and give
it a nice title like "Number of days to display"
Create a boolean calculated field called Within_Day_Range defined as:
datediff('minute', [My_Date_Field], now()) < [num_days] * 24 * 60
Put Within_Day_Range on the filter shelf and select the value true.
This lets the user easily select how many days in the past to include, and works to the granularity of minutes (i.e. the last two days really means the last 48 hours, not starting at midnight yesterday). Adjust the calculated field if you want different behavior.
The main drawback of this approach as described so far is that it doesn't display the earliest date possible in the database because that is filtered out. Quick filters do an initial query to get the bounds, which has a performance cost -- so using the approach described here can avoid that query and thus load faster.
If you really need that information on your dashboard, you could create a different worksheet to get just the min([My_Date_Field]) and display that near your parameter control.

Sorting result set according to variable in iReport

I have a resultset with columns:
interval_start(timestamp) , camp , queue , other columns
2012-09-10 11:10 c1 q1
2012-09-10 11:20 c1 q2
interval_start is having values in 10 minutes interval like :
2012-09-10 11:10,
2012-09-10 11:20,
2012-09-10 11:30 ....
using Joda Time library and interval_start field, I have created a variable to create string such that if minutes of interval_start lie between 00-30, 30 is set in minutes else 00 is set in minutes.
I want to group the data as :
camp as group1
variable created as group2
queue as group3
and done some aggregations
But in my report result, I am getting same queue many time in same interval.
I have used order by camp, interval_start, queue but the problem is still exists.
Attaching screenshot for your reference:
Is there any way to sort the resultset according to created variable?
Best guess would be an issue with your actual SQL Query. You say the same queue is being repeated, but from looking at your image it is not actually be repeated it is a different row.
Your query is going to be tough to pull off, as you are really wanting to your query to have an order by of order by camp, (rounded)interval_start, queue. Without that it is ordering by the camp column, and then the non-rounded version of interval_start and then camp. When means that the data is not in the correct order for Jasper Reports to group them the way you want. And then the real kicker is Jasper Reports does not have the functionality to sort the data once it gets it. It is up to the developer to supply that.
So you have a couple options:
Update your SQL Query to do the rounding of your time. Depending on your database this is done in different ways, but will likely required some type of stored procedure or function to handle it (see this TSQL function for example).
instead of having the sql query in the report, move it outside the report, and process the data, doing the rounding and sorting on the java side. Then pass it is as a the REPORT_DATASOURCE parameter.
Add a column to your table to store the rounded time. You may be able to create a trigger to handle this all in the database, with out having to change any of your other code in your application.
Honestly both these options are not ideal, and I hope someone comes along and provides an answer that proves me wrong. But I do not think there is currently a better way.