Why does join use rows that were sent after watermark of 20 seconds? - scala

I’m using watermark to join two streams as you can see below:
val order_wm = order_details.withWatermark("tstamp_trans", "20 seconds")
val invoice_wm = invoice_details.withWatermark("tstamp_trans", "20 seconds")
val join_df = order_wm
.join(invoice_wm, order_wm.col("s_order_id") === invoice_wm.col("order_id"))
My understanding with the above code, it will keep each of the stream for 20 secs. After it comes but, when I’m giving one stream now and the another after 20secs then also both are getting joined. It seems like even after watermark got finished Spark is holding the data in memory. I even tried after 45 seconds and that was getting joined too.
This is creating confusion in my mind regarding watermark.

After it comes but, when I’m giving one stream now and the another after 20secs then also both are getting joined.
That's possible since the time measured is not the time of events as they arrive, but the time that is inside the watermarked field, i.e. tstamp_trans. You have to make sure that the last time in tstamp_trans is 20 seconds after the rows that will participate in the join.

Quoting the doc from: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#inner-joins-with-optional-watermarking
In other words, you will have to do the following additional steps in the join.
Define watermark delays on both inputs such that the engine knows how delayed the input can be (similar to streaming aggregations)
Define a constraint on event-time across the two inputs such that the engine can figure out when old rows of one input is not going to be required (i.e. will not satisfy the time constraint) for matches with the other input. This constraint can be defined in one of the two ways.
Time range join conditions (e.g. ...JOIN ON leftTime BETWEEN rightTime AND rightTime + INTERVAL 1 HOUR),
Join on event-time windows (e.g. ...JOIN ON leftTimeWindow = rightTimeWindow).

Related

Efficient filtering of a large dataset in Spark based on other dataset

I have an issue with filtering a large dataset, which I believe is because of an inefficient join. What I'm trying to do is the following:
Dataset info contains a lot of user data, identified by a user ID and a timestamp.
Dataset filter_dates contains, for every user ID, the most recent timestamp processed.
The processing job then needs to determine, from a large input source, what data in info is new and to be processed.
What I tried doing was the following:
info
.join(
broadcast(filter_dates),
info("userid") === filter_dates("userid"),
"left"
)
.filter('info_date >= 'latest_date || 'latest_date.isNull)
But this is incredibly slow (taking many hours), and in the Spark UI I see it processing an absurd amount of data as Input.
Then, just for fun, I tried it with this:
val startDate = // Calculate the minimum startDate from filter_dates
info
.filter('info_date >= startDate)
And this was blazingly fast (Taking some minutes), but of course this is not optimal because it ends up re-processing some data. This leads me to believe there's something fundamentally wrong with how I joined the datasets.
Does anyone know how I can improve the joins?

Siddhi query for event prior to another within time limit

I am trying to write a Siddhi query to detect if an event didn't happen prior to another within a time limit. The query I have to detect if 'X' didn't ever happen prior to 'Y' in the entire life of the siddhi application is:
from stream[value == 'Y']
and not stream[value == 'X']
I assumed adding a time constraint would work:
from stream[value == 'Y']
and not stream[value == 'X'] for 5 min
However, the 'for' statement never has any effect that I can see. This query is still triggered whether 'X' was 4 minutes ago or 6 minutes ago. I understand that a similar effect can be achieved by checking if 'Y' comes after 'X' within a time limit, but for my purposes I need to know the other way around.
Is this possible with Siddhi? If so can someone please provide a sample query that could achieve this?
Based on your comment I am composing below answer.
Sequence construct in Siddhi will guarantee that no one can enter the state flow at a random point. As an example let's take below sample sequence query.
from every e1=InputStream[state='X'], e2=InputStream[state'Y']
select e1.state as initialState, e2.state as finalState
insert into NextStream;
So in here we are mandating X followed by a consecutive Y. If Y occurs before X sequence construct will handle that and discard Y. So whatever goes into NextStream is guaranteed to satisfy x -> Y state transition.
More relaxed construct of sequence is pattern. It is same as sequence while relaxing the consecutive arrival requirement. Hope this helps!!

KSQL Hopping Window : accessing only oldest subwindow

I am tracking the rolling sum of a particular field by using a query which looks something like this :
SELECT id, SUM(quantity) AS quantity from stream \
WINDOW HOPPING (SIZE 1 MINUTE, ADVANCE BY 10 SECONDS) \
GROUP BY id;
Now, for every input tick, it seems to return me 6 different aggregated values I guess which are for the following time periods :
[start, start+60] seconds
[start+10, start+60] seconds
[start+20, start+60] seconds
[start+30, start+60] seconds
[start+40, start+60] seconds
[start+50, start+60] seconds
What if I am interested is only getting the [start, start+60] seconds result for every tick that comes in. Is there anyway to get ONLY that?
Because you specify a hopping window, each record falls into multiple windows and all windows need to be updated when processing a record. Updating only one window would be incorrect and the result would be wrong.
Compare the Kafka Streams docs about hopping windows (Kafka Streams is KSQL's internal runtime engine): https://docs.confluent.io/current/streams/developer-guide/dsl-api.html#hopping-time-windows
Update
Kafka Streams is adding proper sliding window support via KIP-450 (https://cwiki.apache.org/confluence/display/KAFKA/KIP-450%3A+Sliding+Window+Aggregations+in+the+DSL). This should allow to add sliding window to ksqlDB later, too.
I was in a similar situation and creating a user defined function to access only the window with collect_list(column).size() = window duration appears to be a promising track.
In the udf use List type to get one of your aggregate base column list of values. Then assess is the formed list size is equal to the hopping window number of period, return null otherwise.
From this create a table selecting data and transforming it with the udf.
Create a table from this latest table and filter out null values on the transformed column.

SparkSQL PostgresQL Dataframe partitions

I have a very simple setup of SparkSQL connecting to a Postgres DB and I'm trying to get a DataFrame from a table, the Dataframe with a number X of partitions (lets say 2). The code would be the following:
Map<String, String> options = new HashMap<String, String>();
options.put("url", DB_URL);
options.put("driver", POSTGRES_DRIVER);
options.put("dbtable", "select ID, OTHER from TABLE limit 1000");
options.put("partitionColumn", "ID");
options.put("lowerBound", "100");
options.put("upperBound", "500");
options.put("numPartitions","2");
DataFrame housingDataFrame = sqlContext.read().format("jdbc").options(options).load();
For some reason, one partition of the DataFrame contains almost all rows.
For what I can understand lowerBound/upperBound are the parameters used to finetune this. In SparkSQL's documentation (Spark 1.4.0 - spark-sql_2.11) it says they are used to define the stride, not to filter/range the partition column. But that raises several questions:
The stride is the frequency (number of elements returned each query) with which Spark will query the DB for each executor (partition)?
If not, what is the purpose of this parameters, what do they depend on and how can I balance my DataFrame partitions in a stable way (not asking all partitions contain the same number of elements, just that there is an equilibrium - for example 2 partitions 100 elements 55/45 , 60/40 or even 65/35 would do)
Can't seem to find a clear answer to these questions around and was wondering if maybe some of you could clear this points for me, because right now is affecting my cluster performance when processing X million rows and all the heavy lifting goes to one single executor.
Cheers and thanks for your time.
Essentially the lower and upper bound and the number of partitions are used to calculate the increment or split for each parallel task.
Let's say the table has partition column "year", and has data from 2006 to 2016.
If you define the number of partitions as 10, with lower bound 2006 and higher bound 2016, you will have each task fetching data for its own year - the ideal case.
Even if you incorrectly specify the lower and / or upper bound, e.g. set lower = 0 and upper = 2016, there will be a skew in data transfer, but, you will not "lose" or fail to retrieve any data, because:
The first task will fetch data for year < 0.
The second task will fetch data for year between 0 and 2016/10.
The third task will fetch data for year between 2016/10 and 2*2016/10.
...
And the last task will have a where condition with year->2016.
T.
Lower bound are indeed used against the partitioning column; refer to this code (current version at the moment of writing this):
https://github.com/apache/spark/blob/40ed2af587cedadc6e5249031857a922b3b234ca/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
Function columnPartition contains the code for the partitioning logic and the use of lower / upper bound.
lowerbound and upperbound have been currently identified to do what they do in the previous answers. A followup to this would be how to balance the data across partitions without looking at the min max values or if your data is heavily skewed.
If your database supports "hash" function, it could do the trick.
partitionColumn = "hash(column_name)%num_partitions"
numPartitions = 10 // whatever you want
lowerBound = 0
upperBound = numPartitions
This will work as long as the modulus operation returns a uniform distribution over [0,numPartitions)

Tableau Future and Current References

Tough problem I am working on here.
I have a table of CustomerIDs and CallDates. I want to measure whether there is a 'repeat call' within a certain period of time (up to 30 days).
I plan on creating a parameter called RepeatTime which is a range from 0 - 30 days, so the user can slide a scale to see the number/percentage of total repeats.
In Excel, I have this working. I sort CustomerID in order and then sort CallDate from earliest to latest. I then have formulas like:
=IF(AND(CurrentCustomerID = FutureCustomerID, FutureCallDate - CurrentCallDate <= RepeatTime), 1,0)
CurrentCustomerID = the current row, and the FutureCustomerID = the following row (so it is saying if the customer ID is the same).
FutureCallDate = the following row and the CurrentCallDate = the current row. It is subtracting the future call time from the first call time to measure the time in between.
The goal is to be able to see, dynamically, how many customers called in for a specific reason within maybe 4 hours or 1 day or 5 days, etc. All of the way up until 30 days (this is our actual metric but it is good to see the calls which are repeats within a shorter time frame so we can investigate).
I had a similar problem, see here for detailed version Array calculation in Tableau, maxif routine
In your case, that is basically the same thing as mine, so you could apply that solution, but I find it easier to understand the one I'm about to give, I would do:
1) Create a calculated field called RepeatTime:
DATEDIFF('day',MAX(CallDates),LOOKUP(MAX(CallDates),-1))
This will calculated how many days have passed since the last call to the current. You can add a IFNULL not to get Null values for the first entry.
2) Drag CustomersID, CallDates and RepeatTime to the worksheet (can be on the marks tab, don't need to be on rows or column).
3) Configure the table calculation of RepeatTIme, Compute using Advanced..., partitioning CustomersID, Adressing CallDates
Also Sort by Field CallDates, Maximum, Ascending.
This will guarantee the table calculation works properly
4) Now you have a base that you can use for what you need. You can either export it to csv or mdb and connect to it.
The best approach, actually, is to have this RepeatTime field calculated outside Tableau, on your database, so it's already there when you connect to it. But this is a way to use Tableau to do the calculation for you.
Unfortunately there's no direct way to do this directly with your database.