Inconsistent count after window lead function, and filter - scala

Edit 2:
I've reported this as an issue to spark developers, I will post status here when I get some.
I have a problem that been bothering me for quite some time now.
Imagine you have a dataframe with several milions of records, with these columns:
df1:
start(timestamp)
user_id(int)
type(string)
I need to define duration between two rows, and filter on that duration and type.
I used window lead function to get the next event time (that define end for current event), so every row now gets start and stop times.
If NULL (last row for example), add next midnight as stop.
Data is stored in ORC file (tried with Parquet format, no difference)
This only happens with multiple executors cluster nodes, for example AWS EMR cluster or local docker cluster setup.
If I run it on single instance (local on laptop), I get consistent results every time.
Spark version is 3.0.1, both in AWS and local and docker setup.
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("user_id").orderBy("start")
val ts_lead = coalesce(lead("start", 1) .over(w), date_add(col("start"), 1))
val df2 = df1.
withColumn("end", ts_lead).
withColumn("duration", col("end").cast("long")-col("start").cast("long"))
df2.where("type='B' and duration>4").count()
Every time I run this last count, I get different results.
For example:
run 1: 19359949
run 2: 19359964
If I run every filter separately, everything is OK and I get consistent results.
But If I combine them, inconsistent.
I tried to filter to separate dataframe, first duration then type and vice versa, no joy there also.
I know that I can cache or checkpoint datframe, but it's very large dataset and I have similar calculations multiple times, so I can't really spare time and disk space for checkpoints and cache.
Is this a bug in spark, or am I missing something?
Edit:
I have created sample code with dummy random data, so anyone can try to reproduce.
Since this sample use random numbers, it's necessary to write dataset after generation and re-read it.
I user for loop to generate set because when I tried to generate 25.000.000 rows in one pass, I got out of memory error.
I saved it to s3://bucket , here it's masked with your-bucket
import org.apache.spark.sql.expressions.Window
val getRandomUser = udf(()=>{
val users = Seq("John","Eve","Anna","Martin","Joe","Steve","Katy")
users(scala.util.Random.nextInt(7))
})
val getRandomType = udf(()=>{
val types = Seq("TypeA","TypeB","TypeC","TypeD","TypeE")
types(scala.util.Random.nextInt(5))
})
val getRandomStart = udf((x:Int)=>{
x+scala.util.Random.nextInt(47)
})
for( a <- 0 to 23){
// use iterator a to continue with next million, repeat 1 mil times
val x=Range(a*1000000,(a*1000000)+1000000).toDF("id").
withColumn("start",getRandomStart(col("id"))).
withColumn("user",getRandomUser()).
withColumn("type",getRandomType()).
drop("id")
x.write.mode("append").orc("s3://your-bucket/random.orc")
}
val w = Window.partitionBy("user").orderBy("start")
val ts_lead = coalesce(lead("start", 1) .over(w), lit(30000000))
val fox2 = spark.read.orc("s3://your-bucket/random.orc").
withColumn("end", ts_lead).
withColumn("duration", col("end")-col("start"))
// repeated executions of this line returns different results for count
fox2.where("type='TypeA' and duration>4").count()
My results for three consecutive runs of last line were:
run 1: 2551259
run 2: 2550756
run 3: 2551279
Every run different count

I have reproduced your issue locally.
As far as I understand, the issue is that you are filtering by duration in this sentence:
fox2.where("type='TypeA' and duration>4").count()
and duration is generated randomly. I understand that you are using a seed, but if you parallelise that, you do not know which random value will be added to each id.
For example, if 4 generated numbers were 21, 14, 5, 17, and the ids were 1, 2, 3, 4, the start column sometimes could be:
1 + 21
2 + 14
3 + 5
4 + 17
and sometimes could be:
1 + 21
4 + 14
3 + 5
2 + 17
this will lead to different start values, and hence different duration values, ultimately leading to changes in the final filter and count because order in dataframes is not guaranteed when running in parallel.

Related

Find duplicates but keep one with a twist in datediff pyspark

I have a sample spark dataframe:
df = [("1","5563","John","Smith","2020-12-15","M"),
("1","5563","John","Smith","2020-12-18","M"),
("2","5568","Jane","King","2020-12-15","F"),
("3","5574","Ernest","Goldberg","2021-10-12","M"),
("5","31","Joe","Hanson","2022-03-16","M"),
("1","5563","John","Smith","2021-01-02","M"),
("2","5568","Jane","King","2021-01-25","F")
]
columns = ['bldg_num",'person_id','first_name','last_name','intake_date','gender']
What I would like to happen is check the difference between intake_dates. If the datediff is less than 5, keep the later one. But if the datediff between the same id is 5 or more days, I keep both records.
I got the following initially ran using SQL:
SELECT *
FROM df
WHERE EXISTS (
SELECT *
FROM df df_v2
WHERE df.person_id = df_v2.person_id AND DATEDIFF(df.intake_date, df_v2.intake_date) => 5 )
HOwever, the above code filters it where the datediff is 5 or more. There are ids that may only have one intake_date that still needs to be kept.
There are links here that partially meets what I need to do, but so far, none is close enough: remove duplicates in list, but keep one copy Finding partial and exact duplicate from a SQL table I also thought of using the concept of returning/recurring customers Calculate recurring customer but I am lost in writing the code. From the df above, I am expecting to get the following output:
df2 = [
("1","5563","John","Smith","2020-12-18","M"),
("2","5568","Jane","King","2020-12-15","F"),
("3","5574","Ernest","Goldberg","2021-10-12","M"),
("5","31","Joe","Hanson","2022-03-16","M"),
("1","5563","John","Smith","2021-01-02","M"),
("2","5568","Jane","King","2021-01-25","F")
]
The first row will be dropped because it is less than 5 days to the next intake_date while still keeping id=5 and id=3 rows. I am thinking, the only way to do it is write a udf, but I am still new in that concept. SQL alternative is ok too, if it is possible.
Let me know if my question is a bit confusing, I will try to rephrase to make it clearer. Thank you

Why does join use rows that were sent after watermark of 20 seconds?

I’m using watermark to join two streams as you can see below:
val order_wm = order_details.withWatermark("tstamp_trans", "20 seconds")
val invoice_wm = invoice_details.withWatermark("tstamp_trans", "20 seconds")
val join_df = order_wm
.join(invoice_wm, order_wm.col("s_order_id") === invoice_wm.col("order_id"))
My understanding with the above code, it will keep each of the stream for 20 secs. After it comes but, when I’m giving one stream now and the another after 20secs then also both are getting joined. It seems like even after watermark got finished Spark is holding the data in memory. I even tried after 45 seconds and that was getting joined too.
This is creating confusion in my mind regarding watermark.
After it comes but, when I’m giving one stream now and the another after 20secs then also both are getting joined.
That's possible since the time measured is not the time of events as they arrive, but the time that is inside the watermarked field, i.e. tstamp_trans. You have to make sure that the last time in tstamp_trans is 20 seconds after the rows that will participate in the join.
Quoting the doc from: http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#inner-joins-with-optional-watermarking
In other words, you will have to do the following additional steps in the join.
Define watermark delays on both inputs such that the engine knows how delayed the input can be (similar to streaming aggregations)
Define a constraint on event-time across the two inputs such that the engine can figure out when old rows of one input is not going to be required (i.e. will not satisfy the time constraint) for matches with the other input. This constraint can be defined in one of the two ways.
Time range join conditions (e.g. ...JOIN ON leftTime BETWEEN rightTime AND rightTime + INTERVAL 1 HOUR),
Join on event-time windows (e.g. ...JOIN ON leftTimeWindow = rightTimeWindow).

Spark window functions: how to implement complex logic with good performance and without looping

I have a data set that lends itself to window functions, 3M+ rows that once ranked can be partitioned into groups of ~20 or less rows. Here is a simplified example:
id date1 date2 type rank
171 20090601 20090601 attempt 1
171 20090701 20100331 trial_fail 2
171 20090901 20091101 attempt 3
171 20091101 20100201 attempt 4
171 20091201 20100401 attempt 5
171 20090601 20090601 fail 6
188 20100701 20100715 trial_fail 1
188 20100716 20100730 trial_success 2
188 20100731 20100814 trial_fail 3
188 20100901 20100901 attempt 4
188 20101001 20101001 success 5
The data is ranked by id and date1, and the window created with:
Window.partitionBy("id").orderBy("rank")
In this example the data has already been ranked by (id, date1). I could also work on the unranked data and rank it within Spark.
I need to implement some logic on these rows, for example, within a window:
1) Identify all rows that end during a failed trial (i.e. a row's date2 is between date1 and date2 of any previous row within the same window of type "trial_fail").
2) Identify all trials after a failed trial (i.e. any row with type "trial_fail" or "trial success" after a row within the same window of type "trial_fail").
3) Identify all attempts before a successful attempt (i.e. any row with type "attempt" with date1 earlier than date1 of another later row of type "success").
The exact logic of these conditions is not important to my question (and there will be other different conditions), what's important is that the logic depends on values in many rows in the window at once. This can't be handled by the simple Spark SQL functions like first, last, lag, lead, etc. and isn't as simple as the typical example of finding the largest/smallest 1 or n rows in the window.
What's also important is that the partitions don't depend on one another so this seems like this a great candidate for Spark to do in parallel, 3 million rows with 150,000 partitions of 20 rows each, in fact I wonder if this is too many partitions.
I can implement this with a loop something like (in pseudocode):
for i in 1..20:
for j in 1..20:
// compare window[j]'s type and dates to window[i]'s etc
// add a Y/N flag to the DF to identify target rows
This would require 400+ iterations (the choice of 20 for the max i and j is an educated guess based on the data set and could actually be larger), which seems needlessly brute force.
However I am at a loss for a better way to implement it. I think this will essentially collect() in the driver, which I suppose might be ok if it is not much data. I thought of trying to implement the logic as sub-queries, or by creating a series of sub-DF's each with a subset or reduction of data.
If anyone is aware of any API's or techniques that I am missing any info would be appreciated.
Edit: This is somewhat related:
Spark SQL window function with complex condition

How to fit a kernel density estimate on a pyspark dataframe column and use it for creating a new column with the estimates

My use is the following. Consider I have a pyspark dataframe which has the following format:
df.columns:
1. hh: Contains the hour of the day (type int)
2. userId : some unique identifier.
What I want to do is I want to figure out list of userIds which have anomalous hits onto the page. So I first do a groupby as so:
df=df.groupby("hh","userId).count().alias("LoginCounts)
Now the format of the dataframe would be:
1. hh
2. userId
3.LoginCounts: Number of times a specific user logs in at a particular hour.
I want to use the pyspark kde function as follows:
from pyspark.mllib.stat import KernelDensity
kd=KernelDensity()
kd.setSample(df.select("LoginCounts").rdd)
kd.estimate([13.0,14.0]).
I get the error:
Py4JJavaError: An error occurred while calling o647.estimateKernelDensity.
: org.apache.spark.SparkException: Job aborted due to stage failure
Now my end goal is to fit a kde on say a day's hour based data and then use the next day's data to get the probability estimates for each login count.
Eg: I would like to achieve something of this nature:
df.withColumn("kdeProbs",kde.estimate(col("LoginCounts)))
So the column kdeProbs will contain P(LoginCount=x | estimated kde).
I have tried searching for an example of the same but am always redirected to the standard kde example on the spark.apache.org page, which does not solve my case.
It's not enough to just select one column and convert it to an RDD; you need to also select the actual data in that column for it to work. Try this:
from pyspark.mllib.stat import KernelDensity
dat_rdd = df.select("LoginCounts").rdd
# actually select data from RDD
dat_rdd_data = dat_rdd.map(lambda x: x[0])
kd = KernelDensity()
kd.setSample(dat_rdd_data)
kd.estimate([13.0,14.0])

SparkSQL PostgresQL Dataframe partitions

I have a very simple setup of SparkSQL connecting to a Postgres DB and I'm trying to get a DataFrame from a table, the Dataframe with a number X of partitions (lets say 2). The code would be the following:
Map<String, String> options = new HashMap<String, String>();
options.put("url", DB_URL);
options.put("driver", POSTGRES_DRIVER);
options.put("dbtable", "select ID, OTHER from TABLE limit 1000");
options.put("partitionColumn", "ID");
options.put("lowerBound", "100");
options.put("upperBound", "500");
options.put("numPartitions","2");
DataFrame housingDataFrame = sqlContext.read().format("jdbc").options(options).load();
For some reason, one partition of the DataFrame contains almost all rows.
For what I can understand lowerBound/upperBound are the parameters used to finetune this. In SparkSQL's documentation (Spark 1.4.0 - spark-sql_2.11) it says they are used to define the stride, not to filter/range the partition column. But that raises several questions:
The stride is the frequency (number of elements returned each query) with which Spark will query the DB for each executor (partition)?
If not, what is the purpose of this parameters, what do they depend on and how can I balance my DataFrame partitions in a stable way (not asking all partitions contain the same number of elements, just that there is an equilibrium - for example 2 partitions 100 elements 55/45 , 60/40 or even 65/35 would do)
Can't seem to find a clear answer to these questions around and was wondering if maybe some of you could clear this points for me, because right now is affecting my cluster performance when processing X million rows and all the heavy lifting goes to one single executor.
Cheers and thanks for your time.
Essentially the lower and upper bound and the number of partitions are used to calculate the increment or split for each parallel task.
Let's say the table has partition column "year", and has data from 2006 to 2016.
If you define the number of partitions as 10, with lower bound 2006 and higher bound 2016, you will have each task fetching data for its own year - the ideal case.
Even if you incorrectly specify the lower and / or upper bound, e.g. set lower = 0 and upper = 2016, there will be a skew in data transfer, but, you will not "lose" or fail to retrieve any data, because:
The first task will fetch data for year < 0.
The second task will fetch data for year between 0 and 2016/10.
The third task will fetch data for year between 2016/10 and 2*2016/10.
...
And the last task will have a where condition with year->2016.
T.
Lower bound are indeed used against the partitioning column; refer to this code (current version at the moment of writing this):
https://github.com/apache/spark/blob/40ed2af587cedadc6e5249031857a922b3b234ca/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala
Function columnPartition contains the code for the partitioning logic and the use of lower / upper bound.
lowerbound and upperbound have been currently identified to do what they do in the previous answers. A followup to this would be how to balance the data across partitions without looking at the min max values or if your data is heavily skewed.
If your database supports "hash" function, it could do the trick.
partitionColumn = "hash(column_name)%num_partitions"
numPartitions = 10 // whatever you want
lowerBound = 0
upperBound = numPartitions
This will work as long as the modulus operation returns a uniform distribution over [0,numPartitions)