Getting X rows before each occurance of a value in Spark - pyspark

I'm pretty new to Spark, and I've hit a bit of a conceptual roadblock. I'm looking for general thoughts on how to to approach this problem:
I have some log data of this form -
+-------------+--------------------+----+----------+
|serial_number| timestamp|code|fault_type|
+-------------+--------------------+----+----------+
| 633878|2017-12-11 01:45:...| 1| STATE|
| 633833|2017-12-11 01:45:...| 3| STATE|
| 633745|2017-12-11 01:45:...| 306| STATE|
| 633747|2017-12-11 01:46:...| 1| STATE|
| 634039|2017-12-11 01:46:...| 4| STATE|
| 633833|2017-12-11 01:46:...| 1| STATE|
| 637480|2017-12-11 01:46:...| 1| STATE|
| 634029|2017-12-11 01:46:...| 3| STATE|
| 634046|2017-12-11 01:46:...| 3| STATE|
| 634039|2017-12-11 01:46:...| 1| STATE|
Sometimes fault_type will equal QUIT, rather than STATE. I am looking for a way in Spark to select X number of records preceding the QUIT fault, and create a dataframe of these selected blocks of data, where each row could be a list of the X number of codes preceding the QUIT.
Thanks for the help!

I would join your data frame to itself.
On a left you select all events with fault type QUIT, on a right you select all preceding events (timestamp less than) ordering by time with limit.
Then you can do group by records on a left, and perform collect_list for records on a right.

Related

Get last n items in pyspark

For a dataset like -
+---+------+----------+
| id| item| timestamp|
+---+------+----------+
| 1| apple|2022-08-15|
| 1| peach|2022-08-15|
| 1| apple|2022-08-15|
| 1|banana|2022-08-14|
| 2| apple|2022-08-15|
| 2|banana|2022-08-14|
| 2|banana|2022-08-14|
| 2| water|2022-08-14|
| 3| water|2022-08-15|
| 3| water|2022-08-14|
+---+------+----------+
Can I use pyspark functions directly to get last three items the user purchased in the past 5 days? I know udf can do that, but I am wondering if any existing funtion can achieve this.
My expected output is like below or anything simliar is okay too.
id last_three_item
1 [apple, peach, apple]
2 [water, banana, apple]
3 [water, water]
Thanks!
You can use pandas_udf for this.
#f.pandas_udf(returnType=ArrayType(StringType()), functionType=f.PandasUDFType.GROUPED_AGG)
def pudf_get_top_3(x):
return x.head(3).to_list()
sdf\
.orderby("timestamp")\
.groupby("id")\
.agg(pudf_get_top_3("item")\
.alias("last_three_item))\
.show()

How do I coalesce rows in pyspark?

In PySpark, there's the concept of coalesce(colA, colB, ...) which will, per row, take the first non-null value it encounters from those columns. However, I want coalesce(rowA, rowB, ...) i.e. the ability to, per column, take the first non-null value it encounters from those rows. I want to coalesce all rows within a group or window of rows.
For example, given the following dataset, I want to coalesce rows per category and ordered ascending by date.
+---------+-----------+------+------+
| category| date| val1| val2|
+---------+-----------+------+------+
| A| 2020-05-01| null| 1|
| A| 2020-05-02| 2| null|
| A| 2020-05-03| 3| null|
| B| 2020-05-01| null| null|
| B| 2020-05-02| 4| null|
| C| 2020-05-01| 5| 2|
| C| 2020-05-02| null| 3|
| D| 2020-05-01| null| 4|
+---------+-----------+------+------+
What I should get as the output is...
+---------+-----------+------+------+
| category| date| val1| val2|
+---------+-----------+------+------+
| A| 2020-05-01| 2| 1|
| B| 2020-05-01| 4| null|
| C| 2020-05-01| 5| 2|
| D| 2020-05-01| null| 4|
+---------+-----------+------+------+
First, I'll give the answer. Then, I'll point out the important bits.
from pyspark.sql import Window
from pyspark.sql.functions import col, dense_rank, first
df = ... # dataframe from question description
window = (
Window
.partitionBy("category")
.orderBy(col("date").asc())
)
window_unbounded = (
window
.rangeBetween(Window.unboundedPreceding, Window.unboundedFollowing)
)
cols_to_merge = [col for col in df.columns if col not in ["category", "date"]]
merged_cols = [first(col, True).over(window_unbounded).alias(col) for col in cols_to_merge]
df_merged = (
df
.select([col("category"), col("date")] + merged_cols)
.withColumn("rank_col", dense_rank().over(window))
.filter(col("rank_col") == 1)
.drop("rank_col")
)
The row-wise analogue to coalesce is the aggregation function first. Specifically, we use first with ignorenulls = True so that we find the first non-null value.
When we use first, we have to be careful about the ordering of the rows it's applied to. Because groupBy doesn't allow us to maintain order within the groups, we use a Window.
The window itself must be unbounded on both ends rather than the default unbounded preceding to current row, else we'll end up with the first aggregation potentially running on subsets of our groups.
After we aggregate over the window, we alias the column back to its original name to keep the column names consistent.
We use a single select statement of cols rather than a for loop with df.withColumn(col, ...) because the select statement greatly reduces the query plan depth. Should you use the looped withColumn, you might hit a stack overflow error if you have too many columns.
Finally, we run a dense_rank over our window --- this time using the window with the default range --- and filter to only the first ranked rows. We use dense rank here, but we could use any ranking function, whatever fits our needs.

Imputing null values in spark dataframe, based on the row category, by fetching the values from another dataframe in Scala

So i have a dataframe as shown below, that has been stored as a temporary view by the name mean_value_gn5 so that i can query using sql(), whenever i need to fetch the data.
+-------+----+
|Species|Avgs|
+-------+----+
| NO2| 43|
| NOX| 90|
| NO| 31|
+-------+----+
This dataframe stores the categorical average of the 'Species' rounded off to the nearest whole number using the ceil() function. I need to use these categorical averages to impute the missing values of column Value in my dataframe of interest clean_gn5. I created a new column Value_imp which would hold my final column with the imputed values.
I made an attempt to do so as follows:
clean_gn5 = clean_gn5.withColumn("Value_imp",
when($"Value".isNull, sql("Select Avgs from mean_value_gn5 where Species = "+$"Species").select("Avgs").head().getLong(0).toInt)
.otherwise($"Value"))
The above mentioned code runs, but the values are getting incorrectly imputed i.e. for the row containing Species as NO the value getting imputed is 43 instead of 31.
Prior to doing this I first checked if i was able to fetch the values correctly by executing the following:
println(sql("Select Avgs from mean_value_gn5 where Species = 'NO'").select("Avgs").head().getLong(0))
I am able to fetch the value correctly after hardcoding the Species and as per my understanding $"Species" should help me fetch the value corresponding to the Species column for that particular row.
Also I thought that probably i was missing the single quotes around the hardcoded Species value i.e. 'NO'. So I tried doing the following
clean_gn5 = clean_gn5.withColumn("Value_imp",
when($"Value".isNull, sql("Select Avgs from mean_value_gn5 where Species = '"+$"Species"+"'").select("Avgs").head().getLong(0).toInt)
.otherwise($"Value"))
But that resulted in the following exception.
Exception in thread "main" java.util.NoSuchElementException: next on empty iterator
I am fairly new to Spark and Scala.
Let's assume clean_gn5 contains the data
+-------+-----+
|Species|Value|
+-------+-----+
| NO2| 2.3|
| NOX| 1.1|
| NO| null|
| ABC| 4.0|
| DEF| null|
| NOX| null|
+-------+-----+
Joining clean_gn5 with mean_value_gn5 using a left join will result in
+-------+-----+----+
|Species|Value|Avgs|
+-------+-----+----+
| NO2| 2.3| 43|
| NOX| 1.1| 90|
| NO| null| 31|
| ABC| 4.0|null|
| DEF| null|null|
| NOX| null| 90|
+-------+-----+----+
On this dataframe you can apply per row the logic you have already given in your question and the result is (after dropping the Avgs column):
+-------+-----+---------+
|Species|Value|Value_imp|
+-------+-----+---------+
| NO2| 2.3| 2.3|
| NOX| 1.1| 1.1|
| NO| null| 31.0|
| ABC| 4.0| 4.0|
| DEF| null| null|
| NOX| null| 90.0|
+-------+-----+---------+
The code:
clean_gn5.join(mean_value_gn5, Seq("Species"), "left")
.withColumn("Value_imp", when('value.isNull, 'Avgs).otherwise('value))
.drop("Avgs")
.show()

How do i filter bad or corrupted rows from a spark data frame after casting

df1
+-------+-------+-----+
| ID | Score| hits|
+-------+-------+-----+
| 01| 100| Null|
| 02| Null| 80|
| 03| spark| 1|
| 04| 300| 1|
+-------+-------+-----+
after casting Score to int and hits to float I get the below dataframe:
df2
+-------+-------+-----+
| ID | Score| hits|
+-------+-------+-----+
| 01| 100| Null|
| 02| Null| 80.0|
| 03| Null| 1.0|
| 04| 300| 1.0|
+-------+-------+-----+
Now I want to extract only the bad records , bad records mean that null produced after casting.
I want to do the operations only on existing dataframe. Please help me out if there is any build-in way to get the bad records after casting.
Please also consider this is sample dataframe. The solution should solve for any number of columns and any scenario.
I tried by separating the null records from both dataframes and compare them. Also i have thought of adding another column with number of nulls and then compare the both dataframes if number of nulls is grater in df2 than in df1 then those are bad one. But i think these solutions are pretty old school.
I would like to know the better way to resolve it.
You can use a custom function/udf to convert string to integer and map non integer values to specific number eg. -999999999.
Later you can filter on -999999999 to identify originally non integer records.
def udfInt(value):
if value is None:
return None
elif value.isdigit():
return int(value)
else:
return -999999999
spark.udf.register('udfInt', udfInt)
df.selectExpr("*",
"udfInt(Score) AS new_Score").show()
#+---+-----+----+----------+
#| ID|Score|hits| new_Score|
#+---+-----+----+----------+
#| 01| 100|null| 100|
#| 02| null| 80| null|
#| 03|spark| 1|-999999999|
#| 04| 300| 1| 300|
#+---+-----+----+----------+
Filter on -999999999 to identify non integer (bad records)
df.selectExpr("*","udfInt(Score) AS new_Score").filter("new_score == -999999999").show()
#+---+-----+----+----------+
#| ID|Score|hits| new_Score|
#+---+-----+----+----------+
#| 03|spark| 1|-999999999|
#+---+-----+----+----------+
The same way you can have customized udf for float conversion.

Spark (Scala): Insert missing rows to complete sequence in a specific column

I have some data source that comes in with certain rows missing when they are invalid.
In the following DataFrame, each user should have 3 rows, from index 0 to index 2.
val df = sc.parallelize(List(
("user1",0,true),
("user1",1,true),
("user1",2,true),
("user2",1,true),
("user3",2,true)
)).toDF("user_id","index","is_active")
For instance:
user1 has all the necessary index.
user2 is missing index 0 and index 2.
user3 is missing index 1.
Like this:
+-------+-----+---------+
|user_id|index|is_active|
+-------+-----+---------+
| user1| 0| true|
| user1| 1| true|
| user1| 2| true|
| user2| 1| true|
| user3| 0| true|
| user3| 2| true|
+-------+-----+---------+
I'd like to insert the default rows and make them into the following DataFrame.
+-------+-----+---------+
|user_id|index|is_active|
+-------+-----+---------+
| user1| 0| true|
| user1| 1| true|
| user1| 2| true|
| user2| 0| false|
| user2| 1| true|
| user2| 2| false|
| user3| 0| true|
| user3| 1| false|
| user3| 2| true|
+-------+-----+---------+
I've seen a separate and similar question where the answer to pivot the table so each user has 3 columns. But 1, that relies on the fact that index 0 to 2 exist for some users. In my real case the index is over a very large range so i cannot guarantee that after the pivot all columns would be complete. It also seems a pretty expensive operation to pivot, then un-pivot to get the second DataFrame.
I also tried to create a new DataFrame like this:
val indexDF = df.select("user_id").distinct.join(sc.parallelize(Seq.range(0, 3)).toDF("ind"))
val result = indexDF.join(df, Seq("user_id", "index"), "left").na.fill(0)
This actually works but when I ran it with real data (with millions of record and hundreds of values in index) it took very long and so I suspect it could be done more efficiently.
Thanks in advance!