Join two data frames as input for Machine Learning with Spark - pyspark

I have 2 data frames in Apache Spark, each with a column named "JoinValue". JoinValue is numeric and has the same semantics and meaning in both data frames.
I need the combination of both data frames as input (training and test set) for a Machine Learning algorithm. Is it correct that I first need to combine both DataFrames into a single DataFrame before using it in an ML Pipeline?
Example:
df1.show()
+---------+---------+
| a|JoinValue|
+---------+---------+
|A value 0| 0|
|A value 1| 5|
|A value 2| 10|
|A value 3| 15|
|A value 4| 20|
|A value 5| 25|
|A value 6| 30|
+---------+---------+
and
> df2.show()
+---------+---------+
| b|JoinValue|
+---------+---------+
|B value 0| 0|
|B value 1| 7|
|B value 2| 14|
|B value 3| 21|
|B value 4| 28|
+---------+---------+
An outer join followed by an orderBy yields the following results:
> df1.join(df2, 'JoinValue', 'outer').orderBy('JoinValue').show()
+---------+---------+---------+
|JoinValue| a| b|
+---------+---------+---------+
| 0|A value 0|B value 0|
| 5|A value 1| null|
| 7| null|B value 1|
| 10|A value 2| null|
| 14| null|B value 2|
| 15|A value 3| null|
| 20|A value 4| null|
| 21| null|B value 3|
| 25|A value 5| null|
| 28| null|B value 4|
| 30|A value 6| null|
+---------+---------+---------+
What I actually want is this, without nulls:
+---------+---------+---------+
|JoinValue| a| b|
+---------+---------+---------+
| 0|A value 0|B value 0|
| 5|A value 1|B value 0|
| 7|A value 1|B value 1|
| 10|A value 2|B value 1|
| 14|A value 2|B value 2|
| 15|A value 3|B value 2|
| 20|A value 4|B value 2|
| 21|A value 4|B value 3|
| 25|A value 5|B value 3|
| 28|A value 5|B value 4|
| 30|A value 6|B value 4|
+---------+---------+---------+
What is the best way to use the JoinValue, a and b, coming from multiple data frames as features and labels in a machine learning algorithm?

Related

How to count change in row values in pyspark

Logic to count the change in the row values of a given column
Input
df22 = spark.createDataFrame(
[(1, 1.0), (1,22.0), (1,22.0), (1,21.0), (1,20.0), (2, 3.0), (2,3.0),
(2, 5.0), (2, 10.0), (2,3.0), (3,11.0), (4, 11.0), (4,15.0), (1,22.0)],
("id", "v"))
+---+----+
| id| v|
+---+----+
| 1| 1.0|
| 1|22.0|
| 1|22.0|
| 1|21.0|
| 1|20.0|
| 2| 3.0|
| 2| 3.0|
| 2| 5.0|
| 2|10.0|
| 2| 3.0|
| 3|11.0|
| 4|11.0|
| 4|15.0|
+---+----+
Expect output
+---+----+---+
| id| v| c|
+---+----+---+
| 1| 1.0| 0|
| 1|22.0| 1|
| 1|22.0| 1|
| 1|21.0| 2|
| 1|20.0| 3|
| 2| 3.0| 0|
| 2| 3.0| 0|
| 2| 5.0| 1|
| 2|10.0| 2|
| 2| 3.0| 3|
| 3|11.0| 0|
| 4|11.0| 0|
| 4|15.0| 1|
+---+----+---+
Any help on this will be greatly appreciated
Thanks in advance
Ramabadran
Before adding answer, I would like to ask you ,"what you have tried ??". Please try something from your end and then seek for support in this platform. Also your question is not clear. You have not provided if you are looking for a delta capture count per 'id' or as a whole. Just giving an expected output is not going to make the question clear.
And now comes to your question , if I understood it correctly from the sample input and output,you need delta capture count per 'id'. So one way to achieve it as below
#Capture the incremented count using lag() and sum() over below mentioned window
import pyspark.sql.functions as F
from pyspark.sql.window import Window
winSpec=Window.partitionBy('id').orderBy('v') # Your Window for capturing the incremented count
df22.\
withColumn('prev',F.coalesce(F.lag('v').over(winSpec),F.col('v'))).\
withColumn('c',F.sum(F.expr("case when v-prev<>0 then 1 else 0 end")).over(winSpec)).\
drop('prev').\
orderBy('id','v').\
show()
+---+----+---+
| id| v| c|
+---+----+---+
| 1| 1.0| 0|
| 1|20.0| 1|
| 1|21.0| 2|
| 1|22.0| 3|
| 1|22.0| 3|
| 1|22.0| 3|
| 2| 3.0| 0|
| 2| 3.0| 0|
| 2| 3.0| 0|
| 2| 5.0| 1|
| 2|10.0| 2|
| 3|11.0| 0|
| 4|11.0| 0|
| 4|15.0| 1|
+---+----+---+

Find next different value from lag in pyspark

I have a pyspark dataframe like this,
+-----+----------+
|value|val_joined|
+-----+----------+
| 3| 3|
| 4| 3+4|
| 5| 3+4+5|
| 5| 3+4+5|
| 5| 3+4+5|
| 2| 3+4+5+2|
+-----+----------+
From this, I have to create another column that goes like this,
+-----+----------+------+
|value|val_joined|result|
+-----+----------+------+
| 3| 3| 4.0|
| 4| 3+4| 5.0|
| 5| 3+4+5| 2.0|
| 5| 3+4+5| 2.0|
| 5| 3+4+5| 2.0|
| 2| 3+4+5+2| NaN|
+-----+----------+------+
The result column is to be made like this, For an item in column named value, find the next item coming in order. So for value 3 it will be 4 and for value 4 it will be 5.
But when there are duplicates like the value 5 that repeats 3 times simple lag won't work. As the lag for first 5 will result in 5. I basically want to repeat taking lag till the value != lag(value) or lag(value) is null.
How can I do this in pyspark without udf and joins?
We can take 2 windows and find the next row value once with 1st window by assigning a monotonically_increasing_id and the last value in the other window like below:
import pyspark.sql.functions as F
w = Window.orderBy('idx')
w1 = Window.partitionBy('value')
(df.withColumn('idx',F.monotonically_increasing_id())
.withColumn("result",F.last(F.lead("value").over(w)).over(w1)).orderBy('idx')
.drop('idx')).show()
+-----+----------+------+
|value|val_joined|result|
+-----+----------+------+
| 3| 3| 4|
| 4| 3+4| 5|
| 5| 3+4+5| 2|
| 5| 3+4+5| 2|
| 5| 3+4+5| 2|
| 2| 3+4+5+2| null|
+-----+----------+------+
If numbers in value can repeat later example below:
+-----+----------+
|value|val_joined|
+-----+----------+
|3 |3 |
|4 |3+4 |
|5 |3+4+5 |
|5 |3+4+5 |
|5 |3+4+5 |
|2 |3+4+5+2 |
|5 |3+4+5+2+5 | <- this value is repeated later
+-----+----------+
Then we will have to create a seperate group and take the group as window:
w = Window.orderBy('idx')
w1 = Window.partitionBy('group')
(df.withColumn('idx',F.monotonically_increasing_id())
.withColumn("lag", F.when(F.lag("value").over(w)!=F.col("value"), F.lit(1))
.otherwise(F.lit(0)))
.withColumn("group", F.sum("lag").over(w) + 1).drop("lag")
.withColumn("result",F.last(F.lead("value").over(w)).over(w1)).orderBy('idx')
.drop('idx',"group")).show()
+-----+----------+------+
|value|val_joined|result|
+-----+----------+------+
| 3| 3| 4|
| 4| 3+4| 5|
| 5| 3+4+5| 2|
| 5| 3+4+5| 2|
| 5| 3+4+5| 2|
| 2| 3+4+5+2| 5|
| 5| 3+4+5+2+5| null|
+-----+----------+------+

How to aggregate contiguous rows in pyspark

I have an immense amount of user data (billions of rows) where I need to summarize the amount of time spent in a specific state by each user.
Let's say it's historical web data, and I want to sum the amount of time each user has spent on the site. The data only says if the user is present.
df = spark.createDataFrame([("A", 1), ("A", 2), ("A", 3),("B", 4 ),("B", 5 ),("A", 6 ),("A", 7 ),("A", 8 )], ["user","timestamp"])
+----+---------+
|user|timestamp|
+----+---------+
| A| 1|
| A| 2|
| A| 3|
| B| 4|
| B| 5|
| A| 6|
| A| 7|
| A| 8|
+----+---------+
The correct answer would be this since I'm summing the total per contiguous segment.
+----+---------+
|user| ttl |
+----+---------+
| A| 4|
| B| 1|
+----+---------+
I tried doing a max()-min() and groupby but that resulted in segment A being 8-1 and gave the wrong answer.
In sqlite I was able to get the answer by creating a partition number and then finding the difference and summing. I created the partition with this...
SELECT
COUNT(*) FILTER (WHERE a.user <>
( SELECT b.user
FROM foobar AS b
WHERE a.timestamp > b.timestamp
ORDER BY b.timestamp DESC
LIMIT 1
))
OVER (ORDER BY timestamp) c,
user,
timestamp
FROM foobar a;
which gave me...
+----+---------+---+
|user|timestamp| c |
+----+---------+---+
| A| 1| 1 |
| A| 2| 1 |
| A| 3| 1 |
| B| 4| 2 |
| B| 5| 2 |
| A| 6| 3 |
| A| 7| 3 |
| A| 8| 3 |
+----+---------+---+
Then the LAST() - FIRST() functions in sql made that easy to finish.
Any ideas on how to scale this and do it in pyspark? I can't seem to find adequate substitutes for the "count(*) where(...)" sqlite offered
We can do this:
Create the DataFrame
from pyspark.sql.window import Window
from pyspark.sql.functions import max, min
from pyspark.sql import functions as F
df = spark.createDataFrame([("A", 1), ("A", 2), ("A", 3),("B", 4 ),("B", 5 ),("A", 6 ),("A", 7 ),("A", 8 )], ["user","timestamp"])
df.show()
+----+---------+
|user|timestamp|
+----+---------+
| A| 1|
| A| 2|
| A| 3|
| B| 4|
| B| 5|
| A| 6|
| A| 7|
| A| 8|
+----+---------+
Assign a row_number to each row, which are ordered by timestamp. The column dummy is used such that we can use window function row_number.
df = df.withColumn('dummy', F.lit(1))
w1 = Window.partitionBy('dummy').orderBy('timestamp')
df = df.withColumn('row_number', F.row_number().over(w1))
df.show()
+----+---------+-----+----------+
|user|timestamp|dummy|row_number|
+----+---------+-----+----------+
| A| 1| 1| 1|
| A| 2| 1| 2|
| A| 3| 1| 3|
| B| 4| 1| 4|
| B| 5| 1| 5|
| A| 6| 1| 6|
| A| 7| 1| 7|
| A| 8| 1| 8|
+----+---------+-----+----------+
We want to create a sub group within each user group here.
(1) For each user group, compute the difference of current row's row_number to previous row's row_number. So any difference larger than 1 indicating there's a new contiguous group. This results diff, note the first row in each group has a value of -1.
(2) We then assign null to every row with diff==1. This results column diff2.
(3) Next, we use the last function to fill the rows with diff2 == null using the last non-null value in column diff2. This results subgroupid.
This is the sub group we want to create for each user group.
w2 = Window.partitionBy('user').orderBy('timestamp')
df = df.withColumn('diff', df['row_number'] - F.lag('row_number').over(w2)).fillna(-1)
df = df.withColumn('diff2', F.when(df['diff']==1, None).otherwise(F.abs(df['diff'])))
df = df.withColumn('subgroupid', F.last(F.col('diff2'), True).over(w2))
df.show()
+----+---------+-----+----------+----+-----+----------+
|user|timestamp|dummy|row_number|diff|diff2|subgroupid|
+----+---------+-----+----------+----+-----+----------+
| B| 4| 1| 4| -1| 1| 1|
| B| 5| 1| 5| 1| null| 1|
| A| 1| 1| 1| -1| 1| 1|
| A| 2| 1| 2| 1| null| 1|
| A| 3| 1| 3| 1| null| 1|
| A| 6| 1| 6| 3| 3| 3|
| A| 7| 1| 7| 1| null| 3|
| A| 8| 1| 8| 1| null| 3|
+----+---------+-----+----------+----+-----+----------+
We now group by both user and subgroupid to compute the time each user spent on each contiguous time interval.
Lastly, we group by user only to sum up the total time spent by each user.
s = "(max('timestamp') - min('timestamp'))"
df = df.groupBy(['user', 'subgroupid']).agg(eval(s))
s = s.replace("'","")
df = df.groupBy('user').sum(s).select('user', F.col("sum(" + s + ")").alias('total_time'))
df.show()
+----+----------+
|user|total_time|
+----+----------+
| B| 1|
| A| 4|
+----+----------+

PySpark : Dataframe : Numeric + Null column values resulting in NULL instead of numeric value

I am facing a problem in PySpark Dataframe loaded from a CSV file , where my numeric column do have empty values Like below
+-------------+------------+-----------+-----------+
| Player_Name|Test_Matches|ODI_Matches|T20_Matches|
+-------------+------------+-----------+-----------+
| Aaron, V R| 9| 9| |
| Abid Ali, S| 29| 5| |
|Adhikari, H R| 21| | |
| Agarkar, A B| 26| 191| 4|
+-------------+------------+-----------+-----------+
Casted those columns to integer and all those empty become null
df_data_csv_casted = df_data_csv.select(df_data_csv['Country'],df_data_csv['Player_Name'], df_data_csv['Test_Matches'].cast(IntegerType()).alias("Test_Matches"), df_data_csv['ODI_Matches'].cast(IntegerType()).alias("ODI_Matches"), df_data_csv['T20_Matches'].cast(IntegerType()).alias("T20_Matches"))
+-------------+------------+-----------+-----------+
| Player_Name|Test_Matches|ODI_Matches|T20_Matches|
+-------------+------------+-----------+-----------+
| Aaron, V R| 9| 9| null|
| Abid Ali, S| 29| 5| null|
|Adhikari, H R| 21| null| null|
| Agarkar, A B| 26| 191| 4|
+-------------+------------+-----------+-----------+
Then I am taking a total , but if one of them is null , result is also coming as null. How to solve it ?
df_data_csv_withTotalCol=df_data_csv_casted.withColumn('Total_Matches',(df_data_csv_casted['Test_Matches']+df_data_csv_casted['ODI_Matches']+df_data_csv_casted['T20_Matches']))
+-------------+------------+-----------+-----------+-------------+
|Player_Name |Test_Matches|ODI_Matches|T20_Matches|Total_Matches|
+-------------+------------+-----------+-----------+-------------+
| Aaron, V R | 9| 9| null| null|
|Abid Ali, S | 29| 5| null| null|
|Adhikari, H R| 21| null| null| null|
|Agarkar, A B | 26| 191| 4| 221|
+-------------+------------+-----------+-----------+-------------+
You can fix this by using coalesce function . for example , lets create some sample data
from pyspark.sql.functions import coalesce,lit
cDf = spark.createDataFrame([(None, None), (1, None), (None, 2)], ("a", "b"))
cDf.show()
+----+----+
| a| b|
+----+----+
|null|null|
| 1|null|
|null| 2|
+----+----+
When I do simple sum as you did -
cDf.withColumn('Total',cDf.a+cDf.b).show()
I get total as null , same as you described-
+----+----+-----+
| a| b|Total|
+----+----+-----+
|null|null| null|
| 1|null| null|
|null| 2| null|
+----+----+-----+
to fix, use coalesce along with lit function , which replaces null values by zeroes.
cDf.withColumn('Total',coalesce(cDf.a,lit(0)) +coalesce(cDf.b,lit(0))).show()
this gives me correct results-
| a| b|Total|
+----+----+-----+
|null|null| 0|
| 1|null| 1|
|null| 2| 2|
+----+----+-----+

spark sql conditional maximum

I have a tall table which contains up to 10 values per group. How can I transform this table into a wide format i.e. add 2 columns where these resemble the value smaller or equal to a threshold?
I want to find the maximum per group, but it needs to be smaller than a specified value like:
min(max('value1), lit(5)).over(Window.partitionBy('grouping))
However min()will only work for a column and not for the Scala value which is returned from the inner function?
The problem can be described as:
Seq(Seq(1,2,3,4).max,5).min
Where Seq(1,2,3,4) is returned by the window.
How can I formulate this in spark sql?
edit
E.g.
+--------+-----+---------+
|grouping|value|something|
+--------+-----+---------+
| 1| 1| first|
| 1| 2| second|
| 1| 3| third|
| 1| 4| fourth|
| 1| 7| 7|
| 1| 10| 10|
| 21| 1| first|
| 21| 2| second|
| 21| 3| third|
+--------+-----+---------+
created by
case class MyThing(grouping: Int, value:Int, something:String)
val df = Seq(MyThing(1,1, "first"), MyThing(1,2, "second"), MyThing(1,3, "third"),MyThing(1,4, "fourth"),MyThing(1,7, "7"), MyThing(1,10, "10"),
MyThing(21,1, "first"), MyThing(21,2, "second"), MyThing(21,3, "third")).toDS
Where
df
.withColumn("somethingAtLeast5AndMaximum5", max('value).over(Window.partitionBy('grouping)))
.withColumn("somethingAtLeast6OupToThereshold2", max('value).over(Window.partitionBy('grouping)))
.show
returns
+--------+-----+---------+----------------------------+-------------------------+
|grouping|value|something|somethingAtLeast5AndMaximum5| somethingAtLeast6OupToThereshold2 |
+--------+-----+---------+----------------------------+-------------------------+
| 1| 1| first| 10| 10|
| 1| 2| second| 10| 10|
| 1| 3| third| 10| 10|
| 1| 4| fourth| 10| 10|
| 1| 7| 7| 10| 10|
| 1| 10| 10| 10| 10|
| 21| 1| first| 3| 3|
| 21| 2| second| 3| 3|
| 21| 3| third| 3| 3|
+--------+-----+---------+----------------------------+-------------------------+
Instead, I rather would want to formulate:
lit(Seq(max('value).asInstanceOf[java.lang.Integer], new java.lang.Integer(2)).min).over(Window.partitionBy('grouping))
But that does not work as max('value) is not a scalar value.
Expected output should look like
+--------+-----+---------+----------------------------+-------------------------+
|grouping|value|something|somethingAtLeast5AndMaximum5|somethingAtLeast6OupToThereshold2|
+--------+-----+---------+----------------------------+-------------------------+
| 1| 4| fourth| 4| 7|
| 21| 1| first| 3| NULL|
+--------+-----+---------+----------------------------+-------------------------+
edit2
When trying a pivot
df.groupBy("grouping").pivot("value").agg(first('something)).show
+--------+-----+------+-----+------+----+----+
|grouping| 1| 2| 3| 4| 7| 10|
+--------+-----+------+-----+------+----+----+
| 1|first|second|third|fourth| 7| 10|
| 21|first|second|third| null|null|null|
+--------+-----+------+-----+------+----+----+
The second part of the problem remains that some columns might not exist or be null.
When aggregating to arrays:
df.groupBy("grouping").agg(collect_list('value).alias("value"), collect_list('something).alias("something"))
+--------+-------------------+--------------------+
|grouping| value| something|
+--------+-------------------+--------------------+
| 1|[1, 2, 3, 4, 7, 10]|[first, second, t...|
| 21| [1, 2, 3]|[first, second, t...|
+--------+-------------------+--------------------+
The values are already next to each other, but the right values need to be selected. This is probably still more efficient than a join or window function.
Would be easier to do in two separate steps - calculate max over Window, and then use when...otherwise on result to produce min(x, 5):
df.withColumn("tmp", max('value1).over(Window.partitionBy('grouping)))
.withColumn("result", when('tmp > lit(5), 5).otherwise('tmp))
EDIT: some example data to clarify this:
val df = Seq((1, 1),(1, 2),(1, 3),(1, 4),(2, 7),(2, 8))
.toDF("grouping", "value1")
df.withColumn("result", max('value1).over(Window.partitionBy('grouping)))
.withColumn("result", when('result > lit(5), 5).otherwise('result))
.show()
// +--------+------+------+
// |grouping|value1|result|
// +--------+------+------+
// | 1| 1| 4| // 4, because Seq(Seq(1,2,3,4).max,5).min = 4
// | 1| 2| 4|
// | 1| 3| 4|
// | 1| 4| 4|
// | 2| 7| 5| // 5, because Seq(Seq(7,8).max,5).min = 5
// | 2| 8| 5|
// +--------+------+------+