Scala Spark - Map function referencing another dataframe - scala

I have two dataframes:
df1:
+---+------+----+
| id|weight|time|
+---+------+----+
| A| 0.1| 1|
| A| 0.2| 2|
| A| 0.3| 4|
| A| 0.4| 5|
| B| 0.5| 1|
| B| 0.7| 3|
| B| 0.8| 6|
| B| 0.9| 7|
| B| 1.0| 8|
+---+------+----+
df2:
+---+---+-------+-----+
| id| t|t_start|t_end|
+---+---+-------+-----+
| A| t1| 0| 3|
| A| t2| 4| 6|
| A| t3| 7| 9|
| B| t1| 0| 2|
| B| t2| 3| 6|
| B| t3| 7| 9|
+---+---+-------+-----+
My desired output is to identify the 't' for each time stamp in df1, where the ranges of 't' are in df2.
df_output:
+---+------+----+---+
| id|weight|time| t |
+---+------+----+---+
| A| 0.1| 1| t1|
| A| 0.2| 2| t1|
| A| 0.3| 4| t2|
| A| 0.4| 5| t2|
| B| 0.5| 1| t1|
| B| 0.7| 3| t2|
| B| 0.8| 6| t2|
| B| 0.9| 7| t3|
| B| 1.0| 8| t3|
+---+------+----+---+
My understanding so far is that I must create an udf that takes the column 'id and 'time as inputs, map for each row, by refering to df2.filter(df2.id == df1.id, df1.time >= df2.t_start, df1.time <= df2.t_end), and get the correspondingdf2.t`
I'm very new to Scala and Spark, so I am wondering if this solution is even possible?

You cannot use UDF for that but all you have to do is to reuse filter condition you already defined to join both frames:
df1.join(
df2,
df2("id") === df1("id") && df1("time").between(df2("t_start"), df2("t_end"))
)

Related

create a new column to increment value when value resets to 1 in another column in pyspark

Logic and columnIn Pyspark DataFrame consider a column like [1,2,3,4,1,2,1,1,2,3,1,2,1,1,2]. Pyspark Column
create a new column to increment value when value resets to 1.
Expected output is[1,1,1,1,2,2,3,4,4,4,5,5,6,7,7]
i am bit new to pyspark, if anyone can help me it would be great for me.
written the logic as like below
def sequence(row_num):
results = [1, ]
flag = 1
for col in range(0, len(row_num)-1):
if row_num[col][0]>=row_num[col+1][0]:
flag+=1
results.append(flag)
return results
but not able to pass a column through udf. please help me in this
Your Dataframe:
df = spark.createDataFrame(
[
('1','a'),
('2','b'),
('3','c'),
('4','d'),
('1','e'),
('2','f'),
('1','g'),
('1','h'),
('2','i'),
('3','j'),
('1','k'),
('2','l'),
('1','m'),
('1','n'),
('2','o')
], ['group','label']
)
+-----+-----+
|group|label|
+-----+-----+
| 1| a|
| 2| b|
| 3| c|
| 4| d|
| 1| e|
| 2| f|
| 1| g|
| 1| h|
| 2| i|
| 3| j|
| 1| k|
| 2| l|
| 1| m|
| 1| n|
| 2| o|
+-----+-----+
You can create a flag and use a Window Function to calculate the cumulative sum. No need to use an UDF:
from pyspark.sql import Window as W
from pyspark.sql import functions as F
w = W.partitionBy().orderBy('label').rowsBetween(Window.unboundedPreceding, 0)
df\
.withColumn('Flag', F.when(F.col('group') == 1, 1).otherwise(0))\
.withColumn('Output', F.sum('Flag').over(w))\
.show()
+-----+-----+----+------+
|group|label|Flag|Output|
+-----+-----+----+------+
| 1| a| 1| 1|
| 2| b| 0| 1|
| 3| c| 0| 1|
| 4| d| 0| 1|
| 1| e| 1| 2|
| 2| f| 0| 2|
| 1| g| 1| 3|
| 1| h| 1| 4|
| 2| i| 0| 4|
| 3| j| 0| 4|
| 1| k| 1| 5|
| 2| l| 0| 5|
| 1| m| 1| 6|
| 1| n| 1| 7|
| 2| o| 0| 7|
+-----+-----+----+------+

Perform merge/insert on two spark dataframes with different schemas?

I have spark dataframe df and df1 both with different schemas.
DF:-
val DF = Seq(("1","acv","34","a","1"),("2","fbg","56","b","3"),("3","rty","78","c","5")).toDF("id","name","age","DBName","test")
+---+----+---+------+----+
| id|name|age|DBName|test|
+---+----+---+------+----+
| 1| acv| 34| a| 1|
| 2| fbg| 56| b| 3|
| 3| rty| 78| c| 5|
+---+----+---+------+----+
DF1:-
val DF1= Seq(("1","gbj","67","a","5"),("2","gbj","67","a","7"),("2","jku","88","b","8"),("4","jku","88","b",7"),("5","uuu","12","c","9")).toDF("id","name","age","DBName","col1")
+---+----+---+------+----+
| id|name|age|DBName|col1|
+---+----+---+------+----+
| 1| gbj| 67| a| 5|
| 2| gbj| 67| a| 7|
| 2| jku| 88| b| 8|
| 4| jku| 88| b| 7|
| 5| uuu| 12| c| 9|
+---+----+---+------+----+
I want to merge DF1 with DF based on value of id and DBName. So if my id and DBName already exists in DF then the record should be updated and if id and DBName doesn't exist then the new record should be added. So the resulting data frame should be like this:
+---+----+---+------+----+----+
| id|name|age|DBName|Test|col |
+---+----+---+------+----+----+
| 5| uuu| 12| c|NULL|9 |
| 2| jku| 88| b|NULL|8 |
| 4| jku| 88| b|NULL|7 |
| 1| gbj| 67| a|NULL|5 |
| 3| rty| 78| c|5 |NULL|
| 2| gbj| 67| a|NULL|7 |
+---+----+---+------+----+----+
I have tried so far
val updatedDF = DF.as("a").join(DF1.as("b"), $"a.id" === $"b.id" && $"a.DBName" === $"b.DBName", "outer").select(DF.columns.map(c => coalesce($"b.$c", $"b.$c") as c): _*)
Error:-
org.apache.spark.sql.AnalysisException: cannot resolve '`b.test`' given input columns: [b.DBName, a.DBName, a.name, b.age, a.id, a.age, b.id, a.test, b.name];;
You're selecting non-existent columns, and also there is a typo in the coalesce. You can follow the example below to fix your issue:
val updatedDF = DF.as("a").join(
DF1.as("b"),
$"a.id" === $"b.id" && $"a.DBName" === $"b.DBName",
"outer"
).select(
DF.columns.dropRight(1).map(c => coalesce($"b.$c", $"a.$c") as c)
:+ col(DF.columns.last)
:+ col(DF1.columns.last)
:_*
)
updatedDF.show
+---+----+---+------+----+----+
| id|name|age|DBName|test|col1|
+---+----+---+------+----+----+
| 5| uuu| 12| c|null| 9|
| 2| jku| 88| b| 3| 8|
| 4| jku| 88| b|null| 7|
| 1| gbj| 67| a| 1| 5|
| 3| rty| 78| c| 5|null|
| 2| gbj| 67| a|null| 7|
+---+----+---+------+----+----+

How to take row_number() based on a condition in spark with scala

I have the below data frame -
+----+-----+---+
| val|count| id|
+----+-----+---+
| a| 10| m1|
| b| 20| m1|
|null| 30| m1|
| b| 30| m2|
| c| 40| m2|
|null| 50| m2|
+----+-----+---+
created by -
val df1=Seq(
("a","10","m1"),
("b","20","m1"),
(null,"30","m1"),
("b","30","m2"),
("c","40","m2"),
(null,"50","m2")
)toDF("val","count","id")
I am trying to make a rank with the help of row_number() and window fuction as below.
df1.withColumn("rannk_num", row_number() over Window.partitionBy("id").orderBy("count")).show
+----+-----+---+---------+
| val|count| id|rannk_num|
+----+-----+---+---------+
| a| 10| m1| 1|
| b| 20| m1| 2|
|null| 30| m1| 3|
| b| 30| m2| 1|
| c| 40| m2| 2|
|null| 50| m2| 3|
+----+-----+---+---------+
But I have to filter those records with null values for column - val.
Expected output --
+----+-----+---+---------+
| val|count| id|rannk_num|
+----+-----+---+---------+
| a| 10| m1| 1|
| b| 20| m1| 2|
|null| 30| m1| NULL|
| b| 30| m2| 1|
| c| 40| m2| 2|
|null| 50| m2| NULL|
+----+-----+---+---------+
wondering if this is possible with minimal change. Also there can be 'n' number of values for the columns val and count.
Filter those rows with null val, assign them a null row number, and union back to the original dataframe.
val df1=Seq(
("a","10","m1"),
("b","20","m1"),
(null,"30","m1"),
("b","30","m2"),
("c","40","m2"),
(null,"50","m2")
).toDF("val","count","id")
df1.filter("val is not null").withColumn(
"rannk_num", row_number() over Window.partitionBy("id").orderBy("count")
).union(
df1.filter("val is null").withColumn("rannk_num", lit(null))
).show
+----+-----+---+---------+
| val|count| id|rannk_num|
+----+-----+---+---------+
| a| 10| m1| 1|
| b| 20| m1| 2|
| b| 30| m2| 1|
| c| 40| m2| 2|
|null| 30| m1| null|
|null| 50| m2| null|
+----+-----+---+---------+

How to combine dataframes with no common columns?

I have 2 data frames
val df1 = Seq(("1","2","3"),("4","5","6")).toDF("A","B","C")
df1.show
+---+---+---+
| A| B| C|
+---+---+---+
| 1| 2| 3|
| 1| 2| 3|
+---+---+---+
and
val df2 = Seq(("11","22","33"),("44","55","66")).toDF("D","E","F")
df2.show
+---+---+---+
| D| E| F|
+---+---+---+
| 11| 22| 33|
| 44| 55| 66|
+---+---+---+
I need to combine the ones above to get
val df3 = Seq(("1","2","3","","",""),("4","5","6","","",""),("","","","11","22","33"),("","","","44","55","66"))
.toDF("A","B","C","D","E","F")
df3.show
+---+---+---+---+---+---+
| A| B| C| D| E| F|
+---+---+---+---+---+---+
| 1| 2| 3| | | |
| 4| 5| 6| | | |
| | | | 11| 22| 33|
| | | | 44| 55| 66|
+---+---+---+---+---+---+
Right now I'm creating the missing columns for all dataframes manually to get to a common structure and am then using a union. This code is specific to the dataframes and is not scalable
Looking for a solution that will work with x dataframes with y columns each
You can manually create missing columns in the two data frames and then union them:
import org.apache.spark.sql.DataFrame
val allCols = df1.columns.toSet.union(df2.columns.toSet).toArray
val createMissingCols = (df: DataFrame, allCols: Array[String]) => allCols.foldLeft(df)(
(_df, _col) => if (_df.columns.contains(_col)) _df else _df.withColumn(_col, lit(""))
).select(allCols.head, allCols.tail: _*)
// select is needed to make sure the two data frames have the same order of columns
createMissingCols(df1, allCols).union(createMissingCols(df2, allCols)).show
+---+---+---+---+---+---+
| E| F| A| B| C| D|
+---+---+---+---+---+---+
| | | 1| 2| 3| |
| | | 4| 5| 6| |
| 22| 33| | | | 11|
| 55| 66| | | | 44|
+---+---+---+---+---+---+
A much simpler way of doing this is creating a full outer join and setting the join expression/condition to false:
val df1 = Seq(("1","2","3"),("4","5","6")).toDF("A","B","C")
val df2 = Seq(("11","22","33"),("44","55","66")).toDF("D","E","F")
val joined = df1.join(df2, lit(false), "full")
joined.show()
+----+----+----+----+----+----+
| A| B| C| D| E| F|
+----+----+----+----+----+----+
| 1| 2| 3|null|null|null|
| 4| 5| 6|null|null|null|
|null|null|null| 11| 22| 33|
|null|null|null| 44| 55| 66|
+----+----+----+----+----+----+
if you then want to actually set the null values to empty string you can just add:
val withEmptyString = joined.na.fill("")
withEmptyString.show()
+---+---+---+---+---+---+
| A| B| C| D| E| F|
+---+---+---+---+---+---+
| 1| 2| 3| | | |
| 4| 5| 6| | | |
| | | | 11| 22| 33|
| | | | 44| 55| 66|
+---+---+---+---+---+---+
so in summary df1.join(df2, lit(false), "full").na.fill("") should do the trick.

Pyspark cumsum over same values in orderBy column

I have the following dataframe:
+----+----+-----+
|col1|col2|value|
+----+----+-----+
| 11| a| 1|
| 11| a| 2|
| 11| b| 3|
| 11| a| 4|
| 11| b| 5|
| 22| a| 6|
| 22| b| 7|
+----+----+-----+
I want to calculate to calculate the cumsum of the 'value' column that is partitioned by 'col1' and ordered by 'col2'.
This is the desired output:
+----+----+-----+------+
|col1|col2|value|cumsum|
+----+----+-----+------+
| 11| a| 1| 1|
| 11| a| 2| 3|
| 11| a| 4| 7|
| 11| b| 3| 10|
| 11| b| 5| 15|
| 22| a| 6| 6|
| 22| b| 7| 13|
+----+----+-----+------+
I have used this code which gives me the df shown below. It is not what I wanted. Can someone help me please?
df.withColumn("cumsum", F.sum("value").over(Window.partitionBy("col1").orderBy("col2").rangeBetween(Window.unboundedPreceding, 0)))
+----+----+-----+------+
|col1|col2|value|cumsum|
+----+----+-----+------+
| 11| a| 2| 7|
| 11| a| 1| 7|
| 11| a| 4| 7|
| 11| b| 3| 15|
| 11| b| 5| 15|
| 22| a| 6| 6|
| 22| b| 7| 13|
+----+----+-----+------+
You have to use .rowsBetween instead of .rangeBetween in your window clause.
rowsBetween (vs) rangeBetween
Example:
df.withColumn("cumsum", sum("value").over(Window.partitionBy("col1").orderBy("col2").rowsBetween(Window.unboundedPreceding, 0))).show()
#+----+----+-----+------+
#|col1|col2|value|cumsum|
#+----+----+-----+------+
#| 11| a| 1| 1|
#| 11| a| 2| 3|
#| 11| a| 4| 7|
#| 11| b| 3| 10|
#| 11| b| 5| 15|
#| 12| a| 6| 6|
#| 12| b| 7| 13|
#+----+----+-----+------+