Scala Spark Join Dataframe in loop - scala

I am trying to join DataFrames on the fly in loop. I am using a properties file to get the column details to use in the final data frame.
Properties file -
a01=status:single,perm_id:multi
a02=status:single,actv_id:multi
a03=status:single,perm_id:multi,actv_id:multi
............................
............................
For each row in the properties file, I need to create a DataFrame and save it in a file. Loading the properties file using PropertiesReader. if the mode is single then I need to get only the column value from the table. But if multi, then I need to get the list of values.
val propertyColumn = properties.get("a01") //a01 value we are getting as an argument. This might be a01,a02 or a0n
val columns = propertyColumn.toString.split(",").map(_.toString)
act_det table -
+-------+--------+-----------+-----------+-----------+------------+
|id |act_id |status |perm_id |actv_id | debt_id |
+-------+--------+-----------+-----------+-----------+------------+
| 1 |1 | 4 | 1 | 10 | 1 |
+-------+--------+-----------+-----------+-----------+------------+
| 2 |1 | 4 | 2 | 20 | 2 |
+-------+--------+-----------+-----------+-----------+------------+
| 3 |1 | 4 | 3 | 30 | 1 |
+-------+--------+-----------+-----------+-----------+------------+
| 4 |2 | 4 | 5 | 10 | 3 |
+-------+--------+-----------+-----------+-----------+------------+
| 5 |2 | 4 | 6 | 20 | 1 |
+-------+--------+-----------+-----------+-----------+------------+
| 6 |2 | 4 | 7 | 30 | 1 |
+-------+--------+-----------+-----------+-----------+------------+
| 7 |3 | 4 | 1 | 10 | 3 |
+-------+--------+-----------+-----------+-----------+------------+
| 8 |3 | 4 | 5 | 20 | 1 |
+-------+--------+-----------+-----------+-----------+------------+
| 9 |3 | 4 | 2 | 30 | 3 |
+-------+--------+-----------+-----------+------------+-----------+
Main DataFrame -
val data = sqlContext.sql("select * from act_det")
I want the following output -
For a01 -
+-------+--------+-----------+
|act_id |status |perm_id |
+-------+--------+-----------+
| 1 | 4 | [1,2,3] |
+-------+--------+-----------+
| 2 | 4 | [5,6,7] |
+-------+--------+-----------+
| 3 | 4 | [1,5,2] |
+-------+--------+-----------+
For a02 -
+-------+--------+-----------+
|act_id |status |actv_id |
+-------+--------+-----------+
| 1 | 4 | [10,20,30]|
+-------+--------+-----------+
| 2 | 4 | [10,20,30]|
+-------+--------+-----------+
| 3 | 4 | [10,20,30]|
+-------+--------+-----------+
For a03 -
+-------+--------+-----------+-----------+
|act_id |status |perm_id |actv_id |
+-------+--------+-----------+-----------+
| 1 | 4 | [1,2,3] |[10,20,30] |
+-------+--------+-----------+-----------+
| 2 | 4 | [5,6,7] |[10,20,30] |
+-------+--------+-----------+-----------+
| 3 | 4 | [1,5,2] |[10,20,30] |
+-------+--------+-----------+-----------+
But the data frame creation process should be dynamic.
I have tried below code but I am not able to implement the join logic for the DataFrames in loop.
val finalDF:DataFrame = ??? //empty dataframe
for {
column <- columns
} yeild {
val eachColumn = column.toString.split(":").map(_.toString)
val columnName = eachColumn(0)
val mode = eachColumn(1)
if(mode.equalsIgnoreCase("single")) {
data.select($"act_id", $"status").distinct
//I want to join finalDF with data.select($"act_id", $"status").distinct
} else if(mode.equalsIgnoreCase("multi")) {
data.groupBy($"act_id").agg(collect_list($"perm_id").as("perm_id"))
//I want to join finalDF with data.groupBy($"act_id").agg(collect_list($"perm_id").as("perm_id"))
}
}
Any advice or guidance would be greatly appreciated.

Check below code.
scala> df.show(false)
+---+------+------+-------+-------+-------+
|id |act_id|status|perm_id|actv_id|debt_id|
+---+------+------+-------+-------+-------+
|1 |1 |4 |1 |10 |1 |
|2 |1 |4 |2 |20 |2 |
|3 |1 |4 |3 |30 |1 |
|4 |2 |4 |5 |10 |3 |
|5 |2 |4 |6 |20 |1 |
|6 |2 |4 |7 |30 |1 |
|7 |3 |4 |1 |10 |3 |
|8 |3 |4 |5 |20 |1 |
|9 |3 |4 |2 |30 |3 |
+---+------+------+-------+-------+-------+
Defining primary keys
scala> val primary_key = Seq("act_id").map(col(_))
primary_key: Seq[org.apache.spark.sql.Column] = List(act_id)
Configs
scala> configs.foreach(println)
/*
(a01,status:single,perm_id:multi)
(a02,status:single,actv_id:multi)
(a03,status:single,perm_id:multi,actv_id:multi)
*/
Constructing Expression.
scala>
val columns = configs
.map(c => {
c._2
.split(",")
.map(c => {
val cc = c.split(":");
if(cc.tail.contains("single"))
first(col(cc.head)).as(cc.head)
else
collect_list(col(cc.head)).as(cc.head)
}
)
})
/*
columns: scala.collection.immutable.Iterable[Array[org.apache.spark.sql.Column]] = List(
Array(first(status, false) AS `status`, collect_list(perm_id) AS `perm_id`),
Array(first(status, false) AS `status`, collect_list(actv_id) AS `actv_id`),
Array(first(status, false) AS `status`, collect_list(perm_id) AS `perm_id`, collect_list(actv_id) AS `actv_id`)
)
*/
Final Result
scala> columns.map(c => df.groupBy(primary_key:_*).agg(c.head,c.tail:_*)).map(_.show(false))
+------+------+---------+
|act_id|status|perm_id |
+------+------+---------+
|3 |4 |[1, 5, 2]|
|1 |4 |[1, 2, 3]|
|2 |4 |[5, 6, 7]|
+------+------+---------+
+------+------+------------+
|act_id|status|actv_id |
+------+------+------------+
|3 |4 |[10, 20, 30]|
|1 |4 |[10, 20, 30]|
|2 |4 |[10, 20, 30]|
+------+------+------------+
+------+------+---------+------------+
|act_id|status|perm_id |actv_id |
+------+------+---------+------------+
|3 |4 |[1, 5, 2]|[10, 20, 30]|
|1 |4 |[1, 2, 3]|[10, 20, 30]|
|2 |4 |[5, 6, 7]|[10, 20, 30]|
+------+------+---------+------------+

Related

How to add some values in a dataframe in Scala Spark?

Here is the dataframe I have for now, suppose there are totally 4 days{1,2,3,4}:
+-------------+----------+------+
| key | Time | Value|
+-------------+----------+------+
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 4 | 3 |
| 2 | 2 | 4 |
| 2 | 3 | 5 |
+-------------+----------+------+
And what I want is
+-------------+----------+------+
| key | Time | Value|
+-------------+----------+------+
| 1 | 1 | 1 |
| 1 | 2 | 2 |
| 1 | 3 | null |
| 1 | 4 | 3 |
| 2 | 1 | null |
| 2 | 2 | 4 |
| 2 | 3 | 5 |
| 2 | 4 | null |
+-------------+----------+------+
If there is some ways that can help me get this?
Say df1 is our main table:
+---+----+-----+
|key|Time|Value|
+---+----+-----+
|1 |1 |1 |
|1 |2 |2 |
|1 |4 |3 |
|2 |2 |4 |
|2 |3 |5 |
+---+----+-----+
We can use the following transformations:
val data = df1
// we first group by and aggregate the values to a sequence between 1 and 4 (your number)
.groupBy("key")
.agg(sequence(lit(1), lit(4)).as("Time"))
// we explode the sequence, thus creating all 'Time' per 'key'
.withColumn("Time", explode(col("Time")))
// finally, we join with our main table on 'key' and 'Time'
.join(df1, Seq("key", "Time"), "left")
To get this output:
+---+----+-----+
|key|Time|Value|
+---+----+-----+
|1 |1 |1 |
|1 |2 |2 |
|1 |3 |null |
|1 |4 |3 |
|2 |1 |null |
|2 |2 |4 |
|2 |3 |5 |
|2 |4 |null |
+---+----+-----+
Which should be what you are looking for, good luck!

How to re-assign session_id to items when we want to create another session after every null value in items?

I have a pyspark dataframe-
df1 = spark.createDataFrame([
("s1", "i1", 0),
("s1", "i2", 1),
("s1", "i3", 2),
("s1", None, 3),
("s1", "i5", 4),
],
["session_id", "item_id", "pos"])
df1.show(truncate=False)
pos is the position or rank of the item in the session.
Now I want to create new sessions without any null values in them. I want to do this by starting a new session after every null item. Basically I want to break existing sessions into multiple sessions, removing the null item_id in the process.
The expected output would like something like-
+----------+-------+---+--------------+
|session_id|item_id|pos|new_session_id|
+----------+-------+---+--------------+
|s1 |i1 |0 | s1_0|
|s1 |i2 |1 | s1_0|
|s1 |i3 |2 | s1_0|
|s1 |null |3 | None|
|s1 |i5 |4 | s1_4|
+----------+-------+---+--------------+
How do I achieve this?
Not sure about the configs of your spark job, but to prevent to use
collect action to build the reference of your "new" session in Python built-in data structure, I would use built-in spark sql function to build the new session reference. Based on your example, assuming you have already sorted the data frame:
from pyspark.sql import SparkSession
from pyspark.sql import functions as func
from pyspark.sql.window import Window
from pyspark.sql.types import *
df = spark.createDataFrame(
[("s1", "i1", 0), ("s1", "i2", 1), ("s1", "i3", 2), ("s1", None, 3), ("s1", None, 4), ("s1", "i6", 5), ("s2", "i7", 6), ("s2", None, 7), ("s2", "i9", 8), ("s2", "i10", 9), ("s2", "i11", 10)],
["session_id", "item_id", "pos"]
)
df.show(20, False)
+----------+-------+---+
|session_id|item_id|pos|
+----------+-------+---+
|s1 |i1 |0 |
|s1 |i2 |1 |
|s1 |i3 |2 |
|s1 |null |3 |
|s1 |null |4 |
|s1 |i6 |5 |
|s2 |i7 |6 |
|s2 |null |7 |
|s2 |i9 |8 |
|s2 |i10 |9 |
|s2 |i11 |10 |
+----------+-------+---+
Step 1: As the data is already sorted, we can use a lag function to shift the data to the next record:
df2 = df\
.withColumn('lag_item', func.lag('item_id', 1).over(Window.partitionBy('session_id').orderBy('pos')))
df2.show(20, False)
+----------+-------+---+--------+
|session_id|item_id|pos|lag_item|
+----------+-------+---+--------+
|s1 |i1 |0 |null |
|s1 |i2 |1 |i1 |
|s1 |i3 |2 |i2 |
|s1 |null |3 |i3 |
|s1 |null |4 |null |
|s1 |i6 |5 |null |
|s2 |i7 |6 |null |
|s2 |null |7 |i7 |
|s2 |i9 |8 |null |
|s2 |i10 |9 |i9 |
|s2 |i11 |10 |i10 |
+----------+-------+---+--------+
Step 2: After using the lag function we can see if the item_id in previous record is NULL or not. Therefore , we can know the boundaries of each new session by doing the filtering and build the reference:
reference = df2\
.filter((func.col('item_id').isNotNull())&(func.col('lag_item').isNull()))\
.groupby('session_id')\
.agg(func.collect_set('pos').alias('session_id_set'))
reference.show(100, False)
+----------+--------------+
|session_id|session_id_set|
+----------+--------------+
|s1 |[0, 5] |
|s2 |[6, 8] |
+----------+--------------+
Step 3: Join the reference back to the data and write a simple UDF to find which new session should be in:
#func.udf(returnType=IntegerType())
def udf_find_session(item_id, pos, session_id_set):
r_val = None
if item_id != None:
for item in session_id_set:
if pos >= item:
r_val = item
else:
break
return r_val
df3 = df2.select('session_id', 'item_id', 'pos')\
.join(reference, on='session_id', how='inner')
df4 = df3.withColumn('new_session_id', udf_find_session(func.col('item_id'), func.col('pos'), func.col('session_id_set')))
df4.show(20, False)
+----------+-------+---+--------------+
|session_id|item_id|pos|new_session_id|
+----------+-------+---+--------------+
|s1 |i1 |0 |0 |
|s1 |i2 |1 |0 |
|s1 |i3 |2 |0 |
|s1 |null |3 |null |
|s1 |null |4 |null |
|s1 |i6 |5 |5 |
|s2 |i7 |6 |6 |
|s2 |null |7 |null |
|s2 |i9 |8 |8 |
|s2 |i10 |9 |8 |
|s2 |i11 |10 |8 |
+----------+-------+---+--------------+
The last step just concat the string you want to show in new session id.

how to rename the Columns Produced by count() function in Scala

I have the below df:
+------+-------+--------+
|student| vars|observed|
+------+-------+--------+
| 1| ABC | 19|
| 1| ABC | 1|
| 2| CDB | 1|
| 1| ABC | 8|
| 3| XYZ | 3|
| 1| ABC | 389|
| 2| CDB | 946|
| 1| ABC | 342|
|+------+-------+--------+
I wanted to add a new frequency column groupBy two columns "student", "vars" in SCALA.
val frequency = df.groupBy($"student", $"vars").count()
This code generates a "count" column with the frequencies BUT losing observed column from the df.
I would like to create a new df as follows without losing "observed" column
+------+-------+--------+------------+
|student| vars|observed|total_count |
+------+-------+--------+------------+
| 1| ABC | 9|22
| 1| ABC | 1|22
| 2| CDB | 1|7
| 1| ABC | 2|22
| 3| XYZ | 3|3
| 1| ABC | 8|22
| 2| CDB | 6|7
| 1| ABC | 2|22
|+------+-------+-------+--------------+
You cannot do this directly but there are couple of ways,
You can join original df with count df. check here
You collect the observed column while doing aggregation and explode it again
With explode:
val frequency = df.groupBy("student", "vars").agg(collect_list("observed").as("observed_list"),count("*").as("total_count")).select($"student", $"vars",explode($"observed_list").alias("observed"), $"total_count")
scala> frequency.show(false)
+-------+----+--------+-----------+
|student|vars|observed|total_count|
+-------+----+--------+-----------+
|3 |XYZ |3 |1 |
|2 |CDB |1 |2 |
|2 |CDB |946 |2 |
|1 |ABC |389 |5 |
|1 |ABC |342 |5 |
|1 |ABC |19 |5 |
|1 |ABC |1 |5 |
|1 |ABC |8 |5 |
+-------+----+--------+-----------+
We can use Window functions as well
val windowSpec = Window.partitionBy("student","vars")
val frequency = df.withColumn("total_count", count(col("student")) over windowSpec)
.show
+-------+----+--------+-----------+
|student|vars|observed|total_count|
+-------+----+--------+-----------+
|3 |XYZ |3 |1 |
|2 |CDB |1 |2 |
|2 |CDB |946 |2 |
|1 |ABC |389 |5 |
|1 |ABC |342 |5 |
|1 |ABC |19 |5 |
|1 |ABC |1 |5 |
|1 |ABC |8 |5 |
+-------+----+--------+-----------+

dynamic values in narrative text type values for a column in spark dataframe

I am trying to add columns values in a narrative text but able to add only one value for every row
var hashColDf = rowmaxDF.select("min", "max", "Total")
val peopleArray = hashColDf.collect.map(r => Map(hashColDf.columns.zip(r.toSeq): _*))
val comstr = "shyam has max and min not Total"
var mapArrayStr = List[String]()
for(eachrow <- peopleArray){
mapArrayStr = mapArrayStr :+ eachrow.foldLeft(comstr)((a, b) => a.replaceAllLiterally(b._1, b._2.toString()))
}
for(eachCol <- mapArrayStr){
rowmaxDF = rowmaxDF.withColumn("compCols", lit(eachCol))
}
}
Source Dataframe :
|max|min|TOTAL|
|3 |1 |4 |
|5 |2 |7 |
|7 |3 |10 |
|8 |4 |12 |
|10 |5 |15 |
|10 |5 |15 |
Actual Result:
|max|min|TOTAL|compCols |
|3 |1 |4 |shyam has 10 and 5 not 15|
|5 |2 |7 |shyam has 10 and 5 not 15|
|7 |3 |10 |shyam has 10 and 5 not 15|
|8 |4 |12 |shyam has 10 and 5 not 15|
|10 |5 |15 |shyam has 10 and 5 not 15|
|10 |5 |15 |shyam has 10 and 5 not 15|
Expected Result :
|max|min|TOTAL|compCols |
|3 |1 |4 |shyam has 3 and 1 not 4 |
|5 |2 |7 |shyam has 5 and 2 not 7 |
|7 |3 |10 |shyam has 7 and 3 not 10 |
|8 |4 |12 |shyam has 8 and 4 not 12 |
|10 |5 |15 |shyam has 10 and 5 not 15|
|10 |5 |15 |shyam has 10 and 5 not 15|

How to find the next occurring item from current row in a data frame using Spark Windowing?

I have the following Dataframe:
+------+----------+-------------+--------------------+---------+-----+----------+
|ID |MEM_ID | BFS | SVC_DT |TYP |SEQ |BFS_SEQ |
+------+----------+----------------------------------+---------+-----+----------+
|105771|29378668 | BRIMONIDINE | 2019-02-04 00:00:00|PD |1 |1 |
|105772|29378668 | BRIMONIDINE | 2019-04-04 00:00:00|PD |2 |2 |
|105773|29378668 | BRIMONIDINE | 2019-04-17 00:00:00|RV |3 |3 |
|105774|29378668 | TIMOLOL | 2019-04-17 00:00:00|RV |4 |1 |
|105775|29378668 | BRIMONIDINE | 2019-04-22 00:00:00|PD |5 |4 |
|105776|29378668 | TIMOLOL | 2019-04-22 00:00:00|PD |6 |2 |
+------+----------+----------------------------------+---------+-----+----------+
For every row, I have to find the occurrence of next 'PD' Typ at BFS level from the current row and populate its associated ID as a new column named 'NEXT_PD_TYP_ID'
The output I am expecting is:
+------+---------+-------------+--------------------+----+-----+---------+---------------+
|ID |MEM_ID | BFS | SVC_DT |TYP |SEQ |BFS_SEQ |NEXT_PD_TYP_ID |
+------+---------+----------------------------------+----+-----+---------+---------------+
|105771|29378668 | BRIMONIDINE | 2019-02-04 00:00:00|PD |1 |1 |105772 |
|105772|29378668 | BRIMONIDINE | 2019-04-04 00:00:00|PD |2 |2 |105775 |
|105773|29378668 | BRIMONIDINE | 2019-04-17 00:00:00|RV |3 |3 |105775 |
|105774|29378668 | TIMOLOL | 2019-04-17 00:00:00|RV |4 |1 |105776 |
|105775|29378668 | BRIMONIDINE | 2019-04-22 00:00:00|PD |5 |4 |null |
|105776|29378668 | TIMOLOL | 2019-04-22 00:00:00|PD |6 |2 |null |
+------+---------+----------------------------------+----+-----+---------+---------------+
Need help.
I have tried using the conditional aggregation: max(when), however since it has more than one 'PD' the max is returning only one value for all the rows.
No error messages
I hope this helps.
I created a new column with ID's of TYP === PD. I called it TYPPDID.
Then I used Window frame ranging from next row to unbounded following row and got the first not-null TYPPDID
orderBy("ID") in the end is only to show records in order.
import org.apache.spark.sql.functions._
val df = Seq(
("105771", "BRIMONIDINE", "PD"),
("105772", "BRIMONIDINE", "PD"),
("105773", "BRIMONIDINE","RV"),
("105774", "TIMOLOL", "RV"),
("105775", "BRIMONIDINE", "PD"),
("105776", "TIMOLOL", "PD")
).toDF("ID", "BFS", "TYP").withColumn("TYPPDID", when($"TYP" === "PD", $"ID"))
df: org.apache.spark.sql.DataFrame = [ID: string, BFS: string ... 2 more fields]
scala> df.show
+------+-----------+---+-------+
| ID| BFS|TYP|TYPPDID|
+------+-----------+---+-------+
|105771|BRIMONIDINE| PD| 105771|
|105772|BRIMONIDINE| PD| 105772|
|105773|BRIMONIDINE| RV| null|
|105774| TIMOLOL| RV| null|
|105775|BRIMONIDINE| PD| 105775|
|105776| TIMOLOL| PD| 105776|
+------+-----------+---+-------+
scala> val overColumns = Window.partitionBy("BFS").orderBy("ID").rowsBetween(1, Window.unboundedFollowing)
overColumns: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#eb923ef
scala> df.withColumn("NEXT_PD_TYP_ID",first("TYPPDID", true).over(overColumns)).orderBy("ID").show(false)
+------+-----------+---+-------+-------+
|ID |BFS |TYP|TYPPDID|NEXT_PD_TYP_ID|
+------+-----------+---+-------+-------+
|105771|BRIMONIDINE|PD |105771 |105772 |
|105772|BRIMONIDINE|PD |105772 |105775 |
|105773|BRIMONIDINE|RV |null |105775 |
|105774|TIMOLOL |RV |null |105776 |
|105775|BRIMONIDINE|PD |105775 |null |
|105776|TIMOLOL |PD |105776 |null |
+------+-----------+---+-------+-------+