Get the values from nested structure dataframe in spark using scala - scala

I have a dataframe with nested structure (Arrays of array),
StructField("Games", ArrayType(StructType(Array(
StructField("Team", StringType, true),
StructField("Amount", StringType, true),
StructField("Game", StringType, true)))), true),
For this I will get the values like below (Team, Amount, Game follows sequence here)
[[A,160,Chess], [B,100,Hockey], [C,1200,Football], [D,900,Cricket]]
[[E,700,Cricket], [F,1000,Chess]]
[[G,1900,Basketball], [I,1000,Cricket], [H,9000,Football]]
Now I have to get the values from this dataframe if
Game === 'Football' then TeamFootball = C and Amount = 1200
Game === 'Cricket' then TeamCricket = D and Amount = 900 for first row.
I tried like this
.withColumn("TeamFootball", when($"Games.Game".getItem(2)==="Football",$"Games.Team".getItem(0).cast(StringType)).otherwise(lit("NA")))
.withColumn("TeamCricket", when($"Games.Game".getItem(2)==="Cricket", $"Games.Team".getItem(0).cast(StringType)).otherwise(lit("NA")))
.withColumn("TeamFootballAmount", when($"Games.Game".getItem(2)==="Football",$"Games.Amount".getItem(1).cast(StringType)).otherwise(lit("NA")))
.withColumn("TeamCricketAmount", when($"Games.Game".getItem(2)==="Cricket",$"Games.Amount".getItem(1).cast(StringType)).otherwise(lit("NA")))
I need all this columns in same row, that why I am not using explode.
Here I am unable to handle array index, Could you please help.

"Explode" and then "pivot" can help, please check "result" in output:
val data = List(
(1, "A", 160, "Chess"), (1, "B", 100, "Hockey"), (1, "C", 1200, "Football"), (1, "D", 900, "Cricket"),
(2, "E", 700, "Cricket"), (2, "F", 1000, "Chess"),
(3, "G", 1900, "Basketball"), (3, "I", 1000, "Cricket"), (3, "H", 9000, "Football")
)
val unstructured = data.toDF("id", "Team", "Amount", "Game")
unstructured.show(false)
val original = unstructured.groupBy("id").agg(collect_list(struct($"Team", $"Amount", $"Game")).alias("Games"))
println("--- Original ----")
original.printSchema()
original.show(false)
val exploded = original.withColumn("Games", explode($"Games")).select("id", "Games.*")
println("--- Exploded ----")
exploded.show(false)
println("--- Result ----")
exploded.groupBy("id").pivot("Game").agg(max($"Amount").alias("Amount"), max("Team").alias("Team")).orderBy("id").show(false)
Output is:
+---+----+------+----------+
|id |Team|Amount|Game |
+---+----+------+----------+
|1 |A |160 |Chess |
|1 |B |100 |Hockey |
|1 |C |1200 |Football |
|1 |D |900 |Cricket |
|2 |E |700 |Cricket |
|2 |F |1000 |Chess |
|3 |G |1900 |Basketball|
|3 |I |1000 |Cricket |
|3 |H |9000 |Football |
+---+----+------+----------+
--- Original ----
root
|-- id: integer (nullable = false)
|-- Games: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Team: string (nullable = true)
| | |-- Amount: integer (nullable = false)
| | |-- Game: string (nullable = true)
+---+-------------------------------------------------------------------+
|id |Games |
+---+-------------------------------------------------------------------+
|3 |[[G,1900,Basketball], [I,1000,Cricket], [H,9000,Football]] |
|1 |[[A,160,Chess], [B,100,Hockey], [C,1200,Football], [D,900,Cricket]]|
|2 |[[E,700,Cricket], [F,1000,Chess]] |
+---+-------------------------------------------------------------------+
--- Exploded ----
+---+----+------+----------+
|id |Team|Amount|Game |
+---+----+------+----------+
|3 |G |1900 |Basketball|
|3 |I |1000 |Cricket |
|3 |H |9000 |Football |
|1 |A |160 |Chess |
|1 |B |100 |Hockey |
|1 |C |1200 |Football |
|1 |D |900 |Cricket |
|2 |E |700 |Cricket |
|2 |F |1000 |Chess |
+---+----+------+----------+
--- Result ----
+---+-----------------+---------------+------------+----------+--------------+------------+---------------+-------------+-------------+-----------+
|id |Basketball_Amount|Basketball_Team|Chess_Amount|Chess_Team|Cricket_Amount|Cricket_Team|Football_Amount|Football_Team|Hockey_Amount|Hockey_Team|
+---+-----------------+---------------+------------+----------+--------------+------------+---------------+-------------+-------------+-----------+
|1 |null |null |160 |A |900 |D |1200 |C |100 |B |
|2 |null |null |1000 |F |700 |E |null |null |null |null |
|3 |1900 |G |null |null |1000 |I |9000 |H |null |null |
+---+-----------------+---------------+------------+----------+--------------+------------+---------------+-------------+-------------+-----------+

Related

Spark dataframe join aggregating by ID

I have problem in joining 2 dataframes grouped by ID
val df1 = Seq(
(1, 1,100),
(1, 3,20),
(2, 5,5),
(2, 2,10)).toDF("id", "index","value")
val df2 = Seq(
(1, 0),
(2, 0),
(3, 0),
(4, 0),
(5,0)).toDF("index", "value")
df1 joins with df2 by index column for every id
expected result
id
index
value
1
1
100
1
2
0
1
3
20
1
4
0
1
5
0
2
1
0
2
2
10
2
3
0
2
4
0
2
5
5
please help me on this
First of all, I would replace your df2 table with this:
var df2 = Seq(
(Array(1, 2), Array(1, 2, 3, 4, 5))
).toDF("id", "index")
This allows us to use explode and auto-generate a table which can be of help to us:
df2 = df2
.withColumn("id", explode(col("id")))
.withColumn("index", explode(col("index")))
and it gives:
+---+-----+
|id |index|
+---+-----+
|1 |1 |
|1 |2 |
|1 |3 |
|1 |4 |
|1 |5 |
|2 |1 |
|2 |2 |
|2 |3 |
|2 |4 |
|2 |5 |
+---+-----+
Now, all we need to do, is join with your df1 as below:
df2 = df2
.join(df1, Seq("id", "index"), "left")
.withColumn("value", when(col("value").isNull, 0).otherwise(col("value")))
And we get this final output:
+---+-----+-----+
|id |index|value|
+---+-----+-----+
|1 |1 |100 |
|1 |2 |0 |
|1 |3 |20 |
|1 |4 |0 |
|1 |5 |0 |
|2 |1 |0 |
|2 |2 |10 |
|2 |3 |0 |
|2 |4 |0 |
|2 |5 |5 |
+---+-----+-----+
which should be what you want. Good luck!

How to re-assign session_id to items when we want to create another session after every null value in items?

I have a pyspark dataframe-
df1 = spark.createDataFrame([
("s1", "i1", 0),
("s1", "i2", 1),
("s1", "i3", 2),
("s1", None, 3),
("s1", "i5", 4),
],
["session_id", "item_id", "pos"])
df1.show(truncate=False)
pos is the position or rank of the item in the session.
Now I want to create new sessions without any null values in them. I want to do this by starting a new session after every null item. Basically I want to break existing sessions into multiple sessions, removing the null item_id in the process.
The expected output would like something like-
+----------+-------+---+--------------+
|session_id|item_id|pos|new_session_id|
+----------+-------+---+--------------+
|s1 |i1 |0 | s1_0|
|s1 |i2 |1 | s1_0|
|s1 |i3 |2 | s1_0|
|s1 |null |3 | None|
|s1 |i5 |4 | s1_4|
+----------+-------+---+--------------+
How do I achieve this?
Not sure about the configs of your spark job, but to prevent to use
collect action to build the reference of your "new" session in Python built-in data structure, I would use built-in spark sql function to build the new session reference. Based on your example, assuming you have already sorted the data frame:
from pyspark.sql import SparkSession
from pyspark.sql import functions as func
from pyspark.sql.window import Window
from pyspark.sql.types import *
df = spark.createDataFrame(
[("s1", "i1", 0), ("s1", "i2", 1), ("s1", "i3", 2), ("s1", None, 3), ("s1", None, 4), ("s1", "i6", 5), ("s2", "i7", 6), ("s2", None, 7), ("s2", "i9", 8), ("s2", "i10", 9), ("s2", "i11", 10)],
["session_id", "item_id", "pos"]
)
df.show(20, False)
+----------+-------+---+
|session_id|item_id|pos|
+----------+-------+---+
|s1 |i1 |0 |
|s1 |i2 |1 |
|s1 |i3 |2 |
|s1 |null |3 |
|s1 |null |4 |
|s1 |i6 |5 |
|s2 |i7 |6 |
|s2 |null |7 |
|s2 |i9 |8 |
|s2 |i10 |9 |
|s2 |i11 |10 |
+----------+-------+---+
Step 1: As the data is already sorted, we can use a lag function to shift the data to the next record:
df2 = df\
.withColumn('lag_item', func.lag('item_id', 1).over(Window.partitionBy('session_id').orderBy('pos')))
df2.show(20, False)
+----------+-------+---+--------+
|session_id|item_id|pos|lag_item|
+----------+-------+---+--------+
|s1 |i1 |0 |null |
|s1 |i2 |1 |i1 |
|s1 |i3 |2 |i2 |
|s1 |null |3 |i3 |
|s1 |null |4 |null |
|s1 |i6 |5 |null |
|s2 |i7 |6 |null |
|s2 |null |7 |i7 |
|s2 |i9 |8 |null |
|s2 |i10 |9 |i9 |
|s2 |i11 |10 |i10 |
+----------+-------+---+--------+
Step 2: After using the lag function we can see if the item_id in previous record is NULL or not. Therefore , we can know the boundaries of each new session by doing the filtering and build the reference:
reference = df2\
.filter((func.col('item_id').isNotNull())&(func.col('lag_item').isNull()))\
.groupby('session_id')\
.agg(func.collect_set('pos').alias('session_id_set'))
reference.show(100, False)
+----------+--------------+
|session_id|session_id_set|
+----------+--------------+
|s1 |[0, 5] |
|s2 |[6, 8] |
+----------+--------------+
Step 3: Join the reference back to the data and write a simple UDF to find which new session should be in:
#func.udf(returnType=IntegerType())
def udf_find_session(item_id, pos, session_id_set):
r_val = None
if item_id != None:
for item in session_id_set:
if pos >= item:
r_val = item
else:
break
return r_val
df3 = df2.select('session_id', 'item_id', 'pos')\
.join(reference, on='session_id', how='inner')
df4 = df3.withColumn('new_session_id', udf_find_session(func.col('item_id'), func.col('pos'), func.col('session_id_set')))
df4.show(20, False)
+----------+-------+---+--------------+
|session_id|item_id|pos|new_session_id|
+----------+-------+---+--------------+
|s1 |i1 |0 |0 |
|s1 |i2 |1 |0 |
|s1 |i3 |2 |0 |
|s1 |null |3 |null |
|s1 |null |4 |null |
|s1 |i6 |5 |5 |
|s2 |i7 |6 |6 |
|s2 |null |7 |null |
|s2 |i9 |8 |8 |
|s2 |i10 |9 |8 |
|s2 |i11 |10 |8 |
+----------+-------+---+--------------+
The last step just concat the string you want to show in new session id.

How to Reverse arrangement DataFrame in Apache Spark

How can I reverse this DataFrame using Scala.
I saw sort functions but must specific column, I only want to reverse them
+---+--------+-----+
|id | name|note |
+---+--------+-----+
|1 | james |any |
|3 | marry |some |
|2 | john |some |
|5 | tom |any |
+---+--------+-----+
to:
+---+--------+-----+
|id | name|note |
+---+--------+-----+
|5 | tom |any |
|2 | john |some |
|3 | marry |some |
|1 | james |any |
+---+--------+-----+
You can add a column with increasing id with use of monotonically_increasing_id()
and sort in descending order
val dff = Seq(
(1, "james", "any"),
(3, "marry", "some"),
(2, "john", "some"),
(5, "tom", "any")
).toDF("id", "name", "note")
dff.withColumn("index", monotonically_increasing_id())
.sort($"index".desc)
.drop($"index")
.show(false)
Output:
+---+-----+----+
|id |name |note|
+---+-----+----+
|5 |tom |any |
|2 |john |some|
|3 |marry|some|
|1 |james|any |
+---+-----+----+
You could do something like this:
val reverseDf = df.withColumn("row_num", row_number.over(Window.partitionBy(lit(1)).orderBy(lit(1))))
.orderBy($"row_num".desc)
.drop("row_num")
Or refer this instead of row number.

String cannot be cast to Integer(Scala)

I am writing a Spark project using Scala in which I need to make some calculations from "demo" datasets. I am using databricks platform.
I need to pass the 2nd column of my Dataframe (trainingCoordDataFrame) into a list. The type of the list must be List[Int].
The dataframe is as shown bellow:
> +---+---+---+---+
> |_c0|_c1|_c2|_c3|
> +---+---+---+---+
> |1 |0 |0 |a |
> |11 |9 |1 |a |
> |12 |2 |7 |c |
> |13 |2 |9 |c |
> |14 |2 |4 |b |
> |15 |1 |3 |c |
> |16 |4 |6 |c |
> |17 |3 |5 |c |
> |18 |5 |3 |a |
> |2 |0 |1 |a |
> |20 |8 |9 |c |
> |3 |1 |0 |b |
> |4 |3 |4 |b |
> |5 |8 |7 |b |
> |6 |4 |9 |b |
> |7 |2 |5 |a |
> |8 |1 |9 |a |
> |9 |3 |6 |a |
> +---+---+---+---+
I am trying to create the list I want using the following command:
val trainingCoordList = trainingCoordDataFrame.select("_c1").collect().map(each => (each.getAs[Int]("_c1"))).toList
The message from the compiler is this:
java.lang.ClassCastException: java.lang.String cannot be cast to
java.lang.Integer
Note that the procedure is :
1) Upload the dataset from local PC to databricks (so no standard data can be used).
val mainDataFrame = spark.read.format("csv").option("header", "false").load("FileStore/tables/First_Spacial_Dataset_ByAris.csv")
2) Create dataframe. ( Step one: Split the main Dataframe randomly. Step two : Remove the unnecessary columns)
val Array(trainingDataFrame,testingDataFrame) = mainDataFrame.randomSplit(Array(0.8,0.2)) //step one
val trainingCoordDataFrame = trainingDataFrame.drop("_c0", "_c3") //step two
3) Create list. <- Here is the false command.
What is the correct way to reach the result I want?
I think there are several ways to deal with this problem.
A) Define a schema for your CSV:
For example:
val customSchema = StructType(Array(
StructField("_c0", IntegerType),
StructField("_c1", IntegerType),
StructField("_c2", IntegerType),
StructField("_c3", StringType)))
When you read the CSV add the schema option with the StructType we created earlier
val mainDataFrame = spark.read.format("csv").option("header", "false").schema(customSchema).load("FileStore/tables/First_Spacial_Dataset_ByAris.csv")
Now if we look at the output of the mainDataFrame.printSchema() command we'll see that the columns are typed according to your use case:
root
|-- _c0: integer (nullable = true)
|-- _c1: integer (nullable = true)
|-- _c2: integer (nullable = true)
|-- _c3: string (nullable = true)
This means we can actually run your original command without getting an error.
trainingCoordDataFrame.select("_c2").map(r => r.getInt(0)).collect.toList
B) Cast the entire column to Int
Refer to the column itself instead of the column name and then cast the column to IntegerType. Now that the column type is Int you can again use getInt where it failed earlier:
trainingCoordDataFrame.select($"_c2".cast(IntegerType)).map(r => r.getInt(0)).collect.toList
C) Cast each value individually
Use map to cast to or retrieve as String each individual value and then cast it to Int
trainingCoordDataFrame.select("_c2").map(r => r.getString(0).toInt).collect.toList
The column's value is of type string so read the column as string and use scala's string.toInt method.
A cast is definitely wrong at this place.
val trainingCoordList = trainingCoordDataFrame.select("_c1").collect().map(each => each.getAs[String]("_c1").toInt).toList
Or use the Dataset API with custom schema e.g. with tuples

How to iterate grouped rows to produce multiple rows in spark structured streaming?

I have the input data set like:
id operation value
1 null 1
1 discard 0
2 null 1
2 null 2
2 max 0
3 null 1
3 null 1
3 list 0
I want to group the input and produce rows according to "operation" column.
for group 1, operation="discard", then the output is null,
for group 2, operation="max", the output is:
2 null 2
for group 3, operation="list", the output is:
3 null 1
3 null 1
So finally the output is like:
id operation value
2 null 2
3 null 1
3 null 1
Is there a solution for this?
I know there is a similar question how-to-iterate-grouped-data-in-spark
But the differences compared to that are:
I want to produce more than one row for each grouped data. Possible
and how?
I want my logic to be easily extended for more operation to be added in future. So User-defined aggregate functions (aka UDAF) is
the only possible solution?
Update 1:
Thank stack0114106, then more details according to his answer, e.g. for id=1, operation="max", I want to iterate all the item with id=2, and find the max value, rather than assign a hard-coded value, that's why I want to iterate the rows in each group. Below is a updated example:
The input:
scala> val df = Seq((0,null,1),(0,"discard",0),(1,null,1),(1,null,2),(1,"max",0),(2,null,1),(2,null,3),(2,"max",0),(3,null,1),(3,null,1),(3,"list",0)).toDF("id"
,"operation","value")
df: org.apache.spark.sql.DataFrame = [id: int, operation: string ... 1 more field]
scala> df.show(false)
+---+---------+-----+
|id |operation|value|
+---+---------+-----+
|0 |null |1 |
|0 |discard |0 |
|1 |null |1 |
|1 |null |2 |
|1 |max |0 |
|2 |null |1 |
|2 |null |3 |
|2 |max |0 |
|3 |null |1 |
|3 |null |1 |
|3 |list |0 |
+---+---------+-----+
The expected output:
+---+---------+-----+
|id |operation|value|
+---+---------+-----+
|1 |null |2 |
|2 |null |3 |
|3 |null |1 |
|3 |null |1 |
+---+---------+-----+
group everything collecting the values, then write logic for each operation :
import org.apache.spark.sql.functions._
val grouped=df.groupBy($"id").agg(max($"operation").as("op"),collect_list($"value").as("vals"))
val maxs=grouped.filter($"op"==="max").withColumn("val",explode($"vals")).groupBy($"id").agg(max("val").as("value"))
val lists=grouped.filter($"op"==="list").withColumn("value",explode($"vals")).filter($"value"!==0).select($"id",$"value")
//we don't collect the "discard"
//and we can add additional subsets for new "operations"
val result=maxs.union(lists)
//if you need the null in "operation" column add it with withColumn
You can use flatMap operation on the dataframe and generate required rows based on the conditions that you mentioned. Check this out
scala> val df = Seq((1,null,1),(1,"discard",0),(2,null,1),(2,null,2),(2,"max",0),(3,null,1),(3,null,1),(3,"list",0)).toDF("id","operation","value")
df: org.apache.spark.sql.DataFrame = [id: int, operation: string ... 1 more field]
scala> df.show(false)
+---+---------+-----+
|id |operation|value|
+---+---------+-----+
|1 |null |1 |
|1 |discard |0 |
|2 |null |1 |
|2 |null |2 |
|2 |max |0 |
|3 |null |1 |
|3 |null |1 |
|3 |list |0 |
+---+---------+-----+
scala> df.filter("operation is not null").flatMap( r=> { val x=r.getString(1); val s = x match { case "discard" => (0,0) case "max" => (1,2) case "list" => (2,1) } ; (0
until s._1).map( i => (r.getInt(0),null,s._2) ) }).show(false)
+---+----+---+
|_1 |_2 |_3 |
+---+----+---+
|2 |null|2 |
|3 |null|1 |
|3 |null|1 |
+---+----+---+
Spark assigns _1,_2 etc.. so you can map them to actual names by assigning them as below
scala> val df2 = df.filter("operation is not null").flatMap( r=> { val x=r.getString(1); val s = x match { case "discard" => (0,0) case "max" => (1,2) case "list" => (2,1) } ; (0 until s._1).map( i => (r.getInt(0),null,s._2) ) }).toDF("id","operation","value")
df2: org.apache.spark.sql.DataFrame = [id: int, operation: null ... 1 more field]
scala> df2.show(false)
+---+---------+-----+
|id |operation|value|
+---+---------+-----+
|2 |null |2 |
|3 |null |1 |
|3 |null |1 |
+---+---------+-----+
scala>
EDIT1:
Since you need the max(value) for each id, you can use window functions and get the max value in a new column, then use the same technique and get the results. Check this out
scala> val df = Seq((0,null,1),(0,"discard",0),(1,null,1),(1,null,2),(1,"max",0),(2,null,1),(2,null,3),(2,"max",0),(3,null,1),(3,null,1),(3,"list",0)).toDF("id","operation","value")
df: org.apache.spark.sql.DataFrame = [id: int, operation: string ... 1 more field]
scala> df.createOrReplaceTempView("michael")
scala> val df2 = spark.sql(""" select *, max(value) over(partition by id) mx from michael """)
df2: org.apache.spark.sql.DataFrame = [id: int, operation: string ... 2 more fields]
scala> df2.show(false)
+---+---------+-----+---+
|id |operation|value|mx |
+---+---------+-----+---+
|1 |null |1 |2 |
|1 |null |2 |2 |
|1 |max |0 |2 |
|3 |null |1 |1 |
|3 |null |1 |1 |
|3 |list |0 |1 |
|2 |null |1 |3 |
|2 |null |3 |3 |
|2 |max |0 |3 |
|0 |null |1 |1 |
|0 |discard |0 |1 |
+---+---------+-----+---+
scala> val df3 = df2.filter("operation is not null").flatMap( r=> { val x=r.getString(1); val s = x match { case "discard" => 0 case "max" => 1 case "list" => 2 } ; (0 until s).map( i => (r.getInt(0),null,r.getInt(3) )) }).toDF("id","operation","value")
df3: org.apache.spark.sql.DataFrame = [id: int, operation: null ... 1 more field]
scala> df3.show(false)
+---+---------+-----+
|id |operation|value|
+---+---------+-----+
|1 |null |2 |
|3 |null |1 |
|3 |null |1 |
|2 |null |3 |
+---+---------+-----+
scala>