How to collect a map after group by in Pyspark dataframe? - group-by

I have a pyspark dataframe like this:
| id | time | cat |
-------------------------
1 t1 a
1 t2 b
2 t3 b
2 t4 c
2 t5 b
3 t6 a
3 t7 a
3 t8 a
Now, I want to group them by "id" and aggregate them into a Map like this:
| id | cat |
---------------------------
| 1 | a -> 1, b -> 1 |
| 2 | b -> 2, c -> 1 |
| 3 | a -> 3 |
I guess we can use pyspark sql function's collect_list to collect them as list, and then I could apply some UDF function to turn the list into dict. But is there any other (shorter or more efficient) way to do this?

You can use this function from pyspark.sql.functions.map_from_entries
if we consider your dataframe is df you should do this:
import pyspark.sql.functions as F
df1 = df.groupby("id", "cat").count()
df2 = df1.groupby("id")\
.agg(F.map_from_entries(F.collect_list(F.struct("cat","count"))).alias("cat"))

similar to yasi's answer
import pyspark.sql.functions as F
df1 = df.groupby("id", "cat").count()
df2 = df1.groupby("id")\
.agg(F.map_from_arrays(F.collect_list("cat"),F.collect_list("count"))).alias("cat"))

Here is how I did it.
Code
import pyspark.sql.functions as F
from pyspark.sql.types import StringType
df = spark.createDataFrame([(1,'t1','a'),(1,'t2','b'),(2,'t3','b'),(2,'t4','c'),(2,'t5','b'),\
(3,'t6','a'),(3,'t7','a'),(3,'t8','a')],\
('id','time','cat'))
(df.groupBy(['id', 'cat'])
.agg(F.count(F.col('cat')).cast(StringType()).alias('counted'))
.select(['id', F.concat_ws('->', F.col('cat'), F.col('counted')).alias('arrowed')])
.groupBy('id')
.agg(F.collect_list('arrowed'))
.show()
)
Output
+-------+---------------------+
| id|collect_list(arrowed)|
+-------+---------------------+
| 1 | [a -> 1, b -> 1] |
| 3 | [a -> 3] |
| 2 | [b -> 2, c -> 1] |
+-------+---------------------+
Edit
(df.groupBy(['id', 'cat'])
.count()
.select(['id',F.create_map('cat', 'count').alias('map')])
.groupBy('id')
.agg(F.collect_list('map').alias('cat'))
.show()
)
#+---+--------------------+
#| id| cat|
#+---+--------------------+
#| 1|[[a -> 1], [b -> 1]]|
#| 3| [[a -> 3]]|
#| 2|[[b -> 2], [c -> 1]]|
#+---+--------------------+

Related

complex logic on pyspark dataframe including previous row existing value as well as previous row value generated on the fly

I have to apply a logic on spark dataframe or rdd(preferably dataframe) which requires to generate two extra column. First generated column is dependent on other columns of same row and second generated column is dependent on first generated column of previous row.
Below is representation of problem statement in tabular format. A and B columns are available in dataframe. C and D columns are to be generated.
A | B | C | D
------------------------------------
1 | 100 | default val | C1-B1
2 | 200 | D1-C1 | C2-B2
3 | 300 | D2-C2 | C3-B3
4 | 400 | D3-C3 | C4-B4
5 | 500 | D4-C4 | C5-B5
Here is the sample data
A | B | C | D
------------------------
1 | 100 | 1000 | 900
2 | 200 | -100 | -300
3 | 300 | -200 | -500
4 | 400 | -300 | -700
5 | 500 | -400 | -900
Only solution I can think of is to coalesce the input dataframe to 1, convert it to rdd and then apply python function (having all the calcuation logic) to mapPartitions API .
However this approach may create load on one executor.
Mathematically seeing, D1-C1 where D1= C1-B1; so D1-C1 will become C1-B1-C1 => -B1.
In pyspark, window function has a parameter called default. this should simplify your problem. try this:
import pyspark.sql.functions as F
from pyspark.sql import Window
df = spark.createDataFrame([(1,100),(2,200),(3,300),(4,400),(5,500)],['a','b'])
w=Window.orderBy('a')
df_lag =df.withColumn('c',F.lag((F.col('b')*-1),default=1000).over(w))
df_final = df_lag.withColumn('d',F.col('c')-F.col('b'))
Results:
df_final.show()
+---+---+----+----+
| a| b| c| d|
+---+---+----+----+
| 1|100|1000| 900|
| 2|200|-100|-300|
| 3|300|-200|-500|
| 4|400|-300|-700|
| 5|500|-400|-900|
+---+---+----+----+
If the operation is something complex other than subtraction, then the same logic applies - fill the column C with your default value- calculate D , then use lag to calculate C and recalculate D.
The lag() function may help you with that:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
w = Window.orderBy("A")
df1 = df1.withColumn("C", F.lit(1000))
df2 = (
df1
.withColumn("D", F.col("C") - F.col("B"))
.withColumn("C",
F.when(F.lag("C").over(w).isNotNull(),
F.lag("D").over(w) - F.lag("C").over(w))
.otherwise(F.col("C")))
.withColumn("D", F.col("C") - F.col("B"))
)

How to display mismatched report with a label in spark 1.6 - scala except function?

Consider there are 2 dataframes df1 and df2.
df1 has below data
A | B
-------
1 | m
2 | n
3 | o
df2 has below data
A | B
-------
1 | m
2 | n
3 | p
df1.except(df2) returns
A | B
-------
3 | o
3 | p
How to display the result as below?
df1: 3 | o
df2: 3 | p
As per the API docs df1.except(df2), Returns a new DataFrame containing rows in this frame but not in another frame. i.e, it will return rows that are in DF1 and not in DF2. Thus a custom except function could be written as:
def except(df1: DataFrame, df2: DataFrame): DataFrame = {
val edf1 = df1.except(df2).withColumn("df", lit("df1"))
val edf2 = df2.except(df1).withColumn("df", lit("df2"))
edf1.union(edf2)
}
//Output
+---+---+---+
| A| B| df|
+---+---+---+
| 3| o|df1|
| 3| p|df2|
+---+---+---+

What is the efficient way to create Spark DataFrame in Scala with array type columns from another DataFrame that does not have an array column?

Suppose, I have the following the dataframe:
id | col1 | col2
-----------------
x | p1 | a1
-----------------
x | p2 | b1
-----------------
y | p2 | b2
-----------------
y | p2 | b3
-----------------
y | p3 | c1
The distinct values from col1 which are (p1, p2, p3) alone with id will be used as columns for the final dataframe. Here, the id y has two col2 values (b2 and b3) for the same col1 value p2, so, p2 will be treated as an array type column.
Therefore, the final dataframe will be
id | p1 | p2 | p3
--------------------------------
x | a1 | [b1] | null
--------------------------------
y | null |[b2, b3]| c1
How can I achieve the second dataframe efficiently from the first dataframe?
You are basically looking for table pivoting; for your case, groupBy id, pivot col1 as headers, and aggregate col2 as list using collect_list function:
df.groupBy("id").pivot("col1").agg(collect_list("col2")).show
+---+----+--------+----+
| id| p1| p2| p3|
+---+----+--------+----+
| x|[a1]| [b1]| []|
| y| []|[b2, b3]|[c1]|
+---+----+--------+----+
If it's guaranteed that there's at most one value in p1 and p3 for each id, you can convert those columns to String type by getting the first item of the array:
df.groupBy("id").pivot("col1").agg(collect_list("col2"))
.withColumn("p1", $"p1"(0)).withColumn("p3", $"p3"(0))
.show
+---+----+--------+----+
| id| p1| p2| p3|
+---+----+--------+----+
| x| a1| [b1]|null|
| y|null|[b2, b3]| c1|
+---+----+--------+----+
If you need to convert the column types dynamically, i.e. only use array type column types when you have to:
// get array Type columns
val arrayColumns = df.groupBy("id", "col1").agg(count("*").as("N"))
.where($"N" > 1).select("col1").distinct.collect.map(row => row.getString(0))
// arrayColumns: Array[String] = Array(p2)
// aggregate / pivot data frame
val aggDf = df.groupBy("id").pivot("col1").agg(collect_list("col2"))
// aggDf: org.apache.spark.sql.DataFrame = [id: string, p1: array<string> ... 2 more fields]
// get string columns
val stringColumns = aggDf.columns.filter(x => x != "id" && !arrayColumns.contains(x))
// use foldLeft on string columns to convert the columns to string type
stringColumns.foldLeft(aggDf)((df, x) => df.withColumn(x, col(x)(0))).show
+---+----+--------+----+
| id| p1| p2| p3|
+---+----+--------+----+
| x| a1| [b1]|null|
| y|null|[b2, b3]| c1|
+---+----+--------+----+

Merging multiple maps with map value as custom case class instance

I want to merge multiple maps using Spark/Scala. The maps have a case class instance as value.
Following is the relevant code:
case class SampleClass(value1:Int,value2:Int)
val sampleDataDs = Seq(
("a",25,Map(1->SampleClass(1,2))),
("a",32,Map(1->SampleClass(3,4),2->SampleClass(1,2))),
("b",16,Map(1->SampleClass(1,2))),
("b",18,Map(2->SampleClass(10,15)))).toDF("letter","number","maps")
Output:
+------+-------+--------------------------+
|letter|number |maps |
+------+-------+--------------------------+
|a | 25 | [1-> [1,2]] |
|a | 32 | [1-> [3,4], 2 -> [1,2]] |
|b | 16 | [1 -> [1,2]] |
|b | 18 | [2 -> [10,15]] |
+------+-------+--------------------------+
I want to group the data based on the "letter" column so that the final dataset should have the below expected final output:
+------+---------------------------------+
|letter| maps |
+------+---------------------------------+
|a | [1-> [4,6], 2 -> [1,2]] |
|b | [1-> [1,2], 2 -> [10,15]] |
+------+---------------------------------+
I tried to group by "letter" and then apply an udf to aggregate the values in the map. Below is what I tried:
val aggregatedDs = SampleDataDs.groupBy("letter").agg(collect_list(SampleDataDs("maps")).alias("mapList"))
Output:
+------+----------------------------------------+
|letter| mapList |
+------+-------+--------------------------------+
|a | [[1-> [1,2]],[1-> [3,4], 2 -> [1,2]]] |
|b | [[1-> [1,2]],[2 -> [10,15]]] |
+------+----------------------------------------+
After this I tried to write an udf to merge the output of collect_list and get the final output:
def mergeMap = udf { valSeq:Seq[Map[Int,SampleClass]] =>
valSeq.flatten.groupBy(_._1).mapValues(x=>(x.map(_._2.value1).reduce(_ + _),x.map(_._2.value2).reduce(_ + _)))
}
val aggMapDs = aggregatedDs.withColumn("aggValues",mergeMap(col("mapList")))
However it fails with the error message:
Failed to execute user defined function
Caused by :java.lang.classCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to SampleClass
My Spark version is 2.3.1. Any ideas how I can get the expected final output?
The problem is due to the UDF not being able to accept the case class as input. Spark dataframes will internally represent your case class as a Row object. The problem can thus be avoided by changing the UDF input type as follows:
val mergeMap = udf((valSeq:Seq[Map[Int, Row]]) => {
valSeq.flatten
.groupBy(_._1)
.mapValues(x=>
SampleClass(
x.map(_._2.getAs[Int]("value1")).reduce(_ + _),
x.map(_._2.getAs[Int]("value2")).reduce(_ + _)
)
)
})
Notice above that some minor additional changes are necessary to handle the Row object.
Running this code will result in:
val aggMapDs = aggregatedDs.withColumn("aggValues",mergeMap(col("mapList")))
+------+----------------------------------------------+-----------------------------+
|letter|mapList |aggValues |
+------+----------------------------------------------+-----------------------------+
|b |[Map(1 -> [1,2]), Map(2 -> [10,15])] |Map(2 -> [10,15], 1 -> [1,2])|
|a |[Map(1 -> [1,2]), Map(1 -> [3,4], 2 -> [1,2])]|Map(2 -> [1,2], 1 -> [4,6]) |
+------+----------------------------------------------+-----------------------------+
There is a slight difference between Dataframe and Dataset.
Dataset takes on two distinct APIs characteristics: a strongly-typed API and an untyped API, as shown in the table below. Conceptually, consider DataFrame as an alias for a collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object. Dataset, by contrast, is a collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java
When you converting your Seq to Dataframe type information is lost.
val df: Dataframe = Seq(...).toDf() <-- here
What you could have done instead is convert Seq to Dataset
val typedDs: Dataset[(String, Int, Map[Int, SampleClass])] = Seq(...).toDS()
+---+---+--------------------+
| _1| _2| _3|
+---+---+--------------------+
| a| 25| [1 -> [1, 2]]|
| a| 32|[1 -> [3, 4], 2 -...|
| b| 16| [1 -> [1, 2]]|
| b| 18| [2 -> [10, 15]]|
+---+---+--------------------+
Because your top-level object in the Seq is Tuple Spark generates dummy column names.
Now you should pay attention to the return type, there are functions on a typed Dataset that losing type information.
val untyped: Dataframe = typedDs
.groupBy("_1")
.agg(collect_list(typedDs("_3")).alias("mapList"))
In order to work with typed API you should explicitly define types
val aggregatedDs = sampleDataDs
.groupBy("letter")
.agg(collect_list(sampleDataDs("maps")).alias("mapList"))
val toTypedAgg: Dataset[(String, Array[Map[Int, SampleClass]])] = aggregatedDs
.as[(String, Array[Map[Int, SampleClass]])] //<- here
Unfortunately, udf won't work as there are a limited number of types for which Spark can infer a schema.
toTypedAgg.withColumn("aggValues", mergeMap1(col("mapList"))).show()
Schema for type org.apache.spark.sql.Dataset[(String, Array[Map[Int,SampleClass]])] is not supported
What you could do instead is to map over a Dataset
val mapped = toTypedAgg.map(v => {
(v._1, v._2.flatten.groupBy(_._1).mapValues(x=>(x.map(_._2.value1).sum,x.map(_._2.value2).sum)))
})
+---+----------------------------+
|_1 |_2 |
+---+----------------------------+
|b |[2 -> [10, 15], 1 -> [1, 2]]|
|a |[2 -> [1, 2], 1 -> [4, 6]] |
+---+----------------------------+

Doing left outer join on multiple data frames in spark scala

I am newbie in Spark. I trying to achieve below use case using scala.
-DataFrame 1
| col A | col B |
-----------------
| 1 | a |
| 2 | a |
| 3 | a |
-DataFrame 2
| col A | col B |
-----------------
| 1 | b |
| 3 | b |
-DataFrame 3
| col A | col B |
-----------------
| 2 | c |
| 3 | c |
Final Output frame should be
| col A | col B |
-----------------
| 1 | a,b |
| 2 | a,c |
| 3 | a,b,c |
Number of frames are not limited to 3 , it can be any number less than 100.So I am using for each in which I am printing each of the data frame.
Can some one please help me how I can create final data frame in which I can have output in above format with N data frames.
I appreciate your help.
I see this question today. I suggest that you use python to solve it. It's easier to write than scala. Here are they:
from pyspark.sql import SQLContext
from pyspark.sql.functions import concat_ws
d1=sc.parallelize([(1, "a"), (2, "a"), (3,"a")]).toDF().toDF("Col_A","Col_B")
d2=sc.parallelize([(1, "b"), (2, "b")]).toDF().toDF("Col_A", "Col_B")
d3=sc.parallelize([(2, "c"), (3, "c")]).toDF().toDF("Col_A", "Col_B")
d4=d1.join(d2,'Col_A','left').join(d3,'Col_A','left').select(d1.Col_A.alias("col A"),concat_ws(',',d1.Col_B,d2.Col_B,d3.Col_B).alias("col B"))
df4.show()
+-----+-----+
|col A|col B|
+-----+-----+
| 1
| a,b|
| 2
|a,b,c|
| 3
| a,c|
+-----+-----+
You see the result!
You can use foldLeft to iteratively merge data with outer join
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
val df1 = Seq((1, "a"), (2, "a"), (3, "a")).toDF("Col A", "Col B")
val df2 = Seq((1, "b"), (2, "b")).toDF("Col A", "Col B")
val df3 = Seq((2, "c"), (3, "c")).toDF("Col A", "Col B")
val dfs = Seq(df2, df3)
val bs = (0 to dfs.size).map(i => s"Col B $i")
dfs.foldLeft(df1)(
(acc, df) => acc.join(df, Seq("Col A"), "fullouter")
).toDF("Col A" +: bs: _*).select($"Col A", array(bs map col: _*)).map {
case Row(a: Int, bs: Seq[_]) =>
// Drop nulls and concat
(a, bs.filter(_ != null).map(_.toString).mkString(","))
}.toDF("Col A", "Col B").show
// +-----+-----+
// |Col A|Col B|
// +-----+-----+
// | 1| a,b|
// | 3| a,c|
// | 2|a,b,c|
// +-----+-----+
but if you really think
it can be any number less than 100
then it is just unrealistic. join is the most expensive operation in Spark, and even with all the optimizer improvements, it is just not gonna work.