I want to merge multiple maps using Spark/Scala. The maps have a case class instance as value.
Following is the relevant code:
case class SampleClass(value1:Int,value2:Int)
val sampleDataDs = Seq(
("a",25,Map(1->SampleClass(1,2))),
("a",32,Map(1->SampleClass(3,4),2->SampleClass(1,2))),
("b",16,Map(1->SampleClass(1,2))),
("b",18,Map(2->SampleClass(10,15)))).toDF("letter","number","maps")
Output:
+------+-------+--------------------------+
|letter|number |maps |
+------+-------+--------------------------+
|a | 25 | [1-> [1,2]] |
|a | 32 | [1-> [3,4], 2 -> [1,2]] |
|b | 16 | [1 -> [1,2]] |
|b | 18 | [2 -> [10,15]] |
+------+-------+--------------------------+
I want to group the data based on the "letter" column so that the final dataset should have the below expected final output:
+------+---------------------------------+
|letter| maps |
+------+---------------------------------+
|a | [1-> [4,6], 2 -> [1,2]] |
|b | [1-> [1,2], 2 -> [10,15]] |
+------+---------------------------------+
I tried to group by "letter" and then apply an udf to aggregate the values in the map. Below is what I tried:
val aggregatedDs = SampleDataDs.groupBy("letter").agg(collect_list(SampleDataDs("maps")).alias("mapList"))
Output:
+------+----------------------------------------+
|letter| mapList |
+------+-------+--------------------------------+
|a | [[1-> [1,2]],[1-> [3,4], 2 -> [1,2]]] |
|b | [[1-> [1,2]],[2 -> [10,15]]] |
+------+----------------------------------------+
After this I tried to write an udf to merge the output of collect_list and get the final output:
def mergeMap = udf { valSeq:Seq[Map[Int,SampleClass]] =>
valSeq.flatten.groupBy(_._1).mapValues(x=>(x.map(_._2.value1).reduce(_ + _),x.map(_._2.value2).reduce(_ + _)))
}
val aggMapDs = aggregatedDs.withColumn("aggValues",mergeMap(col("mapList")))
However it fails with the error message:
Failed to execute user defined function
Caused by :java.lang.classCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to SampleClass
My Spark version is 2.3.1. Any ideas how I can get the expected final output?
The problem is due to the UDF not being able to accept the case class as input. Spark dataframes will internally represent your case class as a Row object. The problem can thus be avoided by changing the UDF input type as follows:
val mergeMap = udf((valSeq:Seq[Map[Int, Row]]) => {
valSeq.flatten
.groupBy(_._1)
.mapValues(x=>
SampleClass(
x.map(_._2.getAs[Int]("value1")).reduce(_ + _),
x.map(_._2.getAs[Int]("value2")).reduce(_ + _)
)
)
})
Notice above that some minor additional changes are necessary to handle the Row object.
Running this code will result in:
val aggMapDs = aggregatedDs.withColumn("aggValues",mergeMap(col("mapList")))
+------+----------------------------------------------+-----------------------------+
|letter|mapList |aggValues |
+------+----------------------------------------------+-----------------------------+
|b |[Map(1 -> [1,2]), Map(2 -> [10,15])] |Map(2 -> [10,15], 1 -> [1,2])|
|a |[Map(1 -> [1,2]), Map(1 -> [3,4], 2 -> [1,2])]|Map(2 -> [1,2], 1 -> [4,6]) |
+------+----------------------------------------------+-----------------------------+
There is a slight difference between Dataframe and Dataset.
Dataset takes on two distinct APIs characteristics: a strongly-typed API and an untyped API, as shown in the table below. Conceptually, consider DataFrame as an alias for a collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object. Dataset, by contrast, is a collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java
When you converting your Seq to Dataframe type information is lost.
val df: Dataframe = Seq(...).toDf() <-- here
What you could have done instead is convert Seq to Dataset
val typedDs: Dataset[(String, Int, Map[Int, SampleClass])] = Seq(...).toDS()
+---+---+--------------------+
| _1| _2| _3|
+---+---+--------------------+
| a| 25| [1 -> [1, 2]]|
| a| 32|[1 -> [3, 4], 2 -...|
| b| 16| [1 -> [1, 2]]|
| b| 18| [2 -> [10, 15]]|
+---+---+--------------------+
Because your top-level object in the Seq is Tuple Spark generates dummy column names.
Now you should pay attention to the return type, there are functions on a typed Dataset that losing type information.
val untyped: Dataframe = typedDs
.groupBy("_1")
.agg(collect_list(typedDs("_3")).alias("mapList"))
In order to work with typed API you should explicitly define types
val aggregatedDs = sampleDataDs
.groupBy("letter")
.agg(collect_list(sampleDataDs("maps")).alias("mapList"))
val toTypedAgg: Dataset[(String, Array[Map[Int, SampleClass]])] = aggregatedDs
.as[(String, Array[Map[Int, SampleClass]])] //<- here
Unfortunately, udf won't work as there are a limited number of types for which Spark can infer a schema.
toTypedAgg.withColumn("aggValues", mergeMap1(col("mapList"))).show()
Schema for type org.apache.spark.sql.Dataset[(String, Array[Map[Int,SampleClass]])] is not supported
What you could do instead is to map over a Dataset
val mapped = toTypedAgg.map(v => {
(v._1, v._2.flatten.groupBy(_._1).mapValues(x=>(x.map(_._2.value1).sum,x.map(_._2.value2).sum)))
})
+---+----------------------------+
|_1 |_2 |
+---+----------------------------+
|b |[2 -> [10, 15], 1 -> [1, 2]]|
|a |[2 -> [1, 2], 1 -> [4, 6]] |
+---+----------------------------+
Related
I have two data frames in Spark Scala where the second column of each data frame is an array of numbers
val data22= Seq((1,List(0.693147,0.6931471)),(2,List(0.69314, 0.0)),(3,List(0.0, 0.693147))).toDF("ID","tf_idf")
data22.show(truncate=false)
+---+---------------------+
|ID |tf_idf |
+---+---------------------+
|1 |[0.693, 0.702] |
|2 |[0.69314, 0.0] |
|3 |[0.0, 0.693147] |
+---+---------------------+
val data12= Seq((1,List(0.69314,0.6931471))).toDF("ID","tf_idf")
data12.show(truncate=false)
+---+--------------------+
|ID |tf_idf |
+---+--------------------+
|1 |[0.693, 0.805] |
+---+--------------------+
I need to perform the dot product between rows in this two data frames. That is I need to multiply the tf_idf array in data12 with each row of tf_idf in data22.
(Ex: The first row in dot product should be like this : 0.693*0.693 + 0.702*0.805
Second row : 0.69314*0.693 + 0.0*0.805
Third row : 0.0*0.693 + 0.693147*0.805 )
Basically I want something(like matrix multiplication) data22*transpose(data12)
I would be grateful if someone can suggest a method to do this in Spark Scala .
Thank you
Spark Version 2.4+: Use the several functions for array such as zip_with and aggregate, that give you more simpler code. To follow your detailed description, I have changed the join into crossJoin.
val data22= Seq((1,List(0.693147,0.6931471)),(2,List(0.69314, 0.0)),(3,List(0.0, 0.693147))).toDF("ID","tf_idf")
val data12= Seq((1,List(0.693,0.805))).toDF("ID2","tf_idf2")
val df = data22.crossJoin(data12).drop("ID2")
df.withColumn("DotProduct", expr("aggregate(zip_with(tf_idf, tf_idf2, (x, y) -> x * y), 0D, (sum, x) -> sum + x)")).show(false)
Here is the result.
+---+---------------------+--------------+-------------------+
|ID |tf_idf |tf_idf2 |DotProduct |
+---+---------------------+--------------+-------------------+
|1 |[0.693147, 0.6931471]|[0.693, 0.805]|1.0383342865 |
|2 |[0.69314, 0.0] |[0.693, 0.805]|0.48034601999999993|
|3 |[0.0, 0.693147] |[0.693, 0.805]|0.557983335 |
+---+---------------------+--------------+-------------------+
The solution is shown below:
scala> val data22= Seq((1,List(0.693147,0.6931471)),(2,List(0.69314, 0.0)),(3,List(0.0, 0.693147))).toDF("ID","tf_idf")
data22: org.apache.spark.sql.DataFrame = [ID: int, tf_idf: array<double>]
scala> val data12= Seq((1,List(0.69314,0.6931471))).toDF("ID","tf_idf")
data12: org.apache.spark.sql.DataFrame = [ID: int, tf_idf: array<double>]
scala> val arrayDot = data12.take(1).map(row => (row.getAs[Int](0), row.getAs[WrappedArray[Double]](1).toSeq))
arrayDot: Array[(Int, Seq[Double])] = Array((1,WrappedArray(0.69314, 0.6931471)))
scala> val dotColumn = arrayDot(0)._2
dotColumn: Seq[Double] = WrappedArray(0.69314, 0.6931471)
scala> val dotUdf = udf((y: Seq[Double]) => y zip dotColumn map(z => z._1*z._2) reduce(_ + _))
dotUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,DoubleType,Some(List(ArrayType(DoubleType,false))))
scala> data22.withColumn("dotProduct", dotUdf('tf_idf)).show
+---+--------------------+-------------------+
| ID| tf_idf| dotProduct|
+---+--------------------+-------------------+
| 1|[0.693147, 0.6931...| 0.96090081381841|
| 2| [0.69314, 0.0]|0.48044305959999994|
| 3| [0.0, 0.693147]| 0.4804528329237|
+---+--------------------+-------------------+
Note that it multiplies multiply the tf_idf array in data12 with each row of tf_idf in data22.
Let me know if it helps!!
I am trying to speed up and limit the cost of taking several columns and their values and inserting them into a map in the same row. This is a requirement because we have a legacy system that is reading from this job and it isn't yet ready to be refactored. There is also another map with some data that needs to be combined with this.
Currently we have a few solutions all of which seem to result in about the same run time on the same cluster with around 1TB of data stored in Parquet:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.json4s._
import org.json4s.jackson.JsonMethods._
import spark.implicits._
def jsonToMap(s: String, map: Map[String, String]): Map[String, String] = {
implicit val formats = org.json4s.DefaultFormats
val jsonMap = if(!s.isEmpty){
parse(s).extract[Map[String, String]]
} else {
Map[String, String]()
}
if(map != null){
map ++ jsonMap
} else {
jsonMap
}
}
val udfJsonToMap = udf(jsonToMap _)
def addMap(key:String, value:String, map: Map[String,String]): Map[String,String] = {
if(map == null) {
Map(key -> value)
} else {
map + (key -> value)
}
}
val addMapUdf = udf(addMap _)
val output = raw.columns.foldLeft(raw.withColumn("allMap", typedLit(Map.empty[String, String]))) { (memoDF, colName) =>
if(colName.startsWith("columnPrefix/")){
memoDF.withColumn("allMap", when(col(colName).isNotNull, addMapUdf(substring_index(lit(colName), "/", -1), col(colName), col("allTagsMap")) ))
} else if(colName.equals("originalMap")){
memoDF.withColumn("allMap", when(col(colName).isNotNull, udfJsonToMap(col(colName), col("allMap"))))
} else {
memoDF
}
}
takes about 1h on 9 m5.xlarge
val resourceTagColumnNames = raw.columns.filter(colName => colName.startsWith("columnPrefix/"))
def structToMap: Row => Map[String,String] = { row =>
row.getValuesMap[String](resourceTagColumnNames)
}
val structToMapUdf = udf(structToMap)
val experiment = raw
.withColumn("allStruct", struct(resourceTagColumnNames.head, resourceTagColumnNames.tail:_*))
.select("allStruct")
.withColumn("allMap", structToMapUdf(col("allStruct")))
.select("allMap")
Also runs in about 1h on the same cluster
This code all works but it isn't fast enough it is about 10 times longer than every other transform we have right now and it is a bottle neck for us.
Is there another way to get this result that is more efficient?
Edit: I have also tried limiting the data by a key however because the values in the columns I am merging can change despite the key remaining the same I cannot limit the data size without risking data loss.
Tl;DR: using only spark sql built-in functions can significantly speedup computation
As explained in this answer, spark sql native functions are more
performant than user-defined functions. So we can try to implement the solution to your problem using only
spark sql native functions.
I show two main versions of implementation. One using all the sql functions existing in last version
of Spark available at the time I wrote this answer, which is Spark 3.0. And another using only sql functions
existing in spark version when the question was asked, so functions existing in Spark 2.3. All the used functions
in this version are also available in Spark 2.2
Spark 3.0 implementation with sql functions
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{MapType, StringType}
val mapFromPrefixedColumns = map_filter(
map(raw.columns.filter(_.startsWith("columnPrefix/")).flatMap(c => Seq(lit(c.dropWhile(_ != '/').tail), col(c))): _*),
(_, v) => v.isNotNull
)
val mapFromOriginalMap = when(col("originalMap").isNotNull && col("originalMap").notEqual(""),
from_json(col("originalMap"), MapType(StringType, StringType))
).otherwise(
map()
)
val comprehensiveMapExpr = map_concat(mapFromPrefixedColumns, mapFromOriginalMap)
raw.withColumn("allMap", comprehensiveMapExpr)
Spark 2.2 implementation with sql functions
In spark 2.2, We don't have the functions map_concat (available in spark 2.4) and map_filter (available in spark 3.0).
I replace them with user-defined functions:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{MapType, StringType}
def filterNull(map: Map[String, String]): Map[String, String] = map.toSeq.filter(_._2 != null).toMap
val filter_null_udf = udf(filterNull _)
def mapConcat(map1: Map[String, String], map2: Map[String, String]): Map[String, String] = map1 ++ map2
val map_concat_udf = udf(mapConcat _)
val mapFromPrefixedColumns = filter_null_udf(
map(raw.columns.filter(_.startsWith("columnPrefix/")).flatMap(c => Seq(lit(c.dropWhile(_ != '/').tail), col(c))): _*)
)
val mapFromOriginalMap = when(col("originalMap").isNotNull && col("originalMap").notEqual(""),
from_json(col("originalMap"), MapType(StringType, StringType))
).otherwise(
map()
)
val comprehensiveMapExpr = map_concat_udf(mapFromPrefixedColumns, mapFromOriginalMap)
raw.withColumn("allMap", comprehensiveMapExpr)
Implementation with sql functions without json mapping
The last part of the question contains a simplified code without mapping of the json column and without filtering of
null values in result map. I created the following implementation for this specific case. As I don't use functions
that were added between spark 2.2 and spark 3.0, I don't need two versions of this implementation:
import org.apache.spark.sql.functions._
val mapFromPrefixedColumns = map(raw.columns.filter(_.startsWith("columnPrefix/")).flatMap(c => Seq(lit(c), col(c))): _*)
raw.withColumn("allMap", mapFromPrefixedColumns)
Run
For the following dataframe as input:
+--------------------+--------------------+--------------------+----------------+
|columnPrefix/column1|columnPrefix/column2|columnPrefix/column3|originalMap |
+--------------------+--------------------+--------------------+----------------+
|a |1 |x |{"column4": "k"}|
|b |null |null |null |
|c |null |null |{} |
|null |null |null |null |
|d |2 |null | |
+--------------------+--------------------+--------------------+----------------+
I obtain the following allMap column:
+--------------------------------------------------------+
|allMap |
+--------------------------------------------------------+
|[column1 -> a, column2 -> 1, column3 -> x, column4 -> k]|
|[column1 -> b] |
|[column1 -> c] |
|[] |
|[column1 -> d, column2 -> 2] |
+--------------------------------------------------------+
And for the mapping without json column:
+---------------------------------------------------------------------------------+
|allMap |
+---------------------------------------------------------------------------------+
|[columnPrefix/column1 -> a, columnPrefix/column2 -> 1, columnPrefix/column3 -> x]|
|[columnPrefix/column1 -> b, columnPrefix/column2 ->, columnPrefix/column3 ->] |
|[columnPrefix/column1 -> c, columnPrefix/column2 ->, columnPrefix/column3 ->] |
|[columnPrefix/column1 ->, columnPrefix/column2 ->, columnPrefix/column3 ->] |
|[columnPrefix/column1 -> d, columnPrefix/column2 -> 2, columnPrefix/column3 ->] |
+---------------------------------------------------------------------------------+
Benchmark
I generated a csv file of 10 millions lines, uncompressed (about 800 Mo), containing one column without column prefix,
nine columns with column prefix, and one colonne containing json as a string:
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+
|id |columnPrefix/column1|columnPrefix/column2|columnPrefix/column3|columnPrefix/column4|columnPrefix/column5|columnPrefix/column6|columnPrefix/column7|columnPrefix/column8|columnPrefix/column9|originalMap |
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+
|1 |iwajedhor |ijoefzi |der |ob |galsu |ril |le |zaahuz |fuzi |{"column10":"true"}|
|2 |ofo |davfiwir |lebfim |roapej |lus |roum |te |javes |karutare |{"column10":"true"}|
|3 |jais |epciel |uv |piubnak |saajo |doke |ber |pi |igzici |{"column10":"true"}|
|4 |agami |zuhepuk |er |pizfe |lafudbo |zan |hoho |terbauv |ma |{"column10":"true"}|
...
The benchmark is to read this csv file, create the column allMap, and write this column to parquet. I ran this on my local machine and I obtained the following results
+--------------------------+--------------------+-------------------------+-------------------------+
| implementations | current (with udf) | sql functions spark 3.0 | sql functions spark 2.2 |
+--------------------------+--------------------+-------------------------+-------------------------+
| execution time | 138 seconds | 48 seconds | 82 seconds |
| improvement from current | 0 % faster | 64 % faster | 40 % faster |
+--------------------------+--------------------+-------------------------+-------------------------+
I also ran against the second implementation in the question, that drop the mapping of the json column and the filtering of null value in map.
+--------------------------+-----------------------+------------------------------------+
| implementations | current (with struct) | sql functions without json mapping |
+--------------------------+-----------------------+------------------------------------+
| execution time | 46 seconds | 35 seconds |
| improvement from current | 0 % | 23 % faster |
+--------------------------+-----------------------+------------------------------------+
Of course, the benchmark is rather basic, but we can see an improvement compared to the implementations that use user-defined functions
Conclusion
When you have a performance issue and you use user-defined functions, it can be a good idea to try to replace those user-defined functions by
spark sql functions
I have a dataframe with 3 columns - number (Integer), Name (String), Color (String). Below is the result of df.show with repartition option.
val df = sparkSession.read.format("csv").option("header", "true").option("inferschema", "true").option("delimiter", ",").option("decoding", "utf8").load(fileName).repartition(5).toDF()
+------+------+------+
|Number| Name| Color|
+------+------+------+
| 4|Orange|Orange|
| 3| Apple| Green|
| 1| Apple| Red|
| 2|Banana|Yellow|
| 5| Apple| Red|
+------+------+------+
My objective is to create list of strings corresponding to each row by replacing the tokens in common dynamic string which I am passing as parameter to the method with the column values
For example: commonDynamicString = Column.Name with Column.Color color
In this string, my tokens are Column.Name and Column.Color. I need to replace these values for all the rows with respective values in that column. Note: this string can change dynamically hence hardcoding won’t work.
I don't want to use RDD unless no other option is available with dataframe.
Below are the approaches I tried but couldn't achieve my objective.
Option 1:
val a = df.foreach(t => {
finalValue = commonString.replace("Column.Number", t.getAs[Any]("Number").toString())
.replace("DF.Name", t.getAs("Name"))
.replace("DF.Color", t.getAs("Color"))
println ("finalValue: " +finalValue)
})
With this approach, the finalValue prints as expected. However, I cannot create a listbuffer or pass the final string from here as a list to other function as foreach returns Unit
and spark throws error.
Option 2: I am thinking about this option but would need some guidance to understand if foldleft or window or any other spark functions can be used to create a 4th column called "Final"
using withColumn option and use a UDF where I can extract all the tokens using regex pattern matching - "Column.\w+" and do replace operation for the tokens?
+------+------+------+--------------------------+
|Number| Name| Color| Final |
+------+------+------+--------------------------+
| 4|Orange|Orange|Orange with orange color |
| 3| Apple| Green|Apple with Green color |
| 1| Apple| Red|Apple with Red color |
| 2|Banana|Yellow|Banana with Yellow color |
| 5| Apple| Red|Apple with Red color |
+------+------+------+--------------------------+
Can someone help me with this problem and also to let me know if I am thinking in the right direction to use spark for handling large datasets?
Thanks!
If I understand your requirement correctly, you can create a column method, say, parseStatement which takes a String-type statement and returns a Column with the following steps:
Parse the input statement to count number of tokens
Generate a Regex pattern in the form of ^(.*?)(token1)(.*?)(token2) ... (.*?)$
Apply pattern matching to assemble a colList consisting of lit(g1), col(g2), lit(g3), col(g4), ..., where the g?s are the extracted Regex groups
Concatenate the Column-type items
Here's the sample code:
import spark.implicits._
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
def parseStatement(stmt: String): Column = {
val token = "Column."
val tokenPattern = """Column\.(\w+)"""
val literalPattern = "(.*?)"
val colCount = stmt.sliding(token.length).count(_ == token)
val pattern = (0 to colCount * 2).map{
case i if (i % 2 == 0) => literalPattern
case _ => tokenPattern
}.mkString
val colList = ("^" + pattern + "$").r.findAllIn(stmt).
matchData.toList.flatMap(_.subgroups).
zipWithIndex.map{
case (g, i) if (i % 2 == 0) => lit(g)
case (g, i) => col(g)
}
concat(colList: _*)
}
val df = Seq(
(4, "Orange", "Orange"),
(3, "Apple", "Green"),
(1, "Apple", "Red"),
(2, "Banana", "Yellow"),
(5, "Apple", "Red")
).toDF("Number", "Name", "Color")
val statement = "Column.Name with Column.Color color"
df.withColumn("Final", parseStatement(statement)).
show(false)
// +------+------+------+------------------------+
// |Number|Name |Color |Final |
// +------+------+------+------------------------+
// |4 |Orange|Orange|Orange with Orange color|
// |3 |Apple |Green |Apple with Green color |
// |1 |Apple |Red |Apple with Red color |
// |2 |Banana|Yellow|Banana with Yellow color|
// |5 |Apple |Red |Apple with Red color |
// +------+------+------+------------------------+
Note that concat takes column-type parameters, hence the need of col() for column values and lit() for literals.
I have a dataframe with columns A & B of type String. Let's assume the below dataframe
+--------+
|A | B |
|1a | 1b |
|2a | 2b |
I want to add a third column that creates a map of A & B column
+-------------------------+
|A | B | C |
|1a | 1b | {A->1a, B->1b} |
|2a | 2b | {A->2a, B->2b} |
I'm attempting to do it the following way. I have udf which takes in a dataframe and returns a map
val test = udf((dataFrame: DataFrame) => {
val result = new mutable.HashMap[String, String]
dataFrame.columns.foreach(col => {
result.put(col, dataFrame(col).asInstanceOf[String])
})
result
})
I'm calling this udf in following way which is throwing a RunTimeException as I'm trying to pass a DataSet as a literal
df.withColumn("C", Helper.test(lit(df.select(df.columns.head, df.columns.tail: _*)))
I don't want to pass df('a') df('b') to my helper udf as I want them to be generic list of columns that I could select.
any pointers?
map way
You can just use map inbuilt function as
import org.apache.spark.sql.functions._
val columns = df.columns
df.withColumn("C", map(columns.flatMap(x => Array(lit(x), col(x))): _*)).show(false)
which should give you
+---+---+---------------------+
|A |B |C |
+---+---+---------------------+
|1a |1b |Map(A -> 1a, B -> 1b)|
|2a |2b |Map(A -> 2a, B -> 2b)|
+---+---+---------------------+
Udf way
Or you can use define your udf as
//collecting column names to be used in the udf
val columns = df.columns
//definining udf function
import org.apache.spark.sql.functions._
def createMapUdf = udf((names: Seq[String], values: Seq[String])=> names.zip(values).toMap)
//calling udf function
df.withColumn("C", createMapUdf(array(columns.map(x => lit(x)): _*), array(col("A"), col("B")))).show(false)
I hope the answer is helpful
# Ramesh Maharjan - Your answers are already great, my answer is just make your UDF answer also in dynamic way using string interpolation.
Column D is giving that in dynamic way.
df.withColumn("C", createMapUdf(array(columns.map(x => lit(x)): _*),
array(col("A"), col("B"))))
.withColumn("D", createMapUdf(array(columns.map(x => lit(x)): _*),
array(columns.map(x => col(s"$x") ): _* ))).show()
I have the following data structure representing movie ids (first column) and ratings for different users for that movie in the rest of columns - something like that:
+-------+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
|movieId| 1| 2| 3| 4| 5| 6| 7| 8| 9| 10| 11| 12| 13| 14| 15|
+-------+----+----+----+----+----+----+----+----+----+----+----+----+----+----+----+
| 1580|null|null| 3.5| 5.0|null|null|null|null|null|null|null|null|null|null|null|
| 3175|null|null|null|null|null|null|null|null|null|null|null|null|null| 5.0|null|
| 3794|null|null|null|null|null|null|null|null|null|null|null| 3.0|null|null|null|
| 2659|null|null|null| 3.0|null|null|null|null|null|null|null|null|null|null|null|
I want to convert this DataFrame to a DataSet of
final case class MovieRatings(movie_id: Long, ratings: Map[Long, Double])
So that it would be something like
[1580, [1 -> null, 2 -> null, 3 -> 3.5, 4 -> 5.0, 5 -> null, 6 -> null, 7 -> null,...]]
Etc.
How this can be done?
The thing here is that number of users is arbitrary. And I want to zip those into a single column leaving the first column untouched.
First, you have to tranform your DataFrame into one with a schema matching your case class, then you can use .as[MovieRatings] to convert DataFrame into a Dataset[MovieRatings]:
import org.apache.spark.sql.functions._
import spark.implicits._
// define a new MapType column using `functions.map`, passing a flattened-list of
// column name (as a Long column) and column value
val mapColumn: Column = map(df.columns.tail.flatMap(name => Seq(lit(name.toLong), $"$name")): _*)
// select movie id and map column with names matching the case class, and convert to Dataset:
df.select($"movieId" as "movie_id", mapColumn as "ratings")
.as[MovieRatings]
.show(false)
You can use the spark.sql.functions.map to create a map from arbitrary columns. It expects a sequence alternating between keys and values which can be Column types or String's. Here is an example:
import spark.implicits._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions
case class Input(movieId: Int, a: Option[Double], b: Option[Double], c: Option[Double])
val data = Input(1, None, Option(3.5), Option(1.4)) ::
Input(2, Option(4.2), Option(1.34), None) ::
Input(3, Option(1.11), None, Option(3.32)) :: Nil
val df = sc.parallelize(data).toDF
// Exclude the PK column from the map
val mapKeys = df.columns.filterNot(_ == "movieId")
// Build the sequence of key, value, key, value, ..
val pairs = mapKeys.map(k => Seq(lit(k), col(k))).flatten
val mapped = df.select($"movieId", functions.map(pairs:_*) as "map")
mapped.show(false)
Produces this output:
+-------+------------------------------------+
|movieId|map |
+-------+------------------------------------+
|1 |Map(a -> null, b -> 3.5, c -> 1.4) |
|2 |Map(a -> 4.2, b -> 1.34, c -> null) |
|3 |Map(a -> 1.11, b -> null, c -> 3.32)|
+-------+------------------------------------+