convert each row of dataframe to a map - scala

I have a dataframe with columns A & B of type String. Let's assume the below dataframe
+--------+
|A | B |
|1a | 1b |
|2a | 2b |
I want to add a third column that creates a map of A & B column
+-------------------------+
|A | B | C |
|1a | 1b | {A->1a, B->1b} |
|2a | 2b | {A->2a, B->2b} |
I'm attempting to do it the following way. I have udf which takes in a dataframe and returns a map
val test = udf((dataFrame: DataFrame) => {
val result = new mutable.HashMap[String, String]
dataFrame.columns.foreach(col => {
result.put(col, dataFrame(col).asInstanceOf[String])
})
result
})
I'm calling this udf in following way which is throwing a RunTimeException as I'm trying to pass a DataSet as a literal
df.withColumn("C", Helper.test(lit(df.select(df.columns.head, df.columns.tail: _*)))
I don't want to pass df('a') df('b') to my helper udf as I want them to be generic list of columns that I could select.
any pointers?

map way
You can just use map inbuilt function as
import org.apache.spark.sql.functions._
val columns = df.columns
df.withColumn("C", map(columns.flatMap(x => Array(lit(x), col(x))): _*)).show(false)
which should give you
+---+---+---------------------+
|A |B |C |
+---+---+---------------------+
|1a |1b |Map(A -> 1a, B -> 1b)|
|2a |2b |Map(A -> 2a, B -> 2b)|
+---+---+---------------------+
Udf way
Or you can use define your udf as
//collecting column names to be used in the udf
val columns = df.columns
//definining udf function
import org.apache.spark.sql.functions._
def createMapUdf = udf((names: Seq[String], values: Seq[String])=> names.zip(values).toMap)
//calling udf function
df.withColumn("C", createMapUdf(array(columns.map(x => lit(x)): _*), array(col("A"), col("B")))).show(false)
I hope the answer is helpful

# Ramesh Maharjan - Your answers are already great, my answer is just make your UDF answer also in dynamic way using string interpolation.
Column D is giving that in dynamic way.
df.withColumn("C", createMapUdf(array(columns.map(x => lit(x)): _*),
array(col("A"), col("B"))))
.withColumn("D", createMapUdf(array(columns.map(x => lit(x)): _*),
array(columns.map(x => col(s"$x") ): _* ))).show()

Related

Pass arguments to a udf from columns present in a list of strings

I have a list of strings which represent column names inside a dataframe.
I want to pass the arguments from these columns to a udf. How can I do it in spark scala ?
val actualDF = Seq(
("beatles", "help|hey jude","sad",4),
("romeo", "eres mia","old school",56)
).toDF("name", "hit_songs","genre","xyz")
val column_list: List[String] = List("hit_songs","name","genre")
// example udf
val testudf = org.apache.spark.sql.functions.udf((s1: String, s2: String) => {
// lets say I want to concat all values
})
val finalDF = actualDF.withColumn("test_res",testudf(col(column_list(0))))
From the above example, I want to pass my list column_list to a udf. I am not sure how can I pass a complete list of string representing column names. Though in case of 1 element I saw I can do it with col(column_list(0))). Please support.
Replace
testudf(col(column_list(0)))
with
testudf(column_list: _*)
This will interpret the list as multiple individual input arguments.
hit_songs is of type Seq[String], You need to change first parameter of your udf to Seq[String].
scala> singersDF.show(false)
+-------+-------------+----------+
|name |hit_songs |genre |
+-------+-------------+----------+
|beatles|help|hey jude|sad |
|romeo |eres mia |old school|
+-------+-------------+----------+
scala> actualDF.show(false)
+-------+----------------+----------+
|name |hit_songs |genre |
+-------+----------------+----------+
|beatles|[help, hey jude]|sad |
|romeo |[eres mia] |old school|
+-------+----------------+----------+
scala> column_list
res27: List[String] = List(hit_songs, name)
Change your UDF like below.
// s1 is of type Seq[String]
val testudf = udf((s1:Seq[String],s2:String) => {
s1.mkString.concat(s2)
})
Applying UDF
scala> actualDF
.withColumn("test_res",testudf(col(column_list.head),col(column_list.last)))
.show(false)
+-------+----------------+----------+-------------------+
|name |hit_songs |genre |test_res |
+-------+----------------+----------+-------------------+
|beatles|[help, hey jude]|sad |helphey judebeatles|
|romeo |[eres mia] |old school|eres miaromeo |
+-------+----------------+----------+-------------------+
Without UDF
scala> actualDF.withColumn("test_res",concat_ws("",$"name",$"hit_songs")).show(false) // Without UDF.
+-------+----------------+----------+-------------------+
|name |hit_songs |genre |test_res |
+-------+----------------+----------+-------------------+
|beatles|[help, hey jude]|sad |beatleshelphey jude|
|romeo |[eres mia] |old school|romeoeres mia |
+-------+----------------+----------+-------------------+

Merging rows into a single struct column in spark scala has efficiency problems, how do we do it better?

I am trying to speed up and limit the cost of taking several columns and their values and inserting them into a map in the same row. This is a requirement because we have a legacy system that is reading from this job and it isn't yet ready to be refactored. There is also another map with some data that needs to be combined with this.
Currently we have a few solutions all of which seem to result in about the same run time on the same cluster with around 1TB of data stored in Parquet:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.json4s._
import org.json4s.jackson.JsonMethods._
import spark.implicits._
def jsonToMap(s: String, map: Map[String, String]): Map[String, String] = {
implicit val formats = org.json4s.DefaultFormats
val jsonMap = if(!s.isEmpty){
parse(s).extract[Map[String, String]]
} else {
Map[String, String]()
}
if(map != null){
map ++ jsonMap
} else {
jsonMap
}
}
val udfJsonToMap = udf(jsonToMap _)
def addMap(key:String, value:String, map: Map[String,String]): Map[String,String] = {
if(map == null) {
Map(key -> value)
} else {
map + (key -> value)
}
}
val addMapUdf = udf(addMap _)
val output = raw.columns.foldLeft(raw.withColumn("allMap", typedLit(Map.empty[String, String]))) { (memoDF, colName) =>
if(colName.startsWith("columnPrefix/")){
memoDF.withColumn("allMap", when(col(colName).isNotNull, addMapUdf(substring_index(lit(colName), "/", -1), col(colName), col("allTagsMap")) ))
} else if(colName.equals("originalMap")){
memoDF.withColumn("allMap", when(col(colName).isNotNull, udfJsonToMap(col(colName), col("allMap"))))
} else {
memoDF
}
}
takes about 1h on 9 m5.xlarge
val resourceTagColumnNames = raw.columns.filter(colName => colName.startsWith("columnPrefix/"))
def structToMap: Row => Map[String,String] = { row =>
row.getValuesMap[String](resourceTagColumnNames)
}
val structToMapUdf = udf(structToMap)
val experiment = raw
.withColumn("allStruct", struct(resourceTagColumnNames.head, resourceTagColumnNames.tail:_*))
.select("allStruct")
.withColumn("allMap", structToMapUdf(col("allStruct")))
.select("allMap")
Also runs in about 1h on the same cluster
This code all works but it isn't fast enough it is about 10 times longer than every other transform we have right now and it is a bottle neck for us.
Is there another way to get this result that is more efficient?
Edit: I have also tried limiting the data by a key however because the values in the columns I am merging can change despite the key remaining the same I cannot limit the data size without risking data loss.
Tl;DR: using only spark sql built-in functions can significantly speedup computation
As explained in this answer, spark sql native functions are more
performant than user-defined functions. So we can try to implement the solution to your problem using only
spark sql native functions.
I show two main versions of implementation. One using all the sql functions existing in last version
of Spark available at the time I wrote this answer, which is Spark 3.0. And another using only sql functions
existing in spark version when the question was asked, so functions existing in Spark 2.3. All the used functions
in this version are also available in Spark 2.2
Spark 3.0 implementation with sql functions
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{MapType, StringType}
val mapFromPrefixedColumns = map_filter(
map(raw.columns.filter(_.startsWith("columnPrefix/")).flatMap(c => Seq(lit(c.dropWhile(_ != '/').tail), col(c))): _*),
(_, v) => v.isNotNull
)
val mapFromOriginalMap = when(col("originalMap").isNotNull && col("originalMap").notEqual(""),
from_json(col("originalMap"), MapType(StringType, StringType))
).otherwise(
map()
)
val comprehensiveMapExpr = map_concat(mapFromPrefixedColumns, mapFromOriginalMap)
raw.withColumn("allMap", comprehensiveMapExpr)
Spark 2.2 implementation with sql functions
In spark 2.2, We don't have the functions map_concat (available in spark 2.4) and map_filter (available in spark 3.0).
I replace them with user-defined functions:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{MapType, StringType}
def filterNull(map: Map[String, String]): Map[String, String] = map.toSeq.filter(_._2 != null).toMap
val filter_null_udf = udf(filterNull _)
def mapConcat(map1: Map[String, String], map2: Map[String, String]): Map[String, String] = map1 ++ map2
val map_concat_udf = udf(mapConcat _)
val mapFromPrefixedColumns = filter_null_udf(
map(raw.columns.filter(_.startsWith("columnPrefix/")).flatMap(c => Seq(lit(c.dropWhile(_ != '/').tail), col(c))): _*)
)
val mapFromOriginalMap = when(col("originalMap").isNotNull && col("originalMap").notEqual(""),
from_json(col("originalMap"), MapType(StringType, StringType))
).otherwise(
map()
)
val comprehensiveMapExpr = map_concat_udf(mapFromPrefixedColumns, mapFromOriginalMap)
raw.withColumn("allMap", comprehensiveMapExpr)
Implementation with sql functions without json mapping
The last part of the question contains a simplified code without mapping of the json column and without filtering of
null values in result map. I created the following implementation for this specific case. As I don't use functions
that were added between spark 2.2 and spark 3.0, I don't need two versions of this implementation:
import org.apache.spark.sql.functions._
val mapFromPrefixedColumns = map(raw.columns.filter(_.startsWith("columnPrefix/")).flatMap(c => Seq(lit(c), col(c))): _*)
raw.withColumn("allMap", mapFromPrefixedColumns)
Run
For the following dataframe as input:
+--------------------+--------------------+--------------------+----------------+
|columnPrefix/column1|columnPrefix/column2|columnPrefix/column3|originalMap |
+--------------------+--------------------+--------------------+----------------+
|a |1 |x |{"column4": "k"}|
|b |null |null |null |
|c |null |null |{} |
|null |null |null |null |
|d |2 |null | |
+--------------------+--------------------+--------------------+----------------+
I obtain the following allMap column:
+--------------------------------------------------------+
|allMap |
+--------------------------------------------------------+
|[column1 -> a, column2 -> 1, column3 -> x, column4 -> k]|
|[column1 -> b] |
|[column1 -> c] |
|[] |
|[column1 -> d, column2 -> 2] |
+--------------------------------------------------------+
And for the mapping without json column:
+---------------------------------------------------------------------------------+
|allMap |
+---------------------------------------------------------------------------------+
|[columnPrefix/column1 -> a, columnPrefix/column2 -> 1, columnPrefix/column3 -> x]|
|[columnPrefix/column1 -> b, columnPrefix/column2 ->, columnPrefix/column3 ->] |
|[columnPrefix/column1 -> c, columnPrefix/column2 ->, columnPrefix/column3 ->] |
|[columnPrefix/column1 ->, columnPrefix/column2 ->, columnPrefix/column3 ->] |
|[columnPrefix/column1 -> d, columnPrefix/column2 -> 2, columnPrefix/column3 ->] |
+---------------------------------------------------------------------------------+
Benchmark
I generated a csv file of 10 millions lines, uncompressed (about 800 Mo), containing one column without column prefix,
nine columns with column prefix, and one colonne containing json as a string:
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+
|id |columnPrefix/column1|columnPrefix/column2|columnPrefix/column3|columnPrefix/column4|columnPrefix/column5|columnPrefix/column6|columnPrefix/column7|columnPrefix/column8|columnPrefix/column9|originalMap |
+---+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+-------------------+
|1 |iwajedhor |ijoefzi |der |ob |galsu |ril |le |zaahuz |fuzi |{"column10":"true"}|
|2 |ofo |davfiwir |lebfim |roapej |lus |roum |te |javes |karutare |{"column10":"true"}|
|3 |jais |epciel |uv |piubnak |saajo |doke |ber |pi |igzici |{"column10":"true"}|
|4 |agami |zuhepuk |er |pizfe |lafudbo |zan |hoho |terbauv |ma |{"column10":"true"}|
...
The benchmark is to read this csv file, create the column allMap, and write this column to parquet. I ran this on my local machine and I obtained the following results
+--------------------------+--------------------+-------------------------+-------------------------+
| implementations | current (with udf) | sql functions spark 3.0 | sql functions spark 2.2 |
+--------------------------+--------------------+-------------------------+-------------------------+
| execution time | 138 seconds | 48 seconds | 82 seconds |
| improvement from current | 0 % faster | 64 % faster | 40 % faster |
+--------------------------+--------------------+-------------------------+-------------------------+
I also ran against the second implementation in the question, that drop the mapping of the json column and the filtering of null value in map.
+--------------------------+-----------------------+------------------------------------+
| implementations | current (with struct) | sql functions without json mapping |
+--------------------------+-----------------------+------------------------------------+
| execution time | 46 seconds | 35 seconds |
| improvement from current | 0 % | 23 % faster |
+--------------------------+-----------------------+------------------------------------+
Of course, the benchmark is rather basic, but we can see an improvement compared to the implementations that use user-defined functions
Conclusion
When you have a performance issue and you use user-defined functions, it can be a good idea to try to replace those user-defined functions by
spark sql functions

Scala Spark - split vector column into separate columns in a Spark DataFrame

I have a Spark DataFrame where I have a column with Vector values. The vector values are all n-dimensional, aka with the same length. I also have a list of column names Array("f1", "f2", "f3", ..., "fn"), each corresponds to one element in the vector.
some_columns... | Features
... | [0,1,0,..., 0]
to
some_columns... | f1 | f2 | f3 | ... | fn
... | 0 | 1 | 0 | ... | 0
What is the best way to achieve this? I thought of one way which is to create a new DataFrame with createDataFrame(Row(Features), featureNameList) and then join with the old one, but it requires spark context to use createDataFrame. I only want to transform the existing data frame. I also know .withColumn("fi", value) but what do I do if n is large?
I'm new to Scala and Spark and couldn't find any good examples for this. I think this can be a common task. My particular case is that I used the CountVectorizer and wanted to recover each column individually for better readability instead of only having the vector result.
One way could be to convert the vector column to an array<double> and then using getItem to extract individual elements.
import org.apache.spark.sql.functions._
import org.apache.spark.ml._
val df = Seq( (1 , linalg.Vectors.dense(1,0,1,1,0) ) ).toDF("id", "features")
//df: org.apache.spark.sql.DataFrame = [id: int, features: vector]
df.show
//+---+---------------------+
//|id |features |
//+---+---------------------+
//|1 |[1.0,0.0,1.0,1.0,0.0]|
//+---+---------------------+
// A UDF to convert VectorUDT to ArrayType
val vecToArray = udf( (xs: linalg.Vector) => xs.toArray )
// Add a ArrayType Column
val dfArr = df.withColumn("featuresArr" , vecToArray($"features") )
// Array of element names that need to be fetched
// ArrayIndexOutOfBounds is not checked.
// sizeof `elements` should be equal to the number of entries in column `features`
val elements = Array("f1", "f2", "f3", "f4", "f5")
// Create a SQL-like expression using the array
val sqlExpr = elements.zipWithIndex.map{ case (alias, idx) => col("featuresArr").getItem(idx).as(alias) }
// Extract Elements from dfArr
dfArr.select(sqlExpr : _*).show
//+---+---+---+---+---+
//| f1| f2| f3| f4| f5|
//+---+---+---+---+---+
//|1.0|0.0|1.0|1.0|0.0|
//+---+---+---+---+---+

How to fetch the value and type of each column of each row in a dataframe?

How can I convert a dataframe to a tuple that includes the datatype for each column?
I have a number of dataframes with varying sizes and types. I need to be able to determine the type and value of each column and row of a given dataframe so I can perform some actions that are type-dependent.
So for example say I have a dataframe that looks like:
+-------+-------+
| foo | bar |
+-------+-------+
| 12345 | fnord |
| 42 | baz |
+-------+-------+
I need to get
Seq(
(("12345", "Integer"), ("fnord", "String")),
(("42", "Integer"), ("baz", "String"))
)
or something similarly simple to iterate over and work with programmatically.
Thanks in advance and sorry for what is, I'm sure, a very noobish question.
If I understand your question correct, then following shall be your solution.
val df = Seq(
(12345, "fnord"),
(42, "baz"))
.toDF("foo", "bar")
This creates dataframe which you already have.
+-----+-----+
| foo| bar|
+-----+-----+
|12345|fnord|
| 42| baz|
+-----+-----+
Next step is to extract dataType from the schema of the dataFrame and create a iterator.
val fieldTypesList = df.schema.map(struct => struct.dataType)
Next step is to convert the dataframe rows into rdd list and map each value to dataType from the list created above
val dfList = df.rdd.map(row => row.toString().replace("[","").replace("]","").split(",").toList)
val tuples = dfList.map(list => list.map(value => (value, fieldTypesList(list.indexOf(value)))))
Now if we print it
tuples.foreach(println)
It would give
List((12345,IntegerType), (fnord,StringType))
List((42,IntegerType), (baz,StringType))
Which you can iterate over and work with programmatically

How to merge two columns of a `Dataframe` in Spark into one 2-Tuple?

I have a Spark DataFrame df with five columns. I want to add another column with its values being the tuple of the first and second columns. When using with withColumn() method, I get the mismatch error, because the input is not Column type, but instead (Column,Column). I wonder if there is a solution beside running for loop over the rows in this case?
var dfCol=(col1:Column,col2:Column)=>(col1,col2)
val vv = df.withColumn( "NewColumn", dfCol( df(df.schema.fieldNames(1)) , df(df.schema.fieldNames(2)) ) )
You can use struct function which creates a tuple of provided columns:
import org.apache.spark.sql.functions.struct
val df = Seq((1,2), (3,4), (5,3)).toDF("a", "b")
df.withColumn("NewColumn", struct(df("a"), df("b")).show(false)
+---+---+---------+
|a |b |NewColumn|
+---+---+---------+
|1 |2 |[1,2] |
|3 |4 |[3,4] |
|5 |3 |[5,3] |
+---+---+---------+
You can use a User-defined function udf to achieve what you want.
UDF definition
object TupleUDFs {
import org.apache.spark.sql.functions.udf
// type tag is required, as we have a generic udf
import scala.reflect.runtime.universe.{TypeTag, typeTag}
def toTuple2[S: TypeTag, T: TypeTag] =
udf[(S, T), S, T]((x: S, y: T) => (x, y))
}
Usage
df.withColumn(
"tuple_col", TupleUDFs.toTuple2[Int, Int].apply(df("a"), df("b"))
)
assuming "a" and "b" are the columns of type Int you want to put in a tuple.
You can merge multiple dataframe columns into one using array.
// $"*" will capture all existing columns
df.select($"*", array($"col1", $"col2").as("newCol"))
If you want to merge two dataframe columns into one column.
Just:
import org.apache.spark.sql.functions.array
df.withColumn("NewColumn", array("columnA", "columnB"))