Spark - Drop null values from map column - scala

I'm using Spark to read a CSV file and then gather all the fields to create a map. Some of the fields are empty and I'd like to remove them from the map.
So for a CSV that looks like this:
"animal", "colour", "age"
"cat" , "black" ,
"dog" , , "3"
I'd like to get a dataset with the following maps:
Map("animal" -> "cat", "colour" -> "black")
Map("animal" -> "dog", "age" -> "3")
This is what I have so far:
val csv_cols_n_vals: Array[Column] = csv.columns.flatMap { c => Array(lit(c), col(c)) }
sparkSession.read
.option("header", "true")
.csv(csvLocation)
.withColumn("allFieldsMap", map(csv_cols_n_vals: _*))
I've tried a few variations, but I can't seem to find the correct solution.

There is most certainly a better and more efficient way using the Dataframe API, but here is a map/flatmap solution:
val df = Seq(("cat", "black", null), ("dog", null, "3")).toDF("animal", "colour", "age")
val cols = df.columns
df.map(r => {
cols.flatMap( c => {
val v = r.getAs[String](c)
if (v != null) {
Some(Map(c -> v))
} else {
None
}
}).reduce(_ ++ _)
}).toDF("map").show(false)
Which produces:
+--------------------------------+
|map |
+--------------------------------+
|[animal -> cat, colour -> black]|
|[animal -> dog, age -> 3] |
+--------------------------------+

scala> df.show(false)
+------+------+----+
|animal|colour|age |
+------+------+----+
|cat |black |null|
|dog |null |3 |
+------+------+----+
Building Expressions
val colExpr = df
.columns // getting list of columns from dataframe.
.map{ columnName =>
when(
col(columnName).isNotNull, // checking if column is not null
map(
lit(columnName),
col(columnName)
) // Adding column name and its value inside map
)
.otherwise(map())
}
.reduce(map_concat(_,_))
// finally using map_concat function to concat map values.
Above code will create below expressions.
map_concat(
map_concat(
CASE WHEN (animal IS NOT NULL) THEN map(animal, animal) ELSE map() END,
CASE WHEN (colour IS NOT NULL) THEN map(colour, colour) ELSE map() END
),
CASE WHEN (age IS NOT NULL) THEN map(age, age) ELSE map() END
)
Applying colExpr on DataFrame.
scala>
df
.withColumn("allFieldsMap",colExpr)
.show(false)
+------+------+----+--------------------------------+
|animal|colour|age |allFieldsMap |
+------+------+----+--------------------------------+
|cat |black |null|[animal -> cat, colour -> black]|
|dog |null |3 |[animal -> dog, age -> 3] |
+------+------+----+--------------------------------+

Spark-sql solution:
val df = Seq(("cat", "black", null), ("dog", null, "3")).toDF("animal", "colour", "age")
df.show(false)
+------+------+----+
|animal|colour|age |
+------+------+----+
|cat |black |null|
|dog |null |3 |
+------+------+----+
df.createOrReplaceTempView("a_vw")
val cols_str = df.columns.flatMap( x => Array("\"".concat(x).concat("\""),x)).mkString(",")
spark.sql(s"""
select collect_list(m2) res from (
select id, key, value, map(key,value) m2 from (
select id, explode(m) as (key,value) from
( select monotonically_increasing_id() id, map(${cols_str}) m from a_vw )
)
where value is not null
) group by id
""")
.show(false)
+------------------------------------+
|res |
+------------------------------------+
|[[animal -> cat], [colour -> black]]|
|[[animal -> dog], [age -> 3]] |
+------------------------------------+
Or much shorter
spark.sql(s"""
select collect_list(case when value is not null then map(key,value) end ) res from (
select id, explode(m) as (key,value) from
( select monotonically_increasing_id() id, map(${cols_str}) m from a_vw )
) group by id
""")
.show(false)
+------------------------------------+
|res |
+------------------------------------+
|[[animal -> cat], [colour -> black]]|
|[[animal -> dog], [age -> 3]] |
+------------------------------------+

Related

Scala - How to convert Spark DataFrame to Map

How to conver Spark DataFrame to Map like below : I want to convert into Map and then Json. Pivot didn't worked to reshape the cplumn so
Any help will be appreciated to convert as a Map like below.
Input DataFrame :
+-----+-----+-------+--------------------+
|col1 |col2 |object | values |
+-------------------+--------------------+
|one | two | main |[101 -> A, 202 -> B]|
+-------------------+--------------------+
Expected Output DataFrame :
+-----+-----+-------+--------------------+------------------------------------------------------------------------+
|col1 |col2 |object | values | newMap |
+-----+-----+-------+--------------------+------------------------------------------------------------------------+
|one | two |main |[101 -> A, 202 -> B]|[col1 -> one, col2 -> two, object -> main, main -> [101 -> A, 202 -> B]]|
+-----+-----+-------+--------------------+------------------------------------------------------------------------+
tried like below, but no success :
val toMap = udf((col1: String, col2: String, object: String, values: Map[String, String])) => {
col1.zip(values).toMap // need help for logic
// col1 -> col1_value, col2 -> col2_values, object -> object_value, object_value -> [values_of_Col_Values].toMap
})
df.withColumn("newMap", toMap($"col1", $"col2", $"object", $"values"))
I am stuck to format the code properly and get the output, please help either in Scala or Spark.
It's quit straight forward. Apparently the precondition is, you must have all the columns with same type otherwise you will get spark error.
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(("Foo", "L", "10"), ("Boo", "XL", "20"))
.toDF("brand", "size", "sales")
//Prepare your map columns.Bit of nasty iteration work is required
var preCol: Column = null
var counter = 1
val size = df.schema.fields.length
val mapColumns = df.schema.flatMap { field =>
val res = if (counter == size)
Seq(preCol, col(field.name))
else
Seq(lit(field.name), col(field.name))
//assign the current field name for tracking and increment the counter by 1
preCol = col(field.name)
counter += 1
res
}
df.withColumn("new", map(mapColumns: _*)).show(false)
Result
+-----+----+-----+---------------------------------------+
|brand|size|sales|new |
+-----+----+-----+---------------------------------------+
|Foo |L |10 |Map(brand -> Foo, size -> L, L -> 10) |
|Boo |XL |20 |Map(brand -> Boo, size -> XL, XL -> 20)|
+-----+----+-----+---------------------------------------+

Conditional Spark map() function based on input columns

What I'm trying to achieve here is sending to Spark SQL map function conditionally generated columns depending on if they have null, 0 or any other value I may want.
Take for example this initial DF.
val initialDF = Seq(
("a", "b", 1),
("a", "b", null),
("a", null, 0)
).toDF("field1", "field2", "field3")
From that initial DataFrame I want to generate yet another column which will be a map, like this.
initialDF.withColumn("thisMap", MY_FUNCTION)
My current approach to this is basically take a Seq[String] in a method a flatMap the key-value pairs that the Spark SQL method receives, like this.
def toMap(columns: String*): Column = {
map(
columns.flatMap(column => List(lit(column), col(column))): _*
)
}
But then, filtering becomes a Scala thing and is quite a mess.
What I would like to obtain after the processing would be, for each of those rows, the next DataFrame.
val initialDF = Seq(
("a", "b", 1, Map("field1" -> "a", "field2" -> "b", "field3" -> 1)),
("a", "b", null, Map("field1" -> "a", "field2" -> "b")),
("a", null, 0, Map("field1" -> "a"))
)
.toDF("field1", "field2", "field3", "thisMap")
I was wondering if this can be achieved using the Column API which is way more intuitive with .isNull or .equalTo?
Here's a small improvement on Lamanus' answer above which only loops over df.columns once:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class Record(field1: String, field2: String, field3: java.lang.Integer)
val df = Seq(
Record("a", "b", 1),
Record("a", "b", null),
Record("a", null, 0)
).toDS
df.show
// +------+------+------+
// |field1|field2|field3|
// +------+------+------+
// | a| b| 1|
// | a| b| null|
// | a| null| 0|
// +------+------+------+
df.withColumn("thisMap", map_concat(
df.columns.map { colName =>
when(col(colName).isNull or col(colName) === 0, map())
.otherwise(map(lit(colName), col(colName)))
}: _*
)).show(false)
// +------+------+------+---------------------------------------+
// |field1|field2|field3|thisMap |
// +------+------+------+---------------------------------------+
// |a |b |1 |[field1 -> a, field2 -> b, field3 -> 1]|
// |a |b |null |[field1 -> a, field2 -> b] |
// |a |null |0 |[field1 -> a] |
// +------+------+------+---------------------------------------+
UPDATE
I found a way to achieve the expected result but it is a bit dirty.
val df2 = df.columns.foldLeft(df) { (df, n) => df.withColumn(n + "_map", map(lit(n), col(n))) }
val col_cond = df.columns.map(n => when(not(col(n + "_map").getItem(n).isNull || col(n + "_map").getItem(n) === lit("0")), col(n + "_map")).otherwise(map()))
df2.withColumn("map", map_concat(col_cond: _*))
.show(false)
ORIGINAL
Here is my try with the function map_from_arrays that is possible to use in spark 2.4+.
df.withColumn("array", array(df.columns.map(col): _*))
.withColumn("map", map_from_arrays(lit(df.columns), $"array")).show(false)
Then, the result is:
+------+------+------+---------+---------------------------------------+
|field1|field2|field3|array |map |
+------+------+------+---------+---------------------------------------+
|a |b |1 |[a, b, 1]|[field1 -> a, field2 -> b, field3 -> 1]|
|a |b |null |[a, b,] |[field1 -> a, field2 -> b, field3 ->] |
|a |null |0 |[a,, 0] |[field1 -> a, field2 ->, field3 -> 0] |
+------+------+------+---------+---------------------------------------+

How to use Except function with spark Dataframe

I would like to get differences between two dataframe but returning the row with the different fields only. For example, I have 2 dataframes as follow:
val DF1 = Seq(
(3,"Chennai", "rahman",9846, 45000,"SanRamon"),
(1,"Hyderabad","ram",9847, 50000,"SF")
).toDF("emp_id","emp_city","emp_name","emp_phone","emp_sal","emp_site")
val DF2 = Seq(
(3,"Chennai", "rahman",9846, 45000,"SanRamon"),
(1,"Sydney","ram",9847, 48000,"SF")
).toDF("emp_id","emp_city","emp_name","emp_phone","emp_sal","emp_site")
The only difference between these two dataframe is emp_city and emp_sal for the second row.
Now, I am using the except function which gives me the entire row as follow:
DF1.except(DF2)
+------+---------+--------+---------+-------+--------+
|emp_id| emp_city|emp_name|emp_phone|emp_sal|emp_site|
+------+---------+--------+---------+-------+--------+
| 1|Hyderabad| ram| 9847| 50000| SF|
+------+---------+--------+---------+-------+--------+
However, I need the output to be like this:
+---------+--------+-----+
|emp_id| emp_city|emp_sal|
+------+---------+-------+
| 1|Hyderabad| 50000|
+------+---------+-------+
Which shows the different cells as well as emp_id.
Edit :
if there is change in column then it should appear if there is no change then it should be hidden or Null
The following should give you the result you are looking for.
DF1.except(DF2).select("emp_id","emp_city","emp_sal")
You should consider the comment from #user238607 as we cannot predict which columns are going to differ,
Still you can try this workaround.
I'm assuming emp_id is unique,
scala> val diff = udf((col: String, c1: String, c2: String) => if (c1 == c2) "" else col )
scala> DF1.join(DF2, DF1("emp_id") === DF2("emp_id"))
res15: org.apache.spark.sql.DataFrame = [emp_id: int, emp_city: string ... 10 more fields]
scala> res15.withColumn("diffcolumn", split(concat_ws(",",DF1.columns.map(x => diff(lit(x), DF1(x), DF2(x))):_*),","))
res16: org.apache.spark.sql.DataFrame = [emp_id: int, emp_city: string ... 11 more fields]
scala> res16.show(false)
+------+---------+--------+---------+-------+--------+------+--------+--------+---------+-------+--------+---------------------------+
|emp_id|emp_city |emp_name|emp_phone|emp_sal|emp_site|emp_id|emp_city|emp_name|emp_phone|emp_sal|emp_site|diffcolumn |
+------+---------+--------+---------+-------+--------+------+--------+--------+---------+-------+--------+---------------------------+
|3 |Chennai |rahman |9846 |45000 |SanRamon|3 |Chennai |rahman |9846 |45000 |SanRamon|[, , , , , ] |
|1 |Hyderabad|ram |9847 |50000 |SF |1 |Sydney |ram |9847 |48000 |SF |[, emp_city, , , emp_sal, ]|
+------+---------+--------+---------+-------+--------+------+--------+--------+---------+-------+--------+---------------------------+
scala> val diff_cols = res16.select(explode($"diffcolumn")).filter("col != ''").distinct.collect.map(a=>col(a(0).toString))
scala> val exceptOpr = DF1.except(DF2)
scala> exceptOpr.select(diff_cols:_*).show
+-------+---------+
|emp_sal| emp_city|
+-------+---------+
| 50000|Hyderabad|
+-------+---------+
I found this solution which seems to be working fine :
val cols = DF1.columns.filter(_ != "emp_id").toList
val DF3 = DF1.except(DF2)
def mapDiffs(name: String) = when($"l.$name" === $"r.$name", null ).otherwise(array($"l.$name", $"r.$name")).as(name)
val result = DF2.as("l").join(DF3.as("r"), "emp_id").select($"emp_id" :: cols.map(mapDiffs): _*)
It generates the output as follow :
+------+-------------------+--------+---------+--------------+--------+
|emp_id| emp_city|emp_name|emp_phone| emp_sal|emp_site|
+------+-------------------+--------+---------+--------------+--------+
| 1|[Sydney, Hyderabad]| null| null|[48000, 50000]| null|
|
+------+-------------------+--------+---------+--------------+--------+

Scala: Printing the ouput in proper data format using scala

I would like to display the data in proper format, I have the below code
val maplist=List(Map("id" -> "1", "Name" -> "divya"),
Map("id" -> "2", "Name" -> "gaya")
)
val header=maplist.flatMap(_.keys).distinct
val data=maplist.map(_.values)
println(header.mkString(" "))
data.foreach(x => println(x.mkString(" ")))
this is showing as
id Name
1 divya
2 gaya
but I would like to show like, I may have to use df.show() function
+----+-----+
|Id |Name |
+----+-----+
|1 |Divya|
|2 |gaya |
+----+-----+
If you want the separators you should use mkString method with more parameters, you can check in the API
mkString(start: String, sep: String, end: String): String
Displays all elements of this traversable or iterator in a string
using start, end, and separator strings.
val separatorLine = "+----+-----+"
val separator = "|"
val maplist=List(Map("id" -> "1", "Name" -> "divya"),
Map("id" -> "2", "Name" -> "gaya")
)
val header=maplist.flatMap(_.keys).distinct
val data=maplist.map(_.values)
println(separatorLine)
println(header.mkString("|", " |", "|"))
println(separatorLine)
data.foreach(x => println(x.mkString("|", " |", "|")))
println(separatorLine)
Result:
+----+-----+
|id |Name|
+----+-----+
|1 |divya|
|2 |gaya|
+----+-----+
Update: If you want to have the same length in every String (for instance 5) you can do an auxiliar method yo append blanks when needed:
#tailrec
private def appendElem(original : String, desiredLength: Int, c: Char): String = {
if (original.length < desiredLength)
appendElem(original + c, desiredLength, c)
else {
original
}
}
val separator = "|"
val maplist=List(Map("id" -> "1", "Name" -> "divya"),
Map("id" -> "2", "Name" -> "gaya")
)
val separatorLine = List.fill(maplist.size)( "+").map(appendElem(_, 6,'-')).mkString+ "+"
val header=maplist.flatMap(_.keys.map(key => appendElem(key, 5, ' '))).distinct
val data=maplist.map(_.values)
println(separatorLine)
println(header.mkString("|", "|", "|"))
println(separatorLine)
data.map(x => x.map(y => appendElem(y, 5, ' '))).foreach(x => println(x.mkString("|", "|", "|")))
println(separatorLine)
With this second version the result is as follows
+-----+-----+
|id |Name |
+-----+-----+
|1 |divya|
|2 |gaya |
+-----+-----+

How to filter a map<String, Int> in a data frame : Spark / Scala

I am trying to get the count individual column to publish metrics. I have a I have a df [customerId : string, totalRent : bigint, totalPurchase: bigint, itemTypeCounts: map<string, int> ]
Right now I am doing :
val totalCustomers = df.count
val totalPurchaseCount = df.filter("totalPurchase > 0").count
val totalRentCount = df.filter("totalRent > 0").count
publishMetrics("Total Customer", totalCustomers )
publishMetrics("Total Purchase", totalPurchaseCount )
publishMetrics("Total Rent", totalRentCount )
publishMetrics("Percentage of Rent", percentage(totalRentCount, totalCustomers) )
publishMetrics("Percentage of Purchase", percentage(totalPurchaseCount, totalCustomers) )
private def percentageCalc(num: Long, denom: Long): Double = {
val numD: Long = num
val denomD: Long = denom
return if (denomD == 0.0) 0.0
else (numD / denomD) * 100
}
But I am not sure how do I do this for itemTypeCounts which is a map. I want count and percentage based on each key entry. The issue is the key value is dynamic , I mean there is no way I know the key value before hand. Can some one tell me how do get count for each key values. I am new to scala/spark, any other efficient approaches to get the counts of each columns are much appreciated.
Sample data :
customerId : 1
totalPurchase : 17
totalRent : 0
itemTypeCounts : {"TV" : 4, "Blender" : 2}
customerId : 2
totalPurchase : 1
totalRent : 1
itemTypeCounts : {"Cloths" : 4}
customerId : 3
totalPurchase : 0
totalRent : 10
itemTypeCounts : {"TV" : 4}
So the output is :
totalCustomer : 3
totalPurchaseCount : 2 (2 customers with totalPurchase > 0)
totalRent : 2 (2 customers with totalRent > 0)
itemTypeCounts_TV : 2
itemTypeCounts_Cloths : 1
itemTypeCounts_Blender : 1
You can accomplish this in Spark SQL, I show two examples of this below (one where the keys are known and can be enumerated in code, one where the keys are unknown). Note that by using Spark SQL, you take advantage of the catalyst optimizer, and this will run very efficiently:
val data = List((1,17,0,Map("TV" -> 4, "Blender" -> 2)),(2,1,1,Map("Cloths" -> 4)),(3,0,10,Map("TV" -> 4)))
val df = data.toDF("customerId","totalPurchase","totalRent","itemTypeCounts")
//Only good if you can enumerate the keys
def countMapKey(name:String) = {
count(when($"itemTypeCounts".getItem(name).isNotNull,lit(1))).as(s"itemTypeCounts_$name")
}
val keysToCount = List("TV","Blender","Cloths").map(key => countMapKey(key))
df.select(keysToCount :_*).show
+-----------------+----------------------+---------------------+
|itemTypeCounts_TV|itemTypeCounts_Blender|itemTypeCounts_Cloths|
+-----------------+----------------------+---------------------+
| 2| 1| 1|
+-----------------+----------------------+---------------------+
//More generic
val pivotData = df.select(explode(col("itemTypeCounts"))).groupBy(lit(1).as("tmp")).pivot("key").count.drop("tmp")
val renameStatement = pivotData.columns.map(name => col(name).as(s"itemTypeCounts_$name"))
pivotData.select(renameStatement :_*).show
+----------------------+---------------------+-----------------+
|itemTypeCounts_Blender|itemTypeCounts_Cloths|itemTypeCounts_TV|
+----------------------+---------------------+-----------------+
| 1| 1| 2|
+----------------------+---------------------+-----------------+
I'm a spark newbie myself, so there is probably a better way to do this. But one thing you could try is transforming the itemTypeCounts into a data structure in scala that you could work with. I converted each row to a List of (Name, Count) pairs e.g. List((Blender,2), (TV,4)).
With this you can have a List of such list of pairs, one list of pairs for each row. In your example, this will be a List of 3 elements:
List(
List((Blender,2), (TV,4)),
List((Cloths,4)),
List((TV,4))
)
Once you have this structure, transforming it to a desired output is standard scala.
Worked example is below:
val itemTypeCounts = df.select("itemTypeCounts")
//Build List of List of Pairs as suggested above
val itemsList = itemTypeCounts.collect().map {
row =>
val values = row.getStruct(0).mkString("",",","").split(",")
val fields = row.schema.head.dataType.asInstanceOf[StructType].map(s => s.name).toList
fields.zip(values).filter(p => p._2 != "null")
}.toList
// Build a summary map for the list constructed above
def itemTypeCountsSummary(frames: List[List[(String, String)]], summary: Map[String, Int]) : Map[String, Int] = frames match {
case Nil => summary
case _ => itemTypeCountsSummary(frames.tail, merge(frames.head, summary))
}
//helper method for the summary map.
def merge(head: List[(String, String)], summary: Map[String, Int]): Map[String, Int] = {
val headMap = head.toMap.map(e => ("itemTypeCounts_" + e._1, 1))
val updatedSummary = summary.map{e => if(headMap.contains(e._1)) (e._1, e._2 + 1) else e}
updatedSummary ++ headMap.filter(e => !updatedSummary.contains(e._1))
}
val summaryMap = itemTypeCountsSummary(itemsList, Map())
summaryMap.foreach(e => println(e._1 + ": " + e._2 ))
Output:
itemTypeCounts_Blender: 1
itemTypeCounts_TV: 2
itemTypeCounts_Cloths: 1
Borrowing the input from Nick and using spark-sql pivot solution:
val data = List((1,17,0,Map("TV" -> 4, "Blender" -> 2)),(2,1,1,Map("Cloths" -> 4)),(3,0,10,Map("TV" -> 4)))
val df = data.toDF("customerId","totalPurchase","totalRent","itemTypeCounts")
df.show(false)
df.createOrReplaceTempView("df")
+----------+-------------+---------+-----------------------+
|customerId|totalPurchase|totalRent|itemTypeCounts |
+----------+-------------+---------+-----------------------+
|1 |17 |0 |[TV -> 4, Blender -> 2]|
|2 |1 |1 |[Cloths -> 4] |
|3 |0 |10 |[TV -> 4] |
+----------+-------------+---------+-----------------------+
Assuming that we know the distinct itemType beforehand, we can use
val dfr = spark.sql("""
select * from (
select explode(itemTypeCounts) itemTypeCounts from (
select flatten(collect_list(map_keys(itemTypeCounts))) itemTypeCounts from df
) ) t
pivot ( count(itemTypeCounts) as c3
for itemTypeCounts in ('TV' ,'Blender' ,'Cloths') )
""")
dfr.show(false)
+---+-------+------+
|TV |Blender|Cloths|
+---+-------+------+
|2 |1 |1 |
+---+-------+------+
For renaming columns,
dfr.select(dfr.columns.map( x => col(x).alias("itemTypeCounts_" + x )):_* ).show(false)
+-----------------+----------------------+---------------------+
|itemTypeCounts_TV|itemTypeCounts_Blender|itemTypeCounts_Cloths|
+-----------------+----------------------+---------------------+
|2 |1 |1 |
+-----------------+----------------------+---------------------+
To get the distinct itemType dynamically and pass it to pivot
val item_count_arr = spark.sql(""" select array_distinct(flatten(collect_list(map_keys(itemTypeCounts)))) itemTypeCounts from df """).as[Array[String]].first
item_count_arr: Array[String] = Array(TV, Blender, Cloths)
spark.sql(s"""
select * from (
select explode(itemTypeCounts) itemTypeCounts from (
select flatten(collect_list(map_keys(itemTypeCounts))) itemTypeCounts from df
) ) t
pivot ( count(itemTypeCounts) as c3
for itemTypeCounts in (${item_count_arr.map(c => "'"+c+"'").mkString(",")}) )
""").show(false)
+---+-------+------+
|TV |Blender|Cloths|
+---+-------+------+
|2 |1 |1 |
+---+-------+------+