How to filter a map<String, Int> in a data frame : Spark / Scala

How to filter a map<String, Int> in a data frame : Spark / Scala - scala

I am trying to get the count individual column to publish metrics. I have a I have a df [customerId : string, totalRent : bigint, totalPurchase: bigint, itemTypeCounts: map<string, int> ]
Right now I am doing :
val totalCustomers = df.count
val totalPurchaseCount = df.filter("totalPurchase > 0").count
val totalRentCount = df.filter("totalRent > 0").count
publishMetrics("Total Customer", totalCustomers )
publishMetrics("Total Purchase", totalPurchaseCount )
publishMetrics("Total Rent", totalRentCount )
publishMetrics("Percentage of Rent", percentage(totalRentCount, totalCustomers) )
publishMetrics("Percentage of Purchase", percentage(totalPurchaseCount, totalCustomers) )
private def percentageCalc(num: Long, denom: Long): Double = {
val numD: Long = num
val denomD: Long = denom
return if (denomD == 0.0) 0.0
else (numD / denomD) * 100
}
But I am not sure how do I do this for itemTypeCounts which is a map. I want count and percentage based on each key entry. The issue is the key value is dynamic , I mean there is no way I know the key value before hand. Can some one tell me how do get count for each key values. I am new to scala/spark, any other efficient approaches to get the counts of each columns are much appreciated.
Sample data :
customerId : 1
totalPurchase : 17
totalRent : 0
itemTypeCounts : {"TV" : 4, "Blender" : 2}
customerId : 2
totalPurchase : 1
totalRent : 1
itemTypeCounts : {"Cloths" : 4}
customerId : 3
totalPurchase : 0
totalRent : 10
itemTypeCounts : {"TV" : 4}
So the output is :
totalCustomer : 3
totalPurchaseCount : 2 (2 customers with totalPurchase > 0)
totalRent : 2 (2 customers with totalRent > 0)
itemTypeCounts_TV : 2
itemTypeCounts_Cloths : 1
itemTypeCounts_Blender : 1

You can accomplish this in Spark SQL, I show two examples of this below (one where the keys are known and can be enumerated in code, one where the keys are unknown). Note that by using Spark SQL, you take advantage of the catalyst optimizer, and this will run very efficiently:
val data = List((1,17,0,Map("TV" -> 4, "Blender" -> 2)),(2,1,1,Map("Cloths" -> 4)),(3,0,10,Map("TV" -> 4)))
val df = data.toDF("customerId","totalPurchase","totalRent","itemTypeCounts")
//Only good if you can enumerate the keys
def countMapKey(name:String) = {
count(when($"itemTypeCounts".getItem(name).isNotNull,lit(1))).as(s"itemTypeCounts_$name")
}
val keysToCount = List("TV","Blender","Cloths").map(key => countMapKey(key))
df.select(keysToCount :_*).show
+-----------------+----------------------+---------------------+
|itemTypeCounts_TV|itemTypeCounts_Blender|itemTypeCounts_Cloths|
+-----------------+----------------------+---------------------+
| 2| 1| 1|
+-----------------+----------------------+---------------------+
//More generic
val pivotData = df.select(explode(col("itemTypeCounts"))).groupBy(lit(1).as("tmp")).pivot("key").count.drop("tmp")
val renameStatement = pivotData.columns.map(name => col(name).as(s"itemTypeCounts_$name"))
pivotData.select(renameStatement :_*).show
+----------------------+---------------------+-----------------+
|itemTypeCounts_Blender|itemTypeCounts_Cloths|itemTypeCounts_TV|
+----------------------+---------------------+-----------------+
| 1| 1| 2|
+----------------------+---------------------+-----------------+

I'm a spark newbie myself, so there is probably a better way to do this. But one thing you could try is transforming the itemTypeCounts into a data structure in scala that you could work with. I converted each row to a List of (Name, Count) pairs e.g. List((Blender,2), (TV,4)).
With this you can have a List of such list of pairs, one list of pairs for each row. In your example, this will be a List of 3 elements:
List(
List((Blender,2), (TV,4)),
List((Cloths,4)),
List((TV,4))
)
Once you have this structure, transforming it to a desired output is standard scala.
Worked example is below:
val itemTypeCounts = df.select("itemTypeCounts")
//Build List of List of Pairs as suggested above
val itemsList = itemTypeCounts.collect().map {
row =>
val values = row.getStruct(0).mkString("",",","").split(",")
val fields = row.schema.head.dataType.asInstanceOf[StructType].map(s => s.name).toList
fields.zip(values).filter(p => p._2 != "null")
}.toList
// Build a summary map for the list constructed above
def itemTypeCountsSummary(frames: List[List[(String, String)]], summary: Map[String, Int]) : Map[String, Int] = frames match {
case Nil => summary
case _ => itemTypeCountsSummary(frames.tail, merge(frames.head, summary))
}
//helper method for the summary map.
def merge(head: List[(String, String)], summary: Map[String, Int]): Map[String, Int] = {
val headMap = head.toMap.map(e => ("itemTypeCounts_" + e._1, 1))
val updatedSummary = summary.map{e => if(headMap.contains(e._1)) (e._1, e._2 + 1) else e}
updatedSummary ++ headMap.filter(e => !updatedSummary.contains(e._1))
}
val summaryMap = itemTypeCountsSummary(itemsList, Map())
summaryMap.foreach(e => println(e._1 + ": " + e._2 ))
Output:
itemTypeCounts_Blender: 1
itemTypeCounts_TV: 2
itemTypeCounts_Cloths: 1

Borrowing the input from Nick and using spark-sql pivot solution:
val data = List((1,17,0,Map("TV" -> 4, "Blender" -> 2)),(2,1,1,Map("Cloths" -> 4)),(3,0,10,Map("TV" -> 4)))
val df = data.toDF("customerId","totalPurchase","totalRent","itemTypeCounts")
df.show(false)
df.createOrReplaceTempView("df")
+----------+-------------+---------+-----------------------+
|customerId|totalPurchase|totalRent|itemTypeCounts |
+----------+-------------+---------+-----------------------+
|1 |17 |0 |[TV -> 4, Blender -> 2]|
|2 |1 |1 |[Cloths -> 4] |
|3 |0 |10 |[TV -> 4] |
+----------+-------------+---------+-----------------------+
Assuming that we know the distinct itemType beforehand, we can use
val dfr = spark.sql("""
select * from (
select explode(itemTypeCounts) itemTypeCounts from (
select flatten(collect_list(map_keys(itemTypeCounts))) itemTypeCounts from df
) ) t
pivot ( count(itemTypeCounts) as c3
for itemTypeCounts in ('TV' ,'Blender' ,'Cloths') )
""")
dfr.show(false)
+---+-------+------+
|TV |Blender|Cloths|
+---+-------+------+
|2 |1 |1 |
+---+-------+------+
For renaming columns,
dfr.select(dfr.columns.map( x => col(x).alias("itemTypeCounts_" + x )):_* ).show(false)
+-----------------+----------------------+---------------------+
|itemTypeCounts_TV|itemTypeCounts_Blender|itemTypeCounts_Cloths|
+-----------------+----------------------+---------------------+
|2 |1 |1 |
+-----------------+----------------------+---------------------+
To get the distinct itemType dynamically and pass it to pivot
val item_count_arr = spark.sql(""" select array_distinct(flatten(collect_list(map_keys(itemTypeCounts)))) itemTypeCounts from df """).as[Array[String]].first
item_count_arr: Array[String] = Array(TV, Blender, Cloths)
spark.sql(s"""
select * from (
select explode(itemTypeCounts) itemTypeCounts from (
select flatten(collect_list(map_keys(itemTypeCounts))) itemTypeCounts from df
) ) t
pivot ( count(itemTypeCounts) as c3
for itemTypeCounts in (${item_count_arr.map(c => "'"+c+"'").mkString(",")}) )
""").show(false)
+---+-------+------+
|TV |Blender|Cloths|
+---+-------+------+
|2 |1 |1 |
+---+-------+------+

Related

Spark - Drop null values from map column

I'm using Spark to read a CSV file and then gather all the fields to create a map. Some of the fields are empty and I'd like to remove them from the map.
So for a CSV that looks like this:
"animal", "colour", "age"
"cat" , "black" ,
"dog" , , "3"
I'd like to get a dataset with the following maps:
Map("animal" -> "cat", "colour" -> "black")
Map("animal" -> "dog", "age" -> "3")
This is what I have so far:
val csv_cols_n_vals: Array[Column] = csv.columns.flatMap { c => Array(lit(c), col(c)) }
sparkSession.read
.option("header", "true")
.csv(csvLocation)
.withColumn("allFieldsMap", map(csv_cols_n_vals: _*))
I've tried a few variations, but I can't seem to find the correct solution.

There is most certainly a better and more efficient way using the Dataframe API, but here is a map/flatmap solution:
val df = Seq(("cat", "black", null), ("dog", null, "3")).toDF("animal", "colour", "age")
val cols = df.columns
df.map(r => {
cols.flatMap( c => {
val v = r.getAs[String](c)
if (v != null) {
Some(Map(c -> v))
} else {
None
}
}).reduce(_ ++ _)
}).toDF("map").show(false)
Which produces:
+--------------------------------+
|map |
+--------------------------------+
|[animal -> cat, colour -> black]|
|[animal -> dog, age -> 3] |
+--------------------------------+

scala> df.show(false)
+------+------+----+
|animal|colour|age |
+------+------+----+
|cat |black |null|
|dog |null |3 |
+------+------+----+
Building Expressions
val colExpr = df
.columns // getting list of columns from dataframe.
.map{ columnName =>
when(
col(columnName).isNotNull, // checking if column is not null
map(
lit(columnName),
col(columnName)
) // Adding column name and its value inside map
)
.otherwise(map())
}
.reduce(map_concat(_,_))
// finally using map_concat function to concat map values.
Above code will create below expressions.
map_concat(
map_concat(
CASE WHEN (animal IS NOT NULL) THEN map(animal, animal) ELSE map() END,
CASE WHEN (colour IS NOT NULL) THEN map(colour, colour) ELSE map() END
),
CASE WHEN (age IS NOT NULL) THEN map(age, age) ELSE map() END
)
Applying colExpr on DataFrame.
scala>
df
.withColumn("allFieldsMap",colExpr)
.show(false)
+------+------+----+--------------------------------+
|animal|colour|age |allFieldsMap |
+------+------+----+--------------------------------+
|cat |black |null|[animal -> cat, colour -> black]|
|dog |null |3 |[animal -> dog, age -> 3] |
+------+------+----+--------------------------------+

Spark-sql solution:
val df = Seq(("cat", "black", null), ("dog", null, "3")).toDF("animal", "colour", "age")
df.show(false)
+------+------+----+
|animal|colour|age |
+------+------+----+
|cat |black |null|
|dog |null |3 |
+------+------+----+
df.createOrReplaceTempView("a_vw")
val cols_str = df.columns.flatMap( x => Array("\"".concat(x).concat("\""),x)).mkString(",")
spark.sql(s"""
select collect_list(m2) res from (
select id, key, value, map(key,value) m2 from (
select id, explode(m) as (key,value) from
( select monotonically_increasing_id() id, map(${cols_str}) m from a_vw )
)
where value is not null
) group by id
""")
.show(false)
+------------------------------------+
|res |
+------------------------------------+
|[[animal -> cat], [colour -> black]]|
|[[animal -> dog], [age -> 3]] |
+------------------------------------+
Or much shorter
spark.sql(s"""
select collect_list(case when value is not null then map(key,value) end ) res from (
select id, explode(m) as (key,value) from
( select monotonically_increasing_id() id, map(${cols_str}) m from a_vw )
) group by id
""")
.show(false)
+------------------------------------+
|res |
+------------------------------------+
|[[animal -> cat], [colour -> black]]|
|[[animal -> dog], [age -> 3]] |
+------------------------------------+

Scala - How to convert Spark DataFrame to Map

How to conver Spark DataFrame to Map like below : I want to convert into Map and then Json. Pivot didn't worked to reshape the cplumn so
Any help will be appreciated to convert as a Map like below.
Input DataFrame :
+-----+-----+-------+--------------------+
|col1 |col2 |object | values |
+-------------------+--------------------+
|one | two | main |[101 -> A, 202 -> B]|
+-------------------+--------------------+
Expected Output DataFrame :
+-----+-----+-------+--------------------+------------------------------------------------------------------------+
|col1 |col2 |object | values | newMap |
+-----+-----+-------+--------------------+------------------------------------------------------------------------+
|one | two |main |[101 -> A, 202 -> B]|[col1 -> one, col2 -> two, object -> main, main -> [101 -> A, 202 -> B]]|
+-----+-----+-------+--------------------+------------------------------------------------------------------------+
tried like below, but no success :
val toMap = udf((col1: String, col2: String, object: String, values: Map[String, String])) => {
col1.zip(values).toMap // need help for logic
// col1 -> col1_value, col2 -> col2_values, object -> object_value, object_value -> [values_of_Col_Values].toMap
})
df.withColumn("newMap", toMap($"col1", $"col2", $"object", $"values"))
I am stuck to format the code properly and get the output, please help either in Scala or Spark.

It's quit straight forward. Apparently the precondition is, you must have all the columns with same type otherwise you will get spark error.
import spark.implicits._
import org.apache.spark.sql.functions._
val df = Seq(("Foo", "L", "10"), ("Boo", "XL", "20"))
.toDF("brand", "size", "sales")
//Prepare your map columns.Bit of nasty iteration work is required
var preCol: Column = null
var counter = 1
val size = df.schema.fields.length
val mapColumns = df.schema.flatMap { field =>
val res = if (counter == size)
Seq(preCol, col(field.name))
else
Seq(lit(field.name), col(field.name))
//assign the current field name for tracking and increment the counter by 1
preCol = col(field.name)
counter += 1
res
}
df.withColumn("new", map(mapColumns: _*)).show(false)
Result
+-----+----+-----+---------------------------------------+
|brand|size|sales|new |
+-----+----+-----+---------------------------------------+
|Foo |L |10 |Map(brand -> Foo, size -> L, L -> 10) |
|Boo |XL |20 |Map(brand -> Boo, size -> XL, XL -> 20)|
+-----+----+-----+---------------------------------------+

Spark higher order functions to compute top N products from a comma separated list

I am using Spark 2.4 and I have a spark dataframe that has 2 columns - id and product_list. The data consists of list of products that every id has interacted with.
here is the sample code -
scala> spark.version
res3: String = 2.4.3
val df = Seq(
("1", "p1,p1,p1,p1,p1,p3,p3,p2,p2,p2,p2"),
("2", "p2,p2,p2,p2,p2,p4,p4,p4,p1,p3")
).toDF("id", "product_list")
df.createOrReplaceTempView("df")
+---+--------------------------------+
|id |product_list |
+---+--------------------------------+
|1 |p1,p1,p1,p1,p1,p3,p3,p2,p2,p2,p2|
|2 |p2,p2,p2,p2,p2,p4,p4,p4,p1,p3 |
+---+--------------------------------+
I would like to return those top 2 products that every id has had a interaction with. For instance, id = 1 has viewed products p1 - 5 times and p2 - 4 times, so i would like to return p1,p2 for id = 1. Similarly, p2,p4 for id = 2.
My final output should look like
id, most_seen_products
1, p1,p2
2, p2,p4
Since I am using Spark 2.4, I was wondering if there is a higher order function to first convert this list to array and then return the top 2 viewed products. In general the code should handle top N products.

Here is my approach
val df = Seq(
("1", "p1,p1,p1,p1,p1,p3,p3,p2,p2,p2,p2"),
("2", "p2,p2,p2,p2,p2,p4,p4,p4,p1,p3")
).toDF("id", "product_list")
def getMetrics(value: Row, n: Int): (String, String) = {
val split = value.getAs[String]("product_list").split(",")
val sortedRecords = split.groupBy(x => x).map(data => (data._1, data._2.size)).toList.sortWith(_._2 > _._2)
(value.getAs[String]("id"), sortedRecords.take(n).map(_._1).mkString(","))
}
df.map(value =>
getMetrics(value, 2)
).withColumnRenamed("_1", "id").withColumnRenamed("_2", "most_seen_products") show (false)
Result
+---+------------------+
|id |most_seen_products|
+---+------------------+
|1 |p1,p2 |
|2 |p2,p4 |
+---+------------------+

Looking at your data format, you can just use a .map() or in case of SQL, a UDF, which converts all rows. The function will be:
productList => {
// list of products = split productList by comma
// add all items to a String/Count map
// sort the map, get first 2 elements
// return string.join of those 2 elements
}

scala> import org.apache.spark.sql.expressions.UserDefinedFunction
scala> import scala.collection.immutable.ListMap
scala> def max_products:UserDefinedFunction = udf((product:String) => {
val productList = product.split(",").toList
val finalList = ListMap(productList.groupBy(i=>i).mapValues(_.size).toSeq.sortWith(_._2 > _._2):_*).keys.toList
finalList(0) + "," + finalList(1)
})
scala> df.withColumn("most_seen_products", max_products(col("product_list"))).show(false)
+---+--------------------------------+------------------+
|id |product_list |most_seen_products|
+---+--------------------------------+------------------+
|1 |p1,p1,p1,p1,p1,p3,p3,p2,p2,p2,p2|p1,p2 |
|2 |p2,p2,p2,p2,p2,p4,p4,p4,p1,p3 |p2,p4 |
+---+--------------------------------+------------------+

Split text and find the common words in a Spark Dataframe

I am working on Scala with Spark and I have a dataframe including two columns with text.
Those columns are with the format of "term1, term2, term3,..." and I want to create a third column with the common terms of the two of them.
For example
Col1
orange, apple, melon
party, clouds, beach
Col2
apple, apricot, watermelon
black, yellow, white
The result would be
Col3
1
0
What I have done until now is to create a udf that splits the text and get the intersection of the two columns.
val common_terms = udf((a: String, b: String) => if (a.isEmpty || b.isEmpty) {
0
} else {
split(a, ",").intersect(split(b, ",")).length
})
And then on my dataframe
val results = termsDF.withColumn("col3", common_terms(col("col1"), col("col2"))
But I have the following error
Error:(96, 13) type mismatch;
found : String
required: org.apache.spark.sql.Column
split(a, ",").intersect(split(b, ",")).length
I would appreciate any help since I am new in Scala and just trying to learn from online tutorials.
EDIT:
val common_authors = udf((a: String, b: String) => if (a != null || b != null) {
0
} else {
val tempA = a.split( ",")
val tempB = b.split(",")
if ( tempA.isEmpty || tempB.isEmpty ) {
0
} else {
tempA.intersect(tempB).length
}
})
After the edit, if I try termsDF.show() it runs. But if I do something like that termsDF.orderBy(desc("col3")) then I get a java.lang.NullPointerException

Try
val common_terms = udf((a: String, b: String) => if (a.isEmpty || b.isEmpty) {
0
} else {
var tmp1 = a.split(",")
var tmp2 = b.split(",")
tmp1.intersect(tmp2).length
})
val results = termsDF.withColumn("col3", common_terms($"a", $"b")).show
split(a, ",") its a spark column functions.
You are using an udf so you need to use string.split() wich is a scala function
After edit: change null verification to == not !=

In Spark 2.4 sql, you can get the same results without UDF. Check this out:
scala> val df = Seq(("orange,apple,melon","apple,apricot,watermelon"),("party,clouds,beach","black,yellow,white"), ("orange,apple,melon","apple,orange,watermelon")).toDF("col1","col2")
df: org.apache.spark.sql.DataFrame = [col1: string, col2: string]
scala>
scala> df.createOrReplaceTempView("tasos")
scala> spark.sql(""" select col1,col2, filter(split(col1,','), x -> array_contains(split(col2,','),x) ) a1 from tasos """).show(false)
+------------------+------------------------+---------------+
|col1 |col2 |a1 |
+------------------+------------------------+---------------+
|orange,apple,melon|apple,apricot,watermelon|[apple] |
|party,clouds,beach|black,yellow,white |[] |
|orange,apple,melon|apple,orange,watermelon |[orange, apple]|
+------------------+------------------------+---------------+
If you want the size, then
scala> spark.sql(""" select col1,col2, filter(split(col1,','), x -> array_contains(split(col2,','),x) ) a1 from tasos """).withColumn("a1_size",size('a1)).show(false)
+------------------+------------------------+---------------+-------+
|col1 |col2 |a1 |a1_size|
+------------------+------------------------+---------------+-------+
|orange,apple,melon|apple,apricot,watermelon|[apple] |1 |
|party,clouds,beach|black,yellow,white |[] |0 |
|orange,apple,melon|apple,orange,watermelon |[orange, apple]|2 |
+------------------+------------------------+---------------+-------+
scala>

How to use Except function with spark Dataframe

I would like to get differences between two dataframe but returning the row with the different fields only. For example, I have 2 dataframes as follow:
val DF1 = Seq(
(3,"Chennai", "rahman",9846, 45000,"SanRamon"),
(1,"Hyderabad","ram",9847, 50000,"SF")
).toDF("emp_id","emp_city","emp_name","emp_phone","emp_sal","emp_site")
val DF2 = Seq(
(3,"Chennai", "rahman",9846, 45000,"SanRamon"),
(1,"Sydney","ram",9847, 48000,"SF")
).toDF("emp_id","emp_city","emp_name","emp_phone","emp_sal","emp_site")
The only difference between these two dataframe is emp_city and emp_sal for the second row.
Now, I am using the except function which gives me the entire row as follow:
DF1.except(DF2)
+------+---------+--------+---------+-------+--------+
|emp_id| emp_city|emp_name|emp_phone|emp_sal|emp_site|
+------+---------+--------+---------+-------+--------+
| 1|Hyderabad| ram| 9847| 50000| SF|
+------+---------+--------+---------+-------+--------+
However, I need the output to be like this:
+---------+--------+-----+
|emp_id| emp_city|emp_sal|
+------+---------+-------+
| 1|Hyderabad| 50000|
+------+---------+-------+
Which shows the different cells as well as emp_id.
Edit :
if there is change in column then it should appear if there is no change then it should be hidden or Null

The following should give you the result you are looking for.
DF1.except(DF2).select("emp_id","emp_city","emp_sal")

You should consider the comment from #user238607 as we cannot predict which columns are going to differ,
Still you can try this workaround.
I'm assuming emp_id is unique,
scala> val diff = udf((col: String, c1: String, c2: String) => if (c1 == c2) "" else col )
scala> DF1.join(DF2, DF1("emp_id") === DF2("emp_id"))
res15: org.apache.spark.sql.DataFrame = [emp_id: int, emp_city: string ... 10 more fields]
scala> res15.withColumn("diffcolumn", split(concat_ws(",",DF1.columns.map(x => diff(lit(x), DF1(x), DF2(x))):_*),","))
res16: org.apache.spark.sql.DataFrame = [emp_id: int, emp_city: string ... 11 more fields]
scala> res16.show(false)
+------+---------+--------+---------+-------+--------+------+--------+--------+---------+-------+--------+---------------------------+
|emp_id|emp_city |emp_name|emp_phone|emp_sal|emp_site|emp_id|emp_city|emp_name|emp_phone|emp_sal|emp_site|diffcolumn |
+------+---------+--------+---------+-------+--------+------+--------+--------+---------+-------+--------+---------------------------+
|3 |Chennai |rahman |9846 |45000 |SanRamon|3 |Chennai |rahman |9846 |45000 |SanRamon|[, , , , , ] |
|1 |Hyderabad|ram |9847 |50000 |SF |1 |Sydney |ram |9847 |48000 |SF |[, emp_city, , , emp_sal, ]|
+------+---------+--------+---------+-------+--------+------+--------+--------+---------+-------+--------+---------------------------+
scala> val diff_cols = res16.select(explode($"diffcolumn")).filter("col != ''").distinct.collect.map(a=>col(a(0).toString))
scala> val exceptOpr = DF1.except(DF2)
scala> exceptOpr.select(diff_cols:_*).show
+-------+---------+
|emp_sal| emp_city|
+-------+---------+
| 50000|Hyderabad|
+-------+---------+

I found this solution which seems to be working fine :
val cols = DF1.columns.filter(_ != "emp_id").toList
val DF3 = DF1.except(DF2)
def mapDiffs(name: String) = when($"l.$name" === $"r.$name", null ).otherwise(array($"l.$name", $"r.$name")).as(name)
val result = DF2.as("l").join(DF3.as("r"), "emp_id").select($"emp_id" :: cols.map(mapDiffs): _*)
It generates the output as follow :
+------+-------------------+--------+---------+--------------+--------+
|emp_id| emp_city|emp_name|emp_phone| emp_sal|emp_site|
+------+-------------------+--------+---------+--------------+--------+
| 1|[Sydney, Hyderabad]| null| null|[48000, 50000]| null|
|
+------+-------------------+--------+---------+--------------+--------+