Spark higher order functions to compute top N products from a comma separated list

Spark higher order functions to compute top N products from a comma separated list - scala

I am using Spark 2.4 and I have a spark dataframe that has 2 columns - id and product_list. The data consists of list of products that every id has interacted with.
here is the sample code -
scala> spark.version
res3: String = 2.4.3
val df = Seq(
("1", "p1,p1,p1,p1,p1,p3,p3,p2,p2,p2,p2"),
("2", "p2,p2,p2,p2,p2,p4,p4,p4,p1,p3")
).toDF("id", "product_list")
df.createOrReplaceTempView("df")
+---+--------------------------------+
|id |product_list |
+---+--------------------------------+
|1 |p1,p1,p1,p1,p1,p3,p3,p2,p2,p2,p2|
|2 |p2,p2,p2,p2,p2,p4,p4,p4,p1,p3 |
+---+--------------------------------+
I would like to return those top 2 products that every id has had a interaction with. For instance, id = 1 has viewed products p1 - 5 times and p2 - 4 times, so i would like to return p1,p2 for id = 1. Similarly, p2,p4 for id = 2.
My final output should look like
id, most_seen_products
1, p1,p2
2, p2,p4
Since I am using Spark 2.4, I was wondering if there is a higher order function to first convert this list to array and then return the top 2 viewed products. In general the code should handle top N products.

Here is my approach
val df = Seq(
("1", "p1,p1,p1,p1,p1,p3,p3,p2,p2,p2,p2"),
("2", "p2,p2,p2,p2,p2,p4,p4,p4,p1,p3")
).toDF("id", "product_list")
def getMetrics(value: Row, n: Int): (String, String) = {
val split = value.getAs[String]("product_list").split(",")
val sortedRecords = split.groupBy(x => x).map(data => (data._1, data._2.size)).toList.sortWith(_._2 > _._2)
(value.getAs[String]("id"), sortedRecords.take(n).map(_._1).mkString(","))
}
df.map(value =>
getMetrics(value, 2)
).withColumnRenamed("_1", "id").withColumnRenamed("_2", "most_seen_products") show (false)
Result
+---+------------------+
|id |most_seen_products|
+---+------------------+
|1 |p1,p2 |
|2 |p2,p4 |
+---+------------------+

Looking at your data format, you can just use a .map() or in case of SQL, a UDF, which converts all rows. The function will be:
productList => {
// list of products = split productList by comma
// add all items to a String/Count map
// sort the map, get first 2 elements
// return string.join of those 2 elements
}

scala> import org.apache.spark.sql.expressions.UserDefinedFunction
scala> import scala.collection.immutable.ListMap
scala> def max_products:UserDefinedFunction = udf((product:String) => {
val productList = product.split(",").toList
val finalList = ListMap(productList.groupBy(i=>i).mapValues(_.size).toSeq.sortWith(_._2 > _._2):_*).keys.toList
finalList(0) + "," + finalList(1)
})
scala> df.withColumn("most_seen_products", max_products(col("product_list"))).show(false)
+---+--------------------------------+------------------+
|id |product_list |most_seen_products|
+---+--------------------------------+------------------+
|1 |p1,p1,p1,p1,p1,p3,p3,p2,p2,p2,p2|p1,p2 |
|2 |p2,p2,p2,p2,p2,p4,p4,p4,p1,p3 |p2,p4 |
+---+--------------------------------+------------------+

Related

Is it possible to register a string as a UDF?

In Spark (Scala), after the application jar is submitted to Spark, is it possible for the jar to fetch many strings from a database table, convert each string to a catalyst Expression and then convert that expression to a UDF, and use the UDF to filters rows in another DataFrame, and finally union the result of each UDF?
(The said expression needs some or all columns of the DataFrame, but which columns are needed is unknown at the time of the code of the jar is written, the schema of the DataFrame is known at development time)
An example:
expression 1: "id == 1"
expression 2: "name == \"andy\""
DataFrame:
row 1: id = 1, name = "red", age = null
row 2: id = 2, name = "andy", age = 20
row 3: id = 3, name = "juliet", age = 21
the final result should be the first two rows
Note: it is not acceptable to first concatenate the two expressions with a or, for I needed to track which expression results the result row

Edited: Filter for each argument and union All.
import org.apache.spark.sql.DataFrame
val df = spark.read.option("header","true").option("inferSchema","true").csv("test1.csv")
val args = Array("id == 1", "name == \"andy\"")
val filters = args.zipWithIndex
var dfs = Array[DataFrame]()
filters.foreach {
case (filter, index) =>
val tempDf = df.filter(filter).withColumn("index", lit(index))
dfs = dfs :+ tempDf
}
val resultDF = dfs.reduce(_ unionAll _)
resultDF.show(false)
+---+----+----+-----+
|id |name|age |index|
+---+----+----+-----+
|1 |red |null|0 |
|2 |andy|20 |1 |
+---+----+----+-----+
Original: Why just put the string to the filter?
val df = spark.read.option("header","true").option("inferSchema","true").csv("test.csv")
val condition = "id == 1 or name == \"andy\""
df.filter(condition).show(false)
+---+----+----+
|id |name|age |
+---+----+----+
|1 |red |null|
|2 |andy|20 |
+---+----+----+
Something I have missed?

Spark SQL Split or Extract words from String of Words

I have a spark Dataframe like Below.I'm trying to split the column into 2 more columns:
date time content
28may 11am [ssid][customerid,shopid]
val personDF2 = personDF.withColumn("temp",split(col("content"),"\\[")).select(
col("*") +: (0 until 3).map(i => col("temp").getItem(i).as(s/col$i)): _*)
date time content col1 col2 col3
28may 11 [ssid][customerid,shopid] ssid customerid shopid

Assuming a String to represent an Array of Words. Got your request. You can optimize the number of dataframes as well to reduce load on system. If there are more than 9 cols etc. you may need to use c00, c01, etc. for c10 etc. Or just use integer as name for columns. leave that up to you.
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
// Set up data
val df = spark.sparkContext.parallelize(Seq(
("A", "[foo][customerid,shopid][Donald,Trump,Esq][single]"),
("B", "[foo]")
)).toDF("k", "v")
val df2 = df.withColumn("words_temp", regexp_replace($"v", lit("]"), lit("" )))
val df3 = df2.withColumn("words_temp2", regexp_replace($"words_temp", lit(","), lit("[" ))).drop("words_temp")
val df4 = df3.withColumn("words_temp3", expr("substring(words_temp2, 2, length(words_temp2))")).withColumn("cnt", expr("length(words_temp2)")).drop("words_temp2")
val df5 = df4.withColumn("words",split(col("words_temp3"),"\\[")).drop("words_temp3")
val df6 = df5.withColumn("num_words", size($"words"))
val df7 = df6.withColumn("v2", explode($"words"))
// Convert to Array of sorts via group by
val df8 = df7.groupBy("k")
.agg(collect_list("v2"))
// Convert to rdd Tuple and then find position so as to gen col names! That is the clue so as to be able to use pivot
val rdd = df8.rdd
val rdd2 = rdd.map(row => (row.getAs[String](0), row.getAs[WrappedArray[String]](1).toArray))
val rdd3 = rdd2.map { case (k, list) => (k, list.zipWithIndex) }
val df9 = rdd3.toDF("k", "v")
val df10 = df9.withColumn("vn", explode($"v"))
val df11 = df10.select($"k", $"vn".getField("_1"), concat(lit("c"),$"vn".getField("_2"))).toDF("k", "v", "c")
// Final manipulation
val result = df11.groupBy("k")
.pivot("c")
.agg(expr("coalesce(first(v),null)")) // May never occur in your case, just done for completeness and variable length cols.
result.show(100,false)
returns in this case:
+---+---+----------+------+------+-----+----+------+
|k |c0 |c1 |c2 |c3 |c4 |c5 |c6 |
+---+---+----------+------+------+-----+----+------+
|B |foo|null |null |null |null |null|null |
|A |foo|customerid|shopid|Donald|Trump|Esq |single|
+---+---+----------+------+------+-----+----+------+

Update:
Based on original title stating array of words. See other answer.
If new, then a few things here. Can also be done with dataset and map I assume. Here is a solution using DFs and rdd's. I might well investigate a complete DS in future, but this works for sure and at scale.
// Can amalgamate more steps
import org.apache.spark.sql.functions._
import scala.collection.mutable.WrappedArray
// Set up data
val df = spark.sparkContext.parallelize(Seq(
("A", Array(Array("foo", "bar"), Array("Donald", "Trump","Esq"), Array("single"))),
("B", Array(Array("foo2", "bar2"), Array("single2"))),
("C", Array(Array("foo3", "bar3", "x", "y", "z")))
)).toDF("k", "v")
// flatten via 2x explode, can be done more elegeantly with def or UDF, but keeping it simple here
val df2 = df.withColumn("v2", explode($"v"))
val df3 = df2.withColumn("v3", explode($"v2"))
// Convert to Array of sorts via group by
val df4 = df3.groupBy("k")
.agg(collect_list("v3"))
// Convert to rdd Tuple and then find position so as to gen col names! That is the clue so as to be able to use pivot
val rdd = df4.rdd
val rdd2 = rdd.map(row => (row.getAs[String](0), row.getAs[WrappedArray[String]](1).toArray))
val rdd3 = rdd2.map { case (k, list) => (k, list.zipWithIndex) }
val df5 = rdd3.toDF("k", "v")
val df6 = df5.withColumn("vn", explode($"v"))
val df7 = df6.select($"k", $"vn".getField("_1"), concat(lit("c"),$"vn".getField("_2"))).toDF("k", "v", "c")
// Final manipulation
val result = df7.groupBy("k")
.pivot("c")
.agg(expr("coalesce(first(v),null)")) // May never occur in your case, just done for completeness and variable length cols.
result.show(100,false)
returns in correct col order:
+---+----+----+-------+-----+----+------+
|k |c0 |c1 |c2 |c3 |c4 |c5 |
+---+----+----+-------+-----+----+------+
|B |foo2|bar2|single2|null |null|null |
|C |foo3|bar3|x |y |z |null |
|A |foo |bar |Donald |Trump|Esq |single|
+---+----+----+-------+-----+----+------+

How to use Except function with spark Dataframe

I would like to get differences between two dataframe but returning the row with the different fields only. For example, I have 2 dataframes as follow:
val DF1 = Seq(
(3,"Chennai", "rahman",9846, 45000,"SanRamon"),
(1,"Hyderabad","ram",9847, 50000,"SF")
).toDF("emp_id","emp_city","emp_name","emp_phone","emp_sal","emp_site")
val DF2 = Seq(
(3,"Chennai", "rahman",9846, 45000,"SanRamon"),
(1,"Sydney","ram",9847, 48000,"SF")
).toDF("emp_id","emp_city","emp_name","emp_phone","emp_sal","emp_site")
The only difference between these two dataframe is emp_city and emp_sal for the second row.
Now, I am using the except function which gives me the entire row as follow:
DF1.except(DF2)
+------+---------+--------+---------+-------+--------+
|emp_id| emp_city|emp_name|emp_phone|emp_sal|emp_site|
+------+---------+--------+---------+-------+--------+
| 1|Hyderabad| ram| 9847| 50000| SF|
+------+---------+--------+---------+-------+--------+
However, I need the output to be like this:
+---------+--------+-----+
|emp_id| emp_city|emp_sal|
+------+---------+-------+
| 1|Hyderabad| 50000|
+------+---------+-------+
Which shows the different cells as well as emp_id.
Edit :
if there is change in column then it should appear if there is no change then it should be hidden or Null

The following should give you the result you are looking for.
DF1.except(DF2).select("emp_id","emp_city","emp_sal")

You should consider the comment from #user238607 as we cannot predict which columns are going to differ,
Still you can try this workaround.
I'm assuming emp_id is unique,
scala> val diff = udf((col: String, c1: String, c2: String) => if (c1 == c2) "" else col )
scala> DF1.join(DF2, DF1("emp_id") === DF2("emp_id"))
res15: org.apache.spark.sql.DataFrame = [emp_id: int, emp_city: string ... 10 more fields]
scala> res15.withColumn("diffcolumn", split(concat_ws(",",DF1.columns.map(x => diff(lit(x), DF1(x), DF2(x))):_*),","))
res16: org.apache.spark.sql.DataFrame = [emp_id: int, emp_city: string ... 11 more fields]
scala> res16.show(false)
+------+---------+--------+---------+-------+--------+------+--------+--------+---------+-------+--------+---------------------------+
|emp_id|emp_city |emp_name|emp_phone|emp_sal|emp_site|emp_id|emp_city|emp_name|emp_phone|emp_sal|emp_site|diffcolumn |
+------+---------+--------+---------+-------+--------+------+--------+--------+---------+-------+--------+---------------------------+
|3 |Chennai |rahman |9846 |45000 |SanRamon|3 |Chennai |rahman |9846 |45000 |SanRamon|[, , , , , ] |
|1 |Hyderabad|ram |9847 |50000 |SF |1 |Sydney |ram |9847 |48000 |SF |[, emp_city, , , emp_sal, ]|
+------+---------+--------+---------+-------+--------+------+--------+--------+---------+-------+--------+---------------------------+
scala> val diff_cols = res16.select(explode($"diffcolumn")).filter("col != ''").distinct.collect.map(a=>col(a(0).toString))
scala> val exceptOpr = DF1.except(DF2)
scala> exceptOpr.select(diff_cols:_*).show
+-------+---------+
|emp_sal| emp_city|
+-------+---------+
| 50000|Hyderabad|
+-------+---------+

I found this solution which seems to be working fine :
val cols = DF1.columns.filter(_ != "emp_id").toList
val DF3 = DF1.except(DF2)
def mapDiffs(name: String) = when($"l.$name" === $"r.$name", null ).otherwise(array($"l.$name", $"r.$name")).as(name)
val result = DF2.as("l").join(DF3.as("r"), "emp_id").select($"emp_id" :: cols.map(mapDiffs): _*)
It generates the output as follow :
+------+-------------------+--------+---------+--------------+--------+
|emp_id| emp_city|emp_name|emp_phone| emp_sal|emp_site|
+------+-------------------+--------+---------+--------------+--------+
| 1|[Sydney, Hyderabad]| null| null|[48000, 50000]| null|
|
+------+-------------------+--------+---------+--------------+--------+

Scala: How to add a column with the value of a changed field that was changed between two tables

I have two tables with the same schema (A and B) where every unique ID in table A also exists in table B in a 1 to 1 way. I want to add a column to table B with the name of the column whose value is different between the tables for each row. There is only one difference per row.
For example:
Table A:
{ "id1": 1,"id2": "a","name": "bob","state": "nj"}
{"id1": 2,"id2": "b","name": "sue","state": "ma"}
Table B:
{"id1": 1,"id2": "a","name": "bob","state": "fl"}
{"id1": 2,"id2": "b","name": "susan","state": "ma"}
After comparing them, I want Table B to look like this:
{"id1": 1,"id2": "a","name": "bob","state": "fl", "changed_field": "state"}
{"id1": 2,"id2": "b","name": "susan","state": "ma", "changed_field": "name"}
I can't find any functions that do this in Spark Scala's data frames. Is there something that I missed?
EDIT: I am working with hundreds to thousands of columns

Here's a way to achieve this without having to "spell-out" the columns, and without a UDF (only using built-in functions):
import org.apache.spark.sql.functions._
import spark.implicits._
// list of columns to compare
val comparableColumns = A.columns.tail // without id
// create Column that would result in the name of the first differing column:
val changedFieldCol: Column = comparableColumns.foldLeft(lit("")) {
case (result, col) => when(
result === "", when($"A.$col" =!= $"B.$col", lit(col)).otherwise(lit(""))
).otherwise(result)
}
// join by id1, add changedFieldCol, and then select only B's columns:
val result = A.as("A").join(B.as("B"), "id1")
.withColumn("changed_field", changedFieldCol)
.select("id1", comparableColumns.map(c => s"B.$c") :+ "changed_field": _*)
result.show(false)
// +---+---+-----+-----+-------------+
// |id1|id2|name |state|changed_field|
// +---+---+-----+-----+-------------+
// |1 |a |bob |fl |state |
// |2 |b |susan|ma |name |
// +---+---+-----+-----+-------------+

You can compare the fields in an UDF which generates the appropriate string:
import spark.implicits._
val df_a = Seq(
(1, "a", "bob", "nj"),
(2, "b", "sue", "ma")
).toDF("id1", "id2", "name", "state")
val df_b = Seq(
(1, "a", "bob", "fl"),
(2, "b", "susane", "ma")
).toDF("id1", "id2", "name", "state")
val compareFields = udf((aName:String,aState:String,bName:String,bState:String) => {
val changedState = if (aState != bState) Some("state") else None
val changedName = if (aName != bName) Some("name") else None
Seq(changedName, changedState).flatten.mkString(",")
}
)
df_b.as("b")
.join(
df_a.as("a"), Seq("id1", "id2")
)
.withColumn("changed_fields",compareFields($"a.name",$"a.state",$"b.name",$"b.state"))
.select($"id1",$"id2",$"b.name",$"b.state",$"changed_fields")
.show()
gives
+---+---+------+-----+--------------+
|id1|id2| name|state|changed_fields|
+---+---+------+-----+--------------+
| 1| a| bob| fl| state|
| 2| b|susane| ma| name|
+---+---+------+-----+--------------+
EDIT:
Here a more generic version which compares all fields at once:
val compareFields = udf((a:Row,b:Row) => {
assert(a.schema==b.schema)
a.schema
.indices
.map(i => if(a.get(i)!=b.get(i)) Some(a.schema(i).name) else None)
.flatten
.mkString(",")
}
)
df_b.as("b")
.join(df_a.as("a"), $"a.id1" === $"b.id1" and $"a.id2" === $"b.id2")
.withColumn("changed_fields",compareFields(struct($"a.*"),struct($"b.*")))
.select($"b.id1",$"b.id2",$"b.name",$"b.state",$"changed_fields")
.show()

How to filter a map<String, Int> in a data frame : Spark / Scala

I am trying to get the count individual column to publish metrics. I have a I have a df [customerId : string, totalRent : bigint, totalPurchase: bigint, itemTypeCounts: map<string, int> ]
Right now I am doing :
val totalCustomers = df.count
val totalPurchaseCount = df.filter("totalPurchase > 0").count
val totalRentCount = df.filter("totalRent > 0").count
publishMetrics("Total Customer", totalCustomers )
publishMetrics("Total Purchase", totalPurchaseCount )
publishMetrics("Total Rent", totalRentCount )
publishMetrics("Percentage of Rent", percentage(totalRentCount, totalCustomers) )
publishMetrics("Percentage of Purchase", percentage(totalPurchaseCount, totalCustomers) )
private def percentageCalc(num: Long, denom: Long): Double = {
val numD: Long = num
val denomD: Long = denom
return if (denomD == 0.0) 0.0
else (numD / denomD) * 100
}
But I am not sure how do I do this for itemTypeCounts which is a map. I want count and percentage based on each key entry. The issue is the key value is dynamic , I mean there is no way I know the key value before hand. Can some one tell me how do get count for each key values. I am new to scala/spark, any other efficient approaches to get the counts of each columns are much appreciated.
Sample data :
customerId : 1
totalPurchase : 17
totalRent : 0
itemTypeCounts : {"TV" : 4, "Blender" : 2}
customerId : 2
totalPurchase : 1
totalRent : 1
itemTypeCounts : {"Cloths" : 4}
customerId : 3
totalPurchase : 0
totalRent : 10
itemTypeCounts : {"TV" : 4}
So the output is :
totalCustomer : 3
totalPurchaseCount : 2 (2 customers with totalPurchase > 0)
totalRent : 2 (2 customers with totalRent > 0)
itemTypeCounts_TV : 2
itemTypeCounts_Cloths : 1
itemTypeCounts_Blender : 1

You can accomplish this in Spark SQL, I show two examples of this below (one where the keys are known and can be enumerated in code, one where the keys are unknown). Note that by using Spark SQL, you take advantage of the catalyst optimizer, and this will run very efficiently:
val data = List((1,17,0,Map("TV" -> 4, "Blender" -> 2)),(2,1,1,Map("Cloths" -> 4)),(3,0,10,Map("TV" -> 4)))
val df = data.toDF("customerId","totalPurchase","totalRent","itemTypeCounts")
//Only good if you can enumerate the keys
def countMapKey(name:String) = {
count(when($"itemTypeCounts".getItem(name).isNotNull,lit(1))).as(s"itemTypeCounts_$name")
}
val keysToCount = List("TV","Blender","Cloths").map(key => countMapKey(key))
df.select(keysToCount :_*).show
+-----------------+----------------------+---------------------+
|itemTypeCounts_TV|itemTypeCounts_Blender|itemTypeCounts_Cloths|
+-----------------+----------------------+---------------------+
| 2| 1| 1|
+-----------------+----------------------+---------------------+
//More generic
val pivotData = df.select(explode(col("itemTypeCounts"))).groupBy(lit(1).as("tmp")).pivot("key").count.drop("tmp")
val renameStatement = pivotData.columns.map(name => col(name).as(s"itemTypeCounts_$name"))
pivotData.select(renameStatement :_*).show
+----------------------+---------------------+-----------------+
|itemTypeCounts_Blender|itemTypeCounts_Cloths|itemTypeCounts_TV|
+----------------------+---------------------+-----------------+
| 1| 1| 2|
+----------------------+---------------------+-----------------+

I'm a spark newbie myself, so there is probably a better way to do this. But one thing you could try is transforming the itemTypeCounts into a data structure in scala that you could work with. I converted each row to a List of (Name, Count) pairs e.g. List((Blender,2), (TV,4)).
With this you can have a List of such list of pairs, one list of pairs for each row. In your example, this will be a List of 3 elements:
List(
List((Blender,2), (TV,4)),
List((Cloths,4)),
List((TV,4))
)
Once you have this structure, transforming it to a desired output is standard scala.
Worked example is below:
val itemTypeCounts = df.select("itemTypeCounts")
//Build List of List of Pairs as suggested above
val itemsList = itemTypeCounts.collect().map {
row =>
val values = row.getStruct(0).mkString("",",","").split(",")
val fields = row.schema.head.dataType.asInstanceOf[StructType].map(s => s.name).toList
fields.zip(values).filter(p => p._2 != "null")
}.toList
// Build a summary map for the list constructed above
def itemTypeCountsSummary(frames: List[List[(String, String)]], summary: Map[String, Int]) : Map[String, Int] = frames match {
case Nil => summary
case _ => itemTypeCountsSummary(frames.tail, merge(frames.head, summary))
}
//helper method for the summary map.
def merge(head: List[(String, String)], summary: Map[String, Int]): Map[String, Int] = {
val headMap = head.toMap.map(e => ("itemTypeCounts_" + e._1, 1))
val updatedSummary = summary.map{e => if(headMap.contains(e._1)) (e._1, e._2 + 1) else e}
updatedSummary ++ headMap.filter(e => !updatedSummary.contains(e._1))
}
val summaryMap = itemTypeCountsSummary(itemsList, Map())
summaryMap.foreach(e => println(e._1 + ": " + e._2 ))
Output:
itemTypeCounts_Blender: 1
itemTypeCounts_TV: 2
itemTypeCounts_Cloths: 1

Borrowing the input from Nick and using spark-sql pivot solution:
val data = List((1,17,0,Map("TV" -> 4, "Blender" -> 2)),(2,1,1,Map("Cloths" -> 4)),(3,0,10,Map("TV" -> 4)))
val df = data.toDF("customerId","totalPurchase","totalRent","itemTypeCounts")
df.show(false)
df.createOrReplaceTempView("df")
+----------+-------------+---------+-----------------------+
|customerId|totalPurchase|totalRent|itemTypeCounts |
+----------+-------------+---------+-----------------------+
|1 |17 |0 |[TV -> 4, Blender -> 2]|
|2 |1 |1 |[Cloths -> 4] |
|3 |0 |10 |[TV -> 4] |
+----------+-------------+---------+-----------------------+
Assuming that we know the distinct itemType beforehand, we can use
val dfr = spark.sql("""
select * from (
select explode(itemTypeCounts) itemTypeCounts from (
select flatten(collect_list(map_keys(itemTypeCounts))) itemTypeCounts from df
) ) t
pivot ( count(itemTypeCounts) as c3
for itemTypeCounts in ('TV' ,'Blender' ,'Cloths') )
""")
dfr.show(false)
+---+-------+------+
|TV |Blender|Cloths|
+---+-------+------+
|2 |1 |1 |
+---+-------+------+
For renaming columns,
dfr.select(dfr.columns.map( x => col(x).alias("itemTypeCounts_" + x )):_* ).show(false)
+-----------------+----------------------+---------------------+
|itemTypeCounts_TV|itemTypeCounts_Blender|itemTypeCounts_Cloths|
+-----------------+----------------------+---------------------+
|2 |1 |1 |
+-----------------+----------------------+---------------------+
To get the distinct itemType dynamically and pass it to pivot
val item_count_arr = spark.sql(""" select array_distinct(flatten(collect_list(map_keys(itemTypeCounts)))) itemTypeCounts from df """).as[Array[String]].first
item_count_arr: Array[String] = Array(TV, Blender, Cloths)
spark.sql(s"""
select * from (
select explode(itemTypeCounts) itemTypeCounts from (
select flatten(collect_list(map_keys(itemTypeCounts))) itemTypeCounts from df
) ) t
pivot ( count(itemTypeCounts) as c3
for itemTypeCounts in (${item_count_arr.map(c => "'"+c+"'").mkString(",")}) )
""").show(false)
+---+-------+------+
|TV |Blender|Cloths|
+---+-------+------+
|2 |1 |1 |
+---+-------+------+

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Spark higher order functions to compute top N products from a comma separated list - scala

Related

Is it possible to register a string as a UDF?

Spark SQL Split or Extract words from String of Words

How to use Except function with spark Dataframe

Scala: How to add a column with the value of a changed field that was changed between two tables

How to filter a map<String, Int> in a data frame : Spark / Scala

Categories

Resources