Collapsing column values in spark dataframes

Collapsing column values in spark dataframes - scala

I have 2 DataFrames
case class UserTransactions(id: Long, transactionDate: java.sql.Date, currencyUsed: String, value: Long)
ID, TransactionDate, CurrencyUsed, value
1, 2016-01-05, USD, 100
1, 2016-01-09, GBP, 150
1, 2016-02-01, USD, 50
1, 2016-02-10, JPN, 10
2, 2016-01-10, EURO, 50
2, 2016-01-10, GBP, 100
case class ReportingTime(userId: Long, reportDate: java.sql.Date)
userId, reportDate
1, 2016-01-05
1, 2016-01-31
1, 2016-02-15
2, 2016-01-10
2, 2016-02-01
Now I want to get summary by combining all previously used currencies by userId, reportDate and sum. The results should look like:
userId, reportDate, trasactionSummary
1, 2016-01-05, None
1, 2016-01-31, (USD -> 100)(GBP-> 150) // combined above 2 transactions less than 2016-01-31
1, 2016-02-15, (USD -> 150)(GBP-> 150)(JPN->10) // combined transactions less than 2016-02-15
2, 2016-01-10, None
2, 2016-02-01, (EURO-> 50) (GBP-> 100)
What is the best way to do this to do this? We have over 300 million transactions where each user can have up to 10,000 transactions.

The below snippet would achieve your requirement. Initial joining and aggregation is done via the Dataframe API of pyspark. Then the grouping of data (using reduceByKey) and final dataset preparation is done via RDD api since it is more suitable for such operations.
from datetime import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import DateType
from pyspark.sql import functions as F
df1 = spark.createDataFrame([(1,'2016-01-05','USD',100),
(1,'2016-01-09','GBP',150),
(1,'2016-02-01','USD',50),
(1,'2016-02-10','JPN',10),
(2,'2016-01-10','EURO',50),
(2,'2016-01-10','GBP',100)],['id', 'tdate', 'currency', 'value'])
df2 = spark.createDataFrame([(1,'2016-01-05'),
(1,'2016-01-31'),
(1,'2016-02-15'),
(2,'2016-01-10'),
(2,'2016-02-01')],['user_id', 'report_date'])
func = udf (lambda x: datetime.strptime(x, '%Y-%m-%d'), DateType()) ### function to convert string data type to date data type
df2 = df2.withColumn('tdate', func(df2.report_date))
df1 = df1.withColumn('tdate', func(df1.tdate))
result = df2.join(df1, (df1.id == df2.user_id) & (df1.tdate < df2.report_date), 'left_outer').select('user_id', 'report_date', 'currency', 'value').groupBy('user_id', 'report_date', 'currency').agg(F.sum('value').alias('value'))
data = result.rdd.map(lambda x: (x.user_id,x.report_date,x.currency,x.value)).keyBy(lambda x: (x[0],x[1])).mapValues(lambda x: filter(lambda x: bool(x),[(x[2],x[3]) if x[2] else None])).reduceByKey(lambda x,y: x + y).map(lambda x: (x[0][0],x[0][1], x[1]))
The final result generated is as shown below.
>>> spark.createDataFrame([ (x[0],x[1],str(x[2])) for x in data.collect()], ['id', 'date', 'values']).orderBy('id', 'date').show(20, False)
+---+----------+--------------------------------------------+
|id |date |values |
+---+----------+--------------------------------------------+
|1 |2016-01-05|[] |
|1 |2016-01-31|[(u'USD', 100), (u'GBP', 150)] |
|1 |2016-02-15|[(u'USD', 150), (u'GBP', 150), (u'JPN', 10)]|
|2 |2016-01-10|[] |
|2 |2016-02-01|[(u'EURO', 50), (u'GBP', 100)] |
+---+----------+--------------------------------------------+

In case some one needs in Scala
case class Transaction(id: String, date: java.sql.Date, currency:Option[String], value: Option[Long])
case class Report(id:String, date:java.sql.Date)
def toDate(date: String): java.sql.Date = {
val sf = new SimpleDateFormat("yyyy-MM-dd")
new java.sql.Date(sf.parse(date).getTime)
}
val allTransactions = Seq(
Transaction("1", toDate("2016-01-05"),Some("USD"),Some(100L)),
Transaction("1", toDate("2016-01-09"),Some("GBP"),Some(150L)),
Transaction("1",toDate("2016-02-01"),Some("USD"),Some(50L)),
Transaction("1",toDate("2016-02-10"),Some("JPN"),Some(10L)),
Transaction("2",toDate("2016-01-10"),Some("EURO"),Some(50L)),
Transaction("2",toDate("2016-01-10"),Some("GBP"),Some(100L))
)
val allReports = Seq(
Report("1",toDate("2016-01-05")),
Report("1",toDate("2016-01-31")),
Report("1",toDate("2016-02-15")),
Report("2",toDate("2016-01-10")),
Report("2",toDate("2016-02-01"))
)
val transections:Dataset[Transaction] = spark.createDataFrame(allTransactions).as[Transaction]
val reports: Dataset[Report] = spark.createDataFrame(allReports).as[Report]
val result = reports.alias("rp").join(transections.alias("tx"), (col("tx.id") === col("rp.id")) && (col("tx.date") < col("rp.date")), "left_outer")
.select("rp.id", "rp.date", "currency", "value")
.groupBy("rp.id", "rp.date", "currency").agg(sum("value"))
.toDF("id", "date", "currency", "value")
.as[Transaction]
val data = result.rdd.keyBy(x => (x.id , x.date))
.mapValues(x => if (x.currency.isDefined) collection.Map[String, Long](x.currency.get -> x.value.get) else collection.Map[String, Long]())
.reduceByKey((x,y) => x ++ y).map(x => (x._1._1, x._1._2, x._2))
.toDF("id", "date", "map")
.orderBy("id", "date")
Console output
+---+----------+--------------------------------------+
|id |date |map |
+---+----------+--------------------------------------+
|1 |2016-01-05|Map() |
|1 |2016-01-31|Map(GBP -> 150, USD -> 100) |
|1 |2016-02-15|Map(USD -> 150, GBP -> 150, JPN -> 10)|
|2 |2016-01-10|Map() |
|2 |2016-02-01|Map(GBP -> 100, EURO -> 50) |
+---+----------+--------------------------------------+

Related

Spark higher order functions to compute top N products from a comma separated list

I am using Spark 2.4 and I have a spark dataframe that has 2 columns - id and product_list. The data consists of list of products that every id has interacted with.
here is the sample code -
scala> spark.version
res3: String = 2.4.3
val df = Seq(
("1", "p1,p1,p1,p1,p1,p3,p3,p2,p2,p2,p2"),
("2", "p2,p2,p2,p2,p2,p4,p4,p4,p1,p3")
).toDF("id", "product_list")
df.createOrReplaceTempView("df")
+---+--------------------------------+
|id |product_list |
+---+--------------------------------+
|1 |p1,p1,p1,p1,p1,p3,p3,p2,p2,p2,p2|
|2 |p2,p2,p2,p2,p2,p4,p4,p4,p1,p3 |
+---+--------------------------------+
I would like to return those top 2 products that every id has had a interaction with. For instance, id = 1 has viewed products p1 - 5 times and p2 - 4 times, so i would like to return p1,p2 for id = 1. Similarly, p2,p4 for id = 2.
My final output should look like
id, most_seen_products
1, p1,p2
2, p2,p4
Since I am using Spark 2.4, I was wondering if there is a higher order function to first convert this list to array and then return the top 2 viewed products. In general the code should handle top N products.

Here is my approach
val df = Seq(
("1", "p1,p1,p1,p1,p1,p3,p3,p2,p2,p2,p2"),
("2", "p2,p2,p2,p2,p2,p4,p4,p4,p1,p3")
).toDF("id", "product_list")
def getMetrics(value: Row, n: Int): (String, String) = {
val split = value.getAs[String]("product_list").split(",")
val sortedRecords = split.groupBy(x => x).map(data => (data._1, data._2.size)).toList.sortWith(_._2 > _._2)
(value.getAs[String]("id"), sortedRecords.take(n).map(_._1).mkString(","))
}
df.map(value =>
getMetrics(value, 2)
).withColumnRenamed("_1", "id").withColumnRenamed("_2", "most_seen_products") show (false)
Result
+---+------------------+
|id |most_seen_products|
+---+------------------+
|1 |p1,p2 |
|2 |p2,p4 |
+---+------------------+

Looking at your data format, you can just use a .map() or in case of SQL, a UDF, which converts all rows. The function will be:
productList => {
// list of products = split productList by comma
// add all items to a String/Count map
// sort the map, get first 2 elements
// return string.join of those 2 elements
}

scala> import org.apache.spark.sql.expressions.UserDefinedFunction
scala> import scala.collection.immutable.ListMap
scala> def max_products:UserDefinedFunction = udf((product:String) => {
val productList = product.split(",").toList
val finalList = ListMap(productList.groupBy(i=>i).mapValues(_.size).toSeq.sortWith(_._2 > _._2):_*).keys.toList
finalList(0) + "," + finalList(1)
})
scala> df.withColumn("most_seen_products", max_products(col("product_list"))).show(false)
+---+--------------------------------+------------------+
|id |product_list |most_seen_products|
+---+--------------------------------+------------------+
|1 |p1,p1,p1,p1,p1,p3,p3,p2,p2,p2,p2|p1,p2 |
|2 |p2,p2,p2,p2,p2,p4,p4,p4,p1,p3 |p2,p4 |
+---+--------------------------------+------------------+

conditional operator with groupby in spark rdd level - scala

I am using Spark 1.60 and Scala 2.10.5
I have a dataframe like this,
+------------------+
|id | needed |
+------------------+
|1 | 2 |
|1 | 0 |
|1 | 3 |
|2 | 0 |
|2 | 0 |
|3 | 1 |
|3 | 2 |
+------------------+
From this df I created an rdd like this,
val dfRDD = df.rdd
from my rdd, I want to group by id and count of needed is > 0.
((1, 2), (2,0), (3,2))
So, I tried like this,
val groupedDF = dfRDD.map(x =>(x(0), x(1) > 0)).count.redueByKey(_+_)
In this case, I am getting an error:
error: value > is not a member of any
I need that in rdd level. Any help to get my desired output would be great.

The problem is that in your map you're calling the apply method of Row, and as you can see in its scaladoc, that method returns Any - and as you can see for the error and from the scaladoc there is not such method < in Any
You can fix it using the getAs[T] method.
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
val spark =
SparkSession
.builder
.master("local[*]")
.getOrCreate()
import spark.implicits._
val df =
List(
(1, 2),
(1, 0),
(1, 3),
(2, 0),
(2, 0),
(3, 1),
(3, 2)
).toDF("id", "needed")
val rdd: RDD[(Int, Int)] = df.rdd.map(row => (row.getAs[Int](fieldName = "id"), row.getAs[Int](fieldName = "needed")))
From there you can continue with the aggregation, you have a few mistakes in your logic.
First, you don't need the count call.
And second, if you need to count the amount of times "needed" was greater than one you can't do _ + _, because that is the sum of needed values.
val grouped: RDD[(Int, Int)] = rdd.reduceByKey { (acc, v) => if (v > 0) acc + 1 else acc }
val result: Array[(Int, Int)] = grouped.collect()
// Array((1,3), (2,0), (3,2))
PS: You should tell your professor to upgrade to Spark 2 and Scala 2.11 ;)
Edit
Using case classes in the above example.
final case class Data(id: Int, needed: Int)
val rdd: RDD[Data] = df.as[Data].rdd
val grouped: RDD[(Int, Int)] = rdd.map(d => d.id -> d.needed).reduceByKey { (acc, v) => if (v > 0) acc + 1 else acc }
val result: Array[(Int, Int)] = grouped.collect()
// Array((1,3), (2,0), (3,2))

There's no need to do the calculation at the rdd level. Aggregation with the data frame should work:
df.groupBy("id").agg(sum(($"needed" > 0).cast("int")).as("positiveCount")).show
+---+-------------+
| id|positiveCount|
+---+-------------+
| 1| 2|
| 3| 2|
| 2| 0|
+---+-------------+
If you have to work with RDD, use row.getInt or as #Luis' answer row.getAs[Int] to get the value with explicit type, and then do the comparison and reduceByKey:
df.rdd.map(r => (r.getInt(0), if (r.getInt(1) > 0) 1 else 0)).reduceByKey(_ + _).collect
// res18: Array[(Int, Int)] = Array((1,2), (2,0), (3,2))

Convert arbitrary number of columns to Vector

How to convert a group of arbitrary columns to a Mllib Vector?
Basically, I have first column of my DataFrame with a fixed name and then a number of arbitrary named columns each having Double values inside.
like so:
name | a | b | c |
val1 | 0.0 | 1.0 | 1.0 |
val2 | 2.0 | 1.0 | 5.0 |
Could be any number of columns. I need to get a DataSet of the following:
final case class ValuesRow(name: String, values: Vector)

This can be done in a simple way using VectorAssembler. The columns that are to be merged into a Vector are used as input, in this case all columns except the first.
val df = spark.createDataFrame(Seq(("val1", 0, 1, 1), ("val2", 2, 1, 5)))
.toDF("name", "a", "b", "c")
val columnNames = df.columns.drop(1) // drop the name column
val assembler = new VectorAssembler()
.setInputCols(columnNames)
.setOutputCol("values")
val df2 = assembler.transform(df).select("name", "values").as[ValuesRow]
The result will be a dataset containing the name and values columns:
+----+-------------+
|name| values|
+----+-------------+
|val1|[0.0,1.0,1.0]|
|val2|[2.0,1.0,5.0]|
+----+-------------+

Here's one way to do it:
import org.apache.spark.sql.functions._
import org.apache.spark.mllib.linalg.DenseVector
val ds = Seq(
("val1", 0.0, 1.0, 1.0),
("val2", 2.0, 1.0, 5.0)
).toDF("name", "a", "b", "c").
as[(String, Double, Double, Double)]
val colList = ds.columns
val keyCol = colList(0)
val valCols = colList.drop(1)
def arrToVec = udf(
(s: Seq[Double]) => new DenseVector(s.toArray)
)
ds.select(
col(keyCol), arrToVec( array(valCols.map(x => col(x)): _*) ).as("values")
).show
// +----+-------------+
// |name| values|
// +----+-------------+
// |val1|[0.0,1.0,1.0]|
// |val2|[2.0,1.0,5.0]|
// +----+-------------+

How to pivot a table into a timeseries table in Scala

I have the following table:
index 0 1 2 id
1 9.69 1.18 0.59 62
2 7.38 2.18 0.87 62
3 10.02 1.16 0.29 62
That I'm trying to pivot into a time series like table.
Expected Output:
data id
[9.69, 7.38, 10.02] 62
[1.18, 2.18, 1.16] 62
[0.59, 0.87, 0.29] 62
I tried the following code
val table = df.groupBy(df.col("id")).pivot("index").sum("0").cache()
val tablets = table.map(x => new transform(1.until(x.length).map(x.getDouble(_)).toList, x.getString(0)))
case class transform(data:List[Double], start:String)
But it's given only this output
[9.69, 7.38, 10.02] 62
How can I iterate through all columns and get the desired output table as above?
class pivot (df: DataFrame) {
val col1Names = df.drop("id").columns.tail
val kv = explode(array(df.select(col1Names.map(col): _*).columns.map {
c => struct(lit(c).alias("k"), col(c).alias("v"))
}: _*))
val tempdf = df.withColumn("kv", kv)
.select("index", "kv.k", "kv.v", "id")
.groupBy("id", "k")
.pivot("index")
.agg(first("v"))
.drop("k")
val col2Names = tempdf.columns.tail
val finaldf = tempdf.withColumn("data", array(col2Names.map(col): _*)).drop(col2Names: _*)
}

In your solution you have used groupBy and sum which will generate aggregated one row for each group. Thats why you were getting one result in your result.
The solution to your problem is a bit complex. I have used combination of withColumn, explode, array, struct, pivot, groupBy, agg, drop, col, select and alias. Following is the solution
val df = Seq((1, 9.69, 1.18, 0.59, 62),
(2, 7.38, 2.18, 0.87, 62),
(3, 10.02, 1.16, 0.29, 62)).toDF("index", "0", "1", "2", "id")
As defined in your question you must already have dataframe as below by reading input as above
+-----+-----+----+----+---+
|index|0 |1 |2 |id |
+-----+-----+----+----+---+
|1 |9.69 |1.18|0.59|62 |
|2 |7.38 |2.18|0.87|62 |
|3 |10.02|1.16|0.29|62 |
+-----+-----+----+----+---+
If yes then following solution should work.
val col1Names = df.drop("id").columns.tail
val kv = explode(array(df.select(col1Names.map(col): _*).columns.map {
c => struct(lit(c).alias("k"), col(c).alias("v"))
}: _*))
val tempdf = df.withColumn("kv", kv)
.select("index", "kv.k", "kv.v", "id")
.groupBy("id", "k")
.pivot("index")
.agg(first("v"))
.orderBy("k")
.drop("k")
val col2Names = tempdf.columns.tail
val finaldf = tempdf.withColumn("data", array(col2Names.map(col): _*)).drop(col2Names: _*).sort($"data".desc)
You should be getting following output
+---+-------------------+
|id |data |
+---+-------------------+
|62 |[9.69, 7.38, 10.02]|
|62 |[1.18, 2.18, 1.16] |
|62 |[0.59, 0.87, 0.29] |
+---+-------------------+

How to filter a map<String, Int> in a data frame : Spark / Scala

I am trying to get the count individual column to publish metrics. I have a I have a df [customerId : string, totalRent : bigint, totalPurchase: bigint, itemTypeCounts: map<string, int> ]
Right now I am doing :
val totalCustomers = df.count
val totalPurchaseCount = df.filter("totalPurchase > 0").count
val totalRentCount = df.filter("totalRent > 0").count
publishMetrics("Total Customer", totalCustomers )
publishMetrics("Total Purchase", totalPurchaseCount )
publishMetrics("Total Rent", totalRentCount )
publishMetrics("Percentage of Rent", percentage(totalRentCount, totalCustomers) )
publishMetrics("Percentage of Purchase", percentage(totalPurchaseCount, totalCustomers) )
private def percentageCalc(num: Long, denom: Long): Double = {
val numD: Long = num
val denomD: Long = denom
return if (denomD == 0.0) 0.0
else (numD / denomD) * 100
}
But I am not sure how do I do this for itemTypeCounts which is a map. I want count and percentage based on each key entry. The issue is the key value is dynamic , I mean there is no way I know the key value before hand. Can some one tell me how do get count for each key values. I am new to scala/spark, any other efficient approaches to get the counts of each columns are much appreciated.
Sample data :
customerId : 1
totalPurchase : 17
totalRent : 0
itemTypeCounts : {"TV" : 4, "Blender" : 2}
customerId : 2
totalPurchase : 1
totalRent : 1
itemTypeCounts : {"Cloths" : 4}
customerId : 3
totalPurchase : 0
totalRent : 10
itemTypeCounts : {"TV" : 4}
So the output is :
totalCustomer : 3
totalPurchaseCount : 2 (2 customers with totalPurchase > 0)
totalRent : 2 (2 customers with totalRent > 0)
itemTypeCounts_TV : 2
itemTypeCounts_Cloths : 1
itemTypeCounts_Blender : 1

You can accomplish this in Spark SQL, I show two examples of this below (one where the keys are known and can be enumerated in code, one where the keys are unknown). Note that by using Spark SQL, you take advantage of the catalyst optimizer, and this will run very efficiently:
val data = List((1,17,0,Map("TV" -> 4, "Blender" -> 2)),(2,1,1,Map("Cloths" -> 4)),(3,0,10,Map("TV" -> 4)))
val df = data.toDF("customerId","totalPurchase","totalRent","itemTypeCounts")
//Only good if you can enumerate the keys
def countMapKey(name:String) = {
count(when($"itemTypeCounts".getItem(name).isNotNull,lit(1))).as(s"itemTypeCounts_$name")
}
val keysToCount = List("TV","Blender","Cloths").map(key => countMapKey(key))
df.select(keysToCount :_*).show
+-----------------+----------------------+---------------------+
|itemTypeCounts_TV|itemTypeCounts_Blender|itemTypeCounts_Cloths|
+-----------------+----------------------+---------------------+
| 2| 1| 1|
+-----------------+----------------------+---------------------+
//More generic
val pivotData = df.select(explode(col("itemTypeCounts"))).groupBy(lit(1).as("tmp")).pivot("key").count.drop("tmp")
val renameStatement = pivotData.columns.map(name => col(name).as(s"itemTypeCounts_$name"))
pivotData.select(renameStatement :_*).show
+----------------------+---------------------+-----------------+
|itemTypeCounts_Blender|itemTypeCounts_Cloths|itemTypeCounts_TV|
+----------------------+---------------------+-----------------+
| 1| 1| 2|
+----------------------+---------------------+-----------------+

I'm a spark newbie myself, so there is probably a better way to do this. But one thing you could try is transforming the itemTypeCounts into a data structure in scala that you could work with. I converted each row to a List of (Name, Count) pairs e.g. List((Blender,2), (TV,4)).
With this you can have a List of such list of pairs, one list of pairs for each row. In your example, this will be a List of 3 elements:
List(
List((Blender,2), (TV,4)),
List((Cloths,4)),
List((TV,4))
)
Once you have this structure, transforming it to a desired output is standard scala.
Worked example is below:
val itemTypeCounts = df.select("itemTypeCounts")
//Build List of List of Pairs as suggested above
val itemsList = itemTypeCounts.collect().map {
row =>
val values = row.getStruct(0).mkString("",",","").split(",")
val fields = row.schema.head.dataType.asInstanceOf[StructType].map(s => s.name).toList
fields.zip(values).filter(p => p._2 != "null")
}.toList
// Build a summary map for the list constructed above
def itemTypeCountsSummary(frames: List[List[(String, String)]], summary: Map[String, Int]) : Map[String, Int] = frames match {
case Nil => summary
case _ => itemTypeCountsSummary(frames.tail, merge(frames.head, summary))
}
//helper method for the summary map.
def merge(head: List[(String, String)], summary: Map[String, Int]): Map[String, Int] = {
val headMap = head.toMap.map(e => ("itemTypeCounts_" + e._1, 1))
val updatedSummary = summary.map{e => if(headMap.contains(e._1)) (e._1, e._2 + 1) else e}
updatedSummary ++ headMap.filter(e => !updatedSummary.contains(e._1))
}
val summaryMap = itemTypeCountsSummary(itemsList, Map())
summaryMap.foreach(e => println(e._1 + ": " + e._2 ))
Output:
itemTypeCounts_Blender: 1
itemTypeCounts_TV: 2
itemTypeCounts_Cloths: 1

Borrowing the input from Nick and using spark-sql pivot solution:
val data = List((1,17,0,Map("TV" -> 4, "Blender" -> 2)),(2,1,1,Map("Cloths" -> 4)),(3,0,10,Map("TV" -> 4)))
val df = data.toDF("customerId","totalPurchase","totalRent","itemTypeCounts")
df.show(false)
df.createOrReplaceTempView("df")
+----------+-------------+---------+-----------------------+
|customerId|totalPurchase|totalRent|itemTypeCounts |
+----------+-------------+---------+-----------------------+
|1 |17 |0 |[TV -> 4, Blender -> 2]|
|2 |1 |1 |[Cloths -> 4] |
|3 |0 |10 |[TV -> 4] |
+----------+-------------+---------+-----------------------+
Assuming that we know the distinct itemType beforehand, we can use
val dfr = spark.sql("""
select * from (
select explode(itemTypeCounts) itemTypeCounts from (
select flatten(collect_list(map_keys(itemTypeCounts))) itemTypeCounts from df
) ) t
pivot ( count(itemTypeCounts) as c3
for itemTypeCounts in ('TV' ,'Blender' ,'Cloths') )
""")
dfr.show(false)
+---+-------+------+
|TV |Blender|Cloths|
+---+-------+------+
|2 |1 |1 |
+---+-------+------+
For renaming columns,
dfr.select(dfr.columns.map( x => col(x).alias("itemTypeCounts_" + x )):_* ).show(false)
+-----------------+----------------------+---------------------+
|itemTypeCounts_TV|itemTypeCounts_Blender|itemTypeCounts_Cloths|
+-----------------+----------------------+---------------------+
|2 |1 |1 |
+-----------------+----------------------+---------------------+
To get the distinct itemType dynamically and pass it to pivot
val item_count_arr = spark.sql(""" select array_distinct(flatten(collect_list(map_keys(itemTypeCounts)))) itemTypeCounts from df """).as[Array[String]].first
item_count_arr: Array[String] = Array(TV, Blender, Cloths)
spark.sql(s"""
select * from (
select explode(itemTypeCounts) itemTypeCounts from (
select flatten(collect_list(map_keys(itemTypeCounts))) itemTypeCounts from df
) ) t
pivot ( count(itemTypeCounts) as c3
for itemTypeCounts in (${item_count_arr.map(c => "'"+c+"'").mkString(",")}) )
""").show(false)
+---+-------+------+
|TV |Blender|Cloths|
+---+-------+------+
|2 |1 |1 |
+---+-------+------+