Convert MapPartitionsRDD to DataFrame and grouping data by 2 keys - scala

I have a dataframe which looks like this:
country | user | count
----------------------
Germany | Sarah| 2
China | Paul | 1
Germany | Alan | 3
Germany | Paul | 1
...
What I am trying to do is to convert this dataframe to another which looks like this:
dimension | value
--------------------------------------------
Country | [Germany -> 4, China -> 1]
--------------------------------------------
User | [Sarah -> 2, Paul -> 2, Alan -> 3]
...
At first I tried to do it by this:
var newDF = Seq.empty[(String, Map[String,Long])].toDF("dimension", "value")
df.collect()
.foreach(row => { Array(0,1)
.map(pos =>
newDF = newDF.union(Seq((df.columns.toSeq(pos).toString, Map(row.mkString(",").split(",")(pos) -> row.mkString(",").split(",")(2).toLong))).toDF())
)
})
val newDF2 = newDF.groupBy("dimension").agg(collect_list("value")).as[(String, Seq[Map[String, Long]])].map {case (id, list) => (id, list.reduce(_ |+| _))}.toDF("dimension", "value")
But the collect() was killing my driver. Therefore, I have tried to do it like this:
class DimItem[T](val dimension: String, val value: String, val metric: T)
val items: RDD[DimItem[Long]] = df.rdd.flatMap(row => {
dims.zipWithIndex.map{case (dim, i) =>
new DimItem(dim, row(i).toString, row(13).asInstanceOf[Long])
}
})
// with the format [ DimItem(Country, Germany, 2), DimItem(User, Sarah, 2)], ...
val itemsGrouped: RDD[((String, String), Iterable[DimItem[Long]])] = items.groupBy(x => (x.dimension, x.value))
val aggregatedItems: RDD[DimItem[Long]] = itemsGrouped.map{case (key, items) => new DimItem(key._1, key._2, items.reduce((a,b) => a.metric + b.metric)}
The idea is to save in an RDD objects like (Country, China, 1), (Country, Germany, 3), (Country, Germany, 1), ... and then group it by the 2 first keys (Country, China), (Country, Germany), ... Once grouped, sum the count they have. Ex: having (Country, Germany, 3), (Country, Germany, 1) will become (Country, Germany, 4).
But once I get here, it tells me that in items.reduce() there is a mismatch: it expects a DimItem[Long] but gets a Long.
Next step will be to group it by the key "dimension" and create the Map[String, Int]()format in the column "value" and convert it to a DF.
I have 2 questions.
First: is this last code correct?
Second: How can I convert this MapPartitionsRDD into a DF?

Here is one solution based on dataframe API:
import org.apache.spark.sql.functions.{lit, map_from_arrays, collect_list}
def transform(df :DataFrame, colName: String) : DataFrame =
df.groupBy(colName)
.agg{sum("count").as("sum")}
.agg{
map_from_arrays(
collect_list(colName),
collect_list("sum")
).as("value")
}.select(lit(colName).as("dimension"), $"value")
val countryDf = transform(df, "country")
val userDf = transform(df, "user")
countryDf.unionByName(userDf).show(false)
// +---------+----------------------------------+
// |dimension|value |
// +---------+----------------------------------+
// |Country |[Germany -> 6, China -> 1] |
// |User |[Sarah -> 2, Alan -> 3, Paul -> 2]|
// +---------+----------------------------------+
Analysis: first we get the sum by country and user grouping by country and user respectively. Next we add one more custom aggregation to the pipeline which collects the previous results into a map. Map will be populated via map_from_arrays function found in Spark 2.4.0. The keys/values of the map we collect them with collect_list. Finally we union the two dataframes to populate the final results.

Related

How to create map fields in scala from dataframes?

I have two data frames as below:
df1:
language (column name)
tamil
telugu
hindi
df2:
id keywords
101 ["tamildiary", "tamilkeyboard", "telugumovie"]
102 ["tamilmovie"]
103 ["hindirhymes", "hindimovie"]
Using these two data frames, I need to create something that looks like the below. (Basically, need to find how many of the language specific keywords are against each id and store them in a map)
id keywords
101 {"tamil":2, "telugu":1}
102 {"tamil":1}
103 {"hindi":2}
Could anyone please help me in doing this. Thanks a lot!
Consider joining the two DataFrames using regexp_replace to regex match for strings with a leading language-keyword, followed by a couple of groupBys to aggregate an array of (language, count), which then gets converted to a map via map_from_entries:
val df1 = Seq("tamil", "telugu", "hindi").toDF("language")
val df2 = Seq(
(101, Seq("tamildiary", "tamilkeyboard", "telugumovie")),
(102, Seq("tamilmovie")),
(103, Seq("hindirhymes", "hindimovie"))
).toDF("id", "keywords")
val pattern = concat(lit("^"), df1("language"), lit(".*"))
df2.
withColumn("keyword", explode($"keywords")).as("df2").
join(df1.as("df1"), regexp_replace($"df2.keyword", pattern, lit("")) =!= $"df2.keyword").
groupBy("id", "language").agg(size(collect_list($"language")).as("count")).
groupBy("id").agg(map_from_entries(collect_list(struct($"language", $"count"))).as("keywords")).
show(false)
// +---+-------------------------+
// |id |keywords |
// +---+-------------------------+
// |101|[tamil -> 2, telugu -> 1]|
// |103|[hindi -> 2] |
// |102|[tamil -> 1] |
// +---+-------------------------+
For Spark 2.0 - 2.3, create a UDF to mimic functionality of map_from_entries which is available only in Spark 2.4+:
import org.apache.spark.sql.Row
val arrayToMap = udf{ (arr: Seq[Row]) =>
arr.map{ case Row(k: String, v: Int) => (k, v) }.toMap
}
df2.
withColumn("keyword", explode($"keywords")).as("df2").
join(df1.as("df1"), regexp_replace($"df2.keyword", pattern, lit("")) =!= $"df2.keyword").
groupBy("id", "language").agg(size(collect_list($"language")).as("count")).
groupBy("id").agg(arrayToMap(collect_list(struct($"language", $"count"))).as("keywords")).
show(false)
As a side note, below are a couple of alternatives for the regex-matching condition via a SQL expression:
Using regexp_extract:
expr("regexp_extract(df2.keyword, concat('^(', df1.language, ').*'), 1) = df1.language")
Using rlike:
expr("df2.keyword rlike concat('^', df1.language, '.*')")

Spark higher order functions to compute top N products from a comma separated list

I am using Spark 2.4 and I have a spark dataframe that has 2 columns - id and product_list. The data consists of list of products that every id has interacted with.
here is the sample code -
scala> spark.version
res3: String = 2.4.3
val df = Seq(
("1", "p1,p1,p1,p1,p1,p3,p3,p2,p2,p2,p2"),
("2", "p2,p2,p2,p2,p2,p4,p4,p4,p1,p3")
).toDF("id", "product_list")
df.createOrReplaceTempView("df")
+---+--------------------------------+
|id |product_list |
+---+--------------------------------+
|1 |p1,p1,p1,p1,p1,p3,p3,p2,p2,p2,p2|
|2 |p2,p2,p2,p2,p2,p4,p4,p4,p1,p3 |
+---+--------------------------------+
I would like to return those top 2 products that every id has had a interaction with. For instance, id = 1 has viewed products p1 - 5 times and p2 - 4 times, so i would like to return p1,p2 for id = 1. Similarly, p2,p4 for id = 2.
My final output should look like
id, most_seen_products
1, p1,p2
2, p2,p4
Since I am using Spark 2.4, I was wondering if there is a higher order function to first convert this list to array and then return the top 2 viewed products. In general the code should handle top N products.
Here is my approach
val df = Seq(
("1", "p1,p1,p1,p1,p1,p3,p3,p2,p2,p2,p2"),
("2", "p2,p2,p2,p2,p2,p4,p4,p4,p1,p3")
).toDF("id", "product_list")
def getMetrics(value: Row, n: Int): (String, String) = {
val split = value.getAs[String]("product_list").split(",")
val sortedRecords = split.groupBy(x => x).map(data => (data._1, data._2.size)).toList.sortWith(_._2 > _._2)
(value.getAs[String]("id"), sortedRecords.take(n).map(_._1).mkString(","))
}
df.map(value =>
getMetrics(value, 2)
).withColumnRenamed("_1", "id").withColumnRenamed("_2", "most_seen_products") show (false)
Result
+---+------------------+
|id |most_seen_products|
+---+------------------+
|1 |p1,p2 |
|2 |p2,p4 |
+---+------------------+
Looking at your data format, you can just use a .map() or in case of SQL, a UDF, which converts all rows. The function will be:
productList => {
// list of products = split productList by comma
// add all items to a String/Count map
// sort the map, get first 2 elements
// return string.join of those 2 elements
}
scala> import org.apache.spark.sql.expressions.UserDefinedFunction
scala> import scala.collection.immutable.ListMap
scala> def max_products:UserDefinedFunction = udf((product:String) => {
val productList = product.split(",").toList
val finalList = ListMap(productList.groupBy(i=>i).mapValues(_.size).toSeq.sortWith(_._2 > _._2):_*).keys.toList
finalList(0) + "," + finalList(1)
})
scala> df.withColumn("most_seen_products", max_products(col("product_list"))).show(false)
+---+--------------------------------+------------------+
|id |product_list |most_seen_products|
+---+--------------------------------+------------------+
|1 |p1,p1,p1,p1,p1,p3,p3,p2,p2,p2,p2|p1,p2 |
|2 |p2,p2,p2,p2,p2,p4,p4,p4,p1,p3 |p2,p4 |
+---+--------------------------------+------------------+

How can I join RDD[Rating] with scala.collection.Map[Int, Double] by key column?

I have two table ->
table1: RDD[Rating] (rdd1,rdd2,rdd3)
and
table2: scala.collection.Map[Int, Double] (m1,m2)
I spent a lot of time and effort trying to make joined table like
(key (key = rdd2 = m1), rdd3, m2)
But I always have a type mismatch.
Could you give an advice how to deal with it? I also try to convert both table to one type, but I'm note sure is t right way...
Based on you have an RDD and a Map, you can do it directly iterating your RDD.
Assuming that Rating has 3 fields (rdd1,rdd2,rdd3), let's rename them as field1, field2 and field3 to make the example clear and avoid confusions.
Giving this sample input source:
case class Rating(field1: String, field2: Int, field3: String) // custom case class
val yourRDD = spark.sparkContext.parallelize(
Seq(
Rating("rating1", 1, "str1"), // item 1
Rating("rating2", 2, "str2"), // item 2
Rating("rating3", 3, "str3") // item 3
)
)
yourRDD.toDF.show() // to visualize()
this will output your data source, that will look like:
+-------+------+------+
| field1|field2|field3|
+-------+------+------+
|rating1| 1| str1|
|rating2| 2| str2|
|rating3| 3| str3|
+-------+------+------+
Similarly, you have this sample data for your map:
val yourMap = Map(
1 -> 1.111,
2 -> 2.222,
3 -> 3.333
)
println(yourMap)
Data on your map:
yourMap: scala.collection.immutable.Map[Int,Double] = Map(1 -> 1.111, 2 -> 2.222, 3 -> 3.333)
Then, to "merge" you only need to iterate your RDD, get the value that you are going to use as key, in this case field2 and use it as key for the map. Something like this:
yourRDD
.map(rating=>{ // iterate each item in your RDD
val key = rating.field2 // get the value from the current item
val valueFromMap = yourMap(key) // look for the value on the map using field2 as key - You need to handle null values in case that you wont have values for all the keys
(key, rating.field3, valueFromMap) // generating an output for a new RDD that will be created based on this
}).toDF.show(truncate=false) // visualize the output
Above code will output:
+---+----+-----+
|_1 |_2 |_3 |
+---+----+-----+
|1 |str1|1.111|
|2 |str2|2.222|
|3 |str3|3.333|
+---+----+-----+
Hope this could help

Data frame creation from MAP of N elements with schemaDetails of N elements

How do I convert the input5 data format into DataFrame, using the schema details mentioned in schemanames?
The conversion should be dynamic without using Row(r(0),r(1)) - the number of columns can increase or decrease in input and schema, hence the code should be dynamic.
case class Entry(schemaName: String, updType: String, ts: Long, row: Map[String, String])
val input5 = List(Entry("a","b",0,Map("col1 " -> "0000555", "ref" -> "2017-08-12 12:12:12.266528")))
val schemanames= "col1,ref"
Target dataframe should be only from Map of input 5 (like col1 and ref). There can be many other columns (like col2, col3...). If there are more columns in Map same columns would be mentioned in schema name.
Schema name variable should be used to create structure , input5.row(Map) should be data source ...as number of columns in schema name can be in 100's , same applies to data in Input5.row.
This would work for any number of columns, as long as they're all Strings, and each Entry contains a map with values for all of these columns:
// split to column names:
val columns = schemanames.split(",")
// create the DataFrame schema with these columns (in that order)
val schema = StructType(columns.map(StructField(_, StringType)))
// convert input5 to Seq[Row], while selecting the values from "row" Map in same order of columns
val rows = input5.map(_.row)
.map(valueMap => columns.map(valueMap.apply).toSeq)
.map(Row.fromSeq)
// finally - create dataframe
val dataframe = spark.createDataFrame(sc.parallelize(rows), schema)
You can go through entries in schemanames (which are presumably selected keys in the Map based on your description) along with a UDF for Map manipulation to assemble the dataframe as shown below:
case class Entry(schemaName: String, updType: String, ts: Long, row: Map[String, String])
val input5 = List(
Entry("a", "b", 0, Map("col1" -> "0000555", "ref" -> "2017-08-12 12:12:12.266528")),
Entry("c", "b", 1, Map("col1" -> "0000444", "col2" -> "0000444", "ref" -> "2017-08-14 14:14:14.0")),
Entry("a", "d", 0, Map("col2" -> "0000666", "ref" -> "2017-08-16 16:16:16.0")),
Entry("e", "f", 0, Map("col1" -> "0000777", "ref" -> "2017-08-17 17:17:17.0", "others" -> "x"))
)
val schemanames= "col1, ref"
// Create dataframe from input5
val df = input5.toDF
// A UDF to get column value from Map
def getColVal(c: String) = udf(
(m: Map[String, String]) =>
m.get(c).getOrElse("n/a")
)
// Add columns based on entries in schemanames
val df2 = schemanames.split(",").map(_.trim).
foldLeft( df )(
(acc, c) => acc.withColumn( c, getColVal(c)(df("row"))
))
val df3 = df2.select(cols.map(c => col(c)): _*)
df3.show(truncate=false)
+-------+--------------------------+
|col1 |ref |
+-------+--------------------------+
|0000555|2017-08-12 12:12:12.266528|
|0000444|2017-08-14 14:14:14.0 |
|n/a |2017-08-16 16:16:16.0 |
|0000777|2017-08-17 17:17:17.0 |
+-------+--------------------------+

How to filter a map<String, Int> in a data frame : Spark / Scala

I am trying to get the count individual column to publish metrics. I have a I have a df [customerId : string, totalRent : bigint, totalPurchase: bigint, itemTypeCounts: map<string, int> ]
Right now I am doing :
val totalCustomers = df.count
val totalPurchaseCount = df.filter("totalPurchase > 0").count
val totalRentCount = df.filter("totalRent > 0").count
publishMetrics("Total Customer", totalCustomers )
publishMetrics("Total Purchase", totalPurchaseCount )
publishMetrics("Total Rent", totalRentCount )
publishMetrics("Percentage of Rent", percentage(totalRentCount, totalCustomers) )
publishMetrics("Percentage of Purchase", percentage(totalPurchaseCount, totalCustomers) )
private def percentageCalc(num: Long, denom: Long): Double = {
val numD: Long = num
val denomD: Long = denom
return if (denomD == 0.0) 0.0
else (numD / denomD) * 100
}
But I am not sure how do I do this for itemTypeCounts which is a map. I want count and percentage based on each key entry. The issue is the key value is dynamic , I mean there is no way I know the key value before hand. Can some one tell me how do get count for each key values. I am new to scala/spark, any other efficient approaches to get the counts of each columns are much appreciated.
Sample data :
customerId : 1
totalPurchase : 17
totalRent : 0
itemTypeCounts : {"TV" : 4, "Blender" : 2}
customerId : 2
totalPurchase : 1
totalRent : 1
itemTypeCounts : {"Cloths" : 4}
customerId : 3
totalPurchase : 0
totalRent : 10
itemTypeCounts : {"TV" : 4}
So the output is :
totalCustomer : 3
totalPurchaseCount : 2 (2 customers with totalPurchase > 0)
totalRent : 2 (2 customers with totalRent > 0)
itemTypeCounts_TV : 2
itemTypeCounts_Cloths : 1
itemTypeCounts_Blender : 1
You can accomplish this in Spark SQL, I show two examples of this below (one where the keys are known and can be enumerated in code, one where the keys are unknown). Note that by using Spark SQL, you take advantage of the catalyst optimizer, and this will run very efficiently:
val data = List((1,17,0,Map("TV" -> 4, "Blender" -> 2)),(2,1,1,Map("Cloths" -> 4)),(3,0,10,Map("TV" -> 4)))
val df = data.toDF("customerId","totalPurchase","totalRent","itemTypeCounts")
//Only good if you can enumerate the keys
def countMapKey(name:String) = {
count(when($"itemTypeCounts".getItem(name).isNotNull,lit(1))).as(s"itemTypeCounts_$name")
}
val keysToCount = List("TV","Blender","Cloths").map(key => countMapKey(key))
df.select(keysToCount :_*).show
+-----------------+----------------------+---------------------+
|itemTypeCounts_TV|itemTypeCounts_Blender|itemTypeCounts_Cloths|
+-----------------+----------------------+---------------------+
| 2| 1| 1|
+-----------------+----------------------+---------------------+
//More generic
val pivotData = df.select(explode(col("itemTypeCounts"))).groupBy(lit(1).as("tmp")).pivot("key").count.drop("tmp")
val renameStatement = pivotData.columns.map(name => col(name).as(s"itemTypeCounts_$name"))
pivotData.select(renameStatement :_*).show
+----------------------+---------------------+-----------------+
|itemTypeCounts_Blender|itemTypeCounts_Cloths|itemTypeCounts_TV|
+----------------------+---------------------+-----------------+
| 1| 1| 2|
+----------------------+---------------------+-----------------+
I'm a spark newbie myself, so there is probably a better way to do this. But one thing you could try is transforming the itemTypeCounts into a data structure in scala that you could work with. I converted each row to a List of (Name, Count) pairs e.g. List((Blender,2), (TV,4)).
With this you can have a List of such list of pairs, one list of pairs for each row. In your example, this will be a List of 3 elements:
List(
List((Blender,2), (TV,4)),
List((Cloths,4)),
List((TV,4))
)
Once you have this structure, transforming it to a desired output is standard scala.
Worked example is below:
val itemTypeCounts = df.select("itemTypeCounts")
//Build List of List of Pairs as suggested above
val itemsList = itemTypeCounts.collect().map {
row =>
val values = row.getStruct(0).mkString("",",","").split(",")
val fields = row.schema.head.dataType.asInstanceOf[StructType].map(s => s.name).toList
fields.zip(values).filter(p => p._2 != "null")
}.toList
// Build a summary map for the list constructed above
def itemTypeCountsSummary(frames: List[List[(String, String)]], summary: Map[String, Int]) : Map[String, Int] = frames match {
case Nil => summary
case _ => itemTypeCountsSummary(frames.tail, merge(frames.head, summary))
}
//helper method for the summary map.
def merge(head: List[(String, String)], summary: Map[String, Int]): Map[String, Int] = {
val headMap = head.toMap.map(e => ("itemTypeCounts_" + e._1, 1))
val updatedSummary = summary.map{e => if(headMap.contains(e._1)) (e._1, e._2 + 1) else e}
updatedSummary ++ headMap.filter(e => !updatedSummary.contains(e._1))
}
val summaryMap = itemTypeCountsSummary(itemsList, Map())
summaryMap.foreach(e => println(e._1 + ": " + e._2 ))
Output:
itemTypeCounts_Blender: 1
itemTypeCounts_TV: 2
itemTypeCounts_Cloths: 1
Borrowing the input from Nick and using spark-sql pivot solution:
val data = List((1,17,0,Map("TV" -> 4, "Blender" -> 2)),(2,1,1,Map("Cloths" -> 4)),(3,0,10,Map("TV" -> 4)))
val df = data.toDF("customerId","totalPurchase","totalRent","itemTypeCounts")
df.show(false)
df.createOrReplaceTempView("df")
+----------+-------------+---------+-----------------------+
|customerId|totalPurchase|totalRent|itemTypeCounts |
+----------+-------------+---------+-----------------------+
|1 |17 |0 |[TV -> 4, Blender -> 2]|
|2 |1 |1 |[Cloths -> 4] |
|3 |0 |10 |[TV -> 4] |
+----------+-------------+---------+-----------------------+
Assuming that we know the distinct itemType beforehand, we can use
val dfr = spark.sql("""
select * from (
select explode(itemTypeCounts) itemTypeCounts from (
select flatten(collect_list(map_keys(itemTypeCounts))) itemTypeCounts from df
) ) t
pivot ( count(itemTypeCounts) as c3
for itemTypeCounts in ('TV' ,'Blender' ,'Cloths') )
""")
dfr.show(false)
+---+-------+------+
|TV |Blender|Cloths|
+---+-------+------+
|2 |1 |1 |
+---+-------+------+
For renaming columns,
dfr.select(dfr.columns.map( x => col(x).alias("itemTypeCounts_" + x )):_* ).show(false)
+-----------------+----------------------+---------------------+
|itemTypeCounts_TV|itemTypeCounts_Blender|itemTypeCounts_Cloths|
+-----------------+----------------------+---------------------+
|2 |1 |1 |
+-----------------+----------------------+---------------------+
To get the distinct itemType dynamically and pass it to pivot
val item_count_arr = spark.sql(""" select array_distinct(flatten(collect_list(map_keys(itemTypeCounts)))) itemTypeCounts from df """).as[Array[String]].first
item_count_arr: Array[String] = Array(TV, Blender, Cloths)
spark.sql(s"""
select * from (
select explode(itemTypeCounts) itemTypeCounts from (
select flatten(collect_list(map_keys(itemTypeCounts))) itemTypeCounts from df
) ) t
pivot ( count(itemTypeCounts) as c3
for itemTypeCounts in (${item_count_arr.map(c => "'"+c+"'").mkString(",")}) )
""").show(false)
+---+-------+------+
|TV |Blender|Cloths|
+---+-------+------+
|2 |1 |1 |
+---+-------+------+