Spark SQL add column/update-accumulate value - scala

I have the following DataFrame:
name,email,phone,country
------------------------------------------------
[Mike,mike#example.com,+91-9999999999,Italy]
[Alex,alex#example.com,+91-9999999998,France]
[John,john#example.com,+1-1111111111,United States]
[Donald,donald#example.com,+1-2222222222,United States]
[Dan,dan#example.com,+91-9999444999,Poland]
[Scott,scott#example.com,+91-9111999998,Spain]
[Rob,rob#example.com,+91-9114444998,Italy]
exposed as temp table tagged_users:
resultDf.createOrReplaceTempView("tagged_users")
I need to add additional column tag to this DataFrame and assign calculated tags by different SQL conditions, which are described in the following map(key - tag name, value - condition for WHERE clause)
val tags = Map(
"big" -> "country IN (SELECT * FROM big_countries)",
"medium" -> "country IN (SELECT * FROM medium_countries)",
//2000 other different tags and conditions
"sometag" -> "name = 'Donald' AND email = 'donald#example.com' AND phone = '+1-2222222222'"
)
I have the following DataFrames(as data dictionaries) in order to be able to use them in SQL query:
Seq("Italy", "France", "United States", "Spain").toDF("country").createOrReplaceTempView("big_countries")
Seq("Poland", "Hungary", "Spain").toDF("country").createOrReplaceTempView("medium_countries")
I want to test each line in my tagged_users table and assign it appropriate tags. I tried to implement the following logic in order to achieve it:
tags.foreach {
case (tag, tagCondition) => {
resultDf = spark.sql(buildTagQuery(tag, tagCondition, "tagged_users"))
.withColumn("tag", lit(tag).cast(StringType))
}
}
def buildTagQuery(tag: String, tagCondition: String, table: String): String = {
f"SELECT * FROM $table WHERE $tagCondition"
}
but right now I don't know how to accumulate tags and not override them. Right now as the result I have the following DataFrame:
name,email,phone,country,tag
Dan,dan#example.com,+91-9999444999,Poland,medium
Scott,scott#example.com,+91-9111999998,Spain,medium
but I need something like:
name,email,phone,country,tag
Mike,mike#example.com,+91-9999999999,Italy,big
Alex,alex#example.com,+91-9999999998,France,big
John,john#example.com,+1-1111111111,United States,big
Donald,donald#example.com,+1-2222222222,United States,(big|sometag)
Dan,dan#example.com,+91-9999444999,Poland,medium
Scott,scott#example.com,+91-9111999998,Spain,(big|medium)
Rob,rob#example.com,+91-9114444998,Italy,big
Please note that Donal should have 2 tags (big|sometag) and Scott should have 2 tags (big|medium).
Please show how to implement it.
UPDATED
val spark = SparkSession
.builder()
.appName("Java Spark SQL basic example")
.config("spark.master", "local")
.getOrCreate();
import spark.implicits._
import spark.sql
Seq("Italy", "France", "United States", "Spain").toDF("country").createOrReplaceTempView("big_countries")
Seq("Poland", "Hungary", "Spain").toDF("country").createOrReplaceTempView("medium_countries")
val df = Seq(
("Mike", "mike#example.com", "+91-9999999999", "Italy"),
("Alex", "alex#example.com", "+91-9999999998", "France"),
("John", "john#example.com", "+1-1111111111", "United States"),
("Donald", "donald#example.com", "+1-2222222222", "United States"),
("Dan", "dan#example.com", "+91-9999444999", "Poland"),
("Scott", "scott#example.com", "+91-9111999998", "Spain"),
("Rob", "rob#example.com", "+91-9114444998", "Italy")).toDF("name", "email", "phone", "country")
df.collect.foreach(println)
df.createOrReplaceTempView("tagged_users")
val tags = Map(
"big" -> "country IN (SELECT * FROM big_countries)",
"medium" -> "country IN (SELECT * FROM medium_countries)",
"sometag" -> "name = 'Donald' AND email = 'donald#example.com' AND phone = '+1-2222222222'")
val sep_tag = tags.map((x) => { s"when array_contains(" + x._1 + ", country) then '" + x._1 + "' " }).mkString
val combine_sel_tag1 = tags.map((x) => { s" array_contains(" + x._1 + ",country) " }).mkString(" and ")
val combine_sel_tag2 = tags.map((x) => x._1).mkString(" '(", "|", ")' ")
val combine_sel_all = " case when " + combine_sel_tag1 + " then " + combine_sel_tag2 + sep_tag + " end as tags "
val crosqry = tags.map((x) => { s" cross join ( select collect_list(country) as " + x._1 + " from " + x._1 + "_countries) " + x._1 + " " }).mkString
val qry = " select name,email,phone,country, " + combine_sel_all + " from tagged_users " + crosqry
spark.sql(qry).show
spark.stop()
fails with the following exception:
Caused by: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'sometag_countries' not found in database 'default';
at org.apache.spark.sql.catalyst.catalog.ExternalCatalog$class.requireTableExists(ExternalCatalog.scala:48)
at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.requireTableExists(InMemoryCatalog.scala:45)
at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.getTable(InMemoryCatalog.scala:326)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.getTable(ExternalCatalogWithListener.scala:138)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupRelation(SessionCatalog.scala:701)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:730)
... 74 more

Check out this DF solution:
scala> val df = Seq(("Mike","mike#example.com","+91-9999999999","Italy"),
| ("Alex","alex#example.com","+91-9999999998","France"),
| ("John","john#example.com","+1-1111111111","United States"),
| ("Donald","donald#example.com","+1-2222222222","United States"),
| ("Dan","dan#example.com","+91-9999444999","Poland"),
| ("Scott","scott#example.com","+91-9111999998","Spain"),
| ("Rob","rob#example.com","+91-9114444998","Italy")
| ).toDF("name","email","phone","country")
df: org.apache.spark.sql.DataFrame = [name: string, email: string ... 2 more fields]
scala> val dfbc=Seq("Italy", "France", "United States", "Spain").toDF("country")
dfbc: org.apache.spark.sql.DataFrame = [country: string]
scala> val dfmc=Seq("Poland", "Hungary", "Spain").toDF("country")
dfmc: org.apache.spark.sql.DataFrame = [country: string]
scala> val dfbc2=dfbc.agg(collect_list('country).as("bcountry"))
dfbc2: org.apache.spark.sql.DataFrame = [bcountry: array<string>]
scala> val dfmc2=dfmc.agg(collect_list('country).as("mcountry"))
dfmc2: org.apache.spark.sql.DataFrame = [mcountry: array<string>]
scala> val df2=df.crossJoin(dfbc2).crossJoin(dfmc2)
df2: org.apache.spark.sql.DataFrame = [name: string, email: string ... 4 more fields]
scala> df2.selectExpr("*","case when array_contains(bcountry,country) and array_contains(mcountry,country) then '(big|medium)' when array_contains(bcountry,country) then 'big' when array_contains(mcountry,country) then 'medium' else 'none' end as `tags`").select("name","email","phone","country","tags").show(false)
+------+------------------+--------------+-------------+------------+
|name |email |phone |country |tags |
+------+------------------+--------------+-------------+------------+
|Mike |mike#example.com |+91-9999999999|Italy |big |
|Alex |alex#example.com |+91-9999999998|France |big |
|John |john#example.com |+1-1111111111 |United States|big |
|Donald|donald#example.com|+1-2222222222 |United States|big |
|Dan |dan#example.com |+91-9999444999|Poland |medium |
|Scott |scott#example.com |+91-9111999998|Spain |(big|medium)|
|Rob |rob#example.com |+91-9114444998|Italy |big |
+------+------------------+--------------+-------------+------------+
scala>
SQL approach
scala> Seq(("Mike","mike#example.com","+91-9999999999","Italy"),
| ("Alex","alex#example.com","+91-9999999998","France"),
| ("John","john#example.com","+1-1111111111","United States"),
| ("Donald","donald#example.com","+1-2222222222","United States"),
| ("Dan","dan#example.com","+91-9999444999","Poland"),
| ("Scott","scott#example.com","+91-9111999998","Spain"),
| ("Rob","rob#example.com","+91-9114444998","Italy")
| ).toDF("name","email","phone","country").createOrReplaceTempView("tagged_users")
scala> Seq("Italy", "France", "United States", "Spain").toDF("country").createOrReplaceTempView("big_countries")
scala> Seq("Poland", "Hungary", "Spain").toDF("country").createOrReplaceTempView("medium_countries")
scala> spark.sql(""" select name,email,phone,country,case when array_contains(bc,country) and array_contains(mc,country) then '(big|medium)' when array_contains(bc,country) then 'big' when array_contains(mc,country) then 'medium' else 'none' end as tags from tagged_users cross join ( select collect_list(country) as bc from big_countries ) b cross join ( select collect_list(country) as mc from medium_countries ) c """).show(false)
+------+------------------+--------------+-------------+------------+
|name |email |phone |country |tags |
+------+------------------+--------------+-------------+------------+
|Mike |mike#example.com |+91-9999999999|Italy |big |
|Alex |alex#example.com |+91-9999999998|France |big |
|John |john#example.com |+1-1111111111 |United States|big |
|Donald|donald#example.com|+1-2222222222 |United States|big |
|Dan |dan#example.com |+91-9999444999|Poland |medium |
|Scott |scott#example.com |+91-9111999998|Spain |(big|medium)|
|Rob |rob#example.com |+91-9114444998|Italy |big |
+------+------------------+--------------+-------------+------------+
scala>
Iterating through the tags
scala> val tags = Map(
| "big" -> "country IN (SELECT * FROM big_countries)",
| "medium" -> "country IN (SELECT * FROM medium_countries)")
tags: scala.collection.immutable.Map[String,String] = Map(big -> country IN (SELECT * FROM big_countries), medium -> country IN (SELECT * FROM medium_countries))
scala> val sep_tag = tags.map( (x) => { s"when array_contains("+x._1+", country) then '" + x._1 + "' " } ).mkString
sep_tag: String = "when array_contains(big, country) then 'big' when array_contains(medium, country) then 'medium' "
scala> val combine_sel_tag1 = tags.map( (x) => { s" array_contains("+x._1+",country) " } ).mkString(" and ")
combine_sel_tag1: String = " array_contains(big,country) and array_contains(medium,country) "
scala> val combine_sel_tag2 = tags.map( (x) => x._1 ).mkString(" '(","|", ")' ")
combine_sel_tag2: String = " '(big|medium)' "
scala> val combine_sel_all = " case when " + combine_sel_tag1 + " then " + combine_sel_tag2 + sep_tag + " end as tags "
combine_sel_all: String = " case when array_contains(big,country) and array_contains(medium,country) then '(big|medium)' when array_contains(big, country) then 'big' when array_contains(medium, country) then 'medium' end as tags "
scala> val crosqry = tags.map( (x) => { s" cross join ( select collect_list(country) as "+x._1+" from "+x._1+"_countries) "+ x._1 + " " } ).mkString
crosqry: String = " cross join ( select collect_list(country) as big from big_countries) big cross join ( select collect_list(country) as medium from medium_countries) medium "
scala> val qry = " select name,email,phone,country, " + combine_sel_all + " from tagged_users " + crosqry
qry: String = " select name,email,phone,country, case when array_contains(big,country) and array_contains(medium,country) then '(big|medium)' when array_contains(big, country) then 'big' when array_contains(medium, country) then 'medium' end as tags from tagged_users cross join ( select collect_list(country) as big from big_countries) big cross join ( select collect_list(country) as medium from medium_countries) medium "
scala> spark.sql(qry).show
+------+------------------+--------------+-------------+------------+
| name| email| phone| country| tags|
+------+------------------+--------------+-------------+------------+
| Mike| mike#example.com|+91-9999999999| Italy| big|
| Alex| alex#example.com|+91-9999999998| France| big|
| John| john#example.com| +1-1111111111|United States| big|
|Donald|donald#example.com| +1-2222222222|United States| big|
| Dan| dan#example.com|+91-9999444999| Poland| medium|
| Scott| scott#example.com|+91-9111999998| Spain|(big|medium)|
| Rob| rob#example.com|+91-9114444998| Italy| big|
+------+------------------+--------------+-------------+------------+
scala>
UPDATE2:
scala> Seq(("Mike","mike#example.com","+91-9999999999","Italy"),
| ("Alex","alex#example.com","+91-9999999998","France"),
| ("John","john#example.com","+1-1111111111","United States"),
| ("Donald","donald#example.com","+1-2222222222","United States"),
| ("Dan","dan#example.com","+91-9999444999","Poland"),
| ("Scott","scott#example.com","+91-9111999998","Spain"),
| ("Rob","rob#example.com","+91-9114444998","Italy")
| ).toDF("name","email","phone","country").createOrReplaceTempView("tagged_users")
scala> Seq("Italy", "France", "United States", "Spain").toDF("country").createOrReplaceTempView("big_countries")
scala> Seq("Poland", "Hungary", "Spain").toDF("country").createOrReplaceTempView("medium_countries")
scala> val tags = Map(
| "big" -> "country IN (SELECT * FROM big_countries)",
| "medium" -> "country IN (SELECT * FROM medium_countries)",
| "sometag" -> "name = 'Donald' AND email = 'donald#example.com' AND phone = '+1-2222222222'")
tags: scala.collection.immutable.Map[String,String] = Map(big -> country IN (SELECT * FROM big_countries), medium -> country IN (SELECT * FROM medium_countries), sometag -> name = 'Donald' AND email = 'donald#example.com' AND phone = '+1-2222222222')
scala> val sql_tags = tags.map( x => { val p = x._2.trim.toUpperCase.split(" ");
| val qry = if(p.contains("IN") && p.contains("FROM"))
| s" case when array_contains((select collect_list("+p.head +") from " + p.last.replaceAll("[)]","")+ " ), " +p.head + " ) then '" + x._1 + " ' else '' end " + x._1 + " "
| else
| " case when " + x._2 + " then '" + x._1 + " ' else '' end " + x._1 + " ";
| qry } ).mkString(",")
sql_tags: String = " case when array_contains((select collect_list(COUNTRY) from BIG_COUNTRIES ), COUNTRY ) then 'big ' else '' end big , case when array_contains((select collect_list(COUNTRY) from MEDIUM_COUNTRIES ), COUNTRY ) then 'medium ' else '' end medium , case when name = 'Donald' AND email = 'donald#example.com' AND phone = '+1-2222222222' then 'sometag ' else '' end sometag "
scala> val outer_query = tags.map( x=> x._1).mkString(" regexp_replace(trim(concat(", ",", " )),' ','|') tags ")
outer_query: String = " regexp_replace(trim(concat(big,medium,sometag )),' ','|') tags "
scala> spark.sql(" select name,email, country, " + outer_query + " from ( select name,email, country ," + sql_tags + " from tagged_users ) " ).show
+------+------------------+-------------+-----------+
| name| email| country| tags|
+------+------------------+-------------+-----------+
| Mike| mike#example.com| Italy| big|
| Alex| alex#example.com| France| big|
| John| john#example.com|United States| big|
|Donald|donald#example.com|United States|big|sometag|
| Dan| dan#example.com| Poland| medium|
| Scott| scott#example.com| Spain| big|medium|
| Rob| rob#example.com| Italy| big|
+------+------------------+-------------+-----------+
scala>

If you need to aggregate the results and not just execute each query perhaps use map instead of foreach then union the results
val o = tags.map {
case (tag, tagCondition) => {
val resultDf = spark.sql(buildTagQuery(tag, tagCondition, "tagged_users"))
.withColumn("tag", new Column("blah"))
resultDf
}
}
o.foldLeft(o.head) {
case (acc, df) => acc.union(df)
}

I would define multiple tags tables with columns value, tag.
Then your tags definition would be a collection say Seq[(String, String] where the first tuple element is the column on which the tag is calculated.
Lets say
Seq(
"country" -> "bigCountries", // Columns [country, bigCountry]
"country" -> "mediumCountries", // Columns [country, mediumCountry]
"email" -> "hotmailLosers" // [country, hotmailLoser]
)
Then iterate through this list, left join each table on the relevant column with the associated column.
After joining each table simply select your tags column to be the current value + the joined column if it is not null.

Related

Check count of a column from a dataframe and and add column and count as Map

I am a scala beginner. I am trying to find count of null values in a column of a table and add column name and count as key value pair in Map. The below code doesn't work as expected. Please guide me how I can modify this code to make it work
def nullCheck(databaseName:String,tableName:String) ={
var map = scala.collection.mutable.Map[String, Int]()
validationColumn = Array(col1,col2)
for(i <- 0 to validationColumn.length) {
val nullVal = spark.sql(s"select count(*) from $databaseName.$tableName where validationColumn(i) is NULL")
if(nullval == 0)
map(validationColumn(i)) = nullVal
map
}
The function should return ((col1,count),(col2,count)) as Map
This can be done with creating a dynamic sql string and then mapping it. Your approach reads same data multiple times
Here is the solution. I used an "example" DataFrame.
scala> val inputDf = Seq((Some("Sam"),None,200),(None,Some(31),30),(Some("John"),Some(25),25),(Some("Harry"),None,100)).toDF("name","age","not_imp_column")
scala> inputDf.show(false)
+-----+----+--------------+
|name |age |not_imp_column|
+-----+----+--------------+
|Sam |null|200 |
|null |31 |30 |
|John |25 |25 |
|Harry|null|100 |
+-----+----+--------------+
and our ValidationColumns Are name and age where we shall count Nulls
we put them in a List
scala> val validationColumns = List("name","age")
And We Create a SQL String that will be driving this whole calculation
scala> val sqlStr = "select " + validationColumns.map(x => "sum(" + x + "_count) AS " + x + "_sum" ).mkString(",") + " from (select " + validationColumns.map(x => "case when " + x + " = '$$' then 1 else 0 end AS " + x + "_count").mkString(",") + " from " +" (select" + validationColumns.map(x => " nvl( " + x +",'$$') as " + x).mkString(",") + " from example_table where " + validationColumns.map(x => x + " is null ").mkString("or ") + " ) layer1 ) layer2 "
It resolves to ==>
"select sum(name_count) AS name_sum,sum(age_count) AS age_sum from (select case when name = '$$' then 1 else 0 end AS name_count,case when age = '$$' then 1 else 0 end AS age_count from (select nvl( name,'$$') as name, nvl( age,'$$') as age from example_table where name is null or age is null ) layer1 ) layer2 "
now we create a temporary view of our dataframe
inputDf.createOrReplaceTempView("example_table")
only thing left to do is execute the sql and creating a Map which is done by
validationColumns zip spark.sql(sqlStr).collect.map(_.toSeq).flatten.toList toMap
and result
Map(name -> 1, age -> 2) // obviously you can make it type safe

spark scala reading text file with line delimiter

I have a one text file with following format.
id##name##subjects$$$
1##a##science
english$$$
2##b##social
mathematics$$$
I want to create a DataFrame like
id | name | subject
1 | a | science
| | english
When I do this Scala I get RDD[String] only. How can I convert RDD[String] to a DataFrame
val rdd = sc.textFile(fileLocation)
val a = rdd.reduce((a, b) => a + " " + b).split("\\$\\$\\$").map(f => f.replaceAll("##","")
Given the text file you provide and assuming you want the all of your example file converted to the following (put example text into a file example.txt)
+---+----+-----------+
| id|name| subjects|
+---+----+-----------+
| 1| a| science|
| | | english|
| 2| b| social|
| | |mathematics|
+---+----+-----------+
you can run the code below (spark 2.3.2)
val fileLocation="example.txt"
val rdd = sc.textFile(fileLocation)
def format(x : (String, String, String)) : String = {
val a = if ("".equals(x._1)) "| " else x._1 + " | "
val b = if ("".equals(x._2)) "| " else x._2 + " | "
val c = if ("".equals(x._3)) "" else x._3
return a + b + c
}
var rdd2 = rdd.filter(x => x.length != 0).map(s => s.split("##")).map(a => {
a match {
case Array(x) =>
("", "", x.split("\\$\\$\\$")(0))
case Array(x, y, z) =>
(x, y, z.split("\\$\\$\\$")(0))
}
})
rdd2.foreach(x => println(format(x)))
val header = rdd2.first()
val df = rdd2.filter(row => row != header).toDF(header._1, header._2, header._3)
df.show
val ds = rdd2.filter(row => row != header).toDS.withColumnRenamed("_1", header._1).withColumnRenamed("_2", header._2).withColumnRenamed("_3", header._3)
ds.show

Scala: Printing the ouput in proper data format using scala

I would like to display the data in proper format, I have the below code
val maplist=List(Map("id" -> "1", "Name" -> "divya"),
Map("id" -> "2", "Name" -> "gaya")
)
val header=maplist.flatMap(_.keys).distinct
val data=maplist.map(_.values)
println(header.mkString(" "))
data.foreach(x => println(x.mkString(" ")))
this is showing as
id Name
1 divya
2 gaya
but I would like to show like, I may have to use df.show() function
+----+-----+
|Id |Name |
+----+-----+
|1 |Divya|
|2 |gaya |
+----+-----+
If you want the separators you should use mkString method with more parameters, you can check in the API
mkString(start: String, sep: String, end: String): String
Displays all elements of this traversable or iterator in a string
using start, end, and separator strings.
val separatorLine = "+----+-----+"
val separator = "|"
val maplist=List(Map("id" -> "1", "Name" -> "divya"),
Map("id" -> "2", "Name" -> "gaya")
)
val header=maplist.flatMap(_.keys).distinct
val data=maplist.map(_.values)
println(separatorLine)
println(header.mkString("|", " |", "|"))
println(separatorLine)
data.foreach(x => println(x.mkString("|", " |", "|")))
println(separatorLine)
Result:
+----+-----+
|id |Name|
+----+-----+
|1 |divya|
|2 |gaya|
+----+-----+
Update: If you want to have the same length in every String (for instance 5) you can do an auxiliar method yo append blanks when needed:
#tailrec
private def appendElem(original : String, desiredLength: Int, c: Char): String = {
if (original.length < desiredLength)
appendElem(original + c, desiredLength, c)
else {
original
}
}
val separator = "|"
val maplist=List(Map("id" -> "1", "Name" -> "divya"),
Map("id" -> "2", "Name" -> "gaya")
)
val separatorLine = List.fill(maplist.size)( "+").map(appendElem(_, 6,'-')).mkString+ "+"
val header=maplist.flatMap(_.keys.map(key => appendElem(key, 5, ' '))).distinct
val data=maplist.map(_.values)
println(separatorLine)
println(header.mkString("|", "|", "|"))
println(separatorLine)
data.map(x => x.map(y => appendElem(y, 5, ' '))).foreach(x => println(x.mkString("|", "|", "|")))
println(separatorLine)
With this second version the result is as follows
+-----+-----+
|id |Name |
+-----+-----+
|1 |divya|
|2 |gaya |
+-----+-----+

How to filter a map<String, Int> in a data frame : Spark / Scala

I am trying to get the count individual column to publish metrics. I have a I have a df [customerId : string, totalRent : bigint, totalPurchase: bigint, itemTypeCounts: map<string, int> ]
Right now I am doing :
val totalCustomers = df.count
val totalPurchaseCount = df.filter("totalPurchase > 0").count
val totalRentCount = df.filter("totalRent > 0").count
publishMetrics("Total Customer", totalCustomers )
publishMetrics("Total Purchase", totalPurchaseCount )
publishMetrics("Total Rent", totalRentCount )
publishMetrics("Percentage of Rent", percentage(totalRentCount, totalCustomers) )
publishMetrics("Percentage of Purchase", percentage(totalPurchaseCount, totalCustomers) )
private def percentageCalc(num: Long, denom: Long): Double = {
val numD: Long = num
val denomD: Long = denom
return if (denomD == 0.0) 0.0
else (numD / denomD) * 100
}
But I am not sure how do I do this for itemTypeCounts which is a map. I want count and percentage based on each key entry. The issue is the key value is dynamic , I mean there is no way I know the key value before hand. Can some one tell me how do get count for each key values. I am new to scala/spark, any other efficient approaches to get the counts of each columns are much appreciated.
Sample data :
customerId : 1
totalPurchase : 17
totalRent : 0
itemTypeCounts : {"TV" : 4, "Blender" : 2}
customerId : 2
totalPurchase : 1
totalRent : 1
itemTypeCounts : {"Cloths" : 4}
customerId : 3
totalPurchase : 0
totalRent : 10
itemTypeCounts : {"TV" : 4}
So the output is :
totalCustomer : 3
totalPurchaseCount : 2 (2 customers with totalPurchase > 0)
totalRent : 2 (2 customers with totalRent > 0)
itemTypeCounts_TV : 2
itemTypeCounts_Cloths : 1
itemTypeCounts_Blender : 1
You can accomplish this in Spark SQL, I show two examples of this below (one where the keys are known and can be enumerated in code, one where the keys are unknown). Note that by using Spark SQL, you take advantage of the catalyst optimizer, and this will run very efficiently:
val data = List((1,17,0,Map("TV" -> 4, "Blender" -> 2)),(2,1,1,Map("Cloths" -> 4)),(3,0,10,Map("TV" -> 4)))
val df = data.toDF("customerId","totalPurchase","totalRent","itemTypeCounts")
//Only good if you can enumerate the keys
def countMapKey(name:String) = {
count(when($"itemTypeCounts".getItem(name).isNotNull,lit(1))).as(s"itemTypeCounts_$name")
}
val keysToCount = List("TV","Blender","Cloths").map(key => countMapKey(key))
df.select(keysToCount :_*).show
+-----------------+----------------------+---------------------+
|itemTypeCounts_TV|itemTypeCounts_Blender|itemTypeCounts_Cloths|
+-----------------+----------------------+---------------------+
| 2| 1| 1|
+-----------------+----------------------+---------------------+
//More generic
val pivotData = df.select(explode(col("itemTypeCounts"))).groupBy(lit(1).as("tmp")).pivot("key").count.drop("tmp")
val renameStatement = pivotData.columns.map(name => col(name).as(s"itemTypeCounts_$name"))
pivotData.select(renameStatement :_*).show
+----------------------+---------------------+-----------------+
|itemTypeCounts_Blender|itemTypeCounts_Cloths|itemTypeCounts_TV|
+----------------------+---------------------+-----------------+
| 1| 1| 2|
+----------------------+---------------------+-----------------+
I'm a spark newbie myself, so there is probably a better way to do this. But one thing you could try is transforming the itemTypeCounts into a data structure in scala that you could work with. I converted each row to a List of (Name, Count) pairs e.g. List((Blender,2), (TV,4)).
With this you can have a List of such list of pairs, one list of pairs for each row. In your example, this will be a List of 3 elements:
List(
List((Blender,2), (TV,4)),
List((Cloths,4)),
List((TV,4))
)
Once you have this structure, transforming it to a desired output is standard scala.
Worked example is below:
val itemTypeCounts = df.select("itemTypeCounts")
//Build List of List of Pairs as suggested above
val itemsList = itemTypeCounts.collect().map {
row =>
val values = row.getStruct(0).mkString("",",","").split(",")
val fields = row.schema.head.dataType.asInstanceOf[StructType].map(s => s.name).toList
fields.zip(values).filter(p => p._2 != "null")
}.toList
// Build a summary map for the list constructed above
def itemTypeCountsSummary(frames: List[List[(String, String)]], summary: Map[String, Int]) : Map[String, Int] = frames match {
case Nil => summary
case _ => itemTypeCountsSummary(frames.tail, merge(frames.head, summary))
}
//helper method for the summary map.
def merge(head: List[(String, String)], summary: Map[String, Int]): Map[String, Int] = {
val headMap = head.toMap.map(e => ("itemTypeCounts_" + e._1, 1))
val updatedSummary = summary.map{e => if(headMap.contains(e._1)) (e._1, e._2 + 1) else e}
updatedSummary ++ headMap.filter(e => !updatedSummary.contains(e._1))
}
val summaryMap = itemTypeCountsSummary(itemsList, Map())
summaryMap.foreach(e => println(e._1 + ": " + e._2 ))
Output:
itemTypeCounts_Blender: 1
itemTypeCounts_TV: 2
itemTypeCounts_Cloths: 1
Borrowing the input from Nick and using spark-sql pivot solution:
val data = List((1,17,0,Map("TV" -> 4, "Blender" -> 2)),(2,1,1,Map("Cloths" -> 4)),(3,0,10,Map("TV" -> 4)))
val df = data.toDF("customerId","totalPurchase","totalRent","itemTypeCounts")
df.show(false)
df.createOrReplaceTempView("df")
+----------+-------------+---------+-----------------------+
|customerId|totalPurchase|totalRent|itemTypeCounts |
+----------+-------------+---------+-----------------------+
|1 |17 |0 |[TV -> 4, Blender -> 2]|
|2 |1 |1 |[Cloths -> 4] |
|3 |0 |10 |[TV -> 4] |
+----------+-------------+---------+-----------------------+
Assuming that we know the distinct itemType beforehand, we can use
val dfr = spark.sql("""
select * from (
select explode(itemTypeCounts) itemTypeCounts from (
select flatten(collect_list(map_keys(itemTypeCounts))) itemTypeCounts from df
) ) t
pivot ( count(itemTypeCounts) as c3
for itemTypeCounts in ('TV' ,'Blender' ,'Cloths') )
""")
dfr.show(false)
+---+-------+------+
|TV |Blender|Cloths|
+---+-------+------+
|2 |1 |1 |
+---+-------+------+
For renaming columns,
dfr.select(dfr.columns.map( x => col(x).alias("itemTypeCounts_" + x )):_* ).show(false)
+-----------------+----------------------+---------------------+
|itemTypeCounts_TV|itemTypeCounts_Blender|itemTypeCounts_Cloths|
+-----------------+----------------------+---------------------+
|2 |1 |1 |
+-----------------+----------------------+---------------------+
To get the distinct itemType dynamically and pass it to pivot
val item_count_arr = spark.sql(""" select array_distinct(flatten(collect_list(map_keys(itemTypeCounts)))) itemTypeCounts from df """).as[Array[String]].first
item_count_arr: Array[String] = Array(TV, Blender, Cloths)
spark.sql(s"""
select * from (
select explode(itemTypeCounts) itemTypeCounts from (
select flatten(collect_list(map_keys(itemTypeCounts))) itemTypeCounts from df
) ) t
pivot ( count(itemTypeCounts) as c3
for itemTypeCounts in (${item_count_arr.map(c => "'"+c+"'").mkString(",")}) )
""").show(false)
+---+-------+------+
|TV |Blender|Cloths|
+---+-------+------+
|2 |1 |1 |
+---+-------+------+

How to filter a dataframe based on column values(multiple values through a arraybuffer) in scala

In scala/spark code I have 1 Dataframe which contains some rows:
col1 col2
Abc someValue1
xyz someValue2
lmn someValue3
zmn someValue4
pqr someValue5
cda someValue6
And i have a variable of ArrayBuffer[String] which contains [xyz,pqr,abc];
I want to filter given dataframe based on given values in arraybuffer at col1.
In SQL it would be like:
select * from tableXyz where col1 in("xyz","pqr","abc");
Assuming you have your dataframe:
val df = sc.parallelize(Seq(("abc","someValue1"),
("xyz","someValue2"),
("lmn","someValue3"),
("zmn","someValue4"),
("pqr","someValue5"),
("cda","someValue6")))
.toDF("col1","col2")
+----+----------+
|col1| col2|
+----+----------+
| abc|someValue1|
| xyz|someValue2|
| lmn|someValue3|
| zmn|someValue4|
| pqr|someValue5|
| cda|someValue6|
+----+----------+
Then you can define an UDF to filter the dataframe based on array's values:
val array = ArrayBuffer[String]("xyz","pqr","abc")
val function: (String => Boolean) = (arg: String) => array.contains(arg)
val udfFiltering = udf(function)
val filtered = df.filter(udfFiltering(col("col1")))
filtered.show()
+----+----------+
|col1| col2|
+----+----------+
| abc|someValue1|
| xyz|someValue2|
| pqr|someValue5|
+----+----------+
Alternately you can register your dataframe and sql-query it by SQLContext:
var elements = ""
array.foreach { el => elements += "\"" + el + "\"" + "," }
elements = elements.dropRight(1)
val query = "select * from tableXyz where col1 in(" + elements + ")"
df.registerTempTable("tableXyz")
val filtered = sqlContext.sql(query)
filtered.show()
+----+----------+
|col1| col2|
+----+----------+
| abc|someValue1|
| xyz|someValue2|
| pqr|someValue5|
+----+----------+