Scala Slick: Getting number of fields with a specific value in a group by query - scala

I have a table like this:
|``````````````````````````|
|imgId, pageId, isAnnotated|
|1, 1, true |
|2, 1, false |
|3, 2, true |
|4, 1, false |
|5, 3, false |
|6, 2, true |
|7, 3, true |
|8, 3, true |
|__________________________|
I want the result as:
|`````````````````````````````````````|
|pageId, imageCount, noOfAnotatedImage|
| 1 3 1 |
| 2 2 2 |
| 3 3 2 |
|_____________________________________|
I want the number of annotated images based on number field set as true.
Slick related code I tried which fired exception:
def get = {
val q = (for {
c <- WebImage.webimage
} yield (c.pageUrl, c.lastAccess, c.isAnnotated)).groupBy(a => (a._1, a._3)).map{
case(a,b) => (a._1, b.map(_._2).max, b.filter(_._3 === true).length, b.length)
}
db.run(q.result)
}
Exception:
[SlickTreeException: Cannot convert node to SQL Comprehension
| Path s6._2 : Vector[t2<{s3: String', s4: Long', s5: Boolean'}>]
]
Note: This Count the total records containing specific values thread clear shows that in plain SQL what I need is possible.
SELECT
Type
,sum(case Authorization when 'Accepted' then 1 else 0 end) Accepted
,sum(case Authorization when 'Denied' then 1 else 0 end) Denied
from MyTable
where Type = 'RAID'
group by Type
Changed the code but still getting exception:
Execution exception
[SlickException: No type for symbol s2 found for Ref s2]
In /home/ravinder/IdeaProjects/structurer/app/scrapper/Datastore.scala:60
56 def get = {
57 val q = (for {
58 c <- WebImage.webimage
59 } yield (c.pageUrl, c.lastAccess, c.isAnnotated)).groupBy(a => (a._1, a._3)).map{
[60] case(a,b) => (a._1, b.map(_._2).max, b.map(a => if (a._3.result == true) 1 else 0 ).sum, b.length)
61 }
62 db.run(q.result)
63 }
64

Given your requirement, you should group by only pageUrl so as to perform aggregation over all rows for the same page. You can aggregate lastAccess using max and isAnnotated using sum over a conditional Case.If-Then-Else. The Slick query should look something like the following:
val q = (for {
c <- WebImage.webimage
} yield (c.pageUrl, c.lastAccess, c.isAnnotated)).
groupBy( _._1 ).map{ case (url, grp) =>
val lastAcc = grp.map( _._2 ).max
val annoCnt = grp.map( _._3 ).map(
anno => Case.If(anno === true).Then(1).Else(0)
).sum
(url, lastAcc, annoCnt, , grp.length)
}

Related

Scala spark + encoder issues

Working on a problem where I need to add a new column that holds the length of all characters under all columns.
My sample data set :
ItemNumber,StoreNumber,SaleAmount,Quantity, Date
2231 , 1 , 400 , 2 , 19/01/2020
2145 , 3 , 500 , 10 , 14/01/2020
The expected output would be
19 20
The ideal output am expecting to build is with new column Length added to the data frame
ItemNumber,StoreNumber,SaleAmount,Quantity, Date , Length
2231 , 1 , 400 , 2 , 19/01/2020, 19
2145 , 3 , 500 , 10 , 14/01/2020, 20
My code
val spark = SparkSession.builder()
.appName("SimpleNewIntColumn").master("local").enableHiveSupport().getOrCreate()
val df = spark.read.option("header","true").csv("./data/sales.csv")
var schema = new StructType
df.schema.toList.map{
each => schema = schema.add(each)
}
val encoder = RowEncoder(schema)
val charLength = (row :Row) => {
var len :Int = 0
row.toSeq.map(x => {
x match {
case a : Int => len = len + a.toString.length
case a : String => len = len + a.length
}
})
len
}
df.map(row => charLength(row))(encoder) // ERROR - Required Encoder[Int] Found EncoderExpression[Row]
df.withColumn("Length", ?)
I have two issues
1) How to solve the error "ERROR - Required Encoder[Int] Found EncodeExpression[Row]"?
2) How do I add the output of charLength function as new column value? - df.withColumn("Length", ?)
Thank you.
Gurupraveen
If you are just trying to add a column, with total length of that Row
You can simply concat all the columns cast to String and use length function
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StringType
val concatCol = concat(df.columns.map(col(_).cast(StringType)):_*)
df.withColumn("Length", length(concatCol))
Output:
+----------+-----------+----------+--------+----------+------+
|ItemNumber|StoreNumber|SaleAmount|Quantity| Date|length|
+----------+-----------+----------+--------+----------+------+
| 2231| 1| 400| 2|19/01/2020| 19|
| 2145| 3| 500| 10|14/01/2020| 20|
+----------+-----------+----------+--------+----------+------+

array[array["string"]] with explode option dropping null rows in spark/scala [duplicate]

I have a Dataframe that I am trying to flatten. As part of the process, I want to explode it, so if I have a column of arrays, each value of the array will be used to create a separate row. For instance,
id | name | likes
_______________________________
1 | Luke | [baseball, soccer]
should become
id | name | likes
_______________________________
1 | Luke | baseball
1 | Luke | soccer
This is my code
private DataFrame explodeDataFrame(DataFrame df) {
DataFrame resultDf = df;
for (StructField field : df.schema().fields()) {
if (field.dataType() instanceof ArrayType) {
resultDf = resultDf.withColumn(field.name(), org.apache.spark.sql.functions.explode(resultDf.col(field.name())));
resultDf.show();
}
}
return resultDf;
}
The problem is that in my data, some of the array columns have nulls. In that case, the entire row is deleted. So this dataframe:
id | name | likes
_______________________________
1 | Luke | [baseball, soccer]
2 | Lucy | null
becomes
id | name | likes
_______________________________
1 | Luke | baseball
1 | Luke | soccer
instead of
id | name | likes
_______________________________
1 | Luke | baseball
1 | Luke | soccer
2 | Lucy | null
How can I explode my arrays so that I don't lose the null rows?
I am using Spark 1.5.2 and Java 8
Spark 2.2+
You can use explode_outer function:
import org.apache.spark.sql.functions.explode_outer
df.withColumn("likes", explode_outer($"likes")).show
// +---+----+--------+
// | id|name| likes|
// +---+----+--------+
// | 1|Luke|baseball|
// | 1|Luke| soccer|
// | 2|Lucy| null|
// +---+----+--------+
Spark <= 2.1
In Scala but Java equivalent should be almost identical (to import individual functions use import static).
import org.apache.spark.sql.functions.{array, col, explode, lit, when}
val df = Seq(
(1, "Luke", Some(Array("baseball", "soccer"))),
(2, "Lucy", None)
).toDF("id", "name", "likes")
df.withColumn("likes", explode(
when(col("likes").isNotNull, col("likes"))
// If null explode an array<string> with a single null
.otherwise(array(lit(null).cast("string")))))
The idea here is basically to replace NULL with an array(NULL) of a desired type. For complex type (a.k.a structs) you have to provide full schema:
val dfStruct = Seq((1L, Some(Array((1, "a")))), (2L, None)).toDF("x", "y")
val st = StructType(Seq(
StructField("_1", IntegerType, false), StructField("_2", StringType, true)
))
dfStruct.withColumn("y", explode(
when(col("y").isNotNull, col("y"))
.otherwise(array(lit(null).cast(st)))))
or
dfStruct.withColumn("y", explode(
when(col("y").isNotNull, col("y"))
.otherwise(array(lit(null).cast("struct<_1:int,_2:string>")))))
Note:
If array Column has been created with containsNull set to false you should change this first (tested with Spark 2.1):
df.withColumn("array_column", $"array_column".cast(ArrayType(SomeType, true)))
You can use explode_outer() function.
Following up on the accepted answer, when the array elements are a complex type it can be difficult to define it by hand (e.g with large structs).
To do it automatically I wrote the following helper method:
def explodeOuter(df: Dataset[Row], columnsToExplode: List[String]) = {
val arrayFields = df.schema.fields
.map(field => field.name -> field.dataType)
.collect { case (name: String, type: ArrayType) => (name, type.asInstanceOf[ArrayType])}
.toMap
columnsToExplode.foldLeft(df) { (dataFrame, arrayCol) =>
dataFrame.withColumn(arrayCol, explode(when(size(col(arrayCol)) =!= 0, col(arrayCol))
.otherwise(array(lit(null).cast(arrayFields(arrayCol).elementType)))))
}
Edit: it seems that spark 2.2 and newer have this built in.
To handle empty map type column: for Spark <= 2.1
List((1, Array(2, 3, 4), Map(1 -> "a")),
(2, Array(5, 6, 7), Map(2 -> "b")),
(3, Array[Int](), Map[Int, String]())).toDF("col1", "col2", "col3").show()
df.select('col1, explode(when(size(map_keys('col3)) === 0, map(lit("null"), lit("null"))).
otherwise('col3))).show()
from pyspark.sql.functions import *
def flatten_df(nested_df):
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
flat_df = nested_df.select(flat_cols +
[col(nc + '.' + c).alias(nc + '_' + c)
for nc in nested_cols
for c in nested_df.select(nc + '.*').columns])
print("flatten_df_count :", flat_df.count())
return flat_df
def explode_df(nested_df):
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct' and c[1][:5] != 'array']
array_cols = [c[0] for c in nested_df.dtypes if c[1][:5] == 'array']
for array_col in array_cols:
schema = new_df.select(array_col).dtypes[0][1]
nested_df = nested_df.withColumn(array_col, when(col(array_col).isNotNull(), col(array_col)).otherwise(array(lit(None)).cast(schema)))
nested_df = nested_df.withColumn("tmp", arrays_zip(*array_cols)).withColumn("tmp", explode("tmp")).select([col("tmp."+c).alias(c) for c in array_cols] + flat_cols)
print("explode_dfs_count :", nested_df.count())
return nested_df
new_df = flatten_df(myDf)
while True:
array_cols = [c[0] for c in new_df.dtypes if c[1][:5] == 'array']
if len(array_cols):
new_df = flatten_df(explode_df(new_df))
else:
break
new_df.printSchema()
Used arrays_zip and explode to do it faster and address the null issue.

I have dataFrame as below and want to add remarks based on the column values using Scala

Below is my Input
id val visits date
111 2 1 20160122
111 2 1 20170122
112 4 2 20160122
112 5 4 20150122
113 6 1 20100120
114 8 2 20150122
114 8 2 20150122
Expected Output:
id val visits date remarks
111 2 1 20160122 oldDate
111 2 1 20170122 recentdate
112 4 2 20160122 less
112 5 4 20150122 more
113 6 1 20100120 one
114 8 2 20150122 Ramdom
114 8 2 20150122 Ramdom
Remarks should be:
Ramdom for Id has two records with same value & visits & date
One Visit for Id has only one record which contains any no of visits
Less Visit for Id has two records with less visits compared to other
More Visit for Id has more than one record with different value and visits.
recentdate Id has more records with same value & visits and different date with Max date
oldDatedate Id has more records with same value & visits and different date with Min date
code:
val grouped = df.groupBy("id").agg(max($"val").as("maxVal"), max($"visits").as("maxVisits"), min($"val").as("minVal"), min($"visits").as("minVisits"), count($"id").as("count"))
val remarks = functions.udf ((value: Int, visits: Int, maxValue: Int, maxVisits: Int, minValue: Int, minVisits: Int, count: Int) =>
if (count == 1) {
"One Visit"
}else if (value == maxValue && value == minValue && visits == maxVisits && visits == minVisits) {
"Random"
}else {
if (visits < maxVisits) {
"Less Visits"
}else {
"More Visits"
}
}
)
df.join(grouped, Seq("id"))
.withColumn("remarks", remarks($"val", $"visits", $"maxVal", $"maxVisits", $"minVal", $"minVisits", $"count"))
.drop("maxVal","maxVisits", "minVal", "minVisits", "count")
following code should work for you (but its not efficient as there are many if elses)
import org.apache.spark.sql.functions._
def remarkUdf = udf((column: Seq[Row])=>{
if(column.size == 1) Seq(remarks(column(0).getAs(0), column(0).getAs(1), column(0).getAs(2), "one"))
else if(column.size == 2) {
if(column(0) == column(1)) column.map(x => remarks(x.getAs(0), x.getAs(1), x.getAs(2), "Random"))
else{
if(column(0).getAs(0) == column(1).getAs(0) && column(0).getAs(1) == column(1).getAs(1)){
if(column(0).getAs[Int](2) < column(1).getAs[Int](2)) Seq(remarks(column(0).getAs(0), column(0).getAs(1), column(0).getAs(2), "oldDate"), remarks(column(1).getAs(0), column(1).getAs(1), column(1).getAs(2), "recentdate"))
else Seq(remarks(column(0).getAs(0), column(0).getAs(1), column(0).getAs(2), "recentdate"), remarks(column(1).getAs(0), column(1).getAs(1), column(1).getAs(2), "oldDate"))
}
else{
if(column(0).getAs[Int](0) < column(1).getAs[Int](0) && column(0).getAs[Int](1) < column(1).getAs[Int](1)) {
Seq(remarks(column(0).getAs(0), column(0).getAs(1), column(0).getAs(2), "less"), remarks(column(1).getAs(0), column(1).getAs(1), column(1).getAs(2), "more"))
}
else Seq(remarks(column(0).getAs(0), column(0).getAs(1), column(0).getAs(2), "more"), remarks(column(1).getAs(0), column(1).getAs(1), column(1).getAs(2), "less"))
}
}
}
else{
column.map(x => remarks(x.getAs(0), x.getAs(1), x.getAs(2), "not defined"))
}
})
df.groupBy("id").agg(collect_list(struct("val", "visits", "date")).as("value"))
.withColumn("value", explode(remarkUdf(col("value"))))
.select(col("id"), col("value.*"))
.show(false)
it should give you
+---+-----+------+--------+----------+
|id |value|Visits|date |Remarks |
+---+-----+------+--------+----------+
|111|2 |1 |20160122|oldDate |
|111|2 |1 |20170122|recentdate|
|112|4 |2 |20160122|less |
|112|5 |4 |20150122|more |
|114|8 |2 |20150122|Random |
|114|8 |2 |20150122|Random |
|113|6 |1 |20100120|one |
+---+-----+------+--------+----------+
And you need the following case class
case class remarks(value: Int, Visits: Int, date: Int, Remarks: String)

How to collect and process column-wise data in Spark

I have a dataframe contains 7 days, 24 hours data, so it has 144 columns.
id d1h1 d1h2 d1h3 ..... d7h24
aaa 21 24 8 ..... 14
bbb 16 12 2 ..... 4
ccc 21 2 7 ..... 6
what I want to do, is to find the max 3 values for each day:
id d1 d2 d3 .... d7
aaa [22,2,2] [17,2,2] [21,8,3] [32,11,2]
bbb [32,22,12] [47,22,2] [31,14,3] [32,11,2]
ccc [12,7,4] [28,14,7] [11,2,1] [19,14,7]
import org.apache.spark.sql.functions._
var df = ...
val first3 = udf((list : Seq[Double]) => list.slice(0,3))
for (i <- 1 until 7) {
val columns = (1 until 24).map(x=> "d"+i+"h"+x)
df = df
.withColumn("d"+i, first3(sort_array(array(columns.head, columns.tail :_*), false)))
.drop(columns :_*)
}
This should give you what you want. In fact for each day I aggregate the 24 hours into an array column, that I sort in desc order and from which I select the first 3 elements.
Define pattern:
val p = "^(d[1-7])h[0-9]{1,2}$".r
Group columns:
import org.apache.spark.sql.functions._
val cols = df.columns.tail
.groupBy { case p(d) => d }
.map { case (c, cs) => {
val sorted = sort_array(array(cs map col: _*), false)
array(sorted(0), sorted(1), sorted(2)).as(c)
}}
And select:
df.select($"id" +: cols.toSeq: _*)

How to filter a map<String, Int> in a data frame : Spark / Scala

I am trying to get the count individual column to publish metrics. I have a I have a df [customerId : string, totalRent : bigint, totalPurchase: bigint, itemTypeCounts: map<string, int> ]
Right now I am doing :
val totalCustomers = df.count
val totalPurchaseCount = df.filter("totalPurchase > 0").count
val totalRentCount = df.filter("totalRent > 0").count
publishMetrics("Total Customer", totalCustomers )
publishMetrics("Total Purchase", totalPurchaseCount )
publishMetrics("Total Rent", totalRentCount )
publishMetrics("Percentage of Rent", percentage(totalRentCount, totalCustomers) )
publishMetrics("Percentage of Purchase", percentage(totalPurchaseCount, totalCustomers) )
private def percentageCalc(num: Long, denom: Long): Double = {
val numD: Long = num
val denomD: Long = denom
return if (denomD == 0.0) 0.0
else (numD / denomD) * 100
}
But I am not sure how do I do this for itemTypeCounts which is a map. I want count and percentage based on each key entry. The issue is the key value is dynamic , I mean there is no way I know the key value before hand. Can some one tell me how do get count for each key values. I am new to scala/spark, any other efficient approaches to get the counts of each columns are much appreciated.
Sample data :
customerId : 1
totalPurchase : 17
totalRent : 0
itemTypeCounts : {"TV" : 4, "Blender" : 2}
customerId : 2
totalPurchase : 1
totalRent : 1
itemTypeCounts : {"Cloths" : 4}
customerId : 3
totalPurchase : 0
totalRent : 10
itemTypeCounts : {"TV" : 4}
So the output is :
totalCustomer : 3
totalPurchaseCount : 2 (2 customers with totalPurchase > 0)
totalRent : 2 (2 customers with totalRent > 0)
itemTypeCounts_TV : 2
itemTypeCounts_Cloths : 1
itemTypeCounts_Blender : 1
You can accomplish this in Spark SQL, I show two examples of this below (one where the keys are known and can be enumerated in code, one where the keys are unknown). Note that by using Spark SQL, you take advantage of the catalyst optimizer, and this will run very efficiently:
val data = List((1,17,0,Map("TV" -> 4, "Blender" -> 2)),(2,1,1,Map("Cloths" -> 4)),(3,0,10,Map("TV" -> 4)))
val df = data.toDF("customerId","totalPurchase","totalRent","itemTypeCounts")
//Only good if you can enumerate the keys
def countMapKey(name:String) = {
count(when($"itemTypeCounts".getItem(name).isNotNull,lit(1))).as(s"itemTypeCounts_$name")
}
val keysToCount = List("TV","Blender","Cloths").map(key => countMapKey(key))
df.select(keysToCount :_*).show
+-----------------+----------------------+---------------------+
|itemTypeCounts_TV|itemTypeCounts_Blender|itemTypeCounts_Cloths|
+-----------------+----------------------+---------------------+
| 2| 1| 1|
+-----------------+----------------------+---------------------+
//More generic
val pivotData = df.select(explode(col("itemTypeCounts"))).groupBy(lit(1).as("tmp")).pivot("key").count.drop("tmp")
val renameStatement = pivotData.columns.map(name => col(name).as(s"itemTypeCounts_$name"))
pivotData.select(renameStatement :_*).show
+----------------------+---------------------+-----------------+
|itemTypeCounts_Blender|itemTypeCounts_Cloths|itemTypeCounts_TV|
+----------------------+---------------------+-----------------+
| 1| 1| 2|
+----------------------+---------------------+-----------------+
I'm a spark newbie myself, so there is probably a better way to do this. But one thing you could try is transforming the itemTypeCounts into a data structure in scala that you could work with. I converted each row to a List of (Name, Count) pairs e.g. List((Blender,2), (TV,4)).
With this you can have a List of such list of pairs, one list of pairs for each row. In your example, this will be a List of 3 elements:
List(
List((Blender,2), (TV,4)),
List((Cloths,4)),
List((TV,4))
)
Once you have this structure, transforming it to a desired output is standard scala.
Worked example is below:
val itemTypeCounts = df.select("itemTypeCounts")
//Build List of List of Pairs as suggested above
val itemsList = itemTypeCounts.collect().map {
row =>
val values = row.getStruct(0).mkString("",",","").split(",")
val fields = row.schema.head.dataType.asInstanceOf[StructType].map(s => s.name).toList
fields.zip(values).filter(p => p._2 != "null")
}.toList
// Build a summary map for the list constructed above
def itemTypeCountsSummary(frames: List[List[(String, String)]], summary: Map[String, Int]) : Map[String, Int] = frames match {
case Nil => summary
case _ => itemTypeCountsSummary(frames.tail, merge(frames.head, summary))
}
//helper method for the summary map.
def merge(head: List[(String, String)], summary: Map[String, Int]): Map[String, Int] = {
val headMap = head.toMap.map(e => ("itemTypeCounts_" + e._1, 1))
val updatedSummary = summary.map{e => if(headMap.contains(e._1)) (e._1, e._2 + 1) else e}
updatedSummary ++ headMap.filter(e => !updatedSummary.contains(e._1))
}
val summaryMap = itemTypeCountsSummary(itemsList, Map())
summaryMap.foreach(e => println(e._1 + ": " + e._2 ))
Output:
itemTypeCounts_Blender: 1
itemTypeCounts_TV: 2
itemTypeCounts_Cloths: 1
Borrowing the input from Nick and using spark-sql pivot solution:
val data = List((1,17,0,Map("TV" -> 4, "Blender" -> 2)),(2,1,1,Map("Cloths" -> 4)),(3,0,10,Map("TV" -> 4)))
val df = data.toDF("customerId","totalPurchase","totalRent","itemTypeCounts")
df.show(false)
df.createOrReplaceTempView("df")
+----------+-------------+---------+-----------------------+
|customerId|totalPurchase|totalRent|itemTypeCounts |
+----------+-------------+---------+-----------------------+
|1 |17 |0 |[TV -> 4, Blender -> 2]|
|2 |1 |1 |[Cloths -> 4] |
|3 |0 |10 |[TV -> 4] |
+----------+-------------+---------+-----------------------+
Assuming that we know the distinct itemType beforehand, we can use
val dfr = spark.sql("""
select * from (
select explode(itemTypeCounts) itemTypeCounts from (
select flatten(collect_list(map_keys(itemTypeCounts))) itemTypeCounts from df
) ) t
pivot ( count(itemTypeCounts) as c3
for itemTypeCounts in ('TV' ,'Blender' ,'Cloths') )
""")
dfr.show(false)
+---+-------+------+
|TV |Blender|Cloths|
+---+-------+------+
|2 |1 |1 |
+---+-------+------+
For renaming columns,
dfr.select(dfr.columns.map( x => col(x).alias("itemTypeCounts_" + x )):_* ).show(false)
+-----------------+----------------------+---------------------+
|itemTypeCounts_TV|itemTypeCounts_Blender|itemTypeCounts_Cloths|
+-----------------+----------------------+---------------------+
|2 |1 |1 |
+-----------------+----------------------+---------------------+
To get the distinct itemType dynamically and pass it to pivot
val item_count_arr = spark.sql(""" select array_distinct(flatten(collect_list(map_keys(itemTypeCounts)))) itemTypeCounts from df """).as[Array[String]].first
item_count_arr: Array[String] = Array(TV, Blender, Cloths)
spark.sql(s"""
select * from (
select explode(itemTypeCounts) itemTypeCounts from (
select flatten(collect_list(map_keys(itemTypeCounts))) itemTypeCounts from df
) ) t
pivot ( count(itemTypeCounts) as c3
for itemTypeCounts in (${item_count_arr.map(c => "'"+c+"'").mkString(",")}) )
""").show(false)
+---+-------+------+
|TV |Blender|Cloths|
+---+-------+------+
|2 |1 |1 |
+---+-------+------+