Scala spark + encoder issues - scala

Working on a problem where I need to add a new column that holds the length of all characters under all columns.
My sample data set :
ItemNumber,StoreNumber,SaleAmount,Quantity, Date
2231 , 1 , 400 , 2 , 19/01/2020
2145 , 3 , 500 , 10 , 14/01/2020
The expected output would be
19 20
The ideal output am expecting to build is with new column Length added to the data frame
ItemNumber,StoreNumber,SaleAmount,Quantity, Date , Length
2231 , 1 , 400 , 2 , 19/01/2020, 19
2145 , 3 , 500 , 10 , 14/01/2020, 20
My code
val spark = SparkSession.builder()
.appName("SimpleNewIntColumn").master("local").enableHiveSupport().getOrCreate()
val df = spark.read.option("header","true").csv("./data/sales.csv")
var schema = new StructType
df.schema.toList.map{
each => schema = schema.add(each)
}
val encoder = RowEncoder(schema)
val charLength = (row :Row) => {
var len :Int = 0
row.toSeq.map(x => {
x match {
case a : Int => len = len + a.toString.length
case a : String => len = len + a.length
}
})
len
}
df.map(row => charLength(row))(encoder) // ERROR - Required Encoder[Int] Found EncoderExpression[Row]
df.withColumn("Length", ?)
I have two issues
1) How to solve the error "ERROR - Required Encoder[Int] Found EncodeExpression[Row]"?
2) How do I add the output of charLength function as new column value? - df.withColumn("Length", ?)
Thank you.
Gurupraveen

If you are just trying to add a column, with total length of that Row
You can simply concat all the columns cast to String and use length function
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StringType
val concatCol = concat(df.columns.map(col(_).cast(StringType)):_*)
df.withColumn("Length", length(concatCol))
Output:
+----------+-----------+----------+--------+----------+------+
|ItemNumber|StoreNumber|SaleAmount|Quantity| Date|length|
+----------+-----------+----------+--------+----------+------+
| 2231| 1| 400| 2|19/01/2020| 19|
| 2145| 3| 500| 10|14/01/2020| 20|
+----------+-----------+----------+--------+----------+------+

Related

getting the values of a column with keys - spark scala

I have a map[String,String] like this
val map1 = Map( "S" -> 1 , "T" -> 2, "U" -> 3)
and a Dataframe with a column called mappedcol ( type array[string] ). Here are the first and second rows of the column : [S,U] , [U,U] and I would like to map every row of this column to get the value of the key so I would have [1,3] instead of [S,U] and [3,3] instead of [U,U]. How can I do this effectively?
Thanks
The map can be tranformed into an SQL expression based on transform
and when:
var ex = "transform(value, v -> case ";
for ((k,v) <- map1) ex += s"when v = '${k}' then ${v} "
ex += "else 99 end)"
ex now contains the string
transform(value, v -> case when v = 'S' then 1 when v = 'T' then 2 when v = 'U' then 3 else 99 end)
This expression can now be used to calculate a new column:
import org.apache.spark.sql.functions._
df.withColumn("result", expr(ex)).show();
Output:
+---+------+------+
| id| value|result|
+---+------+------+
| 1|[S, U]|[1, 3]|
| 2|[U, U]|[3, 3]|
+---+------+------+

Need to transform a dataframe using dataframe API instead RDD

I have a a dataframe with the following data
loans,MTG,111
loans,MTG 102
loans,CRDS,103
loans,PCL,104
loans,PCL,105
I want to get result something like this
loans , MTG:111:102, PCL:104:105 , CRDS:103
I am able to achieve it using the RDD transformations
var data = Seq(("loans","MTG",111),("loans","MTG" ,102),("loans","CRDS",103),("loans","PCL",104),("loans","PCL",105))
var fd1 = sc.parallelize(data)
var fd2 = fd1.map(x => ( (x(0),x(1)) , x(2) ) )
var fd3 = fd2.reduceByKey( (a,b) => a.toString + ":" + b.toString )
var fd4 = fd3.map( x=> (x._1._1,(x._1._2 + ":"+ x._2)))
var fd5 = fd4.groupByKey()
I want to use the dataframe /Dataset API or may be spark SQL to achieve the same result. Could you please help.
Use .groupBy, .collect_list and concat_ws in built functions from dataframe api.
Example:
//sample dataframe
var data = Seq(("loans","MTG",111),("loans","MTG" ,102),("loans","CRDS",103),("loans","PCL",104),("loans","PCL",105)).toDF("col1","col2","col3")
import org.apache.spark.sql.functions._
data.show()
//+-----+----+----+
//| col1|col2|col3|
//+-----+----+----+
//|loans| MTG| 111|
//|loans| MTG| 102|
//|loans|CRDS| 103|
//|loans| PCL| 104|
//|loans| PCL| 105|
//+-----+----+----+
data.groupBy("col1","col2").
agg(concat_ws(":",collect_set("col3")).alias("col3")).
selectExpr("col1","""concat_ws(":",col2,col3) as col2""").
groupBy("col1").
agg(concat_ws(",",collect_list("col2")).alias("col2")).
show(false)
//+-----+--------------------------------+
//|col1 |col2 |
//+-----+--------------------------------+
//|loans|MTG:102:111,CRDS:103,PCL:104:105|
//+-----+--------------------------------+
//collect
data.groupBy("col1","col2").agg(concat_ws(":",collect_set("col3")).alias("col3")).selectExpr("col1","""concat_ws(":",col2,col3) as col2""").groupBy("col1").agg(concat_ws(",",collect_list("col2")).alias("col2")).collect()
//res22: Array[org.apache.spark.sql.Row] = Array([loans,MTG:102:111,CRDS:103,PCL:104:105])

Filter only particular format of date in Scala

I've a dataframe where some of the fields are having the date format of D.HH:mm:ss, D.HH:mm:ss.SSSSSSS & HH:mm:ss.SSSSSSS. I'll need to filter only the date of type HH:mm:ss.SSSSSSS and convert this date to seconds(integer).
I've written below scala code that converts the date to seconds. I need help in filtering the date type of a particular format(HH:mm:ss.SSSSSSS) only and skip other formats of date in a dataframe. Any help would be appreciated.
def hoursToSeconds(a: Any): Int = {
val sec = a.toString.split('.')
val fields = sec(0).split(':')
val creationSeconds = fields(0).toInt*3600 + fields(1).toInt*60 + fields(2).toInt
return creationSeconds
}
The task can be split up into two parts:
Filter the required rows with the help of rlike
calculate the seconds in an udf
Create some test data:
val df = Seq(
("one", "1.09:39:26"),
("two", "1.09:39:26.1234567"),
("three", "09:39:26.1234567")
).toDF("info", "time")
Definition of regexp and udf:
val pattern = "\\A(\\d{1,2}):(\\d{2}):(\\d{2})\\.\\d{7}\\z".r
val toSeconds = udf{in: String => {
val pattern(hour, minute, second) = in
hour.toInt * 60 * 60 + minute.toInt * 60 + second.toInt
}}
The actual code:
df
.filter('time rlike pattern.regex)
.select('info, 'time, toSeconds('time).as("seconds"))
.show
prints
+-----+----------------+-------+
| info| time|seconds|
+-----+----------------+-------+
|three|09:39:26.1234567| 34766|
+-----+----------------+-------+
If the lines that do not have the correct format should be kept, the udf can be changed slightly and the filter has to be removed:
val pattern = "\\A(\\d{1,2}):(\\d{2}):(\\d{2})\\.\\d{7}\\z".r
val toSeconds = udf{in: String => {
in match {
case pattern(hour, minute, second)=> hour.toInt * 60 * 60 + minute.toInt * 60 + second.toInt
case _ => 0
}
}}
df
.select('info, 'time, toSeconds('time).as("seconds"))
.show
prints
+-----+------------------+-------+
| info| time|seconds|
+-----+------------------+-------+
| one| 1.09:39:26| 0|
| two|1.09:39:26.1234567| 0|
|three| 09:39:26.1234567| 34766|
+-----+------------------+-------+
You can try matching using a regex with extractors like so:
val dateRegex = """(\d{2}):(\d{2}):(\d{2}).(\d{7})""".r
val D_HH_mm_ss = "1.12:12:12"
val D_HH_mm_ss_SSSSSSS = "1.12:12:12.1234567"
val HH_mm_ss_SSSSSSS = "12:12:12.1234567"
val dates = List(HH_mm_ss_SSSSSSS, D_HH_mm_ss_SSSSSSS, D_HH_mm_ss)
dates.foreach {
_ match {
case dateRegex(hh, mm, ss, sssssssss) => println(s"Yay! $hh-$mm-$ss")
case _ => println("Nay :(")
}
}

array[array["string"]] with explode option dropping null rows in spark/scala [duplicate]

I have a Dataframe that I am trying to flatten. As part of the process, I want to explode it, so if I have a column of arrays, each value of the array will be used to create a separate row. For instance,
id | name | likes
_______________________________
1 | Luke | [baseball, soccer]
should become
id | name | likes
_______________________________
1 | Luke | baseball
1 | Luke | soccer
This is my code
private DataFrame explodeDataFrame(DataFrame df) {
DataFrame resultDf = df;
for (StructField field : df.schema().fields()) {
if (field.dataType() instanceof ArrayType) {
resultDf = resultDf.withColumn(field.name(), org.apache.spark.sql.functions.explode(resultDf.col(field.name())));
resultDf.show();
}
}
return resultDf;
}
The problem is that in my data, some of the array columns have nulls. In that case, the entire row is deleted. So this dataframe:
id | name | likes
_______________________________
1 | Luke | [baseball, soccer]
2 | Lucy | null
becomes
id | name | likes
_______________________________
1 | Luke | baseball
1 | Luke | soccer
instead of
id | name | likes
_______________________________
1 | Luke | baseball
1 | Luke | soccer
2 | Lucy | null
How can I explode my arrays so that I don't lose the null rows?
I am using Spark 1.5.2 and Java 8
Spark 2.2+
You can use explode_outer function:
import org.apache.spark.sql.functions.explode_outer
df.withColumn("likes", explode_outer($"likes")).show
// +---+----+--------+
// | id|name| likes|
// +---+----+--------+
// | 1|Luke|baseball|
// | 1|Luke| soccer|
// | 2|Lucy| null|
// +---+----+--------+
Spark <= 2.1
In Scala but Java equivalent should be almost identical (to import individual functions use import static).
import org.apache.spark.sql.functions.{array, col, explode, lit, when}
val df = Seq(
(1, "Luke", Some(Array("baseball", "soccer"))),
(2, "Lucy", None)
).toDF("id", "name", "likes")
df.withColumn("likes", explode(
when(col("likes").isNotNull, col("likes"))
// If null explode an array<string> with a single null
.otherwise(array(lit(null).cast("string")))))
The idea here is basically to replace NULL with an array(NULL) of a desired type. For complex type (a.k.a structs) you have to provide full schema:
val dfStruct = Seq((1L, Some(Array((1, "a")))), (2L, None)).toDF("x", "y")
val st = StructType(Seq(
StructField("_1", IntegerType, false), StructField("_2", StringType, true)
))
dfStruct.withColumn("y", explode(
when(col("y").isNotNull, col("y"))
.otherwise(array(lit(null).cast(st)))))
or
dfStruct.withColumn("y", explode(
when(col("y").isNotNull, col("y"))
.otherwise(array(lit(null).cast("struct<_1:int,_2:string>")))))
Note:
If array Column has been created with containsNull set to false you should change this first (tested with Spark 2.1):
df.withColumn("array_column", $"array_column".cast(ArrayType(SomeType, true)))
You can use explode_outer() function.
Following up on the accepted answer, when the array elements are a complex type it can be difficult to define it by hand (e.g with large structs).
To do it automatically I wrote the following helper method:
def explodeOuter(df: Dataset[Row], columnsToExplode: List[String]) = {
val arrayFields = df.schema.fields
.map(field => field.name -> field.dataType)
.collect { case (name: String, type: ArrayType) => (name, type.asInstanceOf[ArrayType])}
.toMap
columnsToExplode.foldLeft(df) { (dataFrame, arrayCol) =>
dataFrame.withColumn(arrayCol, explode(when(size(col(arrayCol)) =!= 0, col(arrayCol))
.otherwise(array(lit(null).cast(arrayFields(arrayCol).elementType)))))
}
Edit: it seems that spark 2.2 and newer have this built in.
To handle empty map type column: for Spark <= 2.1
List((1, Array(2, 3, 4), Map(1 -> "a")),
(2, Array(5, 6, 7), Map(2 -> "b")),
(3, Array[Int](), Map[Int, String]())).toDF("col1", "col2", "col3").show()
df.select('col1, explode(when(size(map_keys('col3)) === 0, map(lit("null"), lit("null"))).
otherwise('col3))).show()
from pyspark.sql.functions import *
def flatten_df(nested_df):
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
flat_df = nested_df.select(flat_cols +
[col(nc + '.' + c).alias(nc + '_' + c)
for nc in nested_cols
for c in nested_df.select(nc + '.*').columns])
print("flatten_df_count :", flat_df.count())
return flat_df
def explode_df(nested_df):
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct' and c[1][:5] != 'array']
array_cols = [c[0] for c in nested_df.dtypes if c[1][:5] == 'array']
for array_col in array_cols:
schema = new_df.select(array_col).dtypes[0][1]
nested_df = nested_df.withColumn(array_col, when(col(array_col).isNotNull(), col(array_col)).otherwise(array(lit(None)).cast(schema)))
nested_df = nested_df.withColumn("tmp", arrays_zip(*array_cols)).withColumn("tmp", explode("tmp")).select([col("tmp."+c).alias(c) for c in array_cols] + flat_cols)
print("explode_dfs_count :", nested_df.count())
return nested_df
new_df = flatten_df(myDf)
while True:
array_cols = [c[0] for c in new_df.dtypes if c[1][:5] == 'array']
if len(array_cols):
new_df = flatten_df(explode_df(new_df))
else:
break
new_df.printSchema()
Used arrays_zip and explode to do it faster and address the null issue.

How to filter a map<String, Int> in a data frame : Spark / Scala

I am trying to get the count individual column to publish metrics. I have a I have a df [customerId : string, totalRent : bigint, totalPurchase: bigint, itemTypeCounts: map<string, int> ]
Right now I am doing :
val totalCustomers = df.count
val totalPurchaseCount = df.filter("totalPurchase > 0").count
val totalRentCount = df.filter("totalRent > 0").count
publishMetrics("Total Customer", totalCustomers )
publishMetrics("Total Purchase", totalPurchaseCount )
publishMetrics("Total Rent", totalRentCount )
publishMetrics("Percentage of Rent", percentage(totalRentCount, totalCustomers) )
publishMetrics("Percentage of Purchase", percentage(totalPurchaseCount, totalCustomers) )
private def percentageCalc(num: Long, denom: Long): Double = {
val numD: Long = num
val denomD: Long = denom
return if (denomD == 0.0) 0.0
else (numD / denomD) * 100
}
But I am not sure how do I do this for itemTypeCounts which is a map. I want count and percentage based on each key entry. The issue is the key value is dynamic , I mean there is no way I know the key value before hand. Can some one tell me how do get count for each key values. I am new to scala/spark, any other efficient approaches to get the counts of each columns are much appreciated.
Sample data :
customerId : 1
totalPurchase : 17
totalRent : 0
itemTypeCounts : {"TV" : 4, "Blender" : 2}
customerId : 2
totalPurchase : 1
totalRent : 1
itemTypeCounts : {"Cloths" : 4}
customerId : 3
totalPurchase : 0
totalRent : 10
itemTypeCounts : {"TV" : 4}
So the output is :
totalCustomer : 3
totalPurchaseCount : 2 (2 customers with totalPurchase > 0)
totalRent : 2 (2 customers with totalRent > 0)
itemTypeCounts_TV : 2
itemTypeCounts_Cloths : 1
itemTypeCounts_Blender : 1
You can accomplish this in Spark SQL, I show two examples of this below (one where the keys are known and can be enumerated in code, one where the keys are unknown). Note that by using Spark SQL, you take advantage of the catalyst optimizer, and this will run very efficiently:
val data = List((1,17,0,Map("TV" -> 4, "Blender" -> 2)),(2,1,1,Map("Cloths" -> 4)),(3,0,10,Map("TV" -> 4)))
val df = data.toDF("customerId","totalPurchase","totalRent","itemTypeCounts")
//Only good if you can enumerate the keys
def countMapKey(name:String) = {
count(when($"itemTypeCounts".getItem(name).isNotNull,lit(1))).as(s"itemTypeCounts_$name")
}
val keysToCount = List("TV","Blender","Cloths").map(key => countMapKey(key))
df.select(keysToCount :_*).show
+-----------------+----------------------+---------------------+
|itemTypeCounts_TV|itemTypeCounts_Blender|itemTypeCounts_Cloths|
+-----------------+----------------------+---------------------+
| 2| 1| 1|
+-----------------+----------------------+---------------------+
//More generic
val pivotData = df.select(explode(col("itemTypeCounts"))).groupBy(lit(1).as("tmp")).pivot("key").count.drop("tmp")
val renameStatement = pivotData.columns.map(name => col(name).as(s"itemTypeCounts_$name"))
pivotData.select(renameStatement :_*).show
+----------------------+---------------------+-----------------+
|itemTypeCounts_Blender|itemTypeCounts_Cloths|itemTypeCounts_TV|
+----------------------+---------------------+-----------------+
| 1| 1| 2|
+----------------------+---------------------+-----------------+
I'm a spark newbie myself, so there is probably a better way to do this. But one thing you could try is transforming the itemTypeCounts into a data structure in scala that you could work with. I converted each row to a List of (Name, Count) pairs e.g. List((Blender,2), (TV,4)).
With this you can have a List of such list of pairs, one list of pairs for each row. In your example, this will be a List of 3 elements:
List(
List((Blender,2), (TV,4)),
List((Cloths,4)),
List((TV,4))
)
Once you have this structure, transforming it to a desired output is standard scala.
Worked example is below:
val itemTypeCounts = df.select("itemTypeCounts")
//Build List of List of Pairs as suggested above
val itemsList = itemTypeCounts.collect().map {
row =>
val values = row.getStruct(0).mkString("",",","").split(",")
val fields = row.schema.head.dataType.asInstanceOf[StructType].map(s => s.name).toList
fields.zip(values).filter(p => p._2 != "null")
}.toList
// Build a summary map for the list constructed above
def itemTypeCountsSummary(frames: List[List[(String, String)]], summary: Map[String, Int]) : Map[String, Int] = frames match {
case Nil => summary
case _ => itemTypeCountsSummary(frames.tail, merge(frames.head, summary))
}
//helper method for the summary map.
def merge(head: List[(String, String)], summary: Map[String, Int]): Map[String, Int] = {
val headMap = head.toMap.map(e => ("itemTypeCounts_" + e._1, 1))
val updatedSummary = summary.map{e => if(headMap.contains(e._1)) (e._1, e._2 + 1) else e}
updatedSummary ++ headMap.filter(e => !updatedSummary.contains(e._1))
}
val summaryMap = itemTypeCountsSummary(itemsList, Map())
summaryMap.foreach(e => println(e._1 + ": " + e._2 ))
Output:
itemTypeCounts_Blender: 1
itemTypeCounts_TV: 2
itemTypeCounts_Cloths: 1
Borrowing the input from Nick and using spark-sql pivot solution:
val data = List((1,17,0,Map("TV" -> 4, "Blender" -> 2)),(2,1,1,Map("Cloths" -> 4)),(3,0,10,Map("TV" -> 4)))
val df = data.toDF("customerId","totalPurchase","totalRent","itemTypeCounts")
df.show(false)
df.createOrReplaceTempView("df")
+----------+-------------+---------+-----------------------+
|customerId|totalPurchase|totalRent|itemTypeCounts |
+----------+-------------+---------+-----------------------+
|1 |17 |0 |[TV -> 4, Blender -> 2]|
|2 |1 |1 |[Cloths -> 4] |
|3 |0 |10 |[TV -> 4] |
+----------+-------------+---------+-----------------------+
Assuming that we know the distinct itemType beforehand, we can use
val dfr = spark.sql("""
select * from (
select explode(itemTypeCounts) itemTypeCounts from (
select flatten(collect_list(map_keys(itemTypeCounts))) itemTypeCounts from df
) ) t
pivot ( count(itemTypeCounts) as c3
for itemTypeCounts in ('TV' ,'Blender' ,'Cloths') )
""")
dfr.show(false)
+---+-------+------+
|TV |Blender|Cloths|
+---+-------+------+
|2 |1 |1 |
+---+-------+------+
For renaming columns,
dfr.select(dfr.columns.map( x => col(x).alias("itemTypeCounts_" + x )):_* ).show(false)
+-----------------+----------------------+---------------------+
|itemTypeCounts_TV|itemTypeCounts_Blender|itemTypeCounts_Cloths|
+-----------------+----------------------+---------------------+
|2 |1 |1 |
+-----------------+----------------------+---------------------+
To get the distinct itemType dynamically and pass it to pivot
val item_count_arr = spark.sql(""" select array_distinct(flatten(collect_list(map_keys(itemTypeCounts)))) itemTypeCounts from df """).as[Array[String]].first
item_count_arr: Array[String] = Array(TV, Blender, Cloths)
spark.sql(s"""
select * from (
select explode(itemTypeCounts) itemTypeCounts from (
select flatten(collect_list(map_keys(itemTypeCounts))) itemTypeCounts from df
) ) t
pivot ( count(itemTypeCounts) as c3
for itemTypeCounts in (${item_count_arr.map(c => "'"+c+"'").mkString(",")}) )
""").show(false)
+---+-------+------+
|TV |Blender|Cloths|
+---+-------+------+
|2 |1 |1 |
+---+-------+------+