Scala Spark - empty map on DataFrame column for map(String, Int) - scala

I am joining two DataFrames, where there are columns of a type Map[String, Int]
I want the merged DF to have an empty map [] and not null on the Map type columns.
val df = dfmerged.
.select("id"),
coalesce(col("map_1"), lit(null).cast(MapType(StringType, IntType))).alias("map_1"),
coalesce(col("map_2"), lit(Map.empty[String, Int])).alias("map_2")
for a map_1 column, a null will be inserted, but I'd like to have an empty map
map_2 is giving me an error:
java.lang.RuntimeException: Unsupported literal type class
scala.collection.immutable.Map$EmptyMap$ Map()
I've also tried with an udf function like:
case class myStructMap(x:Map[String, Int])
val emptyMap = udf(() => myStructMap(Map.empty[String, Int]))
also did not work.
when I try something like:
.select( coalesce(col("myMapCol"), lit(map())).alias("brand_viewed_count")...
or
.select(coalesce(col("myMapCol"), lit(map().cast(MapType(LongType, LongType)))).alias("brand_viewed_count")...
I get the error:
cannot resolve 'map()' due to data type mismatch: cannot cast
MapType(NullType,NullType,false) to MapType(LongType,IntType,true);

In Spark 2.2
import org.apache.spark.sql.functions.typedLit
val df = Seq((1L, null), (2L, Map("foo" -> "bar"))).toDF("id", "map")
df.withColumn("map", coalesce($"map", typedLit(Map[String, Int]()))).show
// +---+-----------------+
// | id| map|
// +---+-----------------+
// | 1| Map()|
// | 2|Map(foobar -> 42)|
// +---+-----------------+
Before
df.withColumn("map", coalesce($"map", map().cast("map<string,int>"))).show
// +---+-----------------+
// | id| map|
// +---+-----------------+
// | 1| Map()|
// | 2|Map(foobar -> 42)|
// +---+-----------------+

Related

Is it possible to apply when.otherwise functions within agg after a groupBy?

Been recently trying to apply a default function to aggregated values that were being calculated so that I didn't have to reprocess them afterwards. As far as I see I'm getting the following error.
Caused by: java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
From the following function.
val defaultUDF: UserDefinedFunction = udf[Column, Column, Any](defaultFunction)
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
And applying it the following way.
val initialDF = Seq(
("a", "b", 1),
("a", "b", null),
("a", null, 0)
).toDF("field1", "field2", "field3")
initialDF
.groupBy("field1", "field2")
.agg(
defaultUDF(functions.count("field3"), lit(0)).as("counter") // exception thrown here
)
Am I trying to do black magic in here or is it something that I may be missing?
The issue is in the implementation of your UserDefinedFunction:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions._
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
val defaultUDF: UserDefinedFunction = udf[Column, Column, Any](defaultFunction)
// java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Column is not supported
// at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:789)
// at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$schemaFor$1.apply(ScalaReflection.scala:724)
// at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:56)
// at org.apache.spark.sql.catalyst.ScalaReflection$class.cleanUpReflectionObjects(ScalaReflection.scala:906)
// at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:46)
// at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:723)
// at org.apache.spark.sql.catalyst.ScalaReflection$.schemaFor(ScalaReflection.scala:720)
// at org.apache.spark.sql.functions$.udf(functions.scala:3914)
// ... 65 elided
The error you're getting is basically because Spark cannot map the return type (i.e. Column) of your UserDefinedFunction defaultFunction to a Spark DataType.
Your defaultFunction has to accept and return Scala types that correspond with a Spark DataType. You can find the list of supported Scala types here: https://spark.apache.org/docs/latest/sql-reference.html#data-types
In any case, you don't need a UserDefinedFunction if your function takes Columns and returns a Column. For your use-case, the following code will work:
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
case class Record(field1: String, field2: String, field3: java.lang.Integer)
val df = Seq(
Record("a", "b", 1),
Record("a", "b", null),
Record("a", null, 0)
).toDS
df.show
// +------+------+------+
// |field1|field2|field3|
// +------+------+------+
// | a| b| 1|
// | a| b| null|
// | a| null| 0|
// +------+------+------+
def defaultFunction(col: Column, value: Any): Column = {
when(col.equalTo(value), null).otherwise(col)
}
df
.groupBy("field1", "field2")
.agg(defaultFunction(count("field3"), lit(0)).as("counter"))
.show
// +------+------+-------+
// |field1|field2|counter|
// +------+------+-------+
// | a| b| 1|
// | a| null| 1|
// +------+------+-------+

Pass Spark SQL function name as parameter in Scala

I am trying to pass a Spark SQL function name to my defined function in Scala.
I am trying to get same functionality as:
myDf.agg(max($"myColumn"))
my attempt doesn't work:
def myFunc(myDf: DataFrame, myParameter: String): Dataframe = {
myDf.agg(myParameter($"myColumn"))
}
Obviously it shouldn't work as I'm providing a string type I am unable to find a way to make it work.
Is it even possible?
Edit:
I have to provide sql function name (and it can be other aggregate function) as parameter when calling my function.
myFunc(anyDf, max) or myFunc(anyDf, "max")
agg also takes a Map[String,String] which allows to do what you want:
def myFunc(myDf: DataFrame, myParameter: String): DataFrame = {
myDf.agg(Map("myColumn"->myParameter))
}
example:
val df = Seq(1.0,2.0,3.0).toDF("myColumn")
myFunc(df,"avg")
.show()
gives:
+-------------+
|avg(myColumn)|
+-------------+
| 2.0|
+-------------+
Try this:
import org.apache.spark.sql.{Column, DataFrame}
val df = Seq((1, 2, 12),(2, 1, 21),(1, 5, 10),(5, 3, 9),(2, 5, 4)).toDF("a","b","c")
def myFunc(df: DataFrame, f: Column): DataFrame = {
df.agg(f)
}
myFunc(df, max(col("a"))).show
+------+
|max(a)|
+------+
| 5|
+------+
Hope it helps!

Conditional aggregation in scala based on data type

How do you aggregate dynamically in scala spark based on data types?
For example:
SELECT ID, SUM(when DOUBLE type)
, APPEND(when STRING), MAX(when BOOLEAN)
from tbl
GROUP BY ID
Sample data
You can do this by getting the runtime schema matching on the datatype, example :
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import spark.implicits._
val df =Seq(
(1, 1.0, true, "a"),
(1, 2.0, false, "b")
).toDF("id","d","b","s")
val dataTypes: Map[String, DataType] = df.schema.map(sf => (sf.name,sf.dataType)).toMap
def genericAgg(c:String) = {
dataTypes(c) match {
case DoubleType => sum(col(c))
case StringType => concat_ws(",",collect_list(col(c))) // "append"
case BooleanType => max(col(c))
}
}
val aggExprs: Seq[Column] = df.columns.filterNot(_=="id") // use all
.map(c => genericAgg(c))
df
.groupBy("id")
.agg(
aggExprs.head,aggExprs.tail:_*
)
.show()
gives
+---+------+------+-----------------------------+
| id|sum(d)|max(b)|concat_ws(,, collect_list(s))|
+---+------+------+-----------------------------+
| 1| 3.0| true| a,b|
+---+------+------+-----------------------------+

RDD[Array[String]] to Dataframe

I am new to Spark and Hive and my goal is to load a delimited(lets say csv) to Hive table. After a bit of reading I found out that the path to load the data into Hive is csv->dataframe->Hive.(Please correct me if I am wrong).
CSV:
1,Alex,70000,Columbus
2,Ryan,80000,New York
3,Johny,90000,Banglore
4,Cook, 65000,Glasgow
5,Starc, 70000,Aus
I read the csv file be using below command:
val csv =sc.textFile("employee_data.txt").map(line => line.split(",").map(elem => elem.trim))
csv: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[29] at map at <console>:39
Now I am trying to convert this RDD to Dataframe and using below code:
scala> val df = csv.map { case Array(s0, s1, s2, s3) => employee(s0, s1, s2, s3) }.toDF()
df: org.apache.spark.sql.DataFrame = [eid: string, name: string, salary: string, destination: string]
employee is a case class and I am using it as a schema definition.
case class employee(eid: String, name: String, salary: String, destination: String)
However when I do df.show I am getting below error:
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 10.0 failed 4 times, most recent failure: Lost task
0.3 in stage 10.0 (TID 22, user.hostname): scala.MatchError: [Ljava.lang.String;#88ba3cb (of class
[Ljava.lang.String;)
I was expecting a dataframe as a output. I know why I might be getting this error because the values in RDD are stored in Ljava.lang.String;#88ba3cb format and I need to use mkString to get the actual values but I am not able to find how to do it. I appreciate your time.
If you fix your case class then it should work:
scala> case class employee(eid: String, name: String, salary: String, destination: String)
defined class employee
scala> val txtRDD = sc.textFile("data.txt").map(line => line.split(",").map(_.trim))
txtRDD: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[30] at map at <console>:24
scala> txtRDD.map{case Array(s0, s1, s2, s3) => employee(s0, s1, s2, s3)}.toDF.show
+---+-----+------+-----------+
|eid| name|salary|destination|
+---+-----+------+-----------+
| 1| Alex| 70000| Columbus|
| 2| Ryan| 80000| New York|
| 3|Johny| 90000| Banglore|
| 4| Cook| 65000| Glasgow|
| 5|Starc| 70000| Aus|
+---+-----+------+-----------+
Otherwise you could convert the String to an Int:
scala> case class employee(eid: Int, name: String, salary: String, destination: String)
defined class employee
scala> val df = txtRDD.map{case Array(s0, s1, s2, s3) => employee(s0.toInt, s1, s2, s3)}.toDF
df: org.apache.spark.sql.DataFrame = [eid: int, name: string ... 2 more fields]
scala> df.show
+---+-----+------+-----------+
|eid| name|salary|destination|
+---+-----+------+-----------+
| 1| Alex| 70000| Columbus|
| 2| Ryan| 80000| New York|
| 3|Johny| 90000| Banglore|
| 4| Cook| 65000| Glasgow|
| 5|Starc| 70000| Aus|
+---+-----+------+-----------+
However the best solution would be to use spark-csv (which would treat the salary as an Int as well).
Also note that the error was thrown when you ran df.show because everything was being lazily evaluated up until that point. df.show is an action which will cause all of the queued transformations to execute (see this article for more).
Use map on array elements, not on array:
val csv = sc.textFile("employee_data.txt")
.map(line => line
.split(",")
.map(e => e.map(_.trim))
)
val df = csv.map { case Array(s0, s1, s2, s3) => employee(s0, s1, s2, s3) }.toDF()
But, why you are reading CSV and then converting RDD to DF? Spark 1.5 already can read CSV via spark-csv package:
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", ";")
.load("employee_data.txt")
As you said in your comment, your case class employee, which should be named Employee, receives an Int as first argument of its constructor, but you are passing a String. Thus, you should convert it to an Int before instantiating or modify your case defining eid as a String.

How to fill the null value in dataframe to uuid?

There is a dataframe with null values in one column(not all being null), it need to fill the null value with uuid, is there a way?
cala> val df = Seq(("stuff2",null,null), ("stuff2",null,Array("value1","value2")),("stuff3","stuff3",null)).toDF("field","field2","values")
df: org.apache.spark.sql.DataFrame = [field: string, field2: string, values: array<string>]
scala> df.show
+------+------+----------------+
| field|field2| values|
+------+------+----------------+
|stuff2| null| null|
|stuff2| null|[value1, value2]|
|stuff3|stuff3| null|
+------+------+----------------+
I tried this way, but each row of the "field2" has the same uuid.
scala> val fillDF = df.na.fill(java.util.UUID.randomUUID().toString(), Seq("field2"))
fillDF: org.apache.spark.sql.DataFrame = [field: string, field2: string, values: array<string>]
scala> fillDF.show
+------+--------------------+----------------+
| field| field2| values|
+------+--------------------+----------------+
|stuff2|d007ffae-9134-4ac...| null|
|stuff2|d007ffae-9134-4ac...|[value1, value2]|
|stuff3| stuff3| null|
+------+--------------------+----------------+
How to make it? in case there is more than 1,000,000 rows
You can do it using UDF and coalesce like below.
import org.apache.spark.sql.functions.udf
val arr = udf(() => java.util.UUID.randomUUID().toString())
val df2 = df.withColumn("field2", coalesce(df("field2"), arr()))
df2.show()
You will get different UUID like below.
+------+--------------------+----------------+
| field| field2| values|
+------+--------------------+----------------+
|stuff2|fda6bc42-1265-407...| null|
|stuff2|3fa74767-abd7-405...|[value1, value2]|
|stuff3| stuff3| null|
+------+--------------------+----------------+
You can easily do this by using UDF , it can be something like this :
def generateUUID(value: String):String = {
import java.util.UUID
if (Option(value).isDefined) {
value
}
else {
UUID.randomUUID().toString
}
val funcUDF = generateUUID _
val generateUUID = udf(funcUDF)
Now pass the fillDF accrodingly:
fillDF.withColumns("field2",generateUUID(fillDF("field2"))).show
P.S: The code is not tested but it should work !