Nulls in spark.createDataset - scala

I'm writing a unit test that the test data need to have some null values.
I tried putting the nulls straight in the tuples and I also tried using Options. It didn't work out.
Here is my code:
import sparkSession.implicits._
// Data set with null for even values
val sampleData = sparkSession.createDataset(Seq(
(1, Some("Yes"), None),
(2, None, None),
(3, Some("Okay"), None),
(4, None, None)))
.toDF("id", "title", "value")
Stack trace:
None.type (of class scala.reflect.internal.Types$UniqueSingleType)
scala.MatchError: None.type (of class scala.reflect.internal.Types$UniqueSingleType)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:472)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:596)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$9.apply(ScalaReflection.scala:587)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:252)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:252)
at scala.collection.immutable.List.flatMap(List.scala:344)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$serializerFor(ScalaReflection.scala:587)
at org.apache.spark.sql.catalyst.ScalaReflection$.serializerFor(ScalaReflection.scala:425)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:71)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275)
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:49)

You should use None: Option[String] instead of None
scala> val maybeString = None: Option[String]
maybeString: Option[String] = None
scala> val sampleData = spark.createDataset(Seq(
| (1, Some("Yes"), maybeString),
| (2, maybeString, maybeString),
| (3, Some("Okay"), maybeString),
| (4, maybeString, maybeString))).toDF("id", "title", "value")
sampleData: org.apache.spark.sql.DataFrame = [id: int, title: string ... 1 more field]
scala> sampleData.show
+---+-----+-----+
| id|title|value|
+---+-----+-----+
| 1| Yes| null|
| 2| null| null|
| 3| Okay| null|
| 4| null| null|
+---+-----+-----+

Or you can use: null.asInstanceOf[String] If you're just dealing with Strings
val df1 = sc.parallelize(Seq((1, "Yes", null.asInstanceOf[String]),
| (2, null.asInstanceOf[String], null.asInstanceOf[String]),
| (3, "Okay", null.asInstanceOf[String]),
| (4, null.asInstanceOf[String], null.asInstanceOf[String]))).toDF("id", "title", "value")
df1: org.apache.spark.sql.DataFrame = [id: int, title: string, value: string]
scala> df1.show
+---+-----+-----+
| id|title|value|
+---+-----+-----+
| 1| Yes| null|
| 2| null| null|
| 3| Okay| null|
| 4| null| null|
+---+-----+-----+

Related

Dynamically apply aggregate function in SPARK data frame

How do I parameterize the below spark function. The groupBy and Pivot values are constant. I need to parameterize
var final_df_transpose=df_transpose.groupBy("_id").pivot("Type").agg(first("Value").alias("Value"),first("OType").alias("OType"),first("DateTime").alias("DateTime"))
Unable to pass alias as dynamically in the above scenario.
agg_Map scala.collection.mutable.Map[String,String] = Map( OType -> first, Type -> first, Value -> first, DateTime -> first)
var agg_Map = collection.mutable.Map[String, String]()
for (aggDataCol <- fin_agg_col) {
agg_Map1 += (aggDataCol -> "first")
}
df_transpose.groupBy("_id").pivot("Type").agg(agg_Map.toMap).show
I can think of two ways, but I'm not happy with either.
In the first, define your aggregates as a list of Columns. The annoying thing here is that to satisfy the method signature, you need to add a dummy column and then remove it after the aggregation:
scala> val in = spark.read.option("header", true).csv("""_id,Type,Value,OType,DateTime
| 0,a,b,c,d
| 1,aaa,bbb,ccc,ddd""".split("\n").toSeq.toDS)
in: org.apache.spark.sql.DataFrame = [_id: string, Type: string ... 3 more fields]
scala> in.show
+---+----+-----+-----+--------+
|_id|Type|Value|OType|DateTime|
+---+----+-----+-----+--------+
| 0| a| b| c| d|
| 1| aaa| bbb| ccc| ddd|
+---+----+-----+-----+--------+
scala> val aggColumns = Seq("Value", "OType", "DateTime").map{c => first(c).alias(c)}
aggColumns: Seq[org.apache.spark.sql.Column] = List(first(Value, false) AS `Value`, first(OType, false) AS `OType`, first(DateTime, false) AS `DateTime`)
scala> val df_intermediate = in.groupBy("_id").pivot("Type").agg(lit("dummy"), aggColumns : _*)
df_intermediate: org.apache.spark.sql.DataFrame = [_id: string, a_dummy: string ... 7 more fields]
scala> df_intermediate.show
+---+-------+-------+-------+----------+---------+---------+---------+------------+
|_id|a_dummy|a_Value|a_OType|a_DateTime|aaa_dummy|aaa_Value|aaa_OType|aaa_DateTime|
+---+-------+-------+-------+----------+---------+---------+---------+------------+
| 0| dummy| b| c| d| dummy| null| null| null|
| 1| dummy| null| null| null| dummy| bbb| ccc| ddd|
+---+-------+-------+-------+----------+---------+---------+---------+------------+
scala> val df_final = df_intermediate.drop(df_intermediate.schema.collect{case c if c.name.endsWith("_dummy") => c.name} : _*)
df_final: org.apache.spark.sql.DataFrame = [_id: string, a_Value: string ... 5 more fields]
scala> df_final.show
+---+-------+-------+----------+---------+---------+------------+
|_id|a_Value|a_OType|a_DateTime|aaa_Value|aaa_OType|aaa_DateTime|
+---+-------+-------+----------+---------+---------+------------+
| 0| b| c| d| null| null| null|
| 1| null| null| null| bbb| ccc| ddd|
+---+-------+-------+----------+---------+---------+------------+
In the second, continue using a Map of agg expressions, then use a regex to find the renamed columns and change them back:
scala> val aggExprs = Map(("OType" -> "first"), ("Value" -> "first"), "DateTime" -> "first")
aggExprs: scala.collection.immutable.Map[String,String] = Map(OType -> first, Value -> first, DateTime -> first)
scala> val df_intermediate = in.groupBy("_id").pivot("Type").agg(aggExprs)
df_intermediate: org.apache.spark.sql.DataFrame = [_id: string, a_first(OType, false): string ... 5 more fields]
scala> df_intermediate.show
+---+---------------------+---------------------+------------------------+-----------------------+-----------------------+--------------------------+
|_id|a_first(OType, false)|a_first(Value, false)|a_first(DateTime, false)|aaa_first(OType, false)|aaa_first(Value, false)|aaa_first(DateTime, false)|
+---+---------------------+---------------------+------------------------+-----------------------+-----------------------+--------------------------+
| 0| c| b| d| null| null| null|
| 1| null| null| null| ccc| bbb| ddd|
+---+---------------------+---------------------+------------------------+-----------------------+-----------------------+--------------------------+
scala> val regex = "^(.*)_first\\((.*), false\\)$".r
regex: scala.util.matching.Regex = ^(.*)_first\((.*), false\)$
scala> val replacements = df_intermediate.schema.collect{ case c if regex.findFirstMatchIn(c.name).isDefined =>
| val regex(pivotVal, colName) = c.name
| c.name -> s"${pivotVal}_$colName"
| }.toMap
replacements: scala.collection.immutable.Map[String,String] = Map(a_first(DateTime, false) -> a_DateTime, aaa_first(DateTime, false) -> aaa_DateTime, aaa_first(OType, false) -> aaa_OType, a_first(Value, false) -> a_Value, a_first(OType, false) -> a_OType, aaa_first(Value, false) -> aaa_Value)
scala> val df_final = replacements.foldLeft(df_intermediate){(df, c) => df.withColumnRenamed(c._1, c._2)}
df_final: org.apache.spark.sql.DataFrame = [_id: string, a_OType: string ... 5 more fields]
scala> df_final.show
+---+-------+-------+----------+---------+---------+------------+
|_id|a_OType|a_Value|a_DateTime|aaa_OType|aaa_Value|aaa_DateTime|
+---+-------+-------+----------+---------+---------+------------+
| 0| c| b| d| null| null| null|
| 1| null| null| null| ccc| bbb| ddd|
+---+-------+-------+----------+---------+---------+------------+
Take your pick, but both involve some unneeded steps.

Spark Scala row-wise average by handling null

I've a dataframe with high volume of data and "n" number of columns.
df_avg_calc: org.apache.spark.sql.DataFrame = [col1: double, col2: double ... 4 more fields]
+------------------+-----------------+------------------+-----------------+-----+-----+
| col1| col2| col3| col4| col5| col6|
+------------------+-----------------+------------------+-----------------+-----+-----+
| null| null| null| null| null| null|
| 14.0| 5.0| 73.0| null| null| null|
| null| null| 28.25| null| null| null|
| null| null| null| null| null| null|
|33.723333333333336|59.78999999999999|39.474999999999994|82.09666666666666|101.0|53.43|
| 26.25| null| null| 2.0| null| null|
| null| null| null| null| null| null|
| 54.46| 89.475| null| null| null| null|
| null| 12.39| null| null| null| null|
| null| 58.0| 19.45| 1.0| 1.33|158.0|
+------------------+-----------------+------------------+-----------------+-----+-----+
I need to perform rowwise average keeping in mind not to consider the cell with "null" for averaging.
This needs to be implemented in Spark / Scala. I've tried to explain the same as in the attached image
What I have tried so far :
By referring - Calculate row mean, ignoring NAs in Spark Scala
val df = df_raw.schema.fieldNames.filter(f => f.contains("colname"))
val rowMeans = df_raw.select(df.map(f => col(f)).reduce(+) / lit(df.length) as "row_mean")
Where df_raw contains columns which needs to be aggregated (of course rowise). There are more than 80 columns. Arbitrarily they have data and null, count of Null needs to be ignored in the denominator while calculating average. It works fine, when all the column contain data, even a single Null in a column returns Null
Edit:
I've tried to adjust this answer by Terry Dactyl
def average(l: Seq[Double]): Option[Double] = {
val nonNull = l.flatMap(i => Option(i))
if(nonNull.isEmpty) None else Some(nonNull.reduce(_ + _).toDouble / nonNull.size.toDouble)
}
val avgUdf = udf(average(_: Seq[Double]))
val rowAvgDF = df_avg_calc.select(avgUdf(array($"col1",$"col2",$"col3",$"col4",$"col5",$"col6")).as("row_avg"))
rowAvgDF.show(10,false)
rowAvgDF: org.apache.spark.sql.DataFrame = [row_avg: double]
+------------------+
|row_avg |
+------------------+
|0.0 |
|15.333333333333334|
|4.708333333333333 |
|0.0 |
|61.58583333333333 |
|4.708333333333333 |
|0.0 |
|23.989166666666666|
|2.065 |
|39.63 |
+------------------+
Spark >= 2.4
It is possible to use aggregate:
val row_mean = expr("""aggregate(
CAST(array(_1, _2, _3) AS array<double>),
-- Initial value
-- Note that aggregate is picky about types
CAST((0.0 as sum, 0.0 as n) AS struct<sum: double, n: double>),
-- Merge function
(acc, x) -> (
acc.sum + coalesce(x, 0.0),
acc.n + CASE WHEN x IS NULL THEN 0.0 ELSE 1.0 END),
-- Finalize function
acc -> acc.sum / acc.n)""")
Usage:
df.withColumn("row_mean", row_mean).show
Result:
+----+----+----+--------+
| _1| _2| _3|row_mean|
+----+----+----+--------+
|null|null|null| null|
| 2.0|null|null| 2.0|
|50.0|34.0|null| 42.0|
| 1.0| 2.0| 3.0| 2.0|
+----+----+----+--------+
Version independent
Compute sum and count of NOT NULL columns and divide one over another:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._
def row_mean(cols: Column*) = {
// Sum of values ignoring nulls
val sum = cols
.map(c => coalesce(c, lit(0)))
.foldLeft(lit(0))(_ + _)
// Count of not null values
val cnt = cols
.map(c => when(c.isNull, 0).otherwise(1))
.foldLeft(lit(0))(_ + _)
sum / cnt
}
Example data:
val df = Seq(
(None, None, None),
(Some(2.0), None, None),
(Some(50.0), Some(34.0), None),
(Some(1.0), Some(2.0), Some(3.0))
).toDF
Result:
df.withColumn("row_mean", row_mean($"_1", $"_2", $"_3")).show
+----+----+----+--------+
| _1| _2| _3|row_mean|
+----+----+----+--------+
|null|null|null| null|
| 2.0|null|null| 2.0|
|50.0|34.0|null| 42.0|
| 1.0| 2.0| 3.0| 2.0|
+----+----+----+--------+
def average(l: Seq[Integer]): Option[Double] = {
val nonNull = l.flatMap(i => Option(i))
if(nonNull.isEmpty) None else Some(nonNull.reduce(_ + _).toDouble / nonNull.size.toDouble)
}
val avgUdf = udf(average(_: Seq[Integer]))
val df = List((Some(1),Some(2)), (Some(1), None), (None, None)).toDF("a", "b")
val avgDf = df.select(avgUdf(array(df.schema.map(c => col(c.name)): _*)).as("average"))
avgDf.collect
res0: Array[org.apache.spark.sql.Row] = Array([1.5], [1.0], [null])
Testing on the data you supplied gives the correct result:
val df = List(
(Some(10),Some(5), Some(5), None, None),
(None, Some(5), Some(5), None, Some(5)),
(Some(2), Some(8), Some(5), Some(1), Some(2)),
(None, None, None, None, None)
).toDF("col1", "col2", "col3", "col4", "col5")
Array[org.apache.spark.sql.Row] = Array([6.666666666666667], [5.0], [3.6], [null])
Note if you have columns you do not want included make sure they are filtered when populating the array passed to the UDF.
Finally:
val df = List(
(Some(14), Some(5), Some(73), None.asInstanceOf[Option[Integer]], None.asInstanceOf[Option[Integer]], None.asInstanceOf[Option[Integer]])
).toDF("col1", "col2", "col3", "col4", "col5", "col6")
Array[org.apache.spark.sql.Row] = Array([30.666666666666668])
Which again is the correct result.
If you want to use Doubles...
def average(l: Seq[java.lang.Double]): Option[java.lang.Double] = {
val nonNull = l.flatMap(i => Option(i))
if(nonNull.isEmpty) None else Some(nonNull.reduce(_ + _) / nonNull.size.toDouble)
}
val avgUdf = udf(average(_: Seq[java.lang.Double]))
val df = List(
(Some(14.0), Some(5.0), Some(73.0), None.asInstanceOf[Option[java.lang.Double]], None.asInstanceOf[Option[java.lang.Double]], None.asInstanceOf[Option[java.lang.Double]])
).toDF("col1", "col2", "col3", "col4", "col5", "col6")
val avgDf = df.select(avgUdf(array(df.schema.map(c => col(c.name)): _*)).as("average"))
avgDf.collect
Array[org.apache.spark.sql.Row] = Array([30.666666666666668])

drop duplicate words in long string using scala

I am curious to learn how to drop duplicate words within strings that are contained in a dataframe column. I would like to accomplish it using scala.
By way of example, below you can find a dataframe I would like to transform.
dataframe:
val dataset1 = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
+----+-------+---+
|KEY1| KEY2| ID|
+----+-------+---+
| 66|a,b,c,a| 4|
| 67|a,f,g,t| 0|
| 70|b,b,b,d| 4|
+----+-------+---+
result:
+----+----------+---+
|KEY1| KEY2| ID|
+----+----------+---+
| 66| a, b, c| 4|
| 67|a, f, g, t| 0|
| 70| b, d| 4|
+----+----------+---+
Using pyspark I have used the following code to get the above result. I could not rewrite such a code via scala. Do you have any suggestion? Thanking you in advance I wish you a nice day.
pyspark code:
# dataframe
l = [("66", "a,b,c,a", "4"),("67", "a,f,g,t", "0"),("70", "b,b,b,d", "4")]
#spark.createDataFrame(l).show()
df1 = spark.createDataFrame(l, ['KEY1', 'KEY2','ID'])
# function
import re
import numpy as np
# drop duplicates in a row
def drop_duplicates(row):
# split string by ', ', drop duplicates and join back
words = re.split(',',row)
return ', '.join(np.unique(words))
# drop duplicates
from pyspark.sql.functions import udf
drop_duplicates_udf = udf(drop_duplicates)
dataset2 = df1.withColumn('KEY2', drop_duplicates_udf(df1.KEY2))
dataset2.show()
Dataframe solution
scala> val df = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
df: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> val distinct :String => String = _.split(",").toSet.mkString(",")
distinct: String => String = <function1>
scala> val distinct_id = udf (distinct)
distinct_id: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(StringType)))
scala> df.select('key1,distinct_id('key2).as("distinct"),'id).show
+----+--------+---+
|key1|distinct| id|
+----+--------+---+
| 66| a,b,c| 4|
| 67| a,f,g,t| 0|
| 70| b,d| 4|
+----+--------+---+
scala>
There could be a more optimized solution but this could help you.
val rdd2 = dataset1.rdd.map(x => x(1).toString.split(",").distinct.mkString(", "))
// and then transform it to dataset
// or
val distinctUDF = spark.udf.register("distinctUDF", (s: String) => s.split(",").distinct.mkString(", "))
dataset1.createTempView("dataset1")
spark.sql("Select KEY1, distinctUDF(KEY2), ID from dataset1").show
import org.apache.spark.sql._
val dfUpdated = dataset1.rdd.map{
case Row(x: String, y: String,z:String) => (x,y.split(",").distinct.mkString(", "),z)
}.toDF(dataset1.columns:_*)
In spark-shell:
scala> val dataset1 = Seq(("66", "a,b,c,a", "4"), ("67", "a,f,g,t", "0"), ("70", "b,b,b,d", "4")).toDF("KEY1", "KEY2", "ID")
dataset1: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> dataset1.show
+----+-------+---+
|KEY1| KEY2| ID|
+----+-------+---+
| 66|a,b,c,a| 4|
| 67|a,f,g,t| 0|
| 70|b,b,b,d| 4|
+----+-------+---+
scala> val dfUpdated = dataset1.rdd.map{
case Row(x: String, y: String,z:String) => (x,y.split(",").distinct.mkString(", "),z)
}.toDF(dataset1.columns:_*)
dfUpdated: org.apache.spark.sql.DataFrame = [KEY1: string, KEY2: string ... 1 more field]
scala> dfUpdated.show
+----+----------+---+
|KEY1| KEY2| ID|
+----+----------+---+
| 66| a, b, c| 4|
| 67|a, f, g, t| 0|
| 70| b, d| 4|
+----+----------+---+

Select column by name with multiple aggregate columns after pivot with Spark Scala

I am trying to aggregate multitple columns after a pivot in Scala Spark 2.0.1:
scala> val df = List((1, 2, 3, None), (1, 3, 4, Some(1))).toDF("a", "b", "c", "d")
df: org.apache.spark.sql.DataFrame = [a: int, b: int ... 2 more fields]
scala> df.show
+---+---+---+----+
| a| b| c| d|
+---+---+---+----+
| 1| 2| 3|null|
| 1| 3| 4| 1|
+---+---+---+----+
scala> val pivoted = df.groupBy("a").pivot("b").agg(max("c"), max("d"))
pivoted: org.apache.spark.sql.DataFrame = [a: int, 2_max(`c`): int ... 3 more fields]
scala> pivoted.show
+---+----------+----------+----------+----------+
| a|2_max(`c`)|2_max(`d`)|3_max(`c`)|3_max(`d`)|
+---+----------+----------+----------+----------+
| 1| 3| null| 4| 1|
+---+----------+----------+----------+----------+
I am unable to select or rename those columns so far:
scala> pivoted.select("3_max(`d`)")
org.apache.spark.sql.AnalysisException: syntax error in attribute name: 3_max(`d`);
scala> pivoted.select("`3_max(`d`)`")
org.apache.spark.sql.AnalysisException: syntax error in attribute name: `3_max(`d`)`;
scala> pivoted.select("`3_max(d)`")
org.apache.spark.sql.AnalysisException: cannot resolve '`3_max(d)`' given input columns: [2_max(`c`), 3_max(`d`), a, 2_max(`d`), 3_max(`c`)];
There must be a simple trick here, any ideas? Thanks.
Seems like a bug, the back ticks caused the problem. One fix here would be to remove the back ticks from the column names:
val pivotedNewName = pivoted.columns.foldLeft(pivoted)((df, col) =>
df.withColumnRenamed(col, col.replace("`", "")))
Now you can select by column names as normal:
pivotedNewName.select("2_max(c)").show
+--------+
|2_max(c)|
+--------+
| 3|
+--------+

Spark - Csv data split with scala

test.csv
name,key1,key2
A,1,2
B,1,3
C,4,3
I want to change this data like this (as dataset or rdd)
whatIwant.csv
name,key,newkeyname
A,1,KEYA
A,2,KEYB
B,1,KEYA
B,3,KEYB
C,4,KEYA
C,3,KEYB
I loaded data with read method.
val df = spark.read
.option("header", true)
.option("charset", "euc-kr")
.csv(csvFilePath)
I can load each dataset like (name, key1) or (name, key2), and union them by union, but want to do this in single spark session.
Any idea of this?
Those are not working.
val df2 = df.select( df("TAG_NO"), df.map { x => (x.getAs[String]("MK_VNDRNM"), x.getAs[String]("WK_ORD_DT")) })
val df2 = df.select( df("TAG_NO"), Seq(df("TAG_NO"), df("WK_ORD_DT")))
This can be accomplished with explode and a udf:
scala> val df = Seq(("A", 1, 2), ("B", 1, 3), ("C", 4, 3)).toDF("name", "key1", "key2")
df: org.apache.spark.sql.DataFrame = [name: string, key1: int ... 1 more field]
scala> df.show
+----+----+----+
|name|key1|key2|
+----+----+----+
| A| 1| 2|
| B| 1| 3|
| C| 4| 3|
+----+----+----+
scala> val explodeUDF = udf((v1: String, v2: String) => Vector((v1, "Key1"), (v2, "Key2")))
explodeUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,ArrayType(StructType(StructField(_1,StringType,true), StructField(_2,StringType,true)),true),Some(List(StringType, StringType)))
scala> df = df.withColumn("TMP", explode(explodeUDF($"key1", $"key2"))).drop("key1", "key2")
df: org.apache.spark.sql.DataFrame = [name: string, TMP: struct<_1: string, _2: string>]
scala> df = df.withColumn("key", $"TMP".apply("_1")).withColumn("new key name", $"TMP".apply("_2"))
df: org.apache.spark.sql.DataFrame = [name: string, TMP: struct<_1: string, _2: string> ... 2 more fields]
scala> df = df.drop("TMP")
df: org.apache.spark.sql.DataFrame = [name: string, key: string ... 1 more field]
scala> df.show
+----+---+------------+
|name|key|new key name|
+----+---+------------+
| A| 1| Key1|
| A| 2| Key2|
| B| 1| Key1|
| B| 3| Key2|
| C| 4| Key1|
| C| 3| Key2|
+----+---+------------+