How to create array of structs in spark - scala

I'm trying to create a array of struct(col, col) in spark dataframe but getting error.
Using sample data to arrive at same error here.
Dataframe
val df = Seq((1, "One", "uno", true), (2, "Two", "Dos", true), (3, "Three", "Tres", false)).toDF("number", "English", "Spanish", "include_spanish")
scala> df.show
+------+-------+-------+---------------+
|number|English|Spanish|include_spanish|
+------+-------+-------+---------------+
| 1| One| uno| true|
| 2| Two| Dos| true|
| 3| Three| Tres| false|
+------+-------+-------+---------------+
Now, here is I'm trying to create struct out of existing columns and then create an array out of it.
val df1 = df.withColumn("numberToEnglish", struct(col("number"), col("English"))).withColumn("numberToSpanish", struct("number", "Spanish")).withColumn("numberToLanguage", when(col("include_spanish") === true, array("numberToEnglish", "numberToSpanish")).otherwise(array("numberToEnglish"))
Getting below error,
org.apache.spark.sql.AnalysisException: cannot resolve 'array(`numberToEnglish`, `numberToSpanish`)' due to data type mismatch: input to function array should all be the same type, but it's [struct<number:int,English:string>, struct<number:int,Spanish:string>];;
'Project [number#200, English#201, Spanish#202, include_spanish#203, numberToEnglish#253, numberToSpanish#259, CASE WHEN (include_spanish#203 = true) THEN array(numberToEnglish#253, numberToSpanish#259) ELSE array(numberToEnglish#253) END AS numberToLanguage#266]
what would be the best way achieve this functionality ?

In order for the array method to view struct($"number", $"English") and struct($"number", $"Spanish") as same data type, you'll need to name the struct elements, as shown below:
val df = Seq(
(1, "One", "uno", true), (2, "Two", "Dos", true), (3, "Three", "Tres", false)
).toDF("number", "English", "Spanish", "include_spanish")
df.
withColumn("numberToEnglish", struct($"number".as("num"), $"English".as("lang"))).
withColumn("numberToSpanish", struct($"number".as("num"), $"Spanish".as("lang"))).
withColumn("numberToLanguage",
when($"include_spanish", array($"numberToEnglish", $"numberToSpanish")).
otherwise(array($"numberToEnglish"))
).
show
// +------+-------+-------+---------------+---------------+---------------+--------------------+
// |number|English|Spanish|include_spanish|numberToEnglish|numberToSpanish| numberToLanguage|
// +------+-------+-------+---------------+---------------+---------------+--------------------+
// | 1| One| uno| true| [1, One]| [1, uno]|[[1, One], [1, uno]]|
// | 2| Two| Dos| true| [2, Two]| [2, Dos]|[[2, Two], [2, Dos]]|
// | 3| Three| Tres| false| [3, Three]| [3, Tres]| [[3, Three]]|
// +------+-------+-------+---------------+---------------+---------------+--------------------+

Related

Spark Dataframe with pivot and different aggregation, based on the column value (measure_type) - Scala

I have a spark dataframe of this type:
scala> val data = Seq((1, "k1", "measureA", 2), (1, "k1", "measureA", 4), (1, "k1", "measureB", 5), (1, "k1", "measureB", 7), (1, "k1", "measureC", 7), (1, "k1", "measureC", 1), (2, "k1", "measureB", 8), (2, "k1", "measureC", 9), (2, "k2", "measureA", 5), (2, "k2", "measureC", 5), (2, "k2", "measureC", 8))
data: Seq[(Int, String, String, Int)] = List((1,k1,measureA,2), (1,k1,measureA,4), (1,k1,measureB,5), (1,k1,measureB,7), (1,k1,measureC,7), (1,k1,measureC,1), (2,k1,measureB,8), (2,k1,measureC,9), (2,k2,measureA,5), (2,k2,measureC,5), (2,k2,measureC,8))
scala> val rdd = spark.sparkContext.parallelize(data)
rdd: org.apache.spark.rdd.RDD[(Int, String, String, Int)] = ParallelCollectionRDD[22] at parallelize at <console>:27
scala> val df = rdd.toDF("ts","key","measure_type","value")
df: org.apache.spark.sql.DataFrame = [ts: int, key: string ... 2 more fields]
scala> df.show
+---+---+------------+-----+
| ts|key|measure_type|value|
+---+---+------------+-----+
| 1| k1| measureA| 2|
| 1| k1| measureA| 4|
| 1| k1| measureB| 5|
| 1| k1| measureB| 7|
| 1| k1| measureC| 7|
| 1| k1| measureC| 1|
| 2| k1| measureB| 8|
| 2| k1| measureC| 9|
| 2| k2| measureA| 5|
| 2| k2| measureC| 5|
| 2| k2| measureC| 8|
+---+---+------------+-----+
I want to pivot on measure_type and apply different aggregation types to the value, depending on measure_type:
measureA -> sum
measureB -> avg
measureC -> max
Then, get the following output dataframe:
+---+---+--------+--------+--------+
| ts|key|measureA|measureB|measureC|
+---+---+--------+--------+--------+
| 1| k1| 6| 6| 7|
| 2| k1| null| 8| 9|
| 2| k2| 5| null| 8|
+---+---+--------+--------+--------+
Thanks a lot.
val ddf = df.groupBy("ts", "key").agg(
sum(when(col("measure_type") === "measureA",col("value"))).as("measureA"),
avg(when(col("measure_type") === "measureB",col("value"))).as("measureB"),
max(when(col("measure_type") === "measureC",col("value"))).as("measureC"))
And results are
scala> ddf.show(false)
+---+---+--------+--------+--------+
|ts |key|measureA|measureB|measureC|
+---+---+--------+--------+--------+
|2 |k2 |5 |null |8 |
|2 |k1 |null |8.0 |9 |
|1 |k1 |6 |6.0 |7 |
+---+---+--------+--------+--------+
I think its tedious to do with traditional pivot function as it will only limit you to one particular aggregate function.
Here is what I would do by mapping a pre-defined list of aggregate functions that I need to perform and apply them on my dataframe giving me 3 extra columns for each aggregate functions and then create another column with value for the measure_type as you mentioned and then drop the 3 columns i created in previous step
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import spark.implicits._
val df = Seq((1, "k1", "measureA", 2), (1, "k1", "measureA", 4), (1, "k1", "measureB", 5), (1, "k1", "measureB", 7), (1, "k1", "measureC", 7), (1, "k1", "measureC", 1), (2, "k1", "measureB", 8), (2, "k1", "measureC", 9), (2, "k2", "measureA", 5), (2, "k2", "measureC", 5), (2, "k2", "measureC", 8)).toDF("ts","key","measure_type","value")
val mapping: Map[String, Column => Column] = Map(
"sum" -> sum, "avg" -> avg, "max" -> max)
val groupBy = Seq("ts","key","measure_type")
val aggregate = Seq("value")
val operations = Seq("sum", "avg", "max")
val exprs = aggregate.flatMap(c => operations .map(f => mapping(f)(col(c))))
val df2 = df.groupBy(groupBy.map(col): _*).agg(exprs.head, exprs.tail: _*)
val df3 = df2.withColumn("new_column",
when($"measure_type" === "measureA", $"sum(value)")
.when($"measure_type" === "measureB", $"avg(value)")
.otherwise($"max(value)"))
.drop("sum(value)")
.drop("avg(value)")
.drop("max(value)")
df3 is the dataframe that you need.

Filter a dataframe using a list of tuples in spark scala

I am trying to filter a dataframe in scala by comparing two of its columns (subject and stream in this case) to a list of tuples. If the column values and the tuple values are equal the row is filtered.
val df = Seq(
(0, "Mark", "Maths", "Science"),
(1, "Tyson", "History", "Commerce"),
(2, "Gerald", "Maths", "Science"),
(3, "Katie", "Maths", "Commerce"),
(4, "Linda", "History", "Science")).toDF("id", "name", "subject", "stream")
Sample input:
+---+------+-------+--------+
| id| name|subject| stream|
+---+------+-------+--------+
| 0| Mark| Maths| Science|
| 1| Tyson|History|Commerce|
| 2|Gerald| Maths| Science|
| 3| Katie| Maths|Commerce|
| 4| Linda|History| Science|
+---+------+-------+--------+
List of tuple based on which the above df needs to be filtered
val listOfTuples = List[(String, String)] (
("Maths" , "Science"),
("History" , "Commerce")
)
Expected result :
+---+------+-------+--------+
| id| name|subject| stream|
+---+------+-------+--------+
| 0| Mark| Maths| Science|
| 1| Tyson|History|Commerce|
| 2|Gerald| Maths| Science|
+---+------+-------+--------+
You can either do it with isin with structs (needs spark 2.2+):
val df_filtered = df
.where(struct($"subject",$"stream").isin(listOfTuples.map(typedLit(_)):_*))
or with leftsemi join:
val df_filtered = df
.join(listOfTuples.toDF("subject","stream"),Seq("subject","stream"),"leftsemi")
You can simply filter as
val resultDF = df.filter(row => {
List(
("Maths", "Science"),
("History", "Commerce")
).contains(
(row.getAs[String]("subject"), row.getAs[String]("stream")))
})
Hope this helps!

Scala Spark collect_list() vs array()

What is the difference between collect_list() and array() in spark using scala?
I see uses all over the place and the use cases are not clear to me to determine the difference.
Even though both array and collect_list return an ArrayType column, the two methods are very different.
Method array combines "column-wise" a number of columns into an array, whereas collect_list aggregates "row-wise" on a single column typically by group (or Window partition) into an array, as shown below:
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1, "a", "b"),
(1, "c", "d"),
(2, "e", "f")
).toDF("c1", "c2", "c3")
df.
withColumn("arr", array("c2", "c3")).
show
// +---+---+---+------+
// | c1| c2| c3| arr|
// +---+---+---+------+
// | 1| a| b|[a, b]|
// | 1| c| d|[c, d]|
// | 2| e| f|[e, f]|
// +---+---+---+------+
df.
groupBy("c1").agg(collect_list("c2")).
show
// +---+----------------+
// | c1|collect_list(c2)|
// +---+----------------+
// | 1| [a, c]|
// | 2| [e]|
// +---+----------------+

Cumulative product in Spark

I try to implement a cumulative product in Spark Scala, but I really don't know how to it. I have the following dataframe:
Input data:
+--+--+--------+----+
|A |B | date | val|
+--+--+--------+----+
|rr|gg|20171103| 2 |
|hh|jj|20171103| 3 |
|rr|gg|20171104| 4 |
|hh|jj|20171104| 5 |
|rr|gg|20171105| 6 |
|hh|jj|20171105| 7 |
+-------+------+----+
And I would like to have the following output:
Output data:
+--+--+--------+-----+
|A |B | date | val |
+--+--+--------+-----+
|rr|gg|20171105| 48 | // 2 * 4 * 6
|hh|jj|20171105| 105 | // 3 * 5 * 7
+-------+------+-----+
As long as the number are strictly positive (0 can be handled as well, if present, using coalesce) as in your example, the simplest solution is to compute the sum of logarithms and take the exponential:
import org.apache.spark.sql.functions.{exp, log, max, sum}
val df = Seq(
("rr", "gg", "20171103", 2), ("hh", "jj", "20171103", 3),
("rr", "gg", "20171104", 4), ("hh", "jj", "20171104", 5),
("rr", "gg", "20171105", 6), ("hh", "jj", "20171105", 7)
).toDF("A", "B", "date", "val")
val result = df
.groupBy("A", "B")
.agg(
max($"date").as("date"),
exp(sum(log($"val"))).as("val"))
Since this uses FP arithmetic the result won't be exact:
result.show
+---+---+--------+------------------+
| A| B| date| val|
+---+---+--------+------------------+
| hh| jj|20171105|104.99999999999997|
| rr| gg|20171105|47.999999999999986|
+---+---+--------+------------------+
but after rounding should good enough for majority of applications.
result.withColumn("val", round($"val")).show
+---+---+--------+-----+
| A| B| date| val|
+---+---+--------+-----+
| hh| jj|20171105|105.0|
| rr| gg|20171105| 48.0|
+---+---+--------+-----+
If that's not enough you can define an UserDefinedAggregateFunction or Aggregator (How to define and use a User-Defined Aggregate Function in Spark SQL?) or use functional API with reduceGroups:
import scala.math.Ordering
case class Record(A: String, B: String, date: String, value: Long)
df.withColumnRenamed("val", "value").as[Record]
.groupByKey(x => (x.A, x.B))
.reduceGroups((x, y) => x.copy(
date = Ordering[String].max(x.date, y.date),
value = x.value * y.value))
.toDF("key", "value")
.select($"value.*")
.show
+---+---+--------+-----+
| A| B| date|value|
+---+---+--------+-----+
| hh| jj|20171105| 105|
| rr| gg|20171105| 48|
+---+---+--------+-----+
You can solve this using either collect_list+UDF or an UDAF. UDAF may be more efficient, but harder to implement due to the local aggregation.
If you have a dataframe like this :
+---+---+
|key|val|
+---+---+
| a| 1|
| a| 2|
| a| 3|
| b| 4|
| b| 5|
+---+---+
You can invoke an UDF :
val prod = udf((vals:Seq[Int]) => vals.reduce(_ * _))
df
.groupBy($"key")
.agg(prod(collect_list($"val")).as("val"))
.show()
+---+---+
|key|val|
+---+---+
| b| 20|
| a| 6|
+---+---+
Since Spark 2.4, you could also compute this using the higher order function aggregate:
import org.apache.spark.sql.functions.{expr, max}
val df = Seq(
("rr", "gg", "20171103", 2),
("hh", "jj", "20171103", 3),
("rr", "gg", "20171104", 4),
("hh", "jj", "20171104", 5),
("rr", "gg", "20171105", 6),
("hh", "jj", "20171105", 7)
).toDF("A", "B", "date", "val")
val result = df
.groupBy("A", "B")
.agg(
max($"date").as("date"),
expr("""
aggregate(
collect_list(val),
cast(1 as bigint),
(acc, x) -> acc * x)""").alias("val")
)
Spark 3.2+
product(e: Column): Column
Aggregate function: returns the product of all numerical elements in a group.
Scala
import spark.implicits._
var df = Seq(
("rr", "gg", 20171103, 2),
("hh", "jj", 20171103, 3),
("rr", "gg", 20171104, 4),
("hh", "jj", 20171104, 5),
("rr", "gg", 20171105, 6),
("hh", "jj", 20171105, 7)
).toDF("A", "B", "date", "val")
df = df.groupBy("A", "B").agg(max($"date").as("date"), product($"val").as("val"))
df.show(false)
// +---+---+--------+-----+
// |A |B |date |val |
// +---+---+--------+-----+
// |hh |jj |20171105|105.0|
// |rr |gg |20171105|48.0 |
// +---+---+--------+-----+
PySpark
from pyspark.sql import SparkSession, functions as F
spark = SparkSession.builder.getOrCreate()
data = [('rr', 'gg', 20171103, 2),
('hh', 'jj', 20171103, 3),
('rr', 'gg', 20171104, 4),
('hh', 'jj', 20171104, 5),
('rr', 'gg', 20171105, 6),
('hh', 'jj', 20171105, 7)]
df = spark.createDataFrame(data, ['A', 'B', 'date', 'val'])
df = df.groupBy('A', 'B').agg(F.max('date').alias('date'), F.product('val').alias('val'))
df.show()
#+---+---+--------+-----+
#| A| B| date| val|
#+---+---+--------+-----+
#| hh| jj|20171105|105.0|
#| rr| gg|20171105| 48.0|
#+---+---+--------+-----+

Create DataFrame with null value for few column

I am trying to create a DataFrame using RDD.
First I am creating a RDD using below code -
val account = sc.parallelize(Seq(
(1, null, 2,"F"),
(2, 2, 4, "F"),
(3, 3, 6, "N"),
(4,null,8,"F")))
It is working fine -
account: org.apache.spark.rdd.RDD[(Int, Any, Int, String)] =
ParallelCollectionRDD[0] at parallelize at :27
but when try to create DataFrame from the RDD using below code
account.toDF("ACCT_ID", "M_CD", "C_CD","IND")
I am getting below error
java.lang.UnsupportedOperationException: Schema for type Any is not
supported
I analyzed that whenever I put null value in Seq then only I got the error.
Is there any way to add null value?
Alternative way without using RDDs:
import spark.implicits._
val df = spark.createDataFrame(Seq(
(1, None, 2, "F"),
(2, Some(2), 4, "F"),
(3, Some(3), 6, "N"),
(4, None, 8, "F")
)).toDF("ACCT_ID", "M_CD", "C_CD","IND")
df.show
+-------+----+----+---+
|ACCT_ID|M_CD|C_CD|IND|
+-------+----+----+---+
| 1|null| 2| F|
| 2| 2| 4| F|
| 3| 3| 6| N|
| 4|null| 8| F|
+-------+----+----+---+
df.printSchema
root
|-- ACCT_ID: integer (nullable = false)
|-- M_CD: integer (nullable = true)
|-- C_CD: integer (nullable = false)
|-- IND: string (nullable = true)
The problem is that Any is too general type and Spark just has no idea how to serialize it. You should explicitly provide some specific type, in your case Integer. Since null can't be assigned to primitive types in Scala you can use java.lang.Integer instead. So try this:
val account = sc.parallelize(Seq(
(1, null.asInstanceOf[Integer], 2,"F"),
(2, new Integer(2), 4, "F"),
(3, new Integer(3), 6, "N"),
(4, null.asInstanceOf[Integer],8,"F")))
Here is an output:
rdd: org.apache.spark.rdd.RDD[(Int, Integer, Int, String)] = ParallelCollectionRDD[0] at parallelize at <console>:24
And the corresponding DataFrame:
scala> val df = rdd.toDF("ACCT_ID", "M_CD", "C_CD","IND")
df: org.apache.spark.sql.DataFrame = [ACCT_ID: int, M_CD: int ... 2 more fields]
scala> df.show
+-------+----+----+---+
|ACCT_ID|M_CD|C_CD|IND|
+-------+----+----+---+
| 1|null| 2| F|
| 2| 2| 4| F|
| 3| 3| 6| N|
| 4|null| 8| F|
+-------+----+----+---+
Also you can consider some cleaner way to declare the null integer value like:
object Constants {
val NullInteger: java.lang.Integer = null
}