Spark w/ Scala way of using multiple heterogeneous columns in a UDF - scala

Say I have a dataframe with multiple columns of possibly various types. I need to write a UDF that takes inputs from multiple columns, does a fairly complicated computation and returns result (say a string).
val dataframe = Seq( (1.0, Array(0, 2, 1), Array(0, 2, 3), 23.0, 21.0),
(1.0, Array(0, 7, 1), Array(1, 2, 3), 42.0, 41.0)).toDF(
"c", "a1", "a2", "t1", "t2")
Eg: ("c" * sum("a1") + sum("a2")).toString + "t1".toString
In actuality, the computation is lengthy and arrays have about a million elements. I am fairly new to Spark and would be grateful if a sample code or a pointer to resource (with Scala examples) is provided.
TIA

here an example UDF:
val udf_doComputation = udf((c:Double, a1:Seq[Int],a2:Seq[Int],t1:Double) => {
// your complex computation goes here
(c*a1.sum+a2.sum).toString() + t1.toString()
})
dataframe
.withColumn("result",udf_doComputation($"c",$"a1",$"a2",$"t1"))
.show()
gives:
+---+---------+---------+----+----+--------+
| c| a1| a2| t1| t2| result|
+---+---------+---------+----+----+--------+
|1.0|[0, 2, 1]|[0, 2, 3]|23.0|21.0| 8.023.0|
|1.0|[0, 7, 1]|[1, 2, 3]|42.0|41.0|14.042.0|
+---+---------+---------+----+----+--------+
Note that the variable names of the UDF don't need to match the column names, but the types must match:
primitives of type A map directely to A. But there are several valid mappings, e.g. double in the dataframe map to either Double or java.lang.Double etc. But you cannot map to Option[A]! So if your input may be null, you need to use corresponding types from java.lang.*...
array of primitives of type A map to Seq[A], e.g. array<int> map to Seq[Int]. Concrete type will be WrappedArray, so mapping to this or IndexedSeq would also work. Important is to know the the runtime type is indexed.
struct map to Row
array<struct> map to Seq[Row]

Related

Element-wise sum of arrays across multiple columns of a data frame in Spark / Scala?

I have a Dataframe that can have multiple columns of Array type like "Array1", "Array2" ... etc. These array columns would have same number of elements. I need to compute a new column of Array type which will be the sum of arrays element wise. How can I do it ?
Spark version = 2.3
For Ex:
Input:
|Column1| ... |ArrayColumn2|ArrayColumn3|
|-------| --- |------------|------------|
|T1 | ... |[1, 2 , 3] | [2, 5, 7]
Output:
|Column1| ... |AggregatedColumn|
|-------| --- |------------|
|T1. | ... |[3, 7 , 10]
No of Array columns are not fixed, thus I need a generalized solution. I would have a list of columns for which I would need to aggregate.
Thanks !
Consider using inline and higher-order function aggregate (available in Spark 2.4+) to compute element-wise sums from the Array-typed columns, followed by a groupBy/agg to group the element-wise sums back into Arrays:
val df = Seq(
(101, Seq(1, 2), Seq(3, 4), Seq(5, 6)),
(202, Seq(7, 8), Seq(9, 10), Seq(11, 12))
).toDF("id", "arr1", "arr2", "arr3")
val arrCols = df.columns.filter(_.startsWith("arr")).map(col)
For Spark 3.0+
df.
withColumn("arr_structs", arrays_zip(arrCols: _*)).
select($"id", expr("inline(arr_structs)")).
select($"id", aggregate(array(arrCols: _*), lit(0), (acc, x) => acc + x).as("pos_elem_sum")).
groupBy("id").agg(collect_list($"pos_elem_sum").as("arr_elem_sum")).
show
// +---+------------+
// | id|arr_elem_sum|
// +---+------------+
// |101| [9, 12]|
// |202| [27, 30]|
// +---+------------+
For Spark 2.4+
df.
withColumn("arr_structs", arrays_zip(arrCols: _*)).
select($"id", expr("inline(arr_structs)")).
select($"id", array(arrCols: _*).as("arr_pos_elems")).
select($"id", expr("aggregate(arr_pos_elems, 0, (acc, x) -> acc + x)").as("pos_elem_sum")).
groupBy("id").agg(collect_list($"pos_elem_sum").as("arr_elem_sum")).
show
For Spark 2.3 or below
val sumArrElems = udf{ (arr: Seq[Int]) => arr.sum }
df.
withColumn("arr_structs", arrays_zip(arrCols: _*)).
select($"id", expr("inline(arr_structs)")).
select($"id", sumArrElems(array(arrCols: _*)).as("pos_elem_sum")).
groupBy("id").agg(collect_list($"pos_elem_sum").as("arr_elem_sum")).
show
An SQL expression like array(ArrayColumn2[0]+ArrayColumn3[0], ArrayColumn2[1]+...) can used to calulate the expected result.
val df = ...
//get all array columns
val arrayCols = df.schema.fields.filter(_.dataType.isInstanceOf[ArrayType]).map(_.name)
//get the size of the first array of the first row
val firstArray = arrayCols(0)
val arraySize = df.selectExpr(s"size($firstArray)").first().getAs[Int](0)
//generate the sql expression for the sums
val sums = (for( i <-0 to arraySize-1)
yield arrayCols.map(c=>s"$c[$i]").mkString("+")).mkString(",")
//sums = ArrayColumn2[0]+ArrayColumn3[0],ArrayColumn2[1]+ArrayColumn3[1],ArrayColumn2[2]+ArrayColumn3[2]
//create a new column using sums
df.withColumn("AggregatedColumn", expr(s"array($sums)")).show()
Output:
+-------+------------+------------+----------------+
|Column1|ArrayColumn2|ArrayColumn3|AggregatedColumn|
+-------+------------+------------+----------------+
| T1| [1, 2, 3]| [2, 5, 7]| [3, 7, 10]|
+-------+------------+------------+----------------+
Using this single (long) SQL expression will avoid shuffling data over the network and thus improve performance.

Spark Dataframe from all combinations of Array column

Assume I have a Spark DataFrame d1 with two columns, elements_1 and elements_2, that contain sets of integers of size k, and value_1, value_2 that contain a integer value. For example, with k = 3:
d1 =
+------------+------------+
| elements_1 | elements_2 |
+-------------------------+
| (1, 4, 3) | (3, 4, 5) |
| (2, 1, 3) | (1, 0, 2) |
| (4, 3, 1) | (3, 5, 6) |
+-------------------------+
I need to create a new column combinations made that contains, for each pair of sets elements_1 and elements_2, a list of the sets from all possible combinations of their elements. These sets must have the following properties:
Their size must be k+1
They must contain either the set in elements_1 or the set in elements_2
For example, from (1, 2, 3) and (3, 4, 5) we obtain [(1, 2, 3, 4), (1, 2, 3, 5), (3, 4, 5, 1) and (3, 4, 5, 2)]. The list does not contain (1, 2, 5) because it is not of length 3+1, and it does not contain (1, 2, 4, 5) because it contains neither of the original sets.
You need to create a custom user-defined function to perform the transformation, create a spark-compatible UserDefinedFunction from it, then apply using withColumn. So really, there are two questions here: (1) how to do the set transformation you described, and (2) how to create a new column in a DataFrame using a user-defined function.
Here's a first shot at the set logic, let me know if it does what you're looking for:
def combo[A](a: Set[A], b: Set[A]): Set[Set[A]] =
a.diff(b).map(b+_) ++ b.diff(a).map(a+_)
Now create the UDF wrapper. Note that under the hood these sets are all represented by WrappedArrays, so we need to handle this. There's probably a more elegant way to deal with this by defining some implicit conversions, but this should work:
import scala.collection.mutable.WrappedArray
val comboWrap: (WrappedArray[Int],WrappedArray[Int])=>Array[Array[Int]] =
(x,y) => combo(x.toSet,y.toSet).map(_.toArray).toArray
val comboUDF = udf(comboWrap)
Finally, apply it to the DataFrame by creating a new column:
val data = Seq((Set(1,2,3),Set(3,4,5))).toDF("elements_1","elements_2")
val result = data.withColumn("result",
comboUDF(col("elements_1"),col("elements_2")))
result.show

How to create DataFrame with nulls using toDF?

How do you create a dataframe containing nulls from a sequence using .toDF ?
This works:
val df = Seq((1,"a"),(2,"b")).toDF("number","letter")
but I'd like to do something along the lines of:
val df = Seq((1, NULL),(2,"b")).toDF("number","letter")
In addition to Ramesh's answer it's worth noting that since toDF uses reflection to infer the schema it's important for the provided sequence to have a correct type. And if scala's type inference isn't enough you need to specify the type explicitly.
For example if you want 2nd column to be nullable integer then neither of the following works:
Seq((1, null)) has inferred type Seq[(Int, Null)]
Seq((1, null), (2, 2)) has inferred type Seq[(Int, Any)]
In this case you need to explicitly specify the type for the 2nd column. There are at least two ways how to do it. You can explicitly specify the generic type for the sequence
Seq[(Int, Integer)]((1, null)).toDF
or create a case class for the row:
case class MyRow(x: Int, y: Integer)
Seq(MyRow(1, null)).toDF
Note that I used Integer instead of Int as the later being a primitive type cannot accommodate nulls.
NULL is not defined in APIs anywhere but null is, so you can define like
val df2 = Seq((1, null), (2, "b")).toDF("number","letter")
And you should have output as
+------+------+
|number|letter|
+------+------+
|1 |null |
|2 |b |
+------+------+
The trick is to use two or more values for the column with nulls to define a type Spark SQL should use.
The following then won't work:
val df = Seq((1, null)).toDF("number","letter")
Spark has no way of knowing what the type of letter is in this case.

How to convert a SQL query output (dataframe) into an array list of key value pairs in Spark Scala?

I created a dataframe in spark scala shell for SFPD incidents. I queried the data for Category count and the result is a datafame. I want to plot this data into a graph using Wisp. Here is my dataframe,
+--------------+--------+
| Category|catcount|
+--------------+--------+
| LARCENY/THEFT| 362266|
|OTHER OFFENSES| 257197|
| NON-CRIMINAL| 189857|
| ASSAULT| 157529|
| VEHICLE THEFT| 109733|
| DRUG/NARCOTIC| 108712|
| VANDALISM| 91782|
| WARRANTS| 85837|
| BURGLARY| 75398|
|SUSPICIOUS OCC| 64452|
+--------------+--------+
I want to convert this dataframe into an arraylist of key value pairs. So I want result like this with (String,Int) type,
(LARCENY/THEFT,362266)
(OTHER OFFENSES,257197)
(NON-CRIMINAL,189857)
(ASSAULT,157529)
(VEHICLE THEFT,109733)
(DRUG/NARCOTIC,108712)
(VANDALISM,91782)
(WARRANTS,85837)
(BURGLARY,75398)
(SUSPICIOUS OCC,64452)
I tried converting this dataframe (t) into an RDD as val rddt = t.rdd. And then used flatMapValues,
rddt.flatMapValues(x=>x).collect()
but still couldn't get the required result.
Or is there a way to directly give the dataframe output into Wisp?
In pyspark it'd be as below. Scala will be quite similar.
Creating test data
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,1), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])
Mapping the test data, reformatting from a RDD of Rows to an RDD of tuples. Then, using collect to extract all the tuples as a list.
df.rdd.map(lambda x: (x[0], x[1])).collect()
[(0, 1), (0, 1), (0, 2), (1, 2), (1, 1), (1, 20), (3, 18), (3, 18), (3, 18)]
Here's the Scala Spark Row documentation that should help you convert this to Scala Spark code

Efficient PairRDD operations on DataFrame with Spark SQL GROUP BY

This question is about the duality between DataFrame and RDD when it comes to aggregation operations. In Spark SQL one can use table generating UDFs for custom aggregations but creating one of those is typically noticeably less user-friendly than using the aggregation functions available for RDDs, especially if table output is not required.
Is there an efficient way to apply pair RDD operations such as aggregateByKey to a DataFrame which has been grouped using GROUP BY or ordered using ORDERED BY?
Normally, one would need an explicit map step to create key-value tuples, e.g., dataFrame.rdd.map(row => (row.getString(row.fieldIndex("category")), row).aggregateByKey(...). Can this be avoided?
Not really. While DataFrames can be converted to RDDs and vice versa this is relatively complex operation and methods like DataFrame.groupBy don't have the same semantics as their counterparts on RDD.
The closest thing you can get is a new DataSet API introduced in Spark 1.6.0. It provides a much closer integration with DataFrames and GroupedDataset class with its own set of methods including reduce, cogroup or mapGroups:
case class Record(id: Long, key: String, value: Double)
val df = sc.parallelize(Seq(
(1L, "foo", 3.0), (2L, "bar", 5.6),
(3L, "foo", -1.0), (4L, "bar", 10.0)
)).toDF("id", "key", "value")
val ds = df.as[Record]
ds.groupBy($"key").reduce((x, y) => if (x.id < y.id) x else y).show
// +-----+-----------+
// | _1| _2|
// +-----+-----------+
// |[bar]|[2,bar,5.6]|
// |[foo]|[1,foo,3.0]|
// +-----+-----------+
In some specific cases it is possible to leverage Orderable semantics to group and process data using structs or arrays. You'll find an example in SPARK DataFrame: select the first row of each group