Spark Dataframe from all combinations of Array column - scala

Assume I have a Spark DataFrame d1 with two columns, elements_1 and elements_2, that contain sets of integers of size k, and value_1, value_2 that contain a integer value. For example, with k = 3:
d1 =
+------------+------------+
| elements_1 | elements_2 |
+-------------------------+
| (1, 4, 3) | (3, 4, 5) |
| (2, 1, 3) | (1, 0, 2) |
| (4, 3, 1) | (3, 5, 6) |
+-------------------------+
I need to create a new column combinations made that contains, for each pair of sets elements_1 and elements_2, a list of the sets from all possible combinations of their elements. These sets must have the following properties:
Their size must be k+1
They must contain either the set in elements_1 or the set in elements_2
For example, from (1, 2, 3) and (3, 4, 5) we obtain [(1, 2, 3, 4), (1, 2, 3, 5), (3, 4, 5, 1) and (3, 4, 5, 2)]. The list does not contain (1, 2, 5) because it is not of length 3+1, and it does not contain (1, 2, 4, 5) because it contains neither of the original sets.

You need to create a custom user-defined function to perform the transformation, create a spark-compatible UserDefinedFunction from it, then apply using withColumn. So really, there are two questions here: (1) how to do the set transformation you described, and (2) how to create a new column in a DataFrame using a user-defined function.
Here's a first shot at the set logic, let me know if it does what you're looking for:
def combo[A](a: Set[A], b: Set[A]): Set[Set[A]] =
a.diff(b).map(b+_) ++ b.diff(a).map(a+_)
Now create the UDF wrapper. Note that under the hood these sets are all represented by WrappedArrays, so we need to handle this. There's probably a more elegant way to deal with this by defining some implicit conversions, but this should work:
import scala.collection.mutable.WrappedArray
val comboWrap: (WrappedArray[Int],WrappedArray[Int])=>Array[Array[Int]] =
(x,y) => combo(x.toSet,y.toSet).map(_.toArray).toArray
val comboUDF = udf(comboWrap)
Finally, apply it to the DataFrame by creating a new column:
val data = Seq((Set(1,2,3),Set(3,4,5))).toDF("elements_1","elements_2")
val result = data.withColumn("result",
comboUDF(col("elements_1"),col("elements_2")))
result.show

Related

Element-wise sum of arrays across multiple columns of a data frame in Spark / Scala?

I have a Dataframe that can have multiple columns of Array type like "Array1", "Array2" ... etc. These array columns would have same number of elements. I need to compute a new column of Array type which will be the sum of arrays element wise. How can I do it ?
Spark version = 2.3
For Ex:
Input:
|Column1| ... |ArrayColumn2|ArrayColumn3|
|-------| --- |------------|------------|
|T1 | ... |[1, 2 , 3] | [2, 5, 7]
Output:
|Column1| ... |AggregatedColumn|
|-------| --- |------------|
|T1. | ... |[3, 7 , 10]
No of Array columns are not fixed, thus I need a generalized solution. I would have a list of columns for which I would need to aggregate.
Thanks !
Consider using inline and higher-order function aggregate (available in Spark 2.4+) to compute element-wise sums from the Array-typed columns, followed by a groupBy/agg to group the element-wise sums back into Arrays:
val df = Seq(
(101, Seq(1, 2), Seq(3, 4), Seq(5, 6)),
(202, Seq(7, 8), Seq(9, 10), Seq(11, 12))
).toDF("id", "arr1", "arr2", "arr3")
val arrCols = df.columns.filter(_.startsWith("arr")).map(col)
For Spark 3.0+
df.
withColumn("arr_structs", arrays_zip(arrCols: _*)).
select($"id", expr("inline(arr_structs)")).
select($"id", aggregate(array(arrCols: _*), lit(0), (acc, x) => acc + x).as("pos_elem_sum")).
groupBy("id").agg(collect_list($"pos_elem_sum").as("arr_elem_sum")).
show
// +---+------------+
// | id|arr_elem_sum|
// +---+------------+
// |101| [9, 12]|
// |202| [27, 30]|
// +---+------------+
For Spark 2.4+
df.
withColumn("arr_structs", arrays_zip(arrCols: _*)).
select($"id", expr("inline(arr_structs)")).
select($"id", array(arrCols: _*).as("arr_pos_elems")).
select($"id", expr("aggregate(arr_pos_elems, 0, (acc, x) -> acc + x)").as("pos_elem_sum")).
groupBy("id").agg(collect_list($"pos_elem_sum").as("arr_elem_sum")).
show
For Spark 2.3 or below
val sumArrElems = udf{ (arr: Seq[Int]) => arr.sum }
df.
withColumn("arr_structs", arrays_zip(arrCols: _*)).
select($"id", expr("inline(arr_structs)")).
select($"id", sumArrElems(array(arrCols: _*)).as("pos_elem_sum")).
groupBy("id").agg(collect_list($"pos_elem_sum").as("arr_elem_sum")).
show
An SQL expression like array(ArrayColumn2[0]+ArrayColumn3[0], ArrayColumn2[1]+...) can used to calulate the expected result.
val df = ...
//get all array columns
val arrayCols = df.schema.fields.filter(_.dataType.isInstanceOf[ArrayType]).map(_.name)
//get the size of the first array of the first row
val firstArray = arrayCols(0)
val arraySize = df.selectExpr(s"size($firstArray)").first().getAs[Int](0)
//generate the sql expression for the sums
val sums = (for( i <-0 to arraySize-1)
yield arrayCols.map(c=>s"$c[$i]").mkString("+")).mkString(",")
//sums = ArrayColumn2[0]+ArrayColumn3[0],ArrayColumn2[1]+ArrayColumn3[1],ArrayColumn2[2]+ArrayColumn3[2]
//create a new column using sums
df.withColumn("AggregatedColumn", expr(s"array($sums)")).show()
Output:
+-------+------------+------------+----------------+
|Column1|ArrayColumn2|ArrayColumn3|AggregatedColumn|
+-------+------------+------------+----------------+
| T1| [1, 2, 3]| [2, 5, 7]| [3, 7, 10]|
+-------+------------+------------+----------------+
Using this single (long) SQL expression will avoid shuffling data over the network and thus improve performance.

Spark w/ Scala way of using multiple heterogeneous columns in a UDF

Say I have a dataframe with multiple columns of possibly various types. I need to write a UDF that takes inputs from multiple columns, does a fairly complicated computation and returns result (say a string).
val dataframe = Seq( (1.0, Array(0, 2, 1), Array(0, 2, 3), 23.0, 21.0),
(1.0, Array(0, 7, 1), Array(1, 2, 3), 42.0, 41.0)).toDF(
"c", "a1", "a2", "t1", "t2")
Eg: ("c" * sum("a1") + sum("a2")).toString + "t1".toString
In actuality, the computation is lengthy and arrays have about a million elements. I am fairly new to Spark and would be grateful if a sample code or a pointer to resource (with Scala examples) is provided.
TIA
here an example UDF:
val udf_doComputation = udf((c:Double, a1:Seq[Int],a2:Seq[Int],t1:Double) => {
// your complex computation goes here
(c*a1.sum+a2.sum).toString() + t1.toString()
})
dataframe
.withColumn("result",udf_doComputation($"c",$"a1",$"a2",$"t1"))
.show()
gives:
+---+---------+---------+----+----+--------+
| c| a1| a2| t1| t2| result|
+---+---------+---------+----+----+--------+
|1.0|[0, 2, 1]|[0, 2, 3]|23.0|21.0| 8.023.0|
|1.0|[0, 7, 1]|[1, 2, 3]|42.0|41.0|14.042.0|
+---+---------+---------+----+----+--------+
Note that the variable names of the UDF don't need to match the column names, but the types must match:
primitives of type A map directely to A. But there are several valid mappings, e.g. double in the dataframe map to either Double or java.lang.Double etc. But you cannot map to Option[A]! So if your input may be null, you need to use corresponding types from java.lang.*...
array of primitives of type A map to Seq[A], e.g. array<int> map to Seq[Int]. Concrete type will be WrappedArray, so mapping to this or IndexedSeq would also work. Important is to know the the runtime type is indexed.
struct map to Row
array<struct> map to Seq[Row]

Sample a different number of random rows for every group in a dataframe in spark scala

The goal is to sample (without replacement) a different number of rows in a dataframe for every group. The number of rows to sample for a specific group is in another dataframe.
Example: idDF is the dataframe to sample from. The groups are denoted by the ID column. The dataframe, planDF specifies the number of rows to sample for each group where "datesToUse" denotes the number of rows, and "ID" denotes the group. "totalDates" is the total number of rows for that group and may or may not be useful.
The final result should have 3 rows sampled from the first group (ID 1), 2 rows sampled from the second group (ID 2) and 1 row sampled from the third group (ID 3).
val idDF = Seq(
(1, "2017-10-03"),
(1, "2017-10-22"),
(1, "2017-11-01"),
(1, "2017-10-02"),
(1, "2017-10-09"),
(1, "2017-12-24"),
(1, "2017-10-20"),
(2, "2017-11-17"),
(2, "2017-11-12"),
(2, "2017-12-02"),
(2, "2017-10-03"),
(3, "2017-12-18"),
(3, "2017-11-21"),
(3, "2017-12-13"),
(3, "2017-10-08"),
(3, "2017-10-16"),
(3, "2017-12-04")
).toDF("ID", "date")
val planDF = Seq(
(1, 3, 7),
(2, 2, 4),
(3, 1, 6)
).toDF("ID", "datesToUse", "totalDates")
this is an example of what a resultant dataframe should look like:
+---+----------+
| ID| date|
+---+----------+
| 1|2017-10-22|
| 1|2017-11-01|
| 1|2017-10-20|
| 2|2017-11-12|
| 2|2017-10-03|
| 3|2017-10-16|
+---+----------+
So far, I tried to use the sample method for DataFrame: https://spark.apache.org/docs/1.5.0/api/java/org/apache/spark/sql/DataFrame.html
Here is an example that would work for an entire data frame.
def sampleDF(DF: DataFrame, datesToUse: Int, totalDates: Int): DataFrame = {
val fraction = datesToUse/totalDates.toFloat.toDouble
DF.sample(false, fraction)
}
I cant figure out how to use something like this for each group. I tried joining the planDF table to the idDF table and using a window partition.
Another idea I had was to somehow make a new column with randomly labeled True / false and then filter on that column.
Another option staying entirely in Dataframes would be to compute probabilities using your planDF, join with idDF, append a column of random numbers and then filter. Helpfully, sql.functions has a rand function.
import org.apache.spark.sql.functions._
import spark.implicits._
val probabilities = planDF.withColumn("prob", $"datesToUse" / $"totalDates")
val dfWithProbs = idDF.join(probabilities, Seq("ID"))
.withColumn("rand", rand())
.where($"rand" < $"prob")
(You'll want to double check that that isn't integer division.)
With the assumption that your planDF is small enough to be collected, you can use Scala's foldLeft to traverse the id list and accumulate the sample Dataframes per id:
import org.apache.spark.sql.{Row, DataFrame}
def sampleByIdDF(DF: DataFrame, id: Int, datesToUse: Int, totalDates: Int): DataFrame = {
val fraction = datesToUse.toDouble / totalDates
DF.where($"id" === id ).sample(false, fraction)
}
val emptyDF = Seq.empty[(Int, String)].toDF("ID", "date")
val planList = planDF.rdd.collect.map{ case Row(x: Int, y: Int, z: Int) => (x, y, z) }
// planList: Array[(Int, Int, Int)] = Array((1,3,7), (2,2,4), (3,1,6))
planList.foldLeft( emptyDF ){
case (accDF: DataFrame, (id: Int, num: Int, total: Int)) =>
accDF union sampleByIdDF(idDF, id, num, total)
}
// res1: org.apache.spark.sql.DataFrame = [ID: int, date: string]
// res1.show
// +---+----------+
// | ID| date|
// +---+----------+
// | 1|2017-10-03|
// | 1|2017-11-01|
// | 1|2017-10-02|
// | 1|2017-12-24|
// | 1|2017-10-20|
// | 2|2017-11-17|
// | 2|2017-11-12|
// | 2|2017-12-02|
// | 3|2017-11-21|
// | 3|2017-12-13|
// +---+----------+
Note that method sample() does not necessarily generate the exact number of samples specified in the method arguments. Here's a relevant SO Q&A.
If your planDF is large, you might have to consider using RDD's aggregate, which has the following signature (skipping the implicit argument):
def aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U): U
It works somewhat like foldLeft, except that it has one accumulation operator within a partition and an additional one to comine results from different partitions.

How to join and reduce two datasets with arrays?

I need an idea for how to join two datasets with millions of arrays. Each dataset will have Longs numbered 1-10,000,000. But with different groupings in each one ex. [1,2] [3, 4] and [1], [2, 3], [4] output should be [1,2,3,4]
I need some way to join these sets efficiently.
I have tried an approach where I explode and group by multiple times, finally sorting and distincting the arrays. This works on small sets but is very inefficent for large sets because it explodes the number of rows many times over.
Any ideas on how to use another approach like a reducer or aggregation to solve this problem more efficiently.
The following is a scala code example. However, I would need an approach that works in java as well.
val rdd1 = spark.sparkContext.makeRDD(Array("""{"groupings":[1,2,3]}""", """{"groupings":[4,5,6]}""", """{"groupings":[7,8,9]}""", """{"groupings":[10]}""", """{"groupings":[11]}"""))
val rdd2 = spark.sparkContext.makeRDD(Array("""{"groupings":[1]}""", """{"groupings":[2,3,4]}""", """{"groupings":[7,8]}""", """{"groupings":[9]}""", """{"groupings":[10,11]}"""))
val srdd1 = spark.read.json(rdd1)
val srdd2 = spark.read.json(rdd2)
Dataset 1:
+---------+
|groupings|
+---------+
|[1, 2, 3]|
|[4, 5, 6]|
|[7, 8, 9]|
| [10]|
| [11]|
+---------+
Dataset 2:
+---------+
|groupings|
+---------+
| [1]|
|[2, 3, 4]|
| [7, 8]|
| [9]|
| [10, 11]|
+---------+
Output should be
+------------------+
| groupings|
+------------------+
|[1, 2, 3, 4, 5, 6]|
| [7, 8, 9]|
| [10, 11]|
+------------------+
Update:
This was my original code, which I had problems running, #AyanGuha had me thinking that perhaps it would be simpler to just use a series of joins instead, I am testing that now and will post a solution if it works out.
srdd1.union(srdd2).withColumn("temp", explode(col("groupings")))
.groupBy("temp")
.agg(collect_list("groupings").alias("groupings"))
.withColumn("groupings", callUDF("distinctLongArray", callUDF("flattenDistinctLongArray", col("groupings"))))
.withColumn("temp", explode(col("groupings")))
.groupBy("temp")
.agg(collect_list("groupings").alias("groupings"))
.withColumn("groupings", callUDF("distinctLongArray", callUDF("flattenDistinctLongArray", col("groupings"))))
.withColumn("temp", explode(col("groupings")))
.groupBy("temp")
.agg(collect_list("groupings").alias("groupings"))
.withColumn("groupings", callUDF("distinctLongArray", callUDF("flattenDistinctLongArray", col("groupings"))))
.select(callUDF("sortLongArray", col("groupings")).alias("groupings"))
.distinct()
What this code showed was that after 3 iterations the data coalesced, ideally then 3 joins would do the same.
Update 2:
Looks like I have a new working version, still seems inefficient but I think this will be handled better by spark.
val ardd1 = spark.sparkContext.makeRDD(Array("""{"groupings":[1,2,3]}""", """{"groupings":[4,5,6]}""", """{"groupings":[7,8,9]}""", """{"groupings":[10]}""", """{"groupings":[11,12]}""", """{"groupings":[13,14]}"""))
val ardd2 = spark.sparkContext.makeRDD(Array("""{"groupings":[1]}""", """{"groupings":[2,3,4]}""", """{"groupings":[7,8]}""", """{"groupings":[9]}""", """{"groupings":[10,11]}""", """{"groupings":[12,13]}""", """{"groupings":[14, 15]}"""))
var srdd1 = spark.read.json(ardd1)
var srdd2 = spark.read.json(ardd2)
val addUDF = udf((x: Seq[Long], y: Seq[Long]) => if(y == null) x else (x ++ y).distinct.sorted)
val encompassUDF = udf((x: Seq[Long], y: Seq[Long]) => if(x.size == y.size) false else (x diff y).size == 0)
val arrayContainsAndDiffUDF = udf((x: Seq[Long], y: Seq[Long]) => (x.intersect(y).size > 0) && (y diff x).size > 0)
var rdd1 = srdd1
var rdd2 = srdd2.withColumnRenamed("groupings", "groupings2")
for (i <- 1 to 3){
rdd1 = rdd1.join(rdd2, arrayContainsAndDiffUDF(col("groupings"), col("groupings2")), "left")
.select(addUDF(col("groupings"), col("groupings2")).alias("groupings"))
.distinct
.alias("rdd1")
rdd2 = rdd1.select(col("groupings").alias("groupings2")).alias("rdd2")
}
rdd1.join(rdd2, encompassUDF(col("groupings"), col("groupings2")), "leftanti")
.show(10, false)
Outputs:
+------------------------+
|groupings |
+------------------------+
|[10, 11, 12, 13, 14, 15]|
|[1, 2, 3, 4, 5, 6] |
|[7, 8, 9] |
+------------------------+
I will try this at scale and see what I get.
This works on small sets but is very inefficent for large sets because it explodes the number of rows many times over.
I don't think you have other options than explodeing the arrays, join followed by distinct. Spark is fairly good at such computations and tries doing them as much using internal binary rows as possible. The datasets are compressed and often comparisons are done at byte level (outside JVM)
That's just a matter of enough memory to hold all the elements which may not that big deal.
I'd recommend giving your solution a try and check out the physical plan and the stats. It could in the end turn out to be the only available solution.
Here is an alternate solution using ARRAY data type that is supported as part of HiveQL. This will at least make your coding simple [i.e. building out the logic]. Code below assumes that the raw data is in a text file.
Step 1. Create table
create table array_table1(array_col1 array<int>) ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY ',' LINES TERMINATED
BY'\n' STORED AS text;
Step 2: Load data into both tables
LOAD DATA INPATH('/path/to/file') OVERWRITE INTO TABLE array_table1
Step 3: Apply sql functions to get results
select distinct(explode(array_col1)) from array_table1 union
select distinct(explode(array_col2)) from array_table2
I am not clear on what is the final output you are looking for, from the example. Is it just union of all distinct numbers - or are they supposed to have a grouping? But anyway, with the tables created you can use a combination of distinct, explode(), left anti join and union - to get the expected results.
You may want to optimize this code to filter the final data set again for duplicates.
Hope that helps!
OK I finally figured it out.
First of all with my array joins I was doing something very wrong, which I overlooked initially.
When joining two arrays with a equivalency. EX. does [1,2,3] equal [1,2,3]? the arrays are hashed. I was doing an intersection match using a UDF. Given x in [1,2,3] is any x in [1, 2, 3, 4, 5]. This cannot be hashed and therefore requires a plan which will check every row with every row.
So to do this you have to explode both arrays first, then join them.
You can then apply other criteria. For example I saved time by only joining arrays which were not equal and when summed were less then the other.
Example with a self join:
rdd2 = rdd2.withColumn("single", explode(col("grouping"))) // Explode the grouping
temp = rdd2.withColumnRenamed("grouping", "grouping2").alias("temp") // Alias for self join
rdd2 = rdd2.join(temp, rdd2.col("single").equalTo(temp.col("single")) // Compare singles which will be hashed
.and(col("grouping").notEqual(col("grouping2"))) // Apply further conditions
.and(callUDF("lessThanArray", col("grouping"), col("grouping2"))) // Make it so only [1,2,3] [4,5,6] is joined and not the duplicate [4,5,6] [1,2,3], for efficiency
, "left") // Left so that the efficiency criteria do not drop rows
I then grouped by the grouping which was joined against aggregated the groupings from the self join.
rdd2.groupBy("grouping")
.agg(callUDF("list_agg",col('grouping2')).alias('grouping2')) // List agg is a UserDefinedAggregateFunction which aggregates lists into a distinct list
.select(callUDF("addArray", col("grouping"), col("grouping2")).alias("grouping")) // AddArray is a udf which concats and distincts 2 arrays
grouping grouping2
[1,2,3] [3,4,5]
[1,2,3] [2,6,7]
[1,2,3] [2,8,9]
becomes just
[1,2,3] [2,3,4,5,6,7,8,9]
after addArray
[1,2,3,4,5,6,7,8,9]
I then iterated that code 3 times, which seems to make everything coalesce and threw in a distinct for good measure.
Notice from the original question I had two datasets, for my specific problem I discovered some assumptions about the first and second set. The first set I could assume had no duplicates as it was a master list, the second set had duplicates hence I only needed to apply the above code to the second set, then join it with the first. I would assume if both sets had duplicates they could be unioned together first.

How to convert a SQL query output (dataframe) into an array list of key value pairs in Spark Scala?

I created a dataframe in spark scala shell for SFPD incidents. I queried the data for Category count and the result is a datafame. I want to plot this data into a graph using Wisp. Here is my dataframe,
+--------------+--------+
| Category|catcount|
+--------------+--------+
| LARCENY/THEFT| 362266|
|OTHER OFFENSES| 257197|
| NON-CRIMINAL| 189857|
| ASSAULT| 157529|
| VEHICLE THEFT| 109733|
| DRUG/NARCOTIC| 108712|
| VANDALISM| 91782|
| WARRANTS| 85837|
| BURGLARY| 75398|
|SUSPICIOUS OCC| 64452|
+--------------+--------+
I want to convert this dataframe into an arraylist of key value pairs. So I want result like this with (String,Int) type,
(LARCENY/THEFT,362266)
(OTHER OFFENSES,257197)
(NON-CRIMINAL,189857)
(ASSAULT,157529)
(VEHICLE THEFT,109733)
(DRUG/NARCOTIC,108712)
(VANDALISM,91782)
(WARRANTS,85837)
(BURGLARY,75398)
(SUSPICIOUS OCC,64452)
I tried converting this dataframe (t) into an RDD as val rddt = t.rdd. And then used flatMapValues,
rddt.flatMapValues(x=>x).collect()
but still couldn't get the required result.
Or is there a way to directly give the dataframe output into Wisp?
In pyspark it'd be as below. Scala will be quite similar.
Creating test data
rdd = sc.parallelize([(0,1), (0,1), (0,2), (1,2), (1,1), (1,20), (3,18), (3,18), (3,18)])
df = sqlContext.createDataFrame(rdd, ["id", "score"])
Mapping the test data, reformatting from a RDD of Rows to an RDD of tuples. Then, using collect to extract all the tuples as a list.
df.rdd.map(lambda x: (x[0], x[1])).collect()
[(0, 1), (0, 1), (0, 2), (1, 2), (1, 1), (1, 20), (3, 18), (3, 18), (3, 18)]
Here's the Scala Spark Row documentation that should help you convert this to Scala Spark code