How to extract subArray from Array[Array[Int]] column DataFrame - scala

I have a Dataframe like this:
+---------------------------------------------------------------------+
|ARRAY |
+---------------------------------------------------------------------+
|[WrappedArray(1, 2, 3), WrappedArray(4, 5, 6), WrappedArray(7, 8, 9)]|
+---------------------------------------------------------------------+
I use this code to create it:
case class MySchema(arr: Array[Array[Int]])
val df = sc.parallelize(Seq(
Array(Array(1,2,3),
Array(4,5,6),
Array(7,8,9))))
.map(x => MySchema(x))
.toDF("ARRAY")
I would like to get a result like this:
+-----------+
|ARRAY | |
+-----------+
|[1, 2, 3] |
|[4, 5, 6] |
|[7, 8, 9] |
+-----------+
Do you have any idea?
I already try to call an udf to do a flatmap(x => x) on my Array line but I get an incorrect result :
+---------------------------+
|ARRAY |
+---------------------------+
|[1, 2, 3, 4, 5, 6, 7, 8, 9]|
+---------------------------+
Thank you for your help

You can explode:
import org.apache.spark.sql.functions.{col, explode}
df.select(explode(col("array")))

Related

Looking to get counts of items within ArrayType column without using Explode

NOTE: I'm working with Spark 2.4
Here is my dataset:
df
col
[1,3,1,4]
[1,1,1,2]
I'd like to essentially get a value_counts of the values in the array. The results df wou
df_upd
col
[{1:2},{3:1},{4:1}]
[{1:3},{2:1}]
I know I can do this by exploding df and then taking a group by but I'm wondering if I can do this without exploding.
Here's a solution using a udf that outputs the result as a MapType. It expects integer values in your arrays (easily changed) and to return integer counts.
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = sc.parallelize([([1, 2, 3, 3, 1],),([4, 5, 6, 4, 5],),([2, 2, 2],),([3, 3],)]).toDF(['arrays'])
df.show()
+---------------+
| arrays|
+---------------+
|[1, 2, 3, 3, 1]|
|[4, 5, 6, 4, 5]|
| [2, 2, 2]|
| [3, 3]|
+---------------+
from collections import Counter
#F.udf(returnType=T.MapType(T.IntegerType(), T.IntegerType(), valueContainsNull=False))
def count_elements(array):
return dict(Counter(array))
df.withColumn('counts', count_elements(F.col('arrays'))).show(truncate=False)
+---------------+------------------------+
|arrays |counts |
+---------------+------------------------+
|[1, 2, 3, 3, 1]|[1 -> 2, 2 -> 1, 3 -> 2]|
|[4, 5, 6, 4, 5]|[4 -> 2, 5 -> 2, 6 -> 1]|
|[2, 2, 2] |[2 -> 3] |
|[3, 3] |[3 -> 2] |
+---------------+------------------------+

How to find sum of arrays in a column which is grouped by another column values in a spark dataframe using scala

I have a dataframe like below
c1 Value
A Array[47,97,33,94,6]
A Array[59,98,24,83,3]
A Array[77,63,93,86,62]
B Array[86,71,72,23,27]
B Array[74,69,72,93,7]
B Array[58,99,90,93,41]
C Array[40,13,85,75,90]
C Array[39,13,33,29,14]
C Array[99,88,57,69,49]
I need an output as below.
c1 Value
A Array[183,258,150,263,71]
B Array[218,239,234,209,75]
C Array[178,114,175,173,153]
Which is nothing but grouping column c1 and find the sum of values in column value in a sequential manner .
Please help, I couldn't find any way of doing this in google .
It is not very complicated. As you mention it, you can simply group by "c1" and aggregate the values of the array index by index.
Let's first generate some data:
val df = spark.range(6)
.select('id % 3 as "c1",
array((1 to 5).map(_ => floor(rand * 10)) : _*) as "Value")
df.show()
+---+---------------+
| c1| Value|
+---+---------------+
| 0|[7, 4, 7, 4, 0]|
| 1|[3, 3, 2, 8, 5]|
| 2|[2, 1, 0, 4, 4]|
| 0|[0, 4, 2, 1, 8]|
| 1|[1, 5, 7, 4, 3]|
| 2|[2, 5, 0, 2, 2]|
+---+---------------+
Then we need to iterate over the values of the array so as to aggregate them. It is very similar to the way we created them:
val n = 5 // if you know the size of the array
val n = df.select(size('Value)).first.getAs[Int](0) // If you do not
df
.groupBy("c1")
.agg(array((0 until n).map(i => sum(col("Value").getItem(i))) :_* ) as "Value")
.show()
+---+------------------+
| c1| Value|
+---+------------------+
| 0|[11, 18, 15, 8, 9]|
| 1| [2, 10, 5, 7, 4]|
| 2|[7, 14, 15, 10, 4]|
+---+------------------+

Spark Dataframe - Get all lists of pairs (Scala)

I have the following situation:
I have a dataframe with an 'array' as the schema. Now I want to get for each array, all lists of pairs and save it again in a dataframe. So for example:
This is the original dataframe:
+---------------+
| candidateList|
+---------------+
| [1, 2]|
| [2, 3, 4]|
| [1, 3, 5]|
|[1, 2, 3, 4, 5]|
|[1, 2, 3, 4, 5]|
+---------------+
And that is how it have to look like after the computation:
+---------------+
| candidates |
+---------------+
| [1, 2]|
| [2, 3]|
| [2, 4]|
| [3, 4]|
| [1, 3]|
| [1, 5]|
| [3, 5]|
|and so on... |
+---------------+
I really don't know how this is possible in spark, maybe someone has a tip for me.
Kind regards
You'll need to create a UDF (User Defined Function) and use it with explode function. The UDF itself is simple thanks to Scala collection's combinations method:
import scala.collection.mutable
import org.apache.spark.sql.functions._
import spark.implicits._
val pairsUdf = udf((arr: mutable.Seq[Int]) => arr.combinations(2).toArray)
val result = df.select(explode(pairsUdf($"candidateList")) as "candidates")
result.show(numRows = 8)
// +----------+
// |candidates|
// +----------+
// | [1, 2]|
// | [2, 3]|
// | [2, 4]|
// | [3, 4]|
// | [1, 3]|
// | [1, 5]|
// | [3, 5]|
// | [1, 2]|
// +----------+

How can I compute the average vector in a Spark Dataset w. Scala? [duplicate]

This question already has answers here:
How to find mean of grouped Vector columns in Spark SQL?
(2 answers)
Closed 4 years ago.
Let's say that I have a dataset in Apache Spark as follows:
+---+--------------------+
| id| vec|
+---+--------------------+
| 0|[1, 2, 3, 4] |
| 0|[2, 3, 4, 5] |
| 0|[6, 7, 8, 9] |
| 1|[1, 2, 3, 4] |
| 1|[5, 6, 7, 8] |
+---+--------------------+
And the vec is a List of Doubles.
How can I create a dataset from this that contains the ids and the average of the vectors associated with that id, like so:
+---+--------------------+
| id| vec|
+---+--------------------+
| 0|[3, 4, 5, 6] |
| 1|[3, 4, 5, 6] |
+---+--------------------+
Thanks in advance!
Created a case class to match the input schema of DataSet.
Grouped the Dataset by id and used foldLeft to accumulate the average of each idx in the vector for a grouped Dataset.
scala> case class Test(id: Int, vec: List[Double])
defined class Test
scala> val inputList = List(
| Test(0, List(1, 2, 3, 4)),
| Test(0, List(2, 3, 4, 5)),
| Test(0, List(6, 7, 8, 9)),
| Test(1, List(1, 2, 3, 4)),
| Test(1, List(5, 6, 7, 8)))
inputList: List[Test] = List(Test(0,List(1.0, 2.0, 3.0, 4.0)), Test(0,List(2.0, 3.0, 4.0, 5.0)), Test(0,List(6.0, 7.0, 8.0, 9.0)), Test(1,
List(1.0, 2.0, 3.0, 4.0)), Test(1,List(5.0, 6.0, 7.0, 8.0)))
scala>
scala> import spark.implicits._
import spark.implicits._
scala> val ds = inputList.toDF.as[Test]
ds: org.apache.spark.sql.Dataset[Test] = [id: int, vec: array<double>]
scala> ds.show(false)
+---+--------------------+
|id |vec |
+---+--------------------+
|0 |[1.0, 2.0, 3.0, 4.0]|
|0 |[2.0, 3.0, 4.0, 5.0]|
|0 |[6.0, 7.0, 8.0, 9.0]|
|1 |[1.0, 2.0, 3.0, 4.0]|
|1 |[5.0, 6.0, 7.0, 8.0]|
+---+--------------------+
scala>
scala> val outputDS = ds.groupByKey(_.id).mapGroups {
| case (key, valuePairs) =>
| val vectors = valuePairs.map(_.vec).toArray
| // compute the length of the vectors for each key
| val len = vectors.length
| // get average for each index in vectors
| val avg = vectors.head.indices.foldLeft(List[Double]()) {
| case (acc, idx) =>
| val sumOfIdx = vectors.map(_ (idx)).sum
| acc :+ (sumOfIdx / len)
| }
| Test(key, avg)
| }
outputDS: org.apache.spark.sql.Dataset[Test] = [id: int, vec: array<double>]
scala> outputDS.show(false)
+---+--------------------+
|id |vec |
+---+--------------------+
|1 |[3.0, 4.0, 5.0, 6.0]|
|0 |[3.0, 4.0, 5.0, 6.0]|
+---+--------------------+
Hope this helps!

Spark Dataframe Arraytype columns

I would like to create a new column on a dataframe, which is the result of applying a function to an arraytype column.
Something like this:
df = df.withColumn("max_$colname", max(col(colname)))
where each row of the column holds an array of values?
The functions in spark.sql.function appear to work on a column basis only.
You can apply user defined functions on the array column.
1.DataFrame
+------------------+
| arr|
+------------------+
| [1, 2, 3, 4, 5]|
|[4, 5, 6, 7, 8, 9]|
+------------------+
2.Creating UDF
import org.apache.spark.sql.functions._
def max(arr: TraversableOnce[Int])=arr.toList.max
val maxUDF=udf(max(_:Traversable[Int]))
3.Applying UDF in query
df.withColumn("arrMax",maxUDF(df("arr"))).show
4.Result
+------------------+------+
| arr|arrMax|
+------------------+------+
| [1, 2, 3, 4, 5]| 5|
|[4, 5, 6, 7, 8, 9]| 9|
+------------------+------+