Add a column value to an array in another column - scala

Source json data
{"ID": "ABC", "Amt": 23077, "col": [{"Seq": 1, "Pct": 1.5, "Sh": 1},{"Seq": 2, "Pct": 1.2, "Sh": 2.5}]}
With below structure
ID:string
Amt:long
Col:array
element:struct
Seq:int
Pct:double
Sh:double
I have a dataframe with below output
+----+-------+-----------------------------+
|ID |Amt |col |
+----+-------+-----------------------------+
|ABC |23077 |[[1, 1.5, 1], [2, 1.2, 2.5]] |
+------------+-----------------------------+
I need to add Amt column to the col towards the end of the each element in the array.
+----+-------+-------------------------------------------+
|ID |Amt |col1 |
+----+---------------------------------------------------+
|ABC |23077 |[[1, 1.5, 1, 23077], [2, 1.2, 2.5, 23077]] |
+----+-------+-------------------------------------------+

If your Spark version >= 2.4, you can use transform to add elements to the struct:
val df2 = df.selectExpr(
"Amt",
"ID",
"transform(col, x -> struct(x.Seq as Seq, x.Pct as Pct, x.Sh as Sh, Amt)) as col1"
)
df2.show(false)
+-----+---+--------------------------------------------+
|Amt |ID |col1 |
+-----+---+--------------------------------------------+
|23077|ABC|[[1, 1.5, 1.0, 23077], [2, 1.2, 2.5, 23077]]|
+-----+---+--------------------------------------------+
For older Spark versions, you can explode the array of structs and reconstruct them:
val df2 = df.selectExpr("Amt","ID","inline(col)")
.groupBy("ID","Amt")
.agg(collect_list(struct(col("Seq"),col("Pct"),col("Sh"),col("Amt"))).as("col1"))
df2.show(false)
+---+-----+--------------------------------------------+
|ID |Amt |col1 |
+---+-----+--------------------------------------------+
|ABC|23077|[[1, 1.5, 1.0, 23077], [2, 1.2, 2.5, 23077]]|
+---+-----+--------------------------------------------+

Related

How to get all combinations of an array column in Spark?

Suppose I have an array column group_ids
+-------+----------+
|user_id|group_ids |
+-------+----------+
|1 |[5, 8] |
|3 |[1, 2, 3] |
|2 |[1, 4] |
+-------+----------+
Schema:
root
|-- user_id: integer (nullable = false)
|-- group_ids: array (nullable = false)
| |-- element: integer (containsNull = false)
I want to get all combinations of pairs:
+-------+------------------------+
|user_id|group_ids |
+-------+------------------------+
|1 |[[5, 8]] |
|3 |[[1, 2], [1, 3], [2, 3]]|
|2 |[[1, 4]] |
+-------+------------------------+
So far I created the easiest solution with UDF:
spark.udf.register("permutate", udf((xs: Seq[Int]) => xs.combinations(2).toSeq))
dataset.withColumn("group_ids", expr("permutate(group_ids)"))
What I'm looking for is something that implemented via Spark Built-in functions. Is there a way to implement the same code without UDF?
Some higher order functions can do the trick. Requires Spark >= 2.4.
val df2 = df.withColumn(
"group_ids",
expr("""
filter(
transform(
flatten(
transform(
group_ids,
x -> arrays_zip(
array_repeat(x, size(group_ids)),
group_ids
)
)
),
x -> array(x['0'], x['group_ids'])
),
x -> x[0] < x[1]
)
""")
)
df2.show(false)
+-------+------------------------+
|user_id|group_ids |
+-------+------------------------+
|1 |[[5, 8]] |
|3 |[[1, 2], [1, 3], [2, 3]]|
|2 |[[1, 4]] |
+-------+------------------------+
You can get the max size of the column group_ids. Then, using combinations on the range (1 - maxSize) with when expressions to create the sub arrays combinations from the original array, and finaly filter the null elements from the resulting array:
val maxSize = df.select(max(size($"group_ids"))).first.getAs[Int](0)
val newCol = (1 to maxSize).combinations(2)
.map(c =>
when(
size($"group_ids") >= c(1),
array(element_at($"group_ids", c(0)), element_at($"group_ids", c(1)))
)
).toSeq
df.withColumn("group_ids", array(newCol: _*))
.withColumn("group_ids", expr("filter(group_ids, x -> x is not null)"))
.show(false)
//+-------+------------------------+
//|user_id|group_ids |
//+-------+------------------------+
//|1 |[[5, 8]] |
//|3 |[[1, 2], [1, 3], [2, 3]]|
//|2 |[[1, 4]] |
//+-------+------------------------+
Based on explode and joins solution
val exploded = df.select(col("user_id"), explode(col("group_ids")).as("e"))
// to have combinations
val joined1 = exploded.as("t1")
.join(exploded.as("t2"), Seq("user_id"), "outer")
.select(col("user_id"), col("t1.e").as("e1"), col("t2.e").as("e2"))
// to filter out redundant combinations
val joined2 = joined1.as("t1")
.join(joined1.as("t2"), $"t1.user_id" === $"t2.user_id" && $"t1.e1" === $"t2.e2" && $"t1.e2"=== $"t2.e1")
.where("t1.e1 < t2.e1")
.select("t1.*")
// group into array
val result = joined2.groupBy("user_id")
.agg(collect_set(struct("e1", "e2")).as("group_ids"))

Scala - Calculate using last elements for the arrays

I have a dataframe with below structure
ID:string
Amt:long
Col:array
element:struct
Seq:int
Pct:double
Sh:double
Dataframe output
+----+-------+------------------------------------------+
|ID |Amt |col |
+----+-------+------------------------------------------+
|ABC |23077 |[[1, 1.5, 1, 10000], [2, 1.2, 2.5,40000]] |
+------------+------------------------------------------+
I need to to the following calculation
Last element of the first arrary will be same 10000.
For the next array I need to minus it with the value from first (40000-10000) and get output as 30000
Expected output
+----+-------+-------------------------------------------+
|ID |Amt |col1 |
+----+---------------------------------------------------+
|ABC |23077 |[[1, 1.5, 1, 10000], [2, 1.2, 2.5, 30000]] |
+----+-------+-------------------------------------------+
How would I achieve this?
You can use transform and compare the Amt with the previous entry:
val df2 = df.withColumn(
"col",
expr("""
transform(
col,
(x, i) -> struct(
x.Seq as Seq, x.Pct as Pct, x.Sh as Sh,
case when i=0 then x.Amt else x.Amt - col[i-1].Amt end as Amt
)
)
""")
)
df2.show(false)
+-----+---+--------------------------------------------+
|Amt |ID |col |
+-----+---+--------------------------------------------+
|23077|ABC|[[1, 1.5, 1.0, 10000], [2, 1.2, 2.5, 30000]]|
+-----+---+--------------------------------------------+

How to find sum of arrays in a column which is grouped by another column values in a spark dataframe using scala

I have a dataframe like below
c1 Value
A Array[47,97,33,94,6]
A Array[59,98,24,83,3]
A Array[77,63,93,86,62]
B Array[86,71,72,23,27]
B Array[74,69,72,93,7]
B Array[58,99,90,93,41]
C Array[40,13,85,75,90]
C Array[39,13,33,29,14]
C Array[99,88,57,69,49]
I need an output as below.
c1 Value
A Array[183,258,150,263,71]
B Array[218,239,234,209,75]
C Array[178,114,175,173,153]
Which is nothing but grouping column c1 and find the sum of values in column value in a sequential manner .
Please help, I couldn't find any way of doing this in google .
It is not very complicated. As you mention it, you can simply group by "c1" and aggregate the values of the array index by index.
Let's first generate some data:
val df = spark.range(6)
.select('id % 3 as "c1",
array((1 to 5).map(_ => floor(rand * 10)) : _*) as "Value")
df.show()
+---+---------------+
| c1| Value|
+---+---------------+
| 0|[7, 4, 7, 4, 0]|
| 1|[3, 3, 2, 8, 5]|
| 2|[2, 1, 0, 4, 4]|
| 0|[0, 4, 2, 1, 8]|
| 1|[1, 5, 7, 4, 3]|
| 2|[2, 5, 0, 2, 2]|
+---+---------------+
Then we need to iterate over the values of the array so as to aggregate them. It is very similar to the way we created them:
val n = 5 // if you know the size of the array
val n = df.select(size('Value)).first.getAs[Int](0) // If you do not
df
.groupBy("c1")
.agg(array((0 until n).map(i => sum(col("Value").getItem(i))) :_* ) as "Value")
.show()
+---+------------------+
| c1| Value|
+---+------------------+
| 0|[11, 18, 15, 8, 9]|
| 1| [2, 10, 5, 7, 4]|
| 2|[7, 14, 15, 10, 4]|
+---+------------------+

How can I compute the average vector in a Spark Dataset w. Scala? [duplicate]

This question already has answers here:
How to find mean of grouped Vector columns in Spark SQL?
(2 answers)
Closed 4 years ago.
Let's say that I have a dataset in Apache Spark as follows:
+---+--------------------+
| id| vec|
+---+--------------------+
| 0|[1, 2, 3, 4] |
| 0|[2, 3, 4, 5] |
| 0|[6, 7, 8, 9] |
| 1|[1, 2, 3, 4] |
| 1|[5, 6, 7, 8] |
+---+--------------------+
And the vec is a List of Doubles.
How can I create a dataset from this that contains the ids and the average of the vectors associated with that id, like so:
+---+--------------------+
| id| vec|
+---+--------------------+
| 0|[3, 4, 5, 6] |
| 1|[3, 4, 5, 6] |
+---+--------------------+
Thanks in advance!
Created a case class to match the input schema of DataSet.
Grouped the Dataset by id and used foldLeft to accumulate the average of each idx in the vector for a grouped Dataset.
scala> case class Test(id: Int, vec: List[Double])
defined class Test
scala> val inputList = List(
| Test(0, List(1, 2, 3, 4)),
| Test(0, List(2, 3, 4, 5)),
| Test(0, List(6, 7, 8, 9)),
| Test(1, List(1, 2, 3, 4)),
| Test(1, List(5, 6, 7, 8)))
inputList: List[Test] = List(Test(0,List(1.0, 2.0, 3.0, 4.0)), Test(0,List(2.0, 3.0, 4.0, 5.0)), Test(0,List(6.0, 7.0, 8.0, 9.0)), Test(1,
List(1.0, 2.0, 3.0, 4.0)), Test(1,List(5.0, 6.0, 7.0, 8.0)))
scala>
scala> import spark.implicits._
import spark.implicits._
scala> val ds = inputList.toDF.as[Test]
ds: org.apache.spark.sql.Dataset[Test] = [id: int, vec: array<double>]
scala> ds.show(false)
+---+--------------------+
|id |vec |
+---+--------------------+
|0 |[1.0, 2.0, 3.0, 4.0]|
|0 |[2.0, 3.0, 4.0, 5.0]|
|0 |[6.0, 7.0, 8.0, 9.0]|
|1 |[1.0, 2.0, 3.0, 4.0]|
|1 |[5.0, 6.0, 7.0, 8.0]|
+---+--------------------+
scala>
scala> val outputDS = ds.groupByKey(_.id).mapGroups {
| case (key, valuePairs) =>
| val vectors = valuePairs.map(_.vec).toArray
| // compute the length of the vectors for each key
| val len = vectors.length
| // get average for each index in vectors
| val avg = vectors.head.indices.foldLeft(List[Double]()) {
| case (acc, idx) =>
| val sumOfIdx = vectors.map(_ (idx)).sum
| acc :+ (sumOfIdx / len)
| }
| Test(key, avg)
| }
outputDS: org.apache.spark.sql.Dataset[Test] = [id: int, vec: array<double>]
scala> outputDS.show(false)
+---+--------------------+
|id |vec |
+---+--------------------+
|1 |[3.0, 4.0, 5.0, 6.0]|
|0 |[3.0, 4.0, 5.0, 6.0]|
+---+--------------------+
Hope this helps!

How to extract subArray from Array[Array[Int]] column DataFrame

I have a Dataframe like this:
+---------------------------------------------------------------------+
|ARRAY |
+---------------------------------------------------------------------+
|[WrappedArray(1, 2, 3), WrappedArray(4, 5, 6), WrappedArray(7, 8, 9)]|
+---------------------------------------------------------------------+
I use this code to create it:
case class MySchema(arr: Array[Array[Int]])
val df = sc.parallelize(Seq(
Array(Array(1,2,3),
Array(4,5,6),
Array(7,8,9))))
.map(x => MySchema(x))
.toDF("ARRAY")
I would like to get a result like this:
+-----------+
|ARRAY | |
+-----------+
|[1, 2, 3] |
|[4, 5, 6] |
|[7, 8, 9] |
+-----------+
Do you have any idea?
I already try to call an udf to do a flatmap(x => x) on my Array line but I get an incorrect result :
+---------------------------+
|ARRAY |
+---------------------------+
|[1, 2, 3, 4, 5, 6, 7, 8, 9]|
+---------------------------+
Thank you for your help
You can explode:
import org.apache.spark.sql.functions.{col, explode}
df.select(explode(col("array")))