Related
Suppose I have an array column group_ids
+-------+----------+
|user_id|group_ids |
+-------+----------+
|1 |[5, 8] |
|3 |[1, 2, 3] |
|2 |[1, 4] |
+-------+----------+
Schema:
root
|-- user_id: integer (nullable = false)
|-- group_ids: array (nullable = false)
| |-- element: integer (containsNull = false)
I want to get all combinations of pairs:
+-------+------------------------+
|user_id|group_ids |
+-------+------------------------+
|1 |[[5, 8]] |
|3 |[[1, 2], [1, 3], [2, 3]]|
|2 |[[1, 4]] |
+-------+------------------------+
So far I created the easiest solution with UDF:
spark.udf.register("permutate", udf((xs: Seq[Int]) => xs.combinations(2).toSeq))
dataset.withColumn("group_ids", expr("permutate(group_ids)"))
What I'm looking for is something that implemented via Spark Built-in functions. Is there a way to implement the same code without UDF?
Some higher order functions can do the trick. Requires Spark >= 2.4.
val df2 = df.withColumn(
"group_ids",
expr("""
filter(
transform(
flatten(
transform(
group_ids,
x -> arrays_zip(
array_repeat(x, size(group_ids)),
group_ids
)
)
),
x -> array(x['0'], x['group_ids'])
),
x -> x[0] < x[1]
)
""")
)
df2.show(false)
+-------+------------------------+
|user_id|group_ids |
+-------+------------------------+
|1 |[[5, 8]] |
|3 |[[1, 2], [1, 3], [2, 3]]|
|2 |[[1, 4]] |
+-------+------------------------+
You can get the max size of the column group_ids. Then, using combinations on the range (1 - maxSize) with when expressions to create the sub arrays combinations from the original array, and finaly filter the null elements from the resulting array:
val maxSize = df.select(max(size($"group_ids"))).first.getAs[Int](0)
val newCol = (1 to maxSize).combinations(2)
.map(c =>
when(
size($"group_ids") >= c(1),
array(element_at($"group_ids", c(0)), element_at($"group_ids", c(1)))
)
).toSeq
df.withColumn("group_ids", array(newCol: _*))
.withColumn("group_ids", expr("filter(group_ids, x -> x is not null)"))
.show(false)
//+-------+------------------------+
//|user_id|group_ids |
//+-------+------------------------+
//|1 |[[5, 8]] |
//|3 |[[1, 2], [1, 3], [2, 3]]|
//|2 |[[1, 4]] |
//+-------+------------------------+
Based on explode and joins solution
val exploded = df.select(col("user_id"), explode(col("group_ids")).as("e"))
// to have combinations
val joined1 = exploded.as("t1")
.join(exploded.as("t2"), Seq("user_id"), "outer")
.select(col("user_id"), col("t1.e").as("e1"), col("t2.e").as("e2"))
// to filter out redundant combinations
val joined2 = joined1.as("t1")
.join(joined1.as("t2"), $"t1.user_id" === $"t2.user_id" && $"t1.e1" === $"t2.e2" && $"t1.e2"=== $"t2.e1")
.where("t1.e1 < t2.e1")
.select("t1.*")
// group into array
val result = joined2.groupBy("user_id")
.agg(collect_set(struct("e1", "e2")).as("group_ids"))
Source json data
{"ID": "ABC", "Amt": 23077, "col": [{"Seq": 1, "Pct": 1.5, "Sh": 1},{"Seq": 2, "Pct": 1.2, "Sh": 2.5}]}
With below structure
ID:string
Amt:long
Col:array
element:struct
Seq:int
Pct:double
Sh:double
I have a dataframe with below output
+----+-------+-----------------------------+
|ID |Amt |col |
+----+-------+-----------------------------+
|ABC |23077 |[[1, 1.5, 1], [2, 1.2, 2.5]] |
+------------+-----------------------------+
I need to add Amt column to the col towards the end of the each element in the array.
+----+-------+-------------------------------------------+
|ID |Amt |col1 |
+----+---------------------------------------------------+
|ABC |23077 |[[1, 1.5, 1, 23077], [2, 1.2, 2.5, 23077]] |
+----+-------+-------------------------------------------+
If your Spark version >= 2.4, you can use transform to add elements to the struct:
val df2 = df.selectExpr(
"Amt",
"ID",
"transform(col, x -> struct(x.Seq as Seq, x.Pct as Pct, x.Sh as Sh, Amt)) as col1"
)
df2.show(false)
+-----+---+--------------------------------------------+
|Amt |ID |col1 |
+-----+---+--------------------------------------------+
|23077|ABC|[[1, 1.5, 1.0, 23077], [2, 1.2, 2.5, 23077]]|
+-----+---+--------------------------------------------+
For older Spark versions, you can explode the array of structs and reconstruct them:
val df2 = df.selectExpr("Amt","ID","inline(col)")
.groupBy("ID","Amt")
.agg(collect_list(struct(col("Seq"),col("Pct"),col("Sh"),col("Amt"))).as("col1"))
df2.show(false)
+---+-----+--------------------------------------------+
|ID |Amt |col1 |
+---+-----+--------------------------------------------+
|ABC|23077|[[1, 1.5, 1.0, 23077], [2, 1.2, 2.5, 23077]]|
+---+-----+--------------------------------------------+
I have a dataframe like below
c1 Value
A Array[47,97,33,94,6]
A Array[59,98,24,83,3]
A Array[77,63,93,86,62]
B Array[86,71,72,23,27]
B Array[74,69,72,93,7]
B Array[58,99,90,93,41]
C Array[40,13,85,75,90]
C Array[39,13,33,29,14]
C Array[99,88,57,69,49]
I need an output as below.
c1 Value
A Array[183,258,150,263,71]
B Array[218,239,234,209,75]
C Array[178,114,175,173,153]
Which is nothing but grouping column c1 and find the sum of values in column value in a sequential manner .
Please help, I couldn't find any way of doing this in google .
It is not very complicated. As you mention it, you can simply group by "c1" and aggregate the values of the array index by index.
Let's first generate some data:
val df = spark.range(6)
.select('id % 3 as "c1",
array((1 to 5).map(_ => floor(rand * 10)) : _*) as "Value")
df.show()
+---+---------------+
| c1| Value|
+---+---------------+
| 0|[7, 4, 7, 4, 0]|
| 1|[3, 3, 2, 8, 5]|
| 2|[2, 1, 0, 4, 4]|
| 0|[0, 4, 2, 1, 8]|
| 1|[1, 5, 7, 4, 3]|
| 2|[2, 5, 0, 2, 2]|
+---+---------------+
Then we need to iterate over the values of the array so as to aggregate them. It is very similar to the way we created them:
val n = 5 // if you know the size of the array
val n = df.select(size('Value)).first.getAs[Int](0) // If you do not
df
.groupBy("c1")
.agg(array((0 until n).map(i => sum(col("Value").getItem(i))) :_* ) as "Value")
.show()
+---+------------------+
| c1| Value|
+---+------------------+
| 0|[11, 18, 15, 8, 9]|
| 1| [2, 10, 5, 7, 4]|
| 2|[7, 14, 15, 10, 4]|
+---+------------------+
This question already has answers here:
How to find mean of grouped Vector columns in Spark SQL?
(2 answers)
Closed 4 years ago.
Let's say that I have a dataset in Apache Spark as follows:
+---+--------------------+
| id| vec|
+---+--------------------+
| 0|[1, 2, 3, 4] |
| 0|[2, 3, 4, 5] |
| 0|[6, 7, 8, 9] |
| 1|[1, 2, 3, 4] |
| 1|[5, 6, 7, 8] |
+---+--------------------+
And the vec is a List of Doubles.
How can I create a dataset from this that contains the ids and the average of the vectors associated with that id, like so:
+---+--------------------+
| id| vec|
+---+--------------------+
| 0|[3, 4, 5, 6] |
| 1|[3, 4, 5, 6] |
+---+--------------------+
Thanks in advance!
Created a case class to match the input schema of DataSet.
Grouped the Dataset by id and used foldLeft to accumulate the average of each idx in the vector for a grouped Dataset.
scala> case class Test(id: Int, vec: List[Double])
defined class Test
scala> val inputList = List(
| Test(0, List(1, 2, 3, 4)),
| Test(0, List(2, 3, 4, 5)),
| Test(0, List(6, 7, 8, 9)),
| Test(1, List(1, 2, 3, 4)),
| Test(1, List(5, 6, 7, 8)))
inputList: List[Test] = List(Test(0,List(1.0, 2.0, 3.0, 4.0)), Test(0,List(2.0, 3.0, 4.0, 5.0)), Test(0,List(6.0, 7.0, 8.0, 9.0)), Test(1,
List(1.0, 2.0, 3.0, 4.0)), Test(1,List(5.0, 6.0, 7.0, 8.0)))
scala>
scala> import spark.implicits._
import spark.implicits._
scala> val ds = inputList.toDF.as[Test]
ds: org.apache.spark.sql.Dataset[Test] = [id: int, vec: array<double>]
scala> ds.show(false)
+---+--------------------+
|id |vec |
+---+--------------------+
|0 |[1.0, 2.0, 3.0, 4.0]|
|0 |[2.0, 3.0, 4.0, 5.0]|
|0 |[6.0, 7.0, 8.0, 9.0]|
|1 |[1.0, 2.0, 3.0, 4.0]|
|1 |[5.0, 6.0, 7.0, 8.0]|
+---+--------------------+
scala>
scala> val outputDS = ds.groupByKey(_.id).mapGroups {
| case (key, valuePairs) =>
| val vectors = valuePairs.map(_.vec).toArray
| // compute the length of the vectors for each key
| val len = vectors.length
| // get average for each index in vectors
| val avg = vectors.head.indices.foldLeft(List[Double]()) {
| case (acc, idx) =>
| val sumOfIdx = vectors.map(_ (idx)).sum
| acc :+ (sumOfIdx / len)
| }
| Test(key, avg)
| }
outputDS: org.apache.spark.sql.Dataset[Test] = [id: int, vec: array<double>]
scala> outputDS.show(false)
+---+--------------------+
|id |vec |
+---+--------------------+
|1 |[3.0, 4.0, 5.0, 6.0]|
|0 |[3.0, 4.0, 5.0, 6.0]|
+---+--------------------+
Hope this helps!
I have the following DataFrame in Spark:
nodeFrom nodeTo value date
1 2 11 2016-10-12T12:10:00.000Z
1 2 12 2016-10-12T12:11:00.000Z
1 2 11 2016-10-12T12:09:00.000Z
4 2 34 2016-10-12T14:00:00.000Z
4 2 34 2016-10-12T14:00:00.000Z
5 3 11 2016-10-12T14:00:00.000Z
I need to delete duplicated pairs of nodeFrom and nodeTo, while taking the earliest and latest date and the average of corresponding value values.
The expected output is the following one:
nodeFrom nodeTo value date
1 2 11.5 [2016-10-12T12:09:00.000Z,2016-10-12T12:11:00.000Z]
4 2 34 [2016-10-12T14:00:00.000Z]
5 3 11 [2016-10-12T14:00:00.000Z]
Using the struct function with min and max, only a single groupBy and agg step is necessary.
Assuming that this is your data:
val data = Seq(
(1, 2, 11, "2016-10-12T12:10:00.000Z"),
(1, 2, 12, "2016-10-12T12:11:00.000Z"),
(1, 2, 11, "2016-10-12T12:09:00.000Z"),
(4, 2, 34, "2016-10-12T14:00:00.000Z"),
(4, 2, 34, "2016-10-12T14:00:00.000Z"),
(5, 3, 11, "2016-10-12T14:00:00.000Z")
).toDF("nodeFrom", "nodeTo", "value", "date")
data.show()
You can get the average and the array with earliest/latest date as follows:
import org.apache.spark.sql.functions._
data
.groupBy('nodeFrom, 'nodeTo).agg(
min(struct('date, 'value)) as 'date1,
max(struct('date, 'value)) as 'date2
)
.select(
'nodeFrom, 'nodeTo,
($"date1.value" + $"date2.value") / 2.0d as 'value,
array($"date1.date", $"date2.date") as 'date
)
.show(60, false)
This will give you almost what you want, with the minor difference every array of dates has size 2:
+--------+------+-----+----------------------------------------------------+
|nodeFrom|nodeTo|value|date |
+--------+------+-----+----------------------------------------------------+
|1 |2 |11.5 |[2016-10-12T12:09:00.000Z, 2016-10-12T12:11:00.000Z]|
|5 |3 |11.0 |[2016-10-12T14:00:00.000Z, 2016-10-12T14:00:00.000Z]|
|4 |2 |34.0 |[2016-10-12T14:00:00.000Z, 2016-10-12T14:00:00.000Z]|
+--------+------+-----+----------------------------------------------------+
If you really (really?) want to eliminate the duplicates from the array column, it seems that the easiest way is to use a custom udf for that:
val elimDuplicates = udf((_: collection.mutable.WrappedArray[String]).distinct)
data
.groupBy('nodeFrom, 'nodeTo).agg(
min(struct('date, 'value)) as 'date1,
max(struct('date, 'value)) as 'date2
)
.select(
'nodeFrom, 'nodeTo,
($"date1.value" + $"date2.value") / 2.0d as 'value,
elimDuplicates(array($"date1.date", $"date2.date")) as 'date
)
.show(60, false)
This will produce:
+--------+------+-----+----------------------------------------------------+
|nodeFrom|nodeTo|value|date |
+--------+------+-----+----------------------------------------------------+
|1 |2 |11.5 |[2016-10-12T12:09:00.000Z, 2016-10-12T12:11:00.000Z]|
|5 |3 |11.0 |[2016-10-12T14:00:00.000Z] |
|4 |2 |34.0 |[2016-10-12T14:00:00.000Z] |
+--------+------+-----+----------------------------------------------------+
Brief explanation:
min(struct('date, 'value)) as date1 selects the earliest date together with the corresponding value
Same with max
The average is computed directly from these two tuples by summing and dividing by 2
The corresponding values are written to array column
(optional) the array is de-duplicated
Hope that helps.
You could do a normal groupBy and then use a udf to make date Columns as desired like below:
val df = Seq(
(1, 2, 11, "2016-10-12T12:10:00.000Z"),
(1, 2, 12, "2016-10-12T12:11:00.000Z"),
(1, 2, 11, "2016-10-12T12:09:00.000Z"),
(4, 2, 34, "2016-10-12T14:00:00.000Z"),
(4, 2, 34, "2016-10-12T14:00:00.000Z"),
(5, 3, 11, "2016-10-12T14:00:00.000Z")
).toDF("nodeFrom", "nodeTo", "value", "date")
def zipDates = udf((date1: String, date2: String) => {
if (date1 == date2)
Seq(date1)
else
Seq(date1, date2)
})
val dfT = df
.groupBy('nodeFrom, 'nodeTo)
.agg(avg('value) as "value", min('date) as "minDate", max('date) as "maxDate")
.select('nodeFrom, 'nodeTo, 'value, zipDates('minDate, 'maxDate) as "date")
dfT.show(10, false)
// +--------+------+------------------+----------------------------------------------------+
// |nodeFrom|nodeTo|value |date |
// +--------+------+------------------+----------------------------------------------------+
// |1 |2 |11.333333333333334|[2016-10-12T12:09:00.000Z, 2016-10-12T12:11:00.000Z]|
// |5 |3 |11.0 |[2016-10-12T14:00:00.000Z] |
// |4 |2 |34.0 |[2016-10-12T14:00:00.000Z] |
// +--------+------+------------------+----------------------------------------------------+