Calculate rolling sum of array in PySpark and save as dict?

Calculate rolling sum of array in PySpark and save as dict? - pyspark

Given an input like this:
timestamp vars
2 [1,2,3]
2 [1,2,4]
3 [1,2]
4 [1,3]
5 [1,3]
I need to keep a rolling count of each of the indices. The tried expanding the array into a one hot encoding ([1,2,3,5] -> [0,1,1,1,0,1]) and adding but this can get arbitrarily big (> 1 million), so I want to maintain it as a dict. Something like below. Any pointers would be greatly appreciated.
timestamp vars
2 {1:1, 2:1, 3:1}
2 {1:2, 2:2, 3:1, 4:1}
3 {1:3, 2:3, 3:1, 4:1}
4 {1:4, 2:3, 3:2, 4:1}
5 {1:5, 2:3, 3:3, 4:1}
Thanks!

Sample Dataframe :
+---+------------+
| ID| arr|
+---+------------+
| 1| [0]|
| 2| [0, 1]|
| 3| [0, 1, 2]|
| 4|[0, 1, 2, 3]|
| 1| [0]|
| 1| [0]|
| 3| [0, 1, 2]|
| 0| []|
+---+------------+
Using the following function which uses Collection counter:
def arr_operation(arr):
from collections import Counter
return dict(Counter(arr))
Creating UDF for arr_operation function in the following manner :
udf_dist_count = udf(arr_operation,MapType(IntegerType(), IntegerType()))
And calling the to create a new column:
final_df = df.withColumn("Dict",udf_dist_count("arr"))
The results will be like :
+---+------------+--------------------------------+
|ID |arr |Dict |
+---+------------+--------------------------------+
|1 |[0] |[0 -> 1] |
|2 |[0, 1] |[0 -> 1, 1 -> 1] |
|3 |[0, 1, 2] |[0 -> 1, 1 -> 1, 2 -> 1] |
|4 |[0, 1, 2, 3]|[0 -> 1, 1 -> 1, 2 -> 1, 3 -> 1]|
|1 |[0] |[0 -> 1] |
|1 |[0] |[0 -> 1] |
|3 |[0, 1, 2] |[0 -> 1, 1 -> 1, 2 -> 1] |
|0 |[] |[] |
+---+------------+--------------------------------+
The argument about collection Counter being slow in a distributed environment has been explained in a good manner in the answer to this question Why is Collections.counter so slow?

I would suggest Counter from collections:
In [1]: from collections import Counter
In [2]: count = Counter()
In [3]: count.update([1,2,4])
In [4]: count
Out[4]: Counter({1: 1, 2: 1, 4: 1})
In [5]: count.update([1,2,3])
In [6]: count
Out[6]: Counter({1: 2, 2: 2, 4: 1, 3: 1})
In [7]: count.update([2,3,5])
In [8]: count
Out[8]: Counter({1: 2, 2: 3, 4: 1, 3: 2, 5: 1})

Related

How to calculate the ApproxQuanitiles from list of Integers into Spark DataFrame column using scala

I have a spark DataFrame with a column containing several arrays of Integers with varying lengths. I will need to create a new column to find the Quantiles for each of these.
This is the input DataFrame :
+---------+------------------------+
|Comm |List_Nb_total_operations|
+---------+------------------------+
| comm1| [1, 1, 2, 3, 4]|
| comm4| [2, 2]|
| comm3| [2, 2]|
| comm0| [1, 1, 1, 2, 2, 2, 3,3]|
| comm2| [1, 1, 1, 2, 3]|
+---------+------------------------+
This is the desired result :
+---------+------------------------+----+----+
|Comm |List_Nb_total_operations|QT25|QT75|
+---------+------------------------+----+----+
| comm1| [1, 1, 2, 3, 4]| 1| 3|
| comm4| [2, 2]| 2| 2|
| comm3| [2, 2]| 2| 2|
| comm0| [1, 1, 1, 2, 2, 2, 3,3]| 1| 3|
| comm2| [1, 1, 1, 2, 3]| 1| 2|
+---------+------------------------+----+----+

The function you want to use is percentile_approx (since Spark 3.1):
val df = Seq(
("comm1", Seq(1,1,2,3,4)),
("comm4", Seq(2,2)),
("comm3", Seq(2,2)),
("comm0", Seq(1,1,1,2,2,2,3,3)),
("comm2", Seq(1,1,1,2,3))
).toDF("Comm", "ops")
val dfQ = df.select(
col("Comm"),
explode(col("ops")) as "ops")
.groupBy("Comm")
.agg(
percentile_approx($"ops", lit(0.25), lit(100)) as "q25",
percentile_approx($"ops", lit(0.75), lit(100)) as "q75"
)
val dfWithQ = df.join(dfQ, Seq("Comm"))
The documentation has more information regarding tuning the parameters for accuracy.

Thank you for your help. I've found an other solution that works very well in my case:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
def percentile_approxx(col: Column, percentage: Column, accuracy: Column): Column = {
val expr = new ApproximatePercentile(
col.expr, percentage.expr, accuracy.expr
).toAggregateExpression
new Column(expr)
}
val perc_df = df.groupBy("Comm").agg(percentile_approxx(col("ops"), lit(0.75), lit(100)))

How to get all combinations of an array column in Spark?

Suppose I have an array column group_ids
+-------+----------+
|user_id|group_ids |
+-------+----------+
|1 |[5, 8] |
|3 |[1, 2, 3] |
|2 |[1, 4] |
+-------+----------+
Schema:
root
|-- user_id: integer (nullable = false)
|-- group_ids: array (nullable = false)
| |-- element: integer (containsNull = false)
I want to get all combinations of pairs:
+-------+------------------------+
|user_id|group_ids |
+-------+------------------------+
|1 |[[5, 8]] |
|3 |[[1, 2], [1, 3], [2, 3]]|
|2 |[[1, 4]] |
+-------+------------------------+
So far I created the easiest solution with UDF:
spark.udf.register("permutate", udf((xs: Seq[Int]) => xs.combinations(2).toSeq))
dataset.withColumn("group_ids", expr("permutate(group_ids)"))
What I'm looking for is something that implemented via Spark Built-in functions. Is there a way to implement the same code without UDF?

Some higher order functions can do the trick. Requires Spark >= 2.4.
val df2 = df.withColumn(
"group_ids",
expr("""
filter(
transform(
flatten(
transform(
group_ids,
x -> arrays_zip(
array_repeat(x, size(group_ids)),
group_ids
)
)
),
x -> array(x['0'], x['group_ids'])
),
x -> x[0] < x[1]
)
""")
)
df2.show(false)
+-------+------------------------+
|user_id|group_ids |
+-------+------------------------+
|1 |[[5, 8]] |
|3 |[[1, 2], [1, 3], [2, 3]]|
|2 |[[1, 4]] |
+-------+------------------------+

You can get the max size of the column group_ids. Then, using combinations on the range (1 - maxSize) with when expressions to create the sub arrays combinations from the original array, and finaly filter the null elements from the resulting array:
val maxSize = df.select(max(size($"group_ids"))).first.getAs[Int](0)
val newCol = (1 to maxSize).combinations(2)
.map(c =>
when(
size($"group_ids") >= c(1),
array(element_at($"group_ids", c(0)), element_at($"group_ids", c(1)))
)
).toSeq
df.withColumn("group_ids", array(newCol: _*))
.withColumn("group_ids", expr("filter(group_ids, x -> x is not null)"))
.show(false)
//+-------+------------------------+
//|user_id|group_ids |
//+-------+------------------------+
//|1 |[[5, 8]] |
//|3 |[[1, 2], [1, 3], [2, 3]]|
//|2 |[[1, 4]] |
//+-------+------------------------+

Based on explode and joins solution
val exploded = df.select(col("user_id"), explode(col("group_ids")).as("e"))
// to have combinations
val joined1 = exploded.as("t1")
.join(exploded.as("t2"), Seq("user_id"), "outer")
.select(col("user_id"), col("t1.e").as("e1"), col("t2.e").as("e2"))
// to filter out redundant combinations
val joined2 = joined1.as("t1")
.join(joined1.as("t2"), $"t1.user_id" === $"t2.user_id" && $"t1.e1" === $"t2.e2" && $"t1.e2"=== $"t2.e1")
.where("t1.e1 < t2.e1")
.select("t1.*")
// group into array
val result = joined2.groupBy("user_id")
.agg(collect_set(struct("e1", "e2")).as("group_ids"))

How to find sum of arrays in a column which is grouped by another column values in a spark dataframe using scala

I have a dataframe like below
c1 Value
A Array[47,97,33,94,6]
A Array[59,98,24,83,3]
A Array[77,63,93,86,62]
B Array[86,71,72,23,27]
B Array[74,69,72,93,7]
B Array[58,99,90,93,41]
C Array[40,13,85,75,90]
C Array[39,13,33,29,14]
C Array[99,88,57,69,49]
I need an output as below.
c1 Value
A Array[183,258,150,263,71]
B Array[218,239,234,209,75]
C Array[178,114,175,173,153]
Which is nothing but grouping column c1 and find the sum of values in column value in a sequential manner .
Please help, I couldn't find any way of doing this in google .

It is not very complicated. As you mention it, you can simply group by "c1" and aggregate the values of the array index by index.
Let's first generate some data:
val df = spark.range(6)
.select('id % 3 as "c1",
array((1 to 5).map(_ => floor(rand * 10)) : _*) as "Value")
df.show()
+---+---------------+
| c1| Value|
+---+---------------+
| 0|[7, 4, 7, 4, 0]|
| 1|[3, 3, 2, 8, 5]|
| 2|[2, 1, 0, 4, 4]|
| 0|[0, 4, 2, 1, 8]|
| 1|[1, 5, 7, 4, 3]|
| 2|[2, 5, 0, 2, 2]|
+---+---------------+
Then we need to iterate over the values of the array so as to aggregate them. It is very similar to the way we created them:
val n = 5 // if you know the size of the array
val n = df.select(size('Value)).first.getAs[Int](0) // If you do not
df
.groupBy("c1")
.agg(array((0 until n).map(i => sum(col("Value").getItem(i))) :_* ) as "Value")
.show()
+---+------------------+
| c1| Value|
+---+------------------+
| 0|[11, 18, 15, 8, 9]|
| 1| [2, 10, 5, 7, 4]|
| 2|[7, 14, 15, 10, 4]|
+---+------------------+

Spark Dataframe - Get all lists of pairs (Scala)

I have the following situation:
I have a dataframe with an 'array' as the schema. Now I want to get for each array, all lists of pairs and save it again in a dataframe. So for example:
This is the original dataframe:
+---------------+
| candidateList|
+---------------+
| [1, 2]|
| [2, 3, 4]|
| [1, 3, 5]|
|[1, 2, 3, 4, 5]|
|[1, 2, 3, 4, 5]|
+---------------+
And that is how it have to look like after the computation:
+---------------+
| candidates |
+---------------+
| [1, 2]|
| [2, 3]|
| [2, 4]|
| [3, 4]|
| [1, 3]|
| [1, 5]|
| [3, 5]|
|and so on... |
+---------------+
I really don't know how this is possible in spark, maybe someone has a tip for me.
Kind regards

You'll need to create a UDF (User Defined Function) and use it with explode function. The UDF itself is simple thanks to Scala collection's combinations method:
import scala.collection.mutable
import org.apache.spark.sql.functions._
import spark.implicits._
val pairsUdf = udf((arr: mutable.Seq[Int]) => arr.combinations(2).toArray)
val result = df.select(explode(pairsUdf($"candidateList")) as "candidates")
result.show(numRows = 8)
// +----------+
// |candidates|
// +----------+
// | [1, 2]|
// | [2, 3]|
// | [2, 4]|
// | [3, 4]|
// | [1, 3]|
// | [1, 5]|
// | [3, 5]|
// | [1, 2]|
// +----------+

need help to compare two columns in spark scala

I have spark dataframe like this
id1 id2 attrname attr_value attr_valuelist
1 2 test Yes Yes, No
2 1 test1 No Yes, No
3 2 test2 value1 val1, Value1,value2
4 1 test3 3 0, 1, 2
5 3 test4 0 0, 1, 2
11 2 test Yes Yes, No
22 1 test1 No1 Yes, No
33 2 test2 value0 val1, Value1,value2
44 1 test3 11 0, 1, 2
55 3 test4 0 0, 1, 2
val df = sqlContext.sql("select id1, id2, attrname, attr_value, attr_valuelist from dftable)
i want to check attr_value in attr_valuelist if it does not exists then take only those rows
id1 id2 attrname attr_value attr_valuelist
4 1 test3 3 0, 1, 2
22 1 test1 No1 Yes, No
33 2 test2 value0 val1, Value1,value2
44 1 test3 11 0, 1, 2

you can simply do the following with contains in your dataframe
import org.apache.spark.sql.functions._
df.filter(!(col("attr_valuelist").contains(col("attr_value")))).show(false)
you should have following output
+---+---+--------+----------+-------------------+
|id1|id2|attrname|attr_value|attr_valuelist |
+---+---+--------+----------+-------------------+
|3 |2 |test2 |value1 |val1, Value1,value2|
|4 |1 |test3 |3 |0, 1, 2 |
|22 |1 |test1 |No1 |Yes, No |
|33 |2 |test2 |value0 |val1, Value1,value2|
|44 |1 |test3 |11 |0, 1, 2 |
+---+---+--------+----------+-------------------+
If you want to ignore the case letters then you can simply user lower function as
df.filter(!(lower(col("attr_valuelist")).contains(lower(col("attr_value"))))).show(false)
you should have
+---+---+--------+----------+-------------------+
|id1|id2|attrname|attr_value|attr_valuelist |
+---+---+--------+----------+-------------------+
|4 |1 |test3 |3 |0, 1, 2 |
|22 |1 |test1 |No1 |Yes, No |
|33 |2 |test2 |value0 |val1, Value1,value2|
|44 |1 |test3 |11 |0, 1, 2 |
+---+---+--------+----------+-------------------+

You can define a custom function, user defined function in Spark, where you can test if a value from a column is contained in the value of the other column, like this:
def contains = udf((attr: String, attrList: String) => attrList.contains(attr))
def notContains = udf((attr: String, attrList: String) => !attrList.contains(attr))
you can tweak contains function how you want, and then you can select from your dataframe like this
df.where(contains(df("attr_value", df("attr_valuelist")))
df.where(notContains(df("attr_value", df("attr_valuelist")))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Calculate rolling sum of array in PySpark and save as dict? - pyspark

Related

How to calculate the ApproxQuanitiles from list of Integers into Spark DataFrame column using scala

How to get all combinations of an array column in Spark?

How to find sum of arrays in a column which is grouped by another column values in a spark dataframe using scala

Spark Dataframe - Get all lists of pairs (Scala)

need help to compare two columns in spark scala

Categories

Resources