How to find sum of arrays in a column which is grouped by another column values in a spark dataframe using scala

How to find sum of arrays in a column which is grouped by another column values in a spark dataframe using scala - scala

I have a dataframe like below
c1 Value
A Array[47,97,33,94,6]
A Array[59,98,24,83,3]
A Array[77,63,93,86,62]
B Array[86,71,72,23,27]
B Array[74,69,72,93,7]
B Array[58,99,90,93,41]
C Array[40,13,85,75,90]
C Array[39,13,33,29,14]
C Array[99,88,57,69,49]
I need an output as below.
c1 Value
A Array[183,258,150,263,71]
B Array[218,239,234,209,75]
C Array[178,114,175,173,153]
Which is nothing but grouping column c1 and find the sum of values in column value in a sequential manner .
Please help, I couldn't find any way of doing this in google .

It is not very complicated. As you mention it, you can simply group by "c1" and aggregate the values of the array index by index.
Let's first generate some data:
val df = spark.range(6)
.select('id % 3 as "c1",
array((1 to 5).map(_ => floor(rand * 10)) : _*) as "Value")
df.show()
+---+---------------+
| c1| Value|
+---+---------------+
| 0|[7, 4, 7, 4, 0]|
| 1|[3, 3, 2, 8, 5]|
| 2|[2, 1, 0, 4, 4]|
| 0|[0, 4, 2, 1, 8]|
| 1|[1, 5, 7, 4, 3]|
| 2|[2, 5, 0, 2, 2]|
+---+---------------+
Then we need to iterate over the values of the array so as to aggregate them. It is very similar to the way we created them:
val n = 5 // if you know the size of the array
val n = df.select(size('Value)).first.getAs[Int](0) // If you do not
df
.groupBy("c1")
.agg(array((0 until n).map(i => sum(col("Value").getItem(i))) :_* ) as "Value")
.show()
+---+------------------+
| c1| Value|
+---+------------------+
| 0|[11, 18, 15, 8, 9]|
| 1| [2, 10, 5, 7, 4]|
| 2|[7, 14, 15, 10, 4]|
+---+------------------+

Related

How to create a column with the maximum number in each row of another column in PySpark?

I have a PySpark dataframe, each row of the column 'TAGID_LIST' is a set of numbers such as {426,427,428,430,432,433,434,437,439,447,448,450,453,460,469,469,469,469}, but I only want to keep the maximum number in each set, 469 for this row. I tried to create a new column with:
wechat_userinfo.withColumn('TAG', f.when(wechat_userinfo['TAGID_LIST'] != 'null', max(wechat_userinfo['TAGID_LIST'])).otherwise('null'))
but got TypeError: Column is not iterable.
How do I correct it?

If the column for which you want to retrieve the max value is an array, you can use the array_max function:
import pyspark.sql.functions as F
new_df = wechat_userinfo.withColumn("TAG", F.array_max(F.col("TAGID_LIST")))
To illustrate with an example,
df = spark.createDataFrame( [(1, [1, 772, 3, 4]), (2, [5, 6, 44, 8, 9])], ('a','d'))
df2 = df.withColumn("maxd", F.array_max(F.col("d")))
df2.show()
+---+----------------+----+
| a| d|maxd|
+---+----------------+----+
| 1| [1, 772, 3, 4]| 772|
| 2|[5, 6, 44, 8, 9]| 44|
+---+----------------+----+
In your particular case, the column in question is not an array of numbers but a string, formatted as comma-separated numbers surrounded by { and }. What I'd suggest is turning your string into an array and then operate on that array as described above. You can use the regexp_replace function to quickly remove the brackets, and then split() the comma-separated string into an array. It would look like this:
df = spark.createDataFrame( [(1, "{1,2,3,4}"), (2, "{5,6,7,8}")], ('a','d'))
df2 = df
.withColumn("as_str", F.regexp_replace( F.col("d") , '^\{|\}?', '' ) )
.withColumn("as_arr", F.split( F.col("as_str"), ",").cast("array<long>"))
.withColumn("maxd", F.array_max(F.col("as_arr"))).drop("as_str")
df2.show()
+---+---------+------------+----+
| a| d| as_arr|maxd|
+---+---------+------------+----+
| 1|{1,2,3,4}|[1, 2, 3, 4]| 4|
| 2|{5,6,7,8}|[5, 6, 7, 8]| 8|
+---+---------+------------+----+

Looking to get counts of items within ArrayType column without using Explode

NOTE: I'm working with Spark 2.4
Here is my dataset:
df
col
[1,3,1,4]
[1,1,1,2]
I'd like to essentially get a value_counts of the values in the array. The results df wou
df_upd
col
[{1:2},{3:1},{4:1}]
[{1:3},{2:1}]
I know I can do this by exploding df and then taking a group by but I'm wondering if I can do this without exploding.

Here's a solution using a udf that outputs the result as a MapType. It expects integer values in your arrays (easily changed) and to return integer counts.
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = sc.parallelize([([1, 2, 3, 3, 1],),([4, 5, 6, 4, 5],),([2, 2, 2],),([3, 3],)]).toDF(['arrays'])
df.show()
+---------------+
| arrays|
+---------------+
|[1, 2, 3, 3, 1]|
|[4, 5, 6, 4, 5]|
| [2, 2, 2]|
| [3, 3]|
+---------------+
from collections import Counter
#F.udf(returnType=T.MapType(T.IntegerType(), T.IntegerType(), valueContainsNull=False))
def count_elements(array):
return dict(Counter(array))
df.withColumn('counts', count_elements(F.col('arrays'))).show(truncate=False)
+---------------+------------------------+
|arrays |counts |
+---------------+------------------------+
|[1, 2, 3, 3, 1]|[1 -> 2, 2 -> 1, 3 -> 2]|
|[4, 5, 6, 4, 5]|[4 -> 2, 5 -> 2, 6 -> 1]|
|[2, 2, 2] |[2 -> 3] |
|[3, 3] |[3 -> 2] |
+---------------+------------------------+

How to calculate the ApproxQuanitiles from list of Integers into Spark DataFrame column using scala

I have a spark DataFrame with a column containing several arrays of Integers with varying lengths. I will need to create a new column to find the Quantiles for each of these.
This is the input DataFrame :
+---------+------------------------+
|Comm |List_Nb_total_operations|
+---------+------------------------+
| comm1| [1, 1, 2, 3, 4]|
| comm4| [2, 2]|
| comm3| [2, 2]|
| comm0| [1, 1, 1, 2, 2, 2, 3,3]|
| comm2| [1, 1, 1, 2, 3]|
+---------+------------------------+
This is the desired result :
+---------+------------------------+----+----+
|Comm |List_Nb_total_operations|QT25|QT75|
+---------+------------------------+----+----+
| comm1| [1, 1, 2, 3, 4]| 1| 3|
| comm4| [2, 2]| 2| 2|
| comm3| [2, 2]| 2| 2|
| comm0| [1, 1, 1, 2, 2, 2, 3,3]| 1| 3|
| comm2| [1, 1, 1, 2, 3]| 1| 2|
+---------+------------------------+----+----+

The function you want to use is percentile_approx (since Spark 3.1):
val df = Seq(
("comm1", Seq(1,1,2,3,4)),
("comm4", Seq(2,2)),
("comm3", Seq(2,2)),
("comm0", Seq(1,1,1,2,2,2,3,3)),
("comm2", Seq(1,1,1,2,3))
).toDF("Comm", "ops")
val dfQ = df.select(
col("Comm"),
explode(col("ops")) as "ops")
.groupBy("Comm")
.agg(
percentile_approx($"ops", lit(0.25), lit(100)) as "q25",
percentile_approx($"ops", lit(0.75), lit(100)) as "q75"
)
val dfWithQ = df.join(dfQ, Seq("Comm"))
The documentation has more information regarding tuning the parameters for accuracy.

Thank you for your help. I've found an other solution that works very well in my case:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
def percentile_approxx(col: Column, percentage: Column, accuracy: Column): Column = {
val expr = new ApproximatePercentile(
col.expr, percentage.expr, accuracy.expr
).toAggregateExpression
new Column(expr)
}
val perc_df = df.groupBy("Comm").agg(percentile_approxx(col("ops"), lit(0.75), lit(100)))

Spark Dataframe Arraytype columns

I would like to create a new column on a dataframe, which is the result of applying a function to an arraytype column.
Something like this:
df = df.withColumn("max_$colname", max(col(colname)))
where each row of the column holds an array of values?
The functions in spark.sql.function appear to work on a column basis only.

You can apply user defined functions on the array column.
1.DataFrame
+------------------+
| arr|
+------------------+
| [1, 2, 3, 4, 5]|
|[4, 5, 6, 7, 8, 9]|
+------------------+
2.Creating UDF
import org.apache.spark.sql.functions._
def max(arr: TraversableOnce[Int])=arr.toList.max
val maxUDF=udf(max(_:Traversable[Int]))
3.Applying UDF in query
df.withColumn("arrMax",maxUDF(df("arr"))).show
4.Result
+------------------+------+
| arr|arrMax|
+------------------+------+
| [1, 2, 3, 4, 5]| 5|
|[4, 5, 6, 7, 8, 9]| 9|
+------------------+------+

How to extract subArray from Array[Array[Int]] column DataFrame

I have a Dataframe like this:
+---------------------------------------------------------------------+
|ARRAY |
+---------------------------------------------------------------------+
|[WrappedArray(1, 2, 3), WrappedArray(4, 5, 6), WrappedArray(7, 8, 9)]|
+---------------------------------------------------------------------+
I use this code to create it:
case class MySchema(arr: Array[Array[Int]])
val df = sc.parallelize(Seq(
Array(Array(1,2,3),
Array(4,5,6),
Array(7,8,9))))
.map(x => MySchema(x))
.toDF("ARRAY")
I would like to get a result like this:
+-----------+
|ARRAY | |
+-----------+
|[1, 2, 3] |
|[4, 5, 6] |
|[7, 8, 9] |
+-----------+
Do you have any idea?
I already try to call an udf to do a flatmap(x => x) on my Array line but I get an incorrect result :
+---------------------------+
|ARRAY |
+---------------------------+
|[1, 2, 3, 4, 5, 6, 7, 8, 9]|
+---------------------------+
Thank you for your help

You can explode:
import org.apache.spark.sql.functions.{col, explode}
df.select(explode(col("array")))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to find sum of arrays in a column which is grouped by another column values in a spark dataframe using scala - scala

Related

How to create a column with the maximum number in each row of another column in PySpark?

Looking to get counts of items within ArrayType column without using Explode

How to calculate the ApproxQuanitiles from list of Integers into Spark DataFrame column using scala

Spark Dataframe Arraytype columns

How to extract subArray from Array[Array[Int]] column DataFrame

Categories

Resources