This question already has answers here:
Spark SQL: apply aggregate functions to a list of columns
(4 answers)
Closed 5 years ago.
I have a Spark Dataset with numerous columns:
val df = Seq(
("a", 2, 3, 5, 3, 4, 2, 6, 7, 3),
("a", 1, 1, 2, 4, 5, 7, 3, 5, 2),
("b", 5, 7, 3, 6, 8, 8, 9, 4, 2),
("b", 2, 2, 3, 5, 6, 3, 2, 4, 8),
("b", 2, 5, 5, 4, 3, 6, 7, 8, 8),
("c", 1, 2, 3, 4, 5, 6, 7, 8, 9)
).toDF("id", "p1", "p2", "p3", "p4", "p5", "p6", "p7", "p8", "p9")
Now I'd like to do a groupBy over id and get the sum of each p-column for each id.
Currently I'm doing the following:
val dfg =
df.groupBy("id")
.agg(
sum($"p1").alias("p1"),
sum($"p2").alias("p2"),
sum($"p3").alias("p3"),
sum($"p4").alias("p4"),
sum($"p5").alias("p5"),
sum($"p6").alias("p6"),
sum($"p7").alias("p7"),
sum($"p8").alias("p8"),
sum($"p9").alias("p9")
)
Which produces the (correct) output:
+---+---+---+---+---+---+---+---+---+---+
| id| p1| p2| p3| p4| p5| p6| p7| p8| p9|
+---+---+---+---+---+---+---+---+---+---+
| c| 1| 2| 3| 4| 5| 6| 7| 8| 9|
| b| 9| 14| 11| 15| 17| 17| 18| 16| 18|
| a| 3| 4| 7| 7| 9| 9| 9| 12| 5|
+---+---+---+---+---+---+---+---+---+---+
Question is, in reality I have several dozens p-columns like that and I'd like to be able to write the aggregation in a more concise way.
Based on the answers to this question, I've tried to do the following:
val pcols = List.range(1, 10)
val ops = pcols.map(k => sum(df(s"p$k")).alias(s"p$k"))
val dfg =
df.groupBy("id")
.agg(ops: _*) // does not compile — agg does not accept *-parameters
Unfortunately, unlike select(), agg() does not seem to accept *-parameters and so this doesn't work, producing a compile-time no ': _*' annotation allowed here error.
agg has this signature: def agg(expr: Column, exprs: Column*): DataFrame
So try this:
df.groupBy("id")
.agg(ops.head,ops.tail:_*)
Related
I have a spark DataFrame with a column containing several arrays of Integers with varying lengths. I will need to create a new column to find the Quantiles for each of these.
This is the input DataFrame :
+---------+------------------------+
|Comm |List_Nb_total_operations|
+---------+------------------------+
| comm1| [1, 1, 2, 3, 4]|
| comm4| [2, 2]|
| comm3| [2, 2]|
| comm0| [1, 1, 1, 2, 2, 2, 3,3]|
| comm2| [1, 1, 1, 2, 3]|
+---------+------------------------+
This is the desired result :
+---------+------------------------+----+----+
|Comm |List_Nb_total_operations|QT25|QT75|
+---------+------------------------+----+----+
| comm1| [1, 1, 2, 3, 4]| 1| 3|
| comm4| [2, 2]| 2| 2|
| comm3| [2, 2]| 2| 2|
| comm0| [1, 1, 1, 2, 2, 2, 3,3]| 1| 3|
| comm2| [1, 1, 1, 2, 3]| 1| 2|
+---------+------------------------+----+----+
The function you want to use is percentile_approx (since Spark 3.1):
val df = Seq(
("comm1", Seq(1,1,2,3,4)),
("comm4", Seq(2,2)),
("comm3", Seq(2,2)),
("comm0", Seq(1,1,1,2,2,2,3,3)),
("comm2", Seq(1,1,1,2,3))
).toDF("Comm", "ops")
val dfQ = df.select(
col("Comm"),
explode(col("ops")) as "ops")
.groupBy("Comm")
.agg(
percentile_approx($"ops", lit(0.25), lit(100)) as "q25",
percentile_approx($"ops", lit(0.75), lit(100)) as "q75"
)
val dfWithQ = df.join(dfQ, Seq("Comm"))
The documentation has more information regarding tuning the parameters for accuracy.
Thank you for your help. I've found an other solution that works very well in my case:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column
import org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
def percentile_approxx(col: Column, percentage: Column, accuracy: Column): Column = {
val expr = new ApproximatePercentile(
col.expr, percentage.expr, accuracy.expr
).toAggregateExpression
new Column(expr)
}
val perc_df = df.groupBy("Comm").agg(percentile_approxx(col("ops"), lit(0.75), lit(100)))
I need to find a count of occurrences of specific elements present in array, we can use array_contains function but I am looking for another solution that can work below spark 2.2
Input:
+----+------------------+
|col1| array_col2|
+----+------------------+
| x| [1, 2, 3, 7, 7]|
| z|[3, 2, 8, 9, 4, 9]|
| a| [4, 5, 2, 8]|
+----+------------------+
result1 -> count of occuriencies of 1,2 in a given array column array_col2
result2 -> count of occuriencies of 3,7,9 in a given array column array_col2
Expected Output:
+----+------------------+----------+----------+
|col1| array_col2| result1| result2|
+----+------------------+----------+----------+
| x| [1, 2, 3, 7, 7]| 2| 3|
| z|[3, 2, 8, 9, 4, 9]| 1| 3|
| a| [4, 5, 2, 8]| 1| 0|
+----+------------------+----------+----------+
You can use UDF :
val count_occ = udf((s: Seq[Int], f: Seq[Int]) => s.filter(f.contains(_)).size)
val df1 = df.withColumn(
"result1",
count_occ($"array_col2", array(lit(1), lit(2)))
).withColumn(
"result2",
count_occ($"array_col2", array(lit(3), lit(7), lit(9)))
)
df1.show
//+----+------------------+-------+-------+
//|col1| array_col2|result1|result2|
//+----+------------------+-------+-------+
//| x| [1, 2, 3, 7, 7]| 2| 3|
//| z|[3, 2, 8, 9, 4, 9]| 1| 3|
//| a| [4, 5, 2, 8]| 1| 0|
//+----+------------------+-------+-------+
You can also explode the array then groupby and count :
val df1 = df.withColumn(
"col2",
explode($"array_col2")
).groupBy("col1", "array_col2").agg(
count(when($"col2".isin(1, 2), 1)).as("result1"),
count(when($"col2".isin(3, 7, 9), 1)).as("result2")
)
import pyspark.sql.functions as F
from datetime import datetime
data = [
(1, datetime(2017, 3, 12, 3, 19, 58), 'Raising',2),
(2, datetime(2017, 3, 12, 3, 21, 30), 'sleeping',1),
(3, datetime(2017, 3, 12, 3, 29, 40), 'walking',3),
(4, datetime(2017, 3, 12, 3, 31, 23), 'talking',5),
(5, datetime(2017, 3, 12, 4, 19, 47), 'eating',6),
(6, datetime(2017, 3, 12, 4, 33, 51), 'working',7),
]
df.show()
| id| testing_time|test_name|shift|
| 1|2017-03-12 03:19:58| Raising| 2|
| 2|2017-03-12 03:21:30| sleeping| 1|
| 3|2017-03-12 03:29:40| walking| 3|
| 4|2017-03-12 03:31:23| talking| 5|
| 5|2017-03-12 04:19:47| eating| 6|
| 6|2017-03-12 04:33:51| working| 7|
Now I want to add shift (hours) to the testing time. Can anybody help me out with a quick solution?
You can use something like below. You need to convert shift field to seconds so I multiply it with 3600
>>> df.withColumn("testing_time", (F.unix_timestamp("testing_time") + F.col("shift")*3600).cast('timestamp')).show()
+---+-------------------+---------+-----+
| id| testing_time|test_name|shift|
+---+-------------------+---------+-----+
| 1|2017-03-12 05:19:58| Raising| 2|
| 2|2017-03-12 04:21:30| sleeping| 1|
| 3|2017-03-12 06:29:40| walking| 3|
| 4|2017-03-12 08:31:23| talking| 5|
| 5|2017-03-12 10:19:47| eating| 6|
| 6|2017-03-12 11:33:51| working| 7|
+---+-------------------+---------+-----+
I have a pyspark dataframe. For example,
d= hiveContext.createDataFrame([("A", 1), ("B", 2), ("D", 3), ("D", 3), ("A", 4), ("D", 3)],["Col1", "Col2"])
+----+----+
|Col1|Col2|
+----+----+
| A| 1|
| B| 2|
| D| 3|
| D| 3|
| A| 4|
| D| 3|
+----+----+
I want to group by Col1 and then create a list of Col2. I need to flatten the groups. I do have a lot of columns.
+----+----------+
|Col1| Col2|
+----+----------+
| A| [1,4] |
| B| [2] |
| D| [3,3,3]|
+----+----------+
You can do a groupBy() and use collect_list() as your aggregate function:
import pyspark.sql.functions as f
d.groupBy('Col1').agg(f.collect_list('Col2').alias('Col2')).show()
#+----+---------+
#|Col1| Col2|
#+----+---------+
#| B| [2]|
#| D|[3, 3, 3]|
#| A| [1, 4]|
#+----+---------+
Update
If you had multiple columns to combine, you could use collect_list() on each, and the combine the resulting lists using struct() and udf(). Consider the following example:
Create Dummy Data
from operator import add
import pyspark.sql.functions as f
# create example dataframe
d = sqlcx.createDataFrame(
[
("A", 1, 10),
("B", 2, 20),
("D", 3, 30),
("D", 3, 10),
("A", 4, 20),
("D", 3, 30)
],
["Col1", "Col2", "Col3"]
)
Collect Desired Columns into lists
Suppose you had a list of columns you wanted to collect into a list. You could do the following:
cols_to_combine = ['Col2', 'Col3']
d.groupBy('Col1').agg(*[f.collect_list(c).alias(c) for c in cols_to_combine]).show()
#+----+---------+------------+
#|Col1| Col2| Col3|
#+----+---------+------------+
#| B| [2]| [20]|
#| D|[3, 3, 3]|[30, 10, 30]|
#| A| [4, 1]| [20, 10]|
#+----+---------+------------+
Combine Resultant Lists into one Column
Now we want to combine the list columns into one list. If we use struct(), we will get the following:
d.groupBy('Col1').agg(*[f.collect_list(c).alias(c) for c in cols_to_combine])\
.select('Col1', f.struct(*cols_to_combine).alias('Combined'))\
.show(truncate=False)
#+----+------------------------------------------------+
#|Col1|Combined |
#+----+------------------------------------------------+
#|B |[WrappedArray(2),WrappedArray(20)] |
#|D |[WrappedArray(3, 3, 3),WrappedArray(10, 30, 30)]|
#|A |[WrappedArray(1, 4),WrappedArray(10, 20)] |
#+----+------------------------------------------------+
Flatten Wrapped Arrays
Almost there. We just need to combine the WrappedArrays. We can achieve this with a udf():
combine_wrapped_arrays = f.udf(lambda val: reduce(add, val), ArrayType(IntegerType()))
d.groupBy('Col1').agg(*[f.collect_list(c).alias(c) for c in cols_to_combine])\
.select('Col1', combine_wrapped_arrays(f.struct(*cols_to_combine)).alias('Combined'))\
.show(truncate=False)
#+----+---------------------+
#|Col1|Combined |
#+----+---------------------+
#|B |[2, 20] |
#|D |[3, 3, 3, 30, 10, 30]|
#|A |[1, 4, 10, 20] |
#+----+---------------------+
References
Pyspark Merge WrappedArrays Within a Dataframe
Update 2
A simpler way, without having to deal with WrappedArrays:
from operator import add
combine_udf = lambda cols: f.udf(
lambda *args: reduce(add, args),
ArrayType(IntegerType())
)
d.groupBy('Col1').agg(*[f.collect_list(c).alias(c) for c in cols_to_combine])\
.select('Col1', combine_udf(cols_to_combine)(*cols_to_combine).alias('Combined'))\
.show(truncate=False)
#+----+---------------------+
#|Col1|Combined |
#+----+---------------------+
#|B |[2, 20] |
#|D |[3, 3, 3, 30, 10, 30]|
#|A |[1, 4, 10, 20] |
#+----+---------------------+
Note: This last step only works if the datatypes for all of the columns are the same. You can not use this function to combine wrapped arrays with mixed types.
from spark 2.4 you can use pyspark.sql.functions.flatten
import pyspark.sql.functions as f
df.groupBy('Col1').agg(f.flatten(f.collect_list('Col2')).alias('Col2')).show()
I have the following DataFrame df:
val df = Seq(
(1, 0, 1, 0, 0), (1, 4, 1, 0, 4), (2, 2, 1, 2, 2),
(4, 3, 1, 4, 4), (4, 5, 1, 4, 4)
).toDF("from", "to", "attr", "type_from", "type_to")
+-----+-----+----+---------------+---------------+
|from |to |attr|type_from |type_to |
+-----+-----+----+---------------+---------------+
| 1| 0| 1| 0| 0|
| 1| 4| 1| 0| 4|
| 2| 2| 1| 2| 2|
| 4| 3| 1| 4| 4|
| 4| 5| 1| 4| 4|
+-----+-----+----+---------------+---------------+
I want to count the number of ingoing and outgoing links for each node only when the type of from node is the same as the type of to node (i.e. the values of type_from and type_to).
The cases when to and from are equal should be excluded.
This is how I calculate the number of outgoing links based on this answer proposed by user8371915.
df
.where($"type_from" === $"type_to" && $"from" =!= $"to")
.groupBy($"from" as "nodeId", $"type_from" as "type")
.agg(count("*") as "numLinks")
.na.fill(0)
.show()
Of course, I can repeat the same calculation for the incoming links and then join the results. But is there any shorter solution?
df2
.where($"type_from" === $"type_to" && $"from" =!= $"to")
.groupBy($"to" as "nodeId", $"type_to" as "type")
.agg(count("*") as "numLinks")
.na.fill(0)
.show()
val df_result = df.join(df2, Seq("nodeId", "type"), "rightouter")