Pivot and aggregate a PySpark Data Frame with alias - pyspark

I have a PySpark DataFrame similar to this:
df = sc.parallelize([
("c1", "A", 3.4, 0.4, 3.5),
("c1", "B", 9.6, 0.0, 0.0),
("c1", "A", 2.8, 0.4, 0.3),
("c1", "B", 5.4, 0.2, 0.11),
("c2", "A", 0.0, 9.7, 0.3),
("c2", "B", 9.6, 8.6, 0.1),
("c2", "A", 7.3, 9.1, 7.0),
("c2", "B", 0.7, 6.4, 4.3)
]).toDF(["user_id", "type", "d1", 'd2', 'd3'])
df.show()
which gives:
+-------+----+---+---+----+
|user_id|type| d1| d2| d3|
+-------+----+---+---+----+
| c1| A|3.4|0.4| 3.5|
| c1| B|9.6|0.0| 0.0|
| c1| A|2.8|0.4| 0.3|
| c1| B|5.4|0.2|0.11|
| c2| A|0.0|9.7| 0.3|
| c2| B|9.6|8.6| 0.1|
| c2| A|7.3|9.1| 7.0|
| c2| B|0.7|6.4| 4.3|
+-------+----+---+---+----+
And I've pivoted it by type column aggregating the result with a sum():
data_wide = df.groupBy('user_id')\
.pivot('type').sum()
data_wide.show()
which gives:
+-------+-----------------+------------------+-----------+------------------+-----------+------------------+
|user_id| A_sum(`d1`)| A_sum(`d2`)|A_sum(`d3`)| B_sum(`d1`)|B_sum(`d2`)| B_sum(`d3`)|
+-------+-----------------+------------------+-----------+------------------+-----------+------------------+
| c1|6.199999999999999| 0.8| 3.8| 15.0| 0.2| 0.11|
| c2| 7.3|18.799999999999997| 7.3|10.299999999999999| 15.0|4.3999999999999995|
+-------+-----------------+------------------+-----------+------------------+-----------+------------------+
Now, the resulting column names contains the `(tilde) character, and this is a problem to, for example, introduce this new columns in a Vector Assembler because it returns a syntax error in attribute name. For this reason, I need to rename the column names but to call a withColumnRenamed method inside a loop or inside a reduce(lambda...) function takes a lot of time (actually my df has 11.520 columns).
Is there any way to avoid this character in the pivot+aggregation step or recursively assign an alias that depends on the name of the new pivoted column?
Thank you in advance

You can do the renaming within the aggregation for the pivot using alias:
import pyspark.sql.functions as f
data_wide = df.groupBy('user_id')\
.pivot('type')\
.agg(*[f.sum(x).alias(x) for x in df.columns if x not in {"user_id", "type"}])
data_wide.show()
#+-------+-----------------+------------------+----+------------------+----+------------------+
#|user_id| A_d1| A_d2|A_d3| B_d1|B_d2| B_d3|
#+-------+-----------------+------------------+----+------------------+----+------------------+
#| c1|6.199999999999999| 0.8| 3.8| 15.0| 0.2| 0.11|
#| c2| 7.3|18.799999999999997| 7.3|10.299999999999999|15.0|4.3999999999999995|
#+-------+-----------------+------------------+----+------------------+----+------------------+
However, this is really no different than doing the pivot and renaming afterwards. Here is the execution plan for this method:
#== Physical Plan ==
#HashAggregate(keys=[user_id#0], functions=[pivotfirst(type#1, sum(`d1`) AS `d1`#169, A, B, 0, 0), pivotfirst(type#1, sum(`d2`)
#AS `d2`#170, A, B, 0, 0), pivotfirst(type#1, sum(`d3`) AS `d3`#171, A, B, 0, 0)])
#+- Exchange hashpartitioning(user_id#0, 200)
# +- HashAggregate(keys=[user_id#0], functions=[partial_pivotfirst(type#1, sum(`d1`) AS `d1`#169, A, B, 0, 0), partial_pivotfirst(type#1, sum(`d2`) AS `d2`#170, A, B, 0, 0), partial_pivotfirst(type#1, sum(`d3`) AS `d3`#171, A, B, 0, 0)])
# +- *HashAggregate(keys=[user_id#0, type#1], functions=[sum(d1#2), sum(d2#3), sum(d3#4)])
# +- Exchange hashpartitioning(user_id#0, type#1, 200)
# +- *HashAggregate(keys=[user_id#0, type#1], functions=[partial_sum(d1#2), partial_sum(d2#3), partial_sum(d3#4)])
# +- Scan ExistingRDD[user_id#0,type#1,d1#2,d2#3,d3#4]
Compare this with the method in this answer:
import re
def clean_names(df):
p = re.compile("^(\w+?)_([a-z]+)\((\w+)\)(?:\(\))?")
return df.toDF(*[p.sub(r"\1_\3", c) for c in df.columns])
pivoted = df.groupBy('user_id').pivot('type').sum()
clean_names(pivoted).explain()
#== Physical Plan ==
#HashAggregate(keys=[user_id#0], functions=[pivotfirst(type#1, sum(`d1`)#363, A, B, 0, 0), pivotfirst(type#1, sum(`d2`)#364, A, B, 0, 0), pivotfirst(type#1, sum(`d3`)#365, A, B, 0, 0)])
#+- Exchange hashpartitioning(user_id#0, 200)
# +- HashAggregate(keys=[user_id#0], functions=[partial_pivotfirst(type#1, sum(`d1`)#363, A, B, 0, 0), partial_pivotfirst(type#1, sum(`d2`)#364, A, B, 0, 0), partial_pivotfirst(type#1, sum(`d3`)#365, A, B, 0, 0)])
# +- *HashAggregate(keys=[user_id#0, type#1], functions=[sum(d1#2), sum(d2#3), sum(d3#4)])
# +- Exchange hashpartitioning(user_id#0, type#1, 200)
# +- *HashAggregate(keys=[user_id#0, type#1], functions=[partial_sum(d1#2), partial_sum(d2#3), partial_sum(d3#4)])
# +- Scan ExistingRDD[user_id#0,type#1,d1#2,d2#3,d3#4]
You'll see that the two are practically identical. You'll likely have some minuscule speed up by avoiding the regular expression, but it will be negligible compared to the pivot.

Wrote an easy and fast function to rename PySpark pivot tables. Enjoy! :)
# This function efficiently rename pivot tables' urgly names
def rename_pivot_cols(rename_df, remove_agg):
"""change spark pivot table's default ugly column names at ease.
Option 1: remove_agg = True: `2_sum(sum_amt)` --> `sum_amt_2`.
Option 2: remove_agg = False: `2_sum(sum_amt)` --> `sum_sum_amt_2`
"""
for column in rename_df.columns:
if remove_agg == True:
start_index = column.find('(')
end_index = column.find(')')
if (start_index > 0 and end_index > 0):
rename_df = rename_df.withColumnRenamed(column, column[start_index+1:end_index]+'_'+column[:1])
else:
new_column = column.replace('(','_').replace(')','')
rename_df = rename_df.withColumnRenamed(column, new_column[2:]+'_'+new_column[:1])
return rename_df

Related

Calculate value based on value from same column of the previous row in spark

I have an issue where I have to calculate a column using a formula that uses the value from the calculation done in the previous row.
I am unable to figure it out using withColumn API.
I need to calculate a new column, using the formula:
MovingRate = MonthlyRate + (0.7 * MovingRatePrevious)
... where the MovingRatePrevious is the MovingRate of the prior row.
For month 1, I have the value so I do not need to re-calculate that but I need that value to be able to calculate the subsequent rows. I need to partition by Type.
This is my original dataset:
Desired results in MovingRate column:
Altough its possible to do with Widow Functions (See #Leo C's answer), I bet its more performant to aggregate once per Type using a groupBy. Then, explode the results of the UDF to get all rows back:
val df = Seq(
(1, "blue", 0.4, Some(0.33)),
(2, "blue", 0.3, None),
(3, "blue", 0.7, None),
(4, "blue", 0.9, None)
)
.toDF("Month", "Type", "MonthlyRate", "MovingRate")
// this udf produces an Seq of Tuple3 (Month, MonthlyRate, MovingRate)
val calcMovingRate = udf((startRate:Double,rates:Seq[Row]) => rates.tail
.scanLeft((rates.head.getInt(0),startRate,startRate))((acc,curr) => (curr.getInt(0),curr.getDouble(1),acc._3+0.7*curr.getDouble(1)))
)
df
.groupBy($"Type")
.agg(
first($"MovingRate",ignoreNulls=true).as("startRate"),
collect_list(struct($"Month",$"MonthlyRate")).as("rates")
)
.select($"Type",explode(calcMovingRate($"startRate",$"rates")).as("movingRates"))
.select($"Type",$"movingRates._1".as("Month"),$"movingRates._2".as("MonthlyRate"),$"movingRates._3".as("MovingRate"))
.show()
gives:
+----+-----+-----------+------------------+
|Type|Month|MonthlyRate| MovingRate|
+----+-----+-----------+------------------+
|blue| 1| 0.33| 0.33|
|blue| 2| 0.3| 0.54|
|blue| 3| 0.7| 1.03|
|blue| 4| 0.9|1.6600000000000001|
+----+-----+-----------+------------------+
Given the nature of the requirement that each moving rate is recursively computed from the previous rate, the column-oriented DataFrame API won't shine especially if the dataset is huge.
That said, if the dataset isn't large, one approach would be to make Spark recalculate the moving rates row-wise via a UDF, with a Window-partitioned rate list as its input:
import org.apache.spark.sql.expressions.Window
val df = Seq(
(1, "blue", 0.4, Some(0.33)),
(2, "blue", 0.3, None),
(3, "blue", 0.7, None),
(4, "blue", 0.9, None),
(1, "red", 0.5, Some(0.2)),
(2, "red", 0.6, None),
(3, "red", 0.8, None)
).toDF("Month", "Type", "MonthlyRate", "MovingRate")
val win = Window.partitionBy("Type").orderBy("Month").
rowsBetween(Window.unboundedPreceding, 0)
def movingRate(factor: Double) = udf( (initRate: Double, monthlyRates: Seq[Double]) =>
monthlyRates.tail.foldLeft(initRate)( _ * factor + _ )
)
df.
withColumn("MovingRate", when($"Month" === 1, $"MovingRate").otherwise(
movingRate(0.7)(last($"MovingRate", ignoreNulls=true).over(win), collect_list($"MonthlyRate").over(win))
)).
show
// +-----+----+-----------+------------------+
// |Month|Type|MonthlyRate| MovingRate|
// +-----+----+-----------+------------------+
// | 1| red| 0.5| 0.2|
// | 2| red| 0.6| 0.74|
// | 3| red| 0.8| 1.318|
// | 1|blue| 0.4| 0.33|
// | 2|blue| 0.3|0.5309999999999999|
// | 3|blue| 0.7|1.0716999999999999|
// | 4|blue| 0.9|1.6501899999999998|
// +-----+----+-----------+------------------+
What you are trying to do is compute a recursive formula that looks like:
x[i] = y[i] + 0.7 * x[i-1]
where x[i] is your MovingRate at row i and y[i] your MonthlyRate at row i.
The problem is that this is a purely sequential formula. Each row needs the result of the previous one which in turn needs the result of the one before. Spark is a parallel computation engine and it is going to be hard to use it to speed up a calculation that cannot really be parallelized.

pyspark Transpose dataframe

I have a dataframe as given below
ID, Code_Num, Code, Code1, Code2, Code3
10, 1, A1005*B1003, A1005, B1003, null
12, 2, A1007*D1008*C1004, A1007, D1008, C1004
I need help on transposing the above dataset, and output should be displayed as below.
ID, Code_Num, Code, Code_T
10, 1, A1005*B1003, A1005
10, 1, A1005*B1003, B1003
12, 2, A1007*D1008*C1004, A1007
12, 2, A1007*D1008*C1004, D1008
12, 2, A1007*D1008*C1004, C1004
Step 1: Creating the DataFrame.
values = [(10, 'A1005*B1003', 'A1005', 'B1003', None),(12, 'A1007*D1008*C1004', 'A1007', 'D1008', 'C1004')]
df = sqlContext.createDataFrame(values,['ID','Code','Code1','Code2','Code3'])
df.show()
+---+-----------------+-----+-----+-----+
| ID| Code|Code1|Code2|Code3|
+---+-----------------+-----+-----+-----+
| 10| A1005*B1003|A1005|B1003| null|
| 12|A1007*D1008*C1004|A1007|D1008|C1004|
+---+-----------------+-----+-----+-----+
Step 2: Explode the DataFrame -
def to_transpose(df, by):
# Filter dtypes and split into column names and type description
cols, dtypes = zip(*((c, t) for (c, t) in df.dtypes if c not in by))
# Spark SQL supports only homogeneous columns
assert len(set(dtypes)) == 1, "All columns have to be of the same type"
# Create and explode an array of (column_name, column_value) structs
kvs = explode(array([
struct(lit(c).alias("key"), col(c).alias("val")) for c in cols
])).alias("kvs")
return df.select(by + [kvs]).select(by + ["kvs.key", "kvs.val"])
df = to_transpose(df, ["ID","Code"]).drop('key').withColumnRenamed("val","Code_T")
df.show()
+---+-----------------+------+
| ID| Code|Code_T|
+---+-----------------+------+
| 10| A1005*B1003| A1005|
| 10| A1005*B1003| B1003|
| 10| A1005*B1003| null|
| 12|A1007*D1008*C1004| A1007|
| 12|A1007*D1008*C1004| D1008|
| 12|A1007*D1008*C1004| C1004|
+---+-----------------+------+
In case you only want non-Null values in column Code_T, just run the statement below -
df = df.where(col('Code_T').isNotNull())

Scala add new column to dataframe by expression

I am going to add new column to a dataframe with expression.
for example, I have a dataframe of
+-----+----------+----------+-----+
| C1 | C2 | C3 |C4 |
+-----+----------+----------+-----+
|steak|1 |1 | 150|
|steak|2 |2 | 180|
| fish|3 |3 | 100|
+-----+----------+----------+-----+
and I want to create a new column C5 with expression "C2/C3+C4", assuming there are several new columns need to add, and the expressions may be different and come from database.
Is there a good way to do this?
I know that if I have an expression like "2+3*4" I can use scala.tools.reflect.ToolBox to eval it.
And normally I am using df.withColumn to add new column.
Seems I need to create an UDF, but how can I pass the columns value as parameters to UDF? especially there maybe multiple expression need different columns calculate.
This can be done using expr to create a Column from an expression:
val df = Seq((1,2)).toDF("x","y")
val myExpression = "x+y"
import org.apache.spark.sql.functions.expr
df.withColumn("z",expr(myExpression)).show()
+---+---+---+
| x| y| z|
+---+---+---+
| 1| 2| 3|
+---+---+---+
Two approaches:
import spark.implicits._ //so that you could use .toDF
val df = Seq(
("steak", 1, 1, 150),
("steak", 2, 2, 180),
("fish", 3, 3, 100)
).toDF("C1", "C2", "C3", "C4")
import org.apache.spark.sql.functions._
// 1st approach using expr
df.withColumn("C5", expr("C2/(C3 + C4)")).show()
// 2nd approach using selectExpr
df.selectExpr("*", "(C2/(C3 + C4)) as C5").show()
+-----+---+---+---+--------------------+
| C1| C2| C3| C4| C5|
+-----+---+---+---+--------------------+
|steak| 1| 1|150|0.006622516556291391|
|steak| 2| 2|180| 0.01098901098901099|
| fish| 3| 3|100| 0.02912621359223301|
+-----+---+---+---+--------------------+
In Spark 2.x, you can create a new column C5 with expression "C2/C3+C4" using withColumn() and org.apache.spark.sql.functions._,
val currentDf = Seq(
("steak", 1, 1, 150),
("steak", 2, 2, 180),
("fish", 3, 3, 100)
).toDF("C1", "C2", "C3", "C4")
val requiredDf = currentDf
.withColumn("C5", (col("C2")/col("C3")+col("C4")))
Also, you can do the same using org.apache.spark.sql.Column as well.
(But the space complexity is bit higher in this approach than using org.apache.spark.sql.functions._ due to the Column object creation)
val requiredDf = currentDf
.withColumn("C5", (new Column("C2")/new Column("C3")+new Column("C4")))
This worked perfectly for me. I am using Spark 2.0.2.

spark aggregating column into Set efficiently

How can I aggregate a column into an Set (Array of unique elements) in spark efficiently?
case class Foo(a:String, b:String, c:Int, d:Array[String])
val df = Seq(Foo("A", "A", 123, Array("A")),
Foo("A", "A", 123, Array("B")),
Foo("B", "B", 123, Array("C", "A")),
Foo("B", "B", 123, Array("C", "E", "A")),
Foo("B", "B", 123, Array("D"))
).toDS()
Will result in
+---+---+---+---------+
| a| b| c| d|
+---+---+---+---------+
| A| A|123| [A]|
| A| A|123| [B]|
| B| B|123| [C, A]|
| B| B|123|[C, E, A]|
| B| B|123| [D]|
+---+---+---+---------+
what I am Looking for is (ordering of d column is not important):
+---+---+---+------------+
| a| b| c| d |
+---+---+---+------------+
| A| A|123| [A, B]. |
| B| B|123|[C, A, E, D]|
+---+---+---+------------+
this may be a bit similar to How to aggregate values into collection after groupBy? or the example from HighPerformanceSpark of https://github.com/high-performance-spark/high-performance-spark-examples/blob/57a6267fb77fae5a90109bfd034ae9c18d2edf22/src/main/scala/com/high-performance-spark-examples/transformations/SmartAggregations.scala#L33-L43
Using the following code:
import org.apache.spark.sql.functions.udf
val flatten = udf((xs: Seq[Seq[String]]) => xs.flatten.distinct)
val d = flatten(collect_list($"d")).alias("d")
df.groupBy($"a", $"b", $"c").agg(d).show
will produce the desired result, but I wonder if there are any possibilities to improve performance using the RDD API as outlined in the book. And would like to know how to formulate it using data set API.
Details about the execution for this minimal sample follow below:
== Optimized Logical Plan ==
GlobalLimit 21
+- LocalLimit 21
+- Aggregate [a#45, b#46, c#47], [a#45, b#46, c#47, UDF(collect_list(d#48, 0, 0)) AS d#82]
+- LocalRelation [a#45, b#46, c#47, d#48]
== Physical Plan ==
CollectLimit 21
+- SortAggregate(key=[a#45, b#46, c#47], functions=[collect_list(d#48, 0, 0)], output=[a#45, b#46, c#47, d#82])
+- *Sort [a#45 ASC NULLS FIRST, b#46 ASC NULLS FIRST, c#47 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(a#45, b#46, c#47, 200)
+- LocalTableScan [a#45, b#46, c#47, d#48]
edit
The problems of this operation are outlined very well https://github.com/awesome-spark/spark-gotchas/blob/master/04_rdd_actions_and_transformations_by_example.md#be-smart-about-groupbykey
edit2
As you can see the DAG for the dataSet query suggested below is more complicated and instead of 0.4 seem to take 2 seconds.
Try this
df.groupByKey(foo => (foo.a, foo.b, foo.c)).
reduceGroups{
(foo1, foo2) =>
foo1.copy(d = (foo1.d ++ foo2.d).distinct )
}.map(_._2)

How to slice and sum elements of array column?

I would like to sum (or perform other aggregate functions too) on the array column using SparkSQL.
I have a table as
+-------+-------+---------------------------------+
|dept_id|dept_nm| emp_details|
+-------+-------+---------------------------------+
| 10|Finance| [100, 200, 300, 400, 500]|
| 20| IT| [10, 20, 50, 100]|
+-------+-------+---------------------------------+
I would like to sum the values of this emp_details column .
Expected query:
sqlContext.sql("select sum(emp_details) from mytable").show
Expected result
1500
180
Also I should be able to sum on the range elements too like :
sqlContext.sql("select sum(slice(emp_details,0,3)) from mytable").show
result
600
80
when doing sum on the Array type as expected it shows that sum expects argument to be numeric type not array type.
I think we need to create UDF for this . but how ?
Will I be facing any performance hits with UDFs ?
and is there any other solution apart from the UDF one ?
Spark 2.4.0
As of Spark 2.4, Spark SQL supports higher-order functions that are to manipulate complex data structures, including arrays.
The "modern" solution would be as follows:
scala> input.show(false)
+-------+-------+-------------------------+
|dept_id|dept_nm|emp_details |
+-------+-------+-------------------------+
|10 |Finance|[100, 200, 300, 400, 500]|
|20 |IT |[10, 20, 50, 100] |
+-------+-------+-------------------------+
input.createOrReplaceTempView("mytable")
val sqlText = "select dept_id, dept_nm, aggregate(emp_details, 0, (acc, value) -> acc + value) as sum from mytable"
scala> sql(sqlText).show
+-------+-------+----+
|dept_id|dept_nm| sum|
+-------+-------+----+
| 10|Finance|1500|
| 20| IT| 180|
+-------+-------+----+
You can find a good reading on higher-order functions in the following articles and video:
Introducing New Built-in and Higher-Order Functions for Complex Data Types in Apache Spark 2.4
Working with Nested Data Using Higher Order Functions in SQL on Databricks
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell (Databricks)
Spark 2.3.2 and earlier
DISCLAIMER I would not recommend this approach (even though it got the most upvotes) because of the deserialization that Spark SQL does to execute Dataset.map. The query forces Spark to deserialize the data and load it onto JVM (from memory regions that are managed by Spark outside JVM). That will inevitably lead to more frequent GCs and hence make performance worse.
One solution would be to use Dataset solution where the combination of Spark SQL and Scala could show its power.
scala> val inventory = Seq(
| (10, "Finance", Seq(100, 200, 300, 400, 500)),
| (20, "IT", Seq(10, 20, 50, 100))).toDF("dept_id", "dept_nm", "emp_details")
inventory: org.apache.spark.sql.DataFrame = [dept_id: int, dept_nm: string ... 1 more field]
// I'm too lazy today for a case class
scala> inventory.as[(Long, String, Seq[Int])].
map { case (deptId, deptName, details) => (deptId, deptName, details.sum) }.
toDF("dept_id", "dept_nm", "sum").
show
+-------+-------+----+
|dept_id|dept_nm| sum|
+-------+-------+----+
| 10|Finance|1500|
| 20| IT| 180|
+-------+-------+----+
I'm leaving the slice part as an exercise as it's equally simple.
Since Spark 2.4 you can slice with the slice function:
import org.apache.spark.sql.functions.slice
val df = Seq(
(10, "Finance", Seq(100, 200, 300, 400, 500)),
(20, "IT", Seq(10, 20, 50, 100))
).toDF("dept_id", "dept_nm", "emp_details")
val dfSliced = df.withColumn(
"emp_details_sliced",
slice($"emp_details", 1, 3)
)
dfSliced.show(false)
+-------+-------+-------------------------+------------------+
|dept_id|dept_nm|emp_details |emp_details_sliced|
+-------+-------+-------------------------+------------------+
|10 |Finance|[100, 200, 300, 400, 500]|[100, 200, 300] |
|20 |IT |[10, 20, 50, 100] |[10, 20, 50] |
+-------+-------+-------------------------+------------------+
and sum arrays with aggregate:
dfSliced.selectExpr(
"*",
"aggregate(emp_details, 0, (x, y) -> x + y) as details_sum",
"aggregate(emp_details_sliced, 0, (x, y) -> x + y) as details_sliced_sum"
).show
+-------+-------+--------------------+------------------+-----------+------------------+
|dept_id|dept_nm| emp_details|emp_details_sliced|details_sum|details_sliced_sum|
+-------+-------+--------------------+------------------+-----------+------------------+
| 10|Finance|[100, 200, 300, 4...| [100, 200, 300]| 1500| 600|
| 20| IT| [10, 20, 50, 100]| [10, 20, 50]| 180| 80|
+-------+-------+--------------------+------------------+-----------+------------------+
A possible approach it to use explode() on your Array column and consequently aggregate the output by unique key. For example:
import sqlContext.implicits._
import org.apache.spark.sql.functions._
(mytable
.withColumn("emp_sum",
explode($"emp_details"))
.groupBy("dept_nm")
.agg(sum("emp_sum")).show)
+-------+------------+
|dept_nm|sum(emp_sum)|
+-------+------------+
|Finance| 1500|
| IT| 180|
+-------+------------+
To select only specific values in your array, we can work with the answer from the linked question and apply it with a slight modification:
val slice = udf((array : Seq[Int], from : Int, to : Int) => array.slice(from,to))
(mytable
.withColumn("slice",
slice($"emp_details",
lit(0),
lit(3)))
.withColumn("emp_sum",
explode($"slice"))
.groupBy("dept_nm")
.agg(sum("emp_sum")).show)
+-------+------------+
|dept_nm|sum(emp_sum)|
+-------+------------+
|Finance| 600|
| IT| 80|
+-------+------------+
Data:
val data = Seq((10, "Finance", Array(100,200,300,400,500)),
(20, "IT", Array(10,20,50,100)))
val mytable = sc.parallelize(data).toDF("dept_id", "dept_nm","emp_details")
Here is an alternative to mtoto's answer without using a groupBy (I really don't know which one is fastest: UDF, mtoto solution or mine, comments welcome)
You would a performance impact on using a UDF, in general. There is an answer which you might want to read and this resource is a good read on UDF.
Now for your problem, you can avoid the use of a UDF. What I would use is a Column expression generated with Scala logic.
data:
val df = Seq((10, "Finance", Array(100,200,300,400,500)),
(20, "IT", Array(10, 20, 50,100)))
.toDF("dept_id", "dept_nm","emp_details")
You need some trickery to be able to traverse a ArrayType, you can play a bit with the solution to discover various problems (see edit at the bottom for the slice part). Here is my proposal but you might find better. First you take the maximum length
val maxLength = df.select(size('emp_details).as("l")).groupBy().max("l").first.getInt(0)
Then you use it, testing when you have a shorter array
val sumArray = (1 until maxLength)
.map(i => when(size('emp_details) > i,'emp_details(i)).otherwise(lit(0)))
.reduce(_ + _)
.as("sumArray")
val res = df
.select('dept_id,'dept_nm,'emp_details,sumArray)
result:
+-------+-------+--------------------+--------+
|dept_id|dept_nm| emp_details|sumArray|
+-------+-------+--------------------+--------+
| 10|Finance|[100, 200, 300, 4...| 1500|
| 20| IT| [10, 20, 50, 100]| 180|
+-------+-------+--------------------+--------+
I advise you to look at sumArray to understand what it is doing.
Edit: Of course I only read half of the question again... But if you want to changes the items on which to sum, you can see that it becomes obvious with this solution (i.e. you don't need a slice function), just change (0 until maxLength) with the range of index you need:
def sumArray(from: Int, max: Int) = (from until max)
.map(i => when(size('emp_details) > i,'emp_details(i)).otherwise(lit(0)))
.reduce(_ + _)
.as("sumArray")
Building on zero323's awesome answer; in case you have an array of Long integers i.e. BIGINT, you need to change the initial value from 0 to BIGINT(0) as explained in the first paragraph here
so you have
dfSliced.selectExpr(
"*",
"aggregate(emp_details, BIGINT(0), (x, y) -> x + y) as details_sum",
"aggregate(emp_details_sliced, BIGINT(0), (x, y) -> x + y) as details_sliced_sum"
).show
The rdd way is missing, so let me add it.
val df = Seq((10, "Finance", Array(100,200,300,400,500)),(20, "IT", Array(10,20,50,100))).toDF("dept_id", "dept_nm","emp_details")
import scala.collection.mutable._
val rdd1 = df.rdd.map( x=> {val p = x.getAs[mutable.WrappedArray[Int]]("emp_details").toArray; Row.merge(x,Row(p.sum,p.slice(0,2).sum)) })
spark.createDataFrame(rdd1,df.schema.add(StructField("sumArray",IntegerType)).add(StructField("sliceArray",IntegerType))).show(false)
Output:
+-------+-------+-------------------------+--------+----------+
|dept_id|dept_nm|emp_details |sumArray|sliceArray|
+-------+-------+-------------------------+--------+----------+
|10 |Finance|[100, 200, 300, 400, 500]|1500 |300 |
|20 |IT |[10, 20, 50, 100] |180 |30 |
+-------+-------+-------------------------+--------+----------+