Spark - Sum of row values - scala

I have the following DataFrame:
January | February | March
-----------------------------
10 | 10 | 10
20 | 20 | 20
50 | 50 | 50
I'm trying to add a column to this which is the sum of the values of each row.
January | February | March | TOTAL
----------------------------------
10 | 10 | 10 | 30
20 | 20 | 20 | 60
50 | 50 | 50 | 150
As far as I can see, all the built in aggregate functions seem to be for calculating values in single columns. How do I go about using values across columns on a per row basis (using Scala)?
I've gotten as far as
val newDf: DataFrame = df.select(colsToSum.map(col):_*).foreach ...

You were very close with this:
val newDf: DataFrame = df.select(colsToSum.map(col):_*).foreach ...
Instead, try this:
val newDf = df.select(colsToSum.map(col).reduce((c1, c2) => c1 + c2) as "sum")
I think this is the best of the the answers, because it is as fast as the answer with the hard-coded SQL query, and as convenient as the one that uses the UDF. It's the best of both worlds -- and I didn't even add a full line of code!

Alternatively and using Hugo's approach and example, you can create a UDF that receives any quantity of columns and sum them all.
from functools import reduce
def superSum(*cols):
return reduce(lambda a, b: a + b, cols)
add = udf(superSum)
df.withColumn('total', add(*[df[x] for x in df.columns])).show()
+-------+--------+-----+-----+
|January|February|March|total|
+-------+--------+-----+-----+
| 10| 10| 10| 30|
| 20| 20| 20| 60|
+-------+--------+-----+-----+

This code is in Python, but it can be easily translated:
# First we create a RDD in order to create a dataFrame:
rdd = sc.parallelize([(10, 10,10), (20, 20,20)])
df = rdd.toDF(['January', 'February', 'March'])
df.show()
# Here, we create a new column called 'TOTAL' which has results
# from add operation of columns df.January, df.February and df.March
df.withColumn('TOTAL', df.January + df.February + df.March).show()
Output:
+-------+--------+-----+
|January|February|March|
+-------+--------+-----+
| 10| 10| 10|
| 20| 20| 20|
+-------+--------+-----+
+-------+--------+-----+-----+
|January|February|March|TOTAL|
+-------+--------+-----+-----+
| 10| 10| 10| 30|
| 20| 20| 20| 60|
+-------+--------+-----+-----+
You can also create an User Defined Function it you want, here a link of Spark documentation:
UserDefinedFunction (udf)

Working Scala example with dynamic column selection:
import sqlContext.implicits._
val rdd = sc.parallelize(Seq((10, 10, 10), (20, 20, 20)))
val df = rdd.toDF("January", "February", "March")
df.show()
+-------+--------+-----+
|January|February|March|
+-------+--------+-----+
| 10| 10| 10|
| 20| 20| 20|
+-------+--------+-----+
val sumDF = df.withColumn("TOTAL", df.columns.map(c => col(c)).reduce((c1, c2) => c1 + c2))
sumDF.show()
+-------+--------+-----+-----+
|January|February|March|TOTAL|
+-------+--------+-----+-----+
| 10| 10| 10| 30|
| 20| 20| 20| 60|
+-------+--------+-----+-----+

You can use expr() for this.In scala use
df.withColumn("TOTAL", expr("January+February+March"))

Related

Spark monotonically_increasing_id() gives consecutive ids for all the partitions

I have a dataframe df in Spark which looks something like this:
val df = (1 to 10).toList.toDF()
When I check the number of partitions, I see that I have 10 partitions:
df.rdd.getNumPartitions
res0: Int = 10
Now I generate an ID column:
val dfWithID = df.withColumn("id", monotonically_increasing_id())
dfWithID.show()
+-----+---+
|value| id|
+-----+---+
| 1| 0|
| 2| 1|
| 3| 2|
| 4| 3|
| 5| 4|
| 6| 5|
| 7| 6|
| 8| 7|
| 9| 8|
| 10| 9|
+-----+---+
So all the generated ids are consecutive though I have 10 partitions. Then I repartition the dataframe:
val dfp = df.repartition(10)
val dfpWithID = dfp.withColumn("id", monotonically_increasing_id())
dfpWithID.show()
+-----+-----------+
|value| id|
+-----+-----------+
| 10| 0|
| 1| 8589934592|
| 7|17179869184|
| 5|25769803776|
| 4|42949672960|
| 9|42949672961|
| 2|51539607552|
| 8|60129542144|
| 6|68719476736|
| 3|77309411328|
+-----+-----------+
Now I get the ids which are not consecutive anymore. Based on Spark documentation, it should put the partition ID in the upper 31 bits, and in both cases I have 10 partitions. Why it only adds the partition ID after calling repartition() ?
I assume this is because all your data in your initial dataframe resides in a single partition, the other 9 being empty.
To very this, use the answers given here: Apache Spark: Get number of records per partition

How to write function in Spark column so a each field in the column increment the value?

It's not about unique id so I don't mean to use the increase unique number api, but try to resolve it by customized query
consider given value like 30, now current dataframe df needs to add a new column called hop_number so each field in the column from top to bottom will increment by 2 starts from 30, so that
with 2 parameters
x -> start number, here is 30
y -> like step or offset, here is 2
hop_number
---------------
30
32
34
36
38
40
......
I know in RDD we can use a map to handle the job, but how to do the same in dataframe with minimal cost?
df.column("hop_number", 30 + map(x => x + 2)) // pseudo code
Check below code.
scala> import org.apache.spark.sql.expressions._
scala> import org.apache.spark.sql.functions._
scala> val x = lit(30)
x: org.apache.spark.sql.Column = 30
scala> val y = lit(2)
y: org.apache.spark.sql.Column = 2
scala> df.withColumn("hop_number",(x + (row_number().over(Window.orderBy(lit(1)))-1) * y)).show(false)
+----------+
|hop_number|
+----------+
|30 |
|32 |
|34 |
|36 |
|38 |
+----------+
Assuming you have a grouping and ordering column, you can use the window function.
import pyspark.sql.functions as F
from pyspark.sql.functions import udf
from pyspark.sql.types import *
from pyspark.sql import Window
tst= sqlContext.createDataFrame([(1,1,14),(1,2,4),(1,3,10),(2,1,90),(7,2,30),(2,3,11)],schema=['group','order','value'])
w=Window.partitionBy('group').orderBy('order')
tst_hop= tst.withColumn("temp",F.sum(F.lit(2)).over(w)).withColumn("hop_number",F.col('temp')+28)
The results:
tst_hop.show()
+-----+-----+-----+----+----------+
|group|order|value|temp|hop_number|
+-----+-----+-----+----+----------+
| 1| 1| 14| 2| 30|
| 1| 2| 4| 4| 32|
| 1| 3| 10| 6| 34|
| 2| 1| 90| 2| 30|
| 2| 3| 11| 4| 32|
| 7| 2| 30| 2| 30|
+-----+-----+-----+----+----------+
If you need a different approach, please provide a sample data of the dataframe.

How can I add a column to a DataFrame which groups rows in chunks of N? Like NTILE, but with a fixed bucket size

Say I have a DataFrame like:
+------------+-----------+-----+
| feed|artist |count|
+------------+-----------+-----+
| y| Kanye West| 9|
| y| Radiohead| 6|
| y| Zero 7| 3|
| y| Puts Marie| 1|
| gwas| Drax| 7|
| gwas| Calibre| 4|
| gwas| Aphex Twin| 1|
| gwas| Jay Z| 1|
| x| DJ Krush| 2|
| x| Titeknots| 1|
+------------+-----------+-----+
I want to add a new column which chunks the rows into buckets of N rows for each partition (feed).
It seems like the inverse of NTILE to me. NTILE lets you choose the # of buckets but I want to choose the bucket-size instead.
Here's the desired result. Notice how each feed is chunked into groups of N = 2, including the x feed which has just one chunk of 2 rows. (Edit: each partition is ordered by count, so group 1 in each partition will be the rows with the highest value for count)
+------------+-----------+-----+-----+
| feed|artist |count|group|
+------------+-----------+-----+-----+
| y| Kanye West| 1| 9|
| y| Radiohead| 1| 6|
| y| Zero 7| 1| 3|
| y| Puts Marie| 1| 1|
| gwas| Drax| 7| 7|
| gwas| Calibre| 1| 4|
| gwas| Aphex Twin| 1| 1|
| gwas| Jay Z| 8| 1|
| x| DJ Krush| 2| 2|
| x| Titeknots| 1| 1|
+------------+-----------+-----+-----+
As a bonus, I would like each bucket to be a different size. E.g. List(2, 2, 4, 10, 10, -1) would mean that the first bucket has 2 rows, the second has 2 rows, the third has 4 rows, etc., and the final bucket (-1) contains the remainder.
EDIT
(Another useful variation)
While implementing the answers, I realized that there's another variation which I would prefer:
Add a column to a DataFrame which chunks its rows into groups of N, without knowing the size of the DataFrame.
Example:
If N = 100 and the DataFrame has 800 rows, it chunk it into 8 buckets of 100. If the DataFrame has 950 rows, it will chunk it into 9 buckets of 100, and 1 bucket of 50. It should not require a scan/call to .count().
The example DataFrames are analogous to the ones above.
(meta: should I make a new question for this variation? I feel like "NTILE with a fixed bucket size" is a more elegant problem and probably more common than my original use-case)
If I understand you correctly, this can be handled by using an SQL expression:
import org.apache.spark.sql.functions.{expr,row_number,desc}
import org.apache.spark.sql.expressions.Window
// set up WindowSpec
val w1 = Window.partitionBy("feed").orderBy(desc("count"))
val L = List(2, 2, 4, 10, 10, -1)
// dynamically create SQL expression from the List `L` to map row_number into group-id
var sql_expr = "CASE"
var running_total = 0
for(i <- 1 to L.size) {
running_total += L(i-1)
sql_expr += (if(L(i-1) > 0) s" WHEN rn <= $running_total THEN $i " else s" ELSE $i END")
}
println(sql_expr)
//CASE WHEN rn <= 2 THEN 1 WHEN rn <= 4 THEN 2 WHEN rn <= 8 THEN 3 WHEN rn <= 18 THEN 4 WHEN rn <= 28 THEN 5 ELSE 6 END
val df_new = df.withColumn("rn", row_number().over(w1)).withColumn("group", expr(sql_expr)).drop("rn")
df_new.show
+----+----------+-----+-----+
|feed| artist|count|group|
+----+----------+-----+-----+
|gwas| Drax| 7| 1|
|gwas| Calibre| 4| 1|
|gwas|Aphex Twin| 1| 2|
|gwas| Jay Z| 1| 2|
| x| DJ Krush| 2| 1|
| x| Titeknots| 1| 1|
| y|Kanye West| 9| 1|
| y| Radiohead| 6| 1|
| y| Zero 7| 3| 2|
| y|Puts Marie| 1| 2|
+----+----------+-----+-----+
For a fixed N, just cast (row_number-1)/N + 1 to int:
val N = 2
val df_new = df.withColumn("group", ((row_number().over(w1)-1)/N+1).cast("int"))
This could work :
val bucketDef = List(2, 2, 4, 10, 10)
val bucketRunsum = bucketDef.scanLeft(1)( _ + _) // calc running sum
// maps a row-number to a bucket
val indexBucketMapping = bucketRunsum.zip(bucketRunsum.tail)
.zipWithIndex
.map{case ((start,end),index) => ((start,end),index+1)} // make index start at 1
// gives List(((1,3),1), ((3,5),2), ((5,9),3), ((9,19),4), ((19,29),5))
// udf to assign a bucket to a given row-number
val calcBucket = udf((rnb:Long) => indexBucketMapping
.find{case ((start,end),_) => start<=rnb && rnb < end}
.map(_._2) // get index
.getOrElse(indexBucketMapping.last._2+1) // is in last bucket
)
df
.withColumn("group",calcBucket(row_number().over(Window.partitionBy($"feed").orderBy($"count"))))
alternatively (without UDF), construct a DataFrame which maps a row-number to a bucket and then join
val bucketSizeDef =List(2, 2, 4, 10, 10)
val bucketDef = (1 +: bucketSizeDef).zipWithIndex.map{case (bs,index) => (bs,index+1)}
.toDF("bucketSize","group")
.withColumn("i",sum($"bucketSize").over(Window.orderBy($"group")))
.withColumn("i_to",coalesce(lead($"i",1).over(Window.orderBy($"group")),lit(Long.MaxValue)))
.drop($"bucketSize")
bucketDef.show()
gives:
+-----+---+-------------------+
|group| i| i_to|
+-----+---+-------------------+
| 1| 1| 3|
| 2| 3| 5|
| 3| 5| 9|
| 4| 9| 19|
| 5| 19| 29|
| 6| 29|9223372036854775807|
+-----+---+-------------------+
then join to df:
df
.withColumn("rnb",row_number().over(Window.partitionBy($"feed").orderBy($"count")))
.join(broadcast(bucketDef),$"rnb">= $"i" and $"rnb"< $"i_to")
.drop("rnb","i","i_to")

PySpark DataFrame multiply columns based on values in other columns

Pyspark newbie here. I have a dataframe, say,
+------------+-------+----+
| id| mode|count|
+------------+------+-----+
| 146360 | DOS| 30|
| 423541 | UNO| 3|
+------------+------+-----+
I want a dataframe with a new column aggregate with count * 2 , when mode is 'DOS' and count * 1 when mode is 'UNO'
+------------+-------+----+---------+
| id| mode|count|aggregate|
+------------+------+-----+---------+
| 146360 | DOS| 30| 60|
| 423541 | UNO| 3| 3|
+------------+------+-----+---------+
Appreciate your inputs and also some pointers to best practices :)
Method 1: using pyspark.sql.functions with when :
from pyspark.sql.functions import when,col
df = df.withColumn('aggregate', when(col('mode')=='DOS', col('count')*2).when(col('mode')=='UNO', col('count')*1).otherwise('count'))
Method 2: using SQL CASE expression with selectExpr:
df = df.selectExpr("*","CASE WHEN mode == 'DOS' THEN count*2 WHEN mode == 'UNO' THEN count*1 ELSE count END AS aggregate")
The result:
+------+----+-----+---------+
| id|mode|count|aggregate|
+------+----+-----+---------+
|146360| DOS| 30| 60|
|423541| UNO| 3| 3|
+------+----+-----+---------+

Spark Scala DF. add a new Column to DF based in processing of some rows of the same column

Dears,
I'm New on SparK Scala, and,
I have a DF of two columns: "UG" and "Counts" and I like to obtain the Third
How was exposed in thsi list.
DF: UG, Counts, CUG ( the columns)
of 12 4
of 23 4
the 134 3
love 68 2
pain 3 1
the 18 3
love 100 2
of 23 4
the 12 3
of 11 4
I need to add a new column called "CUG", the third one exposed, where CUG(i) is the number of times that the string(i) in UG appears in the whole Column.
I tried with the following scheme:
Having the DF like the previous table in df. I did a sql UDF function to count the number of times that the string appear in the column "UG", that is:
val NW1 = (w1:String) => {
df.filter($"UG".like(w1.substring(1,(w1.length-1))).count()
}:Long
val sqlfunc = udf(NW1)
val df2= df.withColumn("CUG",sqlfunc(col("UG")))
But when I tried, ...It did'nt work. I obtained an error of Null Point exception. The UDF scheme worked isolated but not with in DF.
What can I do in order to obtain the asked results using DF.
Thanks In advance.
jm3
So what you can do is firstly count the number of rows grouped by the UG column which gives the third column you need, and then join with the original data frame. You can rename the column name if you want with the withColumnRenamed function.
scala> import org.apache.spark.sql.functions._
scala> myDf.show()
+----+------+
| UG|Counts|
+----+------+
| of| 12|
| of| 23|
| the| 134|
|love| 68|
|pain| 3|
| the| 18|
|love| 100|
| of| 23|
| the| 12|
| of| 11|
+----+------+
scala> myDf.join(myDf.groupBy("UG").count().withColumnRenamed("count", "CUG"), "UG").show()
+----+------+---+
| UG|Counts|CUG|
+----+------+---+
| of| 12| 4|
| of| 23| 4|
| the| 134| 3|
|love| 68| 2|
|pain| 3| 1|
| the| 18| 3|
|love| 100| 2|
| of| 23| 4|
| the| 12| 3|
| of| 11| 4|
+----+------+---+