How to use Sum on groupBy result in Spark DatFrames? - scala

Based on the following dataframe:
+---+-----+----+
| ID|Categ|Amnt|
+---+-----+----+
| 1| A| 10|
| 1| A| 5|
| 2| A| 56|
| 2| B| 13|
+---+-----+----+
I would like to obtain the sum of the column Amnt groupby ID and Categ.
+---+-----+-----+
| ID|Categ|Count|
+---+-----+-----+
| 1| A| 15 |
| 2| A| 56 |
| 2| B| 13 |
+---+-----+-----+
In SQL I would be doing something like
SELECT ID,
Categ,
SUM (Count)
FROM Table
GROUP BY ID,
Categ;
But how to do this in Scala?
I tried
DF.groupBy($"ID", $"Categ").sum("Count")
But this just changed the Count column name into sum(count) instead of actually giving me the sum of the counts.

Maybe you were summing the wrong column, but your grougBy/sum statement looks syntactically correct to me:
val df = Seq(
(1, "A", 10),
(1, "A", 5),
(2, "A", 56),
(2, "B", 13)
).toDF("ID", "Categ", "Amnt")
df.groupBy("ID", "Categ").sum("Amnt").show
// +---+-----+---------+
// | ID|Categ|sum(Amnt)|
// +---+-----+---------+
// | 1| A| 15|
// | 2| A| 56|
// | 2| B| 13|
// +---+-----+---------+
EDIT:
To alias the sum(Amnt) column (or, for multiple aggregations), wrap the aggregation expression(s) with agg. For example:
// Rename `sum(Amnt)` as `Sum`
df.groupBy("ID", "Categ").agg(sum("Amnt").as("Sum"))
// Aggregate `sum(Amnt)` and `count(Categ)`
df.groupBy("ID", "Categ").agg(sum("Amnt"), count("Categ"))

Related

Pyspark sort and get first and last

I used code belopw to sort based on one column. I am wondering how can I get the first element and last element in sorted dataframe?
group_by_dataframe
.count()
.filter("`count` >= 10")
.sort(desc("count"))
The max and min functions need to have a group to work with, to circumvent the issue, you can create a dummy column as below, then call the max and min for the maximum and minimum values.
If that's all you need, you don't really need sort here.
from pyspark.sql import functions as F
df = spark.createDataFrame([("a", 0.694), ("b", -2.669), ("a", 0.245), ("a", 0.1), ("b", 0.3), ("c", 0.3)], ["n", "val"])
df.show()
+---+------+
| n| val|
+---+------+
| a| 0.694|
| b|-2.669|
| a| 0.245|
| a| 0.1|
| b| 0.3|
| c| 0.3|
+---+------+
df = df.groupby('n').count() #.sort(F.desc('count'))
df = df.withColumn('dummy', F.lit(1))
df.show()
+---+-----+-----+
| n|count|dummy|
+---+-----+-----+
| c| 1| 1|
| b| 2| 1|
| a| 3| 1|
+---+-----+-----+
df = df.groupBy('dummy').agg(F.min('count').alias('min'), F.max('count').alias('max')).drop('dummy')
df.show()
+---+---+
|min|max|
+---+---+
| 1| 3|
+---+---+

How to efficiently perform this column operation on a Spark Dataframe?

I have a dataframe as follows:
+---+---+---+
| F1| F2| F3|
+---+---+---+
| x| y| 1|
| x| z| 2|
| x| a| 4|
| x| a| 4|
| x| y| 1|
| t| y2| 6|
| t| y3| 4|
| t| y4| 5|
+---+---+---+
I want to add another column with value as (number of unique rows of "F1" and "F2" for each unique "F3" / total number of unique rows of "F1" and "F2").
For example, for the above table, below is the desired new dataframe:
+---+---+---+----+
| F1| F2| F3| F4|
+---+---+---+----+
| t| y4| 5| 1/6|
| x| y| 1| 1/6|
| x| y| 1| 1/6|
| x| z| 2| 1/6|
| t| y2| 6| 1/6|
| t| y3| 4| 2/6|
| x| a| 4| 2/6|
| x| a| 4| 2/6|
+---+---+---+----+
Note: in case of F3 = 4, there are only 2 unique F1 and F2 = {(t, y3), (x, a)}. Therefore, for all occurrences of F3 = 4, F4 will be 2/(total number of unique ordered pairs of F1 and F2. Here there are 6 such pairs)
How to achieve the above transformation in Spark Scala?
I just learnt trying to solve your problem, that you can't use Distinct functions while performing Window over DataFrames.
So what I did is create an temporary DataFrame and join it with the initial to obtain your desired results :
case class Dog(F1:String, F2: String, F3: Int)
val df = Seq(Dog("x", "y", 1), Dog("x", "z", 2), Dog("x", "a", 4), Dog("x", "a", 4), Dog("x", "y", 1), Dog("t", "y2", 6), Dog("t", "y3", 4), Dog("t", "y4", 5)).toDF
val unique_F1_F2 = df.select("F1", "F2").distinct.count
val dd = df.withColumn("X1", concat(col("F1"), col("F2")))
.groupBy("F3")
.agg(countDistinct(col("X1")).as("distinct_count"))
val final_df = dd.join(df, "F3")
.withColumn("F4", col("distinct_count")/unique_F1_F2)
.drop("distinct_count")
final_df.show
+---+---+---+-------------------+
| F3| F1| F2| F4|
+---+---+---+-------------------+
| 1| x| y|0.16666666666666666|
| 1| x| y|0.16666666666666666|
| 6| t| y2|0.16666666666666666|
| 5| t| y4|0.16666666666666666|
| 4| t| y3| 0.3333333333333333|
| 4| x| a| 0.3333333333333333|
| 4| x| a| 0.3333333333333333|
| 2| x| z|0.16666666666666666|
+---+---+---+-------------------+
I hope this is what you expected !
EDIT : I changed df.count to unique_F1_F2

create column in pyspark based on conditons [duplicate]

I have a PySpark Dataframe with two columns:
+---+----+
| Id|Rank|
+---+----+
| a| 5|
| b| 7|
| c| 8|
| d| 1|
+---+----+
For each row, I'm looking to replace Id column with "other" if Rank column is larger than 5.
If I use pseudocode to explain:
For row in df:
if row.Rank > 5:
then replace(row.Id, "other")
The result should look like this:
+-----+----+
| Id|Rank|
+-----+----+
| a| 5|
|other| 7|
|other| 8|
| d| 1|
+-----+----+
Any clue how to achieve this? Thanks!!!
To create this Dataframe:
df = spark.createDataFrame([('a', 5), ('b', 7), ('c', 8), ('d', 1)], ['Id', 'Rank'])
You can use when and otherwise like -
from pyspark.sql.functions import *
df\
.withColumn('Id_New',when(df.Rank <= 5,df.Id).otherwise('other'))\
.drop(df.Id)\
.select(col('Id_New').alias('Id'),col('Rank'))\
.show()
this gives output as -
+-----+----+
| Id|Rank|
+-----+----+
| a| 5|
|other| 7|
|other| 8|
| d| 1|
+-----+----+
Starting with #Pushkr solution couldn't you just use the following ?
from pyspark.sql.functions import *
df.withColumn('Id',when(df.Rank <= 5,df.Id).otherwise('other')).show()

Apache Spark - Scala API - Aggregate on sequentially increasing key

I have a data frame that looks something like this:
val df = sc.parallelize(Seq(
(3,1,"A"),(3,2,"B"),(3,3,"C"),
(2,1,"D"),(2,2,"E"),
(3,1,"F"),(3,2,"G"),(3,3,"G"),
(2,1,"X"),(2,2,"X")
)).toDF("TotalN", "N", "String")
+------+---+------+
|TotalN| N|String|
+------+---+------+
| 3| 1| A|
| 3| 2| B|
| 3| 3| C|
| 2| 1| D|
| 2| 2| E|
| 3| 1| F|
| 3| 2| G|
| 3| 3| G|
| 2| 1| X|
| 2| 2| X|
+------+---+------+
I need to aggregate the strings by concatenating them together based on the TotalN and the sequentially increasing ID (N). The problem is there is not a unique ID for each aggregation I can group by. So, I need to do something like "for each row look at the TotalN, loop through the next N rows and concatenate, then reset".
+------+------+
|TotalN|String|
+------+------+
| 3| ABC|
| 2| DE|
| 3| FGG|
| 2| XX|
+------+------+
Any pointers much appreciated.
Using Spark 2.3.1 and the Scala Api.
Try this:
val df = spark.sparkContext.parallelize(Seq(
(3, 1, "A"), (3, 2, "B"), (3, 3, "C"),
(2, 1, "D"), (2, 2, "E"),
(3, 1, "F"), (3, 2, "G"), (3, 3, "G"),
(2, 1, "X"), (2, 2, "X")
)).toDF("TotalN", "N", "String")
df.createOrReplaceTempView("data")
val sqlDF = spark.sql(
"""
| SELECT TotalN d, N, String, ROW_NUMBER() over (order by TotalN) as rowNum
| FROM data
""".stripMargin)
sqlDF.withColumn("key", $"N" - $"rowNum")
.groupBy("key").agg(collect_list('String).as("texts")).show()
Solution is to calculate a grouping variable using the row_number function which can be used in later groupBy.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.row_number
var w = Window.orderBy("TotalN")
df.withColumn("GeneratedID", $"N" - row_number.over(w)).show
+------+---+------+-----------+
|TotalN| N|String|GeneratedID|
+------+---+------+-----------+
| 2| 1| D| 0|
| 2| 2| E| 0|
| 2| 1| X| -2|
| 2| 2| X| -2|
| 3| 1| A| -4|
| 3| 2| B| -4|
| 3| 3| C| -4|
| 3| 1| F| -7|
| 3| 2| G| -7|
| 3| 3| G| -7|
+------+---+------+-----------+

spark dataframe groupby multiple times

val df = (Seq((1, "a", "10"),(1,"b", "12"),(1,"c", "13"),(2, "a", "14"),
(2,"c", "11"),(1,"b","12" ),(2, "c", "12"),(3,"r", "11")).
toDF("col1", "col2", "col3"))
So I have a spark dataframe with 3 columns.
+----+----+----+
|col1|col2|col3|
+----+----+----+
| 1| a| 10|
| 1| b| 12|
| 1| c| 13|
| 2| a| 14|
| 2| c| 11|
| 1| b| 12|
| 2| c| 12|
| 3| r| 11|
+----+----+----+
My requirement is actually I need to perform two levels of groupby as explained below.
Level1:
If I do groupby on col1 and do a sum of Col3. I will get below two columns.
1. col1
2. sum(col3)
I will loose col2 here.
Level2:
If i want to again group by on col1 and col2 and do a sum of Col3 I will get below 3 columns.
1. col1
2. col2
3. sum(col3)
My requirement is actually I need to perform two levels of groupBy and have these two columns(sum(col3) of level1, sum(col3) of level2) in a final one dataframe.
How can I do this, can anyone explain?
spark : 1.6.2
Scala : 2.10
One option is to do the two sum separately and then join them back:
(df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")).
join(df.groupBy("col1").agg(sum($"col3").as("sum_level1")), Seq("col1")).show)
+----+----+----------+----------+
|col1|col2|sum_level2|sum_level1|
+----+----+----------+----------+
| 2| c| 23.0| 37.0|
| 2| a| 14.0| 37.0|
| 1| c| 13.0| 47.0|
| 1| b| 24.0| 47.0|
| 3| r| 11.0| 11.0|
| 1| a| 10.0| 47.0|
+----+----+----------+----------+
Another option is to use the window functions, considering the fact that the level1_sum is the sum of level2_sum grouped by col1:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"col1")
(df.groupBy("col1", "col2").agg(sum($"col3").as("sum_level2")).
withColumn("sum_level1", sum($"sum_level2").over(w)).show)
+----+----+----------+----------+
|col1|col2|sum_level2|sum_level1|
+----+----+----------+----------+
| 1| c| 13.0| 47.0|
| 1| b| 24.0| 47.0|
| 1| a| 10.0| 47.0|
| 3| r| 11.0| 11.0|
| 2| c| 23.0| 37.0|
| 2| a| 14.0| 37.0|
+----+----+----------+----------+