I have a dataframe to aggregate one column based on the rest of the other columns. I do not want to give all those rest of the columns in groupBy with comma separated as I have about 30 columns. Could somebody tell me how can I do it in a way that looks more readable.
right now, am doing - df.groupBy("c1","c2","c3","c4","c5","c6","c7","c8","c9","c10",....).agg(c11)
I want to know if there is any better way..
Thanks,
John
Specifying the columns is the clean way to do it but I believe you have quite a few options.
One of them is to go to Spark SQL and compose the query programmatically composing the string.
Another option could be to use the varargs : _* on a list of columns names, like this:
val cols = ...
df.groupBy( cols : _*).agg(...)
Use below steps:
get the columns as list
remove the columns needs to be aggregated from the columns list.
apply groupBy & agg.
**Ex**:
val seq = Seq((101, "abc", 24), (102, "cde", 24), (103, "efg", 22), (104, "ghi", 21), (105, "ijk", 20), (106, "klm", 19), (107, "mno", 18), (108, "pqr", 18), (109, "rst", 26), (110, "tuv", 27), (111, "pqr", 18), (112, "rst", 28), (113, "tuv", 29))
val df = sc.parallelize(seq).toDF("id", "name", "age")
val colsList = df.columns.toList
(colsList: List[String] = List(id, name, age))
val groupByColumns = colsList.slice(0, colsList.size-1)
(groupByColumns: List[String] = List(id, name))
val aggColumn = colsList.last
(aggColumn: String = age)
df.groupBy(groupByColumns.head, groupByColumns.tail:_*).agg(avg(aggColumn)).show
+---+----+--------+
| id|name|avg(age)|
+---+----+--------+
|105| ijk| 20.0|
|108| pqr| 18.0|
|112| rst| 28.0|
|104| ghi| 21.0|
|111| pqr| 18.0|
|113| tuv| 29.0|
|106| klm| 19.0|
|102| cde| 24.0|
|107| mno| 18.0|
|101| abc| 24.0|
|103| efg| 22.0|
|110| tuv| 27.0|
|109| rst| 26.0|
+---+----+--------+
Related
I need to dynamically select columns from one of the two tables I am joining. The name of one of the columns to be selected is passed to a variable. Here are the details.
The table names are passed to variables. So is the join_id and the join_type.
//Creating scala variables for each table
var table_name_a = dbutils.widgets.get("table_name_a")
var table_name_b = dbutils.widgets.get("table_name_b")
//Create scala variable for Join Id
var join_id = dbutils.widgets.get("table_name_b") + "Id"
// Define join type
var join_type = dbutils.widgets.get("join_type")
Then, I join the tables. I want to select all columns from table A and only two columns from table B: one column is called "Description" no matter what table B is passed in the parameter above; the second column has the same name of the table B, e.g., if table B's name is Employee, I want to select a column named "Employee" from table B. The code below selects all columns from table A and the Description column from table B (aliased). But I still need to select another column from table B that has the same name as the table. I don't know in advance how many columns table B has in total nor column order or their names - since Table B is passed as a parameter.
// Joining Tables
var df_joined_tables = df_a
.join(df_b,
df_a(join_id)===df_b(join_id),
join_type
).select($"df_a.*",$"df_b.Description".alias(table_name_b + " Description"))
My question is: How do I pass the variable table_name_b as a column I am trying to select from table B?
I tried the code below which is obviously wrong because in "$"df_b.table_name_b" the "table_name_b" is supposed to be the content of the parameter and not the name of the columns itself.
var df_joined_tables = df_a
.join(df_b,
df_a(join_id)===df_b(join_id),
join_type
).select($"df_a.*",$"df_b.Description".alias(table_name_b + " Description"),$"df_b.table_name_b")
Then I tried the code below and it gives the error: "value table_name_b is not a member of org.apache.spark.sql.DataFrame"
var df_joined_tables = df_a
.join(df_b,
df_a(join_id)===df_b(join_id),
join_type
).select($"df_a.*",$"df_b.Description".alias(table_name_b + " Description"),df_b.table_name_b)
How do I pass the variable table_name_b as a column I need to select from table B?
You can build a List[org.apache.spark.sql.Column] and use it on your select function like the below example:
// sample input:
val df = Seq(
("A", 1, 6, 7),
("B", 2, 7, 6),
("C", 3, 8, 5),
("D", 4, 9, 4),
("E", 5, 8, 3)
).toDF("name", "col1", "col2", "col3")
df.printSchema()
val columnNames = List("col1", "col2") // string column names from your params
val columnsToSelect = columnNames.map(col(_)) // convert the required column names from string to column type
df.select(columnsToSelect: _*).show() // using the list of columns
// output:
+----+----+
|col1|col2|
+----+----+
| 1| 6|
| 2| 7|
| 3| 8|
| 4| 9|
| 5| 8|
+----+----+
Similarly can be applied for join's
Update
Adding another example:
val aliasTableA = "tableA"
val aliasTableB = "tableB"
val joinField = "name"
val df1 = Seq(
("A", 1, 6, 7),
("B", 2, 7, 6),
("C", 3, 8, 5),
("D", 4, 9, 4),
("E", 5, 8, 3)
).toDF("name", "col1", "col2", "col3")
val df2 = Seq(
("A", 11, 61, 71),
("B", 21, 71, 61),
("C", 31, 81, 51)
).toDF("name", "col_1", "col_2", "col_3")
df1.alias(aliasTableA)
.join(df2.alias(aliasTableB), Seq(joinField))
.selectExpr(s"${aliasTableA}.*", s"${aliasTableB}.col_1", s"${aliasTableB}.col_2").show()
// output:
+----+----+----+----+-----+-----+
|name|col1|col2|col3|col_1|col_2|
+----+----+----+----+-----+-----+
| A| 1| 6| 7| 11| 61|
| B| 2| 7| 6| 21| 71|
| C| 3| 8| 5| 31| 81|
+----+----+----+----+-----+-----+
I have an issue where I have to calculate a column using a formula that uses the value from the calculation done in the previous row.
I am unable to figure it out using withColumn API.
I need to calculate a new column, using the formula:
MovingRate = MonthlyRate + (0.7 * MovingRatePrevious)
... where the MovingRatePrevious is the MovingRate of the prior row.
For month 1, I have the value so I do not need to re-calculate that but I need that value to be able to calculate the subsequent rows. I need to partition by Type.
This is my original dataset:
Desired results in MovingRate column:
Altough its possible to do with Widow Functions (See #Leo C's answer), I bet its more performant to aggregate once per Type using a groupBy. Then, explode the results of the UDF to get all rows back:
val df = Seq(
(1, "blue", 0.4, Some(0.33)),
(2, "blue", 0.3, None),
(3, "blue", 0.7, None),
(4, "blue", 0.9, None)
)
.toDF("Month", "Type", "MonthlyRate", "MovingRate")
// this udf produces an Seq of Tuple3 (Month, MonthlyRate, MovingRate)
val calcMovingRate = udf((startRate:Double,rates:Seq[Row]) => rates.tail
.scanLeft((rates.head.getInt(0),startRate,startRate))((acc,curr) => (curr.getInt(0),curr.getDouble(1),acc._3+0.7*curr.getDouble(1)))
)
df
.groupBy($"Type")
.agg(
first($"MovingRate",ignoreNulls=true).as("startRate"),
collect_list(struct($"Month",$"MonthlyRate")).as("rates")
)
.select($"Type",explode(calcMovingRate($"startRate",$"rates")).as("movingRates"))
.select($"Type",$"movingRates._1".as("Month"),$"movingRates._2".as("MonthlyRate"),$"movingRates._3".as("MovingRate"))
.show()
gives:
+----+-----+-----------+------------------+
|Type|Month|MonthlyRate| MovingRate|
+----+-----+-----------+------------------+
|blue| 1| 0.33| 0.33|
|blue| 2| 0.3| 0.54|
|blue| 3| 0.7| 1.03|
|blue| 4| 0.9|1.6600000000000001|
+----+-----+-----------+------------------+
Given the nature of the requirement that each moving rate is recursively computed from the previous rate, the column-oriented DataFrame API won't shine especially if the dataset is huge.
That said, if the dataset isn't large, one approach would be to make Spark recalculate the moving rates row-wise via a UDF, with a Window-partitioned rate list as its input:
import org.apache.spark.sql.expressions.Window
val df = Seq(
(1, "blue", 0.4, Some(0.33)),
(2, "blue", 0.3, None),
(3, "blue", 0.7, None),
(4, "blue", 0.9, None),
(1, "red", 0.5, Some(0.2)),
(2, "red", 0.6, None),
(3, "red", 0.8, None)
).toDF("Month", "Type", "MonthlyRate", "MovingRate")
val win = Window.partitionBy("Type").orderBy("Month").
rowsBetween(Window.unboundedPreceding, 0)
def movingRate(factor: Double) = udf( (initRate: Double, monthlyRates: Seq[Double]) =>
monthlyRates.tail.foldLeft(initRate)( _ * factor + _ )
)
df.
withColumn("MovingRate", when($"Month" === 1, $"MovingRate").otherwise(
movingRate(0.7)(last($"MovingRate", ignoreNulls=true).over(win), collect_list($"MonthlyRate").over(win))
)).
show
// +-----+----+-----------+------------------+
// |Month|Type|MonthlyRate| MovingRate|
// +-----+----+-----------+------------------+
// | 1| red| 0.5| 0.2|
// | 2| red| 0.6| 0.74|
// | 3| red| 0.8| 1.318|
// | 1|blue| 0.4| 0.33|
// | 2|blue| 0.3|0.5309999999999999|
// | 3|blue| 0.7|1.0716999999999999|
// | 4|blue| 0.9|1.6501899999999998|
// +-----+----+-----------+------------------+
What you are trying to do is compute a recursive formula that looks like:
x[i] = y[i] + 0.7 * x[i-1]
where x[i] is your MovingRate at row i and y[i] your MonthlyRate at row i.
The problem is that this is a purely sequential formula. Each row needs the result of the previous one which in turn needs the result of the one before. Spark is a parallel computation engine and it is going to be hard to use it to speed up a calculation that cannot really be parallelized.
I got two DataFrames, with the same schema (but +100 columns):
Small size: 1000 rows
Bigger size: 90000 rows
How to check every Row in 1 exists in 2? What is the "Spark way" of doing this? Should I use map and then deal with it at the Row level; or I use join and then use some sort of comparison with the small size DataFrame?
You can use except, which returns all rows of the first dataset that are not present in the second
smaller.except(bigger).isEmpty()
You can inner join the DF and count to check if ther eis a difference.
def isIncluded(smallDf: Dataframe, biggerDf: Dataframe): Boolean = {
val keys = smallDf.columns.toSeq
val joinedDf = smallDf.join(biggerDf, keys) // You might want to broadcast smallDf for performance issues
joinedDf.count == smallDf
}
However, I think the except method is clearer. Not sure about the performances (It might just be a join underneath)
I would do it with join, probably
This join will give you all rows that are in small data frame but are missing in large data frame. Then just check if it is zero size or no.
Code:
val seq1 = Seq(
("A", "abc", 0.1, 0.0, 0),
("B", "def", 0.15, 0.5, 0),
("C", "ghi", 0.2, 0.2, 1),
("D", "jkl", 1.1, 0.1, 0),
("E", "mno", 0.1, 0.1, 0)
)
val seq2 = Seq(
("A", "abc", "a", "b", "?"),
("C", "ghi", "a", "c", "?")
)
val df1 = ss.sparkContext.makeRDD(seq1).toDF("cA", "cB", "cC", "cD", "cE")
val df2 = ss.sparkContext.makeRDD(seq2).toDF("cA", "cB", "cH", "cI", "cJ")
df2.join(df1, df1("cA") === df2("cA"), "leftOuter").show
Output:
+---+---+---+---+---+---+---+---+---+---+
| cA| cB| cH| cI| cJ| cA| cB| cC| cD| cE|
+---+---+---+---+---+---+---+---+---+---+
| C|ghi| a| c| ?| C|ghi|0.2|0.2| 1|
| A|abc| a| b| ?| A|abc|0.1|0.0| 0|
+---+---+---+---+---+---+---+---+---+---+
I have written below code to group and aggregate the columns
val gmList = List("gc1","gc2","gc3")
val aList = List("val1","val2","val3","val4","val5")
val cype = "first"
val exprs = aList.map((_ -> cype )).toMap
dfgroupBy(gmList.map (col): _*).agg (exprs).show
but this create a columns with appending aggregation name in all column as shown below
so I want to alias that name first(val1) -> val1, I want to make this code generic as part of exprs
+----------+----------+-------------+-------------------------+------------------+---------------------------+------------------------+-------------------+
| gc1 | gc2 | gc3 | first(val1) | first(val2)| first(val3) | first(val4) | first(val5) |
+----------+----------+-------------+-------------------------+------------------+---------------------------+------------------------+-------------------+
One approach would be to alias the aggregated columns to the original column names in a subsequent select. I would also suggest generalizing the single aggregate function (i.e. first) to a list of functions, as shown below:
import org.apache.spark.sql.functions._
val df = Seq(
(1, 10, "a1", "a2", "a3"),
(1, 10, "b1", "b2", "b3"),
(2, 20, "c1", "c2", "c3"),
(2, 30, "d1", "d2", "d3"),
(2, 30, "e1", "e2", "e3")
).toDF("gc1", "gc2", "val1", "val2", "val3")
val gmList = List("gc1", "gc2")
val aList = List("val1", "val2", "val3")
// Populate with different aggregate methods for individual columns if necessary
val fList = List.fill(aList.size)("first")
val afPairs = aList.zip(fList)
// afPairs: List[(String, String)] = List((val1,first), (val2,first), (val3,first))
df.
groupBy(gmList.map(col): _*).agg(afPairs.toMap).
select(gmList.map(col) ::: afPairs.map{ case (v, f) => col(s"$f($v)").as(v) }: _*).
show
// +---+---+----+----+----+
// |gc1|gc2|val1|val2|val3|
// +---+---+----+----+----+
// | 2| 20| c1| c2| c3|
// | 1| 10| a1| a2| a3|
// | 2| 30| d1| d2| d3|
// +---+---+----+----+----+
You can slightly change the way you are generating the expression and use the function alias in there:
import org.apache.spark.sql.functions.col
val aList = List("val1","val2","val3","val4","val5")
val exprs = aList.map(c => first(col(c)).alias(c) )
dfgroupBy( gmList.map(col) : _*).agg(exprs.head , exprs.tail: _*).show
Here's a more generic version that will work with any aggregate functions and doesn't require naming your aggregate columns up front. Build your grouped df as you normally would, then use:
val colRegex = raw"^.+\((.*?)\)".r
val newCols = df.columns.map(c => col(c).as(colRegex.replaceAllIn(c, m => m.group(1))))
df.select(newCols: _*)
This will extract out only what is inside the parentheses, regardless of what aggregate function is called (e.g. first(val) -> val, sum(val) -> val, count(val) -> val, etc.).
I want to convert this basic SQL Query in Spark
select Grade, count(*) * 100.0 / sum(count(*)) over()
from StudentGrades
group by Grade
I have tried using windowing functions in spark like this
val windowSpec = Window.rangeBetween(Window.unboundedPreceding,Window.unboundedFollowing)
df1.select(
$"Arrest"
).groupBy($"Arrest").agg(sum(count("*")) over windowSpec,count("*")).show()
+------+--------------------------------------------------------------------
----------+--------+
|Arrest|sum(count(1)) OVER (RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWING)|count(1)|
+------+--------------------------------------------------------------------
----------+--------+
| true|
665517| 184964|
| false|
665517| 480553|
+------+------------------------------------------------------------------------------+--------+
But when I try dividing by count(*) it through's error
df1.select(
$"Arrest"
).groupBy($"Arrest").agg(count("*")/sum(count("*")) over
windowSpec,count("*")).show()
It is not allowed to use an aggregate function in the argument of another aggregate function. Please use the inner aggregate function in a sub-query.;;
My Question is when I'm already using count() inside sum() in the first query I'm not receiving any errors of using an aggregate function inside another aggregate function but why get error in the second one?
An example:
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq(
("A", "X", 2, 100), ("A", "X", 7, 100), ("B", "X", 10, 100),
("C", "X", 1, 100), ("D", "X", 50, 100), ("E", "X", 30, 100)
)).toDF("c1", "c2", "Val1", "Val2")
val df2 = df
.groupBy("c1")
.agg(sum("Val1").alias("sum"))
.withColumn("fraction", col("sum") / sum("sum").over())
df2.show
You will need to tailor to your own situation. E.g. count instead of sum. As follows:
val df2 = df
.groupBy("c1")
.agg(count("*"))
.withColumn("fraction", col("count(1)") / sum("count(1)").over())
returning:
+---+--------+-------------------+
| c1|count(1)| fraction|
+---+--------+-------------------+
| E| 1|0.16666666666666666|
| B| 1|0.16666666666666666|
| D| 1|0.16666666666666666|
| C| 1|0.16666666666666666|
| A| 2| 0.3333333333333333|
+---+--------+-------------------+
You can do x 100. I note the alias does not seem to work as per the sum, so worked around this and left comparison above. Again, you will need to tailor to your specifics, this is part of my general modules for research and such.