I used code belopw to sort based on one column. I am wondering how can I get the first element and last element in sorted dataframe?
group_by_dataframe
.count()
.filter("`count` >= 10")
.sort(desc("count"))
The max and min functions need to have a group to work with, to circumvent the issue, you can create a dummy column as below, then call the max and min for the maximum and minimum values.
If that's all you need, you don't really need sort here.
from pyspark.sql import functions as F
df = spark.createDataFrame([("a", 0.694), ("b", -2.669), ("a", 0.245), ("a", 0.1), ("b", 0.3), ("c", 0.3)], ["n", "val"])
df.show()
+---+------+
| n| val|
+---+------+
| a| 0.694|
| b|-2.669|
| a| 0.245|
| a| 0.1|
| b| 0.3|
| c| 0.3|
+---+------+
df = df.groupby('n').count() #.sort(F.desc('count'))
df = df.withColumn('dummy', F.lit(1))
df.show()
+---+-----+-----+
| n|count|dummy|
+---+-----+-----+
| c| 1| 1|
| b| 2| 1|
| a| 3| 1|
+---+-----+-----+
df = df.groupBy('dummy').agg(F.min('count').alias('min'), F.max('count').alias('max')).drop('dummy')
df.show()
+---+---+
|min|max|
+---+---+
| 1| 3|
+---+---+
Assuming I have the following DataFrame:
+---+--------+---+----+----+
|grp|null_col|ord|col1|col2|
+---+--------+---+----+----+
| 1| null| 3|null| 11|
| 2| null| 2| xxx| 22|
| 1| null| 1| yyy|null|
| 2| null| 7|null| 33|
| 1| null| 12|null|null|
| 2| null| 19|null| 77|
| 1| null| 10| s13|null|
| 2| null| 11| a23|null|
+---+--------+---+----+----+
here is the same sample DF with comments, sorted by grp and ord:
scala> df.orderBy("grp", "ord").show
+---+--------+---+----+----+
|grp|null_col|ord|col1|col2|
+---+--------+---+----+----+
| 1| null| 1| yyy|null|
| 1| null| 3|null| 11| # grp:1 - last value for `col2` (11)
| 1| null| 10| s13|null| # grp:1 - last value for `col1` (s13)
| 1| null| 12|null|null| # grp:1 - last values for `null_col`, `ord`
| 2| null| 2| xxx| 22|
| 2| null| 7|null| 33|
| 2| null| 11| a23|null| # grp:2 - last value for `col1` (a23)
| 2| null| 19|null| 77| # grp:2 - last values for `null_col`, `ord`, `col2`
+---+--------+---+----+----+
I would like to compress it. I.e. to group it by column "grp" and for each group, sort rows by the "ord" column and take the last not null value in each column (if there is one).
+---+--------+---+----+----+
|grp|null_col|ord|col1|col2|
+---+--------+---+----+----+
| 1| null| 12| s13| 11|
| 2| null| 19| a23| 77|
+---+--------+---+----+----+
I've seen the following similar questions:
How to select the first row of each group?
How to find first non-null values in groups? (secondary sorting using dataset api)
but my real DataFrame has over 250 columns, so I need a solution where I don't have to specify all the columns explicitly.
I can't wrap my head around it...
MCVE: how to create a sample DataFrame:
create local file "/tmp/data.txt" and copy and paste there a context of the DataFrame (as it's posted above)
define function readSparkOutput():
parse "/tmp/data.txt" to DataFrame:
val df = readSparkOutput("file:///tmp/data.txt")
UPDATE: I think it should be similar to the following SQL:
SELECT
grp, ord, null_col, col1, col2
FROM (
SELECT
grp,
ord,
FIRST(null_col) OVER (PARTITION BY grp ORDER BY ord DESC) as null_col,
FIRST(col1) OVER (PARTITION BY grp ORDER BY ord DESC) as col1,
FIRST(col2) OVER (PARTITION BY grp ORDER BY ord DESC) as col2,
ROW_NUMBER() OVER (PARTITION BY grp ORDER BY ord DESC) as rn
FROM table_name) as v
WHERE v.rn = 1;
how can we dynamically generate such a Spark query?
I tried the following simplified approach:
import org.apache.spark.sql.expressions.Window
val win = Window
.partitionBy("grp")
.orderBy($"ord".desc)
val cols = df.columns.map(c => first(c, ignoreNulls=true).over(win).as(c))
which produces:
scala> cols
res23: Array[org.apache.spark.sql.Column] = Array(first(grp, true) OVER (PARTITION BY grp ORDER BY ord DESC NULLS LAST UnspecifiedFrame) AS `grp`, first(null_col, true) OVER (PARTITION BY grp ORDER BY ord DESC NULLS LAST UnspecifiedFrame) AS `null_col`, first(ord, true) OVER (PARTITION BY grp ORDER BY ord DESC NULLS LAST UnspecifiedFrame) AS `ord`, first(col1, true) OVER (PARTITION BY grp ORDER BY ord DESC NULLS LAST UnspecifiedFrame) AS `col1`, first(col2, true) OVER (PARTITION BY grp ORDER BY ord DESC NULLS LAST UnspecifiedFrame) AS `col2`)
but i couldn't pass it to df.select:
scala> df.select(cols.head, cols.tail: _*).show
<console>:34: error: no `: _*' annotation allowed here
(such annotations are only allowed in arguments to *-parameters)
df.select(cols.head, cols.tail: _*).show
another attempt:
scala> df.select(cols.map(col): _*).show
<console>:34: error: type mismatch;
found : String => org.apache.spark.sql.Column
required: org.apache.spark.sql.Column => ?
df.select(cols.map(col): _*).show
Consider the following approach that applies Window function last(c, ignoreNulls=true) ordered by "ord" per "grp" to each of the selected columns; followed by a groupBy("grp") to fetch the first agg(colFcnMap) result:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df0 = Seq(
(1, 3, None, Some(11)),
(2, 2, Some("aaa"), Some(22)),
(1, 1, Some("s12"), None),
(2, 7, None, Some(33)),
(1, 12, None, None),
(2, 19, None, Some(77)),
(1, 10, Some("s13"), None),
(2, 11, Some("a23"), None)
).toDF("grp", "ord", "col1", "col2")
val df = df0.withColumn("null_col", lit(null))
df.orderBy("grp", "ord").show
// +---+---+----+----+--------+
// |grp|ord|col1|col2|null_col|
// +---+---+----+----+--------+
// | 1| 1| s12|null| null|
// | 1| 3|null| 11| null|
// | 1| 10| s13|null| null|
// | 1| 12|null|null| null|
// | 2| 2| aaa| 22| null|
// | 2| 7|null| 33| null|
// | 2| 11| a23|null| null|
// | 2| 19|null| 77| null|
// +---+---+----+----+--------+
val win = Window.partitionBy("grp").orderBy("ord").
rowsBetween(0, Window.unboundedFollowing)
val nonAggCols = Array("grp")
val cols = df.columns.diff(nonAggCols) // Columns to be aggregated
val colFcnMap = cols.zip(Array.fill(cols.size)("first")).toMap
// colFcnMap: scala.collection.immutable.Map[String,String] =
// Map(ord -> first, col1 -> first, col2 -> first, null_col -> first)
cols.foldLeft(df)((acc, c) =>
acc.withColumn(c, last(c, ignoreNulls=true).over(win))
).
groupBy("grp").agg(colFcnMap).
select(col("grp") :: colFcnMap.toList.map{case (c, f) => col(s"$f($c)").as(c)}: _*).
show
// +---+---+----+----+--------+
// |grp|ord|col1|col2|null_col|
// +---+---+----+----+--------+
// | 1| 12| s13| 11| null|
// | 2| 19| a23| 77| null|
// +---+---+----+----+--------+
Note that the final select is for stripping the function name (in this case first()) from the aggregated column names.
I have worked something out, here is the code and output
import org.apache.spark.sql.functions._
import spark.implicits._
val df0 = Seq(
(1, 3, None, Some(11)),
(2, 2, Some("aaa"), Some(22)),
(1, 1, Some("s12"), None),
(2, 7, None, Some(33)),
(1, 12, None, None),
(2, 19, None, Some(77)),
(1, 10, Some("s13"), None),
(2, 11, Some("a23"), None)
).toDF("grp", "ord", "col1", "col2")
df0.show()
//+---+---+----+----+
//|grp|ord|col1|col2|
//+---+---+----+----+
//| 1| 3|null| 11|
//| 2| 2| aaa| 22|
//| 1| 1| s12|null|
//| 2| 7|null| 33|
//| 1| 12|null|null|
//| 2| 19|null| 77|
//| 1| 10| s13|null|
//| 2| 11| a23|null|
//+---+---+----+----+
Ordering the data on first 2 columns
val df1 = df0.select("grp", "ord", "col1", "col2").orderBy("grp", "ord")
df1.show()
//+---+---+----+----+
//|grp|ord|col1|col2|
//+---+---+----+----+
//| 1| 1| s12|null|
//| 1| 3|null| 11|
//| 1| 10| s13|null|
//| 1| 12|null|null|
//| 2| 2| aaa| 22|
//| 2| 7|null| 33|
//| 2| 11| a23|null|
//| 2| 19|null| 77|
//+---+---+----+----+
val df2 = df1.groupBy("grp").agg(max("ord").alias("ord"),collect_set("col1").alias("col1"),collect_set("col2").alias("col2"))
val df3 = df2.withColumn("new_col1",$"col1".apply(size($"col1").minus(1))).withColumn("new_col2",$"col2".apply(size($"col2").minus(1)))
df3.show()
//+---+---+----------+------------+--------+--------+
//|grp|ord| col1| col2|new_col1|new_col2|
//+---+---+----------+------------+--------+--------+
//| 1| 12|[s12, s13]| [11]| s13| 11|
//| 2| 19|[aaa, a23]|[33, 22, 77]| a23| 77|
//+---+---+----------+------------+--------+--------+
You can drop the columns you don't need by using .drop("column_name")
So here we are grouping by a and selecting the max of all other columns in the group:
scala> val df = List((1,2,11), (1,1,1), (2,1,4), (2,3,5)).toDF("a", "b", "c")
df: org.apache.spark.sql.DataFrame = [a: int, b: int ... 1 more field]
scala> val aggCols = df.schema.map(_.name).filter(_ != "a").map(colName => sum(col(colName)).alias(s"max_$colName"))
aggCols: Seq[org.apache.spark.sql.Column] = List(sum(b) AS `max_b`, sum(c) AS `max_c`)
scala> df.groupBy(col("a")).agg(aggCols.head, aggCols.tail: _*)
res0: org.apache.spark.sql.DataFrame = [a: int, max_b: bigint ... 1 more field]
I'd go with same approach like #LeoC, but I believe that there is no need to manipulate column names as string and I would go with a more spark-sql like answer.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{col, first, last}
val win = Window.partitionBy("grp").orderBy(col("ord")).rowsBetween(0, Window.unboundedFollowing)
// In case there is more than one group column
val nonAggCols = Seq("grp")
// Select columns to aggregate on
val cols: Seq[String] = df.columns.diff(nonAggCols).toSeq
// Map over selection and apply fct
val aggregations: Seq[Column] = cols.map(c => first(col(c), ignoreNulls = true).as(c))
// I'd rather cache the following step as it might get expensive
val step1 = cols.foldLeft(df)((acc, c) => acc.withColumn(c, last(col(c), ignoreNulls = true).over(win))).cache
// Finally we can aggregate our results as followed
val results = step1.groupBy(nonAggCols.head, nonAggCols.tail: _*).agg(aggregations.head, aggregations.tail: _*)
results.show
// +---+--------+---+----+----+
// |grp|null_col|ord|col1|col2|
// +---+--------+---+----+----+
// | 1| null| 12| s13| 11|
// | 2| null| 19| a23| 77|
// +---+--------+---+----+----+
I hope this helps.
EDIT: The reason you are not getting the same results is because the reader that you are using isn't correct.
It interprets null from the file as a string and not a null; i.e :
scala> df.filter('col1.isNotNull).show
// +---+--------+---+----+----+
// |grp|null_col|ord|col1|col2|
// +---+--------+---+----+----+
// | 1| null| 3|null| 11|
// | 2| null| 2| xxx| 22|
// | 1| null| 1| yyy|null|
// | 2| null| 7|null| 33|
// | 1| null| 12|null|null|
// | 2| null| 19|null| 77|
// | 1| null| 10| s13|null|
// | 2| null| 11| a23|null|
// +---+--------+---+----+----+
Here is my version of readSparkOutput :
def readSparkOutput(filePath: String): org.apache.spark.sql.DataFrame = {
val step1 = spark.read
.option("header", "true")
.option("inferSchema", "true")
.option("delimiter", "|")
.option("parserLib", "UNIVOCITY")
.option("ignoreLeadingWhiteSpace", "true")
.option("ignoreTrailingWhiteSpace", "true")
.option("comment", "+")
.csv(filePath)
val step2 = step1.select(step1.columns.filterNot(_.startsWith("_c")).map(step1(_)): _*)
val columns = step2.columns
columns.foldLeft(step2)((acc, c) => acc.withColumn(c, when(col(c) =!= "null" or col(c).isNotNull, col(c))))
}
Here is your answer (and hopefully my bounty!!!)
scala> val df = spark.sparkContext.parallelize(List(
| (1,null.asInstanceOf[String],3,null.asInstanceOf[String],new Integer(11)),
| (2,null.asInstanceOf[String],2,new String("xxx"),new Integer(22)),
| (1,null.asInstanceOf[String],1,new String("yyy"),null.asInstanceOf[Integer]),
| (2,null.asInstanceOf[String],7,null.asInstanceOf[String],new Integer(33)),
| (1,null.asInstanceOf[String],12,null.asInstanceOf[String],null.asInstanceOf[Integer]),
| (2,null.asInstanceOf[String],19,null.asInstanceOf[String],new Integer(77)),
| (1,null.asInstanceOf[String],10,new String("s13"),null.asInstanceOf[Integer]),
| (2,null.asInstanceOf[String],11,new String("a23"),null.asInstanceOf[Integer]))).toDF("grp","null_col","ord","col1","col2")
df: org.apache.spark.sql.DataFrame = [grp: int, null_col: string ... 3 more fields]
scala> df.show
+---+--------+---+----+----+
|grp|null_col|ord|col1|col2|
+---+--------+---+----+----+
| 1| null| 3|null| 11|
| 2| null| 2| xxx| 22|
| 1| null| 1| yyy|null|
| 2| null| 7|null| 33|
| 1| null| 12|null|null|
| 2| null| 19|null| 77|
| 1| null| 10| s13|null|
| 2| null| 11| a23|null|
+---+--------+---+----+----+
//Create window specification
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val win = Window.partitionBy("grp").orderBy($"ord".desc)
win: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec#71878833
//Use foldLeft with first over window specification over all columns and take distinct
scala> val result = df.columns.foldLeft(df)((df, colName) => df.withColumn(colName, first(colName, ignoreNulls=true).over(win).as(colName))).distinct
result: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [grp: int, null_col: string ... 3 more fields]
scala> result.show
+---+--------+---+----+----+
|grp|null_col|ord|col1|col2|
+---+--------+---+----+----+
| 1| null| 12| s13| 11|
| 2| null| 19| a23| 77|
+---+--------+---+----+----+
Hope this helps.
Based on the following dataframe:
+---+-----+----+
| ID|Categ|Amnt|
+---+-----+----+
| 1| A| 10|
| 1| A| 5|
| 2| A| 56|
| 2| B| 13|
+---+-----+----+
I would like to obtain the sum of the column Amnt groupby ID and Categ.
+---+-----+-----+
| ID|Categ|Count|
+---+-----+-----+
| 1| A| 15 |
| 2| A| 56 |
| 2| B| 13 |
+---+-----+-----+
In SQL I would be doing something like
SELECT ID,
Categ,
SUM (Count)
FROM Table
GROUP BY ID,
Categ;
But how to do this in Scala?
I tried
DF.groupBy($"ID", $"Categ").sum("Count")
But this just changed the Count column name into sum(count) instead of actually giving me the sum of the counts.
Maybe you were summing the wrong column, but your grougBy/sum statement looks syntactically correct to me:
val df = Seq(
(1, "A", 10),
(1, "A", 5),
(2, "A", 56),
(2, "B", 13)
).toDF("ID", "Categ", "Amnt")
df.groupBy("ID", "Categ").sum("Amnt").show
// +---+-----+---------+
// | ID|Categ|sum(Amnt)|
// +---+-----+---------+
// | 1| A| 15|
// | 2| A| 56|
// | 2| B| 13|
// +---+-----+---------+
EDIT:
To alias the sum(Amnt) column (or, for multiple aggregations), wrap the aggregation expression(s) with agg. For example:
// Rename `sum(Amnt)` as `Sum`
df.groupBy("ID", "Categ").agg(sum("Amnt").as("Sum"))
// Aggregate `sum(Amnt)` and `count(Categ)`
df.groupBy("ID", "Categ").agg(sum("Amnt"), count("Categ"))
I am working on a problem in which I am loading data from a hive table into spark dataframe and now I want all the unique accts in 1 dataframe and all duplicates in another. for example if I have acct id 1,1,2,3,4. I want to get 2,3,4 in one dataframe and 1,1 in another. How can I do this?
Depending on the version of spark you have, you could use window functions in datasets/sql like below:
Dataset<Row> New = df.withColumn("Duplicate", count("*").over( Window.partitionBy("id") ) );
Dataset<Row> Dups = New.filter(col("Duplicate").gt(1));
Dataset<Row> Uniques = New.filter(col("Duplicate").equalTo(1));
the above is written in java. should be similar in scala and read this on how to do in python.
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
df.groupBy($"field1",$"field2"...).count.filter($"count" > 1).show()
val acctDF = List(("1", "Acc1"), ("1", "Acc1"), ("1", "Acc1"), ("2", "Acc2"), ("2", "Acc2"), ("3", "Acc3")).toDF("AcctId", "Details")
scala> acctDF.show()
+------+-------+
|AcctId|Details|
+------+-------+
| 1| Acc1|
| 1| Acc1|
| 1| Acc1|
| 2| Acc2|
| 2| Acc2|
| 3| Acc3|
+------+-------+
// Need to convert the DF to rdd to apply map and reduceByKey and again to DF to use it further more
val countsDF = acctDF.rdd.map(rec => (rec(0), 1)).reduceByKey(_+_).map(rec=> (rec._1.toString, rec._2)).toDF("AcctId", "AcctCount")
val accJoinedDF = acctDF.join(countsDF, acctDF("AcctId")===countsDF("AcctId"), "left_outer").select(acctDF("AcctId"), acctDF("Details"), countsDF("AcctCount"))
scala> accJoinedDF.show()
+------+-------+---------+
|AcctId|Details|AcctCount|
+------+-------+---------+
| 1| Acc1| 3|
| 1| Acc1| 3|
| 1| Acc1| 3|
| 2| Acc2| 2|
| 2| Acc2| 2|
| 3| Acc3| 1|
+------+-------+---------+
val distAcctDF = accJoinedDF.filter($"AcctCount"===1)
scala> distAcctDF.show()
+------+-------+---------+
|AcctId|Details|AcctCount|
+------+-------+---------+
| 3| Acc3| 1|
+------+-------+---------+
val duplAcctDF = accJoinedDF.filter($"AcctCount">1)
scala> duplAcctDF.show()
+------+-------+---------+
|AcctId|Details|AcctCount|
+------+-------+---------+
| 1| Acc1| 3|
| 1| Acc1| 3|
| 1| Acc1| 3|
| 2| Acc2| 2|
| 2| Acc2| 2|
+------+-------+---------+
(OR scala> duplAcctDF.distinct.show() )