Group By on a dataframe - scala

I have a dataframe df with columns a,b,c,d,e,f,g.
I have a scala List L1 which is List[Any] = List(a,b,c)
How to perform a group by operation on DF and find duplicates if any using the list L1
Also how to find out if the dataframe has nulls/blanks/emptyvalues for the columns which are mentioned in list L1
e.g. df.groupby(l1) needs to be used as l1 may vary from time to time

// Null
case class Source(
a: Option[String],
b: Option[String],
c: Option[String],
d: Option[String],
e: Option[String],
f: Option[String],
g: Option[String] )
val l = List("a", "b", "c")
val sourceDF = Seq(
Source(None, Some("b1"), Some("c1"), Some("d1"), Some("e1"), Some("f1"), Some("g1")),
Source(Some("a2"), None, Some("c2"), Some("d2"), Some("e2"), Some("f2"), Some("g2")),
Source(Some("a3"), Some("b3"), None, Some("d3"), Some("e3"), Some("f3"), Some("g3")),
Source(Some("a4"), Some("b4"), Some("c4"), Some("d4"), Some("e4"), Some("f4"), Some("g4"))
).toDF()
sourceDF.show(false)
// +----+----+----+---+---+---+---+
// |a |b |c |d |e |f |g |
// +----+----+----+---+---+---+---+
// |null|b1 |c1 |d1 |e1 |f1 |g1 |
// |a2 |null|c2 |d2 |e2 |f2 |g2 |
// |a3 |b3 |null|d3 |e3 |f3 |g3 |
// |a4 |b4 |c4 |d4 |e4 |f4 |g4 |
// +----+----+----+---+---+---+---+
val f1 = l.map(i => s" $i is null").mkString(" or ")
sourceDF.where(f1).show(false)
// +----+----+----+---+---+---+---+
// |a |b |c |d |e |f |g |
// +----+----+----+---+---+---+---+
// |null|b1 |c1 |d1 |e1 |f1 |g1 |
// |a2 |null|c2 |d2 |e2 |f2 |g2 |
// |a3 |b3 |null|d3 |e3 |f3 |g3 |
// +----+----+----+---+---+---+---+
// groupBy
val gbDF = sourceDF.groupBy(l.head, l.tail:_*).count()
gbDF.show(false)
// +----+----+----+-----+
// |a |b |c |count|
// +----+----+----+-----+
// |a2 |null|c2 |1 |
// |a4 |b4 |c4 |1 |
// |a3 |b3 |null|1 |
// |null|b1 |c1 |1 |
// +----+----+----+-----+

Related

How to count the frequency of words with CountVectorizer in spark ML?

The below code gives a count vector for each row in the DataFrame:
import org.apache.spark.ml.feature.{CountVectorizer, CountVectorizerModel}
val df = spark.createDataFrame(Seq(
(0, Array("a", "b", "c")),
(1, Array("a", "b", "b", "c", "a"))
)).toDF("id", "words")
// fit a CountVectorizerModel from the corpus
val cvModel: CountVectorizerModel = new CountVectorizer()
.setInputCol("words")
.setOutputCol("features")
.fit(df)
cvModel.transform(df).show(false)
The result is:
+---+---------------+-------------------------+
|id |words |features |
+---+---------------+-------------------------+
|0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])|
|1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])|
+---+---------------+-------------------------+
How to get total counts of each words, like:
+---+------+------+
|id |words |counts|
+---+------+------+
|0 |a | 3 |
|1 |b | 3 |
|2 |c | 2 |
+---+------+------+
Shankar's answer only gives you the actual frequencies if the CountVectorizer model keeps every single word in the corpus (e.g. no minDF or VocabSize limitations). In these cases you can use Summarizer to directly sum each Vector. Note: this requires Spark 2.3+ for Summarizer.
import org.apache.spark.ml.stat.Summarizer.metrics
// You need to select normL1 and another item (like mean) because, for some reason, Spark
// won't allow one Vector to be selected at a time (at least in 2.4)
val totalCounts = cvModel.transform(df)
.select(metrics("normL1", "mean").summary($"features").as("summary"))
.select("summary.normL1", "summary.mean")
.as[(Vector, Vector)]
.first()
._1
You'll then have to zip totalCounts with cvModel.vocabulary to get the words themselves.
You can simply explode and groupBy to get the count of each word
cvModel.transform(df).withColumn("words", explode($"words"))
.groupBy($"words")
.agg(count($"words").as("counts"))
.withColumn("id", row_number().over(Window.orderBy("words")) -1)
.show(false)
Output:
+-----+------+---+
|words|counts|id |
+-----+------+---+
|a |3 |1 |
|b |3 |2 |
|c |2 |3 |
+-----+------+---+

Consider items of the same value when deciding rank

In spark, I would like to count how values are less or equal to other values. I tried to accomplish this via ranking but rank produces
[1,2,2,2,3,4] -> [1,2,2,2,5,6]
while what I would like is
[1,2,2,2,3,4] -> [1,4,4,4,5,6]
I can accomplish this by ranking, grouping by the rank and then modifying the rank value based on how many items are in the group. But this is kind of clunky and it's inefficient. Is there a better way to do this?
Edit: Added minimal example of what I'm trying to accomplish
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.rank
import org.apache.spark.sql.expressions.Window
object Question extends App {
val spark = SparkSession.builder.appName("Question").master("local[*]").getOrCreate()
import spark.implicits._
val win = Window.orderBy($"nums".asc)
Seq(1, 2, 2, 2, 3, 4)
.toDF("nums")
.select($"nums", rank.over(win).alias("rank"))
.as[(Int, Int)]
.groupByKey(_._2)
.mapGroups((rank, nums) => (rank, nums.toList.map(_._1)))
.map(x => (x._1 + x._2.length - 1, x._2))
.flatMap(x => x._2.map(num => (num, x._1)))
.toDF("nums", "rank")
.show(false)
}
Output:
+----+----+
|nums|rank|
+----+----+
|1 |1 |
|2 |4 |
|2 |4 |
|2 |4 |
|3 |5 |
|4 |6 |
+----+----+
Use window functions
scala> val df = Seq(1, 2, 2, 2, 3, 4).toDF("nums")
df: org.apache.spark.sql.DataFrame = [nums: int]
scala> df.createOrReplaceTempView("tbl")
scala> spark.sql(" with tab1(select nums, rank() over(order by nums) rk, count(*) over(partition by nums) cn from tbl) select nums, rk+cn-1 as rk2 from tab1 ").show(false)
18/11/28 02:20:55 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+----+---+
|nums|rk2|
+----+---+
|1 |1 |
|2 |4 |
|2 |4 |
|2 |4 |
|3 |5 |
|4 |6 |
+----+---+
scala>
Note that the df doesn't partition on any column, so spark complains of moving all data to single partition.
EDIT1:
scala> spark.sql(" select nums, rank() over(order by nums) + count(*) over(partition by nums) -1 as rk2 from tbl ").show
18/11/28 23:20:09 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+----+---+
|nums|rk2|
+----+---+
| 1| 1|
| 2| 4|
| 2| 4|
| 2| 4|
| 3| 5|
| 4| 6|
+----+---+
scala>
EDIT2:
The equivalent df version
scala> val df = Seq(1, 2, 2, 2, 3, 4).toDF("nums")
df: org.apache.spark.sql.DataFrame = [nums: int]
scala> import org.apache.spark.sql.expressions._
import org.apache.spark.sql.expressions._
scala> df.withColumn("rk2", rank().over(Window orderBy 'nums)+ count(lit(1)).over(Window.partitionBy('nums)) - 1 ).show(false)
2018-12-01 11:10:26 WARN WindowExec:66 - No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
+----+---+
|nums|rk2|
+----+---+
|1 |1 |
|2 |4 |
|2 |4 |
|2 |4 |
|3 |5 |
|4 |6 |
+----+---+
scala>
So, a friend pointed out that if I just calculate the rank in descending order and then for each rank do (max_rank + 1) - current_rank. This is a much more efficient implementation.
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.rank
import org.apache.spark.sql.expressions.Window
object Question extends App {
val spark = SparkSession.builder.appName("Question").master("local[*]").getOrCreate()
import spark.implicits._
val win = Window.orderBy($"nums".desc)
val rankings = Seq(1, 2, 2, 2, 3, 4)
.toDF("nums")
.select($"nums", rank.over(win).alias("rank"))
.as[(Int, Int)]
val maxElement = rankings.select("rank").as[Int].reduce((a, b) => if (a > b) a else b)
rankings
.map(x => x.copy(_2 = maxElement - x._2 + 1))
.toDF("nums", "rank")
.orderBy("rank")
.show(false)
}
Output
+----+----+
|nums|rank|
+----+----+
|1 |1 |
|2 |4 |
|2 |4 |
|2 |4 |
|3 |5 |
|4 |6 |
+----+----+

Operate within a group by and populate additional columns

I have a dataframes as below:
+------+------+---+------+
|field1|field2|id |Amount|
+------+------+---+------+
|A |B |002|10.0 |
|A |B |003|12.0 |
|A |B |005|15.0 |
|C |B |002|20.0 |
|C |B |003|22.0 |
|C |B |005|25.0 |
+------+------+---+------+
I need to convert it to :
+------+------+---+-------+---+-------+---+-------+
|field1|field2|002|002_Amt|003|003_Amt|005|005_Amt|
+------+------+---+-------+---+-------+---+-------+
|A |B |002|10.0 |003|12.0 |005|15.0 |
|C |B |002|20.0 |003|22.0 |005|25.0 |
+------+------+---+-------+---+-------+---+-------+
Please advise!
Your final dataframe column depends on id column so you need to store the distinct ids in a separate array.
import org.apache.spark.sql.functions._
val distinctIds = df.select(collect_list("id")).rdd.first().get(0).asInstanceOf[mutable.WrappedArray[String]].distinct
Next step is to filter each of the distinctIds and join them
val first = distinctIds.head
var finalDF = df.filter($"id" === first).withColumnRenamed("id", first).withColumnRenamed("Amount", first+"_Amt")
for(str <- distinctIds.tail){
var tempDF = df.filter($"id" === str).withColumnRenamed("id", str).withColumnRenamed("Amount", str+"_Amt")
finalDF = finalDF.join(tempDF, Seq("field1", "field2"), "left")
}
finalDF.show(false)
You should have your desired output as
+------+------+---+-------+---+-------+---+-------+
|field1|field2|002|002_Amt|003|003_Amt|005|005_Amt|
+------+------+---+-------+---+-------+---+-------+
|A |B |002|10.0 |003|12.0 |005|15.0 |
|C |B |002|20.0 |003|22.0 |005|25.0 |
+------+------+---+-------+---+-------+---+-------+
Var is never recommended for scala. So you can create a recursive function to do the above logic as below
def getFinalDF(first: Boolean, array: List[String], df: DataFrame, tdf: DataFrame) : DataFrame = array match {
case head :: tail => {
if(first) {
getFinalDF(false, tail, df, df.filter($"id" === head).withColumnRenamed("id", head).withColumnRenamed("Amount", head + "_Amt"))
}
else{
val tempDF = df.filter($"id" === head).withColumnRenamed("id", head).withColumnRenamed("Amount", head+"_Amt")
getFinalDF(false, tail, df, tdf.join(tempDF, Seq("field1", "field2"), "left"))
}
}
case Nil => tdf
}
and call the recursive function as
getFinalDF(true, distinctIds.toList, df, df).show(false)
You should have the same output.

append two dataframes and update data

Hello guys I want to update an old dataframe based on pos_id and article_id field.
If the tuple (pos_id,article_id) exist , I will add each column to the old one, if it doesn't exist I will add the new one. It worked fine. But I don't know how to deal with the case , when the dataframe is intially empty , in this case , I will add the new rows in the second dataframe to the old one. Here it is what I did
val histocaisse = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("C:/Users/MHT/Desktop/histocaisse_dte1.csv")
val hist = histocaisse
.withColumn("pos_id", 'pos_id.cast(LongType))
.withColumn("article_id", 'pos_id.cast(LongType))
.withColumn("date", 'date.cast(DateType))
.withColumn("qte", 'qte.cast(DoubleType))
.withColumn("ca", 'ca.cast(DoubleType))
val histocaisse2 = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("C:/Users/MHT/Desktop/histocaisse_dte2.csv")
val hist2 = histocaisse2.withColumn("pos_id", 'pos_id.cast(LongType))
.withColumn("article_id", 'pos_id.cast(LongType))
.withColumn("date", 'date.cast(DateType))
.withColumn("qte", 'qte.cast(DoubleType))
.withColumn("ca", 'ca.cast(DoubleType))
hist2.show(false)
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-07|2.5 |3.5 |
|2 |2 |2000-01-07|14.7|12.0|
|3 |3 |2000-01-07|3.5 |1.2 |
+------+----------+----------+----+----+
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-08|2.5 |3.5 |
|2 |2 |2000-01-08|14.7|12.0|
|3 |3 |2000-01-08|3.5 |1.2 |
|4 |4 |2000-01-08|3.5 |1.2 |
|5 |5 |2000-01-08|14.5|1.2 |
|6 |6 |2000-01-08|2.0 |1.25|
+------+----------+----------+----+----+
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-08|5.0 |7.0 |
|2 |2 |2000-01-08|39.4|24.0|
|3 |3 |2000-01-08|7.0 |2.4 |
|4 |4 |2000-01-08|3.5 |1.2 |
|5 |5 |2000-01-08|14.5|1.2 |
|6 |6 |2000-01-08|2.0 |1.25|
+------+----------+----------+----+----+
Here is the solution , i found
val df = hist2.join(hist1, Seq("article_id", "pos_id"), "left")
.select($"pos_id", $"article_id",
coalesce(hist2("date"), hist1("date")).alias("date"),
(coalesce(hist2("qte"), lit(0)) + coalesce(hist1("qte"), lit(0))).alias("qte"),
(coalesce(hist2("ca"), lit(0)) + coalesce(hist1("ca"), lit(0))).alias("ca"))
.orderBy("pos_id", "article_id")
This case doesn't work when hist1 is empty .Any help please ?
Thanks a lot
Not sure if I understood correctly, but if the problem is sometimes the second dataframe is empty, and that makes the join crash, something you can try is this:
val checkHist1Empty = Try(hist1.first)
val df = checkHist1Empty match {
case Success(df) => {
hist2.join(hist1, Seq("article_id", "pos_id"), "left")
.select($"pos_id", $"article_id",
coalesce(hist2("date"), hist1("date")).alias("date"),
(coalesce(hist2("qte"), lit(0)) + coalesce(hist1("qte"), lit(0))).alias("qte"),
(coalesce(hist2("ca"), lit(0)) + coalesce(hist1("ca"), lit(0))).alias("ca"))
.orderBy("pos_id", "article_id")
}
case Failure(e) => {
hist2.select($"pos_id", $"article_id",
coalesce(hist2("date")).alias("date"),
coalesce(hist2("qte"), lit(0)).alias("qte"),
coalesce(hist2("ca"), lit(0)).alias("ca"))
.orderBy("pos_id", "article_id")
}
}
This basically checks if the hist1 is empty before performing the join. In case it is empty it generates the df based on the same logic but applied only to the hist2 dataframe. In case it contains information it applies the logic you had, which you said that works.
instead of doing a join, why don't you do a union of the two dataframes and then groupBy (pos_id,article_id) and apply udf to each column sum for qte and ca.
val df3 = df1.unionAll(df2)
val df4 = df3.groupBy("pos_id", "article_id").agg($"pos_id", $"article_id", max("date"), sum("qte"), sum("ca"))

Spark - named_struct for empty Map

I use Spark 2.0.1 Scala 2.11, and this question is related to this
Below is the setup:
val ss = new StructType().add("x", IntegerType).add("y", MapType(DoubleType, IntegerType))
val s = new StructType()
.add("a", IntegerType)
.add("b", ss)
val d = Seq(Row(1, Row(1,Map(1.0->1, 2.0->2))),
Row(2, Row(2,Map(2.0->2, 3.0->3))),
Row(3, null ),
Row(4, Row(4, Map())))
val rd = sc.parallelize(d)
val df = spark.createDataFrame(rd, s)
df.select($"a", $"b").show(false)
// +---+---------------------------+
// |a |b |
// +---+---------------------------+
// |1 |[1,Map(1.0 -> 1, 2.0 -> 2)]|
// |2 |[2,Map(2.0 -> 2, 3.0 -> 3)]|
// |3 |null |
// |4 |[4,Map()] |
// +---+---------------------------+
//
The below statement works when I have to provide a default to coalesce (row 2 col 3 cell has the default value):
df.groupBy($"a").pivot("a").
agg(expr("first(coalesce(b, named_struct('x', cast(null as Int), 'y', Map(0.0D, 0) )))" ) )
.show(false)
// +---+---------------------------+---------------------------+--------------------+---------+
// |a |1 |2 |3 |4 |
// +---+---------------------------+---------------------------+--------------------+---------+
// |1 |[1,Map(1.0 -> 1, 2.0 -> 2)]|null |null |null |
// |3 |null |null |[null,Map(0.0 -> 0)]|null |
// |4 |null |null |null |[4,Map()]|
// |2 |null |[2,Map(2.0 -> 2, 3.0 -> 3)]|null |null |
// +---+---------------------------+---------------------------+--------------------+---------+
But how to create an empty Map() (like what's seen in a=4) using named_struct or otherwise?
You can achieve this with an case class and an UDF:
case class MyStruct(x:Option[Int], y:Map[Double,Int])
import org.apache.spark.sql.functions.{udf, first,coalesce}
val emptyStruct = udf(() => MyStruct(None,Map.empty[Double,Int]))
df.groupBy($"a").pivot("a")
.agg(first(coalesce($"b",emptyStruct())))
.show(false)