How to union DataFrames and add only missing rows? - scala

I have a dataframe df1, which contains below data:
**customer_id** **product** **Val_id** **rule_name**
1 A 1 rule1
2 B X rule1
I have another dataframe df2, which contains below data:
**customer_id** **product** **Val_id** **rule_name**
1 A 1 rule2
2 B X rule2
3 C y rule2
rule_name values in both dataframes is always fixed
I want a new unionized dataframe df3. It should have all customers from dataframe df1 and all other customers from dataframe df2, which are not present in df1. So final df3 should look like:
**customer_id** **product** **Val_id** **rule_name**
1 A 1 rule1
2 B X rule1
3 C y rule2
Can anyone please help me out to achieve this outcome. Any help will be appreciated.

Given the following datasets:
val df1 = Seq(
(1, "A", "1", "rule1"),
(2, "B", "X", "rule1")
).toDF("customer_id", "product", "Val_id", "rule_name")
val df2 = Seq(
(1, "A", "1", "rule2"),
(2, "B", "X", "rule2"),
(3, "C", "y", "rule2")
).toDF("customer_id", "product", "Val_id", "rule_name")
And the requirement:
It should have all customers from dataframe df1 and all other customers from dataframe df2, which are not present in df1.
My first solution could be as follows:
val missingCustomers = df2.
join(df1, Seq("customer_id"), "leftanti").
select($"customer_id", df2("product"), df2("Val_id"), df2("rule_name"))
val all = df1.union(missingCustomers)
scala> all.show
+-----------+-------+------+---------+
|customer_id|product|Val_id|rule_name|
+-----------+-------+------+---------+
| 1| A| 1| rule1|
| 2| B| X| rule1|
| 3| C| y| rule2|
+-----------+-------+------+---------+
Another (and perhaps slower) solution could be as follows:
// find missing ids, i.e. ids in df2 that are not in df1
// BE EXTRA CAREFUL: "Downloading" all missing ids to the driver
val missingIds = df2.
select("customer_id").
except(df1.select("customer_id")).
as[Int].
collect
// filter ids in df2 that match missing ids
val missingRows = df2.filter($"customer_id" isin (missingIds: _*))
scala> df1.union(missingRows).show
+-----------+-------+------+---------+
|customer_id|product|Val_id|rule_name|
+-----------+-------+------+---------+
| 1| A| 1| rule1|
| 2| B| X| rule1|
| 3| C| y| rule2|
+-----------+-------+------+---------+

Related

How to write a function that takes a list of column names of a DataFrame, reorders selected columns the left and preserves unselected columns

I'd like to build a function
def reorderColumns(columnNames: List[String]) = ...
that can be applied to a Spark DataFrame such that the columns specified in columnNames gets reordered to the left, and remaining columns (in any order) remain to the right.
Example:
Given a df with the following 5 columns
| A | B | C | D | E
df.reorderColumns(["D","B","A"]) returns a df with columns ordered like so:
| D | B | A | C | E
Try this one:
def reorderColumns(df: DataFrame, columns: Array[String]): DataFrame = {
val restColumns: Array[String] = df.columns.filterNot(c => columns.contains(c))
df.select((columns ++ restColumns).map(col): _*)
}
Usage example:
val spark: SparkSession = SparkSession.builder().appName("test").master("local[*]").getOrCreate()
import spark.implicits._
val df = List((1, 3, 1, 6), (2, 4, 2, 5), (3, 6, 3, 4)).toDF("colA", "colB", "colC", "colD")
reorderColumns(df, Array("colC", "colB")).show
// output:
//+----+----+----+----+
//|colC|colB|colA|colD|
//+----+----+----+----+
//| 1| 3| 1| 6|
//| 2| 4| 2| 5|
//| 3| 6| 3| 4|
//+----+----+----+----+

Filter a dataframe using a list of tuples in spark scala

I am trying to filter a dataframe in scala by comparing two of its columns (subject and stream in this case) to a list of tuples. If the column values and the tuple values are equal the row is filtered.
val df = Seq(
(0, "Mark", "Maths", "Science"),
(1, "Tyson", "History", "Commerce"),
(2, "Gerald", "Maths", "Science"),
(3, "Katie", "Maths", "Commerce"),
(4, "Linda", "History", "Science")).toDF("id", "name", "subject", "stream")
Sample input:
+---+------+-------+--------+
| id| name|subject| stream|
+---+------+-------+--------+
| 0| Mark| Maths| Science|
| 1| Tyson|History|Commerce|
| 2|Gerald| Maths| Science|
| 3| Katie| Maths|Commerce|
| 4| Linda|History| Science|
+---+------+-------+--------+
List of tuple based on which the above df needs to be filtered
val listOfTuples = List[(String, String)] (
("Maths" , "Science"),
("History" , "Commerce")
)
Expected result :
+---+------+-------+--------+
| id| name|subject| stream|
+---+------+-------+--------+
| 0| Mark| Maths| Science|
| 1| Tyson|History|Commerce|
| 2|Gerald| Maths| Science|
+---+------+-------+--------+
You can either do it with isin with structs (needs spark 2.2+):
val df_filtered = df
.where(struct($"subject",$"stream").isin(listOfTuples.map(typedLit(_)):_*))
or with leftsemi join:
val df_filtered = df
.join(listOfTuples.toDF("subject","stream"),Seq("subject","stream"),"leftsemi")
You can simply filter as
val resultDF = df.filter(row => {
List(
("Maths", "Science"),
("History", "Commerce")
).contains(
(row.getAs[String]("subject"), row.getAs[String]("stream")))
})
Hope this helps!

Group by and find count before doing pivot spark

I have a dataframe like below
A B C D
foo one small 1
foo one large 2
foo one large 2
foo two small 3
I need to groupBy based on A and B pivot on column C, and sum column D
I am able to do this using
df.groupBy("A", "B").pivot("C").sum("D")
However I need also to find count after groupBy ,if I try something like
df.groupBy("A", "B").pivot("C").agg(sum("D"), count)
I get an output like
A B large small large_count small_count
Is there a way to get only one count after groupBy before doing pivot
On output try
output.withColumn("count", $"large_count"+$"small_count").show
You can drop the two count columns if you want to
To do it before pivot try
df.groupBy("A", "B").agg(count("C"))
Is this what you are expecting?.
val df = Seq(("foo", "one", "small", 1),
("foo", "one", "large", 2),
("foo", "one", "large", 2),
("foo", "two", "small", 3)).toDF("A","B","C","D")
scala> df.show
+---+---+-----+---+
| A| B| C| D|
+---+---+-----+---+
|foo|one|small| 1|
|foo|one|large| 2|
|foo|one|large| 2|
|foo|two|small| 3|
+---+---+-----+---+
scala> val df2 = df.groupBy('A,'B).pivot("C").sum("D")
df2: org.apache.spark.sql.DataFrame = [A: string, B: string ... 2 more fields]
scala> val df3 = df.groupBy('A as "A1",'B as "B1").agg(sum('D) as "sumd")
df3: org.apache.spark.sql.DataFrame = [A1: string, B1: string ... 1 more field]
scala> df3.join(df2,'A==='A1 and 'B==='B1,"inner").select("A","B","sumd","large","small").show
+---+---+----+-----+-----+
| A| B|sumd|large|small|
+---+---+----+-----+-----+
|foo|one| 5| 4| 1|
|foo|two| 3| null| 3|
+---+---+----+-----+-----+
scala>
This wont require a join. Is this what you are looking for ?
val df = Seq(("foo", "one", "small", 1),
("foo", "one", "large", 2),
("foo", "one", "large", 2),
("foo", "two", "small", 3)).toDF("A","B","C","D")
scala> df.show
+---+---+-----+---+
| A| B| C| D|
+---+---+-----+---+
|foo|one|small| 1|
|foo|one|large| 2|
|foo|one|large| 2|
|foo|two|small| 3|
+---+---+-----+---+
df.registerTempTable("dummy")
spark.sql("SELECT * FROM (SELECT A , B , C , sum(D) as D from dummy group by A,B,C grouping sets ((A,B,C) ,(A,B)) order by A nulls last , B nulls last , C nulls last) dummy pivot (first(D) for C in ('large' large ,'small' small , null total))").show
+---+---+-----+-----+-----+
| A| B|large|small|total|
+---+---+-----+-----+-----+
|foo|one| 4| 1| 5|
|foo|two| null| 3| 3|
+---+---+-----+-----+-----+

Pyspark: Delete rows on column condition after groupBy

This is my input dataframe:
id val
1 Y
1 N
2 a
2 b
3 N
Result should be:
id val
1 Y
2 a
2 b
3 N
I want to group by on col id which has both Y and N in the val and then remove the row where the column val contains "N".
Please help me resolve this issue as i am beginner to pyspark
you can first identify the problematic rows with a filter for val=="Y" and then join this dataframe back to the original one. Finally you can filter for Null values and for the rows you want to keep, e.g. val==Y. Pyspark should be able to handle the self-join even if there are a lot of rows.
The example is shown below:
df_new = spark.createDataFrame([
(1, "Y"), (1, "N"), (1,"X"), (1,"Z"),
(2,"a"), (2,"b"), (3,"N")
], ("id", "val"))
df_Y = df_new.filter(col("val")=="Y").withColumnRenamed("val","val_Y").withColumnRenamed("id","id_Y")
df_new = df_new.join(df_Y, df_new["id"]==df_Y["id_Y"],how="left")
df_new.filter((col("val_Y").isNull()) | ((col("val_Y")=="Y") & ~(col("val")=="N"))).select("id","val").show()
The result would be your preferred:
+---+---+
| id|val|
+---+---+
| 1| X|
| 1| Y|
| 1| Z|
| 3| N|
| 2| a|
| 2| b|
+---+---+

How do I ignore first element in Groupby in Scala /Spark?

I am using Spark2, Zeppelin and Scala to show the top 10 occurrences of words in a data set.
My code:
z.show(dfFlat.groupBy("value").count().sort(desc("count")), 10)
gives:
How do I ignore 'cat' and have the plot start from 'hat' i.e. show 2nd through last elements?
I tried:
z.show(dfFlat.groupBy("value").count().sort(desc("count")).slice(2,4), 10)
but this gives:
error: value slice is not a member of org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
it's not straight forward to drop the first row in a dataframe (see also Drop first row of Spark DataFrame). But you can do it using window-functions:
val df = Seq(
"cat", "cat", "cat", "hat", "hat", "bat"
).toDF("value")
val dfGrouped = df
.groupBy($"value").count()
.sort($"count".desc)
dfGrouped.show()
+-----+-----+
|value|count|
+-----+-----+
| cat| 3|
| hat| 2|
| bat| 1|
+-----+-----+
val dfWithoutFirstRow = dfGrouped
.withColumn("rank", dense_rank().over(Window.partitionBy().orderBy($"count".desc)))
.where($"rank" =!= 1).drop($"rank") // this filters "cat"
.sort($"count".desc)
dfWithoutFirstRow
.show()
+-----+-----+
|value|count|
+-----+-----+
| hat| 2|
| bat| 1|
+-----+-----+
First row can be removed in such way:
val filteredValue = dfGrouped.first.get(0)
val result = dfGrouped.filter(s"value!='$filteredValue'")