I have a two table/dataframe: A and B
A has following columns: cust_id, purch_date
B has one column: cust_id, col1 (col1 is not needed)
Following sample shows content of each table:
Table A
cust_id purch_date
34564 2017-08-21
34564 2017-08-02
34564 2017-07-21
23847 2017-09-13
23423 2017-06-19
Table B
cust_id col1
23442 x
12452 x
12464 x
23847 x
24354 x
I want to select the cust_id and first day of month of purch_date where the selected cust_id are not there in B.
This can be achieved in SQL by following command:
select a.cust_id, trunc(purch_date, 'MM') as mon
from a
left join b
on a.cust_id = b.cust_id
where b.cust_id is null
group by cust_id, mon;
Following will be the output:
Table A
cust_id purch_date
34564 2017-08-01
34564 2017-07-01
23423 2017-06-01
I tried the following to implement the same in Scala:
import org.apache.spark.sql.functions._
a = spark.sql("select * from db.a")
b = spark.sql("select * from db.b")
var out = a.join(b, Seq("cust_id"), "left")
.filter("col1 is null")
.select("cust_id", trunc("purch_date", "month"))
.distinct()
But I am getting different errors like:
error: type mismatch; found: StringContext required: ?{def $: ?}
I am stuck here and couldn't find enough documentation/answers on net.
Select should contain Columns instead of Strings:
Input:
df1:
+-------+----------+
|cust_id|purch_date|
+-------+----------+
| 34564|2017-08-21|
| 34564|2017-08-02|
| 34564|2017-07-21|
| 23847|2017-09-13|
| 23423|2017-06-19|
+-------+----------+
df2:
+-------+----+
|cust_id|col1|
+-------+----+
| 23442| X|
| 12452| X|
| 12464| X|
| 23847| X|
| 24354| X|
+-------+----+
Change your query as below:
df1.join(df2, Seq("cust_id"), "left").filter("col1 is null")
.select($"cust_id", trunc($"purch_date", "MM"))
.distinct()
.show()
Output:
+-------+---------------------+
|cust_id|trunc(purch_date, MM)|
+-------+---------------------+
| 23423| 2017-06-01|
| 34564| 2017-07-01|
| 34564| 2017-08-01|
+-------+---------------------+
Related
I have a spark dataframe : df :
|id | year | month |
-------------------
| 1 | 2020 | 01 |
| 2 | 2019 | 03 |
| 3 | 2020 | 01 |
I have a sequence year_month = Seq[(2019,01),(2020,01),(2021,01)]
val year_map gets genrated dynamically based on code runs everytime
I want to filter the dataframe : df based on the year_month sequence for on ($year=seq[0] & $month = seq[1]) for each value pair in sequence year_month
You can achieve this by
Create a dataframe from year_month
Perform an inner join on year_month with your original dataframe on month and year
Choosing distinct records
The resulting dataframe will be the matched rows
Working Example
Setup
import spark.implicits._
val dfData = Seq((1,2020,1),(2,2019,3),(3,2020,1))
val df = dfData.toDF()
.selectExpr("_1 as id"," _2 as year","_3 as month")
df.createOrReplaceTempView("original_data")
val year_month = Seq((2019,1),(2020,1),(2021,1))
Step 1
// Create Temporary DataFrame
val yearMonthDf = year_month.toDF()
.selectExpr("_1 as year","_2 as month" )
yearMonthDf.createOrReplaceTempView("temp_year_month")
Step 2
var dfResult = spark.sql("select o.id, o.year, o.month from original_data o inner join temp_year_month t on o.year = t.year and o.month = t.month")
Step3
var dfResultDistinct = dfResult.distinct()
Output
dfResultDistinct.show()
+---+----+-----+
| id|year|month|
+---+----+-----+
| 1|2020| 1|
| 3|2020| 1|
+---+----+-----+
NB. If you are interested in finding the similar records that exist irrespective of the id. You could update the spark sql to the following (it has removed o.id)
select
o.year,
o.month
from
original_data o
inner join
temp_year_month t on o.year = t.year and
o.month = t.month
which would give as the result
+----+-----+
|year|month|
+----+-----+
|2020| 1|
+----+-----+
Consider there are 2 dataframes df1 and df2.
df1 has below data
A | B
-------
1 | m
2 | n
3 | o
df2 has below data
A | B
-------
1 | m
2 | n
3 | p
df1.except(df2) returns
A | B
-------
3 | o
3 | p
How to display the result as below?
df1: 3 | o
df2: 3 | p
As per the API docs df1.except(df2), Returns a new DataFrame containing rows in this frame but not in another frame. i.e, it will return rows that are in DF1 and not in DF2. Thus a custom except function could be written as:
def except(df1: DataFrame, df2: DataFrame): DataFrame = {
val edf1 = df1.except(df2).withColumn("df", lit("df1"))
val edf2 = df2.except(df1).withColumn("df", lit("df2"))
edf1.union(edf2)
}
//Output
+---+---+---+
| A| B| df|
+---+---+---+
| 3| o|df1|
| 3| p|df2|
+---+---+---+
I am not sure if I am asking this correctly and maybe that is the reason why I didn't find the correct answer so far. Anyway, if it will be duplicate I will delete this question.
I have following data:
id | last_updated | count
__________________________
1 | 20190101 | 3
1 | 20190201 | 2
1 | 20190301 | 1
I want to group by this data by "id" column, get max value from "last_updated" and regarding "count" column I want keep value from row where "last_updated" has max value. So in that case result should be like that:
id | last_updated | count
__________________________
1 | 20190301 | 1
So I imagine it will look like that:
df
.groupBy("id")
.agg(max("last_updated"), ... ("count"))
Is there any function I can use to get "count" based on "last_updated" column.
I am using spark 2.4.0.
Thanks for any help
You have two options, the first the better as for my understanding
OPTION 1
Perform a window function over the ID, create a column with the max value over that window function. Then select where the desired column equals the max value and finally drop the column and rename the max column as desired
val w = Window.partitionBy("id")
df.withColumn("max", max("last_updated").over(w))
.where("max = last_updated")
.drop("last_updated")
.withColumnRenamed("max", "last_updated")
OPTION 2
You can perform a join with the original dataframe after grouping
df.groupBy("id")
.agg(max("last_updated").as("last_updated"))
.join(df, Seq("id", "last_updated"))
QUICK EXAMPLE
INPUT
df.show
+---+------------+-----+
| id|last_updated|count|
+---+------------+-----+
| 1| 20190101| 3|
| 1| 20190201| 2|
| 1| 20190301| 1|
+---+------------+-----+
OUTPUT
Option 1
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions
val w = Window.partitionBy("id")
df.withColumn("max", max("last_updated").over(w))
.where("max = last_updated")
.drop("last_updated")
.withColumnRenamed("max", "last_updated")
+---+-----+------------+
| id|count|last_updated|
+---+-----+------------+
| 1| 1| 20190301|
+---+-----+------------+
Option 2
df.groupBy("id")
.agg(max("last_updated").as("last_updated")
.join(df, Seq("id", "last_updated")).show
+---+-----------------+----------+
| id| last_updated| count |
+---+-----------------+----------+
| 1| 20190301| 1|
+---+-----------------+----------+
Suppose, I have the following the dataframe:
id | col1 | col2
-----------------
x | p1 | a1
-----------------
x | p2 | b1
-----------------
y | p2 | b2
-----------------
y | p2 | b3
-----------------
y | p3 | c1
The distinct values from col1 which are (p1, p2, p3) alone with id will be used as columns for the final dataframe. Here, the id y has two col2 values (b2 and b3) for the same col1 value p2, so, p2 will be treated as an array type column.
Therefore, the final dataframe will be
id | p1 | p2 | p3
--------------------------------
x | a1 | [b1] | null
--------------------------------
y | null |[b2, b3]| c1
How can I achieve the second dataframe efficiently from the first dataframe?
You are basically looking for table pivoting; for your case, groupBy id, pivot col1 as headers, and aggregate col2 as list using collect_list function:
df.groupBy("id").pivot("col1").agg(collect_list("col2")).show
+---+----+--------+----+
| id| p1| p2| p3|
+---+----+--------+----+
| x|[a1]| [b1]| []|
| y| []|[b2, b3]|[c1]|
+---+----+--------+----+
If it's guaranteed that there's at most one value in p1 and p3 for each id, you can convert those columns to String type by getting the first item of the array:
df.groupBy("id").pivot("col1").agg(collect_list("col2"))
.withColumn("p1", $"p1"(0)).withColumn("p3", $"p3"(0))
.show
+---+----+--------+----+
| id| p1| p2| p3|
+---+----+--------+----+
| x| a1| [b1]|null|
| y|null|[b2, b3]| c1|
+---+----+--------+----+
If you need to convert the column types dynamically, i.e. only use array type column types when you have to:
// get array Type columns
val arrayColumns = df.groupBy("id", "col1").agg(count("*").as("N"))
.where($"N" > 1).select("col1").distinct.collect.map(row => row.getString(0))
// arrayColumns: Array[String] = Array(p2)
// aggregate / pivot data frame
val aggDf = df.groupBy("id").pivot("col1").agg(collect_list("col2"))
// aggDf: org.apache.spark.sql.DataFrame = [id: string, p1: array<string> ... 2 more fields]
// get string columns
val stringColumns = aggDf.columns.filter(x => x != "id" && !arrayColumns.contains(x))
// use foldLeft on string columns to convert the columns to string type
stringColumns.foldLeft(aggDf)((df, x) => df.withColumn(x, col(x)(0))).show
+---+----+--------+----+
| id| p1| p2| p3|
+---+----+--------+----+
| x| a1| [b1]|null|
| y|null|[b2, b3]| c1|
+---+----+--------+----+
I have two dataframes in Scala:
df1 =
ID Field1
1 AAA
2 BBB
4 CCC
and
df2 =
PK start_date_time
1 2016-10-11 11:55:23
2 2016-10-12 12:25:00
3 2016-10-12 16:20:00
I also have a variable start_date with the format yyyy-MM-dd equal to 2016-10-11.
I need to create a new column check in df1 based on the following condition: If PK is equal to ID AND the year, month and day of start_date_time are equal to start_date, then check is equal to 1, otherwise 0.
The result should be this one:
df1 =
ID Field1 check
1 AAA 1
2 BBB 0
4 CCC 0
In my previous question I had two dataframes and it was suggested to use joining and filtering. However, in this case it won't work. My initial idea was to use udf, but not sure how to make it working for this case.
You can combine join and withColumn for this case. i.e. firstly join with df2 on ID column and then use when.otherwise syntax to modify the check column:
import org.apache.spark.sql.functions.lit
val df2_date = df2.withColumn("date", to_date(df2("start_date_time"))).withColumn("check", lit(1)).select($"PK".as("ID"), $"date", $"check")
df1.join(df2_date, Seq("ID"), "left").withColumn("check", when($"date" === "2016-10-11", $"check").otherwise(0)).drop("date").show
+---+------+-----+
| ID|Field1|check|
+---+------+-----+
| 1| AAA| 1|
| 2| BBB| 0|
| 4| CCC| 0|
+---+------+-----+
Or another option, firstly filter on df2, and then join it back with df1 on ID column:
val df2_date = (df2.withColumn("date", to_date(df2("start_date_time"))).
filter($"date" === "2016-10-11").
withColumn("check", lit(1)).
select($"PK".as("ID"), $"date", $"check"))
df1.join(df2_date, Seq("ID"), "left").drop("date").na.fill(0).show
+---+------+-----+
| ID|Field1|check|
+---+------+-----+
| 1| AAA| 1|
| 2| BBB| 0|
| 4| CCC| 0|
+---+------+-----+
In case you have a date like 2016-OCT-11, you can convert it sql Date for comparison as follows:
val format = new java.text.SimpleDateFormat("yyyy-MMM-dd")
val parsed = format.parse("2016-OCT-11")
val date = new java.sql.Date(parsed.getTime())
// date: java.sql.Date = 2016-10-11