I am new in Apache Spark and need some help. Can someone say how correctly to join next 2 dataframes?!
First dataframe:
| DATE_TIME | PHONE_NUMBER |
|---------------------|--------------|
| 2019-01-01 00:00:00 | 7056589658 |
| 2019-02-02 00:00:00 | 7778965896 |
Second dataframe:
| DATE_TIME | IP |
|---------------------|---------------|
| 2019-01-01 01:00:00 | 194.67.45.126 |
| 2019-02-02 00:00:00 | 102.85.62.100 |
| 2019-03-03 03:00:00 | 102.85.62.100 |
Final dataframe which I want:
| DATE_TIME | PHONE_NUMBER | IP |
|---------------------|--------------|---------------|
| 2019-01-01 00:00:00 | 7056589658 | |
| 2019-01-01 01:00:00 | | 194.67.45.126 |
| 2019-02-02 00:00:00 | 7778965896 | 102.85.62.100 |
| 2019-03-03 03:00:00 | | 102.85.62.100 |
Here below the code which I tried:
import org.apache.spark.sql.Dataset
import spark.implicits._
val df1 = Seq(
("2019-01-01 00:00:00", "7056589658"),
("2019-02-02 00:00:00", "7778965896")
).toDF("DATE_TIME", "PHONE_NUMBER")
df1.show()
val df2 = Seq(
("2019-01-01 01:00:00", "194.67.45.126"),
("2019-02-02 00:00:00", "102.85.62.100"),
("2019-03-03 03:00:00", "102.85.62.100")
).toDF("DATE_TIME", "IP")
df2.show()
val total = df1.join(df2, Seq("DATE_TIME"), "left_outer")
total.show()
Unfortunately, it raise error:
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:136)
at org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:367)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:144)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:140)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.executeBroadcast(SparkPlan.scala:140)
at org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast(BroadcastHashJoinExec.scala:135)
...
You need to full outer join, but your code is good. Your issue might be some thing else, but with the stack trace you mentioned can't conclude what the issue is.
val total = df1.join(df2, Seq("DATE_TIME"), "full_outer")
You can do this:
val total = df1.join(df2, (df1("DATE_TIME") === df2("DATE_TIME")), "left_outer")
Related
I am trying to join a DF with another two using a condition. I have the following DF's.
DF1, the DF that I want to join with df_cond1 and df_cond2.
If DF1 InfoNum col is NBC I want to join with df_cond1 else if DF1 InfoNum Column is BBC I want to join with df_cond2 but I don't know how can I do this.
DF1
+-------------+----------+-------------+
| Date | InfoNum | Sport |
+-------------+----------+-------------+
| 31/11/2020 | NBC | football |
| 11/01/2020 | BBC | tennis |
+-------------+----------+-------------+
df_cond1
+-------------+---------+-------------+
| Periodicity | Info | Description |
+-------------+---------+-------------+
| Monthly | NBC | DATAquality |
+-------------+---------+-------------+
df_cond2
+-------------+---------+-------------+
| Periodicity | Info | Description |
+-------------+---------+-------------+
| Daily | BBC | InfoIndeed |
+-------------+---------+-------------+
final_df
+-------------+----------+-------------+-------------+
| Date | InfoNum | Sport | Description |
+-------------+----------+-------------+-------------+
| 31/11/2020 | NBC | football | DATAquality |
| 11/01/2020 | BBC | tennis | InfoIndeed |
+-------------+----------+-------------+-------------+
I have been searching but didn't find a good solution, can you help me?
Here is how you can join
val df = Seq(
("31/11/2020", "NBC", "football"),
("1/01/2020", "BBC", "tennis")
).toDF("Date", "InfoNum", "Sport")
val df_cond1 = Seq(
("Monthly", "NBC", "DATAquality")
).toDF("Periodicity", "Info", "Description")
val df_cond2 = Seq(
("Daily", "BBC", "InfoIndeed")
).toDF("Periodicity", "Info", "Description")
df.join(df_cond1.union(df_cond2), $"InfoNum" === $"Info")
.drop("Info", "Periodicity")
.show(false)
Output:
+----------+-------+--------+-----------+
|Date |InfoNum|Sport |Description|
+----------+-------+--------+-----------+
|31/11/2020|NBC |football|DATAquality|
|1/01/2020 |BBC |tennis |InfoIndeed |
+----------+-------+--------+-----------+
I have a dataframe in Spark (Scala) from a large csv file.
Dataframe is something like this
key| col1 | timestamp |
---------------------------------
1 | aa | 2019-01-01 08:02:05.1 |
1 | aa | 2019-09-02 08:02:05.2 |
1 | cc | 2019-12-24 08:02:05.3 |
2 | dd | 2013-01-22 08:02:05.4 |
I need to add two columns start_date & end_date something like this
key| col1 | timestamp | start date | end date |
---------------------------------+---------------------------------------------------
1 | aa | 2019-01-01 08:02:05.1 | 2017-01-01 08:02:05.1 | 2018-09-02 08:02:05.2 |
1 | aa | 2019-09-02 08:02:05.2 | 2018-09-02 08:02:05.2 | 2019-12-24 08:02:05.3 |
1 | cc | 2019-12-24 08:02:05.3 | 2019-12-24 08:02:05.3 | NULL |
2 | dd | 2013-01-22 08:02:05.4 | 2013-01-22 08:02:05.4 | NULL |
Here,
for each column "key", end_date is next timestamp for the same key. However, "end_date" for the latest date should be NULL.
What I tried so far:
I tried to use window function to calculate rank for each partition
something like this
var df = read_csv()
//copy timestamp to start_date
df = df
.withColumn("start_date", df.col("timestamp"))
//add null value to the end_date
df = df.withColumn("end_date", typedLit[Option[String]](None))
val windowSpec = Window.partitionBy("merge_key_column").orderBy("start_date")
df
.withColumn("rank", dense_rank()
.over(windowSpec))
.withColumn("max", max("rank").over(Window.partitionBy("merge_key_column")))
So far, I haven't got the desired output.
Use window lead function for this case.
Example:
val df=Seq((1,"aa","2019-01-01 08:02:05.1"),(1,"aa","2019-09-02 08:02:05.2"),(1,"cc","2019-12-24 08:02:05.3"),(2,"dd","2013-01-22 08:02:05.4")).toDF("key","col1","timestamp")
import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
import org.apache.spark.sql._
val df1=df.withColumn("start_date",col("timestamp"))
val windowSpec = Window.partitionBy("key").orderBy("start_date")
df1.withColumn("end_date",lead(col("start_date"),1).over(windowSpec)).show(10,false)
//+---+----+---------------------+---------------------+---------------------+
//|key|col1|timestamp |start_date |end_date |
//+---+----+---------------------+---------------------+---------------------+
//|1 |aa |2019-01-01 08:02:05.1|2019-01-01 08:02:05.1|2019-09-02 08:02:05.2|
//|1 |aa |2019-09-02 08:02:05.2|2019-09-02 08:02:05.2|2019-12-24 08:02:05.3|
//|1 |cc |2019-12-24 08:02:05.3|2019-12-24 08:02:05.3|null |
//|2 |dd |2013-01-22 08:02:05.4|2013-01-22 08:02:05.4|null |
//+---+----+---------------------+---------------------+---------------------+
I need to get a date from below input on which there is a consecutive 'complete' status for past 7 days from that given date.
Requirement:
1. go Back 8 days (this is easy)
2. So we are on 20190111 from below data frame, I need to check day by day from 20190111 to 20190104 (7 day period) and get a date on which status has 'complete' for consecutive 7 days. So we should get 20190108
I need this in spark-scala.
input
+---+--------+--------+
| id| date| status|
+---+--------+--------+
| 1|20190101|complete|
| 2|20190102|complete|
| 3|20190103|complete|
| 4|20190104|complete|
| 5|20190105|complete|
| 6|20190106|complete|
| 7|20190107|complete|
| 8|20190108|complete|
| 9|20190109| pending|
| 10|20190110|complete|
| 11|20190111|complete|
| 12|20190112| pending|
| 13|20190113|complete|
| 14|20190114|complete|
| 15|20190115| pending|
| 16|20190116| pending|
| 17|20190117| pending|
| 18|20190118| pending|
| 19|20190119| pending|
+---+--------+--------+
output
+---+--------+--------+
| id| date| status|
+---+--------+--------+
| 1|20190101|complete|
| 2|20190102|complete|
| 3|20190103|complete|
| 4|20190104|complete|
| 5|20190105|complete|
| 6|20190106|complete|
| 7|20190107|complete|
| 8|20190108|complete|
output
+---+--------+--------+
| id| date| status|
+---+--------+--------+
| 1|20190101|complete|
| 2|20190102|complete|
| 3|20190103|complete|
| 4|20190104|complete|
| 5|20190105|complete|
| 6|20190106|complete|
| 7|20190107|complete|
| 8|20190108|complete|
for >= spark 2.4
import org.apache.spark.sql.expressions.Window
val df= Seq((1,"20190101","complete"),(2,"20190102","complete"),
(3,"20190103","complete"),(4,"20190104","complete"), (5,"20190105","complete"),(6,"20190106","complete"),(7,"20190107","complete"),(8,"20190108","complete"),
(9,"20190109", "pending"),(10,"20190110","complete"),(11,"20190111","complete"),(12,"20190112", "pending"),(13,"20190113","complete"),(14,"20190114","complete"),(15,"20190115", "pending") , (16,"20190116", "pending"),(17,"20190117", "pending"),(18,"20190118", "pending"),(19,"20190119", "pending")).toDF("id","date","status")
val df1= df.select($"id", to_date($"date", "yyyyMMdd").as("date"), $"status")
val win = Window.orderBy("id")
coalesce lag_status and status to remove null
val df2= df1.select($"*", lag($"status",1).over(win).as("lag_status")).withColumn("lag_stat", coalesce($"lag_status", $"status")).drop("lag_status")
create integer columns to denote if staus for current day is equal to status for previous days
val df3=df2.select($"*", ($"status"===$"lag_stat").cast("integer").as("status_flag"))
val win1= Window.orderBy($"id".desc).rangeBetween(0,7)
val df4= df3.select($"*", sum($"status_flag").over(win1).as("previous_7_sum"))
val df_new= df4.where($"previous_7_sum"===8).select($"date").select(explode(sequence(date_sub($"date",7), $"date")).as("date"))
val df5=df4.join(df_new, Seq("date"), "inner").select($"id", concat_ws("",split($"date".cast("string"), "-")).as("date"), $"status")
+---+--------+--------+
| id| date| status|
+---+--------+--------+
| 1|20190101|complete|
| 2|20190102|complete|
| 3|20190103|complete|
| 4|20190104|complete|
| 5|20190105|complete|
| 6|20190106|complete|
| 7|20190107|complete|
| 8|20190108|complete|
+---+--------+--------+
for spark < 2.4
use udf instead of built in array function "sequence"
val df1= df.select($"id", $"date".cast("integer").as("date"), $"status")
val win = Window.orderBy("id")
coalesce lag_status and status to remove null
val df2= df1.select($"*", lag($"status",1).over(win).as("lag_status")).withColumn("lag_stat", coalesce($"lag_status", $"status")).drop("lag_status")
create integer columns to denote if staus for current day is equal to status for previous days
val df3=df2.select($"*", ($"status"===$"lag_stat").cast("integer").as("status_flag"))
val win1= Window.orderBy($"id".desc).rangeBetween(0,7)
val df4= df3.select($"*", sum($"status_flag").over(win1).as("previous_7_sum"))
val ud1= udf((col1:Int) => {
((col1-7).to(col1 )).toArray})
val df_new= df4.where($"previous_7_sum"===8)
.withColumn("dt_arr", ud1($"date"))
.select(explode($"dt_arr" ).as("date"))
val df5=df4.join(df_new, Seq("date"), "inner").select($"id", concat_ws("",split($"date".cast("string"), "-")).as("date"), $"status")
How can I check for the dates from the adjacent rows (preceding and next) in a Dataframe. This should happen at a key level
I have following data after sorting on key, dates
source_Df.show()
+-----+--------+------------+------------+
| key | code | begin_dt | end_dt |
+-----+--------+------------+------------+
| 10 | ABC | 2018-01-01 | 2018-01-08 |
| 10 | BAC | 2018-01-03 | 2018-01-15 |
| 10 | CAS | 2018-01-03 | 2018-01-21 |
| 20 | AAA | 2017-11-12 | 2018-01-03 |
| 20 | DAS | 2018-01-01 | 2018-01-12 |
| 20 | EDS | 2018-02-01 | 2018-02-16 |
+-----+--------+------------+------------+
When the dates are in a range from these rows (i.e. the current row begin_dt falls in between begin and end dates of the previous row), I need to have the lowest begin date on all such rows and the highest end date.
Here is the output I need..
final_Df.show()
+-----+--------+------------+------------+
| key | code | begin_dt | end_dt |
+-----+--------+------------+------------+
| 10 | ABC | 2018-01-01 | 2018-01-21 |
| 10 | BAC | 2018-01-01 | 2018-01-21 |
| 10 | CAS | 2018-01-01 | 2018-01-21 |
| 20 | AAA | 2017-11-12 | 2018-01-12 |
| 20 | DAS | 2017-11-12 | 2018-01-12 |
| 20 | EDS | 2018-02-01 | 2018-02-16 |
+-----+--------+------------+------------+
Appreciate any ideas to achieve this. Thanks in advance!
Here's one approach:
Create new column group_id with null value if begin_dt is within date range from the previous row; otherwise a unique integer
Backfill nulls in group_id with the last non-null value
Compute min(begin_dt) and max(end_dt) within each (key, group_id) partition
Example below:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
val df = Seq(
(10, "ABC", "2018-01-01", "2018-01-08"),
(10, "BAC", "2018-01-03", "2018-01-15"),
(10, "CAS", "2018-01-03", "2018-01-21"),
(20, "AAA", "2017-11-12", "2018-01-03"),
(20, "DAS", "2018-01-01", "2018-01-12"),
(20, "EDS", "2018-02-01", "2018-02-16")
).toDF("key", "code", "begin_dt", "end_dt")
val win1 = Window.partitionBy($"key").orderBy($"begin_dt", $"end_dt")
val win2 = Window.partitionBy($"key", $"group_id")
df.
withColumn("group_id", when(
$"begin_dt".between(lag($"begin_dt", 1).over(win1), lag($"end_dt", 1).over(win1)), null
).otherwise(monotonically_increasing_id)
).
withColumn("group_id", last($"group_id", ignoreNulls=true).
over(win1.rowsBetween(Window.unboundedPreceding, 0))
).
withColumn("begin_dt2", min($"begin_dt").over(win2)).
withColumn("end_dt2", max($"end_dt").over(win2)).
orderBy("key", "begin_dt", "end_dt").
show
// +---+----+----------+----------+-------------+----------+----------+
// |key|code| begin_dt| end_dt| group_id| begin_dt2| end_dt2|
// +---+----+----------+----------+-------------+----------+----------+
// | 10| ABC|2018-01-01|2018-01-08|1047972020224|2018-01-01|2018-01-21|
// | 10| BAC|2018-01-03|2018-01-15|1047972020224|2018-01-01|2018-01-21|
// | 10| CAS|2018-01-03|2018-01-21|1047972020224|2018-01-01|2018-01-21|
// | 20| AAA|2017-11-12|2018-01-03| 455266533376|2017-11-12|2018-01-12|
// | 20| DAS|2018-01-01|2018-01-12| 455266533376|2017-11-12|2018-01-12|
// | 20| EDS|2018-02-01|2018-02-16| 455266533377|2018-02-01|2018-02-16|
// +---+----+----------+----------+-------------+----------+----------+
I have two exactly same dataframes for comparison test
df1
------------------------------------------
year | state | count2 | count3 | count4|
2014 | NJ | 12332 | 54322 | 53422 |
2014 | NJ | 12332 | 53255 | 55324 |
2015 | CO | 12332 | 53255 | 55324 |
2015 | MD | 14463 | 76543 | 66433 |
2016 | CT | 14463 | 76543 | 66433 |
2016 | CT | 55325 | 76543 | 66433 |
------------------------------------------
df2
------------------------------------------
year | state | count2 | count3 | count4|
2014 | NJ | 12332 | 54322 | 53422 |
2014 | NJ | 65333 | 65555 | 125 |
2015 | CO | 12332 | 53255 | 55324 |
2015 | MD | 533 | 75 | 64524 |
2016 | CT | 14463 | 76543 | 66433 |
2016 | CT | 55325 | 76543 | 66433 |
------------------------------------------
I want to compare with these two dfs on count2 to count4, if the counts doesn't match then print out some message saying it is mismatching.
here is my try
val cols = df1.columns.filter(_ != "year").toList
def mapDiffs(name: String) = when($"l.$name" === $"r.$name", null).otherwise(array($"l.$name", $"r.$name")).as(name)
val result = df1.as("l").join(df2.as("r"), "year").select($"year" :: cols.map(mapDiffs): _*)
it then compares with the same state with the same number, it didn't do what I wanted to do
------------------------------------------
year | state | count2 | count3 | count4|
2014 | NJ | 12332 | 54322 | 53422 |
2014 | NJ | no | no | no |
2015 | CO | 12332 | 53255 | 55324 |
2015 | MD | no | no | 64524 |
2016 | CT | 14463 | 76543 | 66433 |
2016 | CT | 55325 | 76543 | 66433 |
------------------------------------------
I want the result to come out as above, how do I achieve that?
edits, also in a different scenario if I want to compare only in one df, col to cols how do I do that?
like
------------------------------------------
year | state | count2 | count3 | count4|
2014 | NJ | 12332 | 54322 | 53422 |
I want to compare count3 and count 4 cols to count2, obviously count3 and count 4 do not match count 2, so I want the result to be
-----------------------------------------------
year | state | count2 | count3 | count4 |
2014 | NJ | 12332 | mismatch | mismatch |
Thank you!
The dataframe join on year won't work for your mapDiffs method. You need a row-identifying column in df1 and df2 for the join.
import org.apache.spark.sql.functions._
val df1 = Seq(
("2014", "NJ", "12332", "54322", "53422"),
("2014", "NJ", "12332", "53255", "55324"),
("2015", "CO", "12332", "53255", "55324"),
("2015", "MD", "14463", "76543", "64524"),
("2016", "CT", "14463", "76543", "66433"),
("2016", "CT", "55325", "76543", "66433")
).toDF("year", "state", "count2", "count3", "count4")
val df2 = Seq(
("2014", "NJ", "12332", "54322", "53422"),
("2014", "NJ", "12332", "53255", "125"),
("2015", "CO", "12332", "53255", "55324"),
("2015", "MD", "533", "75", "64524"),
("2016", "CT", "14463", "76543", "66433"),
("2016", "CT", "55325", "76543", "66433")
).toDF("year", "state", "count2", "count3", "count4")
Skip this if you already have a row-identifying column (say, rowId) in the dataframes for thejoin:
import org.apache.spark.sql.Row
import org.apache.spark.sql.types._
val rdd1 = df1.rdd.zipWithIndex.map{
case (row: Row, id: Long) => Row.fromSeq(row.toSeq :+ id)
}
val df1i = spark.createDataFrame( rdd1,
StructType(df1.schema.fields :+ StructField("rowId", LongType, false))
)
val rdd2 = df2.rdd.zipWithIndex.map{
case (row: Row, id: Long) => Row.fromSeq(row.toSeq :+ id)
}
val df2i = spark.createDataFrame( rdd2,
StructType(df2.schema.fields :+ StructField("rowId", LongType, false))
)
Now, define mapDiffs and apply it to the selected columns after joining the dataframes by rowId:
def mapDiffs(name: String) =
when($"l.$name" === $"r.$name", $"l.$name").otherwise("no").as(name)
val cols = df1i.columns.filter(_.startsWith("count")).toList
val result = df1i.as("l").join(df2i.as("r"), "rowId").
select($"l.rowId" :: $"l.year" :: cols.map(mapDiffs): _*)
// +-----+----+------+------+------+
// |rowId|year|count2|count3|count4|
// +-----+----+------+------+------+
// | 0|2014| 12332| 54322| 53422|
// | 5|2016| 55325| 76543| 66433|
// | 1|2014| 12332| 53255| no|
// | 3|2015| no| no| 64524|
// | 2|2015| 12332| 53255| 55324|
// | 4|2016| 14463| 76543| 66433|
// +-----+----+------+------+------+
Note that there appears to be more discrepancies between df1 and df2 than just the 3 no-spots in your sample result. I've modified the sample data to make those 3 spots the only difference.