Compare dates in dataframes - scala

I have two dataframes in Scala:
df1 =
ID start_date_time
1 2016-10-12 11:55:23
2 2016-10-12 12:25:00
3 2016-10-12 16:20:00
and
df2 =
PK start_date
1 2016-10-12
2 2016-10-14
I need to add a new column to df1 that will have value 0 if the following condition fails, otherwise -> 1:
If ID == PK and start_date_time refers to the same year, month and day as start_date.
The result should be this one:
df1 =
ID start_date_time check
1 2016-10-12-11-55-23 1
2 2016-10-12-12-25-00 0
3 2016-10-12-16-20-00 0
How can I do it?
I assume that the logic should be something like this:
df1 = df.withColumn("check", define(df("ID"),df("start_date")))
val define = udf {(id: String,dateString:String) =>
val formatter = new SimpleDateFormat("yyyy-MM-dd")
val date = formatter.format(dateString)
val checks = df2.filter(df2("PK")===ID).filter(df2("start_date_time")===date)
if(checks.collect().length>0) "1" else "0"
}
However, I have doubts regarding how to compare dates, because df1 and df2 have differently formatted dates. How to better implement it?

You can use spark datetime functions to create date columns on both df1 and df2 and then do a left join on df1, df2, here you create an extra constant column check on df2 to indicate if there is a match in the result:
import org.apache.spark.sql.functions.lit
val df1_date = df1.withColumn("date", to_date(df1("start_date_time")))
val df2_date = (df2.withColumn("date", to_date(df2("start_date"))).
withColumn("check", lit(1)).
select($"PK".as("ID"), $"date", $"check"))
df1_date.join(df2_date, Seq("ID", "date"), "left").drop($"date").na.fill(0).show
+---+--------------------+-----+
| ID| start_date_time|check|
+---+--------------------+-----+
| 1|2016-10-12 11:55:...| 1|
| 2|2016-10-12 12:25:...| 0|
| 3|2016-10-12 16:20:...| 0|
+---+--------------------+-----+

I don't have the exact logic I would do something like that:
val df3 = df2.
join(df1,df1("ID") === df2("ID")).
filter( ($"start_date_time").isBefore($"start_date") )
You will need to convert the 2 timestamp to joda time using this: Converting a date string to a DateTime object using Joda Time library
Good luck !

Related

spark: merge two dataframes, if ID duplicated in two dataframes, the row in df1 overwrites the row in df2

There are two dataframes: df1, and df2 with the same schema. ID is the primary key.
I need merge the two df1, and df2. This can be done by union except one special requirement: if there are duplicates rows with the same ID in df1 and df2. I need keep the one in df1.
df1:
ID col1 col2
1 AA 2019
2 B 2018
df2:
ID col1 col2
1 A 2019
3 C 2017
I need the following output:
df1:
ID col1 col2
1 AA 2019
2 B 2018
3 C 2017
How to do this? Thanks. I think it is possible to register two tmp tables, do full joins and use coalesce. but I do not prefer this way, because there are about 40 columns, in fact, instead of 3 in the above example.
Given that the two DataFrames have the same schema, you could simply union df1 with the left_anti join of df2 & df1:
df1.union(df2.join(df1, Seq("ID"), "left_anti")).show
// +---+---+----+
// | ID|co1|col2|
// +---+---+----+
// | 1| AA|2019|
// | 2| B|2018|
// | 3| C|2017|
// +---+---+----+
One way to do this is, unioning the dataframes with an identifier column that specifies the dataframe and use it thereafter for prioritizing row from df1 with a function like row_number.
PySpark SQL solution shown here.
from pyspark.sql.functions import lit,row_number,when
from pyspark.sql import Window
df1_with_identifier = df1.withColumn('identifier',lit('df1'))
df2_with_identifier = df2.withColumn('identifier',lit('df2'))
merged_df = df1_with_identifier.union(df2_with_identifier)
#Define the Window with the desired ordering
w = Window.partitionBy(merged_df.id).orderBy(when(merged_df.identifier == 'df1',1).otherwise(2))
result = merged_df.withColumn('rownum',row_number().over(w))
result.select(result.rownum == 1).show()
A solution with a left join on df1 could be a lot simpler, except that you have to write multiple coalesces.

Spark Scala: Add 10 days to a date string (not a column)

I have a date and want to add and subtract 10 days to it. Start_date and end_date are dynamic variables from one table and will be used to filter another table.
eg.
val start_date = "2018-09-08"
val end_date = "2018-09-15"
I want to use the two dates above in a filter shown below;
myDF.filter($"timestamp".between(date_sub(start_date, 10),date_add(end_date, 10)))
The functions date_add and date_sub only take in columns as an input. How can I add/subtract 10 (this is an arbitrary number) from my dates?
Thanks
Thank you Luis! Your solution worked, for anyone interested the solution looks like;
val start_date = lit("2018-09-08")
val end_date = lit("2018-09-15")
myDF.filter($"timestamp".between(date_sub(start_date, 10),date_add(end_date, 10)))
Another way...If you can create a temp view, then you can access the vals using $ interpolation.
You should make sure the format is of default ones for date/timestamp.
Check this out:
scala> val start_date = "2018-09-08"
start_date: String = 2018-09-08
scala> val end_date = "2018-09-15"
end_date: String = 2018-09-15
scala> val myDF=Seq(("2018-09-08"),("2018-09-15")).toDF("timestamp").withColumn("timestamp",to_timestamp('timestamp))
myDF: org.apache.spark.sql.DataFrame = [timestamp: timestamp]
scala> myDF.show(false)
+-------------------+
|timestamp |
+-------------------+
|2018-09-08 00:00:00|
|2018-09-15 00:00:00|
+-------------------+
scala> myDF.createOrReplaceTempView("ts_table")
scala> spark.sql(s""" select timestamp, date_sub('$start_date',10) as d_sub, date_add('$end_date',10) d_add from ts_table """).show(false)
+-------------------+----------+----------+
|timestamp |d_sub |d_add |
+-------------------+----------+----------+
|2018-09-08 00:00:00|2018-08-29|2018-09-25|
|2018-09-15 00:00:00|2018-08-29|2018-09-25|
+-------------------+----------+----------+
scala>

Comparing the value of columns in two dataframe

I have two dataframe, one has unique value of id and other can have multiple values of different id.
This is dataframe df1:
id | dt| speed | stats
358899055773504 2018-07-31 18:38:34 0 [9,-1,-1,13,0,1,0]
358899055773505 2018-07-31 18:48:23 4 [8,-1,0,22,1,1,1]
df2:
id | dt| speed | stats
358899055773504 2018-07-31 18:38:34 0 [9,-1,-1,13,0,1,0]
358899055773505 2018-07-31 18:54:23 4 [9,0,0,22,1,1,1]
358899055773504 2018-07-31 18:58:34 0 [9,0,-1,22,0,1,0]
358899055773504 2018-07-31 18:28:34 0 [9,0,-1,22,0,1,0]
358899055773505 2018-07-31 18:38:23 4 [8,-1,0,22,1,1,1]
I aim to compare the second dataframe with the first dataframe and updating the values in first dataframe, only if the value of dt of a particular id of df2 is greater than that in df1 and if it satisfies the greater than condition then comparing the other fields as well.
You need to join the two dataframes together to make any comparison of their columns.
What you can do is first joining the dataframes and then perform all the filtering to get a new dataframe with all rows that should be updated:
val diffDf = df1.as("a").join(df2.as("b"), Seq("id"))
.filter($"b.dt" > $"a.dt")
.filter(...) // Any other filter required
.select($"id", $"b.dt", $"b.speed", $"b.stats")
Note: In some situations it would be required to do a groupBy(id) or use a window function since there should only be one final row per id in the diffDf dataframe. This can be done as as follows (the example here will select the row with maximum in the speed, but it depends on the actual requirements):
val w = Window.partitionBy($"id").orderBy($"speed".desc)
val diffDf2 = diffDf.withColumn("rn", row_number.over(w)).where($"rn" === 1).drop("rn")
More in-depth information about different approaches can be seen here: How to max value and keep all columns (for max records per group)?.
To replace the old rows with the same id in the df1 dataframe, combine the dataframes with an outer join and coalesce:
val df = df1.as("a").join(diffDf.as("b"), Seq("id"), "outer")
.select(
$"id",
coalesce($"b.dt", $"a.dt").as("dt"),
coalesce($"b.speed", $"a.speed").as("speed"),
coalesce($"b.stats", $"a.stats").as("stats")
)
coalesce works by first trying to take the value from the diffDf (b) dataframe. If that value is null it will take the value from df1 (a).
Result when only using the time filter with the provided example input dataframes:
+---------------+-------------------+-----+-----------------+
| id| dt|speed| stats|
+---------------+-------------------+-----+-----------------+
|358899055773504|2018-07-31 18:58:34| 0|[9,0,-1,22,0,1,0]|
|358899055773505|2018-07-31 18:54:23| 4| [9,0,0,22,1,1,1]|
+---------------+-------------------+-----+-----------------+

how to apply partition in spark scala dataframe with multiple columns? [duplicate]

This question already has answers here:
How to select the first row of each group?
(9 answers)
Closed 5 years ago.
I have the following dataframe df in Spark Scala:
id project start_date Change_date designation
1 P1 08/10/2018 01/09/2017 2
1 P1 08/10/2018 02/11/2018 3
1 P1 08/10/2018 01/08/2016 1
then get designation closure to start_date and less than that
Expected output:
id project start_date designation
1 P1 08/10/2018 2
This is because change date 01/09/2017 is the closest date before start_date.
Can somebody advice how to achieve this?
This is not selecting first row but selecting the designation corresponding to change date closest to the start date
Parse dates:
import org.apache.spark.sql.functions._
val spark: SparkSession = ???
import spark.implicits._
val df = Seq(
(1, "P1", "08/10/2018", "01/09/2017", 2),
(1, "P1", "08/10/2018", "02/11/2018", 3),
(1, "P1", "08/10/2018", "01/08/2016", 1)
).toDF("id", "project_id", "start_date", "changed_date", "designation")
val parsed = df
.withColumn("start_date", to_date($"start_date", "dd/MM/yyyy"))
.withColumn("changed_date", to_date($"changed_date", "dd/MM/yyyy"))
Find difference
val diff = parsed
.withColumn("diff", datediff($"start_date", $"changed_date"))
.where($"diff" > 0)
Apply solution of your choice from How to select the first row of each group?, for example window functions. If you group by id:
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"id").orderBy($"diff")
diff.withColumn("rn", row_number.over(w)).where($"rn" === 1).drop("rn").show
// +---+----------+----------+------------+-----------+----+
// | id|project_id|start_date|changed_date|designation|diff|
// +---+----------+----------+------------+-----------+----+
// | 1| P1|2018-10-08| 2017-09-01| 2| 402|
// +---+----------+----------+------------+-----------+----+
Reference:
How to select the first row of each group?

How to transpose dataframe in Spark 1.5 (no pivot operator available)?

I want to transpose following table using spark scala without Pivot function
I am using Spark 1.5.1 and Pivot function does not support in 1.5.1. Please suggest suitable method to transpose following table:
Customer Day Sales
1 Mon 12
1 Tue 10
1 Thu 15
1 Fri 2
2 Sun 10
2 Wed 5
2 Thu 4
2 Fri 3
Output table :
Customer Sun Mon Tue Wed Thu Fri
1 0 12 10 0 15 2
2 10 0 0 5 4 3
Following code is not working as I am using Spark 1.5.1 and pivot function is available from Spark 1.6:
var Trans = Cust_Sales.groupBy("Customer").Pivot("Day").sum("Sales")
Not sure how efficient that is, but you can use collect to get all the distinct days, and then add these columns, then use groupBy and sum:
// get distinct days from data (this assumes there are not too many of them):
val days: Array[String] = df.select("Day")
.distinct()
.collect()
.map(_.getAs[String]("Day"))
// add column for each day with the Sale value if days match:
val withDayColumns = days.foldLeft(df) {
case (data, day) => data.selectExpr("*", s"IF(Day = '$day', Sales, 0) AS $day")
}
// wrap it up
val result = withDayColumns
.drop("Day")
.drop("Sales")
.groupBy("Customer")
.sum(days: _*)
result.show()
Which prints (almost) what you wanted:
+--------+--------+--------+--------+--------+--------+--------+
|Customer|sum(Tue)|sum(Thu)|sum(Sun)|sum(Fri)|sum(Mon)|sum(Wed)|
+--------+--------+--------+--------+--------+--------+--------+
| 1| 10| 15| 0| 2| 12| 0|
| 2| 0| 4| 10| 3| 0| 5|
+--------+--------+--------+--------+--------+--------+--------+
I'll leave it to you to rename / reorder the columns if needed.
If you are working with python below code might help. Let's say you want to transpose spark DataFrame df:
pandas_df = df.toPandas().transpose().reset_index()
transposed_df = sqlContext.createDataFrame(pandas_df)
transposed_df.show()
Consider a data frame which has 6 columns and we want to group by first 4 columns and pivot on col5 while aggregating on col6 (say sum on it).
So lets say you cannot use the spark 1.6 version then the below code can be written (in spark 1.5) as:
val pivotedDf = df_to_pivot
.groupBy(col1,col2,col3,col4)
.pivot(col5)
.agg(sum(col6))
Here is the code with same output but without using in-built pivot function:
import scala.collection.SortedMap
//Extracting the col5 distinct values to create the new columns
val distinctCol5Values = df_to_pivot
.select(col(col5))
.distinct
.sort(col5) // ensure that the output columns are in a consistent logical order
.map(_.getString(0))
.toArray
.toSeq
//Grouping by the data frame to be pivoted on col1-col4
val pivotedAgg = df_to_pivot.rdd
.groupBy{row=>(row.getString(col1Index),
row.getDate(col2Index),
row.getDate(col3Index),
row.getString(col4Index))}
//Initializing a List of tuple of (String, double values) to be filled in the columns that will be created
val pivotColListTuple = distinctCol5Values.map(ft=> (ft,0.0))
// Using Sorted Map to ensure the order is maintained
var distinctCol5ValuesListMap = SortedMap(pivotColListTuple : _*)
//Pivoting the data on col5 by opening the grouped data
val pivotedRDD = pivotedAgg.map{groupedRow=>
distinctCol5ValuesListMap = distinctCol5ValuesListMap.map(ft=> (ft._1,0.0))
groupedRow._2.foreach{row=>
//Updating the distinctCol5ValuesListMap values to reflect the changes
//Change this part accordingly to what you want
distinctCol5ValuesListMap = distinctCol5ValuesListMap.updated(row.getString(col5Index),
distinctCol5ValuesListMap.getOrElse(row.getString(col5Index),0.0)+row.getDouble(col6Index))
}
Row.fromSeq(Seq(groupedRow._1._1,groupedRow._1._2,groupedRow._1._3,groupedRow._1._4) ++ distinctCol5ValuesListMap.values.toSeq)
}
//Consructing the structFields for new columns
val colTypesStruct = distinctCol5ValuesListMap.map(colName=>StructField(colName._1,DoubleType))
//Adding the first four column structFields with the new columns struct
val opStructType = StructType(Seq(StructField(col1Name,StringType),
StructField(col2Name,DateType),
StructField(col3Name,DateType),
StructField(col4Name,StringType)) ++ colTypesStruct )
//Creating the final data frame
val pivotedDF = sqlContext.createDataFrame(pivotedRDD,opStructType)