how to find week difference between two dates - scala

I have a dataframe which has two column dates in unixtime and I want to find the week difference between these two columns. There is a weekOfYear UDF in SparkSQL but that is only useful when both dates fall in the same year. How can I find the week difference then?
p.s. I'm using Scala Spark.

You can take the approach of creating a custom UDF for this:
scala> val df=sc.parallelize(Seq((1480401142453L,1480399932853L))).toDF("date1","date2")
df: org.apache.spark.sql.DataFrame = [date1: bigint, date2: bigint]
scala> df.show
+-------------+-------------+
| date1| date2|
+-------------+-------------+
|1480401142453|1480399932853|
+-------------+-------------+
scala> val udfDateDifference=udf((date1:Long,date2:Long)=>((date1-date2)/(60*60*24*7)).toInt
|
| )
udfDateDifference: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,IntegerType,Some(List(LongType, LongType)))
scala> val resultDF=df.withColumn("dateDiffernece",udfDateDifference(df("date1"),df("date2")))
resultDF: org.apache.spark.sql.DataFrame = [date1: bigint, date2: bigint ... 1 more field]
scala> resultDF.show
+-------------+-------------+--------------+
| date1| date2|dateDiffernece|
+-------------+-------------+--------------+
|1480401142453|1480399932853| 2|
+-------------+-------------+--------------+
And hence you can get the difference !

As you have UNIXTIME date format we can do this expression.
((date1-date2)/(60*60*24*7)).toInt
Edit:
Updating this answer with example
spark.udf.register("weekdiff", (from: Long, to: Long) => ((from - to) / (604800)).toInt)
// 60*60*24*7 => 604800
df.withColumn("weekdiff", weekdiff(df("date1_col_name"), df("date2_col_name")))

Related

Spark Scala: Add 10 days to a date string (not a column)

I have a date and want to add and subtract 10 days to it. Start_date and end_date are dynamic variables from one table and will be used to filter another table.
eg.
val start_date = "2018-09-08"
val end_date = "2018-09-15"
I want to use the two dates above in a filter shown below;
myDF.filter($"timestamp".between(date_sub(start_date, 10),date_add(end_date, 10)))
The functions date_add and date_sub only take in columns as an input. How can I add/subtract 10 (this is an arbitrary number) from my dates?
Thanks
Thank you Luis! Your solution worked, for anyone interested the solution looks like;
val start_date = lit("2018-09-08")
val end_date = lit("2018-09-15")
myDF.filter($"timestamp".between(date_sub(start_date, 10),date_add(end_date, 10)))
Another way...If you can create a temp view, then you can access the vals using $ interpolation.
You should make sure the format is of default ones for date/timestamp.
Check this out:
scala> val start_date = "2018-09-08"
start_date: String = 2018-09-08
scala> val end_date = "2018-09-15"
end_date: String = 2018-09-15
scala> val myDF=Seq(("2018-09-08"),("2018-09-15")).toDF("timestamp").withColumn("timestamp",to_timestamp('timestamp))
myDF: org.apache.spark.sql.DataFrame = [timestamp: timestamp]
scala> myDF.show(false)
+-------------------+
|timestamp |
+-------------------+
|2018-09-08 00:00:00|
|2018-09-15 00:00:00|
+-------------------+
scala> myDF.createOrReplaceTempView("ts_table")
scala> spark.sql(s""" select timestamp, date_sub('$start_date',10) as d_sub, date_add('$end_date',10) d_add from ts_table """).show(false)
+-------------------+----------+----------+
|timestamp |d_sub |d_add |
+-------------------+----------+----------+
|2018-09-08 00:00:00|2018-08-29|2018-09-25|
|2018-09-15 00:00:00|2018-08-29|2018-09-25|
+-------------------+----------+----------+
scala>

How to subtract one Scala Spark DataFrame from another (Normalise to the mean)

I have two Spark DataFrames:
df1 with 80 columns
CO01...CO80
+----+----+
|CO01|CO02|
+----+----+
|2.06|0.56|
|1.96|0.72|
|1.70|0.87|
|1.90|0.64|
+----+----+
and df2 with 80 columns
avg(CO01)...avg(CO80)
which is mean of each column
+------------------+------------------+
| avg(CO01)| avg(CO02)|
+------------------+------------------+
|2.6185106382978716|1.0080985915492937|
+------------------+------------------+
How can i subtract df2 from df1 for corresponding values?
I'm looking for solution that does not require to list all the columns.
P.S
In pandas it could be simply done by:
df2=df1-df1.mean()
Here is what you can do
scala> val df = spark.sparkContext.parallelize(List(
| (2.06,0.56),
| (1.96,0.72),
| (1.70,0.87),
| (1.90,0.64))).toDF("c1","c2")
df: org.apache.spark.sql.DataFrame = [c1: double, c2: double]
scala>
scala> def subMean(mean: Double) = udf[Double, Double]((value: Double) => value - mean)
subMean: (mean: Double)org.apache.spark.sql.expressions.UserDefinedFunction
scala>
scala> val result = df.columns.foldLeft(df)( (df, col) =>
| { val avg = df.select(mean(col)).first().getAs[Double](0);
| df.withColumn(col, subMean(avg)(df(col)))
| })
result: org.apache.spark.sql.DataFrame = [c1: double, c2: double]
scala>
scala> result.show(10, false)
+---------------------+---------------------+
|c1 |c2 |
+---------------------+---------------------+
|0.15500000000000025 |-0.13749999999999996 |
|0.05500000000000016 |0.022499999999999964 |
|-0.20499999999999985 |0.1725 |
|-0.004999999999999893|-0.057499999999999996|
+---------------------+---------------------+
Hope, this helps!
Please note that, this will work for n number of columns as long as all columns in dataframe are of numeric type

How to extract week day as a number from a Spark dataframe with the Scala API

I have a date column which is string in dataframe in the 2017-01-01 12:15:43 timestamp format.
Now I want to get weekday number(1 to 7) from that column using dataframe and not spark sql.
Like below
df.select(weekday(col("colname")))
I found one in python and sql but not in scala. can any body help me on this
in sqlcontext
sqlContext.sql("select date_format(to_date('2017-01-01'),'W') as week")
This works the same way in Scala:
scala> spark.version
res1: String = 2.3.0
scala> spark.sql("select date_format(to_date('2017-01-01'),'W') as week").show
// +----+
// |week|
// +----+
// | 1|
// +----+
or
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> val df = Seq("2017-01-01").toDF("date")
df: org.apache.spark.sql.DataFrame = [date: string]
scala> df.select(date_format(to_date(col("date")), "W")).show
+-------------------------------+
|date_format(to_date(`date`), W)|
+-------------------------------+
| 1|
+-------------------------------+

Spark 2.0 Timestamp Difference in Milliseconds using Scala

I am using Spark 2.0 and looking for a way to achieve the following in Scala:
Need the time-stamp difference in milliseconds between two Data-frame column values.
Value_1 = 06/13/2017 16:44:20.044
Value_2 = 06/13/2017 16:44:21.067
Data-types for both is timestamp.
Note:Applying the function unix_timestamp(Column s) on both values and subtracting works but not upto the milliseconds value which is the requirement.
Final query would look like this:
Select **timestamp_diff**(Value_2,Value_1) from table1
this should return the following output:
1023 milliseconds
where timestamp_diff is the function that would calculate the difference in milliseconds.
One way would be to use Unix epoch time, the number of milliseconds since 1 January 1970. Below is an example using an UDF, it takes two timestamps and returns the difference between them in milliseconds.
val timestamp_diff = udf((startTime: Timestamp, endTime: Timestamp) => {
(startTime.getTime() - endTime.getTime())
})
val df = // dataframe with two timestamp columns (col1 and col2)
.withColumn("diff", timestamp_diff(col("col2"), col("col1")))
Alternatively, you can register the function to use with SQL commands:
val timestamp_diff = (startTime: Timestamp, endTime: Timestamp) => {
(startTime.getTime() - endTime.getTime())
}
spark.sqlContext.udf.register("timestamp_diff", timestamp_diff)
df.createOrReplaceTempView("table1")
val df2 = spark.sqlContext.sql("SELECT *, timestamp_diff(col2, col1) as diff from table1")
The same for PySpark:
import datetime
def timestamp_diff(time1: datetime.datetime, time2: datetime.datetime):
return int((time1-time2).total_seconds()*1000)
int and *1000 are only to output milliseconds
Example usage:
spark.udf.register("timestamp_diff", timestamp_diff)
df.registerTempTable("table1")
df2 = spark.sql("SELECT *, timestamp_diff(col2, col1) as diff from table1")
It's not an optimal solution since UDFs are usually slow, so you might run into performance issues.
Bit late to the party, but hope it's still useful.
import org.apache.spark.sql.Column
def getUnixTimestamp(col: Column): Column = (col.cast("double") * 1000).cast("long")
df.withColumn("diff", getUnixTimestamp(col("col2")) - getUnixTimestamp(col("col1")))
Of course you can define a separate method for the difference:
def timestampDiff(col1: Column, col2: Column): Column = getUnixTimestamp(col2) - getUnixTimestamp(col1)
df.withColumn("diff", timestampDiff(col("col1"), col("col2")))
To make life easier one can define an overloaded method for Strings with a default diff name:
def timestampDiff(col1: String, col2: String): Column = timestampDiff(col(col1), col(col2)).as("diff")
Now in action:
scala> df.show(false)
+-----------------------+-----------------------+
|min_time |max_time |
+-----------------------+-----------------------+
|1970-01-01 01:00:02.345|1970-01-01 01:00:04.786|
|1970-01-01 01:00:23.857|1970-01-01 01:00:23.999|
|1970-01-01 01:00:02.325|1970-01-01 01:01:07.688|
|1970-01-01 01:00:34.235|1970-01-01 01:00:34.444|
|1970-01-01 01:00:34.235|1970-01-01 01:00:34.454|
+-----------------------+-----------------------+
scala> df.withColumn("diff", timestampDiff("min_time", "max_time")).show(false)
+-----------------------+-----------------------+-----+
|min_time |max_time |diff |
+-----------------------+-----------------------+-----+
|1970-01-01 01:00:02.345|1970-01-01 01:00:04.786|2441 |
|1970-01-01 01:00:23.857|1970-01-01 01:00:23.999|142 |
|1970-01-01 01:00:02.325|1970-01-01 01:01:07.688|65363|
|1970-01-01 01:00:34.235|1970-01-01 01:00:34.444|209 |
|1970-01-01 01:00:34.235|1970-01-01 01:00:34.454|219 |
+-----------------------+-----------------------+-----+
scala> df.select(timestampDiff("min_time", "max_time")).show(false)
+-----+
|diff |
+-----+
|2441 |
|142 |
|65363|
|209 |
|219 |
+-----+

scala spark - matching dataframes based on variable dates

I'm trying to match two dataframes based on a variable date window. I am not simply trying to get an exact match, which my code achieves but to get all likely candidates within a variable day window.
I was able to get exact matches on dates with my code.
But I want to find out if the records are still viable to match since they could be a few days off either side but would still be reasonable enough to join on.
I've tried looking for something similar to python's pd.to_timedelta('1 day') in spark to add to the filter but alas have struck no luck.
Here is my current code which matches the dataframe on the ID column and then runs a filter to ensure that the from_date in the second dataframe is between the start_date and the end_date of the first dataframe.
What I need is not the exact date match but be able to match records if they fall between a day or two (either side) of the actual dates.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
val df1 = spark.read.option("header","true")
.option("inferSchema","true").csv("../data/df1.csv")
val df2 = spark.read.option("header","true")
.option("inferSchema","true")
.csv("../data/df2.csv")
val df = df2.join(df1,
(df1("ID") === df2("ID")) &&
(df2("from_date") >= df1("start_date")) &&
(df2("from_date") <= df1("end_date")),"left")
.select(df1("ID"), df1("start_date"), df1("end_date"),
$"from_date", $"to_date")
df.coalesce(1).write.format("com.databricks.spark.csv")
.option("header", "true").save("../mydata.csv")
Essentially I want to be able to edit this date window to increase or decrease the data actually matching.
Would really appreciate your input. I'm new to spark/scala but gotta say I'm loving it so far ... soo much faster (and cleaner) than python!
cheers
You can apply date_add and date_sub to start_date/end_date in your join condition, as shown below:
import org.apache.spark.sql.functions._
import java.sql.Date
val df1 = Seq(
(1, Date.valueOf("2018-12-01"), Date.valueOf("2018-12-05")),
(2, Date.valueOf("2018-12-01"), Date.valueOf("2018-12-06")),
(3, Date.valueOf("2018-12-01"), Date.valueOf("2018-12-07"))
).toDF("ID", "start_date", "end_date")
val df2 = Seq(
(1, Date.valueOf("2018-11-30")),
(2, Date.valueOf("2018-12-08")),
(3, Date.valueOf("2018-12-08"))
).toDF("ID", "from_date")
val deltaDays = 1
df2.join( df1,
df1("ID") === df2("ID") &&
df2("from_date") >= date_sub(df1("start_date"), deltaDays) &&
df2("from_date") <= date_add(df1("end_date"), deltaDays),
"left_outer"
).show
// +---+----------+----+----------+----------+
// | ID| from_date| ID|start_date| end_date|
// +---+----------+----+----------+----------+
// | 1|2018-11-30| 1|2018-12-01|2018-12-05|
// | 2|2018-12-08|null| null| null|
// | 3|2018-12-08| 3|2018-12-01|2018-12-07|
// +---+----------+----+----------+----------+
You can get the same results using datediff() function also. Check this out:
scala> val df1 = Seq((1, "2018-12-01", "2018-12-05"),(2, "2018-12-01", "2018-12-06"),(3, "2018-12-01", "2018-12-07")).toDF("ID", "start_date", "end_date").withColumn("start_date",'start_date.cast("date")).withColumn("end_date",'end_date.cast("date"))
df1: org.apache.spark.sql.DataFrame = [ID: int, start_date: date ... 1 more field]
scala> val df2 = Seq((1, "2018-11-30"), (2, "2018-12-08"),(3, "2018-12-08")).toDF("ID", "from_date").withColumn("from_date",'from_date.cast("date"))
df2: org.apache.spark.sql.DataFrame = [ID: int, from_date: date]
scala> val delta = 1;
delta: Int = 1
scala> df2.join(df1,df1("ID") === df2("ID") && datediff('from_date,'start_date) >= -delta && datediff('from_date,'end_date)<=delta, "leftOuter").show(false)
+---+----------+----+----------+----------+
|ID |from_date |ID |start_date|end_date |
+---+----------+----+----------+----------+
|1 |2018-11-30|1 |2018-12-01|2018-12-05|
|2 |2018-12-08|null|null |null |
|3 |2018-12-08|3 |2018-12-01|2018-12-07|
+---+----------+----+----------+----------+
scala>