Spark Scala: Add 10 days to a date string (not a column) - scala

I have a date and want to add and subtract 10 days to it. Start_date and end_date are dynamic variables from one table and will be used to filter another table.
eg.
val start_date = "2018-09-08"
val end_date = "2018-09-15"
I want to use the two dates above in a filter shown below;
myDF.filter($"timestamp".between(date_sub(start_date, 10),date_add(end_date, 10)))
The functions date_add and date_sub only take in columns as an input. How can I add/subtract 10 (this is an arbitrary number) from my dates?
Thanks

Thank you Luis! Your solution worked, for anyone interested the solution looks like;
val start_date = lit("2018-09-08")
val end_date = lit("2018-09-15")
myDF.filter($"timestamp".between(date_sub(start_date, 10),date_add(end_date, 10)))

Another way...If you can create a temp view, then you can access the vals using $ interpolation.
You should make sure the format is of default ones for date/timestamp.
Check this out:
scala> val start_date = "2018-09-08"
start_date: String = 2018-09-08
scala> val end_date = "2018-09-15"
end_date: String = 2018-09-15
scala> val myDF=Seq(("2018-09-08"),("2018-09-15")).toDF("timestamp").withColumn("timestamp",to_timestamp('timestamp))
myDF: org.apache.spark.sql.DataFrame = [timestamp: timestamp]
scala> myDF.show(false)
+-------------------+
|timestamp |
+-------------------+
|2018-09-08 00:00:00|
|2018-09-15 00:00:00|
+-------------------+
scala> myDF.createOrReplaceTempView("ts_table")
scala> spark.sql(s""" select timestamp, date_sub('$start_date',10) as d_sub, date_add('$end_date',10) d_add from ts_table """).show(false)
+-------------------+----------+----------+
|timestamp |d_sub |d_add |
+-------------------+----------+----------+
|2018-09-08 00:00:00|2018-08-29|2018-09-25|
|2018-09-15 00:00:00|2018-08-29|2018-09-25|
+-------------------+----------+----------+
scala>

Related

Filter a dataframe based on the string date input in spark scala

I have a table with a column 'date' and the date format is yyyyMMdd. I need to filter this dataframe and return a dataframe with only rows with dates greater than an input, For eg: Return all the rows where date is greater than "20180715". I did the following.
scala> df.groupBy("date").count.show(50,false)
+--------+----------+
|date |count |
+--------+----------+
|20180707|200 |
|20180715|1429586969|
|20180628|1425490080|
|20180716|1429819708|
+--------+----------+
scala> var con = df.filter(to_date(df("date"),"yyyyMMdd").gt(lit("20180715")))
scala> con.count
res4: Long = 0
scala> var con = df.filter(to_date(df("date"),"yyyyMMdd").gt(lit("20170715")))
scala> con.count
res1: Long = 4284896957
When I input the date as "20170715", it counts all the records, whereas if the date is "20180715", the filter condition does not work. What is the correct way to compare with a string date.
Changing the format of the input string passed to the lit function, solved this issue.
scala> var con = df.filter(to_date(df("date"),"yyyyMMdd").gt(lit("2018-07-15")))
scala> con.count
res6: Long = 1429819708

Unexpected incorrect result after unixtime conversion in sparksql

I have a dataframe with content like below:
scala> patDF.show
+---------+-------+-----------+-------------+
|patientID| name|dateOtBirth|lastVisitDate|
+---------+-------+-----------+-------------+
| 1001|Ah Teck| 1991-12-31| 2012-01-20|
| 1002| Kumar| 2011-10-29| 2012-09-20|
| 1003| Ali| 2011-01-30| 2012-10-21|
+---------+-------+-----------+-------------+
all the columns are string
I want to get the list of records with lastVisitDate falling in the range of format of yyyy-mm-dd and now, so here is the script:
patDF.registerTempTable("patients")
val results2 = sqlContext.sql("SELECT * FROM patients WHERE from_unixtime(unix_timestamp(lastVisitDate, 'yyyy-mm-dd')) between '2012-09-15' and current_timestamp() order by lastVisitDate")
results2.show()
It gets me nothing, presumably, there should be records with patientID of 1002 and 1003.
So I modified the query to:
val results3 = sqlContext.sql("SELECT from_unixtime(unix_timestamp(lastVisitDate, 'yyyy-mm-dd')), * FROM patients")
results3.show()
Now I get:
+-------------------+---------+-------+-----------+-------------+
| _c0|patientlD| name|dateOtBirth|lastVisitDate|
+-------------------+---------+-------+-----------+-------------+
|2012-01-20 00:01:00| 1001|Ah Teck| 1991-12-31| 2012-01-20|
|2012-01-20 00:09:00| 1002| Kumar| 2011-10-29| 2012-09-20|
|2012-01-21 00:10:00| 1003| Ali| 2011-01-30| 2012-10-21|
+-------------------+---------+-------+-----------+-------------+
If you look at the first column, you will see all the months were somehow changed to 01
What's wrong with the code?
The correct format for year-month-day should be yyyy-MM-dd:
val patDF = Seq(
(1001, "Ah Teck", "1991-12-31", "2012-01-20"),
(1002, "Kumar", "2011-10-29", "2012-09-20"),
(1003, "Ali", "2011-01-30", "2012-10-21")
)toDF("patientID", "name", "dateOtBirth", "lastVisitDate")
patDF.createOrReplaceTempView("patTable")
val result1 = spark.sqlContext.sql("""
select * from patTable where to_timestamp(lastVisitDate, 'yyyy-MM-dd')
between '2012-09-15' and current_timestamp() order by lastVisitDate
""")
result1.show
// +---------+-----+-----------+-------------+
// |patientID| name|dateOtBirth|lastVisitDate|
// +---------+-----+-----------+-------------+
// | 1002|Kumar| 2011-10-29| 2012-09-20|
// | 1003| Ali| 2011-01-30| 2012-10-21|
// +---------+-----+-----------+-------------+
You can also use DataFrame API, if wanted:
val result2 = patDF.where(to_timestamp($"lastVisitDate", "yyyy-MM-dd").
between(to_timestamp(lit("2012-09-15"), "yyyy-MM-dd"), current_timestamp())
).orderBy($"lastVisitDate")

Spark 2.0 Timestamp Difference in Milliseconds using Scala

I am using Spark 2.0 and looking for a way to achieve the following in Scala:
Need the time-stamp difference in milliseconds between two Data-frame column values.
Value_1 = 06/13/2017 16:44:20.044
Value_2 = 06/13/2017 16:44:21.067
Data-types for both is timestamp.
Note:Applying the function unix_timestamp(Column s) on both values and subtracting works but not upto the milliseconds value which is the requirement.
Final query would look like this:
Select **timestamp_diff**(Value_2,Value_1) from table1
this should return the following output:
1023 milliseconds
where timestamp_diff is the function that would calculate the difference in milliseconds.
One way would be to use Unix epoch time, the number of milliseconds since 1 January 1970. Below is an example using an UDF, it takes two timestamps and returns the difference between them in milliseconds.
val timestamp_diff = udf((startTime: Timestamp, endTime: Timestamp) => {
(startTime.getTime() - endTime.getTime())
})
val df = // dataframe with two timestamp columns (col1 and col2)
.withColumn("diff", timestamp_diff(col("col2"), col("col1")))
Alternatively, you can register the function to use with SQL commands:
val timestamp_diff = (startTime: Timestamp, endTime: Timestamp) => {
(startTime.getTime() - endTime.getTime())
}
spark.sqlContext.udf.register("timestamp_diff", timestamp_diff)
df.createOrReplaceTempView("table1")
val df2 = spark.sqlContext.sql("SELECT *, timestamp_diff(col2, col1) as diff from table1")
The same for PySpark:
import datetime
def timestamp_diff(time1: datetime.datetime, time2: datetime.datetime):
return int((time1-time2).total_seconds()*1000)
int and *1000 are only to output milliseconds
Example usage:
spark.udf.register("timestamp_diff", timestamp_diff)
df.registerTempTable("table1")
df2 = spark.sql("SELECT *, timestamp_diff(col2, col1) as diff from table1")
It's not an optimal solution since UDFs are usually slow, so you might run into performance issues.
Bit late to the party, but hope it's still useful.
import org.apache.spark.sql.Column
def getUnixTimestamp(col: Column): Column = (col.cast("double") * 1000).cast("long")
df.withColumn("diff", getUnixTimestamp(col("col2")) - getUnixTimestamp(col("col1")))
Of course you can define a separate method for the difference:
def timestampDiff(col1: Column, col2: Column): Column = getUnixTimestamp(col2) - getUnixTimestamp(col1)
df.withColumn("diff", timestampDiff(col("col1"), col("col2")))
To make life easier one can define an overloaded method for Strings with a default diff name:
def timestampDiff(col1: String, col2: String): Column = timestampDiff(col(col1), col(col2)).as("diff")
Now in action:
scala> df.show(false)
+-----------------------+-----------------------+
|min_time |max_time |
+-----------------------+-----------------------+
|1970-01-01 01:00:02.345|1970-01-01 01:00:04.786|
|1970-01-01 01:00:23.857|1970-01-01 01:00:23.999|
|1970-01-01 01:00:02.325|1970-01-01 01:01:07.688|
|1970-01-01 01:00:34.235|1970-01-01 01:00:34.444|
|1970-01-01 01:00:34.235|1970-01-01 01:00:34.454|
+-----------------------+-----------------------+
scala> df.withColumn("diff", timestampDiff("min_time", "max_time")).show(false)
+-----------------------+-----------------------+-----+
|min_time |max_time |diff |
+-----------------------+-----------------------+-----+
|1970-01-01 01:00:02.345|1970-01-01 01:00:04.786|2441 |
|1970-01-01 01:00:23.857|1970-01-01 01:00:23.999|142 |
|1970-01-01 01:00:02.325|1970-01-01 01:01:07.688|65363|
|1970-01-01 01:00:34.235|1970-01-01 01:00:34.444|209 |
|1970-01-01 01:00:34.235|1970-01-01 01:00:34.454|219 |
+-----------------------+-----------------------+-----+
scala> df.select(timestampDiff("min_time", "max_time")).show(false)
+-----+
|diff |
+-----+
|2441 |
|142 |
|65363|
|209 |
|219 |
+-----+

Compare dates in dataframes

I have two dataframes in Scala:
df1 =
ID start_date_time
1 2016-10-12 11:55:23
2 2016-10-12 12:25:00
3 2016-10-12 16:20:00
and
df2 =
PK start_date
1 2016-10-12
2 2016-10-14
I need to add a new column to df1 that will have value 0 if the following condition fails, otherwise -> 1:
If ID == PK and start_date_time refers to the same year, month and day as start_date.
The result should be this one:
df1 =
ID start_date_time check
1 2016-10-12-11-55-23 1
2 2016-10-12-12-25-00 0
3 2016-10-12-16-20-00 0
How can I do it?
I assume that the logic should be something like this:
df1 = df.withColumn("check", define(df("ID"),df("start_date")))
val define = udf {(id: String,dateString:String) =>
val formatter = new SimpleDateFormat("yyyy-MM-dd")
val date = formatter.format(dateString)
val checks = df2.filter(df2("PK")===ID).filter(df2("start_date_time")===date)
if(checks.collect().length>0) "1" else "0"
}
However, I have doubts regarding how to compare dates, because df1 and df2 have differently formatted dates. How to better implement it?
You can use spark datetime functions to create date columns on both df1 and df2 and then do a left join on df1, df2, here you create an extra constant column check on df2 to indicate if there is a match in the result:
import org.apache.spark.sql.functions.lit
val df1_date = df1.withColumn("date", to_date(df1("start_date_time")))
val df2_date = (df2.withColumn("date", to_date(df2("start_date"))).
withColumn("check", lit(1)).
select($"PK".as("ID"), $"date", $"check"))
df1_date.join(df2_date, Seq("ID", "date"), "left").drop($"date").na.fill(0).show
+---+--------------------+-----+
| ID| start_date_time|check|
+---+--------------------+-----+
| 1|2016-10-12 11:55:...| 1|
| 2|2016-10-12 12:25:...| 0|
| 3|2016-10-12 16:20:...| 0|
+---+--------------------+-----+
I don't have the exact logic I would do something like that:
val df3 = df2.
join(df1,df1("ID") === df2("ID")).
filter( ($"start_date_time").isBefore($"start_date") )
You will need to convert the 2 timestamp to joda time using this: Converting a date string to a DateTime object using Joda Time library
Good luck !

how to find week difference between two dates

I have a dataframe which has two column dates in unixtime and I want to find the week difference between these two columns. There is a weekOfYear UDF in SparkSQL but that is only useful when both dates fall in the same year. How can I find the week difference then?
p.s. I'm using Scala Spark.
You can take the approach of creating a custom UDF for this:
scala> val df=sc.parallelize(Seq((1480401142453L,1480399932853L))).toDF("date1","date2")
df: org.apache.spark.sql.DataFrame = [date1: bigint, date2: bigint]
scala> df.show
+-------------+-------------+
| date1| date2|
+-------------+-------------+
|1480401142453|1480399932853|
+-------------+-------------+
scala> val udfDateDifference=udf((date1:Long,date2:Long)=>((date1-date2)/(60*60*24*7)).toInt
|
| )
udfDateDifference: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function2>,IntegerType,Some(List(LongType, LongType)))
scala> val resultDF=df.withColumn("dateDiffernece",udfDateDifference(df("date1"),df("date2")))
resultDF: org.apache.spark.sql.DataFrame = [date1: bigint, date2: bigint ... 1 more field]
scala> resultDF.show
+-------------+-------------+--------------+
| date1| date2|dateDiffernece|
+-------------+-------------+--------------+
|1480401142453|1480399932853| 2|
+-------------+-------------+--------------+
And hence you can get the difference !
As you have UNIXTIME date format we can do this expression.
((date1-date2)/(60*60*24*7)).toInt
Edit:
Updating this answer with example
spark.udf.register("weekdiff", (from: Long, to: Long) => ((from - to) / (604800)).toInt)
// 60*60*24*7 => 604800
df.withColumn("weekdiff", weekdiff(df("date1_col_name"), df("date2_col_name")))