Timestamp comparison in spark-scala dataframe - scala

I have a field in spark dataframe of type string, and it's value is in format 2019-07-08 00:00. I have to perform a condition on the field like
df.filter(myfield > 2019-07-08 00:00)

Standard comparison operators for String should work, given your date format is in British military form:
val df = Seq(
(1, "2019-07-06 16:00"),
(2, "2019-07-08 09:00"),
(3, "2019-07-11 06:30")
).toDF("id", "date")
df.filter(col("date") > "2019-07-08 00:00").show
// +---+----------------+
// | id| date|
// +---+----------------+
// | 2|2019-07-08 09:00|
// | 3|2019-07-11 06:30|
// +---+----------------+

Related

Spark dataframe filter a timestamp by just the date part

How can I filter a spark dataframe that has a column of type timestamp but filter out by just the date part. I tried below, but it only matches if time is 00:00:00.
Basically I want the filter to match all rows with date 2020-01-01 (3 rows)
import java.sql.Timestamp
val df = Seq(
(1, Timestamp.valueOf("2020-01-01 23:00:01")),
(2, Timestamp.valueOf("2020-01-01 00:00:00")),
(3, Timestamp.valueOf("2020-01-01 12:54:00")),
(4, Timestamp.valueOf("2019-12-15 09:54:00")),
(5, Timestamp.valueOf("2019-12-09 10:12:43"))
).toDF("someCol","someTimeStamp")
df.filter(df("someTimeStamp") === "2020-01-01").show
+-------+-------------------+
|someCol| someTimeStamp|
+-------+-------------------+
| 2|2020-01-01 00:00:00| // ONLY MATCHED with time 00:00
+-------+-------------------+
Use the to_date function to extract the date from the timestamp:
scala> df.filter(to_date(df("someTimeStamp")) === "2020-01-01").show
+-------+-------------------+
|someCol| someTimeStamp|
+-------+-------------------+
| 1|2020-01-01 23:00:01|
| 2|2020-01-01 00:00:00|
| 3|2020-01-01 12:54:00|
+-------+-------------------+

How to check that a value is the unix timestamp in Scala?

In the DataFrame df I have a column datetime that contains timestamp values. The problem is that in some rows these are unix timestamps, while in other rows these are yyyyMMddHHmm format.
How can I verify that each given value is unix timestamp and if it's not to convert it into timestamp?
df.withColumn("timestamp", unix_timestamp(col("datetime")))
I assume that when...otherwise should be used, but how to check that a value is the unix timestamp?
You can use when/otherwise along with the date parsing methods. Here is some example code. I differentiated using just the length of the string, but you could also check the result of parsing them.
from pyspark.sql.functions import *
data = [
('201001021011',),
('201101021011',),
('1539721852',),
('1539721853',)
]
df = sc.parallelize(data).toDF(['date'])
df2 = df.withColumn('date',
when(length('date') != 12, from_unixtime('date', 'yyyyMMddHHmm')) \
.otherwise(col('date'))
)
df3 = df2.withColumn('date', to_timestamp('date', 'yyyyMMddHHmm'))
df3.show()
Outputs this:
+-------------------+
| date|
+-------------------+
|2010-01-02 10:11:00|
|2011-01-02 10:11:00|
|2018-10-16 16:30:00|
|2018-10-16 16:30:00|
+-------------------+
If column datetime consists of only Unix-timestamp strings or "yyyyMMddHHmm"-formatted strings, you can differentiate the two string formats based on their length, since the former has 10 digits or less whereas the latter is a fixed 12:
val df = Seq(
(1, "1538384400"),
(2, "1538481600"),
(3, "201809281800"),
(4, "1538548200"),
(5, "201809291530")
).toDF("id", "datetime")
df.withColumn("timestamp",
when(length($"datetime") === 12, unix_timestamp($"datetime", "yyyyMMddHHmm")).
otherwise($"datetime")
)
// +---+------------+----------+
// | id| datetime| timestamp|
// +---+------------+----------+
// | 1| 1538384400|1538384400|
// | 2| 1538481600|1538481600|
// | 3|201809281800|1538182800|
// | 4| 1538548200|1538548200|
// | 5|201809291530|1538260200|
// +---+------------+----------+
In case there are other string formats in column datetime, you can narrow down the conditions for Unix timestamp to a range corresponding to the range of date-time in your dataset. For example, Unix timestamp should be a 10-digit number post 2001-09-09 (and for the next 250+ years) and would start with 10 to 15 up till now:
df.withColumn("timestamp",
when(length($"datetime") === 12, unix_timestamp($"datetime", "yyyyMMddHHmm")).
otherwise(when(regexp_extract($"datetime", "^(1[0-5]\\d{8})$", 1) === $"datetime", $"datetime").
otherwise(null) // Or, additional conditions for other cases
))

How to add days (as values of a column) to date?

I have a problem with adding days (numbers) to date format columns in Spark. I know that there is a function date_add that takes two arguments - date column and integer:
date_add(date startdate, tinyint/smallint/int days)
I'd like to use a column value that is of type integer instead (not an integer itself).
Say I have the following dataframe:
val data = Seq(
(0, "2016-01-1"),
(1, "2016-02-2"),
(2, "2016-03-22"),
(3, "2016-04-25"),
(4, "2016-05-21"),
(5, "2016-06-1"),
(6, "2016-03-21"))
).toDF("id", "date")
I can simply add integers to dates:
val date_add_fun =
data.select(
$"id",
$"date",
date_add($"date", 1)
)
But I cannot use a column expression that contains the values:
val date_add_fun =
data.select(
$"id",
$"date",
date_add($"date", $"id")
)
It gives error:
<console>:60: error: type mismatch;
found : org.apache.spark.sql.ColumnName
required: Int
date_add($"date", $"id")
Does anyone know if it is possible to use column is date_add function? Or what is the workaround?
You can use expr:
import org.apache.spark.sql.functions.expr
data.withColumn("future", expr("date_add(date, id)")).show
// +---+----------+----------+
// | id| date| future|
// +---+----------+----------+
// | 0| 2016-01-1|2016-01-01|
// | 1| 2016-02-2|2016-02-03|
// | 2|2016-03-22|2016-03-24|
// | 3|2016-04-25|2016-04-28|
// | 4|2016-05-21|2016-05-25|
// | 5| 2016-06-1|2016-06-06|
// | 6|2016-03-21|2016-03-27|
// +---+----------+----------+
selectExpr could be use in a similar way:
data.selectExpr("*", "date_add(date, id) as future").show
The other answers work but aren't a drop in replacement for the existing date_add function.
I had a case where expr wouldn't work for me, so here is a drop in replacement:
def date_add(date: Column, days: Column) = {
new Column(DateAdd(date.expr, days.expr))
}
Basically, all the machinery is there in Spark to do this already, the function signature for date_add just forces it to be a literal.
You can use a sql expression as
data.createOrReplaceTempView("table")
sqlContext.sql("select id, date, date_add(`date`, `id`) as added_date from table").show(false)
which would give you
+---+----------+----------+
|id |date |added_date|
+---+----------+----------+
|0 |2016-01-1 |2016-01-01|
|1 |2016-02-2 |2016-02-03|
|2 |2016-03-22|2016-03-24|
|3 |2016-04-25|2016-04-28|
|4 |2016-05-21|2016-05-25|
|5 |2016-06-1 |2016-06-06|
|6 |2016-03-21|2016-03-27|
+---+----------+----------+
For the Python developers who are here, you can simply add a date column to another column together using +:
import pyspark.sql.functions as F
new_df = df.withColumn("new_date", F.col("date") + F.col("offset"))
Juste make sure that the offset column is int/smallint/tinyint.

Spark dataframe change column value to timestamp [duplicate]

This question already has answers here:
Apache Spark subtract days from timestamp column
(2 answers)
Closed 4 years ago.
I have a jsonl file I've read in, created a temporary table view and filtered down the records that I want to ammend.
val df = session.read.json("tiny.jsonl")
df.createOrReplaceTempView("tempTable")
val filter = df.select("*").where("field IS NOT NULL")
Now I am at the part where I have been trying various things. I want to change a column called "time" with the currentTimestamp before I write it back. Sometimes I will want to change the currentTimestamp to be timestampNow - 5 days for example.
val change = test.withColumn("server_time", date_add(current_timestamp(), -1))
The example above will throw me back a date that's 1 from today, rather than a timestamp.
Edit:
Sample Dataframe that mocks out my jsonl input:
val df = Seq(
(1, "fn", "2018-02-18T22:18:28.645Z"),
(2, "fu", "2018-02-18T22:18:28.645Z"),
(3, null, "2018-02-18T22:18:28.645Z")
).toDF("id", "field", "time")
Expected output:
+---+------+-------------------------+
| id|field |time |
+---+------+-------------------------+
| 1| fn | 2018-04-09T22:18:28.645Z|
| 2| fn | 2018-04-09T22:18:28.645Z|
+---+------+-------------------------+
If you want to replace current column time with current timestamp then, you can use current_timestamp function. To add the number of days you can use SQL INTERVAL
val df = Seq(
(1, "fn", "2018-02-18T22:18:28.645Z"),
(2, "fu", "2018-02-18T22:18:28.645Z"),
(3, null, "2018-02-18T22:18:28.645Z")
).toDF("id", "field", "time")
.na.drop()
val ddf = df
.withColumn("time", current_timestamp())
.withColumn("newTime", $"time" + expr("INTERVAL 5 DAYS"))
Output:
+---+-----+-----------------------+-----------------------+
|id |field|time |newTime |
+---+-----+-----------------------+-----------------------+
|1 |fn |2018-04-10 15:14:27.501|2018-04-15 15:14:27.501|
|2 |fu |2018-04-10 15:14:27.501|2018-04-15 15:14:27.501|
+---+-----+-----------------------+-----------------------+

How to calculate difference between date column and current date?

I am trying to calculate the Date Diff between a column field and current date of the system.
Here is my sample code where I have hard coded the my column field with 20170126.
val currentDate = java.time.LocalDate.now
var datediff = spark.sqlContext.sql("""Select datediff(to_date('$currentDate'),to_date(DATE_FORMAT(CAST(unix_timestamp( cast('20170126' as String), 'yyyyMMdd') AS TIMESTAMP), 'yyyy-MM-dd'))) AS GAP
""")
datediff.show()
Output is like:
+----+
| GAP|
+----+
|null|
+----+
I need to calculate actual Gap Between the two dates but getting NULL.
You have not defined the type and format of "column field" so I assume it's a string in the (not-very-pleasant) format YYYYMMdd.
val records = Seq((0, "20170126")).toDF("id", "date")
scala> records.show
+---+--------+
| id| date|
+---+--------+
| 0|20170126|
+---+--------+
scala> records
.withColumn("year", substring($"date", 0, 4))
.withColumn("month", substring($"date", 5, 2))
.withColumn("day", substring($"date", 7, 2))
.withColumn("d", concat_ws("-", $"year", $"month", $"day"))
.select($"id", $"d" cast "date")
.withColumn("datediff", datediff(current_date(), $"d"))
.show
+---+----------+--------+
| id| d|datediff|
+---+----------+--------+
| 0|2017-01-26| 83|
+---+----------+--------+
PROTIP: Read up on functions object.
Caveats
cast
Please note that I could not convince Spark SQL to cast the column "date" to DateType given the rules in DateTimeUtils.stringToDate:
yyyy,
yyyy-[m]m
yyyy-[m]m-[d]d
yyyy-[m]m-[d]d
yyyy-[m]m-[d]d *
yyyy-[m]m-[d]dT*
date_format
I could not convince date_format to work either so I parsed "date" column myself using substring and concat_ws functions.