I have a UDF that creates a timestamp out of 2 field values with date and time. However, the field with time is of seconds format.
So how сan I merge 2 fields of type date and seconds into an hour of a type Unix timestamp?
My current implementation looks like this:
private val unix_epoch = udf[Long, String, String]{ (date, time) =>
deltaDateFormatter.parseDateTime(s"$date $formatted").getSeconds
}
def transform(inputDf: DataFrame): Unit = {
inputDf
.withColumn("event_hour", unix_epoch($"event_date", $"event_time"))
.withColumn("event_ts", from_unixtime($"event_hour").cast(TimestampType))
}
Input data:
event_date,event_time
20170501,87721
20170501,87728
20170501,87721
20170501,87726
Desired output:
event_tmstp, event_hour
2017-05-01 00:22:01,1493598121
2017-05-01 00:22:08,1493598128
2017-05-01 00:22:01,1493598121
2017-05-01 00:22:06,1493598126
Update. data schema:
event_date: string (nullable = true)
event_time: integer (nullable = true)
Cast event_date to a unix timestamp, add the event_time column to get event_hour, and convert back to normal timestamp event_tmstp.
PS I'm not sure why event_time has 86400 seconds (1 day) more. I needed to subtract that to get your expected output.
val df = Seq(
("20170501", 87721),
("20170501", 87728),
("20170501", 87721),
("20170501", 87726)
).toDF("event_date","event_time")
val df2 = df.select(
unix_timestamp(to_date($"event_date", "yyyyMMdd")) + $"event_time" - 86400
).toDF("event_hour").select(
$"event_hour".cast("timestamp").as("event_tmstp"),
$"event_hour"
)
df2.show
+-------------------+----------+
| event_tmstp|event_hour|
+-------------------+----------+
|2017-05-01 00:22:01|1493598121|
|2017-05-01 00:22:08|1493598128|
|2017-05-01 00:22:01|1493598121|
|2017-05-01 00:22:06|1493598126|
+-------------------+----------+
Check below code if this helps without UDF
val df = Seq(
(20170501,87721),
(20170501,87728),
(20170501,87721),
(20170501,87726)
).toDF("date","time")
df
.withColumn("date",
to_date(
unix_timestamp($"date".cast("string"),
"yyyyMMdd"
).cast("timestamp")
)
)
.withColumn(
"event_hour",
unix_timestamp(
concat_ws(
" ",
$"date",
from_unixtime($"time","HH:mm:ss.S")
).cast("timestamp")
)
)
.withColumn(
"event_ts",
from_unixtime($"event_hour")
)
.show(false)
+----------+-----+----------+-------------------+
|date |time |event_hour|event_ts |
+----------+-----+----------+-------------------+
|2017-05-01|87721|1493598121|2017-05-01 00:22:01|
|2017-05-01|87728|1493598128|2017-05-01 00:22:08|
|2017-05-01|87721|1493598121|2017-05-01 00:22:01|
|2017-05-01|87726|1493598126|2017-05-01 00:22:06|
+----------+-----+----------+-------------------+
Related
Could someone please guide me that how to convert long to timestamp with milliseconds?
I know how to do to the yyyy-MM-dd HH:mm:ss
But I would like to the milliseconds yyyy-MM-dd HH:mm:ss.SSS
My parquet structure is like this
|-- header: struct (nullable = true)
| |-- time: long (nullable = true)
...
One sample for time is 1600676073054:
Scala
scala> spark.sql("select from_unixtime(word) as ts, word from tmp_1").show(false)
+--------------------+-------------+
|ts |word |
+--------------------+-------------+
|52693-05-28 18:30:54|1600676073054|
+--------------------+-------------+
scala> spark.sql("select from_unixtime(word/1000) as ts, word from tmp_1").show(false)
+-------------------+-------------+
|ts |word |
+-------------------+-------------+
|2020-09-21 16:14:33|1600676073054|
+-------------------+-------------+
scala> spark.sql("select from_unixtime(word) as ts, word from tmp_1").show(false)
+--------------------+-------------+
|ts |word |
+--------------------+-------------+
|52693-05-28 18:30:54|1600676073054|
+--------------------+-------------+
Sql Server
declare #StartDate datetime2(3) = '1970-01-01 00:00:00.000'
, #milliseconds bigint = 1600676073054
, #MillisecondsPerDay int = 60 * 60 * 24 * 1000 -- = 86400000
SELECT DATEADD(MILLISECOND, TRY_CAST(#milliseconds % #millisecondsPerDay AS INT), DATEADD(DAY, TRY_CAST(#milliseconds / #millisecondsPerDay AS INT), #StartDate));
--2020-09-21 08:14:33.054
I would like to know how to convert out the 054 as milliseconds.
Thanks.
Spark does not support epoch milliseconds, so you need to divide it by a 1000.
val df = spark.createDataFrame(
Seq(
(1, "1600676073054")
)
).toDF("id","long_timestamp")
df.withColumn(
"timestamp_mili",
(col("long_timestamp")/1000).cast("timestamp")
).show(false)
//+---+--------------+-----------------------+
//|id |long_timestamp|timestamp_mili |
//+---+--------------+-----------------------+
//|1 |1600676073054 |2020-09-21 08:14:33.054|
//+---+--------------+-----------------------+
//loading DF
val df1 = spark.read.option("header",true).option("inferSchema",true).csv("time.csv ")
//
+-------------+
| date_time|
+-----+-------+
|1545905416000|
+-----+-------+
when i use the cast to change the column value to DateType, it shows error
=> the datatype is not matching (date_time : bigint)in df
df1.withColumn("date_time", df1("date").cast(DateType)).show()
Any solution for solveing it???
i tried doing
val a = df1.withColumn("date_time",df1("date").cast(StringType)).drop("date").toDF()
a.withColumn("fomatedDateTime",a("date_time").cast(DateType)).show()
but it does not work.
Welcome to StackOverflow!
You need to convert the timestamp from epoch format to date and then do the computation. You can try this:
import spark.implicits._
val df = spark.read.option("header",true).option("inferSchema",true).csv("time.csv ")
val df1 = df.withColumn(
"dateCreated",
date_format(
to_date(
substring(
from_unixtime($"date_time".divide(1000)),
0,
10
),
"yyyy-MM-dd"
)
,"dd-MM-yyyy")
)
.withColumn(
"timeCreated",
substring(
from_unixtime($"date_time".divide(1000)),
11,
19
)
)
Sample data from my usecase:
+---------+-------------+--------+-----------+-----------+
| adId| date_time| price|dateCreated|timeCreated|
+---------+-------------+--------+-----------+-----------+
|230010452|1469178808000| 5950.0| 22-07-2016| 14:43:28|
|230147621|1469456306000| 19490.0| 25-07-2016| 19:48:26|
|229662644|1468546792000| 12777.0| 15-07-2016| 07:09:52|
|229218611|1467815284000| 9996.0| 06-07-2016| 19:58:04|
|229105894|1467656022000| 7700.0| 04-07-2016| 23:43:42|
|230214681|1469559471000| 4600.0| 27-07-2016| 00:27:51|
|230158375|1469469248000| 999.0| 25-07-2016| 23:24:08|
+---------+-------------+--------+-----------+-----------+
You need to adjust the time. By default it would be your timezone. For me it's GMT +05:30. Hope it helps.
have dataframe with a timestamp column (timestamp type) called "maxTmstmp" and another column with hours, represented as integers called "WindowHours". I would like to dynamically subtract timestamp and integer columns to get lower timestamp.
My data and desired effect ("minTmstmp" column):
+-----------+-------------------+-------------------+
|WindowHours| maxTmstmp| minTmstmp|
| | |(maxTmstmp - Hours)|
+-----------+-------------------+-------------------+
| 1|2016-01-01 23:00:00|2016-01-01 22:00:00|
| 2|2016-03-01 12:00:00|2016-03-01 10:00:00|
| 8|2016-03-05 20:00:00|2016-03-05 12:00:00|
| 24|2016-04-12 11:00:00|2016-04-11 11:00:00|
+-----------+-------------------+-------------------+
root
|-- WindowHours: integer (nullable = true)
|-- maxTmstmp: timestamp (nullable = true)
I have already found an expressions with hours interval solution, but it isn't dynamic. Code below doesn't work as intended.
standards.
.withColumn("minTmstmp", $"maxTmstmp" - expr("INTERVAL 10 HOURS"))
.show()
Operate on Spark 2.4 and scala.
One simple way would be to convert maxTmstmp to unix time, subtract the value of WindowHours in seconds from it, and convert the result back to Spark Timestamp, as shown below:
import java.sql.Timestamp
import org.apache.spark.sql.functions._
import spark.implicits._
val df = Seq(
(1, Timestamp.valueOf("2016-01-01 23:00:00")),
(2, Timestamp.valueOf("2016-03-01 12:00:00")),
(8, Timestamp.valueOf("2016-03-05 20:00:00")),
(24, Timestamp.valueOf("2016-04-12 11:00:00"))
).toDF("WindowHours", "maxTmstmp")
df.withColumn("minTmstmp",
from_unixtime(unix_timestamp($"maxTmstmp") - ($"WindowHours" * 3600))
).show
// +-----------+-------------------+-------------------+
// |WindowHours| maxTmstmp| minTmstmp|
// +-----------+-------------------+-------------------+
// | 1|2016-01-01 23:00:00|2016-01-01 22:00:00|
// | 2|2016-03-01 12:00:00|2016-03-01 10:00:00|
// | 8|2016-03-05 20:00:00|2016-03-05 12:00:00|
// | 24|2016-04-12 11:00:00|2016-04-11 11:00:00|
// +-----------+-------------------+-------------------+
I have a dataframe with content like below:
scala> patDF.show
+---------+-------+-----------+-------------+
|patientID| name|dateOtBirth|lastVisitDate|
+---------+-------+-----------+-------------+
| 1001|Ah Teck| 1991-12-31| 2012-01-20|
| 1002| Kumar| 2011-10-29| 2012-09-20|
| 1003| Ali| 2011-01-30| 2012-10-21|
+---------+-------+-----------+-------------+
all the columns are string
I want to get the list of records with lastVisitDate falling in the range of format of yyyy-mm-dd and now, so here is the script:
patDF.registerTempTable("patients")
val results2 = sqlContext.sql("SELECT * FROM patients WHERE from_unixtime(unix_timestamp(lastVisitDate, 'yyyy-mm-dd')) between '2012-09-15' and current_timestamp() order by lastVisitDate")
results2.show()
It gets me nothing, presumably, there should be records with patientID of 1002 and 1003.
So I modified the query to:
val results3 = sqlContext.sql("SELECT from_unixtime(unix_timestamp(lastVisitDate, 'yyyy-mm-dd')), * FROM patients")
results3.show()
Now I get:
+-------------------+---------+-------+-----------+-------------+
| _c0|patientlD| name|dateOtBirth|lastVisitDate|
+-------------------+---------+-------+-----------+-------------+
|2012-01-20 00:01:00| 1001|Ah Teck| 1991-12-31| 2012-01-20|
|2012-01-20 00:09:00| 1002| Kumar| 2011-10-29| 2012-09-20|
|2012-01-21 00:10:00| 1003| Ali| 2011-01-30| 2012-10-21|
+-------------------+---------+-------+-----------+-------------+
If you look at the first column, you will see all the months were somehow changed to 01
What's wrong with the code?
The correct format for year-month-day should be yyyy-MM-dd:
val patDF = Seq(
(1001, "Ah Teck", "1991-12-31", "2012-01-20"),
(1002, "Kumar", "2011-10-29", "2012-09-20"),
(1003, "Ali", "2011-01-30", "2012-10-21")
)toDF("patientID", "name", "dateOtBirth", "lastVisitDate")
patDF.createOrReplaceTempView("patTable")
val result1 = spark.sqlContext.sql("""
select * from patTable where to_timestamp(lastVisitDate, 'yyyy-MM-dd')
between '2012-09-15' and current_timestamp() order by lastVisitDate
""")
result1.show
// +---------+-----+-----------+-------------+
// |patientID| name|dateOtBirth|lastVisitDate|
// +---------+-----+-----------+-------------+
// | 1002|Kumar| 2011-10-29| 2012-09-20|
// | 1003| Ali| 2011-01-30| 2012-10-21|
// +---------+-----+-----------+-------------+
You can also use DataFrame API, if wanted:
val result2 = patDF.where(to_timestamp($"lastVisitDate", "yyyy-MM-dd").
between(to_timestamp(lit("2012-09-15"), "yyyy-MM-dd"), current_timestamp())
).orderBy($"lastVisitDate")
I have a column in spark dataframe of String datatype (with date in yyyy-MM-dd pattern)
I want to display the column value in MM/dd/yyyy pattern
My data is
val df = sc.parallelize(Array(
("steak", "1990-01-01", "2000-01-01", 150),
("steak", "2000-01-02", "2001-01-13", 180),
("fish", "1990-01-01", "2001-01-01", 100)
)).toDF("name", "startDate", "endDate", "price")
df.show()
+-----+----------+----------+-----+
| name| startDate| endDate|price|
+-----+----------+----------+-----+
|steak|1990-01-01|2000-01-01| 150|
|steak|2000-01-02|2001-01-13| 180|
| fish|1990-01-01|2001-01-01| 100|
+-----+----------+----------+-----+
root
|-- name: string (nullable = true)
|-- startDate: string (nullable = true)
|-- endDate: string (nullable = true)
|-- price: integer (nullable = false)
I want to show endDate in MM/dd/yyyy pattern. All I am able to do is convert the column to DateType from String
val df2 = df.select($"endDate".cast(DateType).alias("endDate"))
df2.show()
+----------+
| endDate|
+----------+
|2000-01-01|
|2001-01-13|
|2001-01-01|
+----------+
df2.printSchema()
root
|-- endDate: date (nullable = true)
I want to show endDate in MM/dd/yyyy pattern. Only reference I found is this which doesn't solve the problem
You can use date_format function.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(
("steak", "1990-01-01", "2000-01-01", 150),
("steak", "2000-01-02", "2001-01-13", 180),
("fish", "1990-01-01", "2001-01-01", 100))).toDF("name", "startDate", "endDate", "price")
df.show()
df.select(date_format(col("endDate"), "MM/dd/yyyy")).show
Output :
+-------------------------------+
|date_format(endDate,MM/dd/yyyy)|
+-------------------------------+
| 01/01/2000|
| 01/13/2001|
| 01/01/2001|
+-------------------------------+
Use pyspark.sql.functions.date_format(date, format):
val df2 = df.select(date_format("endDate", "MM/dd/yyyy").alias("endDate"))
Dataframe/Dataset having a string column with date value in it and we need to change the date format.
For the query asked, date format can be changed as below:
val df1 = df.withColumn("startDate1", date_format(to_date(col("startDate"),"yyyy-MM-dd"),"MM/dd/yyyy" ))
In Spark, the default date format is "yyyy-MM-dd" hence it can be re-written as
val df1 = df.withColumn("startDate1", date_format(col("startDate"),"MM/dd/yyyy" ))
(i) By applying to_date, we are changing the datatype of this column (string) to Date datatype.
Also, we are informing to_date that the format in this string column is yyyy-MM-dd so read the column accordingly.
(ii) Next, we are applying date_format to achieve the date format we require which is MM/dd/yyyy.
When time component is involved, use to_timestamp instead of to_date.
Note that 'MM' represents month and 'mm' represents minutes.