Convert event time into date and time in Pyspark? - pyspark

I have below event_time in my data frame
I would like to convert the event_time into date/time. Used below code, however it's not coming properly
import pyspark.sql.functions as f
df = df.withColumn("date", f.from_unixtime("Event_Time", "dd/MM/yyyy HH:MM:SS"))
df.show()
I am getting below output and it's not coming properly
Can anyone advise how to do this properly as I am new to pyspark?

Seems that your data is in Microseconds (1/1,000,000 second) so you would have to divide by 1,000,000
df = spark.createDataFrame(
[
('1645904274665267',),
('1645973845823770',),
('1644134156697560',),
('1644722868485010',),
('1644805678702121',),
('1645071502180365',),
('1644220446396240',),
('1645736052650785',),
('1646006645296010',),
('1644544811297016',),
('1644614023559317',),
('1644291365608571',),
('1645643575551339',)
], ['Event_Time']
)
import pyspark.sql.functions as f
df = df.withColumn("date", f.from_unixtime(f.col("Event_Time")/1000000))
df.show(truncate = False)
output
+----------------+-------------------+
|Event_Time |date |
+----------------+-------------------+
|1645904274665267|2022-02-26 20:37:54|
|1645973845823770|2022-02-27 15:57:25|
|1644134156697560|2022-02-06 08:55:56|
|1644722868485010|2022-02-13 04:27:48|
|1644805678702121|2022-02-14 03:27:58|
|1645071502180365|2022-02-17 05:18:22|
|1644220446396240|2022-02-07 08:54:06|
|1645736052650785|2022-02-24 21:54:12|
|1646006645296010|2022-02-28 01:04:05|
|1644544811297016|2022-02-11 03:00:11|
|1644614023559317|2022-02-11 22:13:43|
|1644291365608571|2022-02-08 04:36:05|
|1645643575551339|2022-02-23 20:12:55|
+----------------+-------------------+

Related

How to filter pyspark dataframe with last 14 days?

I am having a date column in my dataframe
I wanted to filter out the last 14 days from the dataframe using the date column.
I tried the below code but it's not working
last_14 = df.filter((df('Date')> date_add(current_timestamp(), -14)).select("Event_Time","User_ID","Impressions","Clicks","URL", "Date")
Event_time, user_id, impressions, clicks, URL is my other columns
Can anyone advise how to do this?
from pyspark.sql import functions as F, types as T
df = spark.createDataFrame(
[
('2022-03-10',),
('2022-03-09',),
('2022-03-08',),
('2022-02-02',),
('2022-02-01',)
], ['Date']
).withColumn('Date', F.to_date('Date', 'y-M-d'))
df\
.filter((F.col('Date') > F.date_sub(F.current_date(), 14)))\
.show()
+----------+
| Date|
+----------+
|2022-03-10|
|2022-03-09|
|2022-03-08|
+----------+
In your code it would be:
last_14 = df.filter((F.col('Date') > F.date_sub(F.current_date(), 14))).select("Event_Time","User_ID","Impressions","Clicks","URL", "Date")

Convert string (with timestamp) to timestamp in pyspark

I have a dataframe with a string datetime column.
I am converting it to timestamp, but the values are changing.
Following is my code, can anyone help me to convert without changing values.
df=spark.createDataFrame(
data = [ ("1","2020-04-06 15:06:16 +00:00")],
schema=["id","input_timestamp"])
df.printSchema()
#Timestamp String to DateType
df = df.withColumn("timestamp",to_timestamp("input_timestamp"))
# Using Cast to convert TimestampType to DateType
df.withColumn('timestamp_string', \
to_timestamp('timestamp').cast('string')) \
.show(truncate=False)
This is the output:
+---+--------------------------+-------------------+-------------------+
|id |input_timestamp |timestamp |timestamp_string |
+---+--------------------------+-------------------+-------------------+
|1 |2020-04-06 15:06:16 +00:00|2020-04-06 08:06:16|2020-04-06 08:06:16|
+---+--------------------------+-------------------+-------------------+
I want to know why the hour is changing from 15 to 8 and how can I prevent it?
I believe to_timestamp is converting timestamp value to your local time as you have +00:00 in your data.
Try to pass the format to to_timestamp() function.
Example:
from pyspark.sql.functions import to_timestamp
df.withColumn("timestamp",to_timestamp(col("input_timestamp"),"yyyy-MM-dd HH:mm:ss +00:00")).show(10,False)
#+---+--------------------------+-------------------+
#|id |input_timestamp |timestamp |
#+---+--------------------------+-------------------+
#|1 |2020-04-06 15:06:16 +00:00|2020-04-06 15:06:16|
#+---+--------------------------+-------------------+
from pyspark.sql.functions import to_utc_timestamp
df = spark.createDataFrame(
data=[('1', '2020-04-06 15:06:16 +00:00')],
schema=['id', 'input_timestamp'])
df.printSchema()
df = df.withColumn('timestamp', to_utc_timestamp('input_timestamp',
your_local_timezone))
df.withColumn('timestamp_string', df.timestamp.cast('string')).show(truncate=False)
Replace your_local_timezone with the actual value.

How can i split timestamp to Date and time?

//loading DF
val df1 = spark.read.option("header",true).option("inferSchema",true).csv("time.csv ")
//
+-------------+
| date_time|
+-----+-------+
|1545905416000|
+-----+-------+
when i use the cast to change the column value to DateType, it shows error
=> the datatype is not matching (date_time : bigint)in df
df1.withColumn("date_time", df1("date").cast(DateType)).show()
Any solution for solveing it???
i tried doing
val a = df1.withColumn("date_time",df1("date").cast(StringType)).drop("date").toDF()
a.withColumn("fomatedDateTime",a("date_time").cast(DateType)).show()
but it does not work.
Welcome to StackOverflow!
You need to convert the timestamp from epoch format to date and then do the computation. You can try this:
import spark.implicits._
val df = spark.read.option("header",true).option("inferSchema",true).csv("time.csv ")
val df1 = df.withColumn(
"dateCreated",
date_format(
to_date(
substring(
from_unixtime($"date_time".divide(1000)),
0,
10
),
"yyyy-MM-dd"
)
,"dd-MM-yyyy")
)
.withColumn(
"timeCreated",
substring(
from_unixtime($"date_time".divide(1000)),
11,
19
)
)
Sample data from my usecase:
+---------+-------------+--------+-----------+-----------+
| adId| date_time| price|dateCreated|timeCreated|
+---------+-------------+--------+-----------+-----------+
|230010452|1469178808000| 5950.0| 22-07-2016| 14:43:28|
|230147621|1469456306000| 19490.0| 25-07-2016| 19:48:26|
|229662644|1468546792000| 12777.0| 15-07-2016| 07:09:52|
|229218611|1467815284000| 9996.0| 06-07-2016| 19:58:04|
|229105894|1467656022000| 7700.0| 04-07-2016| 23:43:42|
|230214681|1469559471000| 4600.0| 27-07-2016| 00:27:51|
|230158375|1469469248000| 999.0| 25-07-2016| 23:24:08|
+---------+-------------+--------+-----------+-----------+
You need to adjust the time. By default it would be your timezone. For me it's GMT +05:30. Hope it helps.

Change column value in a dataframe spark scala

This is how my dataframe looks like at the moment
+------------+
| DATE |
+------------+
| 19931001|
| 19930404|
| 19930603|
| 19930805|
+------------+
I am trying to reformat this string value to yyyy-mm-dd hh:mm:ss.fff and keep it as a string not a date type or time stamp.
How would I do that using the withColumn method ?
Here is the solution using UDF and withcolumn, I have assumed that you have a string date field in Dataframe
//Create dfList dataframe
val dfList = spark.sparkContext
.parallelize(Seq("19931001","19930404", "19930603", "19930805")).toDF("DATE")
dfList.withColumn("DATE", dateToTimeStamp($"DATE")).show()
val dateToTimeStamp = udf((date: String) => {
val stringDate = date.substring(0,4)+"/"+date.substring(4,6)+"/"+date.substring(6,8)
val format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
format.format(new SimpleDateFormat("yyy/MM/dd").parse(stringDate))
})
withClumn("date",
from_unixtime(unix_timestamp($"date", "yyyyMMdd"), "yyyy-MM-dd hh:mm:ss.fff") as "date")
this should work.
Another notice is the that mm gives minutes and MM gives months, hope this help you.
First, I created this DF:
val df = sc.parallelize(Seq("19931001","19930404","19930603","19930805")).toDF("DATE")
For date management we are going to use joda time Library (don't forget to join the joda-time.jar file)
import org.joda.time.format.DateTimeFormat
import org.joda.time.format.DateTimeFormatter
def func(s:String):String={
val dateFormat = DateTimeFormat.forPattern("yyyymmdd");
val resultDate = dateFormat.parseDateTime(s);
return resultDate.toString();
}
Finally, apply the function to dataframe:
val temp = df.map(l => func(l.get(0).toString()))
val df2 = temp.toDF("DATE")
df2.show()
This answer still needs some work, me myself is new to spark, but it is getting the job done, I think!

Reading a full timestamp into a dataframe

I am trying to learn Spark and I am reading a dataframe with a timestamp column using the unix_timestamp function as below:
val columnName = "TIMESTAMPCOL"
val sequence = Seq(2016-01-20 12:05:06.999)
val dataframe = {
sequence.toDF(columnName)
}
val typeDataframe = dataframe.withColumn(columnName, org.apache.spark.sql.functions.unix_timestamp($"TIMESTAMPCOL"))
typeDataframe.show
This produces an output:
+------------+
|TIMESTAMPCOL|
+------------+
| 1453320306|
+------------+
How can I read it so that I don't lose the ms i.e the .999 part? I tried using unix_timestamp(col: Col, s: String) where s is the SimpleDateFormat, eg "yyyy-MM-dd hh:mm:ss", without any luck.
To retain the milliseconds use "yyyy-MM-dd HH:mm:ss.SSS" format. You can use date_format like below.
val typeDataframe = dataframe.withColumn(columnName, org.apache.spark.sql.functions.date_format($"TIMESTAMPCOL","yyyy-MM-dd HH:mm:ss.SSS"))
typeDataframe.show
This will give you
+-----------------------+
|TIMESTAMPCOL |
+-----------------------+
|2016-01-20 12:05:06:999|
+-----------------------+