extracting day, week, hour,date,year in pyspark from a string column - pyspark

I am trying to extract day, week, hour,date,year in pyspark however after using dayofweek it shows null as output.
DF is something like this :
Mailed Date
Wed, 09/29/10 03:52 PM
Tue, 09/21/10 11:51 PM
Tue, 09/21/10 11:51 PM
Tue, 09/21/10 11:51 PM
I am trying to have different column named day day of week month year and hour of day
however after using from pyspark.sql.functions import year, month, dayofweek it shows null as day output column
Code i have used:
df01 = emaildf.withColumn('Day', dayofweek('Mailed_Date')).show(5)
converted into timesetemp:
df01 = vdf.withColumn("Mailed_Date",col("Mailed_Date").cast("Timestamp"))
Output:

Since, the string datetime provided is not in the default format, you'd have to convert the datetime to a readable format using to_timestamp(). Also, you'll need to set the timeParserPolicy to LEGACY, if you're parsing in spark 3.0+, due to the presence of week in the string.
spark.conf.set('spark.sql.legacy.timeParserPolicy', 'LEGACY') # if spark 3.0+
ts_sdf = spark.sparkContext.parallelize([('Wed, 09/29/10 03:52 PM',)]).toDF(['ts_str']). \
withColumn('ts', func.to_timestamp('ts_str', 'EEE, MM/dd/yy hh:mm a')). \
withColumn('year', func.year('ts')). \
withColumn('month', func.month('ts')). \
withColumn('dayofweek', func.dayofweek('ts')). \
withColumn('hour', func.hour('ts')). \
withColumn('minute', func.minute('ts'))
ts_sdf.show(truncate=False)
# +----------------------+-------------------+----+-----+---------+----+------+
# |ts_str |ts |year|month|dayofweek|hour|minute|
# +----------------------+-------------------+----+-----+---------+----+------+
# |Wed, 09/29/10 03:52 PM|2010-09-29 15:52:00|2010|9 |4 |15 |52 |
# +----------------------+-------------------+----+-----+---------+----+------+
ts_sdf.printSchema()
# root
# |-- ts_str: string (nullable = true)
# |-- ts: timestamp (nullable = true)
# |-- year: integer (nullable = true)
# |-- month: integer (nullable = true)
# |-- dayofweek: integer (nullable = true)
# |-- hour: integer (nullable = true)
# |-- minute: integer (nullable = true)

Related

pyspark - why convert string(time) to timestamp return None?

I have below sample pyspark dataframe, and want to extract the time from message column, and then convert the extract time to timestamp type.
message,class
"2022-10-28 07:46:59,705 one=1 Two=2 Three=3",classA
"2022-10-27 10:03:59,800 four=4 Five=5 Six=6",classB
I tried below 2 ways, but neither of them works.
way1:
sparkDF.withColumn("TIMESTAMP", to_timestamp(regexp_extract(col('message'), '(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2},\d+)', 1),"MM-dd-yyyy HH:mm:ss.SSSS"))
way2:
sparkDF.withColumn("TIMESTAMP", regexp_extract(col('message'), '(\d{4}-\d{2}-\d{2}\s\d{2}:\d{2}:\d{2},\d+)', 1).cast("timestamp")
can anyone please help take a look?
you could split the message field by " " (multiple spaces) and then extract the first element. the extracted element can be converted to a timestamp easily.
see example
data_sdf. \
withColumn('ts', func.to_timestamp(func.split('message', ' ')[0], 'yyyy-MM-dd HH:mm:ss,SSS')). \
show(truncate=False)
# +---------------------------------------------------------+------+-----------------------+
# |message |class |ts |
# +---------------------------------------------------------+------+-----------------------+
# |2022-10-28 07:46:59,705 one=1 Two=2 Three=3|classA|2022-10-28 07:46:59.705|
# |2022-10-27 10:03:59,800 four=4 Five=5 Six=6|classB|2022-10-27 10:03:59.8 |
# +---------------------------------------------------------+------+-----------------------+
# root
# |-- message: string (nullable = true)
# |-- class: string (nullable = true)
# |-- ts: timestamp (nullable = true)

Convert unix_timestamp column to string using pyspark

I have a column which represents unix_timestamp and want to convert it into string with this format, 'yyyy-MM-dd HH:mm:ss.SSS'.
unix_timestamp | time_string
1578569683753 | 2020-01-09 11:34:43.753
1578569581793 | 2020-01-09 11:33:01.793
1578569581993 | 2020-01-09 11:33:01.993
Is there any builtin function or how does it work? Thanks.
df1 = df1.withColumn('utc_stamp', F.from_unixtime('Timestamp', format="YYYY-MM-dd HH:mm:ss"))
df1.show(truncate=False)
from_unixtime converts only into seconds, for milliseconds I just have to concat them from original column to new column.
unixtimestamp only supports second precision. Looking at your values the precision is at milliseconds, the last 3 positions are milliseconds.
from pyspark.sql.functions import substring,unix_timestamp,col,to_timestamp,concat,lit,from_unixtime
df = spark.createDataFrame([('1578569683753',), ('1578569581793',),('1578569581993',)], ['TMS'])
df.show(3,False)
df.printSchema()
Result
+-------------+
|TMS |
+-------------+
|1578569683753|
|1578569581793|
|1578569581993|
+-------------+
root
|-- TMS: string (nullable = true)
Convert to the human-readable timestamp format
df1 = (df
.select("TMS"
,from_unixtime(substring(col("TMS"),1,10), format="yyyy-MM-dd HH:mm:sss").alias("TMS_WITHOUT_MILLISECONDS")
,(substring("TMS",11,3)).alias("MILLISECONDS")
,(concat(from_unixtime(substring(col("TMS"),1,10), format="yyyy-MM-dd HH:mm:sss"),lit('.'), substring(df.TMS,11,3))).alias("TMS_StringType")
,to_timestamp(concat(from_unixtime(substring(col("TMS"),1,10), format="yyyy-MM-dd HH:mm:sss"),lit('.'), substring(df.TMS,11,3))).alias("TMS_TimestampType")
)
)
df1.show(3,False)
df1.printSchema()
Output
+-------------+------------------------+------------+------------------------+-----------------------+
|TMS |TMS_WITHOUT_MILLISECONDS|MILLISECONDS|TMS_StringType |TMS_TimestampType |
+-------------+------------------------+------------+------------------------+-----------------------+
|1578569683753|2020-01-09 11:34:043 |753 |2020-01-09 11:34:043.753|2020-01-09 11:34:43.753|
|1578569581793|2020-01-09 11:33:001 |793 |2020-01-09 11:33:001.793|2020-01-09 11:33:01.793|
|1578569581993|2020-01-09 11:33:001 |993 |2020-01-09 11:33:001.993|2020-01-09 11:33:01.993|
+-------------+------------------------+------------+------------------------+-----------------------+
root
|-- TMS: string (nullable = true)
|-- TMS_WITHOUT_MILLISECONDS: string (nullable = true)
|-- MILLISECONDS: string (nullable = true)
|-- TMS_StringType: string (nullable = true)
|-- TMS_TimestampType: timestamp (nullable = true)

Timestamp formats and time zones in Spark (scala API)

******* UPDATE ********
As suggested in the comments I eliminated the irrelevant part of the code:
My requirements:
Unify number of milliseconds to 3
Transform string to timestamp and keep the value in UTC
Create dataframe:
val df = Seq("2018-09-02T05:05:03.456Z","2018-09-02T04:08:32.1Z","2018-09-02T05:05:45.65Z").toDF("Timestamp")
Here the reults using the spark shell:
************ END UPDATE *********************************
I am having a nice headache trying to deal with time zones and timestamp formats in Spark using scala.
This is a simplification of my script to explain my problem:
import org.apache.spark.sql.functions._
val jsonRDD = sc.wholeTextFiles("file:///data/home2/phernandez/vpp/Test_Message.json")
val jsonDF = spark.read.json(jsonRDD.map(f => f._2))
This is the resulting schema:
root
|-- MeasuredValues: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- MeasuredValue: double (nullable = true)
| | |-- Status: long (nullable = true)
| | |-- Timestamp: string (nullable = true)
Then I just select the Timestamp field as follows
jsonDF.select(explode($"MeasuredValues").as("Values")).select($"Values.Timestamp").show(5,false)
First thing I want to fix is the number of milliseconds of every timestamp and unify it to three.
I applied the date_format as follows
jsonDF.select(explode($"MeasuredValues").as("Values")).select(date_format($"Values.Timestamp","yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")).show(5,false)
Milliseconds format was fixed but timestamp is converted from UTC to local time.
To tackle this issue, I applied the to_utc_timestamp together with my local time zone.
jsonDF.select(explode($"MeasuredValues").as("Values")).select(to_utc_timestamp(date_format($"Values.Timestamp","yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"),"Europe/Berlin").as("Timestamp")).show(5,false)
Even worst, UTC value is not returned, and the milliseconds format is lost.
Any Ideas how to deal with this? I will appreciated it 😊
BR. Paul
The cause of the problem is the time format string used for conversion:
yyyy-MM-dd'T'HH:mm:ss.SSS'Z'
As you may see, Z is inside single quotes, which means that it is not interpreted as the zone offset marker, but only as a character like T in the middle.
So, the format string should be changed to
yyyy-MM-dd'T'HH:mm:ss.SSSX
where X is the Java standard date time formatter pattern (Z being the offset value for 0).
Now, the source data can be converted to UTC timestamps:
val srcDF = Seq(
("2018-04-10T13:30:34.45Z"),
("2018-04-10T13:45:55.4Z"),
("2018-04-10T14:00:00.234Z"),
("2018-04-10T14:15:04.34Z"),
("2018-04-10T14:30:23.45Z")
).toDF("Timestamp")
val convertedDF = srcDF.select(to_utc_timestamp(date_format($"Timestamp", "yyyy-MM-dd'T'HH:mm:ss.SSSX"), "Europe/Berlin").as("converted"))
convertedDF.printSchema()
convertedDF.show(false)
/**
root
|-- converted: timestamp (nullable = true)
+-----------------------+
|converted |
+-----------------------+
|2018-04-10 13:30:34.45 |
|2018-04-10 13:45:55.4 |
|2018-04-10 14:00:00.234|
|2018-04-10 14:15:04.34 |
|2018-04-10 14:30:23.45 |
+-----------------------+
*/
If you need to convert the timestamps back to strings and normalize the values to have 3 trailing zeros, there should be another date_format call, similar to what you have already applied in the question.

Spark dataframe convert integer to timestamp and find date difference

I have this DataFrame org.apache.spark.sql.DataFrame:
|-- timestamp: integer (nullable = true)
|-- checkIn: string (nullable = true)
| timestamp| checkIn|
+----------+----------+
|1521710892|2018-05-19|
|1521710892|2018-05-19|
Desired result: obtain a new column with day difference between date checkIn and timestamp (2018-03-03 23:59:59 and 2018-03-04 00:00:01 should have a difference of 1)
Thus, i need to
convert timestamp to date (This is where i'm stuck)
take out one date from another
use some function to extract day(Have not found this function yet)
You can use from_unixtime to convert your timestamp to date and datediff to calculate the difference in days:
val df = Seq(
(1521710892, "2018-05-19"),
(1521730800, "2018-01-01")
).toDF("timestamp", "checkIn")
df.withColumn("tsDate", from_unixtime($"timestamp")).
withColumn("daysDiff", datediff($"tsDate", $"checkIn")).
show
// +----------+----------+-------------------+--------+
// | timestamp| checkIn| tsDate|daysDiff|
// +----------+----------+-------------------+--------+
// |1521710892|2018-05-19|2018-03-22 02:28:12| -58|
// |1521730800|2018-01-01|2018-03-22 08:00:00| 80|
// +----------+----------+-------------------+--------+

SpHow to merge two fields(string type) of a DataFrame's column to generate a Date

I have a DataFrame which simplified schema has got two columns with 3 fields each column:
root
|-- npaDetails: struct (nullable = true)
| |-- additionalInformation: struct (nullable = true)
| |-- npaStatus: struct (nullable = true)
| |-- npaDetails: struct (nullable = true)
|-- npaHeaderData: struct (nullable = true)
| |-- npaNumber: string (nullable = true)
| |-- npaDownloadDate: string (nullable = true)
| |-- npaDownloadTime: string (nullable = true)
Possible values:
npaDownloadDate - "30JAN17"
npaDownloadTime - "19.50.00"
I need to compare two rows in a DataFrame with this schema, to determine which one is "fresher". To do so I need to merge the fields npaDownloadDate and npaDownloadTime to generate a Date that I can compare easily.
Below its the code I have written so far. It works, but I think it takes more steps than necessary and I'm sure that Scala offers better solutions than my approach.
val parquetFileDF = sqlContext.read.parquet("MyParquet.parquet")
val relevantRows = parquetFileDF.filter($"npaHeaderData.npaNumber" === "123456")
val date = relevantRows .select($"npaHeaderData.npaDownloadDate").head().get(0)
val time = relevantRows .select($"npaHeaderData.npaDownloadTime").head().get(0)
val dateTime = new SimpleDateFormat("ddMMMyykk.mm.ss").(date+time)
//I would replicate the previous steps to get dateTime2
if(dateTime.before(dateTime2))
println("dateTime is before dateTime2")
So the output of "30JAN17" and "19.50.00" would be Mon Jan 30 19:50:00 GST 2017
Is there another way to generate a Date from two fields of a column, without extracting and merging them as strings? Or even better, is it possible to compare directly both values (date and time) between two different rows in a dataframe to know which one has an older date
In spark 2.2,
df.filter(
to_date(
concat(
$"npaHeaderData.npaDownloadDate",
$"npaHeaderData.npaDownloadTime"),
fmt = "[your format here]")
) < lit(some date))
I'd use
import org.apache.spark.sql.functions._
df.withColumn("some_name", date_format(unix_timestamp(
concat($"npaHeaderData.npaDownloadDate", $"npaHeaderData.npaDownloadTime"),
"ddMMMyykk.mm.ss").cast("timestamp"),
"EEE MMM d HH:mm:ss z yyyy"))