Pyspark date_trunc without modifying actual value - pyspark

Consider the below dataframe
df:
time
2022-02-21T11:23:54
I have to convert it to
time
2022-02-21T11:23:00
After using the below code
df.withColumn("time_updated", date_trunc("minute", col("time"))).show(truncate = False)
My output
time
2022-02-21 11:23:00
By desired output is
time
2022-02-21T11:23:00
Is there anyway I can keep the data same and just update/truncate the seconds??

you simply have a format issue. the output that you see is the string representation of a timestamp. check your output formats :
from pyspark.sql import functions as F, Window as W, types as T
df = df.withColumn(
"time_updated",
F.date_format(F.col("time").cast("timestamp"), "YYYY-MM-dd'T'HH:mm:00"),
)
df.show(truncate=False)
+-------------------+-------------------+
|time |time_updated |
+-------------------+-------------------+
|2022-02-21T11:23:54|2022-02-21T11:23:00|
+-------------------+-------------------+
df.printSchema()
root
|-- time: string (nullable = true)
|-- time_updated: string (nullable = true)

Related

Convert yyyyMM to end of month date using PySpark

I have a column in a dataframe in Pyspark with date in integer format e.g 202203 (yyyyMM format). I want to convert that to end of month date as 2022-03-31. How do I achieve this?
First cast column to String, then use to_date to get the date and then last_day.
Example:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
data = [{"x": 202203}]
df = spark.createDataFrame(data=data)
df = df.withColumn("date", F.last_day(F.to_date(F.col("x").cast("string"), "yyyyMM")))
df.show(10)
df.printSchema()
Output:
+------+----------+
| x| date|
+------+----------+
|202203|2022-03-31|
+------+----------+
root
|-- x: long (nullable = true)
|-- date: date (nullable = true)

Convert unix_timestamp column to string using pyspark

I have a column which represents unix_timestamp and want to convert it into string with this format, 'yyyy-MM-dd HH:mm:ss.SSS'.
unix_timestamp | time_string
1578569683753 | 2020-01-09 11:34:43.753
1578569581793 | 2020-01-09 11:33:01.793
1578569581993 | 2020-01-09 11:33:01.993
Is there any builtin function or how does it work? Thanks.
df1 = df1.withColumn('utc_stamp', F.from_unixtime('Timestamp', format="YYYY-MM-dd HH:mm:ss"))
df1.show(truncate=False)
from_unixtime converts only into seconds, for milliseconds I just have to concat them from original column to new column.
unixtimestamp only supports second precision. Looking at your values the precision is at milliseconds, the last 3 positions are milliseconds.
from pyspark.sql.functions import substring,unix_timestamp,col,to_timestamp,concat,lit,from_unixtime
df = spark.createDataFrame([('1578569683753',), ('1578569581793',),('1578569581993',)], ['TMS'])
df.show(3,False)
df.printSchema()
Result
+-------------+
|TMS |
+-------------+
|1578569683753|
|1578569581793|
|1578569581993|
+-------------+
root
|-- TMS: string (nullable = true)
Convert to the human-readable timestamp format
df1 = (df
.select("TMS"
,from_unixtime(substring(col("TMS"),1,10), format="yyyy-MM-dd HH:mm:sss").alias("TMS_WITHOUT_MILLISECONDS")
,(substring("TMS",11,3)).alias("MILLISECONDS")
,(concat(from_unixtime(substring(col("TMS"),1,10), format="yyyy-MM-dd HH:mm:sss"),lit('.'), substring(df.TMS,11,3))).alias("TMS_StringType")
,to_timestamp(concat(from_unixtime(substring(col("TMS"),1,10), format="yyyy-MM-dd HH:mm:sss"),lit('.'), substring(df.TMS,11,3))).alias("TMS_TimestampType")
)
)
df1.show(3,False)
df1.printSchema()
Output
+-------------+------------------------+------------+------------------------+-----------------------+
|TMS |TMS_WITHOUT_MILLISECONDS|MILLISECONDS|TMS_StringType |TMS_TimestampType |
+-------------+------------------------+------------+------------------------+-----------------------+
|1578569683753|2020-01-09 11:34:043 |753 |2020-01-09 11:34:043.753|2020-01-09 11:34:43.753|
|1578569581793|2020-01-09 11:33:001 |793 |2020-01-09 11:33:001.793|2020-01-09 11:33:01.793|
|1578569581993|2020-01-09 11:33:001 |993 |2020-01-09 11:33:001.993|2020-01-09 11:33:01.993|
+-------------+------------------------+------------+------------------------+-----------------------+
root
|-- TMS: string (nullable = true)
|-- TMS_WITHOUT_MILLISECONDS: string (nullable = true)
|-- MILLISECONDS: string (nullable = true)
|-- TMS_StringType: string (nullable = true)
|-- TMS_TimestampType: timestamp (nullable = true)

EDIT: spark scala inbuilt udf : to_timestamp() ignores the millisecond part of the timestamp value

Sample Code:
val sparkSession = SparkUtil.getSparkSession("timestamp_format_test")
import sparkSession.implicits._
val format = "yyyy/MM/dd HH:mm:ss.SSS"
val time = "2018/12/21 08:07:36.927"
val df = sparkSession.sparkContext.parallelize(Seq(time)).toDF("in_timestamp")
val df2 = df.withColumn("out_timestamp", to_timestamp(df.col("in_timestamp"), format))
Output:
df2.show(false)
plz notice: out_timestamp loses the milli-second part from the original value
+-----------------------+-------------------+
|in_timestamp |out_timestamp |
+-----------------------+-------------------+
|2018/12/21 08:07:36.927|2018-12-21 08:07:36|
+-----------------------+-------------------+
df2.printSchema()
root
|-- in_timestamp: string (nullable = true)
|-- out_timestamp: timestamp (nullable = true)
In the above result: in_timestamp is of string type, and I would like to convert to timestamp data type, it does get convert but the millisecond part gets lost. Any idea.? Thanks.!
Sample code for preserving millisecond during conversion from string to timestamp.
val df2 = df.withColumn("out_timestamp", to_timestamp(df.col("in_timestamp")))
df2.show(false)
+-----------------------+-----------------------+
|in_timestamp |out_timestamp |
+-----------------------+-----------------------+
|2018-12-21 08:07:36.927|2018-12-21 08:07:36.927|
+-----------------------+-----------------------+
scala> df2.printSchema
root
|-- in_timestamp: string (nullable = true)
|-- out_timestamp: timestamp (nullable = true)
You just need to remove format parameter from to_timestamp. This will save your result with data type timestamp similar to String value.

Pyspark from_unixtime (unix_timestamp) does not convert to timestamp

I am using Pyspark with Python 2.7. I have a date column in string (with ms) and would like to convert to timestamp
This is what I have tried so far
df = df.withColumn('end_time', from_unixtime(unix_timestamp(df.end_time, '%Y-%M-%d %H:%m:%S.%f')) )
printSchema() shows
end_time: string (nullable = true)
when I expended timestamp as the type of variable
Try using from_utc_timestamp:
from pyspark.sql.functions import from_utc_timestamp
df = df.withColumn('end_time', from_utc_timestamp(df.end_time, 'PST'))
You'd need to specify a timezone for the function, in this case I chose PST
If this does not work please give us an example of a few rows showing df.end_time
Create a sample dataframe with Time-stamp formatted as string:
import pyspark.sql.functions as F
df = spark.createDataFrame([('22-Jul-2018 04:21:18.792 UTC', ),('23-Jul-2018 04:21:25.888 UTC',)], ['TIME'])
df.show(2,False)
df.printSchema()
Output:
+----------------------------+
|TIME |
+----------------------------+
|22-Jul-2018 04:21:18.792 UTC|
|23-Jul-2018 04:21:25.888 UTC|
+----------------------------+
root
|-- TIME: string (nullable = true)
Converting string time-format (including milliseconds ) to unix_timestamp(double). Since unix_timestamp() function excludes milliseconds we need to add it using another simple hack to include milliseconds. Extracting milliseconds from string using substring method (start_position = -7, length_of_substring=3) and Adding milliseconds seperately to unix_timestamp. (Cast to substring to float for adding)
df1 = df.withColumn("unix_timestamp",F.unix_timestamp(df.TIME,'dd-MMM-yyyy HH:mm:ss.SSS z') + F.substring(df.TIME,-7,3).cast('float')/1000)
Converting unix_timestamp(double) to timestamp datatype in Spark.
df2 = df1.withColumn("TimestampType",F.to_timestamp(df1["unix_timestamp"]))
df2.show(n=2,truncate=False)
This will give you following output
+----------------------------+----------------+-----------------------+
|TIME |unix_timestamp |TimestampType |
+----------------------------+----------------+-----------------------+
|22-Jul-2018 04:21:18.792 UTC|1.532233278792E9|2018-07-22 04:21:18.792|
|23-Jul-2018 04:21:25.888 UTC|1.532319685888E9|2018-07-23 04:21:25.888|
+----------------------------+----------------+-----------------------+
Checking the Schema:
df2.printSchema()
root
|-- TIME: string (nullable = true)
|-- unix_timestamp: double (nullable = true)
|-- TimestampType: timestamp (nullable = true)
in current version of spark , we do not have to do much with respect to timestamp conversion.
using to_timestamp function works pretty well in this case. only thing we need to take care is input the format of timestamp according to the original column.
in my case it was in format yyyy-MM-dd HH:mm:ss.
other format can be like MM/dd/yyyy HH:mm:ss or a combination as such.
from pyspark.sql.functions import to_timestamp
df=df.withColumn('date_time',to_timestamp('event_time','yyyy-MM-dd HH:mm:ss'))
df.show()
Following might help:-
from pyspark.sql import functions as F
df = df.withColumn("end_time", F.from_unixtime(F.col("end_time"), 'yyyy-MM-dd HH:mm:ss.SS').cast("timestamp"))
[Updated]

Timestamp formats and time zones in Spark (scala API)

******* UPDATE ********
As suggested in the comments I eliminated the irrelevant part of the code:
My requirements:
Unify number of milliseconds to 3
Transform string to timestamp and keep the value in UTC
Create dataframe:
val df = Seq("2018-09-02T05:05:03.456Z","2018-09-02T04:08:32.1Z","2018-09-02T05:05:45.65Z").toDF("Timestamp")
Here the reults using the spark shell:
************ END UPDATE *********************************
I am having a nice headache trying to deal with time zones and timestamp formats in Spark using scala.
This is a simplification of my script to explain my problem:
import org.apache.spark.sql.functions._
val jsonRDD = sc.wholeTextFiles("file:///data/home2/phernandez/vpp/Test_Message.json")
val jsonDF = spark.read.json(jsonRDD.map(f => f._2))
This is the resulting schema:
root
|-- MeasuredValues: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- MeasuredValue: double (nullable = true)
| | |-- Status: long (nullable = true)
| | |-- Timestamp: string (nullable = true)
Then I just select the Timestamp field as follows
jsonDF.select(explode($"MeasuredValues").as("Values")).select($"Values.Timestamp").show(5,false)
First thing I want to fix is the number of milliseconds of every timestamp and unify it to three.
I applied the date_format as follows
jsonDF.select(explode($"MeasuredValues").as("Values")).select(date_format($"Values.Timestamp","yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")).show(5,false)
Milliseconds format was fixed but timestamp is converted from UTC to local time.
To tackle this issue, I applied the to_utc_timestamp together with my local time zone.
jsonDF.select(explode($"MeasuredValues").as("Values")).select(to_utc_timestamp(date_format($"Values.Timestamp","yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"),"Europe/Berlin").as("Timestamp")).show(5,false)
Even worst, UTC value is not returned, and the milliseconds format is lost.
Any Ideas how to deal with this? I will appreciated it 😊
BR. Paul
The cause of the problem is the time format string used for conversion:
yyyy-MM-dd'T'HH:mm:ss.SSS'Z'
As you may see, Z is inside single quotes, which means that it is not interpreted as the zone offset marker, but only as a character like T in the middle.
So, the format string should be changed to
yyyy-MM-dd'T'HH:mm:ss.SSSX
where X is the Java standard date time formatter pattern (Z being the offset value for 0).
Now, the source data can be converted to UTC timestamps:
val srcDF = Seq(
("2018-04-10T13:30:34.45Z"),
("2018-04-10T13:45:55.4Z"),
("2018-04-10T14:00:00.234Z"),
("2018-04-10T14:15:04.34Z"),
("2018-04-10T14:30:23.45Z")
).toDF("Timestamp")
val convertedDF = srcDF.select(to_utc_timestamp(date_format($"Timestamp", "yyyy-MM-dd'T'HH:mm:ss.SSSX"), "Europe/Berlin").as("converted"))
convertedDF.printSchema()
convertedDF.show(false)
/**
root
|-- converted: timestamp (nullable = true)
+-----------------------+
|converted |
+-----------------------+
|2018-04-10 13:30:34.45 |
|2018-04-10 13:45:55.4 |
|2018-04-10 14:00:00.234|
|2018-04-10 14:15:04.34 |
|2018-04-10 14:30:23.45 |
+-----------------------+
*/
If you need to convert the timestamps back to strings and normalize the values to have 3 trailing zeros, there should be another date_format call, similar to what you have already applied in the question.