Convert unix_timestamp column to string using pyspark - pyspark

I have a column which represents unix_timestamp and want to convert it into string with this format, 'yyyy-MM-dd HH:mm:ss.SSS'.
unix_timestamp | time_string
1578569683753 | 2020-01-09 11:34:43.753
1578569581793 | 2020-01-09 11:33:01.793
1578569581993 | 2020-01-09 11:33:01.993
Is there any builtin function or how does it work? Thanks.

df1 = df1.withColumn('utc_stamp', F.from_unixtime('Timestamp', format="YYYY-MM-dd HH:mm:ss"))
df1.show(truncate=False)
from_unixtime converts only into seconds, for milliseconds I just have to concat them from original column to new column.

unixtimestamp only supports second precision. Looking at your values the precision is at milliseconds, the last 3 positions are milliseconds.
from pyspark.sql.functions import substring,unix_timestamp,col,to_timestamp,concat,lit,from_unixtime
df = spark.createDataFrame([('1578569683753',), ('1578569581793',),('1578569581993',)], ['TMS'])
df.show(3,False)
df.printSchema()
Result
+-------------+
|TMS |
+-------------+
|1578569683753|
|1578569581793|
|1578569581993|
+-------------+
root
|-- TMS: string (nullable = true)
Convert to the human-readable timestamp format
df1 = (df
.select("TMS"
,from_unixtime(substring(col("TMS"),1,10), format="yyyy-MM-dd HH:mm:sss").alias("TMS_WITHOUT_MILLISECONDS")
,(substring("TMS",11,3)).alias("MILLISECONDS")
,(concat(from_unixtime(substring(col("TMS"),1,10), format="yyyy-MM-dd HH:mm:sss"),lit('.'), substring(df.TMS,11,3))).alias("TMS_StringType")
,to_timestamp(concat(from_unixtime(substring(col("TMS"),1,10), format="yyyy-MM-dd HH:mm:sss"),lit('.'), substring(df.TMS,11,3))).alias("TMS_TimestampType")
)
)
df1.show(3,False)
df1.printSchema()
Output
+-------------+------------------------+------------+------------------------+-----------------------+
|TMS |TMS_WITHOUT_MILLISECONDS|MILLISECONDS|TMS_StringType |TMS_TimestampType |
+-------------+------------------------+------------+------------------------+-----------------------+
|1578569683753|2020-01-09 11:34:043 |753 |2020-01-09 11:34:043.753|2020-01-09 11:34:43.753|
|1578569581793|2020-01-09 11:33:001 |793 |2020-01-09 11:33:001.793|2020-01-09 11:33:01.793|
|1578569581993|2020-01-09 11:33:001 |993 |2020-01-09 11:33:001.993|2020-01-09 11:33:01.993|
+-------------+------------------------+------------+------------------------+-----------------------+
root
|-- TMS: string (nullable = true)
|-- TMS_WITHOUT_MILLISECONDS: string (nullable = true)
|-- MILLISECONDS: string (nullable = true)
|-- TMS_StringType: string (nullable = true)
|-- TMS_TimestampType: timestamp (nullable = true)

Related

Pyspark date_trunc without modifying actual value

Consider the below dataframe
df:
time
2022-02-21T11:23:54
I have to convert it to
time
2022-02-21T11:23:00
After using the below code
df.withColumn("time_updated", date_trunc("minute", col("time"))).show(truncate = False)
My output
time
2022-02-21 11:23:00
By desired output is
time
2022-02-21T11:23:00
Is there anyway I can keep the data same and just update/truncate the seconds??
you simply have a format issue. the output that you see is the string representation of a timestamp. check your output formats :
from pyspark.sql import functions as F, Window as W, types as T
df = df.withColumn(
"time_updated",
F.date_format(F.col("time").cast("timestamp"), "YYYY-MM-dd'T'HH:mm:00"),
)
df.show(truncate=False)
+-------------------+-------------------+
|time |time_updated |
+-------------------+-------------------+
|2022-02-21T11:23:54|2022-02-21T11:23:00|
+-------------------+-------------------+
df.printSchema()
root
|-- time: string (nullable = true)
|-- time_updated: string (nullable = true)

Is there a way to use pyspark.sql.functions.date_add with a col('column_name') as a the second parameter instead of a static integer?

During an ETL process I have this one SAS date field that is in a 5 digit integer format, which indicates days since 01-01-1960. I order to make this data column more useful in analysis I would like to convert the column to a date data type field in Redshift.
Currently I am trying to do this in pyspark as follows:
created new column "sas_date" with string literal "1960-01-01"
Using pyspark.sql.function.date_add I pass the "sas-date" column as
the start date parameter and the integer value 'arrival_date' column as the second parameter.
When the date_add function runs I get error Column not iterable, even though I would think the arrival_date column being a series would mean it was iterable. But its not, why?
When I remove the 'arrival_date' column and replace it with a static integer value (say 1) the date_add function will work.
i94 = i94.withColumn('arrival_date', col('arrival_date').cast(Int()))
i94 = i94.withColumn('sas_date', lit("1960-01-01"))
i94 = i94.withColumn('arrival_date', date_add(col('sas_date'), i94['arrival_date']))
I want to be able to pass my column so that the second date_add parameter will be dynamic. However it seems date_add does not accept this? If date_addd does not accomplish this what other option do I have outside of using a UDF?
UPDATE:
State of data right before the date_add() operation
i94.printSchema()
root
|-- cic_id: double (nullable = true)
|-- visa_id: string (nullable = true)
|-- port_id: string (nullable = true)
|-- airline_id: string (nullable = true)
|-- cit_id: double (nullable = true)
|-- res_id: double (nullable = true)
|-- year: double (nullable = true)
|-- month: double (nullable = true)
|-- age: double (nullable = true)
|-- gender: string (nullable = true)
|-- arrival_date: integer (nullable = true)
|-- depart_date: double (nullable = true)
|-- date_begin: string (nullable = true)
|-- date_end: string (nullable = true)
|-- sas_date: string (nullable = false)
i94.limit(10).toPandas()
toPandas() result
I think you are absolutely right, date_add is designed to take int values only till Spark <3.0.0:
In spark scala implementation i see below lines.
It indicates that whatever value we pass it to function date_add it is converting again into column with lit
Spark <3.0.0:
def date_add(start: Column, days: Int): Column = date_add(start,
lit(days))
Spark >=3.0.0:
def date_add(start: Column, days: Column): Column = withExpr {
DateAdd(start.expr, days.expr) }
Now lets talk about Solution, i can think of two approaches :
Imports and prepare small set of your dataset:
import pyspark.sql.functions as f
import pyspark.sql.types as t
from datetime import datetime
from datetime import timedelta
l1 = [(5748517.0,'1960-01-01', 20574), (5748517.0,'1960-01-01', 20574), (5748517.0,'1960-01-01', 20574)]
df = spark.createDataFrame(l1).toDF('cic_id','sas_date','arrival_date')
df.show()
+---------+----------+------------+
| cic_id| sas_date|arrival_date|
+---------+----------+------------+
|5748517.0|1960-01-01| 20574|
|5748517.0|1960-01-01| 20574|
|5748517.0|1960-01-01| 20574|
+---------+----------+------------+
Now, there are two ways to achive functionality.
UDF Way :
def date_add_(date, days):
# Type check and convert to datetime object
# Format and other things should be handle more delicately
if type(date) is not datetime:
date = datetime.strptime('1960-01-01', "%Y-%m-%d")
return date + timedelta(days)
date_add_udf = f.udf(date_add_, t.DateType())
df.withColumn('actual_arrival_date', date_add_udf(f.to_date('sas_date'), 'arrival_date')).show()
+---------+----------+------------+-------------------+
| cic_id| sas_date|arrival_date|actual_arrival_date|
+---------+----------+------------+-------------------+
|5748517.0|1960-01-01| 20574| 2016-04-30|
|5748517.0|1960-01-01| 20574| 2016-04-30|
|5748517.0|1960-01-01| 20574| 2016-04-30|
+---------+----------+------------+-------------------+
Using expr evaluation :
df.withColumn('new_arrival_date', f.expr("date_add(sas_date, arrival_date)")).show()
+---------+----------+------------+----------------+
| cic_id| sas_date|arrival_date|new_arrival_date|
+---------+----------+------------+----------------+
|5748517.0|1960-01-01| 20574| 2016-04-30|
|5748517.0|1960-01-01| 20574| 2016-04-30|
|5748517.0|1960-01-01| 20574| 2016-04-30|
+---------+----------+------------+----------------+

EDIT: spark scala inbuilt udf : to_timestamp() ignores the millisecond part of the timestamp value

Sample Code:
val sparkSession = SparkUtil.getSparkSession("timestamp_format_test")
import sparkSession.implicits._
val format = "yyyy/MM/dd HH:mm:ss.SSS"
val time = "2018/12/21 08:07:36.927"
val df = sparkSession.sparkContext.parallelize(Seq(time)).toDF("in_timestamp")
val df2 = df.withColumn("out_timestamp", to_timestamp(df.col("in_timestamp"), format))
Output:
df2.show(false)
plz notice: out_timestamp loses the milli-second part from the original value
+-----------------------+-------------------+
|in_timestamp |out_timestamp |
+-----------------------+-------------------+
|2018/12/21 08:07:36.927|2018-12-21 08:07:36|
+-----------------------+-------------------+
df2.printSchema()
root
|-- in_timestamp: string (nullable = true)
|-- out_timestamp: timestamp (nullable = true)
In the above result: in_timestamp is of string type, and I would like to convert to timestamp data type, it does get convert but the millisecond part gets lost. Any idea.? Thanks.!
Sample code for preserving millisecond during conversion from string to timestamp.
val df2 = df.withColumn("out_timestamp", to_timestamp(df.col("in_timestamp")))
df2.show(false)
+-----------------------+-----------------------+
|in_timestamp |out_timestamp |
+-----------------------+-----------------------+
|2018-12-21 08:07:36.927|2018-12-21 08:07:36.927|
+-----------------------+-----------------------+
scala> df2.printSchema
root
|-- in_timestamp: string (nullable = true)
|-- out_timestamp: timestamp (nullable = true)
You just need to remove format parameter from to_timestamp. This will save your result with data type timestamp similar to String value.

In PySpark how to parse an embedded JSON

I am new to PySpark.
I have a JSON file which has below schema
df = spark.read.json(input_file)
df.printSchema()
|-- UrlsInfo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- displayUrl: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- url: string (nullable = true)
|-- type: long (nullable = true)
I want a new result dataframe which should have only two columns type and UrlsInfo.element.DisplayUrl
This is my try code, which doesn't give the expected output
df.createOrReplaceTempView("the_table")
resultDF = spark.sql("SELECT type, UrlsInfo.element.DisplayUrl FROM the_table")
resultDF.show()
I want resultDF to be something like this:
Type | DisplayUrl
----- ------------
2 | http://example.com
This is related JSON file parsing in Pyspark, but doesn't answer my question.
As you can see in your schema, UrlsInfo is an array type, not a struct. The "element" schema item thus refers not to a named property (you're trying to access it by .element) but to an array element (which responds to an index like [0]).
I've reproduced your schema by hand:
from pyspark.sql import Row
df = spark.createDataFrame([Row(UrlsInfo=[Row(displayUri="http://example.com", type="narf", url="poit")], Type=2)])
df.printSchema()
root
|-- Type: long (nullable = true)
|-- UrlsInfo: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- displayUri: string (nullable = true)
| | |-- type: string (nullable = true)
| | |-- url: string (nullable = true)
and I'm able to produce a table like what you seem to be looking for by using an index:
df.createOrReplaceTempView("temp")
resultDF = spark.sql("SELECT type, UrlsInfo[0].DisplayUri FROM temp")
resultDF.show()
+----+----------------------+
|type|UrlsInfo[0].DisplayUri|
+----+----------------------+
| 2| http://example.com|
+----+----------------------+
However, this only gives the first element (if any) of UrlsInfo in the second column.
EDIT: I'd forgotten about the EXPLODE function, which you can use here to treat the UrlsInfo elements like a set of rows:
from pyspark.sql import Row
df = spark.createDataFrame([Row(UrlsInfo=[Row(displayUri="http://example.com", type="narf", url="poit"), Row(displayUri="http://another-example.com", type="narf", url="poit")], Type=2)])
df.createOrReplaceTempView("temp")
resultDF = spark.sql("SELECT type, EXPLODE(UrlsInfo.displayUri) AS displayUri FROM temp")
resultDF.show()
+----+--------------------+
|type| displayUri|
+----+--------------------+
| 2| http://example.com|
| 2|http://another-ex...|
+----+--------------------+

Change schema of existing dataframe

I want to change schema of existing dataframe,while changing the schema I'm experiencing error.Is it possible I can change the existing schema of a dataframe.
val customSchema=StructType(
Array(
StructField("data_typ", StringType, nullable=false),
StructField("data_typ", IntegerType, nullable=false),
StructField("proc_date", IntegerType, nullable=false),
StructField("cyc_dt", DateType, nullable=false),
));
val readDF=
+------------+--------------------+-----------+--------------------+
|DatatypeCode| Description|monthColNam| timeStampColNam|
+------------+--------------------+-----------+--------------------+
| 03099|Volumetric/Expand...| 201867|2018-05-31 18:25:...|
| 03307| Elapsed Day Factor| 201867|2018-05-31 18:25:...|
+------------+--------------------+-----------+--------------------+
val rows= readDF.rdd
val readDF1 = sparkSession.createDataFrame(rows,customSchema)
expected result
val newdf=
+------------+--------------------+-----------+--------------------+
|data_typ_cd | data_typ_desc|proc_dt | cyc_dt |
+------------+--------------------+-----------+--------------------+
| 03099|Volumetric/Expand...| 201867|2018-05-31 18:25:...|
| 03307| Elapsed Day Factor| 201867|2018-05-31 18:25:...|
+------------+--------------------+-----------+--------------------+
Any help will be appricated
You can do something like this to change the datatype from one to other.
I have created a dataframe similar to yours like below:
import sparkSession.sqlContext.implicits._
import org.apache.spark.sql.types._
var df = Seq(("03099","Volumetric/Expand...", "201867", "2018-05-31 18:25:00"),("03307","Elapsed Day Factor", "201867", "2018-05-31 18:25:00"))
.toDF("DatatypeCode","data_typ", "proc_date", "cyc_dt")
df.printSchema()
df.show()
This gives me the following output:
root
|-- DatatypeCode: string (nullable = true)
|-- data_typ: string (nullable = true)
|-- proc_date: string (nullable = true)
|-- cyc_dt: string (nullable = true)
+------------+--------------------+---------+-------------------+
|DatatypeCode| data_typ|proc_date| cyc_dt|
+------------+--------------------+---------+-------------------+
| 03099|Volumetric/Expand...| 201867|2018-05-31 18:25:00|
| 03307| Elapsed Day Factor| 201867|2018-05-31 18:25:00|
+------------+--------------------+---------+-------------------+
If you see the schema above all the columns are of type String. Now I want to change the column proc_date to Integer type and cyc_dt to Date type, I will do the following:
df = df.withColumnRenamed("DatatypeCode", "data_type_code")
df = df.withColumn("proc_date_new", df("proc_date").cast(IntegerType)).drop("proc_date")
df = df.withColumn("cyc_dt_new", df("cyc_dt").cast(DateType)).drop("cyc_dt")
and when you check the schema of this dataframe
df.printSchema()
then it gives the output as following with the new column names:
root
|-- data_type_code: string (nullable = true)
|-- data_typ: string (nullable = true)
|-- proc_date_new: integer (nullable = true)
|-- cyc_dt_new: date (nullable = true)
You cannot change schema like this. Schema object passed to createDataFrame has to match the data, not the other way around:
To parse timestamp data use corresponding functions, for example like Better way to convert a string field into timestamp in Spark
To change other types use cast method, for example how to change a Dataframe column from String type to Double type in pyspark