How to Convert string column to Date column in PySpark DataFrame - date

I have a string column(DOB) like below:
DOB
1973-Jun-28
1978-May-02
I want to convert this to Date type. I tried the below but it's showing null value in the output. I tried other answers from stackoverflow but to no avail.
df.withColumn('DOB', to_date('DOB', 'yyyy-MMM-dd'))

Can you test below code and verify if this is working?
Ref: https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
cat <<EOF >test.csv
DOB
1973-Jun-28
1978-May-02
EOF
pyspark
from pyspark.sql.functions import to_date
df = spark.read.csv('test.csv',header=True)
df.withColumn('DOB', to_date('DOB', 'yyyy-MMM-dd')).show()
+----------+
| DOB|
+----------+
|1973-06-28|
|1978-05-02|
+----------+

Related

Convert string (with timestamp) to timestamp in pyspark

I have a dataframe with a string datetime column.
I am converting it to timestamp, but the values are changing.
Following is my code, can anyone help me to convert without changing values.
df=spark.createDataFrame(
data = [ ("1","2020-04-06 15:06:16 +00:00")],
schema=["id","input_timestamp"])
df.printSchema()
#Timestamp String to DateType
df = df.withColumn("timestamp",to_timestamp("input_timestamp"))
# Using Cast to convert TimestampType to DateType
df.withColumn('timestamp_string', \
to_timestamp('timestamp').cast('string')) \
.show(truncate=False)
This is the output:
+---+--------------------------+-------------------+-------------------+
|id |input_timestamp |timestamp |timestamp_string |
+---+--------------------------+-------------------+-------------------+
|1 |2020-04-06 15:06:16 +00:00|2020-04-06 08:06:16|2020-04-06 08:06:16|
+---+--------------------------+-------------------+-------------------+
I want to know why the hour is changing from 15 to 8 and how can I prevent it?
I believe to_timestamp is converting timestamp value to your local time as you have +00:00 in your data.
Try to pass the format to to_timestamp() function.
Example:
from pyspark.sql.functions import to_timestamp
df.withColumn("timestamp",to_timestamp(col("input_timestamp"),"yyyy-MM-dd HH:mm:ss +00:00")).show(10,False)
#+---+--------------------------+-------------------+
#|id |input_timestamp |timestamp |
#+---+--------------------------+-------------------+
#|1 |2020-04-06 15:06:16 +00:00|2020-04-06 15:06:16|
#+---+--------------------------+-------------------+
from pyspark.sql.functions import to_utc_timestamp
df = spark.createDataFrame(
data=[('1', '2020-04-06 15:06:16 +00:00')],
schema=['id', 'input_timestamp'])
df.printSchema()
df = df.withColumn('timestamp', to_utc_timestamp('input_timestamp',
your_local_timezone))
df.withColumn('timestamp_string', df.timestamp.cast('string')).show(truncate=False)
Replace your_local_timezone with the actual value.

pyspark: change string to timestamp

I've a column in String format , some rows are also null.
I add random timestamp to make it in the following form to convert it into timestamp.
date
null
22-04-2020
date
01-01-1990 23:59:59.000
22-04-2020 23:59:59.000
df = df.withColumn('date', F.concat (df.date, F.lit(" 23:59:59.000")))
df = df.withColumn('date', F.when(F.col('date').isNull(), '01-01-1990 23:59:59.000').otherwise(F.col('date')))
df.withColumn("date", F.to_timestamp(F.col("date"),"MM-dd-yyyy HH mm ss SSS")).show(2)
but after this the column date becomes null.
can anyone help me solve this.
either convert the string to timestamp direct
Your timestamp format should start with dd-MM, not MM-dd, and you're also missing some colons and dots in the time part. Try the code below:
df.withColumn("date", F.to_timestamp(F.col("date"),"dd-MM-yyyy HH:mm:ss.SSS")).show()
+-------------------+
| date|
+-------------------+
|1990-01-01 23:59:59|
|2020-04-22 23:59:59|
+-------------------+

casting to string of column for pyspark dataframe throws error

I have pyspark dataframe with two columns with datatypes as
[('area', 'int'), ('customer_play_id', 'int')]
+----+----------------+
|area|customer_play_id|
+----+----------------+
| 100| 8606738 |
| 110| 8601843 |
| 130| 8602984 |
+----+----------------+
I want to cast column area to str using pyspark commands but I am getting error as below
I tried below
str(df['area']) : but it didnt change datatype to str
df.area.astype(str) : gave "TypeError: unexpected type: "
df['area'].cast(str) same as error above
Any help will be appreciated
I want datatype of area as string using pyspark dataframe operation
Simply you can do any of these -
Option1:
df1 = df.select('*',df.area.cast("string"))
select - All the columns you want in df1 should be mentioned in select
Option2:
df1 = df.selectExpr("*","cast(area as string) AS new_area")
selectExpr - All the columns you want in df1 should be mentioned in selectExpr
Option3:
df1 = df.withColumn("new_area", df.area.cast("string"))
withColumn will add new column (additional to existing columns of df)
"*" in select and selectExpr represent all the columns.
use withColumn function to change the data type or values in the field in spark e.g. is show below:
import pyspark.sql.functions as F
df = df.withColumn("area",F.col("area").cast("string"))
You Can use this UDF Function
from pyspark.sql.types import FloatType
tofloatfunc = udf(lambda x: x,FloatType())
changedTypedf = df.withColumn("Column_name", df["Column_name"].cast(FloatType()))

PySpark: String to timestamp transformation

I am working with time data and try to convert the string to timestamp format.
Here is what the 'Time' column looks like
+----------+
| Time |
+----------+
|1358380800|
|1380672000|
+----------+
Here is what I want
+---------------+
| Time |
+---------------+
|2013/1/17 8:0:0|
|2013/10/2 8:0:0|
+---------------+
I find some similar questions and answers and have tried these code, but all end with 'null'
df2 = df.withColumn("Time", test["Time"].cast(TimestampType()))
df2 = df.withColumn('Time', F.unix_timestamp('Time', 'yyyy-MM-dd').cast(TimestampType()))
Well your are doing it the other way around. The sql function unix_timestamp converts a string with the given format to a unix timestamp. When you want to convert a unix timestamp to the datetime format, you have to use the from_unixtime sql function:
from pyspark.sql import functions as F
from pyspark.sql import types as T
l1 = [('1358380800',),('1380672000',)]
df = spark.createDataFrame(l1,['Time'])
df.withColumn('Time', F.from_unixtime(df.Time).cast(T.TimestampType())).show()
Output:
+-------------------+
| Time|
+-------------------+
|2013-01-17 01:00:00|
|2013-10-02 02:00:00|
+-------------------+

Truncating the data-frame column values in Scala

df is the dataframe, and
df.select($"createdon").show
output is:
| createdon|
+--------------------+
|2017-05-11 15:29:...|
|2017-05-11 15:29:...|
|2017-05-11 11:02:...|
|2017-05-11 11:02:...|
|2017-05-11 15:29:...|
I need the createdon value to be just the date instead of whole timestamp. How to have the new dataframe containing createdon's value to be just the date instead of whole timestamp?
to_date function should do it:
df.withColumn("createdon", to_date($"createdon")).show
+----------+
| createdon|
+----------+
|2017-05-11|
+----------+
You can use a map transformation (assuming, that selected values are of the String type)
df.select($"createdon").map(_.getAs[String]("createdon").substring(0, 11)).show()