spark dataframe column string to date [duplicate] - pyspark

This question already has answers here:
Convert pyspark string to date format
(6 answers)
Closed 2 years ago.
I would like to convert a spark dataframe string column of 'yyyyMMdd' to date format with spark session (spark) - not spark context.
Since I'm not working with spark context (sc), I cannot use the following code, although it would precisly do what I'd like it to do:
.withColumn("column1",DF.to_date(F.col("column1"),"yyyyMMdd"))
As I do not want to convert the column to timestamp, I also don't want to use the following code:
.withColumn("column1", unix_timestamp(col("column1"), "yyyyMMdd").cast("timestamp"))
The final goal is to replace the former string column by the date formatted column.
Many thanks in advance!

the following code works fine:
.withColumn("column1", to_date(DF["column1"], 'yyyyMMdd'))
Thanks for your attention!

Related

How to convert unix time format to timestamp in spark [duplicate]

This question already has answers here:
Convert timestamp to date in Spark dataframe
(7 answers)
How to convert unix timestamp to date in Spark
(7 answers)
How do I convert column of unix epoch to Date in Apache spark DataFrame using Java?
(3 answers)
Closed 8 months ago.
Hi I wanted to get the timestamp, appId, and name of the sparkjob that I'm running and log it in a parquet file. I have this line of code:
AuditLog.application_id = spark.sparkContext.applicationId
AuditLog.job_name = spark.sparkContext.appName
AuditLog.execution_timestamp = spark.sparkContext.startTime:Long
For the execution_timestamp, the result is in Unix format and I need to convert it to timestamp with format: yyyy/MM/dd HH:mm:ss

How to convert string into a date format in spark

I have passed a string (datestr) to a function (that do ETL on a dataframe in spark using scala API) however at some point I need to filter the dataframe by a certain date
something like :
df.filter(col("dt_adpublished_simple") === date_add(datestr, -8))
where datestr is the parameter that I passed to the function.
Unfortunately, the function date_add requires a column type as a first param.
Can anyone help me with how to convert the param into a column or a similar solution that will solve the issue?
You probably only need to use lit to create a String Column from your input String. And then, use to_date to create a Date Column from the previous one.
df.filter(col("dt_adpublished_simple") === date_add(to_date(lit(datestr), format), -8))

Scala: How to split the column values?

I am processing the data in Spark shell and have a dataframe with a date column. The format of the column is like "2017-05-01 00:00:00.0", but I want to change all the values to "2017-05-01" without the "00:00:00.0".
Thanks!
Just use String.split():
"2017-05-01 00:00:00.0".split(" ")(0)

Create a Spark Dataframe on Time

quick question that I did not find on google.
What is the best way to create a Spark Dataframe with time stamps
lets say I have startpoint endpoint and a 15 min interval. What would be the best way to resolve this on spark ?
try SparkSQL inside udf like this DF.withColumn("regTime",current_timestamp) ,it's can replace a column ,
you also can convert a String to date or timestamp use DF.withColumn("regTime", to_date(spark_temp_user_day("dateTime")))
regTime is a column of DF

Spark SQL is not converting timezone correctly [duplicate]

This question already has answers here:
Spark Strutured Streaming automatically converts timestamp to local time
(4 answers)
Closed 4 years ago.
Using Scala 2.10.4 and spark 1.5.1 and spark 1.6
sqlContext.sql(
"""
|select id,
|to_date(from_utc_timestamp(from_unixtime(at), 'US/Pacific')),
|from_utc_timestamp(from_unixtime(at), 'US/Pacific'),
|from_unixtime(at),
|to_date(from_unixtime(at)),
| at
|from events
| limit 100
""".stripMargin).collect().foreach(println)
Spark-Submit options:
--driver-java-options '-Duser.timezone=US/Pacific'
result:
[56d2a9573bc4b5c38453eae7,2016-02-28,2016-02-27 16:01:27.0,2016-02-28 08:01:27,2016-02-28,1456646487]
[56d2aa1bfd2460183a571762,2016-02-28,2016-02-27 16:04:43.0,2016-02-28 08:04:43,2016-02-28,1456646683]
[56d2aaa9eb63bbb63456d5b5,2016-02-28,2016-02-27 16:07:05.0,2016-02-28 08:07:05,2016-02-28,1456646825]
[56d2aab15a21fa5f4c4f42a7,2016-02-28,2016-02-27 16:07:13.0,2016-02-28 08:07:13,2016-02-28,1456646833]
[56d2aac8aeeee48b74531af0,2016-02-28,2016-02-27 16:07:36.0,2016-02-28 08:07:36,2016-02-28,1456646856]
[56d2ab1d87fd3f4f72567788,2016-02-28,2016-02-27 16:09:01.0,2016-02-28 08:09:01,2016-02-28,1456646941]
The time in US/Pacific should be 2016-02-28 00:01:27 etc but some how it subtracts "8" hours twice
after reading for sometime following are the conclusions:
Spark-Sql doesn't support date-time, and nor timezones
Using timestamp is the only solution
from_unixtime(at) parses the epoch time correctly, just that the printing of it as a string changes it due to timezone. It is safe to assume that the from_unixtime will convert it correctly ( although printing it might show different results)
from_utc_timestamp will shift ( not just convert) the timestamp to that timezone, in this case it will subtract 8 hours to the time since (-08:00)
printing sql results messes up the times with respect to timezone param
For the record, here we convert Long values like that using an UDF.
For our purpose, we are interested in only the Date string representation of the timestamp (in ms since epoch in UTC)
val udfToDateUTC = udf((epochMilliUTC: Long) => {
val dateFormatter = java.time.format.DateTimeFormatter.ofPattern("yyyy-MM-dd").withZone(java.time.ZoneId.of("UTC"))
dateFormatter.format(java.time.Instant.ofEpochMilli(epochMilliUTC))
})
This way, we control the parsing as well as the rendering of the dates.