Converting pattern of date in spark dataframe - scala

I have a column in spark dataframe of String datatype (with date in yyyy-MM-dd pattern)
I want to display the column value in MM/dd/yyyy pattern
My data is
val df = sc.parallelize(Array(
("steak", "1990-01-01", "2000-01-01", 150),
("steak", "2000-01-02", "2001-01-13", 180),
("fish", "1990-01-01", "2001-01-01", 100)
)).toDF("name", "startDate", "endDate", "price")
df.show()
+-----+----------+----------+-----+
| name| startDate| endDate|price|
+-----+----------+----------+-----+
|steak|1990-01-01|2000-01-01| 150|
|steak|2000-01-02|2001-01-13| 180|
| fish|1990-01-01|2001-01-01| 100|
+-----+----------+----------+-----+
root
|-- name: string (nullable = true)
|-- startDate: string (nullable = true)
|-- endDate: string (nullable = true)
|-- price: integer (nullable = false)
I want to show endDate in MM/dd/yyyy pattern. All I am able to do is convert the column to DateType from String
val df2 = df.select($"endDate".cast(DateType).alias("endDate"))
df2.show()
+----------+
| endDate|
+----------+
|2000-01-01|
|2001-01-13|
|2001-01-01|
+----------+
df2.printSchema()
root
|-- endDate: date (nullable = true)
I want to show endDate in MM/dd/yyyy pattern. Only reference I found is this which doesn't solve the problem

You can use date_format function.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(
("steak", "1990-01-01", "2000-01-01", 150),
("steak", "2000-01-02", "2001-01-13", 180),
("fish", "1990-01-01", "2001-01-01", 100))).toDF("name", "startDate", "endDate", "price")
df.show()
df.select(date_format(col("endDate"), "MM/dd/yyyy")).show
Output :
+-------------------------------+
|date_format(endDate,MM/dd/yyyy)|
+-------------------------------+
| 01/01/2000|
| 01/13/2001|
| 01/01/2001|
+-------------------------------+

Use pyspark.sql.functions.date_format(date, format):
val df2 = df.select(date_format("endDate", "MM/dd/yyyy").alias("endDate"))

Dataframe/Dataset having a string column with date value in it and we need to change the date format.
For the query asked, date format can be changed as below:
val df1 = df.withColumn("startDate1", date_format(to_date(col("startDate"),"yyyy-MM-dd"),"MM/dd/yyyy" ))
In Spark, the default date format is "yyyy-MM-dd" hence it can be re-written as
val df1 = df.withColumn("startDate1", date_format(col("startDate"),"MM/dd/yyyy" ))
(i) By applying to_date, we are changing the datatype of this column (string) to Date datatype.
Also, we are informing to_date that the format in this string column is yyyy-MM-dd so read the column accordingly.
(ii) Next, we are applying date_format to achieve the date format we require which is MM/dd/yyyy.
When time component is involved, use to_timestamp instead of to_date.
Note that 'MM' represents month and 'mm' represents minutes.

Related

Convert yyyyMM to end of month date using PySpark

I have a column in a dataframe in Pyspark with date in integer format e.g 202203 (yyyyMM format). I want to convert that to end of month date as 2022-03-31. How do I achieve this?
First cast column to String, then use to_date to get the date and then last_day.
Example:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
data = [{"x": 202203}]
df = spark.createDataFrame(data=data)
df = df.withColumn("date", F.last_day(F.to_date(F.col("x").cast("string"), "yyyyMM")))
df.show(10)
df.printSchema()
Output:
+------+----------+
| x| date|
+------+----------+
|202203|2022-03-31|
+------+----------+
root
|-- x: long (nullable = true)
|-- date: date (nullable = true)

Spark fails to convert String to TIMESTAMP

I have a hive table that contains a String column: this is an example:
| DT |
|-------------------------------|
| 2019-05-07 00:03:53.837000000 |
when I try to import the table inside a Spark-Scala DF transforming the String to a timestamp I only have null values:
val df = spark.sql(s"""select to_timestamp(dt_maj, 'yyyy-MM-dd HH:mm:ss.SSS') from ${use_database}.pz_send_demande_diffusion""").show()
| DT |
|------|
| null |
Doing
val df = spark.sql(s"""select dt from ${use_database}.pz_send_demande_diffusion""").show()
gives a good result (column with the String values). So Spark is importing te column normally.
I also tried:
val df = spark.sql(s"""select to_timestamp('2005-05-04 11:12:54.297', 'yyyy-MM-dd HH:mm:ss.SSS') from ${use_database}.pz_send_demande_diffusion""").show()
And it worked! It returns a TIMESTAMPs column.
What is the problem ?
Trim your extra 0s. Then,
df.withColumn("new", to_timestamp($"date".substr(lit(1),length($"date") - 6), "yyyy-MM-dd HH:mm:ss.SSS")).show(false)
the result is:
+-----------------------------+-------------------+
|date |new |
+-----------------------------+-------------------+
|2019-05-07 00:03:53.837000000|2019-05-07 00:03:53|
+-----------------------------+-------------------+
The schema:
root
|-- date: string (nullable = true)
|-- new: timestamp (nullable = true)
I think you should use following format yyyy-MM-dd HH:mm:ss.SSSSSSSSS for this type of data 2019-05-07 00:03:53.837000000

Spark Scala -difference between current date and max(day)

Need to calculate the difference between two dates. The question is
Currentdate - max(day_id)
"Currentdate" is of simple date format - yyyyMMdd
"day_id" is of string format and its value is yyyy-mm-dd.
I have a dataframe which converted the date(string format) to date format (yyyy-mm-dd)
df1 = to_date(from_unixtime(unix_timestamp(day_id, 'yyyy-MM-dd')))
Normally, for finding the max(day_id), I would do
def daySince (columnName: String): Column = {
max(col(columnName))
I cannot figure out how to do the difference between
Currentdate - max(day_id)
Given input dataframe with schema as
+---+----------+
|id |day_id |
+---+----------+
|id1|2017-11-21|
|id1|2018-01-21|
|id2|2017-12-21|
+---+----------+
root
|-- id: string (nullable = true)
|-- day_id: string (nullable = true)
you can use current_date() and datediff() inbuilt functions to meet your requirement as
import org.apache.spark.sql.functions._
df.withColumn("diff", datediff(current_date(), to_date(col("day_id"), "yyyy-MM-dd")))
which should give you
+---+----------+----+
|id |day_id |diff|
+---+----------+----+
|id1|2017-11-21|167 |
|id1|2018-01-21|106 |
|id2|2017-12-21|137 |
+---+----------+----+

Converting String to Date in Spark Dataframe

I have a dataframe (df1) with 2 StringType fields.
Field1 (StringType) Value-X
Field2 (StringType) value-20180101
All I am trying to do is create another dataframe (df2) from df1 with 2 fields-
Field1 (StringType) Value-X
Field2 (Date Type) Value-2018-01-01
I am using the below code-
df2=df1.select(
col("field1").alias("f1"),
unix_timestamp(col("field2"),"yyyyMMdd").alias("f2")
)
df2.show
df2.printSchema
For this field 2, I tried multiple things - unix_timestamp , from_unixtimestamp, to_date, cast(“date”) but nothing worked
I need the following schema as output:
df2.printSchema
|-- f1: string (nullable = false)
|-- f2: date (nullable = false)
I'm using Spark 2.1
to_date seems to work fine for what you need:
import org.apache.spark.sql.functions._
val df1 = Seq( ("X", "20180101"), ("Y", "20180406") ).toDF("c1", "c2")
val df2 = df1.withColumn("c2", to_date($"c2", "yyyyMMdd"))
df2.show
// +---+----------+
// | c1| c2|
// +---+----------+
// | X|2018-01-01|
// | Y|2018-04-06|
// +---+----------+
df2.printSchema
// root
// |-- c1: string (nullable = true)
// |-- c2: date (nullable = true)
[UPDATE]
For Spark 2.1 or prior, to_date doesn't take format string as a parameter, hence explicit string formatting to the standard yyyy-MM-dd format using, say, regexp_replace is needed:
val df2 = df1.withColumn(
"c2", to_date(regexp_replace($"c2", "(\\d{4})(\\d{2})(\\d{2})", "$1-$2-$3"))
)

Spark dataframe convert integer to timestamp and find date difference

I have this DataFrame org.apache.spark.sql.DataFrame:
|-- timestamp: integer (nullable = true)
|-- checkIn: string (nullable = true)
| timestamp| checkIn|
+----------+----------+
|1521710892|2018-05-19|
|1521710892|2018-05-19|
Desired result: obtain a new column with day difference between date checkIn and timestamp (2018-03-03 23:59:59 and 2018-03-04 00:00:01 should have a difference of 1)
Thus, i need to
convert timestamp to date (This is where i'm stuck)
take out one date from another
use some function to extract day(Have not found this function yet)
You can use from_unixtime to convert your timestamp to date and datediff to calculate the difference in days:
val df = Seq(
(1521710892, "2018-05-19"),
(1521730800, "2018-01-01")
).toDF("timestamp", "checkIn")
df.withColumn("tsDate", from_unixtime($"timestamp")).
withColumn("daysDiff", datediff($"tsDate", $"checkIn")).
show
// +----------+----------+-------------------+--------+
// | timestamp| checkIn| tsDate|daysDiff|
// +----------+----------+-------------------+--------+
// |1521710892|2018-05-19|2018-03-22 02:28:12| -58|
// |1521730800|2018-01-01|2018-03-22 08:00:00| 80|
// +----------+----------+-------------------+--------+