Spark Scala -difference between current date and max(day) - scala

Need to calculate the difference between two dates. The question is
Currentdate - max(day_id)
"Currentdate" is of simple date format - yyyyMMdd
"day_id" is of string format and its value is yyyy-mm-dd.
I have a dataframe which converted the date(string format) to date format (yyyy-mm-dd)
df1 = to_date(from_unixtime(unix_timestamp(day_id, 'yyyy-MM-dd')))
Normally, for finding the max(day_id), I would do
def daySince (columnName: String): Column = {
max(col(columnName))
I cannot figure out how to do the difference between
Currentdate - max(day_id)

Given input dataframe with schema as
+---+----------+
|id |day_id |
+---+----------+
|id1|2017-11-21|
|id1|2018-01-21|
|id2|2017-12-21|
+---+----------+
root
|-- id: string (nullable = true)
|-- day_id: string (nullable = true)
you can use current_date() and datediff() inbuilt functions to meet your requirement as
import org.apache.spark.sql.functions._
df.withColumn("diff", datediff(current_date(), to_date(col("day_id"), "yyyy-MM-dd")))
which should give you
+---+----------+----+
|id |day_id |diff|
+---+----------+----+
|id1|2017-11-21|167 |
|id1|2018-01-21|106 |
|id2|2017-12-21|137 |
+---+----------+----+

Related

Convert yyyyMM to end of month date using PySpark

I have a column in a dataframe in Pyspark with date in integer format e.g 202203 (yyyyMM format). I want to convert that to end of month date as 2022-03-31. How do I achieve this?
First cast column to String, then use to_date to get the date and then last_day.
Example:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.getOrCreate()
data = [{"x": 202203}]
df = spark.createDataFrame(data=data)
df = df.withColumn("date", F.last_day(F.to_date(F.col("x").cast("string"), "yyyyMM")))
df.show(10)
df.printSchema()
Output:
+------+----------+
| x| date|
+------+----------+
|202203|2022-03-31|
+------+----------+
root
|-- x: long (nullable = true)
|-- date: date (nullable = true)

Pyspark date_trunc without modifying actual value

Consider the below dataframe
df:
time
2022-02-21T11:23:54
I have to convert it to
time
2022-02-21T11:23:00
After using the below code
df.withColumn("time_updated", date_trunc("minute", col("time"))).show(truncate = False)
My output
time
2022-02-21 11:23:00
By desired output is
time
2022-02-21T11:23:00
Is there anyway I can keep the data same and just update/truncate the seconds??
you simply have a format issue. the output that you see is the string representation of a timestamp. check your output formats :
from pyspark.sql import functions as F, Window as W, types as T
df = df.withColumn(
"time_updated",
F.date_format(F.col("time").cast("timestamp"), "YYYY-MM-dd'T'HH:mm:00"),
)
df.show(truncate=False)
+-------------------+-------------------+
|time |time_updated |
+-------------------+-------------------+
|2022-02-21T11:23:54|2022-02-21T11:23:00|
+-------------------+-------------------+
df.printSchema()
root
|-- time: string (nullable = true)
|-- time_updated: string (nullable = true)

Spark fails to convert String to TIMESTAMP

I have a hive table that contains a String column: this is an example:
| DT |
|-------------------------------|
| 2019-05-07 00:03:53.837000000 |
when I try to import the table inside a Spark-Scala DF transforming the String to a timestamp I only have null values:
val df = spark.sql(s"""select to_timestamp(dt_maj, 'yyyy-MM-dd HH:mm:ss.SSS') from ${use_database}.pz_send_demande_diffusion""").show()
| DT |
|------|
| null |
Doing
val df = spark.sql(s"""select dt from ${use_database}.pz_send_demande_diffusion""").show()
gives a good result (column with the String values). So Spark is importing te column normally.
I also tried:
val df = spark.sql(s"""select to_timestamp('2005-05-04 11:12:54.297', 'yyyy-MM-dd HH:mm:ss.SSS') from ${use_database}.pz_send_demande_diffusion""").show()
And it worked! It returns a TIMESTAMPs column.
What is the problem ?
Trim your extra 0s. Then,
df.withColumn("new", to_timestamp($"date".substr(lit(1),length($"date") - 6), "yyyy-MM-dd HH:mm:ss.SSS")).show(false)
the result is:
+-----------------------------+-------------------+
|date |new |
+-----------------------------+-------------------+
|2019-05-07 00:03:53.837000000|2019-05-07 00:03:53|
+-----------------------------+-------------------+
The schema:
root
|-- date: string (nullable = true)
|-- new: timestamp (nullable = true)
I think you should use following format yyyy-MM-dd HH:mm:ss.SSSSSSSSS for this type of data 2019-05-07 00:03:53.837000000

Spark dataframe convert integer to timestamp and find date difference

I have this DataFrame org.apache.spark.sql.DataFrame:
|-- timestamp: integer (nullable = true)
|-- checkIn: string (nullable = true)
| timestamp| checkIn|
+----------+----------+
|1521710892|2018-05-19|
|1521710892|2018-05-19|
Desired result: obtain a new column with day difference between date checkIn and timestamp (2018-03-03 23:59:59 and 2018-03-04 00:00:01 should have a difference of 1)
Thus, i need to
convert timestamp to date (This is where i'm stuck)
take out one date from another
use some function to extract day(Have not found this function yet)
You can use from_unixtime to convert your timestamp to date and datediff to calculate the difference in days:
val df = Seq(
(1521710892, "2018-05-19"),
(1521730800, "2018-01-01")
).toDF("timestamp", "checkIn")
df.withColumn("tsDate", from_unixtime($"timestamp")).
withColumn("daysDiff", datediff($"tsDate", $"checkIn")).
show
// +----------+----------+-------------------+--------+
// | timestamp| checkIn| tsDate|daysDiff|
// +----------+----------+-------------------+--------+
// |1521710892|2018-05-19|2018-03-22 02:28:12| -58|
// |1521730800|2018-01-01|2018-03-22 08:00:00| 80|
// +----------+----------+-------------------+--------+

Converting pattern of date in spark dataframe

I have a column in spark dataframe of String datatype (with date in yyyy-MM-dd pattern)
I want to display the column value in MM/dd/yyyy pattern
My data is
val df = sc.parallelize(Array(
("steak", "1990-01-01", "2000-01-01", 150),
("steak", "2000-01-02", "2001-01-13", 180),
("fish", "1990-01-01", "2001-01-01", 100)
)).toDF("name", "startDate", "endDate", "price")
df.show()
+-----+----------+----------+-----+
| name| startDate| endDate|price|
+-----+----------+----------+-----+
|steak|1990-01-01|2000-01-01| 150|
|steak|2000-01-02|2001-01-13| 180|
| fish|1990-01-01|2001-01-01| 100|
+-----+----------+----------+-----+
root
|-- name: string (nullable = true)
|-- startDate: string (nullable = true)
|-- endDate: string (nullable = true)
|-- price: integer (nullable = false)
I want to show endDate in MM/dd/yyyy pattern. All I am able to do is convert the column to DateType from String
val df2 = df.select($"endDate".cast(DateType).alias("endDate"))
df2.show()
+----------+
| endDate|
+----------+
|2000-01-01|
|2001-01-13|
|2001-01-01|
+----------+
df2.printSchema()
root
|-- endDate: date (nullable = true)
I want to show endDate in MM/dd/yyyy pattern. Only reference I found is this which doesn't solve the problem
You can use date_format function.
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val df = sc.parallelize(Array(
("steak", "1990-01-01", "2000-01-01", 150),
("steak", "2000-01-02", "2001-01-13", 180),
("fish", "1990-01-01", "2001-01-01", 100))).toDF("name", "startDate", "endDate", "price")
df.show()
df.select(date_format(col("endDate"), "MM/dd/yyyy")).show
Output :
+-------------------------------+
|date_format(endDate,MM/dd/yyyy)|
+-------------------------------+
| 01/01/2000|
| 01/13/2001|
| 01/01/2001|
+-------------------------------+
Use pyspark.sql.functions.date_format(date, format):
val df2 = df.select(date_format("endDate", "MM/dd/yyyy").alias("endDate"))
Dataframe/Dataset having a string column with date value in it and we need to change the date format.
For the query asked, date format can be changed as below:
val df1 = df.withColumn("startDate1", date_format(to_date(col("startDate"),"yyyy-MM-dd"),"MM/dd/yyyy" ))
In Spark, the default date format is "yyyy-MM-dd" hence it can be re-written as
val df1 = df.withColumn("startDate1", date_format(col("startDate"),"MM/dd/yyyy" ))
(i) By applying to_date, we are changing the datatype of this column (string) to Date datatype.
Also, we are informing to_date that the format in this string column is yyyy-MM-dd so read the column accordingly.
(ii) Next, we are applying date_format to achieve the date format we require which is MM/dd/yyyy.
When time component is involved, use to_timestamp instead of to_date.
Note that 'MM' represents month and 'mm' represents minutes.