Validate and change the date formats in pyspark - date

I have a date column with different date fomrats. Now I want to validate it with a particular format('MM-dd-yyyy') and which are not matching has to be date formated to the desired one.
df = sc.parallelize([['12-21-2006'],
['05/30/2007'],
['01-01-1984'],
['22-12-2017'],
['12222019']]).toDF(["Date"])
df.show()
+----------+
| Date|
+----------+
|12-21-2006|
|05/30/2007|
|01-01-1984|
|22-12-2017|
| 12222019|
+----------+
Now to validate,
correct=df.filter(~F.col("Date").isNotNull()|
to_date(F.col("Date"),'MM-dd-yyyy').isNotNull())
correct.show()
+----------+
| Date|
+----------+
|12-21-2006|
|01-01-1984|
+----------+
Now, I extracted wrong records which are as follows:-
wrong = df.exceptAll(correct)
wrong.show()
+----------+
| Date|
+----------+
|05/30/2007|
| 12222019|
|22-12-2017|
+----------+
Now these wrong records has to be date formatted to the desired format which is
'MM-dd-yyyy'
If it is single I could have changed the format by specifying that particular format but how do I convert different date format into a desired date format? Is there any solution for this?

You could try out the different time formats in different columns and then take the first non-null value using coalesce:
df.withColumn("d1", F.to_date(F.col("Date"),'MM-dd-yyyy')) \
.withColumn("d2", F.to_date(F.col("Date"),'MM/dd/yyyy')) \
.withColumn("d3", F.to_date(F.col("Date"),'dd-MM-yyyy')) \
.withColumn("d4", F.to_date(F.col("Date"),'MMddyyyy')) \
.withColumn("result", F.coalesce("d1", "d2", "d3", "d4")) \
.show()
Output:
+----------+----------+----------+----------+----------+----------+
| Date| d1| d2| d3| d4| result|
+----------+----------+----------+----------+----------+----------+
|12-21-2006|2006-12-21| null| null| null|2006-12-21|
|05/30/2007| null|2007-05-30| null| null|2007-05-30|
|01-01-1984|1984-01-01| null|1984-01-01| null|1984-01-01|
|22-12-2017| null| null|2017-12-22| null|2017-12-22|
| 12222019| null| null| null|2019-12-22|2019-12-22|
+----------+----------+----------+----------+----------+----------+

Related

How to add the incremental date value with respect to first row value in spark dataframe

Input :
+------+--------+
|Test |01-12-20|
|Ravi | null|
|Son | null|
Expected output :
+------+--------+
|Test |01-12-20|
|Ravi |02-12-20|
|Son |03-12-20|
I tried with .withColumn(col("dated"),date_add(col("dated"),1));
But this result in NULL for all the columns values.
Could you please help me with getting the incremental values on the date second column?
This will be a working solution for you
Input
df = spark.createDataFrame([("Test", "01-12-20"),("Ravi", None),("Son", None)],[ "col1","col2"])
df.show()
df = df.withColumn("col2", F.to_date(F.col("col2"),"dd-MM-yy"))
# a dummy col for window function
df = df.withColumn("del_col", F.lit(0))
_w = W.partitionBy(F.col("del_col")).orderBy(F.col("del_col").desc())
df = df.withColumn("rn_no", F.row_number().over(_w)-1)
# Create a column with the same date
df = df.withColumn("dated", F.first("col2").over(_w))
df = df.selectExpr('*', 'date_add(dated, rn_no) as next_date')
df.show()
DF
+----+--------+
|col1| col2|
+----+--------+
|Test|01-12-20|
|Ravi| null|
| Son| null|
+----+--------+
Final Output
+----+----------+-------+-----+----------+----------+
|col1| col2|del_col|rn_no| dated| next_date|
+----+----------+-------+-----+----------+----------+
|Test|2020-12-01| 0| 0|2020-12-01|2020-12-01|
|Ravi| null| 0| 1|2020-12-01|2020-12-02|
| Son| null| 0| 2|2020-12-01|2020-12-03|
+----+----------+-------+-----+----------+----------+

WeekOfYear column getting null in the SparkSQL

Here I am writing the SQL statement for spark.sql but I am not able to get the WEEKOFYEAR converted to week of the year and getting a null in the Output
Below I have shown the expression of what I a using
Input Data
InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,01-12-2010 8.26,2.55,17850,United Kingdom
536365,71053,WHITE METAL LANTERN,6,01-12-2010 8.26,3.39,17850,United Kingdom
536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,01-12-2010 8.26,2.75,17850,United Kingdom
536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,01-12-2010 8.26,3.39,17850,United Kingdom
SQL CODE
val summarySQlTest = spark.sql(
"""
|select Country,WEEKOFYEAR(InvoiceDate)as WeekNumber,
|count(distinct(InvoiceNo)) as NumInvoices,
|sum(Quantity) as TotalQuantity,
|round(sum(Quantity*UnitPrice),2) as InvoiceValue
|from sales
|group by Country,WeekNumber
|""".stripMargin
).show()
DESIRED OUTPUT
+--------------+----------+-----------+-------------+------------+
| Country|WeekNumber|NumInvoices|TotalQuantity|InvoiceValue|
+--------------+----------+-----------+-------------+------------+
| Spain| 49| 1| 67| 174.72|
| Germany| 48| 11| 1795| 3309.75|
Output I am getting
+--------------+----------+-----------+-------------+------------+
| Country|WeekNumber|NumInvoices|TotalQuantity|InvoiceValue|
+--------------+----------+-----------+-------------+------------+
| Spain| null| 1| 67| 174.72|
| Germany| null| 11| 1795| 3309.75|
For the desired output I used this, But I want to solve the same in the spark.sql
Also it would be great if anyone can explain the what is actually happening here
(to_date(col("InvoiceDate"),"dd-MM-yyyy H.mm")
val knowFunc= invoicesDF.withColumn("InvoiceDate",to_date(col("InvoiceDate"),"dd-MM-yyyy H.mm"))
.where("year(InvoiceDate) == 2010")
.withColumn("WeekNumber",weekofyear(col("InvoiceDate")))
.groupBy("Country","WeekNumber")
.agg(sum("Quantity").as("TotalQuantity"),
round(sum(expr("Quantity*UnitPrice")),2).as("InvoiceValue")).show()
You'll need to convert the InvoiceDate column to date type first (using to_date), before you can call weekofyear. I guess this also answers your last question.
val summarySQlTest = spark.sql(
"""
|select Country,WEEKOFYEAR(to_date(InvoiceDate,'dd-MM-yyyy H.mm')) as WeekNumber,
|count(distinct(InvoiceNo)) as NumInvoices,
|sum(Quantity) as TotalQuantity,
|round(sum(Quantity*UnitPrice),2) as InvoiceValue
|from sales
|group by Country,WeekNumber
|""".stripMargin
).show()

Pyspark - Convert specific string to date format

I have a date pyspark dataframe with a string column in the format of Mon-YY eg. 'Jan-17' and I am attempting to convert this into a date column.
I've tried to do it like this but it does not work out :
df.select(to_timestamp(df.t, 'MON-YY HH:mm:ss').alias('dt'))
Is it possible to do it like in SQL or do I need to write a special function for conversion ?
You should use valid Java date format. The following will work
import pyspark.sql.functions as psf
df.select(psf.to_timestamp(psf.col('t'), 'MMM-YY HH:mm:ss').alias('dt'))
Jan-17 will become 2017-01-01 in that case
Example
df = spark.createDataFrame([("Jan-17 00:00:00",'a'),("Apr-19 00:00:00",'b')], ['t','x'])
df.show(2)
+---------------+---+
| t| x|
+---------------+---+
|Jan-17 00:00:00| a|
|Apr-19 00:00:00| b|
+---------------+---+
Conversion to timestamp:
import pyspark.sql.functions as psf
df.select(psf.to_timestamp(psf.col('t'), 'MMM-YY HH:mm:ss').alias('dt')).show(2)
+-------------------+
| dt|
+-------------------+
|2017-01-01 00:00:00|
|2018-12-30 00:00:00|
+-------------------+

How to retrieve the month from a date column values in scala dataframe?

Given:
val df = Seq((1L, "04-04-2015")).toDF("id", "date")
val df2 = df.withColumn("month", from_unixtime(unix_timestamp($"date", "dd/MM/yy"), "MMMMM"))
df2.show()
I got this output:
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015| null|
+---+----------+-----+
However, I want the output to be as below:
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015|April|
+---+----------+-----+
How can I do that in sparkSQL using Scala?
This should do it:
val df2 = df.withColumn("month", date_format(to_date($"date", "dd-MM-yyyy"), "MMMM"))
df2.show
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015|April|
+---+----------+-----+
NOTE:
The first string (to_date) must match the format of your existing date
Be careful with: "dd-MM-yyyy" vs "MM-dd-yyyy"
The second string (date_format) is the format of the output
Docs:
to_date
date_format
Nothing Wrong in your code just keeps your date format as your date column.
Here i am attaching screenshot with your code and change codes.
HAppy Hadoooooooooooopppppppppppppppppppppp
Not exactly related to this question but who wants to get a month as integer there is a month function:
val df2 = df.withColumn("month", month($"date", "dd-MM-yyyy"))
df2.show
+---+----------+-----+
| id| date|month|
+---+----------+-----+
| 1|04-04-2015| 4|
+---+----------+-----+
The same way you can use the year function to get only year.

Compare dates in scala present in dataframe column

I am trying to compare dates below in filter as below:-
dataframe KIN_PRC_FILE has column pos_price_expiration_dt that has value 9999-12-31
val formatter = new SimpleDateFormat("yyyy-MM-dd");
val CURRENT_DATE = formatter.format(Calendar.getInstance().getTime());
val FILT_KMART_KIN_DATA= KIN_PRC_FILE.filter(s"(pos_price_expiration_dt)>=$CURRENT_DATE AND pos_price_type_cd").show(10)
but seems above query returns null records, can somebody help me to understand what is wrong here.
You just need to add single commas to your current_date variable
KIN_PRC_FILE.filter(s"pos_price_expiration_dt >= '$CURRENT_DATE'")
Quick example here
INPUT
df.show
+-----------------------+---+
|pos_price_expiration_dt| id|
+-----------------------+---+
| 2018-11-20| a|
| 2018-12-28| b|
| null| c|
+-----------------------+---+
OUTPUT
df.filter(s"pos_price_expiration_dt>='$CURRENT_DATE'").show
+-----------------------+---+
|pos_price_expiration_dt| id|
+-----------------------+---+
| 2018-12-28| b|
+-----------------------+---+
Note that you are using string comparison that has date values. Since you have it in the descending order i.e yyyy-MM-dd, this works, but not safe always.
You should consider casting the column to "date" format before doing such comparisons.
And for the current date you can always use the built-in variable. Check this out:
scala> val KIN_PRC_FILE = Seq(("2018-11-01"),("2018-11-15"),("2018-11-30"),("2018-11-28"),(null)).toDF("pos_price_expiration_dt").withColumn("pos_price_expiration_dt",'pos_price_expiration_dt.cast("date"))
KIN_PRC_FILE: org.apache.spark.sql.DataFrame = [pos_price_expiration_dt: date]
scala> KIN_PRC_FILE.printSchema
root
|-- pos_price_expiration_dt: date (nullable = true)
scala> KIN_PRC_FILE.show
+-----------------------+
|pos_price_expiration_dt|
+-----------------------+
| 2018-11-01|
| 2018-11-15|
| 2018-11-30|
| 2018-11-28|
| null|
+-----------------------+
scala> KIN_PRC_FILE.filter(s"pos_price_expiration_dt >= current_date ").show
+-----------------------+
|pos_price_expiration_dt|
+-----------------------+
| 2018-11-30|
| 2018-11-28|
+-----------------------+
scala>