How to pushdown date filter in pyspark if the date column in not stored in yyyy-MM-dd format? - pyspark

I have a parquet in which there is a date column. If the date column in the parquet is in the "yyyy-MM-dd" format then I can apply any filter on the date column and it gets pushed down.
It is achieved the following way:
schema_sales = schema_sales = StructType([...,StructField("date",DateType()),...])
df_sales = spark.read.schema(schema_sales).option("header",True).load(".../sales_parquet")
df_sales.filter(F.col("date") > F.to_date(F.lit("2020-07-09"))).explain("formatted")
The filter gets pushed down as can be seen from df_sales.explain("formatted") output.
output:
== Physical Plan ==
* Filter (3)
+- * ColumnarToRow (2)
+- Scan parquet (1)
(1) Scan parquet
Output [6]: [order_id#0, product_id#1, seller_id#2, date#3, num_pieces_sold#4, bill_raw_text#5]
Batched: true
Location: InMemoryFileIndex [.../sales_parquet]
PushedFilters: [IsNotNull(date), GreaterThan(date,2020-07-09)]
ReadSchema: struct<order_id:int,product_id:int,seller_id:int,date:date,num_pieces_sold:string,bill_raw_text:int>
(2) ColumnarToRow [codegen id : 1]
Input [6]: [order_id#0, product_id#1, seller_id#2, date#3, num_pieces_sold#4, bill_raw_text#5]
(3) Filter [codegen id : 1]
Input [6]: [order_id#0, product_id#1, seller_id#2, date#3, num_pieces_sold#4, bill_raw_text#5]
Condition : (isnotnull(date#3) AND (date#3 > 18452))
Question:
If my date column is not in yyyy-MM-dd format then how do I achieve something similar
I am struggling because:
I cannot specify the date format in DateType() since it accepts no such argument. So it seems the date column has to be read in as a string and converted to the date
I do not think one can push down cast operations or to_date() transformations

Related

Regular Expression for date extraction

I have a file that has a name: Yonder_CompetitionEntries_20210928080000, I want to extract 20210928. Basically the year, month and day.
So far I have this and it's not working. The file has an extension of csv.gz
date_key = """RIGHT(regexp_replace(regexp_replace(filename,'.gz',''),'.csv',''), 8)"""
what about without regex?
We split on _ and take the last element, then we parse get the first few elements of the string.
date_key = """substring(element_at(split(filename, '_'), -1), 1, 8)"""

How to convert from decimal to date in scala select?

I have a column datetime object declare as decimal (38,0) not timestamp or date and the data input is 'yyyMMdd'. How do I select data with that column convert as date format as 'yyyy-MM-dd' in spark sql (scala) within a day or two days old?
I have tried:
select count(*) from table_name where to_date('column_name','yyyy-MM-dd') = date_sub(current_date(),1));
this gives me 0 count when a data have quiet more than 500000 records
I tried:
select count(*) from table_name where from_unixtime(cast(load_dt_id as string), 'yyyy-MM-dd') = date_sub(current_date(), 1));
I got data in year 1970-01-31 which those year data are not in the table, even when I select that column where it's like '1970%', I got "OK" with bulk sign that accelerate query with Delta. The data select in order of that column started with 20140320
The format argument for to_date is the format of the input, not the desired output. Assuming you have yyyymmdd:
Seq(("20200208")).toDF("RawDate").select(col("RawDate"),to_date(col("RawDate"),"yyyyMMdd").as("formatted_date")).show()
+--------+--------------+
| RawDate|formatted_date|
+--------+--------------+
|20200208| 2020-02-08|
+--------+--------------+
Expanding this to filter by the derived date column:
val raw = Seq(("20200208"),("20200209"),("20200210")).toDF("RawDate")
raw: org.apache.spark.sql.DataFrame = [RawDate: string]
raw.select(col("RawDate"),to_date(col("RawDate"),"yyyyMMdd").as("formatted_date")).filter($"formatted_date".geq(date_add(current_date,-1))).show
+--------+--------------+
| RawDate|formatted_date|
+--------+--------------+
|20200209| 2020-02-09|
|20200210| 2020-02-10|
+--------+--------------+

create a timestamp from month and year string columns in PySpark

I want to create a timestamp column to create a line chart from two columns containing month and year respectively.
The df looks like this:
I know I can create a string concat and then convert it to a datetime column:
df.select('*',
concat('01', df['month'],
df['year']).alias('date')).withColumn("date",
df['date'].cast(TimestampType()))
But I wanted a cleaner approach using an inbuilt PySpark functionality that can also help me create other date parts, like week number, quarters, etc. Any suggestions?
You will have to concatenate the string once, make the timestamp type column and then you can easily extract week, quarter etc.
You can use this function (and edit it to create whatever other columns you need as well):
def spark_date_parsing(df, date_column, date_format):
"""
Parses the date column given the date format in a spark dataframe
NOTE: This is a Pyspark implementation
Parameters
----------
:param df: Spark dataframe having a date column
:param date_column: Name of the date column
:param date_format: Simple Date Format (Java-style) of the dates in the date column
Returns
-------
:return: A spark dataframe with a parsed date column
"""
df = df.withColumn(date_column, F.to_timestamp(F.col(date_column), date_format))
# Spark returns 'null' if the parsing fails, so first check the count of null values
# If parse_fail_count = 0, return parsed column else raise error
parse_fail_count = df.select(
([F.count(F.when(F.col(date_column).isNull(), date_column))])
).collect()[0][0]
if parse_fail_count == 0:
return df
else:
raise ValueError(
f"Incorrect date format '{date_format}' for date column '{date_column}'"
)
Usage (with whatever is your resultant date format):
df = spark_date_parsing(df, "date", "dd/MM/yyyy")

Difference between Date column and other date

I would like to find difference between Date column in spark dataset and date value which is not column.
If both were column i would do
datediff(col("dateStart"),col("dateEnd")
As i want to find difference between col("dateStart") and another date which is not column
val dsWithTimeDiff = detailedRecordsDs.withColumn(RunDate,
lit(runDate.toString))
val dsWithTimeDiff = dsWithRunDate.withColumn(DateDiff,
datediff(to_date(col(RunDate)), col(DateCol)))
Is there better way to do instead of adding one more column and then finding difference

How to change the column type from String to Date in DataFrames?

I have a dataframe that have two columns (C, D) are defined as string column type, but the data in the columns are actually dates. for example column C has the date as "01-APR-2015" and column D as "20150401" I want to change these to date column type, but I didn't find a good way of doing that. I look at the stack overflow I need to convert the string column type to Date column type in Spark SQL's DataFrame. the date format can be "01-APR-2015" and I look at this post but it didn't have info relate to date
Spark >= 2.2
You can use to_date:
import org.apache.spark.sql.functions.{to_date, to_timestamp}
df.select(to_date($"ts", "dd-MMM-yyyy").alias("date"))
or to_timestamp:
df.select(to_date($"ts", "dd-MMM-yyyy").alias("timestamp"))
with intermediate unix_timestamp call.
Spark < 2.2
Since Spark 1.5 you can use unix_timestamp function to parse string to long, cast it to timestamp and truncate to_date:
import org.apache.spark.sql.functions.{unix_timestamp, to_date}
val df = Seq((1L, "01-APR-2015")).toDF("id", "ts")
df.select(to_date(unix_timestamp(
$"ts", "dd-MMM-yyyy"
).cast("timestamp")).alias("timestamp"))
Note:
Depending on a Spark version you this may require some adjustments due to SPARK-11724:
Casting from integer types to timestamp treats the source int as being in millis. Casting from timestamp to integer types creates the result in seconds.
If you use unpatched version unix_timestamp output requires multiplication by 1000.