Issue with date and inferSchema option in spark 3.1 - scala

I have a CSV file with a date column as shown below,
datecol
----------
2021-01-11
2021-02-15
2021-02-10
2021-04-22
If I read this file by enabling inferSchema in spark version 2.4.5 I'm getting below schema,
root
|-- datecol: timestamp (nullable = true)
But in spark 3.1 below is the ouput.
root
|-- datecol: string (nullable = true)
I have checked migration guide from spark documentation but didn't get any information about this.
Could anyone please confirm if it's a bug or do I need to use some other configurations?

This is an effect of the Spark migration to Java 8 new Date API since Spark 3+. You can read from the migration guide:
Parsing/formatting of timestamp/date strings. This effects on CSV/JSON
datasources [...]. New implementation performs strict checking of its input. For example,
the 2015-07-22 10:00:00 timestamp cannot be parse if pattern is
yyyy-MM-dd because the parser does not consume whole input. Another
example is the 31/01/2015 00:00 input cannot be parsed by the
dd/MM/yyyy hh:mm pattern because hh supposes hours in the range 1-12.
In Spark version 2.4 and below, java.text.SimpleDateFormat is used for
timestamp/date string conversions [...].
In fact, inferSchema does not detect DateType but only TimestampType. And since by default in CSV Data Source, the parameter timestampFormat is yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] then it is not converted into timestamp for the reason cited above.
You can try to add the option when loading csv:
val df = spark.read.option("inferSchema", "true").option("timestampFormat", "yyyy-MM-dd").csv("/path/csv")

Related

to_timestamp() in pyspark giving null values

I am trying the following simple transformation.
data = [["06/15/2020 14:04:04]]
cols = ["date"]
df = spark.createDataFrame(data,cols)
df = df.withColumn("datetime",F.to_timestamp(F.col("date"),'MM/DD/YYYY HH24:MI:SS))
df.show()
But this gives me an error "All week-based patterns are unsupported since Spark 3.0, detected: Y, Please use the SQL function EXTRACT instead"
I want to format the data into that date format and convert it to timestamp.
You should use this format - MM/dd/yyyy HH:mm:ss'
Check this page for all datetime format related details.
df = df.withColumn("datetime",to_timestamp(col("date"),'MM/dd/yyyy HH:mm:ss'))
df.show()
+-------------------+-------------------+
| date| datetime|
+-------------------+-------------------+
|06/15/2020 14:04:04|2020-06-15 14:04:04|
+-------------------+-------------------+
The different elements of the timestamp pattern are explained in Spark's documentation. Note that Spark parses timestamps utilising Java's SimpleTimeFormat which uses a somewhat confusing set of format symbols. The symbol matching the hour in 24hrs representation is simply H with no digital suffixes. The minutes are m and not M which is for the month. The year is matched by y and not by Y which is for week year. Week years are unsupported hence the message you're getting.
In your case, the proper format should be MM/dd/yyyy HH:mm:ss.

How to read only latest 7 days csv files from S3 bucket

I am trying to figure it out, how we can read only latest 7 days file from a folder which we have in s3 bucket using Spark Scala.
Directory which we have:
Assume for today's date(Date_1) we have 2 clients and 1-1 csv file
Source/Date_1/Client_1/sample_1.csv
Source/Date_1/Client_2/sample_1.csv
Tomorrow a new folder will generate and we will get as below:
Source/Date_2/Client_1/sample_1.csv
Source/Date_2/Client_2/sample_1.csv
Source/Date_2/Client_3/sample_1.csv
Source/Date_2/Client_4/sample_1.csv
NOTE: we expecting to have newer client data added on any date.
Likewise on 7th day we can have:
Source/Date_7/Client_1/sample_1.csv
Source/Date_7/Client_2/sample_1.csv
Source/Date_7/Client_3/sample_1.csv
Source/Date_7/Client_4/sample_1.csv
So, now if we get 8th day data, We need to discard the Date_1 folder to get read.
How we can do this while reading csv files using spark scala from s3 bucket?
I am trying to read the whole "source/*" folder so that we should not miss if any client is getting added any time/day.
There are various ways to do it. One of the ways is mentioned below:
You can extract the Date from Path and the filter is based on the 7 Days.
Below is a code snippet for pyspark, the same can be implemented in Spark with Scala.
>>> from datetime import datetime, timedelta
>>> from pyspark.sql.functions import *
#Calculate date 7 days before date
>>> lastDate = datetime.now() + timedelta(days=-7)
>>> lastDate = int(lastDate.strftime('%Y%m%d'))
# Source Path
>>> srcPath = "s3://<bucket-name>/.../Source/"
>>> df1 = spark.read.option("header", "true").csv(srcPath + "*/*").withColumn("Date", split(regexp_replace(input_file_name(), srcPath, ""),"/")[0].cast("long"))
>>> df2 = df1.filter(col("Date") >= lit(lastDate))
There are few things that might change in your final implementation, such as Index value [0] that might differ if the path structure is different and the last, the condition >= that can be > based on the requirement.

pySpark Timestamp as String to DateTime

I read from a CSV where column time contains a timestamp with miliseconds '1414250523582'
When I use TimestampType in schema it returnns NULL.
The only way it ready my data is to use StringType.
Now I need this value to be a Datetime for forther processing.
First I god rid of the to long timestamp with this:
df2 = df.withColumn("date", col("time")[0:10].cast(IntegerType()))
a schema checks says its a integer now.
now i try to make it a datetime with
df3 = df2.withColumn("date", datetime.fromtimestamp(col("time")))
it returns
TypeError: an integer is required (got type Column)
when I google people always just use col("x") to read and transform data, so what do I make wrong here?
The schema checks are a bit tricky; the data in that column may be pyspark.sql.types.IntegerType, but that is not equivalent to Python's int type. The col function returns a pyspark.sql.column.Column object, which often do not play nicely with vanilla Python functions like datetime.fromtimestamp. This explains the TypeError. Even though the "date" data in the actual rows is an integer, col doesn't allow you to access it as an integer to feed into a python function quite so simply. To apply arbitrary Python code to that integer value, you can compile a udf pretty easily, but in this case, pyspark.sql.functions already has a solution for your unix timestamp. Try this: df3 = df2.withColumn("date", from_unixtime(col("time"))), and you should see a nice date in 2014 for your example.
Small note: This "date" column will be of StringType.

Error while writing date from spark to elasticsearch due to timestamp length

I get an error when writing data to elasticsearch from spark. Most documents are written fine, then I have this kind of exceptions
org.elasticsearch.hadoop.rest.EsHadoopRemoteException: date_time_exception: date_time_exception: Invalid value for Year (valid values -999999999 - 999999999): -6220800000
The field mapping in elasticsearch is "date"
The field type in pySpark is DateType not TimestampType which imo
should make clear that this is a date without time. The value shown
by spark is "1969-10-21" so a perfectly reasonable date.
(It was originally a timestampType, from another elasticsearch date read but I converted it to a dateType in hope to solve this error but I have the exact same error message (with the exact same timestamp value) either sending to elasticSearch a TimestampType or DateType)
My guess is that there are three 0s that shouldn't be in that timestamp sent to elasticsearch but I can't find any way to normalize it.
Is there an option for org.elasticsearch.hadoop connector ?
(elk version is 7.5.2, spark is 2.4.4)
Obvious workaround : use any other type than an TimestampType or DateType
e.g. using this udf for LongType (to demonstrate that it's indeed a timestamp length issue).
import datetime
import time
def conv_ts(d) :
return time.mktime(d.timetuple())
ts_udf = F.udf(lambda z : int(conv_ts(z)), LongType())
(Note that in that snippet the spark input is timestampType not dateType, so a python datetime not date, because I tried messing around with time conversions too)
OR (much more efficient way) obviously to avoid the udf by using a StringType field of formatted date instead of the Long timestamp thanks to the pyspark.sql.date_format function.
A solution but not a really satisfying one, I would rather understand why the connector doesn't properly deal with timestampTypes and dateTypes by adjusting the timestamp length accordingly.

PySpark - Spark SQL: how to convert timestamp with UTC offset to epoch/unixtime?

How can I convert a timestamp in the format 2019-08-22T23:57:57-07:00 into unixtime using Spark SQL or PySpark?
The most similar function I know is unix_timestamp() it doesn't accept the above time format with UTC offset.
Any suggestion on how I could approach that using preferably Spark SQL or PySpark?
Thanks
The java SimpleDateFormat pattern for ISO 8601time zone in this case is XXX.
So you need to use yyyy-MM-dd'T'HH:mm:ssXXX as your format string.
SparkSQL
spark.sql(
"""select unix_timestamp("2019-08-22T23:57:57-07:00", "yyyy-MM-dd'T'HH:mm:ssXXX")
AS epoch"""
).show(truncate=False)
#+----------+
#|epoch |
#+----------+
#|1566543477|
#+----------+
Spark DataFrame
from pyspark.sql.functions import unix_timestamp
df = spark.createDataFrame([("2019-08-22T23:57:57-07:00",)], ["timestamp"])
df.withColumn(
"unixtime",
unix_timestamp("timestamp", "yyyy-MM-dd'T'HH:mm:ssXXX")
).show(truncate=False)
#+-------------------------+----------+
#|timestamp |unixtime |
#+-------------------------+----------+
#|2019-08-22T23:57:57-07:00|1566543477|
#+-------------------------+----------+
Note that pyspark is just a wrapper on spark - generally I've found the scala/java docs are more complete than the python ones. It may be helpful in the future.