Multiple formats in Date Time column in Spark - scala

I am using Spark3.0.1
I have following data as csv:
348702330256514,37495066290,9084849,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,330148,33946,614677375609919,11-02-2018 00:00:00,GENUINE
348702330256514,37495066290,136052,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,4310362,33946,614677375609919,11-02-2018 00:00:00,GENUINE
348702330256514,37495066290,9097094,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,2291118,33946,614677375609919,11-02-2018 00:00:00,GENUINE
348702330256514,37495066290,4900011,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,633447,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,6259303,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,369067,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,1193207,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,9335696,33946,614677375609919,11-02-2018 0:00:00,GENUINE
As you can see the second last column has Timestamp data where the hour column will have data both in single as well as double digits, depending on the hour of the day (This is sample data and not all records have all zeros for time part).
This is the problem and I tried to solve the problem as below:
Read the column as String and then Use a column Method to format it to TimeStamp type.
val schema = StructType(
List(
StructField("_corrupt_record", StringType)
, StructField("card_id", LongType)
, StructField("member_id", LongType)
, StructField("amount", IntegerType)
, StructField("postcode", IntegerType)
, StructField("pos_id", LongType)
, StructField("transaction_dt", StringType)
, StructField("status", StringType)
)
)
// format the timestamp column
def format_time_column(timeStampCol: Column
, formats: Seq[String] = Seq( "dd-MM-yyyy HH:mm:ss", "dd-MM-yyyy H:mm:ss"
, "dd-MM-yyyy HH:m:ss", "dd-MM-yyyy H:m:ss")) ={
coalesce(
formats.map(f => to_timestamp(timeStampCol, f)):_*
)
}
val cardTransaction = spark.read
.format("csv")
.option("header", false)
.schema(schema)
.option("path", cardTransactionFilePath)
.option("columnNameOfCorruptRecord", "_corrupt_record")
.load
.withColumn("transaction_dt", format_time_column(col("transaction_dt")))
cardTransaction.cache()
cardTransaction.show(5)
This code produces following error:
*Note:
The record highlighted has only 1 digit for hour
Whatever is the first format provided in the list of formats, only that works all the rest formats are not considered.
Problem is that to_timestamp() throws exception instead of producing null as is expected by coalesce(), when wrong format is encounterd.
How to solve it?

In Spark 3.0, we define our own pattern strings in Datetime Patterns for Formatting and Parsing, which is implemented via DateTimeFormatter under the hood.
In Spark version 2.4 and below, java.text.SimpleDateFormat is used for timestamp/date string conversions, and the supported patterns are described in SimpleDateFormat.
The old behavior can be restored by setting spark.sql.legacy.timeParserPolicy to LEGACY.
sparkConf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
Doc:
sql-migration-guide.html#query-engine

Related

How to format datatype to TimestampType in spark DataFrame- Scala

I'm trying to cast the column type to Timestamptype for which the value is in the format "11/14/2022 4:48:24 PM". However when I display the results I see the values as null.
Here is the sample code that I'm using to cast the timestamp field.
val messages = df.withColumn("Offset", $"Offset".cast(LongType))
.withColumn("Time(readable)", $"EnqueuedTimeUtc".cast(TimestampType))
.withColumn("Body", $"Body".cast(StringType))
.select("Offset", "Time(readable)", "Body")
display(messages)
4
Is there any other way I can try to avoid the null values?
Instead of casting to TimestampType, you can use to_timestamp function and provide the time format explicitly, like so:
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import spark.implicits._
val time_df = Seq((62536, "11/14/2022 4:48:24 PM"), (62537, "12/14/2022 4:48:24 PM")).toDF("Offset", "Time")
val messages = time_df
.withColumn("Offset", $"Offset".cast(LongType))
.withColumn("Time(readable)", to_timestamp($"Time", "MM/dd/yyyy h:mm:ss a"))
.select("Offset", "Time(readable)")
messages.show(false)
+------+-------------------+
|Offset|Time(readable) |
+------+-------------------+
|62536 |2022-11-14 16:48:24|
|62537 |2022-12-14 16:48:24|
+------+-------------------+
messages: org.apache.spark.sql.DataFrame = [Offset: bigint, Time(readable): timestamp]
One thing to remember, is that you will have to set one Spark configuration, to allow for legacy time parser policy:
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")

Convert date to another format Scala Spark

I am reading a CSV that contains two types of date:
dd-MMM-yyyy hh:mm:ss -> 13-Dec-2019 17:10:00
dd/MM/yyyy hh:mm -> 11/02/2020 17:33
I am trying to transform all dates of the first type into the second type but I can't find a good solution. I am trying this:
val pr_date = readeve.withColumn("Date", when(to_date(col("Date"),"dd-MMM-yyyy hh:mm:ss").isNotNull,
to_date(col("Date"),"dd/MM/yyyy hh:mm")))
pr_date.show(25)
And I get the entire Date column as null values:
I am trying with this function:
def to_date_(col: Column,
formats: Seq[String] = Seq("dd-MMM-yyyy hh:mm:ss", "dd/MM/yyyy hh:mm")) = {
coalesce(formats.map(f => to_date(col, f)): _*)
}
val p2 = readeve.withColumn("Date",to_date_(readeve.col(("Date")))).show(125)
And in the first type of date i get nulls too:
What am I doing wrong? (new with Scala Spark)
Scala version: 2.11.7
Spark version: 2.4.3
Try code below? Note that 17 is HH, not hh. Also try to_timestamp instead of to_date because you want to keep the time.
val pr_date = readeve.withColumn(
"Date",
coalesce(
date_format(to_timestamp(col("Date"),"dd-MMM-yyyy HH:mm:ss"),"dd/MM/yyyy HH:mm"),
date_format(to_timestamp(col("Date"),"dd/MM/yyyy HH:mm"),"dd/MM/yyyy HH:mm")
)
)

Parse dates with microseconds precision with dataframe in Spark

I have a csv file:
Name;Date
A;2018-01-01 10:15:25.123456
B;2018-12-31 10:15:25.123456
I try to parse with Spark Dataframe:
val df = spark.read.format(source="csv")
.option("header", true)
.option("delimiter", ";")
.option("inferSchema", true)
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSSSSS")
But the resulting Dataframe is (wrongly) truncated at the millisecond:
scala> df.show(truncate=false)
+---+-----------------------+
|Nom|Date |
+---+-----------------------+
|A |2018-01-01 10:17:28.456|
|B |2018-12-31 10:17:28.456|
+---+-----------------------+
df.first()(1).asInstanceOf[Timestamp].getNanos()
res51: Int = 456000000
Bonus question: read with nanoseconds precision
.SSSSS means milliseconds not microseconds:
java.util.Date format SSSSSS: if not microseconds what are the last 3 digits?,
https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html
So if you need microseconds you should parse the date by custom code:
Handling microseconds in Spark Scala
Bonus answer: SparkSQL store data in microseconds internally, so you could use string to store nanos or separate field or any other custom solution

Scala : how to convert integer to time stamp

I am facing an issue when i am trying to find the number of months between two dates using 'months_between'function. when my input date format is 'dd/mm/yyyy' or any other date format then the function is returning the correct output. however when i am passing the input date format as yyyymmdd then i am getting the below error.
Code:
val df = spark.read.option("header", "true").option("dateFormat", "yyyyMMdd").option("inferSchema", "true").csv("MyFile.csv")
val filteredMemberDF = df.withColumn("monthsBetween", functions.months_between(col("toDate"), col("fromDT")))
error:
cannot resolve 'months_between(toDate, fromDT)' due to data type mismatch: argument 1 requires timestamp type,
however, 'toDate' is of int type. argument 2 requires timestamp type, however, 'fromDT' is of int type.;
When my input is as below,
id fromDT toDate
11 16/06/2008 16/08/2008
12 13/07/2008 13/10/2008
getting expected output,
id fromDT toDate monthsBetween
11 16/6/2008 16/8/2008 2
12 13/7/2008 13/10/2008 3
when i am passing the below data, facing the above said error.
id fromDT toDate
11 20150930 20150930
12 20150930 20150930
You first need to use to_date function to convert those numbers to DateTimes.
import org.apache.spark.sql.functions._
val df = spark.read
.option("header", "true")
.option("dateFormat", "yyyyMMdd")
.option("inferSchema", "true")
.csv("MyFile.csv")
val dfWithDates = df
.withColumn("toDateReal", to_date(concat(col("toDate")), "yyyyMMdd"))
.withColumn("fromDateReal", to_date(concat(col("fromDT")), "yyyyMMdd"))
val filteredMemberDF = dfWithDates
.withColumn("monthsBetween", months_between(col("toDateReal"), col("fromDateReal")))

Timestamp changes format when writing to csv file spark

I am trying to save a dataframe to a csv file, that contains a timestamp.
The problem that this column changes of format one written in the csv file. Here is the code I used:
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:/Users/mhattabi/Desktop/dataTest2.csv")
//val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:\\dataSet.csv\\datasetTest.csv")
//convert all column to numeric value in order to apply aggregation function
df.columns.map { c =>df.withColumn(c, col(c).cast("int")) }
//add a new column inluding the new timestamp column
val result2=df.withColumn("new_time",((unix_timestamp(col("time"))/300).cast("long") * 300).cast("timestamp")).drop("time")
val finalresult=result2.groupBy("new_time").agg(result2.drop("new_time").columns.map((_ -> "mean")).toMap).sort("new_time") //agg(avg(all columns..)
finalresult.coalesce(1).write.option("header",true).option("inferSchema","true").csv("C:/mydata.csv")
when display via df.show it shoes the correct format
But in the csv file it shoes this format:
Use option to format timestamp into desired one which you need:
finalresult.coalesce(1).write.option("header",true).option("inferSchema","true").option("dateFormat", "yyyy-MM-dd HH:mm:ss").csv("C:/mydata.csv")
or
finalresult.coalesce(1).write.format("csv").option("delimiter", "\t").option("header",true).option("inferSchema","true").option("dateFormat", "yyyy-MM-dd HH:mm:ss").option("escape", "\\").save("C:/mydata.csv")
Here is the code snippet that worked for me to modify the CSV output format for timestamps.
I needed a 'T' character in there, and no seconds or microseconds. The timestampFormat option did work for this.
DF.write
.mode(SaveMode.Overwrite)
.option("timestampFormat", "yyyy-MM-dd'T'HH:mm")
Such as 2017-02-20T06:53
If you substitute a space for 'T' then you get this:
DF.write
.mode(SaveMode.Overwrite)
.option("timestampFormat", "yyyy-MM-dd HH:mm")
Such as 2017-02-20 06:53