Timestamp changes format when writing to csv file spark - scala

I am trying to save a dataframe to a csv file, that contains a timestamp.
The problem that this column changes of format one written in the csv file. Here is the code I used:
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:/Users/mhattabi/Desktop/dataTest2.csv")
//val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:\\dataSet.csv\\datasetTest.csv")
//convert all column to numeric value in order to apply aggregation function
df.columns.map { c =>df.withColumn(c, col(c).cast("int")) }
//add a new column inluding the new timestamp column
val result2=df.withColumn("new_time",((unix_timestamp(col("time"))/300).cast("long") * 300).cast("timestamp")).drop("time")
val finalresult=result2.groupBy("new_time").agg(result2.drop("new_time").columns.map((_ -> "mean")).toMap).sort("new_time") //agg(avg(all columns..)
finalresult.coalesce(1).write.option("header",true).option("inferSchema","true").csv("C:/mydata.csv")
when display via df.show it shoes the correct format
But in the csv file it shoes this format:

Use option to format timestamp into desired one which you need:
finalresult.coalesce(1).write.option("header",true).option("inferSchema","true").option("dateFormat", "yyyy-MM-dd HH:mm:ss").csv("C:/mydata.csv")
or
finalresult.coalesce(1).write.format("csv").option("delimiter", "\t").option("header",true).option("inferSchema","true").option("dateFormat", "yyyy-MM-dd HH:mm:ss").option("escape", "\\").save("C:/mydata.csv")

Here is the code snippet that worked for me to modify the CSV output format for timestamps.
I needed a 'T' character in there, and no seconds or microseconds. The timestampFormat option did work for this.
DF.write
.mode(SaveMode.Overwrite)
.option("timestampFormat", "yyyy-MM-dd'T'HH:mm")
Such as 2017-02-20T06:53
If you substitute a space for 'T' then you get this:
DF.write
.mode(SaveMode.Overwrite)
.option("timestampFormat", "yyyy-MM-dd HH:mm")
Such as 2017-02-20 06:53

Related

Multiple formats in Date Time column in Spark

I am using Spark3.0.1
I have following data as csv:
348702330256514,37495066290,9084849,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,330148,33946,614677375609919,11-02-2018 00:00:00,GENUINE
348702330256514,37495066290,136052,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,4310362,33946,614677375609919,11-02-2018 00:00:00,GENUINE
348702330256514,37495066290,9097094,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,2291118,33946,614677375609919,11-02-2018 00:00:00,GENUINE
348702330256514,37495066290,4900011,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,633447,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,6259303,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,369067,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,1193207,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,9335696,33946,614677375609919,11-02-2018 0:00:00,GENUINE
As you can see the second last column has Timestamp data where the hour column will have data both in single as well as double digits, depending on the hour of the day (This is sample data and not all records have all zeros for time part).
This is the problem and I tried to solve the problem as below:
Read the column as String and then Use a column Method to format it to TimeStamp type.
val schema = StructType(
List(
StructField("_corrupt_record", StringType)
, StructField("card_id", LongType)
, StructField("member_id", LongType)
, StructField("amount", IntegerType)
, StructField("postcode", IntegerType)
, StructField("pos_id", LongType)
, StructField("transaction_dt", StringType)
, StructField("status", StringType)
)
)
// format the timestamp column
def format_time_column(timeStampCol: Column
, formats: Seq[String] = Seq( "dd-MM-yyyy HH:mm:ss", "dd-MM-yyyy H:mm:ss"
, "dd-MM-yyyy HH:m:ss", "dd-MM-yyyy H:m:ss")) ={
coalesce(
formats.map(f => to_timestamp(timeStampCol, f)):_*
)
}
val cardTransaction = spark.read
.format("csv")
.option("header", false)
.schema(schema)
.option("path", cardTransactionFilePath)
.option("columnNameOfCorruptRecord", "_corrupt_record")
.load
.withColumn("transaction_dt", format_time_column(col("transaction_dt")))
cardTransaction.cache()
cardTransaction.show(5)
This code produces following error:
*Note:
The record highlighted has only 1 digit for hour
Whatever is the first format provided in the list of formats, only that works all the rest formats are not considered.
Problem is that to_timestamp() throws exception instead of producing null as is expected by coalesce(), when wrong format is encounterd.
How to solve it?
In Spark 3.0, we define our own pattern strings in Datetime Patterns for Formatting and Parsing, which is implemented via DateTimeFormatter under the hood.
In Spark version 2.4 and below, java.text.SimpleDateFormat is used for timestamp/date string conversions, and the supported patterns are described in SimpleDateFormat.
The old behavior can be restored by setting spark.sql.legacy.timeParserPolicy to LEGACY.
sparkConf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
Doc:
sql-migration-guide.html#query-engine

convert a string type(MM/dd/YYYY hh:mm:ss AM/PM) to date format in PySpark?

I have a string in format 05/26/2021 11:31:56 AM for mat and I want to convert it to a date format like 05-26-2021 in pyspark.
I have tried below things but its converting the column type to date but making the values null.
df = df.withColumn("columnname", F.to_date(df["columnname"], 'yyyy-MM-dd'))
another one which I have tried is
df = df.withColumn("columnname", df["columnname"].cast(DateType()))
I have also tried the below method
df = df.withColumn(column.lower(), F.to_date(F.col(column.lower())).alias(column).cast("date"))
but in every method I was able to convert the column type to date but it makes the values null.
Any suggestion is appreciated
# Create data frame like below
df = spark.createDataFrame(
[("Test", "05/26/2021 11:31:56 AM")],
("user_name", "login_date"))
# Import functions
from pyspark.sql import functions as f
# Create data framew with new column new_date with data in desired format
df1 = df.withColumn("new_date", f.from_unixtime(f.unix_timestamp("login_date",'MM/dd/yyyy hh:mm:ss a'),'yyyy-MM-dd').cast('date'))
The above answer posted by #User12345 works and the below method is also works
df = df.withColumn(column, F.unix_timestamp(column, "MM/dd/YYYY hh:mm:ss aa").cast("double").cast("timestamp"))
df = df.withColumn(column, F.from_utc_timestamp(column, 'Z').cast(DateType()))
Use this
df=data.withColumn("Date",to_date(to_timestamp("Date","M/d/yyyy")))

Spark Dataframe from a different data format

I've this data set. for which I need to create a sparkdataframe in scala. This data is a column in a csv file. column name is dataheader
dataheader
"{""date_time"":""1999/05/22 03:03:07.011"",""cust_id"":""cust1"",""timestamp"":944248234000,""msgId"":""113"",""activityTimeWindowMilliseconds"":20000,""ec"":""event1"",""name"":""ABC"",""entityId"":""1001"",""et"":""StateChange"",""logType"":""type123,""lastActivityTS"":944248834000,""sc_id"":""abc1d1c9"",""activityDetectedInLastTimeWindow"":true}"
"{""date_time"":""1999/05/23 03:03:07.011"",""cust_id"":""cust1"",""timestamp"":944248234000,""msgId"":""114"",""activityTimeWindowMilliseconds"":20000,""ec"":""event2"",""name"":""ABC"",""entityId"":""1001"",""et"":""StateChange"",""logType"":""type123,""lastActivityTS"":944248834000,""sc_id"":""abc1d1c9"",""activityDetectedInLastTimeWindow"":true}"
I was able to read the csv file -
val df_tmp = spark
.read
.format("com.databricks.spark.csv")
.option("header","true")
.option("quoteMode", "ALL")
.option("delimiter", ",")
.option("escape", "\"")
//.option("inferSchema","true")
.option("multiline", "true")
.load("D:\\dataFile.csv")
I tried to split the data into separate columns in a dataframe but did not succeed.
one thing I noticed in data is both key and value are enclosed by double double quotes ""key1"":""value1""
If you want to get the field inside the data field, you need to parse it and write it into a new CSV file.
It's obviously a string in json format

Scala : how to convert integer to time stamp

I am facing an issue when i am trying to find the number of months between two dates using 'months_between'function. when my input date format is 'dd/mm/yyyy' or any other date format then the function is returning the correct output. however when i am passing the input date format as yyyymmdd then i am getting the below error.
Code:
val df = spark.read.option("header", "true").option("dateFormat", "yyyyMMdd").option("inferSchema", "true").csv("MyFile.csv")
val filteredMemberDF = df.withColumn("monthsBetween", functions.months_between(col("toDate"), col("fromDT")))
error:
cannot resolve 'months_between(toDate, fromDT)' due to data type mismatch: argument 1 requires timestamp type,
however, 'toDate' is of int type. argument 2 requires timestamp type, however, 'fromDT' is of int type.;
When my input is as below,
id fromDT toDate
11 16/06/2008 16/08/2008
12 13/07/2008 13/10/2008
getting expected output,
id fromDT toDate monthsBetween
11 16/6/2008 16/8/2008 2
12 13/7/2008 13/10/2008 3
when i am passing the below data, facing the above said error.
id fromDT toDate
11 20150930 20150930
12 20150930 20150930
You first need to use to_date function to convert those numbers to DateTimes.
import org.apache.spark.sql.functions._
val df = spark.read
.option("header", "true")
.option("dateFormat", "yyyyMMdd")
.option("inferSchema", "true")
.csv("MyFile.csv")
val dfWithDates = df
.withColumn("toDateReal", to_date(concat(col("toDate")), "yyyyMMdd"))
.withColumn("fromDateReal", to_date(concat(col("fromDT")), "yyyyMMdd"))
val filteredMemberDF = dfWithDates
.withColumn("monthsBetween", months_between(col("toDateReal"), col("fromDateReal")))

Read .csv data in european format with Spark

I am currently doing my first attempts with Apache Spark.
I would like to read a .csv File with an SQLContext object, but Spark won't provide the correct results as the File is a european one (comma as decimal separator and semicolon used as value separator).
Is there a way to tell Spark to follow a different .csv syntax?
val conf = new SparkConf()
.setMaster("local[8]")
.setAppName("Foo")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat")
.option("header","true")
.option("inferSchema","true")
.load("data.csv")
df.show()
A row in the relating .csv looks like this:
04.10.2016;12:51:00;1,1;0,41;0,416
Spark interprets the entire row as a column. df.show() prints:
+--------------------------------+
|Col1;Col2,Col3;Col4;Col5 |
+--------------------------------+
| 04.10.2016;12:51:...|
+--------------------------------+
In previous attempts to get it working df.show() was even printing more row-content where it now says '...' but eventually cutting the row at the comma in the third col.
You can just read as Test and split by ; or set a custom delimiter to the CSV format as in .option("delimiter",";")