I have a DF like this:
df = spark.createDataFrame(
["2003-01-01 02:00:00.0 -8:00"],
"string"
).toDF('ts')
df.collect()
[Row(ts='2003-01-01 02:00:00.0 -8:00')]
An I'm trying to make a timestamp type out of my ts but I just can't seem to make it work.
I tried many variants:
df = df.withColumn('cast', to_timestamp('ts', 'yyyy-MM-dd HH:mm:ss.S Z'))
df.collect()
[Row(ts='2003-01-01 02:00:00.0 -8:00', cast=None)]
df = df.withColumn('cast', to_timestamp('ts', 'yyyy-MM-dd HH:mm:ss.S X'))
df.collect()
[Row(ts='2003-01-01 02:00:00.0 -8:00', cast=None)]
df = df.withColumn('cast', to_timestamp('ts', 'yyyy-MM-dd HH:mm:ss.S x'))
df.collect()
[Row(ts='2003-01-01 02:00:00.0 -8:00', cast=None)]
df = df.withColumn('cast', to_timestamp('ts'))
df.collect()
[Row(ts='2003-01-01 02:00:00.0 -8:00', cast=None)]
But it does not work. It is frustrating especially since just removing the space before the offset works even without specifying the format..
df = spark.createDataFrame(
["2003-01-01 02:00:00.0-8:00"],
"string"
).toDF('ts')
df = df.withColumn('cast', to_timestamp('ts'))
df.collect()
[Row(ts='2003-01-01 02:00:00.0-8:00', cast=datetime.datetime(2003, 1, 1, 11, 0))]
You need to use ZZZZZ instead of Z. Also works with x and X:
Spark SQL manual says:
Five letters outputs the hour and minute and optional second, with a colon, such as +01:30:15.
Full text with other options explained:
One letter outputs just the hour, such as +01, unless the minute is non-zero in which case the minute is also output, such as +0130. Two letters outputs the hour and minute, without a colon, such as +0130. Three letters outputs the hour and minute, with a colon, such as +01:30. Four letters outputs the hour and minute and optional second, without a colon, such as +013015. Five letters outputs the hour and minute and optional second, with a colon, such as +01:30:15. Six or more letters will fail.
Spark SQL example:
SELECT to_timestamp('2003-01-01 02:00:00.1 -08:00', 'yyyy-MM-dd HH:mm:ss.S ZZZZZ')
returns
2003-01-01T10:00:00.100+0000
Related
I am using Spark3.0.1
I have following data as csv:
348702330256514,37495066290,9084849,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,330148,33946,614677375609919,11-02-2018 00:00:00,GENUINE
348702330256514,37495066290,136052,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,4310362,33946,614677375609919,11-02-2018 00:00:00,GENUINE
348702330256514,37495066290,9097094,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,2291118,33946,614677375609919,11-02-2018 00:00:00,GENUINE
348702330256514,37495066290,4900011,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,633447,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,6259303,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,369067,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,1193207,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,9335696,33946,614677375609919,11-02-2018 0:00:00,GENUINE
As you can see the second last column has Timestamp data where the hour column will have data both in single as well as double digits, depending on the hour of the day (This is sample data and not all records have all zeros for time part).
This is the problem and I tried to solve the problem as below:
Read the column as String and then Use a column Method to format it to TimeStamp type.
val schema = StructType(
List(
StructField("_corrupt_record", StringType)
, StructField("card_id", LongType)
, StructField("member_id", LongType)
, StructField("amount", IntegerType)
, StructField("postcode", IntegerType)
, StructField("pos_id", LongType)
, StructField("transaction_dt", StringType)
, StructField("status", StringType)
)
)
// format the timestamp column
def format_time_column(timeStampCol: Column
, formats: Seq[String] = Seq( "dd-MM-yyyy HH:mm:ss", "dd-MM-yyyy H:mm:ss"
, "dd-MM-yyyy HH:m:ss", "dd-MM-yyyy H:m:ss")) ={
coalesce(
formats.map(f => to_timestamp(timeStampCol, f)):_*
)
}
val cardTransaction = spark.read
.format("csv")
.option("header", false)
.schema(schema)
.option("path", cardTransactionFilePath)
.option("columnNameOfCorruptRecord", "_corrupt_record")
.load
.withColumn("transaction_dt", format_time_column(col("transaction_dt")))
cardTransaction.cache()
cardTransaction.show(5)
This code produces following error:
*Note:
The record highlighted has only 1 digit for hour
Whatever is the first format provided in the list of formats, only that works all the rest formats are not considered.
Problem is that to_timestamp() throws exception instead of producing null as is expected by coalesce(), when wrong format is encounterd.
How to solve it?
In Spark 3.0, we define our own pattern strings in Datetime Patterns for Formatting and Parsing, which is implemented via DateTimeFormatter under the hood.
In Spark version 2.4 and below, java.text.SimpleDateFormat is used for timestamp/date string conversions, and the supported patterns are described in SimpleDateFormat.
The old behavior can be restored by setting spark.sql.legacy.timeParserPolicy to LEGACY.
sparkConf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
Doc:
sql-migration-guide.html#query-engine
I have spark dataframe and and trying to add Year, Month and Day columns to it.
But the problem is after adding the YTD columns it does not keeps the leading zero with the date and month columns.
val cityDF= Seq(("Delhi","India"),("Kolkata","India"),("Mumbai","India"),("Nairobi","Kenya"),("Colombo","Srilanka"),("Tibet","China")).toDF("City","Country")
val dateString = "2020-01-01"
val dateCol = org.apache.spark.sql.functions.to_date(lit(dateString))
val finaldf = cityDF.select($"*", year(dateCol).alias("Year"), month(dateCol).alias("Month"), dayofmonth(dateCol).alias("Day"))
I want to keep the leading zero from the Month and Day columns but it is giving me result as 1 instead of 01.
As I am using year month date columns for the spark partition creation. so I want to keep the leading zeros intact.
So my question is: How do I keep the leading zero in my dataframe columns.
Integer type can be converted to String type, where leading zeroes are possibe, with "format_string" function:
val finaldf =
cityDF
.select($"*",
year(dateCol).alias("Year"),
format_string("%02d", month(dateCol)).alias("Month"),
format_string("%02d", dayofmonth(dateCol)).alias("Day")
)
Why not simply use date_format for that?
val finaldf = cityDF.select(
$"*",
year(dateCol).alias("Year"),
date_format(dateCol, "MM").alias("Month"),
date_format(dateCol, "dd").alias("Day")
)
I am facing an issue when i am trying to find the number of months between two dates using 'months_between'function. when my input date format is 'dd/mm/yyyy' or any other date format then the function is returning the correct output. however when i am passing the input date format as yyyymmdd then i am getting the below error.
Code:
val df = spark.read.option("header", "true").option("dateFormat", "yyyyMMdd").option("inferSchema", "true").csv("MyFile.csv")
val filteredMemberDF = df.withColumn("monthsBetween", functions.months_between(col("toDate"), col("fromDT")))
error:
cannot resolve 'months_between(toDate, fromDT)' due to data type mismatch: argument 1 requires timestamp type,
however, 'toDate' is of int type. argument 2 requires timestamp type, however, 'fromDT' is of int type.;
When my input is as below,
id fromDT toDate
11 16/06/2008 16/08/2008
12 13/07/2008 13/10/2008
getting expected output,
id fromDT toDate monthsBetween
11 16/6/2008 16/8/2008 2
12 13/7/2008 13/10/2008 3
when i am passing the below data, facing the above said error.
id fromDT toDate
11 20150930 20150930
12 20150930 20150930
You first need to use to_date function to convert those numbers to DateTimes.
import org.apache.spark.sql.functions._
val df = spark.read
.option("header", "true")
.option("dateFormat", "yyyyMMdd")
.option("inferSchema", "true")
.csv("MyFile.csv")
val dfWithDates = df
.withColumn("toDateReal", to_date(concat(col("toDate")), "yyyyMMdd"))
.withColumn("fromDateReal", to_date(concat(col("fromDT")), "yyyyMMdd"))
val filteredMemberDF = dfWithDates
.withColumn("monthsBetween", months_between(col("toDateReal"), col("fromDateReal")))
I am attempting to group data that fits into a specified window period using Spark Structured Streaming.
val profiles = rawProfiles.select("*")
.groupBy(window($"date", "10 minutes", "5 minutes").alias("date"), $"id", $"name")
.agg(sum("value").alias("value"))
.join(url.value, Seq("url"), "left")
.where("value > 20")
.as[profileRecord]
The format of the date from the rawProfiles is a string like this:
2017-07-20 18:27:45
What is returned for the date column after the window aggregation is something like this:
[0,554c749fb8a00,554c76dbed000]
I'm not really sure what to do with that. Does anyone have any ideas?
you can reformat your date field as follows;
rawProfiles.select(<your other fields>,to_date(unix_timestamp($"date").cast(DataTypes.TimestampType)).as("date"))).groupBy(window($"date", "10 minutes", "5 minutes").alias("date"), $"id", $"name")
.agg(sum("value").alias("value"))
.join(url.value, Seq("url"), "left")
.where("value > 20")
.as[profileRecord]
I am trying to save a dataframe to a csv file, that contains a timestamp.
The problem that this column changes of format one written in the csv file. Here is the code I used:
val spark = SparkSession.builder.master("local").appName("my-spark-app").getOrCreate()
val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:/Users/mhattabi/Desktop/dataTest2.csv")
//val df = spark.read.option("header",true).option("inferSchema", "true").csv("C:\\dataSet.csv\\datasetTest.csv")
//convert all column to numeric value in order to apply aggregation function
df.columns.map { c =>df.withColumn(c, col(c).cast("int")) }
//add a new column inluding the new timestamp column
val result2=df.withColumn("new_time",((unix_timestamp(col("time"))/300).cast("long") * 300).cast("timestamp")).drop("time")
val finalresult=result2.groupBy("new_time").agg(result2.drop("new_time").columns.map((_ -> "mean")).toMap).sort("new_time") //agg(avg(all columns..)
finalresult.coalesce(1).write.option("header",true).option("inferSchema","true").csv("C:/mydata.csv")
when display via df.show it shoes the correct format
But in the csv file it shoes this format:
Use option to format timestamp into desired one which you need:
finalresult.coalesce(1).write.option("header",true).option("inferSchema","true").option("dateFormat", "yyyy-MM-dd HH:mm:ss").csv("C:/mydata.csv")
or
finalresult.coalesce(1).write.format("csv").option("delimiter", "\t").option("header",true).option("inferSchema","true").option("dateFormat", "yyyy-MM-dd HH:mm:ss").option("escape", "\\").save("C:/mydata.csv")
Here is the code snippet that worked for me to modify the CSV output format for timestamps.
I needed a 'T' character in there, and no seconds or microseconds. The timestampFormat option did work for this.
DF.write
.mode(SaveMode.Overwrite)
.option("timestampFormat", "yyyy-MM-dd'T'HH:mm")
Such as 2017-02-20T06:53
If you substitute a space for 'T' then you get this:
DF.write
.mode(SaveMode.Overwrite)
.option("timestampFormat", "yyyy-MM-dd HH:mm")
Such as 2017-02-20 06:53