I have a csv file:
Name;Date
A;2018-01-01 10:15:25.123456
B;2018-12-31 10:15:25.123456
I try to parse with Spark Dataframe:
val df = spark.read.format(source="csv")
.option("header", true)
.option("delimiter", ";")
.option("inferSchema", true)
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss.SSSSSS")
But the resulting Dataframe is (wrongly) truncated at the millisecond:
scala> df.show(truncate=false)
+---+-----------------------+
|Nom|Date |
+---+-----------------------+
|A |2018-01-01 10:17:28.456|
|B |2018-12-31 10:17:28.456|
+---+-----------------------+
df.first()(1).asInstanceOf[Timestamp].getNanos()
res51: Int = 456000000
Bonus question: read with nanoseconds precision
.SSSSS means milliseconds not microseconds:
java.util.Date format SSSSSS: if not microseconds what are the last 3 digits?,
https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html
So if you need microseconds you should parse the date by custom code:
Handling microseconds in Spark Scala
Bonus answer: SparkSQL store data in microseconds internally, so you could use string to store nanos or separate field or any other custom solution
Related
I am using Spark3.0.1
I have following data as csv:
348702330256514,37495066290,9084849,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,330148,33946,614677375609919,11-02-2018 00:00:00,GENUINE
348702330256514,37495066290,136052,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,4310362,33946,614677375609919,11-02-2018 00:00:00,GENUINE
348702330256514,37495066290,9097094,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,2291118,33946,614677375609919,11-02-2018 00:00:00,GENUINE
348702330256514,37495066290,4900011,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,633447,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,6259303,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,369067,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,1193207,33946,614677375609919,11-02-2018 0:00:00,GENUINE
348702330256514,37495066290,9335696,33946,614677375609919,11-02-2018 0:00:00,GENUINE
As you can see the second last column has Timestamp data where the hour column will have data both in single as well as double digits, depending on the hour of the day (This is sample data and not all records have all zeros for time part).
This is the problem and I tried to solve the problem as below:
Read the column as String and then Use a column Method to format it to TimeStamp type.
val schema = StructType(
List(
StructField("_corrupt_record", StringType)
, StructField("card_id", LongType)
, StructField("member_id", LongType)
, StructField("amount", IntegerType)
, StructField("postcode", IntegerType)
, StructField("pos_id", LongType)
, StructField("transaction_dt", StringType)
, StructField("status", StringType)
)
)
// format the timestamp column
def format_time_column(timeStampCol: Column
, formats: Seq[String] = Seq( "dd-MM-yyyy HH:mm:ss", "dd-MM-yyyy H:mm:ss"
, "dd-MM-yyyy HH:m:ss", "dd-MM-yyyy H:m:ss")) ={
coalesce(
formats.map(f => to_timestamp(timeStampCol, f)):_*
)
}
val cardTransaction = spark.read
.format("csv")
.option("header", false)
.schema(schema)
.option("path", cardTransactionFilePath)
.option("columnNameOfCorruptRecord", "_corrupt_record")
.load
.withColumn("transaction_dt", format_time_column(col("transaction_dt")))
cardTransaction.cache()
cardTransaction.show(5)
This code produces following error:
*Note:
The record highlighted has only 1 digit for hour
Whatever is the first format provided in the list of formats, only that works all the rest formats are not considered.
Problem is that to_timestamp() throws exception instead of producing null as is expected by coalesce(), when wrong format is encounterd.
How to solve it?
In Spark 3.0, we define our own pattern strings in Datetime Patterns for Formatting and Parsing, which is implemented via DateTimeFormatter under the hood.
In Spark version 2.4 and below, java.text.SimpleDateFormat is used for timestamp/date string conversions, and the supported patterns are described in SimpleDateFormat.
The old behavior can be restored by setting spark.sql.legacy.timeParserPolicy to LEGACY.
sparkConf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
Doc:
sql-migration-guide.html#query-engine
I have a dataframe with a string datetime column.
I am converting it to timestamp, but the values are changing.
Following is my code, can anyone help me to convert without changing values.
df=spark.createDataFrame(
data = [ ("1","2020-04-06 15:06:16 +00:00")],
schema=["id","input_timestamp"])
df.printSchema()
#Timestamp String to DateType
df = df.withColumn("timestamp",to_timestamp("input_timestamp"))
# Using Cast to convert TimestampType to DateType
df.withColumn('timestamp_string', \
to_timestamp('timestamp').cast('string')) \
.show(truncate=False)
This is the output:
+---+--------------------------+-------------------+-------------------+
|id |input_timestamp |timestamp |timestamp_string |
+---+--------------------------+-------------------+-------------------+
|1 |2020-04-06 15:06:16 +00:00|2020-04-06 08:06:16|2020-04-06 08:06:16|
+---+--------------------------+-------------------+-------------------+
I want to know why the hour is changing from 15 to 8 and how can I prevent it?
I believe to_timestamp is converting timestamp value to your local time as you have +00:00 in your data.
Try to pass the format to to_timestamp() function.
Example:
from pyspark.sql.functions import to_timestamp
df.withColumn("timestamp",to_timestamp(col("input_timestamp"),"yyyy-MM-dd HH:mm:ss +00:00")).show(10,False)
#+---+--------------------------+-------------------+
#|id |input_timestamp |timestamp |
#+---+--------------------------+-------------------+
#|1 |2020-04-06 15:06:16 +00:00|2020-04-06 15:06:16|
#+---+--------------------------+-------------------+
from pyspark.sql.functions import to_utc_timestamp
df = spark.createDataFrame(
data=[('1', '2020-04-06 15:06:16 +00:00')],
schema=['id', 'input_timestamp'])
df.printSchema()
df = df.withColumn('timestamp', to_utc_timestamp('input_timestamp',
your_local_timezone))
df.withColumn('timestamp_string', df.timestamp.cast('string')).show(truncate=False)
Replace your_local_timezone with the actual value.
I am using Pyspark with Python 2.7. I have a date column in string (with ms) and would like to convert to timestamp
This is what I have tried so far
df = df.withColumn('end_time', from_unixtime(unix_timestamp(df.end_time, '%Y-%M-%d %H:%m:%S.%f')) )
printSchema() shows
end_time: string (nullable = true)
when I expended timestamp as the type of variable
Try using from_utc_timestamp:
from pyspark.sql.functions import from_utc_timestamp
df = df.withColumn('end_time', from_utc_timestamp(df.end_time, 'PST'))
You'd need to specify a timezone for the function, in this case I chose PST
If this does not work please give us an example of a few rows showing df.end_time
Create a sample dataframe with Time-stamp formatted as string:
import pyspark.sql.functions as F
df = spark.createDataFrame([('22-Jul-2018 04:21:18.792 UTC', ),('23-Jul-2018 04:21:25.888 UTC',)], ['TIME'])
df.show(2,False)
df.printSchema()
Output:
+----------------------------+
|TIME |
+----------------------------+
|22-Jul-2018 04:21:18.792 UTC|
|23-Jul-2018 04:21:25.888 UTC|
+----------------------------+
root
|-- TIME: string (nullable = true)
Converting string time-format (including milliseconds ) to unix_timestamp(double). Since unix_timestamp() function excludes milliseconds we need to add it using another simple hack to include milliseconds. Extracting milliseconds from string using substring method (start_position = -7, length_of_substring=3) and Adding milliseconds seperately to unix_timestamp. (Cast to substring to float for adding)
df1 = df.withColumn("unix_timestamp",F.unix_timestamp(df.TIME,'dd-MMM-yyyy HH:mm:ss.SSS z') + F.substring(df.TIME,-7,3).cast('float')/1000)
Converting unix_timestamp(double) to timestamp datatype in Spark.
df2 = df1.withColumn("TimestampType",F.to_timestamp(df1["unix_timestamp"]))
df2.show(n=2,truncate=False)
This will give you following output
+----------------------------+----------------+-----------------------+
|TIME |unix_timestamp |TimestampType |
+----------------------------+----------------+-----------------------+
|22-Jul-2018 04:21:18.792 UTC|1.532233278792E9|2018-07-22 04:21:18.792|
|23-Jul-2018 04:21:25.888 UTC|1.532319685888E9|2018-07-23 04:21:25.888|
+----------------------------+----------------+-----------------------+
Checking the Schema:
df2.printSchema()
root
|-- TIME: string (nullable = true)
|-- unix_timestamp: double (nullable = true)
|-- TimestampType: timestamp (nullable = true)
in current version of spark , we do not have to do much with respect to timestamp conversion.
using to_timestamp function works pretty well in this case. only thing we need to take care is input the format of timestamp according to the original column.
in my case it was in format yyyy-MM-dd HH:mm:ss.
other format can be like MM/dd/yyyy HH:mm:ss or a combination as such.
from pyspark.sql.functions import to_timestamp
df=df.withColumn('date_time',to_timestamp('event_time','yyyy-MM-dd HH:mm:ss'))
df.show()
Following might help:-
from pyspark.sql import functions as F
df = df.withColumn("end_time", F.from_unixtime(F.col("end_time"), 'yyyy-MM-dd HH:mm:ss.SS').cast("timestamp"))
[Updated]
I got JSON rows that looks like the following
[{"time":"2017-03-23T12:23:05","user":"randomUser","action":"sleeping"}]
[{"time":"2017-03-23T12:24:05","user":"randomUser","action":"sleeping"}]
[{"time":"2017-03-23T12:33:05","user":"randomUser","action":"sleeping"}]
[{"time":"2017-03-23T15:33:05","user":"randomUser2","action":"eating"}]
[{"time":"2017-03-23T15:33:06","user":"randomUser2","action":"eating"}]
So I got 2 problem, First of all the time is stored as String inside my df, I believe it has to be date for me to aggregate them?
second of all, I need to aggregate those datas by 5 minutes interval,
just for example everything that happens from 2017-03-23T12:20:00 to 2017-03-23T12:24:59 need to be aggregated and considered as 2017-03-23T12:20:00 timestamp
expected output is
[{"time":"2017-03-23T12:20:00","user":"randomUser","action":"sleeping","count":2}]
[{"time":"2017-03-23T12:30:00","user":"randomUser","action":"sleeping","count":1}]
[{"time":"2017-03-23T15:30:00","user":"randomUser2","action":"eating","count":2}]
thanks
You can convert the StringType column into a TimestampType column using casting; Then, you can cast the timestamp into IntegerType to make the "rounding" down to the last 5-minute interval easier, and group by that (and all other columns):
// importing SparkSession's implicits
import spark.implicits._
// Use casting to convert String into Timestamp:
val withTime = df.withColumn("time", $"time" cast TimestampType)
// calculate the "most recent 5-minute-round time" and group by all
val result = withTime.withColumn("time", $"time" cast IntegerType)
.withColumn("time", ($"time" - ($"time" mod 60 * 5)) cast TimestampType)
.groupBy("time", "user", "action").count()
result.show(truncate = false)
// +---------------------+-----------+--------+-----+
// |time |user |action |count|
// +---------------------+-----------+--------+-----+
// |2017-03-23 12:20:00.0|randomUser |sleeping|2 |
// |2017-03-23 15:30:00.0|randomUser2|eating |2 |
// |2017-03-23 12:30:00.0|randomUser |sleeping|1 |
// +---------------------+-----------+--------+-----+
I am trying to learn Spark and I am reading a dataframe with a timestamp column using the unix_timestamp function as below:
val columnName = "TIMESTAMPCOL"
val sequence = Seq(2016-01-20 12:05:06.999)
val dataframe = {
sequence.toDF(columnName)
}
val typeDataframe = dataframe.withColumn(columnName, org.apache.spark.sql.functions.unix_timestamp($"TIMESTAMPCOL"))
typeDataframe.show
This produces an output:
+------------+
|TIMESTAMPCOL|
+------------+
| 1453320306|
+------------+
How can I read it so that I don't lose the ms i.e the .999 part? I tried using unix_timestamp(col: Col, s: String) where s is the SimpleDateFormat, eg "yyyy-MM-dd hh:mm:ss", without any luck.
To retain the milliseconds use "yyyy-MM-dd HH:mm:ss.SSS" format. You can use date_format like below.
val typeDataframe = dataframe.withColumn(columnName, org.apache.spark.sql.functions.date_format($"TIMESTAMPCOL","yyyy-MM-dd HH:mm:ss.SSS"))
typeDataframe.show
This will give you
+-----------------------+
|TIMESTAMPCOL |
+-----------------------+
|2016-01-20 12:05:06:999|
+-----------------------+