datetime offset issue while saving data into parquet for day light saving time - scala

screenshot from source and destinationWhile we are writing into parquet file using spark/scala, DST(day light saving times) times are auto converting by one hour delay time for example (2011-09-20 00:00:00.000 into "2011-09-19 23:00:00.000").
Source(reading data from) : sql server
Destination(writing into ): AWS S3
code:
val ssdf= spark.read.format("jdbc").option("driver" , "${ssDriver}").
option("url", "${ssConnectionString}").
option("dbtable", "${SCHEMANAME}.${RESULTTABLENAME}").
option("user", "${ssUsername}").
option("password", "${ssPassword}").
load()
ssdf.write.format("parquet").mode("overwrite").option("header", "true").save("s3://targetS3path/")`
###########################################################
Code is running fine, but which dates are DST datetime, delaying by 1 hour. check screenshot.
Expecting datetime value as per source, 2011-09-20 00:00:00.000.

Set the JVM timezone you will need to add extra JVM options for the driver and executors:
val spark = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.config("spark.driver.extraJavaOptions", "-Duser.timezone=GMT")
.config("spark.executor.extraJavaOptions", "-Duser.timezone=GMT")
.config("spark.sql.session.timeZone", "UTC")
.getOrCreate();

Related

pyspark timestamp changing when creating parquet file

I am creating a dataframe and saving it as parquet format.
after creating dataframe when i did df.show() i have correct timestamp displaying but when i used df.write.parquet("filename") . it is changing timestamp .
orginal time stamps when i did df.show() 2018-02-27 07:15:00 , 2018-07-06 14:23:00
after creating parquet and reading from that parquet file - 2018-02-27 02:15:00 , 2018-07-06 10:23:00
There is either 5 hours difference or 4 hours difference .Any idea what is happening. Thanks for reading

Spark - How to get the latest hour in S3 path?

I'm using a Databricks notebook with Spark and Scala to read data from S3 into a DataFrame:
myDf = spark.read.parquet(s"s3a://data/metrics/*/*/*/). where * wildcards represent year/month/day.
Or I just hardcode it: myDf = spark.read.parquet(s"s3a://data/metrics/2018/05/20/)
Now I want to add an hour parameter right after the day. The idea is to obtain data from S3 for the most recently available hour.
If I do myDf = spark.read.parquet(s"s3a://data/metrics/2018/05/20/*) then I'll get data for all hours of may 20th.
How is it possible to achieve this in a Databricks notebook without hardcoding the hour?
Use timedate function
from datetime import datetime, timedelta
latest_hour = datetime.now() - timedelta(hours = 1)
You can also split them by year, month, day, hour
latest_hour.year
latest_hour.month
latest_hour.day
latest_hour.hour

Spark avro predicate pushdown

We are using Avro data format and the data is partitioned by year, month, day, hour, min
I see the data stored in HDFS as
/data/year=2018/month=01/day=01/hour=01/min=00/events.avro
And we load the data using
val schema = new Schema.Parser().parse(this.getClass.getResourceAsStream("/schema.txt"))
val df = spark.read.format("com.databricks.spark.avro").option("avroSchema",schema.toString).load("/data")
And then using predicate push down for filtering the data -
var x = isInRange(startDate, endDate)($"year", $"month", $"day", $"hour", $"min")
df = tableDf.filter(x)
Can someone explain what is happening behind the scenes?
I want to specifically understand when does the filtering of input files happen and where?
Interestingly, when I print the schema, the fields year, month, day and hour are automatically added, i.e the actual data does not contain these columns. Does Avro add these fields?
Want to understand clearly how files are filtered and how the partitions are created.

partition discovery not working for spark avro reader

I'm trying to read the AVRO files which are partitioned by year, month and date. For example:
Complete file path
/test/data/source1/year=2018/month=2/day=14/file.avro
Base path
/test/data/source1/
Sample code
val df = sqlContext
.read()
.format("com.databricks.spark.avro")
.option("basePath", "/test/data/source1/")
.option("avroSchema", avroSchema.toString())
.load("/test/data/source1/year=2018/")
In the output DF, the year column is not shown up. What could be the issue?
As per the Spark documentation Partition Discovery, it should work.
Update:
I'm using Spark 1.6, for AVRO its not working, but for Parquet its working..

Spark Streaming Guarantee Specific Start Window Time

I'm using Spark Streaming to read data from Kinesis using the Structured Streaming framework, my connection is as follows
val kinesis = spark
.readStream
.format("kinesis")
.option("streams", streamName)
.option("endpointUrl", endpointUrl)
.option("initialPositionInStream", "earliest")
.option("format", "json")
.schema(<my-schema>)
.load
The data comes from several IoT devices which have a unique id, I need to aggregate the data by this id and by a tumbling window over the timestamp field, as follows:
val aggregateData = kinesis
.groupBy($"uid", window($"timestamp", "15 minute", "15 minute"))
.agg(...)
The problem I'm encountering is that I need to guarantee that every window starts at round times (such as 00:00:00, 00:15:00 and so on), also I need a guarantee that only rows containing full 15-minute long windows are going to be output to my sink, what I'm currently doing is
val query = aggregateData
.writeStream
.foreach(postgreSQLWriter)
.outputMode("update")
.start()
.awaitTermination()
Where ths postgreSQLWriter is a StreamWriter I created for inserting each row into a PostgreSQL SGBD. How can I force my windows to be exactly 15-minute long and the start time to be round 15-minute timestamp values for each device unique id?
question1:
to start at specific times to start, there is one more parameters spark grouping function takes which is "offset".
By specifying that it will start after the specified time from an hour
Example:
dataframe.groupBy($"Column1",window($"TimeStamp","22 minute","1 minute","15 minute"))
so the above syntax will group by column1 and create windows of 22 minute duration with sliding window size of 1 minute and offset as 15 minute
for example it starts from:
window1: 8:15(8:00 add 15 minute offset) to 8:37 (8:15 add 22 minutes)
window2: 8:16(previous window start + 1 minute) to 8:38 ( 22 minute size again)
question2:
to push only those windows having full 15 minute size, create a count column which counts the number of events having in that window. once it reaches 15, push it to wherever you want using filter command
calculating count:
dataframe.groupBy($"Column1",window($"TimeStamp","22 minute","1 minute","15 minute")).agg(count*$"Column2").as("count"))
writestream filter containing count 15 only:
aggregateddata.filter($"count"===15).writeStream.format(....).outputMode("complete").start()