pyspark timestamp changing when creating parquet file - pyspark

I am creating a dataframe and saving it as parquet format.
after creating dataframe when i did df.show() i have correct timestamp displaying but when i used df.write.parquet("filename") . it is changing timestamp .
orginal time stamps when i did df.show() 2018-02-27 07:15:00 , 2018-07-06 14:23:00
after creating parquet and reading from that parquet file - 2018-02-27 02:15:00 , 2018-07-06 10:23:00
There is either 5 hours difference or 4 hours difference .Any idea what is happening. Thanks for reading

Related

datetime offset issue while saving data into parquet for day light saving time

screenshot from source and destinationWhile we are writing into parquet file using spark/scala, DST(day light saving times) times are auto converting by one hour delay time for example (2011-09-20 00:00:00.000 into "2011-09-19 23:00:00.000").
Source(reading data from) : sql server
Destination(writing into ): AWS S3
code:
val ssdf= spark.read.format("jdbc").option("driver" , "${ssDriver}").
option("url", "${ssConnectionString}").
option("dbtable", "${SCHEMANAME}.${RESULTTABLENAME}").
option("user", "${ssUsername}").
option("password", "${ssPassword}").
load()
ssdf.write.format("parquet").mode("overwrite").option("header", "true").save("s3://targetS3path/")`
###########################################################
Code is running fine, but which dates are DST datetime, delaying by 1 hour. check screenshot.
Expecting datetime value as per source, 2011-09-20 00:00:00.000.
Set the JVM timezone you will need to add extra JVM options for the driver and executors:
val spark = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.config("spark.driver.extraJavaOptions", "-Duser.timezone=GMT")
.config("spark.executor.extraJavaOptions", "-Duser.timezone=GMT")
.config("spark.sql.session.timeZone", "UTC")
.getOrCreate();

How to read last N number of last days from the current date in parquet

I have saved the data in warehouse in parquet file format with partition by date type column.
I try to get last N days data from the current date using scala spark.
The file data in saved like as below as warehouse path.
Tespath/filename/dt=2020-02-01
Tespath/filename/dt=2020-02-02
...........
Tespath/filename/dt=2020-02-28
If i read all the data its very hug amount of data.
As your dataset is correctly partitioned using the parquet format, you just need to read the directory Testpath/filename and let Spark do the partition discovery.
It will add a dt column in your schema with the value from the path name : dt=<value>.This value can be used to filter your dataset and Spark will optimize the read by partition pruning all directory which does not match you predicate on the dt column.
You could try something like this :
import spark.implicits._
import org.apache.spark.functions._
val df = spark.read.parquet("Testpath/filename/")
.where($"dt" > date_sub(current_date(), N))
You need to ensure spark.sql.parquet.filterPushdown is set to true (which is default)

How to generate current_timestamp() without timezone in Pyspark?

I am trying to get the current_timestamp in a column in my dataframe. I am using below code for that.
df_new = df.withColumn('LOAD_DATE_TIME' , F.current_timestamp())
But this code is generating load_date_time in below format when exported to csv file.
2019-11-19T16:59:44.000+05:30
I don't want the timezone part and want the datetime in this below format.
2019-11-19 16:59:44

Spark read csv containing nanosecond timestamps

I am dumping a Postgres table using a copy command outputting to CSV.
The CSV contains timestamps formatted as such: 2011-01-01 12:30:10.123456+00.
I'm reading the CSV like
df = spark.read.csv(
"s3://path/to/csv",
inferSchema=True,
timestampFormat="yyyy-MM-dd HH:mm:ss.SSSSSSX",
...
)
but this doesn't work (as expected). The timestampFormat uses java.text.SimpleDateFormat which does not have nanosecond support.
I've tried a lot of variations on the timestampFormat, and they all produce either String columns or misformat the timestamp. Seems like the nanoseconds end up overflowing the seconds and adding time to my timestamp.
I can't apply a schema to the CSV because I don't always know it, and I can't cast the columns because I don't always know which will be timestamps. I also can't cast the timestamp on the way out of Postgres, because I'm just doing select * ....
How can I solve this so I can ingest the CSV with the proper timestamp format?
My first thought was I just had to modify timestampFormat, this seems like it's not possible? My second thought is to use sed to truncate the timestamp as I'm dumping from Postgres.
I'm using spark 2.3.1.
Thanks for the help!

partition discovery not working for spark avro reader

I'm trying to read the AVRO files which are partitioned by year, month and date. For example:
Complete file path
/test/data/source1/year=2018/month=2/day=14/file.avro
Base path
/test/data/source1/
Sample code
val df = sqlContext
.read()
.format("com.databricks.spark.avro")
.option("basePath", "/test/data/source1/")
.option("avroSchema", avroSchema.toString())
.load("/test/data/source1/year=2018/")
In the output DF, the year column is not shown up. What could be the issue?
As per the Spark documentation Partition Discovery, it should work.
Update:
I'm using Spark 1.6, for AVRO its not working, but for Parquet its working..