partition discovery not working for spark avro reader - scala

I'm trying to read the AVRO files which are partitioned by year, month and date. For example:
Complete file path
/test/data/source1/year=2018/month=2/day=14/file.avro
Base path
/test/data/source1/
Sample code
val df = sqlContext
.read()
.format("com.databricks.spark.avro")
.option("basePath", "/test/data/source1/")
.option("avroSchema", avroSchema.toString())
.load("/test/data/source1/year=2018/")
In the output DF, the year column is not shown up. What could be the issue?
As per the Spark documentation Partition Discovery, it should work.
Update:
I'm using Spark 1.6, for AVRO its not working, but for Parquet its working..

Related

datetime offset issue while saving data into parquet for day light saving time

screenshot from source and destinationWhile we are writing into parquet file using spark/scala, DST(day light saving times) times are auto converting by one hour delay time for example (2011-09-20 00:00:00.000 into "2011-09-19 23:00:00.000").
Source(reading data from) : sql server
Destination(writing into ): AWS S3
code:
val ssdf= spark.read.format("jdbc").option("driver" , "${ssDriver}").
option("url", "${ssConnectionString}").
option("dbtable", "${SCHEMANAME}.${RESULTTABLENAME}").
option("user", "${ssUsername}").
option("password", "${ssPassword}").
load()
ssdf.write.format("parquet").mode("overwrite").option("header", "true").save("s3://targetS3path/")`
###########################################################
Code is running fine, but which dates are DST datetime, delaying by 1 hour. check screenshot.
Expecting datetime value as per source, 2011-09-20 00:00:00.000.
Set the JVM timezone you will need to add extra JVM options for the driver and executors:
val spark = SparkSession.builder()
.master("local[1]")
.appName("SparkByExamples.com")
.config("spark.driver.extraJavaOptions", "-Duser.timezone=GMT")
.config("spark.executor.extraJavaOptions", "-Duser.timezone=GMT")
.config("spark.sql.session.timeZone", "UTC")
.getOrCreate();

pyspark timestamp changing when creating parquet file

I am creating a dataframe and saving it as parquet format.
after creating dataframe when i did df.show() i have correct timestamp displaying but when i used df.write.parquet("filename") . it is changing timestamp .
orginal time stamps when i did df.show() 2018-02-27 07:15:00 , 2018-07-06 14:23:00
after creating parquet and reading from that parquet file - 2018-02-27 02:15:00 , 2018-07-06 10:23:00
There is either 5 hours difference or 4 hours difference .Any idea what is happening. Thanks for reading

How to read last N number of last days from the current date in parquet

I have saved the data in warehouse in parquet file format with partition by date type column.
I try to get last N days data from the current date using scala spark.
The file data in saved like as below as warehouse path.
Tespath/filename/dt=2020-02-01
Tespath/filename/dt=2020-02-02
...........
Tespath/filename/dt=2020-02-28
If i read all the data its very hug amount of data.
As your dataset is correctly partitioned using the parquet format, you just need to read the directory Testpath/filename and let Spark do the partition discovery.
It will add a dt column in your schema with the value from the path name : dt=<value>.This value can be used to filter your dataset and Spark will optimize the read by partition pruning all directory which does not match you predicate on the dt column.
You could try something like this :
import spark.implicits._
import org.apache.spark.functions._
val df = spark.read.parquet("Testpath/filename/")
.where($"dt" > date_sub(current_date(), N))
You need to ensure spark.sql.parquet.filterPushdown is set to true (which is default)

Spark - How to get the latest hour in S3 path?

I'm using a Databricks notebook with Spark and Scala to read data from S3 into a DataFrame:
myDf = spark.read.parquet(s"s3a://data/metrics/*/*/*/). where * wildcards represent year/month/day.
Or I just hardcode it: myDf = spark.read.parquet(s"s3a://data/metrics/2018/05/20/)
Now I want to add an hour parameter right after the day. The idea is to obtain data from S3 for the most recently available hour.
If I do myDf = spark.read.parquet(s"s3a://data/metrics/2018/05/20/*) then I'll get data for all hours of may 20th.
How is it possible to achieve this in a Databricks notebook without hardcoding the hour?
Use timedate function
from datetime import datetime, timedelta
latest_hour = datetime.now() - timedelta(hours = 1)
You can also split them by year, month, day, hour
latest_hour.year
latest_hour.month
latest_hour.day
latest_hour.hour

Spark avro predicate pushdown

We are using Avro data format and the data is partitioned by year, month, day, hour, min
I see the data stored in HDFS as
/data/year=2018/month=01/day=01/hour=01/min=00/events.avro
And we load the data using
val schema = new Schema.Parser().parse(this.getClass.getResourceAsStream("/schema.txt"))
val df = spark.read.format("com.databricks.spark.avro").option("avroSchema",schema.toString).load("/data")
And then using predicate push down for filtering the data -
var x = isInRange(startDate, endDate)($"year", $"month", $"day", $"hour", $"min")
df = tableDf.filter(x)
Can someone explain what is happening behind the scenes?
I want to specifically understand when does the filtering of input files happen and where?
Interestingly, when I print the schema, the fields year, month, day and hour are automatically added, i.e the actual data does not contain these columns. Does Avro add these fields?
Want to understand clearly how files are filtered and how the partitions are created.