Append new data to partitioned parquet files - scala

I am writing an ETL process where I will need to read hourly log files, partition the data, and save it. I am using Spark (in Databricks).
The log files are CSV so I read them and apply a schema, then perform my transformations.
My problem is, how can I save each hour's data as a parquet format but append to the existing data set? When saving, I need to partition by 4 columns present in the dataframe.
Here is my save line:
data
.filter(validPartnerIds($"partnerID"))
.write
.partitionBy("partnerID","year","month","day")
.parquet(saveDestination)
The problem is that if the destination folder exists the save throws an error.
If the destination doesn't exist then I am not appending my files.
I've tried using .mode("append") but I find that Spark sometimes fails midway through so I end up loosing how much of my data is written and how much I still need to write.
I am using parquet because the partitioning substantially increases my querying in the future. As well, I must write the data as some file format on disk and cannot use a database such as Druid or Cassandra.
Any suggestions for how to partition my dataframe and save the files (either sticking to parquet or another format) is greatly appreciated.

If you need to append the files, you definitely have to use the append mode. I don't know how many partitions you expect it to generate, but I find that if you have many partitions, partitionBy will cause a number of problems (memory- and IO-issues alike).
If you think that your problem is caused by write operations taking too long, I recommend that you try these two things:
1) Use snappy by adding to the configuration:
conf.set("spark.sql.parquet.compression.codec", "snappy")
2) Disable generation of the metadata files in the hadoopConfiguration on the SparkContext like this:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
The metadata-files will be somewhat time consuming to generate (see this blog post), but according to this they are not actually important. Personally, I always disable them and have no issues.
If you generate many partitions (> 500), I'm afraid the best I can do is suggest to you that you look into a solution not using append-mode - I simply never managed to get partitionBy to work with that many partitions.

If you're using unsorted partitioning your data is going to be split across all of your partitions. That means every task will generate and write data to each of your output files.
Consider repartitioning your data according to your partition columns before writing to have all the data per output file on the same partitions:
data
.filter(validPartnerIds($"partnerID"))
.repartition([optional integer,] "partnerID","year","month","day")
.write
.partitionBy("partnerID","year","month","day")
.parquet(saveDestination)
See: DataFrame.repartition

Related

DeltaFileNotFoundException: No file found in the directory DataBricks

I would like to request you for your help.
I have been working with DataBricks.
We developed some scrips and they are working in streaming.
Let's suppose that we have two jobs running and writing data to one general local dataset.
This means notebook1 and notebook2 writing data at the same LDS.
Each notebook read data from different origins and write the data to the same LDS in an standard format. To avoid problems we made use of partitions at the LDS.
This means that in this case the LDS have one partition for notebook1 and other partition for notebook2.
This implementation has been working well for almost 5 months.
However, today we just faced the following error:
com.databricks.sql.transaction.tahoe.DeltaFileNotFoundException: No file found in the directory: dbfs:/mnt/streaming/streaming1/_delta_log.
I have been looking for information for some way to solve it and the solutions that I found have been:
Solution 1 Which explain some reasons why this situations could happen and they say we should use a new checkpoint directory, or set the Spark property spark.sql.files.ignoreMissingFiles to true in the cluster’s Spark Config.
The first solution of using a new checkpoints directory is not possible for us to use due the requeriments that we need to satisfy because using a new checkpoints would mean for us to process the whole data again that has been processed.
You may ask why? In a summary we get updates from a database that is saved in a delta table that contais the raw data and is where we consume the data, so using a new checkpoint or deleting it would mean for us consume the whole data.
This only allow us to use the solution of applying the property of spark.sql.files.ignoreMissingFiles. However, my question here is: If we set this property, Would we be processing the data from the beginning? Or it would resume to process where the last checkpoints was?
Solution 2 I found a similar case here, however I didn't understand it at all, what they suggest is to change the parent directory, however we do have something similar to that which could not satisfy our problem and also add the directory in the start() option?
We have our mains streaming like this:
spark.readStream.format("delta") \
.option("readChangeFeed", "true") \
.option("maxFilesPerTrigger", 250) \
.option("maxBytesPerTrigger", 536870912)\
.option("failOnDataLoss", "true")\
.load(DATA_PATH)\
.filter(expr("_change_type not in ('delete', 'update_preimage')"))\
.writeStream\
.queryName(streamQueryName)\
.foreachBatch(MainFunctionstoprocess)\
.option("checkpointLocation", checkpointLocation)\
.option("mergeSchema", "true")\
.trigger(processingTime='1 seconds')\
.start()
Does anyone have some idea how we could solve this problem without deleting the checkpoints so we can resume the data in the last checkpoint it failed, or some way to get back to one checkpoint so we can only reprocess some part of the data?

scala - to avoid creating empty avro file (or handling the number of files)

my_data.write
.mode(SaveMode.Overwrite)
.avro(_outputPath)
It works fine usually, but when the data is a very small amount, there are some empty Avro files.
All the number of files are quite different per try, when the data row is less than the number of files, some file is in an empty state, only column info are included.
Is there a way to handle the number of output Avro files per the data row number? Or not to create output file if there's not data?
The number of files will depend on how many partitions your dataframe has. Each partition will create its own file. If you know that there is no much data to write, you can re-partition the dataframe before writing it.
my_data.repartition(1)
.write
.mode(SaveMode.Overwrite)
.avro(_outputPath)

What's the best way to write a big file to S3?

I'm using zeppelin and spark, and I'd like to take a 2TB file from S3 and run transformations on it in Spark, and then send it up to S3 so that I can work with the file in Jupyter notebook. The transformations are pretty straightforward.
I'm reading the file as a parquet file. I think it's about 2TB, but I'm not sure how to verify.
It's about 10M row and 5 columns, so it's pretty big.
I tried to do my_table.write.parquet(s3path) and I tried my_table.write.option("maxRecordsPerFile", 200000).parquet(s3path). How do I come up with the right way to write a big parquet file?
These are the points you could consider...
1) maxRecordsPerFile setting:
With
my_table.write.parquet(s3path)
Spark writes a single file out per task.
The number of saved files is = the number of partitions of the RDD/Dataframe being saved. Thus, this could result in ridiculously large files (of couse you can repartition your data and save repartition means shuffles the data across the networks.).
To limit number of records per file
my_table.write.option("maxRecordsPerFile", numberOfRecordsPerFile..yourwish).parquet(s3path)
It can avoid generating huge files.
2) If you are using AWS Emr (Emrfs) this could be one of the point you can consider.
emr-spark-s3-optimized-committer
When the EMRFS S3-optimized Committer is Not Used :
When using the S3A file system.
When using an output format other than Parquet, such as ORC or text.
3) Using compression techniques , algo version and other spark configurations:
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 2)
.config("spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored", true)
.config("spark.hadoop.parquet.enable.summary-metadata", false)
.config("spark.sql.parquet.mergeSchema", false)
.config("spark.sql.parquet.filterPushdown", true) // for reading purpose
.config("mapreduce.fileoutputcommitter.algorithm.version", "2")
.config("spark.sql.parquet.compression.codec", "snappy")
.getOrCreate()
4) fast upload and other props in case you are using s3a:
.config("spark.hadoop.fs.s3a.fast.upload","true")
.config("spark.hadoop.fs.s3a.fast.upload","true")
.config("spark.hadoop.fs.s3a.connection.timeout","100000")
.config("spark.hadoop.fs.s3a.attempts.maximum","10")
.config("spark.hadoop.fs.s3a.fast.upload","true")
.config("spark.hadoop.fs.s3a.fast.upload.buffer","bytebuffer")
.config("spark.hadoop.fs.s3a.fast.upload.active.blocks","4")
.config("fs.s3a.connection.ssl.enabled", "true")
The S3a connector will incrementally write blocks, but the (obsolete) version shipping with spark in hadoop-2.7.x doesn't handle it very well. IF you can, update all hadoop- Jars to 2.8.5 or 2.9.x.
the option "fs.s3a.multipart.size controls the size of the block. There's a limit of 10K blocks, so the max file you can upload is that size * 10,000. For very large files, use a bigger number than the default of "64M"

Should we avoid partitionBy when writing files to S3 in spark?

The parquet location is:
s3://mybucket/ref_id/date/camera_id/parquet-file
Let's say I have ref_id x3, date x 4, camera_id x 500, if I write parquet like below(use partitionBy), I will get 3x4x500=6000 files uploaded to S3. It is extremely slower than that just wrote a couple of files to the top-level bucket(no multiple level prefix)
What is the best practice? My colleague argues that partitionBy is good thing when used together with Hive metastore/table
df.write.mode("overwrite")\
.partitionBy('ref_id','date','camera_id')\
.parquet('s3a://mybucket/tmp/test_data')
If your problem is too many files, which seems to be the case, you need to repartition your RDD/dataframe before you write it. Each RDD/Dataframe partition will generate 1 file per folder.
df.repartition(1)\
.write.mode("overwrite")\
.partitionBy('ref_id','date','camera_id')\
.parquet('s3a://mybucket/tmp/test_data')
As alternative to repartition you can also use coalesce.
If (after repartition to 1) the files are too small you, need to reduce the directory structure. The parquet documentation recommends file size between 500Mb and 1Gb.
https://parquet.apache.org/documentation/latest/
We recommend large row groups (512MB - 1GB). Since an entire row
group might need to be read, we want it to completely fit on one HDFS
block.
If your files are a few Kb or Mb then you have a serious problem, it will seriously hurt performance.

NullPointerException when writing parquet

I am trying to measure how long does it take me to read and write parquet files in Amazon s3 (under a specific partition)
For that I wrote a script that simply reads the files and than write them back:
val df = sqlContext.read.parquet(path + "p1.parquet/partitionBy=partition1")
df.write.mode("overwrite").parquet(path + "p1.parquet/partitionBy=partition1")
However I get a null pointer exception. I tried to add df.count in between, but got the same error.
The reason for the error is that Spark only reads the data when it is going to be used. This results in Spark reading data from the file at the same time as trying to overwrite the file. This causes an issue since data can't be overwritten while reading.
I'd recommend saving to a temporary location as this is for timing purposes. An alternative would be to use .cache() on the data when reading, perform an action to force the read (as well as actually cache the data), and then overwrite the file.