What's the best way to write a big file to S3? - scala

I'm using zeppelin and spark, and I'd like to take a 2TB file from S3 and run transformations on it in Spark, and then send it up to S3 so that I can work with the file in Jupyter notebook. The transformations are pretty straightforward.
I'm reading the file as a parquet file. I think it's about 2TB, but I'm not sure how to verify.
It's about 10M row and 5 columns, so it's pretty big.
I tried to do my_table.write.parquet(s3path) and I tried my_table.write.option("maxRecordsPerFile", 200000).parquet(s3path). How do I come up with the right way to write a big parquet file?

These are the points you could consider...
1) maxRecordsPerFile setting:
With
my_table.write.parquet(s3path)
Spark writes a single file out per task.
The number of saved files is = the number of partitions of the RDD/Dataframe being saved. Thus, this could result in ridiculously large files (of couse you can repartition your data and save repartition means shuffles the data across the networks.).
To limit number of records per file
my_table.write.option("maxRecordsPerFile", numberOfRecordsPerFile..yourwish).parquet(s3path)
It can avoid generating huge files.
2) If you are using AWS Emr (Emrfs) this could be one of the point you can consider.
emr-spark-s3-optimized-committer
When the EMRFS S3-optimized Committer is Not Used :
When using the S3A file system.
When using an output format other than Parquet, such as ORC or text.
3) Using compression techniques , algo version and other spark configurations:
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 2)
.config("spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored", true)
.config("spark.hadoop.parquet.enable.summary-metadata", false)
.config("spark.sql.parquet.mergeSchema", false)
.config("spark.sql.parquet.filterPushdown", true) // for reading purpose
.config("mapreduce.fileoutputcommitter.algorithm.version", "2")
.config("spark.sql.parquet.compression.codec", "snappy")
.getOrCreate()
4) fast upload and other props in case you are using s3a:
.config("spark.hadoop.fs.s3a.fast.upload","true")
.config("spark.hadoop.fs.s3a.fast.upload","true")
.config("spark.hadoop.fs.s3a.connection.timeout","100000")
.config("spark.hadoop.fs.s3a.attempts.maximum","10")
.config("spark.hadoop.fs.s3a.fast.upload","true")
.config("spark.hadoop.fs.s3a.fast.upload.buffer","bytebuffer")
.config("spark.hadoop.fs.s3a.fast.upload.active.blocks","4")
.config("fs.s3a.connection.ssl.enabled", "true")

The S3a connector will incrementally write blocks, but the (obsolete) version shipping with spark in hadoop-2.7.x doesn't handle it very well. IF you can, update all hadoop- Jars to 2.8.5 or 2.9.x.
the option "fs.s3a.multipart.size controls the size of the block. There's a limit of 10K blocks, so the max file you can upload is that size * 10,000. For very large files, use a bigger number than the default of "64M"

Related

scala - to avoid creating empty avro file (or handling the number of files)

my_data.write
.mode(SaveMode.Overwrite)
.avro(_outputPath)
It works fine usually, but when the data is a very small amount, there are some empty Avro files.
All the number of files are quite different per try, when the data row is less than the number of files, some file is in an empty state, only column info are included.
Is there a way to handle the number of output Avro files per the data row number? Or not to create output file if there's not data?
The number of files will depend on how many partitions your dataframe has. Each partition will create its own file. If you know that there is no much data to write, you can re-partition the dataframe before writing it.
my_data.repartition(1)
.write
.mode(SaveMode.Overwrite)
.avro(_outputPath)

Should we avoid partitionBy when writing files to S3 in spark?

The parquet location is:
s3://mybucket/ref_id/date/camera_id/parquet-file
Let's say I have ref_id x3, date x 4, camera_id x 500, if I write parquet like below(use partitionBy), I will get 3x4x500=6000 files uploaded to S3. It is extremely slower than that just wrote a couple of files to the top-level bucket(no multiple level prefix)
What is the best practice? My colleague argues that partitionBy is good thing when used together with Hive metastore/table
df.write.mode("overwrite")\
.partitionBy('ref_id','date','camera_id')\
.parquet('s3a://mybucket/tmp/test_data')
If your problem is too many files, which seems to be the case, you need to repartition your RDD/dataframe before you write it. Each RDD/Dataframe partition will generate 1 file per folder.
df.repartition(1)\
.write.mode("overwrite")\
.partitionBy('ref_id','date','camera_id')\
.parquet('s3a://mybucket/tmp/test_data')
As alternative to repartition you can also use coalesce.
If (after repartition to 1) the files are too small you, need to reduce the directory structure. The parquet documentation recommends file size between 500Mb and 1Gb.
https://parquet.apache.org/documentation/latest/
We recommend large row groups (512MB - 1GB). Since an entire row
group might need to be read, we want it to completely fit on one HDFS
block.
If your files are a few Kb or Mb then you have a serious problem, it will seriously hurt performance.

How to handle large gz file in Spark

I am trying to read large gz file and, then inserting into table. this is taking so long.
sparkSession.read.format("csv").option("header", "true").load("file-about-5gb-size.gz").repartition( 1000).coalesce(1000).write.mode("overwrite").format("orc").insertInto(table)
Is there any way I can optimize this, please help.
Note: I have used random repartition and coalesce
You won't be able to do read optimization if your file is in gzip compression. The gzip compression is not splittable in spark. There's no way to avoid reading the complete file in the spark driver node.
If you want to parallelize, you need to make this file splittable by unzip it and then process it.

Hadoop Config settings through spark-shell seems to have no effect

I'm trying to edit the hadoop block size configuration through spark shell so that the parquet part files generated are of a specific size. I tried setting several variables this way :-
val blocksize:Int = 1024*1024*1024
sc.hadoopConfiguration.setInt("dfs.blocksize", blocksize) //also tried dfs.block.size
sc.hadoopConfiguration.setInt("parquet.block.size", blocksize)
val df = spark.read.csv("/path/to/testfile3.txt")
df.write.parquet("/path/to/output/")
The test file is a large text file of almost 3.5 GB. However, no matter what blocksize I specify or approach I take, the number of part files created and their sizes are the same. It is possible for me to change the number of part files generated using the repartition and coalesce functions, but I have to use and approach that would not shuffle the data in the data frame in any way!
I have also tried specifying
f.write.option("parquet.block.size", 1048576).parquet("/path/to/output")
But with no luck. Can someone please highlight what I am doing wrong? Also is there any other approach I can use that can alter parquet block sizes that are written into hdfs?

Append new data to partitioned parquet files

I am writing an ETL process where I will need to read hourly log files, partition the data, and save it. I am using Spark (in Databricks).
The log files are CSV so I read them and apply a schema, then perform my transformations.
My problem is, how can I save each hour's data as a parquet format but append to the existing data set? When saving, I need to partition by 4 columns present in the dataframe.
Here is my save line:
data
.filter(validPartnerIds($"partnerID"))
.write
.partitionBy("partnerID","year","month","day")
.parquet(saveDestination)
The problem is that if the destination folder exists the save throws an error.
If the destination doesn't exist then I am not appending my files.
I've tried using .mode("append") but I find that Spark sometimes fails midway through so I end up loosing how much of my data is written and how much I still need to write.
I am using parquet because the partitioning substantially increases my querying in the future. As well, I must write the data as some file format on disk and cannot use a database such as Druid or Cassandra.
Any suggestions for how to partition my dataframe and save the files (either sticking to parquet or another format) is greatly appreciated.
If you need to append the files, you definitely have to use the append mode. I don't know how many partitions you expect it to generate, but I find that if you have many partitions, partitionBy will cause a number of problems (memory- and IO-issues alike).
If you think that your problem is caused by write operations taking too long, I recommend that you try these two things:
1) Use snappy by adding to the configuration:
conf.set("spark.sql.parquet.compression.codec", "snappy")
2) Disable generation of the metadata files in the hadoopConfiguration on the SparkContext like this:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
The metadata-files will be somewhat time consuming to generate (see this blog post), but according to this they are not actually important. Personally, I always disable them and have no issues.
If you generate many partitions (> 500), I'm afraid the best I can do is suggest to you that you look into a solution not using append-mode - I simply never managed to get partitionBy to work with that many partitions.
If you're using unsorted partitioning your data is going to be split across all of your partitions. That means every task will generate and write data to each of your output files.
Consider repartitioning your data according to your partition columns before writing to have all the data per output file on the same partitions:
data
.filter(validPartnerIds($"partnerID"))
.repartition([optional integer,] "partnerID","year","month","day")
.write
.partitionBy("partnerID","year","month","day")
.parquet(saveDestination)
See: DataFrame.repartition