Should we avoid partitionBy when writing files to S3 in spark? - scala

The parquet location is:
s3://mybucket/ref_id/date/camera_id/parquet-file
Let's say I have ref_id x3, date x 4, camera_id x 500, if I write parquet like below(use partitionBy), I will get 3x4x500=6000 files uploaded to S3. It is extremely slower than that just wrote a couple of files to the top-level bucket(no multiple level prefix)
What is the best practice? My colleague argues that partitionBy is good thing when used together with Hive metastore/table
df.write.mode("overwrite")\
.partitionBy('ref_id','date','camera_id')\
.parquet('s3a://mybucket/tmp/test_data')

If your problem is too many files, which seems to be the case, you need to repartition your RDD/dataframe before you write it. Each RDD/Dataframe partition will generate 1 file per folder.
df.repartition(1)\
.write.mode("overwrite")\
.partitionBy('ref_id','date','camera_id')\
.parquet('s3a://mybucket/tmp/test_data')
As alternative to repartition you can also use coalesce.
If (after repartition to 1) the files are too small you, need to reduce the directory structure. The parquet documentation recommends file size between 500Mb and 1Gb.
https://parquet.apache.org/documentation/latest/
We recommend large row groups (512MB - 1GB). Since an entire row
group might need to be read, we want it to completely fit on one HDFS
block.
If your files are a few Kb or Mb then you have a serious problem, it will seriously hurt performance.

Related

Is there an optimal way for writing lots of tiny files with PySpark?

I have a job that requires having to write a single JSON file to s3 for each row in a Spark dataframe (which then gets picked up by another process).
df.repartition(col("id")).write.mode("overwrite").partitionBy(col("id")).json(
f"s3://bucket/path/to/file"
)
These datasets often consist of 100k rows (sometimes 1m+) and take a very long time to write. I understand that large numbers of small files is not great for read performance but is this also the case for writes? Or is there something that can be done with partitioning to speed things up?
Please don't do this, you will only suffer pain. S3 was designed to be cheap long-term storage optimized for large files. It was design so the 'prefix' (directory path) leads to a bucket that provides files. If you want to optimize reads and writes you want to develop several buckets to write to at the same time. This means you want to actually modify the directory path(prefix) to the bucket with the most amount of variation to increase the number of buckets that you write to.
Example of mulitple files being written to the same bucket:
S3:/mydrive/mystuff/2020-12-31
S3:/mydrive/mystuff/2020-12-30
S3:/mydrive/mystuff/2020-12-29
This is because they all share the same bucket prefix --> S3:/mydrive/mystuff/
What if instead you flipped the part that changes? Now you have different buckets being used as you are writing to different buckets.(prefix is different)
S3:2020-12-31/mydrive/mystuff/
S3:2020-12-30/mydrive/mystuff/
S3:2020-12-29/mydrive/mystuff/
This change will help with read/write speed as different buckets will be used. It will not solve the problem that S3 doesn't actually use directories to direct you to files. As I said a prefix is actually just a pointer to the bucket. It then searches against all files you have written, to find the file that exists in your bucket. This is why tons of small files makes things worse, the lookup time for files takes longer and longer the more files you write. Because this lookup is expensive it's much faster to write larger files and make the cost of lookup minimized.

What's the best way to write a big file to S3?

I'm using zeppelin and spark, and I'd like to take a 2TB file from S3 and run transformations on it in Spark, and then send it up to S3 so that I can work with the file in Jupyter notebook. The transformations are pretty straightforward.
I'm reading the file as a parquet file. I think it's about 2TB, but I'm not sure how to verify.
It's about 10M row and 5 columns, so it's pretty big.
I tried to do my_table.write.parquet(s3path) and I tried my_table.write.option("maxRecordsPerFile", 200000).parquet(s3path). How do I come up with the right way to write a big parquet file?
These are the points you could consider...
1) maxRecordsPerFile setting:
With
my_table.write.parquet(s3path)
Spark writes a single file out per task.
The number of saved files is = the number of partitions of the RDD/Dataframe being saved. Thus, this could result in ridiculously large files (of couse you can repartition your data and save repartition means shuffles the data across the networks.).
To limit number of records per file
my_table.write.option("maxRecordsPerFile", numberOfRecordsPerFile..yourwish).parquet(s3path)
It can avoid generating huge files.
2) If you are using AWS Emr (Emrfs) this could be one of the point you can consider.
emr-spark-s3-optimized-committer
When the EMRFS S3-optimized Committer is Not Used :
When using the S3A file system.
When using an output format other than Parquet, such as ORC or text.
3) Using compression techniques , algo version and other spark configurations:
.config("spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version", 2)
.config("spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored", true)
.config("spark.hadoop.parquet.enable.summary-metadata", false)
.config("spark.sql.parquet.mergeSchema", false)
.config("spark.sql.parquet.filterPushdown", true) // for reading purpose
.config("mapreduce.fileoutputcommitter.algorithm.version", "2")
.config("spark.sql.parquet.compression.codec", "snappy")
.getOrCreate()
4) fast upload and other props in case you are using s3a:
.config("spark.hadoop.fs.s3a.fast.upload","true")
.config("spark.hadoop.fs.s3a.fast.upload","true")
.config("spark.hadoop.fs.s3a.connection.timeout","100000")
.config("spark.hadoop.fs.s3a.attempts.maximum","10")
.config("spark.hadoop.fs.s3a.fast.upload","true")
.config("spark.hadoop.fs.s3a.fast.upload.buffer","bytebuffer")
.config("spark.hadoop.fs.s3a.fast.upload.active.blocks","4")
.config("fs.s3a.connection.ssl.enabled", "true")
The S3a connector will incrementally write blocks, but the (obsolete) version shipping with spark in hadoop-2.7.x doesn't handle it very well. IF you can, update all hadoop- Jars to 2.8.5 or 2.9.x.
the option "fs.s3a.multipart.size controls the size of the block. There's a limit of 10K blocks, so the max file you can upload is that size * 10,000. For very large files, use a bigger number than the default of "64M"

Spark: partition .txt.gz files and convert to parquet

I need to convert all text files in a folder that are gzipped to parquet. I wonder if I need to gunzip them first or not.
Also, I'd like to partition each file in 100 parts.
This is what I have so far:
sc.textFile("s3://bucket.com/files/*.gz").repartition(100).toDF()
.write.parquet("s3://bucket.com/parquet/")
Is this correct? Am I missing something?
Thanks.
You don't need to uncompress files individually. The only problem with reading gzip files directly is that your reads won't be parallelized. That means, irrespective of the size of the file, you will only get one partition per file because gzip is not a splittable compression codec.
You might face problems if individual files are greater than a certain size (2GB?) because there's an upper limit to Spark's partition size.
Other than that your code looks functionally alright.

Slow performance reading parquet files in S3 with scala in Spark

I save a partitioned file in a s3 bucket from a data frame in scala
data_frame.write.mode("append").partitionBy("date").parquet("s3n://...")
When I read this partitioned file I'm experimenting very slow performance, I'm just doing a simple group by
val load_df = sqlContext.read.parquet(s"s3n://...").cache()
I also try
load_df.registerTempTable("dataframe")
Any advice, I'm doing something wrong?
It depends on what you mean by "very slow performance".
If you have too many files in you date partition it will take some time to read those.
Try to reduce granularity of the partition.
You should use the S3A driver (which may be as simple as changing your url protocol to s3a://, or maybe you'll need some extra classpath to have hadoop-aws and aws-sdk jars in it) to have better perfs.

Append new data to partitioned parquet files

I am writing an ETL process where I will need to read hourly log files, partition the data, and save it. I am using Spark (in Databricks).
The log files are CSV so I read them and apply a schema, then perform my transformations.
My problem is, how can I save each hour's data as a parquet format but append to the existing data set? When saving, I need to partition by 4 columns present in the dataframe.
Here is my save line:
data
.filter(validPartnerIds($"partnerID"))
.write
.partitionBy("partnerID","year","month","day")
.parquet(saveDestination)
The problem is that if the destination folder exists the save throws an error.
If the destination doesn't exist then I am not appending my files.
I've tried using .mode("append") but I find that Spark sometimes fails midway through so I end up loosing how much of my data is written and how much I still need to write.
I am using parquet because the partitioning substantially increases my querying in the future. As well, I must write the data as some file format on disk and cannot use a database such as Druid or Cassandra.
Any suggestions for how to partition my dataframe and save the files (either sticking to parquet or another format) is greatly appreciated.
If you need to append the files, you definitely have to use the append mode. I don't know how many partitions you expect it to generate, but I find that if you have many partitions, partitionBy will cause a number of problems (memory- and IO-issues alike).
If you think that your problem is caused by write operations taking too long, I recommend that you try these two things:
1) Use snappy by adding to the configuration:
conf.set("spark.sql.parquet.compression.codec", "snappy")
2) Disable generation of the metadata files in the hadoopConfiguration on the SparkContext like this:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
The metadata-files will be somewhat time consuming to generate (see this blog post), but according to this they are not actually important. Personally, I always disable them and have no issues.
If you generate many partitions (> 500), I'm afraid the best I can do is suggest to you that you look into a solution not using append-mode - I simply never managed to get partitionBy to work with that many partitions.
If you're using unsorted partitioning your data is going to be split across all of your partitions. That means every task will generate and write data to each of your output files.
Consider repartitioning your data according to your partition columns before writing to have all the data per output file on the same partitions:
data
.filter(validPartnerIds($"partnerID"))
.repartition([optional integer,] "partnerID","year","month","day")
.write
.partitionBy("partnerID","year","month","day")
.parquet(saveDestination)
See: DataFrame.repartition