Spark: partition .txt.gz files and convert to parquet - scala

I need to convert all text files in a folder that are gzipped to parquet. I wonder if I need to gunzip them first or not.
Also, I'd like to partition each file in 100 parts.
This is what I have so far:
sc.textFile("s3://bucket.com/files/*.gz").repartition(100).toDF()
.write.parquet("s3://bucket.com/parquet/")
Is this correct? Am I missing something?
Thanks.

You don't need to uncompress files individually. The only problem with reading gzip files directly is that your reads won't be parallelized. That means, irrespective of the size of the file, you will only get one partition per file because gzip is not a splittable compression codec.
You might face problems if individual files are greater than a certain size (2GB?) because there's an upper limit to Spark's partition size.
Other than that your code looks functionally alright.

Related

scala - to avoid creating empty avro file (or handling the number of files)

my_data.write
.mode(SaveMode.Overwrite)
.avro(_outputPath)
It works fine usually, but when the data is a very small amount, there are some empty Avro files.
All the number of files are quite different per try, when the data row is less than the number of files, some file is in an empty state, only column info are included.
Is there a way to handle the number of output Avro files per the data row number? Or not to create output file if there's not data?
The number of files will depend on how many partitions your dataframe has. Each partition will create its own file. If you know that there is no much data to write, you can re-partition the dataframe before writing it.
my_data.repartition(1)
.write
.mode(SaveMode.Overwrite)
.avro(_outputPath)

Should we avoid partitionBy when writing files to S3 in spark?

The parquet location is:
s3://mybucket/ref_id/date/camera_id/parquet-file
Let's say I have ref_id x3, date x 4, camera_id x 500, if I write parquet like below(use partitionBy), I will get 3x4x500=6000 files uploaded to S3. It is extremely slower than that just wrote a couple of files to the top-level bucket(no multiple level prefix)
What is the best practice? My colleague argues that partitionBy is good thing when used together with Hive metastore/table
df.write.mode("overwrite")\
.partitionBy('ref_id','date','camera_id')\
.parquet('s3a://mybucket/tmp/test_data')
If your problem is too many files, which seems to be the case, you need to repartition your RDD/dataframe before you write it. Each RDD/Dataframe partition will generate 1 file per folder.
df.repartition(1)\
.write.mode("overwrite")\
.partitionBy('ref_id','date','camera_id')\
.parquet('s3a://mybucket/tmp/test_data')
As alternative to repartition you can also use coalesce.
If (after repartition to 1) the files are too small you, need to reduce the directory structure. The parquet documentation recommends file size between 500Mb and 1Gb.
https://parquet.apache.org/documentation/latest/
We recommend large row groups (512MB - 1GB). Since an entire row
group might need to be read, we want it to completely fit on one HDFS
block.
If your files are a few Kb or Mb then you have a serious problem, it will seriously hurt performance.

How to handle large gz file in Spark

I am trying to read large gz file and, then inserting into table. this is taking so long.
sparkSession.read.format("csv").option("header", "true").load("file-about-5gb-size.gz").repartition( 1000).coalesce(1000).write.mode("overwrite").format("orc").insertInto(table)
Is there any way I can optimize this, please help.
Note: I have used random repartition and coalesce
You won't be able to do read optimization if your file is in gzip compression. The gzip compression is not splittable in spark. There's no way to avoid reading the complete file in the spark driver node.
If you want to parallelize, you need to make this file splittable by unzip it and then process it.

Spark save partitions with custom filename and gzip

I would like to save my generated RDD partitions using a custom filename, like: chunk0.gz, chunk1.gz, etc. Hence, I want them to be gzipped as well.
Using saveAsTextFile would result in a directory being created, with standard filenames part-00000.gz, etc.
fqPart.saveAsTextFile(outputFolder, classOf[GzipCodec])
How do I specify my own filenames? Would I have to iterate through the RDD partitions manually and write to the file, and then compress the resulting file as well?
Thanks in advance.

Append new data to partitioned parquet files

I am writing an ETL process where I will need to read hourly log files, partition the data, and save it. I am using Spark (in Databricks).
The log files are CSV so I read them and apply a schema, then perform my transformations.
My problem is, how can I save each hour's data as a parquet format but append to the existing data set? When saving, I need to partition by 4 columns present in the dataframe.
Here is my save line:
data
.filter(validPartnerIds($"partnerID"))
.write
.partitionBy("partnerID","year","month","day")
.parquet(saveDestination)
The problem is that if the destination folder exists the save throws an error.
If the destination doesn't exist then I am not appending my files.
I've tried using .mode("append") but I find that Spark sometimes fails midway through so I end up loosing how much of my data is written and how much I still need to write.
I am using parquet because the partitioning substantially increases my querying in the future. As well, I must write the data as some file format on disk and cannot use a database such as Druid or Cassandra.
Any suggestions for how to partition my dataframe and save the files (either sticking to parquet or another format) is greatly appreciated.
If you need to append the files, you definitely have to use the append mode. I don't know how many partitions you expect it to generate, but I find that if you have many partitions, partitionBy will cause a number of problems (memory- and IO-issues alike).
If you think that your problem is caused by write operations taking too long, I recommend that you try these two things:
1) Use snappy by adding to the configuration:
conf.set("spark.sql.parquet.compression.codec", "snappy")
2) Disable generation of the metadata files in the hadoopConfiguration on the SparkContext like this:
sc.hadoopConfiguration.set("parquet.enable.summary-metadata", "false")
The metadata-files will be somewhat time consuming to generate (see this blog post), but according to this they are not actually important. Personally, I always disable them and have no issues.
If you generate many partitions (> 500), I'm afraid the best I can do is suggest to you that you look into a solution not using append-mode - I simply never managed to get partitionBy to work with that many partitions.
If you're using unsorted partitioning your data is going to be split across all of your partitions. That means every task will generate and write data to each of your output files.
Consider repartitioning your data according to your partition columns before writing to have all the data per output file on the same partitions:
data
.filter(validPartnerIds($"partnerID"))
.repartition([optional integer,] "partnerID","year","month","day")
.write
.partitionBy("partnerID","year","month","day")
.parquet(saveDestination)
See: DataFrame.repartition