s3 parquet write - too many partitions, slow writing - scala

I have my scala spark job to write in to s3 as parquet file. Its 6 billion records so far and it will keep growing daily. As per the use case, our api will query the parquet based on id. So to make the query results faster, i am writing the parquet with partitions on id. However, we have 1330360 unique ids and so this is creating 1330360 parquet files while writing, so the writing step is very slow, writing for past 9 hours and its still running.
output.write.mode("append").partitionBy("id").parquet("s3a://datalake/db/")
Is there anyway, i can reduce the number of partitions and still make the read query faster ? Or any other better way to handle this scenario ? Thanks.
EDIT : - id is an integer column with random numbers.

you can partition by ranges of ids (you didn't say anything about the ids so I can't suggest something specific) and/or use buckets instead of partitions https://www.slideshare.net/TejasPatil1/hive-bucketing-in-apache-spark

Related

How to do simple cache file in Flink-Scala?

I am new to Flink. I am really confused how to do file caching and load it into a dataset ? I can't find a simple example. I am confused why we need to create a dataset first to call "RichMapFunction" ? How I cache file that with nothing do with any other dataset? In sample I found, it kind of performed join with other dataset. Thank you.
For the case to join two data sets, and one data set is small, use broadcast to avoid shuffle. Without broadcasting, it is a pain to shuffle a large data set.
E.g. one dataset has 1 billion records, another one has 100 records. With broadcast, the small dataset will be distributed to all task managers processing those 1 billion records - no moving 1 billion record for join. Without broadcast, the typical behaviour for joining operation is to shuffle the 1 billion records and 100 records, so that records with same key are in the same machine, which is much more expensive compared to broadcast.
The RichMapFunction provides the open() method and method to access RuntimeContext. In the open() function, the Flink job can get broadcasted dataset through getRuntimeContext(). getBroadcastVariable(). The open() function is called only one time for each operator, so the broadcasted dataset is initialised one time and then it can be applied to all incoming records. That is the reason why to use RichMapFunction() instead of MapFunction().
Note - Broadcast applies to the case that the dataset to broadcast is small. Need to create a dataset first and then broadcast the dataset to all operator. Please refer to here for the usage of the API.
For distributed file caching, it is for the case that the operation(e.g. Map operation) needs to load external file one time and use it in the operation.
E.g. A trained model is saved on HDFS. In Flink job, it needs to load the model and apply the model to each record. For this case, the Flink job can use distributed file cache API. The model file will be pulled from HDFS to local machine, and all tasks running on that machine can share the pulled file locally, which saves network and time.
You do not need to create a dataset for the file to be distributed, but using registerCachedFile(). Like the same reason for broadcasting dataset, using RichMapFunction allows the Flink job to load/init distributed file one time.
Please refer to this document for the usage.

How spark shuffle partitions and partition by tag along with each other

I am reading a set of 10,000 parquet files of 10 TB cumulative size from HDFS and writing it back to HDFS in partitioned manner using following code
spark.read.orc("HDFS_LOC").repartition(col("x")).write.partitionBy("x").orc("HDFS_LOC_1")
I am using
spark.sql.shuffle.partitions=8000
I see that spark had written 5000 different partitions of "x" to HDFS(HDFS_LOC_1) . How is shuffle partitions of "8000" is being used in this entire process. I see that there are only 15,000 files got written across all partitions of "x". Does it mean that spark tried to create 8000 files at every partition of "X" and found during write time that there were not enough data to write 8000 files at each partition and ended up writing fewer files ? Can you please help me understand this?
The setting spark.sql.shuffle.partitions=8000 will set the default shuffling partition number of your Spark programs. If you try to execute a join or aggregations just after setting this option, you will see this number taking effect (you can confirm that with df.rdd.getNumPartitions()). Please refer here for more information.
In your case though, you are using this setting with repartition(col("x") and partitionBy("x"). Therefore your program will not be affected by this setting without using a join or an aggregation transformation first. The difference between repartition and partitionBy is that, the first will partition the data in memory, creating cardinality("x") number of partitions, when the second one will write approximately the same number of partitions to HDFS. Why approximately? Well because there are more factors that determine the exact number of output files. Please check the following resources to get a better understanding over this topic:
Difference between df.repartition and DataFrameWriter partitionBy?
pyspark: Efficiently have partitionBy write to same number of total partitions as original table
So the first thing to consider when using repartitioning by column repartition(*cols) or partitionBy(*cols), is the number of unique values (cardinality) that the column (or the combination of columns) has.
That being said, if you want to ensure that you will create 8000 partitions i.e output files, use repartition(partitionsNum, col("x")) where partitionsNum == 8000 in your case then call write.orc("HDFS_LOC_1"). Otherwise, if you want to keep the number of partitions close to the cardinality of x, just call partitionBy("x") to your original df and then write.orc("HDFS_LOC_1") for storing the data to HDFS. This will create cardinality(x) folders with your partitioned data.

Spark How to Specify Number of Resulting Files for DataFrame While/After Writing

I saw several q/a's about writing single file into hdfs,it seems using coalesce(1) is sufficient.
E.g;
df.coalesce(1).write.mode("overwrite").format(format).save(location)
But how can I specify "exact" number of files that will written after save operation?
So my question is;
If I have dataframe which consist 100 partitions when I make write operation will it write 100 files?
If I have dataframe which consist 100 partitions when I make write operation after calling repartition(50)/coalsesce(50) will it write 50 files?
Is there a way in spark which will allow to specify resulting number of files while writing dataframe into HDFS ?
Thanks
Number of output files is in general equal to the number of writing tasks (partitions). Under normal conditions It cannot be smaller (each writer writes its own part and multiple tasks cannot write to the same file), but can be larger if format has non-standard behavior or partitionBy is used.
Normally
If I have dataframe which consist 100 partitions when I make write operation will it write 100 files?
Yes
If I have dataframe which consist 100 partitions when I make write operation after calling repartition(50)/coalsesce(50) will it write 50 files?
And yes.
Is there a way in spark which will allow to specify resulting number of files while writing dataframe into HDFS ?
No.

How can I control number of rows and/or output file size in Spark streaming when writing to HDFS - hive?

Using spark streaming to read and process messages from Kafka and write to HDFS - Hive.
Since I wish to avoid creating many small files which spams the filesystem, I would like to know if there's a way to ensure a minimal file size, and/or ability to force a minimal number of output rows in a file, with the exception of a timeout.
Thanks.
As far as I know, there is no way to control the number of lines in your output files. But you can control the number of output files.
Controlling that and considering your dataset size may help you with your needs, since you can calculate the size of each file in your output. You can do that with the coalesce and repartition commands:
df.coalesce(2).write(...)
df.repartition(2).write(...)
Both of them are used to create the number of partitions given as parameter. So if you set 2, you should have 2 files in your output.
The difference are that with repartition you can both increase and decrease your partitions, while with coalesce you can only decrease.
Also,keep in mind that repartition performs a full shuffle to equally distribute the data among the partitions, which may be resource and time expensive. On the other hand, coalesce does not perform a full shuffle, it combines existing partitions instead.
You can find an awesome explanation in this other answer here

Getting the count of records in a data frame quickly

I have a dataframe with as many as 10 million records. How can I get a count quickly? df.count is taking a very long time.
It's going to take so much time anyway. At least the first time.
One way is to cache the dataframe, so you will be able to more with it, other than count.
E.g
df.cache()
df.count()
Subsequent operations don't take much time.
The time it takes to count the records in a DataFrame depends on the power of the cluster and how the data is stored. Performance optimizations can make Spark counts very quick.
It's easier for Spark to perform counts on Parquet files than CSV/JSON files. Parquet files store counts in the file footer, so Spark doesn't need to read all the rows in the file and actually perform the count, it can just grab the footer metadata. CSV / JSON files don't have any such metadata.
If the data is stored in a Postgres database, then the count operation will be performed by Postgres and count execution time will be a function of the database performance.
Bigger clusters generally perform count operations faster (unless the data is skewed in a way that causes one node to do all the work, leaving the other nodes idle).
The snappy compression algorithm is generally faster than gzip cause it is splittable by Spark and faster to inflate.
approx_count_distinct that's powered by HyperLogLog under the hood will be more performant for distinct counts, at the cost of precision.
The other answer suggests caching before counting, which will actually slow down the count operation. Caching is an expensive operation that can take a lot more time that counting. Caching is an important performance optimization at times, but not if you just want a simple count.