Apache Spark dataframe does not repartition while writing to parquet - scala

I am trying to partition my DataFrame and write it to parquet file. It seems to me, that repartitioning works on dataframe in memory, but does not affect the parquet partitioning. What is even more strange that coalesce works. Lets say I have DataFrame df:
df.rdd.partitions.size
4000
var df_new = df.repartition(20)
df_new.rdd.partitions.size
20
However, when I try to write a parquet file I get the following:
df_new.write.parquet("test.paruqet")
[Stage 0:> (0 + 8) / 4000]
which would create 4000 files, however, if I do this I get the following:
var df_new = df.coalesce(20)
df_new.write.parquet("test.paruqet")
[Stage 0:> (0 + 8) / 20]
I can get what I want to reduce partitions. The problem is when I need to increase the number of partitions I cannot do it. Like if I have 8 partitions and I try to increase them to 100, it always write only 8.
Does somebody know how to fix this?

First of all, you should not provide a file path to the parquet() method, but a folder instead. Spark will handle the parquet filenames on its own.
Then, you must be aware that coalesce only reduces the number of partitions (without shuffle) while repartition lets you re-partition (with shuffle) your DataFrame in any number of partitions you need (more or less). Check out this SO question for more details on repartition vs. coalesce.
In your case, you want to increase the number of partitions, so you need to use repartition
df.repartition(20).write.parquet("/path/to/folder")

Related

spark repartition issue for filesize

Need to merge small parquet files.
I have multiple small parquet files in hdfs.
I like to combine those parquet files each to nearly 128 mb each
2. So I read all the files using spark.read()
And did repartition() on that and write to the hdfs location
My issue is
I have approx 7.9 GB of data, when I did repartition and saved to hdfs it is getting nearly to 22 GB.
I had tied with repartition , range , colasce but not getting the solution
I think that it may be connected with your repartition operation. You are using .repartition(10) so Spark is going to use RoundRobin to repartition your data so probably ordering is going to change. Order of data is important during compresion, you can read more in this question
You may try to add sort or repartition your data by expresion instead of only number of partitions to optimize file size

How to split pyspark dataframe with 3 million records equally in azure databricks

I have a pyspark dataframe which connect to oracle database and read a table which has 3 million records. I need to write this dataframe to azure eventhub.
Below is the sample pyspark datframe write to eventhub code.
df.select("body") \
.write\
.format("eventhubs") \
.options(**ehconf) \
.save()
How to split my pyspark dataframe into 10 parts equally (300k records/ dataframe) ?
So that I can send iterate each of these 10 pyspark dataframes to eventhub.
You can specify the number of partitions by
df.select('body').coalesce(10).write
OR
df.select('body').repartition(10).write
coalesce only decrease the number of partitions while repartition can
increase or decrease. repartition will do full shuffle, while
coalesce(2) will keep partition 1 & 2 as it is and move anything in
other partitions to fit into partition 1 & 2.
So, if your current partitions are higher than 10, use coalesce.
(You can check the num of partitions by df.rdd.getNumPartitions())

PySpark - Does Coalesce(1) Retain the Order of Range Partitioning?

Looking into the Spark UI and physical plan, I found that orderBy is accomplished by Exchange rangepartitioning(col#0000 ACS NULLS FIRST, 200) and then Sort [col#0000 ACS NULLS FIRST], true, 0.
From what I understand, rangepartitioning would define minimum and maximum values for each partition and order the data with column value within the min and max into that partition so as to achieve global ordering.
But now I have 200 partitions and I want to output to a single csv file. If I do a repartition(1), spark will trigger a shuffle and the ordering will be gone. However, I tried coalesce(1) and it retained the global ordering. Yet I don't know if it was merely pure luck since coalesce does not necessarily decrease number of partitions and keep the ordering of partitions. Does anyone know how to repartition to keep the ordering after rangepartitioning? Thanks a lot.
As you state yourself maintaining order is not part of the coalesce API contract. You you have to choose:
collect the ordered dataframe as a list of Row instances and write to csv outside spark
write the partitions to individual CSV files with spark and concatenate the partitions with some other tool, e.g. "hadoop dfs getmerge" on the command line.

Does Spark maintain parquet partitioning on read?

I am having a lot trouble finding the answer to this question. Let's say I write a dataframe to parquet and I use repartition combined with partitionBy to get a nicely partitioned parquet file. See Below:
df.repartition(col("DATE")).write.partitionBy("DATE").parquet("/path/to/parquet/file")
Now later on I would like to read the parquet file so I do something like this:
val df = spark.read.parquet("/path/to/parquet/file")
Is the dataframe partitioned by "DATE"? In other words if a parquet file is partitioned does spark maintain that partitioning when reading it into a spark dataframe. Or is it randomly partitioned?
Also the why and why not to this answer would be helpful as well.
The number of partitions acquired when reading data stored as parquet follows many of the same rules as reading partitioned text:
If SparkContext.minPartitions >= partitions count in data, SparkContext.minPartitions will be returned.
If partitions count in data >= SparkContext.parallelism, SparkContext.parallelism will be returned, though in some very small partition cases, #3 may be true instead.
Finally, if the partitions count in data is somewhere between SparkContext.minPartitions and SparkContext.parallelism, generally you'll see the partitions reflected in the dataset partitioning.
Note that it's rare for a partitioned parquet file to have full data locality for a partition, meaning that, even when the partitions count in data matches the read partition count, there is a strong likelihood that the dataset should be repartitioned in memory if you're trying to achieve partition data locality for performance.
Given your use case above, I'd recommend immediately repartitioning on the "DATE" column if you're planning to leverage partition-local operations on that basis. The above caveats regarding minPartitions and parallelism settings apply here as well.
val df = spark.read.parquet("/path/to/parquet/file")
df.repartition(col("DATE"))
You would get the number of partitions based on the spark config spark.sql.files.maxPartitionBytes which defaults to 128MB. And the data would not be partitioned as per the partition column which was used while writing.
Reference https://spark.apache.org/docs/latest/sql-performance-tuning.html
In your question, there are two ways we could say the data are being "partitioned", which are:
via repartition, which uses a hash partitioner to distribute the data into a specific number of partitions. If, as in your question, you don't specify a number, the value in spark.sql.shuffle.partitions is used, which has default value 200. A call to .repartition will usually trigger a shuffle, which means the partitions are now spread across your pool of executors.
via partitionBy, which is a method specific to a DataFrameWriter that tells it to partition the data on disk according to a key. This means the data written are split across subdirectories named according to your partition column, e.g. /path/to/parquet/file/DATE=<individual DATE value>. In this example, only rows with a particular DATE value are stored in each DATE= subdirectory.
Given these two uses of the term "partitioning," there are subtle aspects in answering your question. Since you used partitionBy and asked if Spark "maintain's the partitioning", I suspect what you're really curious about is if Spark will do partition pruning, which is a technique used drastically improve the performance of queries that have filters on a partition column. If Spark knows values you seek cannot be in specific subdirectories, it won't waste any time reading those files and hence your query completes much quicker.
If the way you're reading the data isn't partition aware, you'll get a number of partitions something like what's in bsplosion's answer. Spark won't employ partition pruning, and hence you won't get the benefit of Spark automatically ignoring reading certain files to speed things up1.
Fortunately, reading parquet files in Spark that were written with partitionBy is a partition-aware read. Even without a metastore like Hive that tells Spark the files are partitioned on disk, Spark will discover the partitioning automatically. Please see partition discovery in Spark for how this works in parquet.
I recommend testing reading your dataset in spark-shell so that you can easily see the output of .explain, which will let you verify that Spark correctly finds the partitions and can prune out the ones that don't contain data of interest in your query. A nice writeup on this can be found here. In short, if you see PartitionFilters: [], it means that Spark isn't doing any partition pruning. But if you see something like PartitionFilters: [isnotnull(date#3), (date#3 = 2021-01-01)], Spark is only reading in a specific set of DATE partitions, and hence the query execution is usually a lot faster.
1A separate detail is that parquet stores statistics about data in its columns inside of the files themselves. If these statistics can be used to eliminate chunks of data that can't match whatever filtering you're doing, e.g. on DATE, then you'll see some speedup even if the way you read the data isn't partition-aware. This is called predicate pushdown. It works because the files on disk will still contain only specific values of DATE when using .partitionBy. More info can be found here.

Split an RDD into multiple RDDS

I have a pair RDD[String,String] where key is a string and value is html. I want to split this rdd into n RDDS based on n keys and store them in HDFS.
htmlRDD = [key1,html
key2,html
key3,html
key4,html
........]
Split this RDD based on keys and store html from each RDD individually on HDFS. Why I want to do that? When, I'm trying to store the html from the main RDD to HDFS,it takes a lot of time as some tasks are denied committing by output coordinator.
I'm doing this in Scala.
htmlRDD.saveAsHadoopFile("hdfs:///Path/",classOf[String],classOf[String], classOf[Formatter])
You can also try this in place of breaking RDD:
htmlRDD.saveAsTextFile("hdfs://HOST:PORT/path/");
I tried this and it worked for me. I had RDD[JSONObject] and it wrote toString() of JSON Object very well.
Spark saves each RDD partition into 1 hdfs file partition. So to achieve good parallelism your source RDD should have many partitions(actually depends on size of whole data). So I think you want to split your RDD not into several RDDs, but rather to have RDD with many partitions.
You you can do it with repartition() or coallesce()