Spark: repartition output by key - scala

I'm trying to output records using the following code:
spark.createDataFrame(asRow, struct)
.write
.partitionBy("foo", "bar")
.format("text")
.save("/some/output-path")
I don't have a problem when the data is small. However when I'm processing ~600GB input, I am writing around 290k files and that includes small files per partition. Is there a way we could control the number of output files per partition? Because right now I am writing a lot of small files and it's not good.

Having lots of files is the expected behavior as each partition (resulting in whatever computation you had before the write) will write to the partitions you requested the relevant files
If you wish to avoid that you need to repartition before the write:
spark.createDataFrame(asRow, struct)
.repartition("foo","bar")
.write
.partitionBy("foo", "bar")
.format("text")
.save("/some/output-path")

You have multiple files per partition because each node writes output to its own file. That means that the only way how to have only single file per partition is to re-partition data before writing. Please note, that that will be very inefficient because data repartition will cause shuffling on your data.

Related

How spark shuffle partitions and partition by tag along with each other

I am reading a set of 10,000 parquet files of 10 TB cumulative size from HDFS and writing it back to HDFS in partitioned manner using following code
spark.read.orc("HDFS_LOC").repartition(col("x")).write.partitionBy("x").orc("HDFS_LOC_1")
I am using
spark.sql.shuffle.partitions=8000
I see that spark had written 5000 different partitions of "x" to HDFS(HDFS_LOC_1) . How is shuffle partitions of "8000" is being used in this entire process. I see that there are only 15,000 files got written across all partitions of "x". Does it mean that spark tried to create 8000 files at every partition of "X" and found during write time that there were not enough data to write 8000 files at each partition and ended up writing fewer files ? Can you please help me understand this?
The setting spark.sql.shuffle.partitions=8000 will set the default shuffling partition number of your Spark programs. If you try to execute a join or aggregations just after setting this option, you will see this number taking effect (you can confirm that with df.rdd.getNumPartitions()). Please refer here for more information.
In your case though, you are using this setting with repartition(col("x") and partitionBy("x"). Therefore your program will not be affected by this setting without using a join or an aggregation transformation first. The difference between repartition and partitionBy is that, the first will partition the data in memory, creating cardinality("x") number of partitions, when the second one will write approximately the same number of partitions to HDFS. Why approximately? Well because there are more factors that determine the exact number of output files. Please check the following resources to get a better understanding over this topic:
Difference between df.repartition and DataFrameWriter partitionBy?
pyspark: Efficiently have partitionBy write to same number of total partitions as original table
So the first thing to consider when using repartitioning by column repartition(*cols) or partitionBy(*cols), is the number of unique values (cardinality) that the column (or the combination of columns) has.
That being said, if you want to ensure that you will create 8000 partitions i.e output files, use repartition(partitionsNum, col("x")) where partitionsNum == 8000 in your case then call write.orc("HDFS_LOC_1"). Otherwise, if you want to keep the number of partitions close to the cardinality of x, just call partitionBy("x") to your original df and then write.orc("HDFS_LOC_1") for storing the data to HDFS. This will create cardinality(x) folders with your partitioned data.

Read parquet file to multiple partitions [duplicate]

So I have just 1 parquet file I'm reading with Spark (using the SQL stuff) and I'd like it to be processed with 100 partitions. I've tried setting spark.default.parallelism to 100, we have also tried changing the compression of the parquet to none (from gzip). No matter what we do the first stage of the spark job only has a single partition (once a shuffle occurs it gets repartitioned into 100 and thereafter obviously things are much much faster).
Now according to a few sources (like below) parquet should be splittable (even if using gzip!), so I'm super confused and would love some advice.
https://www.safaribooksonline.com/library/view/hadoop-application-architectures/9781491910313/ch01.html
I'm using spark 1.0.0, and apparently the default value for spark.sql.shuffle.partitions is 200, so it can't be that. In fact all the defaults for parallelism are much more than 1, so I don't understand what's going on.
You should write your parquet files with a smaller block size. Default is 128Mb per block, but it's configurable by setting parquet.block.size configuration in the writer.
The source of ParquetOuputFormat is here, if you want to dig into details.
The block size is minimum amount of data you can read out of a parquet file which is logically readable (since parquet is columnar, you can't just split by line or something trivial like this), so you can't have more reading threads than input blocks.
The new way of doing it (Spark 2.x) is setting
spark.sql.files.maxPartitionBytes
Source: https://issues.apache.org/jira/browse/SPARK-17998 (the official documentation is not correct yet, misses the .sql)
From my experience, Hadoop settings no longer have effect.
Maybe your parquet file only takes one HDFS block. Create a big parquet file that has many HDFS blocks and load it
val k = sc.parquetFile("the-big-table.parquet")
k.partitions.length
You'll see same number of partitions as HDFS blocks. This worked fine for me (spark-1.1.0)
You have mentioned that you want to control distribution during write to parquet. When you create parquet from RDDs parquet preserves partitions of the RDD. So, if you create RDD and specify 100 partitions and from dataframe with parquet format then it will be writing 100 separate parquet files to fs.
For read you could specify spark.sql.shuffle.partitions parameter.
To achieve that you should use SparkContext to set Hadoop configuration (sc.hadoopConfiguration) property mapreduce.input.fileinputformat.split.maxsize.
By setting this property to a lower value than hdfs.blockSize, than you will get as much partitions as the number of splits.
For example:
When hdfs.blockSize = 134217728 (128MB),
and one file is read which contains exactly one full block,
and mapreduce.input.fileinputformat.split.maxsize = 67108864 (64MB)
Then there will be two partitions those splits will be read into.

How can I control number of rows and/or output file size in Spark streaming when writing to HDFS - hive?

Using spark streaming to read and process messages from Kafka and write to HDFS - Hive.
Since I wish to avoid creating many small files which spams the filesystem, I would like to know if there's a way to ensure a minimal file size, and/or ability to force a minimal number of output rows in a file, with the exception of a timeout.
Thanks.
As far as I know, there is no way to control the number of lines in your output files. But you can control the number of output files.
Controlling that and considering your dataset size may help you with your needs, since you can calculate the size of each file in your output. You can do that with the coalesce and repartition commands:
df.coalesce(2).write(...)
df.repartition(2).write(...)
Both of them are used to create the number of partitions given as parameter. So if you set 2, you should have 2 files in your output.
The difference are that with repartition you can both increase and decrease your partitions, while with coalesce you can only decrease.
Also,keep in mind that repartition performs a full shuffle to equally distribute the data among the partitions, which may be resource and time expensive. On the other hand, coalesce does not perform a full shuffle, it combines existing partitions instead.
You can find an awesome explanation in this other answer here

Best practice for writing to hadoop from spark

I was reviewing some code written by a co-worker, and I found a method like this:
def writeFile(df: DataFrame,
partitionCols: List[String],
writePath: String): Unit {
val df2 = df.repartition(partitionCols.get.map(col): _*)
val dfWriter = df2.write.partitionBy(partitionCols.get.map(col): _*)
dfWriter
.format("parquet")
.mode(SaveMode.Overwrite)
.option("compression", "snappy")
.save(writePath)
}
Is it generally good practice to call repartition on a predefined set of columns like this, and then call partitionBy, and then save to disk?
Generally you call repartition with the same columns as the partitionBy to have a single parquet file in each partition. This is being achieved here. Now you could argue that this could mean the parquet file size becoming large or worse could cause memory overflow.
This problem generally handled by adding a row_number to the Dataframe and then specify the number of documents than each parquet file can have. Something like
val repartitionExpression =colNames.map(col) :+ floor(col(RowNumber) / docsPerPartition)
// now use this to repartition
To answer the next part as persist after partitionBy that is not needed here as after partition it is directly written to the disk.
To help you understand the differences between partitionBy() and repartition(), repartition on the dataframe uses a Hash based partitioner which takes COL as well as NumOfPartitions basing on which generates a hash value and buckets the data.
By default repartition() creates 200 partitions. Because of possibility of collisions there is good chance of partitioning multiple records with different keys into same buckets.
On the other hand the partitionBy() takes COL by which the partitions are purely based on the unique keys. The partitions are proportional to the no: of unique keys in the data.
In repartition case there is a good chance of writing empty files. But, in the case of partitionBy there will not be any empty files.
Is your job CPU-bound, memory-bound, network-IO bound, or disk-IO bound?
First 2 cases are significant if df2 is sufficiently large, and other answers correctly address those cases.
If your job is disk-IO bound (and you see yourself writing large files to HDFS frequently in future), many cloud providers will let you pick a faster SSD disk for an extra charge.
Also Sandy Ryza recommends keeping --executor-cores under 5:
I’ve noticed that the HDFS client has trouble with tons of concurrent threads. A rough guess is that at most five tasks per executor can achieve full write throughput, so it’s good to keep the number of cores per executor below that number.

How to handle large text file in spark?

I have a large textfile (3 GB) and it is DNA reference. I would like to slice it in parts so that i can handle it.
So I want to know how to slice the file with Spark. I am currently having only one node with 4 GB of memory
Sounds like you want to load your file as multiple partitions. If your file is splittable (text file, snappy, sequence, etc.), you can simply provide the number of partitions by which it will be loaded as sc.textFile(inputPath, numPartitions). If your file is not splittable, it will be loaded as one partition, but you may call .repartition(numPartitions) on the loaded RDD to repartition into multiple partitions.
If you want some specific number of lines in your every chunk, you can try this:
rdd=sc.textFile(inputPath).zipWithIndex()
rdd2=rdd.filter(x=>lowest_no_of_line<=x._2 & x._2<=highest_no_of_line).map(x=>x._1).coalesce(1,false)
rdd2.saveAsTextFile(outputpath)
Now your saved textfile will have lines in between highest_no_of_line and lowest_no_of_line