Best practice for writing to hadoop from spark - scala

I was reviewing some code written by a co-worker, and I found a method like this:
def writeFile(df: DataFrame,
partitionCols: List[String],
writePath: String): Unit {
val df2 = df.repartition(partitionCols.get.map(col): _*)
val dfWriter = df2.write.partitionBy(partitionCols.get.map(col): _*)
dfWriter
.format("parquet")
.mode(SaveMode.Overwrite)
.option("compression", "snappy")
.save(writePath)
}
Is it generally good practice to call repartition on a predefined set of columns like this, and then call partitionBy, and then save to disk?

Generally you call repartition with the same columns as the partitionBy to have a single parquet file in each partition. This is being achieved here. Now you could argue that this could mean the parquet file size becoming large or worse could cause memory overflow.
This problem generally handled by adding a row_number to the Dataframe and then specify the number of documents than each parquet file can have. Something like
val repartitionExpression =colNames.map(col) :+ floor(col(RowNumber) / docsPerPartition)
// now use this to repartition
To answer the next part as persist after partitionBy that is not needed here as after partition it is directly written to the disk.

To help you understand the differences between partitionBy() and repartition(), repartition on the dataframe uses a Hash based partitioner which takes COL as well as NumOfPartitions basing on which generates a hash value and buckets the data.
By default repartition() creates 200 partitions. Because of possibility of collisions there is good chance of partitioning multiple records with different keys into same buckets.
On the other hand the partitionBy() takes COL by which the partitions are purely based on the unique keys. The partitions are proportional to the no: of unique keys in the data.
In repartition case there is a good chance of writing empty files. But, in the case of partitionBy there will not be any empty files.

Is your job CPU-bound, memory-bound, network-IO bound, or disk-IO bound?
First 2 cases are significant if df2 is sufficiently large, and other answers correctly address those cases.
If your job is disk-IO bound (and you see yourself writing large files to HDFS frequently in future), many cloud providers will let you pick a faster SSD disk for an extra charge.
Also Sandy Ryza recommends keeping --executor-cores under 5:
I’ve noticed that the HDFS client has trouble with tons of concurrent threads. A rough guess is that at most five tasks per executor can achieve full write throughput, so it’s good to keep the number of cores per executor below that number.

Related

Why Spark repartition leads to MemoryOverhead?

So question is in the subject. I think I dont understand correctly the work of repartition. In my mind when I say somedataset.repartition(600) I expect all data would be partioned by equal size across the workers (let say 60 workers).
So for example. I would have a big chunk of data to load in unbalanced files, lets say 400 files, where 20 % are 2Gb size and others 80% are about 1 Mb. I have the code to load this data:
val source = sparkSession.read.format("com.databricks.spark.csv")
.option("header", "false")
.option("delimiter","\t")
.load(mypath)
Than I want convert raw data to my intermediate object, filter irrelevvant records, convert to final object (with additional attributes) and than partition by some columns and write to parquet. In my mind it seems reasonable to balance data (40000 partitions) across workers and than do the work like that:
val ds: Dataset[FinalObject] = source.repartition(600)
.map(parse)
.filter(filter.IsValid(_))
.map(convert)
.persist(StorageLevel.DISK_ONLY)
val count = ds.count
log(count)
val partitionColumns = List("region", "year", "month", "day")
ds.repartition(partitionColumns.map(new org.apache.spark.sql.Column(_)):_*)
.write.partitionBy(partitionColumns:_*)
.format("parquet")
.mode(SaveMode.Append)
.save(destUrl)
But it fails with
ExecutorLostFailure (executor 7 exited caused by one of the running
tasks) Reason: Container killed by YARN for exceeding memory limits.
34.6 GB of 34.3 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.
When I do not do repartition everything is fine. Where I do not understand repartition correct?
Your logic is correct for repartition as well as partitionBy but before using repartition you need to keep in mind this thing from several sources.
Keep in mind that repartitioning your data is a fairly expensive
operation. Spark also has an optimized version of repartition() called
coalesce() that allows avoiding data movement, but only if you are
decreasing the number of RDD partitions.
If you want that your task must be done then please increase drivers and executors memory

Spark: repartition output by key

I'm trying to output records using the following code:
spark.createDataFrame(asRow, struct)
.write
.partitionBy("foo", "bar")
.format("text")
.save("/some/output-path")
I don't have a problem when the data is small. However when I'm processing ~600GB input, I am writing around 290k files and that includes small files per partition. Is there a way we could control the number of output files per partition? Because right now I am writing a lot of small files and it's not good.
Having lots of files is the expected behavior as each partition (resulting in whatever computation you had before the write) will write to the partitions you requested the relevant files
If you wish to avoid that you need to repartition before the write:
spark.createDataFrame(asRow, struct)
.repartition("foo","bar")
.write
.partitionBy("foo", "bar")
.format("text")
.save("/some/output-path")
You have multiple files per partition because each node writes output to its own file. That means that the only way how to have only single file per partition is to re-partition data before writing. Please note, that that will be very inefficient because data repartition will cause shuffling on your data.

Understanding Spark partitioning

I'm trying to understand how Spark partitions data. Suppose I have an execution DAG like that in the picture (orange boxes are the stages). The two groupBy and the join operations are supposed to be very heavy if the RDD's are not partitioned.
Is it wise then to use .partitonBy(new HashPartitioner(properValue)) to P1, P2, P3 and P4 to avoid shuffle? What's the cost of partitioning an existing RDD? When isn't proper to partition an existing RDD? Doesn't Spark partition my data automatically if I don't specify a partitioner?
Thank you
tl;dr The answers to your questions respectively: Better to partition at the outset if you can; Probably less than not partitioning; Your RDD is partitioned one way or another anyway; Yes.
This is a pretty broad question. It takes up a good portion of our course! But let's try to address as much about partitioning as possible without writing a novel.
As you know, the primary reason to use a tool like Spark is because you have too much data to analyze on one machine without having the fan sound like a jet engine. The data get distributed among all the cores on all the machines in your cluster, so yes, there is a default partitioning--according to the data. Remember that the data are distributed already at rest (in HDFS, HBase, etc.), so Spark just partitions according to the same strategy by default to keep the data on the machines where they already are--with the default number of partitions equal to the number of cores on the cluster. You can override this default number by configuring spark.default.parallelism, and you want this number to be 2-3 per core per machine.
However, typically you want data that belong together (for example, data with the same key, where HashPartitioner would apply) to be in the same partition, regardless of where they are to start, for the sake of your analytics and to minimize shuffle later. Spark also offers a RangePartitioner, or you can roll your own for your needs fairly easily. But you are right that there is an upfront shuffle cost to go from default partitioning to custom partitioning; it's almost always worth it.
It is generally wise to partition at the outset (rather than delay the inevitable with partitionBy) and then repartition if needed later. Later on you may choose to coalesce even, which causes an intermediate shuffle, to reduce the number of partitions and potentially leave some machines and cores idle because the gain in network IO (after that upfront cost) is greater than the loss of CPU power.
(The only situation I can think of where you don't partition at the outset--because you can't--is when your data source is a compressed file.)
Note also that you can preserve partitions during a map transformation with mapPartitions and mapPartitionsWithIndex.
Finally, keep in mind that as you experiment with your analytics while you work your way up to scale, there are diagnostic capabilities you can use:
toDebugString to see the lineage of RDDs
getNumPartitions to, shockingly, get the number of partitions
glom to see clearly how your data are partitioned
And if you pardon the shameless plug, these are the kinds of things we discuss in Analytics with Apache Spark. We hope to have an online version soon.
By applying partitionBy preemptively you don't avoid the shuffle. You just push it in another place. This can be a good idea if partitioned RDD is reused multiple times, but you gain nothing for a one-off join.
Doesn't Spark partition my data automatically if I don't specify a partitioner?
It will partition (a.k.a. shuffle) your data a part of the join) and subsequent groupBy (unless you keep the same key and use transformation which preserves partitioning).

Spark RDD: multiple reducebykey or just once

I have code like following:
// make a rd according to an id
def makeRDD(id:Int, data:RDD[(VertexId, Double)]):RDD[(Long, Double)] = { ... }
val data:RDD[(VertexId, Double)] = ... // loading from hdfs
val idList = (1 to 100)
val rst1 = idList.map(id => makeRDD(id, data)).reduce(_ union _).reduceByKey(_+_)
val rst2 = idList.map(id => makeRDD(id, data)).reduce((l,r) => (l union r).reduceByKey(_+_))
rst1 and rst2 get the sample result. I thought rst1 require more memory (100 times) but only one reduceByKey tranform; however, rst2 require less memory but more reduceByKey tranforms (99 times). So, is it a game of time and space tradeoff?
My question is: whether my analysis above is right, or Spark treat translate the actions in the same way internally?
P.S.: rst1 union all sub rdd then reduceByKey,which reduceByKey is outside reduce. rst2 reduceByKey one by one, which reduceByKey is inside reduce.
Long story short both solutions are relatively inefficient but the second one is worst than the first.
Let's start by answering the last question. For low level RDD API there are only two types of global automatic optimizations (instead):
using explicitly or implicitly cached tasks results instead recomputing complete lineage
combining multiple transformations which don't require a shuffle into a single ShuffleMapStage
Everything else is pretty much a sequential transformations which defines DAG. This stays in contrast to more restrictive, high level Dataset (DataFrame) API, which makes specific assumptions about transformations and perform global optimizations of the execution plan.
Regarding your code. The biggest problem with the first solution is a growing lineage when you apply iterative union. It makes some things, like failure recovery expensive, and since RDDs are defined recursively, can fail with StackOverflow exception. A less serious side effect is a growing number of partitions which is doesn't seem to be compensated in the subsequent reduction*. You'll find a more detailed explanation in my answer to Stackoverflow due to long RDD Lineage but what you really need here is a single union like this:
sc.union(idList.map(id => makeRDD(id, data))).reduceByKey(_+_)
This is actually an optimal solution assuming you apply truly reducing function.
The second solution obviously suffers from the same problem, nevertheless it gets worse. While the first approach requires only two stages with a single shuffle, this requires a shuffle for each RDD. Since number of partitions is growing and you use default HashPartitioner each piece of data has to be written to disk multiple times and most likely shuffled over the network multiple times. Ignoring low level calculations each record is shuffled O(N) times where N is a number of RDDs you merge.
Regarding memory usage it is not obvious without knowing more about data distribution but in the worst case scenario the second method can express significantly worse behavior.
If + works with constant space the only requirement for reduction is a hashmap to store the results of map side combine. Since partitions are processed as a stream of data without reading complete content into memory, this means that total memory size for each task will be proportional to the number of unique keys and not the amount of data. Since the second method requires more tasks overall memory usage will be higher than the first case. On average it can be slightly better due to data being partially organized but it is rather unlikely to compensate additional costs.
* If you want to learn how it can affect overall performance you can see Spark iteration time increasing exponentially when using join This is slightly different problem but should give you some idea why controlling number of partitions matters.

Spark transformation on last partitions extremely slow

I am running an iterative algorithm in which during each iteration, a list of values are each assigned a set of keys (1 to N). Over time, the distribution of files over keys become skewed. I noticed that after a few iterations, coalesce phase, things seem to start running really slow on the last few partitions of my RDD.
My transformation is as follows:
dataRDD_of_20000_partitions.aggregateByKey(zeroOp)(seqOp, mergeOp)
.mapValues(...)
.coalesce(1000, true)
.collect()
Here, aggregatebykey aggregates upon the keys I assigned earlier (1 to N). I can coalescing partitions because I know the number of partitions I need, and set coalesce shuffle to true in order to balance out the partitions.
Could anyone point to some reasons that these transformations may cause the last few partitions of the RDD to process slow? I am wondering if part of this has to do with data skewness.
I have some observations.
You should have right number of partitions to avoid data skewness. I suspect that you have fewer partitions than required number of partitions. Have a look at this blog.
collect() call, fetches entire RDD into single driver node.It may cause OutOfMemory some times.
Transformers like aggregateByKey() may cause performance issues due to shuffling.
Have a look this SE question for more details: Spark : Tackle performance intensive commands like collect(), groupByKey(), reduceByKey()