Spark RDD: multiple reducebykey or just once - scala

I have code like following:
// make a rd according to an id
def makeRDD(id:Int, data:RDD[(VertexId, Double)]):RDD[(Long, Double)] = { ... }
val data:RDD[(VertexId, Double)] = ... // loading from hdfs
val idList = (1 to 100)
val rst1 = idList.map(id => makeRDD(id, data)).reduce(_ union _).reduceByKey(_+_)
val rst2 = idList.map(id => makeRDD(id, data)).reduce((l,r) => (l union r).reduceByKey(_+_))
rst1 and rst2 get the sample result. I thought rst1 require more memory (100 times) but only one reduceByKey tranform; however, rst2 require less memory but more reduceByKey tranforms (99 times). So, is it a game of time and space tradeoff?
My question is: whether my analysis above is right, or Spark treat translate the actions in the same way internally?
P.S.: rst1 union all sub rdd then reduceByKey,which reduceByKey is outside reduce. rst2 reduceByKey one by one, which reduceByKey is inside reduce.

Long story short both solutions are relatively inefficient but the second one is worst than the first.
Let's start by answering the last question. For low level RDD API there are only two types of global automatic optimizations (instead):
using explicitly or implicitly cached tasks results instead recomputing complete lineage
combining multiple transformations which don't require a shuffle into a single ShuffleMapStage
Everything else is pretty much a sequential transformations which defines DAG. This stays in contrast to more restrictive, high level Dataset (DataFrame) API, which makes specific assumptions about transformations and perform global optimizations of the execution plan.
Regarding your code. The biggest problem with the first solution is a growing lineage when you apply iterative union. It makes some things, like failure recovery expensive, and since RDDs are defined recursively, can fail with StackOverflow exception. A less serious side effect is a growing number of partitions which is doesn't seem to be compensated in the subsequent reduction*. You'll find a more detailed explanation in my answer to Stackoverflow due to long RDD Lineage but what you really need here is a single union like this:
sc.union(idList.map(id => makeRDD(id, data))).reduceByKey(_+_)
This is actually an optimal solution assuming you apply truly reducing function.
The second solution obviously suffers from the same problem, nevertheless it gets worse. While the first approach requires only two stages with a single shuffle, this requires a shuffle for each RDD. Since number of partitions is growing and you use default HashPartitioner each piece of data has to be written to disk multiple times and most likely shuffled over the network multiple times. Ignoring low level calculations each record is shuffled O(N) times where N is a number of RDDs you merge.
Regarding memory usage it is not obvious without knowing more about data distribution but in the worst case scenario the second method can express significantly worse behavior.
If + works with constant space the only requirement for reduction is a hashmap to store the results of map side combine. Since partitions are processed as a stream of data without reading complete content into memory, this means that total memory size for each task will be proportional to the number of unique keys and not the amount of data. Since the second method requires more tasks overall memory usage will be higher than the first case. On average it can be slightly better due to data being partially organized but it is rather unlikely to compensate additional costs.
* If you want to learn how it can affect overall performance you can see Spark iteration time increasing exponentially when using join This is slightly different problem but should give you some idea why controlling number of partitions matters.

Related

Spark performance: local faster than cluster (very uneven executor load)

let me start off by saying that I'm relatively new to spark so if I'm saying something that doesn't make sense just please correct me.
Summarising the problem, no mather what I do, at certain stages one executor does all the computation, which makes cluster execution slower than local, one-processor execution.
Full story:
I've written a spark 1.6 application which consists of series of maps, filters, joins and a short graphx part. The app uses only one data source - csv file. For the purpose of development I created a mockup dataset consisting of 100 000 rows, 7MB, with all of the fields having random data with uniform distribution (random sorting in file as well). The joins are self inner joins on PairRDD on various fields (the dataset has duplicate keys with ~200 duplicates per key immitating real data), leading to cartesian product within key. Then I perform a number of map and filter operations on the result of the joins, store it as RDD of some custom-class objects and save everything as a graph at the and.
I developed the code on my laptop and run it, which took about 5 minutes (windows machine, local file). To my surprise, when I deployed the jar onto the cluster (master yarn, cluster mode, file in csv in HDFS) and submitted it the code has taken 8 minutes to execute.
I've run same experiment with smaller data and the results were 40 seconds locally and 1.1 min on the cluster.
When I looked at history server I've seen that 2 stages are particularly long (almost 4 mins each), and on these stages there is one task that takes >90% of the time. I run the code multiple times and it was always the same task that took so much time, even though it was deployed on different data node each time.
To my surprise, when I opened the executors I saw that one executor does almost all of the job (in terms of time spent) and executes most jobs. In the screenshot provided second most 'active' executor had 50 tasks, but that's not always the case - in different submission second most busy executor had 15 tasks, and the leading one 95).
Moreover, I saw that the time of 3.9 mins is used for computation (second screenshot), which is most heavy on the joined data shortly after map. I thought, that the data may not be partitioned equally and one executor has to perform all the computation. Therefore, I tried to patrition the pairRdd manually (using .partitionBy(new HashPartitioner(40))) right before join (similar execution time) or right after join (execution even slower).
What could be the issue? Any help will be appreciated.
It's hard to tell without seeing your queries and understanding your Dataset, I'm guessing you didn't include it either because it's very complex or sensitive? So this is a little bit of a shot in the dark, however this looks a lot like a problem we dealt with on my team at work. My rough guess at what is happening is that during one of your joins, you have a key space that has a high cardinality, but very uneven distribution. In our case, we were joining on sources of web traffic, which while we have thousands of possible sources of traffic, the overwhelming majority of the traffic comes from just a few. This caused a problem when we joined. The keys would be distributed evenly among the executors, however since maybe 95% of the data shared maybe 3 or 4 keys, a very small number of executors were doing most of the work. When you find a join that suffers from this, the thing to do is to pick the smaller of the two datasets and explicitly perform a broadcast join. (Spark normally will try to do this, but it's not always perfect at being able to tell when it should.)
To do this, let's say you have two DataFrames. One of them has two columns, number and stringRep where number is just one row for all integers from 0-10000 and stringRep is just a string representation of that, so "one", "two", "three", etc. We'll call this numToString
The other DataFrame has some key column to join against number in numToString called kind, some other irrelevant data, and 100,000,000 rows. We'll call this DataFrame ourData. Then let's say that the distribution of the 100,000,000 rows in ourData is 90% have kind == 1, 5% have kind == 2, and the remaining 5% distributed pretty evenly amongst the remaining 99,998 numbers. When you perform the following code:
val numToString: DataFrame = loadNumToString()
val ourData: DataFrame = loadOurCode()
val joined = ourData.join(numToString).where(ourData("kind") === numToString("number"))
...it is very likely that Spark will send %90 of the data (that which has kind == 1) to one executor, %5 of the data (that which has kind == 2) to another executor, and the remaining %5 smeared across the rest, leaving two executors with huge partitions and the rest with very tiny ones.
The way around this as I mentioned before is to explicitly perform a broadcast join. What this does is take one DataFrame and distribute it entirely to each node. So you would do this instead:
val joined = ourData.join(broadcast(numToString)).where(ourData("kind") === numToString("number"))
...which would send numToString to each executor. Assuming that ourData was evenly partitioned beforehand, the data should remain evenly partitioned across executors. This might not be your problem, but it does sound a lot like a problem we were having. Hope it helps!
More information on broadcast joins can be found here:
https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-joins-broadcast.html

Handling Skew data in apache spark production scenario

Can anyone explain how the skew data is handled in production for Apache spark?
Scenario:
We submitted the spark job using "spark-submit" and in spark-ui it is observed that few tasks are taking long time which indicates presence of skew.
Questions:
(1) What steps shall we take(re-partitioning,coalesce,etc.)?
(2) Do we need to kill the job and then include the skew solutions in the jar and
re-submit the job?
(3) Can we solve this issue by running the commands like (coalesce) directly from
shell without killing the job?
Data skews a primarily a problem when applying non-reducing by-key (shuffling) operations. The two most common examples are:
Non-reducing groupByKey (RDD.groupByKey, Dataset.groupBy(Key).mapGroups, Dataset.groupBy.agg(collect_list)).
RDD and Dataset joins.
Rarely, the problem is related to the properties of the partitioning key and partitioning function, with no per-existent issue with data distribution.
// All keys are unique - no obvious data skew
val rdd = sc.parallelize(Seq(0, 3, 6, 9, 12)).map((_, None))
// Drastic data skew
rdd.partitionBy(new org.apache.spark.HashPartitioner(3)).glom.map(_.size).collect
// Array[Int] = Array(5, 0, 0)
What steps shall we take(re-partitioning,coalesce,etc.)?
Repartitioning (never coalesce) can help you with the the latter case by
Changing partitioner.
Adjusting number of partitions to minimize possible impact of data (here you can use the same rules as for associative arrays - prime number and powers of two should be preferred, although might not resolve the problem fully, like 3 in the example used above).
The former cases typically won't benefit from repartitioning much, because skew is naturally induced by the operation itself. Values with the same key cannot be spread multiple partitions, and non-reducing character of the process, is minimally affected by the initial data distribution.
These cases have to be handled by adjusting the logic of your application. It could mean a number of things in practice, depending on the data or problem:
Removing operation completely.
Replacing exact result with an approximation.
Using different workarounds (typically with joins), for example frequent-infrequent split, iterative broadcast join or prefiltering with probabilistic filter (like Bloom filter).
Do we need to kill the job and then include the skew solutions in the jar and re-submit the job?
Normally you have to at least resubmit the job with adjust parameters.
In some cases (mostly RDD batch jobs) you can design your application, to monitor task execution and kill and resubmit particular job in case of possible skew, but it might hard to implement right in practice.
In general, if data skew is possible, you should design your application to be immune to data skews.
Can we solve this issue by running the commands like (coalesce) directly from shell without killing the job?
I believe this is already answered by the points above, but just to say - there is no such option in Spark. You can of course include these in your application.
We can fine tune the query to reduce the complexity .
We can Try Salting mechanism:
Salt the skewed column with random number creation better distribution of data across each partition.
Spark 3 Enables Adaptive Query Execution mechanism to avoid such scenarios in production.
Below are couple of spark properties which we can fine tune accordingly.
spark.sql.adaptive.enabled=true
spark.databricks.adaptive.autoBroadcastJoinThreshold=true #changes sort merge join to broadcast join dynamically , default size = 30 mb
spark.sql.adaptive.coalescePartitions.enabled=true #dynamically coalesced
spark.sql.adaptive.advisoryPartitionSizeInBytes=64MB default
spark.sql.adaptive.coalescePartitions.minPartitionSize=true
spark.sql.adaptive.coalescePartitions.minPartitionNum=true # Default 2X number of cores
spark.sql.adaptive.skewJoin.enabled=true
spark.sql.adaptive.skewJoin.skewedPartitionFactor=Default is 5
spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes=256 MB

Best practice for writing to hadoop from spark

I was reviewing some code written by a co-worker, and I found a method like this:
def writeFile(df: DataFrame,
partitionCols: List[String],
writePath: String): Unit {
val df2 = df.repartition(partitionCols.get.map(col): _*)
val dfWriter = df2.write.partitionBy(partitionCols.get.map(col): _*)
dfWriter
.format("parquet")
.mode(SaveMode.Overwrite)
.option("compression", "snappy")
.save(writePath)
}
Is it generally good practice to call repartition on a predefined set of columns like this, and then call partitionBy, and then save to disk?
Generally you call repartition with the same columns as the partitionBy to have a single parquet file in each partition. This is being achieved here. Now you could argue that this could mean the parquet file size becoming large or worse could cause memory overflow.
This problem generally handled by adding a row_number to the Dataframe and then specify the number of documents than each parquet file can have. Something like
val repartitionExpression =colNames.map(col) :+ floor(col(RowNumber) / docsPerPartition)
// now use this to repartition
To answer the next part as persist after partitionBy that is not needed here as after partition it is directly written to the disk.
To help you understand the differences between partitionBy() and repartition(), repartition on the dataframe uses a Hash based partitioner which takes COL as well as NumOfPartitions basing on which generates a hash value and buckets the data.
By default repartition() creates 200 partitions. Because of possibility of collisions there is good chance of partitioning multiple records with different keys into same buckets.
On the other hand the partitionBy() takes COL by which the partitions are purely based on the unique keys. The partitions are proportional to the no: of unique keys in the data.
In repartition case there is a good chance of writing empty files. But, in the case of partitionBy there will not be any empty files.
Is your job CPU-bound, memory-bound, network-IO bound, or disk-IO bound?
First 2 cases are significant if df2 is sufficiently large, and other answers correctly address those cases.
If your job is disk-IO bound (and you see yourself writing large files to HDFS frequently in future), many cloud providers will let you pick a faster SSD disk for an extra charge.
Also Sandy Ryza recommends keeping --executor-cores under 5:
I’ve noticed that the HDFS client has trouble with tons of concurrent threads. A rough guess is that at most five tasks per executor can achieve full write throughput, so it’s good to keep the number of cores per executor below that number.

How to transform RDD, Dataframe or Dataset straight to a Broadcast variable without collect?

Is there any way (or any plans) to be able to turn Spark distributed collections (RDDs, Dataframe or Datasets) directly into Broadcast variables without the need for a collect? The public API doesn't seem to have anything "out of box", but can something be done at a lower level?
I can imagine there is some 2x speedup potential (or more?) for these kind of operations. To explain what I mean in detail let's work through an example:
val myUberMap: Broadcast[Map[String, String]] =
sc.broadcast(myStringPairRdd.collect().toMap)
someOtherRdd.map(someCodeUsingTheUberMap)
This causes all the data to be collected to the driver, then the data is broadcasted. This means the data is sent over the network essentially twice.
What would be nice is something like this:
val myUberMap: Broadcast[Map[String, String]] =
myStringPairRdd.toBroadcast((a: Array[(String, String)]) => a.toMap)
someOtherRdd.map(someCodeUsingTheUberMap)
Here Spark could bypass collecting the data altogether and just move the data between the nodes.
BONUS
Furthermore, there could be a Monoid-like API (a bit like combineByKey) for situations where the .toMap or whatever operation on Array[T] is expensive, but can possibly be done in parallel. E.g. constructing certain Trie structures can be expensive, this kind of functionality could result in awesome scope for algorithm design. This CPU activity can also be run while the IO is running too - while the current broadcast mechanism is blocking (i.e. all IO, then all CPU, then all IO again).
CLARIFICATION
Joining is not (main) use case here, it can be assumed that I sparsely use the broadcasted data structure. For example the keys in someOtherRdd by no means covers the keys in myUberMap but I don't know which keys I need until I traverse someOtherRdd AND suppose I use myUberMap multiple times.
I know that all sounds a bit vague, but the point is for more general machine learning algorithm design.
While theoretically this is an interesting idea I will argue that although theoretically possible it has very limited practical applications. Obviously I cannot speak for PMC so I cannot say if there are any plans to implement this type of broadcasting mechanism at all.
Possible implementation:
Since Spark already provides torrent broadcasting mechanism which behavior is described as follows:
The driver divides the serialized object into small chunks and
stores those chunks in the BlockManager of the driver.
On each executor, the executor first attempts to fetch the object from its BlockManager.
If it does not exist, it then uses remote fetches to fetch the small chunks from the driver and/or
other executors if available.
Once it gets the chunks, it puts the chunks in its own
BlockManager, ready for other executors to fetch from.
it should be possible to reuse the same mechanism for direct node-to-node broadcasting.
It is worth noting that this approach cannot completely eliminate driver communication. Even though blocks could be created locally you still need a single source of truth to advertise a set of blocks to fetch.
Limited applications
One problem with broadcast variables is that there are quite expensive. Even if you can eliminate driver bottleneck two problems remain:
Memory required to store deserialized object on each executor.
Cost of transferring broadcasted data to every executor.
The first problem should be relatively obvious. It is not only about direct memory usage but also about GC cost and its effect on overall latency. The second one is rather subtle. I partially covered this in my answer to Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark but let's discus this further.
From network traffic perspective broadcasting a whole dataset is pretty much equivalent to creating Cartesian product. So if dataset is large enough for driver becoming a bottleneck it is unlikely to be a good candidate for broadcasting and targeted approach like hash join can be preferred in practice.
Alternatives:
There are some methods which can be used to achieve similar results as direct broadcast and address issues enumerated above including:
Passing data via distributed file system.
Using replicated database collocated with worker nodes.
I don't know if we can do it for RDD but you can do it for Dataframe
import org.apache.spark.sql.functions
val df:DataFrame = your_data_frame
val broadcasted_df = functions.broadcast(df)
now you can use variable broadcasted_df and it will be broadcasted to executor.
Make sure broadcasted_df dataframe is not too big and can be send to executor.
broadcasted_df will be broadcaster in operations like for example
other_df.join(broadcasted_df)
and in this case join() operation executes faster because every executor has 1 partition of other_df and whole broadcasted_df
For your question i am not sure you can do what you want. You can not use one rdd inside #map() method of another rdd because spark doesn't allowed transformations inside transformations. And in your case you need to call collect() method to create map from your RDD because you can only use usual map object inside #map() method you can not use RDD there.

Spark transformation on last partitions extremely slow

I am running an iterative algorithm in which during each iteration, a list of values are each assigned a set of keys (1 to N). Over time, the distribution of files over keys become skewed. I noticed that after a few iterations, coalesce phase, things seem to start running really slow on the last few partitions of my RDD.
My transformation is as follows:
dataRDD_of_20000_partitions.aggregateByKey(zeroOp)(seqOp, mergeOp)
.mapValues(...)
.coalesce(1000, true)
.collect()
Here, aggregatebykey aggregates upon the keys I assigned earlier (1 to N). I can coalescing partitions because I know the number of partitions I need, and set coalesce shuffle to true in order to balance out the partitions.
Could anyone point to some reasons that these transformations may cause the last few partitions of the RDD to process slow? I am wondering if part of this has to do with data skewness.
I have some observations.
You should have right number of partitions to avoid data skewness. I suspect that you have fewer partitions than required number of partitions. Have a look at this blog.
collect() call, fetches entire RDD into single driver node.It may cause OutOfMemory some times.
Transformers like aggregateByKey() may cause performance issues due to shuffling.
Have a look this SE question for more details: Spark : Tackle performance intensive commands like collect(), groupByKey(), reduceByKey()