Order of task execution in union() operation - scala

I have a list of 80 rdds that I want to process and then ultimately join.
The "process" part consists of doing a map and a reduce by key for each rdd.
Then I'm "joining" them by doing an union.
Here's a sketch of my code:
rdds0.foreach(_.persist()) //rdds0 are persisted
//trigger a map and a shuffle for each rdd
vals rdds = rdds0.map(rdd => rdd.map(f1).reduceByKey(f2))
//action on the union of the rdds
sparkContext.union(rdds).collect()
However, I have issues with the DAG that is generated.
Indeed the DAG that is generated by spark is like so:
80 stages, one for each "map" of each RDD
1 final stage for the union, that starts with 80 reduceByKey in parallel
I have an issue with the part in bold.
AFAIK, this means that for the last task, Spark will schedule in parallel 80 reducebykey, where each of them is taking lots of memory.
It seems more efficient to be able to do the reduceByKey() for each rdd individually as soon as the map stage is done for this RDD.
Instead, no reduceByKey can be performed before all the map stages are done, and then they are all scheduled at the same time.
Is there a way to force Spark somehow to execute the redueByKey() operations ASAP instead of waiting for all the map tasks?
I thought this was a matter of union() creating a PartitionerAwareUnionRDD instead of a UnionRDD() but it seems that both RDD types generate the same DAG.

reduceByKey is a wide transformation - this means it has:
"map-side" component - part of the operation that happens before shuffle - contained in the first stage in your DAG.
"reduce-side" component - part of the operation which happens after the shuffle - contained in the second stage of your DAG.
The results of the "reduce-side" component are piped directly to union. There is really nothing to optimize in this case.

Related

Can we make the different transformation functions for the Spark Streaming running on different servers?

val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
For the above example, we know there are two transformation functions. Both of them must running at the same process\server, however I want to make the second transformation running on a different server from the first one to achieve scalability, is it possible?
To clear things up: a Spark transformation is not an actual execution. Transformations in Spark are lazy which means nothing gets executed until you call an action (e.g. save, collect). An action is a job in Spark.
So based on the above, you can control jobs but you cannot control transformations. A Spark's job will be distributed on multiple executors by splitting the processed data (RDD) among them. Each executor will apply the job (multiple transformations) on its split and then the results will be collected again. This will significantly reduce network usage.
If you can perform what your asking about, then the intermediate results (which you actually don't care about) should be transformed over the network which in turns will add a great network overhead.

How to process two RDDs serially in Spark?

As I was hitting the resource limit in my Spark program, I want to divide the processing into iterations, and upload results from each iteration to the HDFS, as shown below.
do something using first rdd
upload the output to hdfs
do something using second rdd
upload the output to hdfs
But as far as I know, Spark will try to run those two in parallel. Is there a way to wait for the processing of the first rdd, before processing the second rdd?
I think I understand where you're confused. Within a single RDD, the partitions will run in parallel to each other. However, two RDDs will run sequentially to each other (unless you code otherwise).
Is there a way to wait for the processing of the first rdd, before processing the second rdd
You have the RDD, so why do you need to wait and read from disk again?
Do some transformations on the RDD, write to disk in the first action, and continue with that same RDD to perform a second action.

How to `reduce` only within partitions in Spark Streaming, perhaps using combineByKey?

I have data already sorted by key into my Spark Streaming partitions by virtue of Kafka, i.e. keys found on one node are not found on any other nodes.
I would like to use redis and its incrby (increment by) command as a state engine and to reduce the number of requests sent to redis, I would like to partially reduce my data by doing a word count on each worker node by itself. (The key is tag+timestamp to obtain my functionality from word count).
I would like to avoid shuffling and let redis take care of adding data across worker nodes.
Even when I have checked that data is cleanly split among worker nodes, .reduce(_ + _) (Scala syntax) takes a long time (several seconds vs. sub-second for map tasks), as the HashPartitioner seems to shuffle my data to a random node to add it there.
How can I write a simple word count reduce on each partitioner without triggering the shuffling step in Scala with Spark Streaming?
Note DStream objects lack some RDD methods, which are available only through the transform method.
It seems I might be able to use combineByKey. I would like to skip the mergeCombiners() step and instead leave accumulated tuples where they are.
The book "Learning Spark" enigmatically says:
We can disable map-side aggregation in combineByKey() if we know that our data won’t benefit from it. For example, groupByKey() disables map-side aggregation as the aggregation function (appending to a list) does not save any space. If we want to disable map-side combines, we need to specify the partitioner; for now you can just use the partitioner on the source RDD by passing rdd.partitioner.
https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html
The book then continues to supply no syntax for how to do this, nor have I had any luck with google so far.
What is worse, as far as I know, the partitioner is not set for DStream RDDs in Spark Streaming, so I don't know how to supply a partitioner to combineByKey that doesn't end up shuffling data.
Also, what does "map-side" actually mean and what consequences does mapSideCombine = false have, exactly?
The scala implementation for combineByKey can be found at
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
Look for combineByKeyWithClassTag.
If the solution involves a custom partitioner, please include also a code sample for how to apply that partitioner to the incoming DStream.
This can be done using mapPartitions, which takes a function that maps an iterator of the input RDD on one partition to an iterator over the output RDD.
To implement a word count, I map to _._2 to remove the Kafka key and then perform a fast iterator word count using foldLeft, initializing a mutable.hashMap, which then gets converted to an Iterator to form the output RDD.
val myDstream = messages
.mapPartitions( it =>
it.map(_._2)
.foldLeft(new mutable.HashMap[String, Int])(
(count, key) => count += (key -> (count.getOrElse(key, 0) + 1))
).toIterator
)

Spark RDD's - how do they work

I have a small Scala program that runs fine on a single-node. However, I am scaling it out so it runs on multiple nodes. This is my first such attempt. I am just trying to understand how the RDDs work in Spark so this question is based around theory and may not be 100% correct.
Let's say I create an RDD:
val rdd = sc.textFile(file)
Now once I've done that, does that mean that the file at file is now partitioned across the nodes (assuming all nodes have access to the file path)?
Secondly, I want to count the number of objects in the RDD (simple enough), however, I need to use that number in a calculation which needs to be applied to objects in the RDD - a pseudocode example:
rdd.map(x => x / rdd.size)
Let's say there are 100 objects in rdd, and say there are 10 nodes, thus a count of 10 objects per node (assuming this is how the RDD concept works), now when I call the method is each node going to perform the calculation with rdd.size as 10 or 100? Because, overall, the RDD is size 100 but locally on each node it is only 10. Am I required to make a broadcast variable prior to doing the calculation? This question is linked to the question below.
Finally, if I make a transformation to the RDD, e.g. rdd.map(_.split("-")), and then I wanted the new size of the RDD, do I need to perform an action on the RDD, such as count(), so all the information is sent back to the driver node?
val rdd = sc.textFile(file)
Does that mean that the file is now partitioned across the nodes?
The file remains wherever it was. The elements of the resulting RDD[String] are the lines of the file. The RDD is partitioned to match the natural partitioning of the underlying file system. The number of partitions does not depend on the number of nodes you have.
It is important to understand that when this line is executed it does not read the file(s). The RDD is a lazy object and will only do something when it must. This is great because it avoids unnecessary memory usage.
For example, if you write val errors = rdd.filter(line => line.startsWith("error")), still nothing happens. If you then write val errorCount = errors.count now your sequence of operations will need to be executed because the result of count is an integer. What each worker core (executor thread) will do in parallel then, is read a file (or piece of file), iterate through its lines, and count the lines starting with "error". Buffering and GC aside, only a single line per core will be in memory at a time. This makes it possible to work with very large data without using a lot of memory.
I want to count the number of objects in the RDD, however, I need to use that number in a calculation which needs to be applied to objects in the RDD - a pseudocode example:
rdd.map(x => x / rdd.size)
There is no rdd.size method. There is rdd.count, which counts the number of elements in the RDD. rdd.map(x => x / rdd.count) will not work. The code will try to send the rdd variable to all workers and will fail with a NotSerializableException. What you can do is:
val count = rdd.count
val normalized = rdd.map(x => x / count)
This works, because count is an Int and can be serialized.
If I make a transformation to the RDD, e.g. rdd.map(_.split("-")), and then I wanted the new size of the RDD, do I need to perform an action on the RDD, such as count(), so all the information is sent back to the driver node?
map does not change the number of elements. I don't know what you mean by "size". But yes, you need to perform an action, such as count to get anything out of the RDD. You see, no work at all is performed until you perform an action. (When you perform count, only the per-partition count will be sent back to the driver, of course, not "all the information".)
Usually, the file (or parts of the file, if it's too big) is replicated to N nodes in the cluster (by default N=3 on HDFS). It's not an intention to split every file between all available nodes.
However, for you (i.e. the client) working with file using Spark should be transparent - you should not see any difference in rdd.size, no matter on how many nodes it's split and/or replicated. There are methods (at least, in Hadoop) to find out on which nodes (parts of the) file can be located at the moment. However, in simple cases you most probably won't need to use this functionality.
UPDATE: an article describing RDD internals: https://cs.stanford.edu/~matei/papers/2012/nsdi_spark.pdf

Does groupByKey in Spark preserve the original order?

In Spark, the groupByKey function transforms a (K,V) pair RDD into a (K,Iterable<V>) pair RDD.
Yet, is this function stable? i.e is the order in the iterable preserved from the original order?
For example, if I originally read a file of the form:
K1;V11
K2;V21
K1;V12
May my iterable for K1 be like (V12, V11) (thus not preserving the original order) or can it only be (V11, V12) (thus preserving the original order)?
No, the order is not preserved. Example in spark-shell:
scala> sc.parallelize(Seq(0->1, 0->2), 2).groupByKey.collect
res0: Array[(Int, Iterable[Int])] = Array((0,ArrayBuffer(2, 1)))
The order is timing dependent, so it can vary between runs. (I got the opposite order on my next run.)
What is happening here? groupByKey works by repartitioning the RDD with a HashPartitioner, so that all values for a key end in up in the same partition. Then it performs the aggregation locally on each partition.
The repartitioning is also called a "shuffle", because the lines of the RDD are redistributed between nodes. The shuffle files are pulled from the other nodes in parallel. The new partition is built from these pieces in the order that they arrive. The data from the slowest source will be at the end of the new partition, and at the end of the list in groupByKey.
(Data pulled from the worker itself is of course fastest. Since there is no network transfer involved here, this data is pulled synchronously, and thus arrives in order. (It seems to, at least.) So to replicate my experiment you need at least 2 Spark workers.)
Source: http://apache-spark-user-list.1001560.n3.nabble.com/Is-shuffle-quot-stable-quot-td7628.html
Spark (and other map reduce frameworks) sort data by partitioning , and then merging. Since a merge sort is a stable operation I would guess that the result is stable. After looking more into the source I found that if spark.shuffle.spill is true it uses an external sort , merge sort in this case, which is stable. I'm not 100% sure what it does if it's allowed to spill to disk.
From source:
private val externalSorting = SparkEnv.get.conf.getBoolean("spark.shuffle.spill", true)
Partitioning is also a stable operation because it does no reordering