reduce() vs. fold() in Apache Spark - scala

What is the difference between reduce vs. fold with respect to their technical implementation?
I understand that they differ by their signature as fold accepts additional parameter (i.e. initial value) which gets added to each partition output.
Can someone tell about use case for these two actions?
Which would perform better in which scenario consider 0 is used for fold?
Thanks in advance.

There is no practical difference when it comes to performance whatsoever:
RDD.fold action is using fold on the partition Iterators which is implemented using foldLeft.
RDD.reduce is using reduceLefton the partition Iterators.
Both methods keep mutable accumulator and process partitions sequentially using simple loops with foldLeft implemented like this:
foreach (x => result = op(result, x))
and reduceLeft like this:
for (x <- self) {
if (first) {
...
}
else acc = op(acc, x)
}
Practical difference between these methods in Spark is only related to their behavior on empty collections and ability to use mutable buffer (arguably it is related to performance). You'll find some discussion in Why is the fold action necessary in Spark?
Moreover there is no difference in the overall processing model:
Each partition is processed sequentially using a single thread.
Partitions are processed in parallel using multiple executors / executor threads.
Final merge is performed sequentially using a single thread on the driver.

Related

Flink Custom Partition Function

I am using Scala on Flink with DataSet API.
I want to re-partition my data across the nodes. Spark has a function that lets the user to re-partition the data with a given numberOfPartitions parameter (link) and I believe Flink does not support such function.
Thus, I wanted to achieve this by implementing a custom partitioning function.
My data is of type DataSet(Double,SparseVector)
An example line from the data:
(1.0 SparseVector((2024,1.0), (2025,1.0), (2030,1.0), (2045,1.0), (2046,1.41), (2063,1.0), (2072,1.0), (3031,1.0), (3032,1.0), (4757,1.0), (4790,1.0), (177196,1.0), (177197,0.301), (177199,1.0), (177202,1.0), (1544177,1.0), (1544178,1.0), (1544179,1.0), (1654031,1.0), (1654190,1.0), (1654191,1.0), (1654192,1.0), (1654193,1.0), (1654194,1.0), (1654212,1.0), (1654237,1.0), (1654238,1.0)))
Since my "Double" is binary (1 or -1), I want to partition my data according to the length of the SparceVector.
My custom partitioner is as follows:
class myPartitioner extends Partitioner[SparseVector]
{
override def partition(key: SparseVector, numPartitions: Int): Int = {
key.size % numPartitions
}
}
I call this custom partitioner as follows:
data.partitionCustom(new myPartitioner(),1)
Can somebody please help me to understand how to specify number of partitions as "numPartitions" argument when calling myPartitioner function in Scala.
Thank you.
In flink you can define setParallelism for a single operator or for all the operators using enviornment.setParallelism. I hope this link will help you.
Spark uses repartition(n: Int) function to redistribute data into n partitions, which will be processed by n tasks. From my perspective, this includes two changes: data redistribution and number of downstream tasks.
Therefore, in Apache Flink, I think that the Partitioner is mapped to data redistribution and the parallelism is mapped to the number of downstream tasks, which means you can use setParallelism to determine the "numPartitions".
I'm assuming you're using the length of the SparseVector just to have something that gives you relatively random values to use for partitioning. If that's true, then you can just do a DataSet.rebalance(). If you follow that by any operator (including a Sink) where you set the parallelism to numPartitions, then you should get nicely repartitioned data.
But your description of ...want to re-partition my data across the nodes makes me think that you're trying to apply Spark's concept of RDDs to Flink, which isn't really valid. E.g. assuming you have numPartition parallel operators processing the (re-partitioned) data in your DataSet, then these operators will be running in slots provided by the available TaskManagers, and these slots might or might not be on different physical servers.

Are two transformations on the same RDD executed in parallel in Apache Spark?

Lets say we have the following Scala program:
val inputRDD = sc.textFile("log.txt")
inputRDD.persist()
val errorsRDD = inputRDD.filter(lambda x: "error" in x)
val warningsRDD = inputRDD.filter(lambda x: "warning" in x)
println("Errors: " + errorsRDD.count() + ", Warnings: " + warningsRDD.count())
We create a simple RDD, persist it, perform two transformations on the RDD and finally have an action which uses the RDDs.
When the print is called, the transformations are executed, each transformation is of course parallel depending on the cluster management.
My main question is: Are the two actions and transformations executed in parallel or sequence? Or does errorsRDD.count() first execute and then warningsRDD.count(), in sequence?
I'm also wondering if there is any point in using persist in this example.
All standard RDD methods are blocking (with exception to AsyncRDDActions) so actions will be evaluated sequentially. It is possible to execute multiple actions concurrently using non-blocking submission (threads, Futures) with correct configuration of in-application scheduler or explicitly limited resources for each action.
Regarding cache it is impossible to answer without knowing the context. Depending on the cluster configuration, storage, and data locality it might be cheaper to load data from disk again, especially when resources are limited, and subsequent actions might trigger cache cleaner.
This will execute errorsRDD.count() first then warningsRDD.count().
The point of using persist here is when the first count is executed, inputRDD will be in memory.
The second count, spark won't need to re-read "whole" content of file from storage again, so execution time of this count would be much faster than the first.

Spark aggregateByKey reduceByKey - aggregation (e.g collection) must be thread safe?

This probably sounds basic. If I do aggregateByKey or reduceByKey, and I aggregate a specific implementation of a collection. Do I need to use a thread safe collection during this aggregation ?
Is this OK ?
val sc: SparkContext = ???
val notAggregated = Seq(((1), 100),((1), 200),((1), 300),((2), 100),((2), 200))
sc.parallelize(notAggregated)
.aggregateByKey(mutable.HashSet.empty[Int])(
seqOp = (set, member) => set += member,
combOp = (set1, set2) => set1 ++= set2)
.foreach(println(_))
It don't have to be thread safe.
It uses combineByKey in the background and, if you look at Spark source code, class PairDStreamFunctions.groupByKeyAndWindow. It uses ArrayBuffer as a combiner. Look also here in the comment, it says why it's thread-safe
Why?
You are not putting zero value directly - you write function that creates combiner. Then Spark creates combiner in each partition (clones the value for each partition). One combiner in one partition is used like normal object, without parallel access from many threads in application as objects in one partition are processed sequentially

How to `reduce` only within partitions in Spark Streaming, perhaps using combineByKey?

I have data already sorted by key into my Spark Streaming partitions by virtue of Kafka, i.e. keys found on one node are not found on any other nodes.
I would like to use redis and its incrby (increment by) command as a state engine and to reduce the number of requests sent to redis, I would like to partially reduce my data by doing a word count on each worker node by itself. (The key is tag+timestamp to obtain my functionality from word count).
I would like to avoid shuffling and let redis take care of adding data across worker nodes.
Even when I have checked that data is cleanly split among worker nodes, .reduce(_ + _) (Scala syntax) takes a long time (several seconds vs. sub-second for map tasks), as the HashPartitioner seems to shuffle my data to a random node to add it there.
How can I write a simple word count reduce on each partitioner without triggering the shuffling step in Scala with Spark Streaming?
Note DStream objects lack some RDD methods, which are available only through the transform method.
It seems I might be able to use combineByKey. I would like to skip the mergeCombiners() step and instead leave accumulated tuples where they are.
The book "Learning Spark" enigmatically says:
We can disable map-side aggregation in combineByKey() if we know that our data won’t benefit from it. For example, groupByKey() disables map-side aggregation as the aggregation function (appending to a list) does not save any space. If we want to disable map-side combines, we need to specify the partitioner; for now you can just use the partitioner on the source RDD by passing rdd.partitioner.
https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html
The book then continues to supply no syntax for how to do this, nor have I had any luck with google so far.
What is worse, as far as I know, the partitioner is not set for DStream RDDs in Spark Streaming, so I don't know how to supply a partitioner to combineByKey that doesn't end up shuffling data.
Also, what does "map-side" actually mean and what consequences does mapSideCombine = false have, exactly?
The scala implementation for combineByKey can be found at
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
Look for combineByKeyWithClassTag.
If the solution involves a custom partitioner, please include also a code sample for how to apply that partitioner to the incoming DStream.
This can be done using mapPartitions, which takes a function that maps an iterator of the input RDD on one partition to an iterator over the output RDD.
To implement a word count, I map to _._2 to remove the Kafka key and then perform a fast iterator word count using foldLeft, initializing a mutable.hashMap, which then gets converted to an Iterator to form the output RDD.
val myDstream = messages
.mapPartitions( it =>
it.map(_._2)
.foldLeft(new mutable.HashMap[String, Int])(
(count, key) => count += (key -> (count.getOrElse(key, 0) + 1))
).toIterator
)

Using Futures within Spark

A Spark job makes a remote web service for every element in an RDD. A simple implementation might look something like this:
def webServiceCall(url: String) = scala.io.Source.fromURL(url).mkString
rdd2 = rdd1.map(x => webServiceCall(x.field1))
(The above example has been kept simple and does not handle timeouts).
There is no interdependency between any of the results for different elements of the RDD.
Would the above be improved by using Futures to optimise performance by making parallel calls to the web service for each element of the RDD? Or does Spark itself have that level of optimization built in, so that it will run the operations on each element in the RDD in parallel?
If the above can be optimized by using Futures, does anyone have some code examples showing the correct way to use Futures within a function passed to a Spark RDD.
Thanks
Or does Spark itself have that level of optimization built in, so that it will run the operations on each element in the RDD in parallel?
It doesn't. Spark parallelizes tasks at the partition level but by default every partition is processed sequentially in a single thread.
Would the above be improved by using Futures
It could be an improvement but is quite hard to do it right. In particular:
every Future has to be completed in the same stage before any reshuffle takes place.
given lazy nature of the Iterators used to expose partition data you cannot do it high level primitives like map (see for example Spark job with Async HTTP call).
you can build your custom logic using mapPartitions but then you have to deal with all the consequences of non-lazy partition evaluation.
I couldnt find an easy way to achieve this. But after several iteration of retries this is what I did and its working for a huge list of queries. Basically we used this to do a batch operation for a huge query into multiple sub queries.
// Break down your huge workload into smaller chunks, in this case huge query string is broken
// down to a small set of subqueries
// Here if needed to optimize further down, you can provide an optimal partition when parallelizing
val queries = sqlContext.sparkContext.parallelize[String](subQueryList.toSeq)
// Then map each one those to a Spark Task, in this case its a Future that returns a string
val tasks: RDD[Future[String]] = queries.map(query => {
val task = makeHttpCall(query) // Method returns http call response as a Future[String]
task.recover {
case ex => logger.error("recover: " + ex.printStackTrace()) }
task onFailure {
case t => logger.error("execution failed: " + t.getMessage) }
task
})
// Note:: Http call is still not invoked, you are including this as part of the lineage
// Then in each partition you combine all Futures (means there could be several tasks in each partition) and sequence it
// And Await for the result, in this way you making it to block untill all the future in that sequence is resolved
val contentRdd = tasks.mapPartitions[String] { f: Iterator[Future[String]] =>
val searchFuture: Future[Iterator[String]] = Future sequence f
Await.result(searchFuture, threadWaitTime.seconds)
}
// Note: At this point, you can do any transformations on this rdd and it will be appended to the lineage.
// When you perform any action on that Rdd, then at that point,
// those mapPartition process will be evaluated to find the tasks and the subqueries to perform a full parallel http requests and
// collect those data in a single rdd.
I'm reposting it from my original answer here