Why is there no `reduceByValue` in Spark collections? - scala

I am learning Spark and Scala and keep coming across this pattern:
val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
While I understand what it does, I don't understand why it is used instead of having something like:
val lines = sc.textFile("data.txt")
val counts = lines.reduceByValue((v1, v2) => v1 + v2)
Given that Spark is designed to process large amounts of data efficiently, it seems counter intuitive to always have to perform an additional step of converting a list into a map and then reducing by key, instead of simply being able to reduce by value?

First, this "additional step" doesn't really cost much (see more details at the end) - it doesn't shuffle the data, and it is performed together with other transformations: transformations can be "pipelined" as long as they don't change the partitioning.
Second - the API you suggest seems very specific for counting - although you suggest reduceByValue will take a binary operator f: (Int, Int) => Int, your suggested API assumes each value is mapped to the value 1 before applying this operator for all identical values - an assumption that is hardly useful in any scenario other than counting. Adding such specific APIs would just bloat the interface and is never going to cover all use cases anyway (what's next - RDD.wordCount?), so it's better to give users minimal building blocks (along with good documentation).
Lastly - if you're not happy with such low-level APIs, you can use Spark-SQL's DataFrame API to get some higer-level APIs that will hide these details - that's one of the reasons DataFrames exist:
val linesDF = sc.textFile("file.txt").toDF("line")
val wordsDF = linesDF.explode("line","word")((line: String) => line.split(" "))
val wordCountDF = wordsDF.groupBy("word").count()
EDIT: as requested - some more details about why the performance impact of this map operation is either small or entirely negligibile:
First, I'm assuming you are interested in producing the same result as the map -> reduceByKey code would produce (i.e. word count), which means somewhere the mapping from each record to the value 1 must take place, otherwise there's nothing to perform the summing function (v1, v2) => v1 + v2 on (that function takes Ints, they must be created somewhere).
To my understanding - you're just wondering why this has to happen as a separate map operation
So, we're actually interested in the overhead of adding another map operation
Consider these two functionally-identical Spark transformations:
val rdd: RDD[String] = ???
/*(1)*/ rdd.map(s => s.length * 2).collect()
/*(2)*/ rdd.map(s => s.length).map(_ * 2).collect()
Q: Which one is faster?
A: They perform the same
Why? Because as long as two consecutive transformations on an RDD do not change the partitioning (and that's the case in your original example too), Spark will group them together, and perform them within the same task. So, per record, the difference between these two will come down to the difference between:
/*(1)*/ s.length * 2
/*(2)*/ val r1 = s.length; r1 * 2
Which is negligible, especially when you're discussing distributed execution on large datasets, where execution time is dominated by things like shuffling, de/serialization and IO.

Related

Spark RDD aggregate /fold operation business scenario [duplicate]

This question already has an answer here:
Explanation of fold method of spark RDD
(1 answer)
Closed 4 years ago.
[Edit] Actually, my question is about the business scenario/requirement for Spark RDD aggregate operation especially with zeroValue and RDD partition, but not for how it work in Spark. Sorry for the confusion.
I was learning all kinds of Spark RDD calculation. When looking into Spark RDD aggregate /fold related, I can not think about the business scenario of aggregate/fold.
For example, I am going to calculate sum of value in a RDD by fold.
val myrdd1 = sc.parallelize(1 to 10, 2)
myrdd1.fold(1)((x,y) => x + y)
it returns 58.
if we change partition number from 2 to 4, it returns 60. But I expect 55.
I understood what spark did if did not give the partition number when making the myrdd1. it will take the default partition number which is not known. The return value will be "unstable".
So I do not know why Spark has this kind of logic. Is there real business scenario has this kind of requirement?
fold aggregates the data per partition, starting with zero value from the first parenthesis. Partitions aggregation results are at the end combined with zero value.
Thus, for 2 partitions you correctly received 58:
(1+1+2+3+4+5)+(1+6+7+8+9+10)+1
Same, for 4 partitions a correct result was 60:
(1+1+2+3)+(1+4+5+6)+(1+7+8)+(1+9+10)+1
For real world scenarios, such kind of computations (divide-and-conquer like) may be useful everywhere you have a commutative logic, i.e. when the order of operations execution doesn't matter, like in mathematical addition. With that Spark will only move across the network partial results of aggregations instead of, for instance, shuffle whole blocks.
Your expectation about "receiving 55" if instead of fold you'd use treeReduce:
"treeReduce" should "compute sum of numbers" in {
val numbersRdd = sparkContext.parallelize(1 to 20, 10)
val sumComputation = (v1: Int, v2: Int) => v1 + v2
val treeSum = numbersRdd.treeReduce(sumComputation, 2)
treeSum shouldEqual(210)
val reducedSum = numbersRdd.reduce(sumComputation)
reducedSum shouldEqual(treeSum)
}
Some time ago I wrote a small post about tree aggregations in RDD: http://www.waitingforcode.com/apache-spark/tree-aggregations-spark/read
I think the result you are getting now is as expected, I will try to explain how it works.
You have an `rdd` with 10 elements in two partition
val myrdd1 = sc.parallelize(1 to 10, 2)
Lets suppose the two partition contains p1 = {1,2,3,4,5} and p2 = {6,7,8,9,10}
Now as per the documentation the fold operates in each partition
Now you get (default value or zero value which is one in your case) +1+2+3+4+5 = 16 and (1 as zero value)+7+8+9+10 = 41
At last fold those with (1 as zero value) 16 + 41 = 58
Similarly, if you have 4 partition fold operates in four partitions with default value as 1 and combines the four result with another fold with 1 default value and results as 60.
Aggregate the elements of each partition, and then the results for all
the partitions, using a given associative function and a neutral "zero
value". The function op(t1, t2) is allowed to modify t1 and return it
as its result value to avoid object allocation; however, it should not
modify t2.
This behaves somewhat differently from fold operations implemented for
non-distributed collections in functional languages like Scala. This
fold operation may be applied to partitions individually, and then
fold those results into the final result, rather than apply the fold
to each element sequentially in some defined ordering. For functions
that are not commutative, the result may differ from that of a fold
applied to a non-distributed collection.
For sum zero value should be 0 which gives you the correct result as 55.
Hope this helps!

parallelization level of Tupled RDD data

Suppose I have a RDD with the following type:
RDD[(Long, List(Integer))]
Can I assume that the entire list is located at the same worker? I want to know if certain operations are acceptable on the RDD level or should be calculated at driver. For instance:
val data: RDD[(Long, List(Integer))] = someFunction() //creates list for each timeslot
Please note that the List may be the result of aggregate or any other operation and not necessarily be created as one piece.
val diffFromMax = data.map(item => (item._1, findDiffFromMax(item._2)))
def findDiffFromMax(data: List[Integer]): List[Integer] = {
val maxItem = data.max
data.map(item => (maxItem - item))
}
The thing is that is the List is distributed calculating the maxItem may cause a lot of network traffic. This can be handles with an RDD of the following type:
RDD[(Long, Integer /*Max Item*/,List(Integer))]
Where the max item is calculated at driver.
So the question (actually 2 questions) are:
At what point of RDD data I can assume that the data is located at one worker? (answers with reference to doc or personal evaluations would be great) if any? what happens in the case of Tuple inside Tuple: ((Long, Integer), Double)?
What is the common practice for design of algorithms with Tuples? Should I always treat the data as if it may appear on different workers? should I always break it to the minimal granularity at the first Tuple field - for a case where there is data(Double) for user(String) in timeslot(Long) - should the data be (Long, (Strong, Double)) or ((Long, String), Double) or maybe (String, (Long, Double))? or maybe this is not optimal and matrices are better?
The short answer is yes, your list would be located in a single worker.
Your tuple is a single record in the RDD. A single record is ALWAYS on a single partition (which would be on a single worker).
When you do your findDiffFromMax you are running it on the target worker (so the function is serialized to all the workers to run).
The thing you should note is that when you generate a tuple of (k,v) in general this means a key value pair so you can do key based operations on the RDD. The order ( (Long, (Strong, Double)) vs. ((Long, String), Double) or any other way) doesn't really matter as it is all a single record. The only thing that would matter is which is the key in order to do key operations so the question would be the logic of your calculation

Spark: How to combine 2 sorted RDDs so that order is preserved after union?

I have 2 sorted RDDs:
val rdd_a = some_pair_rdd.sortByKey().
zipWithIndex.filter(f => f._2 < n).
map(f => f._1)
val rdd_b = another_pair_rdd.sortByKey().
zipWithIndex.filter(f => f._2 < n).
map(f => f._1)
val all_rdd = rdd_a.union(rdd_b)
In all_rdd, I see that the order is not necessarily maintained as I'd imagined (that all elements of rdd_a come first, followed by all elements of rdd_b). Is my assumption incorrect (about the contract of union), and if so, what should I use to append multiple sorted RDDs into a single rdd?
I'm fairly new to Spark so I could be wrong, but from what I understand Union is a narrow transformation. That is, each executor joins only its local blocks of RDD a with its local blocks of RDD b and then returns that to the driver.
As an example, let's say that you have 2 executors and 2 RDDS.
RDD_A = ["a","b","c","d","e","f"]
and
RDD_B = ["1","2","3","4","5","6"]
Let Executor 1 contain the first half of both RDD's and Executor 2 contain the second half of both RDD's. When they perform the union on their local blocks, it would look something like:
Union_executor1 = ["a","b","c","1","2","3"]
and
Union_executor2 = ["d","e","f","4","5","6"]
So when the executors pass their parts back to the driver you would have ["a","b","c","1","2","3","d","e","f","4","5","6"]
Again, I'm new to Spark and I could be wrong. I'm just sharing based on my understanding of how it works with RDD's. Hopefully we can both learn something from this.
You can't. Spark does not have a merge sort, because you can't make assumptions about the way that the RDDs are actually stored on the nodes. If you want things in sort order after you take the union, you need to sort again.

Filter columns in large dataset with Spark

I have a dataset which is 1,000,000 rows by about 390,000 columns. The fields are all binary, either 0 or 1. The data is very sparse.
I've been using Spark to process this data. My current task is to filter the data--I only want data in 1000 columns that have been preselected. This is the current code that I'm using to achieve this task:
val result = bigdata.map(_.zipWithIndex.filter{case (value, index) => selectedColumns.contains(index)})
bigdata is just an RDD[Array[Int]]
However, this code takes quite a while to run. I'm sure there's a more efficient way to filter my dataset that doesn't involve going in and filtering every single row separately. Would loading my data into a DataFrame, and maniuplating it through the DataFrame API make things faster/easier? Should I be looking into column-store based databases?
You can start with making your filter a little bit more efficient. Please note that:
your RDD contains Array[Int]. It means you can access nth element of each row in O(1) time
#selectedColumns << #columns
Considering these two facts it should be obvious that it doesn't make sense to iterate over all elements for each row not to mention contains calls. Instead you can simply map over selectedColumns
// Optional if selectedColumns are not ordered
val orderedSelectedColumns = selectedColumns.toList.sorted.toArray
rdd.map(row => selectedColumns.map(row))
Comparing time complexity:
zipWithIndex + filter (assuming best case scenario when contains is O(1)) - O(#rows * # columns)
map - O(#rows * #selectedColumns)
The easiest way to speed up execution is to parallelize it with partitionBy:
bigdata.partitionBy(new HashPartitioner(numPartitions)).foreachPartition(...)
foreachPartition receives a Iterator over which you can map and filter.
numPartitions is a val which you can set with the amount of desired parallel partitions.

OutOfBoundsException with ALS - Flink MLlib

I'm doing a recommandation system for movies, using the MovieLens datasets available here :
http://grouplens.org/datasets/movielens/
To compute this recommandation system, I use the ML library of Flink in scala, and particulalrly the ALS algorithm (org.apache.flink.ml.recommendation.ALS).
I first map the ratings of the movie into a DataSet[(Int, Int, Double)] and then create a trainingSet and a testSet (see the code below).
My problem is that there is no bug when I'm using the ALS.fit function with the whole dataset (all the ratings), but if I just remove only one rating, the fit function doesn't work anymore, and I don't understand why.
Do You have any ideas? :)
Code used :
Rating.scala
case class Rating(userId: Int, movieId: Int, rating: Double)
PreProcessing.scala
object PreProcessing {
def getRatings(env : ExecutionEnvironment, ratingsPath : String): DataSet[Rating] = {
env.readCsvFile[(Int, Int, Double)](
ratingsPath, ignoreFirstLine = true,
includedFields = Array(0,1,2)).map{r => new Rating(r._1, r._2, r._3)}
}
Processing.scala
object Processing {
private val ratingsPath: String = "Path_to_ratings.csv"
def main(args: Array[String]) {
val env = ExecutionEnvironment.getExecutionEnvironment
val ratings: DataSet[Rating] = PreProcessing.getRatings(env, ratingsPath)
val trainingSet : DataSet[(Int, Int, Double)] =
ratings
.map(r => (r.userId, r.movieId, r.rating))
.sortPartition(0, Order.ASCENDING)
.first(ratings.count().toInt)
val als = ALS()
.setIterations(10)
.setNumFactors(10)
.setBlocks(150)
.setTemporaryPath("/tmp/tmpALS")
val parameters = ParameterMap()
.add(ALS.Lambda, 0.01) // After some tests, this value seems to fit the problem
.add(ALS.Seed, 42L)
als.fit(trainingSet, parameters)
}
}
"But if I just remove only one rating"
val trainingSet : DataSet[(Int, Int, Double)] =
ratings
.map(r => (r.userId, r.movieId, r.rating))
.sortPartition(0, Order.ASCENDING)
.first((ratings.count()-1).toInt)
The error :
06/19/2015 15:00:24 CoGroup (CoGroup at org.apache.flink.ml.recommendation.ALS$.updateFactors(ALS.scala:570))(4/4) switched to FAILED
java.lang.ArrayIndexOutOfBoundsException: 5
at org.apache.flink.ml.recommendation.ALS$BlockRating.apply(ALS.scala:358)
at org.apache.flink.ml.recommendation.ALS$$anon$111.coGroup(ALS.scala:635)
at org.apache.flink.runtime.operators.CoGroupDriver.run(CoGroupDriver.java:152)
...
The problem is the first operator in combination with the setTemporaryPath parameter of Flink's ALS implementation. In order to understand the problem, let me quickly explain how the blocking ALS algorithm works.
The blocking implementation of alternating least squares first partitions the given ratings matrix user-wise and item-wise into blocks. For these blocks, routing information is calculated. This routing information says which user/item block receives which input from which item/user block, respectively. Afterwards, the ALS iteration is started.
Since Flink's underlying execution engine is a parallel streaming dataflow engine, it tries to execute as many parts of the dataflow as possible in a pipelined fashion. This requires to have all operators of the pipeline online at the same time. This has the advantage that Flink avoids to materialize intermediate results, which might be prohibitively large. The disadvantage is that the available memory has to be shared among all running operators. In the case of ALS where the size of the individual DataSet elements (e.g. user/item blocks) is rather large, this is not desired.
In order to solve this problem, not all operators of the implementation are executed at the same time if you have set a temporaryPath. The path defines where the intermediate results can be stored. Thus, if you've defined a temporary path, then ALS first calculates the routing information for the user blocks and writes them to disk, then it calculates the routing information for the item blocks and writes them to disk and last but not least it starts ALS iteration for which it reads the routing information from the temporary path.
The calculation of the routing information for the user and item blocks both depend on the given ratings data set. In your case when you calculate the user routing information, then it will first read the ratings data set and apply the first operator on it. The first operator returns n-arbitrary elements from the underlying data set. The problem right now is that Flink does not store the result of this first operation for the calculation of the item routing information. Instead, when you start the calculation of the item routing information, Flink will re-execute the dataflow starting from its sources. This means that it reads the ratings data set from disk and applies the first operator on it again. This will give you in many cases a different set of ratings compared to the result of the first first operation. Therefore, the generated routing information are inconsistent and ALS fails.
You can circumvent the problem by materializing the result of the first operator and use this result as the input for the ALS algorithm. The object FlinkMLTools contains a method persist which takes a DataSet, writes it to the given path and then returns a new DataSet which reads the just written DataSet. This allows you to break up the resulting dataflow graph.
val firstTrainingSet : DataSet[(Int, Int, Double)] =
ratings
.map(r => (r.userId, r.movieId, r.rating))
.first((ratings.count()-1).toInt)
val trainingSet = FlinkMLTools.persist(firstTrainingSet, "/tmp/tmpALS/training")
val als = ALS()
.setIterations(10)
.setNumFactors(10)
.setBlocks(150)
.setTemporaryPath("/tmp/tmpALS/")
val parameters = ParameterMap()
.add(ALS.Lambda, 0.01) // After some tests, this value seems to fit the problem
.add(ALS.Seed, 42L)
als.fit(trainingSet, parameters)
Alternatively, you can try to leave the temporaryPath unset. Then all steps (routing information calculation and als iteration) are executed in a pipelined fashion. This means that both the user and item routing information calculation use the same input data set which results from the first operator.
The Flink community is currently working on keeping intermediate results of operators in the memory. This will allow to pin the result of the first operator so that it won't be calculated twice and, thus, not giving differing results due to its non-deterministic nature.