I have a problem running GraphX
val adjGraph= adjGraph_CC.vertices
.flatMap { case (id, (compID, adjSet)) => (mapMsgGen(id, compID, adjSet)) }
// mapMsgGen will generate a list of msgs each msg has the form K->V
.reduceByKey((fst, snd) =>mapMsgMerg(fst, snd)).collect
// mapMsgMerg will merge each two msgs passed to it
what I was expecting reduceByKey to do is to group the whole output of flatMap by the key (K) and process the list of values (Vs) for each Key (K) using the function provided.
what is happening is the each output of flatMap (using the function mapMsgGen) which is a list of K->V pairs (not the same K usually) is processed immediately using reduceByKey function mapMsgMerg and before the whole flatMap finish.
need some clarification please
I don't undestand what is going wrong or is it that I understand flatMap and reduceByKey wrong??
Regards,
Maher
There's no need to produce the entire output of flatMap before starting reduceByKey. In fact, if you're not using the intermediate output of flatMap it's better not to produce it and possibly save some memory.
If your flatMap outputs a list that contains 'k' -> v1 and 'k' -> v2 there's no reason to wait until the entire list has been produced to pass v1 and v2 to mapMsgMerge. As soon as those two tuples are output v1 and v2 can be combined as mapMsgMerge(v1, v2) and v1 and v2 discarded if the intermediate list isn't used.
I don't know the details of the Spark scheduler well enough to say if this is guaranteed behavior but it does seem like an instance of what the original paper calls 'pipelining' of operations.
Related
I am learning Spark and Scala and keep coming across this pattern:
val lines = sc.textFile("data.txt")
val pairs = lines.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
While I understand what it does, I don't understand why it is used instead of having something like:
val lines = sc.textFile("data.txt")
val counts = lines.reduceByValue((v1, v2) => v1 + v2)
Given that Spark is designed to process large amounts of data efficiently, it seems counter intuitive to always have to perform an additional step of converting a list into a map and then reducing by key, instead of simply being able to reduce by value?
First, this "additional step" doesn't really cost much (see more details at the end) - it doesn't shuffle the data, and it is performed together with other transformations: transformations can be "pipelined" as long as they don't change the partitioning.
Second - the API you suggest seems very specific for counting - although you suggest reduceByValue will take a binary operator f: (Int, Int) => Int, your suggested API assumes each value is mapped to the value 1 before applying this operator for all identical values - an assumption that is hardly useful in any scenario other than counting. Adding such specific APIs would just bloat the interface and is never going to cover all use cases anyway (what's next - RDD.wordCount?), so it's better to give users minimal building blocks (along with good documentation).
Lastly - if you're not happy with such low-level APIs, you can use Spark-SQL's DataFrame API to get some higer-level APIs that will hide these details - that's one of the reasons DataFrames exist:
val linesDF = sc.textFile("file.txt").toDF("line")
val wordsDF = linesDF.explode("line","word")((line: String) => line.split(" "))
val wordCountDF = wordsDF.groupBy("word").count()
EDIT: as requested - some more details about why the performance impact of this map operation is either small or entirely negligibile:
First, I'm assuming you are interested in producing the same result as the map -> reduceByKey code would produce (i.e. word count), which means somewhere the mapping from each record to the value 1 must take place, otherwise there's nothing to perform the summing function (v1, v2) => v1 + v2 on (that function takes Ints, they must be created somewhere).
To my understanding - you're just wondering why this has to happen as a separate map operation
So, we're actually interested in the overhead of adding another map operation
Consider these two functionally-identical Spark transformations:
val rdd: RDD[String] = ???
/*(1)*/ rdd.map(s => s.length * 2).collect()
/*(2)*/ rdd.map(s => s.length).map(_ * 2).collect()
Q: Which one is faster?
A: They perform the same
Why? Because as long as two consecutive transformations on an RDD do not change the partitioning (and that's the case in your original example too), Spark will group them together, and perform them within the same task. So, per record, the difference between these two will come down to the difference between:
/*(1)*/ s.length * 2
/*(2)*/ val r1 = s.length; r1 * 2
Which is negligible, especially when you're discussing distributed execution on large datasets, where execution time is dominated by things like shuffling, de/serialization and IO.
I am new to Spark (using 1.1 version) and Scala .. I am converting my existing Hadoop MapReduce code to spark MR using Scala and bit lost.
I want my mapped RDD to be grouped by Key .. When i read online it's suggested that we should avoid groupByKey and use reducedByKey instead.. but when I apply reduceBykey I am not getting list of values for given key as expected by my code =>Ex.
val rdd = sc.parallelize(List(("k1", "v11"), ("k1", "v21"), ("k2", "v21"), ("k2", "v22"), ("k3", "v31") ))
My "values" for actual task are huge, having 300 plus columns in key-values pair
And when I will do group by on common key it will result in shuffle which i want to avoid.
I want something like this as o/p (key, List OR Array of values) from my mapped RDD =>
rdd.groupByKey()
which gives me following Output
(k3,ArrayBuffer(v31))
(k2,ArrayBuffer(v21, v22))
(k1,ArrayBuffer(v11, v21))
But when I use
rdd.reduceByKey((x,y) => x+y)
I get values connected together like following- If pipe('|') or some other breakable character( (k2,v21|v22) ) would have been there my problem would have been little bit solved but still having list would be great for good coding practice
(k3,v31)
(k2,v21v22)
(k1,v11v21)
Please help
If you refer the spark documentation http://spark.apache.org/docs/latest/programming-guide.html
For groupByKey It says
“When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs.”
The Iterable keyword is very important over here, when you get the value as (v21, v22) it’s iterable.
Further it says
“Note: If you are grouping in order to perform an aggregation (such as a sum or average) over each key, using reduceByKey or aggregateByKey will yield much better performance.”
So from this what I understand is, if you want the return RDD to have iterable values use groupByKey if and if you want to have a single added up value like SUM then use reducebyKey.
Now in your tuple instead of having (String,String) => (K1,V1), if you had (String,ListBuffer(String)) => (K1,ListBuffer(“V1”)) then maybe you could have done rdd.reduceByKey( (x,y) = > x += y)
In Spark's RDDs and DStreams we have the 'reduce' function for transforming an entire RDD into one element. However the reduce function takes (T,T) => T
However if we want to reduce a List in Scala we can use foldLeft or foldRight which takes type (B)( (B,A) => B) This is very useful because you start folding with a type other then what is in your list.
Is there a way in Spark to do something similar? Where I can start with a value that is of different type then the elements in the RDD itself
Use aggregate instead of reduce. It allows you also to specify a "zero" value of type B and a function like the one you want: (B,A) => B. Do note that you also need to merge separate aggregations done on separate executors, so a (B, B) => B function is also required.
Alternatively, if you want this aggregation as a side effect, an option is to use an accumulator. In particular, the accumulable type allows for the result type to be of a different type than the accumulating type.
Also, if you even need to do the same with a key-value RDD, use aggregateByKey.
My starting point is an RDD[(key,value)] in Scala using Apache Spark. The RDD contains roughly 15 million tuples. Each key has roughly 50+-20 values.
Now I'd like to take one value (doesn't matter which one) for each key. My current approach is the following:
HashPartition the RDD by the key. (There is no significant skew)
Group the tuples by key resulting in RDD[(key, array of values)]]
Take the first of each value array
Basically looks like this:
...
candidates
.groupByKey()
.map(c => (c._1, c._2.head)
...
The grouping is the expensive part. It is still fast because there is no network shuffle and candidates is in memory but can I do it faster?
My idea was to work on the partitions directly, but I'm not sure what I get out of the HashPartition. If I take the first tuple of each partition, I will get every key but maybe multiple tuples for a single key depending on the number of partitions? Or will I miss keys?
Thank you!
How about reduceByKey with a function that returns the first argument? Like this:
candidates.reduceByKey((x, _) => x)
when reduceByKey operation is called, it is receiving list of values of a particular key. My question is:
are the list of values it receives in a sorted order?
is it possible to know how many values it receive?
i'm trying to calculate first quartile of the list of values of a key within reduceByKey. is this possible to do within reduceByKey?
.1. No, that would be totally going against the whole point of a reduce operation - i.e. to parallelalize an operation into an arbitrary tree of suboperations by taking advantage of associativity and commutativity.
.2. You'll need to define a new monoid by composing the integer monoid and whatever it is your doing. Let's assume your operation is op then
.
yourRdd.map(kv => (kv._1, (kv._2, 1)))
.reduceByKey((left, right) => (left._1 op right._1, left._2 + right._2))
will give you an RDD[(KeyType, (ReducedValueType, Int))] where the Int will be the number of values the reduce received for each key.
.3. You'll have to be more specific about what you mean by first quartile. Given that the answer to 1. is no, then you would have to have a bound that defines the first quartile then you won't need the data to be sorted because you could filter the values out by that bound.