groupByKey vs. aggregateByKey - where exactly does the difference come from? - scala

There is some scary language in the docs of groupByKey, warning that it can be "very expensive", and suggesting to use aggregateByKey instead whenever possible.
I am wondering whether the difference in cost comes from the fact, that for some aggregattions, the entire group never never needs to be collected and loaded to the same node, or if there are other differences in implementation.
Basically, the question is whether rdd.groupByKey() would be equivalent to rdd.aggregateByKey(Nil)(_ :+ _, _ ++ _) or if it would still be more expensive.

If you are reducing to single element instead of list.
For eg: like word count then aggregateByKey performs better because it will not cause shuffle as explained in the link performance of group by vs aggregate by.
But in your case you are merging to a list . In the case of aggregateByKey it will first reduce all the values for a key in a partition to a single list and then send the data for shuffle.This will create as many list as partitions and memory for that will be high.
In the case of groupByKey the merge happens only at one node responsible for the key. The number of list created will be only one per key here.
In case of merging to list then groupByKey is optimal in terms of memory.
Also Refer: SO Answer by zero323
I am not sure about your use case. But if you can limit the number of elements in the list in the end result then certainly aggregateByKey / combineByKey will give much better result compared to groupByKey. For eg: If you want to take only top 10 values for a given key. Then you could achieve this efficiently by using combineByKey with proper merge and combiner functions than
groupByKey and take 10.

Let me help to illustrate why groupByKey operation will lead to much more cost
By understanding the semantic of this specific operation, what the reduce task need to do is group the whole values associated with a single unique key.
In a word, let us have a look at it's signature
def groupByKey(): RDD[(K, Iterable[V])]
Because the "groupby" operation, all values associated with this key partitioned on different nodes can not be pre-merged. Huge mount of data transfer through over the network, lead to high network io load.
But aggregateByKey is not the same with it. let me clarify the signature:
def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
How the spark engine implement this semantic of operation is as follows:
In partition it will have pre-merged operation, mean that "a specific reducer" just need to fetch all the pre-merged intermediate result of the shuffle map.
This will make the network io significantly light.

Related

apache spark - which one encounters less memory bottlenecks - reduceByKey or reduceByKeyLocally?

I have looked at the API and found the following documentation for both-
def reduceByKey(partitioner: Partitioner, func: (V, V) ⇒ V): RDD[(K, V)]
which merges the values for each key using an associative reduce function.
This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.
def reduceByKeyLocally(func: (V, V) ⇒ V): Map[K, V]
which merges the values for each key using an associative reduce function, but return the results immediately to the master as a Map.
This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.
I don't see much difference between the two except that reduceByKeyLocally returns the results back to the master as a map.
The difference is profound.
With reduceByKey, the pairs are represented as an RDD, which means the data remain distributed among the cluster. This is necessary when you are operating at scale.
With reduceByKeyLocally, all the partitions come back to the master to be merged into a single Map on that single machine. Similar to the collect action, which brings everything back to the master as an Array, if you are operating at scale, all those data will overwhelm a single machine completely and defeat the purpose of using a distributed data abstraction.

Find RDD[(T, U)] elements that have key in RDD[T]

Given
val as: RDD[(T, U)]
val bs: RDD[T]
I would like to filter as to find the elements with keys present bs.
One approach is
val intermediateAndOtherwiseUnnessaryPair = bs.map(b => b -> b)
bs.join(as).values
But the mapping on bs is unfortunate. Is there a more direct method?
You can make the mapping less unnecessary by doing:
val intermediateAndOtherwiseUnnessaryPair = bs.map(b => (b, 1))
Also, co-partitioning before joining helps a lot:
val intermediateAndOtherwiseUnnessaryPair = bs.map(b => (b, 1)).paritionBy(new HashPartitioner(NUM_PARTITIONS))
bs.paritionBy(new HashPartitioner(NUM_PARTITIONS)).join(as).values
Co partitioned RDDs will not be shuffled at runtime, and thus you'll see a significant performance boost.
Broadcasting may not work if bs is too big (more precisely, has a large number of unique values), you may also want to increase driver.maxResultsize.
The only two (or at least the only ones I am aware of) popular and generic ways to filter one RDD using a second RDD are:
1) join which you are already doing - in this case I wouldn't worry about the unnecessary intermediate RDD that much though, map() is a narrow transformation and won't introduce that much overhead. The join() itself will most probably be slow, though, as it's a wide transformation (requires shuffles)
2) collecting the bs on the driver and making it a broadcast variable which then will be used in as.filter()
val collected = sc.broadcast(bs.collect().toSet)
as.filter(el => collected.value.contains(el))
You need to do this as Spark doesn't support nesting RDDs inside methods called on RDD.

Spark: groupBy taking lot of time

In my application when taking perfromance numbers, groupby is eating away lot of time.
My RDD is of below strcuture:
JavaPairRDD<CustomTuple, Map<String, Double>>
CustomTuple:
This object contains information about the current row in RDD like which week, month, city, etc.
public class CustomTuple implements Serializable{
private Map hierarchyMap = null;
private Map granularMap = null;
private String timePeriod = null;
private String sourceKey = null;
}
Map
This map contains the statistical data about that row like how much investment, how many GRPs, etc.
<"Inv", 20>
<"GRP", 30>
I was executing below DAG on this RDD
apply filter on this RDD and scope out relevant rows : Filter
apply filter on this RDD and scope out relevant rows : Filter
Join the RDDs: Join
apply map phase to compute investment: Map
apply GroupBy phase to group the data according to the desired view: GroupBy
apply a map phase to aggregate the data as per the grouping achieved in above step (say view data across timeperiod) and also create new objects based on the resultset desired to be collected: Map
collect the result: Collect
So if user wants to view investment across time periods then below List is returned (this was achieved in step 4 above):
<timeperiod1, value>
When I checked time taken in operations, GroupBy was taking 90% of the time taken in executing the whole DAG.
IMO, we can replace GroupBy and subsequent Map operations by a sing reduce.
But reduce will work on object of type JavaPairRDD>.
So my reduce will be like T reduce(T,T,T) where T will be CustomTuple, Map.
Or maybe after step 3 in above DAG I run another map function that returns me an RDD of type for the metric that needs to be aggregated and then run a reduce.
Also, I am not sure how aggregate function works and will it be able to help me in this case.
Secondly, my application will receive request on varying keys. In my current RDD design each request would require me to repartition or re-group my RDD on this key. This means for each request grouping/re-partitioning would take 95% of my time to compute the job.
<"market1", 20>
<"market2", 30>
This is very discouraging as the current performance of application without Spark is 10 times better than performance with Spark.
Any insight is appreciated.
[EDIT]We also noticed that JOIN was taking a lot of time. Maybe thats why groupby was taking time.[EDIT]
TIA!
The Spark's documentation encourages you to avoid operations groupBy operations instead they suggest combineByKey or some of its derivated operation (reduceByKey or aggregateByKey). You have to use this operation in order to make an aggregation before and after the shuffle (in the Map's and in the Reduce's phase if we use Hadoop terminology) so your execution times will improve (i don't kwown if it will be 10 times better but it has to be better)
If i understand your processing i think that you can use a single combineByKey operation The following code's explanation is made for a scala code but you can translate to Java code without too many effort.
combineByKey have three arguments:
combineByKey[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C): RDD[(K, C)]
createCombiner: In this operation you create a new class in order to combine your data so you could aggregate your CustomTuple data into a new Class CustomTupleCombiner (i don't know if you want only make a sum or maybe you want to apply some process to this data but either option can be made in this operation)
mergeValue: In this operation you have to describe how a CustomTuple is sum to another CustumTupleCombiner(again i am presupposing a simple summarize operation). For example if you want sum the data by key, you will have in your CustumTupleCombiner class a Map so the operation should be something like: CustumTupleCombiner.sum(CustomTuple) that make CustumTupleCombiner.Map(CustomTuple.key)-> CustomTuple.Map(CustomTuple.key) + CustumTupleCombiner.value
mergeCombiners: In this operation you have to define how merge two Combiner class, CustumTupleCombiner in my example. So this will be something like CustumTupleCombiner1.merge(CustumTupleCombiner2) that will be something like CustumTupleCombiner1.Map.keys.foreach( k -> CustumTupleCombiner1.Map(k)+CustumTupleCombiner2.Map(k)) or something like that
The pated code is not proved (this will not even compile because i made it with vim) but i think that might work for your scenario.
I hope this will be usefull
Shuffling is triggered by any change in the key of a [K,V] pair, or by a repartition() call. The partitioning is calculated based on the K (key) value. By default partitioning is calculated using the Hash value of your key, implemented by the hashCode() method. In your case your Key contains two Map instance variables. The default implementation of the hashCode() method will have to calculate the hashCode() of those maps as well, causing an iteration to happen over all it elements to in turn again calculate the hashCode() of those elements.
The solutions are:
Do not include the Map instances in your Key. This seems highly unusual.
Implement and override your own hashCode() that avoids going through the Map Instance variables.
Possibly you can avoid using the Map objects completely. If it is something that is shared amongst multiple elements, you might need to consider using broadcast variables in spark. The overhead of serializing your Maps during shuffling might also be a big contributing factor.
Avoid any shuffling, by tuning your hashing between two consecutive group-by's.
Keep shuffling Node local, by choosing a Partitioner that will have an affinity of keeping partitions local during consecutive use.
Good reading on hashCode(), including a reference to quotes by Josh Bloch can be found in wiki.

FP or typelevel tools to groupBy on deep leaf in nested data?

I got a deeply nested datastructure:
Seq[Seq[(String, Seq[(String, Seq[(String, Try[Boolean])])], Long)]]
Is there a nice functional way to groupBy on Try.isFailure?
With Shapeless it is possible to search in arbitrary nested datastructures, as can be seen here. But finding is only one part of my problem. I saw zippers and lenses, they are nice but afaik they are not the right tool here.
For info, the data represents results of some test code. The layers are:
permutations of configurations => tested component => mutation on data => testing code. Strings are descriptions, long is the time it took to finish for each component test.
I want to create two lists, one with all failures keeping all the info where and when they happened keeping exceptions as info, and one corresponding one for successes.
Is there a solution out there already?
Note: the most sensible approach for that particular case would be to redesign my testcode such that two lists, one failurelist and one successlist are created from the start. But still, I'd like to know. This kind of problem doesn't seem to be uncommon.
It may not be the most creative solution, but you could partition the outermost Seq as follows:
val partitioned = seq.partition{ s =>
val flat = s.map(_._2).flatten.map(_._2).flatten
flat.find(tup => tup._2.isFailure).isDefined
}
In this example, the first line in the partition body flattens out the nested structure so you are left with the inner most Seq. Then, from there, the predicate condition to return for the partition call is derived from seeing if the inner most Seq contains at least one Failure. What you are left with is a tuple where the first Seq is the outermost items that have `failures in their nested structures and the second one is ones where no failures occurred.
This is probably not the best performing solution, but it's succinct as far as code lines is concerned. In fact, you could even do it in one line as follows:
val partitioned = seq.partition(_.map(_._2).flatten.map(_._2).flatten.find(_._2.isFailure).isDefined)

Reason for Scala's Map.unzip returning (Iterable, Iterable)

the other day I was wondering why scala.collection.Map defines its unzip method as
def unzip [A1, A2] (implicit asPair: ((A, B)) ⇒ (A1, A2)): (Iterable[A1], Iterable[A2])
Since the method returns "only" a pair of Iterable instead of a pair of Seq it is not guaranteed that the key/value pairs in the original map occur at matching indices in the returned sequences since Iterable doesn't guarantee the order of traversal. So if I had a
Map((1,A), (2,B))
, then after calling
Map((1,A), (2,B)) unzip
I might end up with
... = (List(1, 2),List(A, B))
just as well as with
... = (List(2, 1),List(B, A))
While I can imagine storage-related reasons behind this (think of HashMaps, for example) I wonder what you guys think about this behavior. It might appear to users of the Map.unzip method that the items were returned in the same pair order (and I bet this is probably almost always the case) yet since there's no guarantee this might in turn yield hard-to-find bugs in the library user's code.
Maybe that behavior should be expressed more explicitly in the accompanying scaladoc?
EDIT: Please note that I'm not referring to maps as ordered collections. I'm only interested in "matching" sequences after unzip, i.e. for
val (keys, values) = someMap.unzip
it holds for all i that (keys(i), values(i)) is an element of the original mapping.
Actually, the examples you gave will not occur. The Map will always be unzipped in a pair-wise fashion. Your statement that Iterable does not guarantee the ordering, is not entirely true. It is more accurate to say that any given Iterable does not have to guarantee the ordering, but this is dependent on implementation. In the case of Map.unzip, the ordering of pairs is not guaranteed, but items in the pairs will not change they way they are matched up -- that matching is a fundamental property of the Map. You can read the source to GenericTraversableTemplate to verify this is the case.
If you expand unzip's description, you'll get the answer:
definition classes: GenericTraversableTemplate
In other words, it didn't get specialized for Map.
Your argument is sound, though, and I daresay you might get your wishes if you open an enhancement ticket with your reasoning. Specially if you go ahead an produce a patch as well -- if nothing else, at least you'll learn a lot more about Scala collections in doing so.
Maps, generally, do not have a natural sequence: they are unordered collections. The fact your keys happen to have a natural order does not change the general case.
(However I am at a loss to explain why Map has a zipWithIndex method. This provides a counter-argument to my point. I guess it is there for consistency with other collections and that, although it provides indices, they are not guaranteed to be the same on subsequent calls.)
If you use a LinkedHashMap or LinkedHashSet the iterators are supposed to return the pairs in the original order of insertion. Other HashMaps, yeah, you have no control. Retaining the original order of insertion is quite useful in UI contexts, it allows you to resort tables on any column in a Web application without changing types, for instance.