Given
val as: RDD[(T, U)]
val bs: RDD[T]
I would like to filter as to find the elements with keys present bs.
One approach is
val intermediateAndOtherwiseUnnessaryPair = bs.map(b => b -> b)
bs.join(as).values
But the mapping on bs is unfortunate. Is there a more direct method?
You can make the mapping less unnecessary by doing:
val intermediateAndOtherwiseUnnessaryPair = bs.map(b => (b, 1))
Also, co-partitioning before joining helps a lot:
val intermediateAndOtherwiseUnnessaryPair = bs.map(b => (b, 1)).paritionBy(new HashPartitioner(NUM_PARTITIONS))
bs.paritionBy(new HashPartitioner(NUM_PARTITIONS)).join(as).values
Co partitioned RDDs will not be shuffled at runtime, and thus you'll see a significant performance boost.
Broadcasting may not work if bs is too big (more precisely, has a large number of unique values), you may also want to increase driver.maxResultsize.
The only two (or at least the only ones I am aware of) popular and generic ways to filter one RDD using a second RDD are:
1) join which you are already doing - in this case I wouldn't worry about the unnecessary intermediate RDD that much though, map() is a narrow transformation and won't introduce that much overhead. The join() itself will most probably be slow, though, as it's a wide transformation (requires shuffles)
2) collecting the bs on the driver and making it a broadcast variable which then will be used in as.filter()
val collected = sc.broadcast(bs.collect().toSet)
as.filter(el => collected.value.contains(el))
You need to do this as Spark doesn't support nesting RDDs inside methods called on RDD.
Related
There is some scary language in the docs of groupByKey, warning that it can be "very expensive", and suggesting to use aggregateByKey instead whenever possible.
I am wondering whether the difference in cost comes from the fact, that for some aggregattions, the entire group never never needs to be collected and loaded to the same node, or if there are other differences in implementation.
Basically, the question is whether rdd.groupByKey() would be equivalent to rdd.aggregateByKey(Nil)(_ :+ _, _ ++ _) or if it would still be more expensive.
If you are reducing to single element instead of list.
For eg: like word count then aggregateByKey performs better because it will not cause shuffle as explained in the link performance of group by vs aggregate by.
But in your case you are merging to a list . In the case of aggregateByKey it will first reduce all the values for a key in a partition to a single list and then send the data for shuffle.This will create as many list as partitions and memory for that will be high.
In the case of groupByKey the merge happens only at one node responsible for the key. The number of list created will be only one per key here.
In case of merging to list then groupByKey is optimal in terms of memory.
Also Refer: SO Answer by zero323
I am not sure about your use case. But if you can limit the number of elements in the list in the end result then certainly aggregateByKey / combineByKey will give much better result compared to groupByKey. For eg: If you want to take only top 10 values for a given key. Then you could achieve this efficiently by using combineByKey with proper merge and combiner functions than
groupByKey and take 10.
Let me help to illustrate why groupByKey operation will lead to much more cost
By understanding the semantic of this specific operation, what the reduce task need to do is group the whole values associated with a single unique key.
In a word, let us have a look at it's signature
def groupByKey(): RDD[(K, Iterable[V])]
Because the "groupby" operation, all values associated with this key partitioned on different nodes can not be pre-merged. Huge mount of data transfer through over the network, lead to high network io load.
But aggregateByKey is not the same with it. let me clarify the signature:
def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
How the spark engine implement this semantic of operation is as follows:
In partition it will have pre-merged operation, mean that "a specific reducer" just need to fetch all the pre-merged intermediate result of the shuffle map.
This will make the network io significantly light.
I have looked at the API and found the following documentation for both-
def reduceByKey(partitioner: Partitioner, func: (V, V) ⇒ V): RDD[(K, V)]
which merges the values for each key using an associative reduce function.
This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.
def reduceByKeyLocally(func: (V, V) ⇒ V): Map[K, V]
which merges the values for each key using an associative reduce function, but return the results immediately to the master as a Map.
This will also perform the merging locally on each mapper before sending results to a reducer, similarly to a "combiner" in MapReduce.
I don't see much difference between the two except that reduceByKeyLocally returns the results back to the master as a map.
The difference is profound.
With reduceByKey, the pairs are represented as an RDD, which means the data remain distributed among the cluster. This is necessary when you are operating at scale.
With reduceByKeyLocally, all the partitions come back to the master to be merged into a single Map on that single machine. Similar to the collect action, which brings everything back to the master as an Array, if you are operating at scale, all those data will overwhelm a single machine completely and defeat the purpose of using a distributed data abstraction.
I'm currently developing a spark application that involves a series of complex joins between tuples, building out an N sized tuple based on the target data set.
I come from a Java world and have been resolving specific fields from each element of a tuple into a new object. I know there has to be a better way to do this functionally.
Example.
val obj1Obj2: ((Int, (Object1, Object2)) = object1.join(object2)
val obj3Resolve = obj1Obj2
.map(a => a match {case (k,v) => v}}
.map(a => (a._2.key, new Object3(a._2.key, a._1.foo, a._2.bar))
What I would like to do is to have a generic trait that I extend for each particular target object, taking in an arbitrary tuple and returning an arbitrary tuple. I've found that the joins themselves are relatively straightforward, it's the intermediate object declarations that bloat the code, as well as the restructuring of the tuples to rekey them for a future join, and I feel this is too "java" like.
Any advice is much appreciated; I'm developing with spark so some Scala solutions may not apply.
Thanks!
In my application when taking perfromance numbers, groupby is eating away lot of time.
My RDD is of below strcuture:
JavaPairRDD<CustomTuple, Map<String, Double>>
CustomTuple:
This object contains information about the current row in RDD like which week, month, city, etc.
public class CustomTuple implements Serializable{
private Map hierarchyMap = null;
private Map granularMap = null;
private String timePeriod = null;
private String sourceKey = null;
}
Map
This map contains the statistical data about that row like how much investment, how many GRPs, etc.
<"Inv", 20>
<"GRP", 30>
I was executing below DAG on this RDD
apply filter on this RDD and scope out relevant rows : Filter
apply filter on this RDD and scope out relevant rows : Filter
Join the RDDs: Join
apply map phase to compute investment: Map
apply GroupBy phase to group the data according to the desired view: GroupBy
apply a map phase to aggregate the data as per the grouping achieved in above step (say view data across timeperiod) and also create new objects based on the resultset desired to be collected: Map
collect the result: Collect
So if user wants to view investment across time periods then below List is returned (this was achieved in step 4 above):
<timeperiod1, value>
When I checked time taken in operations, GroupBy was taking 90% of the time taken in executing the whole DAG.
IMO, we can replace GroupBy and subsequent Map operations by a sing reduce.
But reduce will work on object of type JavaPairRDD>.
So my reduce will be like T reduce(T,T,T) where T will be CustomTuple, Map.
Or maybe after step 3 in above DAG I run another map function that returns me an RDD of type for the metric that needs to be aggregated and then run a reduce.
Also, I am not sure how aggregate function works and will it be able to help me in this case.
Secondly, my application will receive request on varying keys. In my current RDD design each request would require me to repartition or re-group my RDD on this key. This means for each request grouping/re-partitioning would take 95% of my time to compute the job.
<"market1", 20>
<"market2", 30>
This is very discouraging as the current performance of application without Spark is 10 times better than performance with Spark.
Any insight is appreciated.
[EDIT]We also noticed that JOIN was taking a lot of time. Maybe thats why groupby was taking time.[EDIT]
TIA!
The Spark's documentation encourages you to avoid operations groupBy operations instead they suggest combineByKey or some of its derivated operation (reduceByKey or aggregateByKey). You have to use this operation in order to make an aggregation before and after the shuffle (in the Map's and in the Reduce's phase if we use Hadoop terminology) so your execution times will improve (i don't kwown if it will be 10 times better but it has to be better)
If i understand your processing i think that you can use a single combineByKey operation The following code's explanation is made for a scala code but you can translate to Java code without too many effort.
combineByKey have three arguments:
combineByKey[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C): RDD[(K, C)]
createCombiner: In this operation you create a new class in order to combine your data so you could aggregate your CustomTuple data into a new Class CustomTupleCombiner (i don't know if you want only make a sum or maybe you want to apply some process to this data but either option can be made in this operation)
mergeValue: In this operation you have to describe how a CustomTuple is sum to another CustumTupleCombiner(again i am presupposing a simple summarize operation). For example if you want sum the data by key, you will have in your CustumTupleCombiner class a Map so the operation should be something like: CustumTupleCombiner.sum(CustomTuple) that make CustumTupleCombiner.Map(CustomTuple.key)-> CustomTuple.Map(CustomTuple.key) + CustumTupleCombiner.value
mergeCombiners: In this operation you have to define how merge two Combiner class, CustumTupleCombiner in my example. So this will be something like CustumTupleCombiner1.merge(CustumTupleCombiner2) that will be something like CustumTupleCombiner1.Map.keys.foreach( k -> CustumTupleCombiner1.Map(k)+CustumTupleCombiner2.Map(k)) or something like that
The pated code is not proved (this will not even compile because i made it with vim) but i think that might work for your scenario.
I hope this will be usefull
Shuffling is triggered by any change in the key of a [K,V] pair, or by a repartition() call. The partitioning is calculated based on the K (key) value. By default partitioning is calculated using the Hash value of your key, implemented by the hashCode() method. In your case your Key contains two Map instance variables. The default implementation of the hashCode() method will have to calculate the hashCode() of those maps as well, causing an iteration to happen over all it elements to in turn again calculate the hashCode() of those elements.
The solutions are:
Do not include the Map instances in your Key. This seems highly unusual.
Implement and override your own hashCode() that avoids going through the Map Instance variables.
Possibly you can avoid using the Map objects completely. If it is something that is shared amongst multiple elements, you might need to consider using broadcast variables in spark. The overhead of serializing your Maps during shuffling might also be a big contributing factor.
Avoid any shuffling, by tuning your hashing between two consecutive group-by's.
Keep shuffling Node local, by choosing a Partitioner that will have an affinity of keeping partitions local during consecutive use.
Good reading on hashCode(), including a reference to quotes by Josh Bloch can be found in wiki.
I am using Spark with scala. I wanted to know if having single one line command better than separate commands? What are the benefits if any? Does it gain more efficiency in terms of speed? Why?
for e.g.
var d = data.filter(_(1)==user).map(f => (f(2),f(5).toInt)).groupByKey().map(f=> (f._1,f._2.count(x=>true), f._2.sum))
against
var a = data.filter(_(1)==user)
var b = a.map(f => (f(2),f(5).toInt))
var c = b.groupByKey()
var d = c.map(f=> (f._1,f._2.count(x=>true), f._2.sum))
There is no performance difference between your two examples; the decision to chain RDD transformations or to explicitly represent the intermediate RDDs is just a matter of style. Spark's lazy evaluation means that no actual distributed computation will be performed until you invoke an RDD action like take() or count().
During execution, Spark will pipeline as many transformations as possible. For your example, Spark won't materialize the entire filtered dataset before it maps it: the filter() and map() transformations will be pipelined together and executed in a single stage. The groupByKey() transformation (usually) needs to shuffle data over the network, so it's executed in a separate stage. Spark would materialize the output of filter() only if it had been cache()d.
You might need to use the second style if you want to cache an intermediate RDD and perform further processing on it. For example, if I wanted to perform multiple actions on the output of the groupByKey() transformation, I would write something like
val grouped = data.filter(_(1)==user)
.map(f => (f(2),f(5).toInt))
.groupByKey()
.cache()
val mapped = grouped.map(f=> (f._1,f._2.count(x=>true), f._2.sum))
val counted = grouped.count()
There is no difference in terms of execution, but you might want to consider the readability of your code. I would go with your first example but like this:
var d = data.filter(_(1)==user)
.map(f => (f(2),f(5).toInt))
.groupByKey()
.map(f=> (f._1,f._2.count(x=>true), f._2.sum))
Really this is more of a Scala question than Spark though. Still, as you can see from Spark's implementation of word count as shown in their documentation
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
you don't need to worry about those kinds of things. The Scala language (through laziness, etc.) and Spark's RDD implementation handles all that at a higher level of abstraction.
If you find really bad performance, then you should take the time to explore why. As Knuth said, "premature optimization is the root of all evil."