Spark: groupBy taking lot of time - aggregate

In my application when taking perfromance numbers, groupby is eating away lot of time.
My RDD is of below strcuture:
JavaPairRDD<CustomTuple, Map<String, Double>>
CustomTuple:
This object contains information about the current row in RDD like which week, month, city, etc.
public class CustomTuple implements Serializable{
private Map hierarchyMap = null;
private Map granularMap = null;
private String timePeriod = null;
private String sourceKey = null;
}
Map
This map contains the statistical data about that row like how much investment, how many GRPs, etc.
<"Inv", 20>
<"GRP", 30>
I was executing below DAG on this RDD
apply filter on this RDD and scope out relevant rows : Filter
apply filter on this RDD and scope out relevant rows : Filter
Join the RDDs: Join
apply map phase to compute investment: Map
apply GroupBy phase to group the data according to the desired view: GroupBy
apply a map phase to aggregate the data as per the grouping achieved in above step (say view data across timeperiod) and also create new objects based on the resultset desired to be collected: Map
collect the result: Collect
So if user wants to view investment across time periods then below List is returned (this was achieved in step 4 above):
<timeperiod1, value>
When I checked time taken in operations, GroupBy was taking 90% of the time taken in executing the whole DAG.
IMO, we can replace GroupBy and subsequent Map operations by a sing reduce.
But reduce will work on object of type JavaPairRDD>.
So my reduce will be like T reduce(T,T,T) where T will be CustomTuple, Map.
Or maybe after step 3 in above DAG I run another map function that returns me an RDD of type for the metric that needs to be aggregated and then run a reduce.
Also, I am not sure how aggregate function works and will it be able to help me in this case.
Secondly, my application will receive request on varying keys. In my current RDD design each request would require me to repartition or re-group my RDD on this key. This means for each request grouping/re-partitioning would take 95% of my time to compute the job.
<"market1", 20>
<"market2", 30>
This is very discouraging as the current performance of application without Spark is 10 times better than performance with Spark.
Any insight is appreciated.
[EDIT]We also noticed that JOIN was taking a lot of time. Maybe thats why groupby was taking time.[EDIT]
TIA!

The Spark's documentation encourages you to avoid operations groupBy operations instead they suggest combineByKey or some of its derivated operation (reduceByKey or aggregateByKey). You have to use this operation in order to make an aggregation before and after the shuffle (in the Map's and in the Reduce's phase if we use Hadoop terminology) so your execution times will improve (i don't kwown if it will be 10 times better but it has to be better)
If i understand your processing i think that you can use a single combineByKey operation The following code's explanation is made for a scala code but you can translate to Java code without too many effort.
combineByKey have three arguments:
combineByKey[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C): RDD[(K, C)]
createCombiner: In this operation you create a new class in order to combine your data so you could aggregate your CustomTuple data into a new Class CustomTupleCombiner (i don't know if you want only make a sum or maybe you want to apply some process to this data but either option can be made in this operation)
mergeValue: In this operation you have to describe how a CustomTuple is sum to another CustumTupleCombiner(again i am presupposing a simple summarize operation). For example if you want sum the data by key, you will have in your CustumTupleCombiner class a Map so the operation should be something like: CustumTupleCombiner.sum(CustomTuple) that make CustumTupleCombiner.Map(CustomTuple.key)-> CustomTuple.Map(CustomTuple.key) + CustumTupleCombiner.value
mergeCombiners: In this operation you have to define how merge two Combiner class, CustumTupleCombiner in my example. So this will be something like CustumTupleCombiner1.merge(CustumTupleCombiner2) that will be something like CustumTupleCombiner1.Map.keys.foreach( k -> CustumTupleCombiner1.Map(k)+CustumTupleCombiner2.Map(k)) or something like that
The pated code is not proved (this will not even compile because i made it with vim) but i think that might work for your scenario.
I hope this will be usefull

Shuffling is triggered by any change in the key of a [K,V] pair, or by a repartition() call. The partitioning is calculated based on the K (key) value. By default partitioning is calculated using the Hash value of your key, implemented by the hashCode() method. In your case your Key contains two Map instance variables. The default implementation of the hashCode() method will have to calculate the hashCode() of those maps as well, causing an iteration to happen over all it elements to in turn again calculate the hashCode() of those elements.
The solutions are:
Do not include the Map instances in your Key. This seems highly unusual.
Implement and override your own hashCode() that avoids going through the Map Instance variables.
Possibly you can avoid using the Map objects completely. If it is something that is shared amongst multiple elements, you might need to consider using broadcast variables in spark. The overhead of serializing your Maps during shuffling might also be a big contributing factor.
Avoid any shuffling, by tuning your hashing between two consecutive group-by's.
Keep shuffling Node local, by choosing a Partitioner that will have an affinity of keeping partitions local during consecutive use.
Good reading on hashCode(), including a reference to quotes by Josh Bloch can be found in wiki.

Related

groupByKey vs. aggregateByKey - where exactly does the difference come from?

There is some scary language in the docs of groupByKey, warning that it can be "very expensive", and suggesting to use aggregateByKey instead whenever possible.
I am wondering whether the difference in cost comes from the fact, that for some aggregattions, the entire group never never needs to be collected and loaded to the same node, or if there are other differences in implementation.
Basically, the question is whether rdd.groupByKey() would be equivalent to rdd.aggregateByKey(Nil)(_ :+ _, _ ++ _) or if it would still be more expensive.
If you are reducing to single element instead of list.
For eg: like word count then aggregateByKey performs better because it will not cause shuffle as explained in the link performance of group by vs aggregate by.
But in your case you are merging to a list . In the case of aggregateByKey it will first reduce all the values for a key in a partition to a single list and then send the data for shuffle.This will create as many list as partitions and memory for that will be high.
In the case of groupByKey the merge happens only at one node responsible for the key. The number of list created will be only one per key here.
In case of merging to list then groupByKey is optimal in terms of memory.
Also Refer: SO Answer by zero323
I am not sure about your use case. But if you can limit the number of elements in the list in the end result then certainly aggregateByKey / combineByKey will give much better result compared to groupByKey. For eg: If you want to take only top 10 values for a given key. Then you could achieve this efficiently by using combineByKey with proper merge and combiner functions than
groupByKey and take 10.
Let me help to illustrate why groupByKey operation will lead to much more cost
By understanding the semantic of this specific operation, what the reduce task need to do is group the whole values associated with a single unique key.
In a word, let us have a look at it's signature
def groupByKey(): RDD[(K, Iterable[V])]
Because the "groupby" operation, all values associated with this key partitioned on different nodes can not be pre-merged. Huge mount of data transfer through over the network, lead to high network io load.
But aggregateByKey is not the same with it. let me clarify the signature:
def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]
How the spark engine implement this semantic of operation is as follows:
In partition it will have pre-merged operation, mean that "a specific reducer" just need to fetch all the pre-merged intermediate result of the shuffle map.
This will make the network io significantly light.

Spark Dataset aggregation similar to RDD aggregate(zero)(accum, combiner)

RDD has a very useful method aggregate that allows to accumulate with some zero value and combine that across partitions. Is there any way to do that with Dataset[T]. As far as I see the specification via Scala doc, there is actually nothing capable of doing that. Even the reduce method allows to do things only for binary operations with T as both arguments. Any reason why? And if there is anything capable of doing the same?
Thanks a lot!
VK
There are two different classes which can be used to achieve aggregate-like behavior in Dataset API:
UserDefinedAggregateFunction which uses SQL types and takes Columns as an input.
Initial value is defined using initialize method, seqOp with update method and combOp with merge method.
Example implementation: How to define a custom aggregation function to sum a column of Vectors?
Aggregator which uses standard Scala types with Encoders and takes records as an input.
Initial value is defined using zero method, seqOp with reduce method and combOp with merge method.
Example implementation: How to find mean of grouped Vector columns in Spark SQL?
Both provide additional finalization method (evaluate and finish respectively) which is used to generate final results and can be used for both global and by-key aggregations.

How to use Scala to partition data into buckets for further processing

I want to read in a csv log which has as it's first column a timestamp of form hh:mm:ss. I would like to partition the entries into buckets, say hourly. I'm curious what the best approach is that adheres to Scala's semantics, i.e., reading the file as a stream, parsing it (maybe by a match predicate?) and emitting the csv entries as tuples.
It's been a couple of years since I looked at Scala but this problem seems particularly well suited to the language.
log format example:
[time],[string],[int],[int],[int],[int],[string]
The last field in the input could be mapped to an emum in the output tuple but I'm not sure there's value in that.
I'd be happy with a general recipe that I could use, with suggestions for certain built-in functions that are well suited to the problem.
The overall goal is a map-reduce, where I want to count elements in a time window but those elements first need to be preprocessed by a regex replace, before sorting and counting.
I've tried to keep the problem abstract, so the problem can be approached as a pattern to follow.
Thanks.
Perhaps as a first pass, a simple groupBy would do the trick ?
logLines.groupBy(line => line.timestamp.hours)
Using the groupBy idiom, and some filtering, my solution looks like
val lines: Traversable[String] = source.getLines.map(_.trim).toTraversable
val events: List[String] = lines.filter(line => line.matches("[\\d]+:.*")).toList
val buckets: Map[String, List[String]] = events.groupBy { line => line.substring(0, line.indexOf(":")) }
This gives me 24 buckets, one for each hour. Now I can process each bucket, perform the regex replace that I need to de-parameterize the URIs and finally map-reduce those to find the frequency each route has occurred.
Important note. I learned that groupBy doesn't work as desired, without first creating a List from the Traversable stream. Without that step, the end result is a single valued map for each hour. Possibly not the most performant solution, since it requires all events to be loaded in memory before partitioning. Is there a better solution that can partition a stream? Perhaps something that adds events to a mutable Set as the stream is processed?

Spark: Distributed removal/addition of elements in a set?

I am trying to convert a ML algorithm to Spark Scala to take advantage of my cluster's power. The relevant bits of pseudo-code are the following:
initialize set of elements
while(set not empty) {
while(...) { remove a given element from the set }
while(...) { add a given element to the set }
}
Is there any way to parallelize such a thing?
I would intuitively say that this is not implementable in a distributed fashion (the number of iterations being unknown), but I have been reading that Spark allows implementation of iterative ML algorithms.
Here is what I tried so far:
Originally used a mutable Set and removed/added elements during the loops in simple Scala. It runs correctly, but I feel like the whole code will just be executed on the driver which limits the interest of using Spark?
Made the set a RDD, and replaced the var during every iteration by a new RDD with subtracted/added element (which I suppose is super heavy?). No error appears but the variable doesn't actually get updated.
mySetRDD = mySetRDD.subtract(sc.parallelize(Seq(element)))
Looked up Accumulators for a way to keep a set of elements upated on its content (presence/absence of elements) across multiple executors, but they do not seem to allow things other than simple updates of numerical values.
Create PairRDD and then repartitionByKey say x partitions.
After that you can use
PairRdd1.zipPartition() to get the iterator over partition of rdds. Then you can write a function which will work over two iterators to produce third or output iterator.
Since you have repartition the rdd by key you need not keep track of the removals across partitions.
https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/rdd/RDD.html#zipPartitions(org.apache.spark.rdd.RDD, boolean, scala.Function2, scala.reflect.ClassTag, scala.reflect.ClassTag)

Converting a large sequence to a sequence with no repeats in scala really fast

So I have this large sequence, with a lot of repeats as well, and I need to convert it into a sequence with no repeats. What I have been doing so far has been converting the sequence to a set, and then back to the original sequence. Conversion to the set gets rid of the duplicates, and then I convert back into the set. However, this is very slow, as I'm given to understand that when converting to set, every pair of elements is compared, and the makes the complexity O(n^2), which is not acceptable. And since I have access to a computer with thousands of cores (through my university), I was wondering whether making things parallel would help.
Initially I thought I'd use scala Futures to parallelize the code in the following manner. Group the elements of the sequence into smaller subgroups by their hash code. That way, I have a subcollection of the original sequence, such that no element appears in two different subcollections and and every element is covered. Now I convert these smaller subcollections to sets, and back to sequences and concatenate them. This way I'm guaranteed to get a sequence with no repeats.
But I was wondering if applying the toSet method on a parallel sequence already does this. I thought I'd test this out in the scala interpreter, but I got roughly the same time for the conversion to parallel set vs the conversion to the non parallel set.
I was hoping someone could tell me whether conversion to parallel sets works this way or not. I'd be much obliged. Thanks.
EDIT: Is performing a toSet on a parallel collection faster than performing toSet on a non parallel collection?
.distinct with some of the Scala collection types is O(n) (as of Scala 2.11). It uses a hash map to record what has already been seen. With this, it linearly builds up a list:
def distinct: Repr = {
val b = newBuilder
val seen = mutable.HashSet[A]()
for (x <- this) {
if (!seen(x)) {
b += x
seen += x
}
}
b.result()
(newBuilder is like a mutable list.)
Just thinking outside the box, would it be possible that you prevent the creation of these doublons instead of trying to get rid of them afterwards ?