Sharing and Updating data with series of unrelated RDD - scala

I have a series of computations that I distribute where all of the computations rely on data that is represented in Map. The thing is that on each step the Map should be also updated.
First the data sets, the RDD series is:
Iterable[Long, RDD[(String, Iterable[(String, String), Int])]]
Where the first Long is a signature of the RDD, the (String, String) Tuple is the key for the Map that is needed on all the nodes:
Map[(String, String), Double]
On each step the computation needs the double value and in turn updates the double value using the Int value.
I know that accumulators are write-only and cannot be used in my case for both reading and writing (I did try to read the data using localValue which didn't work).
The thing is in my case since each RDD is processed in turn, I was wondering if there still a hack that will be able to help me use the accumulators. Currently I wrote the following accumulator:
val accMap = sc.accumulableCollection(scala.collection.mutable.HashMap[(String, String), Double]())
And I am wondering if calling accMap.value after each calculation of a RDD data and distributing the map using broadcast variable is the best thing I can get? my problem is that the map is REALLY BIG so it is not quite feasible and if its the case the algorithm should be rethought.
basically my question is for the problem I described above is the best thing I can do is on each consecutive RDD: use accumulator Map for accumulating the score and collecting it using value function on each iteration just to broadcast it again using broadcast variable?
EDIT: Adding all the RDDs to a single RDD isn't feasible for me since the datasets are REALLY huge. this is the reason I tried to divide it into several unrelated RDD's.

Related

Spark: Which is the proper way to use a Broadcast variable?

I don't know if I'm using well a broadcast variable.
I have two RDDs, rdd1 and rdd2. I want to apply rdd2.mapPartitionsWithIndex(...), and for each partition I need to perfom some calculation using the whole rdd1. So, I think this is a case to use a Broadcast variable. First question: Am I thinking it right?
To do so, I did this:
val rdd1Broadcast = sc.broadcast(rdd1.collect())
Second question: Why do I need to put .collect(). I saw examples with and without .collect(), but I didn't realized when do I need to use it.
Also, I did this:
val rdd3 = rdd2.mapPartitionsWithIndex( myfunction(_, _, rdd1Broadcast), preservesPartitioning = preserves).cache()
Third question: Which is better: passing rdd1Broadcast or rdd1Broadcast.value?
Am I thinking it right?
There is really not enough information to answer this part. Broadcasting is useful only if broadcasted object is relatively small, or local access significantly reduces computational complexity.
Why do I need to put .collect().
Because RDDs can be accessed only on the driver. Broadcasting RDD is not meaningful, as you cannot access the data from a task.
Which is better: passing rdd1Broadcast or rdd1Broadcast.value?
The argument should be of type Broadcast[_] so don't use rdd1Broadcast.value. If parameter is passed by value, it will be evaluated and substituted locally, and broadcast will not be used.

Split Spark DataFrame based on condition

I need something similar to the randomSplit function:
val Array(df1, df2) = myDataFrame.randomSplit(Array(0.6, 0.4))
However, I need to split myDataFrame based on a boolean condition. Does anything like the following exist?
val Array(df1, df2) = myDataFrame.booleanSplit(col("myColumn") > 100)
I'd like not to do two separate .filter calls.
Unfortunately the DataFrame API doesn't have such a method, to split by a condition you'll have to perform two separate filter transformations:
myDataFrame.cache() // recommended to prevent repeating the calculation
val condition = col("myColumn") > 100
val df1 = myDataFrame.filter(condition)
val df2 = myDataFrame.filter(not(condition))
I understand that caching and filtering twice looks a bit ugly, but please bear in mind that DataFrames are translated to RDDs, which are evaluated lazily, i.e. only when they are directly or indirectly used in an action.
If a method booleanSplit as suggested in the question existed, the result would be translated to two RDDs, each of which would be evaluated lazily. One of the two RDDs would be evaluated first and the other would be evaluated second, strictly after the first. At the point the first RDD is evaluated, the second RDD would not yet have "come into existence" (EDIT: Just noticed that there is a similar question for the RDD API with an answer that gives a similar reasoning)
To actually gain any performance benefit, the second RDD would have to be (partially) persisted during the iteration of the first RDD (or, actually, during the iteration of the parent RDD of both, which is triggered by the iteration of the first RDD). IMO this wouldn't align overly well with the design of the rest of the RDD API. Not sure if the performance gains would justify this.
I think the best you can achieve is to avoid writing two filter calls directly in your business code, by writing an implicit class with a method booleanSplit as a utility method does that part in a similar way as Tzach Zohar's answer, maybe using something along the lines of myDataFrame.withColumn("__condition_value", condition).cache() so the the value of the condition is not calculated twice.

How to parallelize list iteration and be able to create RDDs in Spark?

I've just started learning Spark and Scala.
From what I understand it's bad practice to use collect, because it gathers the whole data in memory and it's also bad practice to use for, because the code inside the block is not executed concurrently by more than one node.
Now, I have a List of numbers from 1 to 10:
List(1,2,3,4,5,6,7,8,9,10)
and for each of these values I need to generate a RDD using this value.
in such cases, how can I generate the RDD?
By doing
sc.parallelize(List(1,2,3,4,5,6,7,8,9,10)).map(number => generate_rdd(number))
I get an error because RDD cannot be generated inside another RDD.
What is the best workaround to this problem?
Assuming generate_rdd defined like def generate_rdd(n: Int): RDD[Something] what you need is flatMap instead of map.
sc.parallelize(List(1,2,3,4,5,6,7,8,9,10)).flatMap(number => generate_rdd(number))
This will give a RDD that is a concatenation of all RDDs that are created for numbers from 1 to 10.
Assuming that the number of RDDs that you would like to create would be lower and hence that parallelization itself need not be accomplished by RDD, we can use Scala's parallel collections instead. For example, I tried to count the number of lines in about 40 HDFS files simultaneously using the following piece of code [Ignore the setting of delimiter. For newline delimited texts, this could have well been replaced by sc.textFile]:
val conf = new Configuration(sc.hadoopConfiguration)
conf.set("textinputformat.record.delimiter", "~^~")
val parSeq = List("path of file1.xsv","path of file2.xsv",...).par
parSeq.map(x => {
val rdd = sc.newAPIHadoopFile(x, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
println(rdd.count())
})
Here is part of the output in Spark UI. As seen, most of the RDD count operations started at the same time.

Spark: Distributed removal/addition of elements in a set?

I am trying to convert a ML algorithm to Spark Scala to take advantage of my cluster's power. The relevant bits of pseudo-code are the following:
initialize set of elements
while(set not empty) {
while(...) { remove a given element from the set }
while(...) { add a given element to the set }
}
Is there any way to parallelize such a thing?
I would intuitively say that this is not implementable in a distributed fashion (the number of iterations being unknown), but I have been reading that Spark allows implementation of iterative ML algorithms.
Here is what I tried so far:
Originally used a mutable Set and removed/added elements during the loops in simple Scala. It runs correctly, but I feel like the whole code will just be executed on the driver which limits the interest of using Spark?
Made the set a RDD, and replaced the var during every iteration by a new RDD with subtracted/added element (which I suppose is super heavy?). No error appears but the variable doesn't actually get updated.
mySetRDD = mySetRDD.subtract(sc.parallelize(Seq(element)))
Looked up Accumulators for a way to keep a set of elements upated on its content (presence/absence of elements) across multiple executors, but they do not seem to allow things other than simple updates of numerical values.
Create PairRDD and then repartitionByKey say x partitions.
After that you can use
PairRdd1.zipPartition() to get the iterator over partition of rdds. Then you can write a function which will work over two iterators to produce third or output iterator.
Since you have repartition the rdd by key you need not keep track of the removals across partitions.
https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/rdd/RDD.html#zipPartitions(org.apache.spark.rdd.RDD, boolean, scala.Function2, scala.reflect.ClassTag, scala.reflect.ClassTag)

Spark: groupBy taking lot of time

In my application when taking perfromance numbers, groupby is eating away lot of time.
My RDD is of below strcuture:
JavaPairRDD<CustomTuple, Map<String, Double>>
CustomTuple:
This object contains information about the current row in RDD like which week, month, city, etc.
public class CustomTuple implements Serializable{
private Map hierarchyMap = null;
private Map granularMap = null;
private String timePeriod = null;
private String sourceKey = null;
}
Map
This map contains the statistical data about that row like how much investment, how many GRPs, etc.
<"Inv", 20>
<"GRP", 30>
I was executing below DAG on this RDD
apply filter on this RDD and scope out relevant rows : Filter
apply filter on this RDD and scope out relevant rows : Filter
Join the RDDs: Join
apply map phase to compute investment: Map
apply GroupBy phase to group the data according to the desired view: GroupBy
apply a map phase to aggregate the data as per the grouping achieved in above step (say view data across timeperiod) and also create new objects based on the resultset desired to be collected: Map
collect the result: Collect
So if user wants to view investment across time periods then below List is returned (this was achieved in step 4 above):
<timeperiod1, value>
When I checked time taken in operations, GroupBy was taking 90% of the time taken in executing the whole DAG.
IMO, we can replace GroupBy and subsequent Map operations by a sing reduce.
But reduce will work on object of type JavaPairRDD>.
So my reduce will be like T reduce(T,T,T) where T will be CustomTuple, Map.
Or maybe after step 3 in above DAG I run another map function that returns me an RDD of type for the metric that needs to be aggregated and then run a reduce.
Also, I am not sure how aggregate function works and will it be able to help me in this case.
Secondly, my application will receive request on varying keys. In my current RDD design each request would require me to repartition or re-group my RDD on this key. This means for each request grouping/re-partitioning would take 95% of my time to compute the job.
<"market1", 20>
<"market2", 30>
This is very discouraging as the current performance of application without Spark is 10 times better than performance with Spark.
Any insight is appreciated.
[EDIT]We also noticed that JOIN was taking a lot of time. Maybe thats why groupby was taking time.[EDIT]
TIA!
The Spark's documentation encourages you to avoid operations groupBy operations instead they suggest combineByKey or some of its derivated operation (reduceByKey or aggregateByKey). You have to use this operation in order to make an aggregation before and after the shuffle (in the Map's and in the Reduce's phase if we use Hadoop terminology) so your execution times will improve (i don't kwown if it will be 10 times better but it has to be better)
If i understand your processing i think that you can use a single combineByKey operation The following code's explanation is made for a scala code but you can translate to Java code without too many effort.
combineByKey have three arguments:
combineByKey[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C): RDD[(K, C)]
createCombiner: In this operation you create a new class in order to combine your data so you could aggregate your CustomTuple data into a new Class CustomTupleCombiner (i don't know if you want only make a sum or maybe you want to apply some process to this data but either option can be made in this operation)
mergeValue: In this operation you have to describe how a CustomTuple is sum to another CustumTupleCombiner(again i am presupposing a simple summarize operation). For example if you want sum the data by key, you will have in your CustumTupleCombiner class a Map so the operation should be something like: CustumTupleCombiner.sum(CustomTuple) that make CustumTupleCombiner.Map(CustomTuple.key)-> CustomTuple.Map(CustomTuple.key) + CustumTupleCombiner.value
mergeCombiners: In this operation you have to define how merge two Combiner class, CustumTupleCombiner in my example. So this will be something like CustumTupleCombiner1.merge(CustumTupleCombiner2) that will be something like CustumTupleCombiner1.Map.keys.foreach( k -> CustumTupleCombiner1.Map(k)+CustumTupleCombiner2.Map(k)) or something like that
The pated code is not proved (this will not even compile because i made it with vim) but i think that might work for your scenario.
I hope this will be usefull
Shuffling is triggered by any change in the key of a [K,V] pair, or by a repartition() call. The partitioning is calculated based on the K (key) value. By default partitioning is calculated using the Hash value of your key, implemented by the hashCode() method. In your case your Key contains two Map instance variables. The default implementation of the hashCode() method will have to calculate the hashCode() of those maps as well, causing an iteration to happen over all it elements to in turn again calculate the hashCode() of those elements.
The solutions are:
Do not include the Map instances in your Key. This seems highly unusual.
Implement and override your own hashCode() that avoids going through the Map Instance variables.
Possibly you can avoid using the Map objects completely. If it is something that is shared amongst multiple elements, you might need to consider using broadcast variables in spark. The overhead of serializing your Maps during shuffling might also be a big contributing factor.
Avoid any shuffling, by tuning your hashing between two consecutive group-by's.
Keep shuffling Node local, by choosing a Partitioner that will have an affinity of keeping partitions local during consecutive use.
Good reading on hashCode(), including a reference to quotes by Josh Bloch can be found in wiki.