Multiple maps on RDD - scala

Let's say I have a class:
class DummyClass(elems1: List[FirstElem], elems2: List[SecondElem])
And during some batch computations, I have an RDD[DummyClass]
How can I collect the elements without running multiple maps over this RDD. My solution is very slow because I have to read the RDD many times. We are talking about 4M-5M records.
val dsElem1 = myRdd.map(x -> x.elems1)
val dsElem2 = myRdd.map(x -> x.elems2)
I wanna add that I have 7 maps following the same logic. Persist() and cache() the RDD before applying the maps, didn't improve that much.

Related

Is a filtered RDD still in cache when performed on a cached RDD

I'm wondering if we perform the following instructions :
val rdd : = sc.textFile("myfile").zipwithIndex.cache
val size = rdd.count
val filter = rdd.filter(_._2 % 2 == 0)
val sizeF = filter.count
The action performed on the filter RDD is execute as if it is in cache or not ? Despite the fact we create a second RDD from the first one, the information came from the same place, so i'm wondering if it is copied into a new object that needs to be cached or if the filtered object is directly linked to his parent allowing faster actions ?
Since filter is a transformation and not an action, and since spark is lazy nothing was actually done in the following line:
val filter = rdd.filter(_._2 % 2 == 0)
The following line:
val sizeF = filter.count
Will use the cached() rdd, and will perform the filter transformation followed by the count action
Hence, there is nothing to cache in the filter transformation.
Spark Guide
Transformations
The following table lists some of the common transformations supported
by Spark. Refer to the RDD API doc (Scala, Java, Python, R) and pair
RDD functions doc (Scala, Java) for details.
filter(func) Return a new dataset formed by selecting those elements
of the source on which func returns true.
Note. if filter was an action, and a new RDD was created, it wouldn't be cached, only the RDDs which the cache() operation was executed on them are cached.
No,
The child RDD will not be cahced, the cache will mantain the original RDD in your workers and the other data will not be cached.
If you request for this filtered RDD other step that doesn't change the data, the response always will be fast due to Spark keep the Spark Files in workers until a real change.

How to parallelize list iteration and be able to create RDDs in Spark?

I've just started learning Spark and Scala.
From what I understand it's bad practice to use collect, because it gathers the whole data in memory and it's also bad practice to use for, because the code inside the block is not executed concurrently by more than one node.
Now, I have a List of numbers from 1 to 10:
List(1,2,3,4,5,6,7,8,9,10)
and for each of these values I need to generate a RDD using this value.
in such cases, how can I generate the RDD?
By doing
sc.parallelize(List(1,2,3,4,5,6,7,8,9,10)).map(number => generate_rdd(number))
I get an error because RDD cannot be generated inside another RDD.
What is the best workaround to this problem?
Assuming generate_rdd defined like def generate_rdd(n: Int): RDD[Something] what you need is flatMap instead of map.
sc.parallelize(List(1,2,3,4,5,6,7,8,9,10)).flatMap(number => generate_rdd(number))
This will give a RDD that is a concatenation of all RDDs that are created for numbers from 1 to 10.
Assuming that the number of RDDs that you would like to create would be lower and hence that parallelization itself need not be accomplished by RDD, we can use Scala's parallel collections instead. For example, I tried to count the number of lines in about 40 HDFS files simultaneously using the following piece of code [Ignore the setting of delimiter. For newline delimited texts, this could have well been replaced by sc.textFile]:
val conf = new Configuration(sc.hadoopConfiguration)
conf.set("textinputformat.record.delimiter", "~^~")
val parSeq = List("path of file1.xsv","path of file2.xsv",...).par
parSeq.map(x => {
val rdd = sc.newAPIHadoopFile(x, classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
println(rdd.count())
})
Here is part of the output in Spark UI. As seen, most of the RDD count operations started at the same time.

Does spark optimize multiple filter applied to an RDD?

I am running a simple Spark application wherein I apply multiple filters over an RDD and ultimately apply an action.
Does Spark go over the RDD multiple times? Or does it optimize and apply multiple filters at the same time (with && operation)?
Every transformation on an RDD creates a new RDD. Let me explain with 2 simple examples:
RDD -> map -> filter -> print
This goes over the source RDD, then applies the map function, creating an RDD, then applies the filter, creating another RDD and finally does the print action.
RDD -> map (lets call it RDD-m) -> filter (lets call filter1) -> print
and
RDD-m -> filter (lets call filter2) -> print
Both these are part of the same job.
Here we create a new RDD (called RDD-m) after the first map function. Now we are branching out, applying two filter functions (filter1 and filter2) on the same RDD-m. Then we finally print the 2 resulting RDDs. So here, what might look like is the RDD-m getting reused across the 2 filter functions, but it is not.
Spark starts with the action and creates a DAG tracing back to the source RDD. So it will create 2 DAGs for the 2 different paths and the RDD-m will be evaluated twice.
The way to avoid is to use persist method on RDD-m which will avoid duplication.

Sharing and Updating data with series of unrelated RDD

I have a series of computations that I distribute where all of the computations rely on data that is represented in Map. The thing is that on each step the Map should be also updated.
First the data sets, the RDD series is:
Iterable[Long, RDD[(String, Iterable[(String, String), Int])]]
Where the first Long is a signature of the RDD, the (String, String) Tuple is the key for the Map that is needed on all the nodes:
Map[(String, String), Double]
On each step the computation needs the double value and in turn updates the double value using the Int value.
I know that accumulators are write-only and cannot be used in my case for both reading and writing (I did try to read the data using localValue which didn't work).
The thing is in my case since each RDD is processed in turn, I was wondering if there still a hack that will be able to help me use the accumulators. Currently I wrote the following accumulator:
val accMap = sc.accumulableCollection(scala.collection.mutable.HashMap[(String, String), Double]())
And I am wondering if calling accMap.value after each calculation of a RDD data and distributing the map using broadcast variable is the best thing I can get? my problem is that the map is REALLY BIG so it is not quite feasible and if its the case the algorithm should be rethought.
basically my question is for the problem I described above is the best thing I can do is on each consecutive RDD: use accumulator Map for accumulating the score and collecting it using value function on each iteration just to broadcast it again using broadcast variable?
EDIT: Adding all the RDDs to a single RDD isn't feasible for me since the datasets are REALLY huge. this is the reason I tried to divide it into several unrelated RDD's.

Spark: Single pipelined scala command better than separate commands?

I am using Spark with scala. I wanted to know if having single one line command better than separate commands? What are the benefits if any? Does it gain more efficiency in terms of speed? Why?
for e.g.
var d = data.filter(_(1)==user).map(f => (f(2),f(5).toInt)).groupByKey().map(f=> (f._1,f._2.count(x=>true), f._2.sum))
against
var a = data.filter(_(1)==user)
var b = a.map(f => (f(2),f(5).toInt))
var c = b.groupByKey()
var d = c.map(f=> (f._1,f._2.count(x=>true), f._2.sum))
There is no performance difference between your two examples; the decision to chain RDD transformations or to explicitly represent the intermediate RDDs is just a matter of style. Spark's lazy evaluation means that no actual distributed computation will be performed until you invoke an RDD action like take() or count().
During execution, Spark will pipeline as many transformations as possible. For your example, Spark won't materialize the entire filtered dataset before it maps it: the filter() and map() transformations will be pipelined together and executed in a single stage. The groupByKey() transformation (usually) needs to shuffle data over the network, so it's executed in a separate stage. Spark would materialize the output of filter() only if it had been cache()d.
You might need to use the second style if you want to cache an intermediate RDD and perform further processing on it. For example, if I wanted to perform multiple actions on the output of the groupByKey() transformation, I would write something like
val grouped = data.filter(_(1)==user)
.map(f => (f(2),f(5).toInt))
.groupByKey()
.cache()
val mapped = grouped.map(f=> (f._1,f._2.count(x=>true), f._2.sum))
val counted = grouped.count()
There is no difference in terms of execution, but you might want to consider the readability of your code. I would go with your first example but like this:
var d = data.filter(_(1)==user)
.map(f => (f(2),f(5).toInt))
.groupByKey()
.map(f=> (f._1,f._2.count(x=>true), f._2.sum))
Really this is more of a Scala question than Spark though. Still, as you can see from Spark's implementation of word count as shown in their documentation
val file = spark.textFile("hdfs://...")
val counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")
you don't need to worry about those kinds of things. The Scala language (through laziness, etc.) and Spark's RDD implementation handles all that at a higher level of abstraction.
If you find really bad performance, then you should take the time to explore why. As Knuth said, "premature optimization is the root of all evil."