Iterating through an RDD without doing any changes to it - scala

So I have a RDD, this RDD is paired together with indexes of its elements. I would like to simply iterate over it (or maybe if there is a nice Spark function to do this) and check adjacent elements by comparing one of their values. If the adjacent elements fulfill this check, I would like to note down their indexes in a different non RDD structure, maybe a ListBuffer.
Is this possible to do with some kind of Spark special function or do I have to simply iteratue through it manually, and how would I then iterate through it?

One of the main characteristics of an RDD is that it is immutable. Once it is created, you can iterate over it as many times as you want, but you won't be able to make any changes to it.
If you want to make changes, you need to create a new RDD via a transformation.
Additionally, if you want to iterate over an RDD and check adjacent elements, this logic will most likely not work very well as an RDD is distributed and you will usually not be able to have any guarantee of which records are next to each other. You could control by specifying a partitioner to group your data, but still I wouldn't count on it unless you explicitly use a function to group your data.
If you post some sample data it may be easier to help you with your question.

Related

Does it help to persist data between transformations in Scala Spark?

One
First I read a tweets and parse into a tweet case class through a map into my parsing function parseTweet:
val tweets = sc.textFile("/home/gakuo/Documents/bigdata/NintendoTweets").map(parseTweet)
Two
Then I use a function to pair RDD that results into a pair RDD of the form (hashtags, likes) through a map inside toPairRdd:
val pairedRDD = toPairRdd(tweets).persist()
Question
After reading in my RDD in (one) above, does it help to persist it as what follows in (two)is a transformation? I am thinking, since both as lazy, then persisting is actually a waste of memory.
Three
After computing the pairRDD, I want to compute scores of each hashtag:toScores uses reduceByKey
val scores = toScores(pairedRDD).persist()
Question
I use reduceByKey. Does this pairRDD method result in shuffling? I have read a paper that states:
"a shuffle can occur when the resulting RDD depends on other elements from the same RDD or another RDD.
cogroup, groupWith, join, leftOuterJoin, rightOuterJoin,
groupByKey, reduceByKey, combineByKey, distinct, intersection,
repartition, coalesce resulting in shuffling. To avoid shuffles for these kinds of operations make sure the transformation follows the same partition as the original RDD"
The same paper also states that reduceByKey follows the same partition as the original RDD.
It's matter to use persist ( on mem/ disk/both) when you have many actions which always do the number of the same transformations again. And if it takes too long to recompute again & again.
In your case there is no persist or caching required as it is a one-pass process. You need to know that Stages are genned putting as many transformations together before shuffling. You would have 2 here.
If you were to process some other data requirements using the pairedRDD, then persist would be advisable.
The actions are more relevant in any event.
If you have multiple actions using the same rdd then it’s advisable to persist. I don’t see any action till now in your code. So I don’t see any reason to cache the rdd. Persist/cache also lazily evaluated.
Persist/cache - it is not guaranteed that data will be stayed during life time of execution as persisting follows LRU least recently used algorithm which may flush the data on basis of least used rdd if the memory is full. All the things need to keep in mind while using persist.
Reducebykey - it’s a wide transformation as shuffle may happen. But first of all it does combine the data w.r.t a key inside the partition first then do a reduce operation after that. So it’s less costly. Always avoid groupbykey where it shuffles the data directly without combining the the data w.r.t a key in a partition. Please avoid groupbykey while coding.

Efficient data structure for aggregation in Scala

Like an example below, I'd like to accumulate values by key.
I can use List, ArrayBuffer, Array, mutable.HashSet, etc.
When the number of values for each key is large varied and unknown number, i.e wide (e.g, 10k - 1M), which data structure is most efficient?
Definitely, in Java, I avoid to use List or Vector due to memory dynamic expansion. In Scala, performance-wise and/or memory-wise what is best practice?
Thanks.
val res = data.flatMap{ x =>
if ( some condition )
Some(( x._2._2, ArrayBuffer[(Int, Double)]( x._1,, x._2._1)) ) )
} else {
None
}
}
.reduceByKey {(x, y) => x ++ y}
UPDATE:
The subsequent transforms are as below on Spark. I'm creating feature matrix (using sparse vector) as data prep.
.map(x => (x._1, x._2.toArray.sortBy(_._1 )) )
.map { x => (yieldMap.value.get(x._1).get , x._2.map(_._1), x._2.map(_._2)) }
Well, if you accumulate them for quick access, then of course you need something that provides O(1) lookup (such as HashMap). From your example I can see that you want to reduce by key in a later stage, which means you need to traverse it anyway.
List is OK if you need to append only to head of the collection. In that case make a ListBuffer, fill it up incrementally and then invoke .toList() when you're done adding. That will save you some memory.
If you don't append only to head, take a Vector. It is effectively constant time due to its tree representation (see here) and is generally recommended over lists if performance is an issue.
Here's a performance overview that might be of help.
You seem to be using spark, so I assume you want to compute this stuff on a cluster somehow? When doing distributed computing, the question how you distribute and how much communication is needed between cluster nodes is the most important.
The fastest approach probably would be to map each key to a cluster node and then aggregate the results sequentially into a list. From looking at the API you can achieve the mapping to cluster nodes using a Partitioner and the aggregation using aggregateByKey. AggregateByKey allows you to specify a function that is applied in linear order over the data in on partition, so you can aggregate all values effectively into a list. You also have to specify a associative aggregate function but it does not matter how efficient it is because it will never be called.
If you stick with what you have, without knowing of being able to assume anything of the order in which the reduce function is called, a plain Array might actually be the best data structure. Lists might be faster if you are prepending elements, but you cannot ensure that. Vectors on the other hand have effectively constant time for appending and prepending an element, but merging of two vectors of similar size should be linear anyway, and the constants involved with vectors are larger. If you have an efficiency problem with what you are doing now, I would really try to use aggregate together with an optimal partitioning of your data.

How to merge data from Enumerators is Scala

I'am using enumerator/iteratee from Play framework
I have several enumerators that each provide sorted sequence of values. I want to write Iteratee/Enumeratee that merges values from these enumerators to provide sorted sequence of all values.
Is it a good idea to use Iteratee or should I eimplement enumeratee directly?
I know that I can zip values from enumerators and reconstruct their data stream in memory and then merge such data.
But I'am wondering if there is a way how to implement a "classic" merge sort - to "read" first value from all enumerators, then select minimal value and then let enumerator that provided it to read another value (while other enumerators are on hold). As a result I want to have enumeratee that provides resulting sorted sequence without storing all streams in memory. And I would like to follow functional style - keep everything immutable.
Thanks for ideas.
You will still need to do some insertion sorting in a standard collection in memory. Imagine this pathological case:
Enumerator(3, 2, 1) and Enumerator(4, -1 , -2, -3)
Here you cannot just take the smallest element and tack it on the end of your collection. You will have to put values at arbitrary places in the collection as you go. This is part of what makes sorting fundamentally O(n log(n)) is that you have to know your full universe of what you have to sort in order to do it any faster than that. (Bucket sort is a linear time sorting algorithm assuming you know the distribution of the values you are trying to sort)
To address your question more specifically:
The enumerator/iteratee library isn't really expressive enough for your use case. If you want to merge enumerators you can use Enumerator.interleave and do some insertion sorting in your Iteratee with whatever elements come in first.
If this mechanic is important to you, you could consider using the recently released Akka Streams with which you can implement a custom FlexiMerge push/pull stage which will allow you to do what you seek.

What is the efficient way to update value inside Spark's RDD?

I'm writing a graph-related program in Scala with Spark. The dataset have 4 million nodes and 4 million edges(you can treat this as a tree), but for each time(an Iteration), I only edit a portion of it, namely a sub-tree rooted by a given node, and the nodes in a path between that given node and root.
The Iteration has dependency, which means i+1 Iteration needs the result coming from i. So I need store the result of each Iteration for next step.
I'm trying to find an efficient way to update RDD, but have no clue so far.I find that PairRDD have a lookup function which could reduce the computation time from O(N), to O(M), N denote the total number of objects in RDD and M denote the number of elements in each partition.
So I'm thinking is there anyway that I could update an object in the RDD with O(M)? Or more ideally, O(1)?(I see an email in Spark's mail list saying that the lookup can be modified to achieve O(1))
Another thing is, if I could achieve O(M) for updating the RDD, could I increase the partition to some number larger than the number of cores I have and achieve a better performance?
As functional data structures, RDDs are immutable and an operation on an RDD generates a new RDD.
Immutability of the structure does not necessarily mean full replication. Persistant data structures are a common functional pattern where operations on immutable structures yield a new structure but previous versions are maintained and often reused.
GraphX (a 'module' on top of Spark) is a graph API on top of Spark that uses such concept: From the docs:
Changes to the values or structure of the graph are accomplished by
producing a new graph with the desired changes. Note that substantial
parts of the original graph (i.e., unaffected structure, attributes,
and indicies) are reused in the new graph reducing the cost of this
inherently functional data-structure.
It might be a solution for the problem at hand: http://spark.apache.org/docs/1.0.0/graphx-programming-guide.html
An RDD is a distributed data set, a partition is the unit for RDD storage, and the unit to process and RDD is an element.
For example, you read a large file from HDFS as an RDD, then the element of this RDD is String(lines in that file), and spark stores this RDD across the cluster by partition. For you, as a spark user, you only need to care about how to deal with the lines of that files, just like you are writing a normal program, and you read a file from local file system line by line. That's the power of spark:)
Anyway, you have no idea which elements will be stored in a certain partition, so it doesn't make sense to update a certain partition.
The MapReduce programming model (and FP) doesn't really support updates of single values. Rather one is supposed to define a sequence of transformations.
Now when you have interdependent values, i.e. you cannot perform your transformation with a simple map but need to aggregate multiple values and update based on that value, then what you need to do is think of a way of grouping those values together then transforming each group - or define a monoidal operation so that the operation can be distributed and chopped up into substeps.
Group By Approach
Now I'll try to be a little more specific for your particular case. You say you have subtrees, is it possible to first map each node to an key that indicates the corresponding subtree? If so you could do something like this:
nodes.map(n => (getSubTreeKey(n), n)).grouByKey().map ...
Monoid
(strictly speaking you want a commutative monoid) Best you read http://en.wikipedia.org/wiki/Monoid#Commutative_monoid
For example + is a monoidal operation because when one wishes to compute the sum of, say, an RDD of Ints then the underlying framework can chop up the data into chunks, perform the sum on each chunk, then sum up the resulting sums (possibly in more than just 2 steps too). If you can find a monoid that will ultimately produce the same results you require from single updates, then you have a way to distribute your processing. E.g.
nodes.reduce(_ myMonoid _)

Seq for fast random access and fast growth in Scala

What would be the best Scala collection (in 2.8+), mutable or immutable, for the following scenario:
Sequentially ordered, so I can access items by position (a Seq)
Need to insert items frequently, so the collection must be able to grow without too much penalty
Random access, frequently need to remove and insert items at arbitrary indexes in the collection
Currently I seem to be getting good performance with the mutable ArrayBuffer, but is there anything better? Is there an immutable alternative that would do as well? Thanks in advance.
Mutable: ArrayBuffer
Immutable: Vector
If you insert items at random positions more than log(N)/N of the time that you access them, then you should probably use immutable.TreeSet as all operations are O(log(N)). If you mostly do accesses or add to the (far) end, ArrayBuffer and Vector work well.
Vector. IndSeq from scalaz should be even better.