How to merge data from Enumerators is Scala - scala

I'am using enumerator/iteratee from Play framework
I have several enumerators that each provide sorted sequence of values. I want to write Iteratee/Enumeratee that merges values from these enumerators to provide sorted sequence of all values.
Is it a good idea to use Iteratee or should I eimplement enumeratee directly?
I know that I can zip values from enumerators and reconstruct their data stream in memory and then merge such data.
But I'am wondering if there is a way how to implement a "classic" merge sort - to "read" first value from all enumerators, then select minimal value and then let enumerator that provided it to read another value (while other enumerators are on hold). As a result I want to have enumeratee that provides resulting sorted sequence without storing all streams in memory. And I would like to follow functional style - keep everything immutable.
Thanks for ideas.

You will still need to do some insertion sorting in a standard collection in memory. Imagine this pathological case:
Enumerator(3, 2, 1) and Enumerator(4, -1 , -2, -3)
Here you cannot just take the smallest element and tack it on the end of your collection. You will have to put values at arbitrary places in the collection as you go. This is part of what makes sorting fundamentally O(n log(n)) is that you have to know your full universe of what you have to sort in order to do it any faster than that. (Bucket sort is a linear time sorting algorithm assuming you know the distribution of the values you are trying to sort)
To address your question more specifically:
The enumerator/iteratee library isn't really expressive enough for your use case. If you want to merge enumerators you can use Enumerator.interleave and do some insertion sorting in your Iteratee with whatever elements come in first.
If this mechanic is important to you, you could consider using the recently released Akka Streams with which you can implement a custom FlexiMerge push/pull stage which will allow you to do what you seek.

Related

Implementation of Spark distinct

I am new to Spark and Scala. I was reading upon distinct() function of Spark. But I could not find any proper details . I have a few doubts which I could not resolve and have written them down .
How distinct() is implemented in Spark ?
I am not that good with Spark source code to be able to identify the whole flow .
When I check for execution plan, I can only see a ShuffleRDD
What is the Time Complexity of distinct ?
I also found from Google searching that it also uses hashing and sorting in some way .
So, I thought whether it uses the same principle as getting unique elements from array with help of Hashset .
If it was one system , I would have guessed that time complexity is O(nlogn) .
But it is distributed among many partitions and shuffled , what would be order of time complexity ?
Is there a way to avoid shuffling in particular cases ?
If I make sure to properly partition my data as per my use-case ,
can I avoid shuffling ?
i.e. for example , say exploding an ArrayType column in dataframe with unique rows creates new rows with other columns being duplicated .
I will select the other columns .
In this way I made sure duplicates are unique per partition .
Since I know duplicates are unique per partition ,
I can avoid shuffle and just keenly drop duplicates in that partition
I also found this Does spark's distinct() function shuffle only the distinct tuples from each partition .
Thanks For your help .
Please correct me if I am wrong anywhere .
How distinct() is implemented in Spark ?
By applying a dummy aggregation with None value. Roughly
rdd.map((_, None)).reduceByKey((a, b) => a)
What is the Time Complexity of distinct ?
Given overall complexity of the process it is hard to estimate. It is at least O(N log N), as shuffle requires sort, but given multiple other operations required to build additional off core data structures (including associative arrays), serialize / deserialize the data can be higher, and in practice dominated by IO operations, not pure algorithm complexity.
Is there a way to avoid shuffling in particular cases ?
Yes, if potential duplicates are guaranteed to be placed on the same partition.,
You can use mapPartitions to dedpulicate the data, especially if data is sorted or in other way guaranteed to have duplicates in a isolated neighborhood. Without this you might be limited by the memory requirements, unless you accept approximate results with probabilistic filter (like Bloom filter).
In general though it is not possible, and operation like this will be non-local.

Iterating through an RDD without doing any changes to it

So I have a RDD, this RDD is paired together with indexes of its elements. I would like to simply iterate over it (or maybe if there is a nice Spark function to do this) and check adjacent elements by comparing one of their values. If the adjacent elements fulfill this check, I would like to note down their indexes in a different non RDD structure, maybe a ListBuffer.
Is this possible to do with some kind of Spark special function or do I have to simply iteratue through it manually, and how would I then iterate through it?
One of the main characteristics of an RDD is that it is immutable. Once it is created, you can iterate over it as many times as you want, but you won't be able to make any changes to it.
If you want to make changes, you need to create a new RDD via a transformation.
Additionally, if you want to iterate over an RDD and check adjacent elements, this logic will most likely not work very well as an RDD is distributed and you will usually not be able to have any guarantee of which records are next to each other. You could control by specifying a partitioner to group your data, but still I wouldn't count on it unless you explicitly use a function to group your data.
If you post some sample data it may be easier to help you with your question.

What is the efficient way to update value inside Spark's RDD?

I'm writing a graph-related program in Scala with Spark. The dataset have 4 million nodes and 4 million edges(you can treat this as a tree), but for each time(an Iteration), I only edit a portion of it, namely a sub-tree rooted by a given node, and the nodes in a path between that given node and root.
The Iteration has dependency, which means i+1 Iteration needs the result coming from i. So I need store the result of each Iteration for next step.
I'm trying to find an efficient way to update RDD, but have no clue so far.I find that PairRDD have a lookup function which could reduce the computation time from O(N), to O(M), N denote the total number of objects in RDD and M denote the number of elements in each partition.
So I'm thinking is there anyway that I could update an object in the RDD with O(M)? Or more ideally, O(1)?(I see an email in Spark's mail list saying that the lookup can be modified to achieve O(1))
Another thing is, if I could achieve O(M) for updating the RDD, could I increase the partition to some number larger than the number of cores I have and achieve a better performance?
As functional data structures, RDDs are immutable and an operation on an RDD generates a new RDD.
Immutability of the structure does not necessarily mean full replication. Persistant data structures are a common functional pattern where operations on immutable structures yield a new structure but previous versions are maintained and often reused.
GraphX (a 'module' on top of Spark) is a graph API on top of Spark that uses such concept: From the docs:
Changes to the values or structure of the graph are accomplished by
producing a new graph with the desired changes. Note that substantial
parts of the original graph (i.e., unaffected structure, attributes,
and indicies) are reused in the new graph reducing the cost of this
inherently functional data-structure.
It might be a solution for the problem at hand: http://spark.apache.org/docs/1.0.0/graphx-programming-guide.html
An RDD is a distributed data set, a partition is the unit for RDD storage, and the unit to process and RDD is an element.
For example, you read a large file from HDFS as an RDD, then the element of this RDD is String(lines in that file), and spark stores this RDD across the cluster by partition. For you, as a spark user, you only need to care about how to deal with the lines of that files, just like you are writing a normal program, and you read a file from local file system line by line. That's the power of spark:)
Anyway, you have no idea which elements will be stored in a certain partition, so it doesn't make sense to update a certain partition.
The MapReduce programming model (and FP) doesn't really support updates of single values. Rather one is supposed to define a sequence of transformations.
Now when you have interdependent values, i.e. you cannot perform your transformation with a simple map but need to aggregate multiple values and update based on that value, then what you need to do is think of a way of grouping those values together then transforming each group - or define a monoidal operation so that the operation can be distributed and chopped up into substeps.
Group By Approach
Now I'll try to be a little more specific for your particular case. You say you have subtrees, is it possible to first map each node to an key that indicates the corresponding subtree? If so you could do something like this:
nodes.map(n => (getSubTreeKey(n), n)).grouByKey().map ...
Monoid
(strictly speaking you want a commutative monoid) Best you read http://en.wikipedia.org/wiki/Monoid#Commutative_monoid
For example + is a monoidal operation because when one wishes to compute the sum of, say, an RDD of Ints then the underlying framework can chop up the data into chunks, perform the sum on each chunk, then sum up the resulting sums (possibly in more than just 2 steps too). If you can find a monoid that will ultimately produce the same results you require from single updates, then you have a way to distribute your processing. E.g.
nodes.reduce(_ myMonoid _)

Appropriate collection type for selecting a random element efficiently in Scala

For a project I am working on, I need to keep track of up to several thousand objects. The collection I choose needs to support insertion, selection, and deletion of random elements. My algorithm performs each of these operations several times, so I would like a collection that can do all these in constant time.
Is there such a collection? If not, what are some trade-offs with existing collections? I am using Scala 2.9.1.
EDIT: By "random", I mean mathematically/probabilistically random, i.e., I would like to select elements randomly from the collection using Random or some other appropriate generator.
Define "random". If you mean indexed, then there's no such collection. You can have insertion/deletion in constant time if you give up the "random element" requirement -- ie, you have have non-constant lookup of the element which will be deleted or which will be the point of insertion. Or you can have constant lookup without constant insertion/deletion.
The collection that best approaches that requirement is the Vector, which provides O(log n) for these operations.
On the other hand, if you have the element which you'll be looking up or removing, then just pick a HashMap. It's not precisely constant time, but it is a fair approximation. Just make sure you have a good hash function.
As a starting point, take a look at The Scala 2.8 Collections API especially at Performance Characteristics.

Seq for fast random access and fast growth in Scala

What would be the best Scala collection (in 2.8+), mutable or immutable, for the following scenario:
Sequentially ordered, so I can access items by position (a Seq)
Need to insert items frequently, so the collection must be able to grow without too much penalty
Random access, frequently need to remove and insert items at arbitrary indexes in the collection
Currently I seem to be getting good performance with the mutable ArrayBuffer, but is there anything better? Is there an immutable alternative that would do as well? Thanks in advance.
Mutable: ArrayBuffer
Immutable: Vector
If you insert items at random positions more than log(N)/N of the time that you access them, then you should probably use immutable.TreeSet as all operations are O(log(N)). If you mostly do accesses or add to the (far) end, ArrayBuffer and Vector work well.
Vector. IndSeq from scalaz should be even better.