Converting a large sequence to a sequence with no repeats in scala really fast - scala

So I have this large sequence, with a lot of repeats as well, and I need to convert it into a sequence with no repeats. What I have been doing so far has been converting the sequence to a set, and then back to the original sequence. Conversion to the set gets rid of the duplicates, and then I convert back into the set. However, this is very slow, as I'm given to understand that when converting to set, every pair of elements is compared, and the makes the complexity O(n^2), which is not acceptable. And since I have access to a computer with thousands of cores (through my university), I was wondering whether making things parallel would help.
Initially I thought I'd use scala Futures to parallelize the code in the following manner. Group the elements of the sequence into smaller subgroups by their hash code. That way, I have a subcollection of the original sequence, such that no element appears in two different subcollections and and every element is covered. Now I convert these smaller subcollections to sets, and back to sequences and concatenate them. This way I'm guaranteed to get a sequence with no repeats.
But I was wondering if applying the toSet method on a parallel sequence already does this. I thought I'd test this out in the scala interpreter, but I got roughly the same time for the conversion to parallel set vs the conversion to the non parallel set.
I was hoping someone could tell me whether conversion to parallel sets works this way or not. I'd be much obliged. Thanks.
EDIT: Is performing a toSet on a parallel collection faster than performing toSet on a non parallel collection?

.distinct with some of the Scala collection types is O(n) (as of Scala 2.11). It uses a hash map to record what has already been seen. With this, it linearly builds up a list:
def distinct: Repr = {
val b = newBuilder
val seen = mutable.HashSet[A]()
for (x <- this) {
if (!seen(x)) {
b += x
seen += x
}
}
b.result()
(newBuilder is like a mutable list.)

Just thinking outside the box, would it be possible that you prevent the creation of these doublons instead of trying to get rid of them afterwards ?

Related

What is the fastest way to join two iterables (or sequences) in Scala?

Say I want to create an Iterable by joining multiple Iterables. What would be the fastest way to do so?
Documentation for ListBuffer's ++= says that it
Appends all elements produced by a TraversableOnce to this list buffer.
which sounds like it appends the elements one by one and hence should take Ω(size of iterable to append) time. I want something that joins the two Iterables in O(1) time. Does ::: method of List runs in O(1) time? Documentation simply states that it
Adds the elements of a given list in front of this list.
I also checked out performance characteristics of scala collections but it says nothing about joining two collections.
You can create a Stream:
Stream(a, b, c).flatten
or concatenate iterators (this gives you Stream at runtime anyway)
(a.iterator ++ b.iterator ++ c.iterator).toIterable
(assuming that a, b and c are Iterable[Int])
That will give you something along the lines of O(number of collections to join)
Streams are evaluated on demand, so only little memory allocation will happen for closures until you actually request any elements. Converting to anything else is unavoidably O(n).

Scala: how to copy part of a List to another List

I am new to Scala.
I have a list
origList = List[Double] with thousands of elements.
I need to create another list
outList = List[Double]
and copy to it the elements from origList with indices
start, start+1, ..., start+nCopy-1
that is the output list will have nCopy elements.
This part of the code will be executed many times. What is the most efficient way to do that in Scala?
The way people usually do it in scala is list.slice(start, start+nCopy).
Note, that List in scala is not a random access container like ArrayList is in java. It is implemented as a linked list, so, especially, if you are going to do this many times, it will help significantly, if you convert your list to something indexed before hand: val converted = list.toIndexedSeq or, better, val converted = list.toArray.
.slice on an Array or on IndexedSeq will be much more efficient, especially if start index is high.
Now, if you are really concerned about efficiency, of this one operation, nothing (unfortunately) beats the good-old java approach:
val converted = list.toArray
val copied = java.util.Arrays.copyOfRange(converted, start, start+nCopy)
This can be orders of magnitude faster than converted.slice (leave alone list.slice) when copying a large enough (hundreds) number of elements.

Spark: Distributed removal/addition of elements in a set?

I am trying to convert a ML algorithm to Spark Scala to take advantage of my cluster's power. The relevant bits of pseudo-code are the following:
initialize set of elements
while(set not empty) {
while(...) { remove a given element from the set }
while(...) { add a given element to the set }
}
Is there any way to parallelize such a thing?
I would intuitively say that this is not implementable in a distributed fashion (the number of iterations being unknown), but I have been reading that Spark allows implementation of iterative ML algorithms.
Here is what I tried so far:
Originally used a mutable Set and removed/added elements during the loops in simple Scala. It runs correctly, but I feel like the whole code will just be executed on the driver which limits the interest of using Spark?
Made the set a RDD, and replaced the var during every iteration by a new RDD with subtracted/added element (which I suppose is super heavy?). No error appears but the variable doesn't actually get updated.
mySetRDD = mySetRDD.subtract(sc.parallelize(Seq(element)))
Looked up Accumulators for a way to keep a set of elements upated on its content (presence/absence of elements) across multiple executors, but they do not seem to allow things other than simple updates of numerical values.
Create PairRDD and then repartitionByKey say x partitions.
After that you can use
PairRdd1.zipPartition() to get the iterator over partition of rdds. Then you can write a function which will work over two iterators to produce third or output iterator.
Since you have repartition the rdd by key you need not keep track of the removals across partitions.
https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/rdd/RDD.html#zipPartitions(org.apache.spark.rdd.RDD, boolean, scala.Function2, scala.reflect.ClassTag, scala.reflect.ClassTag)

How to groupBy an iterator without converting it to list in scala?

Suppose I want to groupBy on a iterator, compiler asks to "value groupBy is not a member of Iterator[Int]". One way would be to convert iterator to list which I want to avoid. I want to do the groupBy such that the input is Iterator[A] and output is Map[B, Iterator[A]]. Such that the part of the iterator is loaded only when that part of element is accessed and not loading the whole list into memory. I also know the possible set of keys, so I can say whether a particular key exists.
def groupBy(iter: Iterator[A], f: fun(A)->B): Map[B, Iterator[A]] = {
.........
}
One possibility is, you can convert Iterator to view and then groupBy as,
iter.toTraversable.view.groupBy(_.whatever)
I don't think this is doable without storing results in memory (and in this case switching to a list would be much easier). Iterator implies that you can make only one pass over the whole collection.
For instance let's say you have a sequence 1 2 3 4 5 6 and you want to groupBy odd an even numbers:
groupBy(it, v => v % 2 == 0)
Then you could query the result with either true and false to get an iterator. The problem should you loop one of those two iterators till the end you couldn't do the same thing for the other one (as you cannot reset an iterator in Scala).
This would be doable should the elements were sorted according to the same rule you're using in groupBy.
As said in other responses, the only way to achieve a lazy groupBy on Iterator is to internally buffer elements. The worst case for the memory will be in O(n). If you know in advance that the keys are well distributed in your iterator, the buffer can be a viable solution.
The solution is relatively complex, but a good start are some methods from the Iterator trait in the Scala source code:
The partition method that uses both the buffered method to keep the head value in memory, and two internal queues (lookahead) for each of the produced iterators.
The span method with also the buffered method and this time a unique queue for the leading iterator.
The duplicate method. Perhaps less interesting, but we can again observe another use of a queue to store the gap between the two produced iterators.
In the groupBy case, we will have a variable number of produced iterators instead of two in the above examples. If requested, I can try to write this method.
Note that you have to know the list of keys in advance. Otherwise, you will need to traverse (and buffer) the entire iterator to collect the different keys to build your Map.

Is this scala parallel array code threadsafe?

I want to use parallel arrays for a task, and before I start with the coding, I'd be interested in knowing if this small snipept is threadsafe:
import collection.mutable._
var listBuffer = ListBuffer[String]("one","two","three","four","five","six","seven","eight","nine")
var jSyncList = java.util.Collections.synchronizedList(new java.util.ArrayList[String]())
listBuffer.par.foreach { e =>
println("processed :"+e)
// using sleep here to simulate a random delay
Thread.sleep((scala.math.random * 1000).toLong)
jSyncList.add(e)
}
jSyncList.toArray.foreach(println)
Are there better ways of processing something with parallel collections, and acumulating the results elsewhere?
The code you posted is perfectly safe; I'm not sure about the premise though: why do you need to accumulate the results of a parallel collection in a non-parallel one? One of the whole points of the parallel collections is that they look like other collections.
I think that parallel collections also will provide a seq method to switch to sequential ones. So you should probably use this!
For this pattern to be safe:
listBuffer.par.foreach { e => f(e) }
f has to be able to run concurrently in a safe way. I think the same rules that you need for safe multi-threading apply (access to share state needs to be thread safe, the order of the f calls for different e won't be deterministic and you may run into deadlocks as you start synchronizing your statements in f).
Additionally I'm not clear what guarantees the parallel collections gives you about the underlying collection being modified while being processed, so a mutable list buffer which can have elements added/removed is possibly a poor choice. You never know when the next coder will call something like foo(listBuffer) before your foreach and pass that reference to another thread which may mutate the list while it's being processed.
Other than that, I think for any f that will take a long time, can be called concurrently and where e can be processed out of order, this is a fine pattern.
immutCol.par.foreach { e => threadSafeOutOfOrderProcessingOf(e) }
disclaimer: I have not tried // colls myself, but I'm looking forward at having SO questions/answers show us what works well.
The synchronisedList should be safe, though the println may give unexpected results - you have no guarantees of the order that items will be printed, or even that your printlns won't be interleaved mid-character.
A synchronised list is also unlikely to be the fastest way you can do this, a safer solution is to map over an immutable collection (Vector is probably your best bet here), then print all the lines (in order) afterwards:
val input = Vector("one","two","three","four","five","six","seven","eight","nine")
val output = input.par.map { e =>
val msg = "processed :" + e
// using sleep here to simulate a random delay
Thread.sleep((math.random * 1000).toLong)
msg
}
println(output mkString "\n")
You'll also note that this code has about as much practical usefulness as your example :)
This code is plain weird -- why add stuff in parallel to something that needs to be synchronized? You'll add contention and gain absolutely nothing in return.
The principle of the thing -- accumulating results from parallel processing, are better achieved with stuff like fold, reduce or aggregate.
The code you've posted is safe - there will be no errors due to inconsistent state of your array list, because access to it is synchronized.
However, parallel collections process items concurrently (at the same time), AND out-of-order. The out-of-order means that the 54. element may be processed before the 2. element - your synchronized array list will contain items in non-predefined order.
In general it's better to use map, filter and other functional combinators to transform a collection into another collection - these will ensure that the ordering guarantees are preserved if a collection has some (like Seqs do). For example:
ParArray(1, 2, 3, 4).map(_ + 1)
always returns ParArray(2, 3, 4, 5).
However, if you need a specific thread-safe collection type such as a ConcurrentSkipListMap or a synchronized collection to be passed to some method in some API, modifying it from a parallel foreach is safe.
Finally, a note - parallel collection provide parallel bulk operations on data. Mutable parallel collections are not thread-safe in the sense that you can add elements to them from different threads. Mutable operations like insertion to a map or appending a buffer still have to be synchronized.