Scala, waiting to execute foreach on parallel collection - scala

I have code like below
val g = new Graph(vertices)
//Firts part
(1 to vertices).par.foreach( i => g + new Vertex(i))
//Second part
for (i <- 1 to edges) {
val data = scala.io.StdIn.readLine()
val d = data.split(" ")
val v1 = d(0).toInt
val v2 = d(1).toInt
val length = d(2).toInt
g+(v1, v2, length)
}
I want to execute first and second part of code sequentially.
At present for loop run before the all Vertex have added to g.
In code + (plus) define add new instance of Vertex to MutableList.
I'am new in scala, please help

You could wrap each call in the following
new Thread(new Runnable {
override def run(): Unit = {
//Code part here:
}
}).start()
You also need to ensure that the your Graph implementation is thread safe as you will have two threads modifying it concurrently.
It doesn't look like you're returning anything from either part but if you were you could use a Future instead. See here for details.

Parallel collection only parallelises the computation for the collection. The second part of the computation where you add edges should happen after the adding of the vertices.
What I would assume is that may be from parsing there are some vertices which does not exists in the vertices at all, may be some empty space or something like that.
I am not sure about what goes on while adding the vertices to the graph but the foreach on parallel should be careful if the operation should be free of side-effect. Please see this for more information. May be this is not even relevant.

I find out the solution. I read more about add new elements to collection parallel and it isn't thread save. I replaced MutableList to fixedsize array, and I add new element by index.
Some code below:
class Graph(val end: Int) {
private val vertices : Array[Vertex] = new Array[Vertex](end)
def +(index: Int, v: Vertex): Unit = {
vertices(index) = v
}
(...)
}
//Firts part
(1 to vertices).par.foreach( i => g + (i-1,new Vertex(i))) //add new vertex to array by index

Related

Should I use an object class or broadcast variable

I have an coordinates RDD[(Int,Int)] and I want to create a new RDD[(Int,(Int,Int))] what is the best practice?
object GlobalVariables{
private var pointId : Int = 0
def newPointId(): Long ={
pointId += 1
pointId
}
}
points = coordinates.map(x=> (GlobalVariables.newPointID,x._1, x._2))
Is this code executed on workers or should I use a combination of broadcast Variables and accumulators?
If the code is executed on workers how can I be sure that I will not have any concurrency error?
You can try another solution without the need to have mutable counter, The transformation zipWithIndex provides a stable indexing, numbering each element in its original order.
example :
val myRdd = RDD(1,2,3)
val zippedWithIndex = myRdd.zipWithIndex // ((1,0),(2,1),(3,2))
After this first transormation you can flip the index and the value
val result = zippedWithIndex.map{case (index,value) => (value,index)} // ((0,1),(1,2),(2,3))

Scala Spark not returning value outside loop [duplicate]

I am new to Scala and Spark and would like some help in understanding why the below code isn't producing my desired outcome.
I am comparing two tables
My desired output schema is:
case class DiscrepancyData(fieldKey:String, fieldName:String, val1:String, val2:String, valExpected:String)
When I run the below code step by step manually, I actually end up with my desired outcome. Which is a List[DiscrepancyData] completely populated with my desired output. However, I must be missing something in the code below because it returns an empty list (before this code gets called there are other codes that is involved in reading tables from HIVE, mapping, grouping, filtering, etc etc etc):
val compareCols = Set(year, nominal, adjusted_for_inflation, average_private_nonsupervisory_wage)
val key = "year"
def compare(table:RDD[(String, Iterable[Row])]): List[DiscrepancyData] = {
var discs: ListBuffer[DiscrepancyData] = ListBuffer()
def compareFields(fieldOne:String, fieldTwo:String, colName:String, row1:Row, row2:Row): DiscrepancyData = {
if (fieldOne != fieldTwo){
DiscrepancyData(
row1.getAs(key).toString, //fieldKey
colName, //fieldName
row1.getAs(colName).toString, //table1Value
row2.getAs(colName).toString, //table2Value
row2.getAs(colName).toString) //expectedValue
}
else null
}
def comparison() {
for(row <- table){
var elem1 = row._2.head //gets the first element in the iterable
var elem2 = row._2.tail.head //gets the second element in the iterable
for(col <- compareCols){
var value1 = elem1.getAs(col).toString
var value2 = elem2.getAs(col).toString
var disc = compareFields(value1, value2, col, elem1, elem2)
if (disc != null) discs += disc
}
}
}
comparison()
discs.toList
}
I'm calling the above function as such:
var outcome = compare(groupedFiltered)
Here is the data in groupedFiltered:
(1991,CompactBuffer([1991,7.14,5.72,39%], [1991,4.14,5.72,39%]))
(1997,CompactBuffer([1997,4.88,5.86,39%], [1997,3.88,5.86,39%]))
(1999,CompactBuffer([1999,5.15,5.96,39%], [1999,5.15,5.97,38%]))
(1947,CompactBuffer([1947,0.9,2.94,35%], [1947,0.4,2.94,35%]))
(1980,CompactBuffer([1980,3.1,6.88,45%], [1980,3.1,6.88,48%]))
(1981,CompactBuffer([1981,3.15,6.8,45%], [1981,3.35,6.8,45%]))
The table schema for groupedFiltered:
(year String,
nominal Double,
adjusted_for_inflation Double,
average_provate_nonsupervisory_wage String)
Spark is a distributed computing engine. Next to "what the code is doing" of classic single-node computing, with Spark we also need to consider "where the code is running"
Let's inspect a simplified version of the expression above:
val records: RDD[List[String]] = ??? //whatever data
var list:mutable.List[String] = List()
for {record <- records
entry <- records }
{ list += entry }
The scala for-comprehension makes this expression look like a natural local computation, but in reality the RDD operations are serialized and "shipped" to executors, where the inner operation will be executed locally. We can rewrite the above like this:
records.foreach{ record => //RDD.foreach => serializes closure and executes remotely
record.foreach{entry => //record.foreach => local operation on the record collection
list += entry // this mutable list object is updated in each executor but never sent back to the driver. All updates are lost
}
}
Mutable objects are in general a no-go in distributed computing. Imagine that one executor adds a record and another one removes it, what's the correct result? Or that each executor comes to a different value, which is the right one?
To implement the operation above, we need to transform the data into our desired result.
I'd start by applying another best practice: Do not use null as return value. I also moved the row ops into the function. Lets rewrite the comparison operation with this in mind:
def compareFields(colName:String, row1:Row, row2:Row): Option[DiscrepancyData] = {
val key = "year"
val v1 = row1.getAs(colName).toString
val v2 = row2.getAs(colName).toString
if (v1 != v2){
Some(DiscrepancyData(
row1.getAs(key).toString, //fieldKey
colName, //fieldName
v1, //table1Value
v2, //table2Value
v2) //expectedValue
)
} else None
}
Now, we can rewrite the computation of discrepancies as a transformation of the initial table data:
val discrepancies = table.flatMap{case (str, row) =>
compareCols.flatMap{col => compareFields(col, row.next, row.next) }
}
We can also use the for-comprehension notation, now that we understand where things are running:
val discrepancies = for {
(str,row) <- table
col <- compareCols
dis <- compareFields(col, row.next, row.next)
} yield dis
Note that discrepancies is of type RDD[Discrepancy]. If we want to get the actual values to the driver we need to:
val materializedDiscrepancies = discrepancies.collect()
Iterating through an RDD and updating a mutable structure defined outside the loop is a Spark anti-pattern.
Imagine this RDD being spread over 200 machines. How can these machines be updating the same Buffer? They cannot. Each JVM will be seeing its own discs: ListBuffer[DiscrepancyData]. At the end, your result will not be what you expect.
To conclude, this is a perfectly valid (not idiomatic though) Scala code but not a valid Spark code. If you replace RDD with an Array it will work as expected.
Try to have a more functional implementation along these lines:
val finalRDD: RDD[DiscrepancyData] = table.map(???).filter(???)

Scala :: PriorityQueue => update priority of specific item

I have implemented routine to extract item from head, update it's priority and put it back to the queue with non-blocking approach like this (using AtomicReference)
def head(): Entry = {
def takeAndUpdate(e: Entry, success: Boolean): Entry = {
if (success) {
return e
}
val oldQueue = queueReference.get()
val newQueue = oldQueue.clone()
val item = newQueue.dequeue().increase()
newQueue += item
takeAndUpdate(item.e, queueReference.compareAndSet(oldQueue, newQueue))
}
takeAndUpdate(null, false)
}
Now I need to find out arbitrary Entry in the queue, change it's priority and put it back to the queue. It seems that PriorityQueue doesn't support this, so which class I should use to accomplish desired behavior?
It's related to Change priority of items in a priority queue
Use an immutable tree map (immutable.TreeMap) to accomplish this. You have to find the entry you want somehow - best way to do this is to associate the part of the information (Key) you use to find the entry with a key, and the actual entry that you wish to return with the value in the map (call this Entry).
Rename queueReference with treeReference. In the part of the code where you create the collection, use immutable.TreeMap[Key, Entry](<list of elements>).
Then modify the code like this:
def updateKey(k: Key) {
#annotation.tailrec def update(): (Key, Entry) = {
val oldTree = treeReference.get()
val entry = oldTree(k)
val newPair = modifyHoweverYouWish(k, entry)
val newTree = (oldTree - k) + newPair
if (treeReference.compareAndSet(oldTree, newTree)) newPair
else update()
}
update()
}
If the compareAndSet fails, you have to repeat as you already do in your source. Best to use #tailrec as shown above, to ensure that the function is tail recursive and prevent a potential stack overflow.
Alternatively, you could use an immutable.TreeSet - if you know the exact reference to your Entry object, you can use it to remove the element via - as above and then add it back after calling increase(). The code is almost the same.

Scala functional way of processing large scala data with lazy collections

I am trying to figure out memory-efficient AND functional ways to process a large scale of data using strings in scala. I have read many things about lazy collections and have seen quite a bit of code examples. However, I run into "GC overhead exceeded" or "Java heap space" issues again and again.
Often the problem is that I try to construct a lazy collection, but evaluate each new element when I append it to the growing collection (I don't now any other way to do so incrementally). Of course, I could try something like initializing an initial lazy collection first and and yield the collection holding the desired values by applying the ressource-critical computations with map or so, but often I just simply do not know the exact size of the final collection a priori to initial that lazy collection.
Maybe you could help me by giving me hints or explanations on how to improve following code as an example, which splits a FASTA (definition below) formatted file into two separate files according to the rule that odd sequence pairs belong to one file and even ones to aother one ("separation of strands"). The "most" straight-forward way to do so would be in a imperative way by looping through the lines and printing into the corresponding files via open file streams (and this of course works excellent). However, I just don't enjoy the style of reassigning to variables holding header and sequences, thus the following example code uses (tail-)recursion, and I would appreciate to have found a way to maintain a similar design without running into ressource problems!
The example works perfectly for small files, but already with files at around ~500mb the code will fail with the standard JVM setups. I do want to process files of "arbitray" size, say 10-20gb or so.
val fileName = args(0)
val in = io.Source.fromFile(fileName) getLines
type itType = Iterator[String]
type sType = Stream[(String, String)]
def getFullSeqs(ite: itType) = {
//val metaChar = ">"
val HeadPatt = "(^>)(.+)" r
val SeqPatt = "([\\w\\W]+)" r
#annotation.tailrec
def rec(it: itType, out: sType = Stream[(String, String)]()): sType =
if (it hasNext) it next match {
case HeadPatt(_,header) =>
// introduce new header-sequence pair
rec(it, (header, "") #:: out)
case SeqPatt(seq) =>
val oldVal = out head
// concat subsequences
val newStream = (oldVal._1, oldVal._2 + seq) #:: out.tail
rec(it, newStream)
case _ =>
println("something went wrong my friend, oh oh oh!"); Stream[(String, String)]()
} else out
rec(ite)
}
def printStrands(seqs: sType) {
import java.io.PrintWriter
import java.io.File
def printStrand(seqse: sType, strand: Int) {
// only use sequences of one strand
val indices = List.tabulate(seqs.size/2)(_*2 + strand - 1).view
val p = new PrintWriter(new File(fileName + "." + strand))
indices foreach { i =>
p.print(">" + seqse(i)._1 + "\n" + seqse(i)._2 + "\n")
}; p.close
println("Done bro!")
}
List(1,2).par foreach (s => printStrand(seqs, s))
}
printStrands(getFullSeqs(in))
Three questions arise for me:
A) Let's assume one needs to maintain a large data structure obtained by processing the initial iterator you get from getLines like in my getFullSeqs method (note the different size of in and the output of getFullSeqs), because transformations on the whole(!) data is required repeatedly, because one does not know which part of the data one will require at any step. My example might not be the best, but how to do so? Is it possible at all??
B) What when the desired data structure is not inherently lazy, say one would like to store the (header -> sequence) pairs into a Map()? Would you wrap it in a lazy collection?
C) My implementation of constructing the stream might reverse the order of the inputted lines. When calling reverse, all elements will be evaluated (in my code, they already are, so this is the actual problem). Is there any way to post-process "from behind" in a lazy fashion? I know of reverseIterator, but is this already the solution, or will this not actually evaluate all elements first, too (as I would need to call it on a list)? One could construct the stream with newVal #:: rec(...), but I would lose tail-recursion then, wouldn't I?
So what I basically need is to add elements to a collection, which are not evaluated by the process of adding. So lazy val elem = "test"; elem :: lazyCollection is not what I am looking for.
EDIT: I have also tried using by-name parameter for the stream argument in rec .
Thank you so much for your attention and time, I really appreciate any help (again :) ).
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
FASTA is defined as a sequential set of sequences delimited by a single header line. A header is defined as a line starting with ">". Every line below the header is called part of the sequence associated with the header. A sequence ends when a new header is present. Every header is unique. Example:
>HEADER1
abcdefg
>HEADER2
hijklmn
opqrstu
>HEADER3
vwxyz
>HEADER4
zyxwv
Thus, sequence 2 is twice as big as seq 1. My program would split that file into a file A containing
>HEADER1
abcdefg
>HEADER3
vwxyz
and a second file B containing
>HEADER2
hijklmn
opqrstu
>HEADER4
zyxwv
The input file is assumed to consist of an even number of header-sequence pairs.
The key to working with really large data structures is to hold in memory only that which is critical to perform whatever operation you need. So, in your case, that's
Your input file
Your two output files
The current line of text
and that's it. In some cases you can need to store information such as how long a sequence is; in such events, you build the data structures in a first pass and use them on a second pass. Let's suppose, for example, that you decide that you want to write three files: one for even records, one for odd, and one for entries where the total length is less than 300 nucleotides. You would do something like this (warning--it compiles but I never ran it, so it may not actually work):
final def findSizes(
data: Iterator[String], sz: Map[String,Long] = Map(),
currentName: String = "", currentSize: Long = 0
): Map[String,Long] = {
def currentMap = if (currentName != "") sz + (currentName->currentSize) else sz
if (!data.hasNext) currentMap
else {
val s = data.next
if (s(0) == '>') findSizes(data, currentMap, s, 0)
else findSizes(data, sz, currentName, currentSize + s.length)
}
}
Then, for processing, you use that map and pass through again:
import java.io._
final def writeFiles(
source: Iterator[String], targets: Array[PrintWriter],
sizes: Map[String,Long], count: Int = -1, which: Int = 0
) {
if (!source.hasNext) targets.foreach(_.close)
else {
val s = source.next
if (s(0) == '>') {
val w = if (sizes.get(s).exists(_ < 300)) 2 else (count+1)%2
targets(w).println(s)
writeFiles(source, targets, sizes, count+1, w)
}
else {
targets(which).println(s)
writeFiles(source, targets, sizes, count, which)
}
}
}
You then use Source.fromFile(f).getLines() twice to create your iterators, and you're all set. Edit: in some sense this is the key step, because this is your "lazy" collection. However, it's not important just because it doesn't read all memory in immediately ("lazy"), but because it doesn't store any previous strings either!
More generally, Scala can't help you that much from thinking carefully about what information you need to have in memory and what you can fetch off disk as needed. Lazy evaluation can sometimes help, but there's no magic formula because you can easily express the requirement to have all your data in memory in a lazy way. Scala can't interpret your commands to access memory as, secretly, instructions to fetch stuff off the disk instead. (Well, not unless you write a library to cache results from disk which does exactly that.)
One could construct the stream with newVal #:: rec(...), but I would
lose tail-recursion then, wouldn't I?
Actually, no.
So, here's the thing... with your present tail recursion, you fill ALL of the Stream with values. Yes, Stream is lazy, but you are computing all of the elements, stripping it of any laziness.
Now say you do newVal #:: rec(...). Would you lose tail recursion? No. Why? Because you are not recursing. How come? Well, Stream is lazy, so it won't evaluate rec(...).
And that's the beauty of it. Once you do it that way, getFullSeqs returns on the first interaction, and only compute the "recursion" when printStrands asks for it. Unfortunately, that won't work as is...
The problem is that you are constantly modifying the Stream -- that's not how you use a Stream. With Stream, you always append to it. Don't keep "rewriting" the Stream.
Now, there are three other problems I could readily identify with printStrands. First, it calls size on seqs, which will cause the whole Stream to be processed, losing lazyness. Never call size on a Stream. Second, you call apply on seqse, accessing it by index. Never call apply on a Stream (or List) -- that's highly inefficient. It's O(n), which makes your inner loop O(n^2) -- yes, quadratic on the number of headers in the input file! Finally, printStrands keeps a reference to seqs throughout the execution of printStrand, preventing processing elements from being garbage collected.
So, here's a first approximation:
def inputStreams(fileName: String): (Stream[String], Stream[String]) = {
val in = (io.Source fromFile fileName).getLines.toStream
val SeqPatt = "^[^>]".r
def demultiplex(s: Stream[String], skip: Boolean): Stream[String] = {
if (s.isEmpty) Stream.empty
else if (skip) demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = false)
else s.head #:: (s.tail takeWhile (SeqPatt findFirstIn _ nonEmpty)) #::: demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = true)
}
(demultiplex(in, skip = false), demultiplex(in, skip = true))
}
The problem with the above, and I'm showing that code just to further guide in the issues of lazyness, is that the instant you do this:
val (a, b) = inputStreams(fileName)
You'll keep a reference to the head of both streams, which prevents garbage collecting them. You can't keep a reference to them, so you have to consume them as soon as you get them, without ever storing them in a "val" or "lazy val". A "var" might do, but it would be tricky to handle. So let's try this instead:
def inputStreams(fileName: String): Vector[Stream[String]] = {
val in = (io.Source fromFile fileName).getLines.toStream
val SeqPatt = "^[^>]".r
def demultiplex(s: Stream[String], skip: Boolean): Stream[String] = {
if (s.isEmpty) Stream.empty
else if (skip) demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = false)
else s.head #:: (s.tail takeWhile (SeqPatt findFirstIn _ nonEmpty)) #::: demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = true)
}
Vector(demultiplex(in, skip = false), demultiplex(in, skip = true))
}
inputStreams(fileName).zipWithIndex.par.foreach {
case (stream, strand) =>
val p = new PrintWriter(new File("FASTA" + "." + strand))
stream foreach p.println
p.close
}
That still doesn't work, because stream inside inputStreams works as a reference, keeping the whole stream in memory even while they are printed.
So, having failed again, what do I recommend? Keep it simple.
def in = (scala.io.Source fromFile fileName).getLines.toStream
def inputStream(in: Stream[String], strand: Int = 1): Stream[(String, Int)] = {
if (in.isEmpty) Stream.empty
else if (in.head startsWith ">") (in.head, 1 - strand) #:: inputStream(in.tail, 1 - strand)
else (in.head, strand) #:: inputStream(in.tail, strand)
}
val printers = Array.tabulate(2)(i => new PrintWriter(new File("FASTA" + "." + i)))
inputStream(in) foreach {
case (line, strand) => printers(strand) println line
}
printers foreach (_.close)
Now this won't keep anymore in memory than necessary. I still think it's too complex, however. This can be done more easily like this:
def in = (scala.io.Source fromFile fileName).getLines
val printers = Array.tabulate(2)(i => new PrintWriter(new File("FASTA" + "." + i)))
def printStrands(in: Iterator[String], strand: Int = 1) {
if (in.hasNext) {
val next = in.next
if (next startsWith ">") {
printers(1 - strand).println(next)
printStrands(in, 1 - strand)
} else {
printers(strand).println(next)
printStrands(in, strand)
}
}
}
printStrands(in)
printers foreach (_.close)
Or just use a while loop instead of recursion.
Now, to the other questions:
B) It might make sense to do so while reading it, so that you do not have to keep two copies of the data: the Map and a Seq.
C) Don't reverse a Stream -- you'll lose all of its laziness.

Filtering Scala's Parallel Collections with early abort when desired number of results found

Given a very large instance of collection.parallel.mutable.ParHashMap (or any other parallel collection), how can one abort a filtering parallel scan once a given, say 50, number of matches has been found ?
Attempting to accumulate intermediate matches in a thread-safe "external" data structure or keeping an external AtomicInteger with result count seems to be 2 to 3 times slower on 4 cores than using a regular collection.mutable.HashMap and pegging a single core at 100%.
I am aware that find or exists on Par* collections do abort "on the inside". Is there a way to generalize this to find more than one result ?
Here's the code which still seems to be 2 to 3 times slower on the ParHashMap with ~ 79,000 entries and also has a problem of stuffing more than maxResults results into the results CHM (Which is probably due to thread being preempted after incrementAndGet but before break which allows other threads to add more elements in). Update: it seems the slow down is due to worker threads contending on the counter.incrementAndGet() which of course defeats the purpose of the whole parallel scan :-(
def find(filter: Node => Boolean, maxResults: Int): Iterable[Node] =
{
val counter = new AtomicInteger(0)
val results = new ConcurrentHashMap[Key, Node](maxResults)
import util.control.Breaks._
breakable
{
for ((key, node) <- parHashMap if filter(node))
{
results.put(key, node)
val total = counter.incrementAndGet()
if (total > maxResults) break
}
}
results.values.toArray(new Array[Node](results.size))
}
I would first do parallel scan in which variable maxResults would be threadlocal. This would find up to (maxResults * numberOfThreads) results.
Then I would do single threaded scan to reduce it to maxResults.
I had performed an interesting investigation about your case.
Investigation reasoning
I suspected the problem is with the mutability of the input Map and I will try to explain you why: HashMap implementation organizes the data in different buckets, as one can see on Wikipedia.
The first thread-safe collections in Java, the synchronized collections were based on synchronizing all the methods around the underlying implementation and resulted in poor performance. Further research and thinking brought to the more performant Concurrent Collection, such as the ConcurrentHashMap which approach was smarter : why don't we protect each bucket with a specific lock?
According to my feeling the performance problem occurs because:
when you run in parallel your filter, some threads will conflict on accessing the same bucket at once and will hit the same lock, because your map is mutable.
You hold a counter to see how many results you have while you can actually check the size of your
result. If you have a thread-safe way to build a collection, you don't need a thread-safe counter too.
Investigation result
I have developed a test case and I find out I was wrong. The problem is with the concurrent nature of the output map. In fact, that is where the collision occurs, when you are putting elements in the map, rather then when you are iterating on it. Additionally, since you want only the result on values, you don't need the keys and the hashing and all the map features. It might be interesting to test if you remove the AtomicCounter and you use only the result map to check if you collected enough elements how your version performs.
Please be careful with the following code in Scala 2.9.2. I am explaining in another post why I need two different functions for the parallel and the non parallel version: Calling map on a parallel collection via a reference to an ancestor type
object MapPerformance {
val size = 100000
val items = Seq.tabulate(size)( x => (x,x*2))
val concurrentParallelMap = ImmutableParHashMap(items:_*)
val concurrentMutableParallelMap = MutableParHashMap(items:_*)
val unparallelMap = Map(items:_*)
class ThreadSafeIndexedSeqBuilder[T](maxSize:Int) {
val underlyingBuilder = new VectorBuilder[T]()
var counter = 0
def sizeHint(hint:Int) { underlyingBuilder.sizeHint(hint) }
def +=(item:T):Boolean ={
synchronized{
if(counter>=maxSize)
false
else{
underlyingBuilder+=item
counter+=1
true
}
}
}
def result():Vector[T] = underlyingBuilder.result()
}
def find(map:ParMap[Int,Int],filter: Int => Boolean, maxResults: Int): Iterable[Int] =
{
// we already know the maximum size
val resultsBuilder = new ThreadSafeIndexedSeqBuilder[Int](maxResults)
resultsBuilder.sizeHint(maxResults)
import util.control.Breaks._
breakable
{
for ((key, node) <- map if filter(node))
{
val newItemAdded = resultsBuilder+=node
if (!newItemAdded)
break()
}
}
resultsBuilder.result().seq
}
def findUnParallel(map:Map[Int,Int],filter: Int => Boolean, maxResults: Int): Iterable[Int] =
{
// we already know the maximum size
val resultsBuilder = Array.newBuilder[Int]
resultsBuilder.sizeHint(maxResults)
var counter = 0
for {
(key, node) <- map if filter(node)
if counter < maxResults
}{
resultsBuilder+=node
counter+=1
}
resultsBuilder.result()
}
def measureTime[K](f: => K):(Long,K) = {
val startMutable = System.currentTimeMillis()
val result = f
val endMutable = System.currentTimeMillis()
(endMutable-startMutable,result)
}
def main(args:Array[String]) = {
val maxResultSetting=10
(1 to 10).foreach{
tryNumber =>
println("Try number " +tryNumber)
val (mutableTime, mutableResult) = measureTime(find(concurrentMutableParallelMap,_%2==0,maxResultSetting))
val (immutableTime, immutableResult) = measureTime(find(concurrentMutableParallelMap,_%2==0,maxResultSetting))
val (unparallelTime, unparallelResult) = measureTime(findUnParallel(unparallelMap,_%2==0,maxResultSetting))
assert(mutableResult.size==maxResultSetting)
assert(immutableResult.size==maxResultSetting)
assert(unparallelResult.size==maxResultSetting)
println(" The mutable version has taken " + mutableTime + " milliseconds")
println(" The immutable version has taken " + immutableTime + " milliseconds")
println(" The unparallel version has taken " + unparallelTime + " milliseconds")
}
}
}
With this code, I have systematically the parallel (both mutable and immutable version of the input map) about 3,5 time faster then the unparallel on my machine.
You could try to get an iterator and then create a lazy list (a Stream) where you filter (with your predicate) and take the number of elements you want. Because it is a non strict, this 'taking' of elements is not evaluated.
Afterwards you can force the execution by adding ".par" to the whole thing and achieve parallelization.
Example code:
A parallelized map with random values (simulating your parallel hash map):
scala> myMap
res14: scala.collection.parallel.immutable.ParMap[Int,Int] = ParMap(66978401 -> -1331298976, 256964068 -> 126442706, 1698061835 -> 1622679396, -1556333580 -> -1737927220, 791194343 -> -591951714, -1907806173 -> 365922424, 1970481797 -> 162004380, -475841243 -> -445098544, -33856724 -> -1418863050, 1851826878 -> 64176692, 1797820893 -> 405915272, -1838192182 -> 1152824098, 1028423518 -> -2124589278, -670924872 -> 1056679706, 1530917115 -> 1265988738, -808655189 -> -1742792788, 873935965 -> 733748120, -1026980400 -> -163182914, 576661388 -> 900607992, -1950678599 -> -731236098)
Get an iterator and create a Stream from the iterator and filter it.
In this case my predicate is only accepting pairs (of the value member of the map).
I want to get 10 even elements, so I take 10 elements which will only get evaluated when I force it to:
scala> val mapIterator = myMap.toIterator
mapIterator: Iterator[(Int, Int)] = HashTrieIterator(20)
scala> val r = Stream.continually(mapIterator.next()).filter(_._2 % 2 == 0).take(10)
r: scala.collection.immutable.Stream[(Int, Int)] = Stream((66978401,-1331298976), ?)
Finally, I force the evaluation which only gets 10 elements as planned
scala> r.force
res16: scala.collection.immutable.Stream[(Int, Int)] = Stream((66978401,-1331298976), (256964068,126442706), (1698061835,1622679396), (-1556333580,-1737927220), (791194343,-591951714), (-1907806173,365922424), (1970481797,162004380), (-475841243,-445098544), (-33856724,-1418863050), (1851826878,64176692))
This way you only get the number of elements you want (without needing to process the remaining elements) and you parallelize the process without locks, atomics or breaks.
Please compare this to your solutions to see if it is any good.