I have a for loop that loops through a Iterable[String] and puts data into a mutabe.Map, is it possible to run everything in the iterable at once/amount at a time?
Use .par (converting collection to parallel) and then map/foreach over it.
Or you can map to Future.
Don't forget about thread safety of map - you should use ConcurrentHashMap.
Related
I am trying to learn Scala through a sample project that I have. In it there is a variable record defined as:
val records: Iterator[Product2[K, V]]
It is passed around in different methods. I explore its contents using :
records.foreach(println)
However, when I try to print the contents using this iterator again, even in successive lines of code, I get no results. It seems as if the iterator is consumed. How do prevent it from happening and be able to explore the contents of the iterator without rendering it useless for the rest of the code?
An Iterator extends TraversableOnce and hence can only be iterated over once, as it represents a mutating pointer into an Iterable. If you want something that can be traversable repeatedly and without affecting multiple, parallel accesses, you need to use the Iterable instead, which extends Traversable and on foreach creates a new Iterator for that specific context
Ok, so I am incredibly new to Scala (started yesterday). I have been reading the documents on concurrency and failed to find how to run larger tasks with Callables in a fork Join Pool. This is what I have as a sketch of what I would use in Java in Scala.
private Object fjp{
val fjp:ForkJoinPool=new ForkJoinPool(Runtime.getRuntime.availableProcessors()*2)
var w:Int=0
def invokeObjects(collection:Collection[Callable[Map[String:Int]]]){
var futures=fjp.invokeAll(collection)
w=0
while(fjp.isQuiescent()==false && fjp.getActiveThreadCount()==0){
w=w+1
}
println("Checked "+w+" times")
for(i<-0 to futures.size()){
var mp=futures.get(i).get()
//add keys to a common list
//submit count with frequency to sparse matrix
//avoid a ton of locking
}
}
}
How would I turn the code into a forkjoin pool that I can continually call. If possible, could I use a foreach without another List to get the results? Thank you for the help. It will point me in the right direction with Scala as well.
In general, I would not bother to follow exactly that path. Scala has a very clean paradigm in order to run parallel computations that just feels more idiomatic.
If you are new to Async computation in Scala, I would recommend you to start reading this.
In particular, you can define and/or reuse several kinds of ExecutorContext in order to get the kind of threadpool you need. Or you can use the default one if you are not blocking the threads (it has one per core only, by default)
I was thinking, how are Futures being evaluated? I mean if we have imperative style programming where we execute program from point A to point B and somewhere between them we create a Future which, when completed, prints the result to the console. How is our program making this step back in flow to print it?
Futures are run on an ExecutionContext which is essentially a threadpool. When you create a Future block or use the various composition and callback methods (map, foreach, onComplete etc) on a future there is an implicit execution context passed along where the logic will be executed.
In a imperative program this will be roughly the same as just pushing a Runnable that would print onto the console onto a threadpool.
There is a good introduction here: http://docs.scala-lang.org/overviews/core/futures.html
The most useful thing about futures is not using them imperatively though, but instead composing using map and flatMap to create chains of futures.
Introduction
Scala's Future (new in 2.10 and now 2.9.3) is an applicative functor, which means that if we have a traversable type F, we can take an F[A] and a function A => Future[B] and turn them into a Future[F[B]].
This operation is available in the standard library as Future.traverse. Scalaz 7 also provides a more general traverse that we can use here if we import the applicative functor instance for Future from the scalaz-contrib library.
These two traverse methods behave differently in the case of streams. The standard library traversal consumes the stream before returning, while Scalaz's returns the future immediately:
import scala.concurrent._
import ExecutionContext.Implicits.global
// Hangs.
val standardRes = Future.traverse(Stream.from(1))(future(_))
// Returns immediately.
val scalazRes = Stream.from(1).traverse(future(_))
There's also another difference, as Leif Warner observes here. The standard library's traverse starts all of the asynchronous operations immediately, while Scalaz's starts the first, waits for it to complete, starts the second, waits for it, and so on.
Different behavior for streams
It's pretty easy to show this second difference by writing a function that will sleep for a few seconds for the first value in the stream:
def howLong(i: Int) = if (i == 1) 10000 else 0
import scalaz._, Scalaz._
import scalaz.contrib.std._
def toFuture(i: Int)(implicit ec: ExecutionContext) = future {
printf("Starting %d!\n", i)
Thread.sleep(howLong(i))
printf("Done %d!\n", i)
i
}
Now Future.traverse(Stream(1, 2))(toFuture) will print the following:
Starting 1!
Starting 2!
Done 2!
Done 1!
And the Scalaz version (Stream(1, 2).traverse(toFuture)):
Starting 1!
Done 1!
Starting 2!
Done 2!
Which probably isn't what we want here.
And for lists?
Strangely enough the two traversals behave the same in this respect on lists—Scalaz's doesn't wait for one future to complete before starting the next.
Another future
Scalaz also includes its own concurrent package with its own implementation of futures. We can use the same kind of setup as above:
import scalaz.concurrent.{ Future => FutureZ, _ }
def toFutureZ(i: Int) = FutureZ {
printf("Starting %d!\n", i)
Thread.sleep(howLong(i))
printf("Done %d!\n", i)
i
}
And then we get the behavior of Scalaz on streams for lists as well as streams:
Starting 1!
Done 1!
Starting 2!
Done 2!
Perhaps less surprisingly, traversing an infinite stream still returns immediately.
Question
At this point we really need a table to summarize, but a list will have to do:
Streams with standard library traversal: consume before returning; don't wait for each future.
Streams with Scalaz traversal: return immediately; do wait for each future to complete.
Scalaz futures with streams: return immediately; do wait for each future to complete.
And:
Lists with standard library traversal: don't wait.
Lists with Scalaz traversal: don't wait.
Scalaz futures with lists: do wait for each future to complete.
Does this make any sense? Is there a "correct" behavior for this operation on lists and streams? Is there some reason that the "most asynchronous" behavior—i.e., don't consume the collection before returning, and don't wait for each future to complete before moving on to the next—isn't represented here?
I cannot answer it all, but i try on some parts:
Is there some reason that the "most asynchronous" behavior—i.e., don't
consume the collection before returning, and don't wait for each
future to complete before moving on to the next—isn't represented
here?
If you have dependent calculations and a limited number of threads, you can experience deadlocks. For example you have two futures depending on a third one (all three in the list of futures) and only two threads, you can experience a situation where the first two futures block all two threads and the third one never gets executed. (Of course, if your pool size is one, i.e. zou execute one calculation after the other, you can get similar situations)
To solve this, you need one thread per future, without any limitation. This works for small lists of futures, but not for big one. So if you run all in parallel, you will get a situation where small examples will run in all cases and bigger one will deadlock. (Example: Developer tests run fine, production deadlocks).
Is there a "correct" behavior for this operation on lists and streams?
I think it is impossible with futures. If you know something more of the dependencies, or when you know for sure that the calculations will not block, a more concurrent solution might be possible. But executing lists of futures looks for me "broken by design". Best solution seems one, that will already fail for small examples for deadlocks (i.e. execute one Future after the other).
Scalaz futures with lists: do wait for each future to complete.
I think scalaz uses for comprehensions internally for traversal. With for comprehensions, it is not guaranteed that the calculations are independent. So I guess that Scalaz is doing the right thing here with for comprehensions: Doing one calculation after the other. In the case of futures, this will always work, given you have unlimited threads in you operating system.
So in other words: You see just an artifact of how for comprehensions (must) work.
I hope this makes some sense.
If I understand the question correctly, I think it really comes down to the semantics of streams vs lists.
Traversing a list does what we'd expect from the docs:
Transforms a TraversableOnce[A] into a Future[TraversableOnce[B]] using the provided function A => Future[B]. This is useful for performing a parallel map. For example, to apply a function to all items of a list in parallel:
With streams, it's up to the developer to decide how they want it to work because it depends on more knowledge of the stream than the compiler has (streams can be infinite, but the type system doesn't know about it). if my stream is reading lines from a file, I want to consume it first, since chaining futures line by line wouldn't actually parallelize things. in this case, I would want the parallel approach.
On the other hand, if my stream is an infinite list generating sequential integers and hunting for the first prime greater than some large number, it would be impossible to consume the stream first in one sweep (the chained Future approach would be required, and we'd probably want to run over batches from the stream).
Rather than trying to figure out a canonical way to handle this, I wonder if there are missing types that would help make the different cases more explicit.
Passing messages around with actors is great. But I would like to have even easier code.
Examples (Pseudo-code)
val splicedList:List[List[Int]]=biglist.partition(100)
val sum:Int=ActorPool.numberOfActors(5).getAllResults(splicedList,foldLeft(_+_))
where spliceIntoParts turns one big list into 100 small lists
the numberofactors part, creates a pool which uses 5 actors and receives new jobs after a job is finished
and getallresults uses a method on a list. all this done with messages passing in the background. where maybe getFirstResult, calculates the first result, and stops all other threads (like cracking a password)
With Scala Parallel collections that will be included in 2.8.1 you will be able to do things like this:
val spliced = myList.par // obtain a parallel version of your collection (all operations are parallel)
spliced.map(process _) // maps each entry into a corresponding entry using `process`
spliced.find(check _) // searches the collection until it finds an element for which
// `check` returns true, at which point the search stops, and the element is returned
and the code will automatically be done in parallel. Other methods found in the regular collections library are being parallelized as well.
Currently, 2.8.RC2 is very close (this or next week), and 2.8 final will come in a few weeks after, I guess. You will be able to try parallel collections if you use 2.8.1 nightlies.
You can use Scalaz's concurrency features to achieve what you want.
import scalaz._
import Scalaz._
import concurrent.strategy.Executor
import java.util.concurrent.Executors
implicit val s = Executor.strategy[Unit](Executors.newFixedThreadPool(5))
val splicedList = biglist.grouped(100).toList
val sum = splicedList.parMap(_.sum).map(_.sum).get
It would be pretty easy to make this prettier (i.e. write a function mapReduce that does the splitting and folding all in one). Also, parMap over a List is unnecessarily strict. You will want to start folding before the whole list is ready. More like:
val splicedList = biglist.grouped(100).toList
val sum = splicedList.map(promise(_.sum)).toStream.traverse(_.sum).get
You can do this with less overhead than creating actors by using futures:
import scala.actors.Futures._
val nums = (1 to 1000).grouped(100).toList
val parts = nums.map(n => future { n.reduceLeft(_ + _) })
val whole = (0 /: parts)(_ + _())
You have to handle decomposing the problem and writing the "future" block and recomposing it in to a final answer, but it does make executing a bunch of small code blocks in parallel easy to do.
(Note that the _() in the fold left is the apply function of the future, which means, "Give me the answer you were computing in parallel!", and it blocks until the answer is available.)
A parallel collections library would automatically decompose the problem and recompose the answer for you (as with pmap in Clojure); that's not part of the main API yet.
I'm not waiting for Scala 2.8.1 or 2.9, it would rather be better to write my own library or use another, so I did more googling and found this: akka
http://doc.akkasource.org/actors
which has an object futures with methods
awaitAll(futures: List[Future]): Unit
awaitOne(futures: List[Future]): Future
but http://scalablesolutions.se/akka/api/akka-core-0.8.1/
has no documentation at all. That's bad.
But the good part is that akka's actors are leaner than scala's native ones
With all of these libraries (including scalaz) around, it would be really great if scala itself could eventually merge them officially
At Scala Days 2010, there was a very interesting talk by Aleksandar Prokopec (who is working on Scala at EPFL) about Parallel Collections. This will probably be in 2.8.1, but you may have to wait a little longer. I'll lsee if I can get the presentation itself. to link here.
The idea is to have a collections framework which parallelizes the processing of the collections by doing exactly as you suggest, but transparently to the user. All you theoretically have to do is change the import from scala.collections to scala.parallel.collections. You obviously still have to do the work to see if what you're doing can actually be parallelized.