I tested parallel collections on Scala vs simple collection, here is my code:
def parallelParse()
{
val adjs = wn.allSynsets(POS.ADJECTIVE).par
adjs.foreach(adj => {
parse(proc.mkDocument(adj.getGloss))
})
}
def serialParse()
{
val adjs = wn.allSynsets(POS.ADJECTIVE)
adjs.foreach(adj => {
parse(proc.mkDocument(adj.getGloss))
})
}
The parallel collection speed up about 3 times.
What other option do I have in Scala to make it even faster in parallel, I would be happy to test them and put the results here.
You can use futures to start asynchronous computations. You could do:
import scala.concurrent._
import scala.concurrent.duration._
import ExecutionContext.Implicits.global
val futures = wn.allSynsets(POS.ADJECTIVE).map(adj => Future {
parse(proc.mkDocument(adj.getGloss))
})
futures.foreach(f => Await.ready(f, Duration.Inf))
Depending on the amount of work on each element in allSynsets and the number of elements in allSynsets (too many elements -> too many futures -> more overheads), you could get worse results with futures.
To ensure that you are benchmarking correctly, consider using the inline benchmarking feature of ScalaMeter 0.5:
http://scalameter.github.io/home/gettingstarted/0.5/inline/index.html
You could also use actors to achieve this, but it would require a bit more plumbing.
Related
I have a use-case in databricks where an API call has to me made on a dataset of URL's. The dataset has around 100K records.
The max allowed concurrency is 3.
I did the implementation in Scala and ran in databricks notebook. Apart from the one element pending in queue, i feel some thing is missing here.
Is the Blocking Queue and Thread Pool the right way to tackle this problem.
In the code below I have modified and instead of reading from dataset I am sampling on a Seq.
Any help/thought will be much appreciated.
import java.time.LocalDateTime
import java.util.concurrent.{ArrayBlockingQueue,BlockingQueue}
import java.util.concurrent.Executors
import java.util.concurrent.TimeUnit;
var inpQueue:BlockingQueue[(Int, String)] = new ArrayBlockingQueue[(Int, String)](1)
val inpDS = Seq((1,"https://google.com/2X6barD"), (2,"https://google.com/3d9vCgW"), (3,"https://google.com/2M02Xz0"), (4,"https://google.com/2XOu2uL"), (5,"https://google.com/2AfBWF0"), (6,"https://google.com/36AEKsw"), (7,"https://google.com/3enBxz7"), (8,"https://google.com/36ABq0x"), (9,"https://google.com/2XBjmiF"), (10,"https://google.com/36Emlen"))
val pool = Executors.newFixedThreadPool(3)
var i = 0
inpDS.foreach{
ix => {
inpQueue.put(ix)
val t = new ConsumerAPIThread()
t.setName("MyThread-"+i+" ")
pool.execute(t)
}
i = i+1
}
println("Final Queue Size = " +inpQueue.size+"\n")
class ConsumerAPIThread() extends Thread
{
var name =""
override def run()
{
val urlDetail = inpQueue.take()
print(this.getName()+" "+ Thread.currentThread().getName() + " popped "+urlDetail+" Queue Size "+inpQueue.size+" \n")
triggerAPI((urlDetail._1, urlDetail._2))
}
def triggerAPI(params:(Int,String)){
try{
val result = scala.io.Source.fromURL(params._2)
println("" +result)
}catch{
case ex:Exception => {
println("Exception caught")
}
}
}
def ConsumerAPIThread(s:String)
{
name = s;
}
}
So, you have two requirements: the functional one is that you want to process asynchronously the items in a list, the non-functional one is that you want to not process more than three items at once.
Regarding the latter, the nice thing is that, as you already have shown in your question, Java natively exposes a nicely packaged Executor that runs task on a thread pool with a fixed size, elegantly allowing you to cap the concurrency level if you work with threads.
Moving to the functional requirement, Scala helps by having something that does precisely that as part of its standard API. In particular it uses scala.concurrent.Future, so in order to use it we'll have to reframe triggerAPI in terms of Future. The content of the function is not particularly relevant, so we'll mostly focus on its (revised) signature for now:
import scala.concurrent.Future
import scala.concurrent.ExecutionContext
def triggerAPI(params: (Int, String))(implicit ec: ExecutionContext): Future[Unit] =
Future {
// some code that takes some time to run...
}
Notice that now triggerAPI returns a Future. A Future can be thought as a read-handle to something that is going to be eventually computed. In particular, this is a Future[Unit], where Unit stands for "we don't particularly care about the output of this function, but mostly about its side effects".
Furthermore, notice that the method now takes an implicit parameter, namely an ExecutionContext. The ExecutionContext is used to provide Futures with some form of environment where the computation happens. Scala has an API to create an ExecutionContext from a java.util.concurrent.ExecutorService, so this will come in handy to run our computation on the fixed thread pool, running no more than three callbacks at any given time.
Before moving forward, if you have questions about Futures, ExecutionContexts and implicit parameters, the Scala documentation is your best source of knowledge (here are a couple of pointers: 1, 2).
Now that we have the new triggerAPI method, we can use Future.traverse (here is the documentation for Scala 2.12 -- the latest version at the time of writing is 2.13 but to the best of my knowledge Spark users are stuck on 2.12 for the time being).
The tl;dr of Future.traverse is that it takes some form of container and a function that takes the items in that container and returns a Future of something else. The function will be applied to each item in the container and the result will be a Future of the container of the results. In your case: the container is a List, the items are (Int, String) and the something else you return is a Unit.
This means that you can simply call it like this:
Future.traverse(inpDS)(triggerAPI)
And triggerAPI will be applied to each item in inpDS.
By making sure that the execution context backed by the thread pool is in the implicit scope when calling Future.traverse, the items will be processed with the desired thread pool.
The result of the call is Future[List[Unit]], which is not very interesting and can simply be discarded (as you are only interested in the side effects).
That was a lot of talk, if you want to play around with the code I described you can do so here on Scastie.
For reference, this is the whole implementation:
import java.util.concurrent.{ExecutorService, Executors}
import scala.concurrent.duration.DurationLong
import scala.concurrent.Future
import scala.concurrent.{ExecutionContext, ExecutionContextExecutorService}
import scala.util.control.NonFatal
import scala.util.{Failure, Success, Try}
val datasets = List(
(1, "https://google.com/2X6barD"),
(2, "https://google.com/3d9vCgW"),
(3, "https://google.com/2M02Xz0"),
(4, "https://google.com/2XOu2uL"),
(5, "https://google.com/2AfBWF0"),
(6, "https://google.com/36AEKsw"),
(7, "https://google.com/3enBxz7"),
(8, "https://google.com/36ABq0x"),
(9, "https://google.com/2XBjmiF")
)
val executor: ExecutorService = Executors.newFixedThreadPool(3)
implicit val executionContext: ExecutionContextExecutorService = ExecutionContext.fromExecutorService(executor)
def triggerAPI(params: (Int, String))(implicit ec: ExecutionContext): Future[Unit] =
Future {
val (index, _) = params
println(s"+ started processing $index")
val start = System.nanoTime() / 1000000
Iterator.from(0).map(_ + 1).drop(100000000).take(1).toList.head // a noticeably slow operation
val end = System.nanoTime() / 1000000
val duration = (end - start).millis
println(s"- finished processing $index after $duration")
}
Future.traverse(datasets)(triggerAPI).onComplete {
case result =>
println("* processing is over, shutting down the executor")
executionContext.shutdown()
}
You need to shutdown the Executor after your job done else It will be waiting.
Try add pool.shutdown() end of your program.
As a first attempt, I tried to use Await.result on the head of the Seq and then use the lazy #:: Stream constructor. However, it seems to not work as good as expected since I haven't found a way to tell the scheduler to prioritize the order of the list nor does the compiler recognize it as #tailrec.
implicit class SeqOfFuture[X](seq: Seq[Future[X]]) {
lazy val stream: Stream[X] =
if (seq.nonEmpty) Await.result(seq.head) #:: seq.tail.stream
else Stream.empty
}
I am attempting this since Future.collect seems to wait until the whole strict Seq is available/ready in order to map/flatmap/transform it further. (And there are other computations I might start with the stream of intermedieate results)
(Proto)Example of usage:
val searches = [SearchParam1, SearchParam2..., SearchParam200]
// big queries that take a some 100ms each for ~20s total wait
val futureDbResult = searches.map(search => (quill)ctx.run { query(search) }).stream
// Stuff that should happen as results become available instead of blocking/waiting ~20 seconds before starting
val processedResults = futureDbResult.map(transform).filter(reduce)
// Log?
processedResults.map(result => log.info/log.trace)
//return lazy processedResults list or Future {processedResults}
???
As others have pointed out, you really should look into a real streaming library like fs2 or monix. I personally think monix is a good fit if you're interfacing with Future and only need it in a small part of your application. It has great APIs and documentation for this use-case.
Here's a small demo for your use-case:
import monix.eval.Task
import monix.execution.Scheduler.Implicits.global
import monix.reactive.Observable
import scala.concurrent.duration._
import scala.util.Random
// requires: libraryDependencies += "io.monix" %% "monix" % "3.0.0"
object Main {
val searchParams = (1 to 200).map(n => s"Search $n")
/**
* Simulates a query. If your library returns a Future, you can wrap it with `Task.deferFuture`
*/
def search(param: String): Task[String] =
Task(s"Result for $param").delayResult(Random.between(25, 250).milliseconds)
val results: Task[List[String]] =
Observable
.fromIterable(searchParams)
.mapParallelUnordered(parallelism = 4)(param => search(param))
.mapEval { result =>
Task(println(result)).map(_ => result) // print intermediate results as feedback
}
.toListL // collect results into List
/**
* If you aren't going all-in on monix, you probably run the stream into a Future with `results.runToFuture`
*/
def main(args: Array[String]): Unit = results.map(_ => ()).runSyncUnsafe()
}
You can think of Task as a lazy and more powerful Future. Observable is a (reactive) stream which will automatically back-pressure if downstream is slow. In this example only 4 queries will run in parallel and the other will wait until a "slot" becomes available to run.
Keep in mind that in those libraries side-effects (like println have to be wrapped in Task (or IO depending on what you use).
You can run this example locally if you provide the monix-dependency and play around with it to get a feel for how it works.
Given the several methods
def methodA(args:Int):Int={
//too long calculation
result
}
def methodB(args:Int):Int={
//too long calculation
result
}
def methodC(args:Int):Int={
//too long calculation
result
}
They have the same set of arguments and return a result of the same type.
It is necessary to calculate the methods in parallel and when one method returns a result I need to interrupt others.
This was an interesting question. I'm not too experienced, so there is probably a more Scala-ish way to do this, for starters with some of the cancellable future implementations. However, I traded brevity for readability.
import java.util.concurrent.{Callable, FutureTask}
import scala.concurrent.{Await, Future}
import scala.concurrent.duration._
import scala.concurrent.ExecutionContext.Implicits.global
def longRunningTask(foo: Int => String)(arg: Int): FutureTask[String] = {
new FutureTask[String](new Callable[String]() {
def call(): String = {
foo(arg)
}
})
}
def killTasks(tasks: Seq[FutureTask[_]]): Unit = {
tasks.foreach(_.cancel(false))
}
val task1 = longRunningTask(methodA)(3)
val task2 = longRunningTask(methodB)(4)
val task1Future = Future {
task1.run()
task1.get()
}
val task2Future = Future {
task2.run()
task2.get()
}
val firstFuture = Future.firstCompletedOf(List(task1Future, task2Future))
firstFuture.onSuccess({
case result => {
println(result)
killTasks(List(task1, task2))
}
})
So, what I did here? I crated a helper method which creates a FutureTask (from the Java API) which executes an Int => String operation on some Int argument. I also created a helper method which kills all FutureTasks in a collection.
After that I created two long running tasks with two different methods and inputs (that's your part).
The next part is a bit ugly, basically I run the FutureTasks with Future monad and call get to fetch the result.
After that I basically only use the firstCompletedOf method from the Future companion to handle the first finishing task, and when that is processed kill the rest of the tasks.
I'll probably make an effort to make this code a bit better, but start with this.
You should also have a go in checking this with different 'ExecutionContext's, and parallelism levels.
I have a collection of Promises. I was looking for an effective way to compose the resulting Futures. Currently the cleanest way i found to combine them was to use a scalaz Monoid, like so:
val promises = List(Promise[Unit], Promise[Unit], Promise[Unit])
promises.map(_.future).reduce(_ |+| _).onSuccess {
case a => println(a)
}
promises.foreach(_.success(()))
Is there a clean way to do this that doesn't require scalaz?
The number of Futures in the collection will vary, and lots of intermediate collections is undesirable.
You could use Future.traverse:
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.{ Future, Promise }
val promises = List(Promise[Unit], Promise[Unit], Promise[Unit])
Future.traverse(promises)(_.future)
This will give you a Future[List[Unit]], which doesn't exactly qualify as "lots of intermediate collections", but isn't necessarily ideal, either. Future.reduce also works:
Future.reduce(promises.map(_.future))((_, _) => ())
This returns a Future[Unit] that will be satisfied when all of the futures are satisfied.
Future.sequence is the method you want.
I have an iteration vals: Iterable[T] and a long-running function without any relevant side effects: f: (T => Unit). Right now this is applied to vals in the obvious way:
vals.foreach(f)
I would like the calls to f to be done concurrently (within reasonable limits). Is there an obvious function somewhere in the Scala base library? Something like:
Concurrent.foreach(8 /* Number of threads. */)(vals, f)
While f is reasonably long running, it is short enough that I don't want the overhead of invoking a thread for each call, so I am looking for something based on a thread pool.
Many of the answers from 2009 still use the old scala.actors.Futures._, which are no longer in the newer Scala. While Akka is the preferred way, a much more readable way is to just use parallel (.par) collections:
vals.foreach { v => f(v) }
becomes
vals.par.foreach { v => f(v) }
Alternatively, using parMap can appear more succinct though with the caveat that you need to remember to import the usual Scalaz*. As usual, there's more than one way to do the same thing in Scala!
Scalaz has parMap. You would use it as follows:
import scalaz.Scalaz._
import scalaz.concurrent.Strategy.Naive
This will equip every functor (including Iterable) with a parMap method, so you can just do:
vals.parMap(f)
You also get parFlatMap, parZipWith, etc.
I like the Futures answer. However, while it will execute concurrently, it will also return asynchronously, which is probably not what you want. The correct approach would be as follows:
import scala.actors.Futures._
vals map { x => future { f(x) } } foreach { _() }
I had some issues using scala.actors.Futures in Scala 2.8 (it was buggy when I checked). Using java libs directly worked for me, though:
final object Parallel {
val cpus=java.lang.Runtime.getRuntime().availableProcessors
import java.util.{Timer,TimerTask}
def afterDelay(ms: Long)(op: =>Unit) = new Timer().schedule(new TimerTask {override def run = op},ms)
def repeat(n: Int,f: Int=>Unit) = {
import java.util.concurrent._
val e=Executors.newCachedThreadPool //newFixedThreadPool(cpus+1)
(0 until n).foreach(i=>e.execute(new Runnable {def run = f(i)}))
e.shutdown
e.awaitTermination(Math.MAX_LONG, TimeUnit.SECONDS)
}
}
I'd use scala.actors.Futures:
vals.foreach(t => scala.actors.Futures.future(f(t)))
The latest release of Functional Java has some higher-order concurrency features that you can use.
import fjs.F._
import fj.control.parallel.Strategy._
import fj.control.parallel.ParModule._
import java.util.concurrent.Executors._
val pool = newCachedThreadPool
val par = parModule(executorStrategy[Unit](pool))
And then...
par.parMap(vals, f)
Remember to shutdown the pool.
You can use the Parallel Collections from the Scala standard library.
They're just like ordinary collections, but their operations run in parallel. You just need to put a par call before you invoke some collections operation.
import scala.collection._
val array = new Array[String](10000)
for (i <- (0 until 10000).par) array(i) = i.toString