Scala Thread Pool - Invoking API's Concurrently - scala

I have a use-case in databricks where an API call has to me made on a dataset of URL's. The dataset has around 100K records.
The max allowed concurrency is 3.
I did the implementation in Scala and ran in databricks notebook. Apart from the one element pending in queue, i feel some thing is missing here.
Is the Blocking Queue and Thread Pool the right way to tackle this problem.
In the code below I have modified and instead of reading from dataset I am sampling on a Seq.
Any help/thought will be much appreciated.
import java.time.LocalDateTime
import java.util.concurrent.{ArrayBlockingQueue,BlockingQueue}
import java.util.concurrent.Executors
import java.util.concurrent.TimeUnit;
var inpQueue:BlockingQueue[(Int, String)] = new ArrayBlockingQueue[(Int, String)](1)
val inpDS = Seq((1,"https://google.com/2X6barD"), (2,"https://google.com/3d9vCgW"), (3,"https://google.com/2M02Xz0"), (4,"https://google.com/2XOu2uL"), (5,"https://google.com/2AfBWF0"), (6,"https://google.com/36AEKsw"), (7,"https://google.com/3enBxz7"), (8,"https://google.com/36ABq0x"), (9,"https://google.com/2XBjmiF"), (10,"https://google.com/36Emlen"))
val pool = Executors.newFixedThreadPool(3)
var i = 0
inpDS.foreach{
ix => {
inpQueue.put(ix)
val t = new ConsumerAPIThread()
t.setName("MyThread-"+i+" ")
pool.execute(t)
}
i = i+1
}
println("Final Queue Size = " +inpQueue.size+"\n")
class ConsumerAPIThread() extends Thread
{
var name =""
override def run()
{
val urlDetail = inpQueue.take()
print(this.getName()+" "+ Thread.currentThread().getName() + " popped "+urlDetail+" Queue Size "+inpQueue.size+" \n")
triggerAPI((urlDetail._1, urlDetail._2))
}
def triggerAPI(params:(Int,String)){
try{
val result = scala.io.Source.fromURL(params._2)
println("" +result)
}catch{
case ex:Exception => {
println("Exception caught")
}
}
}
def ConsumerAPIThread(s:String)
{
name = s;
}
}

So, you have two requirements: the functional one is that you want to process asynchronously the items in a list, the non-functional one is that you want to not process more than three items at once.
Regarding the latter, the nice thing is that, as you already have shown in your question, Java natively exposes a nicely packaged Executor that runs task on a thread pool with a fixed size, elegantly allowing you to cap the concurrency level if you work with threads.
Moving to the functional requirement, Scala helps by having something that does precisely that as part of its standard API. In particular it uses scala.concurrent.Future, so in order to use it we'll have to reframe triggerAPI in terms of Future. The content of the function is not particularly relevant, so we'll mostly focus on its (revised) signature for now:
import scala.concurrent.Future
import scala.concurrent.ExecutionContext
def triggerAPI(params: (Int, String))(implicit ec: ExecutionContext): Future[Unit] =
Future {
// some code that takes some time to run...
}
Notice that now triggerAPI returns a Future. A Future can be thought as a read-handle to something that is going to be eventually computed. In particular, this is a Future[Unit], where Unit stands for "we don't particularly care about the output of this function, but mostly about its side effects".
Furthermore, notice that the method now takes an implicit parameter, namely an ExecutionContext. The ExecutionContext is used to provide Futures with some form of environment where the computation happens. Scala has an API to create an ExecutionContext from a java.util.concurrent.ExecutorService, so this will come in handy to run our computation on the fixed thread pool, running no more than three callbacks at any given time.
Before moving forward, if you have questions about Futures, ExecutionContexts and implicit parameters, the Scala documentation is your best source of knowledge (here are a couple of pointers: 1, 2).
Now that we have the new triggerAPI method, we can use Future.traverse (here is the documentation for Scala 2.12 -- the latest version at the time of writing is 2.13 but to the best of my knowledge Spark users are stuck on 2.12 for the time being).
The tl;dr of Future.traverse is that it takes some form of container and a function that takes the items in that container and returns a Future of something else. The function will be applied to each item in the container and the result will be a Future of the container of the results. In your case: the container is a List, the items are (Int, String) and the something else you return is a Unit.
This means that you can simply call it like this:
Future.traverse(inpDS)(triggerAPI)
And triggerAPI will be applied to each item in inpDS.
By making sure that the execution context backed by the thread pool is in the implicit scope when calling Future.traverse, the items will be processed with the desired thread pool.
The result of the call is Future[List[Unit]], which is not very interesting and can simply be discarded (as you are only interested in the side effects).
That was a lot of talk, if you want to play around with the code I described you can do so here on Scastie.
For reference, this is the whole implementation:
import java.util.concurrent.{ExecutorService, Executors}
import scala.concurrent.duration.DurationLong
import scala.concurrent.Future
import scala.concurrent.{ExecutionContext, ExecutionContextExecutorService}
import scala.util.control.NonFatal
import scala.util.{Failure, Success, Try}
val datasets = List(
(1, "https://google.com/2X6barD"),
(2, "https://google.com/3d9vCgW"),
(3, "https://google.com/2M02Xz0"),
(4, "https://google.com/2XOu2uL"),
(5, "https://google.com/2AfBWF0"),
(6, "https://google.com/36AEKsw"),
(7, "https://google.com/3enBxz7"),
(8, "https://google.com/36ABq0x"),
(9, "https://google.com/2XBjmiF")
)
val executor: ExecutorService = Executors.newFixedThreadPool(3)
implicit val executionContext: ExecutionContextExecutorService = ExecutionContext.fromExecutorService(executor)
def triggerAPI(params: (Int, String))(implicit ec: ExecutionContext): Future[Unit] =
Future {
val (index, _) = params
println(s"+ started processing $index")
val start = System.nanoTime() / 1000000
Iterator.from(0).map(_ + 1).drop(100000000).take(1).toList.head // a noticeably slow operation
val end = System.nanoTime() / 1000000
val duration = (end - start).millis
println(s"- finished processing $index after $duration")
}
Future.traverse(datasets)(triggerAPI).onComplete {
case result =>
println("* processing is over, shutting down the executor")
executionContext.shutdown()
}

You need to shutdown the Executor after your job done else It will be waiting.
Try add pool.shutdown() end of your program.

Related

Is synchronous HTTP request wrapped in a Future considered CPU or IO bound?

Consider the following two snippets where first wraps scalaj-http requests with Future, whilst second uses async-http-client
Sync client wrapped with Future using global EC
object SyncClientWithFuture {
def main(args: Array[String]): Unit = {
import scala.concurrent.ExecutionContext.Implicits.global
import scalaj.http.Http
val delay = "3000"
val slowApi = s"http://slowwly.robertomurray.co.uk/delay/${delay}/url/https://www.google.co.uk"
val nestedF = Future(Http(slowApi).asString).flatMap { _ =>
Future.sequence(List(
Future(Http(slowApi).asString),
Future(Http(slowApi).asString),
Future(Http(slowApi).asString)
))
}
time { Await.result(nestedF, Inf) }
}
}
Async client using global EC
object AsyncClient {
def main(args: Array[String]): Unit = {
import scala.concurrent.ExecutionContext.Implicits.global
import sttp.client._
import sttp.client.asynchttpclient.future.AsyncHttpClientFutureBackend
implicit val sttpBackend = AsyncHttpClientFutureBackend()
val delay = "3000"
val slowApi = uri"http://slowwly.robertomurray.co.uk/delay/${delay}/url/https://www.google.co.uk"
val nestedF = basicRequest.get(slowApi).send().flatMap { _ =>
Future.sequence(List(
basicRequest.get(slowApi).send(),
basicRequest.get(slowApi).send(),
basicRequest.get(slowApi).send()
))
}
time { Await.result(nestedF, Inf) }
}
}
The snippets are using
Slowwly to simulate slow API
scalaj-http
async-http-client sttp backend
time
The former takes 12 seconds whilst the latter takes 6 seconds. It seems the former behaves as if it is CPU bound however I do not see how that is the case since Future#sequence should executes the HTTP requests in parallel? Why does synchronous client wrapped in Future behave differently from proper async client? Is it not the case that async client does the same kind of thing where it wraps calls in Futures under the hood?
Future#sequence should execute the HTTP requests in parallel?
First of all, Future#sequence doesn't execute anything. It just produces a future that completes when all parameters complete.
Evaluation (execution) of constructed futures starts immediately If there is a free thread in the EC. Otherwise, it simply submits it for a sort of queue.
I am sure that in the first case you have single thread execution of futures.
println(scala.concurrent.ExecutionContext.Implicits.global) -> parallelism = 6
Don't know why it is like this, it might that other 5 thread is always busy for some reason. You can experiment with explicitly created new EC with 5-10 threads.
The difference with the Async case that you don't create a future by yourself, it is provided by the library, that internally don't block the thread. It starts the async process, "subscribes" for a result, and returns the future, which completes when the result will come.
Actually, async lib could have another EC internally, but I doubt.
Btw, Futures are not supposed to contain slow/io/blocking evaluations without blocking. Otherwise, you potentially will block the main thread pool (EC) and your app will be completely frozen.

How to stream a `Seq[Future[_]]` into either a `Future[Stream[_]]` or a `Stream[_]` such that it can consumed as it becomes available in order?

As a first attempt, I tried to use Await.result on the head of the Seq and then use the lazy #:: Stream constructor. However, it seems to not work as good as expected since I haven't found a way to tell the scheduler to prioritize the order of the list nor does the compiler recognize it as #tailrec.
implicit class SeqOfFuture[X](seq: Seq[Future[X]]) {
lazy val stream: Stream[X] =
if (seq.nonEmpty) Await.result(seq.head) #:: seq.tail.stream
else Stream.empty
}
I am attempting this since Future.collect seems to wait until the whole strict Seq is available/ready in order to map/flatmap/transform it further. (And there are other computations I might start with the stream of intermedieate results)
(Proto)Example of usage:
val searches = [SearchParam1, SearchParam2..., SearchParam200]
// big queries that take a some 100ms each for ~20s total wait
val futureDbResult = searches.map(search => (quill)ctx.run { query(search) }).stream
// Stuff that should happen as results become available instead of blocking/waiting ~20 seconds before starting
val processedResults = futureDbResult.map(transform).filter(reduce)
// Log?
processedResults.map(result => log.info/log.trace)
//return lazy processedResults list or Future {processedResults}
???
As others have pointed out, you really should look into a real streaming library like fs2 or monix. I personally think monix is a good fit if you're interfacing with Future and only need it in a small part of your application. It has great APIs and documentation for this use-case.
Here's a small demo for your use-case:
import monix.eval.Task
import monix.execution.Scheduler.Implicits.global
import monix.reactive.Observable
import scala.concurrent.duration._
import scala.util.Random
// requires: libraryDependencies += "io.monix" %% "monix" % "3.0.0"
object Main {
val searchParams = (1 to 200).map(n => s"Search $n")
/**
* Simulates a query. If your library returns a Future, you can wrap it with `Task.deferFuture`
*/
def search(param: String): Task[String] =
Task(s"Result for $param").delayResult(Random.between(25, 250).milliseconds)
val results: Task[List[String]] =
Observable
.fromIterable(searchParams)
.mapParallelUnordered(parallelism = 4)(param => search(param))
.mapEval { result =>
Task(println(result)).map(_ => result) // print intermediate results as feedback
}
.toListL // collect results into List
/**
* If you aren't going all-in on monix, you probably run the stream into a Future with `results.runToFuture`
*/
def main(args: Array[String]): Unit = results.map(_ => ()).runSyncUnsafe()
}
You can think of Task as a lazy and more powerful Future. Observable is a (reactive) stream which will automatically back-pressure if downstream is slow. In this example only 4 queries will run in parallel and the other will wait until a "slot" becomes available to run.
Keep in mind that in those libraries side-effects (like println have to be wrapped in Task (or IO depending on what you use).
You can run this example locally if you provide the monix-dependency and play around with it to get a feel for how it works.

When scala.concurrent.Future's execution starts?

Of course one could collect the system time at the first line of the Future's body. But:
Is it possible to know that time without having access to the future's code. (In my case the method returning the future is to be provided by the user of the 'framework'.)
def f: Future[Int] = ...
def magicTimePeak: Long = ???
The Future itself doesn't really know this (nor was it designed to care). It's all up to the executor when the code will actually be executed. This depends on whether or not a thread is immediately available, and if not, when one becomes available.
You could wrap Future, to keep track of it, I suppose. It would involve creating an underlying Future with a closure that changes a mutable var within the wrapped class. Since you just want a Long, it would have to default to zero if the Future hasn't begun executing, though it would be trivial to change this to Option[Date] or something.
class WrappedFuture[A](thunk: => A)(implicit ec: ExecutionContext) {
var started: Long = 0L
val underlying = Future {
started = System.nanoTime / 1000000 // milliseconds
thunk
}
}
To show that it works, create a fixed thread pool with one thread, then feed it a blocking task for say, 5 seconds. Then, create a WrappedFuture, and check it's started value later. Note the difference in the logged times.
import java.util.concurrent.Executors
import scala.concurrent._
val executorService = Executors.newFixedThreadPool(1)
implicit val ec = ExecutionContext.fromExecutorService(executorService)
scala> println("Before blocked: " + System.nanoTime / 1000000)
Before blocked: 13131636
scala> val blocker = Future(Thread.sleep(5000))
blocker: scala.concurrent.Future[Unit] = scala.concurrent.impl.Promise$DefaultPromise#7e5d9a50
scala> val f = new WrappedFuture(1)
f: WrappedFuture[Int] = WrappedFuture#4c4748bf
scala> f.started
res13: Long = 13136779 // note the difference in time of about 5000 ms from earlier
If you don't control the creation of the Future, however, there is nothing you can do to figure out when it started.

Memory consumption of a parallel Scala Stream

I have written a Scala (2.9.1-1) application that needs to process several million rows from a database query. I am converting the ResultSet to a Stream using the technique shown in the answer to one of my previous questions:
class Record(...)
val resultSet = statement.executeQuery(...)
new Iterator[Record] {
def hasNext = resultSet.next()
def next = new Record(resultSet.getString(1), resultSet.getInt(2), ...)
}.toStream.foreach { record => ... }
and this has worked very well.
Since the body of the foreach closure is very CPU intensive, and as a testament to the practicality of functional programming, if I add a .par before the foreach, the closures get run in parallel with no other effort, except to make sure that the body of the closure is thread safe (it is written in a functional style with no mutable data except printing to a thread-safe log).
However, I am worried about memory consumption. Is the .par causing the entire result set to load in RAM, or does the parallel operation load only as many rows as it has active threads? I've allocated 4G to the JVM (64-bit with -Xmx4g) but in the future I will be running it on even more rows and worry that I'll eventually get an out-of-memory.
Is there a better pattern for doing this kind of parallel processing in a functional manner? I've been showing this application to my co-workers as an example of the value of functional programming and multi-core machines.
If you look at the scaladoc of Stream, you will notice that the definition class of par is the Parallelizable trait... and, if you look at the source code of this trait, you will notice that it takes each element from the original collection and put them into a combiner, thus, you will load each row into a ParSeq:
def par: ParRepr = {
val cb = parCombiner
for (x <- seq) cb += x
cb.result
}
/** The default `par` implementation uses the combiner provided by this method
* to create a new parallel collection.
*
* #return a combiner for the parallel collection of type `ParRepr`
*/
protected[this] def parCombiner: Combiner[A, ParRepr]
A possible solution is to explicitly parallelize your computation, thanks to actors for example. You can take a look at this example from the akka documentation for example, that might be helpful in your context.
The new akka stream library is the fix you're looking for:
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Source, Sink}
def iterFromQuery() : Iterator[Record] = {
val resultSet = statement.executeQuery(...)
new Iterator[Record] {
def hasNext = resultSet.next()
def next = new Record(...)
}
}
def cpuIntensiveFunction(record : Record) = {
...
}
implicit val actorSystem = ActorSystem()
implicit val materializer = ActorMaterializer()
implicit val execContext = actorSystem.dispatcher
val poolSize = 10 //number of Records in memory at once
val stream =
Source(iterFromQuery).runWith(Sink.foreachParallel(poolSize)(cpuIntensiveFunction))
stream onComplete {_ => actorSystem.shutdown()}

Concurrent map/foreach in scala

I have an iteration vals: Iterable[T] and a long-running function without any relevant side effects: f: (T => Unit). Right now this is applied to vals in the obvious way:
vals.foreach(f)
I would like the calls to f to be done concurrently (within reasonable limits). Is there an obvious function somewhere in the Scala base library? Something like:
Concurrent.foreach(8 /* Number of threads. */)(vals, f)
While f is reasonably long running, it is short enough that I don't want the overhead of invoking a thread for each call, so I am looking for something based on a thread pool.
Many of the answers from 2009 still use the old scala.actors.Futures._, which are no longer in the newer Scala. While Akka is the preferred way, a much more readable way is to just use parallel (.par) collections:
vals.foreach { v => f(v) }
becomes
vals.par.foreach { v => f(v) }
Alternatively, using parMap can appear more succinct though with the caveat that you need to remember to import the usual Scalaz*. As usual, there's more than one way to do the same thing in Scala!
Scalaz has parMap. You would use it as follows:
import scalaz.Scalaz._
import scalaz.concurrent.Strategy.Naive
This will equip every functor (including Iterable) with a parMap method, so you can just do:
vals.parMap(f)
You also get parFlatMap, parZipWith, etc.
I like the Futures answer. However, while it will execute concurrently, it will also return asynchronously, which is probably not what you want. The correct approach would be as follows:
import scala.actors.Futures._
vals map { x => future { f(x) } } foreach { _() }
I had some issues using scala.actors.Futures in Scala 2.8 (it was buggy when I checked). Using java libs directly worked for me, though:
final object Parallel {
val cpus=java.lang.Runtime.getRuntime().availableProcessors
import java.util.{Timer,TimerTask}
def afterDelay(ms: Long)(op: =>Unit) = new Timer().schedule(new TimerTask {override def run = op},ms)
def repeat(n: Int,f: Int=>Unit) = {
import java.util.concurrent._
val e=Executors.newCachedThreadPool //newFixedThreadPool(cpus+1)
(0 until n).foreach(i=>e.execute(new Runnable {def run = f(i)}))
e.shutdown
e.awaitTermination(Math.MAX_LONG, TimeUnit.SECONDS)
}
}
I'd use scala.actors.Futures:
vals.foreach(t => scala.actors.Futures.future(f(t)))
The latest release of Functional Java has some higher-order concurrency features that you can use.
import fjs.F._
import fj.control.parallel.Strategy._
import fj.control.parallel.ParModule._
import java.util.concurrent.Executors._
val pool = newCachedThreadPool
val par = parModule(executorStrategy[Unit](pool))
And then...
par.parMap(vals, f)
Remember to shutdown the pool.
You can use the Parallel Collections from the Scala standard library.
They're just like ordinary collections, but their operations run in parallel. You just need to put a par call before you invoke some collections operation.
import scala.collection._
val array = new Array[String](10000)
for (i <- (0 until 10000).par) array(i) = i.toString