Of course one could collect the system time at the first line of the Future's body. But:
Is it possible to know that time without having access to the future's code. (In my case the method returning the future is to be provided by the user of the 'framework'.)
def f: Future[Int] = ...
def magicTimePeak: Long = ???
The Future itself doesn't really know this (nor was it designed to care). It's all up to the executor when the code will actually be executed. This depends on whether or not a thread is immediately available, and if not, when one becomes available.
You could wrap Future, to keep track of it, I suppose. It would involve creating an underlying Future with a closure that changes a mutable var within the wrapped class. Since you just want a Long, it would have to default to zero if the Future hasn't begun executing, though it would be trivial to change this to Option[Date] or something.
class WrappedFuture[A](thunk: => A)(implicit ec: ExecutionContext) {
var started: Long = 0L
val underlying = Future {
started = System.nanoTime / 1000000 // milliseconds
thunk
}
}
To show that it works, create a fixed thread pool with one thread, then feed it a blocking task for say, 5 seconds. Then, create a WrappedFuture, and check it's started value later. Note the difference in the logged times.
import java.util.concurrent.Executors
import scala.concurrent._
val executorService = Executors.newFixedThreadPool(1)
implicit val ec = ExecutionContext.fromExecutorService(executorService)
scala> println("Before blocked: " + System.nanoTime / 1000000)
Before blocked: 13131636
scala> val blocker = Future(Thread.sleep(5000))
blocker: scala.concurrent.Future[Unit] = scala.concurrent.impl.Promise$DefaultPromise#7e5d9a50
scala> val f = new WrappedFuture(1)
f: WrappedFuture[Int] = WrappedFuture#4c4748bf
scala> f.started
res13: Long = 13136779 // note the difference in time of about 5000 ms from earlier
If you don't control the creation of the Future, however, there is nothing you can do to figure out when it started.
Related
Consider the following two snippets where first wraps scalaj-http requests with Future, whilst second uses async-http-client
Sync client wrapped with Future using global EC
object SyncClientWithFuture {
def main(args: Array[String]): Unit = {
import scala.concurrent.ExecutionContext.Implicits.global
import scalaj.http.Http
val delay = "3000"
val slowApi = s"http://slowwly.robertomurray.co.uk/delay/${delay}/url/https://www.google.co.uk"
val nestedF = Future(Http(slowApi).asString).flatMap { _ =>
Future.sequence(List(
Future(Http(slowApi).asString),
Future(Http(slowApi).asString),
Future(Http(slowApi).asString)
))
}
time { Await.result(nestedF, Inf) }
}
}
Async client using global EC
object AsyncClient {
def main(args: Array[String]): Unit = {
import scala.concurrent.ExecutionContext.Implicits.global
import sttp.client._
import sttp.client.asynchttpclient.future.AsyncHttpClientFutureBackend
implicit val sttpBackend = AsyncHttpClientFutureBackend()
val delay = "3000"
val slowApi = uri"http://slowwly.robertomurray.co.uk/delay/${delay}/url/https://www.google.co.uk"
val nestedF = basicRequest.get(slowApi).send().flatMap { _ =>
Future.sequence(List(
basicRequest.get(slowApi).send(),
basicRequest.get(slowApi).send(),
basicRequest.get(slowApi).send()
))
}
time { Await.result(nestedF, Inf) }
}
}
The snippets are using
Slowwly to simulate slow API
scalaj-http
async-http-client sttp backend
time
The former takes 12 seconds whilst the latter takes 6 seconds. It seems the former behaves as if it is CPU bound however I do not see how that is the case since Future#sequence should executes the HTTP requests in parallel? Why does synchronous client wrapped in Future behave differently from proper async client? Is it not the case that async client does the same kind of thing where it wraps calls in Futures under the hood?
Future#sequence should execute the HTTP requests in parallel?
First of all, Future#sequence doesn't execute anything. It just produces a future that completes when all parameters complete.
Evaluation (execution) of constructed futures starts immediately If there is a free thread in the EC. Otherwise, it simply submits it for a sort of queue.
I am sure that in the first case you have single thread execution of futures.
println(scala.concurrent.ExecutionContext.Implicits.global) -> parallelism = 6
Don't know why it is like this, it might that other 5 thread is always busy for some reason. You can experiment with explicitly created new EC with 5-10 threads.
The difference with the Async case that you don't create a future by yourself, it is provided by the library, that internally don't block the thread. It starts the async process, "subscribes" for a result, and returns the future, which completes when the result will come.
Actually, async lib could have another EC internally, but I doubt.
Btw, Futures are not supposed to contain slow/io/blocking evaluations without blocking. Otherwise, you potentially will block the main thread pool (EC) and your app will be completely frozen.
I have a use-case in databricks where an API call has to me made on a dataset of URL's. The dataset has around 100K records.
The max allowed concurrency is 3.
I did the implementation in Scala and ran in databricks notebook. Apart from the one element pending in queue, i feel some thing is missing here.
Is the Blocking Queue and Thread Pool the right way to tackle this problem.
In the code below I have modified and instead of reading from dataset I am sampling on a Seq.
Any help/thought will be much appreciated.
import java.time.LocalDateTime
import java.util.concurrent.{ArrayBlockingQueue,BlockingQueue}
import java.util.concurrent.Executors
import java.util.concurrent.TimeUnit;
var inpQueue:BlockingQueue[(Int, String)] = new ArrayBlockingQueue[(Int, String)](1)
val inpDS = Seq((1,"https://google.com/2X6barD"), (2,"https://google.com/3d9vCgW"), (3,"https://google.com/2M02Xz0"), (4,"https://google.com/2XOu2uL"), (5,"https://google.com/2AfBWF0"), (6,"https://google.com/36AEKsw"), (7,"https://google.com/3enBxz7"), (8,"https://google.com/36ABq0x"), (9,"https://google.com/2XBjmiF"), (10,"https://google.com/36Emlen"))
val pool = Executors.newFixedThreadPool(3)
var i = 0
inpDS.foreach{
ix => {
inpQueue.put(ix)
val t = new ConsumerAPIThread()
t.setName("MyThread-"+i+" ")
pool.execute(t)
}
i = i+1
}
println("Final Queue Size = " +inpQueue.size+"\n")
class ConsumerAPIThread() extends Thread
{
var name =""
override def run()
{
val urlDetail = inpQueue.take()
print(this.getName()+" "+ Thread.currentThread().getName() + " popped "+urlDetail+" Queue Size "+inpQueue.size+" \n")
triggerAPI((urlDetail._1, urlDetail._2))
}
def triggerAPI(params:(Int,String)){
try{
val result = scala.io.Source.fromURL(params._2)
println("" +result)
}catch{
case ex:Exception => {
println("Exception caught")
}
}
}
def ConsumerAPIThread(s:String)
{
name = s;
}
}
So, you have two requirements: the functional one is that you want to process asynchronously the items in a list, the non-functional one is that you want to not process more than three items at once.
Regarding the latter, the nice thing is that, as you already have shown in your question, Java natively exposes a nicely packaged Executor that runs task on a thread pool with a fixed size, elegantly allowing you to cap the concurrency level if you work with threads.
Moving to the functional requirement, Scala helps by having something that does precisely that as part of its standard API. In particular it uses scala.concurrent.Future, so in order to use it we'll have to reframe triggerAPI in terms of Future. The content of the function is not particularly relevant, so we'll mostly focus on its (revised) signature for now:
import scala.concurrent.Future
import scala.concurrent.ExecutionContext
def triggerAPI(params: (Int, String))(implicit ec: ExecutionContext): Future[Unit] =
Future {
// some code that takes some time to run...
}
Notice that now triggerAPI returns a Future. A Future can be thought as a read-handle to something that is going to be eventually computed. In particular, this is a Future[Unit], where Unit stands for "we don't particularly care about the output of this function, but mostly about its side effects".
Furthermore, notice that the method now takes an implicit parameter, namely an ExecutionContext. The ExecutionContext is used to provide Futures with some form of environment where the computation happens. Scala has an API to create an ExecutionContext from a java.util.concurrent.ExecutorService, so this will come in handy to run our computation on the fixed thread pool, running no more than three callbacks at any given time.
Before moving forward, if you have questions about Futures, ExecutionContexts and implicit parameters, the Scala documentation is your best source of knowledge (here are a couple of pointers: 1, 2).
Now that we have the new triggerAPI method, we can use Future.traverse (here is the documentation for Scala 2.12 -- the latest version at the time of writing is 2.13 but to the best of my knowledge Spark users are stuck on 2.12 for the time being).
The tl;dr of Future.traverse is that it takes some form of container and a function that takes the items in that container and returns a Future of something else. The function will be applied to each item in the container and the result will be a Future of the container of the results. In your case: the container is a List, the items are (Int, String) and the something else you return is a Unit.
This means that you can simply call it like this:
Future.traverse(inpDS)(triggerAPI)
And triggerAPI will be applied to each item in inpDS.
By making sure that the execution context backed by the thread pool is in the implicit scope when calling Future.traverse, the items will be processed with the desired thread pool.
The result of the call is Future[List[Unit]], which is not very interesting and can simply be discarded (as you are only interested in the side effects).
That was a lot of talk, if you want to play around with the code I described you can do so here on Scastie.
For reference, this is the whole implementation:
import java.util.concurrent.{ExecutorService, Executors}
import scala.concurrent.duration.DurationLong
import scala.concurrent.Future
import scala.concurrent.{ExecutionContext, ExecutionContextExecutorService}
import scala.util.control.NonFatal
import scala.util.{Failure, Success, Try}
val datasets = List(
(1, "https://google.com/2X6barD"),
(2, "https://google.com/3d9vCgW"),
(3, "https://google.com/2M02Xz0"),
(4, "https://google.com/2XOu2uL"),
(5, "https://google.com/2AfBWF0"),
(6, "https://google.com/36AEKsw"),
(7, "https://google.com/3enBxz7"),
(8, "https://google.com/36ABq0x"),
(9, "https://google.com/2XBjmiF")
)
val executor: ExecutorService = Executors.newFixedThreadPool(3)
implicit val executionContext: ExecutionContextExecutorService = ExecutionContext.fromExecutorService(executor)
def triggerAPI(params: (Int, String))(implicit ec: ExecutionContext): Future[Unit] =
Future {
val (index, _) = params
println(s"+ started processing $index")
val start = System.nanoTime() / 1000000
Iterator.from(0).map(_ + 1).drop(100000000).take(1).toList.head // a noticeably slow operation
val end = System.nanoTime() / 1000000
val duration = (end - start).millis
println(s"- finished processing $index after $duration")
}
Future.traverse(datasets)(triggerAPI).onComplete {
case result =>
println("* processing is over, shutting down the executor")
executionContext.shutdown()
}
You need to shutdown the Executor after your job done else It will be waiting.
Try add pool.shutdown() end of your program.
In Akka, I want to send out a "status" message to actors in a cluster for their status. These actors may be various states of health, including dead/unable to respond.
I want to wait up to some time, say 10 seconds, then proceed with whatever results I happened to receive back in that time limit. I don't want to fail the whole thing because 1 or 2 were having issues and didn't responded/timed-out at 10 seconds.
I've tried this:
object GetStats {
def unapply(n: ActorRef )(implicit system: ActorSystem): Option[Future[Any]] = Try {
implicit val t: Timeout = Timeout(10 seconds)
n ? "A"
}.toOption
}
...
val z = List(a,b,c,d) // where a-d are ActorRefs to nodes I want to status
val q = z.collect {
case GetStats(s) => s
}
// OK, so here 'q' is a List[Future[Any]]
val allInverted = Future.sequence(q) // now we have Future[List[Any]]
val ok = Await.result(allInverted, 10 seconds).asInstanceOf[List[String]]
println(ok)
The problem with this code is that it seems to throw a TimeoutException if 1 or more don't respond. Then I can't get to the responses that did come back.
Assuming, you really need to collect at least partial statistics every 10 seconds - the solution is to convert "not responding" to actual failure.
To achieve this, just increase the Await timeout a bit in comparision with implicit val t:Timeout for ask. After that your futures itselves (returned from ?) will fail earlier. So you can recover them:
// Works only when AskTimeout >> AwaitTimeout
val qfiltered = q.map(_.map(Some(_)).recover{case _ => None}) //it's better to match TimeoutException here instead of `_`
val allInverted = Future.sequence(q).map(_.flatten)
Example:
scala> class MyActor extends Actor{ def receive = {case 1 => sender ! 2; case _ =>}}
defined class MyActor
scala> val a = sys.actorOf(Props[MyActor])
a: akka.actor.ActorRef = Actor[akka://1/user/$c#1361310022]
scala> implicit val t: Timeout = Timeout(1 seconds)
t: akka.util.Timeout = Timeout(1 second)
scala> val l = List(a ? 1, a ? 100500).map(_.map(Some(_)).recover{case _ => None})
l: List[scala.concurrent.Future[Option[Any]]] = List(scala.concurrent.impl.Promise$DefaultPromise#7faaa183, scala.concurrent.impl.Promise$DefaultPromise#1b51e0f0)
scala> Await.result(Future.sequence(l).map(_.flatten), 3 seconds)
warning: there were 1 feature warning(s); re-run with -feature for details
res29: List[Any] = List(2)
If you want to know which Future didn't respond - remove flatten.
Receiving partial response should be enough for continously collecting statistics, as if some worker actor didn't respond in time - it will respond next time with actual data and without any data lost. But you should correcly process worker's lifecycle and not loose (if it matters) any data inside actor itself.
If the reason of timeouts is just high pressure on system - you may consider:
separate pool for workers
backpressure
caching for input requests (when system overloaded).
If the reason of such timeouts is some remote storage - then partial response is correct way to process it if client is ready for that. WebUI for example may warn a user that shown data may not be full using some spinning thing. But don't forget to not block actors with storage requests (futures may help) or at least move them to the separrate thread-pool.
If worker actor didn't respond because of failure (like exception) - you can still send notification to sender from your preRestart - so you can also receive the reason why there is no statistics from worker. The only thing here - you shoud check if sender is available (it may not be)
P.S. I hope you don't do Await.result inside some actor - blocking an actor is not recommended at least for your application performance. In some cases it may cause even deadlocks or memory leaks. So await's should be placed somewhere in facade of your system (if underlying framework doesn't support futures).
So it may have a sense to process your answers asynchronously (you will still need to recover them from failure if some actor doesn't respond):
//actor:
val parent = sender
for(list <- Future.sequence(qfiltered)) {
parent ! process(list)
}
//in sender (outside of the actors):
Await(actor ? Get, 10 seconds)
I have the following code,and I expected a.success(burncpu(14969)) to return instantly since it's run in a future, but why it took a long time to run.
import scala.concurrent._
val a=Promise[Unit]()
// why the following took long time here, shouldn't it be in async mode, and return very quick
a.success(burncpu(14969))
a.future
def burncpu(a:Int):Int = {
val point=new Date().getTime()
while ((new Date()).getTime()-point< a) {
a
}
a
}
You are using the Promise wrong.
a.success method completes the promise with given argument, it doesn't run the expression you pass to it asynchronously.
What you probably want to do is something like this:
val f = Future(burncpu(6000))
Assuming you have an ExecutionContext available (if you don't, you can do import ExecutionContext.Implicits.global), this will construct a Future which will run your function asynchronously.
You can see how it works in Scala REPL (f.value returns None until the method has returned)
scala> val f = Future(burncpu(6000))
f: scala.concurrent.Future[Int] = scala.concurrent.impl.Promise$DefaultPromise#4d4d8fcf
scala> f.value
res27: Option[scala.util.Try[Int]] = None
scala> f.value
res28: Option[scala.util.Try[Int]] = None
scala> f.value
res29: Option[scala.util.Try[Int]] = Some(Success(6000))
Promise.success is not executed asynchronously, basically your code is the equivalent of:
Future.successful(burncpu(14969))
You could try this:
Future {
burncpu(14969)
}
This will call Future.apply and execute your function asynchronously.
The call to future is indeed called asynchronously. However, nothing in the API suggests that completing the Promise, i.e. calling success in your case, will be executed asynchronously. That is your responsibility to ensure.
It makes sense when you consider what a Promise is - a "marker" to another execution path, notifying it that a calculation has been completed and a result (or failure) is available.
I have written a Scala (2.9.1-1) application that needs to process several million rows from a database query. I am converting the ResultSet to a Stream using the technique shown in the answer to one of my previous questions:
class Record(...)
val resultSet = statement.executeQuery(...)
new Iterator[Record] {
def hasNext = resultSet.next()
def next = new Record(resultSet.getString(1), resultSet.getInt(2), ...)
}.toStream.foreach { record => ... }
and this has worked very well.
Since the body of the foreach closure is very CPU intensive, and as a testament to the practicality of functional programming, if I add a .par before the foreach, the closures get run in parallel with no other effort, except to make sure that the body of the closure is thread safe (it is written in a functional style with no mutable data except printing to a thread-safe log).
However, I am worried about memory consumption. Is the .par causing the entire result set to load in RAM, or does the parallel operation load only as many rows as it has active threads? I've allocated 4G to the JVM (64-bit with -Xmx4g) but in the future I will be running it on even more rows and worry that I'll eventually get an out-of-memory.
Is there a better pattern for doing this kind of parallel processing in a functional manner? I've been showing this application to my co-workers as an example of the value of functional programming and multi-core machines.
If you look at the scaladoc of Stream, you will notice that the definition class of par is the Parallelizable trait... and, if you look at the source code of this trait, you will notice that it takes each element from the original collection and put them into a combiner, thus, you will load each row into a ParSeq:
def par: ParRepr = {
val cb = parCombiner
for (x <- seq) cb += x
cb.result
}
/** The default `par` implementation uses the combiner provided by this method
* to create a new parallel collection.
*
* #return a combiner for the parallel collection of type `ParRepr`
*/
protected[this] def parCombiner: Combiner[A, ParRepr]
A possible solution is to explicitly parallelize your computation, thanks to actors for example. You can take a look at this example from the akka documentation for example, that might be helpful in your context.
The new akka stream library is the fix you're looking for:
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Source, Sink}
def iterFromQuery() : Iterator[Record] = {
val resultSet = statement.executeQuery(...)
new Iterator[Record] {
def hasNext = resultSet.next()
def next = new Record(...)
}
}
def cpuIntensiveFunction(record : Record) = {
...
}
implicit val actorSystem = ActorSystem()
implicit val materializer = ActorMaterializer()
implicit val execContext = actorSystem.dispatcher
val poolSize = 10 //number of Records in memory at once
val stream =
Source(iterFromQuery).runWith(Sink.foreachParallel(poolSize)(cpuIntensiveFunction))
stream onComplete {_ => actorSystem.shutdown()}