Is synchronous HTTP request wrapped in a Future considered CPU or IO bound? - scala

Consider the following two snippets where first wraps scalaj-http requests with Future, whilst second uses async-http-client
Sync client wrapped with Future using global EC
object SyncClientWithFuture {
def main(args: Array[String]): Unit = {
import scala.concurrent.ExecutionContext.Implicits.global
import scalaj.http.Http
val delay = "3000"
val slowApi = s"http://slowwly.robertomurray.co.uk/delay/${delay}/url/https://www.google.co.uk"
val nestedF = Future(Http(slowApi).asString).flatMap { _ =>
Future.sequence(List(
Future(Http(slowApi).asString),
Future(Http(slowApi).asString),
Future(Http(slowApi).asString)
))
}
time { Await.result(nestedF, Inf) }
}
}
Async client using global EC
object AsyncClient {
def main(args: Array[String]): Unit = {
import scala.concurrent.ExecutionContext.Implicits.global
import sttp.client._
import sttp.client.asynchttpclient.future.AsyncHttpClientFutureBackend
implicit val sttpBackend = AsyncHttpClientFutureBackend()
val delay = "3000"
val slowApi = uri"http://slowwly.robertomurray.co.uk/delay/${delay}/url/https://www.google.co.uk"
val nestedF = basicRequest.get(slowApi).send().flatMap { _ =>
Future.sequence(List(
basicRequest.get(slowApi).send(),
basicRequest.get(slowApi).send(),
basicRequest.get(slowApi).send()
))
}
time { Await.result(nestedF, Inf) }
}
}
The snippets are using
Slowwly to simulate slow API
scalaj-http
async-http-client sttp backend
time
The former takes 12 seconds whilst the latter takes 6 seconds. It seems the former behaves as if it is CPU bound however I do not see how that is the case since Future#sequence should executes the HTTP requests in parallel? Why does synchronous client wrapped in Future behave differently from proper async client? Is it not the case that async client does the same kind of thing where it wraps calls in Futures under the hood?

Future#sequence should execute the HTTP requests in parallel?
First of all, Future#sequence doesn't execute anything. It just produces a future that completes when all parameters complete.
Evaluation (execution) of constructed futures starts immediately If there is a free thread in the EC. Otherwise, it simply submits it for a sort of queue.
I am sure that in the first case you have single thread execution of futures.
println(scala.concurrent.ExecutionContext.Implicits.global) -> parallelism = 6
Don't know why it is like this, it might that other 5 thread is always busy for some reason. You can experiment with explicitly created new EC with 5-10 threads.
The difference with the Async case that you don't create a future by yourself, it is provided by the library, that internally don't block the thread. It starts the async process, "subscribes" for a result, and returns the future, which completes when the result will come.
Actually, async lib could have another EC internally, but I doubt.
Btw, Futures are not supposed to contain slow/io/blocking evaluations without blocking. Otherwise, you potentially will block the main thread pool (EC) and your app will be completely frozen.

Related

Scala Thread Pool - Invoking API's Concurrently

I have a use-case in databricks where an API call has to me made on a dataset of URL's. The dataset has around 100K records.
The max allowed concurrency is 3.
I did the implementation in Scala and ran in databricks notebook. Apart from the one element pending in queue, i feel some thing is missing here.
Is the Blocking Queue and Thread Pool the right way to tackle this problem.
In the code below I have modified and instead of reading from dataset I am sampling on a Seq.
Any help/thought will be much appreciated.
import java.time.LocalDateTime
import java.util.concurrent.{ArrayBlockingQueue,BlockingQueue}
import java.util.concurrent.Executors
import java.util.concurrent.TimeUnit;
var inpQueue:BlockingQueue[(Int, String)] = new ArrayBlockingQueue[(Int, String)](1)
val inpDS = Seq((1,"https://google.com/2X6barD"), (2,"https://google.com/3d9vCgW"), (3,"https://google.com/2M02Xz0"), (4,"https://google.com/2XOu2uL"), (5,"https://google.com/2AfBWF0"), (6,"https://google.com/36AEKsw"), (7,"https://google.com/3enBxz7"), (8,"https://google.com/36ABq0x"), (9,"https://google.com/2XBjmiF"), (10,"https://google.com/36Emlen"))
val pool = Executors.newFixedThreadPool(3)
var i = 0
inpDS.foreach{
ix => {
inpQueue.put(ix)
val t = new ConsumerAPIThread()
t.setName("MyThread-"+i+" ")
pool.execute(t)
}
i = i+1
}
println("Final Queue Size = " +inpQueue.size+"\n")
class ConsumerAPIThread() extends Thread
{
var name =""
override def run()
{
val urlDetail = inpQueue.take()
print(this.getName()+" "+ Thread.currentThread().getName() + " popped "+urlDetail+" Queue Size "+inpQueue.size+" \n")
triggerAPI((urlDetail._1, urlDetail._2))
}
def triggerAPI(params:(Int,String)){
try{
val result = scala.io.Source.fromURL(params._2)
println("" +result)
}catch{
case ex:Exception => {
println("Exception caught")
}
}
}
def ConsumerAPIThread(s:String)
{
name = s;
}
}
So, you have two requirements: the functional one is that you want to process asynchronously the items in a list, the non-functional one is that you want to not process more than three items at once.
Regarding the latter, the nice thing is that, as you already have shown in your question, Java natively exposes a nicely packaged Executor that runs task on a thread pool with a fixed size, elegantly allowing you to cap the concurrency level if you work with threads.
Moving to the functional requirement, Scala helps by having something that does precisely that as part of its standard API. In particular it uses scala.concurrent.Future, so in order to use it we'll have to reframe triggerAPI in terms of Future. The content of the function is not particularly relevant, so we'll mostly focus on its (revised) signature for now:
import scala.concurrent.Future
import scala.concurrent.ExecutionContext
def triggerAPI(params: (Int, String))(implicit ec: ExecutionContext): Future[Unit] =
Future {
// some code that takes some time to run...
}
Notice that now triggerAPI returns a Future. A Future can be thought as a read-handle to something that is going to be eventually computed. In particular, this is a Future[Unit], where Unit stands for "we don't particularly care about the output of this function, but mostly about its side effects".
Furthermore, notice that the method now takes an implicit parameter, namely an ExecutionContext. The ExecutionContext is used to provide Futures with some form of environment where the computation happens. Scala has an API to create an ExecutionContext from a java.util.concurrent.ExecutorService, so this will come in handy to run our computation on the fixed thread pool, running no more than three callbacks at any given time.
Before moving forward, if you have questions about Futures, ExecutionContexts and implicit parameters, the Scala documentation is your best source of knowledge (here are a couple of pointers: 1, 2).
Now that we have the new triggerAPI method, we can use Future.traverse (here is the documentation for Scala 2.12 -- the latest version at the time of writing is 2.13 but to the best of my knowledge Spark users are stuck on 2.12 for the time being).
The tl;dr of Future.traverse is that it takes some form of container and a function that takes the items in that container and returns a Future of something else. The function will be applied to each item in the container and the result will be a Future of the container of the results. In your case: the container is a List, the items are (Int, String) and the something else you return is a Unit.
This means that you can simply call it like this:
Future.traverse(inpDS)(triggerAPI)
And triggerAPI will be applied to each item in inpDS.
By making sure that the execution context backed by the thread pool is in the implicit scope when calling Future.traverse, the items will be processed with the desired thread pool.
The result of the call is Future[List[Unit]], which is not very interesting and can simply be discarded (as you are only interested in the side effects).
That was a lot of talk, if you want to play around with the code I described you can do so here on Scastie.
For reference, this is the whole implementation:
import java.util.concurrent.{ExecutorService, Executors}
import scala.concurrent.duration.DurationLong
import scala.concurrent.Future
import scala.concurrent.{ExecutionContext, ExecutionContextExecutorService}
import scala.util.control.NonFatal
import scala.util.{Failure, Success, Try}
val datasets = List(
(1, "https://google.com/2X6barD"),
(2, "https://google.com/3d9vCgW"),
(3, "https://google.com/2M02Xz0"),
(4, "https://google.com/2XOu2uL"),
(5, "https://google.com/2AfBWF0"),
(6, "https://google.com/36AEKsw"),
(7, "https://google.com/3enBxz7"),
(8, "https://google.com/36ABq0x"),
(9, "https://google.com/2XBjmiF")
)
val executor: ExecutorService = Executors.newFixedThreadPool(3)
implicit val executionContext: ExecutionContextExecutorService = ExecutionContext.fromExecutorService(executor)
def triggerAPI(params: (Int, String))(implicit ec: ExecutionContext): Future[Unit] =
Future {
val (index, _) = params
println(s"+ started processing $index")
val start = System.nanoTime() / 1000000
Iterator.from(0).map(_ + 1).drop(100000000).take(1).toList.head // a noticeably slow operation
val end = System.nanoTime() / 1000000
val duration = (end - start).millis
println(s"- finished processing $index after $duration")
}
Future.traverse(datasets)(triggerAPI).onComplete {
case result =>
println("* processing is over, shutting down the executor")
executionContext.shutdown()
}
You need to shutdown the Executor after your job done else It will be waiting.
Try add pool.shutdown() end of your program.

Is map of Future lazy or not?

Basically I mean:
for(v <- Future(long time operation)) yield v*someOtherValue
This expression returns another Future, but the question is, is the v*someOhterValue operation lazy or not? Will this expression block on getting the value of Future(long time operation)?
Or it is like a chain of callbacks?
A short experiment can test this question.
import concurrent._;
import concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration._
object TheFuture {
def main(args: Array[String]): Unit = {
val fut = for (v <- Future { Thread.sleep(2000) ; 10 }) yield v * 10;
println("For loop is finished...")
println(Await.ready(fut, Duration.Inf).value.get);
}
}
If we run this, we see For loop is finished... almost immediately, and then two seconds later, we see the result. So the act of performing map or similar operations on a future is not blocking.
A map (or, equivalently, your for comprehension) on a Future is not lazy: it will be executed as soon as possible on another thread. However, since it runs on another thread, it isn't blocking, either.
If you want to do the definition and execution of the Future separately, then you have to use something like a Monix Task.
https://monix.io/api/3.0/monix/eval/Task.html

Future declaration seems independent from promise

I was reading this article http://danielwestheide.com/blog/2013/01/16/the-neophytes-guide-to-scala-part-9-promises-and-futures-in-practice.html and I was looking at this code:
object Government {
def redeemCampaignPledge(): Future[TaxCut] = {
val p = Promise[TaxCut]()
Future {
println("Starting the new legislative period.")
Thread.sleep(2000)
p.success(TaxCut(20))
println("We reduced the taxes! You must reelect us!!!!1111")
}
p.future
}
}
I've seen this type of code a few times and I'm confused. So we have this Promise:
val p = Promise[TaxCut]()
And this Future:
Future {
println("Starting the new legislative period.")
Thread.sleep(2000)
p.success(TaxCut(20))
println("We reduced the taxes! You must reelect us!!!!1111")
}
I don't see any assignment between them so I don't understand: How are they connected?
I don't see any assignment between them so I don't understand: How are
they connected?
A Promise is a one way of creating a Future.
When you use Future { } and import scala.concurrent.ExecutionContext.Implicits.global, you're queuing a function on one of Scala's threadpool threads. But, that isn't the only way to generate a Future. A Future need not necessarily be scheduled on a different thread.
What this example does is:
Creates a Promise[TaxCut] which will be completed sometime in the near future.
Queues a function to be ran inside a threadpool thread via the Future apply. This function also completes the Promise via the Promise.success method
Returns the future generated by the promise via Promise.future. When this future returns, it may not be completed yet, depending on how fast the execution of the function queued to the Future really runs (the OP was trying to convey this via the Thread.sleep method, delaying the completion of the future).

Is there any point in blocking for a future?

Suppose I have an application serving many requests. One of the requests takes a while to complete. I have the following code:
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration._
import scala.concurrent.Await
import scala.concurrent.Future
def longReq(data:String):String = {
val respFuture = Future{
// some code that computes resp but takes a long time
// can my application process other requests during this time?
resp = ??? // time-consuming step
}
Await.result(respFuture, 2 minutes)
}
If I don't use futures at all, the application will be blocked until resp is computed and no other requests can be served in parallel during that time. However, if I use futures and then block for resp using Await, will the application be able to serve other requests in parallel while resp is being computed?
In your particular example, assuming that longReq is called serially by a request loop, the answer is No, it cannot process anything else. For that longReq would have to return a future instead:
def longReq(data:String): Future[String] = {
Future {
// some code that computes resp but takes a long time
// can my application process other requests during this time?
resp = ??? // time-consuming step
}
}
Of course that just pushes the reason you likely used Await.result further down the line.
The purpose of using Future is to avoid blocking, but it is a turtles-all-the-way-down buy-in. If you want to use a Future, the final recipient has to be able to deal with getting the result in an asynchronous way, i.e. your request loop must have a way to capture the caller in such a way that when the future is finally completed the caller can be told about the result
Let's assume your request loop receives a request object that a response callback, then you would call longReq like this (assuming the use of longReq that returns a Future):
def asyncCall(request: Request): Unit = {
longReq(request.data).map( result => request.response(result) )
}
The most common scenario where you would use the flow is HTTP or other servers where the synchronous Request => Response cycle has an async equivalent of Request => Future[Response], which pretty much any modern server framework offers (Play, Finatra, Scalatra, etc.)
When to use Await.result
The one scenario, where it might be reasonable to use Await.result is if you have a bunch of Futures and are willing to block while the all complete (assuming the use of longReq that returns a Future):
val futures = allData.map(longReq)) // List[Future[String]]
val combined = Future.sequence(futures) // Future[List[String]]
val responses = Await.result(combined, 10.seconds) // List[String]
Of course, combined being a Future, it would still be better to map over it and handle the result asynchronously

Do Futures always end up not returning anything?

Given that we must avoid...
1) Modifying state
2) Blocking
...what is a correct end-to-end usage for a Future?
The general practice in using Futures seems to be transforming them into other Futures by using map, flatMap etc. but it's no good creating Futures forever.
Will there always be a call to onComplete somewhere, with methods writing the result of the Future to somewhere external to the application (e.g. web socket; the console; a message broker) or is there a non-blocking way of accessing the result?
All of the information on Futures in the Scaladocs - http://docs.scala-lang.org/overviews/core/futures.html seem to end up writing to the console. onComplete doesn't return anything, so presumably we have to end up doing some "fire-and-forget" IO.
e.g. a call to println
f onComplete {
case Success(number) => println(number)
case Failure(err) => println("An error has occured: " + err.getMessage)
}
But what about in more complex cases where we want to do more with the result of the Future?
As an example, in the Play framework Action.async can return a Future[Result] and the framework handles the rest. Will it eventually have to expect never to get a result from the Future?
We know the user needs to be returned a Result, so how can a framework do this using only a Unit method?
Is there a non-blocking way to retrieve the value of a future and use it elsewhere within the application, or is a call to Await inevitable?
Best practice is to use callbacks such as onComplete, onSuccess, onFailure for side effecting operations, e.g. logging, monitoring, I/O.
If you need the continue with the result of of your Future computation as opposed to do a side-effecting operation, you should use map to get access to the result of your computation and compose over it.
Future returns a unit, yes. That's because it's an asynchronous trigger. You need to register a callback in order to gather the result.
From your referenced scaladoc (with my comments):
// first assign the future with expected return type to a variable.
val f: Future[List[String]] = Future {
session.getRecentPosts
}
// immediately register the callbacks
f onFailure {
case t => println("An error has occurred: " + t.getMessage)
}
f onSuccess {
case posts => for (post <- posts) println(post)
}
Or instead of println-ing you could do something with the result:
f onSuccess {
case posts: List[String] => someFunction(posts)
}
Try this out:
import scala.concurrent.duration._
import scala.concurrent._
import scala.concurrent.ExecutionContext.Implicits.global
val f: Future[Int] = Future { 43 }
val result: Int = Await.result(f, 0 nanos)
So what is going on here?
You're defining a computation to be executed on a different thread.
So you Future { 43 } returns immediately.
Then you can wait for it and gather the result (via Await.result) or define computation on it without waiting for it to be completed (via map etc...)
Actually, the kind of Future you are talking about are used for side-effects. The result returned by a Future depends its type :
val f = Future[Int] { 42 }
For example, I could send the result of Future[Int] to another Future :
val f2 = f.flatMap(integer => Future{ println(integer) }) // may print 42
As you know, a future is a process that happens concurrently. So you can get its result in the future (that is, using methods such as onComplete) OR by explicitly blocking the current thread until it gets a value :
import scala.concurrent.Await
import akka.util.Timeout
import scala.concurrent.duration._
implicit val timeout = Timeout(5 seconds)
val integer = Await.result(Future { 42 }, timeout.duration)
Usually when you start dealing with asynchronous processes, you have to think in terms of reactions which may never occur. Using chained Futures is like declaring a possible chain of events which could be broken at any moment. Therefore, waiting for a Future's value is definitely not a good practice as you may never get it :
val integer = Await.result(Future { throw new RuntimeException() }, timeout.duration) // will throw an uncaught exception
Try to think more in terms of events, than in procedures.