Spark httpclient job does not parallel well [duplicate] - scala

I build a RDD from a list of urls, and then try to fetch datas with some async http call.
I need all the results before doing other calculs.
Ideally, I need to make the http calls on differents nodes for scaling considerations.
I did something like this:
//init spark
val sparkContext = new SparkContext(conf)
val datas = Seq[String]("url1", "url2")
//create rdd
val rdd = sparkContext.parallelize[String](datas)
//httpCall return Future[String]
val requests = rdd.map((url: String) => httpCall(url))
//await all results (Future.sequence may be better)
val responses = requests.map(r => Await.result(r, 10.seconds))
//print responses
response.collect().foreach((s: String) => println(s))
//stop spark
sparkContext.stop()
This work, but Spark job never finish !
So I wonder what is are the best practices for dealing with Future using Spark (or Future[RDD]).
I think this use case looks pretty common, but didn't find any answer yet.
Best regards

this use case looks pretty common
Not really, because it simply doesn't work as you (probably) expect. Since each task operates on standard Scala Iterators these operations will be squashed together. It means that all operations will be blocking in practice. Assuming you have three URLs ["x", "y", "z"] you code will be executed in a following order:
Await.result(httpCall("x", 10.seconds))
Await.result(httpCall("y", 10.seconds))
Await.result(httpCall("z", 10.seconds))
You can easily reproduce the same behavior locally. If you want to execute your code asynchronously you should handle this explicitly using mapPartitions:
rdd.mapPartitions(iter => {
??? // Submit requests
??? // Wait until all requests completed and return Iterator of results
})
but this is relatively tricky. There is no guarantee all data for a given partition fits into memory so you'll probably need some batching mechanism as well.
All of that being said I couldn't reproduce the problem you've described to is can be some configuration issue or a problem with httpCall itself.
On a side note allowing a single timeout to kill whole task doesn't look like a good idea.

I couldnt find an easy way to achieve this. But after several iteration of retries this is what I did and its working for a huge list of queries. Basically we used this to do a batch operation for a huge query into multiple sub queries.
// Break down your huge workload into smaller chunks, in this case huge query string is broken
// down to a small set of subqueries
// Here if needed to optimize further down, you can provide an optimal partition when parallelizing
val queries = sqlContext.sparkContext.parallelize[String](subQueryList.toSeq)
// Then map each one those to a Spark Task, in this case its a Future that returns a string
val tasks: RDD[Future[String]] = queries.map(query => {
val task = makeHttpCall(query) // Method returns http call response as a Future[String]
task.recover {
case ex => logger.error("recover: " + ex.printStackTrace()) }
task onFailure {
case t => logger.error("execution failed: " + t.getMessage) }
task
})
// Note:: Http call is still not invoked, you are including this as part of the lineage
// Then in each partition you combine all Futures (means there could be several tasks in each partition) and sequence it
// And Await for the result, in this way you making it to block untill all the future in that sequence is resolved
val contentRdd = tasks.mapPartitions[String] { f: Iterator[Future[String]] =>
val searchFuture: Future[Iterator[String]] = Future sequence f
Await.result(searchFuture, threadWaitTime.seconds)
}
// Note: At this point, you can do any transformations on this rdd and it will be appended to the lineage.
// When you perform any action on that Rdd, then at that point,
// those mapPartition process will be evaluated to find the tasks and the subqueries to perform a full parallel http requests and
// collect those data in a single rdd.
If you dont want to perform any transformation on the content like parsing the response payload, etc. Then you could use foreachPartition instead of mapPartitions to perform all those http calls immediately.

I finally made it using scalaj-http instead of Dispatch.
Call are synchronous, but this match my use case.
I think the Spark Job never finish using Dispatch because the Http connection was not closed properly.
Best Regards

This wont work.
You cannot expect the request objects be distributed and responses collected over a cluster by other nodes. If you do then the spark calls for future will never end. The futures will never work in this case.
If your map() make sync(http) requests then please collect responses within the same action/transformation call and then subject the results(responses) to further map/reduce/other calls.
In your case, please rewrite logic collect the responses for each call in sync and remove the notion of futures then all should be fine.

Related

Alternative to using Future.sequence inside Akka Actors

We have a fairly complex system developed using Akka HTTP and Actors model. Until now, we extensively used ask pattern and mixed Futures and Actors.
For example, an actor gets message, it needs to execute 3 operations in parallel, combine a result out of that data and returns it to sender. What we used is
declare a new variable in actor receive message callback to store a sender (since we use Future.map it can be another sender).
executed all those 3 futures in parallel using Future.sequence (sometimes its call of function that returns a future and sometimes it is ask to another actor to get something from it)
combine the result of all 3 futures using map or flatMap function of Future.sequence result
pipe a final result to a sender using pipeTo
Here is a code simplified:
case RetrieveData(userId, `type`, id, lang, paging, timeRange, platform) => {
val sen = sender
val result: Future[Seq[Map[String, Any]]] = if (paging.getOrElse(Paging(0, 0)) == Paging(0, 0)) Future.successful(Seq.empty)
else {
val start = System.currentTimeMillis()
val profileF = profileActor ? Get(userId)
Future.sequence(Seq(profileF, getSymbols(`type`, id), getData(paging, timeRange, platform)).map { result =>
logger.info(s"Got ${result.size} news in ${System.currentTimeMillis() - start} ms")
result
}.recover { case ex: Throwable =>
logger.error(s"Failure on getting data: ${ex.getMessage}", ex)
Seq.empty
}
}
result.pipeTo(sen)
}
Function getAndProcessData contains Future.sequence with executing 3 futures in parallel.
Now, as I'm reading more and more on Akka, I see that using ask is creating another actor listener. Questions are:
As we extensively use ask, can it lead to a to many threads used in a system and perhaps a thread starvation sometimes?
Using Future.map much also means different thread often. I read about one thread actor illusion which can be easily broken with mixing Futures.
Also, can this affect performances in a bad way?
Do we need to store sender in temp variable send, since we're using pipeTo? Could we do only pipeTo(sender). Also, does declaring sen in almost each receive callback waste to much resources? I would expect its reference will be removed once operation in complete.
Is there a chance to design such a system in a better way, meadning that we don't use map or ask so much? I looked at examples when you just pass a replyTo reference to some actor and the use tell instead of ask. Also, sending message to self and than replying to original sender can replace working with Future.map in some scenarios. But how it can be designed having in mind we want to perform 3 async operations in parallel and returns a formatted data to a sender? We need to have all those 3 operations completed to be able to format data.
I tried not to include to many examples, I hope you understand our concerns and problems. Many questions, but I would really love to understand how it works, simple and clear
Thanks in advance
If you want to do 3 things in parallel you are going to need to create 3 Future values which will potentially use 3 threads, and that can't be avoided.
I'm not sure what the issue with map is, but there is only one call in this code and that is not necessary.
Here is one way to clean up the code to avoid creating unnecessary Future values (untested!):
case RetrieveData(userId, `type`, id, lang, paging, timeRange, platform) =>
if (paging.forall(_ == Paging(0, 0))) {
sender ! Seq.empty
} else {
val sen = sender
val start = System.currentTimeMillis()
val resF = Seq(
profileActor ? Get(userId),
getSymbols(`type`, id),
getData(paging, timeRange, platform),
)
Future.sequence(resF).onComplete {
case Success(result) =>
val dur = System.currentTimeMillis() - start
logger.info(s"Got ${result.size} news in $dur ms")
sen ! result
case Failure(ex)
logger.error(s"Failure on getting data: ${ex.getMessage}", ex)
sen ! Seq.empty
}
}
You can avoid ask by creating your own worker thread that collects the different results and then sends the result to the sender, but that is probably more complicated than is needed here.
An actor only consumes a thread in the dispatcher when it is processing a message. Since the number of messages the actor spawned to manage the ask will process is one, it's very unlikely that the ask pattern by itself will cause thread starvation. If you're already very close to thread starvation, an ask might be the straw that breaks the camel's back.
Mixing Futures and actors can break the single-thread illusion, if and only if the code executing in the Future accesses actor state (meaning, basically, vars or mutable objects defined outside of a receive handler).
Request-response and at-least-once (between them, they cover at least most of the motivations for the ask pattern) will in general limit throughput compared to at-most-once tells. Implementing request-response or at-least-once without the ask pattern might in some situations (e.g. using a replyTo ActorRef for the ultimate recipient) be less overhead than piping asks, but probably not significantly. Asks as the main entry-point to the actor system (e.g. in the streams handling HTTP requests or processing messages from some message bus) are generally OK, but asks from one actor to another are a good opportunity to streamline.
Note that, especially if your actor imports context.dispatcher as its implicit ExecutionContext, transformations on Futures are basically identical to single-use actors.
Situations where you want multiple things to happen (especially when you need to manage partial failure (Future.sequence.recover is a possible sign of this situation, especially if the recover gets nontrivial)) are potential candidates for a saga actor to organize one particular request/response.
I would suggest instead of using Future.sequence, Souce from Akka can be used which will run all the futures in parallel, in which you can provide the parallelism also.
Here is the sample code:
Source.fromIterator( () => Seq(profileF, getSymbols(`type`, id), getData(paging, timeRange, platform)).iterator )
.mapAsync( parallelism = 1 ) { case (seqIdValue, row) =>
row.map( seqIdValue -> _ )
}.runWith( Sink.seq ).map(_.map(idWithDTO => idWithDTO))
This will return Future[Seq[Map[String, Any]]]

When to use Scala Futures?

I'm a spark Scala Programmer. I have a spark job that has sub-tasks which to complete the whole job. I wanted to use Future to complete the subtasks in parallel. On completion of the whole job I have to return the whole job response.
What I heard about scala Future is once the main thread executed and stopped the remaining threads will be killed and also you will get empty response.
I have to use Await.result to collect the results. But all the blogs are telling that you should avoid Await.result and it's a bad practice.
Is using Await.result is correct way of doing or not in my case?
def computeParallel(): Future[String] = {
val f1 = Future { "ss" }
val f2 = Future { "sss" }
val f3 = Future { "ssss" }
for {
r1 <- f1
r2 <- f2
r3 <- f3
} yield (r1 + r2 + r3)
}
computeParallel().map(result => ???)
To my understanding, we have to use Future in webservice kind of application where it has one process always running that won't be exited. But in my case, once logic execution(scala program) is complete it will exit.
Can I use Future to my problem or not?
Using futures in Spark is probably not advisable except in special cases, and simply parallelizing computation isn't one of them (giving a non-blocking wrapper to blocking I/O (e.g. making requests to an outside service) is quite possibly the only special case).
Note that Future doesn't guarantee parallelism (whether and how they're executed in parallel depends on the ExecutionContext in which they're run), just asynchrony. Also, in the event that you're spawning computation-performing futures inside a Spark transformation (i.e. on the executor, not the driver), chances are that there won't be any performance improvement, since Spark tends to do a good job of keeping the cores on the executors busy, all spawning those futures does is contend for cores with Spark.
Broadly, be very careful about combining parallelism abstractions like Spark RDDs/DStreams/Dataframes, actors, and futures: there are a lot of potential minefields where such combinations can violate guarantees and/or conventions in the various components.
It's also worth noting that Spark has requirements around serializability of intermediate values and that futures aren't generally serializable, so a Spark stage can't result in a future; this means that you basically have no choice but to Await on the futures spawned in a stage.
If you still want to spawn futures in a Spark stage (e.g. posting them to a web service), it's probably best to use Future.sequence to collapse the futures into one and then Await on that (note that I have not tested this idea: I'm assuming that there's an implicit CanBuildFrom[Iterator[Future[String]], String, Future[String]] available):
def postString(s: String): Future[Unit] = ???
def postStringRDD(rdd: RDD[String]): RDD[String] = {
rdd.mapPartitions { strings =>
// since this is only get used for combining the futures in the Await, it's probably OK to use the implicit global execution context here
implicit val ectx = ???
Await.result(strings.map(postString))
}
rdd // Pass through the original RDD
}

Akka Http - Host Level Client Side API Source.queue pattern

We started to implement the Source.queue[HttpRequest] pattern mentioned in the docs: http://doc.akka.io/docs/akka-http/current/scala/http/client-side/host-level.html#examples
This is the (reduced) example from the documentation
val poolClientFlow = Http()
.cachedHostConnectionPool[Promise[HttpResponse]]("akka.io")
val queue =
Source.queue[(HttpRequest, Promise[HttpResponse])](
QueueSize, OverflowStrategy.dropNew
)
.via(poolClientFlow)
.toMat(Sink.foreach({
case ((Success(resp), p)) => p.success(resp)
case ((Failure(e), p)) => p.failure(e)
}))(Keep.left)
.run()
def queueRequest(request: HttpRequest): Future[HttpResponse] = {
val responsePromise = Promise[HttpResponse]()
queue.offer(request -> responsePromise).flatMap {
case QueueOfferResult.Enqueued => responsePromise.future
case QueueOfferResult.Dropped => Future.failed(new RuntimeException("Queue overflowed. Try again later."))
case QueueOfferResult.Failure(ex) => Future.failed(ex)
case QueueOfferResult.QueueClosed => Future.failed(new RuntimeException("Queue was closed (pool shut down) while running the request. Try again later."))
}
}
val responseFuture: Future[HttpResponse] = queueRequest(HttpRequest(uri = "/"))
The docs state that using Source.single(request) is an anti-pattern and should be avoid. However it doesn't clarify why and what implications come by using Source.queue.
At this place we previously showed an example that used the Source.single(request).via(pool).runWith(Sink.head).
In fact, this is an anti-pattern that doesn’t perform well. Please either supply requests using a queue or in a streamed fashion as shown below.
Advantages of Source.queue
The flow is only materialized once ( probably a performance gain? ). However if I understood the akka-http implementation correctly, a new flow is materialized for each connection, so this doesn't seem to be that much of a problem
Explicit backpressure handling with OverflowStrategy and matching over the QueueOfferResult
Issues with Source.queue
These are the questions that came up, when we started implementing this pattern in our application.
Source.queue is not thread-safe
The queue implementation is not thread safe. When we use the queue in different routes / actors we have this scenario that:
A enqueued request can override the latest enqueued request, thus leading to an unresolved Future.
UPDATE
This issue as been addressed in akka/akka/issues/23081. The queue is in fact thread safe.
Filtering?
What happens when request are being filtered? E.g. when someone changes the implementation
Source.queue[(HttpRequest, Promise[HttpResponse])](
QueueSize, OverflowStrategy.dropNew)
.via(poolClientFlow)
// only successful responses
.filter(_._1.isSuccess)
// failed won't arrive here
.to(Sink.foreach({
case ((Success(resp), p)) => p.success(resp)
case ((Failure(e), p)) => p.failure(e)
}))
Will the Future not resolve? With a single request flow this is straightforward:
Source.single(request).via(poolClientFlow).runWith(Sink.headOption)
QueueSize vs max-open-request?
The difference between the QueueSize and max-open-requests is not clear. In the end, both are buffers. Our implementation ended up using QueueSize == max-open-requests
What's the downside for Source.single()?
Until now I have found two reasons for using Source.queue over Source.single
Performance - materializing the flow only once. However according to this answer it shouldn't be an issue
Explicitly configuring backpressure and handle failure cases. In my opinion the ConnectionPool has a sufficient handling for too much load. One can map over the resulting future and handle the exceptions.
thanks in advance,
Muki
I'll answer each of your questions directly and then give a general indirect answer to the overall problem.
probably a performance gain?
You are correct that there is a Flow materialized for each IncomingConnection but there is still a performance gain to be had if a Connection has multiple requests coming from it.
What happens when request are being filtered?
In general streams do not have a 1:1 mapping between Source elements and Sink Elements. There can be 1:0, as in your example, or there can be 1:many if a single request somehow spawned multiple responses.
QueueSize vs max-open-request?
This ratio would depend on the speed with which elements are being offered to the queue and the speed with which http requests are being processed into responses. There is no pre-defined ideal solution.
GENERAL REDESIGN
In most cases a Source.queue is used because some upstream function is creating input elements dynamically and then offering them to the queue, e.g.
val queue = ??? //as in the example in your question
queue.offer(httpRequest1)
queue.offer(httpRequest2)
queue.offer(httpRequest3)
This is poor design because whatever entity or function that is being used to create each input element could itself be part of the stream Source, e.g.
val allRequests = Iterable(httpRequest1, httpRequest2, httpRequest3)
//no queue necessary
val allResponses : Future[Seq[HttpResponse]] =
Source(allRequests)
.via(poolClientFlow)
.to(Sink.seq[HttpResponse])
.run()
Now there is no need to worry about the queue, max queue size, etc. Everything is bundled into a nice compact stream.
Even if the source of requests is dynamic, you can still usually use a Source. Say we are getting the request paths from the console stdin, this can still be a complete stream:
import scala.io.{Source => ioSource}
val consoleLines : () => Iterator[String] =
() => ioSource.stdin.getLines()
Source
.fromIterator(consoleLines)
.map(consoleLine => HttpRequest(GET, uri = Uri(consoleLine)))
.via(poolClientFlow)
.to(Sink.foreach[HttpResponse](println))
.run()
Now, even if each line is typed into the console at random intervals the stream can still behave reactively without a Queue.
The only instance I've every seen a queue, or Source.ActorRef, as being absolutely necessary is when you have to create a callback function that gets passed into a third party API. This callback function will have to offer the incoming elements to the queue.

Do all futures need to be waited on to guarantee their execution?

We have a Scala Play webapp which does a number of database operations as part of a HTTP request, each of which is a Future. Usually we bubble up the Futures to an async controller action and let Play handle waiting for them.
But I've also noticed in a number of places we don't bubble up the Future or even wait for it to complete. I think this is bad because it means the HTTP request wont fail if the future fails, but does it actually even guarantee the future will be executed at all, since nothing is going to wait on the result of it? Will Play drop un-awaited futures after the HTTP request has been served, or leave them running in the background?
TL;DR
Play will not kill your Futures after sending the HTTP response.
Errors will not be reported if any of your Futures fail.
Long version
Your futures will not be killed when the HTTP response has been sent. You can try that out for yourself like this:
def futuresTest = Action.async { request =>
println(s"Entered futuresTest at ${LocalDateTime.now}")
val ignoredFuture = Future{
var i = 0
while (i < 10) {
Thread.sleep(1000)
println(LocalDateTime.now)
i += 1
}
}
println(s"Leaving futuresTest at ${LocalDateTime.now}")
Future.successful(Ok)
}
However you are right that the request will not fail if any of the futures fail. If this is a problem then you can compose the futures using a for comprehension or flatMaps. Here's an example of what you can do (I'm assuming that your Futures only perform side efects (Future[Unit])
To let your futures execute in paralell
val dbFut1 = dbCall1(...)
val dbFut2 = dbCall2(...)
val wsFut1 = wsCall1(...)
val fut = for(
_ <- dbFut1;
_ <- dbFut2;
_ <- wsFut1
) yield ()
fut.map(_ => Ok)
To have them execute in sequence
val fut = for(
_ <- dbCall1(...);
_ <- dbCall2(...);
_ <- wsCall2(...)
) yield ()
fut.map(_ => Ok)
does it actually even guarantee the future will be executed at all,
since nothing is going to wait on the result of it? Will Play drop
un-awaited futures after the HTTP request has been served, or leave
them running in the background?
This question actually runs much deeper than Play. You're generally asking "If I don't synchronously wait on a future, how can I guarantee it will actually complete without being GCed?". To answer that, we need to understand how the GC actually views threads. From the GC point of view, a thread is what we call a "root". Such a root is the starting point for the heap to traverse it's objects and see which ones are eligible for collection. Among roots are also static fields, for example, which are known to live throughout the life time of the application.
So, when you view it like that, and think of what a Future actually does, which is queue a function that runs on a dedicated thread from the pool of threads available via the underlying ExecutorService (which we refer to as ExecutionContext in Scala), you see that even though you're not waiting on the completion, the JVM runtime does guarantee that your Future will run to completion. As for the Future object wrapping the function, it holds a reference to that unfinished function body so the Future itself isn't collected.
When you think about it from that point of view, it's totally logical, since execution of a Future happens asynchronously, and we usually continue processing it in an asynchronous manner using continuations such as map, flatMap, onComplete, etc.

How to parallelize several apache spark rdds?

I have the next code:
sc.parquetFile("some large parquet file with bc").registerTempTable("bcs")
sc.parquetFile("some large parquet file with imps").registerTempTable("imps")
val bcs = sc.sql("select * from bcs")
val imps = sc.sql("select * from imps")
I want to do:
bcs.map(x => wrapBC(x)).collect
imps.map(x => wrapIMP(x)).collect
but when I do this, it's running not async. I can to do it with Future, like that:
val bcsFuture = Future { bcs.map(x => wrapBC(x)).collect }
val impsFuture = Future { imps.map(x => wrapIMP(x)).collect }
val result = for {
bcs <- bcsFuture
imps <- impsFuture
} yield (bcs, imps)
Await.result(result, Duration.Inf) //this return (Array[Bc], Array[Imp])
I want to do this without Future, how can I do it?
Update This was originally composed before the question was updated. Given those updates, I agree with #stholzm's answer to use cartesian in this case.
There do exist a limited number of actions which will produce a FutureAction[A] for an RDD[A] and be executed in the background. These are available on the AsyncRDDActions class, and so long as you import SparkContext._ any RDD will can be implicitly converted to an AysnchRDDAction as needed. For your specific code example that would be:
bcs.map(x => wrapBC(x)).collectAsync
imps.map(x => wrapIMP(x)).collectAsync
In additionally to evaluating the DAG up to action in the background, the FutureAction produced has the cancel method to attempt to end processing early.
Caveat
This may not do what you think it does. If the intent is to get data from both sources and then combine them you're more likely to want to join or group the RDDs instead. For that you can look at the functions available in PairRDDFunctions, again available on RDDs through implicit conversion.
If the intention isn't to have the data graphs interact then so far in my experience then running batches concurrently might only serve to slow down both, though that may be a consequence of how the cluster is configured. If the resource manager is set up to give each execution stage a monopoly on the cluster in FIFO order (the default in standalone and YARN modes, I believe; I'm not sure about Mesos) then each of the asynchronous collects will contend with each other for that monopoly, run their tasks, then contend again for the next execution stage.
Compare this to using a Future to wrap blocking calls to downstream services or database, for example, where either the resources in question are completely separate or generally have enough resource capacity to handle multiple requests in parallel without contention.
Update: I misunderstood the question. The desired result is not the cartesian product Array[(Bc, Imp)].
But I'd argue that it does not matter how long the single map calls take because as soon as you add other transformations, Spark tries to combine them in an efficient way. As long as you only chain transformations on RDDs, nothing happens on the data. When you finally apply an action then the execution engine will figure out a way to produce the requested data.
So my advice would be to not think so much about the intermediate steps and avoid collect as much as possible because it will fetch all the data to the driver program.
It seems you are building a cartesian product yourself. Try cartesian instead:
val bc = bcs.map(x => wrapBC(x))
val imp = imps.map(x => wrapIMP(x))
val result = bc.cartesian(imp).collect
Note that collect is called on the final RDD and no longer on intermediate results.
You can use union for solve this problem. For example:
bcs.map(x => wrapBC(x).asInstanceOf[Any])
imps.map(x => wrapIMP(x).asInstanceOf[Any])
val result = (bcs union imps).collect()
val bcsResult = result collect { case bc: Bc => bc }
val impsResult = result collect { case imp: Imp => imp }
If you want to use sortBy or another operations, you can use inheritance of trait or main class.