Scala - Future "Out of memory" - scala

So I have to following code:
case class ResultEntry(destination: String,
from: LocalDate,
to: LocalDate,
data: Data)
val destinations = Seq(
"New York",
"Scranton",
"Stamford",
"Jamaica",
)
val dates = 0.to(30).map(v => LocalDate.of(2021, 1, 1).plusDays(v))
Await.result(
Future
.sequence(for {
destination <- destinations
fromDate <- dates
toDate <- dates if (toDate.isAfter(fromDate))
score = Future(
blocking(
SnelDwight.calculateVacationScore(destination, fromDate, toDate)))
} yield {
score.map { score =>
ResultEntry(destination, fromDate, toDate, score)
}
})
.map { results =>
results.groupBy(_.destination).map {
case (destination, results) =>
val min = results.minBy(_.data.score)
println(s"${destination} : ${min}")
}
},
Duration.Inf
)
This code will throw "Out of memory" exception. I need to solve that problem, but keep the computations as concurrent as possible. How can I achieve that?

Using blocking causes a new thread to be created each time, rather than using an existing thread from a pool of threads. This avoids the danger of blocking every thread in the pool and preventing new Futures from running.
In this example you are creating nearly 2,000 threads which may well cause the memory error.
If the calculation isn't really blocking then just remove blocking.
If the calculation is blocking then you need to batch multiple calculations on a single thread. For example, do all the toDates in one Future (which is probably more efficient anyway).

Related

Can Spark ForEachPartitionAsync be async on worker nodes?

I write a custom spark sink. In my addBatch method I use ForEachPartitionAsync which if I'm not wrong only makes the driver work asynchronously, returning a future.
val work: FutureAction[Unit] = rdd.foreachPartitionAsync { rows =>
val sourceInfo: StreamSourceInfo = serializeRowsAsInputStream(schema, rows)
val ackIngestion = Future {
ingestRows(sourceInfo) } andThen {
case Success(ingestion) => ackIngestionDone(partitionId, ingestion)
}
Await.result(ackIngestion, timeOut) // I would like to remove this line..
}
work onSuccess {
case _ => // move data from temporary table, report success of all workers
}
work onFailure{
//delete tmp data
case t => throw t.getCause
}
I can't find a way to run the worker nodes without blocking on the Await call, as if I remove them a success is reported to the work future object although the future didn't really finish.
Is there a way to report to the driver that all the workers finished
their asynchronous jobs?
Note: I looked at the foreachPartitionAsync function and it has only one implementation that expects a function that returns a Unit (i would've expected it to have another one returning a future or maybe a CountDownLatch..)

Apache Spark: how to cancel job in code and kill running tasks?

I am running a Spark application (version 1.6.0) on a Hadoop cluster with Yarn (version 2.6.0) in client mode. I have a piece of code that runs a long computation, and I want to kill it if it takes too long (and then run some other function instead).
Here is an example:
val conf = new SparkConf().setAppName("TIMEOUT_TEST")
val sc = new SparkContext(conf)
val lst = List(1,2,3)
// setting up an infite action
val future = sc.parallelize(lst).map(while (true) _).collectAsync()
try {
Await.result(future, Duration(30, TimeUnit.SECONDS))
println("success!")
} catch {
case _:Throwable =>
future.cancel()
println("timeout")
}
// sleep for 1 hour to allow inspecting the application in yarn
Thread.sleep(60*60*1000)
sc.stop()
The timeout is set for 30 seconds, but of course the computation is infinite, and so Awaiting on the result of the future will throw an Exception, which will be caught and then the future will be canceled and the backup function will execute.
This all works perfectly well, except that the canceled job doesn't terminate completely: when looking at the web UI for the application, the job is marked as failed, but I can see there are still running tasks inside.
The same thing happens when I use SparkContext.cancelAllJobs or SparkContext.cancelJobGroup. The problem is that even though I manage to get on with my program, the running tasks of the canceled job are still hogging valuable resources (which will eventually slow me down to a near stop).
To sum things up: How do I kill a Spark job in a way that will also terminate all running tasks of that job? (as opposed to what happens now, which is stopping the job from running new tasks, but letting the currently running tasks finish)
UPDATE:
After a long time ignoring this problem, we found a messy but efficient little workaround. Instead of trying to kill the appropriate Spark Job/Stage from within the Spark application, we simply logged the stage ID of all active stages when the timeout occurred, and issued an HTTP GET request to the URL presented by the Spark Web UI used for killing said stages.
I don't know it this answers your question.
My need was to kill jobs hanging for too much time (my jobs extract data from Oracle tables, but for some unknonw reason, seldom the connection hangs forever).
After some study, I came to this solution:
val MAX_JOB_SECONDS = 100
val statusTracker = sc.statusTracker;
val sparkListener = new SparkListener()
{
override def onJobStart(jobStart : SparkListenerJobStart)
{
val jobId = jobStart.jobId
val f = Future
{
var c = MAX_JOB_SECONDS;
var mustCancel = false;
var running = true;
while(!mustCancel && running)
{
Thread.sleep(1000);
c = c - 1;
mustCancel = c <= 0;
val jobInfo = statusTracker.getJobInfo(jobId);
if(jobInfo!=null)
{
val v = jobInfo.get.status()
running = v == JobExecutionStatus.RUNNING
}
else
running = false;
}
if(mustCancel)
{
sc.cancelJob(jobId)
}
}
}
}
sc.addSparkListener(sparkListener)
try
{
val df = spark.sql("SELECT * FROM VERY_BIG_TABLE") //just an example of long-running-job
println(df.count)
}
catch
{
case exc: org.apache.spark.SparkException =>
{
if(exc.getMessage.contains("cancelled"))
throw new Exception("Job forcibly cancelled")
else
throw exc
}
case ex : Throwable =>
{
println(s"Another exception: $ex")
}
}
finally
{
sc.removeSparkListener(sparkListener)
}
For the sake of future visitors, Spark introduced the Spark task reaper since 2.0.3, which does address this scenario (more or less) and is a built-in solution.
Note that is can kill an Executor eventually, if the task is not responsive.
Moreover, some built-in Spark sources of data have been refactored to be more responsive to spark:
For the 1.6.0 version, Zohar's solution is a "messy but efficient" one.
According to setJobGroup:
"If interruptOnCancel is set to true for the job group, then job cancellation will result in Thread.interrupt() being called on the job's executor threads."
So the anno function in your map must be interruptible like this:
val future = sc.parallelize(lst).map(while (!Thread.interrupted) _).collectAsync()

Akka Future - Parallel versus Concurrent?

From the well-written Akka Concurrency:
As I understand, the diagram points out, both numSummer and charConcat will run on the same thread.
Is it possible to run each Future in parallel, i.e. on separate threads?
The picture on the left is them running in parallel.
The point of the illustration is that the Future.apply method is what kicks off the execution, so if it doesn't happen until the first future's result is flatMaped (as in the picture on the right), then you don't get the parallel execution.
(Note that by "kicked off", i mean the relevant ExecutionContext is told about the job. How it parallelizes is a different question and may depend on things like the size of its thread pool.)
Equivalent code for the left:
val numSummer = Future { ... } // execution kicked off
val charConcat = Future { ... } // execution kicked off
numSummer.flatMap { numsum =>
charConcat.map { string =>
(numsum, string)
}
}
and for the right:
Future { ... } // execution kicked off
.flatMap { numsum =>
Future { ... } // execution kicked off (Note that this does not happen until
// the first future's result (`numsum`) is available.)
.map { string =>
(numsum, string)
}
}

Why scala futures do not work faster even with more threads in threads pool?

I have a following algorithm with scala:
Do initial call to db to initialize cursor
Get 1000 entities from db (Returns Future)
For every entity process one additional request to database and get modified entity (returns future)
Transform original entity
Put transformed entity to Future call back from #3
Wait for all Futures
In scala it some thing like:
val client = ...
val size = 1000
val init:Future = client.firstSearch(size) //request over network
val initResult = Await(init, 30.seconds)
var cursorId:String = initResult.getCursorId
while (!cursorId.isEmpty) {
val futures:Seq[Future] = client.grabWithSize(cursorId).map{response=>
response.getAllResults.map(result=>
val grabbedOne:Future[Entity] = client.grabOneEntity(result.id) //request over network
val resultMap:Map[String,Any] = buildMap(result)
val transformed:Map[String,Any] = transform(resultMap) //no future here
grabbedOne.map{grabbedOne=>
buildMap(grabbedOne) == transformed
}
}
Futures.sequence(futures).map(_=> response.getNewCursorId)
}
}
def buildMap(...):Map[String,Any] //sync call
I noticed that if I increase size say two times, every iteration in while started working slowly ~1.5. But I do not see that my PC processor loaded more. It loaded near zero, but time increases in ~1.5. Why? I have setuped:
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(1024))
I think, that not all Futures executed in parallel. But why? And ho to fix?
I see that in your code, the Futures don't block each other. It's more likely the database that is the bottleneck.
Is it possible to do a SQL join for O(1) rather than O(n) in terms of database calls? (If you're using Slick, have a look under the queries section about joins.)
If the load is low, it's probably that the connection pool is maxed out, you'd need to increase it for the database and the network.

How to Merge or skip duplicate messages in a Scala Actor?

Let's say you have a gui component and 10 threads all tell it to repaint at sufficiently the same time as they all arrive before a single paint operation takes place. Instead of naively wasting resources repainting 10 times, just merge/ignore all but the last one and repaint once (or more likely, twice--once for the first, and once for the last). My understanding is that the Swing repaint manager does this.
Is there a way to accomplish this same type of behavior in a Scala Actor? Is there a way to look at the queue and merge messages, or ignore all but the last of a certain type or something?
Something like this?:
act =
loop {
react {
case Repaint(a, b) => if (lastRepaint + minInterval < System.currentTimeMillis) {
lastRepaint = System.currentTimeMillis
repaint(a, b)
}
}
If you want to repaint whenever the actor's thread gets a chance, but no more, then:
(UPDATE: repainting using the last message arguments)
act =
loop {
react {
case r#Repaint(_, _) =>
var lastMsg = r
def findLast: Unit = {
reactWithin(0) {
case r#Repaint(_, _) =>
lastMsg = r
case TIMEOUT => repaint(lastMsg.a, lastMsg.b)
}
}
findLast
}
}