spark-scala - theads not executing skipping Await.ready

spark-scala - theads not executing skipping Await.ready - scala

Await.ready is not working.. not holding the main thread until future threads execute
val futures = for (stagingEntity <- stagingEntities) yield Future {
print("123")
}
for (f <- futures) {
Await.ready(f, Duration.Inf)
}
print("456")
output is : 456
123 - not printing
i tried main times some time working some time not.. could you please help... most appreciated.
i am checking spark application

Related

Scala futures - how to end on completion?

I've inherited some code from an ex-coworker where he started using futures (in Scala) to process some data in Databricks.
I split it into chunks that complete in a similar time period. However there is no output, and I know they aren't using onSuccess or Await or anything.
The thing is, the code finishes running (doesn't return output) but the block in Databricks keeps executing until the thread.sleep() part.
I'm new to Scala and futures and am not sure how I can just exit the notebook once all the futures finish running (should i just use dbutils.notebook.exit() after the future blocks?)
Code is below:
import scala.concurrent.{Future, blocking, Await}
import java.util.concurrent.Executors
import scala.concurrent.ExecutionContext
import com.databricks.WorkflowException
val numNotebooksInParallel = 15
// If you create too many notebooks in parallel the driver may crash when you submit all of the jobs at once.
// This code limits the number of parallel notebooks.
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(numNotebooksInParallel))
val ctx = dbutils.notebook.getContext()
// The simplest interface we can have but doesn't
// have protection for submitting to many notebooks in parallel at once
println("starting parallel jobs... hang tight")
Future {
process("pro","bseg")
process("prc","bkpf")
process("prc","bseg")
process("pr4","bkpf")
process("pr4","bseg")
println("done with future1")
}
Future {
process("pr5","bkpf")
process("pr5","bseg")
process("pri","bkpf")
process("pri","bseg")
process("pr9","bkpf")
println("done with future2")
}
Future {
process("pr9","bseg")
process("prl","bkpf")
process("prl","bseg")
process("pro","bkpf")
println("done with future3")
}
println("finished futures - yay! :)")
Thread.sleep(5*60*60*1000)
println("thread timed out after 5 hrs... hope it all finished.")

One would typically save the futures as values:
val futs = Seq(
Future {
process("pro","bseg")
// and so on
},
// then the other futures
)
and then operate on the futures:
import scala.concurrent.Await
import scala.concurrent.duration._
Await.result(Future.sequence(futs), 5.hours)
Future.sequence will stop at the first one that fails or once they've all succeeded. If you want them all to run even if one fails, you could do something like
Await.result(
futs.foldLeft(Future.unit) { (_, f) =>
f.recover {
case _ => ()
}
},
5.hours
)

Will my spark accumulator see all the records written by the executors?

In the following code, is the BLOCK 2 loop guaranteed to be executed only after all the executor tasks spawned by BLOCK 1 have finished, or is it possible for it to run while some of the executor tasks are still running?
If it is possible for the 2 blocks to run concurrently, what is the best way to prevent this? I need to process the contents of the accumulator, but only once all the executors have finished.
When running with a master url of local[4] as shown, it looks like BLOCK 2 waits for BLOCK 1 to finish, however I am seeing errors when running with a url of yarn which suggest that BLOCK 2 is running concurrently with the executor tasks in BLOCK 1
object Main {
def main(args: Array[String]) {
val sparkContext = SparkSession.builder.master("local[4]").getOrCreate().sparkContext
val myRecordAccumulator = sparkContext.collectionAccumulator[MyRecord]
// BLOCK 1
sparkContext.binaryFiles("/my/files/here").foreach(_ => {
for(i <- 1 to 100000) {
val record = buildRecord()
myRecordAccumulator.add(record)
}
})
// BLOCK 2
myRecordAccumulator.value.asScala.foreach((record: MyRecord) => {
processIt(record)
})
}
}

Control main thread with multiple Future jobs in scala

def fixture =
new {
val xyz = new XYZ(spark)
}
val fList: scala.collection.mutable.MutableList[scala.concurrent.Future[Dataset[Row]]] = scala.collection.mutable.MutableList[scala.concurrent.Future[Dataset[Row]]]() //mutable List of future means List[Future]
test("test case") {
val tasks = for (i <- 1 to 10) {
fList ++ scala.collection.mutable.MutableList[scala.concurrent.Future[Dataset[Row]]](Future {
println("Executing task " + i )
val ds = read(fixture.etlSparkLayer,i)
ds
})
}
Thread.sleep(1000*4200)
val futureOfList = Future.sequence(fList)//list of Future job in Future sequence
println(Await.ready(futureOfList, Duration.Inf))
val await_result: Seq[Dataset[Row]] = Await.result(futureOfList, Duration.Inf)
println("Squares: " + await_result)
futureOfList.onComplete {
case Success(x) => println("Success!!! " + x)
case Failure(ex) => println("Failed !!! " + ex)
}
}
I am executing one test case with sequence of Future List and List have collection of Future.I trying to execute same fuction multiple time parallely by help of using Future in scala.In my system only 4 job start in one time after completion of 4 jobs next 4 job will starting like that complete all the jobs. So how to start more than 4 job at a time and how main Thread will wait to complete all the Future thread ? I tried Await.result and Await.ready but not able to control main thread , for main thread control i m use Thread.sleep concept.this program is for read from RDBMS table and write in Elasticsearch. So how to control main thread main issue?

Assuming that you use the scala.concurrent.ExecutionContext.Implicits.global ExecutionContext you can tune the number of threads as described here:
https://github.com/scala/scala/blob/2.12.x/src/library/scala/concurrent/impl/ExecutionContextImpl.scala#L100
Specifically the following System Properties: scala.concurrent.context.minThreads, scala.concurrent.context.numThreads. scala.concurrent.context.maxThreads, and scala.concurrent.context.maxExtraThreads
Otherwise, you can rewrite your code to something like this:
import scala.collection.immutable
import scala.concurrent.duration._
import scala.concurrent._
import java.util.concurrent.Executors
test("test case") {
implicit val ec = ExecutionContext.fromExecutorService(ExecutorService.newFixedThreadPool(NUMBEROFTHREADSYOUWANT))
val aFuture = Future.traverse(1 to 10) {
i => Future {
println("Executing task " + i)
read(fixture.etlSparkLayer,i) // If this is a blocking operation you may want to consider wrapping it in a `blocking {}`-block.
}
}
aFuture.onComplete(_ => ec.shutdownNow()) // Only for this test, and to make sure the pool gets cleaned up
val await_result: immutable.Seq[Dataset[Row]] = Await.result(aFuture, 60.minutes) // Or other timeout
println("Squares: " + await_result)
}

Slick 3.0-RC3 fails with java.util.concurrent.RejectedExecutionException

I'm trying to get familiar with Slick 3.0 and Futures (using Scala 2.11.6). I use simple code based on Slick's Multi-DB Cake Pattern example. Why does the following code terminate with an exception and how to fix it?
import scala.concurrent.Await
import scala.concurrent.duration._
import slick.jdbc.JdbcBackend.Database
import scala.concurrent.ExecutionContext.Implicits.global
class Dispatcher(db: Database, dal: DAL) {
import dal.driver.api._
def init() = {
db.run(dal.create)
try db.run(dal.stuffTable += Stuff(23,"hi"))
finally db.close
val x = {
try db.run(dal.stuffTable.filter(_.serial === 23).result)
finally db.close
}
// This crashes:
val result = Await.result(x, 2 seconds)
}
}
Execution fails with:
java.util.concurrent.RejectedExecutionException: Task slick.backend.DatabaseComponent$DatabaseDef$$anon$2#5c73f637 rejected from java.util.concurrent.ThreadPoolExecutor#4129c44c[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 2]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
at scala.concurrent.impl.ExecutionContextImpl$$anon$1.execute(ExecutionContextImpl.scala:136)
at slick.backend.DatabaseComponent$DatabaseDef$class.runSynchronousDatabaseAction(DatabaseComponent.scala:224)
at slick.jdbc.JdbcBackend$DatabaseDef.runSynchronousDatabaseAction(JdbcBackend.scala:38)
at slick.backend.DatabaseComponent$DatabaseDef$class.runInContext(DatabaseComponent.scala:201)
at slick.jdbc.JdbcBackend$DatabaseDef.runInContext(JdbcBackend.scala:38)
at slick.backend.DatabaseComponent$DatabaseDef$class.runInternal(DatabaseComponent.scala:75)
at slick.jdbc.JdbcBackend$DatabaseDef.runInternal(JdbcBackend.scala:38)
at slick.backend.DatabaseComponent$DatabaseDef$class.run(DatabaseComponent.scala:72)
at slick.jdbc.JdbcBackend$DatabaseDef.run(JdbcBackend.scala:38)
at Dispatcher.init(Dispatcher.scala:15)
at SlickDemo$.main(SlickDemo.scala:16)
at SlickDemo.main(SlickDemo.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)

I think that something is not correct in what you are trying to do: Slick's run method doesn't return Unit and doesn't fail with an exception - as it used to in previous versions. run now returns a Future, so if you want to run actions in sequence you need to flatMap the steps, or use a for-comprehension:
def init() = {
val = results for {
_ <- db.run(dal.create)
_ <- db.run(dal.stuffTable += Stuff(23, "hi"))
r <- db.run(dal.stuffTable.filter(_.serial === 23).result)
} yield r
}
I am not sure that you really need to use db.close that way: that is actually what may be causing the error (i.e. the db is closed in concurrence with the future that runs the actual queries so the execution can't happen).
If you want to handle errors use Future's capabilities, e.g.:
result.onFailure { case NonFatal(ex) => // do something with the exception }

Why future example do not work?

I am reading akkaScala documentation, there is an example (p. 171 bottom)
// imports added for compilation
import scala.concurrent.{ExecutionContext, Future}
import ExecutionContext.Implicits.global
class Some {
}
object Some {
def main(args: Array[String]) {
// Create a sequence of Futures
val futures = for (i <- 1 to 1000) yield Future(i * 2)
val futureSum = Future.fold(futures)(0)(_ + _)
futureSum foreach println
}
}
I run it, but nothing happened. I mean that nothing was in console output. What is wrong?

You don't wait for the future to complete, so you create a race between the program exiting and the futures completing and the side-effect running. On your machine, the future seems to lose the race, on the commenters' who say "it works", the future is winning the race.
You can use Await to block on a future and wait for it to complete. This is something you should only be doing "at the ends of the world", you should very rarely actually be using Await...
// imports added for compilation
import scala.concurrent.{ExecutionContext, Future}
import ExecutionContext.Implicits.global
import scala.concurrent.duration._ // for the "1 second" syntax
import scala.concurrent.Await
class Some {
}
object Some {
def main(args: Array[String]) {
// Create a sequence of Futures
val futures = for (i <- 1 to 1000) yield Future(i * 2)
val futureSum = Future.fold(futures)(0)(_ + _)
// we map instead of foreach, to make sure that the side-effect is part of the future
// and we "await" for the future to complete (for 1 second)
Await.result(futureSum map println, 1 second)
}
}

As others have stated, the issue is the race condition where the futures are competing with the program terminating. The JVM has a concept of daemon threads. It waits for non-daemon threads to terminate but not daemon threads. So if you want to wait for threads to complete, use non-daemon threads.
The way threads are created for scala futures is using an implicit scala.concurrent.ExecutionContext. The one you use (import ExecutionContext.Implicits.global) starts daemon threads. However, it is possible to use non-daemon threads. So if you use an ExecutionContext with non-daemon threads, it will wait, which in your case is reasonable behaviour. Naively:
import scala.concurrent.Future
import scala.concurrent.ExecutionContextExecutor
import scala.concurrent.ExecutionContext
class MyExecutionContext extends ExecutionContext {
override def execute(runnable:Runnable) = {
val t = new Thread(runnable)
t.setDaemon(false)
t.start()
}
override def reportFailure(t:Throwable) = t.printStackTrace
}
object Some {
implicit lazy val context: ExecutionContext = new MyExecutionContext
def main(args: Array[String]) {
// Create a sequence of Futures
val futures = for (i <- 1 to 1000) yield Future(i * 2)
val futureSum = Future.fold(futures)(0)(_ + _)
futureSum foreach println
}
}
Careful with using the above ExecutionContext in production because it doesn't use a thread pool and can create unbounded threads, but the message is: you can control everything about the threads behind Futures through an ExecutionContext. Explore the various scala and akka contexts to find what you need, or if nothing suits, write your own.

Both of the following statement at the end of main function would help your need. As the above answers said, allow the future to complete. Main thread is different from the Future thread, as main completes, it terminates before Future thread.
Thread.sleep(500) //... Simple solution
Await.result(futureSum, Duration(500, MILLISECONDS)) //...have to import scala.concurrent.duration._ to use Duration object.