I use the below code from Databricks itself on how to run its notebook in parallel in Scala, https://docs.databricks.com/notebooks/notebook-workflows.html#run-multiple-notebooks-concurrently . I am trying to add retry feature where if one of the notebooks in the sequence failed, it will retry that notebook based on the retry value I passed to it.
Here is the parallel notebook code from Databricks:
//parallel notebook code
import scala.concurrent.{Future, Await}
import scala.concurrent.duration._
import scala.util.control.NonFatal
case class NotebookData(path: String, timeout: Int, parameters: Map[String, String] = Map.empty[String, String])
def parallelNotebooks(notebooks: Seq[NotebookData]): Future[Seq[String]] = {
import scala.concurrent.{Future, blocking, Await}
import java.util.concurrent.Executors
import scala.concurrent.ExecutionContext
import com.databricks.WorkflowException
val numNotebooksInParallel = 5
// If you create too many notebooks in parallel the driver may crash when you submit all of the jobs at once.
// This code limits the number of parallel notebooks.
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(numNotebooksInParallel))
val ctx = dbutils.notebook.getContext()
Future.sequence(
notebooks.map { notebook =>
Future {
dbutils.notebook.setContext(ctx)
if (notebook.parameters.nonEmpty)
dbutils.notebook.run(notebook.path, notebook.timeout, notebook.parameters)
else
dbutils.notebook.run(notebook.path, notebook.timeout)
}
.recover {
case NonFatal(e) => s"ERROR: ${e.getMessage}"
}
}
)
}
This is an example of how I am calling the above code to run multiple examples notebooks:
import scala.concurrent.Await
import scala.concurrent.duration._
import scala.language.postfixOps
val notebooks = Seq(
NotebookData("Notebook1", 0, Map("client"->client)),
NotebookData("Notebook2", 0, Map("client"->client))
)
val res = parallelNotebooks(notebooks)
Await.result(res, 3000000 seconds) // this is a blocking call.
res.value
Here is one attempt. Since your code does not compile, I inserted a few dummy classes.
Also, you did not fully specify the desired behavior, so I made some assumptions. Only five retries will be made for each connection. If any of the Futures are still failing after five retries, then the entire Future is failed. Both of these behaviors can be changed, but since you did not specify, I am not sure what it is you want.
If you have questions or would like me to make an alteration to the program, let me know in the comments section.
object TestNotebookData extends App{
//parallel notebook code
import scala.concurrent.{Future, Await}
import scala.concurrent.duration._
import scala.util.control.NonFatal
case class NotebookData(path: String, timeout: Int, parameters: Map[String, String] = Map.empty[String, String])
case class Context()
case class Notebook(){
def getContext(): Context = Context()
def setContext(ctx: Context): Unit = ()
def run(path: String, timeout: Int, paramters: Map[String, String] = Map()): Seq[String] = Seq()
}
case class Dbutils(notebook: Notebook)
val dbutils = Dbutils(Notebook())
def parallelNotebooks(notebooks: Seq[NotebookData]): Future[Seq[Seq[String]]] = {
import scala.concurrent.{Future, blocking, Await}
import java.util.concurrent.Executors
import scala.concurrent.ExecutionContext
// This code limits the number of parallel notebooks.
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(numNotebooksInParallel))
val ctx = dbutils.notebook.getContext()
val isRetryable = true
val retries = 5
def runNotebook(notebook: NotebookData): Future[Seq[String]] = {
def retryWrapper(retry: Boolean, current: Int, max: Int): Future[Seq[String]] = {
val fut = Future {runNotebookInner}
if (retry && current < max) fut.recoverWith{ _ => retryWrapper(retry, current + 1, max)}
else fut
}
def runNotebookInner() = {
dbutils.notebook.setContext(ctx)
if (notebook.parameters.nonEmpty)
dbutils.notebook.run(notebook.path, notebook.timeout, notebook.parameters)
else
dbutils.notebook.run(notebook.path, notebook.timeout)
}
retryWrapper(isRetryable, 0, retries)
}
Future.sequence(
notebooks.map { notebook =>
runNotebook(notebook)
}
)
}
val notebooks = Seq(
NotebookData("Notebook1", 0, Map("client"->"client")),
NotebookData("Notebook2", 0, Map("client"->"client"))
)
val res = parallelNotebooks(notebooks)
Await.result(res, 3000000 seconds) // this is a blocking call.
res.value
}
I found this to work:
import scala.util.{Try, Success, Failure}
def tryNotebookRun (path: String, timeout: Int, parameters: Map[String, String] = Map.empty[String, String]): Try[Any] = {
Try(
if (parameters.nonEmpty){
dbutils.notebook.run(path, timeout, parameters)
}
else{
dbutils.notebook.run(path, timeout)
}
)
}
//parallel notebook code
import scala.concurrent.{Future, Await}
import scala.concurrent.duration._
import scala.util.control.NonFatal
def runWithRetry(path: String, timeout: Int, parameters: Map[String, String] = Map.empty[String, String], maxRetries: Int = 2) = {
var numRetries = 0
while (numRetries < maxRetries){
tryNotebookRun(path, timeout, parameters) match {
case Success(_) => numRetries = maxRetries
case Failure(_) => numRetries = numRetries + 1
}
}
}
case class NotebookData(path: String, timeout: Int, parameters: Map[String, String] = Map.empty[String, String])
def parallelNotebooks(notebooks: Seq[NotebookData]): Future[Seq[Any]] = {
import scala.concurrent.{Future, blocking, Await}
import java.util.concurrent.Executors
import scala.concurrent.ExecutionContext
import com.databricks.WorkflowException
val numNotebooksInParallel = 5
// If you create too many notebooks in parallel the driver may crash when you submit all of the jobs at once.
// This code limits the number of parallel notebooks.
implicit val ec = ExecutionContext.fromExecutor(Executors.newFixedThreadPool(numNotebooksInParallel))
val ctx = dbutils.notebook.getContext()
Future.sequence(
notebooks.map { notebook =>
Future {
dbutils.notebook.setContext(ctx)
runWithRetry(notebook.path, notebook.timeout, notebook.parameters)
}
.recover {
case NonFatal(e) => s"ERROR: ${e.getMessage}"
}
}
)
}
I'd like to measure elapsed time inside IO container. It's relatively easy to do with plain calls or futures (e.g. something like the code below)
class MonitoringComponentSpec extends FunSuite with Matchers with ScalaFutures {
import scala.concurrent.ExecutionContext.Implicits.global
def meter[T](future: Future[T]): Future[T] = {
val start = System.currentTimeMillis()
future.onComplete(_ => println(s"Elapsed ${System.currentTimeMillis() - start}ms"))
future
}
def call(): Future[String] = Future {
Thread.sleep(500)
"hello"
}
test("metered call") {
whenReady(meter(call()), timeout(Span(550, Milliseconds))) { s =>
s should be("hello")
}
}
}
But not sure how to wrap IO call
def io_meter[T](effect: IO[T]): IO[T] = {
val start = System.currentTimeMillis()
???
}
def io_call(): IO[String] = IO.pure {
Thread.sleep(500)
"hello"
}
test("metered io call") {
whenReady(meter(call()), timeout(Span(550, Milliseconds))) { s =>
s should be("hello")
}
}
Thank you!
Cats-effect has a Clock implementation that allows pure time measurement as well as injecting your own implementations for testing when you just want to simulate the passing of time. The example from their documentation is:
def measure[F[_], A](fa: F[A])
(implicit F: Sync[F], clock: Clock[F]): F[(A, Long)] = {
for {
start <- clock.monotonic(MILLISECONDS)
result <- fa
finish <- clock.monotonic(MILLISECONDS)
} yield (result, finish - start)
}
In cats effect 3, you can use .timed. Like,
import cats.effect.IO
import cats.effect.unsafe.implicits.global
import cats.implicits._
import scala.concurrent.duration._
val twoSecondsLater = IO.sleep(2.seconds) *> IO.println("Hi")
val elapsedTime = twoSecondsLater.timed.map(_._1)
elapsedTime.unsafeRunSync()
// would give something like this.
Hi
res0: FiniteDuration = 2006997667 nanoseconds
I want my stream to fail if the time between finishing the processing of one element until beginning the processing of the next element exceeds a specific amount.
None of the current timeout methods seem to deal with this case. How would I do this?
This is the closest to a solution I have made (try it out here):
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Sink, Source}
import concurrent.duration._
import concurrent.Await
import concurrent.ExecutionContext.Implicits.global
import concurrent.Future
implicit class StreamTakeWithinTime[Out, Mat](src: Source[Out, Mat]) {
def takeWithinTime(maxIdleTime: FiniteDuration): Source[Out, Mat] =
src
.map(Option.apply)
.keepAlive(maxIdleTime, () => None)
.takeWhile {
case Some(_) => true
case None => false
}
.collect {
case Some(x) => x
}
}
implicit val actorSystem = ActorSystem("test")
implicit val actorMaterializer = ActorMaterializer()
var delay = 0
def tick = {
delay += 500
Thread.sleep(delay)
"tick"
}
val maxIdleTime = 2.seconds
val pipeline = Source
.fromIterator(() =>
new Iterator[String] {
override def hasNext: Boolean = true
override def next(): String = tick
})
.map { s =>
println("Long processing function...")
Thread.sleep(3000)
s
}
.takeWithinTime(maxIdleTime)
val res = Await.result(pipeline.runForeach(println), 30.seconds)
println("done")
which prints:
Long processing function...
tick
Long processing function...
tick
Long processing function...
tick
Long processing function...
done
I don't know how to description exactly, see the code please
def callForever(f: Future[Int]): Unit = {
f.onComplete {
case Failure(e) =>
//do something
case Success(c) =>
// do again
val nextConn: Future[Int] = connection()
callForever(nextConn)
}
}
Its a normal recursion,actually,I use it to listen socket wait a Async connection.
Because it always running I want make it better, can I refactor it by a tailrec way?
I just thought that you may want to look at this way to do this which looks a bit better for me:
import scala.concurrent.Future
import scala.util.{Failure, Success, Random}
import scala.concurrent.ExecutionContext.Implicits.global
/**
* Created by Alex on 2/29/2016.
*/
object Test {
def giveMeValue:Future[Int] = Future.successful{Random.nextInt()}
def callForever(f:Future[Int]):Future[Int] = {
println("iteration")
f flatMap(i => {println(i); callForever(giveMeValue)})
}
def main(args: Array[String]) {
callForever(giveMeValue)
while(true){}
}
}
I have a couple of futures. campaignFuture returns a List[BigInt] and I want to be able to call the second future profileFuture for each of the values in the list returned from the first one. The second future can only be called when the first one is complete. How do I achieve this in Scala?
campaignFuture(1923).flatMap?? (May be?)
def campaignFuture(advertiserId: Int): Future[List[BigInt]] = Future {
val campaignHttpResponse = getCampaigns(advertiserId.intValue())
parseProfileIds(campaignHttpResponse.entity.asString)
}
def profileFuture(profileId: Int): Future[List[String]] = Future {
val profileHttpResponse = getProfiles(profileId.intValue())
parseSegmentNames(profileHttpResponse.entity.asString)
}
A for comprehension is here not applicable because we have a mix of List's and Future's.
So, your friends are map and flatMap:
To react on Future result
import scala.concurrent.{Future, Promise, Await}
import scala.concurrent.duration.Duration
import scala.concurrent.ExecutionContext.Implicits.global
def campaignFuture(advertiserId: Int): Future[List[BigInt]] = Future {
List(1, 2, 3)
}
def profileFuture(profileId: Int): Future[List[String]] = {
// delayed Future
val p = Promise[List[String]]
Future {
val delay: Int = (math.random * 5).toInt
Thread.sleep(delay * 1000)
p.success(List(s"profile-for:$profileId", s"delayed:$delay sec"))
}
p.future
}
// Future[List[Future[List[String]]]
val listOfProfileFuturesFuture = campaignFuture(1).map { campaign =>
campaign.map(id => profileFuture(id.toInt))
}
// React on Futures which are done
listOfProfileFuturesFuture foreach { campaignFutureRes =>
campaignFutureRes.foreach { profileFutureRes =>
profileFutureRes.foreach(profileListEntry => println(s"${new Date} done: $profileListEntry"))
}
}
// !!ONLY FOR TESTING PURPOSE - THIS CODE BLOCKS AND EXITS THE VM WHEN THE FUTURES ARE DONE!!
println(s"${new Date} waiting for futures")
listOfProfileFuturesFuture.foreach{listOfFut =>
Await.ready(Future.sequence(listOfFut), Duration.Inf)
println(s"${new Date} all futures done")
System.exit(0)
}
scala.io.StdIn.readLine()
To get the result of all Futures at once
import scala.concurrent.{Future, Await}
import scala.concurrent.duration.Duration
import scala.concurrent.ExecutionContext.Implicits.global
def campaignFuture(advertiserId: Int): Future[List[BigInt]] = Future {
List(1, 2, 3)
}
def profileFuture(profileId: Int): Future[List[String]] = Future {
List(s"profile-for:$profileId")
}
// type: Future[List[Future[List[String]]]]
val listOfProfileFutures = campaignFuture(1).map { campaign =>
campaign.map(id => profileFuture(id.toInt))
}
// type: Future[List[List[String]]]
val listOfProfileFuture = listOfProfileFutures.flatMap(s => Future.sequence(s))
// print the result
//listOfProfileFuture.foreach(println)
//scala.io.StdIn.readLine()
// wait for the result (THIS BLOCKS INFINITY!)
Await.result(listOfProfileFuture, Duration.Inf)
we use Future.sequence to convert a List[Future[T]] to Future[List[T]].
flatMap to get a Future[T] from Future[Future[T]]
if you need wait for the result (BLOCKING!) use Await to wait for the result