I have a use-case in databricks where an API call has to me made on a dataset of URL's. The dataset has around 100K records.
The max allowed concurrency is 3.
I did the implementation in Scala and ran in databricks notebook. Apart from the one element pending in queue, i feel some thing is missing here.
Is the Blocking Queue and Thread Pool the right way to tackle this problem.
In the code below I have modified and instead of reading from dataset I am sampling on a Seq.
Any help/thought will be much appreciated.
import java.time.LocalDateTime
import java.util.concurrent.{ArrayBlockingQueue,BlockingQueue}
import java.util.concurrent.Executors
import java.util.concurrent.TimeUnit;
var inpQueue:BlockingQueue[(Int, String)] = new ArrayBlockingQueue[(Int, String)](1)
val inpDS = Seq((1,"https://google.com/2X6barD"), (2,"https://google.com/3d9vCgW"), (3,"https://google.com/2M02Xz0"), (4,"https://google.com/2XOu2uL"), (5,"https://google.com/2AfBWF0"), (6,"https://google.com/36AEKsw"), (7,"https://google.com/3enBxz7"), (8,"https://google.com/36ABq0x"), (9,"https://google.com/2XBjmiF"), (10,"https://google.com/36Emlen"))
val pool = Executors.newFixedThreadPool(3)
var i = 0
inpDS.foreach{
ix => {
inpQueue.put(ix)
val t = new ConsumerAPIThread()
t.setName("MyThread-"+i+" ")
pool.execute(t)
}
i = i+1
}
println("Final Queue Size = " +inpQueue.size+"\n")
class ConsumerAPIThread() extends Thread
{
var name =""
override def run()
{
val urlDetail = inpQueue.take()
print(this.getName()+" "+ Thread.currentThread().getName() + " popped "+urlDetail+" Queue Size "+inpQueue.size+" \n")
triggerAPI((urlDetail._1, urlDetail._2))
}
def triggerAPI(params:(Int,String)){
try{
val result = scala.io.Source.fromURL(params._2)
println("" +result)
}catch{
case ex:Exception => {
println("Exception caught")
}
}
}
def ConsumerAPIThread(s:String)
{
name = s;
}
}
So, you have two requirements: the functional one is that you want to process asynchronously the items in a list, the non-functional one is that you want to not process more than three items at once.
Regarding the latter, the nice thing is that, as you already have shown in your question, Java natively exposes a nicely packaged Executor that runs task on a thread pool with a fixed size, elegantly allowing you to cap the concurrency level if you work with threads.
Moving to the functional requirement, Scala helps by having something that does precisely that as part of its standard API. In particular it uses scala.concurrent.Future, so in order to use it we'll have to reframe triggerAPI in terms of Future. The content of the function is not particularly relevant, so we'll mostly focus on its (revised) signature for now:
import scala.concurrent.Future
import scala.concurrent.ExecutionContext
def triggerAPI(params: (Int, String))(implicit ec: ExecutionContext): Future[Unit] =
Future {
// some code that takes some time to run...
}
Notice that now triggerAPI returns a Future. A Future can be thought as a read-handle to something that is going to be eventually computed. In particular, this is a Future[Unit], where Unit stands for "we don't particularly care about the output of this function, but mostly about its side effects".
Furthermore, notice that the method now takes an implicit parameter, namely an ExecutionContext. The ExecutionContext is used to provide Futures with some form of environment where the computation happens. Scala has an API to create an ExecutionContext from a java.util.concurrent.ExecutorService, so this will come in handy to run our computation on the fixed thread pool, running no more than three callbacks at any given time.
Before moving forward, if you have questions about Futures, ExecutionContexts and implicit parameters, the Scala documentation is your best source of knowledge (here are a couple of pointers: 1, 2).
Now that we have the new triggerAPI method, we can use Future.traverse (here is the documentation for Scala 2.12 -- the latest version at the time of writing is 2.13 but to the best of my knowledge Spark users are stuck on 2.12 for the time being).
The tl;dr of Future.traverse is that it takes some form of container and a function that takes the items in that container and returns a Future of something else. The function will be applied to each item in the container and the result will be a Future of the container of the results. In your case: the container is a List, the items are (Int, String) and the something else you return is a Unit.
This means that you can simply call it like this:
Future.traverse(inpDS)(triggerAPI)
And triggerAPI will be applied to each item in inpDS.
By making sure that the execution context backed by the thread pool is in the implicit scope when calling Future.traverse, the items will be processed with the desired thread pool.
The result of the call is Future[List[Unit]], which is not very interesting and can simply be discarded (as you are only interested in the side effects).
That was a lot of talk, if you want to play around with the code I described you can do so here on Scastie.
For reference, this is the whole implementation:
import java.util.concurrent.{ExecutorService, Executors}
import scala.concurrent.duration.DurationLong
import scala.concurrent.Future
import scala.concurrent.{ExecutionContext, ExecutionContextExecutorService}
import scala.util.control.NonFatal
import scala.util.{Failure, Success, Try}
val datasets = List(
(1, "https://google.com/2X6barD"),
(2, "https://google.com/3d9vCgW"),
(3, "https://google.com/2M02Xz0"),
(4, "https://google.com/2XOu2uL"),
(5, "https://google.com/2AfBWF0"),
(6, "https://google.com/36AEKsw"),
(7, "https://google.com/3enBxz7"),
(8, "https://google.com/36ABq0x"),
(9, "https://google.com/2XBjmiF")
)
val executor: ExecutorService = Executors.newFixedThreadPool(3)
implicit val executionContext: ExecutionContextExecutorService = ExecutionContext.fromExecutorService(executor)
def triggerAPI(params: (Int, String))(implicit ec: ExecutionContext): Future[Unit] =
Future {
val (index, _) = params
println(s"+ started processing $index")
val start = System.nanoTime() / 1000000
Iterator.from(0).map(_ + 1).drop(100000000).take(1).toList.head // a noticeably slow operation
val end = System.nanoTime() / 1000000
val duration = (end - start).millis
println(s"- finished processing $index after $duration")
}
Future.traverse(datasets)(triggerAPI).onComplete {
case result =>
println("* processing is over, shutting down the executor")
executionContext.shutdown()
}
You need to shutdown the Executor after your job done else It will be waiting.
Try add pool.shutdown() end of your program.
As a first attempt, I tried to use Await.result on the head of the Seq and then use the lazy #:: Stream constructor. However, it seems to not work as good as expected since I haven't found a way to tell the scheduler to prioritize the order of the list nor does the compiler recognize it as #tailrec.
implicit class SeqOfFuture[X](seq: Seq[Future[X]]) {
lazy val stream: Stream[X] =
if (seq.nonEmpty) Await.result(seq.head) #:: seq.tail.stream
else Stream.empty
}
I am attempting this since Future.collect seems to wait until the whole strict Seq is available/ready in order to map/flatmap/transform it further. (And there are other computations I might start with the stream of intermedieate results)
(Proto)Example of usage:
val searches = [SearchParam1, SearchParam2..., SearchParam200]
// big queries that take a some 100ms each for ~20s total wait
val futureDbResult = searches.map(search => (quill)ctx.run { query(search) }).stream
// Stuff that should happen as results become available instead of blocking/waiting ~20 seconds before starting
val processedResults = futureDbResult.map(transform).filter(reduce)
// Log?
processedResults.map(result => log.info/log.trace)
//return lazy processedResults list or Future {processedResults}
???
As others have pointed out, you really should look into a real streaming library like fs2 or monix. I personally think monix is a good fit if you're interfacing with Future and only need it in a small part of your application. It has great APIs and documentation for this use-case.
Here's a small demo for your use-case:
import monix.eval.Task
import monix.execution.Scheduler.Implicits.global
import monix.reactive.Observable
import scala.concurrent.duration._
import scala.util.Random
// requires: libraryDependencies += "io.monix" %% "monix" % "3.0.0"
object Main {
val searchParams = (1 to 200).map(n => s"Search $n")
/**
* Simulates a query. If your library returns a Future, you can wrap it with `Task.deferFuture`
*/
def search(param: String): Task[String] =
Task(s"Result for $param").delayResult(Random.between(25, 250).milliseconds)
val results: Task[List[String]] =
Observable
.fromIterable(searchParams)
.mapParallelUnordered(parallelism = 4)(param => search(param))
.mapEval { result =>
Task(println(result)).map(_ => result) // print intermediate results as feedback
}
.toListL // collect results into List
/**
* If you aren't going all-in on monix, you probably run the stream into a Future with `results.runToFuture`
*/
def main(args: Array[String]): Unit = results.map(_ => ()).runSyncUnsafe()
}
You can think of Task as a lazy and more powerful Future. Observable is a (reactive) stream which will automatically back-pressure if downstream is slow. In this example only 4 queries will run in parallel and the other will wait until a "slot" becomes available to run.
Keep in mind that in those libraries side-effects (like println have to be wrapped in Task (or IO depending on what you use).
You can run this example locally if you provide the monix-dependency and play around with it to get a feel for how it works.
A colleague of mine stated the following, about using a Java ReentrantReadWriteLock in some Scala code:
Acquiring the lock here is risky. It's "reentrant", but that internally depends on the thread context. F may run different stages of the same computation in different threads. You can easily cause a deadlock.
F here refers to some effectful monad.
Basically what I'm trying to do is to acquire the same reentrant lock twice, within the same monad.
Could somebody clarify why this could be a problem?
The code is split into two files. The outermost one:
val lock: Resource[F, Unit] = for {
// some other resource
_ <- store.writeLock
} yield ()
lock.use { _ =>
for {
// stuff
_ <- EitherT(store.doSomething())
// other stuff
} yield ()
}
Then, in the store:
import java.util.concurrent.locks.{Lock, ReentrantReadWriteLock}
import cats.effect.{Resource, Sync}
private def lockAsResource[F[_]](lock: Lock)(implicit F: Sync[F]): Resource[F, Unit] =
Resource.make {
F.delay(lock.lock())
} { _ =>
F.delay(lock.unlock())
}
private val lock = new ReentrantReadWriteLock
val writeLock: Resource[F, Unit] = lockAsResource(lock.writeLock())
def doSomething(): F[Either[Throwable, Unit]] = writeLock.use { _ =>
// etc etc
}
The writeLock in the two pieces of code is the same, and it's a cats.effect.Resource[F, Unit] wrapping a ReentrantReadWriteLock's writeLock. There are some reasons why I was writing the code this way, so I wouldn't want to dig into that. I would just like to understand why (according to my colleague, at least), this could potentially break stuff.
Also, I'd like to know if there is some alternative in Scala that would allow something like this without the risk for deadlocks.
IIUC your question:
You expect that for each interaction with the Resource lock.lock and lock.unlock actions happen in the same thread.
1) There is no guarantee at all since you are using arbitrary effect F here.
It's possible to write an implementation of F that executes every action in a new thread.
2) Even if we assume that F is IO then the body of doSomething someone could do IO.shift. So the next actions including unlock would happen in another thread. Probably it's not possible with the current signature of doSomething but you get the idea.
Also, I'd like to know if there is some alternative in Scala that would allow something like this without the risk for deadlocks.
You can take a look at scalaz zio STM.
I am learning scalaz stream at the moment, I am confused why repeatEval only evaluate Task.async once.
val result = Process
.repeatEval(Task.async[Unit](t => {
val result = scala.io.Source.fromURL("http://someUrl").mkString
println(".......")
println(result)
}))
result.runLog.run //only print once
However, if I change Task.async to Task.delay. It evaluates the function infinitely. I dont know why is that
val result = Process
.repeatEval(Task.delay({
val result = scala.io.Source.fromURL("http://someUrl").mkString
println(".......")
println(result)
}))
result.runLog.run //print infinitely
Many thanks in advance
As I mention in my answer to your recent question about Task, Task.async takes a function that registers callbacks—not some code that should be executed asynchronously. In the case of the other question, you actually want Task.async, since you're interoperating with a callback-based API.
Here it seems like you probably want Task.apply, not Task.delay. The two look similar, but delay simply suspends the computation—it doesn't use an ExecutorService to run it in a separate thread. You can see this in the following example:
import scalaz._, Scalaz._, concurrent._
val delayTask = Task.delay(Thread.sleep(5000))
val applyTask = Task(Thread.sleep(5000))
Nondeterminism[Task].both(delayTask, delayTask).run
Nondeterminism[Task].both(applyTask, applyTask).run
The delayTask version will take longer.
I tested parallel collections on Scala vs simple collection, here is my code:
def parallelParse()
{
val adjs = wn.allSynsets(POS.ADJECTIVE).par
adjs.foreach(adj => {
parse(proc.mkDocument(adj.getGloss))
})
}
def serialParse()
{
val adjs = wn.allSynsets(POS.ADJECTIVE)
adjs.foreach(adj => {
parse(proc.mkDocument(adj.getGloss))
})
}
The parallel collection speed up about 3 times.
What other option do I have in Scala to make it even faster in parallel, I would be happy to test them and put the results here.
You can use futures to start asynchronous computations. You could do:
import scala.concurrent._
import scala.concurrent.duration._
import ExecutionContext.Implicits.global
val futures = wn.allSynsets(POS.ADJECTIVE).map(adj => Future {
parse(proc.mkDocument(adj.getGloss))
})
futures.foreach(f => Await.ready(f, Duration.Inf))
Depending on the amount of work on each element in allSynsets and the number of elements in allSynsets (too many elements -> too many futures -> more overheads), you could get worse results with futures.
To ensure that you are benchmarking correctly, consider using the inline benchmarking feature of ScalaMeter 0.5:
http://scalameter.github.io/home/gettingstarted/0.5/inline/index.html
You could also use actors to achieve this, but it would require a bit more plumbing.