I am learning scalaz stream at the moment, I am confused why repeatEval only evaluate Task.async once.
val result = Process
.repeatEval(Task.async[Unit](t => {
val result = scala.io.Source.fromURL("http://someUrl").mkString
println(".......")
println(result)
}))
result.runLog.run //only print once
However, if I change Task.async to Task.delay. It evaluates the function infinitely. I dont know why is that
val result = Process
.repeatEval(Task.delay({
val result = scala.io.Source.fromURL("http://someUrl").mkString
println(".......")
println(result)
}))
result.runLog.run //print infinitely
Many thanks in advance
As I mention in my answer to your recent question about Task, Task.async takes a function that registers callbacks—not some code that should be executed asynchronously. In the case of the other question, you actually want Task.async, since you're interoperating with a callback-based API.
Here it seems like you probably want Task.apply, not Task.delay. The two look similar, but delay simply suspends the computation—it doesn't use an ExecutorService to run it in a separate thread. You can see this in the following example:
import scalaz._, Scalaz._, concurrent._
val delayTask = Task.delay(Thread.sleep(5000))
val applyTask = Task(Thread.sleep(5000))
Nondeterminism[Task].both(delayTask, delayTask).run
Nondeterminism[Task].both(applyTask, applyTask).run
The delayTask version will take longer.
Related
I have a use-case in databricks where an API call has to me made on a dataset of URL's. The dataset has around 100K records.
The max allowed concurrency is 3.
I did the implementation in Scala and ran in databricks notebook. Apart from the one element pending in queue, i feel some thing is missing here.
Is the Blocking Queue and Thread Pool the right way to tackle this problem.
In the code below I have modified and instead of reading from dataset I am sampling on a Seq.
Any help/thought will be much appreciated.
import java.time.LocalDateTime
import java.util.concurrent.{ArrayBlockingQueue,BlockingQueue}
import java.util.concurrent.Executors
import java.util.concurrent.TimeUnit;
var inpQueue:BlockingQueue[(Int, String)] = new ArrayBlockingQueue[(Int, String)](1)
val inpDS = Seq((1,"https://google.com/2X6barD"), (2,"https://google.com/3d9vCgW"), (3,"https://google.com/2M02Xz0"), (4,"https://google.com/2XOu2uL"), (5,"https://google.com/2AfBWF0"), (6,"https://google.com/36AEKsw"), (7,"https://google.com/3enBxz7"), (8,"https://google.com/36ABq0x"), (9,"https://google.com/2XBjmiF"), (10,"https://google.com/36Emlen"))
val pool = Executors.newFixedThreadPool(3)
var i = 0
inpDS.foreach{
ix => {
inpQueue.put(ix)
val t = new ConsumerAPIThread()
t.setName("MyThread-"+i+" ")
pool.execute(t)
}
i = i+1
}
println("Final Queue Size = " +inpQueue.size+"\n")
class ConsumerAPIThread() extends Thread
{
var name =""
override def run()
{
val urlDetail = inpQueue.take()
print(this.getName()+" "+ Thread.currentThread().getName() + " popped "+urlDetail+" Queue Size "+inpQueue.size+" \n")
triggerAPI((urlDetail._1, urlDetail._2))
}
def triggerAPI(params:(Int,String)){
try{
val result = scala.io.Source.fromURL(params._2)
println("" +result)
}catch{
case ex:Exception => {
println("Exception caught")
}
}
}
def ConsumerAPIThread(s:String)
{
name = s;
}
}
So, you have two requirements: the functional one is that you want to process asynchronously the items in a list, the non-functional one is that you want to not process more than three items at once.
Regarding the latter, the nice thing is that, as you already have shown in your question, Java natively exposes a nicely packaged Executor that runs task on a thread pool with a fixed size, elegantly allowing you to cap the concurrency level if you work with threads.
Moving to the functional requirement, Scala helps by having something that does precisely that as part of its standard API. In particular it uses scala.concurrent.Future, so in order to use it we'll have to reframe triggerAPI in terms of Future. The content of the function is not particularly relevant, so we'll mostly focus on its (revised) signature for now:
import scala.concurrent.Future
import scala.concurrent.ExecutionContext
def triggerAPI(params: (Int, String))(implicit ec: ExecutionContext): Future[Unit] =
Future {
// some code that takes some time to run...
}
Notice that now triggerAPI returns a Future. A Future can be thought as a read-handle to something that is going to be eventually computed. In particular, this is a Future[Unit], where Unit stands for "we don't particularly care about the output of this function, but mostly about its side effects".
Furthermore, notice that the method now takes an implicit parameter, namely an ExecutionContext. The ExecutionContext is used to provide Futures with some form of environment where the computation happens. Scala has an API to create an ExecutionContext from a java.util.concurrent.ExecutorService, so this will come in handy to run our computation on the fixed thread pool, running no more than three callbacks at any given time.
Before moving forward, if you have questions about Futures, ExecutionContexts and implicit parameters, the Scala documentation is your best source of knowledge (here are a couple of pointers: 1, 2).
Now that we have the new triggerAPI method, we can use Future.traverse (here is the documentation for Scala 2.12 -- the latest version at the time of writing is 2.13 but to the best of my knowledge Spark users are stuck on 2.12 for the time being).
The tl;dr of Future.traverse is that it takes some form of container and a function that takes the items in that container and returns a Future of something else. The function will be applied to each item in the container and the result will be a Future of the container of the results. In your case: the container is a List, the items are (Int, String) and the something else you return is a Unit.
This means that you can simply call it like this:
Future.traverse(inpDS)(triggerAPI)
And triggerAPI will be applied to each item in inpDS.
By making sure that the execution context backed by the thread pool is in the implicit scope when calling Future.traverse, the items will be processed with the desired thread pool.
The result of the call is Future[List[Unit]], which is not very interesting and can simply be discarded (as you are only interested in the side effects).
That was a lot of talk, if you want to play around with the code I described you can do so here on Scastie.
For reference, this is the whole implementation:
import java.util.concurrent.{ExecutorService, Executors}
import scala.concurrent.duration.DurationLong
import scala.concurrent.Future
import scala.concurrent.{ExecutionContext, ExecutionContextExecutorService}
import scala.util.control.NonFatal
import scala.util.{Failure, Success, Try}
val datasets = List(
(1, "https://google.com/2X6barD"),
(2, "https://google.com/3d9vCgW"),
(3, "https://google.com/2M02Xz0"),
(4, "https://google.com/2XOu2uL"),
(5, "https://google.com/2AfBWF0"),
(6, "https://google.com/36AEKsw"),
(7, "https://google.com/3enBxz7"),
(8, "https://google.com/36ABq0x"),
(9, "https://google.com/2XBjmiF")
)
val executor: ExecutorService = Executors.newFixedThreadPool(3)
implicit val executionContext: ExecutionContextExecutorService = ExecutionContext.fromExecutorService(executor)
def triggerAPI(params: (Int, String))(implicit ec: ExecutionContext): Future[Unit] =
Future {
val (index, _) = params
println(s"+ started processing $index")
val start = System.nanoTime() / 1000000
Iterator.from(0).map(_ + 1).drop(100000000).take(1).toList.head // a noticeably slow operation
val end = System.nanoTime() / 1000000
val duration = (end - start).millis
println(s"- finished processing $index after $duration")
}
Future.traverse(datasets)(triggerAPI).onComplete {
case result =>
println("* processing is over, shutting down the executor")
executionContext.shutdown()
}
You need to shutdown the Executor after your job done else It will be waiting.
Try add pool.shutdown() end of your program.
I have a function that returns a fs2.Stream of Measurements.
import cats.effect._
import fs2._
def apply(sds: SerialPort, interval: Int)(implicit cs: ContextShift[IO]): Stream[IO, SdsMeasurement] =
for {
blocker <- Stream.resource(Blocker[IO])
stream <- io.readInputStream(IO(sds.getInputStream), 1, blocker)
.through(SdsStateMachine.collectMeasurements())
} yield stream
Normally it is an infinite Stream, unless I pass it a test flag, in which case it should output one value and halt.
val infiniteSource: Stream[IO, SdsMeasurement] = ...
val source = if (isTest) infiniteSource.take(1) else infiniteSource
source.compile.drain
The infinite Stream works fine. It gives me all Measurements infinitely. The test Stream indeed gives me only the first measurement, nothing more. The problem I have is that the Stream does not return after this last measurements. It blocks forever. What am I doing wrong?
Note: I think I abstracted the essential code, but for more context, please take a look at my project: https://github.com/jkransen/fijnstof/blob/ZIO/src/main/scala/nl/kransen/fijnstof/Main.scala
The code you've presented here looks fine, I don't think the issue lies within that code. If it blocks, then presumably one of the underlying APIs blocks, for instance it might be the close method of the InputStream. What I usually do in such situations is to add log statements before and after every function call that might block.
As a first attempt, I tried to use Await.result on the head of the Seq and then use the lazy #:: Stream constructor. However, it seems to not work as good as expected since I haven't found a way to tell the scheduler to prioritize the order of the list nor does the compiler recognize it as #tailrec.
implicit class SeqOfFuture[X](seq: Seq[Future[X]]) {
lazy val stream: Stream[X] =
if (seq.nonEmpty) Await.result(seq.head) #:: seq.tail.stream
else Stream.empty
}
I am attempting this since Future.collect seems to wait until the whole strict Seq is available/ready in order to map/flatmap/transform it further. (And there are other computations I might start with the stream of intermedieate results)
(Proto)Example of usage:
val searches = [SearchParam1, SearchParam2..., SearchParam200]
// big queries that take a some 100ms each for ~20s total wait
val futureDbResult = searches.map(search => (quill)ctx.run { query(search) }).stream
// Stuff that should happen as results become available instead of blocking/waiting ~20 seconds before starting
val processedResults = futureDbResult.map(transform).filter(reduce)
// Log?
processedResults.map(result => log.info/log.trace)
//return lazy processedResults list or Future {processedResults}
???
As others have pointed out, you really should look into a real streaming library like fs2 or monix. I personally think monix is a good fit if you're interfacing with Future and only need it in a small part of your application. It has great APIs and documentation for this use-case.
Here's a small demo for your use-case:
import monix.eval.Task
import monix.execution.Scheduler.Implicits.global
import monix.reactive.Observable
import scala.concurrent.duration._
import scala.util.Random
// requires: libraryDependencies += "io.monix" %% "monix" % "3.0.0"
object Main {
val searchParams = (1 to 200).map(n => s"Search $n")
/**
* Simulates a query. If your library returns a Future, you can wrap it with `Task.deferFuture`
*/
def search(param: String): Task[String] =
Task(s"Result for $param").delayResult(Random.between(25, 250).milliseconds)
val results: Task[List[String]] =
Observable
.fromIterable(searchParams)
.mapParallelUnordered(parallelism = 4)(param => search(param))
.mapEval { result =>
Task(println(result)).map(_ => result) // print intermediate results as feedback
}
.toListL // collect results into List
/**
* If you aren't going all-in on monix, you probably run the stream into a Future with `results.runToFuture`
*/
def main(args: Array[String]): Unit = results.map(_ => ()).runSyncUnsafe()
}
You can think of Task as a lazy and more powerful Future. Observable is a (reactive) stream which will automatically back-pressure if downstream is slow. In this example only 4 queries will run in parallel and the other will wait until a "slot" becomes available to run.
Keep in mind that in those libraries side-effects (like println have to be wrapped in Task (or IO depending on what you use).
You can run this example locally if you provide the monix-dependency and play around with it to get a feel for how it works.
Basically I mean:
for(v <- Future(long time operation)) yield v*someOtherValue
This expression returns another Future, but the question is, is the v*someOhterValue operation lazy or not? Will this expression block on getting the value of Future(long time operation)?
Or it is like a chain of callbacks?
A short experiment can test this question.
import concurrent._;
import concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration._
object TheFuture {
def main(args: Array[String]): Unit = {
val fut = for (v <- Future { Thread.sleep(2000) ; 10 }) yield v * 10;
println("For loop is finished...")
println(Await.ready(fut, Duration.Inf).value.get);
}
}
If we run this, we see For loop is finished... almost immediately, and then two seconds later, we see the result. So the act of performing map or similar operations on a future is not blocking.
A map (or, equivalently, your for comprehension) on a Future is not lazy: it will be executed as soon as possible on another thread. However, since it runs on another thread, it isn't blocking, either.
If you want to do the definition and execution of the Future separately, then you have to use something like a Monix Task.
https://monix.io/api/3.0/monix/eval/Task.html
I have an iteration vals: Iterable[T] and a long-running function without any relevant side effects: f: (T => Unit). Right now this is applied to vals in the obvious way:
vals.foreach(f)
I would like the calls to f to be done concurrently (within reasonable limits). Is there an obvious function somewhere in the Scala base library? Something like:
Concurrent.foreach(8 /* Number of threads. */)(vals, f)
While f is reasonably long running, it is short enough that I don't want the overhead of invoking a thread for each call, so I am looking for something based on a thread pool.
Many of the answers from 2009 still use the old scala.actors.Futures._, which are no longer in the newer Scala. While Akka is the preferred way, a much more readable way is to just use parallel (.par) collections:
vals.foreach { v => f(v) }
becomes
vals.par.foreach { v => f(v) }
Alternatively, using parMap can appear more succinct though with the caveat that you need to remember to import the usual Scalaz*. As usual, there's more than one way to do the same thing in Scala!
Scalaz has parMap. You would use it as follows:
import scalaz.Scalaz._
import scalaz.concurrent.Strategy.Naive
This will equip every functor (including Iterable) with a parMap method, so you can just do:
vals.parMap(f)
You also get parFlatMap, parZipWith, etc.
I like the Futures answer. However, while it will execute concurrently, it will also return asynchronously, which is probably not what you want. The correct approach would be as follows:
import scala.actors.Futures._
vals map { x => future { f(x) } } foreach { _() }
I had some issues using scala.actors.Futures in Scala 2.8 (it was buggy when I checked). Using java libs directly worked for me, though:
final object Parallel {
val cpus=java.lang.Runtime.getRuntime().availableProcessors
import java.util.{Timer,TimerTask}
def afterDelay(ms: Long)(op: =>Unit) = new Timer().schedule(new TimerTask {override def run = op},ms)
def repeat(n: Int,f: Int=>Unit) = {
import java.util.concurrent._
val e=Executors.newCachedThreadPool //newFixedThreadPool(cpus+1)
(0 until n).foreach(i=>e.execute(new Runnable {def run = f(i)}))
e.shutdown
e.awaitTermination(Math.MAX_LONG, TimeUnit.SECONDS)
}
}
I'd use scala.actors.Futures:
vals.foreach(t => scala.actors.Futures.future(f(t)))
The latest release of Functional Java has some higher-order concurrency features that you can use.
import fjs.F._
import fj.control.parallel.Strategy._
import fj.control.parallel.ParModule._
import java.util.concurrent.Executors._
val pool = newCachedThreadPool
val par = parModule(executorStrategy[Unit](pool))
And then...
par.parMap(vals, f)
Remember to shutdown the pool.
You can use the Parallel Collections from the Scala standard library.
They're just like ordinary collections, but their operations run in parallel. You just need to put a par call before you invoke some collections operation.
import scala.collection._
val array = new Array[String](10000)
for (i <- (0 until 10000).par) array(i) = i.toString