I'm using Monix Scheduler to execute some tasks periodically. But I don't know how to not just execute them, but also to collect result from them to some collection...
Lets say I have a scheduled task that returns a random number every time:
val task = Task { Math.random() }
implicit val io: SchedulerService = Scheduler.io()
task.map(_ + 2).map(println).executeOn(io).delayExecution(1.seconds).loopForever.runAsyncAndForget
Theoretically speaking, I can create mutable and concurrent list before task execution, and in task.map I can put a result in that list... But I've heard that using mutable, shared between threads, collections isn't the best practice at all... Is there any nice way to collect all scheduled Task results? What instrument should I use to achieve this goal in a proper, scala idiomatic way, avoiding mutable collections?
The idiomatic way to collect repeated results using Monix would be to use an Observable instead of a Task. It has many methods such as zipMap to combine results with another Observable, and many methods such as foldLeft to combine results with previous results of the same Observable.
Note this generally requires collecting all your Observables into one method instead of the fire and forget method in your example. Ideally, you have exactly one runAsync in your entire program, in your main function.
Related
This is the context:
There is an input event stream,
There are some methods to apply on
the stream, which applies different logic to evaluates each event,
saying it is a "good" or "bad" event.
An event can be a real "good" one only if it passes all the methods, otherwise it is a "bad" event.
There is an output event stream who has result of event and its eventID.
To solve this problem, I have two ideas:
We can apply each method sequentially to each event. But this is a kind of batch processing, and doesn't apply the advantages of stream processing, in the same time, it takes Time(M(ethod)1) + Time(M2) + Time(M3) + ....., which maybe not suitable to real-time processing.
We can pass the input stream to each method, and then we can run each method in parallel, each method saves the bad event into a permanent storage, then the Main method could query the permanent storage to get the result of each event. But this has some problems to solve:
how to execute methods in parallel in the programming language(e.g. Scala), how about the performance(network, CPUs, memory)
how to solve the synchronization problem? It's sure that those methods need sometime to calculate and save flag into the permanent storage, but the Main just need less time to query the flagļ¼ which a delay issue occurs.
etc.
This is not a kind of tech and design question, I would like to ask your guys' ideas, if you have some new ideas or ideas to solve the problem ? Looking forward to your opinions.
Parallel streams, each doing the full set of evaluations sequentially, is the more straightforward solution. But if that introduces too much latency, then you can fan out the evaluations to be done in parallel, and then bring the results back together again to make a decision.
To do the fan-out, look at the split operation on DataStream, or use side outputs. But before doing this n-way fan-out, make sure that each event has a unique ID. If necessary, add a field containing a random number to each event to use as the unique ID. Later we will use this unique ID as a key to gather back together all of the partial results for each event.
Once the event stream is split, each copy of the stream can use a MapFunction to compute one of evaluation methods.
Gathering all of these separate evaluations of a given event back together is a bit more complex. One reasonable approach here is to union all of the result streams together, and then key the unioned stream by the unique ID described above. This will bring together all of the individual results for each event. Then you can use a RichFlatMapFunction (using Flink's keyed, managed state) to gather the results for the separate evaluations in one place. Once the full set of evaluations for a given event has arrived at this stateful flatmap operator, it can compute and emit the final result.
Slick has DBIO.seq and DBIO.sequence for running many DBIOActions whereby the results of previous actions aren't required for subsequent actions.
I've looked at the source, and it's not obvious to me if DBIO.seq runs the actions sequentially. On the other hand, DBIO.fold has a very simple implementation which definitely does run the actions sequentially, as it just uses flatMap internally.
My question is: will order be guaranteed when using seq and sequence, like it is with fold?
The documentation states that the actions in DBIO.seq are run sequentially:
The simplest combinator is DBIO.seq which takes a varargs list of
actions to run in sequence
Also, in the source code for DBIO.seq, you will see that SynchronousDatabaseAction is called inside foreach, which means that each are sequantially and (internally) synchronously called without any parallel run.
I'm wondering if Futures are better to be used in conjunction with Actors only, rather than in a program that does not use Actor. Said differently, is performing asynchronous computation with future something that should better be done within an Actors system?
Here why i'm saying that:
1 -
You perform a computation for which the result, would trigger some action that you may do in another thread.
For instance, i have a long operation to determine the price of something, from my main thread, i decide to launch an asynchronous process for it. In the mean time i could be doing other thing, then when the response is ready/availble or communicated back to me, i go on on that path.
I can see that with actor this is handy, because you can pipe a result to an actor. But with a typical threading model, you can either block or .... ?
2 -
Another issue, let say i need to update the age of a list of participant, by getting some information online. Let assume i just have one future for that task. Isn't closing over the participant list something wrong to do. Multiple thread maybe accessing that participant list at the same time. So making the update within the future would simply be wrong and in that case, we would need java concurrent collection isn't it ?
Maybe i see it the wrong way, future are not meant to do side effect
at all
But in that case, fair enough, no side effect, but we still have the problem of getting a value back from the calling thread, which can only be blocking. I mean let's imagine that, the result, would help the calling thread, to update some data structure. How to do that update asynchronously without closing over that data structure somehow.
I believe the call backs such as OnComplete can be use for
side-effecting (Am it right here?)
still, the call back would have to close over the data structure anyway. Hence i don't see how not using Actor.
PS: I like actors, i'm just trying to understand better the usage of future without actors. I read everywhere, that one should use actor only when necessary that is when state need to be manage. It seems to me that overall, using future, without actor, always involve blocking somewhere down the line, if the result need to be communicated back at some point to the thread that initiated the asynchronous task.
Actors are good when you are dealing with mutable state because they encapsulate the mutable state. and allow only message-based interaction.
You can use Future to execute in a different thread. You don't have to block on a Future because Scala's Future compose. So if you have multiple Futures in your code, you don't have to wait/block for all of them to compete. For example, if your pipeline is completely non-block or asyn (e.g., Play and Spray) you can return a Future back to the client.
Futures are lightweight compared to actors because you don't need a complete actorsystem.
Here is a quote from Martin Odersky that I really like.
There is no silver bullet for all concurrency issues; the right
solution depends on what one needs to achieve. Do you want to define
asynchronous computations that react to events or streams of values?
Or have autonomous, isolated entities communicating via messages? Or
define transactions over a mutable store? Or, maybe the primary
purpose of parallel execution is to increase the performance? For each
of these tasks, there is an abstraction that does the job: futures,
reactive streams, actors, transactional memory, or parallel
collections.
So choose your abstraction based on your use case and needs.
I have an Iterable of "work units" that need to be performed, in no particular order, and can easily run in parallel without interfering with one another.
Unfortunately, running too many of them at a time will exceed my available RAM, so I need to make sure that only a handful is running simultaneously at any given time.
At the most basic, I want a function of this type signature:
parMap[A, B](xs: Iterator[A], f: A => B, chunkSize: Int): Iterator[B]
such that the output Iterator is not necessarily in the same order as the input (if I want to maintain knowledge of where the result came from, I can output a pair with the input or something.) The consumer can then consume the resulting iterator incrementally without eating up all of the machine's memory, while maintaining as much parallelism as is possible for this task.
Furthermore, I want the function to be as efficient as possible. An initial idea I had was for example to do something along the following lines:
xs.iterator.grouped(chunkSize).flatMap(_.toSet.par.map(f).iterator)
where I was hoping the toSet would inform Scala's parallel collection that it could start producing elements from its iterator as soon as they were ready, in any order, and the grouped call was to limit the number of simultaneous workers. Unfortunately, it doesn't look like the toSet call achieves the desired effect (the results are returned in the same order as they would have been without the par call, in my experiments,) and the grouped call is suboptimal. For example, if we have a group size of 100, and 99 of those jobs complete immediately on a dozen cores, but one of them is particularly slow, most of the remaining cores will be idle until we can move to the next group. It would be much cleaner to have an "adaptive window" that is at most as big as my chunk size, but doesn't get held up by slow workers.
I can envision writing something like this myself with a work-stealing (de)queue or something along those lines, but I imagine that a lot of the hard work of dealing with the concurrency primitives is already done for me at some level in Scala's parallel collections library. Does anyone know what parts of it I could reuse to build this piece of functionality, or have other suggestions on how to implement such an operation?
The Parallel collections framework allows you to specify the maximum number of threads to be used for a given task. Using scala-2.10, you'd want to do:
def parMap[A,B](x : Iterable[A], f : A => B, chunkSize : Int) = {
val px = x.par
px.tasksupport = new ForkJoinTaskSupport(new scala.concurrent.forkjoin.ForkJoinPool(chunkSize))
px map f
}
This will prevent more than chunkSize operations running at any one time. This uses a work-stealing strategy underneath to keep the actors working, and so doesn't suffer from the same problem as your grouped example above.
Doing it this way won't reorder the results into first-completed order, however. For that, I'd suggest something like turning your operation into an actor and having a small actor pool running the operations and then sending results back to you as they complete.
I want to use parallel arrays for a task, and before I start with the coding, I'd be interested in knowing if this small snipept is threadsafe:
import collection.mutable._
var listBuffer = ListBuffer[String]("one","two","three","four","five","six","seven","eight","nine")
var jSyncList = java.util.Collections.synchronizedList(new java.util.ArrayList[String]())
listBuffer.par.foreach { e =>
println("processed :"+e)
// using sleep here to simulate a random delay
Thread.sleep((scala.math.random * 1000).toLong)
jSyncList.add(e)
}
jSyncList.toArray.foreach(println)
Are there better ways of processing something with parallel collections, and acumulating the results elsewhere?
The code you posted is perfectly safe; I'm not sure about the premise though: why do you need to accumulate the results of a parallel collection in a non-parallel one? One of the whole points of the parallel collections is that they look like other collections.
I think that parallel collections also will provide a seq method to switch to sequential ones. So you should probably use this!
For this pattern to be safe:
listBuffer.par.foreach { e => f(e) }
f has to be able to run concurrently in a safe way. I think the same rules that you need for safe multi-threading apply (access to share state needs to be thread safe, the order of the f calls for different e won't be deterministic and you may run into deadlocks as you start synchronizing your statements in f).
Additionally I'm not clear what guarantees the parallel collections gives you about the underlying collection being modified while being processed, so a mutable list buffer which can have elements added/removed is possibly a poor choice. You never know when the next coder will call something like foo(listBuffer) before your foreach and pass that reference to another thread which may mutate the list while it's being processed.
Other than that, I think for any f that will take a long time, can be called concurrently and where e can be processed out of order, this is a fine pattern.
immutCol.par.foreach { e => threadSafeOutOfOrderProcessingOf(e) }
disclaimer: I have not tried // colls myself, but I'm looking forward at having SO questions/answers show us what works well.
The synchronisedList should be safe, though the println may give unexpected results - you have no guarantees of the order that items will be printed, or even that your printlns won't be interleaved mid-character.
A synchronised list is also unlikely to be the fastest way you can do this, a safer solution is to map over an immutable collection (Vector is probably your best bet here), then print all the lines (in order) afterwards:
val input = Vector("one","two","three","four","five","six","seven","eight","nine")
val output = input.par.map { e =>
val msg = "processed :" + e
// using sleep here to simulate a random delay
Thread.sleep((math.random * 1000).toLong)
msg
}
println(output mkString "\n")
You'll also note that this code has about as much practical usefulness as your example :)
This code is plain weird -- why add stuff in parallel to something that needs to be synchronized? You'll add contention and gain absolutely nothing in return.
The principle of the thing -- accumulating results from parallel processing, are better achieved with stuff like fold, reduce or aggregate.
The code you've posted is safe - there will be no errors due to inconsistent state of your array list, because access to it is synchronized.
However, parallel collections process items concurrently (at the same time), AND out-of-order. The out-of-order means that the 54. element may be processed before the 2. element - your synchronized array list will contain items in non-predefined order.
In general it's better to use map, filter and other functional combinators to transform a collection into another collection - these will ensure that the ordering guarantees are preserved if a collection has some (like Seqs do). For example:
ParArray(1, 2, 3, 4).map(_ + 1)
always returns ParArray(2, 3, 4, 5).
However, if you need a specific thread-safe collection type such as a ConcurrentSkipListMap or a synchronized collection to be passed to some method in some API, modifying it from a parallel foreach is safe.
Finally, a note - parallel collection provide parallel bulk operations on data. Mutable parallel collections are not thread-safe in the sense that you can add elements to them from different threads. Mutable operations like insertion to a map or appending a buffer still have to be synchronized.