I put together this dummy example to try to understand backpressure a little better:
Flowable.range(1, 100).onBackpressureDrop()
.subscribeOn(Schedulers.io())
.observeOn(AndroidSchedulers.mainThread())
.subscribeWith(object : DisposableSubscriber<Int>() {
override fun onStart() {
request(1)
}
override fun onComplete() {
Log.d(this#MainActivity::class.java.simpleName, "onComplete")
}
override fun onNext(t: Int?) {
Log.d(this#MainActivity::class.java.simpleName, t.toString())
Thread.sleep(1000)
request(1)
}
override fun onError(t: Throwable?) { //handle error}
})
I have an extremely slow Subscriber that consumes data from a very fast Flowable. And I'm instructing the Flowable to onBackPressureDrop(). Despite of this, my output looks like this (from 1 to 100)
07-16 23:07:21.097 22389-22389 D: 1
07-16 23:07:22.100 22389-22389 D: 2
07-16 23:07:23.102 22389-22389 D: 3
07-16 23:07:24.104 22389-22389 D: ...
07-16 23:07:24.104 22389-22389 D: ...
07-16 23:07:24.105 22389-22389 D: 99
07-16 23:07:25.105 22389-22389 D: 100
07-16 23:07:25.107 22389-22389 D: onComplete
I was expecting missing elements since the subscriber is extremely slow, but that isn't the case, all number from 1 to 100 are printed to the console, one every second.
Next, I tried to request all values at once. So I replaced request(1) in onStart with request(Long.MAX_VALUE) and remove request(1) from the onNext call. but it still prints the number 1 to 100 with no missing elements.
So I wonder how can I simulate a subscriber missing events for a slow subscriber?
How can I make a backpressure exception happen?
Thanks
observeOn has a default internal buffer size of 128, that's why you don't see elements dropped as it can simply buffer up all the 100 elements you are generating. You can set the buffer size to 1 via observeOn(mainThread(), false, 1) and experience drops.
Related
Let's say I have three remote calls in order to construct my page. One of them (X) is critical for the page and the other two (A, B) just used to enhance the experience.
Because criticalFutureX is too important to be effected by futureA and futureB, so I want the overall latency of of all remote calls to be Not more than X.
That means, in case of criticalFutureX finishes, I want to discard futureA and futureB.
val criticalFutureX = ...
val futureA = ...
val futureB = ...
// the overall latency of this for-comprehension depends on the longest among X, A and B
for {
x <- criticalFutureX
a <- futureA
b <- futureB
} ...
In the above example, even though they are executed in parallel, the overall latency depends on the longest among X, A and B, which is not what I want.
Latencies:
X: |----------|
A: |---------------|
B: |---|
O: |---------------| (overall latency)
There is firstCompletedOf but it can not be used to explicit say "in case of completed of criticalFutureX".
Is there something like the following?
val criticalFutureX = ...
val futureA = ...
val futureB = ...
for {
x <- criticalFutureX
a <- futureA // discard when criticalFutureX finished
b <- futureB // discard when criticalFutureX finished
} ...
X: |----------|
A: |-----------... discarded
B: |---|
O: |----------| (overall latency)
You can achieve this with a promise
def completeOnMain[A, B](main: Future[A], secondary: Future[B]) = {
val promise = Promise[Option[B]]()
main.onComplete {
case Failure(_) =>
case Success(_) => promise.trySuccess(None)
}
secondary.onComplete {
case Failure(exception) => promise.tryFailure(exception)
case Success(value) => promise.trySuccess(Option(value))
}
promise.future
}
Some testing code
private def runFor(first: Int, second: Int) = {
def run(millis: Int) = Future {
Thread.sleep(millis);
millis
}
val start = System.currentTimeMillis()
val combined = for {
_ <- Future.unit
f1 = run(first)
f2 = completeOnMain(f1, run(second))
r1 <- f1
r2 <- f2
} yield (r1, r2)
val result = Await.result(combined, 10.seconds)
println(s"It took: ${System.currentTimeMillis() - start}: $result")
}
runFor(3000, 4000)
runFor(3000, 1000)
Produces
It took: 3131: (3000,None)
It took: 3001: (3000,Some(1000))
This kind of task is very hard to achieve efficiently, reliably and safely with Scala standard library Futures. There is no way to interrupt a Future that hasn't completed yet, meaning that even if you choose to ignore its result, it will still keep running and waste memory and CPU time. And even if there was a method to interrupt a running Future, there is no way to ensure that resources that were allocated (network connections, open files etc.) will be properly released.
I would like to point out that the implementation given by Ivan Stanislavciuc has a bug: if the main Future fails, then the promise will never be completed, which is unlikely to be what you want.
I would therefore strongly suggest looking into modern concurrent effect systems like ZIO or cats-effect. These are not only safer and faster, but also much easier. Here's an implementation with ZIO that doesn't have this bug:
import zio.{Exit, Task}
import Function.tupled
def completeOnMain[A, B](
main: Task[A], secondary: Task[B]): Task[(A, Exit[Throwable, B])] =
(main.forkManaged zip secondary.forkManaged).use {
tupled(_.join zip _.interrupt)
}
Exit is a type that describes how the secondary task ended, i. e. by successfully returning a B or because of an error (of type Throwable) or due to interruption.
Note that this function can be given a much more sophisticated signature that tells you a lot more about what's going on, but I wanted to keep it simple here.
I am just starting out with Apache Flink using Scala. Can someone please tell me how to create a lagged stream(lagged by k events or k units of time) from a current datastream that I have?
Basically, I want to implement an auto regression model (Linear regression on the stream with the time lagged version of itself) on a data-stream. So, a method is needed something similar to the following pseudo code.
val ds : DataStream = ...
val laggedDS : DataStream = ds.map(lag _)
def lag(ds : DataStream, k : Time) : DataStream = {
}
I expect the sample input and output like this if every event is spaced at 1 second interval and there is a 2 second lag.
Input : 1, 2, 3, 4, 5, 6, 7...
Output: NA, NA, 1, 2, 3, 4, 5...
Given that I your requirements right, I would implement this as a FlatMapFunction with a FIFO queue. The queue buffers k events and emits the head whenever a new event arrives. In case you need a fault tolerant streaming application, the queue must be registered as state. Flink will then take care of checkpointing the state (i.e., the queue) and restore it in case of a failure.
The FlatMapFunction could look like this:
class Lagger(val k: Int)
extends FlatMapFunction[X, X]
with Checkpointed[mutable.Queue[X]]
{
var fifo: mutable.Queue[X] = new mutable.Queue[X]()
override def flatMap(value: X, out: Collector[X]): Unit = {
// add new element to queue
fifo.enqueue(value)
if (fifo.size == k + 1) {
// remove head element and emit
out.collect(fifo.dequeue())
}
}
// restore state
override def restoreState(state: mutable.Queue[X]) = { fifo = state }
// get state to checkpoint
override def snapshotState(cId: Long, cTS: Long): mutable.Queue[X] = fifo
}
Returning elements with a time lag is more involved. This would require timer threads for the emission because the function is only called when a new element arrives.
I have some code that is not performance-sensitive and was trying to make stacks easier to follow by using fewer futures. This resulted in some code similar to the following:
val fut = Future {
val r = Future.traverse(ips) { ip =>
val httpResponse: Future[HttpResponse] = asyncHttpClient.exec(req)
httpResponse.andThen {
case x => logger.info(s"received response here: $x")
}
httpResponse.map(r => (ip, r))
}
r.andThen { case x => logger.info(s"final result: $x") }
Await.result(r, 10 seconds)
}
fut.andThen { x => logger.info(s"finished $x") }
logger.info("here nonblocking")
As expected internal logging in the http client shows that the response returns immediately, but the callbacks executing logger.info(s"received response here: $x") and logger.info(s"final result: $x") do not execute until after Await.result(r, 10 seconds) times out. Looking at the log output, which includes thread ids, the callbacks are being executed in the same thread (ForkJoinPool-1-worker-3) that is awaiting the result, creating a deadlock. It was my understanding that ExecutionContext.global would create extra threads on demand when it ran out of threads. Is this not the case? There appears only to be two threads from the global fork join pool that are producing any output in the logs (1 and 3). Can anyone explain this?
As for fixes, I know perhaps the best way is to separate blocking and nonblocking work into different thread pools, but I was hoping to avoid this extra bookkeeping by using a dynamically sized thread pool. Is there a better solution?
If you want to grow the pool (temporarily) when threads are blocked, use concurrent.blocking. Here, you've used all the threads, doing i/o and then scheduling more work with map and andThen (the result of which you don't use).
More info: your "final result" is expected to execute after the traverse, so that is normal.
Example for blocking, although there must be a SO Q&A for it:
scala> import concurrent._ ; import ExecutionContext.Implicits._
scala> val is = 1 to 100 toList
scala> def db = s"${Thread.currentThread}"
db: String
scala> def f(i: Int) = Future { println(db) ; Thread.sleep(1000L) ; 2 * i }
f: (i: Int)scala.concurrent.Future[Int]
scala> Future.traverse(is)(f _)
Thread[ForkJoinPool-1-worker-13,5,main]
Thread[ForkJoinPool-1-worker-7,5,main]
Thread[ForkJoinPool-1-worker-9,5,main]
Thread[ForkJoinPool-1-worker-3,5,main]
Thread[ForkJoinPool-1-worker-5,5,main]
Thread[ForkJoinPool-1-worker-1,5,main]
Thread[ForkJoinPool-1-worker-15,5,main]
Thread[ForkJoinPool-1-worker-11,5,main]
res0: scala.concurrent.Future[List[Int]] = scala.concurrent.impl.Promise$DefaultPromise#3a4b0e5d
[etc, N at a time]
versus overly parallel:
scala> def f(i: Int) = Future { blocking { println(db) ; Thread.sleep(1000L) ; 2 * i }}
f: (i: Int)scala.concurrent.Future[Int]
scala> Future.traverse(is)(f _)
Thread[ForkJoinPool-1-worker-13,5,main]
Thread[ForkJoinPool-1-worker-3,5,main]
Thread[ForkJoinPool-1-worker-1,5,main]
res1: scala.concurrent.Future[List[Int]] = scala.concurrent.impl.Promise$DefaultPromise#759d81f3
Thread[ForkJoinPool-1-worker-7,5,main]
Thread[ForkJoinPool-1-worker-25,5,main]
Thread[ForkJoinPool-1-worker-29,5,main]
Thread[ForkJoinPool-1-worker-19,5,main]
scala> Thread[ForkJoinPool-1-worker-23,5,main]
Thread[ForkJoinPool-1-worker-27,5,main]
Thread[ForkJoinPool-1-worker-21,5,main]
Thread[ForkJoinPool-1-worker-31,5,main]
Thread[ForkJoinPool-1-worker-17,5,main]
Thread[ForkJoinPool-1-worker-49,5,main]
Thread[ForkJoinPool-1-worker-45,5,main]
Thread[ForkJoinPool-1-worker-59,5,main]
Thread[ForkJoinPool-1-worker-43,5,main]
Thread[ForkJoinPool-1-worker-57,5,main]
Thread[ForkJoinPool-1-worker-37,5,main]
Thread[ForkJoinPool-1-worker-51,5,main]
Thread[ForkJoinPool-1-worker-35,5,main]
Thread[ForkJoinPool-1-worker-53,5,main]
Thread[ForkJoinPool-1-worker-63,5,main]
Thread[ForkJoinPool-1-worker-47,5,main]
I want to know how an actor returns a value to the sender and how to store it in a variable.
For example, consider that we have to find the sum of squares of 2 numbers and print it.
i.e., sum = a2 + b2
I have 2 actors. 1 actor computes square of any number passed to it (say, SquareActor). The other actor sends the two numbers (a , b) to the SquareActor and computes their sum (say, SumActor)
/** Actor to find the square of a number */
class SquareActor (x: Int) extends Actor
{
def act()
{
react{
case x : Int => println (x * x)
// how to return the value of x*x to "SumActor" ?
}
}
}
/** Actor to find the sum of squares of a and b */
class SumActor (a: Int, b:Int) extends Actor
{
def act()
{
var a2 = 0
var b2 = 0
val squareActor = new SquareActor (a : Int)
squareActor.start
// call squareActor to get a*a
squareActor ! a
// How to get the value returned by SquareActor and store it in the variable 'a2' ?
// call squareActor to get b*b
squareActor ! b
// How to get the value returned by SquareActor and store it in the variable 'b2' ?
println ("Sum: " + a2+b2)
}
}
Pardon me if the above is not possible; I think my basic understanding of actors may itself be wrong.
Use Akka
Note that from Scala 2.10, the Akka actor library is an integrated part of the standard library. It is generally considered superior to the standard actor library, so getting familiar with that would benefit you.
Use Futures
Also note that what you want to achieve is easier and nicer (composes better) using Futures. A Future[A] represents a possibly concurrent computation, eventually yielding a result of type A.
def asyncSquare(x: Int): Future[Int] = Future(x * x)
val sq1 = asyncSquare(2)
val sq2 = asyncSquare(3)
val asyncSum =
for {
a <- sq1
b <- sq2
}
yield (a + b)
Note that the asyncSquare results are queried in advance to start their (independent) computations as soon as possible. Putting the calls inside the for comprehension would have serialized their execution, not using the possible concurrency.
You use Future-s in for comprehensions, map, flatMap, zip, sequence them, and in the very end, you can get the computed value using Await, which is a blocking operation, or using registered callbacks.
Use Futures with actors
It is handy that you can ask from actors, which results in a Future:
val futureResult: Future[Int] = (someActor ? 5).mapTo[Int]
Note the need to use of mapTo because the message passing interface of actors is not typed (there are however typed actors).
Bottom line
If you want to perform stateless computations in parallel, stick to plain Futures. If you need stateful but local computations, you can still use Future and thread the state yourself (or use scalaz StateT monad transformer + Future as monad, if you are on to that business). If you need computations which require global state, then isolate that state into an actor, and interact with that actor, possibly using Futures.
Remember that actors work by message passing. So to get the response from the SquareActor back to the SumActor, you need to send it as a message from the SquareActor, and add a handler to the SumActor.
Also, your SquareActor constructor doesn't need an integer parameter.
That is, in your SquareActor, instead of just printing x * x, pass it to the SumActor:
class SquareActor extends Actor
{
def act()
{
react{
case x : Int => sender ! (x * x)
}
}
}
(sender causes it to send the message to the actor that sent the message it is reacting to.)
In your SumActor, after you send a and b to the SquareActor, handle the received reply messages:
react {
case a2 : Int => react {
case b2 : Int => println ("Sum: " + (a2+b2))
}
}
In the code below I create 20 threads, have them each print out a message, sleep, and print another message. I start the threads in my main thread and then join all of the threads as well. I would expect the "all done" message to only be printed after all of the threads have finished. Yet "all done" gets printed before all the threads are done. Can someone help me to understand this behavior?
Thanks.
Kent
Here is the code:
def ttest() = {
val threads =
for (i <- 1 to 5)
yield new Thread() {
override def run() {
println("going to sleep")
Thread.sleep(1000)
println("awake now")
}
}
threads.foreach(t => t.start())
threads.foreach(t => t.join())
println("all done")
}
Here is the output:
going to sleep
all done
going to sleep
going to sleep
going to sleep
going to sleep
awake now
awake now
awake now
awake now
awake now
It works if you transform the Range into a List:
def ttest() = {
val threads =
for (i <- 1 to 5 toList)
yield new Thread() {
override def run() {
println("going to sleep")
Thread.sleep(1000)
println("awake now")
}
}
threads.foreach(t => t.start())
threads.foreach(t => t.join())
println("all done")
}
The problem is that "1 to 5" is a Range, and ranges are not "strict", so to speak. In good English, when you call the method map on a Range, it does not compute each value right then. Instead, it produces an object -- a RandomAccessSeq.Projection on Scala 2.7 -- which has a reference to the function passed to map and another to the original range. Thus, when you use an element of the resulting range, the function you passed to map is applied to the corresponding element of the original range. And this will happen each and every time you access any element of the resulting range.
This means that each time you refer to an element of t, you are calling new Thread() { ... } anew. Since you do it twice, and the range has 5 elements, you are creating 10 threads. You start on the first 5, and join on the second 5.
If this is confusing, look at the example below:
scala> object test {
| val t = for (i <- 1 to 5) yield { println("Called again! "+i); i }
| }
defined module test
scala> test.t
Called again! 1
Called again! 2
Called again! 3
Called again! 4
Called again! 5
res4: scala.collection.generic.VectorView[Int,Vector[_]] = RangeM(1, 2, 3, 4, 5)
scala> test.t
Called again! 1
Called again! 2
Called again! 3
Called again! 4
Called again! 5
res5: scala.collection.generic.VectorView[Int,Vector[_]] = RangeM(1, 2, 3, 4, 5)
Each time I print t (by having Scala REPL print res4 and res5), the yielded expression gets evaluated again. It happens for individual elements too:
scala> test.t(1)
Called again! 2
res6: Int = 2
scala> test.t(1)
Called again! 2
res7: Int = 2
EDIT
As of Scala 2.8, Range will be strict, so the code in the question will work as originally expected.
In your code, threads is deferred - each time you iterate it, the for generator expression is run anew. Thus, you actually create 10 threads there - the first foreach creates 5 and starts them, the second foreach creates 5 more (which are not started) and joins them - since they aren't running, join returns immediately. You should use toList on the result of for to make a stable snapshot.