For some reason, my Akka streams always wait for a second message before "emitting"(?) the first.
Here is some example code that demonstrates my problem.
val rx = Source((1 to 100).toStream.map { t =>
Thread.sleep(1000)
println(s"doing $t")
t
})
rx.runForeach(println)
yields output:
doing 1
doing 2
1
doing 3
2
doing 4
3
doing 5
4
doing 6
5
...
What I want:
doing 1
1
doing 2
2
doing 3
3
doing 4
4
doing 5
5
doing 6
6
...
The way your code is setup now, you are completely transforming the Source before it's allowed to start emitting elements downstream. You can clearly see that behavior (as #slouc stated) by removing the toStream on the range of numbers that represents the source. If you do that, you will see the Source be completely transformed first before it starts responding to downstream demand. If you actually want to run a Source into a Sink and have a transformation step in the middle, then you can try and structure things like this:
val transform =
Flow[Int].map{ t =>
Thread.sleep(1000)
println(s"doing $t")
t
}
Source((1 to 100).toStream).
via(transform ).
to(Sink.foreach(println)).
run
If you make that change, then you will get the desired effect, which is that an element flowing downstream gets processed all the way through the flow before the next element starts to be processed.
You are using .toStream() which means that the whole collection is lazy. Without it, your output would be first a hundred "doing"s followed by numbers from 1 to 100. However, Stream evaluates only the first element, which gives the "doing 1" output, which is where it stops. Next element will be evaluated when needed.
Now, I couldn't find any details on this in the docs, but I presume that runForeach has an implementation that takes the next element before invoking the function on the current one. So before calling println on element n, it first examines element n+1 (e.g. checks if it exists), which results in "doing n+1" message. Then it performs your println function on current element which results in message "n" .
Do you really need to map() before you runForeach? I mean, do you need two travelsals through the data? I know I'm probably stating the obvious, but if you just process your data in one go like this:
val rx = Source((1 to 100).toStream)
rx.runForeach({ t =>
Thread.sleep(1000)
println(s"doing $t")
// do something with 't', which is now equal to what "doing" says
})
then you don't have a problem of what's evaluated when.
Related
I am trying to use akka streams to accumulate data and use as batch:
val myFlow: Flow[String, Unit, NotUsed] = Flow[String].collect {
case record =>
println(record)
Future(record)
}.mapAsync(1)(x => x).groupedWithin(3, 30 seconds)
.mapAsync(10)(records =>
someBatchOperation(records))
)
My expectation from code above was not make any operation until 3 records are ready, or 30 seconds pass. But when I send some request with Source.single("test"), it is processing this record without waiting for others or 30 seconds.
How can I use this flow to wait for other records to came or 30 seconds idle?
Record is coming from an API request one by one and I am trying to accumulate this data in flow like:
Source.single(apiRecord).via(myFlow).runWith(Sink.ignore)
It actually does that. Let's consider the following:
Source(Stream.from(1)).throttle(1, 400 milli).groupedWithin(3, 1 seconds).runWith(Sink.foreach(i => println(s"Done with ${i} ${System.currentTimeMillis}")))
The output of that line, until I killed the process, was:
Done with Vector(1, 2, 3) 1599495716345
Done with Vector(4, 5) 1599495717348
Done with Vector(6, 7, 8) 1599495718330
Done with Vector(9, 10) 1599495719350
Done with Vector(11, 12, 13) 1599495720330
Done with Vector(14, 15) 1599495721350
Done with Vector(16, 17, 18) 1599495722328
Done with Vector(19, 20) 1599495723348
Done with Vector(21, 22, 23) 1599495724330
As we can see, the time differences between every time we emit 2 elements, to 3 elements, is a bit more than 1 second. That makes sense because after the 1 second delay, it took a bit more to get to the printing line.
The difference between every time we emit 2 elements, to 3 elements, is less than a second. Because it had enough elements to go on.
Why didn't it work in your example?
When you are using Source.single, then the source adds a complete stage to itself. You can see it in the source code of akka.
In this case, the groupedWithin flow knows that it won't get any more elements, so it can emit the "test" string. In order to actually test this flow try to create a bigger stream.
When using Source(1 to 10) it actually translates into Source.Single, which completes the stream as well. We can see that here.
I want to iterate through all items of a csv file and for each item I want to distribute uniform the request so that all SearchProduct (SearchProduct1, SearchProduct2 and SearchProduct3) functions are called the same times.
val products= csv("products.csv").records
val start= exec(repeat(products.size, "n"){
feed(products.queue)
.uniformRandomSwitch(
exec(searchProduct1),
exec(searchProduct2),
exec(searchProduct3)
)
})
I expect that If I have 9 products, the function SearchProduct1 is called 3 times, the function SearchProduct2 is called 3 times and the function SearchProduct3 also is called 3 times.
But the statistics show me many times that the function SearchProduct3 was called 5 times and the SearchProduct2 and SearchProduct1 were called 2 times. Am I doing anithing wrong? Should I do the repeat inside the uniformRandomSwitch?
So I understand the uniformRandomSwitch that the probability of executing one of these three functions is the same. It could be possible that in 9 iterations, 8 times is executed the SearchProduct1 and 1 time SearchProduct2 (and the SearchProduct3 never). But with uniformRandomSwitch I am not forcing to execute the same times every function. Right?
I think what you want is the roundRobinSwitch directive. This will iterate through each chain, moving on to the next and then repeating at the beginning, as new requests come through.
With uniformRandomSwitch each chain has a 1/N chance of being called. Only over many iterations would the number of calls converge, given your example, to 3/3/3.
I want to offer to queue a string sent in load request after some initial delay say 10 seconds.
If the subsequent request is made with some short interval delay(1 second) then everything works fine, but if it is made continuously like from a script then there is no delay.
Here is the sample code.
def load(randomStr :String) = Action { implicit request =>
Source.single(randomStr)
.delay(10 seconds, DelayOverflowStrategy.backpressure)
.map(x =>{
println(x)
queue.offer(x)
})
.runWith(Sink.ignore)
Ok("")
}
I am not entirely sure that this is the correct way of doing what you want. There are some things you need to reconsider:
a delayed source has an initial buffer capacity of 16 elements. You can increase this with addAttributes(initialBuffer)
In your case the buffer cannot actually become full because every time you provide just one element.
Who is the caller of the Action? You are defining a DelayOverflowStrategy.backpressure strategy but is the caller able to handle this?
On every call of the action you are creating a Stream consisting of one element, how is the backpressure here helping? It is applied on the stream processing and not on the offering to the queue
I have many data and I have experimented with partitions of cardinality [20k, 200k+].
I call it like that:
from pyspark.mllib.clustering import KMeans, KMeansModel
C0 = KMeans.train(first, 8192, initializationMode='random', maxIterations=10, seed=None)
C0 = KMeans.train(second, 8192, initializationMode='random', maxIterations=10, seed=None)
and I see that initRandom() calls takeSample() once.
Then the takeSample() implementation doesn't seem to call itself or something like that, so I would expect KMeans() to call takeSample() once. So why the monitor shows two takeSample()s per KMeans()?
Note: I execute more KMeans() and they all invoke two takeSample()s, regardless of the data being .cache()'d or not.
Moreover, the number of partitions doesn't affect the number takeSample() is called, it's constant to 2.
I am using Spark 1.6.2 (and I cannot upgrade) and my application is in Python, if that matters!
I brought this to the mailing list of the Spark devs, so I am updating:
Details of 1st takeSample():
Details of 2nd takeSample():
where one can see that the same code is executed.
As suggested by Shivaram Venkataraman in Spark's mailing list:
I think takeSample itself runs multiple jobs if the amount of samples
collected in the first pass is not enough. The comment and code path
at GitHub
should explain when this happens. Also you can confirm this by
checking if the logWarning shows up in your logs.
// If the first sample didn't turn out large enough, keep trying to take samples;
// this shouldn't happen often because we use a big multiplier for the initial size
var numIters = 0
while (samples.length < num) {
logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
numIters += 1
}
However, as one can see, the 2nd comment said it shouldn't happen often, and it does happen always to me, so if anyone has another idea, please let me know.
It was also suggested that this was a problem of the UI and takeSample() was actually called only once, but that was just hot air.
// 1 fixed thread
implicit val waitingCtx = scala.concurrent.ExecutionContext.fromExecutor(Executors.newFixedThreadPool(1))
// "map" will use waitingCtx
val ss = (1 to 1000).map {n => // if I change it to 10 000 program will be stopped at some point, like locking forever
service1.doServiceStuff(s"service ${n}").map{s =>
service1.doServiceStuff(s"service2 ${n}")
}
}
Each doServiceStuff(name:String) takes 5 seconds. doServiceStuff does not have implicit ex:Execution context as parameter, it uses its own ex context inside and does Future {blocking { .. }} on it.
In the end program prints:
took: 5.775849753 seconds for 1000 x 2 stuffs
If I change 1000 to 10000 in, adding even more tasks : val ss = (1 to 10000) then program stops:
~17 027 lines will be printed (out of 20 000). No "ERROR" message
will be printed. No "took" message will be printed
**And will not be processing any futher.
But if I change exContext to ExecutionContext.fromExecutor(null: Executor) (global one) then in ends in about 10 seconds (but not normally).
~17249 lines printed
ERROR: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
took: 10.646309398 seconds
That's the question
: Why with fixed ex-context pool it stops without messaging, but with global ex-context it terminates but with error and messaging?
and sometimes.. it is not reproducable.
UPDATE: I do see "ERROR" and "took" if I increase pool from 1 to N. Does not matter how hight N is - it sill will be the ERROR.
The code is here: https://github.com/Sergey80/scala-samples/tree/master/src/main/scala/concurrency/apptmpl
and here, doManagerStuff2()
I think I have an idea of what's going on. If you squint enough, you'll see that map duty is extremely lightweight: just fire off a new future (because doServiceStuff is a Future). I bet the behavior will change if you switch to flatMap, which will actually flatten the nested future and thus will wait for second doServiceStuff call to complete.
Since you're not flattening out these futures, all your awaits downstream are awaiting on a wrong thing, and you are not catching it because here you're discarding whatever Service returns.
Update
Ok, I misinterpreted your question, although I still think that that nested Future is a bug.
When I try your code with both executors with 10000 task I do get OutOfMemory when creating threads in ForkJoin execution context (i.e. for service tasks), which I'd expect. Did you use any specific memory settings?
With 1000 tasks they both do complete successfully.