Akka Streams Accumulate by Source Single - scala

I am trying to use akka streams to accumulate data and use as batch:
val myFlow: Flow[String, Unit, NotUsed] = Flow[String].collect {
case record =>
println(record)
Future(record)
}.mapAsync(1)(x => x).groupedWithin(3, 30 seconds)
.mapAsync(10)(records =>
someBatchOperation(records))
)
My expectation from code above was not make any operation until 3 records are ready, or 30 seconds pass. But when I send some request with Source.single("test"), it is processing this record without waiting for others or 30 seconds.
How can I use this flow to wait for other records to came or 30 seconds idle?
Record is coming from an API request one by one and I am trying to accumulate this data in flow like:
Source.single(apiRecord).via(myFlow).runWith(Sink.ignore)

It actually does that. Let's consider the following:
Source(Stream.from(1)).throttle(1, 400 milli).groupedWithin(3, 1 seconds).runWith(Sink.foreach(i => println(s"Done with ${i} ${System.currentTimeMillis}")))
The output of that line, until I killed the process, was:
Done with Vector(1, 2, 3) 1599495716345
Done with Vector(4, 5) 1599495717348
Done with Vector(6, 7, 8) 1599495718330
Done with Vector(9, 10) 1599495719350
Done with Vector(11, 12, 13) 1599495720330
Done with Vector(14, 15) 1599495721350
Done with Vector(16, 17, 18) 1599495722328
Done with Vector(19, 20) 1599495723348
Done with Vector(21, 22, 23) 1599495724330
As we can see, the time differences between every time we emit 2 elements, to 3 elements, is a bit more than 1 second. That makes sense because after the 1 second delay, it took a bit more to get to the printing line.
The difference between every time we emit 2 elements, to 3 elements, is less than a second. Because it had enough elements to go on.
Why didn't it work in your example?
When you are using Source.single, then the source adds a complete stage to itself. You can see it in the source code of akka.
In this case, the groupedWithin flow knows that it won't get any more elements, so it can emit the "test" string. In order to actually test this flow try to create a bigger stream.
When using Source(1 to 10) it actually translates into Source.Single, which completes the stream as well. We can see that here.

Related

Gatling load testing and running scenarios

I am looking to create three scenarios:
The first scenario will run a bunch of GET requests for 30s
The second and third scenarios will run in parallel and wait until the first is finished.
I want the requests from the first scenario to be excluded from the report.
I have the basic outline of what I want to achieve but not seeing expected results:
val myFeeder = csv("somefile.csv")
val scenario1 = scenario("Get stuff")
.feed(myFeeder)
.during(30 seconds) {
exec(
http("getStuff(${csv_colName})").get("/someEndpoint/${csv_colName}")
)
}
val scenario2 = ...
val scenario3 = ...
setUp(
scenario1.inject(
constantUsersPerSec(20) during (30 seconds)
).protocols(firstProtocaol),
scenario2.inject(
nothingFor(30 seconds), //wait 30s
...
).protocols(secondProt)
scenario3.inject(
nothingFor(30 seconds), //wait 30s
...
).protocols(thirdProt)
)
I am seeing the first scenario being run throughout the entire test. It doesn't stop after the 30s?
For the first scenario I would like to cycle through the CSV file and perform a request for each line. Perhaps 5-10 requests per second, how do I achieve that?
I would also like it to stop after the 30s and then run the other two in parallel. Hence the nothingFor in last two scenarios above.
Also how do I exclude from report, is it possible?
You are likely not getting the expected results due to the combination of settings between your injection profile and your "Get Stuff" scenario.
constantUsersPerSec(20) during (30 seconds)
will start 20 users on scenario "Get Stuff" every second for 30 seconds. So even during the 30th second, 20 users will START "Get Stuff". The injection pofile only controls when a user starts, not how long they are active for. So when a user executes the "Get Stuff" scenario, they make the 'get' request repeatedly over the course of 30 seconds due to the .during loop.
So at the very least, you will have users executing "Get Stuff" for 60 seconds - well into the execution of your other scenarios. Depending on the execution time for you getStuff call, it may be even longer.
To avoid this, you could work out exactly how long you want the "Get Stuff" scenario to run, set that in the injection profile and have no looping in the scenario. Alternatively, you could just set your 'nothingFor' values to be >60s.
To exclude the Get Stuff calls from reports, you can add silencing to the protocol definition (assuming it's not shared with your other requests). More details at https://gatling.io/docs/3.2/http/http_protocol/#silencing

Kafka Streams - time window close delay?

I'm new to Kafka Streams.
I use the suppress method of KTable in order to handle only the final result of a window like this:
myStream
.windowedBy(TimeWindows.of(Duration.ofSeconds(10)).grace(Duration.ofMillis(500)))
.aggregate(new Aggregation(),
(k, v, a) -> a, // Disabled the actual aggregation in order to eliminate possiblities of latency
materialized.withLoggingDisabled())
.suppress(untilWindowCloses(Suppressed.BufferConfig.unbounded()))
.toStream().peek((k, v) -> log.info("delay " + (System.currentTimeMillis() - k.window().endTime().toEpochMilli())));
This way I get a log with the delay every 10 seconds with the difference between the window end and the actual time the peek was called.
I would exect a very small number here, since this code practically does nothing...
Nevertheless, I get delay of 4-20 sec for each key/window.
I use a thread per task (5 threads for this topic).
Can someone please point out if I'm doing anything wrong?
Thanks!
Edit:
Using VirtualVM shows that ~99% of the time consumed over sun.nio.ch.SelectorImpl.select(). This means AFAIU, that the process is "idle" most of the time.
Edit:
It seems that changing "commit.interval.ms" (which was by default 30000) reduced the delay drastically.
Still delay has peaks of event 15 seconds, so the problem isn't solved yet...

Offer to queue with some initial display

I want to offer to queue a string sent in load request after some initial delay say 10 seconds.
If the subsequent request is made with some short interval delay(1 second) then everything works fine, but if it is made continuously like from a script then there is no delay.
Here is the sample code.
def load(randomStr :String) = Action { implicit request =>
Source.single(randomStr)
.delay(10 seconds, DelayOverflowStrategy.backpressure)
.map(x =>{
println(x)
queue.offer(x)
})
.runWith(Sink.ignore)
Ok("")
}
I am not entirely sure that this is the correct way of doing what you want. There are some things you need to reconsider:
a delayed source has an initial buffer capacity of 16 elements. You can increase this with addAttributes(initialBuffer)
In your case the buffer cannot actually become full because every time you provide just one element.
Who is the caller of the Action? You are defining a DelayOverflowStrategy.backpressure strategy but is the caller able to handle this?
On every call of the action you are creating a Stream consisting of one element, how is the backpressure here helping? It is applied on the stream processing and not on the offering to the queue

Akka Reactive Streams always one message behind

For some reason, my Akka streams always wait for a second message before "emitting"(?) the first.
Here is some example code that demonstrates my problem.
val rx = Source((1 to 100).toStream.map { t =>
Thread.sleep(1000)
println(s"doing $t")
t
})
rx.runForeach(println)
yields output:
doing 1
doing 2
1
doing 3
2
doing 4
3
doing 5
4
doing 6
5
...
What I want:
doing 1
1
doing 2
2
doing 3
3
doing 4
4
doing 5
5
doing 6
6
...
The way your code is setup now, you are completely transforming the Source before it's allowed to start emitting elements downstream. You can clearly see that behavior (as #slouc stated) by removing the toStream on the range of numbers that represents the source. If you do that, you will see the Source be completely transformed first before it starts responding to downstream demand. If you actually want to run a Source into a Sink and have a transformation step in the middle, then you can try and structure things like this:
val transform =
Flow[Int].map{ t =>
Thread.sleep(1000)
println(s"doing $t")
t
}
Source((1 to 100).toStream).
via(transform ).
to(Sink.foreach(println)).
run
If you make that change, then you will get the desired effect, which is that an element flowing downstream gets processed all the way through the flow before the next element starts to be processed.
You are using .toStream() which means that the whole collection is lazy. Without it, your output would be first a hundred "doing"s followed by numbers from 1 to 100. However, Stream evaluates only the first element, which gives the "doing 1" output, which is where it stops. Next element will be evaluated when needed.
Now, I couldn't find any details on this in the docs, but I presume that runForeach has an implementation that takes the next element before invoking the function on the current one. So before calling println on element n, it first examines element n+1 (e.g. checks if it exists), which results in "doing n+1" message. Then it performs your println function on current element which results in message "n" .
Do you really need to map() before you runForeach? I mean, do you need two travelsals through the data? I know I'm probably stating the obvious, but if you just process your data in one go like this:
val rx = Source((1 to 100).toStream)
rx.runForeach({ t =>
Thread.sleep(1000)
println(s"doing $t")
// do something with 't', which is now equal to what "doing" says
})
then you don't have a problem of what's evaluated when.

Future map's (waiting) execution context. Stops execution with FixedThreadPool

// 1 fixed thread
implicit val waitingCtx = scala.concurrent.ExecutionContext.fromExecutor(Executors.newFixedThreadPool(1))
// "map" will use waitingCtx
val ss = (1 to 1000).map {n => // if I change it to 10 000 program will be stopped at some point, like locking forever
service1.doServiceStuff(s"service ${n}").map{s =>
service1.doServiceStuff(s"service2 ${n}")
}
}
Each doServiceStuff(name:String) takes 5 seconds. doServiceStuff does not have implicit ex:Execution context as parameter, it uses its own ex context inside and does Future {blocking { .. }} on it.
In the end program prints:
took: 5.775849753 seconds for 1000 x 2 stuffs
If I change 1000 to 10000 in, adding even more tasks : val ss = (1 to 10000) then program stops:
~17 027 lines will be printed (out of 20 000). No "ERROR" message
will be printed. No "took" message will be printed
**And will not be processing any futher.
But if I change exContext to ExecutionContext.fromExecutor(null: Executor) (global one) then in ends in about 10 seconds (but not normally).
~17249 lines printed
ERROR: java.util.concurrent.TimeoutException: Futures timed out after [10 seconds]
took: 10.646309398 seconds
That's the question
: Why with fixed ex-context pool it stops without messaging, but with global ex-context it terminates but with error and messaging?
and sometimes.. it is not reproducable.
UPDATE: I do see "ERROR" and "took" if I increase pool from 1 to N. Does not matter how hight N is - it sill will be the ERROR.
The code is here: https://github.com/Sergey80/scala-samples/tree/master/src/main/scala/concurrency/apptmpl
and here, doManagerStuff2()
I think I have an idea of what's going on. If you squint enough, you'll see that map duty is extremely lightweight: just fire off a new future (because doServiceStuff is a Future). I bet the behavior will change if you switch to flatMap, which will actually flatten the nested future and thus will wait for second doServiceStuff call to complete.
Since you're not flattening out these futures, all your awaits downstream are awaiting on a wrong thing, and you are not catching it because here you're discarding whatever Service returns.
Update
Ok, I misinterpreted your question, although I still think that that nested Future is a bug.
When I try your code with both executors with 10000 task I do get OutOfMemory when creating threads in ForkJoin execution context (i.e. for service tasks), which I'd expect. Did you use any specific memory settings?
With 1000 tasks they both do complete successfully.