Akka stream hangs when starting more than 15 external processes using ProcessBuilder - scala

I'm building an app that has the following flow:
There is a source of items to process
Each item should be processed by external command (it'll be ffmpeg in the end but for this simple reproducible use case it is just cat to have data be passed through it)
In the end, the output of such external command is saved somewhere (again, for the sake of this example it just saves it to a local text file)
So I'm doing the following operations:
Prepare a source with items
Make an Akka graph that uses Broadcast to fan-out the source items into individual flows
Individual flows uses ProcessBuilder in conjunction with Flow.fromSinkAndSource to build flow out of this external process execution
End the individual flows with a sink that saves the data to a file.
Complete code example:
import akka.actor.ActorSystem
import akka.stream.scaladsl.GraphDSL.Implicits._
import akka.stream.scaladsl._
import akka.stream.ClosedShape
import akka.util.ByteString
import java.io.{BufferedInputStream, BufferedOutputStream}
import java.nio.file.Paths
import scala.concurrent.duration.Duration
import scala.concurrent.{Await, ExecutionContext, Future}
object MyApp extends App {
// When this is changed to something above 15, the graph just stops
val PROCESSES_COUNT = Integer.parseInt(args(0))
println(s"Running with ${PROCESSES_COUNT} processes...")
implicit val system = ActorSystem("MyApp")
implicit val globalContext: ExecutionContext = ExecutionContext.global
def executeCmdOnStream(cmd: String): Flow[ByteString, ByteString, _] = {
val convertProcess = new ProcessBuilder(cmd).start
val pipeIn = new BufferedOutputStream(convertProcess.getOutputStream)
val pipeOut = new BufferedInputStream(convertProcess.getInputStream)
Flow
.fromSinkAndSource(StreamConverters.fromOutputStream(() ⇒ pipeIn), StreamConverters.fromInputStream(() ⇒ pipeOut))
}
val source = Source(1 to 100)
.map(element => {
println(s"--emit: ${element}")
ByteString(element)
})
val sinksList = (1 to PROCESSES_COUNT).map(i => {
Flow[ByteString]
.via(executeCmdOnStream("cat"))
.toMat(FileIO.toPath(Paths.get(s"process-$i.txt")))(Keep.right)
})
val graph = GraphDSL.create(sinksList) { implicit builder => sinks =>
val broadcast = builder.add(Broadcast[ByteString](sinks.size))
source ~> broadcast.in
for (i <- broadcast.outlets.indices) {
broadcast.out(i) ~> sinks(i)
}
ClosedShape
}
Await.result(Future.sequence(RunnableGraph.fromGraph(graph).run()), Duration.Inf)
}
Run this using following command:
sbt "run PROCESSES_COUNT"
i.e.
sbt "run 15"
This all works quite well until I raise the amount of "external processes" (PROCESSES_COUNT in the code). When it's 15 or less, all goes well but when it's 16 or more then the following things happen:
Whole execution just hangs after emitting the first 16 items (this amount of 16 items is Akka's default buffer size AFAIK)
I can see that cat processes are started in the system (all 16 of them)
When I manually kill one of these cat processes in the system, something frees up and processing continues (of course in the result, one file is empty because I killed its processing command)
I checked that this is caused by the external execution for sure (not i.e. limit of Akka Broadcast itself).
I recorded a video showing these two situations (first, 15 items working fine and then 16 items hanging and freed up by killing one process) - link to the video
Both the code and video are in this repo
I'd appreciate any help or suggestions where to look solution for this one.

It is an interesting problem and it looks like that the stream is dead-locking. The increase of threads may be fixing the symptom but not the underlying problem.
The problem is following code
Flow
.fromSinkAndSource(
StreamConverters.fromOutputStream(() => pipeIn),
StreamConverters.fromInputStream(() => pipeOut)
)
Both fromInputStream and fromOutputStream will be using the same default-blocking-io-dispatcher as you correctly noticed. The reason for using a dedicated thread pool is that both perform Java API calls that are blocking the running thread.
Here is a part of a thread stack trace of fromInputStream that shows where blocking is happening.
at java.io.FileInputStream.readBytes(java.base#11.0.13/Native Method)
at java.io.FileInputStream.read(java.base#11.0.13/FileInputStream.java:279)
at java.io.BufferedInputStream.read1(java.base#11.0.13/BufferedInputStream.java:290)
at java.io.BufferedInputStream.read(java.base#11.0.13/BufferedInputStream.java:351)
- locked <merged>(a java.lang.ProcessImpl$ProcessPipeInputStream)
at java.io.BufferedInputStream.read1(java.base#11.0.13/BufferedInputStream.java:290)
at java.io.BufferedInputStream.read(java.base#11.0.13/BufferedInputStream.java:351)
- locked <merged>(a java.io.BufferedInputStream)
at java.io.FilterInputStream.read(java.base#11.0.13/FilterInputStream.java:107)
at akka.stream.impl.io.InputStreamSource$$anon$1.onPull(InputStreamSource.scala:63)
Now, you're running 16 simultaneous Sinks that are connected to a single Source. To support back-pressure, a Source will only produce an element when all Sinks send a pull command.
What happens next is that you have 16 calls to method FileInputStream.readBytes at the same time and they immediately block all threads of default-blocking-io-dispatcher. And there are no threads left for fromOutputStream to write any data from the Source or perform any kind of work. Thus, you have a dead-lock.
The problem can be fixed if you increase the threads in the pool. But this just removes the symptom.
The correct solution is to run fromOutputStream and fromInputStream in two separate thread pools. Here is how you can do it.
Flow
.fromSinkAndSource(
StreamConverters.fromOutputStream(() => pipeIn).async("blocking-1"),
StreamConverters.fromInputStream(() => pipeOut).async("blocking-2")
)
with following config
blocking-1 {
type = "Dispatcher"
executor = "thread-pool-executor"
throughput = 1
thread-pool-executor {
fixed-pool-size = 2
}
}
blocking-2 {
type = "Dispatcher"
executor = "thread-pool-executor"
throughput = 1
thread-pool-executor {
fixed-pool-size = 2
}
}
Because they don't share the pools anymore, both fromOutputStream and fromInputStream can perform their tasks independently.
Also note that I just assigned 2 threads per pool to show that it's not about the thread count but about the pool separation.
I hope this helps to understand akka streams better.

Turns out this was limit on Akka configuration level of blocking IO dispatchers:
So changing that value to something bigger than the amount of streams fixed the issue:
akka.actor.default-blocking-io-dispatcher.thread-pool-executor.fixed-pool-size = 50

Related

Close GRPC Netty channel in Scala Spark App

I'm trying to cleanly close a set of channels created during processing of a Spark map function. When I introduce the "shutdown/awaitTermination" methods after the main processing part (right before returning the "Result"), I get errors in other calls. As if the channel was shutdown prematurely (due to Spark scheduling of the actual tasks, I guess). Any recommendation? I have the following flow:
val someDF = initialDF.mapPartitions(iterator => {
val caller = createChannel(certificate, URI, port)
val innerDF = iterator.map(row => {
//do stuff with caller created above
result
}).toDF()
If I don't shutdown the channel, it runs fine (except I get some error messages in unit testing). But if I create a new channel during execution after the code as above, I end up with the following error:
ERROR io.grpc.internal.ManagedChannelOrphanWrapper - ~~~ Channel ManagedChannelImpl{logId=41, target=blablabla:443} was not shutdown properly!!! ~~~
How should I shutdown these channels? I'm not too bright with Spark...
Thanks!

Refresh cached values in spark streaming without reboot the batch

Maybe the question is too simple, at least look like that, but I have the following problem:
A. Execute spark submit in spark streaming process.
ccc.foreachRDD(rdd => {
rdd.repartition(20).foreachPartition(p => {
val repo = getReposX
t foreach (g => {
.................
B. getReposX is a function which one make a query in mongoDB recovering a Map wit a key/value necessary in every executor of the process.
C. Into each g in foreach I manage this map "cached"
The problem or the question is when in the collection of mongo anything change, I don't watch or I don't detect the change, therefore I am managing a Map not updated. My question is: How can I get it? Yes, I know If I reboot the spark-submit and the driver is executed again is OK but otherwise I will never see the update in my Map.
Any ideas or suggestion?
Regards.
Finally I Developed a solution. First I explain the question more in detail because what I really wanted to know is how to implement an object or "cache", which was refreshed every so often, or by some kind of order, without the need to restart the spark streaming process, that is, it would refresh in alive.
In my case this "cache" or refreshed object is an object (Singleton) that connected to a collection of mongoDB to recover a HashMap that was used by each executor and was cached in memory as a good Singleton. The problem with this was that once the spark streaming submit is executed it was cached that object in memory but it was not refreshed unless the process was restarted. Think of a broadcast as a counter mode to refresh when the variable reaches 1000, but these are read only and can not be modified. Think of an counter, but these can only be read by the driver.
Finally my solution is, within the initialization block of the object that loads the mongo collection and the cache, I implement this:
//Initialization Block
{
val ex = new ScheduledThreadPoolExecutor(1)
val task = new Runnable {
def run() = {
logger.info("Refresh - Inicialization")
initCache
}
}
val f = ex.scheduleAtFixedRate(task, 0, TIME_REFRES, TimeUnit.SECONDS)
}
initCache is nothing more than a function that connects mongo and loads a collection:
var cache = mutable.HashMap[String,Description]()
def initCache():mutable.HashMap[String, Description]={
val serverAddresses = Functions.getMongoServers(SERVER, PORT)
val mongoConnectionFactory = new MongoCollectionFactory(serverAddresses,DATABASE,COLLECTION,USERNAME,PASSWORD)
val collection = mongoConnectionFactory.getMongoCollection()
val docs = collection.find.iterator()
cache.clear()
while (docs.hasNext) {
var doc = docs.next
cache.put(...............
}
cache
}
In this way, once the spark streaming submit has been started, each task will open one more task, which will make every X time (1 or 2 hours in my case) refresh the value of the singleton collection, and always recover that value instantiated:
def getCache():mutable.HashMap[String, Description]={
cache
}

Performance issue in play application

I have a play(2.3.0) application that does some database lookups. When there are more than 6 users the application runs into performance problems.
I have narrowed down the problem to a controller with an action that does a sleep of 4 seconds.
A test client calls this action every 500 ms. I can see the the first 6 requests are processesed, and it stops a few seconds(until the 4 seconds sleep have passed) and reads the next 6.
Also: when I open 7 browser windows the 7th will not load(waits for connection).
Looking at the documentation it looks like my problem is blocking io and using the highly synchronous profile should solve my problem.
Therefore I added this profile to my application.conf but nothing changes.
my application.conf looks like this
application.context=/appname/
# Secret key
# ~~~~~
# The secret key is used to secure cryptographics functions.
# If you deploy your application to several instances be sure to use the same key!
application.secret="xxxxx"
play {
akka {
akka.loggers = ["akka.event.slf4j.Slf4jLogger"]
loglevel = WARNING
actor {
default-dispatcher = {
fork-join-executor {
parallelism-min = 300
parallelism-max = 300
}
}
}
}
}
and the action
def performancetestSleep() = Action{ request => {
Thread.sleep(4000)
Ok("hmmm good sleep")
}}
It seems to me the threadpool configuration is ignored. What am I missing here?
What you need for this is really just one thread which handles the 4 second delay - a scheduler. Spawning that many threads defeats the whole point of the architecture that Play has, IMHO. You could then use the scheduler to create a Future[Result] which you'd feed into an Action.async block.
Now, you don't really need to implement your own scheduler since Play depends on Akka for its concurrency; and Akka has a scheduler which will do the job.
import scala.concurrent.{Promise}
import scala.concurrent.duration._
import play.libs.Akka
val system = Akka.system()
def delayedResponse = Action.async {
import system.dispatcher
val promise = Promise[Result]
system.scheduler.scheduleOnce(4000 milliseconds) {
promise.success(Ok("Sorry for the wait!"))
}
promise.future
}
I used
activator run
to start the server, that does not seem to pick up the threadpool profile. Using
activator start
does, and now the profile seems to be used. I now need to test if this solves my problem. Will also have a look at the async call.

Controlling requests per second and timeout threshold in Gatling

I am working on a Gatling simulation. For the life of me, I cannot get my code to reach 10000 requests per second. I have read the documentation and I keep messing with different methods and whatnot but my requests per second seems capped at 5000 requests per second. I have attached my current iteration of my code. The URL and path information is blurred out. Assume that I have no issue with the HTTP part of my simulation.
package computerdatabase
import io.gatling.core.Predef._
import io.gatling.http.Predef._
import scala.concurrent.duration._
//import assertions._
class userSimulation extends Simulation {
object Query {
val feeder = csv("firstfileSHUF.txt").random
val query = repeat(2000) {
feed(feeder).
exec(http("user")
.get("/path/path/" + "${userID}" + "?fullData=true"))
}
}
val baseUrl = "http:URL:7777"
val httpConf = http
.baseURL(baseUrl) // Here is the root for all relative URLs
val scn = scenario("user") // A scenario is a chain of requests and pauses
.exec(Query.query)
setUp(scn.inject(rampUsers(1500) over (60 seconds)))
.throttle(reachRps(10000) in (2 minute),
holdFor(3 minutes))
.protocols(httpConf)
}
Additionally, I would like to set the maximum threshold for a timeout to be 100ms. I have tried to do this with assertions and also editing the configuration files but it never seems to show up during the tests or in my reports. How can I set a request to KO if the request took longer than 100ms? Thank you for your help with this matter!
I ended up figuring this out. My code above is correct and I know understand what Stephane, one of the main contributors to Gatling was explaining. The server at the time simply could not handle my RPS threshold. It was an upper bound that was unreachable. After making changes to the server, we could handle this sort of latency. Additionally, I found a way to timeout at 100ms in the configuration file. Specifically, requestTimeout = 100 will cause the timeout behavior I was looking for.

Control flow of messages in Akka actor

I have an actor using Akka which performs an action that takes some time to complete, because it has to download a file from the network.
def receive = {
case songId: String => {
Future {
val futureFile = downloadFile(songId)
for (file <- futureFile) {
val fileName = doSomenthingWith(file)
otherActor ! fileName
}
}
}
}
I would like to control the flow of messages to this actor. If I try to download too many files simultaneously, I have a network bottleneck. The problem is that I am using a Future inside the actor receive, so, the methods exits and the actor is ready to process a new message. If I remove the Future, I will download only one file per time.
What is the best way to limit the number of messages being processed per unit of time? Is there a better way to design this code?
There is a contrib project for Akka that provides a throttle implementation (http://letitcrash.com/post/28901663062/throttling-messages-in-akka-2). If you sit this in front of the actual download actor then you can effectively throttle the rate of messages going into that actor. It's not 100% perfect in that if the download times are taking longer than expected you could still end up with more downloads then might be desired, but it's a pretty simple implementation and we use it quite a bit to great effect.
Another option could be to use a pool of download actors and remove the future and allow the actors to perform this blocking so that they are truly handling only one message at a time. Because you are going to let them block, I would suggest giving them their own Dispatcher (ExecutionContext) so that this blocking does not negatively effect the main Akka Dispatcher. If you do this, then the pool size itself represents your max allowed number of simultaneous downloads.
Both of these solutions are pretty much "out-of-the-box" solutions that don't require much custom logic to support your use case.
Edit
I also thought it would be good to mention the Work Pulling Pattern as well. With this approach you could still use a pool and then a single work distributer in front. Each worker (download actor) could perform the download (still using a Future) and only request new work (pull) from the work distributer when that Future has fully completed meaning the download is done.
If you have an upper bound on the amount of simultanious downloads you want to happen you can 'ack' back to the actor saying that a download completed and to free up a spot to download another file:
case object AckFileRequest
class ActorExample(otherActor:ActorRef, maxFileRequests:Int = 1) extends Actor {
var fileRequests = 0
def receive = {
case songId: String if fileRequests < maxFileRequests =>
fileRequests += 1
val thisActor = self
Future {
val futureFile = downloadFile(songId)
//not sure if you're returning the downloaded file or a future here,
//but you can move this to wherever the downloaded file is and ack
thisActor ! AckFileRequest
for (file <- futureFile) {
val fileName = doSomenthingWith(file)
otherActor ! fileName
}
}
case songId: String =>
//Do some throttling here
val thisActor = self
context.system.scheduler.scheduleOnce(1 second, thisActor, songId)
case AckFileRequest => fileRequests -= 1
}
}
In this example, if there are too many file requests then we put this songId request on hold and queue it back up for processing 1 second later. You can obviously change this however you see fit, maybe you can just send the message straight back to the actor in a tight loop or do some other throttling, depends on your use case.
There is a contrib implementation of message Throttling, as described here.
The code is very simple:
// A simple actor that prints whatever it receives
class Printer extends Actor {
def receive = {
case x => println(x)
}
}
val printer = system.actorOf(Props[Printer], "printer")
// The throttler for this example, setting the rate
val throttler = system.actorOf(Props(classOf[TimerBasedThrottler], 3 msgsPer 1.second))
// Set the target
throttler ! SetTarget(Some(printer))
// These three messages will be sent to the printer immediately
throttler ! "1"
throttler ! "2"
throttler ! "3"
// These two will wait at least until 1 second has passed
throttler ! "4"
throttler ! "5"