I'm trying to cleanly close a set of channels created during processing of a Spark map function. When I introduce the "shutdown/awaitTermination" methods after the main processing part (right before returning the "Result"), I get errors in other calls. As if the channel was shutdown prematurely (due to Spark scheduling of the actual tasks, I guess). Any recommendation? I have the following flow:
val someDF = initialDF.mapPartitions(iterator => {
val caller = createChannel(certificate, URI, port)
val innerDF = iterator.map(row => {
//do stuff with caller created above
result
}).toDF()
If I don't shutdown the channel, it runs fine (except I get some error messages in unit testing). But if I create a new channel during execution after the code as above, I end up with the following error:
ERROR io.grpc.internal.ManagedChannelOrphanWrapper - ~~~ Channel ManagedChannelImpl{logId=41, target=blablabla:443} was not shutdown properly!!! ~~~
How should I shutdown these channels? I'm not too bright with Spark...
Thanks!
Related
I'm building an app that has the following flow:
There is a source of items to process
Each item should be processed by external command (it'll be ffmpeg in the end but for this simple reproducible use case it is just cat to have data be passed through it)
In the end, the output of such external command is saved somewhere (again, for the sake of this example it just saves it to a local text file)
So I'm doing the following operations:
Prepare a source with items
Make an Akka graph that uses Broadcast to fan-out the source items into individual flows
Individual flows uses ProcessBuilder in conjunction with Flow.fromSinkAndSource to build flow out of this external process execution
End the individual flows with a sink that saves the data to a file.
Complete code example:
import akka.actor.ActorSystem
import akka.stream.scaladsl.GraphDSL.Implicits._
import akka.stream.scaladsl._
import akka.stream.ClosedShape
import akka.util.ByteString
import java.io.{BufferedInputStream, BufferedOutputStream}
import java.nio.file.Paths
import scala.concurrent.duration.Duration
import scala.concurrent.{Await, ExecutionContext, Future}
object MyApp extends App {
// When this is changed to something above 15, the graph just stops
val PROCESSES_COUNT = Integer.parseInt(args(0))
println(s"Running with ${PROCESSES_COUNT} processes...")
implicit val system = ActorSystem("MyApp")
implicit val globalContext: ExecutionContext = ExecutionContext.global
def executeCmdOnStream(cmd: String): Flow[ByteString, ByteString, _] = {
val convertProcess = new ProcessBuilder(cmd).start
val pipeIn = new BufferedOutputStream(convertProcess.getOutputStream)
val pipeOut = new BufferedInputStream(convertProcess.getInputStream)
Flow
.fromSinkAndSource(StreamConverters.fromOutputStream(() ⇒ pipeIn), StreamConverters.fromInputStream(() ⇒ pipeOut))
}
val source = Source(1 to 100)
.map(element => {
println(s"--emit: ${element}")
ByteString(element)
})
val sinksList = (1 to PROCESSES_COUNT).map(i => {
Flow[ByteString]
.via(executeCmdOnStream("cat"))
.toMat(FileIO.toPath(Paths.get(s"process-$i.txt")))(Keep.right)
})
val graph = GraphDSL.create(sinksList) { implicit builder => sinks =>
val broadcast = builder.add(Broadcast[ByteString](sinks.size))
source ~> broadcast.in
for (i <- broadcast.outlets.indices) {
broadcast.out(i) ~> sinks(i)
}
ClosedShape
}
Await.result(Future.sequence(RunnableGraph.fromGraph(graph).run()), Duration.Inf)
}
Run this using following command:
sbt "run PROCESSES_COUNT"
i.e.
sbt "run 15"
This all works quite well until I raise the amount of "external processes" (PROCESSES_COUNT in the code). When it's 15 or less, all goes well but when it's 16 or more then the following things happen:
Whole execution just hangs after emitting the first 16 items (this amount of 16 items is Akka's default buffer size AFAIK)
I can see that cat processes are started in the system (all 16 of them)
When I manually kill one of these cat processes in the system, something frees up and processing continues (of course in the result, one file is empty because I killed its processing command)
I checked that this is caused by the external execution for sure (not i.e. limit of Akka Broadcast itself).
I recorded a video showing these two situations (first, 15 items working fine and then 16 items hanging and freed up by killing one process) - link to the video
Both the code and video are in this repo
I'd appreciate any help or suggestions where to look solution for this one.
It is an interesting problem and it looks like that the stream is dead-locking. The increase of threads may be fixing the symptom but not the underlying problem.
The problem is following code
Flow
.fromSinkAndSource(
StreamConverters.fromOutputStream(() => pipeIn),
StreamConverters.fromInputStream(() => pipeOut)
)
Both fromInputStream and fromOutputStream will be using the same default-blocking-io-dispatcher as you correctly noticed. The reason for using a dedicated thread pool is that both perform Java API calls that are blocking the running thread.
Here is a part of a thread stack trace of fromInputStream that shows where blocking is happening.
at java.io.FileInputStream.readBytes(java.base#11.0.13/Native Method)
at java.io.FileInputStream.read(java.base#11.0.13/FileInputStream.java:279)
at java.io.BufferedInputStream.read1(java.base#11.0.13/BufferedInputStream.java:290)
at java.io.BufferedInputStream.read(java.base#11.0.13/BufferedInputStream.java:351)
- locked <merged>(a java.lang.ProcessImpl$ProcessPipeInputStream)
at java.io.BufferedInputStream.read1(java.base#11.0.13/BufferedInputStream.java:290)
at java.io.BufferedInputStream.read(java.base#11.0.13/BufferedInputStream.java:351)
- locked <merged>(a java.io.BufferedInputStream)
at java.io.FilterInputStream.read(java.base#11.0.13/FilterInputStream.java:107)
at akka.stream.impl.io.InputStreamSource$$anon$1.onPull(InputStreamSource.scala:63)
Now, you're running 16 simultaneous Sinks that are connected to a single Source. To support back-pressure, a Source will only produce an element when all Sinks send a pull command.
What happens next is that you have 16 calls to method FileInputStream.readBytes at the same time and they immediately block all threads of default-blocking-io-dispatcher. And there are no threads left for fromOutputStream to write any data from the Source or perform any kind of work. Thus, you have a dead-lock.
The problem can be fixed if you increase the threads in the pool. But this just removes the symptom.
The correct solution is to run fromOutputStream and fromInputStream in two separate thread pools. Here is how you can do it.
Flow
.fromSinkAndSource(
StreamConverters.fromOutputStream(() => pipeIn).async("blocking-1"),
StreamConverters.fromInputStream(() => pipeOut).async("blocking-2")
)
with following config
blocking-1 {
type = "Dispatcher"
executor = "thread-pool-executor"
throughput = 1
thread-pool-executor {
fixed-pool-size = 2
}
}
blocking-2 {
type = "Dispatcher"
executor = "thread-pool-executor"
throughput = 1
thread-pool-executor {
fixed-pool-size = 2
}
}
Because they don't share the pools anymore, both fromOutputStream and fromInputStream can perform their tasks independently.
Also note that I just assigned 2 threads per pool to show that it's not about the thread count but about the pool separation.
I hope this helps to understand akka streams better.
Turns out this was limit on Akka configuration level of blocking IO dispatchers:
So changing that value to something bigger than the amount of streams fixed the issue:
akka.actor.default-blocking-io-dispatcher.thread-pool-executor.fixed-pool-size = 50
I am uploading a single file to an SFTP server using Alpakka but once the file is uploaded and I have gotten the success response the Sink stays open, how do I drain it?
I started off with this:
val sink = Sftp.toPath(path, settings, false)
val source = Source.single(ByteString(data))
source
.viaMat(KillSwitches.single)(Keep.right)
.toMat(sink)(Keep.both).run()
.map(_.wasSuccessful)
But that ends up never leaving the map step.
I tried to add a killswitch, but that seems to have had no effect (neither with shutdown or abort):
val sink = Sftp.toPath(path, settings, false)
val source = Source.single(ByteString(data))
val (killswitch, result) = source
.viaMat(KillSwitches.single)(Keep.right)
.toMat(sink)(Keep.both).run()
result.map {
killswitch.shutdown()
_.wasSuccessful
}
Am I doing something fundamentally wrong? I just want one result.
EDIT The settings sent in to toPath:
SftpSettings(InetAddress.getByName(host))
.withCredentials(FtpCredentials.create(amexUsername, amexPassword))
.withStrictHostKeyChecking(false)
By asking you to put Await.result(result, Duration.Inf) at the end I wanted to check the theory expressed by A. Gregoris. Thus it's either
your app exits before Future completes or
(if you app does't exit) function in which you do this discards result
If your app doesn't exit you can try using result.onComplete to do necessary work.
I cannot see your whole code but it seems to me that in the snippet you posted that result value is a Future that is not completing before the end of your program execution and that is because the code in the map is not being executed either.
I have a source that groups elements and a sink that makes a batch request,
I'm using KillSwitch to be able to shutdown the graph at some arbitrary point in time. The problem that records of the latest incomplete batch that source outputs are getting lost when switch.shutdown() is being called
val source = Source.tick(10.millis, 10.millis, "tick").grouped(500)
val (switch, _) = source.viaMat(KillSwitches.single)(Keep.right)
.toMat(sink)(Keep.both).run()
Thread.sleep(3000) // wait some arbitrary time
switch.shutdown()
Is there a way to 'flush out' the incomplete batch when shutdown happens?
The behaviour of the kill switch shutdown is positional, as per its docs
After calling [[UniqueKillSwitch#shutdown()]] the running instance of
the [[Graph]] of [[FlowShape]] that materialized to the
[[UniqueKillSwitch]] will complete its downstream and cancel its
upstream (unless if finished or failed already in which case the
command is ignored).
See also more docs here.
Now the grouped stage will emit a partially filled group only at completion time, but not when cancelled.
This means that the graph below (grouped before killswitch) will behave like you observed
val switch =
Source.tick(10.millis, 175.millis, "tick")
.grouped(10)
.viaMat(KillSwitches.single)(Keep.right)
.toMat(Sink.foreach(println))(Keep.left)
.run()
whilst the graph below (grouped after killswitch) will emit partial groups downstream at completion
val switch =
Source.tick(10.millis, 175.millis, "tick")
.viaMat(KillSwitches.single)(Keep.right)
.grouped(10)
.toMat(Sink.foreach(println))(Keep.left)
.run()
I'm new to scala and trying to hack my way around sending serialized Java objects over a RabbitMQ queue to a Spark Streaming application.
I can successfully enqueue my objects which have been serialized with an ObjectOutputStream. To receive my objects on the Spark end I have downloaded a custom RabbitMQ InputDStream and Receiver implementation from here - https://github.com/Stratio/rabbitmq-receiver
However, in my understanding that codebase only supports String messages, not binary. Thus I started hacking on that code in order to make it support being able to read a binary message and store it as a byte array, so that I can deserialize it on the Spark end. That attempt is here - https://github.com/llevar/rabbitmq-receiver
I then have the following code in my Spark driver program:
val conf = new SparkConf().setMaster("local[6]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val receiverStream: ReceiverInputDStream[scala.reflect.ClassTag[AnyRef]] =
RabbitMQUtils.createStreamFromAQueue(ssc,
"localhost",
5672,
"mappingQueue",
StorageLevel.MEMORY_AND_DISK_SER_2)
val parsedStream = receiverStream.map{ m =>
SerializationUtils.deserialize(m.asInstanceOf[Array[Byte]]).asInstanceOf[SAMRecord]
}
parsedStream.print()
ssc.start()
Unfortunately this does not seem to work. The data is consumed off the queue. I don't get any errors but I don't get any of the output that I expect either.
This is all I get.
2015-07-24 23:33:38 WARN BlockManager:71 - Block input-0-1437795218845 replicated to only 0 peer(s) instead of 1 peers
2015-07-24 23:33:38 WARN BlockManager:71 - Block input-0-1437795218846 replicated to only 0 peer(s) instead of 1 peers
2015-07-24 23:33:38 WARN BlockManager:71 - Block input-0-1437795218847 replicated to only 0 peer(s) instead of 1 peers
2015-07-24 23:33:38 WARN BlockManager:71 - Block input-0-1437795218848 replicated to only 0 peer(s) instead of 1 peers
I was able to successfully deserialize my objects before calling the store() method here - https://github.com/llevar/rabbitmq-receiver/blob/master/src/main/scala/com/stratio/receiver/RabbitMQInputDStream.scala#L106
just by invoking SerializationUtils on the data from the delivery.getBody call, but I don't seem to be able to get the same data from the DStream in my main program.
Any help is appreciated.
I am using the dispatch reboot library version 0.9.5 (http://dispatch.databinder.net/Dispatch.html) in my project. Via sbt I have the following line:
libraryDependencies += "net.databinder.dispatch" %% "dispatch-core" % "0.9.5"
In the scala (2.9.2) repl (started using sbt console to get the appropriate dependency) and independent of my code I run the following session:
import dispatch._
import java.util.concurrent.TimeUnit._
val spoo = Http.threads(1).waiting( Duration(10, SECONDS ) )
(I believe that the third line sets up my own threadpool with one thread and a timeout of 10 seconds).
I then run this code repeatedly (in paste mode), to submit a future to fetch a particular url and then print the status code (asynchronously):
spoo(url("http://www.evapcool.com/products/commercial/")).either
.map {
case Right(r) => println( "S: " + r.getStatusCode())
case Left(e) => println( "E: " + e.toString ) }
Each time I run this line I wait for the status code to get printed before running the line again. For the first twenty to forty calls it works as expected. Then it reliably fails to report either a successful page reply or an exception. My assumption was that if this is caused by a timeout, I should expect the callback to fire after 10 seconds with the Left clause of the Either containing some form of timeout exception. But my experience is that this never comes.
Can anyone help tell me what I'm doing wrong?
Update
By the way, I am aware that there is a similar question (with answer) here but I am looking for the official (i.e. intended by the library author) way to handle timeouts - and it appears to me that this is what the waiting method is designed for
So an answer, although one I'm not enormously satisfied with - because it involves ignoring the rather nice looking waiting method on Http and playing directly with the async-http-client apis is this (inspired by the SO post linked in the question update):
val spoo = Http.threads( 1 ).configure
{ builder =>
builder.setRequestTimeoutInMs( 10000 )
builder.setConnectionTimeoutInMs( 10000 )
builder
}
My code now runs as expected. Ho hum...