Akka Sink never closes - scala

I am uploading a single file to an SFTP server using Alpakka but once the file is uploaded and I have gotten the success response the Sink stays open, how do I drain it?
I started off with this:
val sink = Sftp.toPath(path, settings, false)
val source = Source.single(ByteString(data))
​
source
.viaMat(KillSwitches.single)(Keep.right)
.toMat(sink)(Keep.both).run()
.map(_.wasSuccessful)
But that ends up never leaving the map step.
I tried to add a killswitch, but that seems to have had no effect (neither with shutdown or abort):
val sink = Sftp.toPath(path, settings, false)
val source = Source.single(ByteString(data))
​
val (killswitch, result) = source
.viaMat(KillSwitches.single)(Keep.right)
.toMat(sink)(Keep.both).run()
result.map {
killswitch.shutdown()
_.wasSuccessful
}
Am I doing something fundamentally wrong? I just want one result.
EDIT The settings sent in to toPath:
SftpSettings(InetAddress.getByName(host))
.withCredentials(FtpCredentials.create(amexUsername, amexPassword))
.withStrictHostKeyChecking(false)

By asking you to put Await.result(result, Duration.Inf) at the end I wanted to check the theory expressed by A. Gregoris. Thus it's either
your app exits before Future completes or
(if you app does't exit) function in which you do this discards result
If your app doesn't exit you can try using result.onComplete to do necessary work.

I cannot see your whole code but it seems to me that in the snippet you posted that result value is a Future that is not completing before the end of your program execution and that is because the code in the map is not being executed either.

Related

TimeoutException when consuming files from S3 with akka streams

I'm trying to consume a bunch of files from S3 in a streaming manner using akka streams:
S3.listBucket("<bucket>", Some("<common_prefix>"))
.flatMapConcat { r => S3.download("<bucket>", r.key) }
.mapConcat(_.toList)
.flatMapConcat(_._1)
.via(Compression.gunzip())
.via(Framing.delimiter(ByteString("\n"), Int.MaxValue))
.map(_.utf8String)
.runForeach { x => println(x) }
Without increasing akka.http.host-connection-pool.response-entity-subscription-timeout I get
java.util.concurrent.TimeoutException: Response entity was not subscribed after 1 second. Make sure to read the response entity body or call discardBytes() on it. for the second file, just after printing the last line of the first file, when trying to access the first line of the second file.
I understand the nature of that exception. I don't understand why the request to the second file is already in progress, while the first file is still being processed. I guess, there's some buffering involved.
Any ideas how to get rid of that exception without having to increase akka.http.host-connection-pool.response-entity-subscription-timeout?
Instead of merging the processing of downloaded files into one stream with flatMapConcat you could try materializing the stream within the outer stream and fully process it there before emitting your output downstream. Then you shouldn't begin downloading (and fully processing) the next object until you're ready.
Generally you want to avoid having too many stream materializations to reduce overhead, but I suspect that would be negligible for an app performing network I/O like this.
Let me know if something like this works: (warning: untested)
S3.listBucket("<bucket>", Some("<common_prefix>"))
.mapAsync(1) { result =>
val contents = S3.download("<bucket>", r.key)
.via(Compression.gunzip())
.via(Framing.delimiter(ByteString("\n"), Int.MaxValue))
.map(_.utf8String)
.to(Sink.seq)(Keep.right)
.run()
contents
}
.mapConcat(identity)
.runForeach { x => println(x) }

Refresh cached values in spark streaming without reboot the batch

Maybe the question is too simple, at least look like that, but I have the following problem:
A. Execute spark submit in spark streaming process.
ccc.foreachRDD(rdd => {
rdd.repartition(20).foreachPartition(p => {
val repo = getReposX
t foreach (g => {
.................
B. getReposX is a function which one make a query in mongoDB recovering a Map wit a key/value necessary in every executor of the process.
C. Into each g in foreach I manage this map "cached"
The problem or the question is when in the collection of mongo anything change, I don't watch or I don't detect the change, therefore I am managing a Map not updated. My question is: How can I get it? Yes, I know If I reboot the spark-submit and the driver is executed again is OK but otherwise I will never see the update in my Map.
Any ideas or suggestion?
Regards.
Finally I Developed a solution. First I explain the question more in detail because what I really wanted to know is how to implement an object or "cache", which was refreshed every so often, or by some kind of order, without the need to restart the spark streaming process, that is, it would refresh in alive.
In my case this "cache" or refreshed object is an object (Singleton) that connected to a collection of mongoDB to recover a HashMap that was used by each executor and was cached in memory as a good Singleton. The problem with this was that once the spark streaming submit is executed it was cached that object in memory but it was not refreshed unless the process was restarted. Think of a broadcast as a counter mode to refresh when the variable reaches 1000, but these are read only and can not be modified. Think of an counter, but these can only be read by the driver.
Finally my solution is, within the initialization block of the object that loads the mongo collection and the cache, I implement this:
//Initialization Block
{
val ex = new ScheduledThreadPoolExecutor(1)
val task = new Runnable {
def run() = {
logger.info("Refresh - Inicialization")
initCache
}
}
val f = ex.scheduleAtFixedRate(task, 0, TIME_REFRES, TimeUnit.SECONDS)
}
initCache is nothing more than a function that connects mongo and loads a collection:
var cache = mutable.HashMap[String,Description]()
def initCache():mutable.HashMap[String, Description]={
val serverAddresses = Functions.getMongoServers(SERVER, PORT)
val mongoConnectionFactory = new MongoCollectionFactory(serverAddresses,DATABASE,COLLECTION,USERNAME,PASSWORD)
val collection = mongoConnectionFactory.getMongoCollection()
val docs = collection.find.iterator()
cache.clear()
while (docs.hasNext) {
var doc = docs.next
cache.put(...............
}
cache
}
In this way, once the spark streaming submit has been started, each task will open one more task, which will make every X time (1 or 2 hours in my case) refresh the value of the singleton collection, and always recover that value instantiated:
def getCache():mutable.HashMap[String, Description]={
cache
}

Akka streams - shutdown stream with grouping without losing data

I have a source that groups elements and a sink that makes a batch request,
I'm using KillSwitch to be able to shutdown the graph at some arbitrary point in time. The problem that records of the latest incomplete batch that source outputs are getting lost when switch.shutdown() is being called
val source = Source.tick(10.millis, 10.millis, "tick").grouped(500)
val (switch, _) = source.viaMat(KillSwitches.single)(Keep.right)
.toMat(sink)(Keep.both).run()
Thread.sleep(3000) // wait some arbitrary time
switch.shutdown()
Is there a way to 'flush out' the incomplete batch when shutdown happens?
The behaviour of the kill switch shutdown is positional, as per its docs
After calling [[UniqueKillSwitch#shutdown()]] the running instance of
the [[Graph]] of [[FlowShape]] that materialized to the
[[UniqueKillSwitch]] will complete its downstream and cancel its
upstream (unless if finished or failed already in which case the
command is ignored).
See also more docs here.
Now the grouped stage will emit a partially filled group only at completion time, but not when cancelled.
This means that the graph below (grouped before killswitch) will behave like you observed
val switch =
Source.tick(10.millis, 175.millis, "tick")
.grouped(10)
.viaMat(KillSwitches.single)(Keep.right)
.toMat(Sink.foreach(println))(Keep.left)
.run()
whilst the graph below (grouped after killswitch) will emit partial groups downstream at completion
val switch =
Source.tick(10.millis, 175.millis, "tick")
.viaMat(KillSwitches.single)(Keep.right)
.grouped(10)
.toMat(Sink.foreach(println))(Keep.left)
.run()

Proper way to stop Akka Streams on condition

I have been successfully using FileIO to stream the contents of a file, compute some transformations for each line and aggregate/reduce the results.
Now I have a pretty specific use case, where I would like to stop the stream when a condition is reached, so that it is not necessary to read the whole file but the process finishes as soon as possible. What is the recommended way to achieve this?
If the stop condition is "on the outside of the stream"
There is a advanced building-block called KillSwitch that you could use to do this: http://doc.akka.io/japi/akka/2.4.7/akka/stream/KillSwitches.html The stream would get shut down once the kill switch is notified.
It has methods like abort(reason) / shutdown etc, see here for it's API: http://doc.akka.io/japi/akka/2.4.7/akka/stream/SharedKillSwitch.html
Reference documentation is here: http://doc.akka.io/docs/akka/2.4.8/scala/stream/stream-dynamic.html#kill-switch-scala
Example usage would be:
val countingSrc = Source(Stream.from(1)).delay(1.second,
DelayOverflowStrategy.backpressure)
val lastSnk = Sink.last[Int]
val (killSwitch, last) = countingSrc
.viaMat(KillSwitches.single)(Keep.right)
.toMat(lastSnk)(Keep.both)
.run()
doSomethingElse()
killSwitch.shutdown()
Await.result(last, 1.second) shouldBe 2
If the stop condition is inside the stream
You can use takeWhile to express any condition really, though sometimes take or limit may be also enough "take 10 lnes".
If your logic is very advanced, you could build a special stage that handles that special logic using statefulMapConcat that allows to express literally anything - so you could complete the stream whenever you want to "from the inside".

Sending a message to a TestProbe() fails sometimes with ActorInitializationException

I've some sporadic test failures and struggle to figure out why. I have a bunch of actors which to the work which I want to test. At the beginning of the test I pass in an actor reference which I get from a TestProbe(). Later on the group of actors do some work and send the result to the given test probe actor reference. Then I check the result with the TestProbe():
class MyCaseSpec extends Spec with ShouldMatchers{
describe("The Thingy"){
it("should work"){
val eventListener = TestProbe()
val myStuffUnderTest = Actor.actorOf(new ComplexActor(eventListener.ref)).start();
myStuffUnderTest ! "Start"
val eventMessage = eventListener.receiveOne(10.seconds).asInstanceOf[SomeEventMessage]
eventMessage.data should be ("Result")
}
}
}
Now once in a while the test fails. And when I look through the stack trace I see that I got a 'ActorInitializationException' when sending a message to the test probe actor. However at no point in time I stop the TestProbe actor.
Here's the exception:
[akka:event-driven:dispatcher:global-11] [LocalActorRef] Actor has not been started, you need to invoke 'actor.start()' before using it
akka.actor.ActorInitializationException: Actor has not been started, you need to invoke 'actor.start()' before using it
[Gamlor-Laptop_c15fdca0-219e-11e1-9579-001b7744104e]
at akka.actor.ScalaActorRef$class.$bang(ActorRef.scala:1399)
at akka.actor.LocalActorRef.$bang(ActorRef.scala:605)
at akka.mobile.client.RemoteMessaging$RemoteMessagingSupervision$$anonfun$receive$1.apply(RemoteMessaging.scala:125)
at akka.mobile.client.RemoteMessaging$RemoteMessagingSupervision$$anonfun$receive$1.apply(RemoteMessaging.scala:121)
at akka.actor.Actor$class.apply(Actor.scala:545)
....
I'm wondering if I'm missing something obvious or am I making a subtle mistake? Or maybe something is really going wrong inside my code and I can't see it?
I'm on Akka 1.2.
Update for Vitors-Comment. At line 125 I send a message to an actor with the !-operator. Now in the test-setup thats the TestProbe actor-reference. And I can't figure out why sometimes the TestProbe actor seems to be stopped.
protected def receive = {
case msg: MaximumNumberOfRestartsWithinTimeRangeReached => {
val lastException = msg.getLastExceptionCausingRestart
faultHandling ! ConnectionError(lastException, messages.toList, self) // < Line 125. The faultHandling is the TestProbe actor
become({
// Change to failure-state behavior
}
// Snip
Anyway, I'm trying to isolate the problem further for the time being. Thanks for any hint / idea.
You are not starting your actor here. I'm not sure why your test is working some of the time. The code above needs to have the following line modified with an .start()
val myStuffUnderTest = Actor.actorOf(new ComplexActor(eventListener.ref)).start();
Ok, almost certainly found the issue =). TestProbes do have a timeout: When nothing happens after 5 seconds, they stop them self.
Now unfortunately the test takes just a little longer than 5 seconds: In that time the test-probe may stop itself all ready, which then causes the test to fail.
Fixing it is easy, increase the timeout on the TestProbe:
val errorHandler = ignoreConnectionMsgProbe()
errorHandler.setTestActorTimeout(20.seconds)