Why does Source.tick stop after one hundred HttpRequests? - scala

Using akka stream and akka HTTP, I have created a stream which polls an api every 3 seconds, Unmarshalls the result to a JsValue object and sends this result to an actor. As can be seen in the following code:
// Source wich performs an http request every 3 seconds.
val source = Source.tick(0.seconds,
3.seconds,
HttpRequest(uri = Uri(path = Path("/posts/1"))))
// Processes the result of the http request
val flow = Http().outgoingConnectionHttps("jsonplaceholder.typicode.com").mapAsync(1) {
// Able to reach the API.
case HttpResponse(StatusCodes.OK, _, entity, _) =>
// Unmarshal the json response.
Unmarshal(entity).to[JsValue]
// Failed to reach the API.
case HttpResponse(code, _, entity, _) =>
entity.discardBytes()
Future.successful(code.toString())
}
// Run stream
source.via(flow).runWith(Sink.actorRef[Any](processJsonActor,akka.actor.Status.Success(("Completed stream"))))
This works, however the stream closes after 100 HttpRequests (ticks).
What is the cause of this behaviour?

Definitely something to do with outgoingConnectionHttps. This is a low level DSL and there could be some misconfigured setting somewhere which is causing this (although I couldn't figure out which one).
Usage of this DSL is actually discouraged by the docs.
Try using a higher level DSL like cached connection pool
val flow = Http().cachedHostConnectionPoolHttps[NotUsed]("akka.io").mapAsync(1) {
// Able to reach the API.
case (Success(HttpResponse(StatusCodes.OK, _, entity, _)), _) =>
// Unmarshal the json response.
Unmarshal(entity).to[String]
// Failed to reach the API.
case (Success(HttpResponse(code, _, entity, _)), _) =>
entity.discardBytes()
Future.successful(code.toString())
case (Failure(e), _) ⇒
throw e
}
// Run stream
source.map(_ → NotUsed).via(flow).runWith(...)

A potential issue is that there is no backpressure signal with Sink.actorRef, so the actor's mailbox could be getting full. If the actor, whenever it receives a JsValue object, is doing something that could take a long time, use Sink.actorRefWithAck instead. For example:
val initMessage = "start"
val completeMessage = "done"
val ackMessage = "ack"
source
.via(flow)
.runWith(Sink.actorRefWithAck[Any](
processJsonActor, initMessage, ackMessage, completeMessage))
You would need to change the actor to handle an initMessage and reply to the stream for every stream element with an ackMessage (with sender ! ackMessage). More information on Sink.actorRefWithAck is found here.

Related

How to handle backpressure when Streaming file from s3 with actor interop

I am trying to download a large file from S3 and sending it's data to another actor that is doing an http request and then to persist the response. I want to limit number of requests sent by that actor hence I need to handle backpressure.
I tried doing something like this :
S3.download(bckt, bcktKey).map{
case Some((file, _)) =>
file
.via(CsvParsing.lineScanner())
.map(_.map(_.utf8String)).drop(1)//drop headers
.map(p => Foo(p.head, p(1)))
.mapAsync(30) { p =>
implicit val askTimeout: Timeout = Timeout(10 seconds)
(httpClientActor ? p).mapTo[Buzz]
}
.mapAsync(1){
case b#Buzz(_, _) =>
(persistActor ? b).mapTo[Done]
}.runWith(Sink.head)
The problem is that I see that it reads only 30 lines from file as the limit set for parallelism. I am not sure that this is the correct way to achieve what I'm looking for
As Johny notes in his comment, the Sink.head is what causes the stream to only process about 30 elements. What happens is approximately:
Sink.head signals demand for 1 element
this demand propagates up through the second mapAsync
when the demand reaches the first mapAsync, since that one has parallelism 30, it signals demand for 30 elements
the CSV parsing stages emit 30 elements
when the response to the ask with the first element from the client actor is received, the response propagates down to the ask of the persist actor
demand is signaled for one more element from the CSV parsing stages
when the persist actor responds, the response goes to the sink
since the sink is Sink.head which cancels the stream once it receives an element, the stream gets torn down
any asks of the client actor which have been sent but are awaiting a response will still get processed
There's a bit of a race between the persist actor's response and the CSV parsing and sending an ask to the client actor: if the latter is faster, 31 lines might get processed by the client actor.
If you just want a Future[Done] after every element has been processed, Sink.last will work very well with this code.
If the reason is not the usage of Sink.head as I mentioned in the comment, you can backpressure the stream using Sink.actorRefWithBackpressure.
Sample code:
class PersistActor extends Actor {
override def receive: Receive = {
case "init" =>
println("Initialized")
case "complete" =>
context.stop(self)
case message =>
//Persist Buzz??
sender() ! Done
}
}
val sink = Sink
.actorRefWithBackpressure(persistActor, "init", Done, "complete", PartialFunction.empty)
S3.download(bckt, bcktKey).map{
case Some((file, _)) =>
file
.via(CsvParsing.lineScanner())
.map(_.map(_.utf8String)).drop(1)//drop headers
.map(p => Foo(p.head, p(1)))
//You could backpressure here too...
.mapAsync(30) { p =>
implicit val askTimeout: Timeout = Timeout(10 seconds)
(httpClientActor ? p).mapTo[Buzz]
}
.to(sink)
.run()

How to wait for file upload stream to complete in Akka actor

Recently I started using Akka and I am using it to create a REST API using Akka HTTP to upload a file. The file can have millions of records, and for each record I need to perform some validation and business logic. The way I have modeled my actors are, the root actor receives the file stream, converts bytes to String and then splits the records by line separator. After doing this it sends the stream (record by record) to another actor for processing, which in turn distributes the records to other actors based on some grouping. To send the steam from the main root actor to the actor for processing I am using Sink.actorRefWithAck.
This is working fine for a small file, but for a large file what I have observed is, I am getting multiple chunks and the first chunk is getting processed. If I add Thread.sleep for a few seconds based on the load, then it is processing the whole file. I am wondering if there is any way I can know if the stream has been consumed by the processing actor completely so that I don't have to deal with Thread.sleep. Here is the code snippet that I have used:
val AckMessage = DefaultFileUploadProcessActor.Ack
val receiver = context.system.actorOf(
Props(new DefaultFileUploadProcessActor(uuid, sourceId)(self, ackWith = AckMessage)))
// sent from stream to actor to indicate start, end or failure of stream:
val InitMessage = DefaultFileUploadProcessActor.StreamInitialized
val OnCompleteMessage = DefaultFileUploadProcessActor.StreamCompleted
val onErrorMessage = (ex: Throwable) => DefaultFileUploadProcessActor.StreamFailure(ex)
val actorSink = Sink.actorRefWithAck(
receiver,
onInitMessage = InitMessage,
ackMessage = AckMessage,
onCompleteMessage = OnCompleteMessage,
onFailureMessage = onErrorMessage
)
val processStream =
fileStream
.map(byte => byte.utf8String.split(System.lineSeparator()))
.runWith(actorSink)
Thread.sleep(9000)
log.info(s"completed distribution of data to the actors")
sender() ! ActionPerformed(uuid, "Done")
Any expert advice on the approach I have taken will be highly appreciated.
If you have Source with only one file you can await the stream completion by awaiting Future which is returned from runWith method.
If you have Source of multiple files, you should write something like:
filesSource
.mapAsync(1)(data => (receiver ? data).mapTo[ProcessingResult])
.mapAsync(1)(processingResult => (resultListener ? processingResult).mapTo[ListenerResponse])
.runWith(Sink.ignore)
Assuming that fileStream is a Source[ByteString, Future[IOResult], one idea is to retain the materialized value of the source, then fire off the reply to sender() once this materialized value has completed:
val processStream: Future[IOResult] =
fileStream
.map(_.utf8String.split(System.lineSeparator()))
.to(actorSink)
.run()
processStream.onComplete {
case Success(_) =>
log.info("completed distribution of data to the actors")
sender() ! ActionPerformed(uuid, "Done")
case Failure(t) =>
// ...
}
The above approach ensures that the entire file is consumed before the sender is notified.
Note that Akka Streams has a Framing object that can parse lines from a ByteString stream:
val processStream: Future[IOResult] =
fileStream
.via(Framing.delimiter(
ByteString(System.lineSeparator()),
maximumFrameLenght = 256,
allowTruncation = true))
.map(_.ut8String)
.to(actorSink) // the actor will have to expect String, not Array[String], messages
.run()
The receiver actor will receive the OnCompleteMessage or onErrorMessage when the stream has been completed successfully or with failure, so you should handle those messages in the receive block of the receiver DefaultFileUploadProcessActor actor.

Sending actorRefWithAck inside stream

I'm using answer from this thread because I need to treat first element especially. The problem is, I need to send this data to another Actor or persist locally (which is not possibl).
So, my stream looks like this:
val flow: Flow[Message, Message, (Future[Done], Promise[Option[Message]])] = Flow.fromSinkAndSourceMat(
Flow[Message].mapAsync[Trade](1) {
case TextMessage.Strict(text) =>
Unmarshal(text).to[Trade]
case streamed: TextMessage.Streamed =>
streamed.textStream.runFold("")(_ ++ _).flatMap(Unmarshal(_).to[Trade])
}.groupBy(pairs.size, _.s).prefixAndTail(1).flatMapConcat {
case (head, tail) =>
// sending first element here
val result = Source(head).to(Sink.actorRefWithAck(
ref = actor,
onInitMessage = Init,
ackMessage = Ack,
onCompleteMessage = "done"
)).run()
// some kind of operation on the result
Source(head).concat(tail)
}.mergeSubstreams.toMat(sink)(Keep.right),
Source.maybe[Message])(Keep.both)
Is this a good practice? Will it have unintended consequences? Unfortunately, I cannot call persist inside stream, so I want to send this data to the external system.
Your current approach doesn't use result in any way, so a simpler alternative would be to fire and forget the first Message to the actor:
groupBy(pairs.size, _.s).prefixAndTail(1).flatMapConcat {
case (head, tail) =>
// sending first element here
actor ! head.head
Source(head).concat(tail)
}
The actor would then not have to worry about handling Init and sending Ack messages and could be solely concerned with persisting Message instances.

Selective request-throttling using akka-http stream

I got one API which calls two another Downstream APIs. One downstream api (https://test/foo) is really important and it is very fast. Another slow downstream api (https://test/bar) has its limitation, the throughput of it only can handle 50 requests per sec.
I would like to make sure the downstream api https://test/foo has more priority than https://test/bar. For example, if the API thread pool is 75, I only allow 50 parallel incoming connection to go through https://test/bar. Rest of the connections should be used for https://test/bar. It would make https://test/bar never fails.
I guess I should apply throttle or maybe buffer with OverflowStrategy.dropNew for https://test/bar.
Here is the code snippet.
implicit val actorSystem = ActorSystem("api")
implicit val flowMaterializer = ActorMaterializer()
val httpService = Http()
val serverSource: Source[Http.IncomingConnection, Future[Http.ServerBinding]] =
httpService.bind(interface = "0.0.0.0", 3000)
val binding: Future[Http.ServerBinding] =
serverSource
.to(Sink.foreach { connection =>
connection.handleWith(
Flow[HttpRequest]
.map {
case HttpRequest(GET, Uri.Path("/priority-1"), _, _, _) =>
HttpResponse(entity = scala.io.Source.fromURL("https://test/foo").mkString)
case HttpRequest(GET, Uri.Path("/priority-2"), _, _, _) =>
HttpResponse(entity = scala.io.Source.fromURL("https://test/bar").mkString)
}
)
}).run()
Question 1: where should I put throttle(50, 1 seconds, 5000, ThrottleMode.Shaping) to conform to only https://test/bar threshold.
Question 2: do I need to apply buffer and OverflowStrategy.dropNew if I want to prioritise https://test/foo requests. In another words, all unnecessary connections for https://test/bar should be removed.
Question 3: Is there a better way to implement this requirement. I am using connection.handleWith[Flow[HttpRequest, HttpResponse]] in Sink and I am not sure this is right place.
If there are some code snippet provided, that would be much appreciated and super awesome :)
Thanks in advance

Scala Dispatch library: how to handle connection failure or timeout?

I've been using the Databinder Dispatch library in a client for a simple REST-ish API. I know how to detect if I get an HTTP response with an error status:
Http x (request) {
case (200, _, _, content) => successResult(content())
case (404, _, _, _) => notFoundErrorResult
case (_, _, _, _) => genericErrorResult
}
But how can I distinguish an error response from a failure to get any response at all, because of an invalid domain or failure to connect? And is there any way to implement a timeout while still using synchronous semantics? If there's anything relevant in the API, I've missed it.
There is also a more elegant way to configure client using Http.configure method which receives Builder => Builder function as an argument:
val http = Http.configure(_.setAllowPoolingConnection(true).setConnectionTimeoutInMs(5000))
The Periodic Table tells us that >! sets up an exception listener and a recent mailing list thread explains how to set a timeout.
All together, then, you might do something like:
val http = new dispatch.Http {
import org.apache.http.params.CoreConnectionPNames
client.getParams.setParameter(CoreConnectionPNames.CONNECTION_TIMEOUT, 2000)
client.getParams.setParameter(CoreConnectionPNames.SO_TIMEOUT, 5000)
}
http(req >! {
case e => // ...
})
Note that I haven't tested this...
In case you are using Dispatch reboot (with AsyncHttpClient as the underlying library) this is how you'd set the client configuration:
val myHttp = new dispatch.Http {
import com.ning.http.client._
val builder = new AsyncHttpClientConfig.Builder()
builder.setCompressionEnabled(true)
.setAllowPoolingConnection(true)
.setRequestTimeoutInMs(5000)
override lazy val client = new AsyncHttpClient(builder.build())
}
and then just use this new object as you'd otherwise use http:
myHttp((url(baseUrl) <<? args) OK as.xml.Elem).either