Alpakka S3 connector stream won't handle the load, throwing akka.stream.BufferOverflowException - scala

I have an akka-http service and I am trying out the alpakka s3 connector for uploading files. Previously I was using a temporary file and then uploading with Amazon SDK. This approach required some adjustments for Amazon SDK to make it more scala like, but it could handle even a 1000 requests at once. Throughput wasn't amazing, but all of the requests went through eventually. Here is the code before changes, with no alpakka:
```
path("uploadfile") {
withRequestTimeout(20.seconds) {
storeUploadedFile("csv", tempDestination) {
case (metadata, file) =>
val uploadFuture = upload(file, file.toPath.getFileName.toString)
onComplete(uploadFuture) {
case Success(_) => complete(StatusCodes.OK)
case Failure(_) => complete(StatusCodes.FailedDependency)
}
}
}
}
}
case class S3UploaderException(msg: String) extends Exception(msg)
def upload(file: File, key: String): Future[String] = {
val s3Client = AmazonS3ClientBuilder.standard()
.withCredentials(new DefaultAWSCredentialsProviderChain())
.withRegion(Regions.EU_WEST_3)
.build()
val promise = Promise[String]()
val listener = new ProgressListener() {
override def progressChanged(progressEvent: ProgressEvent): Unit = {
(progressEvent.getEventType: #unchecked) match {
case ProgressEventType.TRANSFER_FAILED_EVENT => promise.failure(S3UploaderException(s"Uploading a file with a key: $key"))
case ProgressEventType.TRANSFER_COMPLETED_EVENT |
ProgressEventType.TRANSFER_CANCELED_EVENT => promise.success(key)
}
}
}
val request = new PutObjectRequest("S3_BUCKET", key, file)
request.setGeneralProgressListener(listener)
s3Client.putObject(request)
promise.future
}
```
When I changed this to use alpakka connector, the code looks much nicer as we can just connect the ByteSource and alpakka Sink together. However this approach cannot handle such a big load. When I execute 1000 requests at once (10 kb files) less than 10% go through and the rest fails with exception:
akka.stream.alpakka.s3.impl.FailedUpload: Exceeded configured
max-open-requests value of [32]. This means that the request queue of
this pool
(HostConnectionPoolSetup(bargain-test.s3-eu-west-3.amazonaws.com,443,ConnectionPoolSetup(ConnectionPoolSettings(4,0,5,32,1,30
seconds,ClientConnectionSettings(Some(User-Agent: akka-http/10.1.3),10
seconds,1
minute,512,None,WebSocketSettings(,ping,Duration.Inf,akka.http.impl.settings.WebSocketSettingsImpl$$$Lambda$4787/1279590204#4d809f4c),List(),ParserSettings(2048,16,64,64,8192,64,8388608,256,1048576,Strict,RFC6265,true,Set(),Full,Error,Map(If-Range
-> 0, If-Modified-Since -> 0, If-Unmodified-Since -> 0, default -> 12, Content-MD5 -> 0, Date -> 0, If-Match -> 0, If-None-Match -> 0,
User-Agent ->
32),false,true,akka.util.ConstantFun$$$Lambda$4534/1539966798#69c23cd4,akka.util.ConstantFun$$$Lambda$4534/1539966798#69c23cd4,akka.util.ConstantFun$$$Lambda$4535/297570074#6b426c59),None,TCPTransport),New,1
second),akka.http.scaladsl.HttpsConnectionContext#7e0f3726,akka.event.MarkerLoggingAdapter#74f3a78b)))
has completely filled up because the pool currently does not process
requests fast enough to handle the incoming request load. Please retry
the request later. See
http://doc.akka.io/docs/akka-http/current/scala/http/client-side/pool-overflow.html
for more information.
Here is how the summary of a Gatling test looks like:
---- Response Time Distribution ----------------------------------------
t < 800 ms 0 ( 0%)
800 ms < t < 1200 ms 0 ( 0%)
t > 1200 ms 90 ( 9%)
failed 910 ( 91%)
When I execute 100 of simultaneous requests, half of it fails. So, still close to satisfying.
This is a new code:
```
path("uploadfile") {
withRequestTimeout(20.seconds) {
extractRequestContext { ctx =>
implicit val materializer = ctx.materializer
extractActorSystem { actorSystem =>
fileUpload("csv") {
case (metadata, byteSource) =>
val uploadFuture = byteSource.runWith(S3Uploader.sink("s3FileKey")(actorSystem, materializer))
onComplete(uploadFuture) {
case Success(_) => complete(StatusCodes.OK)
case Failure(_) => complete(StatusCodes.FailedDependency)
}
}
}
}
}
}
def sink(s3Key: String)(implicit as: ActorSystem, m: Materializer) = {
val regionProvider = new AwsRegionProvider {
def getRegion: String = Regions.EU_WEST_3.getName
}
val settings = new S3Settings(MemoryBufferType, None, new DefaultAWSCredentialsProviderChain(), regionProvider, false, None, ListBucketVersion2)
val s3Client = new S3Client(settings)(as, m)
s3Client.multipartUpload("S3_BUCKET", s3Key)
}
```
The complete code with both endpoints can be seen here
I have a couple of questions.
1) Is this a feature? Is this what we can call a backpressure?
2) If I would like this code to behave like the old approach with a temporary file (no failed requests and all of them finish at some point) what do I have to do? I was trying to implement a queue for the stream (link to the source below), but this made no difference. The code can be seen here.
(* DISCLAIMER * I am still a scala newbie trying to quickly understand akka streams and find some workaround for the issue. There are big chances that there is something simple wrong in this code. * DISCLAIMER *)

It’s a backpressure feature.
Exceeded configured max-open-requests value of [32] In the config max-open-requests is set to 32 by default.
Streaming is used to work with big amount of data, not to handle many many requests per second.
Akka developers had to put something for max-open-requests. They choose 32 for some reason for sure. And they had no idea what it will be used for. May it be sending 1000 32KB files or 1000 1GB files at once? They don’t know. But they still want to make sure that by default (and 80% of people use defaults probably) the apps will be handled gracefully and safely. So they had to limit processing power.
You asked to do 1000 “now” but I am pretty sure AWS did not send 1000 files simultaneously but used some queue, which may be a good case for you too if you have many small files to upload.
But it is perfectly fine to tune it to your case!
If you know your machine and the target will take care of more simultaneous connections, you can change the number to a higher value.
Also, for a lot of HTTP calls use cached host connection pool.

Related

Wrapping Pub-Sub Java API in Akka Streams Custom Graph Stage

I am working with a Java API from a data vendor providing real time streams. I would like to process this stream using Akka streams.
The Java API has a pub sub design and roughly works like this:
Subscription sub = createSubscription();
sub.addListener(new Listener() {
public void eventsReceived(List events) {
for (Event e : events)
buffer.enqueue(e)
}
});
I have tried to embed the creation of this subscription and accompanying buffer in a custom graph stage without much success. Can anyone guide me on the best way to interface with this API using Akka? Is Akka Streams the best tool here?
To feed a Source, you don't necessarily need to use a custom graph stage. Source.queue will materialize as a buffered queue to which you can add elements which will then propagate through the stream.
There are a couple of tricky things to be aware of. The first is that there's some subtlety around materializing the Source.queue so you can set up the subscription. Something like this:
def bufferSize: Int = ???
Source.fromMaterializer { (mat, att) =>
val (queue, source) = Source.queue[Event](bufferSize).preMaterialize()(mat)
val subscription = createSubscription()
subscription.addListener(
new Listener() {
def eventsReceived(events: java.util.List[Event]): Unit = {
import scala.collection.JavaConverters.iterableAsScalaIterable
import akka.stream.QueueOfferResult._
iterableAsScalaIterable(events).foreach { event =>
queue.offer(event) match {
case Enqueued => () // do nothing
case Dropped => ??? // handle a dropped pubsub element, might well do nothing
case QueueClosed => ??? // presumably cancel the subscription...
}
}
}
}
)
source.withAttributes(att)
}
Source.fromMaterializer is used to get access at each materialization to the materializer (which is what compiles the stream definition into actors). When we materialize, we use the materializer to preMaterialize the queue source so we have access to the queue. Our subscription adds incoming elements to the queue.
The API for this pubsub doesn't seem to support backpressure if the consumer can't keep up. The queue will drop elements it's been handed if the buffer is full: you'll probably want to do nothing in that case, but I've called it out in the match that you should make an explicit decision here.
Dropping the newest element is the synchronous behavior for this queue (there are other queue implementations available, but those will communicate dropping asynchronously which can be really bad for memory consumption in a burst). If you'd prefer something else, it may make sense to have a very small buffer in the queue and attach the "overall" Source (the one returned by Source.fromMaterializer) to a stage which signals perpetual demand. For example, a buffer(downstreamBufferSize, OverflowStrategy.dropHead) will drop the oldest event not yet processed. Alternatively, it may be possible to combine your Events in some meaningful way, in which case a conflate stage will automatically combine incoming Events if the downstream can't process them quickly.
Great answer! I did build something similar. There are also kamon metrics to monitor queue size exc.
class AsyncSubscriber(projectId: String, subscriptionId: String, metricsRegistry: CustomMetricsRegistry, pullParallelism: Int)(implicit val ec: Executor) {
private val logger = LoggerFactory.getLogger(getClass)
def bufferSize: Int = 1000
def source(): Source[(PubsubMessage, AckReplyConsumer), Future[NotUsed]] = {
Source.fromMaterializer { (mat, attr) =>
val (queue, source) = Source.queue[(PubsubMessage, AckReplyConsumer)](bufferSize).preMaterialize()(mat)
val receiver: MessageReceiver = {
(message: PubsubMessage, consumer: AckReplyConsumer) => {
metricsRegistry.inputEventQueueSize.update(queue.size())
queue.offer((message, consumer)) match {
case QueueOfferResult.Enqueued =>
metricsRegistry.inputQueueAddEventCounter.increment()
case QueueOfferResult.Dropped =>
metricsRegistry.inputQueueDropEventCounter.increment()
consumer.nack()
logger.warn(s"Buffer is full, message nacked. Pubsub should retry don't panic. If this happens too often, we should also tweak the buffer size or the autoscaler.")
case QueueOfferResult.Failure(ex) =>
metricsRegistry.inputQueueDropEventCounter.increment()
consumer.nack()
logger.error(s"Failed to offer message with id=${message.getMessageId()}", ex)
case QueueOfferResult.QueueClosed =>
logger.error("Destination Queue closed. Something went terribly wrong. Shutting down the jvm.")
consumer.nack()
mat.shutdown()
sys.exit(1)
}
}
}
val subscriptionName = ProjectSubscriptionName.of(projectId, subscriptionId)
val subscriber = Subscriber.newBuilder(subscriptionName, receiver).setParallelPullCount(pullParallelism).build
subscriber.startAsync().awaitRunning()
source.withAttributes(attr)
}
}
}

NullPointerException in Flink custom SourceFunction

I wanted to create a SourceFunction which reads a http stream.
I used ScalaJ which does what I want (it splits the incoming text by \n-s).
Obviously the code works outside Flink, but I get a NullPointerExcetion every time I start it as a Flink job (sometimes immediately sometimes after 1-2 seconds after it transmitted 1-2 elements). It kind of looks like the Http object has some problems.
import org.apache.flink.streaming.api.functions.source.SourceFunction
import scala.io.Source.fromInputStream
import scalaj.http._
class HttpSource(url: String) extends SourceFunction[String] {
#volatile var isRunning = true
override def cancel(): Unit = isRunning = false
override def run(ctx: SourceFunction.SourceContext[String]): Unit =
httpStream(ctx.collect)
private def httpStream(f: String => Unit) = {
val request = Http(url)
request
.execute { inputStream =>
fromInputStream(inputStream)
.getLines()
.takeWhile(_ => isRunning)
.foreach(f)
}
}
}
Here's the exception I usually get:
(Sometimes it's a bit different, for example I tried to make the request value transient, then it's already null when it tries to refer to request)
Caused by: java.lang.NullPointerException
at java.io.Reader.<init>(Reader.java:78)
at java.io.InputStreamReader.<init>(InputStreamReader.java:129)
at scala.io.BufferedSource.reader(BufferedSource.scala:24)
at scala.io.BufferedSource.bufferedReader(BufferedSource.scala:25)
at scala.io.BufferedSource.scala$io$BufferedSource$$charReader$lzycompute(BufferedSource.scala:35)
at scala.io.BufferedSource.scala$io$BufferedSource$$charReader(BufferedSource.scala:33)
at scala.io.BufferedSource.scala$io$BufferedSource$$decachedReader(BufferedSource.scala:62)
at scala.io.BufferedSource$BufferedLineIterator.<init>(BufferedSource.scala:67)
at scala.io.BufferedSource.getLines(BufferedSource.scala:86)
at flinkextension.HttpSource$$anonfun$httpStream$1.apply(HttpSource.scala:21)
at flinkextension.HttpSource$$anonfun$httpStream$1.apply(HttpSource.scala:19)
at scalaj.http.HttpRequest$$anonfun$execute$1.apply(Http.scala:323)
at scalaj.http.HttpRequest$$anonfun$execute$1.apply(Http.scala:323)
at scalaj.http.HttpRequest$$anonfun$toResponse$3.apply(Http.scala:388)
at scalaj.http.HttpRequest$$anonfun$toResponse$3.apply(Http.scala:380)
at scala.Option.getOrElse(Option.scala:121)
at scalaj.http.HttpRequest.toResponse(Http.scala:380)
at scalaj.http.HttpRequest.scalaj$http$HttpRequest$$doConnection(Http.scala:360)
at scalaj.http.HttpRequest.exec(Http.scala:335)
at scalaj.http.HttpRequest.execute(Http.scala:323)
at flinkextension.HttpSource.httpStream(HttpSource.scala:19)
at flinkextension.HttpSource.run(HttpSource.scala:14)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:87)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:95)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:263)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:702)
at java.lang.Thread.run(Thread.java:748)
Everything else seems to be working fine, when I don't use a http request, but something else like file read with the same InputStream type, just a plain while loop with strings or even when I use single http requests, which aren't streaming.
I feel like I'm missing some theoretical background, maybe flink does something in the background which destroys the Http object or the InputStream, but I didn't find anything in the documentation.
UPDATE #1:
If I put a null check into the lambda, the job usually exits immediately, sometimes processes a few elements, sometimes timeouts after hanging for a minute. Here's this version of the httpStream function:
private def httpStream(f: String => Unit) = {
val request = Http(url)
request
.execute { inputStream =>
if (inputStream == null) println("null inputstream")
else {
println("not null inputstream")
fromInputStream(inputStream)
.getLines()
.takeWhile(_ => isRunning)
.foreach(f)
}
}
}
UPDATE #2:
The code actually works in distributed mode and with StreamExecutionEnvironment.createLocalEnvironment()
I only experience the issue if I use start-local.sh and submit the jar to it.

How to use Flink streaming to process Data stream of Complex Protocols

I'm using Flink Stream for the handling of data traffic log in 3G network (GPRS Tunnelling Protocol). And I'm having trouble in the synthesis of information in a user session of the user.
For example: how to map the start and end one session. I don't know that there Flink streaming suited to handle complex protocols like that?
p/s:
We capture data exchanging between SGSN and GGSN in 3G network (use GTP protocol with GTP-C/U messages). A session is started when the SGSN sends the CreateReq (TEID, Seq, IMSI, TEID_dl,TEID_data_dl) message and GGSN responses CreateRsp(TEID_dl, Seq, TEID_ul, TEID_data_ul) message.
After the session is established, others GTP-C messages (ex: UpdateReq, DeleteReq) sent from SGSN to GGSN uses TEID_ul and response message uses TEID_dl, GTP- U message uses TEID_data_ul (SGSN -> GGSN) and TEID_data_dl (GGSN -> SGSN). GTP-U messages contain information such as AppID (facebook, twitter, web), url,...
Finally, I want to handle continuous log data stream and map the GTP-C messages and GTP-U of the same one user (IMSI) to make a report.
I've tried this:
val sessions = createReqs.connect(createRsps).flatMap(new CoFlatMapFunction[CreateReq, CreateRsp, Session] {
// holds CreateReqs indexed by (tedid_dl,seq)
private val createReqs = mutable.HashMap.empty[(String, String), CreateReq]
// holds CreateRsps indexed by (tedid,seq)
private val createRsps = mutable.HashMap.empty[(String, String), CreateRsp]
override def flatMap1(req: CreateReq, out: Collector[Session]): Unit = {
val key = (req.teid_dl, req.header.seqNum)
val oRsp = createRsps.get(key)
if (!oRsp.isEmpty) {
val rsp = oRsp.get
println("OK")
out.collect(new Session(rsp.header.time, req.imsi, req.teid_dl, req.teid_ddl, rsp.teid_upl, rsp.teid_dupl, req.rat, req.apn))
createRsps.remove(key)
} else {
createReqs.put(key, req)
}
}
override def flatMap2(rsp: CreateRsp, out: Collector[Session]): Unit = {
val key = (rsp.header.teid, rsp.header.seqNum)
val oReq = createReqs.get(key)
if (!oReq.isEmpty) {
val req = oReq.get
out.collect(new Session(rsp.header.time, req.imsi, req.teid_dl, req.teid_ddl, rsp.teid_upl, rsp.teid_dupl, req.rat, req.apn))
createReqs.remove(key)
} else {
createRsps.put(key, rsp)
}
}
}).print()
This code always returns empty result. The fact that the input stream contains CreateRsp and CreateReq message of the same session. They appear very close together (within 1 second). When I debug, the oReq.isEmpty == true every time.
What i'm doing wrong?
To be honest it is a bit difficult to see through the telco specifics here, but if I understand correctly you have at least 3 streams, the first two being the CreateReq and the CreateRsp streams.
To detect the establishment of a session I would use the ConnectedDataStream abstraction to share state between the two aforementioned streams. Check out this example for usage or the related Flink docs.
Is this what you are trying to achieve?

scalaz-stream: combining queues based on one queue's size

In my application I have up to N consumers working in parallel and a producer. Consumers grab resources from the producer, do their work, append results to an updateQueue and ask for more resources. Producer has some resources available initially and can generate more by applying updates from the updateQueue. It is important to apply all available updates before a new resource is emitted to a consumer. I've tried using a following generator, requesting updates "in bulk" whenever a consumer makes a request and setting aside new resources (which are not needed by the consumer but may be later requested by other consumers) in a ticketQueue:
def updatesOrFresh: Process[Task, Seq[OptimizerResult] \/ Unit] =
Process.await(updateQueue.size.continuous.take(1).runLast) {
case Some(size) =>
println(s"size: $size")
if (size == 0)
wye(updateQueue.dequeueAvailable, ticketQueue.dequeue)(wye.either)
else
updateQueue.dequeueAvailable.map(_.left[Unit])
}.take(1) ++ Process.suspend(updatesOrFresh)
It doesn't work - initially available resource are emitted from the ticketQueue.dequeue and then it appears to block on the wye, logging:
size: 0
<<got ticket>>
size: 0
<<got ticket>>
size: 0 // it appears the updateQueue did not receive the consumer output yet, but I can live with that, it should grab an update from the wye anyway
<<blocks>>
when there were two resources available initially on the ticketQueue. However, if I change it to just
val updatesOrFresh = wye(updateQueue.dequeueAvailable, ticketQueue.dequeue)(wye.either)
It works as expected (although without the "apply updates before emitting new resource" guarantee). How can I make it work ensuring the updates are applied at the right time?
Edit: I've solved it using the following code:
val updatesOrFresh: Process[Task, Seq[OptimizerResult] \/ Unit] =
Process.repeatEval {
for {
sizeOpt <- updateQueue.size.continuous.take(1).runLast
nextOpt <-
if (sizeOpt.getOrElse(???) == 0)
wye(updateQueue.dequeueAvailable, ticketQueue.dequeue)(wye.either).take(1).runLast
else
updateQueue.dequeueAvailable.map(_.left[Unit]).take(1).runLast
} yield nextOpt.getOrElse(???)
}
However the question why the original def didn't work remains...

How to set a socket read timeout in IOManager

In Akka, IOManager "is the recommended entry point to creating sockets for performing IO." I'm looking at the API and was wondering how to set a read timeout? Of course, I can just schedule an actor to fire a message in n seconds to close the socket, but it may have already received all the read in that time and is now working on processing the read-in data. So, it's not truly a read timeout. Any ideas how to do this? Or must I somehow introduce some state to my actor?
Ok, Derek Williams gave me a hint on akka-user. Here's the code, just in case anyone else needs to do something similar.
When we accept a new client, we set a timer for 5 seconds to close the connection.
def receive = {
case IO.NewClient(server) =>
val socket = server.accept()
val readTimeout = context.system.scheduler.scheduleOnce(5 seconds, self, Timeout(socket))
state(socket) flatMap (_ => MyServer.processRequest(socket, readTimeout))
case IO.Read(socket, bytes) =>
state(socket)(IO Chunk bytes)
case IO.Closed(socket, cause) =>
state(socket)(IO EOF None)
state -= socket
case Timeout(socket) =>
socket.close()
}
And to cancel the timeout after we've read, we call cancel() on the Cancellable schedule.
object MyServer {
def processRequest(socket: IO.SocketHandle, readTimeout: Cancellable): IO.Iteratee[Unit] =
for {
request <- readRequest
} yield {
readTimeout.cancel()
request match {
val response = ...
socket write ByteString(response).compact
socket.close()
}
}
}