Wrapping Pub-Sub Java API in Akka Streams Custom Graph Stage - scala

I am working with a Java API from a data vendor providing real time streams. I would like to process this stream using Akka streams.
The Java API has a pub sub design and roughly works like this:
Subscription sub = createSubscription();
sub.addListener(new Listener() {
public void eventsReceived(List events) {
for (Event e : events)
buffer.enqueue(e)
}
});
I have tried to embed the creation of this subscription and accompanying buffer in a custom graph stage without much success. Can anyone guide me on the best way to interface with this API using Akka? Is Akka Streams the best tool here?

To feed a Source, you don't necessarily need to use a custom graph stage. Source.queue will materialize as a buffered queue to which you can add elements which will then propagate through the stream.
There are a couple of tricky things to be aware of. The first is that there's some subtlety around materializing the Source.queue so you can set up the subscription. Something like this:
def bufferSize: Int = ???
Source.fromMaterializer { (mat, att) =>
val (queue, source) = Source.queue[Event](bufferSize).preMaterialize()(mat)
val subscription = createSubscription()
subscription.addListener(
new Listener() {
def eventsReceived(events: java.util.List[Event]): Unit = {
import scala.collection.JavaConverters.iterableAsScalaIterable
import akka.stream.QueueOfferResult._
iterableAsScalaIterable(events).foreach { event =>
queue.offer(event) match {
case Enqueued => () // do nothing
case Dropped => ??? // handle a dropped pubsub element, might well do nothing
case QueueClosed => ??? // presumably cancel the subscription...
}
}
}
}
)
source.withAttributes(att)
}
Source.fromMaterializer is used to get access at each materialization to the materializer (which is what compiles the stream definition into actors). When we materialize, we use the materializer to preMaterialize the queue source so we have access to the queue. Our subscription adds incoming elements to the queue.
The API for this pubsub doesn't seem to support backpressure if the consumer can't keep up. The queue will drop elements it's been handed if the buffer is full: you'll probably want to do nothing in that case, but I've called it out in the match that you should make an explicit decision here.
Dropping the newest element is the synchronous behavior for this queue (there are other queue implementations available, but those will communicate dropping asynchronously which can be really bad for memory consumption in a burst). If you'd prefer something else, it may make sense to have a very small buffer in the queue and attach the "overall" Source (the one returned by Source.fromMaterializer) to a stage which signals perpetual demand. For example, a buffer(downstreamBufferSize, OverflowStrategy.dropHead) will drop the oldest event not yet processed. Alternatively, it may be possible to combine your Events in some meaningful way, in which case a conflate stage will automatically combine incoming Events if the downstream can't process them quickly.

Great answer! I did build something similar. There are also kamon metrics to monitor queue size exc.
class AsyncSubscriber(projectId: String, subscriptionId: String, metricsRegistry: CustomMetricsRegistry, pullParallelism: Int)(implicit val ec: Executor) {
private val logger = LoggerFactory.getLogger(getClass)
def bufferSize: Int = 1000
def source(): Source[(PubsubMessage, AckReplyConsumer), Future[NotUsed]] = {
Source.fromMaterializer { (mat, attr) =>
val (queue, source) = Source.queue[(PubsubMessage, AckReplyConsumer)](bufferSize).preMaterialize()(mat)
val receiver: MessageReceiver = {
(message: PubsubMessage, consumer: AckReplyConsumer) => {
metricsRegistry.inputEventQueueSize.update(queue.size())
queue.offer((message, consumer)) match {
case QueueOfferResult.Enqueued =>
metricsRegistry.inputQueueAddEventCounter.increment()
case QueueOfferResult.Dropped =>
metricsRegistry.inputQueueDropEventCounter.increment()
consumer.nack()
logger.warn(s"Buffer is full, message nacked. Pubsub should retry don't panic. If this happens too often, we should also tweak the buffer size or the autoscaler.")
case QueueOfferResult.Failure(ex) =>
metricsRegistry.inputQueueDropEventCounter.increment()
consumer.nack()
logger.error(s"Failed to offer message with id=${message.getMessageId()}", ex)
case QueueOfferResult.QueueClosed =>
logger.error("Destination Queue closed. Something went terribly wrong. Shutting down the jvm.")
consumer.nack()
mat.shutdown()
sys.exit(1)
}
}
}
val subscriptionName = ProjectSubscriptionName.of(projectId, subscriptionId)
val subscriber = Subscriber.newBuilder(subscriptionName, receiver).setParallelPullCount(pullParallelism).build
subscriber.startAsync().awaitRunning()
source.withAttributes(attr)
}
}
}

Related

Alpakka S3 connector stream won't handle the load, throwing akka.stream.BufferOverflowException

I have an akka-http service and I am trying out the alpakka s3 connector for uploading files. Previously I was using a temporary file and then uploading with Amazon SDK. This approach required some adjustments for Amazon SDK to make it more scala like, but it could handle even a 1000 requests at once. Throughput wasn't amazing, but all of the requests went through eventually. Here is the code before changes, with no alpakka:
```
path("uploadfile") {
withRequestTimeout(20.seconds) {
storeUploadedFile("csv", tempDestination) {
case (metadata, file) =>
val uploadFuture = upload(file, file.toPath.getFileName.toString)
onComplete(uploadFuture) {
case Success(_) => complete(StatusCodes.OK)
case Failure(_) => complete(StatusCodes.FailedDependency)
}
}
}
}
}
case class S3UploaderException(msg: String) extends Exception(msg)
def upload(file: File, key: String): Future[String] = {
val s3Client = AmazonS3ClientBuilder.standard()
.withCredentials(new DefaultAWSCredentialsProviderChain())
.withRegion(Regions.EU_WEST_3)
.build()
val promise = Promise[String]()
val listener = new ProgressListener() {
override def progressChanged(progressEvent: ProgressEvent): Unit = {
(progressEvent.getEventType: #unchecked) match {
case ProgressEventType.TRANSFER_FAILED_EVENT => promise.failure(S3UploaderException(s"Uploading a file with a key: $key"))
case ProgressEventType.TRANSFER_COMPLETED_EVENT |
ProgressEventType.TRANSFER_CANCELED_EVENT => promise.success(key)
}
}
}
val request = new PutObjectRequest("S3_BUCKET", key, file)
request.setGeneralProgressListener(listener)
s3Client.putObject(request)
promise.future
}
```
When I changed this to use alpakka connector, the code looks much nicer as we can just connect the ByteSource and alpakka Sink together. However this approach cannot handle such a big load. When I execute 1000 requests at once (10 kb files) less than 10% go through and the rest fails with exception:
akka.stream.alpakka.s3.impl.FailedUpload: Exceeded configured
max-open-requests value of [32]. This means that the request queue of
this pool
(HostConnectionPoolSetup(bargain-test.s3-eu-west-3.amazonaws.com,443,ConnectionPoolSetup(ConnectionPoolSettings(4,0,5,32,1,30
seconds,ClientConnectionSettings(Some(User-Agent: akka-http/10.1.3),10
seconds,1
minute,512,None,WebSocketSettings(,ping,Duration.Inf,akka.http.impl.settings.WebSocketSettingsImpl$$$Lambda$4787/1279590204#4d809f4c),List(),ParserSettings(2048,16,64,64,8192,64,8388608,256,1048576,Strict,RFC6265,true,Set(),Full,Error,Map(If-Range
-> 0, If-Modified-Since -> 0, If-Unmodified-Since -> 0, default -> 12, Content-MD5 -> 0, Date -> 0, If-Match -> 0, If-None-Match -> 0,
User-Agent ->
32),false,true,akka.util.ConstantFun$$$Lambda$4534/1539966798#69c23cd4,akka.util.ConstantFun$$$Lambda$4534/1539966798#69c23cd4,akka.util.ConstantFun$$$Lambda$4535/297570074#6b426c59),None,TCPTransport),New,1
second),akka.http.scaladsl.HttpsConnectionContext#7e0f3726,akka.event.MarkerLoggingAdapter#74f3a78b)))
has completely filled up because the pool currently does not process
requests fast enough to handle the incoming request load. Please retry
the request later. See
http://doc.akka.io/docs/akka-http/current/scala/http/client-side/pool-overflow.html
for more information.
Here is how the summary of a Gatling test looks like:
---- Response Time Distribution ----------------------------------------
t < 800 ms 0 ( 0%)
800 ms < t < 1200 ms 0 ( 0%)
t > 1200 ms 90 ( 9%)
failed 910 ( 91%)
When I execute 100 of simultaneous requests, half of it fails. So, still close to satisfying.
This is a new code:
```
path("uploadfile") {
withRequestTimeout(20.seconds) {
extractRequestContext { ctx =>
implicit val materializer = ctx.materializer
extractActorSystem { actorSystem =>
fileUpload("csv") {
case (metadata, byteSource) =>
val uploadFuture = byteSource.runWith(S3Uploader.sink("s3FileKey")(actorSystem, materializer))
onComplete(uploadFuture) {
case Success(_) => complete(StatusCodes.OK)
case Failure(_) => complete(StatusCodes.FailedDependency)
}
}
}
}
}
}
def sink(s3Key: String)(implicit as: ActorSystem, m: Materializer) = {
val regionProvider = new AwsRegionProvider {
def getRegion: String = Regions.EU_WEST_3.getName
}
val settings = new S3Settings(MemoryBufferType, None, new DefaultAWSCredentialsProviderChain(), regionProvider, false, None, ListBucketVersion2)
val s3Client = new S3Client(settings)(as, m)
s3Client.multipartUpload("S3_BUCKET", s3Key)
}
```
The complete code with both endpoints can be seen here
I have a couple of questions.
1) Is this a feature? Is this what we can call a backpressure?
2) If I would like this code to behave like the old approach with a temporary file (no failed requests and all of them finish at some point) what do I have to do? I was trying to implement a queue for the stream (link to the source below), but this made no difference. The code can be seen here.
(* DISCLAIMER * I am still a scala newbie trying to quickly understand akka streams and find some workaround for the issue. There are big chances that there is something simple wrong in this code. * DISCLAIMER *)
It’s a backpressure feature.
Exceeded configured max-open-requests value of [32] In the config max-open-requests is set to 32 by default.
Streaming is used to work with big amount of data, not to handle many many requests per second.
Akka developers had to put something for max-open-requests. They choose 32 for some reason for sure. And they had no idea what it will be used for. May it be sending 1000 32KB files or 1000 1GB files at once? They don’t know. But they still want to make sure that by default (and 80% of people use defaults probably) the apps will be handled gracefully and safely. So they had to limit processing power.
You asked to do 1000 “now” but I am pretty sure AWS did not send 1000 files simultaneously but used some queue, which may be a good case for you too if you have many small files to upload.
But it is perfectly fine to tune it to your case!
If you know your machine and the target will take care of more simultaneous connections, you can change the number to a higher value.
Also, for a lot of HTTP calls use cached host connection pool.

Sample most recent element of Akka Stream with trigger signal, using zipWith?

I have a Planning system that computes kind of a global Schedule from customer orders. This schedule changes over time when customers place or revoke orders to this system, or when certain resources used by events within the schedule become unavailable.
Now another system needs to know the status of certain events in the Schedule. The system sends a StatusRequest(EventName) on a message queue to which I must react with a corresponding StatusSignal(EventStatus) on another queue.
The Planning system gives me an akka-streams Source[Schedule] which emits a Schedule whenever the schedule changed, and I also have a Source[StatusRequest] from which I receive StatusRequests and a Sink[StatusSignal] to which I can send StatusSignal responses.
Whenever I receive a StatusRequest I must inspect the current schedule, ie, the most recent value emitted by Source[Schedule], and send a StatusSignal to the sink.
I came up with the following flow
scheduleSource
.zipWith(statusRequestSource) { (schedule, statusRequest) =>
findEventStatus(schedule, statusRequest.eventName))
}
.map(eventStatus => makeStatusSignal(eventStatus))
.runWith(statusSignalSink)
but I am not at all sure when this flow actually emits values and whether it actually implements my requirement (see bold text above).
The zipWith reference says (emphasis mine):
emits when all of the inputs have an element available
What does this mean? When statusRequestSource emits a value does the flow wait until scheduleSource emits, too? Or does it use the last value scheduleSource emitted? Likewise, what happens when scheduleSource emits a value? Does it trigger a status signal with the last element in statusRequestSource?
If the flow doesn't implement what I need, how could I achieve it instead?
To answer your first set of questions regarding the behavior of zipWith, here is a simple test:
val source1 = Source(1 to 5)
val source2 = Source(1 to 3)
source1
.zipWith(source2){ (s1Elem, s2Elem) => (s1Elem, s2Elem) }
.runForeach(println)
// prints:
// (1,1)
// (2,2)
// (3,3)
zipWith will emit downstream as long as both inputs have respective elements that can be zipped together.
One idea to fulfill your requirement is to decouple scheduleSource and statusRequestSource. Feed scheduleSource to an actor, and have the actor track the most recent element it has received from the stream. Then have statusRequestSource query this actor, which will reply with the most recent element from scheduleSource. This actor could look something like the following:
class LatestElementTracker extends Actor with ActorLogging {
var latestSchedule: Option[Schedule] = None
def receive = {
case schedule: Schedule =>
latestSchedule = Some(schedule)
case status: StatusRequest =>
if (latestSchedule.isEmpty) {
log.debug("No schedules have been received yet.")
} else {
val eventStatus = findEventStatus(latestSchedule.get, status.eventName)
sender() ! eventStatus
}
}
}
To integrate with the above actor:
scheduleSource.runForeach(s => trackerActor ! s)
statusRequestSource
.ask[EventStatus](parallelism = 1)(trackerActor) // adjust parallelism as needed
.map(eventStatus => makeStatusSignal(eventStatus))
.runWith(statusSignalSink)

What's the simplest way to use SSE with Redis pub/sub and Akka Streams?

I'd like to stream a chunked server sent event for the following scenario:
Subscribe to a Redis key, and if the key changes, stream the new value with Akka Streams. It should only stream if there are new values.
As I understand it, I need a Source. I guess that is the subscription to a channel:
redis.subscriber.subscribe("My Channel") {
case message # PubSubMessage.Message(channel, messageBytes) => println(
message.readAs[String]()
)
case PubSubMessage.Subscribe(channel, subscribedChannelsCount) => println(
s"Successfully subscribed to $channel"
)
}
In my route I need to create a Source from this, but honestly I don't know how to get going:
val route: Route =
path("stream") {
get {
complete {
val source: Source[ServerSentEvent, NotUsed] =
Source
.asSubscriber(??) // or fromPublisher???
.map(_ => {
??
})
.map(toServerSentEvent)
.keepAlive(1.second, () => ServerSentEvent.heartbeat)
.log("stream")
}
}
One approach is to use Source.actorRef and BroadcastHub.sink:
val (sseActor, sseSource) =
Source.actorRef[String](10, akka.stream.OverflowStrategy.dropTail)
.map(toServerSentEvent) // converts a String to a ServerSentEvent
.keepAlive(1.second, () => ServerSentEvent.heartbeat)
.toMat(BroadcastHub.sink[ServerSentEvent])(Keep.both)
.run()
Subscribe the materialized ActorRef to your message channel: messages sent to this actor are emitted downstream. If there is no downstream demand, the messages are buffered up to a certain number (in this example, the buffer size is 10) with the specified overflow strategy. Note that there is no backpressure with this approach.
redis.subscriber.subscribe("My Channel") {
case message # PubSubMessage.Message(channel, messageBytes) =>
val strMsg = message.readAs[String]
println(strMsg)
sseActor ! strMsg
case ...
}
Also note that the above example uses a Source.actorRef[String]; adjust the type and the example as you see fit (for example, it could be Source.actorRef[PubSubMessage.Message]).
And you can use the materialized Source in your path:
path("stream") {
get {
complete(sseSource)
}
}
An alternative approach could be to create a Source as queue and offer the element to the queue as received in the subscriber callback
val queue =
Source
.queue[String](10, OverflowStrategy.dropHead) // drops the oldest element from the buffer to make space for the new element.
.map(toServerSentEvent) // converts a String to a ServerSentEvent
.keepAlive(1.second, () => ServerSentEvent.heartbeat)
.to(Sink.ignore)
.run()
and in the subscriber
redis.subscriber.subscribe("My Channel") {
case message # PubSubMessage.Message(channel, messageBytes) =>
val strMsg = message.readAs[String]
println(strMsg)
queue.offer(strMsg)
case ...
}

Can I safely create a Thread in an Akka Actor?

I have an Akka Actor that I want to send "control" messages to.
This Actor's core mission is to listen on a Kafka queue, which is a polling process inside a loop.
I've found that the following simply locks up the Actor and it won't receive the "stop" (or any other) message:
class Worker() extends Actor {
private var done = false
def receive = {
case "stop" =>
done = true
kafkaConsumer.close()
// other messages here
}
// Start digesting messages!
while (!done) {
kafkaConsumer.poll(100).iterator.map { cr: ConsumerRecord[Array[Byte], String] =>
// process the record
), null)
}
}
}
I could wrap the loop in a Thread started by the Actor, but is it ok/safe to start a Thread from inside an Actor? Is there a better way?
Basically you can but keep in mind that this actor will be blocking and a thumb of rule is to never block inside actors. If you still want to do this, make sure that this actor runs in a separate thread pool than the native one so you don't affect Actor System performances. One another way to do it would be to send messages to itself to poll new messages.
1) receive a order to poll a message from kafka
2) Hand over the
message to the relevant actor
3) Send a message to itself to order
to pull a new message
4) Hand it over...
Code wise :
case object PollMessage
class Worker() extends Actor {
private var done = false
def receive = {
case PollMessage ⇒ {
poll()
self ! PollMessage
}
case "stop" =>
done = true
kafkaConsumer.close()
// other messages here
}
// Start digesting messages!
def poll() = {
kafkaConsumer.poll(100).iterator.map { cr: ConsumerRecord[Array[Byte], String] =>
// process the record
), null)
}
}
}
I am not sure though that you will ever receive the stop message if you continuously block on the actor.
Adding #Louis F. answer; depending on the configuration of your actors they will either drop all messages that they receive if at the given moment they are busy or put them in a mailbox aka queue and the messages will be processed later (usually in FIFO manner). However, in this particular case you are flooding the actor with PollMessage and you have no guarantee that your message will not be dropped - which appears to happen in your case.

Scala Queue and NoSuchElementException

I am getting an infrequent NoSuchElementException error when operating over my Scala 2.9.2 Queue. I don't understand the exception becase the Queue has elements in it. I've tried switching over to a SynchronizedQueue, thinking it was a concurrency issue (my queue is written and read to from different threads) but that didn't solve it.
The reduced code looks like this:
val window = new scala.collection.mutable.Queue[Packet]
...
(thread 1)
window += packet
...
(thread 2)
window.dequeueAll(someFunction)
println(window.size)
window.foreach(println(_))
Which results in
32
java.util.NoSuchElementException
at scala.collection.mutable.LinkedListLike$class.head(LinkedListLike.scala:76)
at scala.collection.mutable.LinkedList.head(LinkedList.scala:78)
at scala.collection.mutable.MutableList.head(MutableList.scala:53)
at scala.collection.LinearSeqOptimized$class.foreach(LinearSeqOptimized.scala:59)
at scala.collection.mutable.MutableList.foreach(MutableList.scala:30)
The docs for LinkedListLike.head() say
Exceptions thrown
`NoSuchElementException`
if the linked list is empty.
but how can this exception be thrown if the Queue is not empty?
You should have window (mutable data structure) accessed from only a single thread. Other threads should send messages to that one.
There is Akka that allows relatively easy concurrent programming.
class MySource(windowHolderRef:ActorRef) {
def receive = {
case MyEvent(packet:Packet) =>
windowHolderRef ! packet
}
}
case object CheckMessages
class WindowHolder {
private val window = new scala.collection.mutable.Queue[Packet]
def receive = {
case packet:Packet =>
window += packet
case CheckMessages =>
window.dequeueAll(someFunction)
println(window.size)
window.foreach(println(_))
}
}
To check messages periodically you can schedule periodic messages.
// context.schedule(1 second, 1 second, CheckMessages)