Akka Stream - Select Sink based on Element in Flow - scala

I'm creating a simple message delivery service using Akka stream. The service is just like mail delivery, where elements from source include destination and content like:
case class Message(destination: String, content: String)
and the service should deliver the messages to appropriate sink based on the destination field. I created a DeliverySink class to let it have a name:
case class DeliverySink(name: String, sink: Sink[String, Future[Done]])
Now, I instantiated two DeliverySink, let me call them sinkX and sinkY, and created a map based on their name. In practice, I want to provide a list of sink names and the list should be configurable.
The challenge I'm facing is how to dynamically choose an appropriate sink based on the destination field.
Eventually, I want to map Flow[Message] to a sink. I tried:
val sinkNames: List[String] = List("sinkX", "sinkY")
val sinkMapping: Map[String, DeliverySink] =
sinkNames.map { name => name -> DeliverySink(name, ???)}.toMap
Flow[Message].map { msg => msg.content }.to(sinks(msg.destination).sink)
but, obviously this doesn't work because we can't reference msg outside of map...
I guess this is not a right approach. I also thought about using filter with broadcast, but if the destination scales to 100, I cannot type every routing. What is a right way to achieve my goal?
[Edit]
Ideally, I would like to make destinations dynamic. So, I cannot statically type all destinations in filter or routing logic. If a destination sink has not been connected, it should create a new sink dynamically too.

If You Have To Use Multiple Sinks
Sink.combine would directly suite your existing requirements. If you attach an appropriate Flow.filter before each Sink then they'll only receive the appropriate messages.
Don't Use Multiple Sinks
In general I think it is bad design to have the structure, and content, of streams contain business logic. Your stream should be a thin veneer for back-pressured concurrency on top of business logic which is in ordinary scala/java code.
In this particular case, I think it would be best to wrap your destination routing inside of a single Sink and the logic should be implemented inside of a separate function. For example:
val routeMessage : (Message) => Unit =
(message) =>
if(message.destination equalsIgnoreCase "stdout")
System.out println message.content
else if(message.destination equalsIgnoreCase "stderr")
System.err println message.content
val routeSink : Sink[Message, _] = Sink foreach routeMessage
Note how much easier it is to now test my routeMessage since it isn't inside of the stream: I don't need any akka testkit "stuff" to test routeMessage. I can also move the function to a Future or a Thread if my concurrency design were to change.
Many Destinations
If you have many destinations you can use a Map. Suppose, for example, you are sending your messages to AmazonSQS. You could define a function to convert a Queue Name to Queue URL and use that function to maintain a Map of already created names:
type QueueName = String
val nameToRequest : (QueueName) => CreateQueueRequest = ??? //implementation unimportant
type QueueURL = String
val nameToURL : (AmazonSQS) => (QueueName) => QueueURL = {
val nameToURL = mutable.Map.empty[QueueName, QueueURL]
(sqs) => (queueName) => nameToURL.get(queueName) match {
case Some(url) => url
case None => {
sqs.createQueue(nameToRequest(queueName))
val url = sqs.getQueueUrl(queueName).getQueueUrl()
nameToURL put (queueName, url)
url
}
}
}
Now you can use this non-stream function inside of a singular Sink:
val sendMessage : (AmazonSQS) => (Message) => Unit =
(sqs) => (message) =>
sqs sendMessage {
(new SendMessageRequest())
.withQueueUrl(nameToURL(sqs)(message.destination))
.withMessageBody(message.content)
}
val sqs : AmazonSQS = ???
val messageSink = Sink foreach sendMessage(sqs)
Side Note
For destination you probably want to use something other than String. A coproduct is usually better because they can be used with case statements and you'll get helpful compiler errors if you miss one of the possibilities:
sealed trait Destination
object Out extends Destination
object Err extends Destination
object SomethingElse extends Destination
case class Message(destination: Destination, content: String)
//This function won't compile because SomethingElse doesn't have a case
val routeMessage : (Message) => Unit =
(message) => message.destination match {
case Out =>
System.out.println(message.content)
case Err =>
System.err.println(message.content)
}

Given your requirement, maybe you want to consider multiplexing your stream source into substreams using groubBy:
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl._
import akka.util.ByteString
import akka.{NotUsed, Done}
import akka.stream.IOResult
import scala.concurrent.Future
import java.nio.file.Paths
import java.nio.file.StandardOpenOption._
implicit val system = ActorSystem("sys")
implicit val materializer = ActorMaterializer()
import system.dispatcher
case class Message(destination: String, content: String)
case class DeliverySink(name: String, sink: Sink[ByteString, Future[IOResult]])
val messageSource: Source[Message, NotUsed] = Source(List(
Message("a", "uuu"), Message("a", "vvv"),
Message("b", "xxx"), Message("b", "yyy"), Message("b", "zzz")
))
val sinkA = DeliverySink("sink-a", FileIO.toPath(
Paths.get("/path/to/sink-a.txt"), options = Set(CREATE, WRITE)
))
val sinkB = DeliverySink("sink-b", FileIO.toPath(
Paths.get("/path/to/sink-b.txt"), options = Set(CREATE, WRITE)
))
val sinkMapping: Map[String, DeliverySink] = Map("a" -> sinkA, "b" -> sinkB)
val totalDests = 2
messageSource.map(m => (m.destination, m)).
groupBy(totalDests, _._1).
fold(("", List.empty[Message])) {
case ((_, list), (dest, msg)) => (dest, msg :: list)
}.
mapAsync(parallelism = totalDests) {
case (dest: String, msgList: List[Message]) =>
Source(msgList.reverse).map(_.content).map(ByteString(_)).
runWith(sinkMapping(dest).sink)
}.
mergeSubstreams.
runWith(Sink.ignore)

Related

Processing an akka stream asynchronously and writing to a file sink

I am trying to write a piece of code that would consume a stream of tickers (stock exchange symbol of a company) and fetch company information from a REST API for each ticker.
I want to fetch information for multiple companies asynchronously.
I would like to save the results to a file in a continuous manner as the entire data set might not fit into memory.
Following the documentation of akka streams and resources that I was able to google on this subject I have come up with the following piece of code (some parts are omitted for brevity):
implicit val actorSystem: ActorSystem = ActorSystem("stock-fetcher-system")
implicit val materializer: ActorMaterializer = ActorMaterializer(None, Some("StockFetcher"))(actorSystem)
implicit val context = system.dispatcher
import CompanyJsonMarshaller._
val parallelism = 10
val connectionPool = Http().cachedHostConnectionPoolHttps[String](s"api.iextrading.com")
val listOfSymbols = symbols.toList
val outputPath = "out.txt"
Source(listOfSymbols)
.mapAsync(parallelism) {
stockSymbol => Future(HttpRequest(uri = s"https://api.iextrading.com/1.0/stock/${stockSymbol.symbol}/company"), stockSymbol.symbol)
}
.via(connectionPool)
.map {
case (Success(response), _) => Unmarshal(response.entity).to[Company]
case (Failure(ex), symbol) => println(s"Unable to fetch char data for $symbol") "x"
}
.runWith(FileIO.toPath(new File(outputPath).toPath, Set(StandardOpenOption.APPEND)))
.onComplete { _ =>
bufferedSource.close
actorSystem.terminate()
}
This is the problematic line:
runWith(FileIO.toPath(new File(outputPath).toPath, Set(StandardOpenOption.APPEND)))
which doesn't compile and the compiler gives me this mysteriously looking error:
Type mismatch, expected Graph[SinkShape[Any, NotInferedMat2], actual Sink[ByeString, Future[IOResult]]
If I change the sink to Sink.ignore or println(_) it works.
I'd appreciate some more detailed explanation.
As the compiler is indicating, the types don't match. In the call to .map...
.map {
case (Success(response), _) =>
Unmarshal(response.entity).to[Company]
case (Failure(ex), symbol) =>
println(s"Unable to fetch char data for $symbol")
"x"
}
...you're returning either a Company instance or a String, so the compiler infers the closest supertype (or "least upper bounds") to be Any. The Sink expects input elements of type ByteString, not Any.
One approach is to send the response to the file sink without unmarshalling the response:
Source(listOfSymbols)
.mapAsync(parallelism) {
...
}
.via(connectionPool)
.map(_.entity.dataBytes) // entity.dataBytes is a Source[ByteString, _]
.flatMapConcat(identity)
.runWith(FileIO.toPath(...))

How to save streaming data using Akka Persistence

I use StreamRefs to establish streaming connections between actors in the cluster. Currently, in the writing node, I save incoming messages to the log file manually, but I wonder is it possible to replace it with persistent Sink for writing and persistent Source for reading on actor startup
from the Akka Persistence journal. I've been thinking of replacing the log file sink with Persistent actor's persist { evt => ... }, but since it executes asynchronously I'll lose the backpressure. So is it possible to write streaming data with backpressure into Akka Persistence journal and read this data in a streaming manner on actor recover?
Current implementation:
object Writer {
case class WriteSinkRequest(userId: String)
case class WriteSinkReady(userId: String, sinkRef: SinkRef[ByteString])
case class ReadSourceRequest(userId: String)
case class ReadSourceReady(userId: String, sourceRef: SourceRef[ByteString])
}
class Writer extends Actor {
// code omitted
val logsDir = "logs"
val path = Files.createDirectories(FileSystems.getDefault.getPath(logsDir))
def logFile(id: String) = {
path.resolve(id)
}
def logFileSink(logId: String): Sink[ByteString, Future[IOResult]] = FileIO.toPath(logFile(logId), Set(CREATE, WRITE, APPEND))
def logFileSource(logId: String): Source[ByteString, Future[IOResult]] = FileIO.fromPath(logFile(logId))
override def receive: Receive = {
case WriteSinkRequest(userId) =>
// obtain the source you want to offer:
val sink = logFileSink(userId)
// materialize the SinkRef (the remote is like a source of data for us):
val ref: Future[SinkRef[ByteString]] = StreamRefs.sinkRef[ByteString]().to(sink).run()
// wrap the SinkRef in some domain message, such that the sender knows what source it is
val reply: Future[WriteSinkReady] = ref.map(WriteSinkReady(userId, _))
// reply to sender
reply.pipeTo(sender())
case ReadSourceRequest(userId) =>
val source = logFileSource(userId)
val ref: Future[SourceRef[ByteString]] = source.runWith(StreamRefs.sourceRef())
val reply: Future[ReadSourceReady] = ref.map(ReadSourceReady(userId, _))
reply pipeTo sender()
}
}
P.S. Is it possible to create not a "save-to-journal" sink, but flow:
incoming data to write ~> save to persistence journal ~> data that was written?
One idea for streaming data to a persistent actor in a backpressured fashion is to use Sink.actorRefWithAck: have the actor send an acknowledgement message when it has persisted a message. This would look something like the following:
// ...
case class WriteSinkReady(userId: String, sinkRef: SinkRef[MyMsg])
// ...
def receive = {
case WriteSinkRequest(userId) =>
val persistentActor: ActorRef = ??? // a persistent actor that handles MyMsg messages
// as well as the messages used in persistentSink
val persistentSink: Sink[MyMsg, NotUsed] = Sink.actorRefWithAck[MyMsg](
persistentActor,
/* additional parameters: see the docs */
)
val ref: Future[SinkRef[MyMsg]] = StreamRefs.sinkRef[MyMsg]().to(persistentSink).run()
val reply: Future[WriteSinkReady] = ref.map(WriteSinkReady(userId, _))
reply.pipeTo(sender())
case ReadSourceRequest(userId) =>
// ...
}
The above example uses a custom case class named MyMsg instead of ByteString.
In the sender, assuming it's an actor:
def receive = {
case WriteSinkReady(userId, sinkRef) =>
source.runWith(sinkRef) // source is a Source[MyMsg, _]
// ...
}
The materialized stream in the sender will send the messages to the persistent actor.

Akka Stream return object from Sink

I've got a SourceQueue. When I offer an element to this I want it to pass through the Stream and when it reaches the Sink have the output returned to the code that offered this element (similar as Sink.head returns an element to the RunnableGraph.run() call).
How do I achieve this? A simple example of my problem would be:
val source = Source.queue[String](100, OverflowStrategy.fail)
val flow = Flow[String].map(element => s"Modified $element")
val sink = Sink.ReturnTheStringSomehow
val graph = source.via(flow).to(sink).run()
val x = graph.offer("foo")
println(x) // Output should be "Modified foo"
val y = graph.offer("bar")
println(y) // Output should be "Modified bar"
val z = graph.offer("baz")
println(z) // Output should be "Modified baz"
Edit: For the example I have given in this question Vladimir Matveev provided the best answer. However, it should be noted that this solution only works if the elements are going into the sink in the same order they were offered to the source. If this cannot be guaranteed the order of the elements in the sink may differ and the outcome might be different from what is expected.
I believe it is simpler to use the already existing primitive for pulling values from a stream, called Sink.queue. Here is an example:
val source = Source.queue[String](128, OverflowStrategy.fail)
val flow = Flow[String].map(element => s"Modified $element")
val sink = Sink.queue[String]().withAttributes(Attributes.inputBuffer(1, 1))
val (sourceQueue, sinkQueue) = source.via(flow).toMat(sink)(Keep.both).run()
def getNext: String = Await.result(sinkQueue.pull(), 1.second).get
sourceQueue.offer("foo")
println(getNext)
sourceQueue.offer("bar")
println(getNext)
sourceQueue.offer("baz")
println(getNext)
It does exactly what you want.
Note that setting the inputBuffer attribute for the queue sink may or may not be important for your use case - if you don't set it, the buffer will be zero-sized and the data won't flow through the stream until you invoke the pull() method on the sink.
sinkQueue.pull() yields a Future[Option[T]], which will be completed successfully with Some if the sink receives an element or with a failure if the stream fails. If the stream completes normally, it will be completed with None. In this particular example I'm ignoring this by using Option.get but you would probably want to add custom logic to handle this case.
Well, you know what offer() method returns if you take a look at its definition :) What you can do is to create Source.queue[(Promise[String], String)], create helper function that pushes pair to stream via offer, make sure offer doesn't fail because queue might be full, then complete promise inside your stream and use future of the promise to catch completion event in external code.
I do that to throttle rate to external API used from multiple places of my project.
Here is how it looked in my project before Typesafe added Hub sources to akka
import scala.concurrent.Promise
import scala.concurrent.Future
import java.util.concurrent.ConcurrentLinkedDeque
import akka.stream.scaladsl.{Keep, Sink, Source}
import akka.stream.{OverflowStrategy, QueueOfferResult}
import scala.util.Success
private val queue = Source.queue[(Promise[String], String)](100, OverflowStrategy.backpressure)
.toMat(Sink.foreach({ case (p, param) =>
p.complete(Success(param.reverse))
}))(Keep.left)
.run
private val futureDeque = new ConcurrentLinkedDeque[Future[String]]()
private def sendQueuedRequest(request: String): Future[String] = {
val p = Promise[String]
val offerFuture = queue.offer(p -> request)
def addToQueue(future: Future[String]): Future[String] = {
futureDeque.addLast(future)
future.onComplete(_ => futureDeque.remove(future))
future
}
offerFuture.flatMap {
case QueueOfferResult.Enqueued =>
addToQueue(p.future)
}.recoverWith {
case ex =>
val first = futureDeque.pollFirst()
if (first != null)
addToQueue(first.flatMap(_ => sendQueuedRequest(request)))
else
sendQueuedRequest(request)
}
}
I realize that blocking synchronized queue may be bottleneck and may grow indefinitely but because API calls in my project are made only from other akka streams which are backpressured I never have more than dozen items in futureDeque. Your situation may differ.
If you create MergeHub.source[(Promise[String], String)]() instead you'll get reusable sink. Thus every time you need to process item you'll create complete graph and run it. In that case you won't need hacky java container to queue requests.

Akka-Streams ActorPublisher does not receive any Request messages

I am trying to continuously read the wikipedia IRC channel using this lib: https://github.com/implydata/wikiticker
I created a custom Akka Publisher, which will be used in my system as a Source.
Here are some of my classes:
class IrcPublisher() extends ActorPublisher[String] {
import scala.collection._
var queue: mutable.Queue[String] = mutable.Queue()
override def receive: Actor.Receive = {
case Publish(s) =>
println(s"->MSG, isActive = $isActive, totalDemand = $totalDemand")
queue.enqueue(s)
publishIfNeeded()
case Request(cnt) =>
println("Request: " + cnt)
publishIfNeeded()
case Cancel =>
println("Cancel")
context.stop(self)
case _ =>
println("Hm...")
}
def publishIfNeeded(): Unit = {
while (queue.nonEmpty && isActive && totalDemand > 0) {
println("onNext")
onNext(queue.dequeue())
}
}
}
object IrcPublisher {
case class Publish(data: String)
}
I am creating all this objects like so:
def createSource(wikipedias: Seq[String]) {
val dataPublisherRef = system.actorOf(Props[IrcPublisher])
val dataPublisher = ActorPublisher[String](dataPublisherRef)
val listener = new MessageListener {
override def process(message: Message) = {
dataPublisherRef ! Publish(Jackson.generate(message.toMap))
}
}
val ticker = new IrcTicker(
"irc.wikimedia.org",
"imply",
wikipedias map (x => s"#$x.wikipedia"),
Seq(listener)
)
ticker.start() // if I comment this...
Thread.currentThread().join() //... and this I get Request(...)
Source.fromPublisher(dataPublisher)
}
So the problem I am facing is this Source object. Although this implementation works well with other sources (for example from local file), the ActorPublisher don't receive Request() messages.
If I comment the two marked lines I can see, that my actor has received the Request(count) message from my flow. Otherwise all messages will be pushed into the queue, but not in my flow (so I can see the MSG messages printed).
I think it's something with multithreading/synchronization here.
I am not familiar enough with wikiticker to solve your problem as given. One question I would have is: why is it necessary to join to the current thread?
However, I think you have overcomplicated the usage of Source. It would be easier for you to work with the stream as a whole rather than create a custom ActorPublisher.
You can use Source.actorRef to materialize a stream into an ActorRef and work with that ActorRef. This allows you to utilize akka code to do the enqueing/dequeing onto the buffer while you can focus on the "business logic".
Say, for example, your entire stream is only to filter lines above a certain length and print them to the console. This could be accomplished with:
def dispatchIRCMessages(actorRef : ActorRef) = {
val ticker =
new IrcTicker("irc.wikimedia.org",
"imply",
wikipedias map (x => s"#$x.wikipedia"),
Seq(new MessageListener {
override def process(message: Message) =
actorRef ! Publish(Jackson.generate(message.toMap))
}))
ticker.start()
Thread.currentThread().join()
}
//these variables control the buffer behavior
val bufferSize = 1024
val overFlowStrategy = akka.stream.OverflowStrategy.dropHead
val minMessageSize = 32
//no need for a custom Publisher/Queue
val streamRef =
Source.actorRef[String](bufferSize, overFlowStrategy)
.via(Flow[String].filter(_.size > minMessageSize))
.to(Sink.foreach[String](println))
.run()
dispatchIRCMessages(streamRef)
The dispatchIRCMessages has the added benefit that it will work with any ActorRef so you aren't required to only work with streams/publishers.
Hopefully this solves your underlying problem...
I think the main problem is Thread.currentThread().join(). This line will 'hang' current thread because this thread is waiting for himself to die. Please read https://docs.oracle.com/javase/8/docs/api/java/lang/Thread.html#join-long- .

Stream csv file to browser using akka stream and spray

How do I wire up a Source[String, Unit] to a streaming actor?
I think a modified version of StreamingActor from https://gist.github.com/whysoserious/96050c6b4bd5fedb6e33 will work well, but I'm having difficulty connecting the pieces.
Given source: Source[String, Unit] and ctx: RequestContext, I think the the modified StreamingActor should wire up with actorRefFactory.actorOf(fromSource(source, ctx)).
For reference, the gist above:
import akka.actor._
import akka.util.ByteString
import spray.http.HttpEntity.Empty
import spray.http.MediaTypes._
import spray.http._
import spray.routing.{HttpService, RequestContext, SimpleRoutingApp}
object StreamingActor {
// helper methods
def fromString(iterable: Iterable[String], ctx: RequestContext): Props = {
fromHttpData(iterable.map(HttpData.apply), ctx)
}
def fromStringAndCharset(iterable: Iterable[String], ctx: RequestContext, charset: HttpCharset): Props = {
fromHttpData(iterable.map(HttpData.apply), ctx)
}
def fromByteArray(iterable: Iterable[Array[Byte]], ctx: RequestContext): Props = {
fromHttpData(iterable.map(HttpData.apply), ctx)
}
def fromByteString(iterable: Iterable[ByteString], ctx: RequestContext): Props = {
fromHttpData(iterable.map(HttpData.apply), ctx)
}
def fromHttpData(iterable: Iterable[HttpData], ctx: RequestContext): Props = {
Props(new StreamingActor(iterable, ctx))
}
// initial message sent by StreamingActor to itself
private case object FirstChunk
// confirmation that given chunk was sent to client
private case object ChunkAck
}
class StreamingActor(chunks: Iterable[HttpData], ctx: RequestContext) extends Actor with HttpService with ActorLogging {
import StreamingActor._
def actorRefFactory = context
val chunkIterator: Iterator[HttpData] = chunks.iterator
self ! FirstChunk
def receive = {
// send first chunk to client
case FirstChunk if chunkIterator.hasNext =>
val responseStart = HttpResponse(entity = HttpEntity(`text/html`, chunkIterator.next()))
ctx.responder ! ChunkedResponseStart(responseStart).withAck(ChunkAck)
// data stream is empty. Respond with Content-Length: 0 and stop
case FirstChunk =>
ctx.responder ! HttpResponse(entity = Empty)
context.stop(self)
// send next chunk to client
case ChunkAck if chunkIterator.hasNext =>
val nextChunk = MessageChunk(chunkIterator.next())
ctx.responder ! nextChunk.withAck(ChunkAck)
// all chunks were sent. stop.
case ChunkAck =>
ctx.responder ! ChunkedMessageEnd
context.stop(self)
//
case x => unhandled(x)
}
}
I think your use of a StreamingActor over-complicates the underlying problem you are trying to solve. Further, the StreamingActor in the question will produce multiple HttpResponse values, 1 for each Chunk, for a single HttpRequest. This is inefficient because you can simply return 1 HttpReponse with an HttpEntity.Chunked as the Entity for your data stream source.
General Concurrency Design
Actors are for state, e.g. maintaining a running counter between connections. And even then an Agent covers a lot of ground with the additional benefit of type checking (unlike Actor.receive which turns the dead letter mailbox into your only type checker at runtime).
Concurrent computation, not state, should be handled with (in order):
Futures as a first consideration: composable, compile time type safe checking, and best choice for most cases.
akka Streams : composable, compile time type safe checking, and very useful but there is a lot of overhead resulting from the convenient back-pressure functionality. Steams also are how HttpResponse entities are formed, as demonstrated below.
Streaming CSV Files
You're underlying question is how to stream a csv file to an http client using Streams. You can begin by creating a data Source and embedding it within an HttpResponse:
def lines() = scala.io.Source.fromFile("DataFile.csv").getLines()
import akka.util.ByteString
import akka.http.model.HttpEntity
def chunkSource : Source[HttpEntity.ChunkStreamPart, Unit] =
akka.stream.scaladsl.Source(lines)
.map(ByteString.apply)
.map(HttpEntity.ChunkStreamPart.apply)
def httpFileResponse =
HttpResponse(entity = HttpEntity.Chunked(ContentTypes.`text/plain`, chunkSource))
You can then provide this response for any requests:
val fileRequestHandler = {
case HttpRequest(GET, Uri.Path("/csvFile"), _, _, _) => httpFileResponse
}
Then embed the fileRequestHandler into your server routing logic.