How to save streaming data using Akka Persistence - scala

I use StreamRefs to establish streaming connections between actors in the cluster. Currently, in the writing node, I save incoming messages to the log file manually, but I wonder is it possible to replace it with persistent Sink for writing and persistent Source for reading on actor startup
from the Akka Persistence journal. I've been thinking of replacing the log file sink with Persistent actor's persist { evt => ... }, but since it executes asynchronously I'll lose the backpressure. So is it possible to write streaming data with backpressure into Akka Persistence journal and read this data in a streaming manner on actor recover?
Current implementation:
object Writer {
case class WriteSinkRequest(userId: String)
case class WriteSinkReady(userId: String, sinkRef: SinkRef[ByteString])
case class ReadSourceRequest(userId: String)
case class ReadSourceReady(userId: String, sourceRef: SourceRef[ByteString])
}
class Writer extends Actor {
// code omitted
val logsDir = "logs"
val path = Files.createDirectories(FileSystems.getDefault.getPath(logsDir))
def logFile(id: String) = {
path.resolve(id)
}
def logFileSink(logId: String): Sink[ByteString, Future[IOResult]] = FileIO.toPath(logFile(logId), Set(CREATE, WRITE, APPEND))
def logFileSource(logId: String): Source[ByteString, Future[IOResult]] = FileIO.fromPath(logFile(logId))
override def receive: Receive = {
case WriteSinkRequest(userId) =>
// obtain the source you want to offer:
val sink = logFileSink(userId)
// materialize the SinkRef (the remote is like a source of data for us):
val ref: Future[SinkRef[ByteString]] = StreamRefs.sinkRef[ByteString]().to(sink).run()
// wrap the SinkRef in some domain message, such that the sender knows what source it is
val reply: Future[WriteSinkReady] = ref.map(WriteSinkReady(userId, _))
// reply to sender
reply.pipeTo(sender())
case ReadSourceRequest(userId) =>
val source = logFileSource(userId)
val ref: Future[SourceRef[ByteString]] = source.runWith(StreamRefs.sourceRef())
val reply: Future[ReadSourceReady] = ref.map(ReadSourceReady(userId, _))
reply pipeTo sender()
}
}
P.S. Is it possible to create not a "save-to-journal" sink, but flow:
incoming data to write ~> save to persistence journal ~> data that was written?

One idea for streaming data to a persistent actor in a backpressured fashion is to use Sink.actorRefWithAck: have the actor send an acknowledgement message when it has persisted a message. This would look something like the following:
// ...
case class WriteSinkReady(userId: String, sinkRef: SinkRef[MyMsg])
// ...
def receive = {
case WriteSinkRequest(userId) =>
val persistentActor: ActorRef = ??? // a persistent actor that handles MyMsg messages
// as well as the messages used in persistentSink
val persistentSink: Sink[MyMsg, NotUsed] = Sink.actorRefWithAck[MyMsg](
persistentActor,
/* additional parameters: see the docs */
)
val ref: Future[SinkRef[MyMsg]] = StreamRefs.sinkRef[MyMsg]().to(persistentSink).run()
val reply: Future[WriteSinkReady] = ref.map(WriteSinkReady(userId, _))
reply.pipeTo(sender())
case ReadSourceRequest(userId) =>
// ...
}
The above example uses a custom case class named MyMsg instead of ByteString.
In the sender, assuming it's an actor:
def receive = {
case WriteSinkReady(userId, sinkRef) =>
source.runWith(sinkRef) // source is a Source[MyMsg, _]
// ...
}
The materialized stream in the sender will send the messages to the persistent actor.

Related

Akka-Streams ActorPublisher does not receive any Request messages

I am trying to continuously read the wikipedia IRC channel using this lib: https://github.com/implydata/wikiticker
I created a custom Akka Publisher, which will be used in my system as a Source.
Here are some of my classes:
class IrcPublisher() extends ActorPublisher[String] {
import scala.collection._
var queue: mutable.Queue[String] = mutable.Queue()
override def receive: Actor.Receive = {
case Publish(s) =>
println(s"->MSG, isActive = $isActive, totalDemand = $totalDemand")
queue.enqueue(s)
publishIfNeeded()
case Request(cnt) =>
println("Request: " + cnt)
publishIfNeeded()
case Cancel =>
println("Cancel")
context.stop(self)
case _ =>
println("Hm...")
}
def publishIfNeeded(): Unit = {
while (queue.nonEmpty && isActive && totalDemand > 0) {
println("onNext")
onNext(queue.dequeue())
}
}
}
object IrcPublisher {
case class Publish(data: String)
}
I am creating all this objects like so:
def createSource(wikipedias: Seq[String]) {
val dataPublisherRef = system.actorOf(Props[IrcPublisher])
val dataPublisher = ActorPublisher[String](dataPublisherRef)
val listener = new MessageListener {
override def process(message: Message) = {
dataPublisherRef ! Publish(Jackson.generate(message.toMap))
}
}
val ticker = new IrcTicker(
"irc.wikimedia.org",
"imply",
wikipedias map (x => s"#$x.wikipedia"),
Seq(listener)
)
ticker.start() // if I comment this...
Thread.currentThread().join() //... and this I get Request(...)
Source.fromPublisher(dataPublisher)
}
So the problem I am facing is this Source object. Although this implementation works well with other sources (for example from local file), the ActorPublisher don't receive Request() messages.
If I comment the two marked lines I can see, that my actor has received the Request(count) message from my flow. Otherwise all messages will be pushed into the queue, but not in my flow (so I can see the MSG messages printed).
I think it's something with multithreading/synchronization here.
I am not familiar enough with wikiticker to solve your problem as given. One question I would have is: why is it necessary to join to the current thread?
However, I think you have overcomplicated the usage of Source. It would be easier for you to work with the stream as a whole rather than create a custom ActorPublisher.
You can use Source.actorRef to materialize a stream into an ActorRef and work with that ActorRef. This allows you to utilize akka code to do the enqueing/dequeing onto the buffer while you can focus on the "business logic".
Say, for example, your entire stream is only to filter lines above a certain length and print them to the console. This could be accomplished with:
def dispatchIRCMessages(actorRef : ActorRef) = {
val ticker =
new IrcTicker("irc.wikimedia.org",
"imply",
wikipedias map (x => s"#$x.wikipedia"),
Seq(new MessageListener {
override def process(message: Message) =
actorRef ! Publish(Jackson.generate(message.toMap))
}))
ticker.start()
Thread.currentThread().join()
}
//these variables control the buffer behavior
val bufferSize = 1024
val overFlowStrategy = akka.stream.OverflowStrategy.dropHead
val minMessageSize = 32
//no need for a custom Publisher/Queue
val streamRef =
Source.actorRef[String](bufferSize, overFlowStrategy)
.via(Flow[String].filter(_.size > minMessageSize))
.to(Sink.foreach[String](println))
.run()
dispatchIRCMessages(streamRef)
The dispatchIRCMessages has the added benefit that it will work with any ActorRef so you aren't required to only work with streams/publishers.
Hopefully this solves your underlying problem...
I think the main problem is Thread.currentThread().join(). This line will 'hang' current thread because this thread is waiting for himself to die. Please read https://docs.oracle.com/javase/8/docs/api/java/lang/Thread.html#join-long- .

Akka-Http Websockets: How to Send consumers the same stream of data

I have a WebSocket that clients can connect to I also have a stream of data using akka-streams. How can I make it that all clients get the same data. At the moment they seem to be racing for the data.
Thanks
One way you could do is is to have an actor that extends ActorPublisher and have it subscribe
to some message.
class MyPublisher extends ActorPublisher[MyData]{
override def preStart = {
context.system.eventStream.subscribe(self, classOf[MyData])
}
override def receive: Receive = {
case msg: MyData ⇒
if (isActive && totalDemand > 0) {
// Pushes the message onto the stream
onNext(msg)
}
}
}
object MyPublisher {
def props(implicit ctx: ExecutionContext): Props = Props(new MyPublisher())
}
case class MyData(data:String)
You can then use that actor as the source for the stream:
val dataSource = Source.actorPublisher[MyData](MyPublisher.props(someExcutionContext))
You can then create a flow from that datasource and apply a transform to convert the data into a websocket message
val myFlow = Flow.fromSinkAndSource(Sink.ignore, dataSource map {d => TextMessage.Strict(d.data)})
Then you can use that flow in your route handling.
path("readings") {
handleWebsocketMessages(myFlow)
}
From the processing of the original stream, you can then publish the data to the event stream and any instance of that actor will pick it up and put in onto the stream that their websocket is being served from.
val actorSystem = ActorSystem("foo")
val otherSource = Source.fromIterator(() => List(MyData("a"), MyData("b")).iterator)
otherSource.runForeach { msg ⇒ actorSystem.eventStream.publish(MyData("data"))}
Each socket will then have its own instance of the actor to provide it with data all coming from a single source.

Stream csv file to browser using akka stream and spray

How do I wire up a Source[String, Unit] to a streaming actor?
I think a modified version of StreamingActor from https://gist.github.com/whysoserious/96050c6b4bd5fedb6e33 will work well, but I'm having difficulty connecting the pieces.
Given source: Source[String, Unit] and ctx: RequestContext, I think the the modified StreamingActor should wire up with actorRefFactory.actorOf(fromSource(source, ctx)).
For reference, the gist above:
import akka.actor._
import akka.util.ByteString
import spray.http.HttpEntity.Empty
import spray.http.MediaTypes._
import spray.http._
import spray.routing.{HttpService, RequestContext, SimpleRoutingApp}
object StreamingActor {
// helper methods
def fromString(iterable: Iterable[String], ctx: RequestContext): Props = {
fromHttpData(iterable.map(HttpData.apply), ctx)
}
def fromStringAndCharset(iterable: Iterable[String], ctx: RequestContext, charset: HttpCharset): Props = {
fromHttpData(iterable.map(HttpData.apply), ctx)
}
def fromByteArray(iterable: Iterable[Array[Byte]], ctx: RequestContext): Props = {
fromHttpData(iterable.map(HttpData.apply), ctx)
}
def fromByteString(iterable: Iterable[ByteString], ctx: RequestContext): Props = {
fromHttpData(iterable.map(HttpData.apply), ctx)
}
def fromHttpData(iterable: Iterable[HttpData], ctx: RequestContext): Props = {
Props(new StreamingActor(iterable, ctx))
}
// initial message sent by StreamingActor to itself
private case object FirstChunk
// confirmation that given chunk was sent to client
private case object ChunkAck
}
class StreamingActor(chunks: Iterable[HttpData], ctx: RequestContext) extends Actor with HttpService with ActorLogging {
import StreamingActor._
def actorRefFactory = context
val chunkIterator: Iterator[HttpData] = chunks.iterator
self ! FirstChunk
def receive = {
// send first chunk to client
case FirstChunk if chunkIterator.hasNext =>
val responseStart = HttpResponse(entity = HttpEntity(`text/html`, chunkIterator.next()))
ctx.responder ! ChunkedResponseStart(responseStart).withAck(ChunkAck)
// data stream is empty. Respond with Content-Length: 0 and stop
case FirstChunk =>
ctx.responder ! HttpResponse(entity = Empty)
context.stop(self)
// send next chunk to client
case ChunkAck if chunkIterator.hasNext =>
val nextChunk = MessageChunk(chunkIterator.next())
ctx.responder ! nextChunk.withAck(ChunkAck)
// all chunks were sent. stop.
case ChunkAck =>
ctx.responder ! ChunkedMessageEnd
context.stop(self)
//
case x => unhandled(x)
}
}
I think your use of a StreamingActor over-complicates the underlying problem you are trying to solve. Further, the StreamingActor in the question will produce multiple HttpResponse values, 1 for each Chunk, for a single HttpRequest. This is inefficient because you can simply return 1 HttpReponse with an HttpEntity.Chunked as the Entity for your data stream source.
General Concurrency Design
Actors are for state, e.g. maintaining a running counter between connections. And even then an Agent covers a lot of ground with the additional benefit of type checking (unlike Actor.receive which turns the dead letter mailbox into your only type checker at runtime).
Concurrent computation, not state, should be handled with (in order):
Futures as a first consideration: composable, compile time type safe checking, and best choice for most cases.
akka Streams : composable, compile time type safe checking, and very useful but there is a lot of overhead resulting from the convenient back-pressure functionality. Steams also are how HttpResponse entities are formed, as demonstrated below.
Streaming CSV Files
You're underlying question is how to stream a csv file to an http client using Streams. You can begin by creating a data Source and embedding it within an HttpResponse:
def lines() = scala.io.Source.fromFile("DataFile.csv").getLines()
import akka.util.ByteString
import akka.http.model.HttpEntity
def chunkSource : Source[HttpEntity.ChunkStreamPart, Unit] =
akka.stream.scaladsl.Source(lines)
.map(ByteString.apply)
.map(HttpEntity.ChunkStreamPart.apply)
def httpFileResponse =
HttpResponse(entity = HttpEntity.Chunked(ContentTypes.`text/plain`, chunkSource))
You can then provide this response for any requests:
val fileRequestHandler = {
case HttpRequest(GET, Uri.Path("/csvFile"), _, _, _) => httpFileResponse
}
Then embed the fileRequestHandler into your server routing logic.

Spray route get response from child actor

I am trying to figure out how I can setup a Master Actor that calls the appropriate children, in support of some spray routes where I am trying to emulate db calls. I am new to akka / spray, so just trying to gain a better understanding of how you would properly setup spray -> actors -> db calls (etc.). I can get the response back from the top level actor, but when I try to get it back from one actor level below the parent I can't seem to get anything to work.
When looking at the paths of the actors, it appears that from the way I am making the call from my spray route that I am passing from a temp actor. Below is what I have so far for stubbing this out. This has to be just user error / ignorance, just not sure how to proceed. Any suggestions would be appreciated.
The Demo Spray Service and Redis Actor code snippets below show where I am calling the actor from my route and the multiple actors where I am having the issue (want my route to get response from SummaryActor). Thanks!
Boot:
object Boot extends App {
// we need an ActorSystem to host our application in
implicit val system = ActorSystem("on-spray-can")
// create and start our service actor
val service = system.actorOf(Props[DemoServiceActor], "demo-service")
implicit val timeout = Timeout(5.seconds)
// start a new HTTP server on port 8080 with our service actor as the handler
IO(Http) ? Http.Bind(service, interface = "localhost", port = 8080)
}
Demo Service Actor (For Spray)
class DemoServiceActor extends Actor with Api {
// the HttpService trait defines only one abstract member, which
// connects the services environment to the enclosing actor or test
def actorRefFactory = context
// this actor only runs our route, but you could add
// other things here, like request stream processing
// or timeout handling
def receive = handleTimeouts orElse runRoute(route)
//Used to watch for request timeouts
//http://spray.io/documentation/1.1.2/spray-routing/key-concepts/timeout-handling/
def handleTimeouts: Receive = {
case Timedout(x: HttpRequest) =>
sender ! HttpResponse(StatusCodes.InternalServerError, "Too late")
}
}
//Master trait for handling large APIs
//http://stackoverflow.com/questions/14653526/can-spray-io-routes-be-split-into-multiple-controllers
trait Api extends DemoService {
val route = {
messageApiRouting
}
}
Demo Spray Service (Route):
trait DemoService extends HttpService with Actor {
implicit val timeout = Timeout(5 seconds) // needed for `?` below
val redisActor = context.actorOf(Props[RedisActor], "redisactor")
val messageApiRouting =
path("summary" / Segment / Segment) { (dataset, timeslice) =>
onComplete(getSummary(redisActor, dataset, timeslice)) {
case Success(value) => complete(s"The result was $value")
case Failure(ex) => complete(s"An error occurred: ${ex.getMessage}")
}
}
def getSummary(redisActor: ActorRef, dataset: String, timeslice: String): Future[String] = Future {
val dbMessage = DbMessage("summary", dataset + timeslice)
val future = redisActor ? dbMessage
val result = Await.result(future, timeout.duration).asInstanceOf[String]
result
}
}
Redis Actor (Mock no actual redis client yet)
class RedisActor extends Actor with ActorLogging {
// val pool = REDIS
implicit val timeout = Timeout(5 seconds) // needed for `?` below
val summaryActor = context.actorOf(Props[SummaryActor], "summaryactor")
def receive = {
case msg: DbMessage => {
msg.query match {
case "summary" => {
log.debug("Summary Query Request")
log.debug(sender.path.toString)
summaryActor ! msg
}
}
}
//If not match log an error
case _ => log.error("Received unknown message: {} ")
}
}
class SummaryActor extends Actor with ActorLogging{
def receive = {
case msg: DbMessage =>{
log.debug("Summary Actor Received Message")
//Send back to Spray Route
}
}
}
The first problem with your code is that you need to forward from the master actor to the child so that the sender is properly propagated and available for the child to respond to. So change this (in RedisActor):
summaryActor ! msg
To:
summaryActor forward msg
That's the primary issue. Fix that and your code should start working. There is something else that needs attention though. Your getSummary method is currently defined as:
def getSummary(redisActor: ActorRef, dataset: String, timeslice: String): Future[String] =
Future {
val dbMessage = DbMessage("summary", dataset + timeslice)
val future = redisActor ? dbMessage
val result = Await.result(future, timeout.duration).asInstanceOf[String]
result
}
The issue here is that the ask operation (?) already returns a Future, so there and you are blocking on it to get the result, wrapping that in another Future so that you can return a Future for onComplete to work with. You should be able to simplify things by using the Future returned from ask directly like so:
def getSummary(redisActor: ActorRef, dataset: String, timeslice: String): Future[String] = {
val dbMessage = DbMessage("summary", dataset + timeslice)
(redisActor ? dbMessage).mapTo[String]
}
Just an important comment on the above approaches.
Since the getSummary(...) function returns a Future[String] object and you call it in onComplete(...) function you need to import:
import ExecutionContext.Implicits.global
That way you will have ExecutionContext in scope by letting Future
declare an implicit ExecutionContext parameter.
** If you don't, you will end up getting a mismatching error
since onComplete(...) expects an onComplete Future
magnet Object but you gave a Future[String] Object.

Scala Akka Consumer/Producer: Return Value

Problem Statement
Assume I have a file with sentences that is processed line by line. In my case, I need to extract Named Entities (Persons, Organizations, ...) from these lines. Unfortunately, the tagger is quite slow. Therefore, I decided to parallelize the computation, such that lines could be processed independent from each other and the result is collected in a central location.
Current Approach
My current approach comprises the usage of a single producer multiple consumer concept. However, I'm relative new to Akka, but I think my problem description fits well into its capabilities. Let me show you some code:
Producer
The Producer reads the file line by line and sends it to the Consumer. If it reaches the total line limit, it propagates the result back to WordCount.
class Producer(consumers: ActorRef) extends Actor with ActorLogging {
var master: Option[ActorRef] = None
var result = immutable.List[String]()
var totalLines = 0
var linesProcessed = 0
override def receive = {
case StartProcessing() => {
master = Some(sender)
Source.fromFile("sent.txt", "utf-8").getLines.foreach { line =>
consumers ! Sentence(line)
totalLines += 1
}
context.stop(self)
}
case SentenceProcessed(list) => {
linesProcessed += 1
result :::= list
//If we are done, we can propagate the result to the creator
if (linesProcessed == totalLines) {
master.map(_ ! result)
}
}
case _ => log.error("message not recognized")
}
}
Consumer
class Consumer extends Actor with ActorLogging {
def tokenize(line: String): Seq[String] = {
line.split(" ").map(_.toLowerCase)
}
override def receive = {
case Sentence(sent) => {
//Assume: This is representative for the extensive computation method
val tokens = tokenize(sent)
sender() ! SentenceProcessed(tokens.toList)
}
case _ => log.error("message not recognized")
}
}
WordCount (Master)
class WordCount extends Actor {
val consumers = context.actorOf(Props[Consumer].
withRouter(FromConfig()).
withDispatcher("consumer-dispatcher"), "consumers")
val producer = context.actorOf(Props(new Producer(consumers)), "producer")
context.watch(consumers)
context.watch(producer)
def receive = {
case Terminated(`producer`) => consumers ! Broadcast(PoisonPill)
case Terminated(`consumers`) => context.system.shutdown
}
}
object WordCount {
def getActor() = new WordCount
def getConfig(routerType: String, dispatcherType: String)(numConsumers: Int) = s"""
akka.actor.deployment {
/WordCount/consumers {
router = $routerType
nr-of-instances = $numConsumers
dispatcher = consumer-dispatcher
}
}
consumer-dispatcher {
type = $dispatcherType
executor = "fork-join-executor"
}"""
}
The WordCount actor is responsible for creating the other actors. When the Consumer is finished the Producer sends a message with all tokens. But, how to propagate the message again and also accept and wait for it? The architecture with the third WordCount actor might be wrong.
Main Routine
case class Run(name: String, actor: () => Actor, config: (Int) => String)
object Main extends App {
val run = Run("push_implementation", WordCount.getActor _, WordCount.getConfig("balancing-pool", "Dispatcher") _)
def execute(run: Run, numConsumers: Int) = {
val config = ConfigFactory.parseString(run.config(numConsumers))
val system = ActorSystem("Counting", ConfigFactory.load(config))
val startTime = System.currentTimeMillis
system.actorOf(Props(run.actor()), "WordCount")
/*
How to get the result here?!
*/
system.awaitTermination
System.currentTimeMillis - startTime
}
execute(run, 4)
}
Problem
As you see, the actual problem is to propagate the result back to the Main routine. Can you tell me how to do this in a proper way? The question is also how to wait for the result until the consumers are finished? I had a brief look into the Akka Future documentation section, but the whole system is a little bit overwhelming for beginners. Something like var future = message ? actor seems suitable. Not sure, how to do this. Also using the WordCount actor causes additional complexity. Maybe it is possible to come up with a solution that doesn't need this actor?
Consider using the Akka Aggregator Pattern. That takes care of the low-level primitives (watching actors, poison pill, etc). You can focus on managing state.
Your call to system.actorOf() returns an ActorRef, but you're not using it. You should ask that actor for results. Something like this:
implicit val timeout = Timeout(5 seconds)
val wCount = system.actorOf(Props(run.actor()), "WordCount")
val answer = Await.result(wCount ? "sent.txt", timeout.duration)
This means your WordCount class needs a receive method that accepts a String message. That section of code should aggregate the results and tell the sender(), like this:
class WordCount extends Actor {
def receive: Receive = {
case filename: String =>
// do all of your code here, using filename
sender() ! results
}
}
Also, rather than blocking on the results with Await above, you can apply some techniques for handling Futures.