Streaming a CSV file to the browser in Spray - scala

In one part of my application I have to send a CSV file back to the browser. I have an actor that replies with a Stream[String], each element of the Stream is one line in the file that should be sent to the user.
This is what I currently have. Note that I'm currently returning MediaType text/plain for debugging purposes, it will be text/csv in the final version:
trait TripService extends HttpService
with Matchers
with CSVTripDefinitions {
implicit def actorRefFactory: ActorRefFactory
implicit val executionContext = actorRefFactory.dispatcher
lazy val csvActor = Ridespark.system.actorSelection(s"/user/listeners/csv").resolveOne()(1000)
implicit val stringStreamMarshaller = Marshaller.of[Stream[String]](MediaTypes.`text/plain`) { (v, ct, ctx) =>
ctx.marshalTo(HttpEntity(ct, v.mkString("\n")))
}
val route = {
get {
path("trips" / Segment / DateSegment) { (network, date) =>
respondWithMediaType(MediaTypes.`text/plain`) {
complete(askTripActor(date, network))
}
}
}
}
def askTripActor(date: LocalDate, network: String)
(implicit timeout: Timeout = Timeout(1000, TimeUnit.MILLISECONDS)): Future[Stream[String]] = {
csvActor flatMap { actor => (actor ? GetTripsForDate(date, Some(network))).mapTo[Stream[String]] }
}
}
class TripApi(info: AuthInfo)(implicit val actorRefFactory: ActorRefFactory) extends TripService
My problem with this is that it will load the whole CSV into memory (due to the mkString in stringStreamMarshaller). Is there a way of not doing this, streaming the lines of the CSV as they are generated?

You don't need the stringStreamMarshaller. Using the built-in stream marshaller should already do what you want without adding the extra line-breaks. However, adding the linebreaks shouldn't be more difficult than using streamOfStrings.map(_ + "\n") or just adding the linebreaks when rendering the lines in the csv-actor in the first place.

Related

Ask Akka actor for a result only when all the messages are processed

I am trying to split a big chunk of text into multiple paragraphs and process it concurrently by calling an external API.
An immutable list is updated each time the response comes from the API for the paragraph.
Once the paragraphs are processed and the list is updated, I would like to ask the Actor for the final status to be used in the next steps.
The problem with the below approach is that I would never know when all the paragraphs are processed.
I need to get back the targetStore once all the paragraphs are processed and the list is final.
def main(args: Array[String]) {
val source = Source.fromFile("input.txt")
val extDelegator = new ExtractionDelegator()
source.getLines().foreach(line => extDelegator.processParagraph(line))
extDelegator.getFinalResult()
}
case class Extract(uuid: UUID, text: String)
case class UpdateList(text: String)
case class DelegateLambda(text: String)
case class FinalResult()
class ExtractionDelegator {
val system = ActorSystem("ExtractionDelegator")
val extActor = system.actorOf(Props(classOf[ExtractorDelegateActor]).withDispatcher("fixed-thread-pool"))
implicit val executionContext = system.dispatchers.lookup("fixed-thread-pool")
def processParagraph(text: String) = {
extActor ! Extract(uuid, text)
}
def getFinalResult(): java.util.List[String] = {
implicit val timeout = Timeout(5 seconds)
val askActor = system.actorOf(Props(classOf[ExtractorDelegateActor]))
val future = askActor ? FinalResult()
val result = Await.result(future, timeout.duration).asInstanceOf[java.util.List[String]]
result
}
def shutdown(): Unit = {
system.terminate()
}
}
/* Extractor Delegator actor*/
class ExtractorDelegateActor extends Actor with ActorLogging {
var targetStore:scala.collection.immutable.List[String] = scala.collection.immutable.List.empty
def receive = {
case Extract(uuid, text) => {
context.actorOf(Props[ExtractProcessor].withDispatcher("fixed-thread-pool")) ! DelegateLambda(text)
}
case UpdateList(res) => {
targetStore = targetStore :+ res
}
case FinalResult() => {
val senderActor=sender()
senderActor ! targetStore
}
}
}
/* Aggregator actor*/
class ExtractProcessor extends Actor with ActorLogging {
def receive = {
case DelegateLambda(text) => {
val res =callLamdaService(text)
sender ! UpdateList(res)
}
}
def callLamdaService(text: String): String = {
//THis is where external API is called.
Thread.sleep(1000)
result
}
}
Not sure why you want to use actors here, most simple would be to
// because you call external service, you have back async response most probably
def callLamdaService(text: String): Future[String]
and to process your text you do
implicit val ec = scala.concurrent.ExecutionContext.Implicits.global // use you execution context here
Future.sequence(source.getLines().map(callLamdaService)).map {results =>
// do what you want with results
}
If you still want to use actors, you can do it replacing callLamdaService to processParagraph which internally will do ask to worker actor, who returns result (so, signature for processParagraph will be def processParagraph(text: String): Future[String])
If you still want to start multiple tasks and then ask for result, then you just need to use context.become with receive(worker: Int), when you increase amount of workers for each Extract message and decrease amount of workers on each UpdateList message. You will also need to implement then delayed processing of FinalResult for the case of non-zero amount of processing workers.

Implement simple architecture using Akka Graphs

I’m attempting to setup a simple graph structure that process data via invoking rest services, forwards the result of each service to an intermediary processing unit before forwarding the result. Here is a high level architecture :
Can this be defined using Akka graph streams ? Reading https://doc.akka.io/docs/akka/current/stream/stream-graphs.html I don't understand how to even implement this simple architecture.
I've tried to implement custom code to execute functions within a graph :
package com.graph
class RestG {
def flow (in : String) : String = {
return in + "extra"
}
}
object RestG {
case class Flow(in: String) {
def out : String = in+"out"
}
def main(args: Array[String]): Unit = {
List(new RestG().flow("test") , new RestG().flow("test2")).foreach(println)
}
}
I'm unsure how to send data between the functions. So I think I should be using Akka Graphs but how to implement the architecture above ?
Here's how I would approach the problem. First some types:
type Data = Int
type RestService1Response = String
type RestService2Response = String
type DisplayedResult = Boolean
Then stub functions to asynchronously call the external services:
def callRestService1(data: Data): Future[RestService1Response] = ???
def callRestService2(data: Data): Future[RestService2Response] = ???
def resultCombiner(resp1: RestService1Response, resp2: RestService2Response): DisplayedResult = ???
Now for the Akka Streams (I'm leaving out setting up an ActorSystem etc.)
import akka.Done
import akka.stream.scaladsl._
type SourceMatVal = Any
val dataSource: Source[Data, SourceMatVal] = ???
def restServiceFlow[Response](callF: Data => Future[Data, Response], maxInflight: Int) = Flow[Data].mapAsync(maxInflight)(callF)
// NB: since we're fanning out, there's no reason to have different maxInflights here...
val service1 = restServiceFlow(callRestService1, 4)
val service2 = restServiceFlow(callRestService2, 4)
val downstream = Flow[(RestService1Response, RestService2Response)]
.map((resultCombiner _).tupled)
val splitAndCombine = GraphDSL.create() { implicit b =>
import GraphDSL.Implicits._
val fanOut = b.add(Broadcast[Data](2))
val fanIn = b.add(Zip[RestService1Response, RestService2Response])
fanOut.out(0).via(service1) ~> fanIn.in0
fanOut.out(1).via(service2) ~> fanIn.in1
FlowShape(fanOut.in, fanIn.out)
}
// This future will complete with a `Done` if/when the stream completes
val future: Future[Done] = dataSource
.via(splitAndCombine)
.via(downstream)
.runForeach { displayableData =>
??? // Display the data
}
It's possible to do all the wiring within the Graph DSL, but I generally prefer to keep my graph stages as simple as possible and only use them to the extent that the standard methods on Source/Flow/Sink can't do what I want.

Akka-Streams ActorPublisher does not receive any Request messages

I am trying to continuously read the wikipedia IRC channel using this lib: https://github.com/implydata/wikiticker
I created a custom Akka Publisher, which will be used in my system as a Source.
Here are some of my classes:
class IrcPublisher() extends ActorPublisher[String] {
import scala.collection._
var queue: mutable.Queue[String] = mutable.Queue()
override def receive: Actor.Receive = {
case Publish(s) =>
println(s"->MSG, isActive = $isActive, totalDemand = $totalDemand")
queue.enqueue(s)
publishIfNeeded()
case Request(cnt) =>
println("Request: " + cnt)
publishIfNeeded()
case Cancel =>
println("Cancel")
context.stop(self)
case _ =>
println("Hm...")
}
def publishIfNeeded(): Unit = {
while (queue.nonEmpty && isActive && totalDemand > 0) {
println("onNext")
onNext(queue.dequeue())
}
}
}
object IrcPublisher {
case class Publish(data: String)
}
I am creating all this objects like so:
def createSource(wikipedias: Seq[String]) {
val dataPublisherRef = system.actorOf(Props[IrcPublisher])
val dataPublisher = ActorPublisher[String](dataPublisherRef)
val listener = new MessageListener {
override def process(message: Message) = {
dataPublisherRef ! Publish(Jackson.generate(message.toMap))
}
}
val ticker = new IrcTicker(
"irc.wikimedia.org",
"imply",
wikipedias map (x => s"#$x.wikipedia"),
Seq(listener)
)
ticker.start() // if I comment this...
Thread.currentThread().join() //... and this I get Request(...)
Source.fromPublisher(dataPublisher)
}
So the problem I am facing is this Source object. Although this implementation works well with other sources (for example from local file), the ActorPublisher don't receive Request() messages.
If I comment the two marked lines I can see, that my actor has received the Request(count) message from my flow. Otherwise all messages will be pushed into the queue, but not in my flow (so I can see the MSG messages printed).
I think it's something with multithreading/synchronization here.
I am not familiar enough with wikiticker to solve your problem as given. One question I would have is: why is it necessary to join to the current thread?
However, I think you have overcomplicated the usage of Source. It would be easier for you to work with the stream as a whole rather than create a custom ActorPublisher.
You can use Source.actorRef to materialize a stream into an ActorRef and work with that ActorRef. This allows you to utilize akka code to do the enqueing/dequeing onto the buffer while you can focus on the "business logic".
Say, for example, your entire stream is only to filter lines above a certain length and print them to the console. This could be accomplished with:
def dispatchIRCMessages(actorRef : ActorRef) = {
val ticker =
new IrcTicker("irc.wikimedia.org",
"imply",
wikipedias map (x => s"#$x.wikipedia"),
Seq(new MessageListener {
override def process(message: Message) =
actorRef ! Publish(Jackson.generate(message.toMap))
}))
ticker.start()
Thread.currentThread().join()
}
//these variables control the buffer behavior
val bufferSize = 1024
val overFlowStrategy = akka.stream.OverflowStrategy.dropHead
val minMessageSize = 32
//no need for a custom Publisher/Queue
val streamRef =
Source.actorRef[String](bufferSize, overFlowStrategy)
.via(Flow[String].filter(_.size > minMessageSize))
.to(Sink.foreach[String](println))
.run()
dispatchIRCMessages(streamRef)
The dispatchIRCMessages has the added benefit that it will work with any ActorRef so you aren't required to only work with streams/publishers.
Hopefully this solves your underlying problem...
I think the main problem is Thread.currentThread().join(). This line will 'hang' current thread because this thread is waiting for himself to die. Please read https://docs.oracle.com/javase/8/docs/api/java/lang/Thread.html#join-long- .

Is it possible to install a callback after request processing is finished in Spray?

I'm trying to serve large temporary files from Spray. I need to delete those files once HTTP request is complete. I could not find a way to do this so far...
I'm using code similar to this or this:
respondWithMediaType(`text/csv`) {
path("somepath" / CsvObjectIdSegment) {
id =>
CsvExporter.export(id) { // loan pattern to provide temp file for this request
file =>
encodeResponse(Gzip) {
getFromFile(file)
}
}
}
}
So essentially it calls getFromFile which completes the route in a Future. The problem is that even if that Future is complete the web request is not complete yet. I tried to write a function similar to getFromFile and I would call file.delete() in onComplete of that Future but it has the same problem - Future completes before the client finished downloading the file if the file is large enough.
Here is getFromFile from Spray for reference:
/**
* Completes GET requests with the content of the given file. The actual I/O operation is
* running detached in a `Future`, so it doesn't block the current thread (but potentially
* some other thread !). If the file cannot be found or read the request is rejected.
*/
def getFromFile(file: File)(implicit settings: RoutingSettings,
resolver: ContentTypeResolver,
refFactory: ActorRefFactory): Route =
getFromFile(file, resolver(file.getName))
I can't use file.deleteOnExit() because JVM might not be restarted for a while and temp files will be kept laying around wasting disk space.
On the other hand it's a more general question - is there a way to install a callback in Spray so that when request processing is complete resources can be released or statistics/logs can be updated, etc.
Thanks to #VladimirPetrosyan for the pointer. Here is how I implemented it:
The route has this:
trait MyService extends HttpService ... with CustomMarshallers {
override def routeSettings = implicitly[RoutingSettings]
...
get {
respondWithMediaType(`text/csv`) {
path("somepath" / CsvObjectIdSegment) {
filterInstanceId => // just an ObjectId
val tempResultsFile = CsvExporter.saveCsvResultsToTempFile(filterInstanceId)
respondWithLastModifiedHeader(tempResultsFile.lastModified) {
encodeResponse(Gzip) {
complete(tempResultsFile)
}
}
}
}
and the trait that I mix in that does the unmarshalling producing chunked response:
import akka.actor._
import spray.httpx.marshalling.{MarshallingContext, Marshaller}
import spray.http.{MessageChunk, ChunkedMessageEnd, HttpEntity, ContentType}
import spray.can.Http
import spray.http.MediaTypes._
import scala.Some
import java.io.{RandomAccessFile, File}
import spray.routing.directives.FileAndResourceDirectives
import spray.routing.RoutingSettings
import math._
trait CustomMarshallers extends FileAndResourceDirectives {
implicit def actorRefFactory: ActorRefFactory
implicit def routeSettings: RoutingSettings
implicit val CsvMarshaller =
Marshaller.of[File](`text/csv`) {
(file: File, contentType: ContentType, ctx: MarshallingContext) =>
actorRefFactory.actorOf {
Props {
new Actor with ActorLogging {
val defaultChunkSize = min(routeSettings.fileChunkingChunkSize, routeSettings.fileChunkingThresholdSize).toInt
private def getNumberOfChunks(file: File): Int = {
val randomAccessFile = new RandomAccessFile(file, "r")
try {
ceil(randomAccessFile.length.toDouble / defaultChunkSize).toInt
} finally {
randomAccessFile.close
}
}
private def readChunk(file: File, chunkIndex: Int): String = {
val randomAccessFile = new RandomAccessFile(file, "r")
val byteBuffer = new Array[Byte](defaultChunkSize)
try {
val seek = chunkIndex * defaultChunkSize
randomAccessFile.seek(seek)
val nread = randomAccessFile.read(byteBuffer)
if(nread == -1) ""
else if(nread < byteBuffer.size) new String(byteBuffer.take(nread))
else new String(byteBuffer)
} finally {
randomAccessFile.close
}
}
val chunkNum = getNumberOfChunks(file)
val responder: ActorRef = ctx.startChunkedMessage(HttpEntity(contentType, ""), Some(Ok(0)))(self)
sealed case class Ok(seq: Int)
def stop() = {
log.debug("Stopped CSV download handler actor.")
responder ! ChunkedMessageEnd
file.delete()
context.stop(self)
}
def sendCSV(seq: Int) =
if (seq < chunkNum)
responder ! MessageChunk(readChunk(file, seq)).withAck(Ok(seq + 1))
else
stop()
def receive = {
case Ok(seq) =>
sendCSV(seq)
case ev: Http.ConnectionClosed =>
log.debug("Stopping response streaming due to {}", ev)
}
}
}
}
}
}
The temp file is created and then actor starts streaming chunks. It sends a chunk whenever response from client is received. Whenever client disconnects temp file is deleted and actor is shut down.
This requires you to run your app in spray-can and I think will not work if you run it in container.
Some useful links:
example1, example2, docs

sys.process to wrap a process as a function

I have an external process that I would like to treat as a
function from String=>String. Given a line of input, it will respond with a single line of output. It seems that I should use
scala.sys.process, which is clearly an elegant library that makes many
shell operations easily accessible from within scala. However, I
can't figure out how to perform this simple use case.
If I write a single line to the process' stdin, it prints the result
in a single line. How can I use sys.process to create a wrapper so I
can use the process interactively? For example, if I had an
implementation for ProcessWrapper, here is a program and it's output:
// abstract definition
class ProcessWrapper(executable: String) {
def apply(line: String): String
}
// program using an implementation
val process = new ProcessWrapper("cat -b")
println(process("foo"))
println(process("bar"))
println(process("baz"))
Output:
1 foo
2 bar
3 baz
It is important that the process is not reloaded for each call to process because there is a significant initialization step.
So - after my comment - this would be my solution
import java.io.BufferedReader
import java.io.File
import java.io.InputStream
import java.io.InputStreamReader
import scala.annotation.tailrec
class ProcessWrapper(cmdLine: String, lineListenerOut: String => Unit, lineListenerErr: String => Unit,
finishHandler: => Unit,
lineMode: Boolean = true, envp: Array[String] = null, dir: File = null) {
class StreamRunnable(val stream: InputStream, listener: String => Unit) extends Runnable {
def run() {
try {
val in = new BufferedReader(new InputStreamReader(this.stream));
#tailrec
def readLines {
val line = in.readLine
if (line != null) {
listener(line)
readLines
}
}
readLines
}
finally {
this.stream.close
finishHandler
}
}
}
val process = Runtime.getRuntime().exec(cmdLine, envp, dir);
val outThread = new Thread(new StreamRunnable(process.getInputStream, lineListenerOut), "StreamHandlerOut")
val errThread = new Thread(new StreamRunnable(process.getErrorStream, lineListenerErr), "StreamHandlerErr")
val sendToProcess = process.getOutputStream
outThread.start
errThread.start
def apply(txt: String) {
sendToProcess.write(txt.getBytes)
if (lineMode)
sendToProcess.write('\n')
sendToProcess.flush
}
}
object ProcessWrapper {
def main(args: Array[String]) {
val process = new ProcessWrapper("python -i", txt => println("py> " + txt),
err => System.err.println("py err> " + err), System.exit(0))
while (true) {
process(readLine)
}
}
}
The main part is the StreamRunnable, where the process is read in a thread and the received line is passed on to a "LineListener" (a simple String => Unit - function).
The main is just a sample implementation - calling python ;)
I'm not sure, but you want somethings like that ?
case class ProcessWrapper(executable: String) {
import java.io.ByteArrayOutputStream
import scala.concurrent.duration.Duration
import java.util.concurrent.TimeUnit
lazy val process = sys.runtime.exec(executable)
def apply(line: String, blockedRead: Boolean = true): String = {
process.getOutputStream().write(line.getBytes())
process.getOutputStream().flush()
val r = new ByteArrayOutputStream
if (blockedRead) {
r.write(process.getInputStream().read())
}
while (process.getInputStream().available() > 0) {
r.write(process.getInputStream().read())
}
r.toString()
}
def close() = process.destroy()
}
val process = ProcessWrapper("cat -b")
println(process("foo\n"))
println(process("bar\n"))
println(process("baz\n"))
println(process("buz\n"))
println(process("puz\n"))
process.close
Result :
1 foo
2 bar
3 baz
4 buz
5 puz
I think that PlayCLI is a better way.
http://blog.greweb.fr/2013/01/playcli-play-iteratees-unix-pipe/ came across this today and looks exactly like what you want
How about using an Akka actor. The actor can have state and thus a reference to an open program (in a thread). You can send messages to that actor.
ProcessWrapper might be a typed actor itself or just something that converts the calls of a function to a call of an actor. If you only have 'process' as method name, then wrapper ! "message" would be enough.
Having a program open and ready to receive commands sounds like an actor that receives messages.
Edit: Probably I got the requirements wrong. You want to send multiple lines to the same process. That's not possible with the below solution.
One possibility would be to add an extension method to the ProcessBuilder that allows for taking the input from a string:
implicit class ProcessBuilderWithStringInput(val builder: ProcessBuilder) extends AnyVal {
// TODO: could use an implicit for the character set
def #<<(s: String) = builder.#<(new ByteArrayInputStream(s.getBytes))
}
You can now use the method like this:
scala> ("bc":ProcessBuilder).#<<("3 + 4\n").!!
res9: String =
"7
"
Note that the type annotation is necessary, because we need two conversions (String -> ProcessBuilder -> ProcessBuilderWithStringInput, and Scala will only apply one conversion automatically.