Akka stream - Splitting a stream of ByteString into multiple files - scala

I'm trying to split an incoming Akka stream of bytes (from the body of an http request, but it could also be from a file) into multiple files of a defined size.
For example, if I'm uploading a 10Gb file, it would create something like 10 files of 1Gb. The files would have randomly generated names. My issue is that I don't really know where to start, because all the responses and examples I've read are either storing the whole chunk into memory, or using a delimiter based on a string. Except I can't really have "chunks" of 1Gb, and then just write them to the disk..
Is there any easy way to perform that kind of operation ? My only idea would be to use something like this http://doc.akka.io/docs/akka/2.4/scala/stream/stream-cookbook.html#Chunking_up_a_stream_of_ByteStrings_into_limited_size_ByteStrings but transformed to something like FlowShape[ByteString, File], writting myself into a file the chunks until the max file size is reached, then creating a new file, etc.., and streaming back the created files. Which looks like an atrocious idea not using properly Akka..
Thanks in advance

I often revert to purely functional, non-akka, techniques for problems such as this and then "lift" those functions into akka constructs. By this I mean I try to use only scala "stuff" and then try to wrap that stuff inside of akka later on...
File Creation
Starting with the FileOutputStream creation based on "randomly generated names":
def randomFileNameGenerator : String = ??? //not specified in question
import java.io.FileOutputStream
val randomFileOutGenerator : () => FileOutputStream =
() => new FileOutputStream(randomFileNameGenerator)
State
There needs to be some way of storing the "state" of the current file, e.g. the number of bytes already written:
case class FileState(byteCount : Int = 0,
fileOut : FileOutputStream = randomFileOutGenerator())
File Writing
First we determine if we'd breach the maximum file size threshold with the given ByteString:
import akka.util.ByteString
val isEndOfChunk : (FileState, ByteString, Int) => Boolean =
(state, byteString, maxBytes) =>
state.byteCount + byteString.length > maxBytes
We then have to write the function that creates a new FileState if we've maxed out the capacity of the current one or returns the current state if it is still below capacity:
val closeFileInState : FileState => Unit =
(_ : FileState).fileOut.close()
val getCurrentFileState(FileState, ByteString, Int) => FileState =
(state, byteString, maxBytes) =>
if(isEndOfChunk(maxBytes, state, byteString)) {
closeFileInState(state)
FileState()
}
else
state
The only thing left is to write to the FileOutputStream:
val writeToFileAndReturn(FileState, ByteString) => FileState =
(fileState, byteString) => {
fileState.fileOut write byteString.toArray
fileState copy (byteCount = fileState.byteCount + byteString.size)
}
//the signature ordering will become useful
def writeToChunkedFile(maxBytes : Int)(fileState : FileState, byteString : ByteString) : FileState =
writeToFileAndReturn(getCurrentFileState(maxBytes, fileState, byteString), byteString)
Fold On Any GenTraversableOnce
In scala a GenTraversableOnce is any collection, parallel or not, that has the fold operator. These include Iterator, Vector, Array, Seq, scala stream, ... Th final writeToChunkedFile function perfectly matches the signature of GenTraversableOnce#fold:
val anyIterable : Iterable = ???
val finalFileState = anyIterable.fold(FileState())(writetochunkedFile(maxBytes))
One final loose end; the last FileOutputStream needs to be closed as well. Since the fold will only emit that last FileState we can close that one:
closeFileInState(finalFileState)
Akka Streams
Akka Flow gets its fold from FlowOps#fold which happens to match the GenTraversableOnce signature. Therefore we can "lift" our regular functions into stream values similar to the way we used Iterable fold:
import akka.stream.scaladsl.Flow
def chunkerFlow(maxBytes : Int) : Flow[ByteString, FileState, _] =
Flow[ByteString].fold(FileState())(writeToChunkedFile(maxBytes))
The nice part about handling the problem with regular functions is that they can be used within other asynchronous frameworks beyond streams, e.g. Futures or Actors. You also don't need an akka ActorSystem in unit testing, just regular language data structures.
import akka.stream.scaladsl.Sink
import scala.concurrent.Future
def byteStringSink(maxBytes : Int) : Sink[ByteString, _] =
chunkerFlow(maxBytes) to (Sink foreach closeFileInState)
You can then use this Sink to drain HttpEntity coming from HttpRequest.

You could write a custom graph stage.
Your issue is similar to the one faced in alpakka during upload to amazon S3. ( google alpakka s3 connector.. they wont let me post more than 2 links)
For some reason the s3 connector DiskBuffer however writes the entire incoming source of bytestrings to a file, before emitting out the chunk to do further stream processing..
What we want is something similar to limit a source of byte strings to specific length. In the example, they have limited the incoming Source[ByteString, _] to a source of fixed sized byteStrings by maintaining a memory buffer. I adopted it to work with Files.
The advantage of this is that you can use a dedicated thread pool for this stage to do blocking IO. For good reactive stream you want to keep blocking IO in separate thread pool in actor system.
PS: this does not try to make files of exact size.. so if we read 2KB extra in a 100MB file.. we write those extra bytes to the current file rather than trying to achieve exact size.
import java.io.{FileOutputStream, RandomAccessFile}
import java.nio.channels.FileChannel
import java.nio.file.Path
import akka.stream.stage.{GraphStage, GraphStageLogic, InHandler, OutHandler}
import akka.stream._
import akka.util.ByteString
case class MultipartUploadChunk(path: Path, size: Int, partNumber: Int)
//Starts writing the byteStrings received from upstream to a file. Emits a path after writing a partSize number of bytes. Does not attemtp to write exact number of bytes.
class FileChunker(maxSize: Int, tempDir: Path, partSize: Int)
extends GraphStage[FlowShape[ByteString, MultipartUploadChunk]] {
assert(maxSize > partSize, "Max size should be larger than part size. ")
val in: Inlet[ByteString] = Inlet[ByteString]("PartsMaker.in")
val out: Outlet[MultipartUploadChunk] = Outlet[MultipartUploadChunk]("PartsMaker.out")
override val shape: FlowShape[ByteString, MultipartUploadChunk] = FlowShape.of(in, out)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
new GraphStageLogic(shape) with OutHandler with InHandler {
var partNumber: Int = 0
var length: Int = 0
var currentBuffer: Option[PartBuffer] = None
override def onPull(): Unit =
if (isClosed(in)) {
emitPart(currentBuffer, length)
} else {
pull(in)
}
override def onPush(): Unit = {
val elem = grab(in)
length += elem.size
val currentPart: PartBuffer = currentBuffer match {
case Some(part) => part
case None =>
val newPart = createPart(partNumber)
currentBuffer = Some(newPart)
newPart
}
currentPart.fileChannel.write(elem.asByteBuffer)
if (length > partSize) {
emitPart(currentBuffer, length)
//3. .increment part number, reset length.
partNumber += 1
length = 0
} else {
pull(in)
}
}
override def onUpstreamFinish(): Unit =
if (length > 0) emitPart(currentBuffer, length) // emit part only if something is still left in current buffer.
private def emitPart(maybePart: Option[PartBuffer], size: Int): Unit = maybePart match {
case Some(part) =>
//1. flush the part buffer and truncate the file.
part.fileChannel.force(false)
// not sure why we do this truncate.. but was being done in alpakka. also maybe safe to do.
// val ch = new FileOutputStream(part.path.toFile).getChannel
// try {
// println(s"truncating to size $size")
// ch.truncate(size)
// } finally {
// ch.close()
// }
//2emit the part
val chunk = MultipartUploadChunk(path = part.path, size = length, partNumber = partNumber)
push(out, chunk)
part.fileChannel.close() // TODO: probably close elsewhere.
currentBuffer = None
//complete stage if in is closed.
if (isClosed(in)) completeStage()
case None => if (isClosed(in)) completeStage()
}
private def createPart(partNum: Int): PartBuffer = {
val path: Path = partFile(partNum)
//currentPart.deleteOnExit() //TODO: Enable in prod. requests that the file be deleted when VM dies.
PartBuffer(path, new RandomAccessFile(path.toFile, "rw").getChannel)
}
/**
* Creates a file in the temp directory with name bmcs-buffer-part-$partNumber
* #param partNumber the part number in multipart upload.
* #return
* TODO:add unique id to the file name. for multiple
*/
private def partFile(partNumber: Int): Path =
tempDir.resolve(s"bmcs-buffer-part-$partNumber.bin")
setHandlers(in, out, this)
}
case class PartBuffer(path: Path, fileChannel: FileChannel) //TODO: see if you need mapped byte buffer. might be ok with just output stream / channel.
}

The idiomatic way to split a ByteString stream to multiple files is to use Alpakka's LogRotatorSink. From the documentation:
This sink will takes a function as parameter which returns a Bytestring => Option[Path] function. If the generated function returns a path the sink will rotate the file output to this new path and the actual ByteString will be written to this new file too. With this approach the user can define a custom stateful file generation implementation.
The following fileSizeRotationFunction is also from the documentation:
val fileSizeRotationFunction = () => {
val max = 10 * 1024 * 1024
var size: Long = max
(element: ByteString) =>
{
if (size + element.size > max) {
val path = Files.createTempFile("out-", ".log")
size = element.size
Some(path)
} else {
size += element.size
None
}
}
}
An example of its use:
val source: Source[ByteString, _] = ???
source.runWith(LogRotatorSink(fileSizeRotationFunction))

Related

How to create an akka-stream Source from a Flow that generate values recursively?

I need to traverse an API that is shaped like a tree. For example, a directory structure or threads of discussion. It can be modeled via the following flow:
type ItemId = Int
type Data = String
case class Item(data: Data, kids: List[ItemId])
def randomData(): Data = scala.util.Random.alphanumeric.take(2).mkString
// 0 => [1, 9]
// 1 => [10, 19]
// 2 => [20, 29]
// ...
// 9 => [90, 99]
// _ => []
// NB. I don't have access to this function, only the itemFlow.
def nested(id: ItemId): List[ItemId] =
if (id == 0) (1 to 9).toList
else if (1 <= id && id <= 9) ((id * 10) to ((id + 1) * 10 - 1)).toList
else Nil
val itemFlow: Flow[ItemId, Item, NotUsed] =
Flow.fromFunction(id => Item(randomData, nested(id)))
How can I traverse this data? I got the following working:
import akka.NotUsed
import akka.actor.ActorSystem
import akka.stream._
import akka.stream.scaladsl._
import scala.concurrent.Await
import scala.concurrent.duration.Duration
implicit val system = ActorSystem()
implicit val materializer = ActorMaterializer()
val loop =
GraphDSL.create() { implicit b =>
import GraphDSL.Implicits._
val source = b.add(Flow[Int])
val merge = b.add(Merge[Int](2))
val fetch = b.add(itemFlow)
val bcast = b.add(Broadcast[Item](2))
val kids = b.add(Flow[Item].mapConcat(_.kids))
val data = b.add(Flow[Item].map(_.data))
val buffer = Flow[Int].buffer(100, OverflowStrategy.dropHead)
source ~> merge ~> fetch ~> bcast ~> data
merge <~ buffer <~ kids <~ bcast
FlowShape(source.in, data.out)
}
val flow = Flow.fromGraph(loop)
Await.result(
Source.single(0).via(flow).runWith(Sink.foreach(println)),
Duration.Inf
)
system.terminate()
However, since I'm using a flow with a buffer, the Stream will never complete.
Completes when upstream completes and buffered elements have been drained
Flow.buffer
I read the Graph cycles, liveness, and deadlocks section multiple times and I'm still struggling to find an answer.
This would create a live lock:
import java.util.concurrent.atomic.AtomicInteger
def unfold[S, E](seed: S, flow: Flow[S, E, NotUsed])(loop: E => List[S]): Source[E, NotUsed] = {
// keep track of how many element flows,
val remaining = new AtomicInteger(1) // 1 = seed
// should be > max loop(x)
val bufferSize = 10000
val (ref, publisher) =
Source.actorRef[S](bufferSize, OverflowStrategy.fail)
.toMat(Sink.asPublisher(true))(Keep.both)
.run()
ref ! seed
Source.fromPublisher(publisher)
.via(flow)
.map{x =>
loop(x).foreach{ c =>
remaining.incrementAndGet()
ref ! c
}
x
}
.takeWhile(_ => remaining.decrementAndGet > 0)
}
EDIT: I added a git repo to test your solution https://github.com/MasseGuillaume/source-unfold
Cause of Non-Completion
I don't think the cause of the stream never completing is due to "using a flow with a buffer". The actual cause, similar to this question, is the fact that merge with the default parameter eagerClose=False is waiting on both the source and the buffer to complete before it (merge) completes. But buffer is waiting on merge to complete. So merge is waiting on buffer and buffer is waiting on merge.
eagerClose merge
You could set eagerClose=True when creating your merge. But using eager close may unfortunately result in some children ItemId values never being queried.
Indirect Solution
If you materialize a new stream for each level of the tree then the recursion can be extracted to outside of the stream.
You can construct a query function utilizing the itemFlow:
val itemQuery : Iterable[ItemId] => Future[Seq[Data]] =
(itemIds) => Source.apply(itemIds)
.via(itemFlow)
.runWith(Sink.seq[Data])
This query function can now be wrapped inside of a recursive helper function:
val recQuery : (Iterable[ItemId], Iterable[Data]) => Future[Seq[Data]] =
(itemIds, currentData) => itemQuery(itemIds) flatMap { allNewData =>
val allNewKids = allNewData.flatMap(_.kids).toSet
if(allNewKids.isEmpty)
Future.successful(currentData ++ allNewData)
else
recQuery(allNewKids, currentData ++ data)
}
The number of streams created will be equal to the maximum depth of the tree.
Unfortunately, because Futures are involved, this recursive function is not tail-recursive and could result in a "stack overflow" if the tree is too deep.
I solved this problem by writing my own GraphStage.
import akka.NotUsed
import akka.stream._
import akka.stream.scaladsl._
import akka.stream.stage.{GraphStage, GraphStageLogic, OutHandler}
import scala.concurrent.ExecutionContext
import scala.collection.mutable
import scala.util.{Success, Failure, Try}
import scala.collection.mutable
def unfoldTree[S, E](seeds: List[S],
flow: Flow[S, E, NotUsed],
loop: E => List[S],
bufferSize: Int)(implicit ec: ExecutionContext): Source[E, NotUsed] = {
Source.fromGraph(new UnfoldSource(seeds, flow, loop, bufferSize))
}
object UnfoldSource {
implicit class MutableQueueExtensions[A](private val self: mutable.Queue[A]) extends AnyVal {
def dequeueN(n: Int): List[A] = {
val b = List.newBuilder[A]
var i = 0
while (i < n) {
val e = self.dequeue
b += e
i += 1
}
b.result()
}
}
}
class UnfoldSource[S, E](seeds: List[S],
flow: Flow[S, E, NotUsed],
loop: E => List[S],
bufferSize: Int)(implicit ec: ExecutionContext) extends GraphStage[SourceShape[E]] {
val out: Outlet[E] = Outlet("UnfoldSource.out")
override val shape: SourceShape[E] = SourceShape(out)
override def createLogic(inheritedAttributes: Attributes): GraphStageLogic = new GraphStageLogic(shape) with OutHandler {
// Nodes to expand
val frontier = mutable.Queue[S]()
frontier ++= seeds
// Nodes expanded
val buffer = mutable.Queue[E]()
// Using the flow to fetch more data
var inFlight = false
// Sink pulled but the buffer was empty
var downstreamWaiting = false
def isBufferFull() = buffer.size >= bufferSize
def fillBuffer(): Unit = {
val batchSize = Math.min(bufferSize - buffer.size, frontier.size)
val batch = frontier.dequeueN(batchSize)
inFlight = true
val toProcess =
Source(batch)
.via(flow)
.runWith(Sink.seq)(materializer)
val callback = getAsyncCallback[Try[Seq[E]]]{
case Failure(ex) => {
fail(out, ex)
}
case Success(es) => {
val got = es.size
inFlight = false
es.foreach{ e =>
buffer += e
frontier ++= loop(e)
}
if (downstreamWaiting && buffer.nonEmpty) {
val e = buffer.dequeue
downstreamWaiting = false
sendOne(e)
} else {
checkCompletion()
}
()
}
}
toProcess.onComplete(callback.invoke)
}
override def preStart(): Unit = {
checkCompletion()
}
def checkCompletion(): Unit = {
if (!inFlight && buffer.isEmpty && frontier.isEmpty) {
completeStage()
}
}
def sendOne(e: E): Unit = {
push(out, e)
checkCompletion()
}
def onPull(): Unit = {
if (buffer.nonEmpty) {
sendOne(buffer.dequeue)
} else {
downstreamWaiting = true
}
if (!isBufferFull && frontier.nonEmpty) {
fillBuffer()
}
}
setHandler(out, this)
}
}
Ah, the joys of cycles in Akka streams. I had a very similar problem which I solved in a deeply hacky way. Possibly it'll be helpful for you.
Hacky Solution:
// add a graph stage that will complete successfully if it sees no element within 5 seconds
val timedStopper = b.add(
Flow[Item]
.idleTimeout(5.seconds)
.recoverWithRetries(1, {
case _: TimeoutException => Source.empty[Item]
}))
source ~> merge ~> fetch ~> timedStopper ~> bcast ~> data
merge <~ buffer <~ kids <~ bcast
What this does is that 5 seconds after the last element passes through the timedStopper stage, that stage completes the stream successfully. This is achieved via the use of idleTimeout, which fails the stream with a TimeoutException, and then using recoverWithRetries to turn that failure into a successful completion. (I did mention it was hacky).
This is obviously not suitable if you might have more than 5 seconds between elements, or if you can't afford a long wait between the stream "actually" completing and Akka picking up on it. Thankfully, neither were a concern for us, and in that case it actually works pretty well!
Non-hacky solution
Unfortunately, the only ways I can think of to do this without cheating via timeouts are very, very complicated.
Basically, you need to be able to track two things:
are there any elements still in the buffer, or in process of being sent to the buffer
is the incoming source open
and complete the stream if and only if the answer to both questions is no. Native Akka building blocks are probably not going to be able to handle this. A custom graph stage might, however. An option might be to write one that takes the place of Merge and give it some way of knowing about the buffer contents, or possibly have it track both the IDs it receives and the IDs the broadcast is sending to the buffer. The problem being that custom graph stages are not particularly pleasant to write at the best of times, let alone when you're mixing logic across stages like this.
Warnings
Akka streams just don't work well with cycles, especially how they calculate completion. As a result, this may not be the only problem you encounter.
For instance, an issue we had with a very similar structure was that a failure in the source was treated as the stream completing successfully, with a succeeded Future being materialised. The problem is that by default, a stage that fails will fail its downstreams but cancel its upstreams (which counts as a successful completion for those stages). With a cycle like the one you have, the result is a race as cancellation propagates down one branch but failure down the other. You also need to check what happens if the sink errors; depending on the cancellation settings for broadcast, it's possible the cancellation will not propagate upwards and the source will happily continue pulling in elements.
One final option: avoid handling the recursive logic with streams at all. On one extreme, if there's any way for you to write a single tail-recursive method that pulls out all the nested items at once and put that into a Flow stage, that will solve your problems. On the other, we're seriously considering going to Kafka queueing for our own system.

Akka streams: dealing with futures within graph stage

Within an akka stream stage FlowShape[A, B] , part of the processing I need to do on the A's is to save/query a datastore with a query built with A data. But that datastore driver query gives me a future, and I am not sure how best to deal with it (my main question here).
case class Obj(a: String, b: Int, c: String)
case class Foo(myobject: Obj, name: String)
case class Bar(st: String)
//
class SaveAndGetId extends GraphStage[FlowShape[Foo, Bar]] {
val dao = new DbDao // some dao with an async driver
override def createLogic(inheritedAttributes: Attributes) = new GraphStageLogic(shape) {
setHandlers(in, out, new InHandler with Outhandler {
override def onPush() = {
val foo = grab(in)
val add = foo.record.value()
val result: Future[String] = dao.saveAndGetRecord(add.myobject)//saves and returns id as string
//the naive approach
val record = Await(result, Duration.inf)
push(out, Bar(record))// ***tests pass every time
//mapping the future approach
result.map { x=>
push(out, Bar(x))
} //***tests fail every time
The next stage depends on the id of the db record returned from query, but I want to avoid Await. I am not sure why mapping approach fails:
"it should work" in {
val source = Source.single(Foo(Obj("hello", 1, "world")))
val probe = source
.via(new SaveAndGetId))
.runWith(TestSink.probe)
probe
.request(1)
.expectBarwithId("one")//say we know this will be
.expectComplete()
}
private implicit class RichTestProbe(probe: Probe[Bar]) {
def expectBarwithId(expected: String): Probe[Bar] =
probe.expectNextChainingPF{
case r # Bar(str) if str == expected => r
}
}
When run with mapping future, I get failure:
should work ***FAILED***
java.lang.AssertionError: assertion failed: expected: message matching partial function but got unexpected message OnComplete
at scala.Predef$.assert(Predef.scala:170)
at akka.testkit.TestKitBase$class.expectMsgPF(TestKit.scala:406)
at akka.testkit.TestKit.expectMsgPF(TestKit.scala:814)
at akka.stream.testkit.TestSubscriber$ManualProbe.expectEventPF(StreamTestKit.scala:570)
The async side channels example in the docs has the future in the constructor of the stage, as opposed to building the future within the stage, so doesn't seem to apply to my case.
I agree with Ramon. Constructing a new FlowShapeis not necessary in this case and it is too complicated. It is very much convenient to use mapAsync method here:
Here is a code snippet to utilize mapAsync:
import akka.stream.scaladsl.{Sink, Source}
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
object MapAsyncExample {
val numOfParallelism: Int = 10
def main(args: Array[String]): Unit = {
Source.repeat(5)
.mapAsync(parallelism)(x => asyncSquare(x))
.runWith(Sink.foreach(println)) previous stage
}
//This method returns a Future
//You can replace this part with your database operations
def asyncSquare(value: Int): Future[Int] = Future {
value * value
}
}
In the snippet above, Source.repeat(5) is a dummy source to emit 5 indefinitely. There is a sample function asyncSquare which takes an integer and calculates its square in a Future. .mapAsync(parallelism)(x => asyncSquare(x)) line uses that function and emits the output of Future to the next stage. In this snipet, the next stage is a sink which prints every item.
parallelism is the maximum number of asyncSquare calls that can run concurrently.
I think your GraphStage is unnecessarily overcomplicated. The below Flow performs the same actions without the need to write a custom stage:
val dao = new DbDao
val parallelism = 10 //number of parallel db queries
val SaveAndGetId : Flow[Foo, Bar, _] =
Flow[Foo]
.map(foo => foo.record.value().myobject)
.mapAsync(parallelism)(rec => dao.saveAndGetRecord(rec))
.map(Bar.apply)
I generally try to treat GraphStage as a last resort, there is almost always an idiomatic way of getting the same Flow by using the methods provided by the akka-stream library.

Akka Stream return object from Sink

I've got a SourceQueue. When I offer an element to this I want it to pass through the Stream and when it reaches the Sink have the output returned to the code that offered this element (similar as Sink.head returns an element to the RunnableGraph.run() call).
How do I achieve this? A simple example of my problem would be:
val source = Source.queue[String](100, OverflowStrategy.fail)
val flow = Flow[String].map(element => s"Modified $element")
val sink = Sink.ReturnTheStringSomehow
val graph = source.via(flow).to(sink).run()
val x = graph.offer("foo")
println(x) // Output should be "Modified foo"
val y = graph.offer("bar")
println(y) // Output should be "Modified bar"
val z = graph.offer("baz")
println(z) // Output should be "Modified baz"
Edit: For the example I have given in this question Vladimir Matveev provided the best answer. However, it should be noted that this solution only works if the elements are going into the sink in the same order they were offered to the source. If this cannot be guaranteed the order of the elements in the sink may differ and the outcome might be different from what is expected.
I believe it is simpler to use the already existing primitive for pulling values from a stream, called Sink.queue. Here is an example:
val source = Source.queue[String](128, OverflowStrategy.fail)
val flow = Flow[String].map(element => s"Modified $element")
val sink = Sink.queue[String]().withAttributes(Attributes.inputBuffer(1, 1))
val (sourceQueue, sinkQueue) = source.via(flow).toMat(sink)(Keep.both).run()
def getNext: String = Await.result(sinkQueue.pull(), 1.second).get
sourceQueue.offer("foo")
println(getNext)
sourceQueue.offer("bar")
println(getNext)
sourceQueue.offer("baz")
println(getNext)
It does exactly what you want.
Note that setting the inputBuffer attribute for the queue sink may or may not be important for your use case - if you don't set it, the buffer will be zero-sized and the data won't flow through the stream until you invoke the pull() method on the sink.
sinkQueue.pull() yields a Future[Option[T]], which will be completed successfully with Some if the sink receives an element or with a failure if the stream fails. If the stream completes normally, it will be completed with None. In this particular example I'm ignoring this by using Option.get but you would probably want to add custom logic to handle this case.
Well, you know what offer() method returns if you take a look at its definition :) What you can do is to create Source.queue[(Promise[String], String)], create helper function that pushes pair to stream via offer, make sure offer doesn't fail because queue might be full, then complete promise inside your stream and use future of the promise to catch completion event in external code.
I do that to throttle rate to external API used from multiple places of my project.
Here is how it looked in my project before Typesafe added Hub sources to akka
import scala.concurrent.Promise
import scala.concurrent.Future
import java.util.concurrent.ConcurrentLinkedDeque
import akka.stream.scaladsl.{Keep, Sink, Source}
import akka.stream.{OverflowStrategy, QueueOfferResult}
import scala.util.Success
private val queue = Source.queue[(Promise[String], String)](100, OverflowStrategy.backpressure)
.toMat(Sink.foreach({ case (p, param) =>
p.complete(Success(param.reverse))
}))(Keep.left)
.run
private val futureDeque = new ConcurrentLinkedDeque[Future[String]]()
private def sendQueuedRequest(request: String): Future[String] = {
val p = Promise[String]
val offerFuture = queue.offer(p -> request)
def addToQueue(future: Future[String]): Future[String] = {
futureDeque.addLast(future)
future.onComplete(_ => futureDeque.remove(future))
future
}
offerFuture.flatMap {
case QueueOfferResult.Enqueued =>
addToQueue(p.future)
}.recoverWith {
case ex =>
val first = futureDeque.pollFirst()
if (first != null)
addToQueue(first.flatMap(_ => sendQueuedRequest(request)))
else
sendQueuedRequest(request)
}
}
I realize that blocking synchronized queue may be bottleneck and may grow indefinitely but because API calls in my project are made only from other akka streams which are backpressured I never have more than dozen items in futureDeque. Your situation may differ.
If you create MergeHub.source[(Promise[String], String)]() instead you'll get reusable sink. Thus every time you need to process item you'll create complete graph and run it. In that case you won't need hacky java container to queue requests.

Stream input to external process in Scala

I have an Iterable[String] and I want to stream that to an external Process and return an Iterable[String] for the output.
I feel like this should work as it compiles
import scala.sys.process._
object PipeUtils {
implicit class IteratorStream(s: TraversableOnce[String]) {
def pipe(cmd: String) = s.toStream.#>(cmd).lines
def run(cmd: String) = s.toStream.#>(cmd).!
}
}
However, Scala tries to execute the contents of s instead of pass them in to standard in. Can anyone please tell me what I'm doing wrong?
UPDATE:
I think that my original problem was that the s.toStream was being implicity converted to a ProcessBuilder and then executed. This is incorrect as it's the input to the process.
I have come up with the following solution. This feels very hacky and wrong but it seems to work for now. I'm not writing this as an answer because I feel like the answer should be one line and not this gigantic thing.
object PipeUtils {
/**
* This class feels wrong. I think that for the pipe command it actually loads all of the output
* into memory. This could blow up the machine if used wrong, however, I cannot figure out how to get it to
* work properly. Hopefully http://stackoverflow.com/questions/28095469/stream-input-to-external-process-in-scala
* will get some good responses.
* #param s
*/
implicit class IteratorStream(s: TraversableOnce[String]) {
val in = (in: OutputStream) => {
s.foreach(x => in.write((x + "\n").getBytes))
in.close
}
def pipe(cmd: String) = {
val output = ListBuffer[String]()
val io = new ProcessIO(in,
out => {Source.fromInputStream(out).getLines.foreach(output += _)},
err => {Source.fromInputStream(err).getLines.foreach(println)})
cmd.run(io).exitValue
output.toIterable
}
def run(cmd: String) = {
cmd.run(BasicIO.standard(in)).exitValue
}
}
}
EDIT
The motivation for this comes from using Spark's .pipe function on an RDD. I want this exact same functionality on my local code.
Assuming scala 2.11+, you should use lineStream as suggested by #edi. The reason is that you get a streaming response as it becomes available instead of a batched response. Let's say I have a shell script echo-sleep.sh:
#/usr/bin/env bash
# echo-sleep.sh
while read line; do echo $line; sleep 1; done
and we want to call it from scala using code like the following:
import scala.sys.process._
import scala.language.postfixOps
import java.io.ByteArrayInputStream
implicit class X(in: TraversableOnce[String]) {
// Don't do the BAOS construction in real code. Just for illustration.
def pipe(cmd: String) =
cmd #< new ByteArrayInputStream(in.mkString("\n").getBytes) lineStream
}
Then if we do a final call like:
1 to 10 map (_.toString) pipe "echo-sleep.sh" foreach println
a number in the sequence appears on STDOUT every 1 second. If you buffer, and convert to an Iterable as in your example, you will lose this responsiveness.
Here's a solution demonstrating how to write the process code so that it streams both the input and output. The key is to produce a java.io.PipedInputStream that is passed to the input of the process. This stream is filled from the iterator asynchronously via a java.io.PipedOutputStream. Obviously, feel free to change the input type of the implicit class to an Iterable.
Here's an iterator used to show this works.
/**
* An iterator with pauses used to illustrate data streaming to the process to be run.
*/
class PausingIterator[A](zero: A, until: A, pauseMs: Int)(subsequent: A => A)
extends Iterator[A] {
private[this] var current = zero
def hasNext = current != until
def next(): A = {
if (!hasNext) throw new NoSuchElementException
val r = current
current = subsequent(current)
Thread.sleep(pauseMs)
r
}
}
Here's the actual code you want
import java.io.PipedOutputStream
import java.io.PipedInputStream
import java.io.InputStream
import java.io.PrintWriter
// For process stuff
import scala.sys.process._
import scala.language.postfixOps
// For asynchronous stream writing.
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future
/**
* A streaming version of the original class. This does not block to wait for the entire
* input or output to be constructed. This allows the process to get data ASAP and allows
* the process to return information back to the scala environment ASAP.
*
* NOTE: Don't forget about error handling in the final production code.
*/
implicit class X(it: Iterator[String]) {
def pipe(cmd: String) = cmd #< iter2is(it) lineStream
/**
* Convert an iterator to an InputStream for use in the pipe function.
* #param it an iterator to convert
*/
private[this] def iter2is[A](it: Iterator[A]): InputStream = {
// What is written to the output stream will appear in the input stream.
val pos = new PipedOutputStream
val pis = new PipedInputStream(pos)
val w = new PrintWriter(pos, true)
// Scala 2.11 (scala 2.10, use 'future'). Executes asynchrously.
// Fill the stream, then close.
Future {
it foreach w.println
w.close
}
// Return possibly before pis is fully written to.
pis
}
}
The final call will show display 0 through 9 and will pause for 3 seconds in between the displaying of each number (second pause on the scala side, 1 second pause on the shell script side).
// echo-sleep.sh is the same script as in my previous post
new PausingIterator(0, 10, 2000)(_ + 1)
.map(_.toString)
.pipe("echo-sleep.sh")
.foreach(println)
Output
0 [ pause 3 secs ]
1 [ pause 3 secs ]
...
8 [ pause 3 secs ]
9 [ pause 3 secs ]

Streaming a CSV file to the browser in Spray

In one part of my application I have to send a CSV file back to the browser. I have an actor that replies with a Stream[String], each element of the Stream is one line in the file that should be sent to the user.
This is what I currently have. Note that I'm currently returning MediaType text/plain for debugging purposes, it will be text/csv in the final version:
trait TripService extends HttpService
with Matchers
with CSVTripDefinitions {
implicit def actorRefFactory: ActorRefFactory
implicit val executionContext = actorRefFactory.dispatcher
lazy val csvActor = Ridespark.system.actorSelection(s"/user/listeners/csv").resolveOne()(1000)
implicit val stringStreamMarshaller = Marshaller.of[Stream[String]](MediaTypes.`text/plain`) { (v, ct, ctx) =>
ctx.marshalTo(HttpEntity(ct, v.mkString("\n")))
}
val route = {
get {
path("trips" / Segment / DateSegment) { (network, date) =>
respondWithMediaType(MediaTypes.`text/plain`) {
complete(askTripActor(date, network))
}
}
}
}
def askTripActor(date: LocalDate, network: String)
(implicit timeout: Timeout = Timeout(1000, TimeUnit.MILLISECONDS)): Future[Stream[String]] = {
csvActor flatMap { actor => (actor ? GetTripsForDate(date, Some(network))).mapTo[Stream[String]] }
}
}
class TripApi(info: AuthInfo)(implicit val actorRefFactory: ActorRefFactory) extends TripService
My problem with this is that it will load the whole CSV into memory (due to the mkString in stringStreamMarshaller). Is there a way of not doing this, streaming the lines of the CSV as they are generated?
You don't need the stringStreamMarshaller. Using the built-in stream marshaller should already do what you want without adding the extra line-breaks. However, adding the linebreaks shouldn't be more difficult than using streamOfStrings.map(_ + "\n") or just adding the linebreaks when rendering the lines in the csv-actor in the first place.