I need to use to java.util.zip.ZipOutputStream to respond with a compressed file archive.
The data is several hundred megabytes uncompressed, so I would like to store as little of it as possible. It is coming from a serialization of SQL results.
I see examples of using an OutputStream to return a chunked result using Enumerator.outputStream:
http://greweb.me/2012/11/play-framework-enumerator-outputstream/
Play/Akka integration with Java OutputStreams
but those seem ill-advised when I read the documentation (emphasis mine)
Create an Enumerator of bytes with an OutputStream.
Not that calls to write will not block, so if the iteratee that is being fed to is slow to consume the input, the
OutputStream will not push back. This means it should not be used with large streams since there is a risk of
running out of memory.
Clearly, I can't use that. Or at least not without modification.
How can I create a response with an OutputStream (in this case, a gzipped archive) while being assured that only portions of it will be stored in memory?
I recognize the difference between InputStreams/OutputStreams and Play's Enumerator/Iteratee paradigm, so I expect there will be a specific way in which I need to generate my source data (serialization of SQL results) so that it doesn't outpace the rate of download. I don't know what it is.
In general you can't safely use any OutputStream with the Enumerator/Iteratee framework because OutputStream doesn't support non-blocking pushback. However, if you can control the writing to the OutputStream you can hack together something like:
val baos = new ByteArrayOutputStream
val zos = new ZipOutputStream(baos)
val enumerator = Enumerator.generateM {
Future.successful {
if (moreDateToWrite) {
// Write data into zos
val r = Some(baos.toByteArray)
baos.reset()
r
} else None
}
}
If all you need is compression, take a look at the Enumeratee instances provided in play.filters.gzip.Gzip and the play.filters.gzip.GzipFilter filter.
The only backpressure mechanism for OutputStream is blocking the thread. So one way or another, there will have to be a thread that is able to be blocked.
One way is to use piped streams.
import java.io.OutputStream
import java.io.PipedInputStream
import java.io.PipedOutputStream
import play.api.libs.iteratee.Enumerator
import scala.concurrent.ExecutorContext
def outputStream2(a: OutputStream => Unit, bufferSize: Int)
(implicit ec1: ExecutionContext, ec2: ExecutionContext) = {
val outputStream = new PipedOutputStream
Future(a(outputStream))(ec1)
val inputStream = new PipedInputStream(pipedOutputStream, bufferSize)
Enumerator.fromStream(inputStream)(ec2)
}
Since the operations are blocking, you must take care to prevent deadlock.
Either use two different thread pools, or used a cached (unbounded) thread pool.
Related
I want to stream a file from s3 to actor to be parsed and enriched and to write the output to other file.
The number of parserActors should be limited e.g
application.conf
akka{
actor{
deployment {
HereClient/router1 {
router = round-robin-pool
nr-of-instances = 28
}
}
}
}
code
val writerActor = actorSystem.actorOf(WriterActor.props())
val parser = actorSystem.actorOf(FromConfig.props(ParsingActor.props(writerActor)), "router1")
however the actor that is writing to a file should be limited to 1 (singleton)
I tried doing something like
val reader: ParquetReader[GenericRecord] = AvroParquetReader.builder[GenericRecord](file).withConf(conf).build()
val source: Source[GenericRecord, NotUsed] = AvroParquetSource(reader)
source.map (record => record ! parser)
but I am not sure that the backpressure is handled correctly. any advice ?
Indeed your solution is disregarding backpressure.
The correct way to have a stream interact with an actor while maintaining backpressure is to use the ask pattern support of akka-stream (reference).
From my understanding of your example you have 2 separate actor interaction points:
send records to the parsing actors (via a router)
send parsed records to the singleton write actor
What I would do is something similar to the following:
val writerActor = actorSystem.actorOf(WriterActor.props())
val parserActor = actorSystem.actorOf(FromConfig.props(ParsingActor.props(writerActor)), "router1")
val reader: ParquetReader[GenericRecord] = AvroParquetReader.builder[GenericRecord](file).withConf(conf).build()
val source: Source[GenericRecord, NotUsed] = AvroParquetSource(reader)
source.ask[ParsedRecord](28)(parserActor)
.ask[WriteAck](writerActor)
.runWith(Sink.ignore)
The idea is that you send all the GenericRecord elements to the parserActor which will reply with a ParsedRecord. Here as an example we specify a parallelism of 28 since that's the number of instances you have configured, however as long as you use a value higher than the actual number of actor instances no actor should suffer from work starvation.
Once the parseActor replies with the parsing result (here represented by the ParsedRecord) we apply the same pattern to interact with the singleton writer actor. Note that here we don't specify the parallelism as we have a single instance so it doesn't make sense the send more than 1 message at a time (in reality this happens anyway due to buffering at async boundaries, but this is just a built-in optimization). In this case we expect that the writer actor replies with a WriteAck to inform us that the writing has been successful and we can send the next element.
Using this method you are maintaining backpressure throughout your whole stream.
I think you should be using one of the "async" operations
Perhaps this other q/a gives you some insperation Processing an akka stream asynchronously and writing to a file sink
Im working with akka/scala/play stack.
Usually, im using stream to perform certain tasks. for example, I have a stream that wakes every minute, picks up something from the DB, and call another service to enrich its data using an API and save the enrichment to the DB.
something like this:
class FetcherAndSaveStream #Inject()(fetcherAndSaveGraph: FetcherAndSaveGraph, dbElementsSource: DbElementsSource)
(implicit val mat: Materializer,
implicit val exec: ExecutionContext) extends LazyLogging {
def graph[M1, M2](source: Source[BDElement, M1],
sink: Sink[BDElement, M2],
switch: SharedKillSwitch): RunnableGraph[(M1, M2)] = {
val fetchAndSaveDataFromExternalService: Flow[BDElement, BDElement, NotUsed] =
fetcherAndSaveGraph.fetchEndSaveEnrichment
source.viaMat(switch.flow)(Keep.left)
.via(fetchAndSaveDataFromExternalService)
.toMat(sink)(Keep.both).withAttributes(supervisionStrategy(resumingDecider))
}
def runGraph(switchSharedKill: SharedKillSwitch): (NotUsed, Future[Done]) = {
logger.info("FetcherAndSaveStream is now running")
graph(dbElementsSource.dbElements(), Sink.ignore, switchSharedKill).run()
}
}
I wonder, is this better than just using an actor that ticks every minute and do something like that? what is the comparison between using actors for this and stream?
trying to figure out still when should I choose which method (streams/actors). thanks!!
You can use both, depending on the requirements you have for your solution which are not listed there. The general concern you need to take into consideration - actors more low-level stuff than streams, so they require more code and debug.
Basically, streams are good for tasks where you have a relatively big amount of data you need to process with low memory consumption. With streams, you won't need to start to stream each n seconds, you can set this stream to run along with the application. That could make your code more concise by omitting scheduler logic.
I will omit your DI and architecture stuff, write solution with pseudocode:
val yourConsumer: Sink[YourDBRecord] = ???
val recordsSource: Source[YourDBRecord] =
val runnableGraph = (Source repeat ())
.throttle(1, n seconds)
.mapAsync(yourParallelism){_ =>
fetchReasonableAmountOfRecordsFromDB
} mapConcat identity to yourConsumer
This stream will do your stuff. You even can enhance it with more sophisticated logic to adapt the polling rate according to workloads using feedback loop in graph api. Also, you can add the error-handling strategy you need to resume in place your stream has crashed.
Moreover, there's alpakka connectors for DBS capable of doing so, you can see if solutions there fit your purpose, or check for implementation details.
What you can get by doing so - backpressure, ability to work with streams, clean and concise code with no timed automata managed directly by you.
https://doc.akka.io/docs/akka/current/stream/stream-rate.html
You can also create an actor, but then you should do all the things akka streams do for you by hand, i.e. back-pressure in case you want to interop with streams, scheduler, chunking and memory management(to not to load 100000 or so entries in one batch to memory), etc.
My Scala application kicks off an external process that writes a file to disk. In a separate thread, I want to read that file and copy its contents to an OutputStream until the process is done and the file is no longer growing.
There are a couple of edge cases to consider:
The file may not exist yet when the thread is ready to start.
The thread may copy faster than the process is writing. In other words, it may reach the end of the file while the file is still growing.
BTW I can pass the thread a processCompletionFuture variable which indicates when the file is done growing.
Is there an elegant and efficient way to do this? Perhaps using Akka Streams or actors? (I've tried using an Akka Stream off of the FileInputStream, but the stream seems to terminate as soon as there are no more bytes in the input stream, which happens in case #2).
Alpakka, a library that is built on Akka Streams, has a FileTailSource utility that mimics the tail -f Unix command. For example:
import akka.NotUsed
import akka.stream._
import akka.stream.scaladsl._
import akka.stream.alpakka.file.scaladsl._
import akka.util.{ ByteString, Timeout }
import java.io.OutputStream
import java.nio.file.Path
import scala.concurrent._
import scala.concurrent.duration._
val path: Path = ???
val maxLineSize = 10000
val tailSource: Source[ByteString, NotUsed] = FileTailSource(
path = path,
maxChunkSize = maxLineSize,
startingPosition = 0,
pollingInterval = 500.millis
).via(Framing.delimiter(ByteString(System.lineSeparator), maxLineSize, true))
The above tailSource reads an entire file line-by-line and continually reads freshly appended data every 500 milliseconds. To copy the stream contents to an OutputStream, connect the source to a StreamConverters.fromOutputStream sink:
val stream: Future[IOResult] =
tailSource
.runWith(StreamConverters.fromOutputStream(() => new OutputStream {
override def write(i: Int): Unit = ???
override def write(bytes: Array[Byte]): Unit = ???
}))
(Note that there is a FileTailSource.lines method that produces a Source[String, NotUsed], but in this scenario it's more felicitous to work with ByteString instead of String. This is why the example uses FileTailSource.apply(), which produces a Source[ByteString, NotUsed].)
The stream will fail if the file doesn't exist at the time of materialization. Therefore, you'll need to confirm the existence of the file before running the stream. This might be overkill, but one idea is to use Alpakka's DirectoryChangesSource for that.
Are there some code examples of using org.reactivestreams libraries to process large data streams using Java NIO (for high performance)? I'm aiming at distributed processing, so examples using Akka would be best, but I can figure that out.
It still seems to be the case that most (I hope not all) examples of reading files in scala resort to Source (non-binary) or direct Java NIO (and even things like Files.readAllBytes!)
Perhaps there is an activator template I've missed? (Akka Streams with Scala! is close addressing everything I need except the binary/NIO side)
Do not use scala.collection.immutable.Stream to consume files like this, the reason being that it performs memoization - that is, while yes it is lazy it will keep the entire stream buffered (memoized) in memory!
This is definitely not what you want when you think about "stream processing a file". The reason Scala's Stream works like this is because in a functional setting it makes complete sense - you can avoid calculating fibbonachi numbers again and again easily thanks to this for example, for more details see the ScalaDoc.
Akka Streams provides Reactive Streams implementations and provides a FileIO class that you could use here (it will properly back-pressure and pull the data out of the file only when needed and the rest of the stream is ready to consume it):
import java.io._
import akka.actor.ActorSystem
import akka.stream.scaladsl.{ Sink, Source }
object ExampleApp extends App {
implicit val sys = ActorSystem()
implicit val mat = FlowMaterializer()
FileIO.fromPath(Paths.get("/example/file.txt"))
.map(c ⇒ { print(c); c })
.runWith(Sink.onComplete(_ ⇒ { f.close(); sys.shutdown() } ))
}
Here are more docs about working with IO with Akka Streams
Note that this is for the current-as-of writing version of Akka, so the 2.5.x series.
Hope this helps!
We actually use akka streams to process binary files. It was a little tricky to get things going as there wasn't any documentation around this, but this is what we came up with:
val binFile = new File(filePath)
val inputStream = new BufferedInputStream(new FileInputStream(binFile))
val binStream = Stream.continually(inputStream.read).takeWhile(-1 != _).map(_.toByte)
val binSource = Source(binStream)
Once you have binSource, which is an akka Source[Byte] you can go ahead and start applying whatever stream transformations (map, flatMap, transform, etc...) you want to it. This functionality leverages the Source companion object's apply that takes an Iterable, passing in a scala Stream that should read in the data lazily and make it available to your transforms.
EDIT
As Konrad pointed out in the comments section, a Stream can be an issue with large files due to the fact that it performs memoization of the elements it encounters as it's lazily building out the stream. This can lead to out of memory situations if you are not careful. However, if you look at the docs for Stream there is a tip for avoiding memoization building up in memory:
One must be cautious of memoization; you can very quickly eat up large
amounts of memory if you're not careful. The reason for this is that
the memoization of the Stream creates a structure much like
scala.collection.immutable.List. So long as something is holding on to
the head, the head holds on to the tail, and so it continues
recursively. If, on the other hand, there is nothing holding on to the
head (e.g. we used def to define the Stream) then once it is no longer
being used directly, it disappears.
So taking that into account, you could modify my original example as follows:
val binFile = new File(filePath)
val inputStream = new BufferedInputStream(new FileInputStream(binFile))
val binSource = Source(() => binStream(inputStream).iterator)
def binStream(in:BufferedInputStream) = Stream.continually(in.read).takeWhile(-1 != _).map(_.toByte)
So the idea here is to build the Stream via a def and not assign to a valand then immediately get the iterator from it and use that to initialize the Akka Source. Setting things up this way should avoid the issues with momoization. I ran the old code against a big file and was able to produce an OutOfMemory situation by doing a foreach on the Source. When I switched it over to the new code I was able to avoid this issue.
I have a method save that takes an Iteratee it and saves some data to it. Inside the method, the data is available as an enumerator producing byte-array chunks.
def save[E](consumer: Iteratee[Array[Byte], E]): Future[E] = {
val producer: Enumerator[Array[Byte]] = // ...
Iteratee.flatten(producer(consumer)).run
}
Wanted: Call save in order to have it write the data to a FileOutputStream.
I tried the following but am not sure whether this is the way to go:
def writeToStream(s: OutputStream) =
Iteratee.foreach((e: Array[Byte]) => s.write(e)).
mapDone(r => { s.close(); r })
save(writeToStream(new FileOutputStream(myFile)))
Question: Is this the way it's supposed to be done? I fear that this will not always close the stream (case of exceptions).
I am using the Play Framework Iteratee library from Play Framework 2.1 (which uses Scala futures).
The scaladocs for Iteratee say that it is in the responsibility of the "producer" not the iteratee to handle resources:
The Iteratee does not do any resource management (such as closing streams); the producer pushing stuff into the Iteratee has that responsibility.
You might be successful using the "onDoneEnumerating" method in Enumerator to clean up resources afterwards.
Scaladoc Iteratee