My Scala application kicks off an external process that writes a file to disk. In a separate thread, I want to read that file and copy its contents to an OutputStream until the process is done and the file is no longer growing.
There are a couple of edge cases to consider:
The file may not exist yet when the thread is ready to start.
The thread may copy faster than the process is writing. In other words, it may reach the end of the file while the file is still growing.
BTW I can pass the thread a processCompletionFuture variable which indicates when the file is done growing.
Is there an elegant and efficient way to do this? Perhaps using Akka Streams or actors? (I've tried using an Akka Stream off of the FileInputStream, but the stream seems to terminate as soon as there are no more bytes in the input stream, which happens in case #2).
Alpakka, a library that is built on Akka Streams, has a FileTailSource utility that mimics the tail -f Unix command. For example:
import akka.NotUsed
import akka.stream._
import akka.stream.scaladsl._
import akka.stream.alpakka.file.scaladsl._
import akka.util.{ ByteString, Timeout }
import java.io.OutputStream
import java.nio.file.Path
import scala.concurrent._
import scala.concurrent.duration._
val path: Path = ???
val maxLineSize = 10000
val tailSource: Source[ByteString, NotUsed] = FileTailSource(
path = path,
maxChunkSize = maxLineSize,
startingPosition = 0,
pollingInterval = 500.millis
).via(Framing.delimiter(ByteString(System.lineSeparator), maxLineSize, true))
The above tailSource reads an entire file line-by-line and continually reads freshly appended data every 500 milliseconds. To copy the stream contents to an OutputStream, connect the source to a StreamConverters.fromOutputStream sink:
val stream: Future[IOResult] =
tailSource
.runWith(StreamConverters.fromOutputStream(() => new OutputStream {
override def write(i: Int): Unit = ???
override def write(bytes: Array[Byte]): Unit = ???
}))
(Note that there is a FileTailSource.lines method that produces a Source[String, NotUsed], but in this scenario it's more felicitous to work with ByteString instead of String. This is why the example uses FileTailSource.apply(), which produces a Source[ByteString, NotUsed].)
The stream will fail if the file doesn't exist at the time of materialization. Therefore, you'll need to confirm the existence of the file before running the stream. This might be overkill, but one idea is to use Alpakka's DirectoryChangesSource for that.
Related
What are the key difference between run and runWith:-
import akka.actor.ActorSystem
import akka.stream.ActorMaterializer
import akka.stream.scaladsl.{Keep, Sink, Source}
object RunAndRunWith extends App {
implicit val system: ActorSystem = ActorSystem("Run_RunWith")
implicit val materializer: ActorMaterializer = ActorMaterializer()
Source(1 to 10).toMat(Sink.foreach[Int](println))(Keep.right).run()
Source(1 to 10).runWith(Sink.foreach[Int](println))
}
How to know which one to use?
to(Sink) and toMat(Sink) terminates the source with the sink and produces a RunnableGraph, which you can execute with run() but it also gives you the chance to set stream attributes for the whole graph before running it, or hand it to some other function/method which will run it (or possibly do something else with it than executing it).
This form also gives you some control of where the materialized value should come from if you need that.
Since wanting to terminate and run a source with a sink, without any additional attributes, keeping the materialized value of the sink, is so common, runWith(Sink) is a convenient shortcut for this.
I want to stream a file from s3 to actor to be parsed and enriched and to write the output to other file.
The number of parserActors should be limited e.g
application.conf
akka{
actor{
deployment {
HereClient/router1 {
router = round-robin-pool
nr-of-instances = 28
}
}
}
}
code
val writerActor = actorSystem.actorOf(WriterActor.props())
val parser = actorSystem.actorOf(FromConfig.props(ParsingActor.props(writerActor)), "router1")
however the actor that is writing to a file should be limited to 1 (singleton)
I tried doing something like
val reader: ParquetReader[GenericRecord] = AvroParquetReader.builder[GenericRecord](file).withConf(conf).build()
val source: Source[GenericRecord, NotUsed] = AvroParquetSource(reader)
source.map (record => record ! parser)
but I am not sure that the backpressure is handled correctly. any advice ?
Indeed your solution is disregarding backpressure.
The correct way to have a stream interact with an actor while maintaining backpressure is to use the ask pattern support of akka-stream (reference).
From my understanding of your example you have 2 separate actor interaction points:
send records to the parsing actors (via a router)
send parsed records to the singleton write actor
What I would do is something similar to the following:
val writerActor = actorSystem.actorOf(WriterActor.props())
val parserActor = actorSystem.actorOf(FromConfig.props(ParsingActor.props(writerActor)), "router1")
val reader: ParquetReader[GenericRecord] = AvroParquetReader.builder[GenericRecord](file).withConf(conf).build()
val source: Source[GenericRecord, NotUsed] = AvroParquetSource(reader)
source.ask[ParsedRecord](28)(parserActor)
.ask[WriteAck](writerActor)
.runWith(Sink.ignore)
The idea is that you send all the GenericRecord elements to the parserActor which will reply with a ParsedRecord. Here as an example we specify a parallelism of 28 since that's the number of instances you have configured, however as long as you use a value higher than the actual number of actor instances no actor should suffer from work starvation.
Once the parseActor replies with the parsing result (here represented by the ParsedRecord) we apply the same pattern to interact with the singleton writer actor. Note that here we don't specify the parallelism as we have a single instance so it doesn't make sense the send more than 1 message at a time (in reality this happens anyway due to buffering at async boundaries, but this is just a built-in optimization). In this case we expect that the writer actor replies with a WriteAck to inform us that the writing has been successful and we can send the next element.
Using this method you are maintaining backpressure throughout your whole stream.
I think you should be using one of the "async" operations
Perhaps this other q/a gives you some insperation Processing an akka stream asynchronously and writing to a file sink
I'm writing a UCI interpreter, as an Akka Finite State Machine. As per the specification, the interpreter must write its output to stdout, and take its input from stdin.
I have a test suit for the actor, and I can test some aspects (message related) of its behaviour, but I don't know how to capture the stdout to make assertions, nor how to send it its input through stdin. I've explored the scalatest API to the best of my abilities, but can't find how to achieve what I need.
This is the current test class:
package org.chess
import akka.actor.ActorSystem
import akka.testkit.{TestKit, TestProbe}
import org.chess.Keyword.Quit
import org.scalatest.wordspec.AnyWordSpecLike
import org.scalatest.{BeforeAndAfterAll, Matchers}
import scala.concurrent.duration._
import scala.language.postfixOps
class UCIInterpreterSpec(_system: ActorSystem)
extends TestKit(_system)
with Matchers
with AnyWordSpecLike
with BeforeAndAfterAll {
def this() = this(ActorSystem("UCIInterpreterSpec"))
override def afterAll: Unit = {
super.afterAll()
shutdown(system)
}
"A UCI interpreter" should {
"be able to quit" in {
val testProbe = TestProbe()
val interpreter = system.actorOf(UCIInterpreter.props)
testProbe watch interpreter
interpreter ! Command(Quit, Nil)
testProbe.expectTerminated(interpreter, 3 seconds)
}
}
}
Of course, knowing that the interpreter can quit is useful... but not very useful. I need to test, for example, if sending the string isready to the interpreter, it returns readyok.
Is it possible that I am overcomplicating the test by using akka.testkit, instead of a simpler framework? I would like to keep using a single testing framework for simplicity, and I will need to test many other actor-related elements of the system, so if this could be solved without leaving the akka-testkit/scalatest domain, it would be fantastic.
Any help will be appreciated. Thanks in advance.
You need to change the design of your Actor.
The Actor should not read stdin or write stdout directly. Instead, give the actor objects in the Props that provide input and accept output. stdin could be something like () => String that is called each time input is needed. stdout could be String => Unit that is called each time output is generated. Or you could use Streams or similar constructs that are designed to be abstract sources and sinks of data.
In production code you pass objects that use stdin and stdout, but for test code you pass objects that read and write memory buffers. You can then check that the appropriate input is consumed by the Actor and that the appropriate output is generated by the Actor.
I want to compare the read performance of different storage systems using Spark ,e.g. HDFS/S3N. I have written a small Scala program for this:
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.storage.StorageLevel
object SimpleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("WordCount")
val sc = new SparkContext(conf)
val file = sc.textFile("s3n://test/wordtest")
val splits = file.map(word => word)
splits.saveAsTextFile("s3n://test/myoutput")
}
}
My question is, is it possible to run a read-only test with Spark? For the program above, isn't saveAsTextFile() causing some write as well?
I am not sure if that is possible at all. In order to run a transformation, a posterior action is necessary.
From the official Spark documentation:
All transformations in Spark are lazy, in that they do not compute their results right away. Instead, they just remember the transformations applied to some base dataset (e.g. a file). The transformations are only computed when an action requires a result to be returned to the driver program.
Taking this into account, saveAsTextFile might not be considered the lightest from the wide range of actions available. Several lightweight alternatives exists, actions like count or first, for example. These would leverage almost the totality of the work to the transformations phase, making you able to measure the performance of your solution.
You might want to check the available actions and choose the one that best fits your requirements.
Yes."saveAsTextFile" writes the RDD data to text file using given path.
I need to use to java.util.zip.ZipOutputStream to respond with a compressed file archive.
The data is several hundred megabytes uncompressed, so I would like to store as little of it as possible. It is coming from a serialization of SQL results.
I see examples of using an OutputStream to return a chunked result using Enumerator.outputStream:
http://greweb.me/2012/11/play-framework-enumerator-outputstream/
Play/Akka integration with Java OutputStreams
but those seem ill-advised when I read the documentation (emphasis mine)
Create an Enumerator of bytes with an OutputStream.
Not that calls to write will not block, so if the iteratee that is being fed to is slow to consume the input, the
OutputStream will not push back. This means it should not be used with large streams since there is a risk of
running out of memory.
Clearly, I can't use that. Or at least not without modification.
How can I create a response with an OutputStream (in this case, a gzipped archive) while being assured that only portions of it will be stored in memory?
I recognize the difference between InputStreams/OutputStreams and Play's Enumerator/Iteratee paradigm, so I expect there will be a specific way in which I need to generate my source data (serialization of SQL results) so that it doesn't outpace the rate of download. I don't know what it is.
In general you can't safely use any OutputStream with the Enumerator/Iteratee framework because OutputStream doesn't support non-blocking pushback. However, if you can control the writing to the OutputStream you can hack together something like:
val baos = new ByteArrayOutputStream
val zos = new ZipOutputStream(baos)
val enumerator = Enumerator.generateM {
Future.successful {
if (moreDateToWrite) {
// Write data into zos
val r = Some(baos.toByteArray)
baos.reset()
r
} else None
}
}
If all you need is compression, take a look at the Enumeratee instances provided in play.filters.gzip.Gzip and the play.filters.gzip.GzipFilter filter.
The only backpressure mechanism for OutputStream is blocking the thread. So one way or another, there will have to be a thread that is able to be blocked.
One way is to use piped streams.
import java.io.OutputStream
import java.io.PipedInputStream
import java.io.PipedOutputStream
import play.api.libs.iteratee.Enumerator
import scala.concurrent.ExecutorContext
def outputStream2(a: OutputStream => Unit, bufferSize: Int)
(implicit ec1: ExecutionContext, ec2: ExecutionContext) = {
val outputStream = new PipedOutputStream
Future(a(outputStream))(ec1)
val inputStream = new PipedInputStream(pipedOutputStream, bufferSize)
Enumerator.fromStream(inputStream)(ec2)
}
Since the operations are blocking, you must take care to prevent deadlock.
Either use two different thread pools, or used a cached (unbounded) thread pool.