Scala: Read file as infinite/async stream of lines - scala

I have a file /files/somelog.log that is continuously written to by some other process. What I want to do is to read that file as a (never-ending) stream and respond to it whenever there is a new log line available.
val logStream: Stream[String] = readAsStream("/files/somelog.log")
logStream.foreach { line: String => println(line) }
So, immediately when the other process writes anther line to the logfile, I expect the println above to go off (and keep going off).
However, that doesn't seem to be how the default Stream-type works. Scala's Stream isn't async, it's just lazy. What I'm probably looking for is something similar to what akka streams offers (I think). But Akka is so huge (150MB!) - I don't want to pull in so many new things that I don't need just to do something this simple/basic. Is there really no other (tiny!) library or technique I can use?

If you simply want to tail the file you could use Tailer from Apache Commons IO:
val l = new TailerListenerAdapter() {
override def handle(line: String): Unit = println(line)
}
val tail = Tailer.create(new File("file.txt"), l)

I'm not actually recommending this, but you could roll your own.
def nextLine( itr: Iterator[String] ): String = {
while (!itr.hasNext) Thread.sleep(2000)
itr.next
}
. . .
val log = io.Source.fromFile("/files/somelog.log").getLines
val logTxt = nextLine( log ) // will block until next line is available
. . .

Related

Sink for line-by-line file IO with backpressure

I have a file processing job that currently uses akka actors with manually managed backpressure to handle the processing pipeline, but I've never been able to successfully manage the backpressure at the input file reading stage.
This job takes an input file and groups lines by an ID number present at the start of each line, and then once it hits a line with a new ID number, it pushes the grouped lines to a processing actor via message, and then continues with the new ID number, all the way until it reaches the end of the file.
This seems like it would be a good use case for Akka Streams, using the File as a sink, but I'm still not sure of three things:
1) How can I read the file line by line?
2) How can I group by the ID present on every line? I currently use very imperative processing for this, and I don't think I'll have the same ability in a stream pipeline.
3) How can I apply backpressure, such that I don't keep reading lines into memory faster than I can process the data downstream?
Akka streams' groupBy is one approach. But groupBy has a maxSubstreams param which would require that you to know that max # of ID ranges up front. So: the solution below uses scan to identify same-ID blocks, and splitWhen to split into substreams:
object Main extends App {
implicit val system = ActorSystem("system")
implicit val materializer = ActorMaterializer()
def extractId(s: String) = {
val a = s.split(",")
a(0) -> a(1)
}
val file = new File("/tmp/example.csv")
private val lineByLineSource = FileIO.fromFile(file)
.via(Framing.delimiter(ByteString("\n"), maximumFrameLength = 1024))
.map(_.utf8String)
val future: Future[Done] = lineByLineSource
.map(extractId)
.scan( (false,"","") )( (l,r) => (l._2 != r._1, r._1, r._2) )
.drop(1)
.splitWhen(_._1)
.fold( ("",Seq[String]()) )( (l,r) => (r._2, l._2 ++ Seq(r._3) ))
.concatSubstreams
.runForeach(println)
private val reply = Await.result(future, 10 seconds)
println(s"Received $reply")
Await.ready(system.terminate(), 10 seconds)
}
extractId splits lines into id -> data tuples. scan prepends id -> data tuples with a start-of-ID-range flag. The drop drops the primer element to scan. splitwhen starts a new substream for each start-of-range. fold concatenates substreams to lists and removes the start-of-ID-range boolean, so that each substream produces a single element. In place of the fold you probably want a custom SubFlow which processes a streams of rows for a single ID and emits some result for the ID range. concatSubstreams merges the per-ID-range substreams produced by splitWhen back into a single stream that's printed by runForEach .
Run with:
$ cat /tmp/example.csv
ID1,some input
ID1,some more input
ID1,last of ID1
ID2,one line of ID2
ID3,2nd before eof
ID3,eof
Output is:
(ID1,List(some input, some more input, last of ID1))
(ID2,List(one line of ID2))
(ID3,List(2nd before eof, eof))
It appears that the easiest way to add "back pressure" to your system without introducing huge modifications is to simply change the mailbox type of the input groups consuming Actor to BoundedMailbox.
Change the type of the Actor that consumes your lines to BoundedMailbox with high mailbox-push-timeout-time:
bounded-mailbox {
mailbox-type = "akka.dispatch.BoundedDequeBasedMailbox"
mailbox-capacity = 1
mailbox-push-timeout-time = 1h
}
val actor = system.actorOf(Props(classOf[InputGroupsConsumingActor]).withMailbox("bounded-mailbox"))
Create iterator from your file, create grouped (by id) iterator from that iterator. Then just cycle through the data, sending groups to consuming Actor. Note, that send will block in this case, when Actor's mailbox gets full.
def iterGroupBy[A, K](iter: Iterator[A])(keyFun: A => K): Iterator[Seq[A]] = {
def rec(s: Stream[A]): Stream[Seq[A]] =
if (s.isEmpty) Stream.empty else {
s.span(keyFun(s.head) == keyFun(_)) match {
case (prefix, suffix) => prefix.toList #:: rec(suffix)
}
}
rec(iter.toStream).toIterator
}
val lines = Source.fromFile("input.file").getLines()
iterGroupBy(lines){l => l.headOption}.foreach {
lines:Seq[String] =>
actor.tell(lines, ActorRef.noSender)
}
That's it!
You probably want to move file reading stuff to separate thread, as it's gonna block. Also by adjusting mailbox-capacity you can regulate amount of consumed memory. But if reading batch from the file is always faster than processing, it seems reasonable to keep capacity small, like 1 or 2.
upd iterGroupBy implemented with Stream, tested not to produce StackOverflow.

What is the Scala way to process a file line by line in parallel, with back pressure?

The following code will read a file line by line, create a task for each line and then queue it up to the executor. If the executor's queue is full, the reading from the file stops until there is space again.
I looked at a few suggestions in SO, but they all either require the entire content of the file be read into memory, or suboptimal scheduling (e.g. read 100 lines, process them in parallel, only after that finishes, read the next 100 lines). I also don't want to use libraries like Akka for this.
What is the Scala way of achieving this, without those drawbacks?
val exec = executorWithBoundedQueue()
val lines = Source.fromFile(sourceFile, cs).getLines()
lines.map {
l => exec.submit(new Callable[String] {
override def call(): String = doStuff(l)
})
}.foreach {
s => consume(s.get())
}
exec.shutdown()
Illustrative definition of executorWithBoundedQueue
def executorWithBoundedQueue(): ExecutorService = {
val boundedQueue = new LinkedBlockingQueue[Runnable](1000)
new ThreadPoolExecutor(boundedQueue)
}

Splitting a scalaz-stream process into two child streams

Using scalaz-stream is it possible to split/fork and then rejoin a stream?
As an example, let's say I have the following function
val streamOfNumbers : Process[Task,Int] = Process.emitAll(1 to 10)
val sumOfEvenNumbers = streamOfNumbers.filter(isEven).fold(0)(add)
val sumOfOddNumbers = streamOfNumbers.filter(isOdd).fold(0)(add)
zip( sumOfEven, sumOfOdd ).to( someEffectfulFunction )
With scalaz-stream, in this example the results would be as you expect - a tuple of numbers from 1 to 10 passed to a sink.
However if we replace streamOfNumbers with something that requires IO, it will actually execute the IO action twice.
Using a Topic I'm able create a pub/sub process that duplicates elements in the stream correctly, however it does not buffer - it simply consumers the entire source as fast as possible regardless of the pace sinks consume it.
I can wrap this in a bounded Queue, however the end result feels a lot more complex than it needs to be.
Is there a simpler way of splitting a stream in scalaz-stream without duplicate IO actions from the source?
Also to clarify the previous answer delas with the "splitting" requirement. The solution to your specific issue may be without the need of splitting streams:
val streamOfNumbers : Process[Task,Int] = Process.emitAll(1 to 10)
val oddOrEven: Process[Task,Int\/Int] = streamOfNumbers.map {
case even if even % 2 == 0 => right(even)
case odd => left(odd)
}
val summed = oddOrEven.pipeW(sump1).pipeO(sump1)
val evenSink: Sink[Task,Int] = ???
val oddSink: Sink[Task,Int] = ???
summed
.drainW(evenSink)
.to(oddSink)
You can perhaps still use topic and just assure that the children processes will subscribe before you will push to topic.
However please note this solution does not have any bounds on it, i.e. if you will be pushing too fast, you may encounter OOM error.
def split[A](source:Process[Task,A]): Process[Task,(Process[Task,A], Proces[Task,A])]] = {
val topic = async.topic[A]
val sub1 = topic.subscribe
val sub2 = topic.subscribe
merge.mergeN(Process(emit(sub1->sub2),(source to topic.publish).drain))
}
I likewise needed this functionality. My situation was quite a bit trickier disallowing me to work around it in this manner.
Thanks to Daniel Spiewak's response in this thread, I was able to get the following to work. I improved on his solution by adding onHalt so my application would exit once the Process completed.
def split[A](p: Process[Task, A], limit: Int = 10): Process[Task, (Process[Task, A], Process[Task, A])] = {
val left = async.boundedQueue[A](limit)
val right = async.boundedQueue[A](limit)
val enqueue = p.observe(left.enqueue).observe(right.enqueue).drain.onHalt { cause =>
Process.await(Task.gatherUnordered(Seq(left.close, right.close))){ _ => Halt(cause) }
}
val dequeue = Process((left.dequeue, right.dequeue))
enqueue merge dequeue
}

Using Iteratees and Enumerators in Play Scala to Stream Data to S3

I am building a Play Framework application in Scala where I would like to stream an array of bytes to S3. I am using the Play-S3 library to do this. The "Multipart file upload" of the documentation section is what's relevant here:
// Retrieve an upload ticket
val result:Future[BucketFileUploadTicket] =
bucket initiateMultipartUpload BucketFile(fileName, mimeType)
// Upload the parts and save the tickets
val result:Future[BucketFilePartUploadTicket] =
bucket uploadPart (uploadTicket, BucketFilePart(partNumber, content))
// Complete the upload using both the upload ticket and the part upload tickets
val result:Future[Unit] =
bucket completeMultipartUpload (uploadTicket, partUploadTickets)
I am trying to do the same thing in my application but with Iteratees and Enumerators.
The streams and asynchronicity make things a little complicated, but here is what I have so far (Note uploadTicket is defined earlier in the code):
val partNumberStream = Stream.iterate(1)(_ + 1).iterator
val partUploadTicketsIteratee = Iteratee.fold[Array[Byte], Future[Vector[BucketFilePartUploadTicket]]](Future.successful(Vector.empty[BucketFilePartUploadTicket])) { (partUploadTickets, bytes) =>
bucket.uploadPart(uploadTicket, BucketFilePart(partNumberStream.next(), bytes)).flatMap(partUploadTicket => partUploadTickets.map( _ :+ partUploadTicket))
}
(body |>>> partUploadTicketsIteratee).andThen {
case result =>
result.map(_.map(partUploadTickets => bucket.completeMultipartUpload(uploadTicket, partUploadTickets))) match {
case Success(x) => x.map(d => println("Success"))
case Failure(t) => throw t
}
}
Everything compiles and runs without incident. In fact, "Success" gets printed, but no file ever shows up on S3.
There might be multiple problems with your code. It's a bit unreadable caused by the map method calls. You might have a problem with your future composition. Another problem might be caused by the fact that all chunks (except for the last) should be at least 5MB.
The code below has not been tested, but shows a different approach. The iteratee approach is one where you can create small building blocks and compose them into a pipe of operations.
To make the code compile I added a trait and a few methods
trait BucketFilePartUploadTicket
val uploadPart: (Int, Array[Byte]) => Future[BucketFilePartUploadTicket] = ???
val completeUpload: Seq[BucketFilePartUploadTicket] => Future[Unit] = ???
val body: Enumerator[Array[Byte]] = ???
Here we create a few parts
// Create 5MB chunks
val chunked = {
val take5MB = Traversable.takeUpTo[Array[Byte]](1024 * 1024 * 5)
Enumeratee.grouped(take5MB transform Iteratee.consume())
}
// Add a counter, used as part number later on
val zipWithIndex = Enumeratee.scanLeft[Array[Byte]](0 -> Array.empty[Byte]) {
case ((counter, _), bytes) => (counter + 1) -> bytes
}
// Map the (Int, Array[Byte]) tuple to a BucketFilePartUploadTicket
val uploadPartTickets = Enumeratee.mapM[(Int, Array[Byte])](uploadPart.tupled)
// Construct the pipe to connect to the enumerator
// the ><> operator is an alias for compose, it is more intuitive because of
// it's arrow like structure
val pipe = chunked ><> zipWithIndex ><> uploadPartTickets
// Create a consumer that ends by finishing the upload
val consumeAndComplete =
Iteratee.getChunks[BucketFilePartUploadTicket] mapM completeUpload
Running it is done by simply connecting the parts
// This is the result, a Future[Unit]
val result = body through pipe run consumeAndComplete
Note that I did not test any code and might have made some mistakes in my approach. This however shows a different way of dealing with the problem and should probably help you to find a good solution.
Note that this approach waits for one part to complete upload before it takes on the next part. If the connection from your server to amazon is slower than the connection from the browser to you server this mechanism will slow the input.
You could take another approach where you do not wait for the Future of the part upload to complete. This would result in another step where you use Future.sequence to convert the sequence of upload futures into a single future containing a sequence of the results. The result would be a mechanism sending a part to amazon as soon as you have enough data.

Play 2.x : Reactive file upload with Iteratees

I will start with the question: How to use Scala API's Iteratee to upload a file to the cloud storage (Azure Blob Storage in my case, but I don't think it's most important now)
Background:
I need to chunk the input into blocks of about 1 MB for storing large media files (300 MB+) as an Azure's BlockBlobs. Unfortunately, my Scala knowledge is still poor (my project is Java based and the only use for Scala in it will be an Upload controller).
I tried with this code: Why makes calling error or done in a BodyParser's Iteratee the request hang in Play Framework 2.0? (as a Input Iteratee) - it works quite well but eachElement that I could use has size of 8192 bytes, so it's too small for sending some hundred megabyte files to the cloud.
I must say that's quite a new approach to me, and most probably I misunderstood something (don't want to tell that I misunderstood everything ;> )
I will appreciate any hint or link, which will help me with that topic. If is there any sample of similar usage it would be the best option for me to get the idea.
Basically what you need first is rechunk input as bigger chunks, 1024 * 1024 bytes.
First let's have an Iteratee that will consume up to 1m of bytes (ok to have the last chunk smaller)
val consumeAMB =
Traversable.takeUpTo[Array[Byte]](1024*1024) &>> Iteratee.consume()
Using that, we can construct an Enumeratee (adapter) that will regroup chunks, using an API called grouped:
val rechunkAdapter:Enumeratee[Array[Byte],Array[Byte]] =
Enumeratee.grouped(consumeAMB)
Here grouped uses an Iteratee to determine how much to put in each chunk. It uses the our consumeAMB for that. Which means the result is an Enumeratee that rechunks input into Array[Byte] of 1MB.
Now we need to write the BodyParser, which will use the Iteratee.foldM method to send each chunk of bytes:
val writeToStore: Iteratee[Array[Byte],_] =
Iteratee.foldM[Array[Byte],_](connectionHandle){ (c,bytes) =>
// write bytes and return next handle, probable in a Future
}
foldM passes a state along and uses it in its passed function (S,Input[Array[Byte]]) => Future[S] to return a new Future of state. foldM will not call the function again until the Future is completed and there is an available chunk of input.
And the body parser will be rechunking input and pushing it into the store:
BodyParser( rh => (rechunkAdapter &>> writeToStore).map(Right(_)))
Returning a Right indicates that you are returning a body by the end of the body parsing (which happens to be the handler here).
If your goal is to stream to S3, here is a helper that I have implemented and tested:
def uploadStream(bucket: String, key: String, enum: Enumerator[Array[Byte]])
(implicit ec: ExecutionContext): Future[CompleteMultipartUploadResult] = {
import scala.collection.JavaConversions._
val initRequest = new InitiateMultipartUploadRequest(bucket, key)
val initResponse = s3.initiateMultipartUpload(initRequest)
val uploadId = initResponse.getUploadId
val rechunker: Enumeratee[Array[Byte], Array[Byte]] = Enumeratee.grouped {
Traversable.takeUpTo[Array[Byte]](5 * 1024 * 1024) &>> Iteratee.consume()
}
val uploader = Iteratee.foldM[Array[Byte], Seq[PartETag]](Seq.empty) { case (etags, bytes) =>
val uploadRequest = new UploadPartRequest()
.withBucketName(bucket)
.withKey(key)
.withPartNumber(etags.length + 1)
.withUploadId(uploadId)
.withInputStream(new ByteArrayInputStream(bytes))
.withPartSize(bytes.length)
val etag = Future { s3.uploadPart(uploadRequest).getPartETag }
etag.map(etags :+ _)
}
val futETags = enum &> rechunker |>>> uploader
futETags.map { etags =>
val compRequest = new CompleteMultipartUploadRequest(bucket, key, uploadId, etags.toBuffer[PartETag])
s3.completeMultipartUpload(compRequest)
}.recoverWith { case e: Exception =>
s3.abortMultipartUpload(new AbortMultipartUploadRequest(bucket, key, uploadId))
Future.failed(e)
}
}
add the following to your config file
play.http.parser.maxMemoryBuffer=256K
For those who are also trying to figure out a solution of this streaming problem, instead of writing a whole new BodyParser, you can also use what has already been implemented in parse.multipartFormData.
You can implement something like below to overwrite the default handler handleFilePartAsTemporaryFile.
def handleFilePartAsS3FileUpload: PartHandler[FilePart[String]] = {
handleFilePart {
case FileInfo(partName, filename, contentType) =>
(rechunkAdapter &>> writeToS3).map {
_ =>
val compRequest = new CompleteMultipartUploadRequest(...)
amazonS3Client.completeMultipartUpload(compRequest)
...
}
}
}
def multipartFormDataS3: BodyParser[MultipartFormData[String]] = multipartFormData(handleFilePartAsS3FileUpload)
I am able to make this work but I am still not sure whether the whole upload process is streamed. I tried some large files, it seems the S3 upload only starts when the whole file has been sent from the client side.
I looked at the above parser implementation and I think everything is connected using Iteratee so the file should be streamed.
If someone has some insight on this, that will be very helpful.