What is the Scala way to process a file line by line in parallel, with back pressure? - scala

The following code will read a file line by line, create a task for each line and then queue it up to the executor. If the executor's queue is full, the reading from the file stops until there is space again.
I looked at a few suggestions in SO, but they all either require the entire content of the file be read into memory, or suboptimal scheduling (e.g. read 100 lines, process them in parallel, only after that finishes, read the next 100 lines). I also don't want to use libraries like Akka for this.
What is the Scala way of achieving this, without those drawbacks?
val exec = executorWithBoundedQueue()
val lines = Source.fromFile(sourceFile, cs).getLines()
lines.map {
l => exec.submit(new Callable[String] {
override def call(): String = doStuff(l)
})
}.foreach {
s => consume(s.get())
}
exec.shutdown()
Illustrative definition of executorWithBoundedQueue
def executorWithBoundedQueue(): ExecutorService = {
val boundedQueue = new LinkedBlockingQueue[Runnable](1000)
new ThreadPoolExecutor(boundedQueue)
}

Related

Chunk file into multiple streams in scala

I have a use case in my program where I need to take a file, split them equally N times and upload them remotely.
I'd like a function that takes, say a File and output a list of BufferedReader. To which I can just distribute them and send them to another function which uses some API to store them.
I've seen examples where authors utilize the .lines() method of a BufferedReader:
def splitFile: List[Stream] = {
val temp = "Test mocked file contents\nTest"
val is = new ByteArrayInputStream(lolz.getBytes)
val br = new BufferedReader(new InputStreamReader(is))
// Chunk the file into two sort-of equal parts.
// Stream 1
val test = br.lines().skip(1).limit(1)
// Stream 2
val test2 = br.lines().skip(2).limit(1)
List(test, test2)
}
I suppose that the above example works, it's not beautiful but it works.
My questions:
Is there a way to split a BufferedReader into multiple list of streams?
I don't control the format of the File, so the file contents could potentially be a single line long. Wouldn't that just mean that .lines() just load all that into a Stream of one element?
It will be much easier to read the file once, and then split it up. You're not really losing anything either, since the file read is a serial operation anyway. From there, just devise a scheme to slice up the list. In this case, I group everything based on its index modulo the number of desired output lists, then pull out the lists. If there are fewer lines than you ask for, it will just place each line in a separate List.
val lines: List[String] = br.lines
val outputListQuantity: Int = 2
val data: List[List[String]] = lines.zipWithIndex.groupBy(_._2 % outputListQuantity}.
values.map(_.map(_._1)).toList
Well, if you don't mind reading the whole stream into memory, it's easy enough (assuming, that file contains text - since you are talking about Readers, but it would be the same idea with binary):
Source.fromFile("filename")
.mkString
.getBytes
.grouped(chunkSize)
.map { chunk => new BufferedReader(new InputStreamReader(chunk)) }
But that seems to sorta defeat the purpose: if the file is small enough to be loaded into memory entirely, why bother splitting it to begin with?
So, a more practical solution is a little bit more involved:
def splitFile(
input: InputStream,
chunkSize: Int
): Iterator[InputStream] = new AbstractIterator[InputStream] {
var hasNext = true
def next = {
val buffer = new Array[Byte](chunkSize)
val bytes = input.read(buffer)
hasNext = bytes == chunkSize
new ByteArrayInputStream(buffer, 0, bytes max 0)
}
}

Stop processing large text files in Apache Spark after certain amount of errors

I'm very new to Spark. I'm working in 1.6.1.
Let's imagine I have large file, I'm reading it into RDD[String] thru textFile.
Then I want to validate each line in some function.
Because file is huge, I want to stop processing when I reached certain amount of errors, let's say 1000 lines.
Something like
val rdd = sparkContext.textFile(fileName)
rdd.map(line => myValidator.validate(line))
here is validate function:
def validate(line:String) : (String, String) = {
// 1st in Tuple for resulted line, 2nd ,say, for validation error.
}
How to calculate errors inside 'validate'?. It is actually executed in parallel on multiple nodes? Broadcasts? Accumulators?
You can achieve this behavior using Spark's laziness by "splitting" the result of the parsing into success and failures, calling take(n) on the failures, and only using the success data if there were less then n failures.
To achieve this more conveniently, I'd suggest changing the signature of validate to return some type that can easily distinguish success from failure, e.g. scala.util.Try:
def validate(line:String) : Try[String] = {
// returns Success[String] on success,
// Failure (with details in the exception object) otherwise
}
And then, something like:
val maxFailures = 1000
val rdd = sparkContext.textFile(fileName)
val parsed: RDD[Try[String]] = rdd.map(line => myValidator.validate(line)).cache()
val failures: Array[Throwable] = parsed.collect { case Failure(e) => e }.take(maxFailures)
if (failures.size == maxFailures) {
// report failures...
} else {
val success: RDD[String] = parsed.collect { case Success(s) => s }
// continue here...
}
Why would this work?
If there are less then 1000 failures, the entire dataset will be parsed when take(maxFailures) is called, the successful data will be cached and ready to use
If there are 1000 failures or more, the parsing would stop there, as the take operation won't require anymore reads

Sink for line-by-line file IO with backpressure

I have a file processing job that currently uses akka actors with manually managed backpressure to handle the processing pipeline, but I've never been able to successfully manage the backpressure at the input file reading stage.
This job takes an input file and groups lines by an ID number present at the start of each line, and then once it hits a line with a new ID number, it pushes the grouped lines to a processing actor via message, and then continues with the new ID number, all the way until it reaches the end of the file.
This seems like it would be a good use case for Akka Streams, using the File as a sink, but I'm still not sure of three things:
1) How can I read the file line by line?
2) How can I group by the ID present on every line? I currently use very imperative processing for this, and I don't think I'll have the same ability in a stream pipeline.
3) How can I apply backpressure, such that I don't keep reading lines into memory faster than I can process the data downstream?
Akka streams' groupBy is one approach. But groupBy has a maxSubstreams param which would require that you to know that max # of ID ranges up front. So: the solution below uses scan to identify same-ID blocks, and splitWhen to split into substreams:
object Main extends App {
implicit val system = ActorSystem("system")
implicit val materializer = ActorMaterializer()
def extractId(s: String) = {
val a = s.split(",")
a(0) -> a(1)
}
val file = new File("/tmp/example.csv")
private val lineByLineSource = FileIO.fromFile(file)
.via(Framing.delimiter(ByteString("\n"), maximumFrameLength = 1024))
.map(_.utf8String)
val future: Future[Done] = lineByLineSource
.map(extractId)
.scan( (false,"","") )( (l,r) => (l._2 != r._1, r._1, r._2) )
.drop(1)
.splitWhen(_._1)
.fold( ("",Seq[String]()) )( (l,r) => (r._2, l._2 ++ Seq(r._3) ))
.concatSubstreams
.runForeach(println)
private val reply = Await.result(future, 10 seconds)
println(s"Received $reply")
Await.ready(system.terminate(), 10 seconds)
}
extractId splits lines into id -> data tuples. scan prepends id -> data tuples with a start-of-ID-range flag. The drop drops the primer element to scan. splitwhen starts a new substream for each start-of-range. fold concatenates substreams to lists and removes the start-of-ID-range boolean, so that each substream produces a single element. In place of the fold you probably want a custom SubFlow which processes a streams of rows for a single ID and emits some result for the ID range. concatSubstreams merges the per-ID-range substreams produced by splitWhen back into a single stream that's printed by runForEach .
Run with:
$ cat /tmp/example.csv
ID1,some input
ID1,some more input
ID1,last of ID1
ID2,one line of ID2
ID3,2nd before eof
ID3,eof
Output is:
(ID1,List(some input, some more input, last of ID1))
(ID2,List(one line of ID2))
(ID3,List(2nd before eof, eof))
It appears that the easiest way to add "back pressure" to your system without introducing huge modifications is to simply change the mailbox type of the input groups consuming Actor to BoundedMailbox.
Change the type of the Actor that consumes your lines to BoundedMailbox with high mailbox-push-timeout-time:
bounded-mailbox {
mailbox-type = "akka.dispatch.BoundedDequeBasedMailbox"
mailbox-capacity = 1
mailbox-push-timeout-time = 1h
}
val actor = system.actorOf(Props(classOf[InputGroupsConsumingActor]).withMailbox("bounded-mailbox"))
Create iterator from your file, create grouped (by id) iterator from that iterator. Then just cycle through the data, sending groups to consuming Actor. Note, that send will block in this case, when Actor's mailbox gets full.
def iterGroupBy[A, K](iter: Iterator[A])(keyFun: A => K): Iterator[Seq[A]] = {
def rec(s: Stream[A]): Stream[Seq[A]] =
if (s.isEmpty) Stream.empty else {
s.span(keyFun(s.head) == keyFun(_)) match {
case (prefix, suffix) => prefix.toList #:: rec(suffix)
}
}
rec(iter.toStream).toIterator
}
val lines = Source.fromFile("input.file").getLines()
iterGroupBy(lines){l => l.headOption}.foreach {
lines:Seq[String] =>
actor.tell(lines, ActorRef.noSender)
}
That's it!
You probably want to move file reading stuff to separate thread, as it's gonna block. Also by adjusting mailbox-capacity you can regulate amount of consumed memory. But if reading batch from the file is always faster than processing, it seems reasonable to keep capacity small, like 1 or 2.
upd iterGroupBy implemented with Stream, tested not to produce StackOverflow.

Scala: Read file as infinite/async stream of lines

I have a file /files/somelog.log that is continuously written to by some other process. What I want to do is to read that file as a (never-ending) stream and respond to it whenever there is a new log line available.
val logStream: Stream[String] = readAsStream("/files/somelog.log")
logStream.foreach { line: String => println(line) }
So, immediately when the other process writes anther line to the logfile, I expect the println above to go off (and keep going off).
However, that doesn't seem to be how the default Stream-type works. Scala's Stream isn't async, it's just lazy. What I'm probably looking for is something similar to what akka streams offers (I think). But Akka is so huge (150MB!) - I don't want to pull in so many new things that I don't need just to do something this simple/basic. Is there really no other (tiny!) library or technique I can use?
If you simply want to tail the file you could use Tailer from Apache Commons IO:
val l = new TailerListenerAdapter() {
override def handle(line: String): Unit = println(line)
}
val tail = Tailer.create(new File("file.txt"), l)
I'm not actually recommending this, but you could roll your own.
def nextLine( itr: Iterator[String] ): String = {
while (!itr.hasNext) Thread.sleep(2000)
itr.next
}
. . .
val log = io.Source.fromFile("/files/somelog.log").getLines
val logTxt = nextLine( log ) // will block until next line is available
. . .

Modifying a large file in Scala

I am trying to modify a large PostScript file in Scala (some are as large as 1GB in size). The file is a group of batches, with each batch containing a code that represents the batch number, number of pages, etc.
I need to:
Search the file for the batch codes (which always start with the same line in the file)
Count the number of pages until the next batch code
Modify the batch code to include how many pages are in each batch.
Save the new file in a different location.
My current solution uses two iterators (iterA and iterB), created from Source.fromFile("file.ps").getLines. The first iterator (iterA) traverses in a while loop to the beginning of a batch code (with iterB.next being called each time as well). iterB then continues searching until the next batch code (or the end of the file), counting the number of pages it passes as it goes. Then, it updates the batch code at iterA's position, an the process repeats.
This seems very non-Scala-like and I still haven't designed a good way to save these changes into a new file.
What is a good approach to this problem? Should I ditch iterators entirely? I'd preferably like to do it without having to have the entire input or output into memory at once.
Thanks!
You could probably implement this with Scala's Stream class. I am assuming that you don't mind
holding one "batch" in memory at a time.
import scala.annotation.tailrec
import scala.io._
def isBatchLine(line:String):Boolean = ...
def batchLine(size: Int):String = ...
val it = Source.fromFile("in.ps").getLines
// cannot use it.toStream here because of SI-4835
def inLines = Stream.continually(i).takeWhile(_.hasNext).map(_.next)
// Note: using `def` instead of `val` here means we don't hold
// the entire stream in memory
def batchedLinesFrom(stream: Stream[String]):Stream[String] = {
val (batch, remainder) = stream span { !isBatchLine(_) }
if (batch.isEmpty && remainder.isEmpty) {
Stream.empty
} else {
batchLine(batch.size) #:: batch #::: batchedLinesFrom(remainder.drop(1))
}
}
def newLines = batchedLinesFrom(inLines dropWhile isBatchLine)
val ps = new java.io.PrintStream(new java.io.File("out.ps"))
newLines foreach ps.println
ps.close()
If you not in pursuit of functional scala enlightenment, I'd recommend a more imperative style using java.util.Scanner#findWithinHorizon. My example is quite naive, iterating through the input twice.
val scanner = new Scanner(inFile)
val writer = new BufferedWriter(...)
def loop() = {
// you might want to limit the horizon to prevent OutOfMemoryError
Option(scanner.findWithinHorizon(".*YOUR-BATCH-MARKER", 0)) match {
case Some(batch) =>
val pageCount = countPages(batch)
writePageCount(writer, pageCount)
writer.write(batch)
loop()
case None =>
}
}
loop()
scanner.close()
writer.close()
May be you can use span and duplicate effectively. Assuming the iterator is positioned on the start of a batch, you take the span before the next batch, duplicate it so that you can count the pages, write the modified batch line, then write the pages using the duplicated iterator. Then process next batch recursively...
def batch(i: Iterator[String]) {
if (i.hasNext) {
assert(i.next() == "batch")
val (current, next) = i.span(_ != "batch")
val (forCounting, forWriting) = current.duplicate
val count = forCounting.filter(_ == "p").size
println("batch " + count)
forWriting.foreach(println)
batch(next)
}
}
Assuming the following input:
val src = Source.fromString("head\nbatch\np\np\nbatch\np\nbatch\np\np\np\n")
You position the iterator at the start of batch and then you process the batches:
val (head, next) = src.getLines.span(_ != "batch")
head.foreach(println)
batch(next)
This prints:
head
batch 2
p
p
batch 1
p
batch 3
p
p
p