Scala writing and reading from a file inside a while loop - scala

I have an application where I am to write values into a file and read them back into the program in a while loop. This fails because the file is only written when I exit the loop and not in every iteration. Therefore, in the next iterations, I cannot access values that were supposed to be written into the file in the previous iterations. How can I make every iteration write into the file as opposed to writing all values at the end of the while loop?
My application uses Scalafix. It reads in a test suite Scala file and duplicates its test cases at each iteration. The important details are explained by my series of 8 comments. Is there something about the working of the FileWriter that makes it wait until the last round of the loop to write back to file as it does not write back to file at every iteration of the loop?
object Printer{
//1 . This is my filePrinter which I call at every iteration to print the new test file with its test cases duplicated.
def saveFile(filename:String, data: String): Unit ={
val fileWritter: FileWriter = new FileWriter(filename)
val bufferWritter: BufferedWriter = new BufferedWriter(fileWritter)
bufferWritter.write(data)
bufferWritter.flush()
bufferWritter.close()
}
}
object Main extends App {
//2. my loop starts here.
var n = 2
do {
// read in a semanticDocument (function provided below)
val ( sdoc1,base,filename)=SemanticDocumentBuilder.buildSemanticDocument()
implicit val sdoc = sdoc1 //4. P3 is a scalafix "patch" that collects all the test cases of
// test suite and duplicates them. It works just fine, see the next comment.
val p3 =sdoc.tree.collect {
case test#Term.ApplyInfix(Term.ApplyInfix(_,Term.Name(smc), _,
List(Lit.String(_))), Term.Name("in"), _, params) =>
Patch.addRight(test,"\n" +test.toString())
}.asPatch
//5. I collect the test cases in the next line and print
//out how many they are. At this moment, I have not
// applied the duplicate function, so they are still as
//originally read from the test file.
val staticAnalyzer = new StaticAnalyzer()
val testCases: List[Term.ApplyInfix] =
staticAnalyzer.collectTestCases()
println("Tests cases count: "+ testCases.length)
val r3 = RuleName(List(RuleIdentifier("r3")))
val map:Map[RuleName, Patch] = Map(r3->p3)
val r = PatchInternals(map, v0.RuleCtx(sdoc.tree), None)
//6. After applying the p3 patch in the previous three lines,
//I indeed print out the newly created test suite file
//and it contains each test case duplicated as shown
// by the below println(r._1.getClass).
println(r._1.getClass)
//7. I then call the my save file (see this function above - first lines of this code)
Printer.saveFile(base+"src/test/scala/"+filename,r._1)
n-=1
//8. Since I have saved my file with the duplicates,
//I would expect that it will save the file back to the
//file (overwrite the original file as I have not used "append = true".
//I would then expect that the next length of test cases will
//have doubled but this is never the case.
//The save function with FileWriter only works in the last loop.
//Therefore, no matter the number of loops, it only doubles once!
println("Loop: "+ n)
} while(n>0)
}
**Edit factored out the reading in of the semanticDocument ** This function simply returns a SemanticDocument and two strings representing my file path and filename.
object SemanticDocumentBuilder{
def buildSemanticDocument(): (SemanticDocument,String,String) ={
val base = "/Users/soft/Downloads/simpleAkkaProject/"
val local = new File(base)
//val dependenceisSBTCommand = s"sbt -ivy ./.ivy2 -Dsbt.ivy.home=./.ivy2 -Divy.home=./.ivy2
//val sbtCmd = s"sbt -ivy ./ivy2 -Dsbt.ivy.home=./ivy2 -Divy.home=./ivy2 -Dsbt.boot.directo
val result = sys.process.Process(Seq("sbt","semanticdb"), local).!
val jars = FileUtils.listFiles(local, Array("jar"), true).toArray(new Array[File](0))
.toList
.map(f => Classpath(f.getAbsolutePath))
.reduceOption(_ ++ _)
val classes = FileUtils.listFilesAndDirs(local, TrueFileFilter.INSTANCE, DirectoryFileFilte
.toList
.filter(p => p.isDirectory && !p.getAbsolutePath.contains(".sbt") && p.getAbsolutePath.co
.map(f => Classpath(f.getAbsolutePath))
.reduceOption(_ ++ _)
val classPath = ClassLoader.getSystemClassLoader.asInstanceOf[URLClassLoader].getURLs
.map(url => Classpath(url.getFile))
.reduceOption(_ ++ _)
val all = (jars ++ classes ++ classPath).reduceOption(_ ++ _).getOrElse(Classpath(""))
val symbolTable = GlobalSymbolTable(all)
val filename = "AkkaQuickstartSpec.scala"
val root = AbsolutePath(base).resolve("src/test/scala/")
println(root)
val abspath = root.resolve(filename)
println(root)
val relpath = abspath.toRelative(AbsolutePath(base))
println(relpath)
val sourceFile = new File(base+"src/test/scala/"+filename)
val input = Input.File(sourceFile)
println(input)
if (n == firstRound){
doc = SyntacticDocument.fromInput(input)
}
//println(doc.tree.structure(30))
var documents: Map[String, TextDocument] = Map.empty
Locator.apply(local.toPath)((path, db) => db.documents.foreach({
case document#TextDocument(_, uri, text, md5, _, _, _, _, _) if !md5.isEmpty => { // skip
if (n == firstRound){
ast= sourceFile.parse[Source].getOrElse(Source(List()))
}
documents = documents + (uri -> document)
println(uri)
}
println(local.canWrite)
if (editedSuite != null){
Printer.saveFile(sourceFile,editedSuite)
}
}))
//println(documents)
val impl = new InternalSemanticDoc(doc, documents(relpath.toString()), symbolTable)
implicit val sdoc = new SemanticDocument(impl)
val symbols = sdoc.tree.collect {
case t# Term.Name("<") => {
println(s"symbol for $t")
println(t.symbol.value)
println(symbolTable.info(t.symbol.value))
}
}
(sdoc,base,filename)
}
}

In saveFile you need to close the fileWriter after closing the bufferedWriter. You don't need to flush because close will do this for you.
You should also close all the other File objects that you create in the loop, because they may be holding on to stale file handles. (e.g. local, ast)
More generally, clean up the code by putting code in functions with meaningful names. There is also a lot of code that can be outside the loop. Doing this will make it easier to see what is going on and allow you create a Minimal, Complete, and Verifiable example. As it stands it is really difficult to work out what is going on.

Related

Using variable second times in function not return the same value?

I start learning Scala, and i wrote that code. And I have question, why val which is constant? When i pass it second time to the same function return other value? How write pure function in scala?
Or any comment if that counting is right?
import java.io.FileNotFoundException
import java.io.IOException
import scala.io.BufferedSource
import scala.io.Source.fromFile
object Main{
def main(args: Array[String]): Unit = {
val fileName: String = if(args.length == 1) args(0) else ""
try {
val file = fromFile(fileName)
/* In file tekst.txt is 4 lines */
println(s"In file $fileName is ${countLines(file)} lines")
/* In file tekst.txt is 0 lines */
println(s"In file $fileName is ${countLines(file)} lines")
file.close
}
catch{
case e: FileNotFoundException => println(s"File $fileName not found")
case _: Throwable => println("Other error")
}
}
def countLines(file: BufferedSource): Long = {
file.getLines.count(_ => true)
}
}
val means that you cannot assign new value to it. If this is something immutable - a number, immutable collection, tuple or case class of other immutable things - then your value will not change over its lifetime - if this is val inside a function, when you assign value to it, it will stay the same until you leave that function. If this is value in class, it will stay the same between all calls to this class. If this is object it will stay the same over whole program life.
But, if you are talking about object which are mutable on their own, then the only immutable part is the reference to object. If you have a val of mutable.MutableList, then you can swap it with another mutable.MutableList, but you can modify the content of the list. Here:
val file = fromFile(fileName)
/* In file tekst.txt is 4 lines */
println(s"In file $fileName is ${countLines(file)} lines")
/* In file tekst.txt is 0 lines */
println(s"In file $fileName is ${countLines(file)} lines")
file.close
file is immutable reference to BufferedSource. You cannot replace it with another BufferedSource - but this class has internal state, it counts how many lines from file it already read, so the first time you operate on it you receive total number of lines in file, and then (since file is already read) 0.
If you wanted that code to be purer, you should contain mutability so that it won't be observable to the user e.g.
def countFileLines(fileName: String): Either[String, Long] = try {
val file = fromFile(fileName)
try {
Right(file.getLines.count(_ => true))
} finally {
file.close()
}
} catch {
case e: FileNotFoundException => Left(s"File $fileName not found")
case _: Throwable => Left("Other error")
}
println(s"In file $fileName is ${countLines(fileName)} lines")
println(s"In file $fileName is ${countLines(fileName)} lines")
Still, you are having side effects there, so ideally it should be something written using IO monad, but for now remember that you should aim for referential transparency - if you could replace each call to countLines(file) with a value from val counted = countLines(file) it would be RT. As you checked, it isn't. So replace it with something that wouldn't change behavior if it was called twice. A way to do it is to call whole computations twice without any global state preserved between them (e.g. internal counter in BufferedSource). IO monads make that easier, so go after them once you feel comfortable with syntax itself (to avoid learning too many things at once).
file.getLines returns Iterator[String] and iterator is consumable meaning we can iterate over it only once, for example, consider
val it = Iterator("a", "b", "c")
it.count(_ => true)
// val res0: Int = 3
it.count(_ => true)
// val res1: Int = 0
Looking at the implementation of count
def count(p: A => Boolean): Int = {
var res = 0
val it = iterator
while (it.hasNext) if (p(it.next())) res += 1
res
}
notice the call to it.next(). This call advances the state of the iterator and if it happens then we cannot go back to previous state.
As an alternative you could try length instead of count
val it = Iterator("a", "b", "c")
it.length
// val res0: Int = 3
it.length
// val res0: Int = 3
Looking at the definition of length which just delegates to size
def size: Int = {
if (knownSize >= 0) knownSize
else {
val it = iterator
var len = 0
while (it.hasNext) { len += 1; it.next() }
len
}
}
notice the guard
if (knownSize >= 0) knownSize
Some collections know their size without having to compute it by iterating over them. For example,
Array(1,2,3).knownSize // 3: I know my size in advance
List(1,2,3).knownSize // -1: I do not know my size in advance so I have to traverse the whole collection to find it
So if the underlying concrete collection of the Iterator knows its size, then call to length will short-cuircuit and it.next() will never execute, which means the iterator will not be consumed. This is the case for default concrete collection used by Iterator factory which is Array
val it = Iterator("a", "b", "c")
it.getClass.getSimpleName
// res6: Class[_ <: Iterator[String]] = class scala.collection.ArrayOps$ArrayIterator
however it is not true for BufferedSource. To workaround the issue consider creating an new iterator each time countLines is called
def countLines(fileName: String): Long = {
fromFile(fileName).getLines().length
}
println(s"In file $fileName is ${countLines(fileName)} lines")
println(s"In file $fileName is ${countLines(fileName)} lines")
// In file build.sbt is 22 lines
// In file build.sbt is 22 lines
Final point regarding value definitions and immutability. Consider
object Foo { var x = 42 } // object contains mutable state
val foo = Foo // value definition
foo.x
// val res0: Int = 42
Foo.x = -11 // mutation happening here
foo.x
// val res1: Int = -11
Here identifier foo is an immutable reference to mutable object.

Use of Scala Loan pattern in Success Case

I'm following the tutorial from Alvin Alexander to use Loan Pattern
Here is the code what I use -
val year = 2016
val nationalData = {
val source = io.Source.fromFile(s"resources/Babynames/names/yob$year.txt")
// names is iterator of String, split() gives the array
//.toArray & toSeq is a slow process compare to .toSet // .toSeq gives Stream Closed error
val names = source.getLines().filter(_.nonEmpty).map(_.split(",")(0)).toSet
source.close()
names
// println(names.mkString(","))
}
println("Names " + nationalData)
val info = for (stateFile <- new java.io.File("resources/Babynames/namesbystate").list(); if stateFile.endsWith(".TXT")) yield {
val source = io.Source.fromFile("resources/Babynames/namesbystate/" + stateFile)
val names = source.getLines().filter(_.nonEmpty).map(_.split(",")).
filter(a => a(2).toInt == year).map(a => a(3)).toArray // .toSet
source.close()
(stateFile.take(2), names)
}
println(info(0)._2.size + " names from state "+ info(0)._1)
println(info(1)._2.size + " names from state "+ info(1)._1)
for ((state, sname) <- info) {
println("State: " +state + " Coverage of name in "+ year+" "+ sname.count(n => nationalData.contains(n)).toDouble / nationalData.size) // Set doesn't have length method
}
This is how I applied readTextFile, readTextFileWithTry on the above code to learn/experiment Loan Pattern in the above code
def using[A <: { def close(): Unit }, B](resource: A)(f: A => B): B =
try {
f(resource)
} finally {
resource.close()
}
def readTextFile(filename: String): Option[List[String]] = {
try {
val lines = using(fromFile(filename)) { source =>
(for (line <- source.getLines) yield line).toList
}
Some(lines)
} catch {
case e: Exception => None
}
}
def readTextFileWithTry(filename: String): Try[List[String]] = {
Try {
val lines = using(fromFile(filename)) { source =>
(for (line <- source.getLines) yield line).toList
}
lines
}
}
val year = 2016
val data = readTextFile(s"resources/Babynames/names/yob$year.txt") match {
case Some(lines) =>
val n = lines.filter(_.nonEmpty).map(_.split(",")(0)).toSet
println(n)
case None => println("couldn't read file")
}
val data1 = readTextFileWithTry("resources/Babynames/namesbystate")
data1 match {
case Success(lines) => {
val info = for (stateFile <- data1; if stateFile.endsWith(".TXT")) yield {
val source = fromFile("resources/Babynames/namesbystate/" + stateFile)
val names = source.getLines().filter(_.nonEmpty).map(_.split(",")).
filter(a => a(2).toInt == year).map(a => a(3)).toArray // .toSet
(stateFile.take(2), names)
println(names)
}
}
But in the second case, readTextFileWithTry, I am getting the following error -
Failed, message is: java.io.FileNotFoundException: resources\Babynames\namesbystate (Access is denied)
I guess the reason for the failure is from SO what I understand -
I am trying to open the same file on each iteration of the for loop
Apart from that, I have few concerns regarding how I use -
Is it the good way to use? Can some help me how can I use the TRY on multiple occasions?
I tried to change the return type of readTextFileWithTry like Option[A] or Set/Map or Scala Collection to apply higher-order functions later on that. but not able to succeed. Not sure that is a good practice or not.
How can I use higher-order functions in Success case, as there are multiple operations and in Success case the code blocks get bigger? I can't use any field outside of Success case.
Can someone help me to understand?
I think that you problem has nothing to do with "I am trying to open the same file on each iteration of the for loop" and it is actually the same as in the accepted answer
Unfortunately you didn't provide stack trace so it is not clear on which line this happens. I would guess that the falling call is
val data1 = readTextFileWithTry("resources/Babynames/namesbystate")
And looking at your first code sample:
val info = for (stateFile <- new java.io.File("resources/Babynames/namesbystate").list(); if stateFile.endsWith(".TXT")) yield {
it looks like the path "resources/Babynames/namesbystate" points to a directory. But in your second example you are trying to read it as a file and this is the reason for the error. It comes from the fact that your readTextFileWithTry is not a valid substitute for java.io.File.list call. And File.list doesn't need a wrapper because it doesn't use any intermediate closeable/disposable entity.
P.S. it might make more sense to use File.list(FilenameFilter filter) instead of if stateFile.endsWith(".TXT"))

Convert into a For Comprehension in Scala

Testing this I can see that it works:
def twoHtmlFutures = Action { request =>
val async1 = as1.index(embed = true)(request) // Future[Result]
val async2 = as2.index(embed = true)(request) // Future[Result]
val async1Html = async1.flatMap(x => Pagelet.readBody(x)) // Future[Html]
val async2Html = async2.flatMap(x => Pagelet.readBody(x)) // Future[Html]
val source1 = Source.fromFuture(async1Html) // Source[Html, NotUsed]
val source2 = Source.fromFuture(async2Html) // Source[Html, NotUsed]
val merged = source1.merge(source2) // Source[Html, NotUsed]
Ok.chunked(merged)
}
But trying to put it into a For Comprehension is not working for me. This is what I tried:
def twoHtmlFutures2 = Action.async { request =>
val async1 = as1.index(embed = true)(request)
val async2 = as2.index(embed = true)(request)
for {
async1Res <- async1 // from Future[Result] to Result
async2Res <- async2 // from Future[Result] to Result
async1Html <- Pagelet.readBody(async1Res) // from Result to Html
async2Html <- Pagelet.readBody(async2Res) // from Result to Html
} yield {
val source1 = single(async1Html) // from Html to Source[Html, NotUsed]
val source2 = single(async2Html) // from Html to Source[Html, NotUsed]
val merged = source1.merge(source2) // Source[Html, NotUsed]
Ok.chunked(merged)
}
}
But this just jumps on-screen at the same time rather than at different times (streamed) as the first example does. Any helpers out there to widen my eyelids?Thanks
Monads are a sequencing shape and Futures models this as causal dependence (first-this-future-completes-then-that-future-completes):
val x = Future(something).map(_ => somethingElse) or Future(something).flatMap(_ => Future(somethingElse)
However, there's a little trick one can do in for comprehensions:
def twoHtmlFutures = Action { request =>
Ok.chunked(
Source.fromFutureSource(
for { _ <- Future.unit // For Scala version <= 2.11 use Future.successful(())
async1 = as1.index(embed = true)(request) // Future[Result]
async2 = as2.index(embed = true)(request) // Future[Result]
async1Html = async1.flatMap(x => Pagelet.readBody(x)) // Future[Html]
async2Html = async2.flatMap(x => Pagelet.readBody(x)) // Future[Html]
source1 = Source.fromFuture(async1Html) // Source[Html, NotUsed]
source2 = Source.fromFuture(async2Html) // Source[Html, NotUsed]
} yield source1.merge(source2) // Source[Html, NotUsed]
)
)
I describe this technique in greater detail in this blogpost.
An alternate solution to your problem could be:
def twoHtmlFutures = Action { request =>
Ok.chunked(
Source.fromFuture(as1.index(embed = true)(request)).merge(Source.fromFuture(as2.index(embed = true)(request))).mapAsyncUnordered(2)(b => Pagelet.readBody(b))
)
}
For comprehension and flatMap (which is its desugared version) are used to sequence things.
In the context of Future, this means that in a for comprehension, each of the statement is started only once the previous one has successfully ended.
In your case, you want two Futures run in parallel. This is not what flatMap (or for comprehension) is.
What your code do is the following:
do the first index call
when that's over, do the second index call
when that's over, do the first readBody
when that's over, do the second readBody
when that's over create two (synchronous) sources with the values from the two previous steps, merge them, and start returning the merged source as chunked response.
What your previous code did was
do the first index call
when that's over, do the first readBody
in the meantime, do the same for the second index and readBody
in the meantime, create a source that will output an element when the first readBody yields a result
in the meantime, do the same for the second
merge these two sources, and start all at once to give the merged output as a chunked response.
So, in this case, you start your chunked response just after receiving the request (but with nothing inside yet, waiting for the Futures to be resolved), while in the former case, you wait at each computation for the previous one to be over, even if you don't need its result to go on.
What you should remember, is that you should use flatMap on Future, only if you need the result from a previous computation or if you wish for another computation to be over before doing something else. The same goes for for comprehension, which is just a nice-looking way of chaining flatMaps.
Here's what your proposed for comprehension looks like after it's been desugared (courtesy of IntelliJ's "Desugar Scala code ..." menu option):
async1.flatMap((async1Res: Nothing) =>
async2.flatMap((async2Res: Nothing) =>
Pagelet.readBody(async1Res).flatMap((async1Html: Nothing) =>
Pagelet.readBody(async2Res).map((async2Html: Nothing) =>
Ok.chunked(merged)))))
As you can see, the nesting, and the concluding flatMap/map pair, are very different from your original code plan.
As a general rule, every <- in a single for comprehension is turned into a flatMap() except for the final one, which is a map(), and each is nested inside the previous.

Scala functional way of processing large scala data with lazy collections

I am trying to figure out memory-efficient AND functional ways to process a large scale of data using strings in scala. I have read many things about lazy collections and have seen quite a bit of code examples. However, I run into "GC overhead exceeded" or "Java heap space" issues again and again.
Often the problem is that I try to construct a lazy collection, but evaluate each new element when I append it to the growing collection (I don't now any other way to do so incrementally). Of course, I could try something like initializing an initial lazy collection first and and yield the collection holding the desired values by applying the ressource-critical computations with map or so, but often I just simply do not know the exact size of the final collection a priori to initial that lazy collection.
Maybe you could help me by giving me hints or explanations on how to improve following code as an example, which splits a FASTA (definition below) formatted file into two separate files according to the rule that odd sequence pairs belong to one file and even ones to aother one ("separation of strands"). The "most" straight-forward way to do so would be in a imperative way by looping through the lines and printing into the corresponding files via open file streams (and this of course works excellent). However, I just don't enjoy the style of reassigning to variables holding header and sequences, thus the following example code uses (tail-)recursion, and I would appreciate to have found a way to maintain a similar design without running into ressource problems!
The example works perfectly for small files, but already with files at around ~500mb the code will fail with the standard JVM setups. I do want to process files of "arbitray" size, say 10-20gb or so.
val fileName = args(0)
val in = io.Source.fromFile(fileName) getLines
type itType = Iterator[String]
type sType = Stream[(String, String)]
def getFullSeqs(ite: itType) = {
//val metaChar = ">"
val HeadPatt = "(^>)(.+)" r
val SeqPatt = "([\\w\\W]+)" r
#annotation.tailrec
def rec(it: itType, out: sType = Stream[(String, String)]()): sType =
if (it hasNext) it next match {
case HeadPatt(_,header) =>
// introduce new header-sequence pair
rec(it, (header, "") #:: out)
case SeqPatt(seq) =>
val oldVal = out head
// concat subsequences
val newStream = (oldVal._1, oldVal._2 + seq) #:: out.tail
rec(it, newStream)
case _ =>
println("something went wrong my friend, oh oh oh!"); Stream[(String, String)]()
} else out
rec(ite)
}
def printStrands(seqs: sType) {
import java.io.PrintWriter
import java.io.File
def printStrand(seqse: sType, strand: Int) {
// only use sequences of one strand
val indices = List.tabulate(seqs.size/2)(_*2 + strand - 1).view
val p = new PrintWriter(new File(fileName + "." + strand))
indices foreach { i =>
p.print(">" + seqse(i)._1 + "\n" + seqse(i)._2 + "\n")
}; p.close
println("Done bro!")
}
List(1,2).par foreach (s => printStrand(seqs, s))
}
printStrands(getFullSeqs(in))
Three questions arise for me:
A) Let's assume one needs to maintain a large data structure obtained by processing the initial iterator you get from getLines like in my getFullSeqs method (note the different size of in and the output of getFullSeqs), because transformations on the whole(!) data is required repeatedly, because one does not know which part of the data one will require at any step. My example might not be the best, but how to do so? Is it possible at all??
B) What when the desired data structure is not inherently lazy, say one would like to store the (header -> sequence) pairs into a Map()? Would you wrap it in a lazy collection?
C) My implementation of constructing the stream might reverse the order of the inputted lines. When calling reverse, all elements will be evaluated (in my code, they already are, so this is the actual problem). Is there any way to post-process "from behind" in a lazy fashion? I know of reverseIterator, but is this already the solution, or will this not actually evaluate all elements first, too (as I would need to call it on a list)? One could construct the stream with newVal #:: rec(...), but I would lose tail-recursion then, wouldn't I?
So what I basically need is to add elements to a collection, which are not evaluated by the process of adding. So lazy val elem = "test"; elem :: lazyCollection is not what I am looking for.
EDIT: I have also tried using by-name parameter for the stream argument in rec .
Thank you so much for your attention and time, I really appreciate any help (again :) ).
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
FASTA is defined as a sequential set of sequences delimited by a single header line. A header is defined as a line starting with ">". Every line below the header is called part of the sequence associated with the header. A sequence ends when a new header is present. Every header is unique. Example:
>HEADER1
abcdefg
>HEADER2
hijklmn
opqrstu
>HEADER3
vwxyz
>HEADER4
zyxwv
Thus, sequence 2 is twice as big as seq 1. My program would split that file into a file A containing
>HEADER1
abcdefg
>HEADER3
vwxyz
and a second file B containing
>HEADER2
hijklmn
opqrstu
>HEADER4
zyxwv
The input file is assumed to consist of an even number of header-sequence pairs.
The key to working with really large data structures is to hold in memory only that which is critical to perform whatever operation you need. So, in your case, that's
Your input file
Your two output files
The current line of text
and that's it. In some cases you can need to store information such as how long a sequence is; in such events, you build the data structures in a first pass and use them on a second pass. Let's suppose, for example, that you decide that you want to write three files: one for even records, one for odd, and one for entries where the total length is less than 300 nucleotides. You would do something like this (warning--it compiles but I never ran it, so it may not actually work):
final def findSizes(
data: Iterator[String], sz: Map[String,Long] = Map(),
currentName: String = "", currentSize: Long = 0
): Map[String,Long] = {
def currentMap = if (currentName != "") sz + (currentName->currentSize) else sz
if (!data.hasNext) currentMap
else {
val s = data.next
if (s(0) == '>') findSizes(data, currentMap, s, 0)
else findSizes(data, sz, currentName, currentSize + s.length)
}
}
Then, for processing, you use that map and pass through again:
import java.io._
final def writeFiles(
source: Iterator[String], targets: Array[PrintWriter],
sizes: Map[String,Long], count: Int = -1, which: Int = 0
) {
if (!source.hasNext) targets.foreach(_.close)
else {
val s = source.next
if (s(0) == '>') {
val w = if (sizes.get(s).exists(_ < 300)) 2 else (count+1)%2
targets(w).println(s)
writeFiles(source, targets, sizes, count+1, w)
}
else {
targets(which).println(s)
writeFiles(source, targets, sizes, count, which)
}
}
}
You then use Source.fromFile(f).getLines() twice to create your iterators, and you're all set. Edit: in some sense this is the key step, because this is your "lazy" collection. However, it's not important just because it doesn't read all memory in immediately ("lazy"), but because it doesn't store any previous strings either!
More generally, Scala can't help you that much from thinking carefully about what information you need to have in memory and what you can fetch off disk as needed. Lazy evaluation can sometimes help, but there's no magic formula because you can easily express the requirement to have all your data in memory in a lazy way. Scala can't interpret your commands to access memory as, secretly, instructions to fetch stuff off the disk instead. (Well, not unless you write a library to cache results from disk which does exactly that.)
One could construct the stream with newVal #:: rec(...), but I would
lose tail-recursion then, wouldn't I?
Actually, no.
So, here's the thing... with your present tail recursion, you fill ALL of the Stream with values. Yes, Stream is lazy, but you are computing all of the elements, stripping it of any laziness.
Now say you do newVal #:: rec(...). Would you lose tail recursion? No. Why? Because you are not recursing. How come? Well, Stream is lazy, so it won't evaluate rec(...).
And that's the beauty of it. Once you do it that way, getFullSeqs returns on the first interaction, and only compute the "recursion" when printStrands asks for it. Unfortunately, that won't work as is...
The problem is that you are constantly modifying the Stream -- that's not how you use a Stream. With Stream, you always append to it. Don't keep "rewriting" the Stream.
Now, there are three other problems I could readily identify with printStrands. First, it calls size on seqs, which will cause the whole Stream to be processed, losing lazyness. Never call size on a Stream. Second, you call apply on seqse, accessing it by index. Never call apply on a Stream (or List) -- that's highly inefficient. It's O(n), which makes your inner loop O(n^2) -- yes, quadratic on the number of headers in the input file! Finally, printStrands keeps a reference to seqs throughout the execution of printStrand, preventing processing elements from being garbage collected.
So, here's a first approximation:
def inputStreams(fileName: String): (Stream[String], Stream[String]) = {
val in = (io.Source fromFile fileName).getLines.toStream
val SeqPatt = "^[^>]".r
def demultiplex(s: Stream[String], skip: Boolean): Stream[String] = {
if (s.isEmpty) Stream.empty
else if (skip) demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = false)
else s.head #:: (s.tail takeWhile (SeqPatt findFirstIn _ nonEmpty)) #::: demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = true)
}
(demultiplex(in, skip = false), demultiplex(in, skip = true))
}
The problem with the above, and I'm showing that code just to further guide in the issues of lazyness, is that the instant you do this:
val (a, b) = inputStreams(fileName)
You'll keep a reference to the head of both streams, which prevents garbage collecting them. You can't keep a reference to them, so you have to consume them as soon as you get them, without ever storing them in a "val" or "lazy val". A "var" might do, but it would be tricky to handle. So let's try this instead:
def inputStreams(fileName: String): Vector[Stream[String]] = {
val in = (io.Source fromFile fileName).getLines.toStream
val SeqPatt = "^[^>]".r
def demultiplex(s: Stream[String], skip: Boolean): Stream[String] = {
if (s.isEmpty) Stream.empty
else if (skip) demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = false)
else s.head #:: (s.tail takeWhile (SeqPatt findFirstIn _ nonEmpty)) #::: demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = true)
}
Vector(demultiplex(in, skip = false), demultiplex(in, skip = true))
}
inputStreams(fileName).zipWithIndex.par.foreach {
case (stream, strand) =>
val p = new PrintWriter(new File("FASTA" + "." + strand))
stream foreach p.println
p.close
}
That still doesn't work, because stream inside inputStreams works as a reference, keeping the whole stream in memory even while they are printed.
So, having failed again, what do I recommend? Keep it simple.
def in = (scala.io.Source fromFile fileName).getLines.toStream
def inputStream(in: Stream[String], strand: Int = 1): Stream[(String, Int)] = {
if (in.isEmpty) Stream.empty
else if (in.head startsWith ">") (in.head, 1 - strand) #:: inputStream(in.tail, 1 - strand)
else (in.head, strand) #:: inputStream(in.tail, strand)
}
val printers = Array.tabulate(2)(i => new PrintWriter(new File("FASTA" + "." + i)))
inputStream(in) foreach {
case (line, strand) => printers(strand) println line
}
printers foreach (_.close)
Now this won't keep anymore in memory than necessary. I still think it's too complex, however. This can be done more easily like this:
def in = (scala.io.Source fromFile fileName).getLines
val printers = Array.tabulate(2)(i => new PrintWriter(new File("FASTA" + "." + i)))
def printStrands(in: Iterator[String], strand: Int = 1) {
if (in.hasNext) {
val next = in.next
if (next startsWith ">") {
printers(1 - strand).println(next)
printStrands(in, 1 - strand)
} else {
printers(strand).println(next)
printStrands(in, strand)
}
}
}
printStrands(in)
printers foreach (_.close)
Or just use a while loop instead of recursion.
Now, to the other questions:
B) It might make sense to do so while reading it, so that you do not have to keep two copies of the data: the Map and a Seq.
C) Don't reverse a Stream -- you'll lose all of its laziness.

Modifying a large file in Scala

I am trying to modify a large PostScript file in Scala (some are as large as 1GB in size). The file is a group of batches, with each batch containing a code that represents the batch number, number of pages, etc.
I need to:
Search the file for the batch codes (which always start with the same line in the file)
Count the number of pages until the next batch code
Modify the batch code to include how many pages are in each batch.
Save the new file in a different location.
My current solution uses two iterators (iterA and iterB), created from Source.fromFile("file.ps").getLines. The first iterator (iterA) traverses in a while loop to the beginning of a batch code (with iterB.next being called each time as well). iterB then continues searching until the next batch code (or the end of the file), counting the number of pages it passes as it goes. Then, it updates the batch code at iterA's position, an the process repeats.
This seems very non-Scala-like and I still haven't designed a good way to save these changes into a new file.
What is a good approach to this problem? Should I ditch iterators entirely? I'd preferably like to do it without having to have the entire input or output into memory at once.
Thanks!
You could probably implement this with Scala's Stream class. I am assuming that you don't mind
holding one "batch" in memory at a time.
import scala.annotation.tailrec
import scala.io._
def isBatchLine(line:String):Boolean = ...
def batchLine(size: Int):String = ...
val it = Source.fromFile("in.ps").getLines
// cannot use it.toStream here because of SI-4835
def inLines = Stream.continually(i).takeWhile(_.hasNext).map(_.next)
// Note: using `def` instead of `val` here means we don't hold
// the entire stream in memory
def batchedLinesFrom(stream: Stream[String]):Stream[String] = {
val (batch, remainder) = stream span { !isBatchLine(_) }
if (batch.isEmpty && remainder.isEmpty) {
Stream.empty
} else {
batchLine(batch.size) #:: batch #::: batchedLinesFrom(remainder.drop(1))
}
}
def newLines = batchedLinesFrom(inLines dropWhile isBatchLine)
val ps = new java.io.PrintStream(new java.io.File("out.ps"))
newLines foreach ps.println
ps.close()
If you not in pursuit of functional scala enlightenment, I'd recommend a more imperative style using java.util.Scanner#findWithinHorizon. My example is quite naive, iterating through the input twice.
val scanner = new Scanner(inFile)
val writer = new BufferedWriter(...)
def loop() = {
// you might want to limit the horizon to prevent OutOfMemoryError
Option(scanner.findWithinHorizon(".*YOUR-BATCH-MARKER", 0)) match {
case Some(batch) =>
val pageCount = countPages(batch)
writePageCount(writer, pageCount)
writer.write(batch)
loop()
case None =>
}
}
loop()
scanner.close()
writer.close()
May be you can use span and duplicate effectively. Assuming the iterator is positioned on the start of a batch, you take the span before the next batch, duplicate it so that you can count the pages, write the modified batch line, then write the pages using the duplicated iterator. Then process next batch recursively...
def batch(i: Iterator[String]) {
if (i.hasNext) {
assert(i.next() == "batch")
val (current, next) = i.span(_ != "batch")
val (forCounting, forWriting) = current.duplicate
val count = forCounting.filter(_ == "p").size
println("batch " + count)
forWriting.foreach(println)
batch(next)
}
}
Assuming the following input:
val src = Source.fromString("head\nbatch\np\np\nbatch\np\nbatch\np\np\np\n")
You position the iterator at the start of batch and then you process the batches:
val (head, next) = src.getLines.span(_ != "batch")
head.foreach(println)
batch(next)
This prints:
head
batch 2
p
p
batch 1
p
batch 3
p
p
p