Scala - Iterate Over Two Arrays - scala

How do you iterate over two arrays of the same size, accessing the same index each iteration The Scala Way™?
for ((aListItem, bListItem) <- (aList, bList)) {
// do something with items
}
The Java way applied to Scala:
for(i <- 0 until aList.length ) {
aList(i)
bList(i)
}
Assume both lists are the same size.

tl;dr: There are trade-offs between speed and convenience; you need to know your use case to pick appropriately.
If you know both arrays are the same length and you don't need to worry how fast it is, the easiest and most canonical is to use zip inside a for-comprehension:
for ((a,b) <- aList zip bList) { ??? }
The zip method creates a new single array, however. To avoid that overhead you can use zipped on a tuple which will present the elements in pairs to methods like foreach and map:
(aList, bList).zipped.foreach{ (a,b) => ??? }
Faster still is to index into the arrays, especially if the arrays contain primitives like Int, since the generic code above has to box them. There is a handy method indices that you can use:
for (i <- aList.indices) { ??? }
Finally, if you need to go as fast as you possibly can, you can fall back to manual while loops or recursion, like so:
// While loop
var i = 0
while (i < aList.length) {
???
i += 1
}
// Recursion
def loop(i: Int) {
if (i < aList.length) {
???
loop(i+1)
}
}
loop(0)
If you are computing some value, rather than having it be a side effect, it's sometimes faster with recursion if you pass it along:
// Recursion with explicit result
def loop(i: Int, acc: Int = 0): Int =
if (i < aList.length) {
val nextAcc = ???
loop(i+1, nextAcc)
}
else acc
Since you can drop a method definition in anywhere, you can use recursion without restriction. You can add an #annotation.tailrec annotation to make sure it can be compiled down to a fast loop with jumps instead of actual recursion that eats stack space.
Taking all these different approaches to calculate a dot product on length 1024 vectors, we can compare these to a reference implementation in Java:
public class DotProd {
public static int dot(int[] a, int[] b) {
int s = 0;
for (int i = 0; i < a.length; i++) s += a[i]*b[i];
return s;
}
}
plus an equivalent version where we take the dot product of the lengths of strings (so we can assess objects vs. primitives)
normalized time
-----------------
primitive object method
--------- ------ ---------------------------------
100% 100% Java indexed for loop (reference)
100% 100% Scala while loop
100% 100% Scala recursion (either way)
185% 135% Scala for comprehension on indices
2100% 130% Scala zipped
3700% 800% Scala zip
This is particularly bad, of course, with primitives! (You get similarly huge jumps in time taken if you try to use ArrayLists of Integer instead of Array of int in Java.) Note in particular that zipped is quite a reasonable choice if you have objects stored.
Do beware of premature optimization, though! There are advantages to in clarity and safety to functional forms like zip. If you always write while loops because you think "every little bit helps", you're probably making a mistake because it takes more time to write and debug, and you could be using that time optimizing some more important part of your program.
But, assuming your arrays are the same length is dangerous. Are you sure? How much effort will you make to be sure? Maybe you shouldn't make that assumption?
If you don't need it to be fast, just correct, then you have to choose what to do if the two arrays are not the same length.
If you want to do something with all the elements up to the length of the shorter, then zip is still what you use:
// The second is just shorthand for the first
(aList zip bList).foreach{ case (a,b) => ??? }
for ((a,b) <- (aList zip bList)) { ??? }
// This avoids an intermediate array
(aList, bList).zipped.foreach{ (a,b) => ??? }
If you instead want to pad the shorter one with a default value, you would
aList.zipAll(bList, aDefault, bDefault).foreach{ case (a,b) => ??? }
for ((a,b) <- aList.zipAll(bList, aDefault, bDefault)) { ??? }
In any of these cases, you can use yield with for or map instead of foreach to generate a collection.
If you need the index for a calculation or it really is an array and you really need it to be fast, you will have to do the calculation manually. Padding missing elements is awkward (I leave that as an exercise to the reader), but the basic form would be:
for (i <- 0 until math.min(aList.length, bList.length)) { ??? }
where you then use i to index into aList and bList.
If you really need maximum speed you would again use (tail) recursion or while loops:
val n = math.min(aList.length, bList.length)
var i = 0
while (i < n) {
???
i += 1
}
def loop(i: Int) {
if (i < aList.length && i < bList.length) {
???
loop(i+1)
}
}
loop(0)

Something like:
for ((aListItem, bListItem) <- (aList zip bList)) {
// do something with items
}
Or with map like:
(aList zip bList).map{ case (alistItem, blistItem) => // do something }
Updated:
For iterating without creating an intermediates, you can try:
for (i <- 0 until xs.length) ... //xs(i) & ys(i) to access element
or simply
for (i <- xs.indices) ...

I would do something like this:
aList.indices foreach { i =>
val (aListItem, bListItem) = (aList(i), bList(i))
// do something with items
}

for {
i <- 0 until Math.min(list1.size, list2.size)
} yield list1(i) + list2(i)
Or something like that which checks bounds etc.

Related

Scala: what is the interest in using Iterators?

I have used Iterators after have worked with Regexes in Scala but I don't really understand the interest.
I know that it has a state and if I call the next() method on it, it will output a different result every time, but I don't see anything I can do with it and that is not possible with an Iterable.
And it doesn't seem to work as Akka Streams (for example) since the following example directly prints all the numbers (without waiting one second as I would expect it):
lazy val a = Iterator({Thread.sleep(1000); 1}, {Thread.sleep(1000); 2}, {Thread.sleep(1000); 3})
while(a.hasNext){ println(a.next()) }
So what is the purpose of using Iterators?
Perhaps, the most useful property of iterators is that they are lazy.
Consider something like this:
(1 to 10000)
.map { x => x * x }
.map { _.toString }
.find { _ == "4" }
This snippet will square 10000 numbers, then generate 10000 strings, and then return the second one.
This on the other hand:
(1 to 10000)
.iterator
.map { x => x * x }
.map { _.toString }
.find { _ == "4" }
... only computes two squares, and generates two strings.
Iterators are also often useful when you need to wrap around some poorly designed (java?) objects in order to be able to handle them in functional style:
val rs: ResultSet = jdbcQuery.executeQuery()
new Iterator {
def next = rs
def hasNext = rs.next
}.map { rs =>
fetchData(rs)
}
Streams are similar to iterators - they are also lazy, and also useful for wrapping:
Stream.continually(rs).takeWhile { _.next }.map(fetchData)
The main difference though is that streams remember the data that gets materialized, so that you can traverse them more than once. This is convenient, but may be costly if the original amount of data is very large, especially, if it gets filtered down to much smaller size:
Source
.fromFile("huge_file.txt")
.getLines
.filter(_ == "")
.toList
This only uses, roughly (ignoring buffering, object overhead, and other implementation specific details), the amount of memory, necessary to keep one line in memory, plus however many empty lines there are in the file.
This on the other hand:
val reader = new FileReader("huge_file.txt")
Stream
.continually(reader.readLine)
.takeWhile(_ != null)
.filter(_ == "")
.toList
... will end up with the entire content of the huge_file.txt in memory.
Finally, if I understand the intent of your example correctly, here is how you could do it with iterators:
val iterator = Seq(1,2,3).iterator.map { n => Thread.sleep(1000); n }
iterator.foreach(println)
// Or while(iterator.hasNext) { println(iterator.next) } as you had it.
There is a good explanation of what iterator is http://www.scala-lang.org/docu/files/collections-api/collections_43.html
An iterator is not a collection, but rather a way to access the
elements of a collection one by one. The two basic operations on an
iterator it are next and hasNext. A call to it.next() will return the
next element of the iterator and advance the state of the iterator.
Calling next again on the same iterator will then yield the element
one beyond the one returned previously. If there are no more elements
to return, a call to next will throw a NoSuchElementException.
First of all you should understand what is wrong with your example:
lazy val a = Iterator({Thread.sleep(1); 1}, {Thread.sleep(1); 2},
{Thread.sleep(2); 3}) while(a.hasNext){ println(a.next()) }
if you look at the apply method of Iterator, you'll see there are no calls by name,so all Thread.sleep are calling at the same time when apply method calls. Also Thread.sleep takes parameter of time to sleep in milliseconds, so if you want to sleep your thread on one second you should pass Thread.sleep(1000).
The companion object has additional methods which allow you do the next:
val a = Iterator.iterate(1)(x => {Thread.sleep(1000); x+1})
Iterator is very useful when you need to work with large data. Also you can implement your own:
val it = new Iterator[Int] {
var i = -1
def hasNext = true
def next(): Int = { i += 1; i }
}
I don't see anything I can do with it and that is not possible with an Iterable
In fact, what most collection can do can also be done with Array, but we don't do that because it's much less convenient
So same reason apply to iterator, if you want to model a mutable state, then iterator makes more sense.
For example, Random is implemented in a way resemble to iterator because it's use case fit more naturally in iterator, rather than iterable.

Why doesn't my recursive function return the max value of a List

I have the following recursive function in Scala that should return the maximum size integer in the List. Is anyone able to tell me why the largest value is not returned?
def max(xs: List[Int]): Int = {
var largest = xs.head
println("largest: " + largest)
if (!xs.tail.isEmpty) {
var next = xs.tail.head
println("next: " + next)
largest = if (largest > next) largest else next
var remaining = List[Int]()
remaining = largest :: xs.tail.tail
println("remaining: " + remaining)
max(remaining)
}
return largest
}
Print out statements show me that I've successfully managed to bring back the largest value in the List as the head (which was what I wanted) but the function still returns back the original head in the list. I'm guessing this is because the reference for xs is still referring to the original xs list, problem is I can't override that because it's a val.
Any ideas what I'm doing wrong?
You should use the return value of the inner call to max and compare that to the local largest value.
Something like the following (removed println just for readability):
def max(xs: List[Int]): Int = {
var largest = xs.head
if (!xs.tail.isEmpty) {
var remaining = List[Int]()
remaining = largest :: xs.tail
var next = max(remaining)
largest = if (largest > next) largest else next
}
return largest
}
Bye.
I have an answer to your question but first...
This is the most minimal recursive implementation of max I've ever been able to think up:
def max(xs: List[Int]): Option[Int] = xs match {
case Nil => None
case List(x: Int) => Some(x)
case x :: y :: rest => max( (if (x > y) x else y) :: rest )
}
(OK, my original version was ever so slightly more minimal but I wrote that in Scheme which doesn't have Option or type safety etc.) It doesn't need an accumulator or a local helper function because it compares the first two items on the list and discards the smaller, a process which - performed recursively - inevitably leaves you with a list of just one element which must be bigger than all the rest.
OK, why your original solution doesn't work... It's quite simple: you do nothing with the return value from the recursive call to max. All you had to do was change the line
max(remaining)
to
largest = max(remaining)
and your function would work. It wouldn't be the prettiest solution, but it would work. As it is, your code looks as if it assumes that changing the value of largest inside the recursive call will change it in the outside context from which it was called. But each new call to max creates a completely new version of largest which only exists inside that new iteration of the function. Your code then throws away the return value from max(remaining) and returns the original value of largest, which hasn't changed.
Another way to solve this would have been to use a local (inner) function after declaring var largest. That would have looked like this:
def max(xs: List[Int]): Int = {
var largest = xs.head
def loop(ys: List[Int]) {
if (!ys.isEmpty) {
var next = ys.head
largest = if (largest > next) largest else next
loop(ys.tail)
}
}
loop(xs.tail)
return largest
}
Generally, though, it is better to have recursive functions be entirely self-contained (that is, not to look at or change external variables but only at their input) and to return a meaningful value.
When writing a recursive solution of this kind, it often helps to think in reverse. Think first about what things are going to look like when you get to the end of the list. What is the exit condition? What will things look like and where will I find the value to return?
If you do this, then the case which you use to exit the recursive function (by returning a simple value rather than making another recursive call) is usually very simple. The other case matches just need to deal with a) invalid input and b) what to do if you are not yet at the end. a) is usually simple and b) can usually be broken down into just a few different situations, each with a simple thing to do before making another recursive call.
If you look at my solution, you'll see that the first case deals with invalid input, the second is my exit condition and the third is "what to do if we're not at the end".
In many other recursive solutions, Nil is the natural end of the recursion.
This is the point at which I (as always) recommend reading The Little Schemer. It teaches you recursion (and basic Scheme) at the same time (both of which are very good things to learn).
It has been pointed out that Scala has some powerful functions which can help you avoid recursion (or hide the messy details of it), but to use them well you really do need to understand how recursion works.
The following is a typical way to solve this sort of problem. It uses an inner tail-recursive function that includes an extra "accumulator" value, which in this case will hold the largest value found so far:
def max(xs: List[Int]): Int = {
def go(xs: List[Int], acc: Int): Int = xs match {
case Nil => acc // We've emptied the list, so just return the final result
case x :: rest => if (acc > x) go(rest, acc) else go(rest, x) // Keep going, with remaining list and updated largest-value-so-far
}
go(xs, Int.MinValue)
}
Nevermind I've resolved the issue...
I finally came up with:
def max(xs: List[Int]): Int = {
var largest = 0
var remaining = List[Int]()
if (!xs.isEmpty) {
largest = xs.head
if (!xs.tail.isEmpty) {
var next = xs.tail.head
largest = if (largest > next) largest else next
remaining = largest :: xs.tail.tail
}
}
if (!remaining.tail.isEmpty) max(remaining) else xs.head
}
Kinda glad we have loops - this is an excessively complicated solution and hard to get your head around in my opinion. I resolved the problem by making sure the recursive call was the last statement in the function either that or xs.head is returned as the result if there isn't a second member in the array.
The most concise but also clear version I have ever seen is this:
def max(xs: List[Int]): Int = {
def maxIter(a: Int, xs: List[Int]): Int = {
if (xs.isEmpty) a
else a max maxIter(xs.head, xs.tail)
}
maxIter(xs.head, xs.tail)
}
This has been adapted from the solutions to a homework on the Scala official Corusera course: https://github.com/purlin/Coursera-Scala/blob/master/src/example/Lists.scala
but here I use the rich operator max to return the largest of its two operands. This saves having to redefine this function within the def max block.
What about this?
def max(xs: List[Int]): Int = {
maxRecursive(xs, 0)
}
def maxRecursive(xs: List[Int], max: Int): Int = {
if(xs.head > max && ! xs.isEmpty)
maxRecursive(xs.tail, xs.head)
else
max
}
What about this one ?
def max(xs: List[Int]): Int = {
var largest = xs.head
if( !xs.tail.isEmpty ) {
if(xs.head < max(xs.tail)) largest = max(xs.tail)
}
largest
}
My answer is using recursion is,
def max(xs: List[Int]): Int =
xs match {
case Nil => throw new NoSuchElementException("empty list is not allowed")
case head :: Nil => head
case head :: tail =>
if (head >= tail.head)
if (tail.length > 1)
max(head :: tail.tail)
else
head
else
max(tail)
}
}

Scala actor kills itself inconsistently

I am a newbie to scala and I am writing scala code to implement pastry protocol. The protocol itself does not matter. There are nodes and each node has a routing table which I want to populate.
Here is the part of the code:
def act () {
def getMatchingNode (initialMatch :String) : Int = {
val len = initialMatch.length
for (i <- 0 to noOfNodes-1) {
var flag : Int = 1
for (j <- 0 to len-1) {
if (list(i).key.charAt(j) == initialMatch(j)) {
continue
}
else {
flag = 0
}
}
if (flag == 1) {
return i
}
}
return -1
}
// iterate over rows
for (ii <- 0 to rows - 1) {
for (jj <- 0 to 15) {
var initialMatch = ""
for (k <- 0 to ii-1) {
initialMatch = initialMatch + key.charAt(k)
}
initialMatch += jj
println("initialMatch",initialMatch)
if (getMatchingNode(initialMatch) != -1) {
Routing(0)(jj) = list(getMatchingNode(initialMatch)).key
}
else {
Routing(0)(jj) = "NULL"
}
}
}
}// act
The problem is when the function call to getMatchingNode takes place then the actor dies suddenly by itself. 'list' is the list of all nodes. (list of node objects)
Also this behaviour is not consistent. The call to getMatchingNode should take place 15 times for each actor (for 10 nodes).
But while debugging the actor kills itself in the getMatchingNode function call after one call or sometimes after 3-4 calls.
The scala library code which gets executed is this :
def run() {
try {
beginExecution()
try {
if (fun eq null)
handler(msg)
else
fun()
} catch {
case _: KillActorControl =>
// do nothing
case e: Exception if reactor.exceptionHandler.isDefinedAt(e) =>
reactor.exceptionHandler(e)
}
reactor.kill()
}
Eclipse shows that this code has been called from the for loop in the getMatchingNode function
def getMatchingNode (initialMatch :String) : Int = {
val len = initialMatch.length
for (i <- 0 to noOfNodes-1)
The strange thing is that sometimes the loop behaves normally and sometimes it goes to the scala code which kills the actor.
Any inputs what wrong with the code??
Any help would be appreciated.
Got the error..
The 'continue' clause in the for loop caused the trouble.
I thought we could use continue in Scala as we do in C++/Java but it does not seem so.
Removing the continue solved the issue.
From the book: "Programming in Scala 2ed" by M.Odersky
You may have noticed that there has been no mention of break or continue.
Scala leaves out these commands because they do not mesh well with function
literals, a feature described in the next chapter. It is clear what continue
means inside a while loop, but what would it mean inside a function literal?
While Scala supports both imperative and functional styles of programming,
in this case it leans slightly towards functional programming in exchange
for simplifying the language. Do not worry, though. There are many ways to
program without break and continue, and if you take advantage of function
literals, those alternatives can often be shorter than the original code.
I really suggest reading the book if you want to learn scala
Your code is based on tons of nested for loops, which can be more often than not be rewritten using the Higher Order Functions available on the most appropriate Collection.
You can rewrite you function like the following [I'm trying to make it approachable for newcomers]:
//works if "list" contains "nodes" with an attribute "node.key: String"
def getMatchingNode (initialMatch :String) : Int = {
//a new list with the corresponding keys
val nodeKeys = list.map(node => node.key)
//zips each key (creates a pair) with the corresponding index in the list and then find a possible match
val matchOption: Option[(String, Int)] = (nodeKeys.zipWithIndex) find {case (key, index) => key == initialMatch}
//we convert an eventual result contained in the Option, with the right projection of the pair (which contains the index)
val idxOption = matchOption map {case (key, index) => index} //now we have an Option[Int] with a possible index
//returns the content of option if it's full (Some) or a default value of "-1" if there was no match (None). See Option[T] for more details
idxOption.getOrElse(-1)
}
The potential to easily transform or operate on the Collection's elements is what makes continues, and for loops in general, less used in Scala
You can convert the row iteration in a similar way, but I would suggest that if you need to work a lot with the collection's indexes, you want to use an IndexedSeq or one of its implementations, like ArrayBuffer.

Scala functional way of processing large scala data with lazy collections

I am trying to figure out memory-efficient AND functional ways to process a large scale of data using strings in scala. I have read many things about lazy collections and have seen quite a bit of code examples. However, I run into "GC overhead exceeded" or "Java heap space" issues again and again.
Often the problem is that I try to construct a lazy collection, but evaluate each new element when I append it to the growing collection (I don't now any other way to do so incrementally). Of course, I could try something like initializing an initial lazy collection first and and yield the collection holding the desired values by applying the ressource-critical computations with map or so, but often I just simply do not know the exact size of the final collection a priori to initial that lazy collection.
Maybe you could help me by giving me hints or explanations on how to improve following code as an example, which splits a FASTA (definition below) formatted file into two separate files according to the rule that odd sequence pairs belong to one file and even ones to aother one ("separation of strands"). The "most" straight-forward way to do so would be in a imperative way by looping through the lines and printing into the corresponding files via open file streams (and this of course works excellent). However, I just don't enjoy the style of reassigning to variables holding header and sequences, thus the following example code uses (tail-)recursion, and I would appreciate to have found a way to maintain a similar design without running into ressource problems!
The example works perfectly for small files, but already with files at around ~500mb the code will fail with the standard JVM setups. I do want to process files of "arbitray" size, say 10-20gb or so.
val fileName = args(0)
val in = io.Source.fromFile(fileName) getLines
type itType = Iterator[String]
type sType = Stream[(String, String)]
def getFullSeqs(ite: itType) = {
//val metaChar = ">"
val HeadPatt = "(^>)(.+)" r
val SeqPatt = "([\\w\\W]+)" r
#annotation.tailrec
def rec(it: itType, out: sType = Stream[(String, String)]()): sType =
if (it hasNext) it next match {
case HeadPatt(_,header) =>
// introduce new header-sequence pair
rec(it, (header, "") #:: out)
case SeqPatt(seq) =>
val oldVal = out head
// concat subsequences
val newStream = (oldVal._1, oldVal._2 + seq) #:: out.tail
rec(it, newStream)
case _ =>
println("something went wrong my friend, oh oh oh!"); Stream[(String, String)]()
} else out
rec(ite)
}
def printStrands(seqs: sType) {
import java.io.PrintWriter
import java.io.File
def printStrand(seqse: sType, strand: Int) {
// only use sequences of one strand
val indices = List.tabulate(seqs.size/2)(_*2 + strand - 1).view
val p = new PrintWriter(new File(fileName + "." + strand))
indices foreach { i =>
p.print(">" + seqse(i)._1 + "\n" + seqse(i)._2 + "\n")
}; p.close
println("Done bro!")
}
List(1,2).par foreach (s => printStrand(seqs, s))
}
printStrands(getFullSeqs(in))
Three questions arise for me:
A) Let's assume one needs to maintain a large data structure obtained by processing the initial iterator you get from getLines like in my getFullSeqs method (note the different size of in and the output of getFullSeqs), because transformations on the whole(!) data is required repeatedly, because one does not know which part of the data one will require at any step. My example might not be the best, but how to do so? Is it possible at all??
B) What when the desired data structure is not inherently lazy, say one would like to store the (header -> sequence) pairs into a Map()? Would you wrap it in a lazy collection?
C) My implementation of constructing the stream might reverse the order of the inputted lines. When calling reverse, all elements will be evaluated (in my code, they already are, so this is the actual problem). Is there any way to post-process "from behind" in a lazy fashion? I know of reverseIterator, but is this already the solution, or will this not actually evaluate all elements first, too (as I would need to call it on a list)? One could construct the stream with newVal #:: rec(...), but I would lose tail-recursion then, wouldn't I?
So what I basically need is to add elements to a collection, which are not evaluated by the process of adding. So lazy val elem = "test"; elem :: lazyCollection is not what I am looking for.
EDIT: I have also tried using by-name parameter for the stream argument in rec .
Thank you so much for your attention and time, I really appreciate any help (again :) ).
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
FASTA is defined as a sequential set of sequences delimited by a single header line. A header is defined as a line starting with ">". Every line below the header is called part of the sequence associated with the header. A sequence ends when a new header is present. Every header is unique. Example:
>HEADER1
abcdefg
>HEADER2
hijklmn
opqrstu
>HEADER3
vwxyz
>HEADER4
zyxwv
Thus, sequence 2 is twice as big as seq 1. My program would split that file into a file A containing
>HEADER1
abcdefg
>HEADER3
vwxyz
and a second file B containing
>HEADER2
hijklmn
opqrstu
>HEADER4
zyxwv
The input file is assumed to consist of an even number of header-sequence pairs.
The key to working with really large data structures is to hold in memory only that which is critical to perform whatever operation you need. So, in your case, that's
Your input file
Your two output files
The current line of text
and that's it. In some cases you can need to store information such as how long a sequence is; in such events, you build the data structures in a first pass and use them on a second pass. Let's suppose, for example, that you decide that you want to write three files: one for even records, one for odd, and one for entries where the total length is less than 300 nucleotides. You would do something like this (warning--it compiles but I never ran it, so it may not actually work):
final def findSizes(
data: Iterator[String], sz: Map[String,Long] = Map(),
currentName: String = "", currentSize: Long = 0
): Map[String,Long] = {
def currentMap = if (currentName != "") sz + (currentName->currentSize) else sz
if (!data.hasNext) currentMap
else {
val s = data.next
if (s(0) == '>') findSizes(data, currentMap, s, 0)
else findSizes(data, sz, currentName, currentSize + s.length)
}
}
Then, for processing, you use that map and pass through again:
import java.io._
final def writeFiles(
source: Iterator[String], targets: Array[PrintWriter],
sizes: Map[String,Long], count: Int = -1, which: Int = 0
) {
if (!source.hasNext) targets.foreach(_.close)
else {
val s = source.next
if (s(0) == '>') {
val w = if (sizes.get(s).exists(_ < 300)) 2 else (count+1)%2
targets(w).println(s)
writeFiles(source, targets, sizes, count+1, w)
}
else {
targets(which).println(s)
writeFiles(source, targets, sizes, count, which)
}
}
}
You then use Source.fromFile(f).getLines() twice to create your iterators, and you're all set. Edit: in some sense this is the key step, because this is your "lazy" collection. However, it's not important just because it doesn't read all memory in immediately ("lazy"), but because it doesn't store any previous strings either!
More generally, Scala can't help you that much from thinking carefully about what information you need to have in memory and what you can fetch off disk as needed. Lazy evaluation can sometimes help, but there's no magic formula because you can easily express the requirement to have all your data in memory in a lazy way. Scala can't interpret your commands to access memory as, secretly, instructions to fetch stuff off the disk instead. (Well, not unless you write a library to cache results from disk which does exactly that.)
One could construct the stream with newVal #:: rec(...), but I would
lose tail-recursion then, wouldn't I?
Actually, no.
So, here's the thing... with your present tail recursion, you fill ALL of the Stream with values. Yes, Stream is lazy, but you are computing all of the elements, stripping it of any laziness.
Now say you do newVal #:: rec(...). Would you lose tail recursion? No. Why? Because you are not recursing. How come? Well, Stream is lazy, so it won't evaluate rec(...).
And that's the beauty of it. Once you do it that way, getFullSeqs returns on the first interaction, and only compute the "recursion" when printStrands asks for it. Unfortunately, that won't work as is...
The problem is that you are constantly modifying the Stream -- that's not how you use a Stream. With Stream, you always append to it. Don't keep "rewriting" the Stream.
Now, there are three other problems I could readily identify with printStrands. First, it calls size on seqs, which will cause the whole Stream to be processed, losing lazyness. Never call size on a Stream. Second, you call apply on seqse, accessing it by index. Never call apply on a Stream (or List) -- that's highly inefficient. It's O(n), which makes your inner loop O(n^2) -- yes, quadratic on the number of headers in the input file! Finally, printStrands keeps a reference to seqs throughout the execution of printStrand, preventing processing elements from being garbage collected.
So, here's a first approximation:
def inputStreams(fileName: String): (Stream[String], Stream[String]) = {
val in = (io.Source fromFile fileName).getLines.toStream
val SeqPatt = "^[^>]".r
def demultiplex(s: Stream[String], skip: Boolean): Stream[String] = {
if (s.isEmpty) Stream.empty
else if (skip) demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = false)
else s.head #:: (s.tail takeWhile (SeqPatt findFirstIn _ nonEmpty)) #::: demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = true)
}
(demultiplex(in, skip = false), demultiplex(in, skip = true))
}
The problem with the above, and I'm showing that code just to further guide in the issues of lazyness, is that the instant you do this:
val (a, b) = inputStreams(fileName)
You'll keep a reference to the head of both streams, which prevents garbage collecting them. You can't keep a reference to them, so you have to consume them as soon as you get them, without ever storing them in a "val" or "lazy val". A "var" might do, but it would be tricky to handle. So let's try this instead:
def inputStreams(fileName: String): Vector[Stream[String]] = {
val in = (io.Source fromFile fileName).getLines.toStream
val SeqPatt = "^[^>]".r
def demultiplex(s: Stream[String], skip: Boolean): Stream[String] = {
if (s.isEmpty) Stream.empty
else if (skip) demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = false)
else s.head #:: (s.tail takeWhile (SeqPatt findFirstIn _ nonEmpty)) #::: demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = true)
}
Vector(demultiplex(in, skip = false), demultiplex(in, skip = true))
}
inputStreams(fileName).zipWithIndex.par.foreach {
case (stream, strand) =>
val p = new PrintWriter(new File("FASTA" + "." + strand))
stream foreach p.println
p.close
}
That still doesn't work, because stream inside inputStreams works as a reference, keeping the whole stream in memory even while they are printed.
So, having failed again, what do I recommend? Keep it simple.
def in = (scala.io.Source fromFile fileName).getLines.toStream
def inputStream(in: Stream[String], strand: Int = 1): Stream[(String, Int)] = {
if (in.isEmpty) Stream.empty
else if (in.head startsWith ">") (in.head, 1 - strand) #:: inputStream(in.tail, 1 - strand)
else (in.head, strand) #:: inputStream(in.tail, strand)
}
val printers = Array.tabulate(2)(i => new PrintWriter(new File("FASTA" + "." + i)))
inputStream(in) foreach {
case (line, strand) => printers(strand) println line
}
printers foreach (_.close)
Now this won't keep anymore in memory than necessary. I still think it's too complex, however. This can be done more easily like this:
def in = (scala.io.Source fromFile fileName).getLines
val printers = Array.tabulate(2)(i => new PrintWriter(new File("FASTA" + "." + i)))
def printStrands(in: Iterator[String], strand: Int = 1) {
if (in.hasNext) {
val next = in.next
if (next startsWith ">") {
printers(1 - strand).println(next)
printStrands(in, 1 - strand)
} else {
printers(strand).println(next)
printStrands(in, strand)
}
}
}
printStrands(in)
printers foreach (_.close)
Or just use a while loop instead of recursion.
Now, to the other questions:
B) It might make sense to do so while reading it, so that you do not have to keep two copies of the data: the Map and a Seq.
C) Don't reverse a Stream -- you'll lose all of its laziness.

Is there a way to handle the last case differently in a Scala for loop?

For example suppose I have
for (line <- myData) {
println("}, {")
}
Is there a way to get the last line to print
println("}")
Can you refactor your code to take advantage of built-in mkString?
scala> List(1, 2, 3).mkString("{", "}, {", "}")
res1: String = {1}, {2}, {3}
Before going any further, I'd recommend you avoid println in a for-comprehension. It can sometimes be useful for tracking down a bug that occurs in the middle of a collection, but otherwise leads to code that's harder to refactor and test.
More generally, life usually becomes easier if you can restrict where any sort of side-effect occurs. So instead of:
for (line <- myData) {
println("}, {")
}
You can write:
val lines = for (line <- myData) yield "}, {"
println(lines mkString "\n")
I'm also going to take a guess here that you wanted the content of each line in the output!
val lines = for (line <- myData) yield (line + "}, {")
println(lines mkString "\n")
Though you'd be better off still if you just used mkString directly - that's what it's for!
val lines = myData.mkString("{", "\n}, {", "}")
println(lines)
Note how we're first producing a String, then printing it in a single operation. This approach can easily be split into separate methods and used to implement toString on your class, or to inspect the generated String in tests.
I agree fully with what has been said before about using mkstring, and distinguishing the first iteration rather than the last one. Would you still need to distinguish on the last, scala collections have an init method, which return all elements but the last.
So you can do
for(x <- coll.init) workOnNonLast(x)
workOnLast(coll.last)
(init and last being sort of the opposite of head and tail, which are the first and and all but first). Note however than depending on the structure, they may be costly. On Vector, all of them are fast. On List, while head and tail are basically free, init and last are both linear in the length of the list. headOption and lastOption may help you when the collection may be empty, replacing workOnlast by
for (x <- coll.lastOption) workOnLast(x)
You may take the addString function of the TraversableOncetrait as an example.
def addString(b: StringBuilder, start: String, sep: String, end: String): StringBuilder = {
var first = true
b append start
for (x <- self) {
if (first) {
b append x
first = false
} else {
b append sep
b append x
}
}
b append end
b
}
In your case, the separator is }, { and the end is }
If you don't want to use built-in mkString function, you can make something like
for (line <- lines)
if (line == lines.last) println("last")
else println(line)
UPDATE: As didierd mentioned in comments, this solution is wrong because last value can occurs several times, he provides better solution in his answer.
It is fine for Vectors, because last function takes "effectively constant time" for them, as for Lists, it takes linear time, so you can use pattern matching
#tailrec
def printLines[A](l: List[A]) {
l match {
case Nil =>
case x :: Nil => println("last")
case x :: xs => println(x); printLines(xs)
}
}
Other answers are rightfully pointed to mkString, and for a normal amount of data I would also use that.
However, mkString builds (accumulates) the end-result in-memory through a StringBuilder. This is not always desirable, depending on the amount of data we have.
In this case, if all we want is to "print" we don't need to build the big-result first (and maybe we even want to avoid this).
Consider the implementation of this helper function:
def forEachIsLast[A](iterator: Iterator[A])(operation: (A, Boolean) => Unit): Unit = {
while(iterator.hasNext) {
val element = iterator.next()
val isLast = !iterator.hasNext // if there is no "next", this is the last one
operation(element, isLast)
}
}
It iterates over all elements and invokes operation passing each element in turn, with a boolean value. The value is true if the element passed is the last one.
In your case it could be used like this:
forEachIsLast(myData) { (line, isLast) =>
if(isLast)
println("}")
else
println("}, {")
}
We have the following advantages here:
It operates on each element, one by one, without necessarily accumulating the result in memory (unless you want to).
Because it does not need to load the whole collection into memory to check its size, it's enough to ask the Iterator if it's exhausted or not. You could read data from a big file, or from the network, etc.