Scala: Compare elements at same position in two arrays - scala

I'm in the process of learning Scala and am trying to write some sort of function that will compare one element in an list against an element in another list at the same index. I know that there has to be a more Scalatic way to do this than two write two for loops and keep track of the current index of each manually.
Let's say that we're comparing URLs, for example. Say that we have the following two Lists that are URLs split by the / character:
val incomingUrl = List("users", "profile", "12345")
and
val urlToCompare = List("users", "profile", ":id")
Say that I want to treat any element that begins with the : character as a match, but any element that does not begin with a : will not be a match.
What is the best and most Scalatic way to go about doing this comparison?
Coming from a OOP background, I would immediately jump to a for loop, but I know that there has to be a good FP way to go about it that will teach me a thing or two about Scala.
EDIT
For completion, I found this outdated question shortly after posting mine that relates to the problem.
EDIT 2
The implementation that I chose for this specific use case:
def doRoutesMatch(incomingURL: List[String], urlToCompare: List[String]): Boolean = {
// if the lengths don't match, return immediately
if (incomingURL.length != urlToCompare.length) return false
// merge the lists into a tuple
urlToCompare.zip(incomingURL)
// iterate over it
.foreach {
// get each path
case (existingPath, pathToCompare) =>
if (
// check if this is some value supplied to the url, such as `:id`
existingPath(0) != ':' &&
// if this isn't a placeholder for a value that the route needs, then check if the strings are equal
p2 != p1
)
// if neither matches, it doesn't match the existing route
return false
}
// return true if a `false` didn't get returned in the above foreach loop
true
}

You can use zip, that invoked on Seq[A] with Seq[B] results in Seq[(A, B)]. In other words it creates a sequence with tuples with elements of both sequences:
incomingUrl.zip(urlToCompare).map { case(incoming, pattern) => f(incoming, pattern) }

There is already another answer to the question, but I am adding another one since there is one corner case to watch out for. If you don't know the lengths of the two Lists, you need zipAll. Since zipAll needs a default value to insert if no corresponding element exists in the List, I am first wrapping every element in a Some, and then performing the zipAll.
object ZipAllTest extends App {
val incomingUrl = List("users", "profile", "12345", "extra")
val urlToCompare = List("users", "profile", ":id")
val list1 = incomingUrl.map(Some(_))
val list2 = urlToCompare.map(Some(_))
val zipped = list1.zipAll(list2, None, None)
println(zipped)
}
One thing that might bother you is that we are making several passes through the data. If that's something you are worried about, you can use lazy collections or else write a custom fold that makes only one pass over the data. That is probably overkill though. If someone wants to, they can add those alternative implementations in another answer.

Since the OP is curious to see how we would use lazy collections or a custom fold to do the same thing, I have included a separate answer with those implementations.
The first implementation uses lazy collections. Note that lazy collections have poor cache properties so that in practice, it often does does not make sense to use lazy collections as a micro-optimization. Although lazy collections will minimize the number of times you traverse the data, as has already been mentioned, the underlying data structure does not have good cache locality. To understand why lazy collections minimize the number of passes you make over the data, read chapter 5 of Functional Programming in Scala.
object LazyZipTest extends App{
val incomingUrl = List("users", "profile", "12345", "extra").view
val urlToCompare = List("users", "profile", ":id").view
val list1 = incomingUrl.map(Some(_))
val list2 = urlToCompare.map(Some(_))
val zipped = list1.zipAll(list2, None, None)
println(zipped)
}
The second implementation uses a custom fold to go over lists only one time. Since we are appending to the rear of our data structure, we want to use IndexedSeq, not List. You should rarely be using List anyway. Otherwise, if you are going to convert from List to IndexedSeq, you are actually making one additional pass over the data, in which case, you might as well not bother and just use the naive implementation I already wrote in the other answer.
Here is the custom fold.
object FoldTest extends App{
val incomingUrl = List("users", "profile", "12345", "extra").toIndexedSeq
val urlToCompare = List("users", "profile", ":id").toIndexedSeq
def onePassZip[T, U](l1: IndexedSeq[T], l2: IndexedSeq[U]): IndexedSeq[(Option[T], Option[U])] = {
val folded = l1.foldLeft((l2, IndexedSeq[(Option[T], Option[U])]())) { (acc, e) =>
acc._1 match {
case x +: xs => (xs, acc._2 :+ (Some(e), Some(x)))
case IndexedSeq() => (IndexedSeq(), acc._2 :+ (Some(e), None))
}
}
folded._2 ++ folded._1.map(x => (None, Some(x)))
}
println(onePassZip(incomingUrl, urlToCompare))
println(onePassZip(urlToCompare, incomingUrl))
}
If you have any questions, I can answer them in the comments section.

Related

How do I convert a List[Option[(A, List[B])]] to the Option[(A,List[B])]? (Basically retrieve Option[X] from List[Option[X]])

I have a List of “rules” tuples as follows:
val rules = List[(A, List[B])], where A and B are two separate case-classes
For the purposes of my use-case, I need to convert this to an Option[(A, List[B])]. The A case class contains a property id which is of type Option[String], based on which the tuple is returned.
I have written a function def findRule(entryId: Option[String]), from which I intend to return the tuple (A, List[B]) whose A.id = entryId as an Option. So far, I have written the following snippet of code:
def findRule(entryId: Option[String]) = {
for {
ruleId <- rules.flatMap(_._1.id) // ruleId is a String
id <- entryId // entryId is an Option[String]
} yield {
rules.find(_ => ruleId.equalsIgnoreCase(id)) // returns List[Option[(A, List[B])]
}
}
This snippet returns a List[Option[(A, List[B])] but I can’t figure out how to retrieve just the Option from it. Using .head() isn’t an option, since the rules list may be empty. Please help as I am new to Scala.
Example (the real examples are too large to post here, so please consider this representative example):
val rules = [(A = {id=1, ….}, B = [{B1}, {B2}, {B3}, …]), (A={id=2, ….}, B = [{B10}, {B11}, {B12}, …]), …. ] (B is not really important here, I need to find the tuple based on the id element of case-class A)
Now, suppose entryId = Some(1)
After the findRule() function, this would currently look like:
[Some((A = {id=1, …}, B = [{B1}, {B2}, {B3}, …]))]
I want to return:

Some((A = {id=1, …}, B = [{B1}, {B2}, {B3}, …])) , ie, the Option within the List returned (currently) from findRule()
From your question, it sounds like you're trying to pick a single item from your list based on some conditional, which means you'll probably want to start with rules.find. The problem then becomes how to express the predicate function that you pass to find.
From my read of your question, the conditional is
The id on the A part of the tuple needs to match the entryId that was passed in elsewhere
def findRule(entryId: Option[String]) =
rules.find { case (a, listOfB) => entryId.contains(a.id) }
The case expression is just nice syntax for dealing with the tuple. I could have also done
rules.find { tup => entryId.contains(tup._1.id) }
The contains method on Option roughly does "if I'm a Some, see if my value equals the argument; if I'm a None, just return false and ignore the argument". It works as a comparison between the Option[String] you have for entryId and the plain String you have for A's id.
Your first attempt didn't work because rules.flatMap got you a List, and using that after the <- in the for-comprehension means another flatMap, which keeps things as a List.
A variant that I personally prefer is
def findRule(entryId: Option[String]): Option[(A, List[B])] = {
entryId.flatMap { id =>
rules.find { case (a, _) => a.id == id }
}
}
In the case where entryId is None, it just immediately returns None. If you start with rules.find, then it will iterate through all of the rules and check each one even when entryId is None. It's unlikely to make any observable performance difference if the list of rules is small, but I also find it more intuitive to understand.

Scala: Find and update one element in a list

I am trying to find an elegant way to do:
val l = List(1,2,3)
val (item, idx) = l.zipWithIndex.find(predicate)
val updatedItem = updating(item)
l.update(idx, updatedItem)
Can I do all in one operation ? Find the item, if it exist replace with updated value and keep it in place.
I could do:
l.map{ i =>
if (predicate(i)) {
updating(i)
} else {
i
}
}
but that's pretty ugly.
The other complexity is the fact that I want to update only the first element which match predicate .
Edit: Attempt:
implicit class UpdateList[A](l: List[A]) {
def filterMap(p: A => Boolean)(update: A => A): List[A] = {
l.map(a => if (p(a)) update(a) else a)
}
def updateFirst(p: A => Boolean)(update: A => A): List[A] = {
val found = l.zipWithIndex.find { case (item, _) => p(item) }
found match {
case Some((item, idx)) => l.updated(idx, update(item))
case None => l
}
}
}
I don't know any way to make this in one pass of the collection without using a mutable variable. With two passes you can do it using foldLeft as in:
def updateFirst[A](list:List[A])(predicate:A => Boolean, newValue:A):List[A] = {
list.foldLeft((List.empty[A], predicate))((acc, it) => {acc match {
case (nl,pr) => if (pr(it)) (newValue::nl, _ => false) else (it::nl, pr)
}})._1.reverse
}
The idea is that foldLeft allows passing additional data through iteration. In this particular implementation I change the predicate to the fixed one that always returns false. Unfortunately you can't build a List from the head in an efficient way so this requires another pass for reverse.
I believe it is obvious how to do it using a combination of map and var
Note: performance of the List.map is the same as of a single pass over the list only because internally the standard library is mutable. Particularly the cons class :: is declared as
final case class ::[B](override val head: B, private[scala] var tl: List[B]) extends List[B] {
so tl is actually a var and this is exploited by the map implementation to build a list from the head in an efficient way. The field is private[scala] so you can't use the same trick from outside of the standard library. Unfortunately I don't see any other API call that allows to use this feature to reduce the complexity of your problem to a single pass.
You can avoid .zipWithIndex() by using .indexWhere().
To improve complexity, use Vector so that l(idx) becomes effectively constant time.
val l = Vector(1,2,3)
val idx = l.indexWhere(predicate)
val updatedItem = updating(l(idx))
l.updated(idx, updatedItem)
Reason for using scala.collection.immutable.Vector rather than List:
Scala's List is a linked list, which means data are access in O(n) time. Scala's Vector is indexed, meaning data can be read from any point in effectively constant time.
You may also consider mutable collections if you're modifying just one element in a very large collection.
https://docs.scala-lang.org/overviews/collections/performance-characteristics.html

How to collect paginated resuls in scala

I am trying to collect paginated results by trying to do the following logic in Scala and failed pathetically:
def python_version():
cursor
books, cursor = fetch_results()
while (cursor!=null) {
new_books = fetch_results(cursor)
books = books + new_books
}
return books
def fetch_results(cursor=None):
#do some fetchings...
return books, next_cursor
Here is an alternative solution using a recursive function, which avoids mutable values:
def fetchResults(c: Option[Cursor]=None): (List[Book], Option[Cursor]) = ...
def fetchAllResults(): List[Book] = {
#tailrec
def loop(cursor: Option[Cursor], res: List[Book]): List[Book] = {
val (books, newCursor) = fetchResults(cursor)
val newBooks = res ::: books
newCursor match {
case Some(_) =>
loop(newCursor, newBooks)
case None =>
newBooks
}
}
loop(None, Nil)
}
This is a fairly standard pattern for recursive functions in Scala where the actual recursion is done in an internal function. The result of the previous iteration is passed down to the next iteration and then returned from the function. This means that loop is a tail-recursive function and can be optimised by the compiler into a while loop. (The #tailrec annotation tells the compiler to warn if this is not actually tail-recursive)
Something like this, perhaps:
Iterator.iterate(fetch_results()) {
case (_, None) => (Nil, None)
case (books, cursor) => fetch_results(cursor)
}.map(_._1).takeWhile(_.nonEmpty).flatten.toList`
.iterate takes the first parameter to be the initial element of the iterator, and the second one is a function, that, given previous element, computes the next one.
So, this creates an iterator of tuples (Seq[Book], Cursor), starting with the initial return of fetch_results, and then keeps fetching more results, and accumulating them, until the nextCursor is None (I used None instead of null, because nulls are evil, and shouldn't really be used in a normal language, like scala, that provides enough facilities to avoid them).
Then .map(_._1) discards the cursors (don't need them any more), so we now have an iterator of pages, .takeWhile truncates the iterator at the first
page that is empty, then .flatten concatenates all inner Seqs together, and finally toList materializes all the elements, and returns the entire list of books.

Scala: what is the interest in using Iterators?

I have used Iterators after have worked with Regexes in Scala but I don't really understand the interest.
I know that it has a state and if I call the next() method on it, it will output a different result every time, but I don't see anything I can do with it and that is not possible with an Iterable.
And it doesn't seem to work as Akka Streams (for example) since the following example directly prints all the numbers (without waiting one second as I would expect it):
lazy val a = Iterator({Thread.sleep(1000); 1}, {Thread.sleep(1000); 2}, {Thread.sleep(1000); 3})
while(a.hasNext){ println(a.next()) }
So what is the purpose of using Iterators?
Perhaps, the most useful property of iterators is that they are lazy.
Consider something like this:
(1 to 10000)
.map { x => x * x }
.map { _.toString }
.find { _ == "4" }
This snippet will square 10000 numbers, then generate 10000 strings, and then return the second one.
This on the other hand:
(1 to 10000)
.iterator
.map { x => x * x }
.map { _.toString }
.find { _ == "4" }
... only computes two squares, and generates two strings.
Iterators are also often useful when you need to wrap around some poorly designed (java?) objects in order to be able to handle them in functional style:
val rs: ResultSet = jdbcQuery.executeQuery()
new Iterator {
def next = rs
def hasNext = rs.next
}.map { rs =>
fetchData(rs)
}
Streams are similar to iterators - they are also lazy, and also useful for wrapping:
Stream.continually(rs).takeWhile { _.next }.map(fetchData)
The main difference though is that streams remember the data that gets materialized, so that you can traverse them more than once. This is convenient, but may be costly if the original amount of data is very large, especially, if it gets filtered down to much smaller size:
Source
.fromFile("huge_file.txt")
.getLines
.filter(_ == "")
.toList
This only uses, roughly (ignoring buffering, object overhead, and other implementation specific details), the amount of memory, necessary to keep one line in memory, plus however many empty lines there are in the file.
This on the other hand:
val reader = new FileReader("huge_file.txt")
Stream
.continually(reader.readLine)
.takeWhile(_ != null)
.filter(_ == "")
.toList
... will end up with the entire content of the huge_file.txt in memory.
Finally, if I understand the intent of your example correctly, here is how you could do it with iterators:
val iterator = Seq(1,2,3).iterator.map { n => Thread.sleep(1000); n }
iterator.foreach(println)
// Or while(iterator.hasNext) { println(iterator.next) } as you had it.
There is a good explanation of what iterator is http://www.scala-lang.org/docu/files/collections-api/collections_43.html
An iterator is not a collection, but rather a way to access the
elements of a collection one by one. The two basic operations on an
iterator it are next and hasNext. A call to it.next() will return the
next element of the iterator and advance the state of the iterator.
Calling next again on the same iterator will then yield the element
one beyond the one returned previously. If there are no more elements
to return, a call to next will throw a NoSuchElementException.
First of all you should understand what is wrong with your example:
lazy val a = Iterator({Thread.sleep(1); 1}, {Thread.sleep(1); 2},
{Thread.sleep(2); 3}) while(a.hasNext){ println(a.next()) }
if you look at the apply method of Iterator, you'll see there are no calls by name,so all Thread.sleep are calling at the same time when apply method calls. Also Thread.sleep takes parameter of time to sleep in milliseconds, so if you want to sleep your thread on one second you should pass Thread.sleep(1000).
The companion object has additional methods which allow you do the next:
val a = Iterator.iterate(1)(x => {Thread.sleep(1000); x+1})
Iterator is very useful when you need to work with large data. Also you can implement your own:
val it = new Iterator[Int] {
var i = -1
def hasNext = true
def next(): Int = { i += 1; i }
}
I don't see anything I can do with it and that is not possible with an Iterable
In fact, what most collection can do can also be done with Array, but we don't do that because it's much less convenient
So same reason apply to iterator, if you want to model a mutable state, then iterator makes more sense.
For example, Random is implemented in a way resemble to iterator because it's use case fit more naturally in iterator, rather than iterable.

Scala functional way of processing large scala data with lazy collections

I am trying to figure out memory-efficient AND functional ways to process a large scale of data using strings in scala. I have read many things about lazy collections and have seen quite a bit of code examples. However, I run into "GC overhead exceeded" or "Java heap space" issues again and again.
Often the problem is that I try to construct a lazy collection, but evaluate each new element when I append it to the growing collection (I don't now any other way to do so incrementally). Of course, I could try something like initializing an initial lazy collection first and and yield the collection holding the desired values by applying the ressource-critical computations with map or so, but often I just simply do not know the exact size of the final collection a priori to initial that lazy collection.
Maybe you could help me by giving me hints or explanations on how to improve following code as an example, which splits a FASTA (definition below) formatted file into two separate files according to the rule that odd sequence pairs belong to one file and even ones to aother one ("separation of strands"). The "most" straight-forward way to do so would be in a imperative way by looping through the lines and printing into the corresponding files via open file streams (and this of course works excellent). However, I just don't enjoy the style of reassigning to variables holding header and sequences, thus the following example code uses (tail-)recursion, and I would appreciate to have found a way to maintain a similar design without running into ressource problems!
The example works perfectly for small files, but already with files at around ~500mb the code will fail with the standard JVM setups. I do want to process files of "arbitray" size, say 10-20gb or so.
val fileName = args(0)
val in = io.Source.fromFile(fileName) getLines
type itType = Iterator[String]
type sType = Stream[(String, String)]
def getFullSeqs(ite: itType) = {
//val metaChar = ">"
val HeadPatt = "(^>)(.+)" r
val SeqPatt = "([\\w\\W]+)" r
#annotation.tailrec
def rec(it: itType, out: sType = Stream[(String, String)]()): sType =
if (it hasNext) it next match {
case HeadPatt(_,header) =>
// introduce new header-sequence pair
rec(it, (header, "") #:: out)
case SeqPatt(seq) =>
val oldVal = out head
// concat subsequences
val newStream = (oldVal._1, oldVal._2 + seq) #:: out.tail
rec(it, newStream)
case _ =>
println("something went wrong my friend, oh oh oh!"); Stream[(String, String)]()
} else out
rec(ite)
}
def printStrands(seqs: sType) {
import java.io.PrintWriter
import java.io.File
def printStrand(seqse: sType, strand: Int) {
// only use sequences of one strand
val indices = List.tabulate(seqs.size/2)(_*2 + strand - 1).view
val p = new PrintWriter(new File(fileName + "." + strand))
indices foreach { i =>
p.print(">" + seqse(i)._1 + "\n" + seqse(i)._2 + "\n")
}; p.close
println("Done bro!")
}
List(1,2).par foreach (s => printStrand(seqs, s))
}
printStrands(getFullSeqs(in))
Three questions arise for me:
A) Let's assume one needs to maintain a large data structure obtained by processing the initial iterator you get from getLines like in my getFullSeqs method (note the different size of in and the output of getFullSeqs), because transformations on the whole(!) data is required repeatedly, because one does not know which part of the data one will require at any step. My example might not be the best, but how to do so? Is it possible at all??
B) What when the desired data structure is not inherently lazy, say one would like to store the (header -> sequence) pairs into a Map()? Would you wrap it in a lazy collection?
C) My implementation of constructing the stream might reverse the order of the inputted lines. When calling reverse, all elements will be evaluated (in my code, they already are, so this is the actual problem). Is there any way to post-process "from behind" in a lazy fashion? I know of reverseIterator, but is this already the solution, or will this not actually evaluate all elements first, too (as I would need to call it on a list)? One could construct the stream with newVal #:: rec(...), but I would lose tail-recursion then, wouldn't I?
So what I basically need is to add elements to a collection, which are not evaluated by the process of adding. So lazy val elem = "test"; elem :: lazyCollection is not what I am looking for.
EDIT: I have also tried using by-name parameter for the stream argument in rec .
Thank you so much for your attention and time, I really appreciate any help (again :) ).
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
FASTA is defined as a sequential set of sequences delimited by a single header line. A header is defined as a line starting with ">". Every line below the header is called part of the sequence associated with the header. A sequence ends when a new header is present. Every header is unique. Example:
>HEADER1
abcdefg
>HEADER2
hijklmn
opqrstu
>HEADER3
vwxyz
>HEADER4
zyxwv
Thus, sequence 2 is twice as big as seq 1. My program would split that file into a file A containing
>HEADER1
abcdefg
>HEADER3
vwxyz
and a second file B containing
>HEADER2
hijklmn
opqrstu
>HEADER4
zyxwv
The input file is assumed to consist of an even number of header-sequence pairs.
The key to working with really large data structures is to hold in memory only that which is critical to perform whatever operation you need. So, in your case, that's
Your input file
Your two output files
The current line of text
and that's it. In some cases you can need to store information such as how long a sequence is; in such events, you build the data structures in a first pass and use them on a second pass. Let's suppose, for example, that you decide that you want to write three files: one for even records, one for odd, and one for entries where the total length is less than 300 nucleotides. You would do something like this (warning--it compiles but I never ran it, so it may not actually work):
final def findSizes(
data: Iterator[String], sz: Map[String,Long] = Map(),
currentName: String = "", currentSize: Long = 0
): Map[String,Long] = {
def currentMap = if (currentName != "") sz + (currentName->currentSize) else sz
if (!data.hasNext) currentMap
else {
val s = data.next
if (s(0) == '>') findSizes(data, currentMap, s, 0)
else findSizes(data, sz, currentName, currentSize + s.length)
}
}
Then, for processing, you use that map and pass through again:
import java.io._
final def writeFiles(
source: Iterator[String], targets: Array[PrintWriter],
sizes: Map[String,Long], count: Int = -1, which: Int = 0
) {
if (!source.hasNext) targets.foreach(_.close)
else {
val s = source.next
if (s(0) == '>') {
val w = if (sizes.get(s).exists(_ < 300)) 2 else (count+1)%2
targets(w).println(s)
writeFiles(source, targets, sizes, count+1, w)
}
else {
targets(which).println(s)
writeFiles(source, targets, sizes, count, which)
}
}
}
You then use Source.fromFile(f).getLines() twice to create your iterators, and you're all set. Edit: in some sense this is the key step, because this is your "lazy" collection. However, it's not important just because it doesn't read all memory in immediately ("lazy"), but because it doesn't store any previous strings either!
More generally, Scala can't help you that much from thinking carefully about what information you need to have in memory and what you can fetch off disk as needed. Lazy evaluation can sometimes help, but there's no magic formula because you can easily express the requirement to have all your data in memory in a lazy way. Scala can't interpret your commands to access memory as, secretly, instructions to fetch stuff off the disk instead. (Well, not unless you write a library to cache results from disk which does exactly that.)
One could construct the stream with newVal #:: rec(...), but I would
lose tail-recursion then, wouldn't I?
Actually, no.
So, here's the thing... with your present tail recursion, you fill ALL of the Stream with values. Yes, Stream is lazy, but you are computing all of the elements, stripping it of any laziness.
Now say you do newVal #:: rec(...). Would you lose tail recursion? No. Why? Because you are not recursing. How come? Well, Stream is lazy, so it won't evaluate rec(...).
And that's the beauty of it. Once you do it that way, getFullSeqs returns on the first interaction, and only compute the "recursion" when printStrands asks for it. Unfortunately, that won't work as is...
The problem is that you are constantly modifying the Stream -- that's not how you use a Stream. With Stream, you always append to it. Don't keep "rewriting" the Stream.
Now, there are three other problems I could readily identify with printStrands. First, it calls size on seqs, which will cause the whole Stream to be processed, losing lazyness. Never call size on a Stream. Second, you call apply on seqse, accessing it by index. Never call apply on a Stream (or List) -- that's highly inefficient. It's O(n), which makes your inner loop O(n^2) -- yes, quadratic on the number of headers in the input file! Finally, printStrands keeps a reference to seqs throughout the execution of printStrand, preventing processing elements from being garbage collected.
So, here's a first approximation:
def inputStreams(fileName: String): (Stream[String], Stream[String]) = {
val in = (io.Source fromFile fileName).getLines.toStream
val SeqPatt = "^[^>]".r
def demultiplex(s: Stream[String], skip: Boolean): Stream[String] = {
if (s.isEmpty) Stream.empty
else if (skip) demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = false)
else s.head #:: (s.tail takeWhile (SeqPatt findFirstIn _ nonEmpty)) #::: demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = true)
}
(demultiplex(in, skip = false), demultiplex(in, skip = true))
}
The problem with the above, and I'm showing that code just to further guide in the issues of lazyness, is that the instant you do this:
val (a, b) = inputStreams(fileName)
You'll keep a reference to the head of both streams, which prevents garbage collecting them. You can't keep a reference to them, so you have to consume them as soon as you get them, without ever storing them in a "val" or "lazy val". A "var" might do, but it would be tricky to handle. So let's try this instead:
def inputStreams(fileName: String): Vector[Stream[String]] = {
val in = (io.Source fromFile fileName).getLines.toStream
val SeqPatt = "^[^>]".r
def demultiplex(s: Stream[String], skip: Boolean): Stream[String] = {
if (s.isEmpty) Stream.empty
else if (skip) demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = false)
else s.head #:: (s.tail takeWhile (SeqPatt findFirstIn _ nonEmpty)) #::: demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = true)
}
Vector(demultiplex(in, skip = false), demultiplex(in, skip = true))
}
inputStreams(fileName).zipWithIndex.par.foreach {
case (stream, strand) =>
val p = new PrintWriter(new File("FASTA" + "." + strand))
stream foreach p.println
p.close
}
That still doesn't work, because stream inside inputStreams works as a reference, keeping the whole stream in memory even while they are printed.
So, having failed again, what do I recommend? Keep it simple.
def in = (scala.io.Source fromFile fileName).getLines.toStream
def inputStream(in: Stream[String], strand: Int = 1): Stream[(String, Int)] = {
if (in.isEmpty) Stream.empty
else if (in.head startsWith ">") (in.head, 1 - strand) #:: inputStream(in.tail, 1 - strand)
else (in.head, strand) #:: inputStream(in.tail, strand)
}
val printers = Array.tabulate(2)(i => new PrintWriter(new File("FASTA" + "." + i)))
inputStream(in) foreach {
case (line, strand) => printers(strand) println line
}
printers foreach (_.close)
Now this won't keep anymore in memory than necessary. I still think it's too complex, however. This can be done more easily like this:
def in = (scala.io.Source fromFile fileName).getLines
val printers = Array.tabulate(2)(i => new PrintWriter(new File("FASTA" + "." + i)))
def printStrands(in: Iterator[String], strand: Int = 1) {
if (in.hasNext) {
val next = in.next
if (next startsWith ">") {
printers(1 - strand).println(next)
printStrands(in, 1 - strand)
} else {
printers(strand).println(next)
printStrands(in, strand)
}
}
}
printStrands(in)
printers foreach (_.close)
Or just use a while loop instead of recursion.
Now, to the other questions:
B) It might make sense to do so while reading it, so that you do not have to keep two copies of the data: the Map and a Seq.
C) Don't reverse a Stream -- you'll lose all of its laziness.