Scala View + Stream combo causing OutOfMemory Error. How do I replace it with a View? - scala

I was looking at solving a very simple problem, Eratosthenes sieve, using idiomatic Scala, for learning purposes.
I've learned a Stream caches, so it is not so performant when determining the nth element because it's an O(n) complexity access with memoisation of data, therefore not suitable for this situation.
def primes(nums: Stream[Int]): Stream[Int] = {
Stream.cons(nums.head,
primes((nums tail) filter (x => x % nums.head != 0)))
}
def ints(n: Int): Stream[Int] = {
Stream.cons(n, ints(n + 1))
};
def nthPrime(n: Int): Int = {
val prim = primes(ints(2)).view take n toList;
return prim(n - 1);
};
The Integer stream is the problematic one. While the prime number filtering is done, JVM runs OutOfMemory. What is the correct way to achieve the same functionality without using Streams?
Basically take a view of primes from a view of ints and display the last element, without memoisation?

I have had similar cases where a stream was a good idea, but I did not need to store it's values. In order to consume the stream without storing it's values I created (what I called) ThrowAwayIterator:
class ThrowAwayIterator[T](var stream: Stream[T]) extends Iterator[T] {
def hasNext: Boolean = stream.nonEmpty
def next(): T = {
val next = stream.head
stream = stream.tail
next
}
}
Make sure that you do not store a reference to the instance of stream that is passed in.

Related

How to iterate and add elements to a list in scala without changing state

I have this
val x = List(1,2,3,4,5,6,7,8,9)
I want to take the items of the list from 3 to the end of the list and create a new list with them without changing state,
and get this:
List(4,5,6,7,8,9)
As stated in #Jubobs' comment and #Yuval Itzchakov's answer, the answer is to use the scala.collection.immutable.List.drop(n: Int): List[A] method of lists, whose implementation you can find in src/library/scala/collection/immutable/List.scala.
However, because of the importance of List in the Scala ecosystem, this method is aggressively optimized for performance and not indicative of good Scala style. In particular, even though it is "externally pure", it does use mutation, side-effects and loops on the inside.
An alternative implementation that doesn't use any mutation, loops or side-effects might look like this:
override def drop(n: Int): List[A] =
if (n == 0) this else tail drop n-1
This is the trivial recursive implementation. Note: this will throw an exception if you try to drop more items than the list has, while the original one will handle that gracefully by returning the empty list. Re-introducing that behavior is trivial, though:
override def drop(n: Int): List[A] =
if (n == 0 || isEmpty) this else tail drop n-1
This method is not just recursive, but actually tail-recursive, so it will be as efficient as a while loop. We can have the compiler yell at us if that isn't true by adding the #scala.annotation.tailrec annotation to the method:
#scala.annotation.tailrec override def drop(n: Int): List[A] =
if (n == 0 || isEmpty) this else tail drop n-1
As mentioned in the comments, List.drop does exactly that:
val x = List(1,2,3,4,5,6,7,8,9)
val reduced = x.drop(3)
The implementation looks like this:
override def drop(n: Int): List[A] = {
var these = this
var count = n
while (!these.isEmpty && count > 0) {
these = these.tail
count -= 1
}
these
}

Does scala have a lazy evaluating wrapper?

I want to return a wrapper/holder for a result that I want to compute only once and only if the result is actually used. Something like:
def getAnswer(question: Question): Lazy[Answer] = ???
println(getAnswer(q).value)
This should be pretty easy to implement using lazy val:
class Lazy[T](f: () => T) {
private lazy val _result = Try(f())
def value: T = _result.get
}
But I'm wondering if there's already something like this baked into the standard API.
A quick search pointed at Streams and DelayedLazyVal but neither is quite what I'm looking for.
Streams do memoize the stream elements, but it seems like the first element is computed at construction:
def compute(): Int = { println("computing"); 1 }
val s1 = compute() #:: Stream.empty
// computing is printed here, before doing s1.take(1)
In a similar vein, DelayedLazyVal starts computing upon construction, even requires an execution context:
val dlv = new DelayedLazyVal(() => 1, { println("started") })
// immediately prints out "started"
There's scalaz.Need which I think you'd be able to use for this.

Queue In Scala using list in scala

We can implement a queue in java simply by using ArrayList but in case of Scala Lists are immutable so how can I implement a queue using List in Scala.Somebody give me some hint about it.
This is from Scala's immutable Queue:
Queue is implemented as a pair of Lists, one containing the in elements and the other the out elements. Elements are added to the in list and removed from the out list. When the out list runs dry, the queue is pivoted by replacing the out list by in.reverse, and in by Nil.
So:
object Queue {
def empty[A]: Queue[A] = new Queue(Nil, Nil)
}
class Queue[A] private (in: List[A], out: List[A]) {
def isEmpty: Boolean = in.isEmpty && out.isEmpty
def push(elem: A): Queue[A] = new Queue(elem :: in, out)
def pop(): (A, Queue[A]) =
out match {
case head :: tail => (head, new Queue(in, tail))
case Nil =>
val head :: tail = in.reverse // throws exception if empty
(head, new Queue(Nil, tail))
}
}
var q = Queue.empty[Int]
(1 to 10).foreach(i => q = q.push(i))
while (!q.isEmpty) { val (i, r) = q.pop(); println(i); q = r }
With immutable Lists, you have to return a new List after any modifying operation. Once you've grasped that, it's straightforward. A minimal (but inefficient) implementation where the Queue is also immutable might be:
class Queue[T](content:List[T]) {
def pop() = new Queue(content.init)
def push(element:T) = new Queue(element::content)
def peek() = content.last
override def toString() = "Queue of:" + content.toString
}
val q= new Queue(List(1)) //> q : lists.queue.Queue[Int] = Queue of:List(1)
val r = q.push(2) //> r : lists.queue.Queue[Int] = Queue of:List(2, 1)
val s = r.peek() //> s : Int = 1
val t = r.pop() //> t : lists.queue.Queue[Int] = Queue of:List(2)
If we talk about mutable Lists, they wouldn't be an efficient structure for implementing a Queue for the following reason: Adding elements to the beginning of a list works very well (takes constant time), but popping elements off the end is not efficient at all (takes longer the more elements there are in the list).
You do, however, have Arrays in Scala. Accessing any element in an array takes constant time. Unfortunately arrays are not dynamically sized, so they wouldn't make good queues. They cannot grow as your queue grows. However ArrayBuffers do grow as your array grows. So that would be a great place to start.
Also, note that Scala already has a Queue class: scala.collection.mutable.Queue.
The only way to implement a Queue with an immutable List would be to use a var. Good luck!

Parallel collection processing of data larger than memory size

Is there a simple way to use scala parallel collections without loading a full collection into memory?
For example I have a large collection and I'd like to perform a particular operation (fold) in parallel only on a small chunk, that fits into memory, than on another chunk and so on, and finally recombine results from all chunks.
I know, that actors could be used, but it would be really nice to use par-collections.
I've written a solution, but it isn't nice:
def split[A](list: Iterable[A], chunkSize: Int): Iterable[Iterable[A]] = {
new Iterator[Iterable[A]] {
var rest = list
def hasNext = !rest.isEmpty
def next = {
val chunk = rest.take(chunkSize)
rest = rest.drop(chunkSize)
chunk
}
}.toIterable
}
def foldPar[A](acc: A)(list: Iterable[A], chunkSize: Int, combine: ((A, A) => A)): A = {
val chunks: Iterable[Iterable[A]] = split(list, chunkSize)
def combineChunk: ((A,Iterable[A]) => A) = { case (res, entries) => entries.par.fold(res)(combine) }
chunks.foldLeft(acc)(combineChunk)
}
val chunkSize = 10000000
val x = 1 to chunkSize*10
def sum: ((Int,Int) => Int) = {case (acc,n) => acc + n }
foldPar(0)(x,chunkSize,sum)
Your idea is very neat and it's a pity there is no such function available already (AFAIK).
I just rephrased your idea into a bit shorter code. First, I feel that for parallel folding it's useful to use the concept of monoid - it's a structure with an associative operation and a zero element. The associativity is important because we don't know the order in which we combine result that are computed in parallel. And the zero element is important so that we can split computations into blocks and start folding each one from the zero. There is nothing new about it though, it's just what fold for Scala's collections expects.
// The function defined by Monoid's apply must be associative
// and zero its identity element.
trait Monoid[A]
extends Function2[A,A,A]
{
val zero: A
}
Next, Scala's Iterators already have a useful method grouped(Int): GroupedIterator[Seq[A]] which slices the iterator into fixed-size sequences. It's quite similar to your split. This allows us to cut the input into fixed-size blocks and then apply Scala's parallel collection methods on them:
def parFold[A](c: Iterator[A], blockSize: Int)(implicit monoid: Monoid[A]): A =
c.grouped(blockSize).map(_.par.fold(monoid.zero)(monoid))
.fold(monoid.zero)(monoid);
We fold each block using the parallel collections framework and then (without any parallelization) combine the intermediate results.
An example:
// Example:
object SumMonoid extends Monoid[Long] {
override val zero: Long = 0;
override def apply(x: Long, y: Long) = x + y;
}
val it = Iterator.range(1, 10000001).map(_.toLong)
println(parFold(it, 100000)(SumMonoid));

Scala functional way of processing large scala data with lazy collections

I am trying to figure out memory-efficient AND functional ways to process a large scale of data using strings in scala. I have read many things about lazy collections and have seen quite a bit of code examples. However, I run into "GC overhead exceeded" or "Java heap space" issues again and again.
Often the problem is that I try to construct a lazy collection, but evaluate each new element when I append it to the growing collection (I don't now any other way to do so incrementally). Of course, I could try something like initializing an initial lazy collection first and and yield the collection holding the desired values by applying the ressource-critical computations with map or so, but often I just simply do not know the exact size of the final collection a priori to initial that lazy collection.
Maybe you could help me by giving me hints or explanations on how to improve following code as an example, which splits a FASTA (definition below) formatted file into two separate files according to the rule that odd sequence pairs belong to one file and even ones to aother one ("separation of strands"). The "most" straight-forward way to do so would be in a imperative way by looping through the lines and printing into the corresponding files via open file streams (and this of course works excellent). However, I just don't enjoy the style of reassigning to variables holding header and sequences, thus the following example code uses (tail-)recursion, and I would appreciate to have found a way to maintain a similar design without running into ressource problems!
The example works perfectly for small files, but already with files at around ~500mb the code will fail with the standard JVM setups. I do want to process files of "arbitray" size, say 10-20gb or so.
val fileName = args(0)
val in = io.Source.fromFile(fileName) getLines
type itType = Iterator[String]
type sType = Stream[(String, String)]
def getFullSeqs(ite: itType) = {
//val metaChar = ">"
val HeadPatt = "(^>)(.+)" r
val SeqPatt = "([\\w\\W]+)" r
#annotation.tailrec
def rec(it: itType, out: sType = Stream[(String, String)]()): sType =
if (it hasNext) it next match {
case HeadPatt(_,header) =>
// introduce new header-sequence pair
rec(it, (header, "") #:: out)
case SeqPatt(seq) =>
val oldVal = out head
// concat subsequences
val newStream = (oldVal._1, oldVal._2 + seq) #:: out.tail
rec(it, newStream)
case _ =>
println("something went wrong my friend, oh oh oh!"); Stream[(String, String)]()
} else out
rec(ite)
}
def printStrands(seqs: sType) {
import java.io.PrintWriter
import java.io.File
def printStrand(seqse: sType, strand: Int) {
// only use sequences of one strand
val indices = List.tabulate(seqs.size/2)(_*2 + strand - 1).view
val p = new PrintWriter(new File(fileName + "." + strand))
indices foreach { i =>
p.print(">" + seqse(i)._1 + "\n" + seqse(i)._2 + "\n")
}; p.close
println("Done bro!")
}
List(1,2).par foreach (s => printStrand(seqs, s))
}
printStrands(getFullSeqs(in))
Three questions arise for me:
A) Let's assume one needs to maintain a large data structure obtained by processing the initial iterator you get from getLines like in my getFullSeqs method (note the different size of in and the output of getFullSeqs), because transformations on the whole(!) data is required repeatedly, because one does not know which part of the data one will require at any step. My example might not be the best, but how to do so? Is it possible at all??
B) What when the desired data structure is not inherently lazy, say one would like to store the (header -> sequence) pairs into a Map()? Would you wrap it in a lazy collection?
C) My implementation of constructing the stream might reverse the order of the inputted lines. When calling reverse, all elements will be evaluated (in my code, they already are, so this is the actual problem). Is there any way to post-process "from behind" in a lazy fashion? I know of reverseIterator, but is this already the solution, or will this not actually evaluate all elements first, too (as I would need to call it on a list)? One could construct the stream with newVal #:: rec(...), but I would lose tail-recursion then, wouldn't I?
So what I basically need is to add elements to a collection, which are not evaluated by the process of adding. So lazy val elem = "test"; elem :: lazyCollection is not what I am looking for.
EDIT: I have also tried using by-name parameter for the stream argument in rec .
Thank you so much for your attention and time, I really appreciate any help (again :) ).
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
FASTA is defined as a sequential set of sequences delimited by a single header line. A header is defined as a line starting with ">". Every line below the header is called part of the sequence associated with the header. A sequence ends when a new header is present. Every header is unique. Example:
>HEADER1
abcdefg
>HEADER2
hijklmn
opqrstu
>HEADER3
vwxyz
>HEADER4
zyxwv
Thus, sequence 2 is twice as big as seq 1. My program would split that file into a file A containing
>HEADER1
abcdefg
>HEADER3
vwxyz
and a second file B containing
>HEADER2
hijklmn
opqrstu
>HEADER4
zyxwv
The input file is assumed to consist of an even number of header-sequence pairs.
The key to working with really large data structures is to hold in memory only that which is critical to perform whatever operation you need. So, in your case, that's
Your input file
Your two output files
The current line of text
and that's it. In some cases you can need to store information such as how long a sequence is; in such events, you build the data structures in a first pass and use them on a second pass. Let's suppose, for example, that you decide that you want to write three files: one for even records, one for odd, and one for entries where the total length is less than 300 nucleotides. You would do something like this (warning--it compiles but I never ran it, so it may not actually work):
final def findSizes(
data: Iterator[String], sz: Map[String,Long] = Map(),
currentName: String = "", currentSize: Long = 0
): Map[String,Long] = {
def currentMap = if (currentName != "") sz + (currentName->currentSize) else sz
if (!data.hasNext) currentMap
else {
val s = data.next
if (s(0) == '>') findSizes(data, currentMap, s, 0)
else findSizes(data, sz, currentName, currentSize + s.length)
}
}
Then, for processing, you use that map and pass through again:
import java.io._
final def writeFiles(
source: Iterator[String], targets: Array[PrintWriter],
sizes: Map[String,Long], count: Int = -1, which: Int = 0
) {
if (!source.hasNext) targets.foreach(_.close)
else {
val s = source.next
if (s(0) == '>') {
val w = if (sizes.get(s).exists(_ < 300)) 2 else (count+1)%2
targets(w).println(s)
writeFiles(source, targets, sizes, count+1, w)
}
else {
targets(which).println(s)
writeFiles(source, targets, sizes, count, which)
}
}
}
You then use Source.fromFile(f).getLines() twice to create your iterators, and you're all set. Edit: in some sense this is the key step, because this is your "lazy" collection. However, it's not important just because it doesn't read all memory in immediately ("lazy"), but because it doesn't store any previous strings either!
More generally, Scala can't help you that much from thinking carefully about what information you need to have in memory and what you can fetch off disk as needed. Lazy evaluation can sometimes help, but there's no magic formula because you can easily express the requirement to have all your data in memory in a lazy way. Scala can't interpret your commands to access memory as, secretly, instructions to fetch stuff off the disk instead. (Well, not unless you write a library to cache results from disk which does exactly that.)
One could construct the stream with newVal #:: rec(...), but I would
lose tail-recursion then, wouldn't I?
Actually, no.
So, here's the thing... with your present tail recursion, you fill ALL of the Stream with values. Yes, Stream is lazy, but you are computing all of the elements, stripping it of any laziness.
Now say you do newVal #:: rec(...). Would you lose tail recursion? No. Why? Because you are not recursing. How come? Well, Stream is lazy, so it won't evaluate rec(...).
And that's the beauty of it. Once you do it that way, getFullSeqs returns on the first interaction, and only compute the "recursion" when printStrands asks for it. Unfortunately, that won't work as is...
The problem is that you are constantly modifying the Stream -- that's not how you use a Stream. With Stream, you always append to it. Don't keep "rewriting" the Stream.
Now, there are three other problems I could readily identify with printStrands. First, it calls size on seqs, which will cause the whole Stream to be processed, losing lazyness. Never call size on a Stream. Second, you call apply on seqse, accessing it by index. Never call apply on a Stream (or List) -- that's highly inefficient. It's O(n), which makes your inner loop O(n^2) -- yes, quadratic on the number of headers in the input file! Finally, printStrands keeps a reference to seqs throughout the execution of printStrand, preventing processing elements from being garbage collected.
So, here's a first approximation:
def inputStreams(fileName: String): (Stream[String], Stream[String]) = {
val in = (io.Source fromFile fileName).getLines.toStream
val SeqPatt = "^[^>]".r
def demultiplex(s: Stream[String], skip: Boolean): Stream[String] = {
if (s.isEmpty) Stream.empty
else if (skip) demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = false)
else s.head #:: (s.tail takeWhile (SeqPatt findFirstIn _ nonEmpty)) #::: demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = true)
}
(demultiplex(in, skip = false), demultiplex(in, skip = true))
}
The problem with the above, and I'm showing that code just to further guide in the issues of lazyness, is that the instant you do this:
val (a, b) = inputStreams(fileName)
You'll keep a reference to the head of both streams, which prevents garbage collecting them. You can't keep a reference to them, so you have to consume them as soon as you get them, without ever storing them in a "val" or "lazy val". A "var" might do, but it would be tricky to handle. So let's try this instead:
def inputStreams(fileName: String): Vector[Stream[String]] = {
val in = (io.Source fromFile fileName).getLines.toStream
val SeqPatt = "^[^>]".r
def demultiplex(s: Stream[String], skip: Boolean): Stream[String] = {
if (s.isEmpty) Stream.empty
else if (skip) demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = false)
else s.head #:: (s.tail takeWhile (SeqPatt findFirstIn _ nonEmpty)) #::: demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = true)
}
Vector(demultiplex(in, skip = false), demultiplex(in, skip = true))
}
inputStreams(fileName).zipWithIndex.par.foreach {
case (stream, strand) =>
val p = new PrintWriter(new File("FASTA" + "." + strand))
stream foreach p.println
p.close
}
That still doesn't work, because stream inside inputStreams works as a reference, keeping the whole stream in memory even while they are printed.
So, having failed again, what do I recommend? Keep it simple.
def in = (scala.io.Source fromFile fileName).getLines.toStream
def inputStream(in: Stream[String], strand: Int = 1): Stream[(String, Int)] = {
if (in.isEmpty) Stream.empty
else if (in.head startsWith ">") (in.head, 1 - strand) #:: inputStream(in.tail, 1 - strand)
else (in.head, strand) #:: inputStream(in.tail, strand)
}
val printers = Array.tabulate(2)(i => new PrintWriter(new File("FASTA" + "." + i)))
inputStream(in) foreach {
case (line, strand) => printers(strand) println line
}
printers foreach (_.close)
Now this won't keep anymore in memory than necessary. I still think it's too complex, however. This can be done more easily like this:
def in = (scala.io.Source fromFile fileName).getLines
val printers = Array.tabulate(2)(i => new PrintWriter(new File("FASTA" + "." + i)))
def printStrands(in: Iterator[String], strand: Int = 1) {
if (in.hasNext) {
val next = in.next
if (next startsWith ">") {
printers(1 - strand).println(next)
printStrands(in, 1 - strand)
} else {
printers(strand).println(next)
printStrands(in, strand)
}
}
}
printStrands(in)
printers foreach (_.close)
Or just use a while loop instead of recursion.
Now, to the other questions:
B) It might make sense to do so while reading it, so that you do not have to keep two copies of the data: the Map and a Seq.
C) Don't reverse a Stream -- you'll lose all of its laziness.