Storing a sequence of Long efficiently in a file in Scala - scala

So I have an association that associates a pair of Ints with a Vector[Long] that can be up to size 10000, and I have anywhere from several hundred thousand to a million of such data. I would like to store this in a single file for later processing in Scala.
Clearly storing this in a plain-text format would take way too much space, so I've been trying to figure out how to do it by writing a Byte stream. However I'm not too sure if this will work since it seems to me that the byteValue() of a Long returns the Byte representation which is still 4 bytes long, and hence I won't save any space? I do not have much experience working with binary formats.
It seems the Scala standard library had a BytePickle that might have been what I was looking for, but has since been deprecated?

An arbitrary Long is about 19.5 ASCII digits long, but only 8 bytes long, so you'll gain a savings of a factor of ~2 if you write it in binary. Now, it may be that most of the values are not actually taking all 8 bytes, in which case you could define some compression scheme yourself.
In any case, you are probably best off writing block data using java.nio.ByteBuffer and friends. Binary data is most efficiently read in blocks, and you might want your file to be randomly accessible, in which case you want your data to look something like so:
<some unique binary header that lets you check the file type>
<int saying how many records you have>
<offset of the first record>
<offset of the second record>
...
<offset of the last record>
<int><int><length of vector><long><long>...<long>
<int><int><length of vector><long><long>...<long>
...
<int><int><length of vector><long><long>...<long>
This is a particularly convenient format for reading and writing using ByteBuffer because you know in advance how big everything is going to be. So you can
val fos = new FileOutputStream(myFileName)
val fc = fos.getChannel // java.nio.channel.FileChannel
val header = ByteBuffer.allocate(28)
header.put("This is my cool header!!".getBytes)
header.putInt(data.length)
fc.write(header)
val offsets = ByteBuffer.allocate(8*data.length)
data.foldLeft(28L+8*data.length){ (n,d) =>
offsets.putLong(n)
n = n + 12 + d.vector.length*8
}
fc.write(offsets)
...
and on the way back in
val fis = new FileInputStream(myFileName)
val fc = fis.getChannel
val header = ByteBuffer.allocate(28)
fc.read(header)
val hbytes = new Array[Byte](24)
header.get(hbytes)
if (new String(hbytes) != "This is my cool header!!") ???
val nrec = header.getInt
val offsets = ByteBuffer.allocate(8*nrec)
fc.read(offsets)
val offsetArray = offsets.getLongs(nrec) // See below!
...
There are some handy methods on ByteBuffer that are absent, but you can add them on with implicits (here for Scala 2.10; with 2.9 make it a plain class, drop the extends AnyVal, and supply an implicit conversion from ByteBuffer to RichByteBuffer):
implicit class RichByteBuffer(val b: java.nio.ByteBuffer) extends AnyVal {
def getBytes(n: Int) = { val a = new Array[Byte](n); b.get(a); a }
def getShorts(n: Int) = { val a = new Array[Short](n); var i=0; while (i<n) { a(i)=b.getShort(); i+=1 } ; a }
def getInts(n: Int) = { val a = new Array[Int](n); var i=0; while (i<n) { a(i)=b.getInt(); i+=1 } ; a }
def getLongs(n: Int) = { val a = new Array[Long](n); var i=0; while (i<n) { a(i)=b.getLong(); i+=1 } ; a }
def getFloats(n: Int) = { val a = new Array[Float](n); var i=0; while (i<n) { a(i)=b.getFloat(); i+=1 } ; a }
def getDoubles(n: Int) = { val a = new Array[Double](n); var i=0; while (i<n) { a(i)=b.getDouble(); i+=1 } ; a }
}
Anyway, the reason to do things this way is that you'll end up with decent performance, which is also a consideration when you have tens of gigabytes of data (which it sounds like you have given hundreds of thousands of vectors of length up to ten thousand).
If your problem is actually much smaller, then don't worry so much about it--pack it into XML or use JSON or some custom text solution (or use DataOutputStream and DataInputStream, which don't perform as well and won't give you random access).
If your problem is actually bigger, you can define two lists of longs; first, the ones that will fit in an Int, say, and then the ones that actually need a full Long (with indices so you know where they are). Data compression is a very case-specific task--assuming you don't just want to use java.util.zip--so without a lot more knowledge about what the data looks like, it's hard to know what to recommend beyond just storing it as a weakly hierarchical binary file as I've described above.

See Java's DataOutputStream. It allows easy and efficient writing of primitive types and Strings to byte streams. In particular, you want something like:
val stream = new DataOutputStream(new FileOutputStream("your_file.bin"))
You can then use the equivalent DataInputStream methods to read from that file to variables again.

I used scala-io, scala-arm to write a binary stream of Long-s. The libraries itself are supposed to be a Scala-way to do things, but these are not in Scala master branch - maybe someone knows why? I use them from time to time.
1) Clone scala-io:
git clone https://github.com/scala-incubator/scala-io.git
Go to scala-io/package and change in Build.scala, val scalaVersion to yours
sbt package
2) Clone scala-arm:
git clone https://github.com/jsuereth/scala-arm.git
Go to scala-arm/package and change in build.scala, scalaVersion := to yours
sbt package
3) Copy somewhere not too far:
scala-io/core/target/scala-xxx/scala-io-core_xxx-0.5.0-SNAPSHOT.jar
scala-io/file/target/scala-xxx/scala-io-file_xxx-0.5.0-SNAPSHOT.jar
scala-arm/target/scala-xxx/scala-arm_xxx-1.3-SNAPSHOT.jar
4) Start REPL:
scala -classpath "/opt/scala-io/scala-io-core_2.10-0.5.0-SNAPSHOT.jar:
/opt/scala-io/scala-io-file_2.10-0.5.0-SNAPSHOT.jar:
/opt/scala-arm/scala-arm_2.10-1.3-SNAPSHOT.jar"
5) :paste actual code:
import scalax.io._
// create data stream
val EOData = Vector(0xffffffffffffffffL)
val data = List(
(0, Vector(0L,1L,2L,3L))
,(1, Vector(4L,5L))
,(2, Vector(6L,7L,8L))
,(3, Vector(9L))
)
var it = Iterator[Long]()
for (rec <- data) {
it = it ++ Vector(rec._1).iterator.map(_.toLong)
it = it ++ rec._2.iterator
it = it ++ EOData.iterator
}
// write data at once
val out: Output = Resource.fromFile("/tmp/data")
out.write(it)(OutputConverter.TraversableLongConverter)

Related

Chunk file into multiple streams in scala

I have a use case in my program where I need to take a file, split them equally N times and upload them remotely.
I'd like a function that takes, say a File and output a list of BufferedReader. To which I can just distribute them and send them to another function which uses some API to store them.
I've seen examples where authors utilize the .lines() method of a BufferedReader:
def splitFile: List[Stream] = {
val temp = "Test mocked file contents\nTest"
val is = new ByteArrayInputStream(lolz.getBytes)
val br = new BufferedReader(new InputStreamReader(is))
// Chunk the file into two sort-of equal parts.
// Stream 1
val test = br.lines().skip(1).limit(1)
// Stream 2
val test2 = br.lines().skip(2).limit(1)
List(test, test2)
}
I suppose that the above example works, it's not beautiful but it works.
My questions:
Is there a way to split a BufferedReader into multiple list of streams?
I don't control the format of the File, so the file contents could potentially be a single line long. Wouldn't that just mean that .lines() just load all that into a Stream of one element?
It will be much easier to read the file once, and then split it up. You're not really losing anything either, since the file read is a serial operation anyway. From there, just devise a scheme to slice up the list. In this case, I group everything based on its index modulo the number of desired output lists, then pull out the lists. If there are fewer lines than you ask for, it will just place each line in a separate List.
val lines: List[String] = br.lines
val outputListQuantity: Int = 2
val data: List[List[String]] = lines.zipWithIndex.groupBy(_._2 % outputListQuantity}.
values.map(_.map(_._1)).toList
Well, if you don't mind reading the whole stream into memory, it's easy enough (assuming, that file contains text - since you are talking about Readers, but it would be the same idea with binary):
Source.fromFile("filename")
.mkString
.getBytes
.grouped(chunkSize)
.map { chunk => new BufferedReader(new InputStreamReader(chunk)) }
But that seems to sorta defeat the purpose: if the file is small enough to be loaded into memory entirely, why bother splitting it to begin with?
So, a more practical solution is a little bit more involved:
def splitFile(
input: InputStream,
chunkSize: Int
): Iterator[InputStream] = new AbstractIterator[InputStream] {
var hasNext = true
def next = {
val buffer = new Array[Byte](chunkSize)
val bytes = input.read(buffer)
hasNext = bytes == chunkSize
new ByteArrayInputStream(buffer, 0, bytes max 0)
}
}

Spark: Transforming JSON files to correct format

I've 100+ million records stored in files with the following JSON structure (real data has way more columns, rows and is also nested)
{"id":"2-2-3","key":"value"}{"id":"2-2-3","key":"value"}{"id":"2-2-3","key":"value"}{"id":"2-2-3","key":"value"}{"id":"2-2-3","key":"value"}
The sqlContext.read.json function is unable to parse this since the records aren't on multiple lines but on one big line. The solution below solves this problem, but is a big performance killer. What would be the best way, performance wise, to handle this issue in Apache Spark?
val rdd = sc.wholeTextFiles("s3://some-bucket/**/*")
val validJSON = rdd.flatMap(_._2.replace("}{", "}\n{").split("\n"))
val df = sqlContext.read.json(validJSON)
df.count()
df.select("id").show()
This is a riff on Antot's answer, which should handle nested JSON
input.toVector
.foldLeft((false, Vector.empty[Char], Vector.empty[String])) {
case ((true, charAccum, strAccum), '{') => (false, Vector('{'), strAccum :+ charAccum.mkString);
case ((_, charAccum, strAccum), '}') => (true, charAccum :+ '}', strAccum);
case ((_, charAccum, strAccum), char) => (false, charAccum :+ char, strAccum)
}
._3
Basically what it does is split the data into a Vector[Char], and uses foldLeft to aggregate the input into substrings. The trick is to keep track of just enough information about the previous character to figure out if a { marks the start of a new object.
I used this input to test it (basically the OP's sample input, with a nested object thrown in):
val input = """{"id":"2-2-3","key":{ "test": "value"}}{"id":"2-2-3","key":"value"}{"id":"2-2-3","key":"value"}{"id":"2-2-3","key":"value"}{"id":"2-2-3","key":"value"}"""
And got this result, which looks good:
Vector({"id":"2-2-3","key":{ "test": "value"}},
{"id":"2-2-3","key":"value"},
{"id":"2-2-3","key":"value"},
{"id":"2-2-3","key":"value"})
The problem with the original approach is the call _._2.replace("}{", "}\n{", which creates another huge string from the input one, with new line chars inserted, which is then split once again into an array.
An improvement is possible by minimizing the creation of intermediate strings and retrieving the target ones as soon as possible. For this, we can play a bit with substrings:
val validJson = rdd.flatMap(rawJson => {
// functions extracted to make it more readable.
def nextObjectStartIndex(fromIndex: Int):Int = rawJson._2.indexOf('{', fromIndex)
def currObjectEndIndex(fromIndex: Int): Int = rawJson._2.indexOf('}', fromIndex)
def extractObject(fromIndex: Int, toIndex: Int): String = rawJson._2.substring(fromIndex, toIndex + 1)
// the resulting strings are put in a local buffer
val buffer = new ListBuffer[String]()
// init the scanning of the input string
var posStartNextObject = nextObjectStartIndex(0)
// main loop terminates when there are no more '{' chars
while (posStartNextObject != -1) {
val posEndObject = currObjectEndIndex(posStartNextObject)
val extractedObject = extractObject(posStartNextObject, posEndObject)
posStartNextObject = nextObjectStartIndex(posEndObject)
buffer += extractedObject
}
buffer
})
Please note that this approach would work only if the objects in the input JSON are not nested, supposing that all curly braces separate objects of same level.

How to write binary data to a file to implement Huffman compression

I'm trying to implement Huffman compression. After encoding the char into 0 and 1, how do I write it to a file so it'll be compressed? Obviously simply writing the characters 0,1 will only make the file larger.
Let's say I have a string "0101010" which represents bits.
I wish to write the string into file but as in binary, i.e. not the char '0' but the bit 0.
I tried using the getBytesArray() on the string but it seems to make no difference rather than simply writing the string.
Although Sarvesh Kumar Singh code will probably work, it looks so Javish to me that I think this question needs one more answer. The way I imaging Huffman coding to be implemented in Scala is something like this:
import scala.collection._
type Bit = Boolean
type BitStream = Iterator[Bit]
type BitArray = Array[Bit]
type ByteStream = Iterator[Byte]
type CharStream = Iterator[Char]
case class EncodingMapping(charMapping: Map[Char, BitArray], eofCharMapping: BitArray)
def buildMapping(src: CharStream): EncodingMapping = {
def accumulateStats(src: CharStream): Map[Char, Int] = ???
def buildMappingImpl(stats: Map[Char, Int]): EncodingMapping = ???
val stats = accumulateStats(src)
buildMappingImpl(stats)
}
def convertBitsToBytes(bits: BitStream): ByteStream = {
bits.grouped(8).map(bits => {
val res = bits.foldLeft((0.toByte, 0))((acc, bit) => ((acc._1 * 2 + (if (bit) 1 else 0)).toByte, acc._2 + 1))
// last byte might be less than 8 bits
if (res._2 == 8)
res._1
else
(res._1 << (8 - res._2)).toByte
})
}
def encodeImpl(src: CharStream, mapping: EncodingMapping): ByteStream = {
val mainData = src.flatMap(ch => mapping.charMapping(ch))
val fullData = mainData ++ mapping.eofCharMapping
convertBitsToBytes(fullData)
}
// can be used to encode String as src. Thanks to StringLike/StringOps extension
def encode(src: Iterable[Char]): (EncodingMapping, ByteStream) = {
val mapping = buildMapping(src.iterator)
val encoded = encode(src.iterator, mapping)
(mapping, encoded)
}
def wrapClose[A <: java.io.Closeable, B](resource: A)(op: A => B): B = {
try {
op(resource)
}
finally {
resource.close()
}
}
def encodeFile(fileName: String): (EncodingMapping, ByteStream) = {
// note in real life you probably want to specify file encoding as well
val mapping = wrapClose(Source.fromFile(fileName))(buildMapping)
val encoded = wrapClose(Source.fromFile(fileName))(file => encode(file, mapping))
(mapping, encoded)
}
where in accumulateStats you find out how often each Char is present in the src and in buildMappingImpl (which is the main part of the whole Huffman encoding) you first build a tree from that stats and then use the create a fixed EncodingMapping. eofCharMapping is a mapping for the pseudo-EOF char as mentioned in one of the comments. Note that high-level encode methods return both EncodingMapping and ByteStream because in any real life scenario you want to save both.
The piece of logic specifically being asked is located in convertBitsToBytes method. Note that I use Boolean to represent single bit rather than Char and thus Iterator[Bit] (effectively Iterator[Boolean]) rather than String to represent a sequence of bits. The idea of the implementation is based on the grouped method that converts a BitStream into a stream of Bits grouped in a byte-sized groups (except possible for the last one).
IMHO the main advantage of this stream-oriented approach comparing to Sarvesh Kumar Singh answer is that you don't need to load the whole file into memory at once or store the whole encoded file in the memory. Note however that in such case you'll have to read the file twice: first time to build the EncodingMapping and second to apply it. Obviously if the file is small enough you can load it into the memory first and then convert ByteStream to Array[Byte] using .toArray call. But if your file is big, you can use just stream-based approach and easily save the ByteStream into a file using something like .foreach(b => out.write(b))
I don't think this will help you achieve your Huffman compression goal, but in answer to your question:
string-to-value
Converting a String to the value it represents is pretty easy.
val x: Int = "10101010".foldLeft(0)(_*2 + _.asDigit)
Note: You'll have to check for formatting (only ones and zeros) and overflow (strings too long) before conversion.
value-to-file
There are a number of ways to write data to a file. Here's a simple one.
import java.io.{FileOutputStream, File}
val fos = new FileOutputStream(new File("output.dat"))
fos.write(x)
fos.flush()
fos.close()
Note: You'll want to catch any errors thrown.
I will specify all the required imports first,
import java.io.{ File, FileInputStream, FileOutputStream}
import java.nio.file.Paths
import scala.collection.mutable.ArrayBuffer
Now, We are going to need following smaller units to achieve this whole thing,
1 - We need to be able to convert our binary string (eg. "01010") to Array[Byte],
def binaryStringToByteArray(binaryString: String) = {
val byteBuffer = ArrayBuffer.empty[Byte]
var byteStr = ""
for (binaryChar <- binaryString) {
if (byteStr.length < 7) {
byteStr = byteStr + binaryChar
}
else {
try{
val byte = java.lang.Byte.parseByte(byteStr + binaryChar, 2)
byteBuffer += byte
byteStr = ""
}
catch {
case ex: java.lang.NumberFormatException =>
val byte = java.lang.Byte.parseByte(byteStr, 2)
byteBuffer += byte
byteStr = "" + binaryChar
}
}
}
if (!byteStr.isEmpty) {
val byte = java.lang.Byte.parseByte(byteStr, 2)
byteBuffer += byte
byteStr = ""
}
byteBuffer.toArray
}
2 - We need to be able to open the file to serve in our little play,
def openFile(filePath: String): File = {
val path = Paths.get(filePath)
val file = path.toFile
if (file.exists()) file.delete()
if (!file.exists()) file.createNewFile()
file
}
3 - We need to be able to write bytes to a file,
def writeBytesToFile(bytes: Array[Byte], file: File): Unit = {
val fos = new FileOutputStream(file)
fos.write(bytes)
fos.close()
}
4 - We need to be able to read bytes back from the file,
def readBytesFromFile(file: File): Array[Byte] = {
val fis = new FileInputStream(file)
val bytes = new Array[Byte](file.length().toInt)
fis.read(bytes)
fis.close()
bytes
}
5 - We need to be able convert bytes back to our binaryString,
def byteArrayToBinaryString(byteArray: Array[Byte]): String = {
byteArray.map(b => b.toBinaryString).mkString("")
}
Now, we are ready to do every thing we want,
// lets say we had this binary string,
scala> val binaryString = "00101110011010101010101010101"
// binaryString: String = 00101110011010101010101010101
// Now, we need to "pad" this with a leading "1" to avoid byte related issues
scala> val paddedBinaryString = "1" + binaryString
// paddedBinaryString: String = 100101110011010101010101010101
// The file which we will use for this,
scala> val file = openFile("/tmp/a_bit")
// file: java.io.File = /tmp/a_bit
// convert our padded binary string to bytes
scala> val bytes = binaryStringToByteArray(paddedBinaryString)
// bytes: Array[Byte] = Array(75, 77, 85, 85)
// write the bytes to our file,
scala> writeBytesToFile(bytes, file)
// read bytes back from file,
scala> val bytesFromFile = readBytesFromFile(file)
// bytesFromFile: Array[Byte] = Array(75, 77, 85, 85)
// so now, we have our padded string back,
scala> val paddedBinaryStringFromFile = byteArrayToBinaryString(bytes)
// paddedBinaryStringFromFile: String = 1001011100110110101011010101
// remove that "1" from the front and we have our binaryString back,
scala> val binaryStringFromFile = paddedBinaryString.tail
// binaryStringFromFile: String = 00101110011010101010101010101
NOTE :: you may have to make few changes if you want to deal with very large "binary strings" (more than few millions of characters long) to improve performance or even be usable. For example - You will need to start using Streams or Iterators instead of Array[Byte].

spark scala - replace text if exists in list

I have 2 datasets.
One is a dataframe with a bunch of data, one column has comments (a string).
The other is a list of words.
If a comment contains a word in the list, I want to replace the word in the comment with ##### and return the comment in full with the replaced words.
Here's some sample data:
CommentSample.txt
1 A badword small town
2 "Love the truck, though rattle is annoying."
3 Love the paint!
4
5 "Like that you added the ""oh badword2"" handle to passenger side."
6 "badword you. specific enough for you, badword3?"
7 This car is a piece if badword2
ProfanitySample.txt
badword
badword2
badword3
Here's my code so far:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Response(UniqueID: Int, Comment: String)
val response = sc.textFile("file:/data/CommentSample.txt").map(_.split("\t")).filter(_.size == 2).map(r => Response(r(0).trim.toInt, r(1).trim.toString, r(10).trim.toInt)).toDF()
var profanity = sc.textFile("file:/data/ProfanitySample.txt").map(x => (x.toLowerCase())).toArray();
def replaceProfanity(s: String): String = {
val l = s.toLowerCase()
val r = "#####"
if(profanity.contains(s))
r
else
s
}
def processComment(s: String): String = {
val commentWords = sc.parallelize(s.split(' '))
commentWords.foreach(replaceProfanity)
commentWords.collect().mkString(" ")
}
response.select(processComment("Comment")).show(100)
It compiles, it runs, but the words are not replaced.
I don't know how to debug in scala.
I'm totally new! This is my first project ever!
Many thanks for any pointers.
-M
First, I think the usecase you describe here won't benefit much from the use of DataFrames - it's simpler to implement using RDDs only (DataFrames are mostly convenient when your transformations can easily be described using SQL, which isn't the case here).
So - here's a possible implementation using RDDs. This assumes the list of profanities isn't too large (i.e. up to ~thousands), so we can collect it into non-distributed memory. If that's not the case, a different approach (involving a join) might be needed.
case class Response(UniqueID: Int, Comment: String)
val mask = "#####"
val responses: RDD[Response] = sc.textFile("file:/data/CommentSample.txt").map(_.split("\t")).filter(_.size == 2).map(r => Response(r(0).trim.toInt, r(1).trim))
val profanities: Array[String] = sc.textFile("file:/data/ProfanitySample.txt").collect()
val result = responses.map(r => {
// using foldLeft here means we'll replace profanities one by one,
// with the result of each replace as the input of the next,
// starting with the original comment
profanities.foldLeft(r.Comment)({
case (updatedComment, profanity) => updatedComment.replaceAll(s"(?i)\\b$profanity\\b", mask)
})
})
result.take(10).foreach(println) // just printing some examples...
Note that the case-insensitivity and the "words only" limitations are implemented in the regex itself: "(?i)\\bSomeWord\\b".

Scala functional way of processing large scala data with lazy collections

I am trying to figure out memory-efficient AND functional ways to process a large scale of data using strings in scala. I have read many things about lazy collections and have seen quite a bit of code examples. However, I run into "GC overhead exceeded" or "Java heap space" issues again and again.
Often the problem is that I try to construct a lazy collection, but evaluate each new element when I append it to the growing collection (I don't now any other way to do so incrementally). Of course, I could try something like initializing an initial lazy collection first and and yield the collection holding the desired values by applying the ressource-critical computations with map or so, but often I just simply do not know the exact size of the final collection a priori to initial that lazy collection.
Maybe you could help me by giving me hints or explanations on how to improve following code as an example, which splits a FASTA (definition below) formatted file into two separate files according to the rule that odd sequence pairs belong to one file and even ones to aother one ("separation of strands"). The "most" straight-forward way to do so would be in a imperative way by looping through the lines and printing into the corresponding files via open file streams (and this of course works excellent). However, I just don't enjoy the style of reassigning to variables holding header and sequences, thus the following example code uses (tail-)recursion, and I would appreciate to have found a way to maintain a similar design without running into ressource problems!
The example works perfectly for small files, but already with files at around ~500mb the code will fail with the standard JVM setups. I do want to process files of "arbitray" size, say 10-20gb or so.
val fileName = args(0)
val in = io.Source.fromFile(fileName) getLines
type itType = Iterator[String]
type sType = Stream[(String, String)]
def getFullSeqs(ite: itType) = {
//val metaChar = ">"
val HeadPatt = "(^>)(.+)" r
val SeqPatt = "([\\w\\W]+)" r
#annotation.tailrec
def rec(it: itType, out: sType = Stream[(String, String)]()): sType =
if (it hasNext) it next match {
case HeadPatt(_,header) =>
// introduce new header-sequence pair
rec(it, (header, "") #:: out)
case SeqPatt(seq) =>
val oldVal = out head
// concat subsequences
val newStream = (oldVal._1, oldVal._2 + seq) #:: out.tail
rec(it, newStream)
case _ =>
println("something went wrong my friend, oh oh oh!"); Stream[(String, String)]()
} else out
rec(ite)
}
def printStrands(seqs: sType) {
import java.io.PrintWriter
import java.io.File
def printStrand(seqse: sType, strand: Int) {
// only use sequences of one strand
val indices = List.tabulate(seqs.size/2)(_*2 + strand - 1).view
val p = new PrintWriter(new File(fileName + "." + strand))
indices foreach { i =>
p.print(">" + seqse(i)._1 + "\n" + seqse(i)._2 + "\n")
}; p.close
println("Done bro!")
}
List(1,2).par foreach (s => printStrand(seqs, s))
}
printStrands(getFullSeqs(in))
Three questions arise for me:
A) Let's assume one needs to maintain a large data structure obtained by processing the initial iterator you get from getLines like in my getFullSeqs method (note the different size of in and the output of getFullSeqs), because transformations on the whole(!) data is required repeatedly, because one does not know which part of the data one will require at any step. My example might not be the best, but how to do so? Is it possible at all??
B) What when the desired data structure is not inherently lazy, say one would like to store the (header -> sequence) pairs into a Map()? Would you wrap it in a lazy collection?
C) My implementation of constructing the stream might reverse the order of the inputted lines. When calling reverse, all elements will be evaluated (in my code, they already are, so this is the actual problem). Is there any way to post-process "from behind" in a lazy fashion? I know of reverseIterator, but is this already the solution, or will this not actually evaluate all elements first, too (as I would need to call it on a list)? One could construct the stream with newVal #:: rec(...), but I would lose tail-recursion then, wouldn't I?
So what I basically need is to add elements to a collection, which are not evaluated by the process of adding. So lazy val elem = "test"; elem :: lazyCollection is not what I am looking for.
EDIT: I have also tried using by-name parameter for the stream argument in rec .
Thank you so much for your attention and time, I really appreciate any help (again :) ).
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
FASTA is defined as a sequential set of sequences delimited by a single header line. A header is defined as a line starting with ">". Every line below the header is called part of the sequence associated with the header. A sequence ends when a new header is present. Every header is unique. Example:
>HEADER1
abcdefg
>HEADER2
hijklmn
opqrstu
>HEADER3
vwxyz
>HEADER4
zyxwv
Thus, sequence 2 is twice as big as seq 1. My program would split that file into a file A containing
>HEADER1
abcdefg
>HEADER3
vwxyz
and a second file B containing
>HEADER2
hijklmn
opqrstu
>HEADER4
zyxwv
The input file is assumed to consist of an even number of header-sequence pairs.
The key to working with really large data structures is to hold in memory only that which is critical to perform whatever operation you need. So, in your case, that's
Your input file
Your two output files
The current line of text
and that's it. In some cases you can need to store information such as how long a sequence is; in such events, you build the data structures in a first pass and use them on a second pass. Let's suppose, for example, that you decide that you want to write three files: one for even records, one for odd, and one for entries where the total length is less than 300 nucleotides. You would do something like this (warning--it compiles but I never ran it, so it may not actually work):
final def findSizes(
data: Iterator[String], sz: Map[String,Long] = Map(),
currentName: String = "", currentSize: Long = 0
): Map[String,Long] = {
def currentMap = if (currentName != "") sz + (currentName->currentSize) else sz
if (!data.hasNext) currentMap
else {
val s = data.next
if (s(0) == '>') findSizes(data, currentMap, s, 0)
else findSizes(data, sz, currentName, currentSize + s.length)
}
}
Then, for processing, you use that map and pass through again:
import java.io._
final def writeFiles(
source: Iterator[String], targets: Array[PrintWriter],
sizes: Map[String,Long], count: Int = -1, which: Int = 0
) {
if (!source.hasNext) targets.foreach(_.close)
else {
val s = source.next
if (s(0) == '>') {
val w = if (sizes.get(s).exists(_ < 300)) 2 else (count+1)%2
targets(w).println(s)
writeFiles(source, targets, sizes, count+1, w)
}
else {
targets(which).println(s)
writeFiles(source, targets, sizes, count, which)
}
}
}
You then use Source.fromFile(f).getLines() twice to create your iterators, and you're all set. Edit: in some sense this is the key step, because this is your "lazy" collection. However, it's not important just because it doesn't read all memory in immediately ("lazy"), but because it doesn't store any previous strings either!
More generally, Scala can't help you that much from thinking carefully about what information you need to have in memory and what you can fetch off disk as needed. Lazy evaluation can sometimes help, but there's no magic formula because you can easily express the requirement to have all your data in memory in a lazy way. Scala can't interpret your commands to access memory as, secretly, instructions to fetch stuff off the disk instead. (Well, not unless you write a library to cache results from disk which does exactly that.)
One could construct the stream with newVal #:: rec(...), but I would
lose tail-recursion then, wouldn't I?
Actually, no.
So, here's the thing... with your present tail recursion, you fill ALL of the Stream with values. Yes, Stream is lazy, but you are computing all of the elements, stripping it of any laziness.
Now say you do newVal #:: rec(...). Would you lose tail recursion? No. Why? Because you are not recursing. How come? Well, Stream is lazy, so it won't evaluate rec(...).
And that's the beauty of it. Once you do it that way, getFullSeqs returns on the first interaction, and only compute the "recursion" when printStrands asks for it. Unfortunately, that won't work as is...
The problem is that you are constantly modifying the Stream -- that's not how you use a Stream. With Stream, you always append to it. Don't keep "rewriting" the Stream.
Now, there are three other problems I could readily identify with printStrands. First, it calls size on seqs, which will cause the whole Stream to be processed, losing lazyness. Never call size on a Stream. Second, you call apply on seqse, accessing it by index. Never call apply on a Stream (or List) -- that's highly inefficient. It's O(n), which makes your inner loop O(n^2) -- yes, quadratic on the number of headers in the input file! Finally, printStrands keeps a reference to seqs throughout the execution of printStrand, preventing processing elements from being garbage collected.
So, here's a first approximation:
def inputStreams(fileName: String): (Stream[String], Stream[String]) = {
val in = (io.Source fromFile fileName).getLines.toStream
val SeqPatt = "^[^>]".r
def demultiplex(s: Stream[String], skip: Boolean): Stream[String] = {
if (s.isEmpty) Stream.empty
else if (skip) demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = false)
else s.head #:: (s.tail takeWhile (SeqPatt findFirstIn _ nonEmpty)) #::: demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = true)
}
(demultiplex(in, skip = false), demultiplex(in, skip = true))
}
The problem with the above, and I'm showing that code just to further guide in the issues of lazyness, is that the instant you do this:
val (a, b) = inputStreams(fileName)
You'll keep a reference to the head of both streams, which prevents garbage collecting them. You can't keep a reference to them, so you have to consume them as soon as you get them, without ever storing them in a "val" or "lazy val". A "var" might do, but it would be tricky to handle. So let's try this instead:
def inputStreams(fileName: String): Vector[Stream[String]] = {
val in = (io.Source fromFile fileName).getLines.toStream
val SeqPatt = "^[^>]".r
def demultiplex(s: Stream[String], skip: Boolean): Stream[String] = {
if (s.isEmpty) Stream.empty
else if (skip) demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = false)
else s.head #:: (s.tail takeWhile (SeqPatt findFirstIn _ nonEmpty)) #::: demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = true)
}
Vector(demultiplex(in, skip = false), demultiplex(in, skip = true))
}
inputStreams(fileName).zipWithIndex.par.foreach {
case (stream, strand) =>
val p = new PrintWriter(new File("FASTA" + "." + strand))
stream foreach p.println
p.close
}
That still doesn't work, because stream inside inputStreams works as a reference, keeping the whole stream in memory even while they are printed.
So, having failed again, what do I recommend? Keep it simple.
def in = (scala.io.Source fromFile fileName).getLines.toStream
def inputStream(in: Stream[String], strand: Int = 1): Stream[(String, Int)] = {
if (in.isEmpty) Stream.empty
else if (in.head startsWith ">") (in.head, 1 - strand) #:: inputStream(in.tail, 1 - strand)
else (in.head, strand) #:: inputStream(in.tail, strand)
}
val printers = Array.tabulate(2)(i => new PrintWriter(new File("FASTA" + "." + i)))
inputStream(in) foreach {
case (line, strand) => printers(strand) println line
}
printers foreach (_.close)
Now this won't keep anymore in memory than necessary. I still think it's too complex, however. This can be done more easily like this:
def in = (scala.io.Source fromFile fileName).getLines
val printers = Array.tabulate(2)(i => new PrintWriter(new File("FASTA" + "." + i)))
def printStrands(in: Iterator[String], strand: Int = 1) {
if (in.hasNext) {
val next = in.next
if (next startsWith ">") {
printers(1 - strand).println(next)
printStrands(in, 1 - strand)
} else {
printers(strand).println(next)
printStrands(in, strand)
}
}
}
printStrands(in)
printers foreach (_.close)
Or just use a while loop instead of recursion.
Now, to the other questions:
B) It might make sense to do so while reading it, so that you do not have to keep two copies of the data: the Map and a Seq.
C) Don't reverse a Stream -- you'll lose all of its laziness.