Spark: Transforming JSON files to correct format - scala

I've 100+ million records stored in files with the following JSON structure (real data has way more columns, rows and is also nested)
{"id":"2-2-3","key":"value"}{"id":"2-2-3","key":"value"}{"id":"2-2-3","key":"value"}{"id":"2-2-3","key":"value"}{"id":"2-2-3","key":"value"}
The sqlContext.read.json function is unable to parse this since the records aren't on multiple lines but on one big line. The solution below solves this problem, but is a big performance killer. What would be the best way, performance wise, to handle this issue in Apache Spark?
val rdd = sc.wholeTextFiles("s3://some-bucket/**/*")
val validJSON = rdd.flatMap(_._2.replace("}{", "}\n{").split("\n"))
val df = sqlContext.read.json(validJSON)
df.count()
df.select("id").show()

This is a riff on Antot's answer, which should handle nested JSON
input.toVector
.foldLeft((false, Vector.empty[Char], Vector.empty[String])) {
case ((true, charAccum, strAccum), '{') => (false, Vector('{'), strAccum :+ charAccum.mkString);
case ((_, charAccum, strAccum), '}') => (true, charAccum :+ '}', strAccum);
case ((_, charAccum, strAccum), char) => (false, charAccum :+ char, strAccum)
}
._3
Basically what it does is split the data into a Vector[Char], and uses foldLeft to aggregate the input into substrings. The trick is to keep track of just enough information about the previous character to figure out if a { marks the start of a new object.
I used this input to test it (basically the OP's sample input, with a nested object thrown in):
val input = """{"id":"2-2-3","key":{ "test": "value"}}{"id":"2-2-3","key":"value"}{"id":"2-2-3","key":"value"}{"id":"2-2-3","key":"value"}{"id":"2-2-3","key":"value"}"""
And got this result, which looks good:
Vector({"id":"2-2-3","key":{ "test": "value"}},
{"id":"2-2-3","key":"value"},
{"id":"2-2-3","key":"value"},
{"id":"2-2-3","key":"value"})

The problem with the original approach is the call _._2.replace("}{", "}\n{", which creates another huge string from the input one, with new line chars inserted, which is then split once again into an array.
An improvement is possible by minimizing the creation of intermediate strings and retrieving the target ones as soon as possible. For this, we can play a bit with substrings:
val validJson = rdd.flatMap(rawJson => {
// functions extracted to make it more readable.
def nextObjectStartIndex(fromIndex: Int):Int = rawJson._2.indexOf('{', fromIndex)
def currObjectEndIndex(fromIndex: Int): Int = rawJson._2.indexOf('}', fromIndex)
def extractObject(fromIndex: Int, toIndex: Int): String = rawJson._2.substring(fromIndex, toIndex + 1)
// the resulting strings are put in a local buffer
val buffer = new ListBuffer[String]()
// init the scanning of the input string
var posStartNextObject = nextObjectStartIndex(0)
// main loop terminates when there are no more '{' chars
while (posStartNextObject != -1) {
val posEndObject = currObjectEndIndex(posStartNextObject)
val extractedObject = extractObject(posStartNextObject, posEndObject)
posStartNextObject = nextObjectStartIndex(posEndObject)
buffer += extractedObject
}
buffer
})
Please note that this approach would work only if the objects in the input JSON are not nested, supposing that all curly braces separate objects of same level.

Related

Extracting data from RDD in Scala/Spark

So I have a large dataset that is a sample of a stackoverflow userbase. One line from this dataset is as follows:
<row Id="42" Reputation="11849" CreationDate="2008-08-01T13:00:11.640" DisplayName="Coincoin" LastAccessDate="2014-01-18T20:32:32.443" WebsiteUrl="" Location="Montreal, Canada" AboutMe="A guy with the attention span of a dead goldfish who has been having a blast in the industry for more than 10 years.
Mostly specialized in game and graphics programming, from custom software 3D renderers to accelerated hardware pipeline programming." Views="648" UpVotes="337" DownVotes="40" Age="35" AccountId="33" />
I would like to extract the number from reputation, in this case it is "11849" and the number from age, in this example it is "35" I would like to have them as floats.
The file is located in a HDFS so it comes in the format RDD
val linesWithAge = lines.filter(line => line.contains("Age=")) //This is filtering data which doesnt have age
val repSplit = linesWithAge.flatMap(line => line.split("\"")) //Here I am trying to split the data where there is a "
so when I split it with quotation marks the reputation is in index 3 and age in index 23 but how do I assign these to a map or a variable so I can use them as floats.
Also I need it to do this for every line on the RDD.
EDIT:
val linesWithAge = lines.filter(line => line.contains("Age=")) //transformations from the original input data
val repSplit = linesWithAge.flatMap(line => line.split("\""))
val withIndex = repSplit.zipWithIndex
val indexKey = withIndex.map{case (k,v) => (v,k)}
val b = indexKey.lookup(3)
println(b)
So if added an index to the array and now I've successfully managed to assign it to a variable but I can only do it to one item in the RDD, does anyone know how I could do it to all items?
What we want to do is to transform each element in the original dataset (represented as an RDD) into a tuple containing (Reputation, Age) as numeric values.
One possible approach is to transform each element of the RDD using String operations in order to extract the values of the elements "Age" and "Reputation", like this:
// define a function to extract the value of an element, given the name
def findElement(src: Array[String], name:String):Option[String] = {
for {
entry <- src.find(_.startsWith(name))
value <- entry.split("\"").lift(1)
} yield value
}
We then use that function to extract the interesting values from every record:
val reputationByAge = lines.flatMap{line =>
val elements = line.split(" ")
for {
age <- findElement(elements, "Age")
rep <- findElement(elements, "Reputation")
} yield (rep.toInt, age.toInt)
}
Note how we don't need to filter on "Age" before doing this. If we process a record that does not have "Age" or "Reputation", findElement will return None. Henceforth the result of the for-comprehension will be None and the record will be flattened by the flatMap operation.
A better way to approach this problem is by realizing that we are dealing with structured XML data. Scala provides built-in support for XML, so we can do this:
import scala.xml.XML
import scala.xml.XML._
// help function to map Strings to Option where empty strings become None
def emptyStrToNone(str:String):Option[String] = if (str.isEmpty) None else Some(str)
val xmlReputationByAge = lines.flatMap{line =>
val record = XML.loadString(line)
for {
rep <- emptyStrToNone((record \ "#Reputation").text)
age <- emptyStrToNone((record \ "#Age").text)
} yield (rep.toInt, age.toInt)
}
This method relies on the structure of the XML record to extract the right attributes. As before, we use the combination of Option values and flatMap to remove records that do not contain all the information we require.
First, you need a function which extracts the value for a given key of your line (getValueForKeyAs[T]), then do:
val rdd = linesWithAge.map(line => (getValueForKeyAs[Float](line,"Age"), getValueForKeyAs[Float](line,"Reputation")))
This should give you an rdd of type RDD[(Float,Float)]
getValueForKeyAs could be implemented like this:
def getValueForKeyAs[A](line:String, key:String) : A = {
val res = line.split(key+"=")
if(res.size==1) throw new RuntimeException(s"no value for key $key")
val value = res(1).split("\"")(1)
return value.asInstanceOf[A]
}

spark scala - replace text if exists in list

I have 2 datasets.
One is a dataframe with a bunch of data, one column has comments (a string).
The other is a list of words.
If a comment contains a word in the list, I want to replace the word in the comment with ##### and return the comment in full with the replaced words.
Here's some sample data:
CommentSample.txt
1 A badword small town
2 "Love the truck, though rattle is annoying."
3 Love the paint!
4
5 "Like that you added the ""oh badword2"" handle to passenger side."
6 "badword you. specific enough for you, badword3?"
7 This car is a piece if badword2
ProfanitySample.txt
badword
badword2
badword3
Here's my code so far:
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
case class Response(UniqueID: Int, Comment: String)
val response = sc.textFile("file:/data/CommentSample.txt").map(_.split("\t")).filter(_.size == 2).map(r => Response(r(0).trim.toInt, r(1).trim.toString, r(10).trim.toInt)).toDF()
var profanity = sc.textFile("file:/data/ProfanitySample.txt").map(x => (x.toLowerCase())).toArray();
def replaceProfanity(s: String): String = {
val l = s.toLowerCase()
val r = "#####"
if(profanity.contains(s))
r
else
s
}
def processComment(s: String): String = {
val commentWords = sc.parallelize(s.split(' '))
commentWords.foreach(replaceProfanity)
commentWords.collect().mkString(" ")
}
response.select(processComment("Comment")).show(100)
It compiles, it runs, but the words are not replaced.
I don't know how to debug in scala.
I'm totally new! This is my first project ever!
Many thanks for any pointers.
-M
First, I think the usecase you describe here won't benefit much from the use of DataFrames - it's simpler to implement using RDDs only (DataFrames are mostly convenient when your transformations can easily be described using SQL, which isn't the case here).
So - here's a possible implementation using RDDs. This assumes the list of profanities isn't too large (i.e. up to ~thousands), so we can collect it into non-distributed memory. If that's not the case, a different approach (involving a join) might be needed.
case class Response(UniqueID: Int, Comment: String)
val mask = "#####"
val responses: RDD[Response] = sc.textFile("file:/data/CommentSample.txt").map(_.split("\t")).filter(_.size == 2).map(r => Response(r(0).trim.toInt, r(1).trim))
val profanities: Array[String] = sc.textFile("file:/data/ProfanitySample.txt").collect()
val result = responses.map(r => {
// using foldLeft here means we'll replace profanities one by one,
// with the result of each replace as the input of the next,
// starting with the original comment
profanities.foldLeft(r.Comment)({
case (updatedComment, profanity) => updatedComment.replaceAll(s"(?i)\\b$profanity\\b", mask)
})
})
result.take(10).foreach(println) // just printing some examples...
Note that the case-insensitivity and the "words only" limitations are implemented in the regex itself: "(?i)\\bSomeWord\\b".

How to read input from a file and convert data lines of the file to List[Map[Int,String]] using scala?

My Query is, read input from a file and convert data lines of the file to List[Map[Int,String]] using scala. Here I give a dataset as the input. My code is,
def id3(attrs: Attributes,
examples: List[Example],
label: Symbol
) : Node = {
level = level+1
// if all the examples have the same label, return a new node with that label
if(examples.forall( x => x(label) == examples(0)(label))){
new Leaf(examples(0)(label))
} else {
for(a <- attrs.keySet-label){ //except label, take all attrs
("Information gain for %s is %f".format(a,
informationGain(a,attrs,examples,label)))
}
// find the best splitting attribute - this is an argmax on a function over the list
var bestAttr:Symbol = argmax(attrs.keySet-label, (x:Symbol) =>
informationGain(x,attrs,examples,label))
// now we produce a new branch, which splits on that node, and recurse down the nodes.
var branch = new Branch(bestAttr)
for(v <- attrs(bestAttr)){
val subset = examples.filter(x=> x(bestAttr)==v)
if(subset.size == 0){
// println(levstr+"Tiny subset!")
// zero subset, we replace with a leaf labelled with the most common label in
// the examples
val m = examples.map(_(label))
val mostCommonLabel = m.toSet.map((x:Symbol) => (x,m.count(_==x))).maxBy(_._2)._1
branch.add(v,new Leaf(mostCommonLabel))
}
else {
// println(levstr+"Branch on %s=%s!".format(bestAttr,v))
branch.add(v,id3(attrs,subset,label))
}
}
level = level-1
branch
}
}
}
object samplet {
def main(args: Array[String]){
var attrs: sample.Attributes = Map()
attrs += ('0 -> Set('abc,'nbv,'zxc))
attrs += ('1 -> Set('def,'ftr,'tyh))
attrs += ('2 -> Set('ghi,'azxc))
attrs += ('3 -> Set('jkl,'fds))
attrs += ('4 -> Set('mno,'nbh))
val examples: List[sample.Example] = List(
Map(
'0 -> 'abc,
'1 -> 'def,
'2 -> 'ghi,
'3 'jkl,
'4 -> 'mno
),
........................
)
// obviously we can't use the label as an attribute, that would be silly!
val label = 'play
println(sample.try(attrs,examples,label).getStr(0))
}
}
But How I change this code to - accepting input from a .csv file?
I suggest you use Java's io / nio standard library to read your CSV file. I think there is no relevant drawback in doing so.
But the first question we need to answer is where to read the file in the code? The parsed input seems to replace the value of examples. This fact also hints us what type the parsed CSV input must have, namely List[Map[Symbol, Symbol]]. So let us declare a new class
class InputFromCsvLoader(charset: Charset = Charset.defaultCharset()) {
def getInput(file: Path): List[Map[Symbol, Symbol]] = ???
}
Note that the Charset is only needed if we must distinguish between differently encoded CSV-files.
Okay, so how do we implement the method? It should do the following:
Create an appropriate input reader
Read all lines
Split each line at the comma-separator
Transform each substring into the symbol it represents
Build a map from from the list of symbols, using the attributes as key
Create and return the list of maps
Or expressed in code:
class InputFromCsvLoader(charset: Charset = Charset.defaultCharset()) {
val Attributes = List('outlook, 'temperature, 'humidity, 'wind, 'play)
val Separator = ","
/** Get the desired input from the CSV file. Does not perform any checks, i.e., there are no guarantees on what happens if the input is malformed. */
def getInput(file: Path): List[Map[Symbol, Symbol]] = {
val reader = Files.newBufferedReader(file, charset)
/* Read the whole file and discard the first line */
inputWithHeader(reader).tail
}
/** Reads all lines in the CSV file using [[java.io.BufferedReader]] There are many ways to do this and this is probably not the prettiest. */
private def inputWithHeader(reader: BufferedReader): List[Map[Symbol, Symbol]] = {
(JavaConversions.asScalaIterator(reader.lines().iterator()) foldLeft Nil.asInstanceOf[List[Map[Symbol, Symbol]]]){
(accumulator, nextLine) =>
parseLine(nextLine) :: accumulator
}.reverse
}
/** Parse an entry. Does not verify the input: If there are less attributes than columns or vice versa, zip creates a list of the size of the shorter list */
private def parseLine(line: String): Map[Symbol, Symbol] = (Attributes zip (line split Separator map parseSymbol)).toMap
/** Create a symbol from a String... we could also check whether the string represents a valid symbol */
private def parseSymbol(symbolAsString: String): Symbol = Symbol(symbolAsString)
}
Caveat: Expecting only valid input, we are certain that the individual symbol representations do not contain the comma-separation character. If this cannot be assumed, then the code as is would fail to split certain valid input strings.
To use this new code, we could change the main-method as follows:
def main(args: Array[String]){
val csvInputFile: Option[Path] = args.headOption map (p => Paths get p)
val examples = (csvInputFile map new InputFromCsvLoader().getInput).getOrElse(exampleInput)
// ... your code
Here, examples uses the value exampleInput, which is the current, hardcoded value of examples if no input argument is specified.
Important: In the code, all error handling has been omitted for convenience. In most cases, errors can occur when reading from files and user input must be considered invalid, so sadly, error handling at the boundaries of your program is usally not optional.
Side-notes:
Try not to use null in your code. Returning Option[T] is a better option than returning null, because it makes "nullness" explicit and provides static safety thanks to the type-system.
The return-keyword is not required in Scala, as the last value of a method is always returned. You can still use the keyword if you find the code more readable or if you want to break in the middle of your method (which is usually a bad idea).
Prefer val over var, because immutable values are much easier to understand than mutable values.
The code will fail with the provided CSV string, because it contains the symbols TRUE and FALSE which are not legal according to your programs logic (they should be true and false instead).
Add all information to your error-messages. Your error message only tells me what that a value for the attribute 'wind is bad, but it does not tell me what the actual value is.
Read a csv file ,
val datalines = Source.fromFile(filepath).getLines()
So this datalines contains all the lines from the csv file.
Next, convert each line into a Map[Int,String]
val datamap = datalines.map{ line =>
line.split(",").zipWithIndex.map{ case (word, idx) => idx -> word}.toMap
}
Here, we split each line with ",". Then construct a map with key as column number and value as each word after the split.
Next, If we want List[Map[Int,String]],
val datamap = datalines.map{ line =>
line.split(",").zipWithIndex.map{ case (word, idx) => idx -> word}.toMap
}.toList

How do I iterate RDD's in apache spark (scala)

I use the following command to fill an RDD with a bunch of arrays containing 2 strings ["filename", "content"].
Now I want to iterate over every of those occurrences to do something with every filename and content.
val someRDD = sc.wholeTextFiles("hdfs://localhost:8020/user/cloudera/*")
I can't seem to find any documentation on how to do this however.
So what I want is this:
foreach occurrence-in-the-rdd{
//do stuff with the array found on loccation n of the RDD
}
You call various methods on the RDD that accept functions as parameters.
// set up an example -- an RDD of arrays
val sparkConf = new SparkConf().setMaster("local").setAppName("Example")
val sc = new SparkContext(sparkConf)
val testData = Array(Array(1,2,3), Array(4,5,6,7,8))
val testRDD = sc.parallelize(testData, 2)
// Print the RDD of arrays.
testRDD.collect().foreach(a => println(a.size))
// Use map() to create an RDD with the array sizes.
val countRDD = testRDD.map(a => a.size)
// Print the elements of this new RDD.
countRDD.collect().foreach(a => println(a))
// Use filter() to create an RDD with just the longer arrays.
val bigRDD = testRDD.filter(a => a.size > 3)
// Print each remaining array.
bigRDD.collect().foreach(a => {
a.foreach(e => print(e + " "))
println()
})
}
Notice that the functions you write accept a single RDD element as input, and return data of some uniform type, so you create an RDD of the latter type. For example, countRDD is an RDD[Int], while bigRDD is still an RDD[Array[Int]].
It will probably be tempting at some point to write a foreach that modifies some other data, but you should resist for reasons described in this question and answer.
Edit: Don't try to print large RDDs
Several readers have asked about using collect() and println() to see their results, as in the example above. Of course, this only works if you're running in an interactive mode like the Spark REPL (read-eval-print-loop.) It's best to call collect() on the RDD to get a sequential array for orderly printing. But collect() may bring back too much data and in any case too much may be printed. Here are some alternative ways to get insight into your RDDs if they're large:
RDD.take(): This gives you fine control on the number of elements you get but not where they came from -- defined as the "first" ones which is a concept dealt with by various other questions and answers here.
// take() returns an Array so no need to collect()
myHugeRDD.take(20).foreach(a => println(a))
RDD.sample(): This lets you (roughly) control the fraction of results you get, whether sampling uses replacement, and even optionally the random number seed.
// sample() does return an RDD so you may still want to collect()
myHugeRDD.sample(true, 0.01).collect().foreach(a => println(a))
RDD.takeSample(): This is a hybrid: using random sampling that you can control, but both letting you specify the exact number of results and returning an Array.
// takeSample() returns an Array so no need to collect()
myHugeRDD.takeSample(true, 20).foreach(a => println(a))
RDD.count(): Sometimes the best insight comes from how many elements you ended up with -- I often do this first.
println(myHugeRDD.count())
The fundamental operations in Spark are map and filter.
val txtRDD = someRDD filter { case(id, content) => id.endsWith(".txt") }
the txtRDD will now only contain files that have the extension ".txt"
And if you want to word count those files you can say
//split the documents into words in one long list
val words = txtRDD flatMap { case (id,text) => text.split("\\s+") }
// give each word a count of 1
val wordT = words map (x => (x,1))
//sum up the counts for each word
val wordCount = wordsT reduceByKey((a, b) => a + b)
You want to use mapPartitions when you have some expensive initialization you need to perform -- for example, if you want to do Named Entity Recognition with a library like the Stanford coreNLP tools.
Master map, filter, flatMap, and reduce, and you are well on your way to mastering Spark.
I would try making use of a partition mapping function. The code below shows how an entire RDD dataset can be processed in a loop so that each input goes through the very same function. I am afraid I have no knowledge about Scala, so everything I have to offer is java code. However, it should not be very difficult to translate it into scala.
JavaRDD<String> res = file.mapPartitions(new FlatMapFunction <Iterator<String> ,String>(){
#Override
public Iterable<String> call(Iterator <String> t) throws Exception {
ArrayList<String[]> tmpRes = new ArrayList <>();
String[] fillData = new String[2];
fillData[0] = "filename";
fillData[1] = "content";
while(t.hasNext()){
tmpRes.add(fillData);
}
return Arrays.asList(tmpRes);
}
}).cache();
what the wholeTextFiles return is a Pair RDD:
def wholeTextFiles(path: String, minPartitions: Int): RDD[(String, String)]
Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. Each file is read as a single record and returned in a key-value pair, where the key is the path of each file, the value is the content of each file.
Here is an example of reading the files at a local path then printing every filename and content.
val conf = new SparkConf().setAppName("scala-test").setMaster("local")
val sc = new SparkContext(conf)
sc.wholeTextFiles("file:///Users/leon/Documents/test/")
.collect
.foreach(t => println(t._1 + ":" + t._2));
the result:
file:/Users/leon/Documents/test/1.txt:{"name":"tom","age":12}
file:/Users/leon/Documents/test/2.txt:{"name":"john","age":22}
file:/Users/leon/Documents/test/3.txt:{"name":"leon","age":18}
or converting the Pair RDD to a RDD first
sc.wholeTextFiles("file:///Users/leon/Documents/test/")
.map(t => t._2)
.collect
.foreach { x => println(x)}
the result:
{"name":"tom","age":12}
{"name":"john","age":22}
{"name":"leon","age":18}
And I think wholeTextFiles is more compliant for small files.
for (element <- YourRDD)
{
// do what you want with element in each iteration, and if you want the index of element, simply use a counter variable in this loop beginning from 0
println (element._1) // this will print all filenames
}

Scala functional way of processing large scala data with lazy collections

I am trying to figure out memory-efficient AND functional ways to process a large scale of data using strings in scala. I have read many things about lazy collections and have seen quite a bit of code examples. However, I run into "GC overhead exceeded" or "Java heap space" issues again and again.
Often the problem is that I try to construct a lazy collection, but evaluate each new element when I append it to the growing collection (I don't now any other way to do so incrementally). Of course, I could try something like initializing an initial lazy collection first and and yield the collection holding the desired values by applying the ressource-critical computations with map or so, but often I just simply do not know the exact size of the final collection a priori to initial that lazy collection.
Maybe you could help me by giving me hints or explanations on how to improve following code as an example, which splits a FASTA (definition below) formatted file into two separate files according to the rule that odd sequence pairs belong to one file and even ones to aother one ("separation of strands"). The "most" straight-forward way to do so would be in a imperative way by looping through the lines and printing into the corresponding files via open file streams (and this of course works excellent). However, I just don't enjoy the style of reassigning to variables holding header and sequences, thus the following example code uses (tail-)recursion, and I would appreciate to have found a way to maintain a similar design without running into ressource problems!
The example works perfectly for small files, but already with files at around ~500mb the code will fail with the standard JVM setups. I do want to process files of "arbitray" size, say 10-20gb or so.
val fileName = args(0)
val in = io.Source.fromFile(fileName) getLines
type itType = Iterator[String]
type sType = Stream[(String, String)]
def getFullSeqs(ite: itType) = {
//val metaChar = ">"
val HeadPatt = "(^>)(.+)" r
val SeqPatt = "([\\w\\W]+)" r
#annotation.tailrec
def rec(it: itType, out: sType = Stream[(String, String)]()): sType =
if (it hasNext) it next match {
case HeadPatt(_,header) =>
// introduce new header-sequence pair
rec(it, (header, "") #:: out)
case SeqPatt(seq) =>
val oldVal = out head
// concat subsequences
val newStream = (oldVal._1, oldVal._2 + seq) #:: out.tail
rec(it, newStream)
case _ =>
println("something went wrong my friend, oh oh oh!"); Stream[(String, String)]()
} else out
rec(ite)
}
def printStrands(seqs: sType) {
import java.io.PrintWriter
import java.io.File
def printStrand(seqse: sType, strand: Int) {
// only use sequences of one strand
val indices = List.tabulate(seqs.size/2)(_*2 + strand - 1).view
val p = new PrintWriter(new File(fileName + "." + strand))
indices foreach { i =>
p.print(">" + seqse(i)._1 + "\n" + seqse(i)._2 + "\n")
}; p.close
println("Done bro!")
}
List(1,2).par foreach (s => printStrand(seqs, s))
}
printStrands(getFullSeqs(in))
Three questions arise for me:
A) Let's assume one needs to maintain a large data structure obtained by processing the initial iterator you get from getLines like in my getFullSeqs method (note the different size of in and the output of getFullSeqs), because transformations on the whole(!) data is required repeatedly, because one does not know which part of the data one will require at any step. My example might not be the best, but how to do so? Is it possible at all??
B) What when the desired data structure is not inherently lazy, say one would like to store the (header -> sequence) pairs into a Map()? Would you wrap it in a lazy collection?
C) My implementation of constructing the stream might reverse the order of the inputted lines. When calling reverse, all elements will be evaluated (in my code, they already are, so this is the actual problem). Is there any way to post-process "from behind" in a lazy fashion? I know of reverseIterator, but is this already the solution, or will this not actually evaluate all elements first, too (as I would need to call it on a list)? One could construct the stream with newVal #:: rec(...), but I would lose tail-recursion then, wouldn't I?
So what I basically need is to add elements to a collection, which are not evaluated by the process of adding. So lazy val elem = "test"; elem :: lazyCollection is not what I am looking for.
EDIT: I have also tried using by-name parameter for the stream argument in rec .
Thank you so much for your attention and time, I really appreciate any help (again :) ).
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
FASTA is defined as a sequential set of sequences delimited by a single header line. A header is defined as a line starting with ">". Every line below the header is called part of the sequence associated with the header. A sequence ends when a new header is present. Every header is unique. Example:
>HEADER1
abcdefg
>HEADER2
hijklmn
opqrstu
>HEADER3
vwxyz
>HEADER4
zyxwv
Thus, sequence 2 is twice as big as seq 1. My program would split that file into a file A containing
>HEADER1
abcdefg
>HEADER3
vwxyz
and a second file B containing
>HEADER2
hijklmn
opqrstu
>HEADER4
zyxwv
The input file is assumed to consist of an even number of header-sequence pairs.
The key to working with really large data structures is to hold in memory only that which is critical to perform whatever operation you need. So, in your case, that's
Your input file
Your two output files
The current line of text
and that's it. In some cases you can need to store information such as how long a sequence is; in such events, you build the data structures in a first pass and use them on a second pass. Let's suppose, for example, that you decide that you want to write three files: one for even records, one for odd, and one for entries where the total length is less than 300 nucleotides. You would do something like this (warning--it compiles but I never ran it, so it may not actually work):
final def findSizes(
data: Iterator[String], sz: Map[String,Long] = Map(),
currentName: String = "", currentSize: Long = 0
): Map[String,Long] = {
def currentMap = if (currentName != "") sz + (currentName->currentSize) else sz
if (!data.hasNext) currentMap
else {
val s = data.next
if (s(0) == '>') findSizes(data, currentMap, s, 0)
else findSizes(data, sz, currentName, currentSize + s.length)
}
}
Then, for processing, you use that map and pass through again:
import java.io._
final def writeFiles(
source: Iterator[String], targets: Array[PrintWriter],
sizes: Map[String,Long], count: Int = -1, which: Int = 0
) {
if (!source.hasNext) targets.foreach(_.close)
else {
val s = source.next
if (s(0) == '>') {
val w = if (sizes.get(s).exists(_ < 300)) 2 else (count+1)%2
targets(w).println(s)
writeFiles(source, targets, sizes, count+1, w)
}
else {
targets(which).println(s)
writeFiles(source, targets, sizes, count, which)
}
}
}
You then use Source.fromFile(f).getLines() twice to create your iterators, and you're all set. Edit: in some sense this is the key step, because this is your "lazy" collection. However, it's not important just because it doesn't read all memory in immediately ("lazy"), but because it doesn't store any previous strings either!
More generally, Scala can't help you that much from thinking carefully about what information you need to have in memory and what you can fetch off disk as needed. Lazy evaluation can sometimes help, but there's no magic formula because you can easily express the requirement to have all your data in memory in a lazy way. Scala can't interpret your commands to access memory as, secretly, instructions to fetch stuff off the disk instead. (Well, not unless you write a library to cache results from disk which does exactly that.)
One could construct the stream with newVal #:: rec(...), but I would
lose tail-recursion then, wouldn't I?
Actually, no.
So, here's the thing... with your present tail recursion, you fill ALL of the Stream with values. Yes, Stream is lazy, but you are computing all of the elements, stripping it of any laziness.
Now say you do newVal #:: rec(...). Would you lose tail recursion? No. Why? Because you are not recursing. How come? Well, Stream is lazy, so it won't evaluate rec(...).
And that's the beauty of it. Once you do it that way, getFullSeqs returns on the first interaction, and only compute the "recursion" when printStrands asks for it. Unfortunately, that won't work as is...
The problem is that you are constantly modifying the Stream -- that's not how you use a Stream. With Stream, you always append to it. Don't keep "rewriting" the Stream.
Now, there are three other problems I could readily identify with printStrands. First, it calls size on seqs, which will cause the whole Stream to be processed, losing lazyness. Never call size on a Stream. Second, you call apply on seqse, accessing it by index. Never call apply on a Stream (or List) -- that's highly inefficient. It's O(n), which makes your inner loop O(n^2) -- yes, quadratic on the number of headers in the input file! Finally, printStrands keeps a reference to seqs throughout the execution of printStrand, preventing processing elements from being garbage collected.
So, here's a first approximation:
def inputStreams(fileName: String): (Stream[String], Stream[String]) = {
val in = (io.Source fromFile fileName).getLines.toStream
val SeqPatt = "^[^>]".r
def demultiplex(s: Stream[String], skip: Boolean): Stream[String] = {
if (s.isEmpty) Stream.empty
else if (skip) demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = false)
else s.head #:: (s.tail takeWhile (SeqPatt findFirstIn _ nonEmpty)) #::: demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = true)
}
(demultiplex(in, skip = false), demultiplex(in, skip = true))
}
The problem with the above, and I'm showing that code just to further guide in the issues of lazyness, is that the instant you do this:
val (a, b) = inputStreams(fileName)
You'll keep a reference to the head of both streams, which prevents garbage collecting them. You can't keep a reference to them, so you have to consume them as soon as you get them, without ever storing them in a "val" or "lazy val". A "var" might do, but it would be tricky to handle. So let's try this instead:
def inputStreams(fileName: String): Vector[Stream[String]] = {
val in = (io.Source fromFile fileName).getLines.toStream
val SeqPatt = "^[^>]".r
def demultiplex(s: Stream[String], skip: Boolean): Stream[String] = {
if (s.isEmpty) Stream.empty
else if (skip) demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = false)
else s.head #:: (s.tail takeWhile (SeqPatt findFirstIn _ nonEmpty)) #::: demultiplex(s.tail dropWhile (SeqPatt findFirstIn _ nonEmpty), skip = true)
}
Vector(demultiplex(in, skip = false), demultiplex(in, skip = true))
}
inputStreams(fileName).zipWithIndex.par.foreach {
case (stream, strand) =>
val p = new PrintWriter(new File("FASTA" + "." + strand))
stream foreach p.println
p.close
}
That still doesn't work, because stream inside inputStreams works as a reference, keeping the whole stream in memory even while they are printed.
So, having failed again, what do I recommend? Keep it simple.
def in = (scala.io.Source fromFile fileName).getLines.toStream
def inputStream(in: Stream[String], strand: Int = 1): Stream[(String, Int)] = {
if (in.isEmpty) Stream.empty
else if (in.head startsWith ">") (in.head, 1 - strand) #:: inputStream(in.tail, 1 - strand)
else (in.head, strand) #:: inputStream(in.tail, strand)
}
val printers = Array.tabulate(2)(i => new PrintWriter(new File("FASTA" + "." + i)))
inputStream(in) foreach {
case (line, strand) => printers(strand) println line
}
printers foreach (_.close)
Now this won't keep anymore in memory than necessary. I still think it's too complex, however. This can be done more easily like this:
def in = (scala.io.Source fromFile fileName).getLines
val printers = Array.tabulate(2)(i => new PrintWriter(new File("FASTA" + "." + i)))
def printStrands(in: Iterator[String], strand: Int = 1) {
if (in.hasNext) {
val next = in.next
if (next startsWith ">") {
printers(1 - strand).println(next)
printStrands(in, 1 - strand)
} else {
printers(strand).println(next)
printStrands(in, strand)
}
}
}
printStrands(in)
printers foreach (_.close)
Or just use a while loop instead of recursion.
Now, to the other questions:
B) It might make sense to do so while reading it, so that you do not have to keep two copies of the data: the Map and a Seq.
C) Don't reverse a Stream -- you'll lose all of its laziness.