Efficient iteration of large text file - scala

So I have the following:
val bufferedSource = io.Source.fromFile("""C:\Users\something\workspace\""" + fileName)
val lines = bufferedSource.getLines
I would like to select at random, a start index and an end index and iterate through lines in this range while printing to a new file. Is there a way to access the elements in lines iterator by index?
My first attempt was to copy over the data to a ListBuffer:
var lineArr = ListBuffer[String]()
for (line <- lines) {
lineArr += line
}
Therafter if I iterate through lineArr in my range, by index, it is really slow.
In what way could I do this efficiently?
Sidenote: If I iterate through lines which contains all elements (which I do not want) it is fast to iterate while writing them to a new file, however I only want a select amount to write.

So instead of iterating through each line I solved this problem by using slicing. I still create a ListBuffer but I slice it on the start and end index:
lineArrTemp = lineArrTemp.slice(start, end)
And thereafter simply iterate through the ListBuffer iterator, it is efficient.

lines
.drop(startIndex)
.take(endIndex - startIndex)
.foreach(writeToFile)

Consider also zipWithIndex on the iterator, hence we can sophisticate the line selection based in the index value; for instance select even indexed lines with
io.Source.fromFile("temp.txt").getLines.zipWithIndex.foreach {
case (line,i) => if (i % 2 == 0) println(line)
}
Here we index one line at a time as we iterate over the file once only.

Related

Filling in desired lines in Scala

I currently have a value of result that is a string which represents cycles in a graph
> scala result
String =
0:0->52->22;
5:5->70->77;
8:8->66->24;8->42->32;
. //
. // trimmed to get by point across
. //
71:71->40->45;
77:77->34->28;77->5->70;
84:84->22->29
However, I want to have the output have the numbers in between be included and up to a certain value included. The example code would have value = 90
0:0->52->22;
1:
2:
3:
4:
5:5->70->77;
6:
7:
8:8->66->24;8->42->32;
. //
. // trimmed
. //
83:
84:84->22->29;
85:
86:
87:
88:
89:
90:
If it helps or makes any difference, this value is changed to a list for later purposes, such like
list_result = result.split("\n").toList
List[String] = List(0:0->52->22;, 5:5->70->77;, 8:8->66->24;8->42->32;, 11:11->26->66;11->17->66;
My initial thought was to insert the missing numbers into the list and then sort it, but I had trouble with the sorting so I instead look here for a better method.
Turn your list_result into a Map with default values. Then walk through the desired number range, exchanging each for its Map value.
val map_result: Map[String,List[String]] =
list_result.groupBy("\\d+:".r.findFirstIn(_).getOrElse("bad"))
.withDefault(List(_))
val full_result: String =
(0 to 90).flatMap(n => map_result(s"$n:")).mkString("\n")
Here's a Scastie session to see it in action.
One option would be to use a Map as an intermediate data structure:
val l: List[String] = List("0:0->52->22;", "5:5->70->77;", "8:8->66->24;8->42->32;", "11:11->26->66;11->17->66;")
val byKey: List[Array[String]] = l.map(_.split(":"))
val stop = 90
val mapOfValues = (1 to stop).map(_->"").toMap
val output = byKey.foldLeft(mapOfValues)((acc, nxt) => acc + (nxt.head.toInt -> nxt.tail.head))
output.toList.sorted.map {case (key, value) => println(s"$key, $value")}
This will give you the output you are after. It breaks your input strings into pseudo key-value pairs, creates a map to hold the results, inserts the elements of byKey into the map, then returns a sorted list of the results.
Note: If you are using this in anything like production code you'd need to properly check that each Array in byKey does have two elements to prevent any nullPointerExceptions with the later calls to head and tail.head.
The provided solutions are fine, but I would like to suggest one that can process the data lazily and doesn't need to keep all data in memory at once.
It uses a nice function called unfold, which allows to "unfold" a collection from a starting state, up to a point where you deem the collection to be over (docs).
It's not perfectly polished but I hope it may help:
def readLines(s: String): Iterator[String] =
util.Using.resource(io.Source.fromString(s))(_.getLines)
def emptyLines(from: Int, until: Int): Iterator[(String)] =
Iterator.range(from, until).map(n => s"$n:")
def indexOf(line: String): Int =
Integer.parseInt(line.substring(0, line.indexOf(':')))
def withDefaults(from: Int, to: Int, it: Iterator[String]): Iterator[String] = {
Iterator.unfold((from, it)) { case (n, lines) =>
if (lines.hasNext) {
val next = lines.next()
val i = indexOf(next)
Some((emptyLines(n, i) ++ Iterator.single(next), (i + 1, lines)))
} else if (n < to) {
Some((emptyLines(n, to + 1), (to, lines)))
} else {
None
}
}.flatten
}
You can see this in action here on Scastie.
What unfold does is start from a state (in this case, the line number from and the iterator with the lines) and at every iteration:
if there are still elements in the iterator it gets the next item, identifies its index and returns:
as the next item an Iterator with empty lines up to the latest line number followed by the actual line
e.g. when 5 is reached the empty lines between 1 and 4 are emitted, terminated by the line starting with 5
as the next state, the index of the line after the last in the emitted item and the iterator itself (which, being stateful, is consumed by the repeated calls to unfold at each iteration)
e.g. after processing 5, the next state is 6 and the iterator
if there are no elements in the iterator anymore but the to index has not been reached, it emits another Iterator with the remaining items to be printed (in your example, those after 84)
if both conditions are false we don't need to emit anything anymore and we can close the "unfolding" collection, signalling this by returning a None instead of Some[(Item, State)]
This returns an Iterator[Iterator[String]] where every nested iterator is a range of values from one line to the next, with the default empty lines "sandwiched" in between. The call to flatten turns it into the desired result.
I used an Iterator to make sure that only the essential state is kept in memory at any time and only when it's actually used.

append element to array in for loop matlab

I have a matrix 10x500 and I want to discard every row which contains in the first 100 elements a value above 6. First I am trying to make an array with all the indexes of the row to discard. Here my code
idx_discard_trials = [];
for i = 1:size(data_matrix,1)
if any(data_matrix(i,1:100)>6)
idx_discard_trials = i;
end
end
However, at the end of the loop I get just the last index, not a list. Does anybody know how to append elements to an array using a for loop?
It's because you keep rewriting a single value, you need to append the values through idx_discard_trials(end+1) = i, for example.
You don't need a loop for this however, try the following:
data_matrix(any(data_matrix(:,1:100) > 6, 2),:) = []

How mimic the function map.getORelse to a CSV file

I have a CSV file that represent a map[String,Int], then I am reading the file as follows:
def convI2N (vkey:Int):String={
val in = new Scanner("dictionaryNV.csv")
loop.breakable{
while (in.hasNext) {
val nodekey = in.next(',')
val value = in.next('\n')
if (value == vkey.toString){
n=nodekey
loop.break()}
}}
in.close
n
}
the function give the String given the Int. The problem here is that I must browse the whole file, and the file is to big, then the procedure is too slow. Someone tell me that this is O(n) complexity time, and recomend me to pass to O(log n). I suppose that the function map.getOrElse is O(log n).
Someone can help me to find a way to get a best performance of this code?
As additional comment, the dictionaryNV file is sorted by the Int values
maybe I can divide the file by lines, or set of lines. The CSV has like 167000 Tuples [String,Int]
or in another way how you make some kind of binary search through the csv in scala?
If you are calling confI2N function many times then definitely the job will be slow because each time you have to scan the big file. So if the function is called many times then it is recommended to store them in temporary variable as properties or hashmap or collection of tuple2 and change the other code that is eating the memory.
You can try following way which should be faster than scanner way
Assuming that your csv file is comma separated as
key1,value1
key2,value2
Using Source.fromFile can be your solution as
def convI2N (vkey:Int):String={
var n = "not found"
val filtered = Source.fromFile("<your path to dictionaryNV.csv>")
.getLines()
.map(line => line.split(","))
.filter(sline => sline(0).equalsIgnoreCase(vkey.toString))
for(str <- filtered){
n = str(0)
}
n
}

Remove first element in RDD without using filter function

I have built an RDD from a file where each element in the RDD is section from the file separated by a delimiter.
val inputRDD1:RDD[(String,Long)] = myUtilities.paragraphFile(spark,path1)
.coalesce(100*spark.defaultParallelism)
.zipWithIndex() //RDD[String, Long]
.filter(f => f._2!=0)
The reason I do the last operation above (filter) is to remove the first index 0.
Is there a better way to remove the first element rather than to check each element for the index value as done above?
Thanks!
One possibility is to use RDD.mapPartitionsWithIndex and to remove the first element from the iterator at index 0:
val inputRDD = myUtilities
.paragraphFile(spark,path1)
.coalesce(100*spark.defaultParallelism)
.mapPartitionsWithIndex(
(index, it) => if (index == 0) it.drop(1) else it,
preservesPartitioning = true
)
This way, you only ever advance a single item on the first iterator, where all others remain untouched. Is this be more efficient? Probably. Anyway, I'd test both versions to see which one performs better.

Scala Performance Issue with mutable List (LinkedList)

I have the following code snippets: The code reads the system (Linux) dictionary(en) file and keeps it in memory List.
Code 1 : (With mutable List)
val word = scala.collection.mutable.LinkedList[String]("init");
for(line <- Source.fromFile("/usr/share/dict/words").getLines()){
val s : String = line.trim()
if( // some checks
){
word append scala.collection.mutable.LinkedList[String](s)
}
}
Code 2 : (With Immutable List)
var word = List[String]()
for(line <- Source.fromFile("/usr/share/dict/words").getLines()){
val s : String = line.trim()
if( // some checks
){
word ::= s
}
}
Code 2 : returns almost immediately , But
Code 1 : Takes for ever .
Can any one help me out , why is it taking so much time for mutable List? . Should we use Mutable at all or Am I doing something wrong?
Scala version used : 2.10.3
Thanks in Advance for your help.
word append scala.collection.mutable.LinkedList[String](s)
Traverse the word list and then at the end append the items from the other list.
word ::= s
Append s at the front of the word list and assign the new list to word variable.
Appending to the end of list is always expensive as compared to add a item to the front.
In the first example, you are adding to the end of a list repeatedly (append). This takes time on the order of the length of the list. In the second example, you are adding to the beginning of a list (::). This takes constant time. So the first example has an execution time that increases with the square of the number of lines in the file, and the second has an execution time that increases linearly with the length of the file.
This is due to the nature of linked lists, which are the data structure underlying both immutable List and mutable LinkedList. linked lists are fast to access at the front and slow to access at the back.