Simple function to read lines from a file in Scala lazily using scala-arm - scala

I want to write a Scala function to read all of the lines from a file lazily (i.e returning an Iterator[String]) which also closes the file afterwards. I know about the idiom io.Source.fromFile("something.txt").getLines however as noted here this will not close the file afterwards. Surely there is a simple way to do this?
Currently I'm using this, with the scala-arm library:
import resource.managed
import io.{Source, BufferedSource}
def lines(filename: String): Iterator[String] = {
val reader = managed(Source.fromFile(filename, "UTF-8"))
reader.map(_.getLines).toTraversable.toIterator
}
but this seems to read the whole file into memory as far as I can tell.
I come from a Python background, where this is laughably trivial:
def lines(filename):
with open(filename) as f:
for line in f:
yield line
Surely there is a reasonably straightforward Scala equivalent which I just haven't managed to work out yet.

Related

Functional Style reading large csv file in Scala

I am new to functional-style programming and scala, so my question might seem to be bit primitive.
Is there a specific way to read csv file in scala using functional style? Also, how the inner joins are performed for combining 2 csv files in scala using functional style?
I know spark and generally use data frame but don't have any idea in scala and finding it tough to search on google as well, since don't have much knowledge about it. Also, if anyone knows good links for the functional style programming for scala it would be great help.
The question is indeed too broad.
Is there a specific way to read csv file in scala using functional
style?
So far I don't know of a king's road to parse CSVs completely hassle-free.
CSV parsing includes
going through the input line-by-line
understanding, what to do with the (optional) header
accurately parsing each line each line according to the CSV specification
turning line parts into a business object
I recommend to
turn your input into Iterator[String]
split each line into parts using a library of your choice (e.g. opencsv)
manually create a desired domain object from the line parts
Here is an simple example which (ignores error handling and potential header)
case class Person(name: String, street: String)
val lineParser = new CSVParserBuilder().withSeparator(',').build()
val lines: Iterator[String] = Source.fromInputStream(new FileInputStream("file.csv")).getLines()
val parsedObjects: Iterator[Person] = lines.map(line => {
val parts: Array[String] = lineParser.parseLine(line)
Person(parts(0), parts(1))
})

How to override stdin to a String

Basic question, I want to set the standard input to be a specific string. Currently I am trying it with this:
import java.nio.charset.StandardCharsets
import java.io.ByteArrayInputStream
// Let's say we are inside a method now
val str = "textinputgoeshere"
System.setIn(new ByteArrayInputStream(str.getBytes(StandardCharsets.UTF_8)))
Because that's similar to how I'd do it in Java, however str.getBytes seems to work differently in Scala as System in is set to a memory address when I check it with println....
I've looked at the Scala API: http://www.scala-lang.org/api/current/scala/Console$.html#setIn(in:java.io.InputStream):Unit
and I've found
def withIn[T](in: InputStream)(thunk: ⇒ T): T
But this seems to only set the input stream for a specific chunk of code, I'd like this to be a feature in a Setup method in my JUnit tests.
My problem ended up being something related to my code, not this specific concept. The correct way to override Standard In / System In to a String in Scala is the following:
val str = "your string here"
val in: InputStream = new ByteArrayInputStream(str.getBytes(StandardCharsets.UTF_8))
Console.withIn(in)(yourMethod())"
My tests run correctly now.

How to interpret a val in Scala that is of type Option[T]

I am trying to analyze Scala code written by someone else, and in doing so, I would like to be able to write Unit Tests (that were not written before the code was written, unfortunately).
Being a relative Newbie to Scala, especially in the Futures concept area, I am trying to understand the following line of code.
val niceAnalysis:Option[(niceReport) => Future[niceReport]] = None
Update:
The above line of code should be:
val niceAnalysis:Option[(NiceReport) => Future[NiceReport]] = None
- Where NiceReport is a case class
-----------Update ends here----------------
Since I am trying to mock up an Actor, I created this new Actor where I introduce my niceAnalysis val as a field.
The first problem I see with this "niceAnalysis" thing is that it looks like an anonymous function.
How do I "initialize" this val, or to give it an initial value.
My goal is to create a test in my test class, where I am going to pass in this initialized val value into my test actor's receive method.
My naive approach to accomplish this looked like:
val myActorUnderTestRef = TestActorRef(new MyActorUnderTest("None))
Neither does IntelliJ like it. My SBT compile and test fails.
So, I need to understand the "niceAnalyis" declaration first and then understand how to give it an initial value. Please advise.
You are correct that this is a value that might contain a function from type niceReport to Future[niceReport]. You can pass an anonymous function or just a function pointer. The easiest to understand might be the pointer, so I will provide that first, but the easiest in longer terms would be the anonymous function most likely, which I will show second:
import scala.concurrent.Future
import scala.concurrent.ExecutionContext.Implicits.global
def strToFuture(x: String) = Future{ x } //merely wrap the string in a future
val foo = Option(strToFuture)
Conversely, the one liner is as follows:
val foo = Option((x:String)=>Future{x})

Lazy eager map evaluation

There are basically two options to evaluate a map in Scala.
Lazy evaluation computers the function that is passed as a parameter when the next value is needed. IF the function takes one hour to execute then it's one hour to wait when the value is needed. (e.g. Stream and Iterator)
Eager evaluation computes the function when the map is defined. It produces a new list (Vector or whatever) and stores the results, making the program to be busy in that time.
With Future we can obtain the list (Seq or whatever) in a separate thread, this means that our thread doesn't block, but the results have to be stored.
So I did something different, please check it here.
This was a while ago so I don't remember whether I tested it. The point is to have a map that applies concurrently (non-blocking) and kind of eagerly to a set of elements, filling a buffer (the size of the number of cores in the computer, and not more). This means that:
The invocation of the map doesn't block the current thread.
Obtaining an element doesn't block the current thread (in case there was time to calculate it before and store the result in the buffer).
Infinite lists can be handled because we only prefetch a few results (about 8, depending on the number of cores).
So this all sounds very good and you may be wondering what's the problem. The problem is that this solution is not particularly elegant IMHO. Assuming the code I shared works in Java and or Scala, to iterate over the elements in the iterable produced by the map I would only need to write:
new CFMap(whateverFunction).apply(whateverIterable)
However what I would like to write is something like:
whateverIterable.bmap(whateverFunction)
As it is usual in Scala (the 'b' is for buffered), or perhaps something like:
whateverThing.toBuffered.map(whateverFunction)
Either of them works for me. So the question is, how can I do this in an idiomatic way in Scala? Some options:
Monads: create a new collection "Buffered" so that I can use the toBuffered method (that should be added to the previous ones as an implicit) and implement map and everything else for this Buffered thing (sounds like quite some work).
Implicits: create an implicit method that transforms the usual collections or the superclass of them (I'm not sure about which one should it be, Iterable maybe?) to something else to which I can apply the .bmap method and obtain something from it, probably an iterable.
Other: there are probably many options I have not considered so far. It's possible that some library does already implement this (I'd be actually surprised of the opposite, I can't believe nobody thought of this before). Using something that has already been done is usually a good idea.
Please let me know if something is unclear.
What you're looking for is the "pimp-my-library" pattern. Check it out:
object CFMapExtensions {
import sanity.commons.functional.CFMap
import scala.collection.JavaConversions._
implicit class IterableExtensions[I](i: Iterable[I]) {
def bmap[O](f: Function1[I, O]): Iterable[O] = new CFMap(f).apply(asJavaIterable(i))
}
implicit class JavaIterableExtensions[I](i: java.lang.Iterable[I]) {
def bmap[O](f: Function1[I, O]): Iterable[O] = new CFMap(f).apply(i)
}
// Add an implicit conversion to a java function.
import java.util.function.{Function => JFunction}
implicit def toJFunction[I, O](f: Function1[I, O]): JFunction[I, O] = {
new JFunction[I, O]() {
def apply(t: I): O = f(t)
}
}
}
object Test extends App {
import CFMapExtensions._
List(1,2,3,4).bmap(_ + 5).foreach(println)
}

Scala Iterable Memory Leaks

I recently started playing with Scala and ran across the following. Below are 4 different ways to iterate through the lines of a file, do some stuff, and write the result to another file. Some of these methods work as I would think (though using a lot of memory to do so) and some eat memory to no end.
The idea was to wrap Scala's getLines Iterator as an Iterable. I don't care if it reads the file multiple times - that's what I expect it to do.
Here's my repro code:
class FileIterable(file: java.io.File) extends Iterable[String] {
override def iterator = io.Source.fromFile(file).getLines
}
// Iterator
// Option 1: Direct iterator - holds at 100MB
def lines = io.Source.fromFile(file).getLines
// Option 2: Get iterator via method - holds at 100MB
def lines = new FileIterable(file).iterator
// Iterable
// Option 3: TraversableOnce wrapper - holds at 2GB
def lines = io.Source.fromFile(file).getLines.toIterable
// Option 4: Iterable wrapper - leaks like a sieve
def lines = new FileIterable(file)
def values = lines
.drop(1)
//.map(l => l.split("\t")).map(l => l.reduceLeft(_ + "|" + _))
//.filter(l => l.startsWith("*"))
val writer = new java.io.PrintWriter(new File("out.tsv"))
values.foreach(v => writer.println(v))
writer.close()
The file it's reading is ~10GB with 1MB lines.
The first two options iterate the file using a constant amount of memory (~100MB). This is what I would expect. The downside here is that an iterator can only be used once and it's using Scala's call-by-name convention as a psuedo-iterable. (For reference, the equivalent c# code uses ~14MB)
The third method calls toIterable defined in TraverableOnce. This one works, but it uses about 2GB to do the same work. No idea where the memory is going because it can't cache the entire Iterable.
The fourth is the most alarming - it immediately uses all available memory and throws an OOM exception. Even weirder is that it does this for all of the operations I've tested: drop, map, and filter. Looking at the implementations, none of them seem to maintain much state (though the drop looks a little suspect - why does it not just count the items?). If I do no operations, it works fine.
My guess is that somewhere it's maintaining references to each of the lines read, though I can't imagine how. I've seen the same memory usage when passing Iterables around in Scala. For example if I take case 3 (.toIterable) and pass that to a method that writes an Iterable[String] to a file, I see the same explosion.
Any ideas?
Note how the ScalaDoc of Iterable says:
Implementations of this trait need to provide a concrete method with
signature:
def iterator: Iterator[A]
They also need to provide a method newBuilder which creates a builder
for collections of the same kind.
Since you don't provide an implementation for newBuilder, you get the default implementation, which uses a ListBuffer and thus tries to fit everything into memory.
You might want to implement Iterable.drop as
def drop(n: Int) = iterator.drop(n).toIterable
but that would break with the representation invariance of the collection library (i.e. iterator.toIterable returns a Stream, while you want List.drop to return a List etc - thus the need for the Builder concept).