Best way to represent a readline loop in Scala? - scala

Coming from a C/C++ background, I'm not very familiar with the functional style of programming so all my code tends to be very imperative, as in most cases I just can't see a better way of doing it.
I'm just wondering if there is a way of making this block of Scala code more "functional"?
var line:String = "";
var lines:String = "";
do {
line = reader.readLine();
lines += line;
} while (line != null)

How about this?
val lines = Iterator.continually(reader.readLine()).takeWhile(_ != null).mkString

Well, in Scala you can actually say:
val lines = scala.io.Source.fromFile("file.txt").mkString
But this is just a library sugar. See Read entire file in Scala? for other possiblities. What you are actually asking is how to apply functional paradigm to this problem. Here is a hint:
Source.fromFile("file.txt").getLines().foreach {println}
Do you get the idea behind this? foreach line in the file execute println function. BTW don't worry, getLines() returns an iterator, not the whole file. Now something more serious:
lines filter {_.startsWith("ab")} map {_.toUpperCase} foreach {println}
See the idea? Take lines (it can be an array, list, set, iterator, whatever that can be filtered and which contains an items having startsWith method) and filter taking only the items starting with "ab". Now take every item and map it by applying toUpperCase method. Finally foreach resulting item print it.
The last thought: you are not limited to a single type. For instance say you have a file containing integer number, one per line. If you want to read that file, parse the number and sum them up, simply say:
lines.map(_.toInt).sum

To add how the same can be achieved using the formerly new nio files which I vote to use because it has several advantages:
val path: Path = Paths.get("foo.txt")
val lines = Source.fromInputStream(Files.newInputStream(path)).getLines()
// Now we can iterate the file or do anything we want,
// e.g. using functional operations such as map. Or simply concatenate.
val result = lines.mkString
Don't forget to close the stream afterwards.

I find that Stream is a pretty nice approach: it create a re-traversible (if needed) sequence:
def loadLines(in: java.io.BufferedReader): Stream[String] = {
val line = in.readLine
if (line == null) Stream.Empty
else Stream.cons(line, loadLines(in))
}
Each Stream element has a value (a String, line, in this case), and calls a function (loadLines(in), in this example) which will yield the next element, lazily, on demand. This makes for a good memory usage profile, especially with large data sets -- lines aren't read until they're needed, and aren't retained unless something is actually still holding onto them. Yet you can also go back to a previous Stream element and traverse forward again, yielding the exact same result.

Related

An efficient and composable foreach in Scala

Goal: Efficiently do something for each element in a list, and then return the original list, so that I can do something else with original list. For example, let lst be a very large list, and suppose we do many operations to it before applying our foreach. What I want to do is something like this:
lst.many_operations().foreach(x => f(x)).something_else()
However, foreach returns a unit. I seek a way to iterate through the list and return the original list supplied, so that I can do something_else() to it. To reduce the memory impact, I need to avoid saving the result of lst.many_operations() to a variable.
An obvious, but imperfect, solution is to replace foreach with map. Then the code looks like:
lst
.many_operations()
.map(x => {
f(x)
x
}).something_else()
However, this is not good because map constructs a new list, effectively duplicating the very large list that it iterated through.
What is the right way to do this in Scala?
The simplest way seems to be:
lst.foreach(many_operations)
lst.foreach(something_else)
However: using side effects is really not a good idea. I would urge you to revisit your design to use explicit pure transformations rather than side effects and mutations.
To address your concern about having multiple lists in memory at the same time, you can use view or iterator to emulate streaming processing, and discard intermediate results you do not need to use again:
val newList = lst.iterator
.map(foo)
.map(bar)
.map(baz)
.toList
(lst will get garbage collected if you do not reference it again).

(Scala) Am I using Options correctly?

I'm currently working on my functional programming - I am fairly new to it. Am i using Options correctly here? I feel pretty insecure on my skills currently. I want my code to be as safe as possible - Can any one point out what am I doing wrong here or is it not that bad? My code is pretty straight forward here:
def main(args: Array[String]): Unit =
{
val file = "myFile.txt"
val myGame = Game(file) //I have my game that returns an Option here
if(myGame.isDefined) //Check if I indeed past a .txt file
{
val solutions = myGame.get.getAllSolutions() //This returns options as well
if(solutions.isDefined) //Is it possible to solve the puzzle(crossword)
{
for(i <- solutions.get){ //print all solutions to the crossword
i.solvedCrossword foreach println
}
}
}
}
-Thanks!! ^^
When using Option, it is recommended to use match case instead of calling 'isDefined' and 'get'
Instead of the java style for loop, use higher-order function:
myGame match {
case Some(allSolutions) =>
val solutions = allSolutions.getAllSolutions
solutions.foreach(_.solvedCrossword.foreach(println))
case None =>
}
As a rule of thumb, you can think of Option as a replacement for Java's null pointer. That is, in cases where you might want to use null in Java, it often makes sense to use Option in Scala.
Your Game() function uses None to represent errors. So you're not really using it as a replacement for null (at least I'd consider it poor practice for an equivalent Java method to return null there instead of throwing an exception), but as a replacement for exceptions. That's not a good use of Option because it loses error information: you can no longer differentiate between the file not existing, the file being in the wrong format or other types of errors.
Instead you should use Either. Either consists of the cases Left and Right where Right is like Option's Some, but Left differs from None in that it also takes an argument. Here that argument can be used to store information about the error. So you can create a case class containing the possible types of errors and use that as an argument to Left. Or, if you never need to handle the errors differently, but just present them to the user, you can use a string with the error message as the argument to Left instead of case classes.
In getAllSolutions you're just using None as a replacement for the empty list. That's unnecessary because the empty list needs no replacement. It's perfectly fine to just return an empty list when there are no solutions.
When it comes to interacting with the Options, you're using isDefined + get, which is a bit of an anti pattern. get can be used as a shortcut if you know that the option you have is never None, but should generally be avoided. isDefined should generally only be used in situations where you need to know whether an option contains a value, but don't need to know the value.
In cases where you need to know both whether there is a value and what that value is, you should either use pattern matching or one of Option's higher-order functions, such as map, flatMap, getOrElse (which is kind of a higher-order function if you squint a bit and consider by-name arguments as kind-of like functions). For cases where you want to do something with the value if there is one and do nothing otherwise, you can use foreach (or equivalently a for loop), but note that you really shouldn't do nothing in the error case here. You should tell the user about the error instead.
If all you need here is to print it in case all is good, you can use for-comprehension which is considered quite idiomatic Scala way
for {
myGame <- Game("mFile.txt")
solutions <- myGame.getAllSolutions()
solution <- solutions
crossword <- solution.solvedCrossword
} println(crossword)

Scala's collect inefficient in Spark?

I am currently starting to learn to use spark with Scala. The problem I am working on needs me to read a file, split each line on a certain character, then filtering the lines where one of the columns matches a predicate and finally remove a column. So the basic, naive implementation is a map, then a filter then another map.
This meant going through the collection 3 times and that seemed quite unreasonable to me. So I tried replacing them by one collect (the collect that takes a partial function as an argument). And much to my surprise, this made it run much slower. I tried locally on regular Scala collections; as expected, the latter way of doing is much faster.
So why is that ? My idea is that the map and filter and map are not applied sequentially, but rather mixed into one operation; in other words, when an action forces evaluation every element of the list will be checked and the pending operations will be executed. Is that right ? But even so, why do the collect perform so badly ?
EDIT: a code example to show what I want to do:
The naive way:
sc.textFile(...).map(l => {
val s = l.split(" ")
(s(0), s(1))
}).filter(_._2.contains("hello")).map(_._1)
The collect way:
sc.textFile(...).collect {
case s if(s.split(" ")(0).contains("hello")) => s(0)
}
The answer lies in the implementation of collect:
/**
* Return an RDD that contains all matching values by applying `f`.
*/
def collect[U: ClassTag](f: PartialFunction[T, U]): RDD[U] = withScope {
val cleanF = sc.clean(f)
filter(cleanF.isDefinedAt).map(cleanF)
}
As you can see, it's the same sequence of filter->map, but less efficient in your case.
In scala both isDefinedAt and apply methods of PartialFunction evaluate if part.
So, in your "collect" example split will be performed twice for each input element.

Iterator of InputStream: How to close the InputStreams?

I have an Iterator[InputStream] which i map over to retrieve the individual results:
val streams: Iterator[InputStream[CustomType]] = retrieveStreams()
val results: Iterator[MyResultType] = streams flatMap (c => transformToResult(c))
This works as expected, meaning I can retrieve values of type MyResultType from the results iterator. The only problem I have is that the individual InputStreams are never being closed. Is there any way to do this?
There is no magic way to close it, or at least to guarantee that it will get closed. Thus you have to close each stream yourself. Take a look at the Loan Pattern which makes it less error prone: Loaner Pattern in Scala.
In your case you don't have a single resource to release but rather a collection of resources, so adjust your custom loan pattern accordingly.
Since you are dealing with Iterator you might have unlimited supply of InputStreams, in that case your transformToResult function would have to close the stream or something else at the element level.
It could look something like this:
val streams: Iterator[InputStream[CustomType]] = retrieveStreams()
val results: Iterator[MyResultType] =
streams flatMap (c => yourLoaner(c)(transformToResult))

How to correctly get current loop count from a Iterator in scala

I am looping over the following lines from a csv file to parse them. I want to identify the first line since its the header. Whats the best way of doing this instead of making a var counter holder.
var counter = 0
for (line <- lines) {
println(CsvParser.parse(line, counter))
counter++
}
I know there is got to be a better way to do this, newbie to Scala.
Try zipWithIndex:
for (line <- lines.zipWithIndex) {
println(CsvParser.parse(line._1, line._2))
}
#tenshi suggested the following improvement with pattern matching:
for ((line, count) <- lines.zipWithIndex) {
println(CsvParser.parse(line, count))
}
I totally agree with the given answer, still that I've to point something important out and initially I planned to put in a simple comment.
But it would be quite long, so that, leave me set it as a variant answer.
It's prefectly true that zip* methods are helpful in order to create tables with lists, but they have the counterpart that they loop the lists in order to create it.
So that, a common recommendation is to sequence the actions required on the lists in a view, so that you combine all of them to be applied only producing a result will be required. Producing a result is considered when the returnable isn't an Iterable. So is foreach for instance.
Now, talking about the first answer, if you have lines to be the list of lines in a very big file (or even an enumeratee on it), zipWithIndex will go through all of 'em and produce a table (Iterable of tuples). Then the for-comprehension will go back again through the same amount of items.
Finally, you've impacted the running lenght by n, where n is the length of lines and added a memory footprint of m + n*16 (roughtly) where m is the lines' footprint.
Proposition
lines.view.zipWithIndex map Function.tupled(CsvParser.parse) foreach println
Some few words left (I promise), lines.view will create something like scala.collection.SeqView that will hold all further "mapping" function producing new Iterable, as are zipWithIndex and map.
Moreover, I think the expression is more elegant because it follows the reader and logical.
"For lines, create a view that will zip each item with its index, the result as to be mapped on the result of the parser which must be printed".
HTH.