Why `Source.fromFile(...).getLines()` is empty after I've iterated over it? - scala

It was quite a surprise for me that (line <- lines) is so devastating! It completely unwinds lines iterator. So running the following snippet will make size = 0 :
val lines = Source.fromFile(args(0)).getLines()
var cnt = 0
for (line <- lines) {
cnt = readLines(line, cnt)
}
val size = lines.size
Is it a normal Scala practice to have well-hidden side-effects like this?

Source.getLines() returns an iterator. For every iterator, if you invoke a bulk operation such as foreach above, or map, take, toList, etc., then the iterator is no longer in a usable state.
That is the contract for Iterators and, more generally, classes that inherit TraversableOnce.
It is of particular importance to note that, unless stated otherwise, one should never use an iterator after calling a method on it. The two most important exceptions are also the sole abstract methods: next and hasNext.
This is not the case for classes that inherit Traversable -- for those you can invoke the bulk traversal operations as many times as you want.

Source.getLines() returns an Iterator, and walking through an Iterator will mutate it. This is made quite clear in the Scala documentation
An iterator is mutable: most operations on it change its state. While it is often used to iterate through the elements of a collection, it can also be used without being backed by any collection (see constructors on the companion object).
It is of particular importance to note that, unless stated otherwise, one should never use an iterator after calling a method on it. The two most important exceptions are also the sole abstract methods: next and hasNext.
Using for notation is just syntactic sugar for calling map, flatMap and foreach methods on the Iterator, which again have quite clear documentation stating not to use the iterator:
Reuse: After calling this method, one should discard the iterator it was called on, and use only the iterator that was returned. Using the old iterator is undefined, subject to change, and may result in changes to the new iterator as well.
Scala generally aims to be a 'pragmatic' language - mutation and side effects are allowed for performance and inter-operability reasons, although not encouraged. To call it 'well-hidden' is, however, something of a stretch.

Related

Scala Iterator overwritten [duplicate]

I am trying to learn Scala through a sample project that I have. In it there is a variable record defined as:
val records: Iterator[Product2[K, V]]
It is passed around in different methods. I explore its contents using :
records.foreach(println)
However, when I try to print the contents using this iterator again, even in successive lines of code, I get no results. It seems as if the iterator is consumed. How do prevent it from happening and be able to explore the contents of the iterator without rendering it useless for the rest of the code?
An Iterator extends TraversableOnce and hence can only be iterated over once, as it represents a mutating pointer into an Iterable. If you want something that can be traversable repeatedly and without affecting multiple, parallel accesses, you need to use the Iterable instead, which extends Traversable and on foreach creates a new Iterator for that specific context

Declaring one, two, many arguments in Scala

I'm just starting to learn Scala. While browsing through the Scaladocs I saw this method definition in mutable.Map:
def -=(elem1: A, elem2: A, elems: A*): Map.this.type
Removes two or more elements from this shrinkable collection.
elem1 the first element to remove.
elem2 the second element to remove.
elems the remaining elements to remove.
returns the shrinkable collection itself
Why would you need to define elem1 and elem2 explicitly if you're just going to define elems with a * anyway?
Notice that there is already a separate overloaded method with a single parameter:
abstract def -=(key: A): Map.this.type
So the two more more parameters constraint is to avoid ambiguous calls. The reason for a separate method with a single parameter may be efficiency. Note that calling a variable argument method involves creating an array in the background, which would be wasted if there is only a single element to remove.
As the single-argument version of the method is abstract while the multiple-argument version is concrete, I would not be surprised if the implementation of the latter actually called the former in (a functional equivalent of) a loop. (Update: a quick code check verified my guess, although the call is indirect via --= .)
I would say because of the two or more condition.
With that signature, you clearly enforce at least two arguments of type A to be passed to the method.
If you used only *elems, it would mean removing 0 or more elements, which wouldn't make much sense.
The fact that you specifically have a signature for removing 2 or more, and whether it makes sense or not, is beyond the scope of this answer. Like #Péter Török said, the reason for overloading this method with one param, and two or more params may be efficiency.

Scala immutable map, when to go mutable?

My present use case is pretty trivial, either mutable or immutable Map will do the trick.
Have a method that takes an immutable Map, which then calls a 3rd party API method that takes an immutable Map as well
def doFoo(foo: String = "default", params: Map[String, Any] = Map()) {
val newMap =
if(someCondition) params + ("foo" -> foo) else params
api.doSomething(newMap)
}
The Map in question will generally be quite small, at most there might be an embedded List of case class instances, a few thousand entries max. So, again, assume little impact in going immutable in this case (i.e. having essentially 2 instances of the Map via the newMap val copy).
Still, it nags me a bit, copying the map just to get a new map with a few k->v entries tacked onto it.
I could go mutable and params.put("bar", bar), etc. for the entries I want to tack on, and then params.toMap to convert to immutable for the api call, that is an option. but then I have to import and pass around mutable maps, which is a bit of hassle compared to going with Scala's default immutable Map.
So, what are the general guidelines for when it is justified/good practice to use mutable Map over immutable Maps?
Thanks
EDIT
so, it appears that an add operation on an immutable map takes near constant time, confirming #dhg's and #Nicolas's assertion that a full copy is not made, which solves the problem for the concrete case presented.
Depending on the immutable Map implementation, adding a few entries may not actually copy the entire original Map. This is one of the advantages to the immutable data structure approach: Scala will try to get away with copying as little as possible.
This kind of behavior is easiest to see with a List. If I have a val a = List(1,2,3), then that list is stored in memory. However, if I prepend an additional element like val b = 0 :: a, I do get a new 4-element List back, but Scala did not copy the orignal list a. Instead, we just created one new link, called it b, and gave it a pointer to the existing List a.
You can envision strategies like this for other kinds of collections as well. For example, if I add one element to a Map, the collection could simply wrap the existing map, falling back to it when needed, all while providing an API as if it were a single Map.
Using a mutable object is not bad in itself, it becomes bad in a functional programming environment, where you try to avoid side-effects by keeping functions pure and objects immutable.
However, if you create a mutable object inside a function and modify this object, the function is still pure if you don't release a reference to this object outside the function. It is acceptable to have code like:
def buildVector( x: Double, y: Double, z: Double ): Vector[Double] = {
val ary = Array.ofDim[Double]( 3 )
ary( 0 ) = x
ary( 1 ) = y
ary( 2 ) = z
ary.toVector
}
Now, I think this approach is useful/recommended in two cases: (1) Performance, if creating and modifying an immutable object is a bottleneck of your whole application; (2) Code readability, because sometimes it's easier to modify a complex object in place (rather than resorting to lenses, zippers, etc.)
In addition to dhg's answer, you can take a look to the performance of the scala collections. If an add/remove operation doesn't take a linear time, it must do something else than just simply copying the entire structure. (Note that the converse is not true: it's not beacuase it takes linear time that your copying the whole structure)
I like to use collections.maps as the declared parameter types (input or return values) rather than mutable or immutable maps. The Collections maps are immutable interfaces that work for both types of implementations. A consumer method using a map really doesn't need to know about a map implementation or how it was constructed. (It's really none of its business anyway).
If you go with the approach of hiding a map's particular construction (be it mutable or immutable) from the consumers who use it then you're still getting an essentially immutable map downstream. And by using collection.Map as an immutable interface you completely remove all the ".toMap" inefficiency that you would have with consumers written to use immutable.Map typed objects. Having to convert a completely constructed map into another one simply to comply to an interface not supported by the first one really is absolutely unnecessary overhead when you think about it.
I suspect in a few years from now we'll look back at the three separate sets of interfaces (mutable maps, immutable maps, and collections maps) and realize that 99% of the time only 2 are really needed (mutable and collections) and that using the (unfortunately) default immutable map interface really adds a lot of unnecessary overhead for the "Scalable Language".

How do I deal with Scala collections generically?

I have realized that my typical way of passing Scala collections around could use some improvement.
def doSomethingCool(theFoos: List[Foo]) = { /* insert cool stuff here */ }
// if I happen to have a List
doSomethingCool(theFoos)
// but elsewhere I may have a Vector, Set, Option, ...
doSomethingCool(theFoos.toList)
I tend to write my library functions to take a List as the parameter type, but I'm certain that there's something more general I can put there to avoid all the occasional .toList calls I have in the application code. This is especially annoying since my doSomethingCool function typically only needs to call map, flatMap and filter, which are defined on all the collection types.
What are my options for that 'something more general'?
Here are more general traits, each of which extends the previous one:
GenTraversableOnce
GenTraversable
GenIterable
GenSeq
The traits above do not specify whether the collection is sequential or parallel. If your code requires that things be executed sequentially (typically, if your code has side effects of any kind), they are too general for it.
The following traits mandate sequential execution:
TraversableOnce
Traversable
Iterable
Seq
LinearSeq
The first one, TraversableOnce only allows you to call one method on the collection. After that, the collection has been "used". In exchange, it is general enough to accept iterators as well as collections.
Traversable is a pretty general collection that has most methods. There are some things it cannot do, however, in which case you need to go to Iterable.
All Iterable implement the iterator method, which allows you to get an Iterator for that collection. This gives it the capability for a few methods not present in Traversable.
A Seq[A] implements the function Int => A, which means you can access any element by its index. This is not guaranteed to be efficient, but it is a guarantee that each element has an index, and that you can make assertions about what that index is going to be. Contrast this with Map and Set, where you cannot tell what the index of an element is.
A LinearSeq is a Seq that provides fast head, tail, isEmpty and prepend. This is as close as you can get to a List without actually using a List explicitly.
Alternatively, you could have an IndexedSeq, which has fast indexed access (something List does not provide).
See also this question and this FAQ based on it.
The most obvious one is to use Traversable as the most general trait which will have the goodies you want. However, I think you are generally better sticking to:
Seq
IndexedSeq
Set
Map
A Seq will cover List, Vector etc, IndexedSeq will cover Vector etc etc. I found myself not using Iterable because I often need (or want) to know the size of the thing I have and back pre scala-2.8 Iterable did not provide access to this, so I kept having to turn things into sequences anyway!
Looks like Traversable and Iterable now have size methods so maybe I should go back to using them! Of course you could start "going mad" with GenTraversableOnce but that is not likely to aid in readability.

Difference between MutableList and ListBuffer

What is the difference between Scala's MutableList and ListBuffer classes in scala.collection.mutable? When would you use one vs the other?
My use case is having a linear sequence where I can efficiently remove the first element, prepend, and append. What's the best structure for this?
A little explanation on how they work.
ListBuffer uses internally Nil and :: to build an immutable List and allows constant-time removal of the first and last elements. To do so, it keeps a pointer on the first and last element of the list, and is actually allowed to change the head and tail of the (otherwise immutable) :: class (nice trick allowed by the private[scala] var members of ::). Its toList method returns the normal immutable List in constant time as well, as it can directly return the structure maintained internally. It is also the default builder for immutable Lists (and thus can indeed be reasonably expected to have constant-time append). If you call toList and then again append an element to the buffer, it takes linear time with respect to the current number of elements in the buffer to recreate a new structure, as it must not mutate the exported list any more.
MutableList works internally with LinkedList instead, an (openly, not like ::) mutable linked list implementation which knows of its element and successor (like ::). MutableList also keeps pointers to the first and last element, but toList returns in linear time, as the resulting List is constructed from the LinkedList. Thus, it doesn't need to reinitialize the buffer after a List has been exported.
Given your requirements, I'd say ListBuffer and MutableList are equivalent. If you want to export their internal list at some point, then ask yourself where you want the overhead: when you export the list, and then no overhead if you go on mutating buffer (then go for MutableList), or only if you mutable the buffer again, and none at export time (then go for ListBuffer).
My guess is that in the 2.8 collection overhaul, MutableList predated ListBuffer and the whole Builder system. Actually, MutableList is predominantly useful from within the collection.mutable package: it has a private[mutable] def toLinkedList method which returns in constant time, and can thus efficiently be used as a delegated builder for all structures that maintain a LinkedList internally.
So I'd also recommend ListBuffer, as it may also get attention and optimization in the future than “purely mutable” structures like MutableList and LinkedList.
This gives you an overview of the performance characteristics: http://www.scala-lang.org/docu/files/collections-api/collections.html ; interestingly, MutableList and ListBuffer do not differ there. The documentation of MutableList says it is used internally as base class for Stack and Queue, so maybe ListBuffer is more the official class from the user perspective?
You want a list (why a list?) that is growable and shrinkable, and you want constant append and prepend. Well, Buffer, a trait, has constant append and prepend, with most other operations linear. I'm guessing that ListBuffer, a class that implements Buffer, has constant time removal of the first element.
So, my own recommendation is for ListBuffer.
First, lets go over some of the relevant types in Scala
List - An Immutable collection. A Recursive implementation i.e . i.e An instance of list has two primary elements the head and the tail, where the tail references another List.
List[T]
head: T
tail: List[T] //recursive
LinkedList - A mutable collection defined as a series of linked nodes, where each node contains a value and a pointer to the next node.
Node[T]
value: T
next: Node[T] //sequential
LinkedList[T]
first: Node[T]
List is a functional data structure (immutability) compared to LinkedList which is more standard in imperative languages.
Now, lets look at
ListBuffer - A mutable buffer implementation backed by a List.
MutableList - An implementation based on LinkedList ( Would have been more self explanatory if it had been named LinkedListBuffer instead )
They both offer similar complexity bounds on most operations.
However, if you request a List from a MutableList, then it has to convert the existing linear representation into the recursive representation which takes O(n) which is what #Jean-Philippe Pellet points out. But, if you request a Seq from MutableList the complexity is O(1).
So, IMO the choice narrows down to the specifics of your code and your preference. Though, I suspect there is a lot more List and ListBuffer out there.
Note that ListBuffer is final/sealed, while you can extend MutableList.
Depending on your application, extensibility may be useful.