Is there any benefit of working with an Iterator over a List - scala

Is there any benefit of manipulating an Iterator over or List ?
I need to know if concatenating 2 iterators is better that concatenating to List ?
In a sense what the fundamental difference between working with iterator over the actual collection.

An Iterator isn't an actual data structure, although it behaves similar to one. It is just a traversal pointer to some actual data structure. Thus, unlike in an actual data structure, an Iterator can't "go back," that is, access old elements. Once you've gone through an Iterator, you're done.
What's cool about Iterator is that you can give it a map, filter, or other transformation elements, and instead of actually modifying any existing data structure, it will instead apply the transformation the next time you ask for an element.
"Concatenating" two Iterators creates a new Iterator that wraps both of them.
On the other hand, Lists are actual collections and can be re-traversed.

Related

How is Scala so efficient with lists?

It's usually considered a bad practice to make unnecessary collections in Java as it consumes some memory and CPU. Scala seems to be pretty efficient with it and encourages to use immutable data structures.
How is Scala so efficient with Lists? What techniques are used to achieve that?
While the comments are correct that the claim that list is particularly efficient is a dubious one, it does much better than doing full copies of the collection for every operation like you would do with Java's standard collections.
The reason for this is List and the other immutable collections are not just mutable collections with mutation methods returning a copy, but are designed differently to with immutability in mind. They Take advantage of something called "structural sharing". If parts of a collection remain the same after a change, then those parts don't need to be copied and the same object can be shared across multiple collections. This works because of immutability, there is no change that they could be altered so it's safe to share.
Imagine the simplest example, prepending to a list.
You have a List(1,2,3) and you want to prepend 0
val original = List(1,2,3)
val updated = 0 :: original
You list would then look something like this
updated original
\ \
0 - - - 1 - - - 2 - - - 3
All that's needed is to create a new node and point it's tail to the head of your original list. Nothing needs to be copied. Similarly the tail and drop operations just need to return a reference to the appropriate node and nothing needs to be copied. This is why List can be quite good with the prepend and tail operations, because it doesn't do any copying even though it creates a "new" List.
Other List operations do require some amount copying, but always as little as possible. As long as part of the tail of a list is unchanged it doesn't need to be copied. For example when concatenating lists, the first list needs to be copied, but then it's tail can just point to the head of the second, so the second list doesn't need to be copied at all. This is why, when concatenating a long and short list it's better to put the shorter list on the "left" as it is the only one that needs to be copied.
Other types of collections are better at different operations. Vector for example can to both prepend and append in amortized constant time, as well as having good random access and update capabilities (though still much worse than a raw mutable array). In most cases it will be more efficient than List while still being immutable. It's implementation is quite complicated. It uses a trie datastructure, with many internal arrays to store data. The unchanged ones can be shared and only the ones that need to be altered by an update operation need to be copied.

How to groupBy an iterator without converting it to list in scala?

Suppose I want to groupBy on a iterator, compiler asks to "value groupBy is not a member of Iterator[Int]". One way would be to convert iterator to list which I want to avoid. I want to do the groupBy such that the input is Iterator[A] and output is Map[B, Iterator[A]]. Such that the part of the iterator is loaded only when that part of element is accessed and not loading the whole list into memory. I also know the possible set of keys, so I can say whether a particular key exists.
def groupBy(iter: Iterator[A], f: fun(A)->B): Map[B, Iterator[A]] = {
.........
}
One possibility is, you can convert Iterator to view and then groupBy as,
iter.toTraversable.view.groupBy(_.whatever)
I don't think this is doable without storing results in memory (and in this case switching to a list would be much easier). Iterator implies that you can make only one pass over the whole collection.
For instance let's say you have a sequence 1 2 3 4 5 6 and you want to groupBy odd an even numbers:
groupBy(it, v => v % 2 == 0)
Then you could query the result with either true and false to get an iterator. The problem should you loop one of those two iterators till the end you couldn't do the same thing for the other one (as you cannot reset an iterator in Scala).
This would be doable should the elements were sorted according to the same rule you're using in groupBy.
As said in other responses, the only way to achieve a lazy groupBy on Iterator is to internally buffer elements. The worst case for the memory will be in O(n). If you know in advance that the keys are well distributed in your iterator, the buffer can be a viable solution.
The solution is relatively complex, but a good start are some methods from the Iterator trait in the Scala source code:
The partition method that uses both the buffered method to keep the head value in memory, and two internal queues (lookahead) for each of the produced iterators.
The span method with also the buffered method and this time a unique queue for the leading iterator.
The duplicate method. Perhaps less interesting, but we can again observe another use of a queue to store the gap between the two produced iterators.
In the groupBy case, we will have a variable number of produced iterators instead of two in the above examples. If requested, I can try to write this method.
Note that you have to know the list of keys in advance. Otherwise, you will need to traverse (and buffer) the entire iterator to collect the different keys to build your Map.

Scala's toList function appears to be slow

I was under the impression that calling seq.toList() on an immutable Seq would be making a new list which is sharing the structural state from the first list. We're finding that this could be really slow and I'm not sure why. It is just sharing the structural state, correct? I can't see why it'd be making an n-time copy of all the elements when it knows they'll never change.
A List in Scala is a particular data structure: instances of :: each containing a value, followed by Nil at the end of the chain.
If you toList a List, it will take O(1) time. If you toList on anything else then it must be converted into a List, which involves O(n) object allocations (all the :: instances).
So you have to ask whether you really want a scala.collection.immutable.List. That's what toList gives you.
Sharing structural state is possible for particular operations on particular data structures.
With the List data structure in Scala, my understanding is that every element refers to the next, starting from the head through the tail, so a singly linked list.
From a structural state sharing perspective, consider the restrictions placed on this from the internal data structure perspective. Adding an element to the head of a List (X) effectively creates a new list (X') with the new element as the head of X' and the old list (X) as the tail. For this particular operation, internal state can be shared completely.
The same operation above can be applied to create a new List (X'), with the new element as the head of X' and any element from X as the tail, as long as you accept that the tail will be the element you choose from X, plus all additional elements it already has in it's data structure.
When you think about it logically, each data structure has an internal structure that allows some operations to be performed with simple shared internal structure and other operations requiring more invasive and costly computations.
The key from my perspective here is having an understanding of the constraints placed on the operations by the internal data structure itself.
For example, consider the same operations above on a doubly linked list data structure and you will see that there are quite different restrictions.
Personally, I find drawing out an understanding of the internal structure can be helpful in understanding the consequences of particular operations.
In the case of the toList operation on an arbitrary sequence, with no knowledge of the arbitrary sequences internal data structure, one has to therefore assume O(n). List.toList has the obvious performance advantage of already being a list.

LinkedList vs MutableList in scala

Below, both descriptions of these data structures: (from Programming in scala book)
Linked lists
Linked lists are mutable sequences that consist of nodes
that are linked with next pointers. In most languages null would be
picked as the empty linked list. That does not work for Scala
collections, because even empty sequences must support all sequence
methods. LinkedList.empty.isEmpty, in par- ticular, should return true
and not throw a NullPointerException. Empty linked lists are encoded
instead in a special way: Their next field points back to the node
itself. Like their immutable cousins, linked lists are best operated
on sequen- tially. In addition, linked lists make it easy to insert an
element or linked list into another linked list.
Mutable lists
A MutableList consists of a single linked list together with a pointer
that refers to the terminal empty node of that list. This makes list
append a con- stant time operation because it avoids having to
traverse the list in search for its terminal node. MutableList is
currently the standard implementation of mutable.LinearSeq in Scala.
Main difference is the addition of the last element's pointer in MutableList type.
Question is: What might be the usage preferring LinkedList rather than MutableList? Isn't MutableList strictly (despite the new pointer) equivalent and even more practical with a tiny addition of used memory (the last element's pointer)?
Since MutableList wraps a LinkedList, most operations involve an extra indirection step. Note that wrapping means, it contains an internal variable to a LinkedList (indeed two, because of the efficient last element lookup). So the linked list is a required building block to realise the mutable list.
If you do not need prepend or look up of the last element, you could thus just go for the LinkedList. Scala offers you a large choice of data structures, so the best is first to make a checklist of all the operations that you require (and their preferred efficiency), then choose the best fit.
Generally, I recommend you to use immutable structures, they are often as efficient as the mutable ones and don't produce problems with concurrency.

Which scala mutable list to use?

This is a followup question to No Scala mutable list
I want to use a mutable list in Scala. I can chose from
scala.collection.mutable.DoubleLinkedList
scala.collection.mutable.LinkedList
scala.collection.mutable.ListBuffer
scala.collection.mutable.MutableList
Which is nice, but what is the "standard", recommended, idiomatic scala way? I just want to use a list that I can add things to on the back.
In my case, I am using a HashMap, where the "lists" (I am meaning it in general sense) will be on value side. Then, I am reading something from a file and for every line, I want to find the right list in the hashmap and append the value to the list.
Depends what you need.
DoubleLinkedList is a linked list which allows you to traverse back-and-forth through the list of nodes. Use its prev and next references to go to the previous or the next node, respectively.
LinkedList is a singly linked list, so there are not prev pointers - if you only traverse to the next element of the list all the time, this is what you need.
EDIT: Note that the two above are meant to be used internally as building blocks for more complicated list structures like MutableLists which support efficient append, and mutable.Queues.
The two collections above both have linear-time append operations.
ListBuffer is a buffer class. Although it is backed by a singly linked list data structure, it does not expose the next pointer to the client, so you can only traverse it using iterators and the foreach.
Its main use is, however, as a buffer and an immutable list builder - you append elements to it via +=, and when you call result, you very efficiently get back a functional immutable.List. Unlike mutable and immutable lists, both append and prepend operations are constant-time - you can append at the end via += very efficiently.
MutableList is used internally, you usually do not use it unless you plan to implement a custom collection class based on the singly linked list data structure. Mutable queues, for example, inherit this class. MutableList class also has an efficient constant-time append operation, because it maintains a reference to the last node in the list.
The documentation's Concrete Mutable Collection Classes page (or the one for 2.12) has an overview of mutable list classes, including explanations on when to use which one.
If you want to append items you shouldn't use a List at all. Lists are good when you want to prepend items. Use ArrayBuffer instead.
I just want to use a list that I can add things to on the back.
Then choose something that implements Growable. I personally suggest one of the Buffer implementations.
I stay away from LinkedList and DoubleLinkedList, as they are present mainly as underlying implementation of other collections, but have quite a few bugs up to Scala 2.9.x. Starting with Scala 2.10.0, I expect the various bug fixes have brought them up to standard. Still, they lack some methods people expect, such as +=, which you'll find on collections based on them.