Scala List.updated - scala

I am curious about List.updated. What is it's runtime? And how does it compare to just changing one element in an ArrayBuffer? In the background, how does it deal with copying all of the list? Is this an O(n) procedure? If so, is there an immutable data structure that has an updated like method without being so slow?
An example is:
val list = List(1,2,3)
val list2 = list.updated(2, 5) --> # list2 = (1,5,3)
var abuf = ArrayBuffer(1,2,3)
abuf(2) = 5 --> # abuf = (1,5,3)

The time and memory complexity of the updated(index, value) method is linear in terms of index (not in terms of the size of the list). The first index cells are recreated.
Changing an element in an ArrayBuffer has constant time and memory complexity. The backing array is updated in place, no copying occurs.
This updated method is not slow if you update elements near the beginning of the list. For larger sequences, Vector has a different way to share common parts of the list and will probably do less copying.

List.updated is an O(n) operation (linear).
It calls the linear List.splitAt operation to split the list at the index to get two list (before, rest), then builds a new list by appending the elements in before, the updated element and then the elements in rest.tail.
I'm not sure - this would have to be tested, but it seems that if the updated element was at the start of the list, it may be pretty efficient as in theory getting rest and appending rest.tail could be done in constant time.

I suppose performance would be O(n) since list doesn't store index to each element and implemented as links to next el -> el2 -> el3` so only list.head operation are O(1) as fast.
You should use IndexedSeq for that purpose with most common implmentation Vector.
Although it doesn't copy any data so only 1 value are actually updated in memory.
In general all scala immutable collections dosn't copy all data on modification or creation of updated new instance. It is key difference with Java collections.

Related

Since Scala lists are immutable, are they actually traversed at run-time for operations, length, last or xs(n)?

Until now I thought a list had to be traversed to count the length of it or get the last element.
Then I thought "since it is immutable, the length or last element, or any element for that sake, are all constant, so maybe some work could be saved by storing those in pointers on creation of a list".
If I have a list xs and use xs.length, and later on I use xs.length again, will the list be traversed twice?
Yes, the list is traversed with every call to length.
The thing about List is that there is no "manager" container to store all that information. A reference to a list is actually a reference to the first node of that list, and it only knows about it's own data element and the next node in the list. You could come up with a mechanism to cache that information but it would increase the overhead of List.
Sometimes. It depends on which implementation of List you are talking about. Most of the List's are defined as recursive data structures, eg head :: (tail:List) I think ListBuffer has a constant time lookup for length
The docs detail the performance of typical operations.

Scala: how to copy part of a List to another List

I am new to Scala.
I have a list
origList = List[Double] with thousands of elements.
I need to create another list
outList = List[Double]
and copy to it the elements from origList with indices
start, start+1, ..., start+nCopy-1
that is the output list will have nCopy elements.
This part of the code will be executed many times. What is the most efficient way to do that in Scala?
The way people usually do it in scala is list.slice(start, start+nCopy).
Note, that List in scala is not a random access container like ArrayList is in java. It is implemented as a linked list, so, especially, if you are going to do this many times, it will help significantly, if you convert your list to something indexed before hand: val converted = list.toIndexedSeq or, better, val converted = list.toArray.
.slice on an Array or on IndexedSeq will be much more efficient, especially if start index is high.
Now, if you are really concerned about efficiency, of this one operation, nothing (unfortunately) beats the good-old java approach:
val converted = list.toArray
val copied = java.util.Arrays.copyOfRange(converted, start, start+nCopy)
This can be orders of magnitude faster than converted.slice (leave alone list.slice) when copying a large enough (hundreds) number of elements.

Converting a large sequence to a sequence with no repeats in scala really fast

So I have this large sequence, with a lot of repeats as well, and I need to convert it into a sequence with no repeats. What I have been doing so far has been converting the sequence to a set, and then back to the original sequence. Conversion to the set gets rid of the duplicates, and then I convert back into the set. However, this is very slow, as I'm given to understand that when converting to set, every pair of elements is compared, and the makes the complexity O(n^2), which is not acceptable. And since I have access to a computer with thousands of cores (through my university), I was wondering whether making things parallel would help.
Initially I thought I'd use scala Futures to parallelize the code in the following manner. Group the elements of the sequence into smaller subgroups by their hash code. That way, I have a subcollection of the original sequence, such that no element appears in two different subcollections and and every element is covered. Now I convert these smaller subcollections to sets, and back to sequences and concatenate them. This way I'm guaranteed to get a sequence with no repeats.
But I was wondering if applying the toSet method on a parallel sequence already does this. I thought I'd test this out in the scala interpreter, but I got roughly the same time for the conversion to parallel set vs the conversion to the non parallel set.
I was hoping someone could tell me whether conversion to parallel sets works this way or not. I'd be much obliged. Thanks.
EDIT: Is performing a toSet on a parallel collection faster than performing toSet on a non parallel collection?
.distinct with some of the Scala collection types is O(n) (as of Scala 2.11). It uses a hash map to record what has already been seen. With this, it linearly builds up a list:
def distinct: Repr = {
val b = newBuilder
val seen = mutable.HashSet[A]()
for (x <- this) {
if (!seen(x)) {
b += x
seen += x
}
}
b.result()
(newBuilder is like a mutable list.)
Just thinking outside the box, would it be possible that you prevent the creation of these doublons instead of trying to get rid of them afterwards ?

Scala which data structure is most efficient for my intended operations?

I am running a dynamic programming function where I carry a list of strings throughout the process.
Over time, I append new strings to the end of this list, and occasionally I may remove the last element. Right now I am currently using a mutable ListBuffer, doing += for the appends and .trimEnd(1) for the removals.
Once my dynamic programming procedure is done, I need to efficiently be able to access each element of that list/sequence/etc, and in order (the first item I inserted will be first accessed, whereas last item I inserted will be the last accessed).
I've also tried ArrayBuffers but they both seem too slow. I am trying to speed this process up and I am wondering if I am using a data structure that has O(n) operations when there may be something that has O(1) time operations for what I need.
A simple singly linked list will provide O(1) for the append/discard portions of what you describe. Worst case, if you need to reverse the list at the end before processing, that would be an O(n) operation paid once.
Note that if you go down this road, during the accumulation phase "first" and "last" will be reversed (you will prepend items and drop the first item by getting the tail of the list).

How to groupBy an iterator without converting it to list in scala?

Suppose I want to groupBy on a iterator, compiler asks to "value groupBy is not a member of Iterator[Int]". One way would be to convert iterator to list which I want to avoid. I want to do the groupBy such that the input is Iterator[A] and output is Map[B, Iterator[A]]. Such that the part of the iterator is loaded only when that part of element is accessed and not loading the whole list into memory. I also know the possible set of keys, so I can say whether a particular key exists.
def groupBy(iter: Iterator[A], f: fun(A)->B): Map[B, Iterator[A]] = {
.........
}
One possibility is, you can convert Iterator to view and then groupBy as,
iter.toTraversable.view.groupBy(_.whatever)
I don't think this is doable without storing results in memory (and in this case switching to a list would be much easier). Iterator implies that you can make only one pass over the whole collection.
For instance let's say you have a sequence 1 2 3 4 5 6 and you want to groupBy odd an even numbers:
groupBy(it, v => v % 2 == 0)
Then you could query the result with either true and false to get an iterator. The problem should you loop one of those two iterators till the end you couldn't do the same thing for the other one (as you cannot reset an iterator in Scala).
This would be doable should the elements were sorted according to the same rule you're using in groupBy.
As said in other responses, the only way to achieve a lazy groupBy on Iterator is to internally buffer elements. The worst case for the memory will be in O(n). If you know in advance that the keys are well distributed in your iterator, the buffer can be a viable solution.
The solution is relatively complex, but a good start are some methods from the Iterator trait in the Scala source code:
The partition method that uses both the buffered method to keep the head value in memory, and two internal queues (lookahead) for each of the produced iterators.
The span method with also the buffered method and this time a unique queue for the leading iterator.
The duplicate method. Perhaps less interesting, but we can again observe another use of a queue to store the gap between the two produced iterators.
In the groupBy case, we will have a variable number of produced iterators instead of two in the above examples. If requested, I can try to write this method.
Note that you have to know the list of keys in advance. Otherwise, you will need to traverse (and buffer) the entire iterator to collect the different keys to build your Map.