Scala how to update values in immutable list - scala

I have a immutable list and need a new copy of it with elements replaced at multiple index locations. The List.updated is an O(n) operation and can only replace one at a time. What is the efficient way of doing this? Thanks!

List is not a good fit if you need random element access/update. From the documentation:
This class is optimal for last-in-first-out (LIFO), stack-like access patterns. If you need another access pattern, for example, random access or FIFO, consider using a collection more suited to this than List.
More generally, what you need is an indexed sequence instead of a linear one (such as List). From the documentation of IndexedSeq:
Indexed sequences support constant-time or near constant-time element access and length computation. They are defined in terms of abstract methods apply for indexing and length.
Indexed sequences do not add any new methods to Seq, but promise efficient implementations of random access patterns.
The default concrete implementation of IndexedSeq is Vector, so you may consider using it.
Here's an extract from its documentation (emphasis added):
Vector is a general-purpose, immutable data structure. It provides random access and updates in effectively constant time, as well as very fast append and prepend. Because vectors strike a good balance between fast random selections and fast random functional updates, they are currently the default implementation of immutable indexed sequences

list
.iterator
.zipWithIndex
.map { case (index, element) => newElementFor(index) }
.toList

Related

What is the most effective structure for appending elements to a List-like collection in Scala?

I have to append elements to my collection. Which structure is more preferable? Appending to List costs O(n), what about ListBuffer, ArrayBuffer, Set, Map and other structures?
ListBuffer accotding to the docs:
It provides constant time prepend and append.
But it is mutable structure, so be careful using - preferably in a very limited scope (e.g. function or method).
ArrayBuffer according to the documentation:
Prepends and removes are linear in the buffer size.
Because this structure built on top of the dynamic array, hence sometimes require internal array copy for recreation, which in JVM is almost constant but still not exactly constant time. See System.arraycopy documentation for more details. Also mutable structure.
Set, Map - are not what you called List-like at all. Set - un-ordered (list IS ordered) structure, which contains ONLY unique elements. Map[K, V] - stores as the name stands, the mapping between K type keys to V type values.
So as conclusion: if you need to append elements I'd suggest to go with ListBuffer, but since this is mutable structure limit scope its usage ad whenever you need to pass it somewhere - convert it to List.

What are Builder, Combiner, and Splitter in scala?

In the parallel programming course from EPFL, four abstractions for data parallelism are mentioned: Iterator, Builder, Combiner, and Splitter.
I am familiar with Iterator, but have never used the other three. I have seen other traits Builder, Combiner, and Splitter under scala.collection package. However, I have idea how to use them in real-world development, particularly how to use them in collaboration with other collections like List, Array, ParArray, etc. Could anyone please give me some guidance and examples?
Thanks!
The two traits Iterator and Builder are not specific to parallelism, however, they provide the basis for Combiner and Splitter.
You already know that an Iterator can help you with iterating over sequential collections by providing the methods hasNext and next. A Splitter is a special case of an Iterator and helps to partition a collection into multiple disjoint subsets. The idea is that after the splitting, these subsets can be processed in parallel. You can obtain a Splitter from a parallel collection by invoking .splitter on it. The two important methods of the Splitter trait are as follows:
remaining: Int: returns the number of elements of the current collection, or at least an approximation of that number. This information is important, since it is used to decide whether or not it's worth it to split the collection. If your collection contains only a small amount of elements, then you want to process these elements sequentially instead of splitting the collection into even smaller subsets.
split: Seq[Splitter[A]]: the method that actually splits the current collection. It returns the disjoint subsets (represented as Splitters), which recursively can be splitted again if it's worth it. If the subsets are small enough, they finally can be processed (e.g. filtered or mapped).
Builders are used internally to create new (sequential) collections. A Combiner is a special case of a Builder and at the same time represents the counterpart to Splitter. While a Splitter splits your collection before it is processed in parallel, a Combiner puts together the results afterwards. You can obtain a Combiner from a parallel collection (subset) by invoking .newCombiner on it. This is done via the following method:
combine(that: Combiner[A, B]): Combiner[A, B]: combines your current collection with another collection by "merging" both Combiners. The result is a new Combiner, which either represents the final result, or gets combined again with another subset (by the way: the type parameters A and B represent the element type and the type or the resulting collection).
The thing is that you don't need to implement or even use these methods directly if you don't define a new parallel collection. The idea is that people implementing new parallel collections only need to define splitters and combiners and get a whole bunch of other operations for free, because those operations are already implemented and make use of splitters and combiners.
Of course this is only a superficial description of how those things work. For further reading, I recommend reading Architecture of the Parallel Collections Library as well as Creating Custom Parallel Collections.

Why is Buffer not a subclass of IndexedSeq?

In the scala collections library Buffer inherits from Seq:
Buffer[A] extends Seq[A] with GenericTraversableTemplate[A, Buffer] with BufferLike[A, Buffer[A]] with scala.Cloneable
and the Buffer documentation says:
Buffers are used to create sequences of elements incrementally by
appending, prepending, or inserting new elements. It is also possible
to access and modify elements in a random access fashion via the index
of the element in the current sequence.
while the IndexedSeq docs says:
A base trait for indexed sequences.
Indexed sequences support constant-time or near constant-time element
access and length computation. They are defined in terms of abstract
methods apply for indexing and length.
Indexed sequences do not add any new methods to Seq, but promise
efficient implementations of random access patterns.
Since Buffer already extends Seq and IndexedSeq does not add any methods to Seq
Buffer must already implement the IndexedSeq interface and according to the documentation
it should meet the non-functional requirements of IndexedSeq.
So why is Buffer not an IndexedSeq.
Buffer is not IndexedSeq because it does not guarantee near constant-time element access and length computation. For example, ListBuffer supports neither, as you can see in this description of the performance characteristics of Scala collections.

Asymptotic behaviour of Scala methods

Is there somewhere I can find out the expected time and space complexities of operations on collections like HashSet, TreeSet, List and so on?
Is one just expected to know these from the properties of the abstract-data-types themselves?
I know of Performance characteristics for Scala collections, but this only mentions some very basic operations. Perhaps the rest of the operations for these collections are built purely from a small base-set, but then, it seems I am just expected to know that they have implemented them in this way?
The guide for the other methods should be - just think what an efficient implementation should look like.
Most other bulk-operations on collections (operations that process each element in the collection) are O(n), so they are not mentioned there. Examples are filter, map, foreach, indexOf, reverse, find ...
Methods returning iterators or streams like combinations and permutations are usually O(1).
Methods involving 2 collections are usually O(max(n, m)) or O(min(n, m)). These are zip, zipAll, sameElements, corresponds, ...
Methods union, diff, and intersect are O(n + m).
Sort variants are, naturally, O(nlogn). The groupBy is O(nlogn) in the current implementation. The indexOfSlice uses the KMP algorithm and is O(m + n), where m and n are lengths of the strings.
Methods such as +:, :+ or patch are generally O(n) as well, unless you are dealing with a specific case of an immutable collection for which the operation in question is more efficient - for example, prepending an element on a functional List or appending an element to a Vector.
Methods toX are generally O(n), as they have to iterate all the elements and create a new collection. An exception is toStream which builds the collection lazily - so it's O(1). Also, whenever X is the type of the collection toX just returns this, being O(1).
Iterator implementations should have an O(1) (amortized) next and hasNext operations. Iterator creation should be worst-case O(logn), but O(1) in most cases.
Performance characteristics of the other methods is really difficult to assert. Consider the following:
These methods are all implemented based on foreach or iterator, and at usually very high levels in the hierachy. Vector's map is implemented on collection.TraversableLike, for example.
To add insult to injury, which method implementation is used depends on the linearization of the class inheritance. This also applies to any method called as a helper. It has happened before that changes here caused unforeseen performance problems.
Since foreach and iterator are both O(n), any improved performance depends on specialization at other methods, such as size and slice.
For many of them, there's further dependency on the performance characteristics of the builder that was provided, which depends on the call site instead of the definition site.
So the result is that the place where the method is defined -- and documented -- does not have near enough information to state its performance characteristics, and may depend not only on how other methods are implemented by the inheriting collection, but even by the performance characteristics of an object, Builder, obtained from CanBuildFrom, that is passed at the call site.
At best, any such documentation would be described in terms of other methods. Which doesn't mean it isn't worthwhile, but it isn't easily done -- and hard tasks on open source projects depend on volunteers, who usually work at what they like, not what is needed.

What is the prefered way in using the parallel collections in Scala?

At first I assumed that every collection class would receive an additional par method which would convert the collection to a fitting parallel data structure (like map returns the best collection for the element type in Scala 2.8).
Now it seems that some collection classes support a par method (e. g. Array) but others have toParSeq, toParIterable methods (e. g. List). This is a bit weird, since Array isn't used or recommended that often.
What is the reason for that? Wouldn't it be better to just have a par available on all collection classes doing the "right thing"?
If I have data which might be processed in parallel, what types should I use? The traits in scala.collection or the type of the implementation directly?
Or should I prefer Arrays now, because they seem to be cheaper to parallelize?
Lists aren't that well suited for parallel processing. The reason is that to get to the end of the list, you have to walk through every single element. Thus, you may as well just treat the list as an iterator, and thus may as well just use something more generic like toParIterable.
Any collection that has a fast index is a good candidate for parallel processing. This includes anything implementing LinearSeqOptimized, plus trees and hash tables. Array has as fast of an index as you can get, so it's a fairly natural choice. You can also use things like ArrayBuffer (which has a par method returning a ParArray).