Partially sorting collections in Scala - scala

I am trying to sort a collection of linked-list nodes. The collection contains nodes from more than one linked list; ordering must be maintained within each list, but ordering across lists does not matter.
PartialOrdering[T] seems like the natural choice, but I cannot find any standard functions within Scala that support it (e.g. .sort only takes Ordering[T]).
I've considered wrapping the former type into the latter, but realise this will actually produce erroneous results. Partial ordering cannot be abstracted-away like this as the underlying sort algorithm needs the additional information to produce correct results.
I would like to represent the elements as a SortedSet - is anyone aware of anything that can get me close?

Related

What are Builder, Combiner, and Splitter in scala?

In the parallel programming course from EPFL, four abstractions for data parallelism are mentioned: Iterator, Builder, Combiner, and Splitter.
I am familiar with Iterator, but have never used the other three. I have seen other traits Builder, Combiner, and Splitter under scala.collection package. However, I have idea how to use them in real-world development, particularly how to use them in collaboration with other collections like List, Array, ParArray, etc. Could anyone please give me some guidance and examples?
Thanks!
The two traits Iterator and Builder are not specific to parallelism, however, they provide the basis for Combiner and Splitter.
You already know that an Iterator can help you with iterating over sequential collections by providing the methods hasNext and next. A Splitter is a special case of an Iterator and helps to partition a collection into multiple disjoint subsets. The idea is that after the splitting, these subsets can be processed in parallel. You can obtain a Splitter from a parallel collection by invoking .splitter on it. The two important methods of the Splitter trait are as follows:
remaining: Int: returns the number of elements of the current collection, or at least an approximation of that number. This information is important, since it is used to decide whether or not it's worth it to split the collection. If your collection contains only a small amount of elements, then you want to process these elements sequentially instead of splitting the collection into even smaller subsets.
split: Seq[Splitter[A]]: the method that actually splits the current collection. It returns the disjoint subsets (represented as Splitters), which recursively can be splitted again if it's worth it. If the subsets are small enough, they finally can be processed (e.g. filtered or mapped).
Builders are used internally to create new (sequential) collections. A Combiner is a special case of a Builder and at the same time represents the counterpart to Splitter. While a Splitter splits your collection before it is processed in parallel, a Combiner puts together the results afterwards. You can obtain a Combiner from a parallel collection (subset) by invoking .newCombiner on it. This is done via the following method:
combine(that: Combiner[A, B]): Combiner[A, B]: combines your current collection with another collection by "merging" both Combiners. The result is a new Combiner, which either represents the final result, or gets combined again with another subset (by the way: the type parameters A and B represent the element type and the type or the resulting collection).
The thing is that you don't need to implement or even use these methods directly if you don't define a new parallel collection. The idea is that people implementing new parallel collections only need to define splitters and combiners and get a whole bunch of other operations for free, because those operations are already implemented and make use of splitters and combiners.
Of course this is only a superficial description of how those things work. For further reading, I recommend reading Architecture of the Parallel Collections Library as well as Creating Custom Parallel Collections.

LinkedList vs MutableList in scala

Below, both descriptions of these data structures: (from Programming in scala book)
Linked lists
Linked lists are mutable sequences that consist of nodes
that are linked with next pointers. In most languages null would be
picked as the empty linked list. That does not work for Scala
collections, because even empty sequences must support all sequence
methods. LinkedList.empty.isEmpty, in par- ticular, should return true
and not throw a NullPointerException. Empty linked lists are encoded
instead in a special way: Their next field points back to the node
itself. Like their immutable cousins, linked lists are best operated
on sequen- tially. In addition, linked lists make it easy to insert an
element or linked list into another linked list.
Mutable lists
A MutableList consists of a single linked list together with a pointer
that refers to the terminal empty node of that list. This makes list
append a con- stant time operation because it avoids having to
traverse the list in search for its terminal node. MutableList is
currently the standard implementation of mutable.LinearSeq in Scala.
Main difference is the addition of the last element's pointer in MutableList type.
Question is: What might be the usage preferring LinkedList rather than MutableList? Isn't MutableList strictly (despite the new pointer) equivalent and even more practical with a tiny addition of used memory (the last element's pointer)?
Since MutableList wraps a LinkedList, most operations involve an extra indirection step. Note that wrapping means, it contains an internal variable to a LinkedList (indeed two, because of the efficient last element lookup). So the linked list is a required building block to realise the mutable list.
If you do not need prepend or look up of the last element, you could thus just go for the LinkedList. Scala offers you a large choice of data structures, so the best is first to make a checklist of all the operations that you require (and their preferred efficiency), then choose the best fit.
Generally, I recommend you to use immutable structures, they are often as efficient as the mutable ones and don't produce problems with concurrency.

Which scala mutable list to use?

This is a followup question to No Scala mutable list
I want to use a mutable list in Scala. I can chose from
scala.collection.mutable.DoubleLinkedList
scala.collection.mutable.LinkedList
scala.collection.mutable.ListBuffer
scala.collection.mutable.MutableList
Which is nice, but what is the "standard", recommended, idiomatic scala way? I just want to use a list that I can add things to on the back.
In my case, I am using a HashMap, where the "lists" (I am meaning it in general sense) will be on value side. Then, I am reading something from a file and for every line, I want to find the right list in the hashmap and append the value to the list.
Depends what you need.
DoubleLinkedList is a linked list which allows you to traverse back-and-forth through the list of nodes. Use its prev and next references to go to the previous or the next node, respectively.
LinkedList is a singly linked list, so there are not prev pointers - if you only traverse to the next element of the list all the time, this is what you need.
EDIT: Note that the two above are meant to be used internally as building blocks for more complicated list structures like MutableLists which support efficient append, and mutable.Queues.
The two collections above both have linear-time append operations.
ListBuffer is a buffer class. Although it is backed by a singly linked list data structure, it does not expose the next pointer to the client, so you can only traverse it using iterators and the foreach.
Its main use is, however, as a buffer and an immutable list builder - you append elements to it via +=, and when you call result, you very efficiently get back a functional immutable.List. Unlike mutable and immutable lists, both append and prepend operations are constant-time - you can append at the end via += very efficiently.
MutableList is used internally, you usually do not use it unless you plan to implement a custom collection class based on the singly linked list data structure. Mutable queues, for example, inherit this class. MutableList class also has an efficient constant-time append operation, because it maintains a reference to the last node in the list.
The documentation's Concrete Mutable Collection Classes page (or the one for 2.12) has an overview of mutable list classes, including explanations on when to use which one.
If you want to append items you shouldn't use a List at all. Lists are good when you want to prepend items. Use ArrayBuffer instead.
I just want to use a list that I can add things to on the back.
Then choose something that implements Growable. I personally suggest one of the Buffer implementations.
I stay away from LinkedList and DoubleLinkedList, as they are present mainly as underlying implementation of other collections, but have quite a few bugs up to Scala 2.9.x. Starting with Scala 2.10.0, I expect the various bug fixes have brought them up to standard. Still, they lack some methods people expect, such as +=, which you'll find on collections based on them.

Scala SeqLike distinct preserves order?

The apidoc of distinct in SeqLike says:
Builds a new sequence from this sequence without any duplicate elements.
Returns: A new sequence which contains the first occurrence of every element of this sequence.
Do I feel it correct that no ordering guarantee is provided? More generally, do methods of SeqLike provide any process-in-order (and return-in-order) guarantee?
On the contrary: operations on Seqs guarantee the output order (unless the API says otherwise). This is one of the basic properties of sequences, where the order matters, versus sets, where only containment matters.
It depends on the collection you were using in the first place. If you had a list you'll get your order. If on the other hand you had a set, then probably not.

What is the prefered way in using the parallel collections in Scala?

At first I assumed that every collection class would receive an additional par method which would convert the collection to a fitting parallel data structure (like map returns the best collection for the element type in Scala 2.8).
Now it seems that some collection classes support a par method (e. g. Array) but others have toParSeq, toParIterable methods (e. g. List). This is a bit weird, since Array isn't used or recommended that often.
What is the reason for that? Wouldn't it be better to just have a par available on all collection classes doing the "right thing"?
If I have data which might be processed in parallel, what types should I use? The traits in scala.collection or the type of the implementation directly?
Or should I prefer Arrays now, because they seem to be cheaper to parallelize?
Lists aren't that well suited for parallel processing. The reason is that to get to the end of the list, you have to walk through every single element. Thus, you may as well just treat the list as an iterator, and thus may as well just use something more generic like toParIterable.
Any collection that has a fast index is a good candidate for parallel processing. This includes anything implementing LinearSeqOptimized, plus trees and hash tables. Array has as fast of an index as you can get, so it's a fairly natural choice. You can also use things like ArrayBuffer (which has a par method returning a ParArray).