Scala - TrieMap vs Vector - scala

I read that TrieMap in scala is based on has array mapped trie, whike Vector reads bit mapped vector trie.
Are both darastructures backed by the same idea of a hash trie or is there a difference between these?

There are some similarities, but fundamentally they are different data structures:
Vector
There is no hashing involved in Vector. The index directly describes the path into the tree. And of course, the occupied indices of a vector are consecutive.
Disregarding all the trickery with the display pointers in the production implementation of scala.collection.immutable.Vector, every branch node in a vector except for the last one at a level has the same number of children (32 in case of the scala Vector). That allows indexing using simple bit manipulation. The downside is that splicing elements in the middle of a vector is expensive.
HashMap
In a HashTrieMap, the hash code is the path into the tree. That means that the occupied indices are not consecutive, but evenly distributed. This requires a different encoding of the tree branch nodes.
In a HashTrieMap, a branch node has up to 32 children (But if you have a very bad hash code distribution it is entirely possible to have a branch node with only one child). There is an Int bitmap to encode which child corresponds to which position, which means that looking up values in a HashTrieMap requires frequent calls to Integer.bitCount, which fortunately is a CPU intrinsic on modern CPUs.
Here is a fun project that allows you to look at the internals of scala data structures such as Vector and HashMap: https://github.com/stanch/reftree
The images in this answer were generated using this project.

Related

Structural Sharing in Scala Vector

Structural sharing in Scala List is straightforward and easy to understand. But Scala Vector is a more complicated data structure than a list. How is structural sharing achieved in Scala Vector?
Vector is basically a tree (trie) with 32-wide branching at each level. If you have a Vector of, say, 3000 elements and you want to index element 2045, for example, which converts to 100000010101 in binary, it will decompose it into 5-bit blocks to use as indices into the tree: 10 (i.e. 2) in the first branch then 00000 (i.e. 0) in the next, and finally 10101 (i.e. 21) in the terminal branch, and then there's the data.
Given this structure, it's easy to see how to structurally share things: you can share any sub-trees that aren't changed. So if you make a new vector with a different element 2045, you have to change not all 3000 elements but recreate "only" three arrays of size 32: the terminal one is replaced by a copy with its element 21 updated; then its parent has to be replaced by a copy with this new child in index 0; then its parent has to be replaced with the correct subtree in index 2.
Now, this provides quite a lot of structural sharing as long as you have far more than 32 elements in your vector, but it's still a pretty big overhead. Because of this, additions to the end of the vector are special-cased so that you just add to the existing array. The old Vectors still point to that array, but they think the end is earlier (and that part is unchanged) so it works out okay.
There's a more complex but similar scheme to allow addition at the front of a vector in a similar fashion (basically, by leaving space at the front and keeping track of where to point via indices and offsets in addition to the indexing scheme).
The trick as implemented doesn't work to allow alternating addition to both front and back, though, so there you effectively rebuild the trees every addition. Making a version with even better structural sharing would be possible, but it would probably be a bit slower to access.

How does Scala's Vector work?

I read this page about the time complexity of Scala collections. As it says, Vector's complexity is eC for all operations.
It made me wonder what Vector is. I read the document and it says:
Because vectors strike a good balance between fast random selections and fast random functional updates, they are currently the
default implementation of immutable indexed sequences. It is backed by
a little endian bit-mapped vector trie with a branching factor of 32.
Locality is very good, but not contiguous, which is good for very
large sequences.
As with everything else about Scala, it's pretty vague. How actually does Vector work?
The keyword here is Trie.
Vector is implemented as a Trie datastructure.
See http://en.wikipedia.org/wiki/Trie.
More precisely, it is a "bit-mapped vector trie". I've just found a consise enough description of the structure (along with an implementation - apparently in Rust) here:
https://bitbucket.org/astrieanna/bitmapped-vector-trie
The most relevant excerpt is:
A Bitmapped Vector Trie is basically a 32-tree. Level 1 is an array of size 32, of whatever data type. Level 2 is an array of 32 Level 1's. and so on, until: Level 7 is an array of 2 Level 6's.
UPDATE: In reply to Lai Yu-Hsuan's comment about complexity:
I will have to assume you meant "depth" here :-D. The legend for "eC" says "The operation takes effectively constant time, but this might depend on some assumptions such as maximum length of a vector or distribution of hash keys.".
If you are willing to consider the worst case, and given that there is an upper bound to the maximum size of the vector, then yes indeed we can say that the complexity is constant.
Say that we consider the maximum size to be 2^32, then this means that the worst case is 7 operations at most, in any case.
Then again, we can always consider the worst case for any type of collection, find an upper bound and say this is constant complexity, but for a list by example, this would mean a constant of 4 billions, which is not quite practical.
But Vector is the opposite, 7 operations being more than practical, and this is how we can afford to consider its complexity constant in practice.
Another way to look at this: we are not talking about log(2,N), but log(32,N). If you try to plot that you'll see it is practically an horizontal line. So pragmatically speaking you'll never be able to see much increase in processing time as the collection grows.
Yes, that's still not really constant (which is why it is marked as "eC" and not just "C"), and you'll be able to see a difference around short vectors (but again, a very small difference because the number of operations grows so much slowly).
The other answers re 'Trie' are good. But as a close approximation, just for quick understanding:
Vector internally uses a tree structure - not a binary tree, but a 32-ary tree
Each '32-way node' uses Array[32] and can store either 0-32 references to child nodes or 0-32 pieces of data
The tree is structured to be balanced in a certain way - it is "n" levels deep, but levels 1 to n-1 are "index-only levels" (100% child references; no data) and level n contains all the data (100% data; no child references). So if the number of elements of data is "d" then n = log-base-32(d) rounded upwards
Why this? Simple: for performance.
Instead of doing thousands/millions/gazillions of memory allocations for each individual data element, memory is allocated in 32 element chunks. Instead of walking miles deep to find your data, the structure is quite shallow - it's a very wide, short tree. E.g. 5 levels deep can contain 32^5 data elements (for 4 byte elements = 132GB i.e. pretty big) and each data access would lookup & walk through 5 nodes from the root (whereas a big array would use a single data access). The vector does not proactively allocat memory for all of Level n (data), - it allocates in 32 element chunks as needed. It gives read performance somewhat similar to a huge array, whilst having functional characteristics (power & flexibility & memory-efficiency) somewhat similar to a binary tree.
:)
These may be interesting for you:
Ideal Hash Trees by Phil Bagwell.
Implementing Persistent Vectors in Scala - Daniel Spiewak
More Persistent Vectors: Performance Analysis - Daniel Spiewak
Persistent data structures in Scala

Hash operator in Matlab for linear indices of vectors

I am clustering a large set of points. Throughout the iterations, I want to avoid re-computing cluster properties if the assigned points are the same as the previous iteration. Each cluster keeps the IDs of its points. I don't want to compare them element wise, comparing the sum of the ID vector is risky (a small ID can be compensated with a large one), may be I should compare the sum of squares? Is there a hashing method in Matlab which I can use with confidence?
Example data:
a=[2,13,14,18,19,21,23,24,25,27]
b=[6,79,82,85,89,111,113,123,127,129]
c=[3,9,59,91,99,101,110,119,120,682]
d=[11,57,74,83,86,90,92,102,103,104]
So the problem is that if I just check the sum, it could be that cluster d for example, looses points 11,103 and gets 9,105. Then I would mistakenly think that there has been no change in the cluster.
This is one of those (very common) situations where the more we know about your data and application the better we are able to help. In the absence of better information than you provide, and in the spirit of exposing the weakness of answers such as this in that absence, here are a couple of suggestions you might reject.
One appropriate data structure for set operations is a bit-set, that is a set of length equal to the cardinality of the underlying universe of things in which each bit is set on or off according to the things membership of the (sub-set). You could implement this in Matlab in at least two ways:
a) (easy, but possibly consuming too much space): define a matrix with as many columns as there are points in your data, and one row for each cluster. Set the (cluster, point) value to true if point is a member of cluster. Set operations are then defined by vector operations. I don't have a clue about the relative (time) efficiency of setdiff versus rowA==rowB.
b) (more difficult): actually represent the clusters by bit sets. You'll have to use Matlab's bit-twiddling capabilities of course, but the pain might be worth the gain. Suppose that your universe comprises 1024 points, then you'll need an array of 16 uint64 values to represent the bit set for each cluster. The presence of, say, point 563 in a cluster requires that you set, for the bit set representing that cluster, bit 563 (which is probably bit 51 in the 9th element of the set) to 1.
And perhaps I should have started by writing that I don't think that this is a hashing sort of a problem, it's a set sort of a problem. Yeah, you could use a hash but then you'll have to program around the limitations of using a screwdriver on a nail (choose your preferred analogy).
If I understand correctly, to hash the ID's I would recommend using the matlab Java interface to use the Java hashing algorithms
http://docs.oracle.com/javase/1.4.2/docs/api/java/security/MessageDigest.html
You'll do something like:
hash = java.security.MessageDigest.getInstance('SHA');
Hope this helps.
I found the function
DataHash on FEX it is quiet fast for vectors and the strcmp on the keys is a lot faster than I expected.

How is a bitmapped vector trie faster than a plain vector?

It's supposedly faster than a vector, but I don't really understand how locality of reference is supposed to help this (since a vector is by definition the most locally packed data possible -- every element is packed next to the succeeding element, with no extra space between).
Is the benchmark assuming a specific usage pattern or something similar?
How this is possible?
bitmapped vector tries aren't strictly faster than normal vectors, at least not at everything. It depends on what operation you are considering.
Conventional vectors are faster, for example, at accessing a data element at a specific index. It's hard to beat a straight indexed array lookup. And from a cache locality perspective, big arrays are pretty good if all you are doing is looping over them sequentially.
However a bitmapped vector trie will be much faster for other operations (thanks to structural sharing) - for example creating a new copy with a single changed element without affecting the original data structure is O(log32 n) vs. O(n) for a traditional vector. That's a huge win.
Here's an excellent video well worth watching on the topic, which includes a lot of the motivation of why you might want these kind of structures in your language: Persistent Data Structures and Managed References (talk by Rich Hickey).
There is a lot of good stuff in the other answers but nobdy answers your question. The PersistenVectors are only fast for lots of random lookups by index (when the array is big). "How can that be?" you might ask. "A normal flat array only needs to move a pointer, the PersistentVector has to go through multiple steps."
The answer is "Cache Locality".
The cache always gets a range from memory. If you have a big array it does not fit the cache. So if you want to get item x and item y you have to reload the whole cache. That's because the array is always sequential in memory.
Now with the PVector that's diffrent. There are lots of small arrays floating around and the JVM is smart about that and puts them close to each other in memory. So for random accesses this is fast; if you run through it sequentially it's much slower.
I have to say that I'm not an expert on hardware or how the JVM handles cache locality and I have never benchmarked this myself; I am just retelling stuff I've heard from other people :)
Edit: mikera mentions that too.
Edit 2: See this talk about Functional Data-Structures, skip to the last part if you are only intrested in the vector. http://www.infoq.com/presentations/Functional-Data-Structures-in-Scala
A bitmapped vector trie (aka a persistent vector) is a data structure invented by Rich Hickey for Clojure, that has been implementated in Scala since 2010 (v 2.8). It is its clever bitwise indexing strategy that allows for highly efficient access and modification of large data sets.
From Understanding Clojure's Persistent Vectors :
Mutable vectors and ArrayLists are generally just arrays which grows
and shrinks when needed. This works great when you want mutability,
but is a big problem when you want persistence. You get slow
modification operations because you'll have to copy the whole array
all the time, and it will use a lot of memory. It would be ideal to
somehow avoid redundancy as much as possible without losing
performance when looking up values, along with fast operations. That
is exactly what Clojure's persistent vector does, and it is done
through balanced, ordered trees.
The idea is to implement a structure which is similar to a binary
tree. The only difference is that the interior nodes in the tree have
a reference to at most two subnodes, and does not contain any elements
themselves. The leaf nodes contain at most two elements. The elements
are in order, which means that the first element is the first element
in the leftmost leaf, and the last element is the rightmost element in
the rightmost leaf. For now, we require that all leaf nodes are at the
same depth2. As an example, take a look at the tree below: It has
the integers 0 to 8 in it, where 0 is the first element and 8 the
last. The number 9 is the vector size:
If we wanted to add a new element to the end of this vector and we
were in the mutable world, we would insert 9 in the rightmost leaf
node, like this:
But here's the issue: We cannot do that if we want to be persistent.
And this would obviously not work if we wanted to update an element!
We would need to copy the whole structure, or at least parts of it.
To minimize copying while retaining full persistence, we perform path
copying: We copy all nodes on the path down to the value we're about
to update or insert, and replace the value with the new one when we're
at the bottom. A result of multiple insertions is shown below. Here,
the vector with 7 elements share structure with a vector with 10
elements:
The pink coloured nodes are shared between the vectors, whereas the
brown and blue are separate. Other vectors not visualized may also
share nodes with these vectors.
More info
Besides Understanding Clojure's Persistent Vectors, the ideas behind this data structure and its use cases are also explained pretty well in David Nolen's 2014 lecture Immutability, interactivity & JavaScript, from which the screenshot below was taken. Or if you really want to dive deeply into the technical details, see also Phil Bagwell's Ideal Hash Trees, which was the paper upon which Hickey's initial Clojure implementation was based.
What do you mean by "plain vector"? Just a flat array of items? That's great if you never update it, but if you ever change a 1M-element flat-vector you have to do a lot of copying; the tree exists to allow you to share most of the structure.
Short explanation: it uses the fact that the JVM optimizes so hard on read/write/copy array data structures. The key aspect IMO is that if your vector grows to a certain size index management becomes a  bottleneck . Here comes the very clever algorithm from persisted vector into play, on very large collections it outperforms the standard variant. So basically it is a functional data-structure which only performed so well because it is built up on small mutable highly optimizes JVM datastructures.
For further details see here (at the end)
http://topsy.com/vimeo.com/28760673
Judging by the title of the talk, it's talking about Scala vectors, which aren't even close to "the most locally packed data possible": see source at https://lampsvn.epfl.ch/trac/scala/browser/scala/tags/R_2_9_1_final/src/library/scala/collection/immutable/Vector.scala.
Your definition only applies to Lisps (as far as I know).

When should I choose Vector in Scala?

It seems that Vector was late to the Scala collections party, and all the influential blog posts had already left.
In Java ArrayList is the default collection - I might use LinkedList but only when I've thought through an algorithm and care enough to optimise. In Scala should I be using Vector as my default Seq, or trying to work out when List is actually more appropriate?
As a general rule, default to using Vector. It’s faster than List for almost everything and more memory-efficient for larger-than-trivial sized sequences. See this documentation of the relative performance of Vector compared to the other collections. There are some downsides to going with Vector. Specifically:
Updates at the head are slower than List (though not by as much as you might think)
Another downside before Scala 2.10 was that pattern matching support was better for List, but this was rectified in 2.10 with generalized +: and :+ extractors.
There is also a more abstract, algebraic way of approaching this question: what sort of sequence do you conceptually have? Also, what are you conceptually doing with it? If I see a function that returns an Option[A], I know that function has some holes in its domain (and is thus partial). We can apply this same logic to collections.
If I have a sequence of type List[A], I am effectively asserting two things. First, my algorithm (and data) is entirely stack-structured. Second, I am asserting that the only things I’m going to do with this collection are full, O(n) traversals. These two really go hand-in-hand. Conversely, if I have something of type Vector[A], the only thing I am asserting is that my data has a well defined order and a finite length. Thus, the assertions are weaker with Vector, and this leads to its greater flexibility.
Well, a List can be incredibly fast if the algorithm can be implemented solely with ::, head and tail. I had an object lesson of that very recently, when I beat Java's split by generating a List instead of an Array, and couldn't beat that with anything else.
However, List has a fundamental problem: it doesn't work with parallel algorithms. I cannot split a List into multiple segments, or concatenate it back, in an efficient manner.
There are other kinds of collections that can handle parallelism much better -- and Vector is one of them. Vector also has great locality -- which List doesn't -- which can be a real plus for some algorithms.
So, all things considered, Vector is the best choice unless you have specific considerations that make one of the other collections preferable -- for example, you might choose Stream if you want lazy evaluation and caching (Iterator is faster but doesn't cache), or List if the algorithm is naturally implemented with the operations I mentioned.
By the way, it is preferable to use Seq or IndexedSeq unless you want a specific piece of API (such as List's ::), or even GenSeq or GenIndexedSeq if your algorithm can be run in parallel.
Some of the statements here are confusing or even wrong, especially the idea that immutable.Vector in Scala is anything like an ArrayList.
List and Vector are both immutable, persistent (i.e. "cheap to get a modified copy") data structures.
There is no reasonable default choice as their might be for mutable data structures, but it rather depends on what your algorithm is doing.
List is a singly linked list, while Vector is a base-32 integer trie, i.e. it is a kind of search tree with nodes of degree 32.
Using this structure, Vector can provide most common operations reasonably fast, i.e. in O(log_32(n)). That works for prepend, append, update, random access, decomposition in head/tail. Iteration in sequential order is linear.
List on the other hand just provides linear iteration and constant time prepend, decomposition in head/tail. Everything else takes in general linear time.
This might look like as if Vector was a good replacement for List in almost all cases, but prepend, decomposition and iteration are often the crucial operations on sequences in a functional program, and the constants of these operations are (much) higher for vector due to its more complicated structure.
I made a few measurements, so iteration is about twice as fast for list, prepend is about 100 times faster on lists, decomposition in head/tail is about 10 times faster on lists and generation from a traversable is about 2 times faster for vectors. (This is probably, because Vector can allocate arrays of 32 elements at once when you build it up using a builder instead of prepending or appending elements one by one).
Of course all operations that take linear time on lists but effectively constant time on vectors (as random access or append) will be prohibitively slow on large lists.
So which data structure should we use?
Basically, there are four common cases:
We only need to transform sequences by operations like map, filter, fold etc:
basically it does not matter, we should program our algorithm generically and might even benefit from accepting parallel sequences. For sequential operations List is probably a bit faster. But you should benchmark it if you have to optimize.
We need a lot of random access and different updates, so we should use vector, list will be prohibitively slow.
We operate on lists in a classical functional way, building them by prepending and iterating by recursive decomposition: use list, vector will be slower by a factor 10-100 or more.
We have an performance critical algorithm that is basically imperative and does a lot of random access on a list, something like in place quick-sort: use an imperative data structure, e.g. ArrayBuffer, locally and copy your data from and to it.
For immutable collections, if you want a sequence, your main decision is whether to use an IndexedSeq or a LinearSeq, which give different guarantees for performance. An IndexedSeq provides fast random-access of elements and a fast length operation. A LinearSeq provides fast access only to the first element via head, but also has a fast tail operation. (Taken from the Seq documentation.)
For an IndexedSeq you would normally choose a Vector. Ranges and WrappedStrings are also IndexedSeqs.
For a LinearSeq you would normally choose a List or its lazy equivalent Stream. Other examples are Queues and Stacks.
So in Java terms, ArrayList used similarly to Scala's Vector, and LinkedList similarly to Scala's List. But in Scala I would tend to use List more often than Vector, because Scala has much better support for functions that include traversal of the sequence, like mapping, folding, iterating etc. You will tend to use these functions to manipulate the list as a whole, rather than randomly accessing individual elements.
In situations which involve a lot random access and random mutation, a Vector (or – as the docs say – a Seq) seems to be a good compromise. This is also what the performance characteristics suggest.
Also, the Vector class seems to play nicely in distributed environments without much data duplication because there is no need to do a copy-on-write for the complete object. (See: http://akka.io/docs/akka/1.1.3/scala/stm.html#persistent-datastructures)
If you're programming immutably and need random access, Seq is the way to go (unless you want a Set, which you often actually do). Otherwise List works well, except it's operations can't be parallelized.
If you don't need immutable data structures, stick with ArrayBuffer since it's the Scala equivalent to ArrayList.