How are Scala immutable indexed sequences implemented and what is the complexity of their operations? - scala

I vaguely recall reading somewhere that Scala's immutable indexed sequence operations are O(log n), but that the base of the logarithm is large enough so that for all practical purposes the operations are almost like O(1). Is that true?
How is IndexedSeq implemented to achieve this?

The default implementation of immutable.IndexedSeq is Vector. Here's an excerpt from relevant documentation about its implementation:
Vectors are represented as trees with a high branching factor (The branching factor of a tree or a graph is the number of children at each node). Every tree node contains up to 32 elements of the vector or contains up to 32 other tree nodes. Vectors with up to 32 elements can be represented in a single node. Vectors with up to 32 * 32 = 1024 elements can be represented with a single indirection. Two hops from the root of the tree to the final element node are sufficient for vectors with up to 2^15 elements, three hops for vectors with 2^20, four hops for vectors with 2^25 elements and five hops for vectors with up to 2^30 elements. So for all vectors of reasonable size, an element selection involves up to 5 primitive array selections. This is what we meant when we wrote that element access is “effectively constant time”.
immutable.HashSet and immutable.HashMap are implemented using a similar technique.

IndexedSeq is a Vector, which is a tree (trie, actually) structure with a fanout of 32. So, not counting memory locality, you never get over a O(log n) factor of about 6--compare with a binary tree where it ranges from 1 to ~30.
That said, if you count memory locality also, you will notice a huge difference between indexing into a 1G element Vector and a 10 element Vector. (You'll notice a pretty big difference with an Array also.)

Related

Amortization complexity when resizing arrays by a constant?

I know that when you resize an array by a scalar (like doubling the length of the array, then copying all elements into the new big array) the amortized time complexity is O(1).
But why is it the case that when you do it with a constant (say, resizing it by +10 each time) not O(1) as well?
Edit: https://www.cs.utexas.edu/~slaberge/docs/topics/amortized/dynamic_arrays/ this site seems to explain it, but I am very confused on the math. Where does big $N$ come from? I thought we were dealing with k?
If every kth consecutive insertions cost as much as the number of elements that are already in the array (denote by n+N*k where n is the initial size of the array) then you got sequences of this type:
n O(1) Operations
Expensive operation of O(n)
k O(1) Operations
Expensive operation of O(n+k)
k O(1) Operations
Expensive operation of (n+2k)
k O(1) Operations
Expensive operation of O(n+3k)
See where this is going? each expensive insertion happens every k insertions (expect first time) and costs as the current number of elements.
This means that after, lets simplify, n+A*k insertions we had A copies of n elements, and also we had A-1 copy of the first set of k elements, A-2 copies of the second set of k elements, and so on..
This sums up to O(An + A^2 * k). And because we did n+Ak, we can divide to get amortized cost.
This gives us (An + A^2 * k)/(n+Ak)=A
So, this implies that we are amortized dependent in this array on the NUMBER OF INSERTIONS, which is bad because we won't be able to state that this array, does a constant work in average.

Scala - TrieMap vs Vector

I read that TrieMap in scala is based on has array mapped trie, whike Vector reads bit mapped vector trie.
Are both darastructures backed by the same idea of a hash trie or is there a difference between these?
There are some similarities, but fundamentally they are different data structures:
Vector
There is no hashing involved in Vector. The index directly describes the path into the tree. And of course, the occupied indices of a vector are consecutive.
Disregarding all the trickery with the display pointers in the production implementation of scala.collection.immutable.Vector, every branch node in a vector except for the last one at a level has the same number of children (32 in case of the scala Vector). That allows indexing using simple bit manipulation. The downside is that splicing elements in the middle of a vector is expensive.
HashMap
In a HashTrieMap, the hash code is the path into the tree. That means that the occupied indices are not consecutive, but evenly distributed. This requires a different encoding of the tree branch nodes.
In a HashTrieMap, a branch node has up to 32 children (But if you have a very bad hash code distribution it is entirely possible to have a branch node with only one child). There is an Int bitmap to encode which child corresponds to which position, which means that looking up values in a HashTrieMap requires frequent calls to Integer.bitCount, which fortunately is a CPU intrinsic on modern CPUs.
Here is a fun project that allows you to look at the internals of scala data structures such as Vector and HashMap: https://github.com/stanch/reftree
The images in this answer were generated using this project.

Which data structure to store binary strings and query with hamming distane

I'm looking for a data structure to handle bilions of binary strings that contains 512 binary values.
My goal is to send querys to the structure and get a resultset which contains all data that are lower a distance.
My first idea was to use a kd tree. But those trees are very slow for a high dimension.
My second idea is to use a lsh approach (minHash/ superbit) lsh. But for that I must also have a structure to perform efficient search
Any ideas how to handle these big data?
**Update **
Some detail notes :
for the hamming distance exists only a upper limit that is maybe 128. But in time I doesn't know the upper limit
a insertion or a deletion would be nice but I also can rebuild the graph (the data base only updated onces a week)
the result set must contain all relevant nodes (I'm not looking for knn)
Without knowing your intended search parameters, it's hard to be too optimized. That said, I think a good approach would be to build a B- or T- tree and then optimize that structure for the binary nature of the data.
Specifically, you have 64 bytes of data as a 512 element bit-string. Your estimate is that you will have "bilions" of records. That's on the order of 232 values, so 1/16th of the space will be full? (Does this agree with your expectations?)
Anyway, try breaking the data into bytes, let each byte be a key level. You can probably compress the level records, if the probability of set bits is uniform. (If not, if say set bits are more likely at the front of the key, then you might want to just allocate 256 next-level pointers and have some be null. It's not always worth it.)
All of your levels will be the same- they will represent 8 more bits of the string. So compute a table that maps, for a byte, all the byte values that are within distance S from that byte, 0 <= S <= 8. Also, compute a table that maps two bytes to the distance E between them, hamming(a,b).
To traverse the tree, let your search distance be SD. Set D = SD. Read the top level block. Find all 8-bits values in the block less than distance min(8, D) from your query. For each value, compute the exact distance hamming(query, value) and recurse to the lower block with D = D - hamming(query, value) for that sub-tree.
The biggest design problem I see here is the closure requirement: we need to return all items within distance N of a given vector, for arbitrary N. The data space is sparse: "billions" is on the order of 2^33, but we have 512 bits of information, so there is only 1 entry per 2^(512-33) possibilities. For randomly distributed keys, the expected distance between any two nodes is 256; the expected nearest-neighbour distance is somewhere around 180.
This leads me to expect that your search will hinge on non-random clusters of data, and that your search will be facilitated by recognition of that clustering. This will be a somewhat painful pre-processing step on the initial data, but should be well worthwhile.
My general approach to this is to first identify those clusters in some generally fast way. Start with a hashing function that returns a very general distance metric. For instance, for any vector, compute the distances to each of a set of orthogonal reference vectors. For 16 bits, you might take the following set (listed in hex): 0000, 00FF, 0F0F, 3333, 5555, a successive "grain" of alternating bits". Return this hash as a simple tuple the 4-bit distances, a total of 20 bits (there are actual savings for long vectors, as one of the sizes is 2^(2^N)).
Now, this hash tuple allows you a rough estimate of the hamming distance, such that you can cluster the vectors more easily: vectors that are similar must have similar hash values.
Within each cluster, find a central element, and then characterize each element of the cluster by its distance from that center. For more speed, give each node a list of its closest neighbors with distances, all of them within the cluster. This gives you a graph for each cluster.
Similarly connect all the cluster centers, giving direct edges to the nearer cluster centers. If your data are reasonably amenable to search, then we'll be able to guarantee that, for any two nodes A, B with cluster centers Ac and Bc, we will have d(A, Ac) + d(B, Bc) < d(A, B). Each cluster is a topological neighbourhood.
The query process is now somewhat faster. For a target vector V, find the hash value. Find cluster centers that are close enough tot hat value that something in their neighbourhoods might match ([actual distance] - [query range] - [cluster radius]). This will allow you to eliminate whole clusters right away, and may give you an entire cluster of "hits". For each marginal cluster (some, but not all nodes qualify), you'll need to find a node that works by something close to brute force (start in the middle of the range of viable distances from the cluster center), and then do a breadth-first search of each node's neighbors.
I expect that this will give you something comparable to optimal performance. It also adapts decently to additions and deletions, so long as those are not frequent enough to change cluster membership for other nodes.
The set of vectors is straightforward. Write out the bit patterns for the 16-bit case:
0000 0000 0000 0000 16 0s
0000 0000 1111 1111 8 0s, 8 1s
0000 1111 0000 1111 4 0s, 4 1s, repeat
0011 0011 0011 0011 2 0s, 2 1s, repeat
0101 0101 0101 0101 1 0s, 1 1s, repeat

How does Scala's Vector work?

I read this page about the time complexity of Scala collections. As it says, Vector's complexity is eC for all operations.
It made me wonder what Vector is. I read the document and it says:
Because vectors strike a good balance between fast random selections and fast random functional updates, they are currently the
default implementation of immutable indexed sequences. It is backed by
a little endian bit-mapped vector trie with a branching factor of 32.
Locality is very good, but not contiguous, which is good for very
large sequences.
As with everything else about Scala, it's pretty vague. How actually does Vector work?
The keyword here is Trie.
Vector is implemented as a Trie datastructure.
See http://en.wikipedia.org/wiki/Trie.
More precisely, it is a "bit-mapped vector trie". I've just found a consise enough description of the structure (along with an implementation - apparently in Rust) here:
https://bitbucket.org/astrieanna/bitmapped-vector-trie
The most relevant excerpt is:
A Bitmapped Vector Trie is basically a 32-tree. Level 1 is an array of size 32, of whatever data type. Level 2 is an array of 32 Level 1's. and so on, until: Level 7 is an array of 2 Level 6's.
UPDATE: In reply to Lai Yu-Hsuan's comment about complexity:
I will have to assume you meant "depth" here :-D. The legend for "eC" says "The operation takes effectively constant time, but this might depend on some assumptions such as maximum length of a vector or distribution of hash keys.".
If you are willing to consider the worst case, and given that there is an upper bound to the maximum size of the vector, then yes indeed we can say that the complexity is constant.
Say that we consider the maximum size to be 2^32, then this means that the worst case is 7 operations at most, in any case.
Then again, we can always consider the worst case for any type of collection, find an upper bound and say this is constant complexity, but for a list by example, this would mean a constant of 4 billions, which is not quite practical.
But Vector is the opposite, 7 operations being more than practical, and this is how we can afford to consider its complexity constant in practice.
Another way to look at this: we are not talking about log(2,N), but log(32,N). If you try to plot that you'll see it is practically an horizontal line. So pragmatically speaking you'll never be able to see much increase in processing time as the collection grows.
Yes, that's still not really constant (which is why it is marked as "eC" and not just "C"), and you'll be able to see a difference around short vectors (but again, a very small difference because the number of operations grows so much slowly).
The other answers re 'Trie' are good. But as a close approximation, just for quick understanding:
Vector internally uses a tree structure - not a binary tree, but a 32-ary tree
Each '32-way node' uses Array[32] and can store either 0-32 references to child nodes or 0-32 pieces of data
The tree is structured to be balanced in a certain way - it is "n" levels deep, but levels 1 to n-1 are "index-only levels" (100% child references; no data) and level n contains all the data (100% data; no child references). So if the number of elements of data is "d" then n = log-base-32(d) rounded upwards
Why this? Simple: for performance.
Instead of doing thousands/millions/gazillions of memory allocations for each individual data element, memory is allocated in 32 element chunks. Instead of walking miles deep to find your data, the structure is quite shallow - it's a very wide, short tree. E.g. 5 levels deep can contain 32^5 data elements (for 4 byte elements = 132GB i.e. pretty big) and each data access would lookup & walk through 5 nodes from the root (whereas a big array would use a single data access). The vector does not proactively allocat memory for all of Level n (data), - it allocates in 32 element chunks as needed. It gives read performance somewhat similar to a huge array, whilst having functional characteristics (power & flexibility & memory-efficiency) somewhat similar to a binary tree.
:)
These may be interesting for you:
Ideal Hash Trees by Phil Bagwell.
Implementing Persistent Vectors in Scala - Daniel Spiewak
More Persistent Vectors: Performance Analysis - Daniel Spiewak
Persistent data structures in Scala

Why are vectors so shallow?

What is the rationale behind Scala's vectors having a branching factor of 32, and not some other number? Wouldn't smaller branching factors enable more structural sharing? Clojure seems to use the same branching factor. Is there anything magic about the branching factor 32 that I am missing?
It would help if you explained what a branching factor is:
The branching factor of a tree or a graph is the number of children at each node.
So, the answer appears to be largely here:
http://www.scala-lang.org/docu/files/collections-api/collections_15.html
Vectors are represented as trees with a high branching factor. Every
tree node contains up to 32 elements of the vector or contains up to
32 other tree nodes. Vectors with up to 32 elements can be represented
in a single node. Vectors with up to 32 * 32 = 1024 elements can be
represented with a single indirection. Two hops from the root of the
tree to the final element node are sufficient for vectors with up to
215 elements, three hops for vectors with 220, four hops for vectors
with 225 elements and five hops for vectors with up to 230 elements.
So for all vectors of reasonable size, an element selection involves
up to 5 primitive array selections. This is what we meant when we
wrote that element access is "effectively constant time".
So, basically, they had to make a design decision as to how many children to have at each node. As they explained, 32 seemed reasonable, but, if you find that it is too restrictive for you, then you could always write your own class.
For more information on why it may have been 32, you can look at this paper, as in the introduction they make the same statement as above, about it being nearly constant time, but this paper deals with Clojure it seems, more than Scala.
http://infoscience.epfl.ch/record/169879/files/RMTrees.pdf
James Black's answer is correct. Another argument for choosing 32 items might have been that the cache line size in many modern processors is 64 bytes, so two lines can hold 32 ints with 4 bytes each or 32 pointers on a 32bit machine or a 64bit JVM with a heap size up to 32GB due to pointer compression.
It's the "effectively constant time" for updates. With that large of a branching factor, you never have to go beyond 5 levels, even for terabyte-scale vectors. Here's a video with Rich talking about that and other aspects of Clojure on Channel 9. http://channel9.msdn.com/Shows/Going+Deep/Expert-to-Expert-Rich-Hickey-and-Brian-Beckman-Inside-Clojure
Just adding a bit to James's answer.
From an algorithm analysis standpoint, because the growth of the two functions is logarithmic, so they scale the same way.
But, in practical applications, having
hops is a much smaller number of hops than, say, base 2, sufficiently so that it keeps it closer to constant time, even for fairly large values of N.
I'm sure they picked 32 exactly (as opposed to a higher number) because of some memory block size, but the main reason is the fewer number of hops, compared to smaller sizes.
I also recommend you watch this presentation on InfoQ, where Daniel Spiewak discusses Vectors starting about 30 minutes in: http://www.infoq.com/presentations/Functional-Data-Structures-in-Scala