From 'A gentle introduction to lisp':
If given a circular list such as #1=(A B C . #1#) as input, LENGTH may
not return a value at all. In most implementations it will go into an
infinite loop.
Is this still true? Was/is it a bug? Why not check the nature of the list first?
In modern implementations like R7RS Scheme and Common Lisp they do identify circular list, but to reduce overhead CL has both length that might hang and list-length that returns nil if a circle is detected.
There is no simple way to see the nature of a list when all you see is one cons at a time. What you do is iterate each step in one variable and every two steps in the second staring at the second element. If those two ever is the same object there is a circle reference. That is called the turtle and hare algorithm.
length is meant to work on sequences in general; the circularity issue is relevant for lists but not for, say, strings or arrays. list-length is specialized on lists and works as expected for proper lists, but returns nil for circular lists.
Related
Does anyone know the original hash table implementation?
Every realization I've found is based on separate chaining or open addressing methods
Chaining, by Hans Peter Luhn, in 1953.
https://en.wikipedia.org/wiki/Hash_table#History
The first implementation, not that the most common, is probably the one that uses an array (which is resized as needed) where each entry points to a list of elements.
The hash code, computed mod the size of the array, points to the integer index at which the list of the element to be searched is located. In case of hash code collision, the elements will accumulate in the list of the related entry.
So, once the hash code is computed, we have O(1) for accessing the entry of the array and O(N) for the actual search of the element in the list by verifying its actual equality. The value of N must be kept low for obvious performance consequences.
In case the collision becomes high we resize the array by increasing the number of entries and decreasing the collisions accordingly. This occurs as the hash code mod a higher number than the previous one is computed.
Some more complicated implementations convert the lists to trees if they become too long so that O(N) to O(log(N)) for equality search.
I'm gonna write the problem as I found it and I will then explain what confuses me.
"A teacher is marking his students' work from 0-10 but he only marks with an 8 or above for a certain number 'x'(x=15 for example) of the 'n' students. You are given an array with all the students' marks in random order. Find the 'x' best marks in O(1)."
We certainly have been taught hashing but this requires me to store all the data in a hash table which is definitely not O(1). Maybe we don't have to take the 'conversion' into account? If we do , maybe the coversion combined with the search time after will lead to a method different than hashing.
In that case, leaving O(1) aside , what is the fastest algorithm including both the conversion and the search time?
Simple: It's not possible.
O(1) can only achieved if all of input size, number of necessary comparisons and output size are constants. You may argue that x could be treated as constant, but it still doesn't work:
You need to inspect every single input element, all n of them, as the random input order does not even allow any heuristics to guess where the xth element would be, even if you already had correctly guessed the other x-1 elements already in constant time.
As the problem is stated, there is no solution which can do it in the upper bounds of O(1) or O(x).
Let's just assume your instructor corrects his mistake, and gives you a revised version which correctly states O(n) as the required upper bound.
In that case your hash approach is (almost) correct. The catch of using a hash function, is that you now need to account for potential collisions on the hash function, which are the reason why hash maps don't work strictly in O(1), but rather only "on average" in O(1).
As you know all possible values (grades from 0-10), you can just allocate buckets with a known index. Inside each bucket you may use linked lists, as they also allow constant time insertions and linear time iteration.
Structural sharing in Scala List is straightforward and easy to understand. But Scala Vector is a more complicated data structure than a list. How is structural sharing achieved in Scala Vector?
Vector is basically a tree (trie) with 32-wide branching at each level. If you have a Vector of, say, 3000 elements and you want to index element 2045, for example, which converts to 100000010101 in binary, it will decompose it into 5-bit blocks to use as indices into the tree: 10 (i.e. 2) in the first branch then 00000 (i.e. 0) in the next, and finally 10101 (i.e. 21) in the terminal branch, and then there's the data.
Given this structure, it's easy to see how to structurally share things: you can share any sub-trees that aren't changed. So if you make a new vector with a different element 2045, you have to change not all 3000 elements but recreate "only" three arrays of size 32: the terminal one is replaced by a copy with its element 21 updated; then its parent has to be replaced by a copy with this new child in index 0; then its parent has to be replaced with the correct subtree in index 2.
Now, this provides quite a lot of structural sharing as long as you have far more than 32 elements in your vector, but it's still a pretty big overhead. Because of this, additions to the end of the vector are special-cased so that you just add to the existing array. The old Vectors still point to that array, but they think the end is earlier (and that part is unchanged) so it works out okay.
There's a more complex but similar scheme to allow addition at the front of a vector in a similar fashion (basically, by leaving space at the front and keeping track of where to point via indices and offsets in addition to the indexing scheme).
The trick as implemented doesn't work to allow alternating addition to both front and back, though, so there you effectively rebuild the trees every addition. Making a version with even better structural sharing would be possible, but it would probably be a bit slower to access.
It's supposedly faster than a vector, but I don't really understand how locality of reference is supposed to help this (since a vector is by definition the most locally packed data possible -- every element is packed next to the succeeding element, with no extra space between).
Is the benchmark assuming a specific usage pattern or something similar?
How this is possible?
bitmapped vector tries aren't strictly faster than normal vectors, at least not at everything. It depends on what operation you are considering.
Conventional vectors are faster, for example, at accessing a data element at a specific index. It's hard to beat a straight indexed array lookup. And from a cache locality perspective, big arrays are pretty good if all you are doing is looping over them sequentially.
However a bitmapped vector trie will be much faster for other operations (thanks to structural sharing) - for example creating a new copy with a single changed element without affecting the original data structure is O(log32 n) vs. O(n) for a traditional vector. That's a huge win.
Here's an excellent video well worth watching on the topic, which includes a lot of the motivation of why you might want these kind of structures in your language: Persistent Data Structures and Managed References (talk by Rich Hickey).
There is a lot of good stuff in the other answers but nobdy answers your question. The PersistenVectors are only fast for lots of random lookups by index (when the array is big). "How can that be?" you might ask. "A normal flat array only needs to move a pointer, the PersistentVector has to go through multiple steps."
The answer is "Cache Locality".
The cache always gets a range from memory. If you have a big array it does not fit the cache. So if you want to get item x and item y you have to reload the whole cache. That's because the array is always sequential in memory.
Now with the PVector that's diffrent. There are lots of small arrays floating around and the JVM is smart about that and puts them close to each other in memory. So for random accesses this is fast; if you run through it sequentially it's much slower.
I have to say that I'm not an expert on hardware or how the JVM handles cache locality and I have never benchmarked this myself; I am just retelling stuff I've heard from other people :)
Edit: mikera mentions that too.
Edit 2: See this talk about Functional Data-Structures, skip to the last part if you are only intrested in the vector. http://www.infoq.com/presentations/Functional-Data-Structures-in-Scala
A bitmapped vector trie (aka a persistent vector) is a data structure invented by Rich Hickey for Clojure, that has been implementated in Scala since 2010 (v 2.8). It is its clever bitwise indexing strategy that allows for highly efficient access and modification of large data sets.
From Understanding Clojure's Persistent Vectors :
Mutable vectors and ArrayLists are generally just arrays which grows
and shrinks when needed. This works great when you want mutability,
but is a big problem when you want persistence. You get slow
modification operations because you'll have to copy the whole array
all the time, and it will use a lot of memory. It would be ideal to
somehow avoid redundancy as much as possible without losing
performance when looking up values, along with fast operations. That
is exactly what Clojure's persistent vector does, and it is done
through balanced, ordered trees.
The idea is to implement a structure which is similar to a binary
tree. The only difference is that the interior nodes in the tree have
a reference to at most two subnodes, and does not contain any elements
themselves. The leaf nodes contain at most two elements. The elements
are in order, which means that the first element is the first element
in the leftmost leaf, and the last element is the rightmost element in
the rightmost leaf. For now, we require that all leaf nodes are at the
same depth2. As an example, take a look at the tree below: It has
the integers 0 to 8 in it, where 0 is the first element and 8 the
last. The number 9 is the vector size:
If we wanted to add a new element to the end of this vector and we
were in the mutable world, we would insert 9 in the rightmost leaf
node, like this:
But here's the issue: We cannot do that if we want to be persistent.
And this would obviously not work if we wanted to update an element!
We would need to copy the whole structure, or at least parts of it.
To minimize copying while retaining full persistence, we perform path
copying: We copy all nodes on the path down to the value we're about
to update or insert, and replace the value with the new one when we're
at the bottom. A result of multiple insertions is shown below. Here,
the vector with 7 elements share structure with a vector with 10
elements:
The pink coloured nodes are shared between the vectors, whereas the
brown and blue are separate. Other vectors not visualized may also
share nodes with these vectors.
More info
Besides Understanding Clojure's Persistent Vectors, the ideas behind this data structure and its use cases are also explained pretty well in David Nolen's 2014 lecture Immutability, interactivity & JavaScript, from which the screenshot below was taken. Or if you really want to dive deeply into the technical details, see also Phil Bagwell's Ideal Hash Trees, which was the paper upon which Hickey's initial Clojure implementation was based.
What do you mean by "plain vector"? Just a flat array of items? That's great if you never update it, but if you ever change a 1M-element flat-vector you have to do a lot of copying; the tree exists to allow you to share most of the structure.
Short explanation: it uses the fact that the JVM optimizes so hard on read/write/copy array data structures. The key aspect IMO is that if your vector grows to a certain size index management becomes a bottleneck . Here comes the very clever algorithm from persisted vector into play, on very large collections it outperforms the standard variant. So basically it is a functional data-structure which only performed so well because it is built up on small mutable highly optimizes JVM datastructures.
For further details see here (at the end)
http://topsy.com/vimeo.com/28760673
Judging by the title of the talk, it's talking about Scala vectors, which aren't even close to "the most locally packed data possible": see source at https://lampsvn.epfl.ch/trac/scala/browser/scala/tags/R_2_9_1_final/src/library/scala/collection/immutable/Vector.scala.
Your definition only applies to Lisps (as far as I know).
It seems that Vector was late to the Scala collections party, and all the influential blog posts had already left.
In Java ArrayList is the default collection - I might use LinkedList but only when I've thought through an algorithm and care enough to optimise. In Scala should I be using Vector as my default Seq, or trying to work out when List is actually more appropriate?
As a general rule, default to using Vector. It’s faster than List for almost everything and more memory-efficient for larger-than-trivial sized sequences. See this documentation of the relative performance of Vector compared to the other collections. There are some downsides to going with Vector. Specifically:
Updates at the head are slower than List (though not by as much as you might think)
Another downside before Scala 2.10 was that pattern matching support was better for List, but this was rectified in 2.10 with generalized +: and :+ extractors.
There is also a more abstract, algebraic way of approaching this question: what sort of sequence do you conceptually have? Also, what are you conceptually doing with it? If I see a function that returns an Option[A], I know that function has some holes in its domain (and is thus partial). We can apply this same logic to collections.
If I have a sequence of type List[A], I am effectively asserting two things. First, my algorithm (and data) is entirely stack-structured. Second, I am asserting that the only things I’m going to do with this collection are full, O(n) traversals. These two really go hand-in-hand. Conversely, if I have something of type Vector[A], the only thing I am asserting is that my data has a well defined order and a finite length. Thus, the assertions are weaker with Vector, and this leads to its greater flexibility.
Well, a List can be incredibly fast if the algorithm can be implemented solely with ::, head and tail. I had an object lesson of that very recently, when I beat Java's split by generating a List instead of an Array, and couldn't beat that with anything else.
However, List has a fundamental problem: it doesn't work with parallel algorithms. I cannot split a List into multiple segments, or concatenate it back, in an efficient manner.
There are other kinds of collections that can handle parallelism much better -- and Vector is one of them. Vector also has great locality -- which List doesn't -- which can be a real plus for some algorithms.
So, all things considered, Vector is the best choice unless you have specific considerations that make one of the other collections preferable -- for example, you might choose Stream if you want lazy evaluation and caching (Iterator is faster but doesn't cache), or List if the algorithm is naturally implemented with the operations I mentioned.
By the way, it is preferable to use Seq or IndexedSeq unless you want a specific piece of API (such as List's ::), or even GenSeq or GenIndexedSeq if your algorithm can be run in parallel.
Some of the statements here are confusing or even wrong, especially the idea that immutable.Vector in Scala is anything like an ArrayList.
List and Vector are both immutable, persistent (i.e. "cheap to get a modified copy") data structures.
There is no reasonable default choice as their might be for mutable data structures, but it rather depends on what your algorithm is doing.
List is a singly linked list, while Vector is a base-32 integer trie, i.e. it is a kind of search tree with nodes of degree 32.
Using this structure, Vector can provide most common operations reasonably fast, i.e. in O(log_32(n)). That works for prepend, append, update, random access, decomposition in head/tail. Iteration in sequential order is linear.
List on the other hand just provides linear iteration and constant time prepend, decomposition in head/tail. Everything else takes in general linear time.
This might look like as if Vector was a good replacement for List in almost all cases, but prepend, decomposition and iteration are often the crucial operations on sequences in a functional program, and the constants of these operations are (much) higher for vector due to its more complicated structure.
I made a few measurements, so iteration is about twice as fast for list, prepend is about 100 times faster on lists, decomposition in head/tail is about 10 times faster on lists and generation from a traversable is about 2 times faster for vectors. (This is probably, because Vector can allocate arrays of 32 elements at once when you build it up using a builder instead of prepending or appending elements one by one).
Of course all operations that take linear time on lists but effectively constant time on vectors (as random access or append) will be prohibitively slow on large lists.
So which data structure should we use?
Basically, there are four common cases:
We only need to transform sequences by operations like map, filter, fold etc:
basically it does not matter, we should program our algorithm generically and might even benefit from accepting parallel sequences. For sequential operations List is probably a bit faster. But you should benchmark it if you have to optimize.
We need a lot of random access and different updates, so we should use vector, list will be prohibitively slow.
We operate on lists in a classical functional way, building them by prepending and iterating by recursive decomposition: use list, vector will be slower by a factor 10-100 or more.
We have an performance critical algorithm that is basically imperative and does a lot of random access on a list, something like in place quick-sort: use an imperative data structure, e.g. ArrayBuffer, locally and copy your data from and to it.
For immutable collections, if you want a sequence, your main decision is whether to use an IndexedSeq or a LinearSeq, which give different guarantees for performance. An IndexedSeq provides fast random-access of elements and a fast length operation. A LinearSeq provides fast access only to the first element via head, but also has a fast tail operation. (Taken from the Seq documentation.)
For an IndexedSeq you would normally choose a Vector. Ranges and WrappedStrings are also IndexedSeqs.
For a LinearSeq you would normally choose a List or its lazy equivalent Stream. Other examples are Queues and Stacks.
So in Java terms, ArrayList used similarly to Scala's Vector, and LinkedList similarly to Scala's List. But in Scala I would tend to use List more often than Vector, because Scala has much better support for functions that include traversal of the sequence, like mapping, folding, iterating etc. You will tend to use these functions to manipulate the list as a whole, rather than randomly accessing individual elements.
In situations which involve a lot random access and random mutation, a Vector (or – as the docs say – a Seq) seems to be a good compromise. This is also what the performance characteristics suggest.
Also, the Vector class seems to play nicely in distributed environments without much data duplication because there is no need to do a copy-on-write for the complete object. (See: http://akka.io/docs/akka/1.1.3/scala/stm.html#persistent-datastructures)
If you're programming immutably and need random access, Seq is the way to go (unless you want a Set, which you often actually do). Otherwise List works well, except it's operations can't be parallelized.
If you don't need immutable data structures, stick with ArrayBuffer since it's the Scala equivalent to ArrayList.