What does it mean for a hash function to be incremental? - hash

I have heard that, for example, MurmurHash2 is not "incremental" but MurmurHash3 is incremental. What does this mean? And why is it useful?

Incremental hash functions suited for situations where if a previously
hashed message, M is slightly updated into a new message, M*, then it
should be fairly quick to compute the hash value of the updated
message, M*. This is done by computing the new hash, m*, from the old
hash value, m, in contrast to conventional hash functions that have to
recompute the new hash, m* from scratch, which takes a longer time.
http://www.cs.berkeley.edu/~daw/papers/inchash-cs06.pdf
They're useful due to the fact that they're easier to compute and therefore less expensive in terms of computing power and time.
However they're not suited to every situation. That paper from Berkeley has some nice examples of when they can be useful in the Introduction section.

I'm not an expert on this, but I think MurmurHash3 is not incremental in the sense tommarshall describes.
When people describe it as incremental they probably mean that you can compute hash of a stream in O(1) memory, i.e. you can have an API that let you do the following (in pseudocode):
x = Hasher()
x.add("hello ")
x.add("world!")
x.get_hash()
and that would produce a hash of string "hello world" without keeping the whole string in memory at any point in time.
In particular, the imurmurhash-js javascript package seems to use the word 'incremental' in that meaning.
Same meaning seems to be used in MetroHash docs.

Related

Is there a possibility to create a memory-efficient sequence of bits in the JVM?

I've got a piece of code that takes into account a given amount of features, where each feature is Boolean. I'm looking for the most efficient way to store a set of such features. My initial thought was to try and store these as a BitSet. But then, I realized that this implementation is meant to be used to store numbers in bit format rather than manipulate each bit, which is something I'd like to do (see the effect of switching any feature on and off). I then thought of using a Boolean array, but apparently the JVM uses much more memory for each Boolean element than the one bit it actually needs.
I'm therefore left with the question: What is the most efficient way to store a set of bits that I'd like to treat as independent bits rather than the building blocks of some number?
Please refer to this question: boolean[] vs. BitSet: Which is more efficient?
According to the answer of Peter Lawrey, boolean[] (not Boolean[]) is your way to go since its values can be manipulated and it takes only one byte of memory per bit to store. Consider that there is no way for a JVM application to store one bit in only one bit of memory and let it be directly (array-like) manipulated because it needs a pointer to find the address of the bit and the smallest addressable unit is a byte.
The site you referenced already states that the mutable BitSet is the same as the java.util.BitSet. There is nothing you can do in Java that you can't do in Scala. But since you are using Scala, you probably want a safe implementation which is probably meant to be even multithreaded. Mutable datatypes are not suitable for that. Therefore, I would simply use an immutable BitSet and accept the memory cost.
However, BitSets have their limits (deriving from the maximum number of int). If you need larger data sizes, you may use LongBitSets, which are basically Map<Long, BitSet>. If you need even more space, you may nest them in another map Map<Long, LongBitSet>, but in that case you need to use two or more identifiers (longs).

Disadvantages of Immutable objects

I know that Immutable objects offer several advantages over mutable objects like they are easier to reason about than mutable ones, they do not have complex state spaces that change over time, we can pass them around freely, they make safe hash table keys etc etc.So my question is what are the disadvantages of immutable objects??
Quoting from Effective Java:
The only real disadvantage of immutable classes is that they require a
separate object for each distinct value. Creating these objects can be
costly, especially if they are large. For example, suppose that you
have a million-bit BigInteger and you want to change its low-order
bit:
BigInteger moby = ...;
moby = moby.flipBit(0);
The flipBit method
creates a new BigInteger instance, also a million bits long, that
differs from the original in only one bit. The operation requires time
and space proportional to the size of the BigInteger. Contrast this to
java.util.BitSet. Like BigInteger, BitSet represents an arbitrarily
long sequence of bits, but unlike BigInteger, BitSet is mutable. The
BitSet class provides a method that allows you to change the state of
a single bit of a millionbit instance in constant time.
Read the full item on Item 15: Minimize mutability
Apart from possible performance drawbacks (possible! because with the complexity of GC and HotSpot optimisations, immutable structures are not necessarily slower) - one drawback can be that state must now be threaded through your whole application. For simple applications or tiny scripts the effort to maintain state this way might be too high to buy you concurrency safety.
For example think of a GUI framework like Swing. It would be definitely possible to write a GUI framework entirely using immutable structures and one main "unsafe" outer loop, and I guess this has been done in Haskell. Some of the problems of maintaining nested immutable state can be addressed for example with lenses. But managing all the interactions (registering listeners etc.) may get quite involved, so you might instead want to introduce new abstractions such as functional-reactive or hybrid-reactive GUIs.
Basically you lose some of OO's encapsulation by going all immutable, and when this becomes a problem there are alternative approaches such as actors or STM.
I work with Scala on a daily basis. Immutability has certain key advantages as we all know. However sometimes it's just plain easier to allow mutable content in some situations. Here's a contrived example:
var counter = 0
something.map {e =>
...
counter += 1
}
Of course I could just have the map return a tuple with the payload and count, or use a collection.size if available. But in this case the mutable counter is arguably more clear. In general I prefer immutability but also allow myself to make exceptions.
To answer this question I would quote Programming in Scala, second Edition, chapter "Next Steps in Scala", item 11, by Lex Spoon, Bill Venners and Martin Odersky :
The Scala perspective, however, is that val and var are just two different tools in your toolbox, both useful, neither inherently evil. Scala encourages you to lean towards vals, but ultimately reach for the best tool given the job at hand.
So I would say that just as for programming languages, val and var solves different problems : there is no "disavantage / avantage" without context, there is just a problem to solve, and both of val / var address differently the problem.
Hope it helps, even if it does not provide a concrete list of pros / cons !

G-Counters in Riak: Don't the underlying vclocks provide the same data?

I've been reading into CvRDTs and I'm aware that Riak has already added a few to Riak 2.
My question is: why would Riak implement a gcounter when it sounds like the underlying vclock that is associated with every object records the same information? Wouldn't the result be a gcounter stored with a vclock, each containing the same essential information?
My only guess right now would be that Riak may garbage-collect the vclocks, trimming information that would actually be important for the purpose of a gcounter (i.e. the number of increments).
I cannot read Erlang particularly well, so maybe I've wrongly assumed that Riak stores vclocks with these special-case data types. However, the question still applies to the homegrown solutions that are written on top of standard Riak (and hence inherit vclocks with each object persisted).
EDIT:
I have since written the following article to help explain CvRDTs in a more practical manner. This article also touches on the redundancy I have highlighted above:
Conflict-free Replicated Data Types (CRDT) - A digestible explanation with less math.
Riak prunes version vectors, no big deal for causality (false concurrency, more siblings, safe) but a disaster for counters.
Riak's CRDT support is general. We "hide" CRDTs inside the regular riak object.
Riak's CRDT support is in it's first wave, we'll be optimising further as we make further releases.
We have a great mailing list for questions like this, btw. Stack Overflow has it's uses but if you want to talk to the authors of an open source DB why not use their list? Since Riak is open source, you can submit a pull request, we'd love to incorporate your ideas into the code base.
Quick answer: Riak's counters are actually PN-Counters, ie they allow both increments and decrements, so can't be implemented like a vclock, as they require tracking the increments and decrements differently.
Long Answer:
This question suggests you have completely misunderstood the difference between a g-counter and a vector clock (or version vector).
A vector clock (vclock) is a system for tracking the causality of concurrent updates to a piece of data. They are a map of {actor => logical clock}. Actors only increment their logical clocks when the data they're associated with changes, and try to increment it as little as possible (so at most once per update). Two vclocks can either be concurrent, or one can dominate the other.
A g-counter is a CvRDT with what looks like the same structure as a vclock, but with important differences. They are implemented as a map of {actor => counter}. Actors can increment their own counter as much as they want. A g-counter has the concept of a "counter value", and the concept of a "merge", so that when concurrent operations are executed by different actors, they can work out what the actual "counter value" should be.
Importantly, g-counters can't track causality, and vclocks have no idea what their "counter value" is.
To conflate the two in a codebase would not only be confusing, but could also bring in errors.
Add this to the fact that riak actually implements pn-counters. The difference is that a g-counter can only be incremented, but pn-counters can be both incremented and decremented. Pn-counters work by being a map of {actor => (increment count, decrement count)}, which more obviously has a different structure to a vclock. You can only increment both those counts, hence why there are two and not just one.

Streams vs. tail recursion for iterative processes

This is a follow-up to my previous question.
I understand that we can use streams to generate an approximation of 'pi' (and other numbers), n-th fibonacci, etc. However I doubt if streams is the right approach to do that.
The main drawback (as I see it) is memory consumption: e.g. stream will retains all fibonacci numbers for i < n while I need only fibonacci n-th. Of course, I can use drop but it makes the solution a bit more complicated. The tail recursion looks like a more suitable approach to the tasks like that.
What do you think?
If need to go fast, travel light. That means; avoid allocation of any unneccessary memory. If you need memory, use the fastast collections available. If you know how much memory you need; preallocate. Allocation is the absolute performance killer... for calculation. Your code may not look nice anymore, but it will go fast.
However, if you're working with IO (disk, network) or any user interaction then allocation pales. It's then better to shift priority from code performance to maintainability.
Use Iterator. It does not retain intermediate values.
If you want n-th fibonacci number and use a stream just as a temporary data structure (if you do not hold references to previously computed elements of stream) then your algorithm would run in constant space.
Previously computed elements of a Stream (which are not used anymore) are going to be garbage collected. And as they were allocated in the youngest generation and immediately collected, allmost all allocations might be in cache.
Update:
It seems that the current implementation of Stream is not as space-efficient as it may be, mainly because it inherits an implementation of apply method from LinearSeqOptimized trait, where it is defined as
def apply(n: Int): A = {
val rest = drop(n)
if (n < 0 || rest.isEmpty) throw new IndexOutOfBoundsException("" + n)
rest.head
}
Reference to a head of a stream is hold here by this and prevents the stream from being gc'ed. So combination of drop and head methods (as in f.drop(100).head) may be better for situations where dropping intermediate results is feasible. (thanks to Sebastien Bocq for explaining this stuff on scala-user).

What is hash exactly?

I am learning MD5. I found a term 'hash' in most description of MD5. I googled 'hash', but I could not find exact term of 'hash' in computer programming.
Why are we using 'hash' in computer programming? What is origin of the word??
I would say any answer must be guesswork, so I will make this a community wiki.
Hash, or hash browns, is breakfast food made from cutting potatoes into long thin strips (smaller than french fries, and shorter, but proportionally similar), then frying the mass of strips in animal or vegetable fat until browned, stuck together, and cooked. By analogy, 'hashing' a number meant turning it into another, usually smaller, number using a method which still deterministically depending on the input number.
I believe the term "hash" was first used in the context of "hash table", which was commonly used in the 1960's on mainframe-type machines. In these cases, usually an integer value with a large range is converted to a "hash table index" which is a small integer. It is important for an efficient hash table that the "hash function" be evenly distributed, or "flat."
I don't have a citation, that is how I have understood the analogy since I heard it in the 80's. Someone must have been there when the term was first applied, though.
A hash value (or simply hash), also
called a message digest, is a number
generated from a string of text. The
hash is substantially smaller than the
text itself, and is generated by a
formula in such a way that it is
extremely unlikely that some other
text will produce the same hash value.
You're refering to the "hash function". It is used to generate a unique value for a given set of parameters.
One great use of a hash is password security. Instead of saving a password in a database, you save a hash of the password.
A hash is supposed to be a unique combination of values from 00 to FF (hexadecimal) that represents a certain piece of data, be it a file or a string of bytes. It is used primarily for password storage and verification, and to test if a file is the same as another file (i.e. you hash two files, if they match, they're the same file).
Generally, any of the SHA algorithms are preferred over MD5, due to hash collisions that can occur when using it. See this Wikipedia article.
According to the Wikipedia article on hash functions, Donald Knuth in the Art of Computer Programming was able to trace the concept of hash functions back to an internal IBM memo by Hans Peter Luhn in 1953.
And just for fun, here's a scrap of overheard conversation quoted in Two Women in the Klondike: the Story of a Journey to the Gold Fields of Alaska (1899):
They'll have to keep the hash table going all day long to feed us. 'T will be a short order affair.
the hash function hashes input to a value, requires a salt value and no proof salt is needed to store. Indications are everybody says we must store the salt same time match and new still work. Mathematically related concept is bijection
adding to gabriel1836's answer, one of the important properties of hash function is that it is a one way function, which means you cannot generate the original string from its hash value.