I'm writing double hash table which only takes integer.
unsigned int DoubleHashTable::HashFunction1(unsigned int const data)
{
return (data % GetTableSize());
}
unsigned int DoubleHashTable::HashFunction2(unsigned int const data, unsigned int count)
{
return ((HashFunction1(data) + count * (5 - (data % 5)) % GetTableSize()));
}
and trying to insert data into table with SetData()
void DoubleHashTable::SetData(unsigned int const data)
{
unsigned int probe = HashFunction1(data);
if (m_table[probe].GetStatus())
{
unsigned int count = 1;
while (m_table[probe].GetStatus() && count <= GetTableSize())
{
probe = HashFunction2(data, count);
count++;
}
}
m_table[probe].Insert(data);
}
After put 100 of integer items into table with size of 100, table shows me some of indexes are left as blank. I know, it will takes O(N) which is worst case. My question is, item should be inserted into table with no empty space even it takes worst case of search time, right? I can't find the problem of my functions.
Additional question. There are well-known algorithms for hash and purpose of double hashing is makes less collision as much as possible, H2(T) is backup for H1(T). But, if well-known hashing algorithm (like MD5, SHA and other, I'm not talking about security, just well-known algorithm) is faster and well-distribute, why we need a double hashing?
Thanks!
When testing hash functions, there may be high collisions with certain pathological inputs (=those which break your hash function). These inputs can be discovered by reversing the hash function which can lead to certain attacks (this is a real concern as internet routers have limited space for hash tables). Even with no adversary, the look up time of such a hash table after certain inputs can grow and even become linear in the worst case.
Double hashing is a method of resolving hash collisions to try to solve the problem of linear growth on pathological inputs. Linear probing or open addressing are popular choices. However, the number of inputs must be much lower than the table size in these cases, unless your hash table can grow dynamically.
To answer your second question (now that you have fixed your code on your own), in a nutshell, double hashing is better-suited for small hash tables, and single hashing is better-suited for large hash tables.
Related
I use this function to generate unique salts for each of my users when they sign up (random letters and numbers). How big is the chance that salts will collide?
// length is for the underlying bytes, not the resulting string.
String generateSalt([int length = 94]) {
final Random random = Random.secure();
var values = List<int>.generate(length, (i) => random.nextInt(256));
return base64Url.encode(values);
}
to a first approximation they are not going to collide! in fact I'd suggest making length smaller so you don't potentially reveal as much entropy of the RNG
your algorithm currently produces log2(256**94) = 8*94 = 752 random bits. via the birthday problem we know that you'd have to produce 2**(752/2) = 2376 values to have a 50% chance of collision. generating this many values is impossible.
lets reduce this down to a more reasonable 2128 lifetime chance of collision. this means you'd want 256 random bits generated, and would mean your length would be 256/8 = 32.
note that all of the above relies upon Random.secure actually being a csPRNG and any attacker not knowing any of its state. given the above probabilities, this is the much bigger vulnerability of this system
I have the following task.
I have 1 billion or more 20-bytes distinct hashes (stored in some database) which total number
is less than Java's Long.MAX_VALUE;
After that I have almost infinite stream of such hashes.
Is there possibility to create some bijective mapping from the set of these 20-bytes distinct hashes
to the set of numbers between 0 and Long.MAX_VALUE ?
Kind of Lagrange polynomial calculation - but may be there is something really fast and effective for such case.
We need fast long value calculation for each hash from this almost infinite stream.
Each 20 - bytes hash is just a number.
Before stream's processing we can create mapping
20-byte | 8-byte
(hash1 1)
....
(hashN N)
After that when we have next hash from the infinite stream we will obtain 8-byte hash value without lookups using only arithmetical calculations.
Since you gave no practical constraints on size or storage beyond "It has to be fast", I am going to assume you can take your time to pre-process the set of hashes in order to "make it fast". I am further assuming the hashes are distributed randomly and that the mapping to 8-byte numbers is likewise unpredictable.
My first approach would be a local SQLite database. That allows you to use its native BTree indexing to quickly retrieve results. With a large enough page size you can store 256 pointers per BTree node for an expected amount of log_256(10^9)= 3.737169106748283 disk seeks per lookup. This will improve as more of your BTree structures get cached.
Second approach, if you have the memory for it: in-memory BTree.
Would it work something like this?
aNextHash = Stream.getHash();
long aValue = aNextHash % Long.MAX_VALUE;
I want to hashed a String into a hashed object which has some numerical values NSNumber/Int as an output instead of alpha-numeric values.
The problem is that after digging through swift and some 3rd party library, I'm not able to find any library that suffices our need.
I'm working on a Chat SDK and it takes NSNumber/Int as unique identifier to co-relate Chat Message and Conversation Message.
My company demand is not to store any addition field onto the database
or change the schema that we have which complicates thing.
A neat solution my team came with was some sort of hashed function that generates number.
func userIdToConversationNumber(id:String) -> NSNumber
We can use that function to convert String to NSNumber/Int. This Int should be produced by that function and probability of colliding should be negligible. Any suggestion on any approach.
The key calculation you need to perform is the birthday bound. My favorite table is the one in Wikipedia, and I reference it regularly when I'm designing systems like this one.
The table expresses how many items you can hash for a given hash size before you have a certain expectation of a collision. This is based on a perfectly uniform hash, which a cryptographic hash is a close approximation of.
So for a 64-bit integer, after hashing 6M elements, there is a 1-in-a-million chance that there was a single collision anywhere in that list. After hashing 20M elements, there is a 1-in-a-thousand chance that there was a single collision. And after 5 billion elements, you should bet on a collision (50% chance).
So it all comes down to how many elements you plan to hash and how bad it is if there is a collision (would it create a security problem? can you detect it? can you do anything about it like change the input data?), and of course how much risk you're willing to take for the given problem.
Personally, I'm a 1-in-a-million type of person for these things, though I've been convinced to go down to 1-in-a-thousand at times. (Again, this is not 1:1000 chance of any given element colliding; that would be horrible. This is 1:1000 chance of there being a collision at all after hashing some number of elements.) I would not accept 1-in-a-million in situations where an attacker can craft arbitrary things (of arbitrary size) for you to hash. But I'm very comfortable with it for structured data (email addresses, URLs) of constrained length.
If these numbers work for you, then what you want is a hash that is highly uniform in all its bits. And that's a SHA hash. I'd use a SHA-2 (like SHA-256) because you should always use SHA-2 unless you have a good reason not to. Since SHA-2's bits are all independent of each other (or at least that's its intent), you can select any number of its bits to create a shorter hash. So you compute a SHA-256, and take the top (or bottom) 64-bits as an integer, and that's your hash.
As a rule, for modest sized things, you can get away with this in 64 bits. You cannot get away with this in 32 bits. So when you say "NSNumber/Int", I want you to mean explicitly "64-bit integer." For example, on a 32-bit platform, Swift's Int is only 32 bits, so I would use UInt64 or uint64_t, not Int or NSInteger. I recommend unsigned integers here because these are really unique bit patterns, not "numbers" (i.e. it is not meaningful to add or multiply them) and having negative values tends to be confusing in identifiers unless there is some semantic meaning to it.
Note that everything said about hashes here is also true of random numbers, if they're generated by a cryptographic random number generator. In fact, I generally use random numbers for these kinds of problems. For example, if I want clients to generate their own random unique IDs for messages, how many bits do I need to safely avoid collisions? (In many of my systems, you may not be able to use all the bits in your value; some may be used as flags.)
That's my general solution, but there's an even better solution if your input space is constrained. If your input space is smaller than 2^64, then you don't need hashing at all. Obviously, any Latin-1 string up to 8 characters can be stored in a 64-bit value. But if your input is even more constrained, then you can compress the data and get slightly longer strings. It only takes 5 bits to encode 26 symbols, so you can store a 12 letter string (of a single Latin case) in a UInt64 if you're willing to do the math. It's pretty rare that you get lucky enough to use this, but it's worth keeping in the back of your mind when space is at a premium.
I've built a lot of these kinds of systems, and I will say that eventually, we almost always wind up just making a longer identifier. You can make it work on a small identifier, but it's always a little complicated, and there is nothing as effective as just having more bits.... Best of luck till you get there.
Yes, you can create a hashes that are collision resistant using a cryptographic hash function. The output of such a hash function is in bits if you follow the algorithms specifications. However, implementations will generally only return bytes or an encoding of the byte values. A hash does not return a number, as other's have indicated in the comments.
It is relatively easy to convert such a hash into a number of 32 bites such as an Int or Int32. You just take the leftmost bytes of the hash and interpret those to be an unsigned integer.
However, a cryptographic hash has a relatively large output size precisely to make sure that the chance of collisions is small. Collisions are prone to the birthday problem, which means that you only have to try about 2 to the power of hLen divided by 2 inputs to create a collision within the generated set. E.g. you'd need 2^80 tries to create a collision of RIPEMD-160 hashes.
Now for most cryptographic hashes, certainly the common ones, the same rule counts. That means that for 32 bit hash that you'd only need 2^16 hashes to be reasonably sure that you have a collision. That's not good, 65536 tries are very easy to accomplish. And somebody may get lucky, e.g. after 256 tries you'd have a 1 in 256 chance of a collision. That's no good.
So calculating a hash value to use it as ID is fine, but you'd need the full output of a hash function, e.g. 256 bits of SHA-2 to be very sure you don't have a collision. Otherwise you may need to use something line a serial number instead.
Is it done in O(1) or O(n) or somewhere in between? Is there any disadvantage to computing the hash of a very large object vs a small one? If it matters, I'm using Python.
Generally speaking, computing a hash will be O(1) for "small" items and O(N) for "large" items (where "N" denotes the size of an item's key). The precise dividing line between small and large varies, but is typically somewhere in the general vicinity of the size of a register (e.g., 32 bits on a 32-bit machine, 64 bits on a 64-bit machine). This can also depend on the input type--for example, integer types up on the register size all hashing with constant complexity, but strings taking time proportional to the size in bytes, right down to a single character (i.e., a two-character string taking roughly twice the time of a single character string).
Once you've computed the hash, accessing the hash table has expected constant complexity, but can be as bad as O(N) in the worst case (but this is a different "N"--the number of items inserted in the table, not the size of an individual key).
The real answer is it depends. You didn't specify what hash function you are interested in. When we are talking about cryptographic hash like SHA256, then complexity is O(n). When we are talking about hash function that take last two digits of phone number, then it will be O(1). Hash functions that are used in hash tables tend to be optimized for speed and thus are closer to O(1).
For further reference on hash tables see this page from python wiki on Time Complexity.
Most of the time your hash is going to compute in access at O(1). However, if it is a really bad hash where every value has the same hash, it will be O(n) worst case.
The more objects associated to the hash is equivalent to more collisions.
So its time for me to index my database file format and after looking at various methods, I decided that a hash table would be my best option. Since I've only familiarized myself with the inner workings of a hash table just today though, heres my understanding of it so please correct me if I'm wrong:
A hash table has a constant size that is equivalent of the maximum value storable in its hash function output size * key value pair size * bucket size + overflow bucket size. So for example, if the hash function makes 16 bit hashes and the bucket size is 4 and the values are 32bit then it would be 2^16 * 4 * 6 = 1572864 or 1.5MB plus overflow.
That in essence would make the hash table a sort of compressed lookup table. If the hash function changes, the whole table has to be reevaluated. Otherwise it just adds stuff to empty slots. Also the hash table can contain the maximum of units that its hash size could address (so for a 16bit hash its 65536) but to perform well without many collisions it would have to be much less.
Ok and heres the things I'm trying to index: (up to) 100 million pairs with 64bit integer keys and a 96bit value. The keys are object ID's(that mostly come in short sequences but can be all over the place) and the values are the object location + length. Reads/writes are equally important and very frequent.
The other options i looked into were various trees but the reason I didn't like them is because it seems to me that i would have to do a lot of sparse reads/writes to look up the data or to restructure the tree each time I go in.
So here are my questions:
It seems to me that I need a hash with a weird number of bits in it, I'm thinking up to ~38 since it would be just about the maximum I can store on a single disk and should be comfy enough for the 100 million. Is the weird bit amount unheard of? I'm thinking I'll probably bottleneck on disk activity way before CPU.
Are there any articles out there on how to design a good hash function for my particular case? Googling gave me an overview of the common methods but I'm looking for explanations behind them.
Any other general tips/pitfalls I should know of?
A hash table has a constant size
...not necessarily - a hash table can support resizing, but that tends to be done in fairly dramatic and invasive chunks where you can reason about the hash table as if it were constant size both before and after.
...that is equivalent of the maximum value storable in its hash function output size * key value pair size * bucket size + overflow bucket size. So for example, if the hash function makes 16 bit hashes and the bucket size is 4 and the values are 32bit then it would be 2^16 * 4 * 6 = 1572864 or 1.5MB plus overflow.
Not at all. A better way to calculate size is to say there are N values of a certain size, and you want to maintain a capacity:size ratio somewhere between say 3:1 and 5:4: the table memory usage is: N * sizeof(Value) * ratio.
The number of bits in the hash value is only relevant in that it indicates the maximum number of distinct buckets you can hash to: if you try to have a bigger table then you'll get more collisions than you would with a hash function generating wider-bit hash values. If you have more bits from your hash function than you need it is not a problem, you e.g. take the modulus with the current table size to find your bucket: hashed_to_bucket = hash_value % num_buckets.
That in essence would make the hash table a sort of compressed lookup table.
That's a good way to look at a hash table.
If the hash function changes, the whole table has to be reevaluated. Otherwise it just adds stuff to empty slots.
Definitely reevaluated/regenerated. Otherwise adding to empty slots is but one of the undesirable consequences.
Also the hash table can contain the maximum of units that its hash size could address (so for a 16bit hash its 65536) but to perform well without many collisions it would have to be much less.
As above, that (e.g. 65536) is not a hard maximum, but "to perform well without collisions" going over that should be avoided. To perform well it does not have to be much less: anything right up to 65536 is perfectly fine if it's a good quality 16-bit hash function.
Ok and heres the things I'm trying to index: (up to) 100 million pairs with 64bit integer keys and a 96bit value. The keys are object ID's(that mostly come in short sequences but can be all over the place) and the values are the object location + length. Reads/writes are equally important and very frequent.
The other options i looked into were various trees but the reason I didn't like them is because it seems to me that i would have to do a lot of sparse reads/writes to look up the data or to restructure the tree each time I go in.
Could be... a lot depends on your access patterns. For example, if you happen to try to access the keys following the "short sequences" then a data organisation model that tends to put them nearby in memory/disk helps. Some types of tree structures do that nicely, and you can sometimes hack your hash function to do it too (but need to balance that up against collision proneness).
It seems to me that I need a hash with a weird number of bits in it, I'm thinking up to ~38 since it would be just about the maximum I can store on a single disk and should be comfy enough for the 100 million. Is the weird bit amount unheard of? I'm thinking I'll probably bottleneck on disk activity way before CPU.
Not so... you have 64 bit integer keys - a 64 bit or larger hash would be desirable. That said, a 32 bit hash may well be fine too - that generates 4 billion distinct values which is greater than your 100 million keys.
Are there any articles out there on how to design a good hash function for my particular case? Googling gave me an overview of the common methods but I'm looking for explanations behind them.
Not that I'm aware of.
Any other general tips/pitfalls I should know of?
For tips... I'd say start simple (e.g. with the hash function returning the key unchanged and using modulus with a hash table capacity that's a prime number, OR using any common hash if you're picking up a hash table implementation that uses e.g. power-of-2 numbers of buckets) and measure your collision rates: that tells you how much effort it's worth putting into improving your hashing.
One very simple way to get "ideal, randomised" hashing in your case is to have 8 tables of 256 32-bit integers - initialised with hardcoded random numbers (you can google for random number download websites). Given any 64-bit key, just slice it into 8 bytes then use each byte as a key in the successive tables, XORing the 32-bit values you look up. A single bit of difference in any of the 64 input bits will then impact all 32 bits in the hash value with equal probability.
uint32_t table[8][256] = { ...add some random numbers... };
uint32_t h(uint64_t n)
{
uint32_t result = 0;
unsigned char* p = (unsigned char*)&n;
for (int i = 0; i < 8; ++i)
result ^= table[i][*p++];
return result;
}