How many elements can perl hash have? - perl

I was wondering if there is a limit to how many elements Perl hash data structure can hold? I am assuming it is probably dependent on how much memory you have available. Does value and key size matter in terms of how many elements it can hold?

There's no trivial fixed upper bound. It depends on the memory available in the system. If the keys to the hash are bigger, you will run out of memory quicker than if they are smaller. Similarly with the values in the hash; the bigger they are, the sooner you run out of memory.
Generally, the number of elements that will fit in a hash is the least of your problems; if you run out of memory, you should probably be rethinking your algorithm anyway.

Related

Comparing hashes to test for collisions

I wish to compare hashes to check for collisions (Yes, I know it is time consuming, but never mind that). In checking for collisions, hashes need to be compared. Is the best method to have a single hash in a variable to compare against or to have a list of all hashes previously generated and compare the latest hash to each item in the list.
I would prefer the first option because it is much faster, but is there a recommended method? Are you less likely to find a collision by using the first method?
Is the best method to have a single hash in a variable to compare against or to have a list of all hashes previously generated and compare the latest hash to each item in the list.
Neither.
I would prefer the first option because it is much faster, but is there a recommended method?
I don't understand why you think the first method might work, but then you haven't fully explained your situation. Still, if you want to detect hash values that repeat, you do indeed need to keep track of already-seen hash values: to do that you don't want to search linearly though a list, and should use a set container to store seen hashes; a hash table - as suggested in a comment by gnasher729 a few hours back - would give O(1) performance e.g. in C++ in your hashes are 64 bit, std::unordered_set<uint64_t>), or a balance binary tree for O(logN) performance (e.g. C++ std::set<uint64_t>).
Are you less likely to find a collision by using the first method?
You're very likely to miss collisions.
All that said, you may want to reexamine your premise. The chance of a good (cryptographic quality) hash function producing collisions closely approaches the odds described by the "birthday paradox". As a rule of thumb, if you have 2^N distinct values to hash you're statistically unlikely to see collisions if your hashes are comfortably more than 2*N bits wide: if you allow enough "comfort", you're more likely to be hit on the noggin by a meteor than have your program see a collision. You mentioned MD5 so I'd expect 128 bits: unless you're storing order-of a quadrillion values or more (literally), it's pretty safe to ignore the potential for collisions.
Do note one important use of hash values where collisions happen more often for a different reason, and that's in hash tables, where even non-colliding hash values may collide at the same bucket index after they're "wrapped" - often a la h % N when N is the number of buckets. In general, it's impractical to ignore the potential for collisions in a hash table, and very unwise to try.

best way to resolve collisions in hashing strings

I got asked this question at an interview and said to use a second has function, but the interviewer kept probing me for other answers. Anyone have other solutions?
best way to resolve collisions in hashing strings
"with continuous inserts"
Assuming the inserts are of strings whose contents can't be predicted, then reasonable options are:
Use a displacement list, so you try a number of offsets from the
hashed-to bucket until you find a free bucket (modding by table
size). Displacement lists might look something like { 3, 5, 11,
19... } etc. - ideally you want to have the difference between
displacements not be the sum of a sequence of other displacements.
rehash using a different algorithm (but then you'd need yet another
algorithm if you happen to clash twice etc.)
root a container in the
buckets, such that colliding strings can be searched for. Typically
the number of buckets should be similar to or greater than the
number of elements, so elements per bucket will be fairly small and
a brute-force search through an array/vector is a reasonable
approach, but a linked list is also credible.
Comparing these, displacement lists tend to be fastest (because adding an offset is cheaper than calculating another hash or support separate heap & allocation, and in most cases the first one or two displacements (which can reasonably be by a small number of buckets) is enough to find an empty bucket so the locality of memory use is reasonable) though they're more collision prone than an alternative hashing algorithm (which should approach #elements/#buckets chance of further collisions). With both displacement lists and rehashing you have to provide enough retries that in practice you won't expect a complete failure, add some last-resort handling for failures, or accept that failures may happen.
Use a linked list as the hash bucket. So any collisions are handled gracefully.
Alternative approach: You might want to concider using a trie instead of a hash table for dictionaries of strings.
The up side of this approach is you get O(|S|) worst case complexity for seeking/inserting each string [where |S| is the length of that string]. Note that hash table allows you only average case of O(|S|), where the worst case is O(|S|*n) [where n is the size of the dictionary]. A trie also does not require rehashing when load balance is too high.
Assuming we are not using a perfect hash function (which you usually don't have) the hash tells you that:
if the hashes are different, the objects are distinct
if the hashes are the same, the objects are probably the same (if good hashing function is used), but may still be distinct.
So in a hashtable, the collision will be resolved with some additional checking if the objects are actually the same or not (this brings some performance penalty, but according to Amdahl's law, you still gained a lot, because collisions rarely happen for good hashing functions). In a dictionary you just need to resolve that rare collision cases and assure you get the right object out.
Using another non-perfect hash function will not resolve anything, it just reduces the chance of (another) collision.

Is there a collision rate difference between one 32-bit hash vs two 16 bit hashes?

I am working on a system where hash collisions would be a problem. Essentially there is a system that references items in a hash-table+tree structure. However the system in question first compiles text-files containing paths in the structure into a binary file containing the hashed values instead. This is done for performance reasons. However because of this collisions are very bad as the structure cannot store 2 items with the same hash value; the part asking for an item would not have enough information to know which one it needs.
My initial thought is that 2 hashes, either using 2 different algorithms, or the same algorithm twice, with 2 salts would be more collision resistant. Two items having the same hash for different hashing algorithms would be very unlikely.
I was hoping to keep the hash value 32-bits for space reasons, so I thought I could switch to using two 16-bit algorithms instead of one 32-bit algorithm. But that would not increase the range of possible hash values...
I know that switching to two 32-bit hashes would be more collision resistant, but I am wondering if switching to 2 16-bit hashes has at least some gain over a single 32-bit hash? I am not the most mathematically inclined person, so I do not even know how to begin checking for an answer other than to bruit force it...
Some background on the system:
Items are given names by humans, they are not random strings, and will typically be made of words, letters, and numbers with no whitespace. It is a nested hash structure, so if you had something like { a => { b => { c => 'blah' }}} you would get the value 'blah' by getting value of a/b/c, the compiled request would be 3 hash values in immediate sequence, the hashe values of a, b, and then c.
There is only a problem when there is a collision on a given level. A collision between an item at the top level and a lower level is fine. You can have { a => {a => {...}}}, almost guaranteeing collisions that are on different levels (not a problem).
In practice any given level will likely have less than 100 values to hash, and none will be duplicates on the same level.
To test the hashing algorithm I adopted (forgot which one, but I did not invent it) I downloaded the entire list of CPAN Perl modules, split all namespaces/modules into unique words, and finally hashed each one searching for collisions, I encountered 0 collisions. That means that the algorithm has a different hash value for each unique word in the CPAN namespace list (Or that I did it wrong). That seems good enough to me, but its still nagging at my brain.
If you have 2 16 bit hashes, that are producing uncorrelated values, then you have just written a 32-bit hash algorithm. That will not be better or worse than any other 32-bit hash algorithm.
If you are concerned about collisions, be sure that you are using a hash algorithm that does a good job of hashing your data (some are written to merely be fast to compute, this is not what you want), and increase the size of your hash until you are comfortable.
This raises the question of the probability of collisions. It turns out that if you have n things in your collection, there are n * (n-1) / 2 pairs of things that could collide. If you're using a k bit hash, the odds of a single pair colliding are 2-k. If you have a lot of things, then the odds of different pairs colliding is almost uncorrelated. This is exactly the situation that the Poisson distribution describes.
Thus the number of collisions that you will see should approximately follow the Poisson distribution with λ = n * (n-1) * 2-k-1. From that the probability of no hash collisions is about e-λ. With 32 bits and 100 items, the odds of a collision in one level are about 1.1525 in a million. If you do this enough times, with enough different sets of data, eventually those one in a million chances will add up.
But note that you have many normal sized levels and a few large ones, the large ones will have a disproportionate impact on your risk of collision. That is because each thing you add to a collection can collide with any of the preceeding things - more things equals higher risk of collision. So, for instance, a single level with 1000 data items has about 1 chance in 10,000 of failing - which is about the same risk as 100 levels with 100 data items.
If the hashing algorithm is not doing its job properly, your risk of collision will go up rapidly. How rapidly depends very much on the nature of the failure.
Using those facts and your projections for what the usage of your application is, you should be able to decide whether you're comfortable with the risk from 32-bit hashes, or whether you should move up to something larger.

Are hash collisions with different file sizes just as likely as same file size?

I'm hashing a large number of files, and to avoid hash collisions, I'm also storing a file's original size - that way, even if there's a hash collision, it's extremely unlikely that the file sizes will also be identical. Is this sound (a hash collision is equally likely to be of any size), or do I need another piece of information (if a collision is more likely to also be the same length as the original).
Or, more generally: Is every file just as likely to produce a particular hash, regardless of original file size?
Hash functions are generally written to evenly distribute the data across all result buckets.
If you assume that your files are evenly distributed over a fixed range of available sizes, lets say that there are only 1024 (2^10) evenly distributed distinct sizes for your files. Storing file size at best only reduces the chance of a collision by the number of distinct file sizes.
Note: we could assume it's 2^32 evenly distributed and distinct sizes and it still doesn't change the rest of the math.
It is commonly accepted that the general probability of a collision on MD5 (for example) is 1/(2^128).
Unless there is something that is specifically built into a hash function that says otherwise. Given any valid X such that Probability of P(MD5(X) == MD5(X+1)) remains the same as any two random values {Y, Z} That is to say that P(MD5(Y) == MD5(Z)) = P(MD5(X) == MD5(X+1)) = 1/(2^128) for any values of X, Y and Z.
Combining this with the 2^10 of distinct files means that by storing file size you are at most getting an additional 10 bits that signify if items are different or not (again this is assuming your files are evenly distributed for all values).
So at the very best all you are doing is adding another N bytes of storage for <=N bytes worth of unique values (it can never be >N). Therefore you're much better off to increase the bytes returned by your hash function using something such as SHA-1/2 instead as this will be more likely to give you an evenly distributed data of hash values than storing the file size.
In short, if MD5 isn't good enough for collisions use a stronger hash, if the stronger hashes are too slow then use a fast hash with low chance of collisions such a as MD5, and then use a slower hash such as SHA-1 or SHA256 to reduce the chance of a collision, but if SHA256 is fast enough and the doubled space isn't a problem then you probably should be using SHA256.
Depends on your hash function, but in general, files that are of the same size but different content are less likely to produce the same hash as files that are of different size. Still, it would probably be cleaner to simply use a time-tested hash with a larger space (e.g. MD5 instead of CRC32, or SHA1 instead of MD5) than bet on your own solutions like storing file size.
The size of the hash is the same regardless of the size of the original data. As there is only a limited number of possible hashes it is theoretically possible that two files with different sizes may have the same hash. However, this means that it is also possible that two files with the same size may have the same hash.
Hash functions are designed the way that it's very difficult to get the collision, otherwise they won't be effective.
If you have hash collision that is absolutely unbelievable about 1 : number_of_possible_hashes probability that says nothing about file size.
If you really want to be double-sure about hash collisions, you can calculate two different hashes for the same file - it will be less error-prone than saving hash + file size.
The whole point of the family of cryptographic hashes (MD5, SHA-x, etc) is to make collisions vanishingly unlikely. The notion is that official legal processes are prepared to depend on it being impractical to manufacture a collision on purpose. So, really, it's a bad use of space and CPU time to add a belt to the suspenders of these hashes.

Explanation about hashing and its use for data compression

I am facing an application that uses hashing, but I cannot still figure out how it works. Here is my problem, hashing is used to generate some index, and with those indexes I access to different tables, and after I add the value of every table that I get using the indexes and with that I get my final value. This is done to reduce the memory requirements. The input to the hashing function is doing the XOR between a random constant number and some parameters from the application.
Is this a typical hashing application?. The thing that I do not understand is how using hashing can we reduce the memory requirements?. Can anyone clarify this?.
Thank you
Hashing alone doesn't have anything to do with memory.
What it is often used for is a hashtable. Hashtables work by computing the hash of what you are keying off of, which is then used as an index into a data structure.
Hashing allows you to reduce the key (string, etc.) into a more compact value like an integer or set of bits.
That might be the memory savings you're referring to--reducing a large key to a simple integer.
Note, though, that hashes are not unique! A good hashing algorithm minimizes collisions but they are not intended to reduce to a unique value--doing so isn't possible (e.g., if your hash outputs a 32bit integer, your hash would have only 2^32 unique values).
Is it a bloom filter you are talking about? This uses hash functions to get a space efficient way to test membership of a set. If so then see the link for an explanation.
Most good hash implementations are memory inefficient, otherwise there would be more computing involved - and that would exactly be missing the point of hashing.
Hash implementations are used for processing efficiency, as they'll provide you with constant running time for operations like insertion, removal and retrieval.
You can think about the quality of hashing in a way that all your data, no matter what type or size, is always represented in a single fixed-length form.
This could be explained if the hashing being done isn't to build a true hash table, but is to just create an index in a string/memory block table. If you had the same string (or memory sequence) 20 times in your data, and you then replaced all 20 instances of that string with just its hash/table index, you could achieve data compression in that way. If there's an actual collision chain contained in that table for each hash value, however, then what I just described is not what's going on; in that case, the reason for the hashing would most likely be to speed up execution (by providing quick access to stored values), rather than compression.