How Do Hashes Work in Programming? - hash

How do hashes work in programming? How I think of a hash is something that allows me the ability to use some unique value to retrieve some data. Like if we have an array and I start to put things in the array, if I have another variable that keeps track of what item is in slot 0,1,2... then I have that instant ability to find an item. Is that hashing?
What is the purpose of a hash?
When should a hash be implemented? What's a hash similar to in terms of data structure?
What I think I know about hashes is that it allows us the ability to retrieve the item within O(1). Is that correct?

A hash is like a person's first name -- it's a short way of remembering a person, even though it doesn't have to be unique. If you need to find some information about someone, you can just look them up by their name, and you only need to perform other checks if two or more people have the same name.
That's the power of hashing, and just as remembering people is much easier by name than by Social Security Number, finding an object by its hash code is much easier than actually comparing the object to everything already in your collection.
Now, in this example, if you're looking someone up in a phone book by name, you'd probably find them in O(log n) time, because the names are sorted alphabetically, and because you need to do a binary search. If, however, you instead "hashed" 100 people born in the 1900s by their years of birth, then you'd only need at most 4 comparisons in the hashtable/phonebook (one per digit) to find any one year by hash, which is constant time. Then, if two people are born in the same year, you can use other information to find the person you need, and on average, if your table isn't too full (say, if you have at most 50 people for 100 different years of birth), your lookups will be constant-time.
(If your table gets more than, say, 50% full, you can always double its size, to keep the number of collisions low and hence to keep your lookups fast.)
More information:
If you've ever heard of MD5 or SHA-1 SHA-2 hashes for files, they're like the "fingerprints" of the file. While it's possible to have two files with the same hash, this is made so unlikely that, for practical purposes, it's impossible; hence, if you have the hash of two files, you can compare the files by their fingerprints rather than by their data, which is immensely faster.

A hash map / dictionary is a key/value data structure that stores objects in buckets based on the value of a hash function. These keys must be unique but the hash function values (sometimes called hashcodes) aren't necessarily unique.
Like if we have an array and I start to put htings in the array, if I have another varible that keeps track of what item is in slot 0,1,2... then I have that instant ability to find an item. Is that hashing?
No. A hash function is a deterministic function that always gives the same value for an object. The hash code does not change depending on where the object is stored.
What I think I know about hashes is that it allows us the ability to retrieve the item within O(1). Is that correct?
Nearly. A dictionary has O(1) complexity for lookups if there are not too many hash code collisions. However if the hash function is poor and every object has the same hash value then a dictionary could have O(n) performance instead.

A hash makes it fast to lookup instead of iterating over an array or tree. It makes it possible search O(1) time with little use of memory.

Related

Implementing Hash Table with binary search tree

This is the controversial line from Cracking the Coding Interview on hash tables.
Another common implementation(besides linked-list) for a hash table is to use a BST as the underlying data structure.
I know this question has been asked before... it's so confusing because everyone is giving two different answers. For example
Why implement a Hashtable with a Binary Search Tree?
The highest voted answer in this post says that the quoted statement above is saying talking about a hash table implementation using a binary search tree, without an underlying array. I understood that since each element inserted gets a hash value (an integer), the elements form a total order (every pair can be compared with < and >). Therefore, we can simply use a binary search tree to hold the elements of the hash table.
On the other hand, others say
Hash table - implementing with Binary Search Tree
the book is saying that we should handle collisions with a binary search tree. So there is an underlying array and when collisions because multiple elements get the same hash value and get placed in the same slot in the array, that's where the BST comes in.
So each slot in the array will be a pointer to a BST, which holds elements with the same hash value.
I'm leaning towards the second post's argument because the first post does not really explain how such implementation of a hash table can handle collisions. And I don't think it can achieve expected O(1) insert/delete/lookup time.
But for the second post, if we have multiple elements that get the same hash value and placed in a BST, I'm not sure how these elements are ordered (how can they be compared against each other?)
Please, help me put an end to this question once and for all!
the first post does not really explain how such implementation of a hash table can handle collisions
With a BST, you can use a hashing function that would produce no duplicate keys so there would be no collisions. The advantage here isn't speed but to reduce memory consumption, and to have better worst-case guarantees. If you're writing software for a critical real-time system, you might not be able to tolerate a O(n) resizing of your hash table.
if we have multiple elements that get the same hash value and placed in a BST, I'm not sure how these elements are ordered (how can they be compared against each other?)
Rehash with another function.
In the end, it all depends on what your data structure is used for (Is memory vs. speed more important? Is amortized performance vs worst-case performance more important? etc.)

how are hashes used to look up entries in a database?

I'm reading Medium Staff's article on Bloom filters, and I found this snippet:
When you ask the database for a specific post, how does it know where to find it? A hash function. The database takes the ID, hashes it to a unique value, and uses that to jump straight to the record location, just like you’d use the index of a book to jump straight to a specific page number
How exactly does this work in practice? If each ID is hashed, shouldnt the lookup time using the hash be the same as using the actual ID?

Mongo Objectid collision possibility VS MD5

Mongo objectid, and MD5 hash function, which one is more likely to collide, now I am building a website, and look for a way to index my products.
Thanks in advance.
MongoDB's ObjectId is unlikely to collide. It contains a counter, a random number, process id, etc. MD5 hashes depend on the value of the input. If you pass two inputs that have the same value, then hashes will be the same.
I should know more about how you hash your products. If you are sure your product values won't be the same, then you can use both. But I would use ObjectId, because you won't need to worry about product values and hashing at all. The size of ObjectId is also smaller than the size of an MD5 hash, which is better for indexing.
If your product model has some relation to real world products, then there are existing ways to index them - pick article numbers or EAN(IAN). They can be an addition to natural auto-incremental ids of mysql. UUIDs have pros (mainly for distributed data bases) and cons - read https://tomharrisonjr.com/uuid-or-guid-as-primary-keys-be-careful-7b2aa3dcb439

DynamoDB: Querying of non-key values with comparisons

Let's say we have many data tables structured as timestamp(hash) - value pairs, where values could be for example temperatures or other kinds of varied measurement data.
To get timestamps of certain values we can build a secondary index with value(hash) - timestamp(range), but what if we want to query the value with comparison operations like GT, LT, BETWEEN to get timestamps of a range of values?
Obviously, I want to avoid using scan. The only thing I've come up with is using a dummy hash key and putting the values+timestamps into range attribute, but I'm guessing this has its own problems (better or worse compared to scan?).
Is there a better solution or can this be done with DynamoDB at all?
You need to know the HASH then you can perform a query on the RANGE. To get around this you would need to denormalize your table, i.e. create a duplicate with the keys reversed. Although that seems like a bit of a pain in the butt, it is one of the tradeoffs that is at times required for all the performance benefits of a key value store.
Example for this case:
With both keys completely random, then you are out of luck. Rather than set your HASH as a dummy value you could try using a monthly time stamp instead, that way you should always be able to work out pragmatically what the hash should be. You can also then look at setting the range as a combination of both values separated by a hyphen, i.e. timestamp-value, then in the denormalized table, value-timestamp, that way you should be able to use the comparison operators with no performance hit.

What's the purpose in hashing information?

After being taught how to create a hash table in class, I don't understand when hashing data would be useful. It seems to me that all hashing does is storing information in semi-random positions in an array. I want to know how any of the data can be made useful after it's stored.
My question is this: what are some examples where hashing information is beneficial? How is data retrieved in any organized manner? It seems to be placed in arbitrary positions where it would be difficult to retrieve.
Hashing can be used for many purposes:
It can be used to compare large amounts of data. You create the hashes for the data, store the hashes and later if you want to compare the data, you just compare the hashes.
Hashes can be used to index data. They can be used in hash tables to point to the correct row. If you want to quickly find a record, you calculate the hash of the data and directly go to the record where the corresponding hash record is pointing to. (This assumes that you have a sorted list of hashes that point to the actual records)
They can be used in cryptographic applications like digital signatures.
Hashing can be used to generate seemingly random strings.
Here are the applications of hash functions that wikipedia lists:
Finding duplicate records
Finding similar records
Finding similar substrings
Geometric hashing
Now regarding hash table, here are some points to note:
If you are using a hash table, the hashes in the table should be in a sorted manner. If not, you will have to create an index on the hash column. Some implementations store the hash separately in a sorted manner and point to the original record.
If somebody is storing hashes in a semi-random order, it must be either because of the above reasons or because they just want to store the message digest of the information for comparing, finding duplicates etc. and not as an index to the data.
One of the major uses of the hash tables you created in class is when you need fast O(1) lookup times. You'll have have two components, keys and the values.
The hash function transforms the key into a hash. That hash is a number, and specifically, it is the index of the data in the array.
So, when you need to look up Agscala's reputation up in a hash table and you have used your username as the key, it takes almost no time to locate and find the relevant value. It simply re-hashes your username and viola, there is the index of the data you were looking for. You didn't have to iterate over the entire array looking for that specific value.
For some reference the Wikipedia page on Hash tables is pretty good.
There are a couple of typical reasons to hash data. In the example you reference, you would hash the data and use that as a key to extract the actual value of the hashed item. The hashed data is often referred to as a key and it references a bucket where the actual, non-hashed value can be found.
The other typical reason is to create a signature of the hashed value so that you can check if the value has been changed by someone else. Since it's usually rare, depending on the algorithm used, to have two items hash to the same value, you can rehash a value and compare it to the saved hash value to check if the item is still the same.
Hashing is a technique useful for fast key lookup. It allows one to more efficiently find a value rather than scanning a list from beginning to end.
Have you ever used a dictionary or a set? They're typically implemented in terms of a hashtable because the value associated with a key can be found quickly.
{
'WA': 'Washington',
'WV': 'West Virginia',
'WY': 'Wyoming'
}