how are hashes used to look up entries in a database?

how are hashes used to look up entries in a database? - hash

I'm reading Medium Staff's article on Bloom filters, and I found this snippet:
When you ask the database for a specific post, how does it know where to find it? A hash function. The database takes the ID, hashes it to a unique value, and uses that to jump straight to the record location, just like you’d use the index of a book to jump straight to a specific page number
How exactly does this work in practice? If each ID is hashed, shouldnt the lookup time using the hash be the same as using the actual ID?

Related

Compound indexes in MongoDB efficiency unclear

Im looking for a structure to save userdata for a discord bot.
The context is that i need a unique save for a user for each discord sever (aka. guild) he is on.
Therefore neither userID nor guildID should be unique, but i could use them as compound index to quickly find users inside the users collection.
Is my train of thought correct until now?
My actual question is:
Which ID should be the first index its "sorted" by?
there are multiple hundred or thousand users per guild, but a single user is on about 1-5 guilds the bot is on.
Therefore first searching by guildID would make the amount of data to search in by userID somewhat smaller.
But first searching for userID would make the amount of data to search in by guildID even smaller.
Since the DB will search both indexes completely anyway, so step1 will be similarly quick for both, the second idea with first filtering by userID and then by guildID seems more efficient to me.
I'd like to know if my assumption seems viable, and if not, why not.
Or if there would be a better way that i haven't thought of.
Thanks in advance!

Compound indexes worked fine.
Still not big enough to see any difference in implementation of them, so i don't know about that.

Are there any reasons why I should/shouldn't use ObjectId's in my RESTful url's

I'm using mongoDB for the first time in a RESTful service. Previously the id column in my SQL databases was an incrementing integer so my RESTful endpoints would look something like /rest/objectType/1. Is there any reason why I shouldn't just use mongoDB's ObjectId's in the same role, or is it wiser to maintain a separate incrementing integer id column and use this for urls?

Having used ObjectIds in RESTful APIs several times, the biggest downside is really that they are very noisy in terms of having a clean URL. You'll either leave it as a HEX number, or convert it to a very large integer number, both making for a somewhat unfriendly URL:
/rest/resource/52435dbecb970072ec3a780f
/rest/resource/25459211534898951476729247759
I've added a "title" to the URL (like StackOverflow does) to make them slightly more friendly:
/rest/resource/52435dbecb970072ec3a780f/FriendlyResourceName
Of course, the "title" is ignored in software, but the user sees it and can mentally ignore the crazy ID segment.
There's very little useful that could be learned from the infrastructure by exposing them:
Timestamp
Machine ID
Process ID
Random incrementing value
Other than potentially gathering Machine IDs (which generally would indicate the number of clients creating ObjectIds), there's not much there.
ObjectIds aren't random, so you couldn't use them for security. You'll always need to secure the data. While they may not increment in an obvious way, it would be easy to find other resources through brute force. However, if you were using auto-incrementing IDs before, this isn't a new problem for you.
If you know you aren't creating many new documents at any given time, it might be worth using one of the patterns here to create a simpler ID. In one app I wrote, I used an auto-inc technique for some of the document IDs that were shown in URLs, and for those that were Ajax-only, I used ObjectIds. I really wanted some URLs to be easily "typed". No form of an ObjectId is easily typed by an end user. That's one of the strengths of MongoDB -- that you can use any _id format you want. :)

It's wiser to use the ObjectIds, because keeping an incrementing counter can be a bottleneck. Also, since ObjectId contains a timestamp and is monotonic, they can be helpful in optimizing queries.
The ObjectIds can be guessed, but since that is definitely true for incrementing IDs, I suspect you didn't rely on security through obscurity before, so that's no trouble for you.
A downside, albeit a small one, is that the creation time on your server leaks to the user, i.e. if the user is able to identify this as an ObjectId, she can reverse-engineer the creation time of the object. That's the only potential issue I see.

Query for set complement in CouchDB

I'm not sure that there is a good way to do with with the facilities CouchDB provides, but I'd like to somehow extract the relative complement of the sets of two different document types over a particular key.
For example, let's say that I have documents representing users and posts, both of which have a (unique) username field. There's a validation in place ensuring that a user document exists for the username in every post, but there may be any number post documents with a given username, include none. It's trivial to create a view which counts the number of posts per username. The view can even include zero-counts by emitting zero post-counts for the user documents in the view map function. What I want to do though is retrieve just the list of users who have zero associated posts.
It's possible to build the view I described above and filter client-side for zero-value results, but in my actual situation the number of results could be very, very large, and the interesting results a relatively small proportion of the total. Is there a way to do this sever-side and retrieve back just the interesting results?

I would write a map function to iterate through the documents and emit the users (or just usersnames) with 0 posts.
Then I would write a list function to iterate through the map function results and format them however you want (JSON, csv, etc).
(I would NOT use a reduce function to format the results, even if a reduce function appears to work OK in development. That is just my own experience from lessons learned the hard way.)

Personally I would filter on the client-side until I had performance issues. Next I would probably use Teddy's _filter technique—all pretty standard CouchDB stuff.
However, I stumbled across (IMO) an elegant way to find set complements. I described it when exploring how to find documents missing a field.
The basic idea
Finding non-members of your view obviously can't be done with a simple query (and a straightforward index scan.) However, it can be done in constant memory, and linear time, by simultaneously iterating through two query results at the same time.
One query is for all possible document ids. The other query is for matching documents (those you don't want). Importantly, CouchDB sorts query results, therefore you can calculate the complement efficiently.
See my details in the previous question. The basic idea is you iterate through both (sorted) lists simultaneously and when you say "hey, this document id is listed in the full set but it's missing in the sub-set, that is a hit.
(You don't have to query _all_docs, you just need two queries to CouchDB: one returning all possible values, and the other returning values not to be counted.)

How Do Hashes Work in Programming?

How do hashes work in programming? How I think of a hash is something that allows me the ability to use some unique value to retrieve some data. Like if we have an array and I start to put things in the array, if I have another variable that keeps track of what item is in slot 0,1,2... then I have that instant ability to find an item. Is that hashing?
What is the purpose of a hash?
When should a hash be implemented? What's a hash similar to in terms of data structure?
What I think I know about hashes is that it allows us the ability to retrieve the item within O(1). Is that correct?

A hash is like a person's first name -- it's a short way of remembering a person, even though it doesn't have to be unique. If you need to find some information about someone, you can just look them up by their name, and you only need to perform other checks if two or more people have the same name.
That's the power of hashing, and just as remembering people is much easier by name than by Social Security Number, finding an object by its hash code is much easier than actually comparing the object to everything already in your collection.
Now, in this example, if you're looking someone up in a phone book by name, you'd probably find them in O(log n) time, because the names are sorted alphabetically, and because you need to do a binary search. If, however, you instead "hashed" 100 people born in the 1900s by their years of birth, then you'd only need at most 4 comparisons in the hashtable/phonebook (one per digit) to find any one year by hash, which is constant time. Then, if two people are born in the same year, you can use other information to find the person you need, and on average, if your table isn't too full (say, if you have at most 50 people for 100 different years of birth), your lookups will be constant-time.
(If your table gets more than, say, 50% full, you can always double its size, to keep the number of collisions low and hence to keep your lookups fast.)
More information:
If you've ever heard of MD5 or SHA-1 SHA-2 hashes for files, they're like the "fingerprints" of the file. While it's possible to have two files with the same hash, this is made so unlikely that, for practical purposes, it's impossible; hence, if you have the hash of two files, you can compare the files by their fingerprints rather than by their data, which is immensely faster.

A hash map / dictionary is a key/value data structure that stores objects in buckets based on the value of a hash function. These keys must be unique but the hash function values (sometimes called hashcodes) aren't necessarily unique.
Like if we have an array and I start to put htings in the array, if I have another varible that keeps track of what item is in slot 0,1,2... then I have that instant ability to find an item. Is that hashing?
No. A hash function is a deterministic function that always gives the same value for an object. The hash code does not change depending on where the object is stored.
What I think I know about hashes is that it allows us the ability to retrieve the item within O(1). Is that correct?
Nearly. A dictionary has O(1) complexity for lookups if there are not too many hash code collisions. However if the hash function is poor and every object has the same hash value then a dictionary could have O(n) performance instead.

A hash makes it fast to lookup instead of iterating over an array or tree. It makes it possible search O(1) time with little use of memory.

What's the purpose in hashing information?

After being taught how to create a hash table in class, I don't understand when hashing data would be useful. It seems to me that all hashing does is storing information in semi-random positions in an array. I want to know how any of the data can be made useful after it's stored.
My question is this: what are some examples where hashing information is beneficial? How is data retrieved in any organized manner? It seems to be placed in arbitrary positions where it would be difficult to retrieve.

Hashing can be used for many purposes:
It can be used to compare large amounts of data. You create the hashes for the data, store the hashes and later if you want to compare the data, you just compare the hashes.
Hashes can be used to index data. They can be used in hash tables to point to the correct row. If you want to quickly find a record, you calculate the hash of the data and directly go to the record where the corresponding hash record is pointing to. (This assumes that you have a sorted list of hashes that point to the actual records)
They can be used in cryptographic applications like digital signatures.
Hashing can be used to generate seemingly random strings.
Here are the applications of hash functions that wikipedia lists:
Finding duplicate records
Finding similar records
Finding similar substrings
Geometric hashing
Now regarding hash table, here are some points to note:
If you are using a hash table, the hashes in the table should be in a sorted manner. If not, you will have to create an index on the hash column. Some implementations store the hash separately in a sorted manner and point to the original record.
If somebody is storing hashes in a semi-random order, it must be either because of the above reasons or because they just want to store the message digest of the information for comparing, finding duplicates etc. and not as an index to the data.

One of the major uses of the hash tables you created in class is when you need fast O(1) lookup times. You'll have have two components, keys and the values.
The hash function transforms the key into a hash. That hash is a number, and specifically, it is the index of the data in the array.
So, when you need to look up Agscala's reputation up in a hash table and you have used your username as the key, it takes almost no time to locate and find the relevant value. It simply re-hashes your username and viola, there is the index of the data you were looking for. You didn't have to iterate over the entire array looking for that specific value.
For some reference the Wikipedia page on Hash tables is pretty good.

There are a couple of typical reasons to hash data. In the example you reference, you would hash the data and use that as a key to extract the actual value of the hashed item. The hashed data is often referred to as a key and it references a bucket where the actual, non-hashed value can be found.
The other typical reason is to create a signature of the hashed value so that you can check if the value has been changed by someone else. Since it's usually rare, depending on the algorithm used, to have two items hash to the same value, you can rehash a value and compare it to the saved hash value to check if the item is still the same.

Hashing is a technique useful for fast key lookup. It allows one to more efficiently find a value rather than scanning a list from beginning to end.

Have you ever used a dictionary or a set? They're typically implemented in terms of a hashtable because the value associated with a key can be found quickly.
{
'WA': 'Washington',
'WV': 'West Virginia',
'WY': 'Wyoming'
}