A hash table has a size of 11 and date is fitted in postion {3,5,7,9,6} how many comparisons have to be made if data is not found in the list in worst case 2,11,6
That depends a lot on how the terminology's intended and the hash table implementation...
"a hash table of size 11" might mean 11 elements (especially in C++ where .size() is the function for finding out how many elements are in the table), or 11 buckets, but in this case probably means the latter;
"data is fitted" might be intended to mean there's a single datum (the singular of data) in each of those bucket positions, or it might mean there could be one or more (but if the latter's intended and "size" isn't the number of buckets, there's no answer).
"how many comparisons" normally means key-vs-key comparisons, but the implementation will need to perform comparisons of bucket content against no-[more]-value sentinels too
the implementation might handle collisions using a list of colliding keys per bucket, or it might rehash or cascade/jump through some set of offsets (a displacement list) from the hashed-to-bucket
So, if we assume 11 buckets, that there's only one data item at each bucket position {3,5,7,9,6} and key-on-key comparisons, and nominally lists of colliding elements per bucket, then the worst that can happen is a comparison against a buckets with an existing entry, which means a key-on-key comparison then a key-vs-sentinel comparison. That could satisfy the "2" answer.
If the system uses displacement lists or rehashing, there's generally no protection against visiting the same bucket several times (it's just statistically unlikely), so there is no "worst case" limit short of the implementation giving up.
If the specific dead simple case that the displacement increments to try the next bucket, it might give up after visiting a string of populated buckets and finding the first empty one... given {3,5,7,9,6} the worst case is that it compares keys at 5 then 6 then 7 then compares against a sentinel at 8, for 3 or 4 total comparisons depending on how you count it.
If the size indicates the number of elements, such that 11 are spread across {3,5,7,9,6} which would only be possible if any individual bucket maintained a list of colliding keys, then if 4 of those buckets had a single key, the remaining bucket might have 11-4 = 7, so there could be 7 key-on-key comparisons before failure.
It's very hard to imagine any reasonable interpretation of the question yielding 6 or 11 comparisons, by any counting method.
If someone's asked you this in an academic setting, there might be some specific implementation and notational conventions you can apply when considering the question. If such context is missing, whomever's asking is either very lazy or not very well versed in even the basics of hash tables.
Related
Our current PostgreSQL database is using GUID's as primary keys and storing them as a Text field.
My initial reaction to this is that trying to perform any kind of minimal cartesian join would be a nightmare of indexing trying to find all the matching records. However, perhaps my limited understanding of database indexing is wrong here.
I'm thinking that we should be using UUID as these are stored as a binary representation of the GUID where a Text is not and the amount of indexing that you get on a Text column is minimal.
It would be a significant project to change these, and I'm wondering if it would be worth it?
When dealing with UUID numbers store them as data type uuid. Always. There is simply no good reason to even consider text as alternative. Input and output is done via text representation by default anyway. The cast is very cheap.
The data type text requires more space in RAM and on disk, is slower to process and more error prone. #khampson's answer provides most of the rationale. Oddly, he doesn't seem to arrive at the same conclusion.
This has all been asked and answered and discussed before. Related questions on dba.SE with detailed explanation:
Would index lookup be noticeably faster with char vs varchar when all values are 36 chars
What is the optimal data type for an MD5 field?
bigint?
Maybe you don't need UUIDs (GUIDs) at all. Consider bigint instead. It only occupies 8 bytes and is faster in every respect. It's range is often underestimated:
-9223372036854775808 to +9223372036854775807
That's 9.2 millions of millions of millions positive numbers. IOW, nine quintillion two hundred twenty-three quadrillion three hundred seventy-two trillion thirty-six something billion.
If you burn 1 million IDs per second (which is an insanely high number) you can keep doing so for 292471 years. And then another 292471 years for negative numbers. "Tens or hundreds of millions" is not even close.
UUID is really just for distributed systems and other special cases.
As #Kevin mentioned, the only way to know for sure with your exact data would be to compare and contrast both methods, but from what you've described, I don't see why this would be different from any other case where a string was either the primary key in a table or part of a unique index.
What can be said up front is that your indexes will probably larger, since they have to store larger string values, and in theory the comparisons for the index will take a bit longer, but I wouldn't advocate premature optimization if to do so would be painful.
In my experience, I have seen very good performance on a unique index using md5sums on a table with billions of rows. I have found it tends to be other factors about a query which tend to result in performance issues. For example, when you end up needing to query over a very large swath of the table, say hundreds of thousands of rows, a sequential scan ends up being the better choice, so that's what the query planner chooses, and it can take much longer.
There are other mitigating strategies for that type of situation, such as chunking the query and then UNIONing the results (e.g. a manual simulation of the sort of thing that would be done in Hive or Impala in the Hadoop sphere).
Re: your concern about indexing of text, while I'm sure there are some cases where a dataset produces a key distribution such that it performs terribly, GUIDs, much like md5sums, sha1's, etc. should index quite well in general and not require sequential scans (unless, as I mentioned above, you query a huge swath of the table).
One of the big factors about how an index would perform is how many unique values there are. For that reason, a boolean index on a table with a large number of rows isn't likely to help, since it basically is going to end up having a huge number of row collisions for any of the values (true, false, and potentially NULL) in the index. A GUID index, on the other hand, is likely to have a huge number of values with no collision (in theory definitionally, since they are GUIDs).
Edit in response to comment from OP:
So are you saying that a UUID guid is the same thing as a Text guid as far as the indexing goes? Our entire table structure is using Text fields with a guid-like string, but I'm not sure Postgre recognizes it as a Guid. Just a string that happens to be unique.
Not literally the same, no. However, I am saying that they should have very similar performance for this particular case, and I don't see why optimizing up front is worth doing, especially given that you say to do so would be a very involved task.
You can always change things later if, in your specific environment, you run into performance problems. However, as I mentioned earlier, I think if you hit that scenario, there are other things that would likely yield better performance than changing the PK data types.
A UUID is a 128-bit data type (so, 16 bytes), whereas text has 1 or 4 bytes of overhead plus the actual length of the string. For a GUID, that would mean a minimum of 33 bytes, but could vary significantly depending on the encoding used.
So, with that in mind, certainly indexes of text-based UUIDs will be larger since the values are larger, and comparing two strings versus two numerical values is in theory less efficient, but is not something that's likely to make a huge difference in this case, at least not usual cases.
I would not optimize up front when to do so would be a significant cost and is likely to never be needed. That bridge can be crossed if that time does come (although I would persue other query optimizations first, as I mentioned above).
Regarding whether Postgres knows the string is a GUID, it definitely does not by default. As far as it's concerned, it's just a unique string. But that should be fine for most cases, e.g. matching rows and such. If you find yourself needing some behavior that specifically requires a GUID (for example, some non-equality based comparisons where a GUID comparison may differ from a purely lexical one), then you can always cast the string to a UUID, and Postgres will treat the value as such during that query.
e.g. for a text column foo, you can do foo::uuid to cast it to a uuid.
There's also a module available for generating uuids, uuid-ossp.
One of the objectives of DHT is to partition the keyspace, so each node (or group of them) has a share of it. To do so, it hashes the filename of a file that wants to be saved and stores it in the node responsible of this part of the network. But, why does it have to hash the filename? Couldn't it just work like a dictionary, so instead of having a node hold hash values between 0000 and 0a2d, it would hold filename values between C and E?
But, why does it have to hash the filename?
It doesn't have to be a filename. It can hash other things too. E.g. file contents. Or metadata. Or cryptographic keys used as identities of users in the network.
Couldn't it just work like a dictionary, so instead of having a node hold hash values between 0000 and 0a2d, it would hold filename values between C and E?
Because filenames are not uniformly distributed throughout the possible keyspace (how often do you see filenames starting with some exotic unicode character?) and their entropy is spread over a variable length, leading to even more clustering at the top level.
If you were to index all existing unix filesystems in the world you would have massive clustering around the /etc/... prefix for example.
There are other p2p network overlays that can deal with heavy clustering in the keyspace, often by rearranging the nodes around the hotspots to increase network capacity in regions of the affected keyspace, e.g. based on levenshtein distance, but they generally aren't distributed hash tables because they do not employ hashing.
because searches are done on numbers.
When you hash a file, you end up with a number, and that number will be allocated in the nearest K-buckets of the nearest K-peers.
names are irrelevant, you're performing XOR searches on numeric spaces, so that you always search half of the space on every hop.
once you find a peer that has the bucket pointed by the hash, then you can communicate with that peer and exchange related information.
A DHT, like libtorrent's kademlia implementation has to be seen more of a distributed routing data structure. The problem you're solving is how do I find a number among billions of numbers, how do I find a peer among millions in the least amounts of hops possible, and the answer is that every node on the network has to follow a set of simple rules as to how to organize the numbers they're storing, and the peers that they know about.
I recommend you read these notes on how a real DHT actually works.
https://gist.github.com/gubatron/cd9cfa66839e18e49846
Also, storing a number takes a lot less space than storing a word.
If you know the word, you can hash the word and search for the hash.
Yes, it could work like a dictionary. However, it would be missing some desirable (for the typical DHT use case) emergent properties that come from using a hash.
One property that hashing (along with XOR distance metric) gives you is an even distribution of content amongst all the nodes participating in a DHT. "Even" here being caveated by how the k-bucket data structure works (here's an overview k-bucket slides), but in aggregate, you get nodes evenly distributing data amongst the DHT peers.. in theory. In practice, you can get hotspots.
Another property of using a hash is if you're looking for a file with specific contents. So, if you use hashes of the file contents as the identifiers, you can be... statistically sure (the guarantee comes from your hash function collision properties) that you're getting the contents you're looking for. Relying on a filename introduces a level of indirection that can serve different contents for the same file. Depending on your use case, that's acceptable or not.
I've considered what you're proposing before as a prefix to a SHA-1 hash. So, something like node1-cd9cf... (the prefix could be anything really, doesn't need to be human readable). This would ensure that all the things with that prefix end up pretty much on a node that identifies itself with an id starting with "node1-". But, you'd have to have a DHT implementation (including k-bucket implementation) that supports variable length ids. In this case, you're guaranteeing a hotspot. It's an equivalent of artificially ensuring that things are "close together" as in the difference between them in the XOR metric is very small. Why would anyone want to do this? For example: com.example.www-cd9cf... combined with some crypto could ensure that while you're participating in a DHT, the data is stored on your servers. I haven't seen this implemented before though.
How do I enforce a unique constraint in Key-Value store where the unique data is longer than the key length limit?
I currently use CouchBase to store the document below:
{
url: "http://google.com",
siteName: "google.com",
data:
{
//more properties
}
}
Unique constraint is defined at url + siteName. I however can't use those properties as the key since the length can be longer than the key length limit of CouchBase.
I currently have two solutions in mind but I think that both are not good enough.
Solution 1
Document key is the SHA1 hash of url + siteName.
Advantages: easy to implement
Disadvantages: collisions can occur
Solution 2
Document key is the hash(url + siteName) + index.
This is same as Solution 1 but key includes index in-case a collision occurs.
To add a document, the application server:
set index to 0
Store document with the key = hash(url + siteName) + index
If duplicate key conflict occurred, read document back
Does existing document have same url and sitename with the one we are storing?
If yes, throw an exception is duplicates aren't allowed
If no, increment index and go back to step 2
This is currently my favorite solution because it can handle collisions
I a NoSQL n00b! How can I enforce unique constraints in a Key-Value store?
After reading your question, here are my thoughts/opinions, which I think should help give rationale for choosing your first option.
Couchbase is an in-memory cache/dictionary. To store many (read "very large incomprehensible number") values, it requires both RAM and disk space. Regardless of how much space each document occupies, all of the document keys are stored in RAM. If you were therefore permitted to store an arbitrarily large value for the key, your server farm would consume RAM faster than you could supply it, and your design would fall apart.
With item #1 being the case, your application needs to be designed such that key sizes are as small as practicable. Dictionary key/hash value computation is up to application API (in the same way that this is left to the .Net or Java API - which likewise compute hashes on the string inputs). The same method to produce a hash should be used regardless of input, for the sake of consistency.
The SHA1 has has an extremely low collision probability, and it is designed that way to make "breaking" of the encryption computationally infeasible. This is the foundation behind the "fingerprint" in bitcoins. See here and here for tasty reading on the topic.
Given what I know about hashes, and given the fact that URLs always start with the same set of characters, this theoretically lowers the likelihood of collision even further.
If you are, in fact, storing enough documents that the odds of a SHA1 collision are significant, then there are almost certainly at least a dozen other issues that will affect your application's usability and reliability in a more significant way, and you should devote your energy to thinking about those things.
The hard part about being an engineer is recognizing the need to take a step back from the engineering and say when "good" is "good enough." That being said, option 1 looks like the best choice, it's simple and consistent. If properly applied, that's all you need. Check the box on this one and move on to your next issue.
I’d go for solution 1 however for choosing the hashing function you should consider the following things:
how many data you have? => how large should be the generated hash in order to reduce the probability of colisions to a minimum? - here the best might be SHA-512 which has 512 bits large output hash, compared to the 160 bits from SHA-1
what performance do you need from the hashing function? SHA-x are pretty slow compared to md5 and depending on the number of items you want to store md5 could be pretty good as well.
in the end you can also have a combination, use sitename+url as a key if it is short enough, switch to sitename+hash(url) in case this combination can be short enough and in the end only hash both together.
on a related note I’ve found also this question http://www.couchbase.com/communities/q-and-a/key-size-limits-couchbasemembase-again where one answer suggests to compress the keys if it is possible for you.
You could actually use normal gzip compression and encode the text. I’m not sure how well this would work on your usecase, you’ll have to check it, but I used it for JSON files and managed to reduce it down to ~20% - however it was a huge 8MB file so the compression possibilities for your key might be much lower.
Considering that an UUID rfc 4122 (16 bytes) is much larger than a MongoDB ObjectId (12 bytes), I am trying to find out how their collision probability compare.
I know that is something around quite unlikely, but in my case most ids will be generated within a large number of mobile clients, not within a limited set of servers. I wonder if in this case, there is a justified concern.
Compared to the normal case where all ids are generated by a small number of clients:
It might take months to detect a collision since the document creation
IDs are generated from a much larger client base
Each client has a lower ID generation rate
in my case most ids will be generated within a large number of mobile clients, not within a limited set of servers. I wonder if in this case, there is a justified concern.
That sounds like very bad architecture to me. Are you using a two-tier architecture? Why would the mobile clients have direct access to the db? Do you really want to rely on network-based security?
Anyway, some deliberations about the collision probability:
Neither UUID nor ObjectId rely on their sheer size, i.e. both are not random numbers, but they follow a scheme that tries to systematically reduce collision probability. In case of ObjectIds, their structure is:
4 byte seconds since unix epoch
3 byte machine id
2 byte process id
3 byte counter
This means that, contrary to UUIDs, ObjectIds are monotonic (except within a single second), which is probably their most important property. Monotonic indexes will cause the B-Tree to be filled more efficiently, it allows paging by id and allows a 'default sort' by id to make your cursors stable, and of course, they carry an easy-to-extract timestamp. These are the optimizations you should be aware of, and they can be huge.
As you can see from the structure of the other 3 components, collisions become very likely if you're doing > 1k inserts/s on a single process (not really possible, not even from a server), or if the number of machines grows past about 10 (see birthday problem), or if the number of processes on a single machine grows too large (then again, those aren't random numbers, but they are truly unique on a machine, but they must be shortened to two bytes).
Naturally, for a collision to occur, they must match in all these aspects, so even if two machines have the same machine hash, it'd still require a client to insert with the same counter value in the exact same second and the same process id, but yes, these values could collide.
Let's look at the spec for "ObjectId" from the documentation:
Overview
ObjectId is a 12-byte BSON type, constructed using:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
So let us consider this in the context of a "mobile client".
Note: The context here does not mean using a "direct" connection of the "mobile client" to the database. That should not be done. But the "_id" generation can be done quite simply.
So the points:
Value for the "seconds since epoch". That is going to be fairly random per request. So minimal collision impact just on that component. Albeit in "seconds".
The "machine identifier". So this is a different client generating the _id value. This is removing possibility of further "collision".
The "process id". So where that is accessible to seed ( and it should be ) then the generated _id has more chance of avoiding collision.
The "random value". So another "client" somehow managed to generate all of the same values as above and still managed to generate the same random value.
Bottom line is, if that is not a convincing enough argument to digest, then simply provide your own "uuid" entries as the "primary key" values.
But IMHO, that should be a fair convincing argument to consider that the collision aspects here are very broad. To say the least.
The full topic is probably just a little "too-broad". But I hope this moves consideration a bit more away from "Quite unlikely" and on to something a little more concrete.
Forgive me if this question is silly, but I'm starting to learn about consistent hashing and after reading Tom White blog post on it here and realizing that most default hash functions are NOT well mixed I had a thought on ensuring that an arbitrary hash function is minimally well-mixed.
My thought is best explained using an example like this:
Bucket 1: 11000110
Bucket 2: 11001110
Bucket 3: 11010110
Bucket 4: 11011110
Under a standard hash ring implementation for consistent caching across these buckets, you would be get terribly performance, and nearly every entry would be lumped into Bucket 1. However, if we use bits 4&5 as the MSBs in each case then these buckets are suddenly excellently mixed, and assigning a new object to a cache becomes trivial and only requires examining 2 bits.
In my mind this concept could very easily be extended when building distributed networks across multiple nodes. In my particular case I would be using this to determine which cache to place a given piece of data into. The increased placement speed isn't a real concern, but ensuring that my caches are well-mixed is and I was considering just choosing a few bits that are optimally mixed for my given caches. Any information later indexed would be indexed on the basis of the same bits.
In my naive mind this is a much simpler solution than introducing virtual nodes or building a better hash function. That said, I can't see any mention of an approach like this and I'm concerned that in my hashing ignorance I'm doing something wrong here and I might be introducing unintended consequences.
Is this approach safe? Should I use it? Has this approach been used before and are there any established algorithms for determining the minimum unique group of bits?