For this application I've mine I feel like I can get away with a 40 bit hash key, which seems awfully low, but see if you can confirm my reasoning (I want a small key because I want a small filename and the key will be converted to a filename):
(Note: only accidental collisions a concern - no security issues.)
A key point here is that the population in question is divided into groups, and a collision is only relevant if it occurs within the same group. A "group" is a directory on a user's system (the contents of files are hashed and a collision is only relevant if it occurs for files within the same directory). So with speculating roughly 100,000 potential users, say 2^17, that corresponds to 2^18 "groups" assuming 2 directories per user on average. So with a 40 bit key I can expect 2^(20+9) files created (among all users) before a collision occurs for some user somewhere. (Or IOW 2^((40+18)/2), due to the "birthday effect".) That's an average 4096 unique files created per user, for 2^17 users, before a single collision occurs for some user somewhere. And then that long again before another collision occurs somewhere (right?)
Your math looks reasonable, but I'm left wondering why you'd bother with this at all. If you want to create unique file names, why not just assign a number to each user, and keep a serial number for that user. When you need a file name, basically just concatentate the user number with the serial number (both padded to the correct number of digits). If you feel that you need to obfuscate those numbers, run that result though a 40-bit encryption (which will guarantee that a unique input produces a unique output).
If, for example, you assign 20 bits to each, you can have 220 users create 220 documents apiece before there's any chance of a collision at all.
If you don't mind serialized access to it, you could just use a single 40-bit counter instead. The advantage of this is that a single user wouldn't immediately use up 220 serial numbers, even though the average user is unlikely to ever create nearly that many documents.
Again, if you think you need to obfuscate this number for some reason, you can use a 40-bit encryption algorithm in counter mode (i.e. use a serial number, but encrypt it) which (again) guarantees that each input maps to a unique output. This guarantees no collision until/unless your users create 240 documents (i.e., the maximum possible with only 40 bits). Alternatively, you could create a 40-bit full-range linear feedback shift register to create your pseudo-random 40-bit numbers. This might be marginally less secure, but has the advantage of being faster and simpler to implement.
Related
One of the objectives of DHT is to partition the keyspace, so each node (or group of them) has a share of it. To do so, it hashes the filename of a file that wants to be saved and stores it in the node responsible of this part of the network. But, why does it have to hash the filename? Couldn't it just work like a dictionary, so instead of having a node hold hash values between 0000 and 0a2d, it would hold filename values between C and E?
But, why does it have to hash the filename?
It doesn't have to be a filename. It can hash other things too. E.g. file contents. Or metadata. Or cryptographic keys used as identities of users in the network.
Couldn't it just work like a dictionary, so instead of having a node hold hash values between 0000 and 0a2d, it would hold filename values between C and E?
Because filenames are not uniformly distributed throughout the possible keyspace (how often do you see filenames starting with some exotic unicode character?) and their entropy is spread over a variable length, leading to even more clustering at the top level.
If you were to index all existing unix filesystems in the world you would have massive clustering around the /etc/... prefix for example.
There are other p2p network overlays that can deal with heavy clustering in the keyspace, often by rearranging the nodes around the hotspots to increase network capacity in regions of the affected keyspace, e.g. based on levenshtein distance, but they generally aren't distributed hash tables because they do not employ hashing.
because searches are done on numbers.
When you hash a file, you end up with a number, and that number will be allocated in the nearest K-buckets of the nearest K-peers.
names are irrelevant, you're performing XOR searches on numeric spaces, so that you always search half of the space on every hop.
once you find a peer that has the bucket pointed by the hash, then you can communicate with that peer and exchange related information.
A DHT, like libtorrent's kademlia implementation has to be seen more of a distributed routing data structure. The problem you're solving is how do I find a number among billions of numbers, how do I find a peer among millions in the least amounts of hops possible, and the answer is that every node on the network has to follow a set of simple rules as to how to organize the numbers they're storing, and the peers that they know about.
I recommend you read these notes on how a real DHT actually works.
https://gist.github.com/gubatron/cd9cfa66839e18e49846
Also, storing a number takes a lot less space than storing a word.
If you know the word, you can hash the word and search for the hash.
Yes, it could work like a dictionary. However, it would be missing some desirable (for the typical DHT use case) emergent properties that come from using a hash.
One property that hashing (along with XOR distance metric) gives you is an even distribution of content amongst all the nodes participating in a DHT. "Even" here being caveated by how the k-bucket data structure works (here's an overview k-bucket slides), but in aggregate, you get nodes evenly distributing data amongst the DHT peers.. in theory. In practice, you can get hotspots.
Another property of using a hash is if you're looking for a file with specific contents. So, if you use hashes of the file contents as the identifiers, you can be... statistically sure (the guarantee comes from your hash function collision properties) that you're getting the contents you're looking for. Relying on a filename introduces a level of indirection that can serve different contents for the same file. Depending on your use case, that's acceptable or not.
I've considered what you're proposing before as a prefix to a SHA-1 hash. So, something like node1-cd9cf... (the prefix could be anything really, doesn't need to be human readable). This would ensure that all the things with that prefix end up pretty much on a node that identifies itself with an id starting with "node1-". But, you'd have to have a DHT implementation (including k-bucket implementation) that supports variable length ids. In this case, you're guaranteeing a hotspot. It's an equivalent of artificially ensuring that things are "close together" as in the difference between them in the XOR metric is very small. Why would anyone want to do this? For example: com.example.www-cd9cf... combined with some crypto could ensure that while you're participating in a DHT, the data is stored on your servers. I haven't seen this implemented before though.
How do I enforce a unique constraint in Key-Value store where the unique data is longer than the key length limit?
I currently use CouchBase to store the document below:
{
url: "http://google.com",
siteName: "google.com",
data:
{
//more properties
}
}
Unique constraint is defined at url + siteName. I however can't use those properties as the key since the length can be longer than the key length limit of CouchBase.
I currently have two solutions in mind but I think that both are not good enough.
Solution 1
Document key is the SHA1 hash of url + siteName.
Advantages: easy to implement
Disadvantages: collisions can occur
Solution 2
Document key is the hash(url + siteName) + index.
This is same as Solution 1 but key includes index in-case a collision occurs.
To add a document, the application server:
set index to 0
Store document with the key = hash(url + siteName) + index
If duplicate key conflict occurred, read document back
Does existing document have same url and sitename with the one we are storing?
If yes, throw an exception is duplicates aren't allowed
If no, increment index and go back to step 2
This is currently my favorite solution because it can handle collisions
I a NoSQL n00b! How can I enforce unique constraints in a Key-Value store?
After reading your question, here are my thoughts/opinions, which I think should help give rationale for choosing your first option.
Couchbase is an in-memory cache/dictionary. To store many (read "very large incomprehensible number") values, it requires both RAM and disk space. Regardless of how much space each document occupies, all of the document keys are stored in RAM. If you were therefore permitted to store an arbitrarily large value for the key, your server farm would consume RAM faster than you could supply it, and your design would fall apart.
With item #1 being the case, your application needs to be designed such that key sizes are as small as practicable. Dictionary key/hash value computation is up to application API (in the same way that this is left to the .Net or Java API - which likewise compute hashes on the string inputs). The same method to produce a hash should be used regardless of input, for the sake of consistency.
The SHA1 has has an extremely low collision probability, and it is designed that way to make "breaking" of the encryption computationally infeasible. This is the foundation behind the "fingerprint" in bitcoins. See here and here for tasty reading on the topic.
Given what I know about hashes, and given the fact that URLs always start with the same set of characters, this theoretically lowers the likelihood of collision even further.
If you are, in fact, storing enough documents that the odds of a SHA1 collision are significant, then there are almost certainly at least a dozen other issues that will affect your application's usability and reliability in a more significant way, and you should devote your energy to thinking about those things.
The hard part about being an engineer is recognizing the need to take a step back from the engineering and say when "good" is "good enough." That being said, option 1 looks like the best choice, it's simple and consistent. If properly applied, that's all you need. Check the box on this one and move on to your next issue.
I’d go for solution 1 however for choosing the hashing function you should consider the following things:
how many data you have? => how large should be the generated hash in order to reduce the probability of colisions to a minimum? - here the best might be SHA-512 which has 512 bits large output hash, compared to the 160 bits from SHA-1
what performance do you need from the hashing function? SHA-x are pretty slow compared to md5 and depending on the number of items you want to store md5 could be pretty good as well.
in the end you can also have a combination, use sitename+url as a key if it is short enough, switch to sitename+hash(url) in case this combination can be short enough and in the end only hash both together.
on a related note I’ve found also this question http://www.couchbase.com/communities/q-and-a/key-size-limits-couchbasemembase-again where one answer suggests to compress the keys if it is possible for you.
You could actually use normal gzip compression and encode the text. I’m not sure how well this would work on your usecase, you’ll have to check it, but I used it for JSON files and managed to reduce it down to ~20% - however it was a huge 8MB file so the compression possibilities for your key might be much lower.
Considering that an UUID rfc 4122 (16 bytes) is much larger than a MongoDB ObjectId (12 bytes), I am trying to find out how their collision probability compare.
I know that is something around quite unlikely, but in my case most ids will be generated within a large number of mobile clients, not within a limited set of servers. I wonder if in this case, there is a justified concern.
Compared to the normal case where all ids are generated by a small number of clients:
It might take months to detect a collision since the document creation
IDs are generated from a much larger client base
Each client has a lower ID generation rate
in my case most ids will be generated within a large number of mobile clients, not within a limited set of servers. I wonder if in this case, there is a justified concern.
That sounds like very bad architecture to me. Are you using a two-tier architecture? Why would the mobile clients have direct access to the db? Do you really want to rely on network-based security?
Anyway, some deliberations about the collision probability:
Neither UUID nor ObjectId rely on their sheer size, i.e. both are not random numbers, but they follow a scheme that tries to systematically reduce collision probability. In case of ObjectIds, their structure is:
4 byte seconds since unix epoch
3 byte machine id
2 byte process id
3 byte counter
This means that, contrary to UUIDs, ObjectIds are monotonic (except within a single second), which is probably their most important property. Monotonic indexes will cause the B-Tree to be filled more efficiently, it allows paging by id and allows a 'default sort' by id to make your cursors stable, and of course, they carry an easy-to-extract timestamp. These are the optimizations you should be aware of, and they can be huge.
As you can see from the structure of the other 3 components, collisions become very likely if you're doing > 1k inserts/s on a single process (not really possible, not even from a server), or if the number of machines grows past about 10 (see birthday problem), or if the number of processes on a single machine grows too large (then again, those aren't random numbers, but they are truly unique on a machine, but they must be shortened to two bytes).
Naturally, for a collision to occur, they must match in all these aspects, so even if two machines have the same machine hash, it'd still require a client to insert with the same counter value in the exact same second and the same process id, but yes, these values could collide.
Let's look at the spec for "ObjectId" from the documentation:
Overview
ObjectId is a 12-byte BSON type, constructed using:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
So let us consider this in the context of a "mobile client".
Note: The context here does not mean using a "direct" connection of the "mobile client" to the database. That should not be done. But the "_id" generation can be done quite simply.
So the points:
Value for the "seconds since epoch". That is going to be fairly random per request. So minimal collision impact just on that component. Albeit in "seconds".
The "machine identifier". So this is a different client generating the _id value. This is removing possibility of further "collision".
The "process id". So where that is accessible to seed ( and it should be ) then the generated _id has more chance of avoiding collision.
The "random value". So another "client" somehow managed to generate all of the same values as above and still managed to generate the same random value.
Bottom line is, if that is not a convincing enough argument to digest, then simply provide your own "uuid" entries as the "primary key" values.
But IMHO, that should be a fair convincing argument to consider that the collision aspects here are very broad. To say the least.
The full topic is probably just a little "too-broad". But I hope this moves consideration a bit more away from "Quite unlikely" and on to something a little more concrete.
I have an authentication application and don't know how secure it is.
here is the algorithm.
1) A clientToken is generated by using SHA512 hash a new guid. I have about 1000 ClientsToken generated and store in the database.
every time the caller calling my web service it need to provide the clientToken, if the clienttoken does not exists in the database, then it is not valid client.
The problem is how long does it take to brute force to get the existing ClientToken?
A GUID is a 128 bit value, with 6 bits held constant, so a total of 122 bits available. Since this is your input to the hash, you're not going to have 2^512 unique hashes in your application. This is roughly 5.3*10^36 values to check.
Say your attacker is able to calculate 1,000,000 (10^6) hashes per second (I'm not sure how reasonable that is for SHA-512, but at this size, a few orders of magnitude won't influence things that much). This works out to about 5.3*10^30 seconds to check the space (For reference, this will be far beyond the time all stars have gone dark). Also, unless you have several billion clients, a birthday attack probably will not remove too many orders of magnitude from this.
But, just for fun, let's say the attacker has some trick that lets him reduce the number of hashes to check by half (or some combination of reduced space to check and increased speed), either by you having that many users, or some flaw in your GUID generator, or what have you. We're still looking at well over 100 million years to find a collision.
I think you're beyond safe and into somewhat overkill territory. Also note that hashing the GUID in effect does nothing, and that GUIDs probably are not generated via a secure random number generator. You'd actually be a bit better off just generating a 128 bits (16 bytes) of randomness via whatever secure random number generator your platform uses, and using that as the shared secret.
I am designing a storage cloud software on top of a LAMP stack.
Files could have an internal ID, but it would have many advantages to store them not with an incrementing id as filename in the servers filesystems, but using an hash as filename.
Also hashes as identifier in the database would have a lot of advantages if the currently centralized database should be sharded or decentralized or some sort of master-master high availability environment should be set up. But I am not sure about that yet.
Clients can store files under any string (usually some sort of path and filename).
This string is guaranteed to be unique, because on the first level is something like "buckets" that users have go register like in Amazon S3 and Google storage.
My plan is to store files as hash of the client side defined path.
This way the storage server can directly serve the file without needing the database to ask which ID it is because it can calculate the hash and thus the filename on the fly.
But I am afraid of collisions. I currently think about using SHA1 hashes.
I heard that GIT uses hashes also revision identifiers as well.
I know that the chances of collisions are really really low, but possible.
I just cannot judge this. Would you or would you not rely on hash for this purpose?
I could also us some normalization of encoding of the path. Maybe base64 as filename, but i really do not want that because it could get messy and paths could get too long and possibly other complications.
Assuming you have a hash function with "perfect" properties and assuming cryptographic hash functions approach that the theory that applies is the same that applies to birthday attacks . What this says is that given a maximum number of files you can make the collision probability as small as you want by using a larger hash digest size. SHA has 160 bits so for any practical number of files the probability of collision is going to be just about zero. If you look at the table in the link you'll see that a 128 bit hash with 10^10 files has a collision probability of 10^-18 .
As long as the probability is low enough I think the solution is good. Compare with the probability of the planet being hit by an asteroid, undetectable errors in the disk drive, bits flipping in your memory etc. - as long as those probabilities are low enough we don't worry about them because they'll "never" happen. Just take enough margin and make sure this isn't the weakest link.
One thing to be concerned about is the choice of the hash function and it's possible vulnerabilities. Is there any other authentication in place or does the user simply present a path and retrieve a file?
If you think about an attacker trying to brute force the scenario above they would need to request 2^18 files before they can get some other random file stored in the system (again assuming 128 bit hash and 10^10 files, you'll have a lot less files and a longer hash). 2^18 is a pretty big number and the speed you can brute force this is limited by the network and the server. A simple lock the user out after x attempts policy can completely close this hole (which is why many systems implement this sort of policy). Building a secure system is complicated and there will be many points to consider but this sort of scheme can be perfectly secure.
Hope this is useful...
EDIT: another way to think about this is that practically every encryption or authentication system relies on certain events having very low probability for its security. e.g. I can be lucky and guess the prime factor on a 512 bit RSA key but it is so unlikely that the system is considered very secure.
Whilst the probability of a collision might be vanishingly small, imagine serving a highly confidential file from one customer to their competitor just because there happens to be a hash collision.
= end of business
I'd rather use hashing for things that were less critical when collisions DO occur ;-)
If you have a database, store the files under GUIDs - so not an incrementing index, but a proper globally unique identifier. They work nicely when it comes to distributed shards / high availability etc.
Imagine the worst case scenario and assume it will happen the week after you are featured in wired magazine as an amazing startup ... that's a good stress test for the algorithm.