is perfect hashing without buckets possible?

is perfect hashing without buckets possible? - hash

I've been asked to look for a perfect hash/one way function to be able to hash 10^11 numbers.
However as we'll be using a embedded device it wont have the memory to store the relevant buckets so I was wondering if it's possible to have a decent (minimal) perfect hash without them?
The plan is to use the device to hash the number(s) and we use a rainbow table or a file using the hash as the offset.
Cheers
Edit:
I'll try to provide some more info :)
1) 10^11 is actually now 10^10 so that makes it easer.This number is the possible combinations. So we could get a number between 0000000001 and 10000000000 (10^10).
2) The plan is to us it as part of a one way function to make the number secure so we can send it by insecure means.
We will then look up the original number at the other end using a rainbow table.
The problem is that the source the devices generally have 512k-4Meg of memory to use.
3) it must be perfect - we 100% cannot have a collision .
Edit2:
4) We cant use encryption as we've been told it's not really possable on the devices and keymanigment would be a nightmare if we could.
Edit3:
As this is not sensible, Its purely academic question now (I promise)

Okay, since you've clarified what you're trying to do, I rewrote my answer.
To summarize: Use a real encryption algorithm.
First, let me go over why your hashing system is a bad idea.
What is your hashing system, anyway?
As I understand it, your proposed system is something like this:
Your embedded system (which I will call C) is sending some sort of data with a value space of 10^11. This data needs to be kept confidential in transit to some sort of server (which I will call S).
Your proposal is to send the value hash(salt + data) to S. S will then use a rainbow table to reverse this hash and recover the data. salt is a shared value known to both C and S.
This is an encryption algorithm
An encryption algorithm, when you boil it down, is any algorithm that gives you confidentiality. Since your goal is confidentiality, any algorithm that satisfies your goals is an encryption algorithm, including this one.
This is a very poor encryption algorithm
First, there is an unavoidable chance of collision. Moreover, the set of colliding values differs each day.
Second, decryption is extremely CPU- and memory-intensive even for the legitimate server S. Changing the salt is even more expensive.
Third, although your stated goal is avoiding key management, your salt is a key! You haven't solved key management at all; anyone with the salt will be able to crack the message just as well as you can.
Fourth, it's only usable from C to S. Your embedded system C will not have enough computational resources to reverse hashes, and can only send data.
This isn't any faster than a real encryption algorithm on the embedded device
Most secure hashing algorithm are just as computationally expensive as a reasonable block cipher, if not worse. For example, SHA-1 requires doing the following for each 512-bit block:
Allocate 12 32-bit variables.
Allocate 80 32-bit words for the expanded message
64 times: Perform three array lookups, three 32-bit xors, and a rotate operation
80 times: Perform up to five 32-bit binary operations (some combination of xor, and, or, not, and and depending on the round); then a rotate, array lookup, four adds, another rotate, and several memory loads/stores.
Perform five 32-bit twos-complement adds
There is one chunk per 512-bits of the message, plus a possible extra chunk at the end. This is 1136 binary operations per chunk (not counting memory operations), or about 16 operations per byte.
For comparison, the RC4 encryption algorithm requires four operations (three additions, plus an xor on the message) per byte, plus two array reads and two array writes. It also requires only 258 bytes of working memory, vs a peak of 368 bytes for SHA-1.
Key management is fundamental
With any confidentiality system, you must have some sort of secret. If you have no secrets, then anyone else can implement the same decoding algorithm, and your data is exposed to the world.
So, you have two choices as to where to put the secrecy. One option is to make the encipherpent/decipherment algorithms secret. However, if the code (or binaries) for the algorithm is ever leaked, you lose - it's quite hard to replace such an algorithm.
Thus, secrets are generally made easy to replace - this is what we call a key.
Your proposed usage of hash algorithms would require a salt - this is the only secret in the system and is therefore a key. Whether you like it or not, you will have to manage this key carefully. And it's a lot harder to replace if leaked than other keys - you have to spend many CPU-hours generating a new rainbow table every time it's changed!
What should you do?
Use a real encryption algorithm, and spend some time actually thinking about key management. These issues have been solved before.
First, use a real encryption algorithm. AES has been designed for high performance and low RAM requirements. You could also use a stream cipher like RC4 as I mentioned before - the thing to watch out for with RC4, however, is that you must discard the first 4 kilobytes or so of output from the cipher, or you will be vulnerable to the same attacks that plauged WEP.
Second, think about key management. One option is to simply burn a key into each client, and physically go out and replace it if the client is compromised. This is reasonable if you have easy physical access to all of the clients.
Otherwise, if you don't care about man-in-the-middle attacks, you can simply use Diffie-Hellman key exchange to negotiate a shared key between S and C. If you are concerned about MitMs, then you'll need to start looking at ECDSA or something to authenticate the key obtained from the D-H exchange - beware that when you start going down that road, it's easy to get things wrong, however. I would recommend implementing TLS at that point. It's not beyond the capabilities of an embedded system - indeed, there are a number of embedded commercial (and open source) libraries available already. If you don't implement TLS, then at least have a professional cryptographer look over your algorithm before implementing it.

There is obviously no such thing as a "perfect" hash unless you have at least as many hash buckets as inputs; if you don't, then inevitably it will be possible for two of your inputs to share the same hash bucket.
However, it's unlikely you'll be storing all the numbers between 0 and 10^11. So what's the pattern? If there's a pattern, there may be a perfect hash function for your actual data set.
It's really not that important to find a "perfect" hash function anyway, though. Hash tables are very fast. A function with a very low collision rate - and when hashing integers, that means nearly any simple function, like modulus - is fine and you'll get O(1) average performance.

Related

Which SHA-2 function will Facebook use?

I read that Facebook on the 1st Oct 2015 will move from SHA-1 to SHA-2 and we have to update our applications: https://developers.facebook.com/blog/post/2015/06/02/SHA-2-Updates-Needed/
Do you know which function of SHA-2 will it use?
I read there are several (224, 256, 384 or 512) and one of these (SHA-224) doesn't work with the Windows XP SP3 which I use (http://blogs.msdn.com/b/alejacma/archive/2009/01/23/sha-2-support-on-windows-xp.aspx)

You don't have to care that much because usage of the SHA-224 is quite limited.
In this question CBroe has pointed out an important remark:
That blog post is about SSL connections when your app is making API
requests. This is not about anything you do with data within your app,
it is about the transport layer.
According to the https://crypto.stackexchange.com/questions/15151/sha-224-purpose
Answer by Ilmari Karonen:
Honestly, in practice, there are very few if any reasons to use
SHA-224.
As fgrieu notes, SHA-224 is simply SHA-256 with a different IV and
with 32 of the output bits thrown away. For most purposes, if you want
a hash with more than 128 but less than 256 bits, simply using SHA-256
and truncating the output yourself to the desired bit length is
simpler and just as efficient as using SHA-224. As you observe,
SHA-256 is also more likely to be available on different platforms
than SHA-224, making it the better choice for portability.
Why would you ever want to use SHA-224, then?
The obvious use case is if you need to implement an existing protocol
that specifies the use of SHA-224 hashes. While, for the reasons
described above, it's not a very common choice, I'm sure such
protocols do exist.
Also, a minor advantage of SHA-224 over truncated SHA-256 is that, due
to the different IV, knowing the SHA-224 hash of a given message does
not reveal anything useful about its SHA-256 hash, or vice versa. This
is really more of an "idiot-proofing" feature; since the two hashes
have different names, careless users might assume that their outputs
have nothing in common, so NIST changed the IV to ensure that this is
indeed the case.
However, this isn't really something you should generally rely on. If
you really need to compute multiple unrelated hashes of the same input
string, what you probably want instead is a keyed PRF like HMAC, which
can be instantiated using any common hash function (such as SHA-256).
As you've mentioned, Windows XP with SP3 doesn't support SHA-224 but it supports SHA-256:
Check also: https://security.stackexchange.com/questions/1751/what-are-the-realistic-and-most-secure-crypto-for-symmetric-asymmetric-hash
Especially: https://stackoverflow.com/a/817121/3964066
And: https://security.stackexchange.com/a/1755
Part of the Thomas Pornin's answer:
ECDSA over a 256-bit curve already achieves an "unbreakable" level of
security (i.e. roughly the same level than AES with a 128-bit key, or
SHA-256 against collisions). Note that there are elliptic curves on
prime fields, and curves on binary fields; which kind is most
efficient depends on the involved hardware (for curves of similar
size, a PC will prefer the curves on a prime field, but dedicated
hardware will be easier to build with binary fields; the CLMUL
instructions on the newer Intel and AMD processors may change that).
SHA-512 uses 64-bit operations. This is fast on a PC, not so fast on a
smartcard. SHA-256 is often a better deal on small hardware (including
32-bit architectures such as home routers or smartphones)

CRC32+Size vs MD5/SHA1

We have a storage of files and the storage uniquely identifies a file on the basis of size appended to crc32.
I wanted to know if this checksum ( crc32 + size ) would be good enough for identifying files or should we consider some other hashing technique like MD5/SHA1?

CRC is most an error detection method than a serious hash function. It helps in identify corrupting files rather than uniquely identify them.
So your choice should be between MD5 and SHA1.
If you don't have strong security needings you can choose MD5 that should be faster.
(remember that MD5 is vulnerable to collision attacks).
If you need more security you better use SHA1 or even SHA2 .

CRC-32 is not good enough; it is trivial to build collisions, i.e. two files (of the same length if you wish it so) which have the same CRC-32. Even in the absence of a malicious attacker, collisions will happen randomly once you have about 65000 distinct files with the same length.
A hash function is designed to avoid collisions. With MD5 or SHA-1, you will not get random collisions. If your setup is security-related (i.e. there is someone, somewhere, who may actively try to create collisions), then you need a secure hash function. MD5 is not secure anymore (creating collisions with MD5 is easy) and SHA-1 is somewhat weak in that respect (no actual collisions were computed, but a method for creating one is known and, while expensive, it is much less expensive than what it ought to be). The usual recommendation is to use SHA-256 or SHA-512 (SHA-256 is enough for security; SHA-512 may be a tad faster on big, 64-bit systems, but file reading bandwidth will be more limitating than hashing speed).
Note: when using a cryptographic hash function, there is no need to store and compare the file lengths; the hash is sufficient to disambiguate files.
In a non-security setup (i.e. you only fear random collisions), then MD4 can be used. It is thoroughly "broken" as a cryptographic hash function, but it still is a very good checksum, and it is really fast (on some ARM-based platforms, it is even faster than CRC-32, for a much better resistance to random collisions). Basically, you should not use MD5: if you have security issues, then MD5 must not be used (it is broken; use SHA-256); and if you do not have security issues then MD4 is faster than MD5.

The space that would be used by a CRC32+size gives you enough room for a bigger CRC which would be a much better choice. If you are not worried about malicious collision that's it in which case Thomas' answer applies.
You didn't specify a language but for example in C++ you got Boost CRC giving you CRC of the size you want (or you can afford to store).

As others have said, CRC doesn't guarantee absence of collisions. However, your problem is be solved simply by giving the files incrementing 64-bit numbers. This is guaranteed to never collide (unless you want to keep gazillion of files in one directory which is not a good idea anyway).

How safe is it to rely on hashes for file identification?

I am designing a storage cloud software on top of a LAMP stack.
Files could have an internal ID, but it would have many advantages to store them not with an incrementing id as filename in the servers filesystems, but using an hash as filename.
Also hashes as identifier in the database would have a lot of advantages if the currently centralized database should be sharded or decentralized or some sort of master-master high availability environment should be set up. But I am not sure about that yet.
Clients can store files under any string (usually some sort of path and filename).
This string is guaranteed to be unique, because on the first level is something like "buckets" that users have go register like in Amazon S3 and Google storage.
My plan is to store files as hash of the client side defined path.
This way the storage server can directly serve the file without needing the database to ask which ID it is because it can calculate the hash and thus the filename on the fly.
But I am afraid of collisions. I currently think about using SHA1 hashes.
I heard that GIT uses hashes also revision identifiers as well.
I know that the chances of collisions are really really low, but possible.
I just cannot judge this. Would you or would you not rely on hash for this purpose?
I could also us some normalization of encoding of the path. Maybe base64 as filename, but i really do not want that because it could get messy and paths could get too long and possibly other complications.

Assuming you have a hash function with "perfect" properties and assuming cryptographic hash functions approach that the theory that applies is the same that applies to birthday attacks . What this says is that given a maximum number of files you can make the collision probability as small as you want by using a larger hash digest size. SHA has 160 bits so for any practical number of files the probability of collision is going to be just about zero. If you look at the table in the link you'll see that a 128 bit hash with 10^10 files has a collision probability of 10^-18 .
As long as the probability is low enough I think the solution is good. Compare with the probability of the planet being hit by an asteroid, undetectable errors in the disk drive, bits flipping in your memory etc. - as long as those probabilities are low enough we don't worry about them because they'll "never" happen. Just take enough margin and make sure this isn't the weakest link.
One thing to be concerned about is the choice of the hash function and it's possible vulnerabilities. Is there any other authentication in place or does the user simply present a path and retrieve a file?
If you think about an attacker trying to brute force the scenario above they would need to request 2^18 files before they can get some other random file stored in the system (again assuming 128 bit hash and 10^10 files, you'll have a lot less files and a longer hash). 2^18 is a pretty big number and the speed you can brute force this is limited by the network and the server. A simple lock the user out after x attempts policy can completely close this hole (which is why many systems implement this sort of policy). Building a secure system is complicated and there will be many points to consider but this sort of scheme can be perfectly secure.
Hope this is useful...
EDIT: another way to think about this is that practically every encryption or authentication system relies on certain events having very low probability for its security. e.g. I can be lucky and guess the prime factor on a 512 bit RSA key but it is so unlikely that the system is considered very secure.

Whilst the probability of a collision might be vanishingly small, imagine serving a highly confidential file from one customer to their competitor just because there happens to be a hash collision.
= end of business
I'd rather use hashing for things that were less critical when collisions DO occur ;-)
If you have a database, store the files under GUIDs - so not an incrementing index, but a proper globally unique identifier. They work nicely when it comes to distributed shards / high availability etc.
Imagine the worst case scenario and assume it will happen the week after you are featured in wired magazine as an amazing startup ... that's a good stress test for the algorithm.

Why use SHA1 for hashing secrets when SHA-512 is more secure?

I don't mean for this to be a debate, but I'm trying to understand the technical rationale behind why so many apps use SHA1 for hashing secrets, when SHA512 is more secure. Perhaps it's simply for backwards compatibility.
Besides the obvious larger size (128 chars vs 40), or slight speed differences, is there any other reason why folks use the former?
Also, SHA-1 I believe was first cracked by a VCR's processor years ago. Has anyone cracked 512 yet (perhaps with a leaf blower), or is it still safe to use without salting?

Most uses of SHA-1 are for interoperability: we use SHA-1 when we implement protocols where SHA-1 is mandated. Ease of development also comes into account: SHA-1 implementations in various languages and programming environment are more common than SHA-512 implementations.
Also, even so most usages of hash functions do not have performance issues (at least, no performance issue where the hash function is the bottleneck), there are some architectures where SHA-1 is vastly more efficient than SHA-512. Consider a basic Linksys router: it uses a Mips-derivative CPU, clocked at 200 MHz. Such a machine can be reprogrammed, e.g. with OpenWRT (a small Linux for embedded systems). As a router, it has fast network (100Mbit/s). Suppose that you want to hash some data (e.g. as part of some VPN software -- a router looks like a good candidate for running a VPN). With SHA-1, you will get about 6 MB/s, using the full CPU. That's already quite lower than the network bandwidth. SHA-512 will give you no more than 1.5 MB/s on the same machine. On such a system, the difference in performance is not negligible. Also, if I use SHA-1 on my Linksys router for some communication protocol, then the machine at the other end of the link will also have to use SHA-1.
The good news is that there is an ongoing competition to select a new standard hash function, code-named SHA-3. Some of the competing candidates provide performance similar to SHA-1, or even somewhat better, while still yielding a 512-bit output and be (probably) as secure as SHA-512.

Both SHA1 and SHA512 are hash functions. If you are using them as a cryptographic hash, then perhaps that is good reason to use SHA512; however, there are applications that use these function simply to identify objects. For example, Git uses SHA1 to cheaply distinguish between objects. In that case, because the possibility of collision between two documents is incredibly small with SHA1, there really is no justification for the additional space requirement of SHA512 when SHA1 is more than suitable for the task.
In terms of cryptographic hashes and the choice to use a salt or not, you may be interested in reading Don't Hash Secrets. Even with SHA512, using a salt is a good idea (and it's cheap to do, too, so why not do it?), because you can guess the top passwords and see if they have the same hash, but the author points out that HMAC is a more secure mechanism. In any case, you will have to determine the costs associated with the extra time+space and the costs associated with the possibility of a breach, and determine how paranoid you want to be. As was recently discovered by Microsoft, constantly changing passwords is a waste of money and doesn't pay off, so while paranoia is usually good when it comes to security, you really have to do the math to determine if it makes sense.... do the gains in security outweigh time and storage costs?

If you need something to be hashed quickly, or only need a 160 bit hash, you'd use SHA-1.
For comparing database entries to one another quickly, you might take 100 fields and make a SHA-1 hash from them, yielding 160 bits. Those 160 bits are 10^50ish values.
If I'm unlikely to ever have more than a tiny fraction of 10^50th values, it's quicker to just hash what I have with the simpler and faster algorithm.

Purposely create two files to have the same hash?

If someone is purposely trying to modify two files to have the same hash, what are ways to stop them? Can md5 and sha1 prevent the majority case?
I was thinking of writing my own and I figure even if I don't do a good job if the user doesn't know my hash he may not be able to fool mine.
What's the best way to prevent this?

MD5 is generally considered insecure if hash collisions are a major concern. SHA1 is likewise no longer considered acceptable by the US government. There is was a competition under way to find a replacement hash algorithm, but the recommendation at the moment is to use the SHA2 family - SHA-256, SHA-384 or SHA-512. [Update: 2012-10-02 NIST has chosen SHA-3 to be the algorithm Keccak.]
You can try to create your own hash — it would probably not be as good as MD5, and 'security through obscurity' is likewise not advisable.
If you want security, hash with multiple hash algorithms. Being able to simultaneously create files that have hash collisions using a number of algorithms is excessively improbable. [And, in the light of comments, let me make it clear: I mean publish both the SHA-256 and the Whirlpool values for the file — not combining hash algorithms to create a single value, but using separate algorithms to create separate values. Generally, a corrupted file will fail to match any of the algorithms; if, perchance, someone has managed to create a collision value using one algorithm, the chance of also producing a second collision in one of the other algorithms is negligible.]
The Public TimeStamp uses an array of algorithms. See, for example, sqlcmd-86.00.tgz for an illustration.

If the user doesn't know your hashing algorithm he also can't verify your signature on a document that you actually signed.
The best option is to use public-key one-way hashing algorithms that generate the longest hash. SHA-256 creates a 256-bit hash, so a forger would have to try 2255 different documents (on average) before they created one that matched a given document, which is pretty secure. If that's still not secure enough for you, there's SHA-512.
Also, I think it's worth mentioning that a good low-tech way to protect yourself against forged digitally-signed documents is to simply keep a copy of anything you sign. That way, if it comes down to a dispute, you can show that the original document you signed was altered.

There is a hierarchy of difficulty (for an attacker) here. It is easier to find two files with the same hash than to generate one to match a given hash, and easier to do the later if you don't have to respect form/content/lengths restrictions.
Thus, if it is possible to use a well defined document structure and lengths, you can make an attackers life a bit harder no matter what underling hash you use.

Why are you trying to create your own hash algorithm? What's wrong with SHA1HMAC?
Yes, there are repeats for hashes.
Any hash that is shorter than the plaintext is necessarily less information. That means there will be some repeats. The key for hashes is that the repeats are hard to reverse-engineer.
Consider CRC32 - commonly used as a hash. It's a 32-bit quantity. Because there are more than 2^32 messages in the universe, then there will be repeats with CRC32.
The same idea applies to other hashes.

This is called a "hash collision", and the best way to avoid it is to use a strong hash function. MD5 is relatively easy to artificially build colliding files, as seen here. Similarly, it's known there is a relatively efficient method for computing colliding SH1 files, although in this case "relatively efficient" still takes hunreds of hours of compute time.
Generally, MD5 and SHA1 are still expensive to crack, but not impossible. If you're really worried about it, use a stronger hash function, like SHA256.
Writing your own isn't actually a good idea unless you're a pretty expert cryptographer. most of the simple ideas have been tried and there are well-known attacks against them.
If you really want to learn more about it, have a look at Schneier's Applied Cryptography.

I don't think coming up with your own hash algorithm is a good choice.
Another good option is used Salted MD5. For example, the input to your MD5 hash function is appended with string "acidzom!##" before passing to MD5 function.
There is also a good reading at Slashdot.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse