is possible to bruteforce my Sha512 Authenication algorithm? - hash

I have an authentication application and don't know how secure it is.
here is the algorithm.
1) A clientToken is generated by using SHA512 hash a new guid. I have about 1000 ClientsToken generated and store in the database.
every time the caller calling my web service it need to provide the clientToken, if the clienttoken does not exists in the database, then it is not valid client.
The problem is how long does it take to brute force to get the existing ClientToken?

A GUID is a 128 bit value, with 6 bits held constant, so a total of 122 bits available. Since this is your input to the hash, you're not going to have 2^512 unique hashes in your application. This is roughly 5.3*10^36 values to check.
Say your attacker is able to calculate 1,000,000 (10^6) hashes per second (I'm not sure how reasonable that is for SHA-512, but at this size, a few orders of magnitude won't influence things that much). This works out to about 5.3*10^30 seconds to check the space (For reference, this will be far beyond the time all stars have gone dark). Also, unless you have several billion clients, a birthday attack probably will not remove too many orders of magnitude from this.
But, just for fun, let's say the attacker has some trick that lets him reduce the number of hashes to check by half (or some combination of reduced space to check and increased speed), either by you having that many users, or some flaw in your GUID generator, or what have you. We're still looking at well over 100 million years to find a collision.
I think you're beyond safe and into somewhat overkill territory. Also note that hashing the GUID in effect does nothing, and that GUIDs probably are not generated via a secure random number generator. You'd actually be a bit better off just generating a 128 bits (16 bytes) of randomness via whatever secure random number generator your platform uses, and using that as the shared secret.

Related

why can't json web token be reversed engineered? [duplicate]

Why can't you just reverse the algorithm like you could reverse a math function? How is it possible to make an algorithm that isn't reversible?
And if you use a rainbow table, what makes using a salt impossible to crack it? If you are making a rainbow table with brute force to generate it, then it invents each plaintext value possible (to a length), which would end up including the salt for each possible password and each possible salt (the salt and password/text would just come together as a single piece of text).
MD5 is designed to be cryptographically irreversible. In this case, the most important property is that it is computationally unfeasible to find the reverse of a hash, but it is easy to find the hash of any data. For example, let's think about just operating on numbers (binary files after all, could be interpreted as just a very long number).
Let's say we have the number "7", and we want to take the hash of it. Perhaps the first thing we try as our hash function is "multiply by two". As we'll see, this is not a very good hash function, but we'll try it, to illustrate a point. In this case, the hash of the number will be "14". That was pretty easy to calculate. But now, if we look at how hard it is to reverse it, we find that it is also just as easy! Given any hash, we can just divide it by two to get the original number! This is not a good hash, because the whole point of a hash is that it is much harder to calculate the inverse than it is to calculate the hash (this is the most important property in at least some contexts).
Now, let's try another hash. For this one, I'm going to have to introduce the idea of clock arithmetic. On a clock, there aren't an infinite amount of number. In fact, it just goes from 0 to 11 (remember, 0 and 12 are the same on a clock). So if you "add one" to 11, you just get zero. You can extend the ideas of multiplication, addition, and exponentiation to a clock. For example, 8+7=15, but 15 on a clock is really just 3! So on a clock, you would say 8+7=3! 6*6=36, but on a clock, 36=0! so 6*6=0! Now, for the concept of powers, you can do the same thing. 2^4=16, but 16 is just 4. So 2^4=4! Now, here's how it ties into hashing. How about we try the hash function f(x)=5^x, but with clock arithmetic. As you'll see, this leads to some interesting results. Let's try taking the hash of 7 as before.
We see that 5^7=78125 but on a clock, that's just 5 (if you do the math, you see that we've wrapped around the clock 6510 times). So we get f(7)=5. Now, the question is, if I told you that the hash of my number was 5, would you be able to figure out that my number was 7? Well, it's actually very hard to calculate the reverse of this function in the general case. People much smarter than me have proved that in certain cases, reversing this function is way harder than calculating it forward. (EDIT: Nemo has pointed out that this in fact has not been "proven"; in fact, the only guarantee you get is that a lot of smart people have tried a long time to find an easy way to do so, and none of them have succeeded.) The problem of reversing this operation is called the "Discrete Logarithm Problem". Look it up for more in depth coverage. This is at least the beginning of a good hash function.
With real world hash functions, the idea is basically the same: You find some function that is hard to reverse. People much smarter than me have engineered MD5 and other hashes to make them provably hard to reverse.
Now, perhaps earlier the thought has occurred to you: "it would be easy to calculate the inverse! I'd just take the hash of every number until I found the one that matched!" Now, for the case where the numbers are all less than twelve, this would be feasible. But for the analog of a real-world hash function, imagine all the numbers involved are huge. The idea is that it is still relatively easy to calculate the hash function for these large numbers, but to search through all possible inputs becomes harder much quicker. But what you've stumbled upon is the still a very important idea though: searching through the input space for an input which will give a matching output. Rainbow tables are a more complex variation on the idea, which use precomputed tables of input-output pairs in smart ways in order to make it possible to quickly search through a large number of possible inputs.
Now let's say that you are using a hash function to store passwords on your computer. The idea is this: The computer just stores the hash of the correct password. When a user tries to login, you compare the hash of the input password to the hash of the correct password. If they match, you assume the user has the correct password. The reason this is advantageous is because if someone steals your computer, they still don't have access to your password, just the hash of it. Because the hash function was designed by smart people to be hard to take the reverse of, they can't easily retrieve your password from it.
An attacker's best bet is a bruteforce attack, where they try a bunch of passwords. Just like you might try the numbers less that 12 in the previous problem, an attacker might try all the passwords just composed of numbers and letters less than 7 characters long, or all words which show up in the dictionary. The important thing here is that he can't try all possible passwords, because there are way too many possible 16 character passwords, for example, to ever test. So the point is that an attacker has to restrict the possible passwords he tests, otherwise he will never even check a small percentage of them.
Now, as for a salt, the idea is this: What if two users had the same password? They would have the same hash. If you think about it, the attacker doesn't really have to crack every users password individually. He simply goes through every possible input password, and compares the hash to all the hashes. If it matches one of them, then he has found a new password. What we'd really like to force him to do is calculate a new hash for every user+password combination he wants to check. That's the idea of a salt, is that you make the hash function be slightly different for every user, so he can't reuse a single set of precomputed values for all users. The most straightforward way to do this is to tack on some random string to each user's password before you take the hash, where the random string is different for each user. So, for example, if my password is "shittypassword", my hash might show up as MD5("6n93nshittypassword") and if your password is "shittypassword", your hash might show up as MD5("fa9elshittypassword"). This little bit "fa9el" is called the "salt", and it's different for every user. For example, my salt is "6n93n". Now, this little bit which is tacked on to your password is just stored on your computer as well. When you try to login with the password X, the computer can just calculate MD5("fa9el"+X) and see if it matches the stored hash.
So the basic mechanics of logging in remain unchanged, but for an attacker, they are now faced with a more daunting challenge: rather than a list of MD5 hashes, they are faced with a list of MD5 sums and salts. They essentially have two options:
They can ignore the fact that the hashes are salted, and try to crack the passwords with their lookup table as is. However, the chances that they'll actually crack a password are much reduced. For example, even if "shittypassword" is on their list of inputs to check, most likely "fa9elshittypassword" isn't. In order to get even a small percentage of the probability of cracking a password that they had before, they'll need to test orders of magnitude more possible passwords.
They can recalculate the hashes on a per-user basis. So rather than calculating MD5(passwordguess), for each user X, they calculate MD5( Salt_of_user_X + passwordguess). Not only does this force them to calculate a new hash for each user they want to crack, but also most importantly, it prevents them from being able to use precalculated tables (like rainbow table, for example), because they can't know what Salt_of_user_X is before hand, so they can't precalculate the hashes to test.
So basically, if they are trying to use precalculated tables, using a salt effectively greatly increases the possible inputs they have to test in order to crack the password, and even if they aren't using precalculated tables, it still slows them down by a factor of N, where N is the number of passwords you are storing.
Hopefully this answers all your questions.
Think of 2 numbers from 1 to 9999. Add them. Now tell me the final digit.
I can't, from that information, deduce which numbers you originally thought of. That is a very simple example of a one-way hash.
Now, I can think of two numbers which give the same result, and this is where this simple example differs from a 'proper' cryptographic hash like MD5 or SHA1. With those algorithms, it should be computationally difficult to come up with an input which produces a specific hash.
One big reason you can't reverse the hash function is because data is lost.
Consider a simple example function: 'OR'. If you apply that to your input data of 1 and 0, it yields 1. But now, if you know the answer is '1', how do you back out the original data? You can't. It could have been 1,1 or maybe 0,1, or maybe 1,0.
As for salting and rainbow tables. Yes, theoretically, you could have a rainbow table which would encompass all possible salts and passwords, but practically, that's just too big. If you tried every possible combination of lower case letters, upper case, numbers, and twelve punctuation symbols, up to 50 characters long, that's (26+26+10+12)^50 = 2.9 x 10^93 different possibilities. That's more than the number of atoms in the visible universe.
The idea behind rainbow tables is to calculate the hash for a bunch of possible passwords in advance, and passwords are much shorter than 50 characters, so it's possible to do so. That's why you want to add a salt in front: if you add on '57sjflk43380h4ljs9flj4ay' to the front of the password. While someone may have already computed the hash for "pa55w0rd", no one will have already calculated the has for '57sjflk43380h4ljs9flj4aypa55w0rd'.
I don't think the md5 gives you the whole result - so you can't work backwards to find the original things that was md5-ed
md5 is 128bit, that's 3.4*10^38 combinations.
the total number of eight character length passwords:
only lowercase characters and numbers: 36^8 = 2.8*10^12
lower&uppercase and numbers: 62^8 = 2.18*10^14
You have to store 8 bytes for the password, 16 for the md5 value, that's 24 bytes total per entry.
So you need approx 67000G or 5200000G storage for your rainbow table.
The only reason that it's actually possible to figure out passwords is because people use obvious ones.

Collision probability of ObjectId vs UUID in a large distributed system

Considering that an UUID rfc 4122 (16 bytes) is much larger than a MongoDB ObjectId (12 bytes), I am trying to find out how their collision probability compare.
I know that is something around quite unlikely, but in my case most ids will be generated within a large number of mobile clients, not within a limited set of servers. I wonder if in this case, there is a justified concern.
Compared to the normal case where all ids are generated by a small number of clients:
It might take months to detect a collision since the document creation
IDs are generated from a much larger client base
Each client has a lower ID generation rate
in my case most ids will be generated within a large number of mobile clients, not within a limited set of servers. I wonder if in this case, there is a justified concern.
That sounds like very bad architecture to me. Are you using a two-tier architecture? Why would the mobile clients have direct access to the db? Do you really want to rely on network-based security?
Anyway, some deliberations about the collision probability:
Neither UUID nor ObjectId rely on their sheer size, i.e. both are not random numbers, but they follow a scheme that tries to systematically reduce collision probability. In case of ObjectIds, their structure is:
4 byte seconds since unix epoch
3 byte machine id
2 byte process id
3 byte counter
This means that, contrary to UUIDs, ObjectIds are monotonic (except within a single second), which is probably their most important property. Monotonic indexes will cause the B-Tree to be filled more efficiently, it allows paging by id and allows a 'default sort' by id to make your cursors stable, and of course, they carry an easy-to-extract timestamp. These are the optimizations you should be aware of, and they can be huge.
As you can see from the structure of the other 3 components, collisions become very likely if you're doing > 1k inserts/s on a single process (not really possible, not even from a server), or if the number of machines grows past about 10 (see birthday problem), or if the number of processes on a single machine grows too large (then again, those aren't random numbers, but they are truly unique on a machine, but they must be shortened to two bytes).
Naturally, for a collision to occur, they must match in all these aspects, so even if two machines have the same machine hash, it'd still require a client to insert with the same counter value in the exact same second and the same process id, but yes, these values could collide.
Let's look at the spec for "ObjectId" from the documentation:
Overview
ObjectId is a 12-byte BSON type, constructed using:
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id, and
a 3-byte counter, starting with a random value.
So let us consider this in the context of a "mobile client".
Note: The context here does not mean using a "direct" connection of the "mobile client" to the database. That should not be done. But the "_id" generation can be done quite simply.
So the points:
Value for the "seconds since epoch". That is going to be fairly random per request. So minimal collision impact just on that component. Albeit in "seconds".
The "machine identifier". So this is a different client generating the _id value. This is removing possibility of further "collision".
The "process id". So where that is accessible to seed ( and it should be ) then the generated _id has more chance of avoiding collision.
The "random value". So another "client" somehow managed to generate all of the same values as above and still managed to generate the same random value.
Bottom line is, if that is not a convincing enough argument to digest, then simply provide your own "uuid" entries as the "primary key" values.
But IMHO, that should be a fair convincing argument to consider that the collision aspects here are very broad. To say the least.
The full topic is probably just a little "too-broad". But I hope this moves consideration a bit more away from "Quite unlikely" and on to something a little more concrete.

How safe is it to rely on hashes for file identification?

I am designing a storage cloud software on top of a LAMP stack.
Files could have an internal ID, but it would have many advantages to store them not with an incrementing id as filename in the servers filesystems, but using an hash as filename.
Also hashes as identifier in the database would have a lot of advantages if the currently centralized database should be sharded or decentralized or some sort of master-master high availability environment should be set up. But I am not sure about that yet.
Clients can store files under any string (usually some sort of path and filename).
This string is guaranteed to be unique, because on the first level is something like "buckets" that users have go register like in Amazon S3 and Google storage.
My plan is to store files as hash of the client side defined path.
This way the storage server can directly serve the file without needing the database to ask which ID it is because it can calculate the hash and thus the filename on the fly.
But I am afraid of collisions. I currently think about using SHA1 hashes.
I heard that GIT uses hashes also revision identifiers as well.
I know that the chances of collisions are really really low, but possible.
I just cannot judge this. Would you or would you not rely on hash for this purpose?
I could also us some normalization of encoding of the path. Maybe base64 as filename, but i really do not want that because it could get messy and paths could get too long and possibly other complications.
Assuming you have a hash function with "perfect" properties and assuming cryptographic hash functions approach that the theory that applies is the same that applies to birthday attacks . What this says is that given a maximum number of files you can make the collision probability as small as you want by using a larger hash digest size. SHA has 160 bits so for any practical number of files the probability of collision is going to be just about zero. If you look at the table in the link you'll see that a 128 bit hash with 10^10 files has a collision probability of 10^-18 .
As long as the probability is low enough I think the solution is good. Compare with the probability of the planet being hit by an asteroid, undetectable errors in the disk drive, bits flipping in your memory etc. - as long as those probabilities are low enough we don't worry about them because they'll "never" happen. Just take enough margin and make sure this isn't the weakest link.
One thing to be concerned about is the choice of the hash function and it's possible vulnerabilities. Is there any other authentication in place or does the user simply present a path and retrieve a file?
If you think about an attacker trying to brute force the scenario above they would need to request 2^18 files before they can get some other random file stored in the system (again assuming 128 bit hash and 10^10 files, you'll have a lot less files and a longer hash). 2^18 is a pretty big number and the speed you can brute force this is limited by the network and the server. A simple lock the user out after x attempts policy can completely close this hole (which is why many systems implement this sort of policy). Building a secure system is complicated and there will be many points to consider but this sort of scheme can be perfectly secure.
Hope this is useful...
EDIT: another way to think about this is that practically every encryption or authentication system relies on certain events having very low probability for its security. e.g. I can be lucky and guess the prime factor on a 512 bit RSA key but it is so unlikely that the system is considered very secure.
Whilst the probability of a collision might be vanishingly small, imagine serving a highly confidential file from one customer to their competitor just because there happens to be a hash collision.
= end of business
I'd rather use hashing for things that were less critical when collisions DO occur ;-)
If you have a database, store the files under GUIDs - so not an incrementing index, but a proper globally unique identifier. They work nicely when it comes to distributed shards / high availability etc.
Imagine the worst case scenario and assume it will happen the week after you are featured in wired magazine as an amazing startup ... that's a good stress test for the algorithm.

is perfect hashing without buckets possible?

I've been asked to look for a perfect hash/one way function to be able to hash 10^11 numbers.
However as we'll be using a embedded device it wont have the memory to store the relevant buckets so I was wondering if it's possible to have a decent (minimal) perfect hash without them?
The plan is to use the device to hash the number(s) and we use a rainbow table or a file using the hash as the offset.
Cheers
Edit:
I'll try to provide some more info :)
1) 10^11 is actually now 10^10 so that makes it easer.This number is the possible combinations. So we could get a number between 0000000001 and 10000000000 (10^10).
2) The plan is to us it as part of a one way function to make the number secure so we can send it by insecure means.
We will then look up the original number at the other end using a rainbow table.
The problem is that the source the devices generally have 512k-4Meg of memory to use.
3) it must be perfect - we 100% cannot have a collision .
Edit2:
4) We cant use encryption as we've been told it's not really possable on the devices and keymanigment would be a nightmare if we could.
Edit3:
As this is not sensible, Its purely academic question now (I promise)
Okay, since you've clarified what you're trying to do, I rewrote my answer.
To summarize: Use a real encryption algorithm.
First, let me go over why your hashing system is a bad idea.
What is your hashing system, anyway?
As I understand it, your proposed system is something like this:
Your embedded system (which I will call C) is sending some sort of data with a value space of 10^11. This data needs to be kept confidential in transit to some sort of server (which I will call S).
Your proposal is to send the value hash(salt + data) to S. S will then use a rainbow table to reverse this hash and recover the data. salt is a shared value known to both C and S.
This is an encryption algorithm
An encryption algorithm, when you boil it down, is any algorithm that gives you confidentiality. Since your goal is confidentiality, any algorithm that satisfies your goals is an encryption algorithm, including this one.
This is a very poor encryption algorithm
First, there is an unavoidable chance of collision. Moreover, the set of colliding values differs each day.
Second, decryption is extremely CPU- and memory-intensive even for the legitimate server S. Changing the salt is even more expensive.
Third, although your stated goal is avoiding key management, your salt is a key! You haven't solved key management at all; anyone with the salt will be able to crack the message just as well as you can.
Fourth, it's only usable from C to S. Your embedded system C will not have enough computational resources to reverse hashes, and can only send data.
This isn't any faster than a real encryption algorithm on the embedded device
Most secure hashing algorithm are just as computationally expensive as a reasonable block cipher, if not worse. For example, SHA-1 requires doing the following for each 512-bit block:
Allocate 12 32-bit variables.
Allocate 80 32-bit words for the expanded message
64 times: Perform three array lookups, three 32-bit xors, and a rotate operation
80 times: Perform up to five 32-bit binary operations (some combination of xor, and, or, not, and and depending on the round); then a rotate, array lookup, four adds, another rotate, and several memory loads/stores.
Perform five 32-bit twos-complement adds
There is one chunk per 512-bits of the message, plus a possible extra chunk at the end. This is 1136 binary operations per chunk (not counting memory operations), or about 16 operations per byte.
For comparison, the RC4 encryption algorithm requires four operations (three additions, plus an xor on the message) per byte, plus two array reads and two array writes. It also requires only 258 bytes of working memory, vs a peak of 368 bytes for SHA-1.
Key management is fundamental
With any confidentiality system, you must have some sort of secret. If you have no secrets, then anyone else can implement the same decoding algorithm, and your data is exposed to the world.
So, you have two choices as to where to put the secrecy. One option is to make the encipherpent/decipherment algorithms secret. However, if the code (or binaries) for the algorithm is ever leaked, you lose - it's quite hard to replace such an algorithm.
Thus, secrets are generally made easy to replace - this is what we call a key.
Your proposed usage of hash algorithms would require a salt - this is the only secret in the system and is therefore a key. Whether you like it or not, you will have to manage this key carefully. And it's a lot harder to replace if leaked than other keys - you have to spend many CPU-hours generating a new rainbow table every time it's changed!
What should you do?
Use a real encryption algorithm, and spend some time actually thinking about key management. These issues have been solved before.
First, use a real encryption algorithm. AES has been designed for high performance and low RAM requirements. You could also use a stream cipher like RC4 as I mentioned before - the thing to watch out for with RC4, however, is that you must discard the first 4 kilobytes or so of output from the cipher, or you will be vulnerable to the same attacks that plauged WEP.
Second, think about key management. One option is to simply burn a key into each client, and physically go out and replace it if the client is compromised. This is reasonable if you have easy physical access to all of the clients.
Otherwise, if you don't care about man-in-the-middle attacks, you can simply use Diffie-Hellman key exchange to negotiate a shared key between S and C. If you are concerned about MitMs, then you'll need to start looking at ECDSA or something to authenticate the key obtained from the D-H exchange - beware that when you start going down that road, it's easy to get things wrong, however. I would recommend implementing TLS at that point. It's not beyond the capabilities of an embedded system - indeed, there are a number of embedded commercial (and open source) libraries available already. If you don't implement TLS, then at least have a professional cryptographer look over your algorithm before implementing it.
There is obviously no such thing as a "perfect" hash unless you have at least as many hash buckets as inputs; if you don't, then inevitably it will be possible for two of your inputs to share the same hash bucket.
However, it's unlikely you'll be storing all the numbers between 0 and 10^11. So what's the pattern? If there's a pattern, there may be a perfect hash function for your actual data set.
It's really not that important to find a "perfect" hash function anyway, though. Hash tables are very fast. A function with a very low collision rate - and when hashing integers, that means nearly any simple function, like modulus - is fine and you'll get O(1) average performance.

hash fragments and collisions cont

For this application I've mine I feel like I can get away with a 40 bit hash key, which seems awfully low, but see if you can confirm my reasoning (I want a small key because I want a small filename and the key will be converted to a filename):
(Note: only accidental collisions a concern - no security issues.)
A key point here is that the population in question is divided into groups, and a collision is only relevant if it occurs within the same group. A "group" is a directory on a user's system (the contents of files are hashed and a collision is only relevant if it occurs for files within the same directory). So with speculating roughly 100,000 potential users, say 2^17, that corresponds to 2^18 "groups" assuming 2 directories per user on average. So with a 40 bit key I can expect 2^(20+9) files created (among all users) before a collision occurs for some user somewhere. (Or IOW 2^((40+18)/2), due to the "birthday effect".) That's an average 4096 unique files created per user, for 2^17 users, before a single collision occurs for some user somewhere. And then that long again before another collision occurs somewhere (right?)
Your math looks reasonable, but I'm left wondering why you'd bother with this at all. If you want to create unique file names, why not just assign a number to each user, and keep a serial number for that user. When you need a file name, basically just concatentate the user number with the serial number (both padded to the correct number of digits). If you feel that you need to obfuscate those numbers, run that result though a 40-bit encryption (which will guarantee that a unique input produces a unique output).
If, for example, you assign 20 bits to each, you can have 220 users create 220 documents apiece before there's any chance of a collision at all.
If you don't mind serialized access to it, you could just use a single 40-bit counter instead. The advantage of this is that a single user wouldn't immediately use up 220 serial numbers, even though the average user is unlikely to ever create nearly that many documents.
Again, if you think you need to obfuscate this number for some reason, you can use a 40-bit encryption algorithm in counter mode (i.e. use a serial number, but encrypt it) which (again) guarantees that each input maps to a unique output. This guarantees no collision until/unless your users create 240 documents (i.e., the maximum possible with only 40 bits). Alternatively, you could create a 40-bit full-range linear feedback shift register to create your pseudo-random 40-bit numbers. This might be marginally less secure, but has the advantage of being faster and simpler to implement.