Chance on generating 2 equal hashes based on time()? - hash

Question is pretty simple, but I couldn't find an answer for this one... Basicly, my application is generating filenames with md5(time());.
What are the chances, if any, that using this technique, I'll have 2 equal results?
P.S. Since my question title says hashes not exact hash, what are the chances, if any, again, of generating equal results for each type of hashes sha1();, sha512(); etc.?
Thanks in advance!

My estimation is it is unsafe due to possible changes in time by humans and other processes such as NTP which FrankH has kindly noted. I highly recommend using a cryptographically secure RNG (random number generator) if your framework allows.

Equal results are unlikely to result from this, you can simply validate that yourself by checking the uniqueness of md5(0) ... md5(INT32_MAX) since that's the total range of a time_t. I don't think there are collisions in that input space for any of the hashes you've named.
Predictable results is another matter, though. By choosing time() as you input supplier, you restrict yourself to, well, one unique hash per second, no more than 86400 per day, ...

Related

Picking a check digit algorithm

I am generating random OTP-style strings that serve as a short-term identifier to link two otherwise unrelated systems (which have authentication at each end). These need to be read and re-entered by users, so in order to reduce the error rate and reduce the opportunities for forgery, I'd like to make one of the digits a check digit. At present my random string conforms to the pattern (removing I and O to avoid confusion):
^[ABCDEFGHJKLMNPQRSTUVWXYZ][0-9]{4}$
I want to append one extra decimal digit for the check. So far I've implemented this as a BLAKE2 hash (from libsodium) that's converted to decimal and truncated to 1 char. This gives only 10 possibilities for the check digit, which isn't much. My primary objective is to detect single character errors in the input.
This approach kind of works, but it seems that one digit is not enough to detect single char errors, and undetected errors are quite easy to find, for example K37705 and K36705 are both considered valid.
I do not have a time value baked into this OTP; instead it's purely random and I'm relying on keeping a record of the OTPs that have been generated recently for each user, which are deleted periodically, and I'm reducing opportunities for brute-forcing by rate and attempt-count limiting.
I'm guessing that BLAKE2 isn't a good choice here, but given there are only 10 possibilities for the result, I don't know that others will be better. What would be a better algorithm/approach to use?
Frame challenge
Why do you need a check digit?
It doesn't improve security, and a five digits is trivial for most humans to get correct. Check if server side and return an error message if it's wrong.
Normal TOTP tokens are commonly 6 digits, and actors such as google has determined that people in general manage to get them orrect.

How can Hashing prevent collision when mapping key-value pairs?

If the point of hashing is to prevent collision on key value pairs (map), how can hashing achieve this? If you give a hashing algorithm the name "Jon Smith", wouldn't it give you the same result every time? Thus, what's the difference in just using "John Smith" without hashing the string, as in, if we had two different John Smiths with differing values, how do we know why one we're supposed to pick, since chaining is basically just putting all john smiths in a bucket, and the two john smiths might return different values and we would need to know which one it is.
I've tried googling it but I couldn't find good answers, I would really appreciate some explanation.
Thanks in advance!
I think you understand the idea of hasing wrong. Normally, it should return the same value every time you use the Hashing-function, in other words, it should be deterministic.
Hashing has other benefits:
It can be used to calculate an index out of the object, which then can be used for example in hash-sets or hash-maps. A hash-map calculates the hash for the key and then stores the value associated with that key at the position of the hash in an array.
A big collection of objects can be compared more easily if it contains duplicates
You can store the hash-values of passwords instead of the password itself. When the user wants to log in, you check if it produces the same hash-value that you initially stored.
There is a collision if two different inputs are mapped to the same hash-value. In other words, the hashing-function is not injective. Ideally, there is no hash-value that is produced significantly more often than others, which means the the resulting values are evenly distributed.

How do I store Argon2 passwords in my database?

I'm trying to store user passwords in my DB using Argon2 algorithm.
This is what I obtain by using it:
$echo -n "password" | argon2 "smallsalt" -id -t 4 -m 18 -p 4
Type: Argon2id
Iterations: 4
Memory: 262144 KiB
Parallelism: 4
Hash: cb4447d91dd62b085a555e13ebcc6f04f4c666388606b2c401ddf803055f54ac
Encoded: $argon2id$v=19$m=262144,t=4,p=4$c21hbGxzYWx0$y0RH2R3WKwhaVV4T68xvBPTGZjiGBrLEAd34AwVfVKw
1.486 seconds
Verification ok
In this case, what should I store in the DB?
The "encoded" value as shown above?
The "hash" value as shown above?
Neither, but another solution?
Please, could you help me? I'm a newbie with this and I'm a little bit lost.
I'm a bit late to the party, but I disagree with the previous answers.
You should store the field: Encoded
The $argon2id$.... value.
(At least if you are using normal Argon2 libraries having the verify() function.
It does not look like the man-page for argon2 command does this, however.
Only if you are stuck with the command line, you should consider storing each field individually.)
The $argon2id$ encoded hash
The argon2 encoded hash follows the same as its older cousin bcrypt's syntax.
The encoded hash includes all you ever need to verify the hash when the user logs in.
It is most likely more future proof. When a newer and better argon2 comes along: You can upgrade your one column hashed passwords. Just like you could detect bcrypt's $2a$-hashes, and re-hash them as $argon2id$-hashes, next time the user logs in. (If you were moving from bcrypt to agron2.)
TL;DR
Store the $-encoded value encoded_hash in your database.
Use argon2.verify(password, encoded_hash) to verify that the password is correct.
Don't bother about all the values inside the hash. Let the library do that for you. :)
Neither. Save following as a single value:
algorithm ID (e.g. argon2id)
salt
number of iterations (4)
memory usage factor (18)
parallelism (4)
The output of the field "encoded" is misleading because you cannot use it as is for password check (i.e. for hash generation), e.g. m=262144 where as for password check you need the original factor m=18.
Are you going to launch an OS process each time you check password? I would discourage you from doing this. I'd suggest you use a library (C++, Java, ...). They produce a string that contains all these data concatenated and separated with "$".
I'd put the type, iterations, memory, parallelism, hash, salt and corresponding user id into separate columns and leave the encoded bit out, because it's just all the attributes joined together. If they're in separate columns then you can reference the attributes more easily than having to split and index the encoded string.
The other option is to just store the encoded string in 1 column, but as I said its more tedious to look at certain attributes, as you'd have to split the encoded string and then index it.
I had the same question and read this post while gathering some information. Now after some days and thoughts about all this, I'll personally take a different route than the accepted answer and therefore slightly disagree with it. I thought I would share my perspective so that it might help others as well.
I suppose it will depend on everyone's context. I don't think there is a one size fits all answer here. I'm sure there are situations where it is perfectly valid and even better/simpler to store the encoded string ($argon2...).
However, I would argue that depending on the context, storing the encoded string doesn't seem to be the right approach.
First of all, it makes the hashing method very obvious. It is probably not that important but for some reasons it makes me a bit more comfortable not having it ^^. But, more importantly, it means that implementation details are stored in your persistence layer (db or else). At the time of writing, argon2id is the recommended hashing mechanism by OWASP but these things can change (eventually do change...). Some day, it might be considered unsecure, or another function will be considered more secure.
As a result, I would suggest this more function "agnostic" starting point:
The hash (for argon2 -> the hex string)
The salt
The last_modified date
A string with hashing parameters (for argon2, you could put the parameters here in the form of your choosing)
The last_modified allows to know if the hash needs updating or not and the parameters allows to support the verification and update of "old" hashes.
Of course this means that you have to work a bit more in the code and can't simply use every libraries shortcuts straight away. But, I would say that this increased complexity offer more flexibility in other circumstances (like moving away from a given hashing function). As always there are no free lunch.
That's why I suppose it depends on your context and why personally I wouldn't go with the accepted answer in my situation.
PS: I'm no cryptography expert nor some devsecop guru. So feel free to contradict, enrich, agree or disagree. I just like to keep my implementation details contained ;)

hashing strings

I have streaming strings (text containing words and number).
Taking one line at a time for streaming strings, I would like to assign a unique value to them.
the examples may be:strings with their scores/hash
User1 logged in Comp1 port8087 1109
User2 logged in comp2 1135
user3 logged in port8080 1098
user1 logged in comp2 port8080 1178
these string should be in same cluster. For this what i have thought is mapping(bad type of hashing) the strings such that the small change in the string wont affect the score that much.
One simple way of doing that may be: taking UliCp8, Ulic .... ( i.e. 1st letter of each sentence) and find some way of scoring. After then the similar scored strings are kept in same bucket and later on sub group them.
The improved method would be: lets not take out first word of each word of the string but find some way to take representative value of the word such that the string representation may be quite suitable for mapping with score/hash as i mention.
Considering Levenstein distance or jaccard_index or some similarity distance metrices, all of them require inputting the strings for comparisions. Isn't there any method to hash/score the string as stated without going for comparisions.( POS tagging, comparing looks uneffiecient for my purpose as the data are streaming, huge in number, unstructured)
Hope you understand what i want to achieve and please help me out. Forgot about the comments below and lets restart.
"at least two similar word (not considering length) should have similar hash value"
This is against the most basic requirements for a hash function. With a hash function also minimal changes to the input should produce vehement changes to the bucket the hash falls into.
You are looking for an algorithm that calculates the similarity or distance between two inputs.
As stated you are not looking for a hash function, rather something like the Levenshtein distance which is an algorithm for calculating a metric representing the degree of differences between two sequences of bits. It is commonly used to find out how similar/dissimilar two strings are. Hashing / message digests are good for creating identifiers for unique, distinct values but they will produce entirely different results for "similar" values.
You are interested in the similarity of strings. Here is a nice post that names a few resources that are used for measuring string similarity. Maybe Lucene could help you in your situation.

How can I generate a unique, small, random, and user-friendly key?

A few months back I was tasked with implementing a unique and random code for our web application. The code would have to be user friendly and as small as possible, but still be essentially random (so users couldn't easily predict the next code in the sequence).
It ended up generating values that looked something like this:
Af3nT5Xf2
Unfortunately, I was never satisfied with the implementation. Guid's were out of the question, they were simply too big and difficult for users to type in. I was hoping for something more along the lines of 4 or 5 characters/digits, but our particular implementation would generate noticeably patterned sequences if we encoded to less than 9 characters.
Here's what we ended up doing:
We pulled a unique sequential 32bit id from the database. We then inserted it into the center bits of a 64bit RANDOM integer. We created a lookup table of easily typed and recognized characters (A-Z, a-z, 2-9 skipping easily confused characters such as L,l,1,O,0, etc.). Finally, we used that lookup table to base-54 encode the 64-bit integer. The high bits were random, the low bits were random, but the center bits were sequential.
The final result was a code that was much smaller than a guid and looked random, even though it absolutely wasn't.
I was never satisfied with this particular implementation. What would you guys have done?
Here's how I would do it.
I'd obtain a list of common English words with usage frequency and some grammatical information (like is it a noun or a verb?). I think you can look around the intertubes for some copy. Firefox is open-source and it has a spellchecker... so it must be obtainable somehow.
Then I'd run a filter on it so obscure words are removed and that words which are too long are excluded.
Then my generation algorithm would pick 2 words from the list and concatenate them and add a random 3 digits number.
I can also randomize word selection pattern between verb/nouns like
eatCake778
pickBasket524
rideFlyer113
etc..
the case needn't be camel casing, you can randomize that as well. You can also randomize the placement of the number and the verb/noun.
And since that's a lot of randomizing, Jeff's The Danger of Naïveté is a must-read. Also make sure to study dictionary attacks well in advance.
And after I'd implemented it, I'd run a test to make sure that my algorithms should never collide. If the collision rate was high, then I'd play with the parameters (amount of nouns used, amount of verbs used, length of random number, total number of words, different kinds of casings etc.)
In .NET you can use the RNGCryptoServiceProvider method GetBytes() which will "fill an array of bytes with a cryptographically strong sequence of random values" (from ms documentation).
byte[] randomBytes = new byte[4];
RNGCryptoServiceProvider rng = new RNGCryptoServiceProvider();
rng.GetBytes(randomBytes);
You can increase the lengh of the byte array and pluck out the character values you want to allow.
In C#, I have used the 'System.IO.Path.GetRandomFileName() : String' method... but I was generating salt for debug file names. This method returns stuff that looks like your first example, except with a random '.xyz' file extension too.
If you're in .NET and just want a simpler (but not 'nicer' looking) solution, I would say this is it... you could remove the random file extension if you like.
At the time of this writing, this question's title is:
How can I generate a unique, small, random, and user-friendly key?
To that, I should note that it's not possible in general to create a random value that's also unique, at least if each random value is generated independently of any other. In addition, there are many things you should ask yourself if you want to generate unique identifiers (which come from my section on unique random identifiers):
Can the application easily check identifiers for uniqueness within the desired scope and range (e.g., check whether a file or database record with that identifier already exists)?
Can the application tolerate the risk of generating the same identifier for different resources?
Do identifiers have to be hard to guess, be simply "random-looking", or be neither?
Do identifiers have to be typed in or otherwise relayed by end users?
Is the resource an identifier identifies available to anyone who knows that identifier (even without being logged in or authorized in some way)?
Do identifiers have to be memorable?
In your case, you have several conflicting goals: You want identifiers that are—
unique,
easy to type by end users (including small), and
hard to guess (including random).
Important points you don't mention in the question include:
How will the key be used?
Are other users allowed to access the resource identified by the key, whenever they know the key? If not, then additional access control or a longer key length will be necessary.
Can your application tolerate the risk of duplicate keys? If so, then the keys can be completely randomly generated (such as by a cryptographic RNG). If not, then your goal will be harder to achieve, especially for keys intended for security purposes.
Note that I don't go into the issue of formatting a unique value into a "user-friendly key". There are many ways to do so, and they all come down to mapping unique values one-to-one with "user-friendly keys" — if the input value was unique, the "user-friendly key" will likewise be unique.
If by user friendly, you mean that a user could type the answer in then I think you would want to look in a different direction. I've seen and done implementations for initial random passwords that pick random words and numbers as an easier and less error prone string.
If though you're looking for a way to encode a random code in the URL string which is an issue I've dealt with for awhile then I what I have done is use 64-bit encoded GUIDs.
You could load your list of words as chakrit suggested into a data table or xml file with a unique sequential key. When getting your random word, use a random number generator to determine what words to fetch by their key. If you concatenate 2 of them, I don't think you need to include the numbers in the string unless "true randomness" is part of the goal.