Is there an idempotent hash function?

Is there an idempotent hash function? - hash

Is there a hash function that is idempotent? I know MD5 and SHA256 are not:
$ echo -n "hello world" | md5sum
5eb63bbbe01eeed093cb22bb8f5acdc3 -
$ echo -n "5eb63bbbe01eeed093cb22bb8f5acdc3" | md5sum
c0b0ef2d0f76f0133b83a9b82c1c7326 -
$ echo -n "hello world" | sha256sum
b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9 -
$ echo -n "b94d27b9934d3e08a52e52d7da7dabfac484efe37a5380ee9088f7ace2efcde9" | sha256sum
049da052634feb56ce6ec0bc648c672011edff1cb272b53113bbc90a8f00249c -
Is there a hash algorithm that can do something like this?
$ echo -n "hello world" | idempotentsum
abcdef1234567890
$ echo -n "abcdef1234567890" | idempotentsum
abcdef1234567890
If such an algorithm does exist, is it useful cryptographically? I.e. with reasonable inputs, is it computationally infeasible to guess the input with a known output?
If such an algorithm does not exist, does it not exist because nobody has bothered to find it or is it a mathematical impossibility?
Context
I'm working on a system where a user may want to save a password in a password manager. A particularly paranoid user may prefer to save the password in a hashed form rather than in plain text. I'd like the user to be able to authenticate with this hashed password. Rather than simply trying the authentication twice (once assuming the user's password is hashed and once assuming it is not), I wondered if there was an algorithm to let me only do it once.
I know there are alternative ways of allowing users to store authentication tokens rather than plain-text passwords. But this idea popped into my head, and I am curious. I couldn't find anything about this on Google or SO.
EDIT: I am not suggesting that allowing a user to authenticate with a hashed password means it is OK for the server to not salt/hash the password. The server must still salt/hash the original password or the client-side hashed password.
EDIT: I am not suggesting that allowing the user to log in with a client-side hashed password is a genuine security improvement. As far as I know the only possible benefit this would add is if the user used this password for more than one purpose. In that case, if the user's hashed password was discovered by an attacker, then only access to my service would be compromised rather than all services sharing that password. However, best practice is to not use the same password for multiple services.

Such a function is actually quite easy to find, and it doesn't weaken the cryptography of the system (except in an obvious and trivial way). We can actually transform any hashing function into an idempotent hashing function so long as we have a way of identifying if a given value is a potential output of the hashing function (in more formal language, if an element of the domain is also an element of the range).
(A potential method of doing this is just checking the size of the input element, as most hash functions attempt to uniformly output values up to a given size. This ignores the possibility of incorrectly identifying a value that can never be output from the hash function, but that would be specific to individual hashing functions.)
We then create a new method that checks whether a value can be output from the hashing function, and if so, returns the value back. Otherwise, the function proceeds as normal and hashes the value. This new function is as secure as the original function except for hashing values of its range, for which its completely insecure, but that's unavoidable in an idempotent hashing function.

If such an algorithm does exist, is it useful cryptographically?
Well, consider this: a hash typically is a map between two sets:
A -> B
where B is the set of possible hashes, and A is the set of things that are hashable.
Now, usually A is much bigger than B -- hashes are like shorter "checksums" that can be calculated from much larger streams of data.
Typically, you'd still want as little collisions as possible in your hash, meaning that statistically, all elements from B should have the same number of elements from A that map to them, and the elements from A that map to the same element in B should be "far away" from each other under some metric. This implies, that B tries as hard as possible to be the whole set of words of a constant length. It will be immensely harder to find a systematic function that does that, but still maps each element from B to the same element in B; you "enforce" collision. In general, that's a cryptographic weakness, and a serious one at that.
Now, considering your password case: I don't see how that would make any sense. It's cryptographically a bad idea to let your user authenticate with either his/her hashed password or in plain, because no matter what you'd do, you'd give away full information on how to forge authentication to all eavesdroppers.

Related

Is there a standard or alternative for shorter UUIDs?

The UUID standard has several versions. Version 4 for example is based on a completely random input. But it still encodes the version information and only uses 125bits of the possible 128 bits.
But for transferring these via HTTP, it is more efficient to encode them in BASE64. There are libraries for this (https://github.com/skorokithakis/shortuuid).
But what I am wondering: Is there an alternative standard for shorter ID-strings? Of course I could slap together a version-byte + n random bytes and encode them BASE64, having my own working 'short, random ID scheme', but I wonder if there is any alternative that someone already specified before I make my own.

There is no standard for anything shorter.
Numerous folks have asked the same question and all come to the same conclusion: UUIDs are overkill for their specific requirements. And they developed their own alternatives, which essentially boil down to a random string of some length (based on the expected size of their dataset) in a more efficient encoding like base64. Many of them have published libraries for working with these strings. However, all of them suffer from the same two problems:
They cannot handle arbitrarily large datasets (the reason UUIDs are so big)
They are not standardized, which means there is no real interop.
If neither of these problems affect you, then feel free to pick any of the various schemes available or roll your own. But do consider the future costs if you discover you're wrong about your requirements and whether the cost of having to change justifies whatever minuscule advantage in space or complexity you get by not using a proven, universal system from the start.

I have just found https://github.com/ai/nanoid
It it not really a 'standard', but at least not an arbitrary scheme that I would come up with myself. It is shorter through smarter encoding (larger alphabet) and faster.

A quick and dirty alternative is mktemp, depending on your requirements for security, uniqueness and your access to a shell.
Use the form mktemp -u XXXXXXXX
-u: dry-run, don't create a file
XXXXXXXX is the format, in this case eight random characters
$ echo `mktemp -u XXXXXXXX`
KbbHRWYv
$ echo `mktemp -u XXXXXXXX`
UnyO2eH8
$ echo `mktemp -u XXXXXXXX`
g6NiTfvT

xkcd: Externalities

So the April 1, 2013 xkcd Externalities web comic features a Skein 1024 1024 hash breaking contest. I'm assuming that this must be nothing more than a brute force effort where random strings are hashed in an effort to match Randall's posted hash? Is this correct?
Also, my knowledge of Skein hashing theory is virtually non-existent but being a halfway decent programmer I was able to download and run both SkeinFish (C#) and Maarten Bodewes Skein implementation (Java) locally in 1024 1024 mode with some input strings. The hashes that they gave, however, were different than the hash that xkcd returned for the same input. This may be an extremely naive question but do different Skein implementations give different hashes? And what Skein implementation is xkcd using?
Thanks for pardoning my ignorance!

There are several different iterations of the skein algorithm. XKCD is using version 1.3, which is also the most recent. Sources can be found here (look for "V1.3")
Interestingly enough, this brute-force method is the same one employed by Bitcoin to "mine" bitcoins. The big differences are the hash algorithm (SHA-256 in that case) and the target hash (which is dynamically determined to be any hash starting with a certain number of zeros.) It takes a lot of work to discover the hash, but once it has been found it is trivial to verify the source bits and that the resulting hash meets the criteria.

Here's the source code the Stanford team used. We ran this on about a hundred 8-core EC2 servers for a while, but not the whole competition.
https://github.com/jhiesey/skeincrack

If you were hashing non-alphanumeric characters (spaces, punctuation, etc.), you may have been getting different results due to HTML form encoding. The "enctype" attribute on the form XKCD was hosting was "application/octet-stream", which according to https://developer.mozilla.org/en-US/docs/HTML/Element/form is not a browser-supported standard. I assume the browser falls back on the URL-encoding type when it sees one it doesn't recognize.
I observed the string "=" being submitted URL-encoded in Chrome, and returning a different hash than what I got locally with the latest pyskein. But when I submitted it with this curl command line (no longer works), I got the expected hash:
curl -X POST --data-binary "hashable==" "http://almamater.xkcd.com/?edu=school.edu"
The Stanford code in another answer does the same thing, and they apparently had some success. I never got any random data to locally hash to a better score than even my own school, so I never got a chance to test thoroughly how to pass arbitrary data in properly. I don't know what the exact behavior was (e.g., perhaps if you omitted hashable= the server would detect that and just hash the whole POST body), but it may have intentionally been a little tricky as part of April Fool's.

Perl getpwuid() and getpwnam()

I'm learning perl's *nix system tools and I've been staring at the following two sentences for several minutes:
You can think of getpwuid() and getpwnam() operators as random access -- they grab a specific entry by key so you have to have a key to start with. Another way of accessing the password file is sequential access -- grabbing each entry in some apparently random order.
I'm 99% sure this is a typo, but if it isn't I'm clearly missing a key idea. Can anyone shed some light on the subject?
Thanks in advance.

Not a typo, but very poorly worded. getpwuid looks up a passwd entry by UID. getpwnam looks up a password entry by name. These are "random access" like system memory is "random access"; you can pick which one you want by providing a key. (For system memory, the "key" is the address. For getpwuid, the key is the UID. For getpwnam, the key is the name.)
These are in contrast to getpwent, which simply returns the "next" entry from the passwd file. The entries will be returned in some unspecified order. This is "sequential access", like reading a file from disk. Although for getpwent you do not know what order the results will appear.
The wording is confusing because they use the word "random" for both the phrase "random access" (like a memory) and "apparently random order" (by which they mean "unspecified order").
They should have said "unspecified order" or "indeterminate order" rather than "apparently random order".

Should all implementations of SHA512 give the same Hash?

I am working on writing a SHA512 function. When i check the file I am encrypting on different sources, a Linux SHA512SUM tool, a couple websites, and run it through the old source code i have for SHA512, they all give different hash values. My thought going into this project is that all Hash algorithms will output the same hash value if implemented correctly, to be used as a check sum. Am I wrong in thinking this? If I am wrong how would I really check to see if my work is correct?
Thanks in advance.

Yes, that's one of the basic building block of PKI: the same data block passed to a hash should always return the same hash value.
beware of the interpretation, though: the result of a SHA-2(512) hash is a block of 512 bits, not a string value so it will first be encoded for human consumption and it is therefore possible that you see what looks like visually different results when it's simply a matter of using different encodings.

How does the Enterprise Library's CryptographyManager.CompareHash method work?

I've been wondering how the CryptographyManager is able to compare a salted hash with the plain text. It has to save the salt for each hash somewhere, right? Has anyone any insight on this?

We ship source code. Take a look at CryptographyManagerImpl.cs in the Cryptography solution.
Also, you may want to review our unit tests - the ones that start with HashProvider should give you additional insight.

So I checked out the source code and it is actually quite trivial: The salt is prepended to the actual hash value. When the hash is compared to a plaintext the salt is extracted and used to hash the plaintext. These two hash values (= salt + hash) are then compared.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse