How are SHA-3 variants named? - hash

How should we succinctly refer to SHA-3 variants of specific width? The precedent set by SHA-2 naming is unfortunately ambiguous if applied to SHA-3. Specifically, we have SHA-0 and SHA-1 (160 bits), followed by SHA-2 (224, 256, 384, or 512 bits), where SHA-224, SHA-256, SHA-384, and SHA-512 refer to the SHA-2 variants. SHA-3 supports the same bit counts as SHA-2, but a different naming convention is needed to distinguish between SHA-2 and SHA-3. SHA-3-224, SHA-3-256, SHA-3-384, and SHA-3-512 seem reasonable (if clumsy), but I can find no established naming convention of any sort.

I believe they have been finalized as follows
"SHA3-224", "SHA3-256", "SHA3-386", "SHA3-512"
SHA-3 Standard: Permutation-Based Hash and Extendable-Output Functions

There is no convention yet. Even the standard itself is not published AFAIK.
I'd use SHA3-256 etc. (like MD6-256).
Same naming scheme is also used in BouncyCastle library.
As for SHA-3-256 and friends, I personally don't like the idea of using the same char - in algorithm name and as property separator. If you necessarily need to keep the dash in algorithm name, I'd go with SHA-3/256 -- similar scheme is used in cipher transformation naming in JCA.

Related

Why doesn't md5 (and other hash algorithms) output in base32?

It seems like most hashes (usually in base16/hex) could be easily represented in base32 in a lossless way, resulting in much shorter (and more easily readable) hash strings.
I understand that naive implementations might mix "O"s, "0"s, "1"s, and "I"s, but one could easily choose alphabetic characters without such problems. There are also enough characters to keep hashes case-insensitive. I know that shorter hash algorithms exist (like crc32), but this idea could be applied to those too for even shorter hashes.
Why, then, do most (if not all) hash algorithm implementations not output in base32, or at least provide an option to do so?

Basics of MD5: How to know hash bit length and symmetry?

I'm curious about some basics of MD5 encryption I couldn't get from Google, Java questions here nor a dense law paper:
1-How to measure, in bytes, an MD5 hash string? And does it depends if the string is UNICODE or ANSI?
2-Is MD5 an assymetric algorythm?
Example: If my app talks (http) to a REST webservice using a key (MD5_128 hash string, ANSI made of 9 chars) to unencrypt received data, does that account for 9x8=72 bytes in an assymetric algorithm?
I'm using Windevs 25 in Windows, using functions like Encrypt and HashString, but I lack knowledge about encryption.
Edit: Not asnwered yet, but it seems like I need to know more about charsets before jumping to hashes and encryption. https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
An MD5 hash is 128 bits, 16 bytes. The result is binary, not text, so it is neither "ANSI" nor "Unicode". Like all hashes, it is asymmetric, which should be obvious from the fact that you can hash inputs which are longer than 128 bits. Since it is asymmetric, you cannot "unencrypt" (decrypt) it. This is by design and intentional.

How to get Perl crypt to encrypt more than 8 characters?

Only the first 8 characters is encrypted when the Perl crypt function is used. Is there a way to get it to use more characters?
As an example:
$crypted_password = crypt ("PassWord", "SALT");
and
$crypted_password = crypt ("PassWord123", "SALT");
returns exactly the same result. $crypted_password has exactly the same value.
Would love to use crypt because it is a quick and easy solution to some none reversible encryption but this limit does not make it useful for anything serious.
To quote from the documentation:
Traditionally the result is a string of 13 bytes: two first bytes of the salt, followed by 11 bytes from the set [./0-9A-Za-z], and only the first eight bytes of PLAINTEXT mattered. But alternative hashing schemes (like MD5), higher level security schemes (like C2), and implementations on non-Unix platforms may produce different strings.
So the exact return value of crypt is system dependent, but it often uses an algorithm that only looks at the first 8 byte of the password. These two things combined make it a poor choice for portable password encryption. If you're using a system with a stronger encryption routine and don't try to check those passwords on incompatible systems, you're fine. But it sounds like you're using an OS with the old crappy DES routine.
So a better option is to use a module off of CPAN that does the encryption in a predictable, more secure way.
Some searching gives a few promising looking options (That I haven't used and can't recommend one over another; I just looked for promising keywords on metacpan):
Crypt::SaltedHash
Authen::Passphrase::SaltedDigest
Crypt::Bcrypt::Easy
Crypt::Password::Util

What is the limit to encoding base in case of Unicode strings as opposed to base64 having base = 64?

This is actually related to code golf in general, but also appliable elsewhere. People commonly use base64 encoding to store large amounts of binary data in source code.
Assuming all programming languages to be happy to read Unicode source code, what is the max N, for which we can reliably devise a baseN encoding?
Reliability here means being able to encode/decode any data, so every single combination of input bytes can be encoded, and then decoded. The encoded form is free from this rule.
The main goal is to minimize the character count, regardless of byte-count.
Would it be base2147483647 (32-bit) ?
Also, because I know it may vary from browser-to-browser, and we already have problems with copy-pasting code from codegolf answers to our editors, the copy-paste-ability is also a factor here. I know there is a Unicode range of characters that are not displayed.
NOTE:
I know that for binary data, base64 usually expands data, but here the character-count is the main factor.
It really depends on how reliable you want the encoding to be. Character encodings are designed with trade-offs, and in general the more characters allowed, the less likely it is to be universally accepted i.e. less reliable. Base64 isn't immune to this. RFC 3548, published in 2003, mentions that case sensitivity may be an issue, and that the characters + and / may be problematic in certain scenarios. It describes Base32 (no lowercase) and Base16 (hex digits) as potentially safer alternatives.
It does not get better with Unicode. Adding that many characters introduces many more possible points of failure. Depending on how stringent your requirements are, you might have different values for N. I'll cover a few possibilities from large N to small N, adding a requirement each time.
1,114,112: Code points. This is the number of possible code points defined by the Unicode Standard.
1,112,064: Valid UTF. This excludes the surrogates which cannot stand on their own.
1,111,998: Valid for exchange between processes. Unicode reserves 66 code points as permanent non-characters for internal use only. Theoretically, this is the maximum N you could justifiably expect for your copy-paste scenario, but as you noted, in practice many other Unicode strings will fail that exercise.
120,503: Printable characters only, depending on your definition. I've defined it to be all characters outside of the Other and Separator general categories. Also, starting from this bullet point, N is subject to change in future versions of Unicode.
103,595: NFKD normalized Unicode. Unfortunately, many processes automatically normalize Unicode input to a standardized form. If the process used NFKC or NFKD, some information may have been lost. For more reliability, the encoding should thus define a normalization form, with NFKD being better for increasing character count
101,684: No combining characters. These are "characters" which shouldn't stand on their own, such as accents, and are meant to be combined with another base character. Some processes might panic if they are left standing alone, or if there are too many combining characters on a single base character. I've now excluded the Mark category.
85: ASCII85, aka. I want my ASCII back. Okay, this is no longer Unicode, but I felt like mentioning it because it's a lesser known ASCII-only encoding. It's mainly used in Adobe's PostScript and PDF formats, and has a 5:4 encoded data size increase, rather than Base64's 4:3 ratio.

doing away with encoding in a genetic algorithmic impementation

I was wondering if encoding in a genetic algorithm is really necessary , I mean let's say I have a program that is supposed to implement a GA to guess a word a user inputs,.
I don't see the point in having the chromosomes as a binary string, I would rather have it as just a string of letters , and mutate the string and crossbreed it accordingly.
Is such a approach unorthodox ? and will it really affect the outcome , or does it violate the definition of a genetic algorithm?
I do understand different types of encoding is possible.However that isn't what I am concerned about.Please keep your answer specific to the program objective of guessing a string that is similar to the one inputted by the user.
THIS IS NOT A QUESTION ABOUT CHOICE OF ENCODING, BUT WHETHER I CAN DO AWAY WITH THE WHOLE ENCODING SCENARIO RELEVANT TO THIS QUESTION OBJECTIVE.
Though unorthodox, your approach would be perfectly valid. The crossover and mutation functionalities may have to be tweaked however. There are in fact numerous such-non standard implementations (of encodings) today including alphabetic, alphanumeric, decimal, etc.
As per your specific case, if you do not encode an alphabetic chromosome, it is the same as encoding it in an alphabetic manner with an identity map; now, for an alphabetic encoding, the normal crossover functionality should be valid though the mutation may have to be so as to generate a random alphabet at the mutation site, if any.
Binary encoding in GA is generally followed only due to the simplicity and speed of the operations involved. For example, for your case, a string/character comparison takes longer to carry out in general considering the integer/boolean alternative.