Maximum input and output length for Argon2

Maximum input and output length for Argon2 - hash

As you may know, maximum input length for bcrypt is 72 characters and the output length is 60 characters. (I've it tested in PHP. Correct me if I'm wrong)
I want to know maximum input length and the exact output length for argon2. Thanks.

According to https://en.wikipedia.org/wiki/Argon2#Algorithm max input length is 2^32-1 bytes or 4294967295 bytes.
As to the equivalent in character length, it depends on what character encoding you use.
According to this answer:
In ASCII or ISO 8859, each character is represented by one byte
In UTF-32, each character is represented by 4 bytes
In UTF-8, each character uses between 1 and 4 bytes
In ISO 2022, it's much more complicated
Still according to https://en.wikipedia.org/wiki/Argon2#Algorithm I cannot give you an 'exact' output length because it depends on the length you choose for various parameters such as the salt and the output hash itself.

Related

Which hashing algorithm generates alphanumeric output?

I am looking for an hashing algorithm that generates alphanumeric output. I did few tests with MD5 , SHA3 etc and they produce hexadecimal output.
Example:
Input: HelloWorld
Output[sha3_256]: 92dad9443e4dd6d70a7f11872101ebff87e21798e4fbb26fa4bf590eb440e71b
The 1st character in the above output is 9. Since output is in HEX format, maximum possible values are [0-9][a-f]
I am trying to achieve maximum possible values for the 1st character. [0-9][a-z][A-Z]
Any ideas would be appreciated . Thanks in advance.

Where MD5 computes a 128bit hash and SHA256 a 256bit hash, the output they provide is nothing more than a 128, respectively 256 long binary number. In short, that are a lot of zero's and ones. In order to use a more human-friendly representation of binary-coded values, Software developers and system designers use hexadecimal numbers, which is a representation in base(16). For example, an 8-bit byte can have values ranging from 00000000 to 11111111 in binary form, which can be conveniently represented as 00 to FF in hexadecimal.
You could convert this binary number into a base(32) if you want. This is represented using the characters "A-Z2-7". Or you could use base(64) which needs the characters "A-Za-z0-9+/". In the end, it is just a representation.
There is, however, some practical use to base(16) or hexadecimal. In computer lingo, a byte is 8 bits and a word consists of two bytes (16 bits). All of these can be comfortably represented hexadecimally as 28 = 24×24 = 16×16. Where 28 = 25×23 = 32×8. Hence, in base(32), a byte is not cleanly represented. You already need 5 bytes to have a clean base(32) representation with 8 characters. That is not comfortable to deal with on a daily basis.

Clarification on Joel Spolsky's Unicode article

I'm reading the popular Unicode article from Joel Spolsky and there's one illustration that I don't understand.
What does "Hex Min, Hex Max" mean? What do those values represent? Min and max of what?
Binary can only have 1 or 0. Why do I see tons of letter "v" here?
http://www.joelonsoftware.com/articles/Unicode.html

The Hex Min/Max define the range of unicode characters (typically represented by their unicode number in HEX).
The v is referring to the bits of the original number
So the first line is saying:
The unicode characters in the range 0 (hex 00) to 127 (hex 7F) (a 7
bit number) are represented by a 1 byte bit string starting with '0'
followed by all 7 bits of the unicode number.
The second line is saying:
The unicode numbers in the range 128 (hex 0800) to 2047 (07FF) (an 11
bit number) are represented by a 2 byte bit string where the first
byte starts with '110' followed by the first 5 of the 11 bits, and the
second byte starts with '10' followed by the remaining 6 of the 11 bits
etc
Hope that makes sense

Note that the table in Joel's article covers code points that do not, and never will, exist in Unicode. In fact, UTF-8 never needs more than 4 bytes, though the scheme underlying UTF-8 could be extended much further, as shown.
A more nuanced version of the table is available in How does a file with Chinese characters know how many bytes to use per character? It points out some of the gaps. For example, the bytes 0xC0, 0xC1, and 0xF5..0xFF can never appear in valid UTF-8. You can also see information about invalid UTF-8 at Really good bad UTF-8 example test data.
In the table you showed, the Hex Min and Hex Max values are the minimum and maximum U+wxyz values that can be represented using the number of bytes in the 'byte sequence in binary' column. Note that the maximum code point in Unicode is U+10FFFF (and that is defined/reserved as a non-character). This is the maximum value that can be represented using the surrogate encoding scheme in UTF-16 using just 4 bytes (two UTF-16 code points).

pbkdf2 key length

What is the $key_length in PBKDF2
It says that it will be derived from the input, but I see people using key_lengths of 256 and greater, but when I enter 256 as a key_length the output is 512 characters. Is this intentional? Can I safely use 64 as the key_length so the output is 128 characters long?

$key_length is the number of output bytes that you desire from PBKDF2. (Note that if key_length is more than the number of output bytes of the hash algorithm, the process is repeated twice, slowing down that hashing perhaps more than you desire. SHA256 gives 32 bytes of output, for example, so asking for 33 bytes will take roughly twice as long as asking for 32.)
The doubling of the length that you mention is because the code converts the output bytes to hexadecimal (i.e. 2 characters per 1 byte) unless you specify $raw_output = true. The test vectors included specify $raw_output = false, since hexadecimal is simply easier to work with and post online. Depending on how you are storing the data in your application, you can decide if you want to store the results as hex, base64, or just raw binary data.

In the IETF specification of Password-Based Cryptography Specification Version 2.0 the key length is defined as
"intended length in octets of the derived key, a positive integer, at most
(2^32 - 1) * hLen" Here hLen denotes the length in octets of the pseudorandom function output. For further details on pbkdf2 you can refer How to store passwords securely with PBKDF2

How many characters can be mapped with Unicode?

I am asking for the count of all the possible valid combinations in Unicode with explanation. I know a char can be encoded as 1,2,3 or 4 bytes. I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.

I am asking for the count of all the possible valid combinations in Unicode with explanation.
1,111,998: 17 planes × 65,536 characters per plane - 2048 surrogates - 66 noncharacters
Note that UTF-8 and UTF-32 could theoretically encode much more than 17 planes, but the range is restricted based on the limitations of the UTF-16 encoding.
137,929 code points are actually assigned in Unicode 12.1.
I also don't understand why continuation bytes have restrictions even though starting byte of that char clears how long it should be.
The purpose of this restriction in UTF-8 is to make the encoding self-synchronizing.
For a counterexample, consider the Chinese GB 18030 encoding. There, the letter ß is represented as the byte sequence 81 30 89 38, which contains the encoding of the digits 0 and 8. So if you have a string-searching function not designed for this encoding-specific quirk, then a search for the digit 8 will find a false positive within the letter ß.
In UTF-8, this cannot happen, because the non-overlap between lead bytes and trail bytes guarantees that the encoding of a shorter character can never occur within the encoding of a longer character.

Unicode allows for 17 planes, each of 65,536 possible characters (or 'code points'). This gives a total of 1,114,112 possible characters. At present, only about 10% of this space has been allocated.
The precise details of how these code points are encoded differ with the encoding, but your question makes it sound like you are thinking of UTF-8. The reason for restrictions on the continuation bytes are presumably so it is easy to find the beginning of the next character (as continuation characters are always of the form 10xxxxxx, but the starting byte can never be of this form).

Unicode supports 1,114,112 code points. There are 2048 surrogate code point, giving 1,112,064 scalar values. Of these, there are 66 non-characters, leading to 1,111,998 possible encoded characters (unless I made a calculation error).

To give a metaphorically accurate answer, all of them.
Continuation bytes in the UTF-8 encodings allow for resynchronization of the encoded octet stream in the face of "line noise". The encoder, merely need scan forward for a byte that does not have a value between 0x80 and 0xBF to know that the next byte is the start of a new character point.
In theory, the encodings used today allow for expression of characters whose Unicode character number is up to 31 bits in length. In practice, this encoding is actually implemented on services like Twitter, where the maximal length tweet can encode up to 4,340 bits' worth of data. (140 characters [valid and invalid], times 31 bits each.)

According to Wikipedia, Unicode 12.1 (released in May 2019) contains 137,994 distinct characters.

Unicode has the hexadecimal amount of 110000, which is 1114112

Would it be possible to have a UTF-8-like encoding limited to 3 bytes per character?

UTF-8 requires 4 bytes to represent characters outside the BMP. That's not bad; it's no worse than UTF-16 or UTF-32. But it's not optimal (in terms of storage space).
There are 13 bytes (C0-C1 and F5-FF) that are never used. And multi-byte sequences that are not used such as the ones corresponding to "overlong" encodings. If these had been available to encode characters, then more of them could have been represented by 2-byte or 3-byte sequences (of course, at the expense of making the implementation more complex).
Would it be possible to represent all 1,114,112 Unicode code points by a UTF-8-like encoding with at most 3 bytes per character? If not, what is the maximum number of characters such an encoding could represent?
By "UTF-8-like", I mean, at minimum:
The bytes 0x00-0x7F are reserved for ASCII characters.
Byte-oriented find / index functions work correctly. You can't find a false positive by starting in the middle of a character like you can in Shift-JIS.
Update -- My first attempt to answer the question
Suppose you have a UTF-8-style classification of leading/trailing bytes. Let:
A = the number of single-byte characters
B = the number of values used for leading bytes of 2-byte characters
C = the number of values used for leading bytes of 3-byte characters
T = 256 - (A + B + C) = the number of values used for trailing bytes
Then the number of characters that can be supported is N = A + BT + CT².
Given A = 128, the optimum is at B = 0 and C = 43. This allows 310,803 characters, or about 28% of the Unicode code space.
Is there a different approach that could encode more characters?

It would take a little over 20 bits to record all the Unicode code points (assuming your number is correct), leaving over 3 bits out of 24 for encoding which byte is which. That should be adequate.
I fail to see what you would gain by this, compared to what you would lose by not going with an established standard.
Edit: Reading the spec again, you want the values 0x00 through 0x7f reserved for the first 128 code points. That means you only have 21 bits in 3 bytes to encode the remaining 1,113,984 code points. 21 bits is barely enough, but it doesn't really give you enough extra to do the encoding unambiguously. Or at least I haven't figured out a way, so I'm changing my answer.
As to your motivations, there's certainly nothing wrong with being curious and engaging in a little thought exercise. But the point of a thought exercise is to do it yourself, not try to get the entire internet to do it for you! At least be up front about it when asking your question.

I did the math, and it's not possible (if wanting to stay strictly "UTF-8-like").
To start off, the four-byte range of UTF-8 covers U+010000 to U+10FFFF, which is a huge slice of the available characters. This is what we're trying to replace using only 3 bytes.
By special-casing each of the 13 unused prefix bytes you mention, you could gain 65,536 characters each, which brings us to a total of 13 * 0x10000, or 0xD0000.
This would bring the total 3-byte character range to U+010000 to U+0DFFFF, which is almost all, but not quite enough.

Sure it's possible. Proof:
224 = 16,777,216
So there is enough of a bit-space for 1,114,112 characters but the more crowded the bit-space the more bits are used per character. The whole point of UTF-8 is that it makes the assumption that the lower code points are far more likely in a character stream so the entire thing will be quite efficient even though some characters may use 4 bytes.
Assume 0-127 remains one byte. That leaves 8.4M spaces for 1.1M characters. You can then solve this is an equation. Choose an encoding scheme where the first byte determines how many bytes are used. So there are 128 values. Each of these will represent either 256 characters (2 bytes total) or 65,536 characters (3 bytes total). So:
256x + 65536(128-x) = 1114112 - 128
Solving this you need 111 values of the first byte as 2 byte characters and the remaining 17 as 3 byte. To check:
128 + 111 * 256 + 17 * 65536 = 1,114,256
To put it another way:
128 code points require 1 byte;
28,416 code points require 2 bytes; and
1,114,112 code points require 3 bytes.
Of course, this doesn't allow for the inevitable expansion of Unicode, which UTF-8 does. You can adjust this to the first byte meaning:
0-127 (128) = 1 byte;
128-191 (64) = 2 bytes;
192-255 (64) = 3 bytes.
This would be better because it's simple bitwise AND tests to determine length and gives an address space of 4,210,816 code points.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse