Maximum length of headers - email

I am interested in the maximum length of the header name, header value.
And are there any restrictions on the maximum number of parameters?

None of the relevant specifications define a maximum length for a header name or value, but rfc5321 section 4.5.3.1.6 states that the maximum line length is 1000 octets (aka 1000 bytes) including the terminating <CR><LF> sequence.
How does that affect maximum header name/value lengths, you might ask?
It doesn't affect the maximum header value length at all because rfc5322 section 3.2.2 defines CFWS (Comment Folding WhiteSpace) which is further used in the BNF grammar definitions for headers, which basically allows header values to be infinite in length.
That said, while there is no explicit maximum length for a header field name, there is a practical one.
The maximum line length is 1000 octets (including the terminating <CR><LF> sequence).
The recommended maximum line length is 78 octets (see rfc5322 section 2.1.1).
The syntactical definition of a header looks like this:
optional-field = field-name ":" unstructured CRLF
field-name = 1*ftext
ftext = %d33-57 / ; Printable US-ASCII
%d59-126 ; characters not including
; ":".
(where optional-field is any header field that is not pre-defined in the specification such as "To", "From", "Date", "Subject", etc). This syntax definition can be found in rfc5322 section 3.6.8.
Header field names cannot be folded (as seen by the syntax definition).
Since it must be possible to represent a header field name and the colon (":") all within 998 bytes (1000 bytes minus the <CR><LF> sequence), we can safely conclude that the maximum length of a header field name is 997 bytes (or, since header field names are constrained to US-ASCII, 997 characters) and SHOULD be constrained to fit within a recommended maximum line length of 78 bytes, meaning the maximum header field name SHOULD be a maximum of 77 bytes/characters.

Related

postgres: how to count multibyte emoji strings display length in UTF-8

Postgres (v11) counts the red heart ❤️ as two characters, and so on for other multibyte UTF-8 chars with selector units. Anyone know how I get postgres to count true characters and not the bytes?
For example, I would like both of the examples below should return 1.
select length('❤️') = 2 (Unicode: 2764 FE0F)
select length('🏃‍♂️') = 4 (Unicode: 1F3C3 200D 2642 FE0F)
UPDATE
Thank you to folks pointing out that postgres is correctly counting the Unicode code points and why and how this happens.
I don't see any other option other than pre-processing the emoji strings as bytes against a table of official Unicode character bytes, in Python or some such, to get the perceived length.
So one way to do this is to ignore all characters in the Variation Selector and decrement by 2 if you hit the General Punctuation range.
This could be converted into a postgres function.
python
"""
# For reference, these code pages apply to emojis
Name Range
Emoticons 1F600-1F64F
Supplemental_Symbols_and_Pictographs 1F900-1F9FF
Miscellaneous Symbols and Pictographs 1F300-1F5FF
General Punctuation 2000-206F
Miscellaneous Symbols 2600-26FF
Variation Selectors FE00-FE0F
Dingbats 2700-27BF
Transport and Map Symbols 1F680-1F6FF
Enclosed Alphanumeric Supplement 1F100-1F1FF
"""
emojis="🏃‍♂️🏃‍♂️🏃‍♂️🏃‍♂️🏃‍♂️🏃‍♂️🏃‍♂️" # true count is 7, postgres length() returns 28
true_count=0
for char in emojis:
d=ord(char)
char_type=None
if (d>=0x2000 and d<=0x206F) : char_type="GP" # Zero Width Joiner
elif (d>=0xFE00 and d<=0xFE0F) : char_type="VS" # Variation Selector
print(d, char_type)
if ( char_type=="GP") : true_count-=2
elif (char_type!="VS" ): true_count+=1
print(true_count)

Interpreting ASN.1 indefinite-lenght encoding with multiple encapsulated octet-strings

I have a BER structure like this...
$ openssl asn1parse -inform der -in test.der -i -dump
????:d=4 hl=2 l=inf cons: cont [ 0 ]
????:d=5 hl=3 l= 240 prim: OCTET STRING
0000 - AABBCCDD
????:d=5 hl=2 l= 8 prim: OCTET STRING
0000 - EEFF
????:d=5 hl=2 l= 0 prim: EOC
...or in der2ascii style...
[0] `80`
OCTET_STRING { `AABBCCDD` }
OCTET_STRING { `EEFF` }
`0000`
What I know: indefinite-length encoding must contain a constructed type, because primitive types may introduce ambiguities, e.g. when containing 0x0000. What I want to know: How does a decoder must behave when parsing this BER structure? Are the header bytes of both OCTET STRINGs included in the encoding? If yes, how is indefinite-length byte data encoded? How does an application interpret the value of the TLV field tagged [0], when the second OCTET STRING is e.g. an INTEGER?
I am asking this question, because in the CMS standard, a field is defined as single OCTET STRING, but in most BER encodings I always see two of them. Is this only due to the indefinite-length encoding? Am I missing something?
From ITU-T X.690:
8.1.4 Contents octets
The contents octets shall consist of zero, one or more octets, and shall encode the data value as specified in
subsequent clauses.
NOTE – The contents octets depend on the type of the data value;
subsequent clauses follow the same sequence as the definition of types
in ASN.1.
Does this mean, that I can put every constructed type and the application must only interpret the value part of the contructed TLV structure?
When you encode a primitive OCTET STRING in indefinite length mode, the encoder must:
split up the value into chunks of smaller OCTET STRINGs
encode each chunk in definite length mode so that each has its own TLV (with length!)
the whole sequence of definite length encoded primitive OCTET STRINGs must be framed by a single, indefinite length encoded constructed OCTET STRING "container" having its own TLV (without length, but with end-of-octets sentinel)
At the other end, the decoder extracts the V part from the inner, definite length OCTET STRING chunks (dropping their TL headers). Then joins/consumes V's together in the order of arrival dropping the TL part of the outer frame.
Note that the idea behind indefinite length encoding technique is that both encoder and decoder can emit/consume incomplete, possibly oversized, data.
Chunk size is chosen by the encoder/application based on data availability, memory situation and possibly the estimation of decoder's buffering capabilities. I think this is mentioned somewhere in the X.280/X.680 papers.
Encoder is not allowed to put chunks of different ASN.1 types into any single indefinite length encoded container. In other words, all chunks must be of the same type as the outer container.
That should hopefully explain why you may see multiple (depending on chunk size) OCTET STRINGs in the indefinite length encoded BER/CER stream where just a single OCTET STRING is expected.
DER forbids indefinite length encoding on the grounds that serialized representation of the same data may change on re-encoding (due to potentially changing chunk size).

How should the 'System Use' field be interpreted in a 'Directory Record'?

In the ECMA 119 specifications (freely available here), I am trying to understand how to fetch the content of the System Use field:
How is one supposed to compute the length of the System Use field, i.e. how is the value of the LEN_SU found in the left column ?
The value of LEN_SU is given implicitly. From BP1 you know the total number of bytes in the directory record (LEN_DR). LEN_SU is then given (implicitly) as the bytes remaining in the directory record after 33+LEN_FI+possible_padding(1), where you get length LEN_FI from BP33.
Hence
LEN_SU = LEN_DR - (33+LEN_FI+possible_padding(1))
From the spec.:
Padding Field [BP (34 + LEN_FI)]
This field shall be present in the
Directory Record only if the number in the Length of the File
Identifier field is an even number.
System Use [BP (LEN_DR - LEN_SU + 1) to LEN_DR)
This field shall be
optional. If present, this field shall be reserved for system use. Its
content is not specified by this Standard. If necessary, so that the
Directory Record comprises an even number of bytes, a (00) byte shall
be added to terminate this field.

How can I convert the tiger hash values from the official implementations into the form used by Direct Connect?

I am trying to implement a Direct Connect Client, and I am currently stuck at a point where I need to hash the files in order to be able to upload them to other clients.
As the all other clients require a TTHL (Tiger Tree Hashing Leaves) support for verification of the downloaded data. I have searched for implementations of the algorithm, and found tiger-hash-python.
I have implemented a routine that uses the hash function from before, and is able to hash large files, according to the logic specified in Tree Hash EXchange format (THEX) (basically, the tree diagram is the important part on that page).
However, the value produced by it is similar to those shown on Wikipedia, a hex digest, but is different from those shown in the DC clients I'm using for reference.
I have been unable to find out how the hex digest form is converted to this other one (39 characters, A-Z, 0-9). Could someone please explain how that is done?
Well ... I tried what Paulo Ebermann said, using the following functions:
def strdivide(list,length):
result = []
# Calculate how many blocks there are, using the condition: i*length = len(list).
# The additional maths operations are to deal with the last block which might have a smaller size
for i in range(0,int(math.ceil(float(len(list))/length))):
result.append(list[i*length:(i+1)*length])
return result
def dchash(data):
result = tiger.hash(data) # From the aformentioned tiger-hash-python script, 48-char hex digest
result = "".join([ "".join(strdivide(result[i:i+16],2)[::-1]) for i in range(0,48,16) ]) # Representation Transform
bits = "".join([chr(int(c,16)) for c in strdivide(result,2)]) # Converting every 2 hex characters into 1 normal
result = base64.b32encode(bits) # Result will be 40 characters
return result[:-1] # Leaving behind the trailing '='
The TTH for an empty file was found to be 8B630E030AD09E5D0E90FB246A3A75DBB6256C3EE7B8635A, which after the transformation specified here, becomes 5D9ED00A030E638BDB753A6A24FB900E5A63B8E73E6C25B6. Base-32 encoding this result yielded LWPNACQDBZRYXW3VHJVCJ64QBZNGHOHHHZWCLNQ, which was found to be what DC++ generates.
The only mention of the format of the hash in the Direct Connect protocol I found is on the $SR page on the NMDC Protocol wiki:
For files containing TTH, the <hub_name> parameter is replaced with TTH:<base32_encoded_tth_hash> (ref: TTH_Hash).
So, it is Base32-encoding. This is defined in RFC 4648 (and some earlier ones), section 6.
Basically, you are using the capital letters A-Z and the decimal digits 2 to 7, and one base32 digit represents 5 bits, while one base16 (hexadecimal) digit represents only 4 ones.
This means, each 5 hex digits map to 4 base32-digits, and for a Tiger hash (192 bits) you will need 40 base32-digits (in the official encoding, the last one would be a = padding, which seems to be omitted if you say that there are always 39 characters).
I'm not sure of an implementation of a conversion from hex (or bytes) to base32, but it shouldn't be too complicated with a lookup table and some bit-shifting.

Why does a base64 encoded string have an = sign at the end

I know what base64 encoding is and how to calculate base64 encoding in C#, however I have seen several times that when I convert a string into base64, there is an = at the end.
A few questions came up:
Does a base64 string always end with =?
Why does an = get appended at the end?
Q Does a base64 string always end with =?
A: No. (the word usb is base64 encoded into dXNi)
Q Why does an = get appended at the end?
A: As a short answer:
The last character (= sign) is added only as a complement (padding) in the final process of encoding a message with a special number of characters.
You will not have an = sign if your string has a multiple of 3 characters, because Base64 encoding takes each three bytes (a character=1 byte) and represents them as four printable characters in the ASCII standard.
Example:
(a) If you want to encode
ABCDEFG <=> [ABC] [DEF] [G]
Base64 deals with the first block (producing 4 characters) and the second (as they are complete). But for the third, it will add a double == in the output in order to complete the 4 needed characters. Thus, the result will be QUJD REVG Rw== (without spaces).
[ABC] => QUJD
[DEF] => REVG
[G] => Rw==
(b) If you want to encode ABCDEFGH <=> [ABC] [DEF] [GH]
similarly, it will add one = at the end of the output to get 4 characters.
The result will be QUJD REVG R0g= (without spaces).
[ABC] => QUJD
[DEF] => REVG
[GH] => R0g=
It serves as padding.
A more complete answer is that a base64 encoded string doesn't always end with a =, it will only end with one or two = if they are required to pad the string out to the proper length.
From Wikipedia:
The final '==' sequence indicates that the last group contained only one byte, and '=' indicates that it contained two bytes.
Thus, this is some sort of padding.
Its defined in RFC 2045 as a special padding character if fewer than 24 bits are available at the end of the encoded data.
No.
To pad the Base64-encoded string to a multiple of 4 characters in length, so that it can be decoded correctly.
The equals sign (=) is used as padding in certain forms of base64 encoding. The Wikipedia article on base64 has all the details.
It's padding. From http://en.wikipedia.org/wiki/Base64:
In theory, the padding character is not needed for decoding, since the
number of missing bytes can be calculated from the number of Base64
digits. In some implementations, the padding character is mandatory,
while for others it is not used. One case in which padding characters
are required is concatenating multiple Base64 encoded files.
http://www.hcidata.info/base64.htm
Encoding "Mary had" to Base 64
In this example we are using a simple text string ("Mary had") but the principle holds no matter what the data is (e.g. graphics file). To convert each 24 bits of input data to 32 bits of output, Base 64 encoding splits the 24 bits into 4 chunks of 6 bits. The first problem we notice is that "Mary had" is not a multiple of 3 bytes - it is 8 bytes long. Because of this, the last group of bits is only 4 bits long. To remedy this we add two extra bits of '0' and remember this fact by putting a '=' at the end. If the text string to be converted to Base 64 was 7 bytes long, the last group would have had 2 bits. In this case we would have added four extra bits of '0' and remember this fact by putting '==' at the end.
= is a padding character. If the input stream has length that is not a multiple of 3, the padding character will be added. This is required by decoder: if no padding present, the last byte would have an incorrect number of zero bits.
Better and deeper explanation here: https://base64tool.com/detect-whether-provided-string-is-base64-or-not/
The equals or double equals serves as padding. It's a stupid concept defined in RFC2045 and it is actually superfluous. Any decend parser can encode and decode a base64 string without knowing about padding by just counting up the number of characters and filling in the rest if size isn't dividable by 3 or 4 respectively. This actually leads to difficulties every now and then, because some parsers expect padding while others blatantly ignore it. My MPU base64 decoder for example needs padding, but it receives a non-padded base64 string over the network. This leads to erronous parsing and I had to account for it myself.