What do the numbers in a multi-part email mean? - email

I'm looking at the source of a multi-part message from Thunderbird (in hopes of writing my own multi-part message from C++/Javascript)
I was wondering what the follow means (the part between the text-only part and the html part of the email) and how I might calculate it for my own program to generate a multi-part email:
This is a multi-part message in MIME format.
------=_NextPart_32252.1057009685.31.001
Content-Type: multipart/alternative;
boundary="----=_NextPart_32252.1057009685.31.002"
Content-Description: Message in alternative text and HTML forms
------=_NextPart_32252.1057009685.31.002
(as seen here)
The rest of the message code makes sense to me for the post part.

The numbers you are seeing within the boundary delimiters don't necessarily mean anything (although the RFC doesn't preclude an implementor from trying to include some meaning).
They must be unique and not contained within the part that they encapsulate.
From RFC 2046:
5.1. Multipart Media Type
In the case of multipart entities,
in which one or more different sets
of data are combined in a single body,
a "multipart" media type field must
appear in the entity's header. The
body must then contain one or more
body parts, each preceded by a
boundary delimiter line...
As stated previously, each body part is preceded by a boundary
delimiter line that contains the boundary delimiter. The boundary
delimiter MUST NOT appear inside any of the encapsulated parts, on a
line by itself or as the prefix of any line...
...
5.1.1. Common Syntax
The Content-Type field for
multipart entities requires one
parameter, "boundary". The boundary
delimiter line is then defined as a
line consisting entirely of two
hyphen characters ("-", decimal value
45) followed by the boundary
parameter value from the Content-Type
header field, optional linear
whitespace, and a terminating CRLF.
...
NOTE: Because boundary delimiters must not appear in the body parts
being encapsulated, a user agent must exercise care to choose a
unique boundary parameter value. The boundary parameter value
[could be] the result of an algorithm designed to
produce boundary delimiters with a very low probability of already
existing in the data to be encapsulated without having to prescan the
data. ... The
simplest boundary delimiter line possible is something like "---",
with a closing boundary delimiter line of "-----".

They don't mean anything. They are just a random string that does not occur within the body of the email. They are just used to mark where the embedded message starts and stops.

Related

Garbage in Thunderbird multipart boundaries

In messages I send with Thunderbird I have the following email header
Content-Type: multipart/alternative;
boundary="000_160551222008756181axiscom"
All the history is divided by lines looking like
--000_160551222008756181axiscom
Those initial -- mess things up for other mail viewers like Outlook's webmail client. Where does those -- come from, and how do I get rid of them?
I'm using version 78.4.2 (64-bit) of Thunderbird.
The hyphens come from RFC2046. You don't normally get rid of them. Look for the problem elsewhere.
The Content-Type field for multipart entities requires one parameter,
"boundary". The boundary delimiter line is then defined as a line
consisting entirely of two hyphen characters ("-", decimal value 45)
followed by the boundary parameter value from the Content-Type header
field, optional linear whitespace, and a terminating CRLF.

Maximum length of reply-to header

Is there a maximum length for the reply-to header field? The maximum length for a e-mail address is defined but in reply-to you can add several addresses.
Regarding latest RFC 5322 documentation
2.2.3. Long Header Fields
Each header field is logically a single line of characters
comprising the field name, the colon, and the field body. For
convenience however, and to deal with the 998/78 character
limitations per line, the field body portion of a header field can
be split into a multiple-line representation; this is called
"folding". The general rule is that wherever this specification
allows for folding white space (not simply WSP characters), a CRLF
may be inserted before any WSP.

Are newlines in MIME headers using encoded-words legal?

RFC 2047 defines the encoded-words mechanism for encoding non-ASCII character in MIME documents. It specifies that whitespace characters (space and tabs) are not allowed inside the encoded-word.
However, RFC 5322 for parsing email MIME documents specifies that long header lines should be "folded". Should this folding take place before or after encoded-words decoding?
I recently received an email where encoded-text part of the header had a newline in it, like this:
Header: =?UTF-8?Q?=C3=A5
=C3=A4?=
Would this be valid?
Of course emails can be invalid in lots of exciting ways and the parser needs to handle that, but it's interesting to know the "correct" way. :)
I misread the question and answered as if it was a different sort of whitespace. In this case the white space appears inside the MIME word, not multiple ones separated by white space.
This sort of thing is explicitly disallowed. From the introduction to the format in RFC2047:
2. Syntax of encoded-words
An 'encoded-word' is defined by the following ABNF grammar. The
notation of RFC 822 is used, with the exception that white space
characters MUST NOT appear between components of an 'encoded-word'.
And then later on in the same section:
IMPORTANT: 'encoded-word's are designed to be recognized as 'atom's
by an RFC 822 parser. As a consequence, unencoded white space
characters (such as SPACE and HTAB) are FORBIDDEN within an
'encoded-word'. For example, the character sequence
=?iso-8859-1?q?this is some text?=
would be parsed as four 'atom's, rather than as a single 'atom' (by
an RFC 822 parser) or 'encoded-word' (by a parser which understands
'encoded-words'). The correct way to encode the string "this is some
text" is to encode the SPACE characters as well, e.g.
=?iso-8859-1?q?this=20is=20some=20text?=
The characters which may appear in 'encoded-text' are further
restricted by the rules in section 5.
Earlier answer
This sort of thing is explicitly allowed. Headers with MIME words should be 76 characters or less and folded if needed. RFC822 folded headers are indented second and any additional lines. RFC2047 headers are supposed to only indent one space. The whitespace between ?= on the first line and =? should be suppressed from output.
See the example on the bottom of page 12 of the RFC:
encoded form displayed as
---------------------------------------------------------------------
(=?ISO-8859-1?Q?a?= (ab)
=?ISO-8859-1?Q?b?=)
Any amount of linear-space-white between 'encoded-word's,
even if it includes a CRLF followed by one or more SPACEs,
is ignored for the purposes of display.

What if the form-data boundary is contained in the attached file?

Let's take the following example of multipart/form-data taken from w3.com:
Content-Type: multipart/form-data; boundary=AaB03x
--AaB03x
Content-Disposition: form-data; name="submit-name"
Larry
--AaB03x
Content-Disposition: form-data; name="files"; filename="file1.txt"
Content-Type: text/plain
... contents of file1.txt ...
--AaB03x--
It's pretty straight forward, but let's say you are writing code that implements this and creates such a request from scratch. Let's assume file1.txt is created by a user, and we have no control over its contents.
What if the text file file1.txt contains the string --AaB03x? You likely generated the boundary AaB03x randomly, but let's assume a "million monkeys entering a million web forms" scenario.
Is there a standard way of dealing with this improbably but still possible situation?
Should the text/plain (or even, potentially something like image/jpeg or application/octet-stream) be "encoded" or some of the information within "escaped" in some sort of way?
Or should the developer always search the contents of the file for the boundary, and then repeatedly keep picking a new randomly generated boundary until the chosen string cannot be found within the file?
HTTP delegates to the MIME RFCs for defining the multipart/ types here. The rules are laid out in RFC 2046 section 5.1.
The RFC simply states the boundary must not appear:
The boundary
delimiter MUST NOT appear inside any of the encapsulated parts, on a
line by itself or as the prefix of any line. This implies that it is
crucial that the composing agent be able to choose and specify a
unique boundary parameter value that does not contain the boundary
parameter value of an enclosing multipart as a prefix.
and
NOTE: Because boundary delimiters must not appear in the body parts
being encapsulated, a user agent must exercise care to choose a
unique boundary parameter value. The boundary parameter value in the
example above could have been the result of an algorithm designed to
produce boundary delimiters with a very low probability of already
existing in the data to be encapsulated without having to prescan the
data. Alternate algorithms might result in more "readable" boundary
delimiters for a recipient with an old user agent, but would require
more attention to the possibility that the boundary delimiter might
appear at the beginning of some line in the encapsulated part. The
simplest boundary delimiter line possible is something like "---",
with a closing boundary delimiter line of "-----".
Most MIME software simply generates a random boundary such that the probability of that boundary appearing in the parts is statistically unlikely; e.g. a collision could happen but the probability of that ever happening is so low as to be infeasible. Computer UUID values rely on the same principles; if you generate a few trillion UUIDs in a year, the probability of generating two identical UUID values is about the same as someone being hit by a meteorite, both have a 1 in 17 billion chance.
Note that you usually encode binary data to some form of ASCII-safe encoding like base64, an encoding that doesn't include dashes, removing the likelihood that binary data ever contains the boundary.
As such, the standard way to deal with the possibility is to simply make the possibility so unlikely as to be next to nothing. If you have a greater chance of a computer storing the email being hit by a meteorite, why worry about the MIME boundary?

What is the email subject length limit?

How many characters are allowed to be in the subject line of Internet email?
I had a scan of The RFC for email but could not see specifically how long it was allowed to be.
I have a colleague that wants to programmatically validate for it.
If there is no formal limit, what is a good length in practice to suggest?
See RFC 2822, section 2.1.1 to start.
There are two limits that this
standard places on the number of
characters in a line. Each line of
characters MUST be no more than 998
characters, and SHOULD be no more than
78 characters, excluding the CRLF.
As the RFC states later, you can work around this limit (not that you should) by folding the subject over multiple lines.
Each header field is logically a
single line of characters comprising
the field name, the colon, and the
field body. For convenience however,
and to deal with the 998/78 character
limitations per line, the field body
portion of a header field can be split
into a multiple line representation;
this is called "folding". The general
rule is that wherever this standard
allows for folding white space (not
simply WSP characters), a CRLF may be
inserted before any WSP. For
example, the header field:
Subject: This is a test
can be represented as:
Subject: This
is a test
The recommendation for no more than 78 characters in the subject header sounds reasonable. No one wants to scroll to see the entire subject line, and something important might get cut off on the right.
RFC2322 states that the subject header "has no length restriction"
but to produce long headers but you need to split it across multiple lines, a process called "folding".
subject is defined as "unstructured" in RFC 5322
here's some quotes ([...] indicate stuff i omitted)
3.6.5. Informational Fields
The informational fields are all optional. The "Subject:" and
"Comments:" fields are unstructured fields as defined in section
2.2.1, [...]
2.2.1. Unstructured Header Field Bodies
Some field bodies in this specification are defined simply as
"unstructured" (which is specified in section 3.2.5 as any printable
US-ASCII characters plus white space characters) with no further
restrictions. These are referred to as unstructured field bodies.
Semantically, unstructured field bodies are simply to be treated as a
single line of characters with no further processing (except for
"folding" and "unfolding" as described in section 2.2.3).
2.2.3 [...] An unfolded header field has no length restriction and
therefore may be indeterminately long.
after some test: If you send an email to an outlook client, and the subject is >77 chars, and it needs to use "=?ISO" inside the subject (in my case because of accents) then OutLook will "cut" the subject in the middle of it and mesh it all that comes after, including body text, attaches, etc... all a mesh!
I have several examples like this one:
Subject: =?ISO-8859-1?Q?Actas de la obra N=BA.20100154 (Expediente N=BA.20100182) "NUEVA RED FERROVIARIA.=
TRAMO=20BEASAIN=20OESTE(Pedido=20PC10/00123-125),=20BEASAIN".?=
To:
As you see, in the subject line it cutted on char 78 with a "=" followed by 2 or 3 line feeds, then continued with the rest of the subject baddly.
It was reported to me from several customers who all where using OutLook, other email clients deal with those subjects ok.
If you have no ISO on it, it doesn't hurt, but if you add it to your subject to be nice to RFC, then you get this surprise from OutLook. Bit if you don't add the ISOs, then iPhone email will not understand it(and attach files with names using such characters will not work on iPhones).
Limits in the context of Unicode multi-byte character capabilities
While RFC5322 defines a limit of 1000 (998 + CRLF) characters, it does so in the context of headers limited to ASCII characters only.
RFC 6532 explains how to handle multi-byte Unicode characters.
Section 3.4 ( Effects on Line Length Limits ) states:
Section 2.1.1 of [RFC5322] limits lines to 998 characters and
recommends that the lines be restricted to only 78 characters. This
specification changes the former limit to 998 octets. (Note that, in
ASCII, octets and characters are effectively the same, but this is
not true in UTF-8.) The 78-character limit remains defined in terms
of characters, not octets, since it is intended to address display
width issues, not line-length issues.
So for example, because you are limited to 998 octets, you can't have 998 smiley faces in your subject line as each emoji of this type is 4 octets.
Using PHP to demonstrate:
Run php -a for an interactive terminal.
// Multi-byte string length:
var_export(mb_strlen("\u{0001F602}",'UTF-8'));
// 1
// ASCII string length:
var_export(strlen("\u{0001F602}"));
// 4
// ASCII substring of four octet character:
var_export(substr("\u{0001F602}",0,4));
// '😂'
// ASCI substring of four octet character truncated to 3 octets, mutating character:
var_export(substr("\u{0001F602}",0,3));
// 'â–’'
I don't believe that there is a formal limit here, and I'm pretty sure there isn't any hard limit specified in the RFC either, as you found.
I think that some pretty common limitations for subject lines in general (not just e-mail) are:
80 Characters
128 Characters
256 Characters
Obviously, you want to come up with something that is reasonable. If you're writing an e-mail client, you may want to go with something like 256 characters, and obviously test thoroughly against big commercial servers out there to make sure they serve your mail correctly.
Hope this helps!
What's important is which mechanism you are using the send the email. Most modern libraries (i.e. System.Net.Mail) will hide the folding from you. You just put a very long email subject line in without (CR,LF,HTAB). If you start trying to do your own folding all bets are off. It will start reporting errors. So if you are having this issue just filter out the CR,LF,HTAB and let the library do the work for you. You can usually also set the encoding text type as a separate field. No need for iso encoding in the subject line.