What if the form-data boundary is contained in the attached file? - forms

Let's take the following example of multipart/form-data taken from w3.com:
Content-Type: multipart/form-data; boundary=AaB03x
--AaB03x
Content-Disposition: form-data; name="submit-name"
Larry
--AaB03x
Content-Disposition: form-data; name="files"; filename="file1.txt"
Content-Type: text/plain
... contents of file1.txt ...
--AaB03x--
It's pretty straight forward, but let's say you are writing code that implements this and creates such a request from scratch. Let's assume file1.txt is created by a user, and we have no control over its contents.
What if the text file file1.txt contains the string --AaB03x? You likely generated the boundary AaB03x randomly, but let's assume a "million monkeys entering a million web forms" scenario.
Is there a standard way of dealing with this improbably but still possible situation?
Should the text/plain (or even, potentially something like image/jpeg or application/octet-stream) be "encoded" or some of the information within "escaped" in some sort of way?
Or should the developer always search the contents of the file for the boundary, and then repeatedly keep picking a new randomly generated boundary until the chosen string cannot be found within the file?

HTTP delegates to the MIME RFCs for defining the multipart/ types here. The rules are laid out in RFC 2046 section 5.1.
The RFC simply states the boundary must not appear:
The boundary
delimiter MUST NOT appear inside any of the encapsulated parts, on a
line by itself or as the prefix of any line. This implies that it is
crucial that the composing agent be able to choose and specify a
unique boundary parameter value that does not contain the boundary
parameter value of an enclosing multipart as a prefix.
and
NOTE: Because boundary delimiters must not appear in the body parts
being encapsulated, a user agent must exercise care to choose a
unique boundary parameter value. The boundary parameter value in the
example above could have been the result of an algorithm designed to
produce boundary delimiters with a very low probability of already
existing in the data to be encapsulated without having to prescan the
data. Alternate algorithms might result in more "readable" boundary
delimiters for a recipient with an old user agent, but would require
more attention to the possibility that the boundary delimiter might
appear at the beginning of some line in the encapsulated part. The
simplest boundary delimiter line possible is something like "---",
with a closing boundary delimiter line of "-----".
Most MIME software simply generates a random boundary such that the probability of that boundary appearing in the parts is statistically unlikely; e.g. a collision could happen but the probability of that ever happening is so low as to be infeasible. Computer UUID values rely on the same principles; if you generate a few trillion UUIDs in a year, the probability of generating two identical UUID values is about the same as someone being hit by a meteorite, both have a 1 in 17 billion chance.
Note that you usually encode binary data to some form of ASCII-safe encoding like base64, an encoding that doesn't include dashes, removing the likelihood that binary data ever contains the boundary.
As such, the standard way to deal with the possibility is to simply make the possibility so unlikely as to be next to nothing. If you have a greater chance of a computer storing the email being hit by a meteorite, why worry about the MIME boundary?

Related

Garbage in Thunderbird multipart boundaries

In messages I send with Thunderbird I have the following email header
Content-Type: multipart/alternative;
boundary="000_160551222008756181axiscom"
All the history is divided by lines looking like
--000_160551222008756181axiscom
Those initial -- mess things up for other mail viewers like Outlook's webmail client. Where does those -- come from, and how do I get rid of them?
I'm using version 78.4.2 (64-bit) of Thunderbird.
The hyphens come from RFC2046. You don't normally get rid of them. Look for the problem elsewhere.
The Content-Type field for multipart entities requires one parameter,
"boundary". The boundary delimiter line is then defined as a line
consisting entirely of two hyphen characters ("-", decimal value 45)
followed by the boundary parameter value from the Content-Type header
field, optional linear whitespace, and a terminating CRLF.

UTF-8 encoding in emails, parsing the body

So I don't really want this question to be language specific, however I suspect Go (my language choice) is playing a part here.
I'm trying to find a string within the body of a raw email. To do so, I am getting the encoding, and the marjority of cases are quoted-printable.
Ok so thats fine, I am encoding my search query quoted printable and then doing a search for it. That works.
However. In one specific case the raw email I see in gmail looks fine, however when I retrieve the raw email from the gmail API the although the encoding and everything is identical, its encoding the " as =22
Research shows me thats because the charset is utf-8.
I haven't quite got my head around whether thats encoded utf-8 then quoted-printable or the other way around, but thats not quite the question either....
If I look at the email where the " is =22 I see the char set is utf-8 and when I look at another where its not encoded, the charset is UTF-8 (notice the case). I can't believe that the case here is whats causing this to happen, but it doesn't seem a robust enough way to work out if =22 is actually =22 or is a " encoded utf-8.
My original thought was to always decode the quoted-printable and then re-encode it before doing the search but I don't think this is going to be a robust approach going forward and thought others might have a better suggestion?
Conclusion, I'm trying to find a string in a raw email but the encoding is causing me problems getting my search string to match the encoding of the body
The =22-type encoding actually has nothing to do with the charset (whether that is utf-8 lowercase or UTF-8 uppercase or any other charset).
It is the Content-Transfer-Encoding: quoted-printable encoding.
The quoted-printable encoding is just a way of hex-encoding octets, typically limited to octets that fall outside of the printable ascii range. It seems odd that the DQUOTE character would be encoded in this way, but it's perfectly legal to do so.
If you want to search for strings in the body of the message, you'll need to first decode the body of the message. Otherwise you will not be successful.
I would recommend reading rfc2045 at a minimum.
You may also need to end up reading rfc2047 if you end up wanting to search headers at some point, but that gets... tricky due to various bugs that sending clients have.
Now that I've been "triggered" into a rant about MIME, let me explain why decoding headers is so hard to get right. I'm sure just about every developer who has ever worked on an email client could tell you this, but I guess I'm going to be the one to do it.
Here's just a short list of the problems every developer faces when they go to implement a decoder for headers which have been (theoretically) encoded according to the rfc2047 specification:
First off, there are technically two variations of header encoding formats specified by rfc2047 - one for phrases and one for unstructured text fields. They are very similar but you can't use the same rules for tokenizing them. I mention this because it seems that most MIME parsers miss this very subtle distinction and so, as you might imagine, do most MIME generators. Hell, most MIME generators probably never even heard of specifications to begin with it seems.
This brings us to:
There are so many variations of how MIME headers fail to be tokenizable according to the rules of rfc2822 and rfc2047. You'll encounter fun stuff such as:
a. encoded-word tokens illegally being embedded in other word tokens
b. encoded-word tokens containing illegal characters in them (such as spaces, line breaks, and more) effectively making it so that a tokenizer can no longer, well, tokenize them (at least not easily)
c. multi-byte character sequences being split between multiple encoded-word tokens which means that it's not possible to decode said encoded-word tokens individually
d. the payloads of encoded-word tokens being split up into multiple encoded-word tokens, often splitting in a location which makes it impossible to decode the payload in isolation
You can see some examples here.
Something that many developers seem to miss is the fact that each encoded-word token is allowed to be in different character encodings (you might have one token in UTF-8, another in ISO-8859-1 and yet another in koi8-r). Normally, this would be no big deal because you'd just decode each payload, then convert from the specified charset into UTF-8 via iconv() or something. However, due to the fun brokenness that I mentioned above in (2c) and (2d), this becomes more complicated.
If that isn't enough to make you want to throw your hands up in the air and mutter some profanities, there's more...
Undeclared 8bit text in headers. Yep. Some mailers just didn't get the memo that they are supposed to encode non-ASCII text. So now you get to have the fun experience of mixing and matching undeclared 8bit text of God-only-knows what charset along with the content of (probably broken) encoded-words.
If you want to see how to deal with these issues, you can take a look at how I did it using C in my GMime library, here: https://github.com/jstedfast/gmime/blob/master/gmime/gmime-utils.c#L1894 (in case line offsets change in the future, look for _g_mime_utils_header_decode_text() and the various internal methods it uses in that source file - I have written comments explaining how it deals with the above issues).
Or you can see how I did it using C# in my MimeKit library, here: https://github.com/jstedfast/MimeKit/blob/master/MimeKit/Utils/Rfc2047.cs
For more infomation about why & how dealing with email is hard, check out Joshua Cramner's blog series: http://quetzalcoatal.blogspot.com/search/label/email-hard

Parsing of HTTP Headers Values: Quoting, RFC 5987, MIME, etc

What confuses me is decoding of HTTP header values.
Example Header:
Some-Header: "quoted string?"; *utf-8'en'Weirdness
Can header value's be quoted? What about the encoding of a " itself? is ' a valid quote character? What's the significance of a semi-colon (;)? Could the value parser for a HTTP header be considered a MIME parser?
I am making a transparent proxy that needs to transparently handle and modify many in-the-wild header fields. That's why I need so much detail on the format.
Can header values be quoted?
If you mean does the RFC 5987 parameter production apply to the main part of the header value, then no.
Some-Header: "foo"; bar*=utf-8'en'bof
Here the main part of the header value would probably be "foo" including the quotes, but...
What's the significance of a semi-colon (;)?
The specific handling is defined for each named header separately. So semicolon is significant for, say, Content-Disposition, but not for Content-Length.
Obviously this is not a very satisfactory solution but that's what we're stuck with.
I am making a transparent proxy that needs to transparently handle and modify many in-the-wild header fields.
You can't handle these in a generic way, you have to know the form of each possible header. For anything you don't recognise, don't attempt to decompose the header value; and really, so little out there supports RFC 5987 at the moment, it's unlikely you'll be able to do much useful handling of it.
Status quo today is that non-ASCII characters in header values doesn't work well enough cross-browser to be used at all, either encoded or raw.
Luckily they are rarely needed. The only really common use case is non-ASCII filenames for Content-Disposition but that's easier to work around by putting the filename in a trailing URL path part instead.
Could the value parser for a HTTP header be considered a MIME parser?
No. HTTP borrows heavily from MIME and the RFC 822 family of standards in general, but it isn't part of the 822 family. It has its own low-level grammar for headers which looks like 822, but isn't quite compatible. Arbitrary MIME features can't be used in HTTP, there has to be a standardisation mechanism to drag them into HTTP explicitly—which is what RFC 5987 is, for (parts of) RFC 2231.
(See section 19.4 of RFC 2616 for discussion of some other differences.)
In theory, a multipart form submission is part of the 822 family and you should be able to use RFC 2231 encoding there. But the reality is browsers don't support that either.

What do the numbers in a multi-part email mean?

I'm looking at the source of a multi-part message from Thunderbird (in hopes of writing my own multi-part message from C++/Javascript)
I was wondering what the follow means (the part between the text-only part and the html part of the email) and how I might calculate it for my own program to generate a multi-part email:
This is a multi-part message in MIME format.
------=_NextPart_32252.1057009685.31.001
Content-Type: multipart/alternative;
boundary="----=_NextPart_32252.1057009685.31.002"
Content-Description: Message in alternative text and HTML forms
------=_NextPart_32252.1057009685.31.002
(as seen here)
The rest of the message code makes sense to me for the post part.
The numbers you are seeing within the boundary delimiters don't necessarily mean anything (although the RFC doesn't preclude an implementor from trying to include some meaning).
They must be unique and not contained within the part that they encapsulate.
From RFC 2046:
5.1. Multipart Media Type
In the case of multipart entities,
in which one or more different sets
of data are combined in a single body,
a "multipart" media type field must
appear in the entity's header. The
body must then contain one or more
body parts, each preceded by a
boundary delimiter line...
As stated previously, each body part is preceded by a boundary
delimiter line that contains the boundary delimiter. The boundary
delimiter MUST NOT appear inside any of the encapsulated parts, on a
line by itself or as the prefix of any line...
...
5.1.1. Common Syntax
The Content-Type field for
multipart entities requires one
parameter, "boundary". The boundary
delimiter line is then defined as a
line consisting entirely of two
hyphen characters ("-", decimal value
45) followed by the boundary
parameter value from the Content-Type
header field, optional linear
whitespace, and a terminating CRLF.
...
NOTE: Because boundary delimiters must not appear in the body parts
being encapsulated, a user agent must exercise care to choose a
unique boundary parameter value. The boundary parameter value
[could be] the result of an algorithm designed to
produce boundary delimiters with a very low probability of already
existing in the data to be encapsulated without having to prescan the
data. ... The
simplest boundary delimiter line possible is something like "---",
with a closing boundary delimiter line of "-----".
They don't mean anything. They are just a random string that does not occur within the body of the email. They are just used to mark where the embedded message starts and stops.

What is the email subject length limit?

How many characters are allowed to be in the subject line of Internet email?
I had a scan of The RFC for email but could not see specifically how long it was allowed to be.
I have a colleague that wants to programmatically validate for it.
If there is no formal limit, what is a good length in practice to suggest?
See RFC 2822, section 2.1.1 to start.
There are two limits that this
standard places on the number of
characters in a line. Each line of
characters MUST be no more than 998
characters, and SHOULD be no more than
78 characters, excluding the CRLF.
As the RFC states later, you can work around this limit (not that you should) by folding the subject over multiple lines.
Each header field is logically a
single line of characters comprising
the field name, the colon, and the
field body. For convenience however,
and to deal with the 998/78 character
limitations per line, the field body
portion of a header field can be split
into a multiple line representation;
this is called "folding". The general
rule is that wherever this standard
allows for folding white space (not
simply WSP characters), a CRLF may be
inserted before any WSP. For
example, the header field:
Subject: This is a test
can be represented as:
Subject: This
is a test
The recommendation for no more than 78 characters in the subject header sounds reasonable. No one wants to scroll to see the entire subject line, and something important might get cut off on the right.
RFC2322 states that the subject header "has no length restriction"
but to produce long headers but you need to split it across multiple lines, a process called "folding".
subject is defined as "unstructured" in RFC 5322
here's some quotes ([...] indicate stuff i omitted)
3.6.5. Informational Fields
The informational fields are all optional. The "Subject:" and
"Comments:" fields are unstructured fields as defined in section
2.2.1, [...]
2.2.1. Unstructured Header Field Bodies
Some field bodies in this specification are defined simply as
"unstructured" (which is specified in section 3.2.5 as any printable
US-ASCII characters plus white space characters) with no further
restrictions. These are referred to as unstructured field bodies.
Semantically, unstructured field bodies are simply to be treated as a
single line of characters with no further processing (except for
"folding" and "unfolding" as described in section 2.2.3).
2.2.3 [...] An unfolded header field has no length restriction and
therefore may be indeterminately long.
after some test: If you send an email to an outlook client, and the subject is >77 chars, and it needs to use "=?ISO" inside the subject (in my case because of accents) then OutLook will "cut" the subject in the middle of it and mesh it all that comes after, including body text, attaches, etc... all a mesh!
I have several examples like this one:
Subject: =?ISO-8859-1?Q?Actas de la obra N=BA.20100154 (Expediente N=BA.20100182) "NUEVA RED FERROVIARIA.=
TRAMO=20BEASAIN=20OESTE(Pedido=20PC10/00123-125),=20BEASAIN".?=
To:
As you see, in the subject line it cutted on char 78 with a "=" followed by 2 or 3 line feeds, then continued with the rest of the subject baddly.
It was reported to me from several customers who all where using OutLook, other email clients deal with those subjects ok.
If you have no ISO on it, it doesn't hurt, but if you add it to your subject to be nice to RFC, then you get this surprise from OutLook. Bit if you don't add the ISOs, then iPhone email will not understand it(and attach files with names using such characters will not work on iPhones).
Limits in the context of Unicode multi-byte character capabilities
While RFC5322 defines a limit of 1000 (998 + CRLF) characters, it does so in the context of headers limited to ASCII characters only.
RFC 6532 explains how to handle multi-byte Unicode characters.
Section 3.4 ( Effects on Line Length Limits ) states:
Section 2.1.1 of [RFC5322] limits lines to 998 characters and
recommends that the lines be restricted to only 78 characters. This
specification changes the former limit to 998 octets. (Note that, in
ASCII, octets and characters are effectively the same, but this is
not true in UTF-8.) The 78-character limit remains defined in terms
of characters, not octets, since it is intended to address display
width issues, not line-length issues.
So for example, because you are limited to 998 octets, you can't have 998 smiley faces in your subject line as each emoji of this type is 4 octets.
Using PHP to demonstrate:
Run php -a for an interactive terminal.
// Multi-byte string length:
var_export(mb_strlen("\u{0001F602}",'UTF-8'));
// 1
// ASCII string length:
var_export(strlen("\u{0001F602}"));
// 4
// ASCII substring of four octet character:
var_export(substr("\u{0001F602}",0,4));
// '😂'
// ASCI substring of four octet character truncated to 3 octets, mutating character:
var_export(substr("\u{0001F602}",0,3));
// 'â–’'
I don't believe that there is a formal limit here, and I'm pretty sure there isn't any hard limit specified in the RFC either, as you found.
I think that some pretty common limitations for subject lines in general (not just e-mail) are:
80 Characters
128 Characters
256 Characters
Obviously, you want to come up with something that is reasonable. If you're writing an e-mail client, you may want to go with something like 256 characters, and obviously test thoroughly against big commercial servers out there to make sure they serve your mail correctly.
Hope this helps!
What's important is which mechanism you are using the send the email. Most modern libraries (i.e. System.Net.Mail) will hide the folding from you. You just put a very long email subject line in without (CR,LF,HTAB). If you start trying to do your own folding all bets are off. It will start reporting errors. So if you are having this issue just filter out the CR,LF,HTAB and let the library do the work for you. You can usually also set the encoding text type as a separate field. No need for iso encoding in the subject line.