Parsing of HTTP Headers Values: Quoting, RFC 5987, MIME, etc - unicode

What confuses me is decoding of HTTP header values.
Example Header:
Some-Header: "quoted string?"; *utf-8'en'Weirdness
Can header value's be quoted? What about the encoding of a " itself? is ' a valid quote character? What's the significance of a semi-colon (;)? Could the value parser for a HTTP header be considered a MIME parser?
I am making a transparent proxy that needs to transparently handle and modify many in-the-wild header fields. That's why I need so much detail on the format.

Can header values be quoted?
If you mean does the RFC 5987 parameter production apply to the main part of the header value, then no.
Some-Header: "foo"; bar*=utf-8'en'bof
Here the main part of the header value would probably be "foo" including the quotes, but...
What's the significance of a semi-colon (;)?
The specific handling is defined for each named header separately. So semicolon is significant for, say, Content-Disposition, but not for Content-Length.
Obviously this is not a very satisfactory solution but that's what we're stuck with.
I am making a transparent proxy that needs to transparently handle and modify many in-the-wild header fields.
You can't handle these in a generic way, you have to know the form of each possible header. For anything you don't recognise, don't attempt to decompose the header value; and really, so little out there supports RFC 5987 at the moment, it's unlikely you'll be able to do much useful handling of it.
Status quo today is that non-ASCII characters in header values doesn't work well enough cross-browser to be used at all, either encoded or raw.
Luckily they are rarely needed. The only really common use case is non-ASCII filenames for Content-Disposition but that's easier to work around by putting the filename in a trailing URL path part instead.
Could the value parser for a HTTP header be considered a MIME parser?
No. HTTP borrows heavily from MIME and the RFC 822 family of standards in general, but it isn't part of the 822 family. It has its own low-level grammar for headers which looks like 822, but isn't quite compatible. Arbitrary MIME features can't be used in HTTP, there has to be a standardisation mechanism to drag them into HTTP explicitly—which is what RFC 5987 is, for (parts of) RFC 2231.
(See section 19.4 of RFC 2616 for discussion of some other differences.)
In theory, a multipart form submission is part of the 822 family and you should be able to use RFC 2231 encoding there. But the reality is browsers don't support that either.

Related

UTF-8 encoding in emails, parsing the body

So I don't really want this question to be language specific, however I suspect Go (my language choice) is playing a part here.
I'm trying to find a string within the body of a raw email. To do so, I am getting the encoding, and the marjority of cases are quoted-printable.
Ok so thats fine, I am encoding my search query quoted printable and then doing a search for it. That works.
However. In one specific case the raw email I see in gmail looks fine, however when I retrieve the raw email from the gmail API the although the encoding and everything is identical, its encoding the " as =22
Research shows me thats because the charset is utf-8.
I haven't quite got my head around whether thats encoded utf-8 then quoted-printable or the other way around, but thats not quite the question either....
If I look at the email where the " is =22 I see the char set is utf-8 and when I look at another where its not encoded, the charset is UTF-8 (notice the case). I can't believe that the case here is whats causing this to happen, but it doesn't seem a robust enough way to work out if =22 is actually =22 or is a " encoded utf-8.
My original thought was to always decode the quoted-printable and then re-encode it before doing the search but I don't think this is going to be a robust approach going forward and thought others might have a better suggestion?
Conclusion, I'm trying to find a string in a raw email but the encoding is causing me problems getting my search string to match the encoding of the body
The =22-type encoding actually has nothing to do with the charset (whether that is utf-8 lowercase or UTF-8 uppercase or any other charset).
It is the Content-Transfer-Encoding: quoted-printable encoding.
The quoted-printable encoding is just a way of hex-encoding octets, typically limited to octets that fall outside of the printable ascii range. It seems odd that the DQUOTE character would be encoded in this way, but it's perfectly legal to do so.
If you want to search for strings in the body of the message, you'll need to first decode the body of the message. Otherwise you will not be successful.
I would recommend reading rfc2045 at a minimum.
You may also need to end up reading rfc2047 if you end up wanting to search headers at some point, but that gets... tricky due to various bugs that sending clients have.
Now that I've been "triggered" into a rant about MIME, let me explain why decoding headers is so hard to get right. I'm sure just about every developer who has ever worked on an email client could tell you this, but I guess I'm going to be the one to do it.
Here's just a short list of the problems every developer faces when they go to implement a decoder for headers which have been (theoretically) encoded according to the rfc2047 specification:
First off, there are technically two variations of header encoding formats specified by rfc2047 - one for phrases and one for unstructured text fields. They are very similar but you can't use the same rules for tokenizing them. I mention this because it seems that most MIME parsers miss this very subtle distinction and so, as you might imagine, do most MIME generators. Hell, most MIME generators probably never even heard of specifications to begin with it seems.
This brings us to:
There are so many variations of how MIME headers fail to be tokenizable according to the rules of rfc2822 and rfc2047. You'll encounter fun stuff such as:
a. encoded-word tokens illegally being embedded in other word tokens
b. encoded-word tokens containing illegal characters in them (such as spaces, line breaks, and more) effectively making it so that a tokenizer can no longer, well, tokenize them (at least not easily)
c. multi-byte character sequences being split between multiple encoded-word tokens which means that it's not possible to decode said encoded-word tokens individually
d. the payloads of encoded-word tokens being split up into multiple encoded-word tokens, often splitting in a location which makes it impossible to decode the payload in isolation
You can see some examples here.
Something that many developers seem to miss is the fact that each encoded-word token is allowed to be in different character encodings (you might have one token in UTF-8, another in ISO-8859-1 and yet another in koi8-r). Normally, this would be no big deal because you'd just decode each payload, then convert from the specified charset into UTF-8 via iconv() or something. However, due to the fun brokenness that I mentioned above in (2c) and (2d), this becomes more complicated.
If that isn't enough to make you want to throw your hands up in the air and mutter some profanities, there's more...
Undeclared 8bit text in headers. Yep. Some mailers just didn't get the memo that they are supposed to encode non-ASCII text. So now you get to have the fun experience of mixing and matching undeclared 8bit text of God-only-knows what charset along with the content of (probably broken) encoded-words.
If you want to see how to deal with these issues, you can take a look at how I did it using C in my GMime library, here: https://github.com/jstedfast/gmime/blob/master/gmime/gmime-utils.c#L1894 (in case line offsets change in the future, look for _g_mime_utils_header_decode_text() and the various internal methods it uses in that source file - I have written comments explaining how it deals with the above issues).
Or you can see how I did it using C# in my MimeKit library, here: https://github.com/jstedfast/MimeKit/blob/master/MimeKit/Utils/Rfc2047.cs
For more infomation about why & how dealing with email is hard, check out Joshua Cramner's blog series: http://quetzalcoatal.blogspot.com/search/label/email-hard

Is it appropriate or necessary to use percent-encoding with HTTP Headers?

When I'm building RESTful client and servers, is it appropriate or necessary to use percent-encoding with HTTP Headers (request or response), or does this type of encoding just apply to URIs?
Basically No, but see below.
RFC2616 describes percent-encoding only for URIs (search for % or HEX HEX or percent) and it defines the field-value without mentioning percent-encoding.
However, RFC2616 allows arbitraty octets (except CTLs) in the header field value, and has a half-baked statement mentioning MIME encoding (RFC2047) for characters not in ISO-8859-1 (see definition of TEXT in its Section 2.2). I called that statement "half-baked" because it does not exlictly state that ISO-8859-1 is the mandatory character set to be used for interpreting the octets, but despite of that, it normatively requires the use of MIME encoding for characters outside of that character set. It seems that both the use of ISO-8859-1 and the MIME encoding of header field values are not widely supported.
HTTPbis seems to have given up on this, and goes back to US-ASCII for header field values. See this answer for details.
My reading of this is:
For standard header fields (those defined in RFC2616), percent-encoding is not permitted.
For extension header fields, percent-encoding is not described in RFC2616, but there is room for applying all kinds of encodings, including percent-encoding, as long as the resulting characters are US-ASCII (if you want to be future-proof). Just don't think you have to use percent-encoding.
Some more sources I found:
https://www.quora.com/Do-HTTP-headers-need-to-be-encoded confirms my understanding, although it is not specific about standard headers vs extension headers and does not quote a source.
https://support.ca.com/us/knowledge-base-articles.TEC1904612.html argues that the percent-encoding of extension headers in their product is a measure of protection against CSS attacks.
TL;DR: Octet percent-encoding and base64 encoding are fine.
Indicating Character Encoding and Language for HTTP Header Field Parameters
https://www.rfc-editor.org/rfc/rfc8187
This document specifies an encoding suitable for use in HTTP header
fields...
Read the "3.2.3. Examples"
base64 encoding is fine too, as read the HTTP Basic Authorziation spec: https://www.rfc-editor.org/rfc/rfc7617

Request for update: Is there any "Best Practice" for communicating encoding of POST to REST service?

I'm creating a RESTful service where the client may be posting either some XML, JSON, or some unstructured text. Conceivably the client could post chinese characters, etc. There is this question that is nearly the same, Detecting the character encoding of an HTTP POST request, but it is four years old and I wanted to see if any "best practices" had coalesced.
EDIT: This is not for information posted from a form (web page) but for client applications, so the Content-Type of the POST request will be things like text/xml, text/plain, and maybe application/json.
For XML and JSON the best practice is to always encode in UTF-8. XML has mechanisms for different character sets if you really must not use UTF-8, starting with the charset param given to the mimetype and then the charset param of the xml directive.
The character set of a www form POST is always ASCII due to the embedded percent encoding, so charset declaration for application/x-www-form-urlencoded is unnecessary. In fact, specifying a charset for this MIME type is invalid.
So to get from:
0x6b65793d76254333254134254333254241254333254142
Into:
key=v%C3%A4%C3%BA%C3%A
Using virtually any encoding will work the same because of ASCII compatibility.
You may notice the data is still encoded. The charset parameter of a request Content-Type only applies to the immediate binaries sent ("converting a sequence of octets into a sequence of characters" as they say in the specs), not to the mechanism used in turning key=v%C3%A4%C3%BA%C3%A into key=väúë, which actually involves converting characters into other characters.
The application/x-www-form-urlencoded scheme "specification" in html4 is pretty useless, but html 5 actually tries. The ultimate default encoding of percent-encoding is UTF-8 with the encoding name transferred in the _charset_ magic parameter if available.
So yeah, there still isn't a good and used formal way (and charset in the Content-Type is just invalid, wrong and misunderstood) to declare the character encoding for the embedded percent-encoding. In practice I would just use UTF-8 and as it's a very strict scheme, fall back to ISO-8859-1 when that fails because you can always go back from ISO-8859-1.
For JSON, using any other encoding outside UTF-8/16/32 is invalid with UTF-8 being assumed everywhere. For XML, you can read the Content-Type header, fallback to encoding attribute and ultimately you have to fallback to UTF-8 and declare invalid if it doesn't compute.

Are email headers case sensitive?

Are email headers case sensitive?
For example, is Content-Type different from Content-type?
According to RFC 5322, I don't see anything about case sensitivity. However, I'm seeing a problem with creating MIME messages using the PEAR Mail_mime module, and everything is pointing to the fact that our SMTP server uses Content-type and MIME-version instead of Content-Type and MIME-Version. I tried using another SMTP server (like GMail), but unfortunately our web servers are firewalled pretty tightly.
RFC 5322 does actually specify this, but it is very indirect.
Section 1.2.2 says:
This specification uses the Augmented
Backus-Naur Form (ABNF) [RFC5234]
notation for the formal definitions of
the syntax of messages.
In turn, Section 2.3 of RFC 5234 says:
NOTE:
ABNF strings are case insensitive and the character set for
these strings is US-ASCII.
So when RFC 5322 specifies a production rule like this:
from = "From:" mailbox-list CRLF
It is implicit that the "From:" is not case-sensitive.
[update]
As for Content-Type and MIME-Version, they are specified by the MIME spec (RFC 2045). That in turn refers to the BNF described by the original RFC 822, which (luckily) also makes it clear that these literal strings are case-insensitive.
Bottom line: According to the spec, Email headers are not case-sensitive, so it sounds like your mail server is buggy.

What is the proper way to URL encode Unicode characters?

I know of the non-standard %uxxxx scheme but that doesn't seem like a wise choice since the scheme has been rejected by the W3C.
Some interesting examples:
The heart character.
If I type this into my browser:
http://www.google.com/search?q=♥
Then copy and paste it, I see this URL
http://www.google.com/search?q=%E2%99%A5
which makes it seem like Firefox (or Safari) is doing this.
urllib.quote_plus(x.encode("latin-1"))
'%E2%99%A5'
which makes sense, except for things that can't be encoded in Latin-1, like the triple dot character.
…
If I type the URL
http://www.google.com/search?q=…
into my browser then copy and paste, I get
http://www.google.com/search?q=%E2%80%A6
back. Which seems to be the result of doing
urllib.quote_plus(x.encode("utf-8"))
which makes sense since … can't be encoded with Latin-1.
But then its not clear to me how the browser knows whether to decode with UTF-8 or Latin-1.
Since this seems to be ambiguous:
In [67]: u"…".encode('utf-8').decode('latin-1')
Out[67]: u'\xc3\xa2\xc2\x80\xc2\xa6'
works, so I don't know how the browser figures out whether to decode that with UTF-8 or Latin-1.
What's the right thing to be doing with the special characters I need to deal with?
I would always encode in UTF-8. From the Wikipedia page on percent encoding:
The generic URI syntax mandates that new URI schemes that provide for the representation of character data in a URI must, in effect, represent characters from the unreserved set without translation, and should convert all other characters to bytes according to UTF-8, and then percent-encode those values. This requirement was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.
It seems like because there were other accepted ways of doing URL encoding in the past, browsers attempt several methods of decoding a URI, but if you're the one doing the encoding you should use UTF-8.
The general rule seems to be that browsers encode form responses according to the content-type of the page the form was served from. This is a guess that if the server sends us "text/xml; charset=iso-8859-1", then they expect responses back in the same format.
If you're just entering a URL in the URL bar, then the browser doesn't have a base page to work on and therefore just has to guess. So in this case it seems to be doing utf-8 all the time (since both your inputs produced three-octet form values).
The sad truth is that AFAIK there's no standard for what character set the values in a query string, or indeed any characters in the URL, should be interpreted as. At least in the case of values in the query string, there's no reason to suppose that they necessarily do correspond to characters.
It's a known problem that you have to tell your server framework which character set you expect the query string to be encoded as--- for instance, in Tomcat, you have to call request.setEncoding() (or some similar method) before you call any of the request.getParameter() methods. The dearth of documentation on this subject probably reflects the lack of awareness of the problem amongst many developers. (I regularly ask Java interviewees what the difference between a Reader and an InputStream is, and regularly get blank looks)
IRI (RFC 3987) is the latest standard that replaces the URI/URL (RFC 3986 and older) standards. URI/URL do not natively support Unicode (well, RFC 3986 adds provisions for future URI/URL-based protocols to support it, but does not update past RFCs). The "%uXXXX" scheme is a non-standard extension to allow Unicode in some situations, but is not universally implemented by everyone. IRI, on the other hand, fully supports Unicode, and requires that text be encoded as UTF-8 before then being percent-encoded.
IRIs do not replace URIs, because only URIs (effectively, ASCII) are permissible in some contexts -- including HTTP.
Instead, you specify an IRI and it gets transformed into a URI when going out on the wire.
The first question is what are your needs? UTF-8 encoding is a pretty good compromise between taking text created with a cheap editor and support for a wide variety of languages. In regards to the browser identifying the encoding, the response (from the web server) should tell the browser the encoding. Still most browsers will attempt to guess, because this is either missing or wrong in so many cases. They guess by reading some amount of the result stream to see if there is a character that does not fit in the default encoding. Currently all browser(? I did not check this, but it is pretty close to true) use utf-8 as the default.
So use utf-8 unless you have a compelling reason to use one of the many other encoding schemes.