How does an email client read the content-type headers for encoding? - email

It is possible to send an email with different content types: text/html, text/plain, mime, etc. It also is possible to use different encodings, including (according to the RFCs) for header fields: us-ascii, utf8, etc.
How do you solve the chicken and egg problem? The content-type header is just one of several headers. If the headers can be any encoding, how does a mail server or client know how to read the content-type header if it does not know what encoding the headers themselves are in?
I can see it if the first line, e.g. had to be the content-type and it had to be in a pre-agreed encoding, (e.g. ascii), but that is not the case.
How do you parse a stream of bytes whose encoding is embedded as a string inside that very same stream?

Headers are defined to be in ascii. They can be in utf-8 if agreed to out of band, such as via the smtp or imap utf-8 capability extensions.
Internationalization in headers is performed via "encoded words", where the encoding is part of the header data. (This looks like a string such as =?iso8859-1?q?sample_header_data?=). See rfc2047.
Content Type headers do not apply to headers themselves, only the body content.

Related

Is the charset parameter allowed on application/octet-stream MIME type

I am working on a project where I need to send requests over email instead of http,
to prevent email servers or clients from messing with the body (especially urls) I have set the Content-Type header in my SMTP request to application/octet-stream instead of text/plain.
The content however is actually plain text so I also specified ;charset=UTF-8.
Looking at RFC it seems that the charset parameter is only allowed for text/* types, however I also found many examples where charset was used with application/* types.
Now I wonder, is application/octet-stream; charset=UTF-8 a valid MIME type?
As the application/octet-stream definition (IANA-RFC) doesn't define a charset for this applicationtype and the definition for application/json (IANA-RFC) a mimetype thats used more often includes a note:
No "charset" parameter is defined for this registration.
Adding one really has no effect on compliant recipients.
I would strongly recommend to assume that the statement not only applies in this special case, but also in other application/* which have no charset defined.
So I can't say if it is valid to pass parameters that aren't defined, but the RFC clearly implies that the charset parameter for application/octet-stream (and other application/* that do not define charset) has no effect.

Headers for REST API with optional Base64 encoding

We have a media file repository, with which other services communicate over a REST API. For various reasons we want the users of the repository to be able to upload and download files over HTTP both directly (plaintext for text files and byte array for binary files) and using Base64 encoding. We want the fact that the file is uploaded (PUT, POST) and requested for download (GET) in the Base64 encoding be reflected in the header of the HTTP request.
How do we reflect the fact that the content of the request or requested response is Base64 encoded in the HTTP header?
So far I'm tending towards appending ;base64 after the mime type in the Content-Type header, for example Content-Type: image/png;base64. Other options (X- header, Content-Encoding) are discussed in this related question but do not offer satisfactory resolution to our question.
You have to use Content-Transfer-Encoding header.
It is in RFC https://www.rfc-editor.org/rfc/rfc2045#page-14.
It supports base64 value among others, like "7bit" / "8bit" / "binary" / "quoted-printable" / "base64" / ietf-token / x-token
This header is specially designed for your case, to use as a complement for MIME type.

HTTP multipart/form-data. What happends when binary data has no string representation?

I want to write an HTTP implementation.
I've been looking around for a few days about sending files over HTTP with Content-Type: multipart/form-data, and I'm really interested about how browsers (or any HTTP client) creates that kind of request.
I already took a look at a lots of questions about it here at stackoverflow like:
How does HTTP file upload work?
What does enctype='multipart/form-data' mean?
I dig into RFCs 2616 (and newer versions), 2046, etc. But I didn't find a clear answer (obviously I did not get the idea behind it).At most articles and answers I found this piece of request string, that's is simple to me to interpret, all these things are documented at RFCs...
POST /upload?upload_progress_id=12344 HTTP/1.1
Host: localhost:3000
Content-Length: 1325
Origin: http://localhost:3000
... other headers ...
Content-Type: multipart/form-data; boundary=----WebKitFormBoundaryePkpFF7tjBAqx29L
------WebKitFormBoundaryePkpFF7tjBAqx29L
Content-Disposition: form-data; name="MAX_FILE_SIZE"
100000
------WebKitFormBoundaryePkpFF7tjBAqx29L
Content-Disposition: form-data; name="uploadedfile"; filename="hello.o"
Content-Type: application/x-object
... contents of file goes here ...
------WebKitFormBoundaryePkpFF7tjBAqx29L--
...and it would be simple to implement an HTTP client to construct a piece of string that way in any language.The problem becomes at ... contents of file goes here ..., there's little information about what "contents of file" is. I know it's binary data with a certain type and encoding, but It's difficult to think out of string data, how I would add a piece of binary data that has no string representation inside a string.
I would like to see examples of low level implementations of HTTP protocol with any language. And maybe in depth explanations about binary data transfer over HTTP, how client creates requests and how server read/parse it. PD. I know this question my look a duplicate but most of the answers are not focused on explaining binary data transfer (like media).
You should not try to handle strings on this part of the body, you should send binary data, see it as reading bytes from the resource and sending theses bytes unaltered.
So especially no encoding applied, no utf-8, no base64, HTTP is not a protocol with an ascii7 restriction like smtp, where base64 encoding is applied to ensure only ascii7 characters are used.
There is, by definition, no string version of this data, and looking at raw HTTP transfer (with wireshark for example) you should see binary data, bytes, stuff.
This is why most HTTP servers uses C to manage HTTP, they parse the HTTP communication byte per byte (as the protocol headers are ascii 7 only, certainly not multibytes characters) and they can also read/write arbitrary
binary data for the body quite easily (or even using system calls like readfile to let the kernel manage the binary part).
Now, about examples.
When you use Content-Length and no multipart stuff the body is exactly (content-length) bytes long, so the client parsing your sent data will just read this number of bytes and will treat this whole raw data as the body content (which may have a mime type and and encoding information, but that's just informations for layers set on top of the HTTP protocol).
When you use Transfer-Encoding: chunked, the raw binary body is separated into pieces, each part is then prefixed by an hexadecimal number (the size of the chunk) and the end of line marker. With a final null marker at the end.
If we take the wikipedia example:
4\r\n
Wiki\r\n
5\r\n
pedia\r\n
E\r\n
in\r\n
\r\n
chunks.\r\n
0\r\n
\r\n
We could replace each ascii7 letter by any byte, even a byte that would have no ascii7 representation, Ill use a * character for each real body byte:
4\r\n
****\r\n
5\r\n
*****\r\n
E\r\n
**************\r\n
0\r\n
\r\n
All the other characters are part of the HTTP protocol (here a chunked body transmission). I could also use a \n representation of binary data, and send only the null byte for each byte of the body, that would be:
4\r\n
\0\0\0\0\0\r\n
5\r\n
\0\0\0\0\0\0\r\n
E\r\n
\0\0\0\0\0\0\0\0\0\0\0\0\0\0\r\n
0\r\n
\r\n
That's just a representation, we could also use \xNN or \NN representations, in reality these are bytes, 8 bits (too lazy to write the 0/1 representation of this body :-) ).
If the text of the example, instead of being:
Wikipedia in\r\n
\r\n
chunks.
It could have been a more complex one, with multibytes characters (here a é in utf-8):
Wikipédia in\r\n
\r\n
chunks.
This é is in fact 11000011:10101001 in utf-8, two bytes: \xc3\xa9 in \xNN representation), instead of the simple 01100101 / \x65 / echaracter. The HTTP body is now (see that second chunk size is 6 and not 5):
4\r\n
Wiki\r\n
6\r\n
p\xc3\xa9dia\r\n
E\r\n
in\r\n
\r\n
chunks.\r\n
0\r\n
\r\n
But this is only valid if the source data was effectively in utf-8, could have been another encoding. By default, unless you have some specific configuration settings available in your web server where you enforce a conversion of the source document in a specific encoding, that's not really the job of the web server to convert the source document, you take what you have, and you maybe add an header to tell the client what encoding was defined on the source document.
Finally we have the multipart way of transmitting the body, like in your question, it's a lot like the chunked version, except here boundaries and intermediary headers are used, but for the binary data between these boundaries, headers, and line endings control characters it is the same rule, everything inside are just bytes...

Handling diacritics in SIP headers

Following the SIMPLE specification of OMA, when sending a SIP INVITE for chat we can use a header named Subject.
Typically, this header contains the first message sent by a user to his contact.
My question is: this message can contain diacritics, so how should I encode them? Is there a standard definition on how to do this?
You should encode them as UTF-8 as specified in the SIP RFC. There are a few SIP Headers where UTF-8 is not allowed and US ASCII with escaping rules is mandated but the Subject header is not one of those.

Extracting email attachment filename : Content-Disposition vs Content-type

I am working on a script that will handle email attachments. I see that, most of the time, both content-type and content-disposition headers have the filename, but I have seen cases where only one had proper encoding or valid mime header.
Is there a preferred header to use to extract the file name? If so, which one?
Quoting wikipedia http://en.wikipedia.org/wiki/MIME:
"Many mail user agents also send messages with the file name in the name parameter of the content-type header instead of the filename parameter of the content-disposition header. This practice is discouraged."
So it seems content-disposition is preferred. However as I am using JavaMail, current JavaMail API seems to have only a String getDisposition() method: http://javamail.kenai.com/nonav/javadocs/javax/mail/Part.html#getDisposition(). So you might need to work with the header directly if you are using JavaMail.