Mail header fields: Practical difference between quoted-printable and 7bit? - email

Is there any practical difference between "7bit" and "quoted-printable" as Content-Transfer-Encoding in email? From all I could gather the encoding schemes are virtually identical.

For example, in 7bit, you can have a space at the end of a line, but in quoted-printable, you have to write it as =20 (which would be interpreted literally by 7bit).

Related

UTF-8 encoding in emails, parsing the body

So I don't really want this question to be language specific, however I suspect Go (my language choice) is playing a part here.
I'm trying to find a string within the body of a raw email. To do so, I am getting the encoding, and the marjority of cases are quoted-printable.
Ok so thats fine, I am encoding my search query quoted printable and then doing a search for it. That works.
However. In one specific case the raw email I see in gmail looks fine, however when I retrieve the raw email from the gmail API the although the encoding and everything is identical, its encoding the " as =22
Research shows me thats because the charset is utf-8.
I haven't quite got my head around whether thats encoded utf-8 then quoted-printable or the other way around, but thats not quite the question either....
If I look at the email where the " is =22 I see the char set is utf-8 and when I look at another where its not encoded, the charset is UTF-8 (notice the case). I can't believe that the case here is whats causing this to happen, but it doesn't seem a robust enough way to work out if =22 is actually =22 or is a " encoded utf-8.
My original thought was to always decode the quoted-printable and then re-encode it before doing the search but I don't think this is going to be a robust approach going forward and thought others might have a better suggestion?
Conclusion, I'm trying to find a string in a raw email but the encoding is causing me problems getting my search string to match the encoding of the body
The =22-type encoding actually has nothing to do with the charset (whether that is utf-8 lowercase or UTF-8 uppercase or any other charset).
It is the Content-Transfer-Encoding: quoted-printable encoding.
The quoted-printable encoding is just a way of hex-encoding octets, typically limited to octets that fall outside of the printable ascii range. It seems odd that the DQUOTE character would be encoded in this way, but it's perfectly legal to do so.
If you want to search for strings in the body of the message, you'll need to first decode the body of the message. Otherwise you will not be successful.
I would recommend reading rfc2045 at a minimum.
You may also need to end up reading rfc2047 if you end up wanting to search headers at some point, but that gets... tricky due to various bugs that sending clients have.
Now that I've been "triggered" into a rant about MIME, let me explain why decoding headers is so hard to get right. I'm sure just about every developer who has ever worked on an email client could tell you this, but I guess I'm going to be the one to do it.
Here's just a short list of the problems every developer faces when they go to implement a decoder for headers which have been (theoretically) encoded according to the rfc2047 specification:
First off, there are technically two variations of header encoding formats specified by rfc2047 - one for phrases and one for unstructured text fields. They are very similar but you can't use the same rules for tokenizing them. I mention this because it seems that most MIME parsers miss this very subtle distinction and so, as you might imagine, do most MIME generators. Hell, most MIME generators probably never even heard of specifications to begin with it seems.
This brings us to:
There are so many variations of how MIME headers fail to be tokenizable according to the rules of rfc2822 and rfc2047. You'll encounter fun stuff such as:
a. encoded-word tokens illegally being embedded in other word tokens
b. encoded-word tokens containing illegal characters in them (such as spaces, line breaks, and more) effectively making it so that a tokenizer can no longer, well, tokenize them (at least not easily)
c. multi-byte character sequences being split between multiple encoded-word tokens which means that it's not possible to decode said encoded-word tokens individually
d. the payloads of encoded-word tokens being split up into multiple encoded-word tokens, often splitting in a location which makes it impossible to decode the payload in isolation
You can see some examples here.
Something that many developers seem to miss is the fact that each encoded-word token is allowed to be in different character encodings (you might have one token in UTF-8, another in ISO-8859-1 and yet another in koi8-r). Normally, this would be no big deal because you'd just decode each payload, then convert from the specified charset into UTF-8 via iconv() or something. However, due to the fun brokenness that I mentioned above in (2c) and (2d), this becomes more complicated.
If that isn't enough to make you want to throw your hands up in the air and mutter some profanities, there's more...
Undeclared 8bit text in headers. Yep. Some mailers just didn't get the memo that they are supposed to encode non-ASCII text. So now you get to have the fun experience of mixing and matching undeclared 8bit text of God-only-knows what charset along with the content of (probably broken) encoded-words.
If you want to see how to deal with these issues, you can take a look at how I did it using C in my GMime library, here: https://github.com/jstedfast/gmime/blob/master/gmime/gmime-utils.c#L1894 (in case line offsets change in the future, look for _g_mime_utils_header_decode_text() and the various internal methods it uses in that source file - I have written comments explaining how it deals with the above issues).
Or you can see how I did it using C# in my MimeKit library, here: https://github.com/jstedfast/MimeKit/blob/master/MimeKit/Utils/Rfc2047.cs
For more infomation about why & how dealing with email is hard, check out Joshua Cramner's blog series: http://quetzalcoatal.blogspot.com/search/label/email-hard

How to escape a full email address for SMTP in the headers when the email address contains non-ascii chars

It's about sending emails with non ASCII chars in the email address.
When I use send the TO /RCPT stuff to the SMTP server I know that I need to use punycode here.
But what about the To: and From: Header. Again I know that if the User friendly part contains a non ascii char I con use the standard header encoding that I also use for the subject. But this encoding is only used for the user friendly part.
But what if the email address contains a non ascii char? How must the To header be formatted.
So how to encode "Tüst" ?
This is the encoding as far as I know.
"=?iso-8859-1?Q?T=FCst?="<tüst#domain.de>
But what with the email address.
In fact: I don't understand the RFC's. I tried hard but failed.
The answer is: UTF-8 is the correct way to encode the header.
After some more research I found the answer hidden inside this article:
https://en.wikipedia.org/wiki/International_email
Although the traditional format for email header section allows
non-ASCII characters to be included in the value portion of some of
the header fields using MIME-encoded words (e.g. in display names or
in a Subject header field), MIME-encoding must not be used to encode
other information in a header, such as an email address, or header
fields like Message-ID or Received. Moreover, the MIME-encoding
requires extra processing of the header to convert the data to and
from its MIME-encoded word representation, and harms readability of a
header section.
The 2012 standards RFC 6532 and RFC 6531 allow the inclusion of
Unicode characters in a header content using UTF-8 encoding, and their
transmission via SMTP - but in practice support is only slowly rolling
out.[5]

How to encode the filename parameter value of the Content-Disposition header in MIME message?

By checking the source of some emails, I found that many emails use 'Encoded Words' (RFC 2047) format to encode the filename parameter values. However, according to RFC 2047, this encoding method should not be used to header parameter values. Instead, the parameter value, such as the filename parameter in Content-Disposition header, should use the encoding method suggested by RFC 2231.
Thus, my question is why so many emails don't comply with the RFC standards. Is it a right way to encode the header parameter value with RFC 2047 format? Can all the email agents parse these emails properly?
The sad truth is that many popular email clients are in violation of pertinent RFCs.
Indeed, as you surmise, filenames in MIME body parts should use RFC2231, but many implementations out in the wild use RFC2047 or a number of other informal, ad-hoc, or at worst indeterminable filename encodings.
As for the "why", I don't really think this is answerable. Fundamentally I think we can't do better than guess it's a mistake at some level.
Common and easily identified incorrect encodings seem to work fairly transparently between popular clients; but by definition, failure to adhere to the specification removes any guarantee that the recipient can correctly guess what was intended.
For reference, here is a model message which should hopefully pass validation (-:
From: me <tripleee#example.org>
To: =?utf-8?B?G=C3=B6del?= <goedel#example.net>
Subject: File name and recipient are identical,
but encoded differently
Mime-Version: 1.0
Content-type: application/octet-stream;
name*=UTF-8''G%C3%B6del
Content-disposition: attachment;
filename*=UTF-8''G%C3%B6del
Content-transfer-encoding: base64
R8O2ZGVsCg==
For the record, the Content-Type: header's name parameter is superseded by the filename parameter of the Content-Disposition: header, but many implenentations still conservatively specify both, in case some client somewhere still doesn't grok Content-Disposition:

Content Transfer Encoding 7bit or 8 bit

While sending email content, it is required to set "Content Transfer Encoding" header. I observed many headers of emails that I received. Some emails using "7bit" and some are using "8bit".
What is the difference between these two? Which is recommended? Is there any special encoding required for email body in order to set these headers?
It can be a bit dense to read, but the "Content-Transfer-Encoding" section of RFC 1341 has all of the details:
http://www.w3.org/Protocols/rfc1341/5_Content-Transfer-Encoding.html
The situation kinda goes from bad to worse. Here's my summary:
Background
SMTP, by definition (RFC 821), limits mail to lines of 1000 characters of 7 bits each. That means that none of the bytes you send down the pipe can have the most significant ("highest-order") bit set to "1".
The content that we want to send will often not obey this restriction inherently. Think of an image file, or a text file that contains Unicode characters: the bytes of these files will often have their 8th bit set to "1". SMTP doesn't allow this, so you need to use "transfer encoding" to describe how you've worked around the mismatch.
The values for the Content-Transfer-Encoding header describe the rule that you've chosen to solve this problem.
7Bit Encoding
7bit simply means "My data consists only of US-ASCII characters, which only use the lower 7 bits for each character." You're basically guaranteeing that all of the bytes in your content already adhere to the restrictions of SMTP, and so it needs no special treatment. You can just read it as-is.
Note that when you choose 7bit, you're agreeing that all of the lines in your content are less than 1000 characters in length.
As long as your content adheres to these rule, 7bit is the best transfer encoding, since there's no extra work necessary; you just read/write the bytes as they come off the pipe. It's also easy to eyeball 7bit content and make sense of it. The idea here is that if you're just writing in "plain English text" you'll be fine. But that wasn't true in 2005 and it isn't true today.
8Bit Encoding
8bit means "My data may include extended ASCII characters; they may use the 8th (highest) bit to indicate special characters outside of the standard US-ASCII 7-bit characters." As with 7bit, there's still a 1000-character line limit.
8bit, just like 7bit, does not actually do any transformation of the bytes as they're written to or read from the wire. It just means that you're not guaranteeing that none of the bytes will have the highest bit set to "1".
This seems like a step up from 7bit, since it gives you more freedom in your content. However, RFC 1341 contains this tidbit:
As of the publication of this document, there are no standardized Internet transports for which it is legitimate to include unencoded 8-bit or binary data in mail bodies. Thus there are no circumstances in which the "8bit" or "binary" Content-Transfer-Encoding is actually legal on the Internet.
RFC 1341 came out over 20 years ago. Since then we've gotten 8bit MIME Extensions in RFC 6152. But even then, line limits still may apply:
Note that this extension does NOT eliminate the possibility of an SMTP server limiting line length; servers are free to implement this extension but nevertheless set a line length limit no lower than 1000 octets.
Binary Encoding
binary is the same as 8bit, except that there's no line length restriction. You can still include any characters you want, and there's no extra encoding. Similar to 8bit, RFC 1341 states that it's not really a legitimate encoding transfer encoding. RFC 3030 extended this with BINARYMIME.
Quoted Printable
Before the 8BITMIME extension, there needed to be a way to send content that couldn't be 7bit over SMTP. HTML files (which might have more than 1000-character lines) and files with international characters are good examples of this. The quoted-printable encoding (Defined in Section 5.1 of RFC 1341) is designed to handle this. It does two things:
Defines how to escape non-US-ASCII characters so that they can be represented in only 7-bit characters. (Short version: they get displayed as an equals sign plus two 7-bit characters.)
Defines that lines will be no greater than 76 characters, and that line breaks will be represented using special characters (which are then escaped).
Quoted Printable, because of the escaping and short lines, is much harder to read by a human than 7bit or 8bit, but it does support a much wider range of possible content.
Base64 Encoding
If your data is largely non-text (ex: an image file), you don't have many options. 7bit is off the table. 8bit and binary were unsupported prior to the MIME extension RFCs. quoted-printable would work, but is really inefficient (every byte is going to be represented by 3 characters).
base64 is a good solution for this type of data. It encodes 3 raw bytes as 4 US-ASCII characters, which is relatively efficient. RFC 1341 further limits the line length of base64-encoded data to 76 characters to fit within an SMTP message, but that's relatively easy to manage when you're just splitting or concatenating arbitrary characters at fixed lengths.
The big downside is that base64-encoded data is pretty much entirely unreadable by humans, even if it's just "plain" text underneath.
With content-transfer-encoding: 7bit the bytes that are used in body (or more correct within part's boundaries) should represent ascii characters but not extended-ascii characters. This means 0-127 decimal (8th bit not used).
Since 8th bit is not used it means that you cannot encode your text using utf-8 or iso8859-7 bytes because they use the 8th bit. Nor you can add binary content.
With content-transfer-encoding: 8bit you can use any possible byte which means that you can encode your text using utf-8 bytes or iso8859-7 bytes (both assuming that 8BITMIME extension is used in SMTP). You are however still unsafe adding binary content due to the max line-restriction that still applies which could break your bytes with newlines.
Now even with 7bit content-transfer-encoding you can still set content-type's charset param to utf-8 as long as you still keep your bytes between the boundaries of 0-127.
e.g. A possible way to represent characters outside ascii using the 7bit content-transfer-encoding could be to use html code characters (with content-type: text/html)
Many email clients will set content-transfer-encoding to 7bit or 8bit depending on the case. E.g. 7bit when sending english text, 8bit when sending multilingual text. And there are always the options of quoted-printable and base64 whose characters are also not using 8th bit, but this is out of scope of the
question.

What means Zend_Mime::ENCODING_8BIT when sending mails with Zend_Mail?

In the example for Zend_Mail on http://framework.zend.com/manual/en/zend.mail.attachments.html they use ENCODING_8BIT but searching for what that might be sends me to http://msdn.microsoft.com/en-us/library/ms526992%28EXCHG.10%29.aspx were (and this sounds logical to me) it is explained that 8bit encoding does not make sense for emails.
Edit:
When I use this encoding for a mail with an attachment, I receive the mail with a corrupted attachment in my mail software (Thunderbird)
In which cases does it make sense to use ENCODING_8BIT?
As everybody said, ENCODING_8BIT represents the Content Transfer Encoding.
Basically, 8BITMIME is used for Internationalization. It's using a 8-bit character sets and therefore, allow you to send any character supported in the UTF8 charset.
In general, non-MIME mailers send 8-bit data but do not include any
MIME headers to mark the message as 8-bit data. MIME mailers should
cope with this without any problems. [source]
So basically there is not really a case where it makes sense to use ENCODING_8BIT over another encoding since emails in UTF8 are a standard today. Also, note that most of the MTAs (Message Transfer Agent, such as Postfix, etc.) automatically force the encoding to 8BITMIME (UTF-8).
Here is a good resource about the 8BITMIME encoding.
The 8BITMIME extension has two effects in practice:
The client will avoid Q-P conversion.
The client may add extra
information at the end of a MAIL request: a space followed by either
"BODY=7BIT" or "BODY=8BITMIME".
Zend_Mime::ENCODING_8BIT sets the Content-Transfer-Encoding.
The Content-Transfer-Encoding defines methods for representing binary data in ASCII text format.
The use of Zend_Mime::ENCODING_8BIT in the example is a Bug.
For sending Attachments you should always use Zend_Mime::ENCODING_BASE64
Not for email but for attachements. If you take a look on the RFC 2045 at page 7:
RFC2045
"Binary data" refers to data where any
sequence of octets whatsoever is
allowed.