RS-232C and Email in 7bit char set - email

The book "Designing Embedded Hardware" in the chapter "9.3. Old Faithful: RS-232C" mentions that emails are still sent in 7bit char set because of RS-232C:
It's also not unheard of to see
RS-232C systems still using 7-bit data
frames (another leftover from the
'60s), rather than the more common
8-bit. In fact, this is one of the reasons why you'll
still see email being sent on the
Internet limited to a 7-bit character
set, just in case the packets happen
to be routed via a serial connection
that supports only 7-bit
transmissions.
How can I confirm the observation?

Check out the spec. The original rfc822, for ARPA Internet Text Messages, explicitly states:
A message consists of header fields
and, optionally, a body. The body is
simply a sequence of lines containing
ASCII characters.
Since ASCII is 7-bit, voila.
Note, however, that there are a whole bunch of additions to that original spec, all the MIME extensions, which allow message header extensions for non-ascii text.

The Quoted-printable MIME encoding is specifically designed to encode 8-bit data in 7-bit characters. This encoding is widely used to encode email.
Note also that the text you quoted says "in case the packets happen to be routed via a serial connection" which is misleading, especially if they're talking in a context of IP packets. IP packets assume an 8-bit data path, and cannot be sent directly over a 7-bit RS-232 link without additional encoding (and then it's not a 7-bit data path anymore, it's 8-bit).

The systems that were restricted to 7 bits were already old when email first became popular. The chances that you will find one today approach zero.
Since certain characters have special meaning to email programs (most notably the end-of-line character), it still makes sense to limit the character set.

Related

Email with special characters rejected - RFC-6532 and "quoted-printable"

One email provider rejected an email containing special characters (e.g. umlaute). They say that they are RFC-5321 and RFC-5322 compliant. Now I browsed those standards however they are not supporting international emails (thus no umlaute). Only ASCII-127 is supported.
Now there is an extension called RFC-6532 which standardizes international emails. Our emails are UTF-8 (quoted-printable) encoded and sent like this:
"=?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?="<boerge.moeller#foo.org>
Is this an RFC-6532 compliant address? Or is it some other/older RFC (like RFC-2054)? After all there are so many mail related RFCs that I might have missed 10 or 20 ;-)
It's on the right track, but it's wrong.
"=?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?="<boerge.moeller#foo.org>
There are 2 problems with the above form:
The encoded-word (the =?UTF-8?Q?...?= bit) is quoted and shouldn't be. Mail software that parse this address won't decode that token if they are standards-compliant.
The "name" is butted up against the angle brackets and should not be. There MUST be a space in order to be standards compliant.
In other words, this is what it should look like:
=?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?= <boerge.moeller#foo.org>
The RFCs that you need to look at are:
RFC5322 - this defines the modern Message syntax that is implemented by the server you are trying to interoperate with.
RFC2047 - this defines the methods and syntax of the encoded-words that are needed to represent non-ASCII characters in headers like Subject and address headers (e.g. To/From/Cc/Reply-To/etc). (This is the =?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?= part)
RFC822 - this defines the grammar used by RFC2047 and is an older version of RFC5322.
It may also be helpful to read RFC2822 which is newer than RFC822 but older than RFC5322. My guess, however, is that you can skip it because it won't have a lot of value. The only reason RFC822 still has value is because of its much older grammar definitions that are referenced by RFC2047 (such as atom, dot-atom, phrase, angle-addr, addr-spec, tspecials, etc).
RFC6532 is even newer than RFC5322. The purpose of which is to remove the need to encode headers altogether by allowing the use of UTF-8 as an alternative.
Before RFC6532, there was no standard for the character encoding to use for headers other than ASCII (which was what RFC822 used) and so everything was always supposed to conform to ASCII.
A lot of software doesn't follow the standards, however, and so there was a lot of mail in the real world that used ISO-8859-1 and every other character encoding under the sun, all depending on what region the user(s) were in and what character encoding(s) were in wide use in those regions (e.g. Big5 and GB2312 are popular in various parts of China, Shift-JIS being popular in Japan, EUC-KR/KS-C-5601-1987 are popular in Korea, etc).
This caused major interoperability problems, though, not least of which because not every mail client could handle every character encoding under the sun, but also because there was no way for a client to figure out which character encoding was even being used! It's all just binary gobbeldy-gook.
UTF-8, however, has existed for a long time and it can represent all characters in all languages, so it was only logical for it to eventually win out as the standard character encoding to use for international email.

Why Base64 is used "only" to encode binary data?

I saw many resources about the usages of base64 in today's internet. As I understand it, all of those resources seem to spell out single usecase in different ways : Encode binary data in Base64 to avoid getting it misinterpreted/corrupted as something else during transit (by intermediate systems). But I found nothing that explains following :
Why would binary data be corrupted by intermediate systems? If I am sending an image from a server to client, any intermediate servers/systems/routers will simply forward data to next appropriate servers/systems/routers in the path to client. Why would intermediate servers/systems/routers need to interpret something that it receives? Any example of such systems which may corrupt/wrongly interpret data that it receives, in today's internet?
Why do we fear only binary data to be corrupted. We use Base64 because we are sure that those 64 characters can never be corrupted/misinterpreted. But by this same logic, any text characters that do not belong to base64 characters can be corrupted/misinterpreted. Why then, base64 is use only to encode binary data? Extending the same idea, when we use browser are javascript and HTML files transferred in base64 form?
There's two reasons why Base64 is used:
systems that are not 8-bit clean. This stems from "the before time" where some systems took ASCII seriously and only ever considered (and transferred) 7bits out of any 8bit byte (since ASCII uses only 7 bits, that would be "fine", as long as all content was actually ASCII).
systems that are 8-bit clean, but try to decode the data using a specific encoding (i.e. they assume it's well-formed text).
Both of these would have similar effects when transferring binary (i.e. non-text) data over it: they would try to interpret the binary data as textual data in a character encoding that obviously doesn't make sense (since there is no character encoding in binary data) and as a consequence modify the data in an un-fixable way.
Base64 solves both of these in a fairly neat way: it maps all possible binary data streams into valid ASCII text: the 8th bit is never set on Base64-encoded data, because only regular old ASCII characters are used.
This pretty much solves the second problem as well, since most commonly used character encodings (with the notable exception of UTF-16 and UCS-2, among a few lesser-used ones) are ASCII compatible, which means: all valid ASCII streams happen to also be valid streams in most common encodings and represent the same characters (examples of these encodings are the ISO-8859-* family, UTF-8 and most Windows codepages).
As to your second question, the answer is two-fold:
textual data often comes with some kind of meta-data (either a HTTP header or a meta-tag inside the data) that describes the encoding to be used to interpret it. Systems built to handle this kind of data understand and either tolerate or interpret those tags.
in some cases (notably for mail transport) we do have to use various encoding techniques to ensure text doesn't get mangles. This might be the use of quoted-printable encoding or sometimes even wrapping text data in Base64.
Last but not least: Base64 has a serious drawback and that's that it's inefficient. For every 3 bytes of data to encode, it produces 4 bytes of output, thus increasing the size of the data by ~33%. That's why it should be avoided when it's not necessary.
One of the use of BASE64 is to send email.
Mail servers used a terminal to transmit data. It was common also to have translation, e.g. \c\r into a single \n and the contrary. Note: Also there where no guarantee that 8-bit can be used (email standard is old, and it allowed also non "internet" email, so with ! instead of #). Also systems may not be fully ASCII.
Also \n\n. is considered as end of body, and mboxes uses also \n>From to mark start of new mail, so also when 8-bit flag was common in mail servers, the problems were not totally solved.
BASE64 was a good way to remove all problems: the content is just send as characters that all servers must know, and the problem of encoding/decoding requires just sender and receiver agreement (and right programs), without worrying of the many relay server in between. Note: all \c, \r, \n etc. are just ignored.
Note: you can use BASE64 also to encode strings in URL, without worrying about the interpretation of webbrowsers. You may see BASE64 also in configuration files (e.g. to include icons): special crafted images may not be interpreted as configuration. Just BASE64 is handy to encode binary data into protocols which were not designed for binary data.

Content Transfer Encoding 7bit or 8 bit

While sending email content, it is required to set "Content Transfer Encoding" header. I observed many headers of emails that I received. Some emails using "7bit" and some are using "8bit".
What is the difference between these two? Which is recommended? Is there any special encoding required for email body in order to set these headers?
It can be a bit dense to read, but the "Content-Transfer-Encoding" section of RFC 1341 has all of the details:
http://www.w3.org/Protocols/rfc1341/5_Content-Transfer-Encoding.html
The situation kinda goes from bad to worse. Here's my summary:
Background
SMTP, by definition (RFC 821), limits mail to lines of 1000 characters of 7 bits each. That means that none of the bytes you send down the pipe can have the most significant ("highest-order") bit set to "1".
The content that we want to send will often not obey this restriction inherently. Think of an image file, or a text file that contains Unicode characters: the bytes of these files will often have their 8th bit set to "1". SMTP doesn't allow this, so you need to use "transfer encoding" to describe how you've worked around the mismatch.
The values for the Content-Transfer-Encoding header describe the rule that you've chosen to solve this problem.
7Bit Encoding
7bit simply means "My data consists only of US-ASCII characters, which only use the lower 7 bits for each character." You're basically guaranteeing that all of the bytes in your content already adhere to the restrictions of SMTP, and so it needs no special treatment. You can just read it as-is.
Note that when you choose 7bit, you're agreeing that all of the lines in your content are less than 1000 characters in length.
As long as your content adheres to these rule, 7bit is the best transfer encoding, since there's no extra work necessary; you just read/write the bytes as they come off the pipe. It's also easy to eyeball 7bit content and make sense of it. The idea here is that if you're just writing in "plain English text" you'll be fine. But that wasn't true in 2005 and it isn't true today.
8Bit Encoding
8bit means "My data may include extended ASCII characters; they may use the 8th (highest) bit to indicate special characters outside of the standard US-ASCII 7-bit characters." As with 7bit, there's still a 1000-character line limit.
8bit, just like 7bit, does not actually do any transformation of the bytes as they're written to or read from the wire. It just means that you're not guaranteeing that none of the bytes will have the highest bit set to "1".
This seems like a step up from 7bit, since it gives you more freedom in your content. However, RFC 1341 contains this tidbit:
As of the publication of this document, there are no standardized Internet transports for which it is legitimate to include unencoded 8-bit or binary data in mail bodies. Thus there are no circumstances in which the "8bit" or "binary" Content-Transfer-Encoding is actually legal on the Internet.
RFC 1341 came out over 20 years ago. Since then we've gotten 8bit MIME Extensions in RFC 6152. But even then, line limits still may apply:
Note that this extension does NOT eliminate the possibility of an SMTP server limiting line length; servers are free to implement this extension but nevertheless set a line length limit no lower than 1000 octets.
Binary Encoding
binary is the same as 8bit, except that there's no line length restriction. You can still include any characters you want, and there's no extra encoding. Similar to 8bit, RFC 1341 states that it's not really a legitimate encoding transfer encoding. RFC 3030 extended this with BINARYMIME.
Quoted Printable
Before the 8BITMIME extension, there needed to be a way to send content that couldn't be 7bit over SMTP. HTML files (which might have more than 1000-character lines) and files with international characters are good examples of this. The quoted-printable encoding (Defined in Section 5.1 of RFC 1341) is designed to handle this. It does two things:
Defines how to escape non-US-ASCII characters so that they can be represented in only 7-bit characters. (Short version: they get displayed as an equals sign plus two 7-bit characters.)
Defines that lines will be no greater than 76 characters, and that line breaks will be represented using special characters (which are then escaped).
Quoted Printable, because of the escaping and short lines, is much harder to read by a human than 7bit or 8bit, but it does support a much wider range of possible content.
Base64 Encoding
If your data is largely non-text (ex: an image file), you don't have many options. 7bit is off the table. 8bit and binary were unsupported prior to the MIME extension RFCs. quoted-printable would work, but is really inefficient (every byte is going to be represented by 3 characters).
base64 is a good solution for this type of data. It encodes 3 raw bytes as 4 US-ASCII characters, which is relatively efficient. RFC 1341 further limits the line length of base64-encoded data to 76 characters to fit within an SMTP message, but that's relatively easy to manage when you're just splitting or concatenating arbitrary characters at fixed lengths.
The big downside is that base64-encoded data is pretty much entirely unreadable by humans, even if it's just "plain" text underneath.
With content-transfer-encoding: 7bit the bytes that are used in body (or more correct within part's boundaries) should represent ascii characters but not extended-ascii characters. This means 0-127 decimal (8th bit not used).
Since 8th bit is not used it means that you cannot encode your text using utf-8 or iso8859-7 bytes because they use the 8th bit. Nor you can add binary content.
With content-transfer-encoding: 8bit you can use any possible byte which means that you can encode your text using utf-8 bytes or iso8859-7 bytes (both assuming that 8BITMIME extension is used in SMTP). You are however still unsafe adding binary content due to the max line-restriction that still applies which could break your bytes with newlines.
Now even with 7bit content-transfer-encoding you can still set content-type's charset param to utf-8 as long as you still keep your bytes between the boundaries of 0-127.
e.g. A possible way to represent characters outside ascii using the 7bit content-transfer-encoding could be to use html code characters (with content-type: text/html)
Many email clients will set content-transfer-encoding to 7bit or 8bit depending on the case. E.g. 7bit when sending english text, 8bit when sending multilingual text. And there are always the options of quoted-printable and base64 whose characters are also not using 8th bit, but this is out of scope of the
question.

Is "ISO8859-1" an acceptable variation/alias for "ISO-8859-1"

I've got an application that sends an email notification. When the email is generated, it includes the following in the mime source:
Content-Type: text/plain;
charset="ISO8859-1"
Content-Transfer-Encoding: quoted-printable
I've noticed that other email programs and open-source conversion tools (like iconv) don't support that specific spelling and instead require "ISO-8859-1".
I don't see "ISO8859-1" specifically listed on the IANA character set list: https://www.iana.org/assignments/character-sets/character-sets.xhtml
So my question is:
Is ISO8859-1 an acceptable variation name of ISO-8859-1 and is there some sort of RFC or standard available to definitively "prove" that one way or the other?
The IANA registry mentioned in the question cites RFC 2978, which in turn cites several RFCs which define how character encodings are to be specified in the Internet. Thus, since ISO8859-1 is not listed there, it is not correct to use it.
Programs may still accept it, as part of their error recovery, but they are not required to do so. Programs may do better error recovery, upon encountering an undefined character encoding name, by inspecting the actual content of text data and trying to make a guess on the encoding. Or they may simply fall back to some default encoding they use, and this may well be ISO-8859-1 (or, in fact, more often windows-1252).

What means Zend_Mime::ENCODING_8BIT when sending mails with Zend_Mail?

In the example for Zend_Mail on http://framework.zend.com/manual/en/zend.mail.attachments.html they use ENCODING_8BIT but searching for what that might be sends me to http://msdn.microsoft.com/en-us/library/ms526992%28EXCHG.10%29.aspx were (and this sounds logical to me) it is explained that 8bit encoding does not make sense for emails.
Edit:
When I use this encoding for a mail with an attachment, I receive the mail with a corrupted attachment in my mail software (Thunderbird)
In which cases does it make sense to use ENCODING_8BIT?
As everybody said, ENCODING_8BIT represents the Content Transfer Encoding.
Basically, 8BITMIME is used for Internationalization. It's using a 8-bit character sets and therefore, allow you to send any character supported in the UTF8 charset.
In general, non-MIME mailers send 8-bit data but do not include any
MIME headers to mark the message as 8-bit data. MIME mailers should
cope with this without any problems. [source]
So basically there is not really a case where it makes sense to use ENCODING_8BIT over another encoding since emails in UTF8 are a standard today. Also, note that most of the MTAs (Message Transfer Agent, such as Postfix, etc.) automatically force the encoding to 8BITMIME (UTF-8).
Here is a good resource about the 8BITMIME encoding.
The 8BITMIME extension has two effects in practice:
The client will avoid Q-P conversion.
The client may add extra
information at the end of a MAIL request: a space followed by either
"BODY=7BIT" or "BODY=8BITMIME".
Zend_Mime::ENCODING_8BIT sets the Content-Transfer-Encoding.
The Content-Transfer-Encoding defines methods for representing binary data in ASCII text format.
The use of Zend_Mime::ENCODING_8BIT in the example is a Bug.
For sending Attachments you should always use Zend_Mime::ENCODING_BASE64
Not for email but for attachements. If you take a look on the RFC 2045 at page 7:
RFC2045
"Binary data" refers to data where any
sequence of octets whatsoever is
allowed.