One email provider rejected an email containing special characters (e.g. umlaute). They say that they are RFC-5321 and RFC-5322 compliant. Now I browsed those standards however they are not supporting international emails (thus no umlaute). Only ASCII-127 is supported.
Now there is an extension called RFC-6532 which standardizes international emails. Our emails are UTF-8 (quoted-printable) encoded and sent like this:
"=?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?="<boerge.moeller#foo.org>
Is this an RFC-6532 compliant address? Or is it some other/older RFC (like RFC-2054)? After all there are so many mail related RFCs that I might have missed 10 or 20 ;-)
It's on the right track, but it's wrong.
"=?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?="<boerge.moeller#foo.org>
There are 2 problems with the above form:
The encoded-word (the =?UTF-8?Q?...?= bit) is quoted and shouldn't be. Mail software that parse this address won't decode that token if they are standards-compliant.
The "name" is butted up against the angle brackets and should not be. There MUST be a space in order to be standards compliant.
In other words, this is what it should look like:
=?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?= <boerge.moeller#foo.org>
The RFCs that you need to look at are:
RFC5322 - this defines the modern Message syntax that is implemented by the server you are trying to interoperate with.
RFC2047 - this defines the methods and syntax of the encoded-words that are needed to represent non-ASCII characters in headers like Subject and address headers (e.g. To/From/Cc/Reply-To/etc). (This is the =?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?= part)
RFC822 - this defines the grammar used by RFC2047 and is an older version of RFC5322.
It may also be helpful to read RFC2822 which is newer than RFC822 but older than RFC5322. My guess, however, is that you can skip it because it won't have a lot of value. The only reason RFC822 still has value is because of its much older grammar definitions that are referenced by RFC2047 (such as atom, dot-atom, phrase, angle-addr, addr-spec, tspecials, etc).
RFC6532 is even newer than RFC5322. The purpose of which is to remove the need to encode headers altogether by allowing the use of UTF-8 as an alternative.
Before RFC6532, there was no standard for the character encoding to use for headers other than ASCII (which was what RFC822 used) and so everything was always supposed to conform to ASCII.
A lot of software doesn't follow the standards, however, and so there was a lot of mail in the real world that used ISO-8859-1 and every other character encoding under the sun, all depending on what region the user(s) were in and what character encoding(s) were in wide use in those regions (e.g. Big5 and GB2312 are popular in various parts of China, Shift-JIS being popular in Japan, EUC-KR/KS-C-5601-1987 are popular in Korea, etc).
This caused major interoperability problems, though, not least of which because not every mail client could handle every character encoding under the sun, but also because there was no way for a client to figure out which character encoding was even being used! It's all just binary gobbeldy-gook.
UTF-8, however, has existed for a long time and it can represent all characters in all languages, so it was only logical for it to eventually win out as the standard character encoding to use for international email.
Related
So I don't really want this question to be language specific, however I suspect Go (my language choice) is playing a part here.
I'm trying to find a string within the body of a raw email. To do so, I am getting the encoding, and the marjority of cases are quoted-printable.
Ok so thats fine, I am encoding my search query quoted printable and then doing a search for it. That works.
However. In one specific case the raw email I see in gmail looks fine, however when I retrieve the raw email from the gmail API the although the encoding and everything is identical, its encoding the " as =22
Research shows me thats because the charset is utf-8.
I haven't quite got my head around whether thats encoded utf-8 then quoted-printable or the other way around, but thats not quite the question either....
If I look at the email where the " is =22 I see the char set is utf-8 and when I look at another where its not encoded, the charset is UTF-8 (notice the case). I can't believe that the case here is whats causing this to happen, but it doesn't seem a robust enough way to work out if =22 is actually =22 or is a " encoded utf-8.
My original thought was to always decode the quoted-printable and then re-encode it before doing the search but I don't think this is going to be a robust approach going forward and thought others might have a better suggestion?
Conclusion, I'm trying to find a string in a raw email but the encoding is causing me problems getting my search string to match the encoding of the body
The =22-type encoding actually has nothing to do with the charset (whether that is utf-8 lowercase or UTF-8 uppercase or any other charset).
It is the Content-Transfer-Encoding: quoted-printable encoding.
The quoted-printable encoding is just a way of hex-encoding octets, typically limited to octets that fall outside of the printable ascii range. It seems odd that the DQUOTE character would be encoded in this way, but it's perfectly legal to do so.
If you want to search for strings in the body of the message, you'll need to first decode the body of the message. Otherwise you will not be successful.
I would recommend reading rfc2045 at a minimum.
You may also need to end up reading rfc2047 if you end up wanting to search headers at some point, but that gets... tricky due to various bugs that sending clients have.
Now that I've been "triggered" into a rant about MIME, let me explain why decoding headers is so hard to get right. I'm sure just about every developer who has ever worked on an email client could tell you this, but I guess I'm going to be the one to do it.
Here's just a short list of the problems every developer faces when they go to implement a decoder for headers which have been (theoretically) encoded according to the rfc2047 specification:
First off, there are technically two variations of header encoding formats specified by rfc2047 - one for phrases and one for unstructured text fields. They are very similar but you can't use the same rules for tokenizing them. I mention this because it seems that most MIME parsers miss this very subtle distinction and so, as you might imagine, do most MIME generators. Hell, most MIME generators probably never even heard of specifications to begin with it seems.
This brings us to:
There are so many variations of how MIME headers fail to be tokenizable according to the rules of rfc2822 and rfc2047. You'll encounter fun stuff such as:
a. encoded-word tokens illegally being embedded in other word tokens
b. encoded-word tokens containing illegal characters in them (such as spaces, line breaks, and more) effectively making it so that a tokenizer can no longer, well, tokenize them (at least not easily)
c. multi-byte character sequences being split between multiple encoded-word tokens which means that it's not possible to decode said encoded-word tokens individually
d. the payloads of encoded-word tokens being split up into multiple encoded-word tokens, often splitting in a location which makes it impossible to decode the payload in isolation
You can see some examples here.
Something that many developers seem to miss is the fact that each encoded-word token is allowed to be in different character encodings (you might have one token in UTF-8, another in ISO-8859-1 and yet another in koi8-r). Normally, this would be no big deal because you'd just decode each payload, then convert from the specified charset into UTF-8 via iconv() or something. However, due to the fun brokenness that I mentioned above in (2c) and (2d), this becomes more complicated.
If that isn't enough to make you want to throw your hands up in the air and mutter some profanities, there's more...
Undeclared 8bit text in headers. Yep. Some mailers just didn't get the memo that they are supposed to encode non-ASCII text. So now you get to have the fun experience of mixing and matching undeclared 8bit text of God-only-knows what charset along with the content of (probably broken) encoded-words.
If you want to see how to deal with these issues, you can take a look at how I did it using C in my GMime library, here: https://github.com/jstedfast/gmime/blob/master/gmime/gmime-utils.c#L1894 (in case line offsets change in the future, look for _g_mime_utils_header_decode_text() and the various internal methods it uses in that source file - I have written comments explaining how it deals with the above issues).
Or you can see how I did it using C# in my MimeKit library, here: https://github.com/jstedfast/MimeKit/blob/master/MimeKit/Utils/Rfc2047.cs
For more infomation about why & how dealing with email is hard, check out Joshua Cramner's blog series: http://quetzalcoatal.blogspot.com/search/label/email-hard
I've got an application that sends an email notification. When the email is generated, it includes the following in the mime source:
Content-Type: text/plain;
charset="ISO8859-1"
Content-Transfer-Encoding: quoted-printable
I've noticed that other email programs and open-source conversion tools (like iconv) don't support that specific spelling and instead require "ISO-8859-1".
I don't see "ISO8859-1" specifically listed on the IANA character set list: https://www.iana.org/assignments/character-sets/character-sets.xhtml
So my question is:
Is ISO8859-1 an acceptable variation name of ISO-8859-1 and is there some sort of RFC or standard available to definitively "prove" that one way or the other?
The IANA registry mentioned in the question cites RFC 2978, which in turn cites several RFCs which define how character encodings are to be specified in the Internet. Thus, since ISO8859-1 is not listed there, it is not correct to use it.
Programs may still accept it, as part of their error recovery, but they are not required to do so. Programs may do better error recovery, upon encountering an undefined character encoding name, by inspecting the actual content of text data and trying to make a guess on the encoding. Or they may simply fall back to some default encoding they use, and this may well be ISO-8859-1 (or, in fact, more often windows-1252).
The book "Designing Embedded Hardware" in the chapter "9.3. Old Faithful: RS-232C" mentions that emails are still sent in 7bit char set because of RS-232C:
It's also not unheard of to see
RS-232C systems still using 7-bit data
frames (another leftover from the
'60s), rather than the more common
8-bit. In fact, this is one of the reasons why you'll
still see email being sent on the
Internet limited to a 7-bit character
set, just in case the packets happen
to be routed via a serial connection
that supports only 7-bit
transmissions.
How can I confirm the observation?
Check out the spec. The original rfc822, for ARPA Internet Text Messages, explicitly states:
A message consists of header fields
and, optionally, a body. The body is
simply a sequence of lines containing
ASCII characters.
Since ASCII is 7-bit, voila.
Note, however, that there are a whole bunch of additions to that original spec, all the MIME extensions, which allow message header extensions for non-ascii text.
The Quoted-printable MIME encoding is specifically designed to encode 8-bit data in 7-bit characters. This encoding is widely used to encode email.
Note also that the text you quoted says "in case the packets happen to be routed via a serial connection" which is misleading, especially if they're talking in a context of IP packets. IP packets assume an 8-bit data path, and cannot be sent directly over a 7-bit RS-232 link without additional encoding (and then it's not a 7-bit data path anymore, it's 8-bit).
The systems that were restricted to 7 bits were already old when email first became popular. The chances that you will find one today approach zero.
Since certain characters have special meaning to email programs (most notably the end-of-line character), it still makes sense to limit the character set.
If it's possible, should I accept such emails from users and what problems to expect when I will be sending mails to such addresses?
Officially, per RFC 6532 - Yes.
For a quick explanation, check out wikipedia on the subject.
Update 2015: Use RFC 6532
The experimental 5335 has been Obsoleted by: 6532 and
this later has been set to "Category: Standards Track",
making it the standard.
The Section 3.2 (Syntax Extensions to RFC 5322) has updated most text fields to
include (proper) UTF-8.
The following rules extend the ABNF syntax defined in [RFC5322] and
[RFC5234] in order to allow UTF-8 content.
VCHAR =/ UTF8-non-ascii
ctext =/ UTF8-non-ascii
atext =/ UTF8-non-ascii
qtext =/ UTF8-non-ascii
text =/ UTF8-non-ascii
; note that this upgrades the body to UTF-8
dtext =/ UTF8-non-ascii
The preceding changes mean that the following constructs now
allow UTF-8:
1. Unstructured text, used in header fields like
"Subject:" or "Content-description:".
2. Any construct that uses atoms, including but not limited
to the local parts of addresses and Message-IDs. This
includes addresses in the "for" clauses of "Received:"
header fields.
3. Quoted strings.
4. Domains.
Note that header field names are not on this list; these are still
restricted to ASCII.
Please note the explicit inclusion of Domains.
And the explicit exclusion of header names.
Also Note about NFKC:
The UTF-8 NFKC normalization form SHOULD NOT be used because
it may lose information that is needed to correctly spell
some names in some unusual circumstances.
And Section 3 start:
Also note that messages in this format require the use of the
SMTPUTF8 extension [RFC6531] to be transferred via SMTP.
The problem is that some mail clients (server-tools and / or desktop tools) don't support it and throw an 'invalid email' exception when you try to send a mail to an address which contains umlauts for example.
If you want full support, you could do the trick with converting the email-address parts to "punycode". This allows users to type in their addresses the usual way but you save it the supported-level way.
Example: müller.com » xn--mller-kva.com
Both points to the same thing.
I would assume yes since a number of top level domains already allow non ascii
characters for domains and since the domain is part of an email address, it's
perfectly possible. An example for such a domain would be www.öko.de
short answer: yes
not only in the username but also in the domain name are allowed.
The answer is yes, but they need to be encoded specially.
Look at this. Read the part that refers to email-headers and RFC 2047.
Not yet. The IEEE plans to do this:
H-Online article: IEFT planning internationalised email addresses, here is the RfC: SMTP Extension for Internationalized Email Addresses
Quote from H-Online (as it went down):
The Internet Engineering Task Force (IETF) has published three crucial documents for the standardisation of email address headers
that include symbols outside the ASCII character set. This means that
soon you'll be able to use Chinese characters, French accents, and
German umlauts in email addresses as well as just in the body of the
message. So if your name is Zoë and you work for a company that makes
façades, you might be interested in a new email address. But
representatives of providers are already moaning. They say there would
need to be an "upgrade mania" if the Unicode standard UTF-8 is to
replace the American Standard Code for Information Interchange (ASCII)
currently used as the general email language.
RFC 5335 specifies the use of UTF-8 in practically all email headers.
Changes would have to be made to SMTP clients, SMTP servers, mail user
agents (MUAs), software for mailing lists, gateways to other media,
and everywhere else where email is processed or passed along. RFC 5336
expands the SMTP email transport protocol. At the level of the
protocol, the expansion is labelled UTF8SMTP.
A new header field will be added as a sort of "emergency parachute" to
ensure that UTF-8 emails have a soft landing if they are thrown out
before reaching the recipient by systems that have not been upgraded.
The "OldAddress" is a purely ASCII address. But OldAddress is not to
be used as a channel for a second transfer attempt, but rather to make
sure that feedback is sent home.
Finally, RFC5337 ensures that correct messages are sent pertaining to
the delivery status of non-ASCII emails. The correct address of an
unreachable addressee must be sent back, even if further transport has
been refused. The email Address Internationalization (EAI) working
group is also working on a number of "downgrade mechanisms" for
various header fields and the envelope. If possible, original header
information is to be "packaged" and preserved.
Germany's DeNIC, the registrar for the ".de" domain, is nonetheless
taking this in its stride. "There is really not much we can do",
explained DeNIC spokesperson Klaus Herzig. DeNIC is instead paying
more attention to the update that the IETF is working on for the
standard of international domains – RFC3490, or IDNA2003 as it's
sometimes known. "We are not that happy about it because there is no
backwards compatibility," Herzig explained. When the update comes,
DeNIC says it will be throwing its weight behind the symbol "ß" - also
known as estzett - which has been overlooked up to now. The German
registrar also says that it may wait a bit before switching in light
of the lack of backward compatibility. Once the new standard is
running stably and registrars and providers have adopted it, the ß
will be added.
In contrast, experts believe that Chinese registrars in China and
Taiwan will quickly implement the change for internationalised email.
Representatives of CNIC and TWNIC are authors of the standards.
Chinese users currently have to write emails in ASCII to the left of
the # and in Chinese characters to the right of it for Chinese
domains, which have already been internationalized.
(Monika Ermert)
The usual method of URL-encoding a unicode character is to split it into 2 %HH codes. (\u4161 => %41%61)
But, how is unicode distinguished when decoding? How do you know that %41%61 is \u4161 vs. \x41\x61 ("Aa")?
Are 8-bit characters, that require encoding, preceded by %00?
Or, is the point that unicode characters are supposed to be lost/split?
According to Wikipedia:
Current standard
The generic URI syntax mandates that new URI schemes
that provide for the representation of
character data in a URI must, in
effect, represent characters from the
unreserved set without translation,
and should convert all other
characters to bytes according to
UTF-8, and then percent-encode those
values. This requirement was
introduced in January 2005 with the
publication of RFC 3986. URI schemes
introduced before this date are not
affected.
Not addressed by the current
specification is what to do with
encoded character data. For example,
in computers, character data manifests
in encoded form, at some level, and
thus could be treated as either binary
data or as character data when being
mapped to URI characters. Presumably,
it is up to the URI scheme
specifications to account for this
possibility and require one or the
other, but in practice, few, if any,
actually do.
Non-standard implementations
There exists a non-standard encoding
for Unicode characters: %uxxxx, where
xxxx is a Unicode value represented as
four hexadecimal digits. This behavior
is not specified by any RFC and has
been rejected by the W3C. The third
edition of ECMA-262 still includes an
escape(string) function that uses this
syntax, but also an encodeURI(uri)
function that converts to UTF-8 and
percent-encodes each octet.
So, it looks like its entirely up to the person writing the unencode method...Aren't standards fun?
What I've always done is first UTF-8 encode a Unicode string to make it a series of 8-bit characters before escaping any of those with %HH.
P.S. - I can only hope the non-standard implementations (%uxxxx) are few and far between.
Since URI's were introduced before unicode was around, or atleast in wide use, I imagine this is a very implementation specific question. UTF-8 encoding your text, then escaping that per normal sounds like the best idea, since that's completely backwards compatible with any ASCII/ANSI systems in place, though you might get the odd wierd character or two.
On the other end, to decode, you'd unescape your text, and get a UTF-8 string. If someone using an older system tries to send yours some data in ASCII/ANSI, there's no harm done, that's (almost) UTF-8 encoded already.