Can an email address contain international (non-english) characters? - email

If it's possible, should I accept such emails from users and what problems to expect when I will be sending mails to such addresses?

Officially, per RFC 6532 - Yes.
For a quick explanation, check out wikipedia on the subject.

Update 2015: Use RFC 6532
The experimental 5335 has been Obsoleted by: 6532 and
this later has been set to "Category: Standards Track",
making it the standard.
The Section 3.2 (Syntax Extensions to RFC 5322) has updated most text fields to
include (proper) UTF-8.
The following rules extend the ABNF syntax defined in [RFC5322] and
[RFC5234] in order to allow UTF-8 content.
VCHAR =/ UTF8-non-ascii
ctext =/ UTF8-non-ascii
atext =/ UTF8-non-ascii
qtext =/ UTF8-non-ascii
text =/ UTF8-non-ascii
; note that this upgrades the body to UTF-8
dtext =/ UTF8-non-ascii
The preceding changes mean that the following constructs now
allow UTF-8:
1. Unstructured text, used in header fields like
"Subject:" or "Content-description:".
2. Any construct that uses atoms, including but not limited
to the local parts of addresses and Message-IDs. This
includes addresses in the "for" clauses of "Received:"
header fields.
3. Quoted strings.
4. Domains.
Note that header field names are not on this list; these are still
restricted to ASCII.
Please note the explicit inclusion of Domains.
And the explicit exclusion of header names.
Also Note about NFKC:
The UTF-8 NFKC normalization form SHOULD NOT be used because
it may lose information that is needed to correctly spell
some names in some unusual circumstances.
And Section 3 start:
Also note that messages in this format require the use of the
SMTPUTF8 extension [RFC6531] to be transferred via SMTP.

The problem is that some mail clients (server-tools and / or desktop tools) don't support it and throw an 'invalid email' exception when you try to send a mail to an address which contains umlauts for example.
If you want full support, you could do the trick with converting the email-address parts to "punycode". This allows users to type in their addresses the usual way but you save it the supported-level way.
Example: müller.com » xn--mller-kva.com
Both points to the same thing.

I would assume yes since a number of top level domains already allow non ascii
characters for domains and since the domain is part of an email address, it's
perfectly possible. An example for such a domain would be www.öko.de

short answer: yes
not only in the username but also in the domain name are allowed.

The answer is yes, but they need to be encoded specially.
Look at this. Read the part that refers to email-headers and RFC 2047.

Not yet. The IEEE plans to do this:
H-Online article: IEFT planning internationalised email addresses, here is the RfC: SMTP Extension for Internationalized Email Addresses
Quote from H-Online (as it went down):
The Internet Engineering Task Force (IETF) has published three crucial documents for the standardisation of email address headers
that include symbols outside the ASCII character set. This means that
soon you'll be able to use Chinese characters, French accents, and
German umlauts in email addresses as well as just in the body of the
message. So if your name is Zoë and you work for a company that makes
façades, you might be interested in a new email address. But
representatives of providers are already moaning. They say there would
need to be an "upgrade mania" if the Unicode standard UTF-8 is to
replace the American Standard Code for Information Interchange (ASCII)
currently used as the general email language.
RFC 5335 specifies the use of UTF-8 in practically all email headers.
Changes would have to be made to SMTP clients, SMTP servers, mail user
agents (MUAs), software for mailing lists, gateways to other media,
and everywhere else where email is processed or passed along. RFC 5336
expands the SMTP email transport protocol. At the level of the
protocol, the expansion is labelled UTF8SMTP.
A new header field will be added as a sort of "emergency parachute" to
ensure that UTF-8 emails have a soft landing if they are thrown out
before reaching the recipient by systems that have not been upgraded.
The "OldAddress" is a purely ASCII address. But OldAddress is not to
be used as a channel for a second transfer attempt, but rather to make
sure that feedback is sent home.
Finally, RFC5337 ensures that correct messages are sent pertaining to
the delivery status of non-ASCII emails. The correct address of an
unreachable addressee must be sent back, even if further transport has
been refused. The email Address Internationalization (EAI) working
group is also working on a number of "downgrade mechanisms" for
various header fields and the envelope. If possible, original header
information is to be "packaged" and preserved.
Germany's DeNIC, the registrar for the ".de" domain, is nonetheless
taking this in its stride. "There is really not much we can do",
explained DeNIC spokesperson Klaus Herzig. DeNIC is instead paying
more attention to the update that the IETF is working on for the
standard of international domains – RFC3490, or IDNA2003 as it's
sometimes known. "We are not that happy about it because there is no
backwards compatibility," Herzig explained. When the update comes,
DeNIC says it will be throwing its weight behind the symbol "ß" - also
known as estzett - which has been overlooked up to now. The German
registrar also says that it may wait a bit before switching in light
of the lack of backward compatibility. Once the new standard is
running stably and registrars and providers have adopted it, the ß
will be added.
In contrast, experts believe that Chinese registrars in China and
Taiwan will quickly implement the change for internationalised email.
Representatives of CNIC and TWNIC are authors of the standards.
Chinese users currently have to write emails in ASCII to the left of
the # and in Chinese characters to the right of it for Chinese
domains, which have already been internationalized.
(Monika Ermert)

Related

Email with special characters rejected - RFC-6532 and "quoted-printable"

One email provider rejected an email containing special characters (e.g. umlaute). They say that they are RFC-5321 and RFC-5322 compliant. Now I browsed those standards however they are not supporting international emails (thus no umlaute). Only ASCII-127 is supported.
Now there is an extension called RFC-6532 which standardizes international emails. Our emails are UTF-8 (quoted-printable) encoded and sent like this:
"=?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?="<boerge.moeller#foo.org>
Is this an RFC-6532 compliant address? Or is it some other/older RFC (like RFC-2054)? After all there are so many mail related RFCs that I might have missed 10 or 20 ;-)
It's on the right track, but it's wrong.
"=?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?="<boerge.moeller#foo.org>
There are 2 problems with the above form:
The encoded-word (the =?UTF-8?Q?...?= bit) is quoted and shouldn't be. Mail software that parse this address won't decode that token if they are standards-compliant.
The "name" is butted up against the angle brackets and should not be. There MUST be a space in order to be standards compliant.
In other words, this is what it should look like:
=?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?= <boerge.moeller#foo.org>
The RFCs that you need to look at are:
RFC5322 - this defines the modern Message syntax that is implemented by the server you are trying to interoperate with.
RFC2047 - this defines the methods and syntax of the encoded-words that are needed to represent non-ASCII characters in headers like Subject and address headers (e.g. To/From/Cc/Reply-To/etc). (This is the =?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?= part)
RFC822 - this defines the grammar used by RFC2047 and is an older version of RFC5322.
It may also be helpful to read RFC2822 which is newer than RFC822 but older than RFC5322. My guess, however, is that you can skip it because it won't have a lot of value. The only reason RFC822 still has value is because of its much older grammar definitions that are referenced by RFC2047 (such as atom, dot-atom, phrase, angle-addr, addr-spec, tspecials, etc).
RFC6532 is even newer than RFC5322. The purpose of which is to remove the need to encode headers altogether by allowing the use of UTF-8 as an alternative.
Before RFC6532, there was no standard for the character encoding to use for headers other than ASCII (which was what RFC822 used) and so everything was always supposed to conform to ASCII.
A lot of software doesn't follow the standards, however, and so there was a lot of mail in the real world that used ISO-8859-1 and every other character encoding under the sun, all depending on what region the user(s) were in and what character encoding(s) were in wide use in those regions (e.g. Big5 and GB2312 are popular in various parts of China, Shift-JIS being popular in Japan, EUC-KR/KS-C-5601-1987 are popular in Korea, etc).
This caused major interoperability problems, though, not least of which because not every mail client could handle every character encoding under the sun, but also because there was no way for a client to figure out which character encoding was even being used! It's all just binary gobbeldy-gook.
UTF-8, however, has existed for a long time and it can represent all characters in all languages, so it was only logical for it to eventually win out as the standard character encoding to use for international email.

Is it possible to add another Unicode character for "at sign" without changing any code in the back-end of all the email providers?

So lets say for some reason we wanted to add another Unicode character for at sign, and use it instead of # in all the email providers
Now i have three questions:
How do email providers parse the email, do they actually parse the written email until they see a # and they have hard-coded the # symbol's Unicode in the parser?
Do different service providers have different email parser with different standards or is there a standard type of parser library that every email provider use?
Will it be possible to add another at sign symbol and use it in emails without having to make changes in all the email provider's code?
Yes, e-mail addresses are parsed using a hard-wired # character. After almost fifty years of e-mail, there are literally millions of e-mail handling programs, and they all use this same syntax. So you're not going to be able to change this convention, and your second and third questions are moot.
E-mail addresses are parsed by tens of different kind of softwares, not just "email server" software inside "e-mail providers". Even things as trivial as client-side javascript highlighting for an e-mail field - of which there are easly tens of thousands around, would have to adapt.
An "#" is not a character class by itself - so, even if it were an unique "unicode character class" for "Unicode Separator", whou would ever have written code that would check for the character class of the separator? Have you ever done that, even for filtering punctuation out? (A real use case for the unicode classification of characters, and even them, this sees little use in real-world code).
Now, of course, you are free to write email client code that would present the "#" as anything else when rendering e-mail data to the users. Internally, if this software would not use "#", even for its own uses, it would not work with anything else in the World - from antivirus software to text-based templates.
And finally, such a change would hardly have to do with "unicode" itself - Unicode can standardize characters - but the e-mail protocol is a separate thing - normally the series of documents kept as "RFC"s is what mandate various internet protocols, including IMAP, POP and SMTP- the three protocols that are used to enable e-mail to work. Even if new RFCs for all these would be published with a new character accept in place of "#", it would likely take more than a decade until all software around, as detailed above, would be compliant enough to enable it to be used. (And yes, all of it would have to be changed)

RFC 5322 email format validation

How can I check if emails that are generated by my code a valid according to
RFC 5322 ?
Here's a PCRE regular expression (taken from a PHP library) that will validate according to RFC 5322:
'/^(?!(?>(?1)"?(?>\\\[ -~]|[^"])"?(?1)){255,})(?!(?>(?1)"?(?>\\\[ -~]|[^"])"?(?1)){65,}#)((?>(?>(?>((?>(?>(?>\x0D\x0A)?[\t ])+|(?>[\t ]*\x0D\x0A)?[\t ]+)?)(\((?>(?2)(?>[\x01-\x08\x0B\x0C\x0E-\'*-\[\]-\x7F]|\\\[\x00-\x7F]|(?3)))*(?2)\)))+(?2))|(?2))?)([!#-\'*+\/-9=?^-~-]+|"(?>(?2)(?>[\x01-\x08\x0B\x0C\x0E-!#-\[\]-\x7F]|\\\[\x00-\x7F]))*(?2)")(?>(?1)\.(?1)(?4))*(?1)#(?!(?1)[a-z\d-]{64,})(?1)(?>([a-z\d](?>[a-z\d-]*[a-z\d])?)(?>(?1)\.(?!(?1)[a-z\d-]{64,})(?1)(?5)){0,126}|\[(?:(?>IPv6:(?>([a-f\d]{1,4})(?>:(?6)){7}|(?!(?:.*[a-f\d][:\]]){8,})((?6)(?>:(?6)){0,6})?::(?7)?))|(?>(?>IPv6:(?>(?6)(?>:(?6)){5}:|(?!(?:.*[a-f\d]:){6,})(?8)?::(?>((?6)(?>:(?6)){0,4}):)?))?(25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)(?>\.(?9)){3}))\])(?1)$/isD'
Unlike Peter's answer it does allow for single-label domain names (which are syntactically valid) and IPv6 address literals.
However, I'd strongly suggest to instead validate according to RFC 5321 which doesn't allow for comments or folding white space (which are semantically invisible and so not actually a part of the email address) or for obsolete local parts (which can just be re-written as non-obsolete quoted strings):
'/^(?!(?>"?(?>\\\[ -~]|[^"])"?){255,})(?!"?(?>\\\[ -~]|[^"]){65,}"?#)(?>([!#-\'*+\/-9=?^-~-]+)(?>\.(?1))*|"(?>[ !#-\[\]-~]|\\\[ -~])*")#(?!.*[^.]{64,})(?>([a-z\d](?>[a-z\d-]*[a-z\d])?)(?>\.(?2)){0,126}|\[(?:(?>IPv6:(?>([a-f\d]{1,4})(?>:(?3)){7}|(?!(?:.*[a-f\d][:\]]){8,})((?3)(?>:(?3)){0,6})?::(?4)?))|(?>(?>IPv6:(?>(?3)(?>:(?3)){5}:|(?!(?:.*[a-f\d]:){6,})(?5)?::(?>((?3)(?>:(?3)){0,4}):)?))?(25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)(?>\.(?6)){3}))\])$/iD'
Using this regex its like 98% valid. It doesn't validate the following:
postbox#com
admin#mailserver1
user#[IPv6:2001:db8:1ff::a0b:dbd0]
But it covers everything else
^(([^<>()[\\]\\.,;:\\s#\"]+(\\.[^<>()[\\]\\.,;:\\s#\"]+)*)|(\".+\"))#((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\])|(([a-zA-Z\\-0-9]+\\.)+[a-zA-Z]{2,}))$
Note: This is transported directly from some production Golang code so slashes are added.
Email Regex as per RFC 5322 Policy
After so much struggle I made the regex validating all the cases as per 5322 except one:
(1)admin#mailserver1 (local domain name with no TLD, although ICANN highly discourages dot less email addresses)
^(?=.{1,64}#)((?:[A-Za-z0-9!#$%&'*+-/=?^\{\|\}~]+|"(?:\\"|\\\\|[A-Za-z0-9\.!#\$%&'\*\+\-/=\?\^_{|}~ (),:;<>#[].])+")(?:.(?:[A-Za-z0-9!#$%&'*+-/=?^\{\|\}~]+|"(?:\\"|\\\\|[A-Za-z0-9\.!#\$%&'\*\+\-/=\?\^_{|}~ (),:;<>#[].])+")))#(?=.{1,255}.)((?:[A-Za-z0-9]+(?:(?:[A-Za-z0-9-][A-Za-z0-9])?).)+[A-Za-z]{2,})|(((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){0,6}(0|)])$
Please click here to get a clear idea about this regex
https://regex101.com/r/7u0dze/1

Are international characters (e.g. umlaut characters) valid in the local part of email addresses?

Are german umlauts (ä, ö, ü) and the sz-character (ß) valid in the local part of an email-address?
For example take this email-address: björn.nußbaum#trouble.org
RFC 5322 quite clearly says, that umlauts (and other international characters) aren't allowed. If I take a look at chapter 3.4.1, there's the following regarding the local part:
local-part = dot-atom / quoted-string / obs-local-part
So what means dot-atom? It's described in chapter 3.2.3: Well, long story short: Printable US-ASCII characters not including specials
So in the whole RFC 5322 I can't see anything regarding international characters.
Or is RFC 5322 already obsolete? (RFC 822 -> RFC 2822 -> RFC 5322)
Update:
The important point for me is: What's the current standard? International characters allowed or not?
RFC 5322 is marked as DRAFT STANDARD. So I think that's the most recent source to rely on, isn't it?
Efran mentioned, that RFC 5336 allows international characters. But RFC 5336 is marked as EXPERIMENTAL, so that's not interesting for me.
Yes, they are valid characters as long as the mail exchanger responsible for the email address supports the UTF8SMTP extension, discussed in RFC 5336. Beware that just a small portion of the mail exchangers out there supports internationalized email addresses.
Both our email validation component for Microsoft .NET and our REST email validation service, for example, allow UTF8 characters in the local part of an email address but will mark it as invalid if its related mail exchanger does not support the aforementioned extension.
https://www.rfc-editor.org/rfc/rfc5322#section-3.4.1 is your latest standards track reference. Generally it is not advisable to use characters which require quoting due to the outrageously high amount of standards unconformant MTAs out there. Such email are bound to get lost in the long run.
As a friendly advice this table is pretty useful too (from Jochen Topf, titled "Characters in the local part of an email address"): https://www.jochentopf.com/email/chars.html
It looks like rfc6531 replaces 5336 and it is "PROPOSED STANDARD"
https://www.rfc-editor.org/rfc/rfc6531

Is it possible to send email to an address that contains latin unicode characters with cfmail?

We need to be able to send an email with cfmail to an email address that contains a latin a with acute. I assume we'll eventually have to allow other Unicode characters too - a sample email address is foobár#example.com. ColdFusion throws an error on this email address, which is technically valid. Since the acute a is a UTF-8 character, and the default encoding for cfmail is UTF-8, I'm not sure what other settings I would need to enable to make this work. Is this possible?
The error I get is Attribute validation error for tag CFMAIL.
Detail: The value of the attribute to, which is currently foobár#example.com, is invalid.
I'm neither an I18N nor email expert but my understanding FWIW is that current systems don't generally support unicode in the local part of the email address, i.e. the mailbox name before the #. Local mail servers may support it and allow a name such as foobár internally, but if that person wants to receive mail from the outside world they will also need an ASCII alias such as foobar.
There is however a mechanism for supporting unicode in the domain portion of the address, which involves conversion to an ASCII representation called punycode. This means an address such as foo#foobár.com will be converted to foo#xn--foobr-0qa.com which current DNS and mail systems will accept.
It's possible to do this conversion in ColdFusion by using existing Java libraries. For more detail see this question.