What is the RFC 822 format for the email addresses? - email

I have to make a regular expression for the email addresses (RFC 822) and I want to know which characters are allowed in the local part and in the domain.
I found this https://www.rfc-editor.org/rfc/rfc822#section-6.1 but I don't see that it says which are the valid characters.

According to RFC 822, the local part may contain any ASCII character, since local-part is defined using word, which is defined as atom / quoted-string; atom covers most ASCII characters, and the rest can be written in a quoted-string. There are syntactic restrictions, but obeying them, any ASCII character can be used.
On similar grounds, RFC 822 allows any ASCII character in the domain part.
On the other hand, RFC 822 was obsoleted in 2001 by RFC 2822, which in turn was obsoleted in 2008 by RFC 5322. The status of RFCs can be checked from the RFC Editor’s RFC database.

Related

Email with special characters rejected - RFC-6532 and "quoted-printable"

One email provider rejected an email containing special characters (e.g. umlaute). They say that they are RFC-5321 and RFC-5322 compliant. Now I browsed those standards however they are not supporting international emails (thus no umlaute). Only ASCII-127 is supported.
Now there is an extension called RFC-6532 which standardizes international emails. Our emails are UTF-8 (quoted-printable) encoded and sent like this:
"=?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?="<boerge.moeller#foo.org>
Is this an RFC-6532 compliant address? Or is it some other/older RFC (like RFC-2054)? After all there are so many mail related RFCs that I might have missed 10 or 20 ;-)
It's on the right track, but it's wrong.
"=?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?="<boerge.moeller#foo.org>
There are 2 problems with the above form:
The encoded-word (the =?UTF-8?Q?...?= bit) is quoted and shouldn't be. Mail software that parse this address won't decode that token if they are standards-compliant.
The "name" is butted up against the angle brackets and should not be. There MUST be a space in order to be standards compliant.
In other words, this is what it should look like:
=?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?= <boerge.moeller#foo.org>
The RFCs that you need to look at are:
RFC5322 - this defines the modern Message syntax that is implemented by the server you are trying to interoperate with.
RFC2047 - this defines the methods and syntax of the encoded-words that are needed to represent non-ASCII characters in headers like Subject and address headers (e.g. To/From/Cc/Reply-To/etc). (This is the =?UTF-8?Q?B=C3=B6rge_M=C3=B6ller?= part)
RFC822 - this defines the grammar used by RFC2047 and is an older version of RFC5322.
It may also be helpful to read RFC2822 which is newer than RFC822 but older than RFC5322. My guess, however, is that you can skip it because it won't have a lot of value. The only reason RFC822 still has value is because of its much older grammar definitions that are referenced by RFC2047 (such as atom, dot-atom, phrase, angle-addr, addr-spec, tspecials, etc).
RFC6532 is even newer than RFC5322. The purpose of which is to remove the need to encode headers altogether by allowing the use of UTF-8 as an alternative.
Before RFC6532, there was no standard for the character encoding to use for headers other than ASCII (which was what RFC822 used) and so everything was always supposed to conform to ASCII.
A lot of software doesn't follow the standards, however, and so there was a lot of mail in the real world that used ISO-8859-1 and every other character encoding under the sun, all depending on what region the user(s) were in and what character encoding(s) were in wide use in those regions (e.g. Big5 and GB2312 are popular in various parts of China, Shift-JIS being popular in Japan, EUC-KR/KS-C-5601-1987 are popular in Korea, etc).
This caused major interoperability problems, though, not least of which because not every mail client could handle every character encoding under the sun, but also because there was no way for a client to figure out which character encoding was even being used! It's all just binary gobbeldy-gook.
UTF-8, however, has existed for a long time and it can represent all characters in all languages, so it was only logical for it to eventually win out as the standard character encoding to use for international email.

RFC 5322 email format validation

How can I check if emails that are generated by my code a valid according to
RFC 5322 ?
Here's a PCRE regular expression (taken from a PHP library) that will validate according to RFC 5322:
'/^(?!(?>(?1)"?(?>\\\[ -~]|[^"])"?(?1)){255,})(?!(?>(?1)"?(?>\\\[ -~]|[^"])"?(?1)){65,}#)((?>(?>(?>((?>(?>(?>\x0D\x0A)?[\t ])+|(?>[\t ]*\x0D\x0A)?[\t ]+)?)(\((?>(?2)(?>[\x01-\x08\x0B\x0C\x0E-\'*-\[\]-\x7F]|\\\[\x00-\x7F]|(?3)))*(?2)\)))+(?2))|(?2))?)([!#-\'*+\/-9=?^-~-]+|"(?>(?2)(?>[\x01-\x08\x0B\x0C\x0E-!#-\[\]-\x7F]|\\\[\x00-\x7F]))*(?2)")(?>(?1)\.(?1)(?4))*(?1)#(?!(?1)[a-z\d-]{64,})(?1)(?>([a-z\d](?>[a-z\d-]*[a-z\d])?)(?>(?1)\.(?!(?1)[a-z\d-]{64,})(?1)(?5)){0,126}|\[(?:(?>IPv6:(?>([a-f\d]{1,4})(?>:(?6)){7}|(?!(?:.*[a-f\d][:\]]){8,})((?6)(?>:(?6)){0,6})?::(?7)?))|(?>(?>IPv6:(?>(?6)(?>:(?6)){5}:|(?!(?:.*[a-f\d]:){6,})(?8)?::(?>((?6)(?>:(?6)){0,4}):)?))?(25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)(?>\.(?9)){3}))\])(?1)$/isD'
Unlike Peter's answer it does allow for single-label domain names (which are syntactically valid) and IPv6 address literals.
However, I'd strongly suggest to instead validate according to RFC 5321 which doesn't allow for comments or folding white space (which are semantically invisible and so not actually a part of the email address) or for obsolete local parts (which can just be re-written as non-obsolete quoted strings):
'/^(?!(?>"?(?>\\\[ -~]|[^"])"?){255,})(?!"?(?>\\\[ -~]|[^"]){65,}"?#)(?>([!#-\'*+\/-9=?^-~-]+)(?>\.(?1))*|"(?>[ !#-\[\]-~]|\\\[ -~])*")#(?!.*[^.]{64,})(?>([a-z\d](?>[a-z\d-]*[a-z\d])?)(?>\.(?2)){0,126}|\[(?:(?>IPv6:(?>([a-f\d]{1,4})(?>:(?3)){7}|(?!(?:.*[a-f\d][:\]]){8,})((?3)(?>:(?3)){0,6})?::(?4)?))|(?>(?>IPv6:(?>(?3)(?>:(?3)){5}:|(?!(?:.*[a-f\d]:){6,})(?5)?::(?>((?3)(?>:(?3)){0,4}):)?))?(25[0-5]|2[0-4]\d|1\d{2}|[1-9]?\d)(?>\.(?6)){3}))\])$/iD'
Using this regex its like 98% valid. It doesn't validate the following:
postbox#com
admin#mailserver1
user#[IPv6:2001:db8:1ff::a0b:dbd0]
But it covers everything else
^(([^<>()[\\]\\.,;:\\s#\"]+(\\.[^<>()[\\]\\.,;:\\s#\"]+)*)|(\".+\"))#((\\[[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\.[0-9]{1,3}\\])|(([a-zA-Z\\-0-9]+\\.)+[a-zA-Z]{2,}))$
Note: This is transported directly from some production Golang code so slashes are added.
Email Regex as per RFC 5322 Policy
After so much struggle I made the regex validating all the cases as per 5322 except one:
(1)admin#mailserver1 (local domain name with no TLD, although ICANN highly discourages dot less email addresses)
^(?=.{1,64}#)((?:[A-Za-z0-9!#$%&'*+-/=?^\{\|\}~]+|"(?:\\"|\\\\|[A-Za-z0-9\.!#\$%&'\*\+\-/=\?\^_{|}~ (),:;<>#[].])+")(?:.(?:[A-Za-z0-9!#$%&'*+-/=?^\{\|\}~]+|"(?:\\"|\\\\|[A-Za-z0-9\.!#\$%&'\*\+\-/=\?\^_{|}~ (),:;<>#[].])+")))#(?=.{1,255}.)((?:[A-Za-z0-9]+(?:(?:[A-Za-z0-9-][A-Za-z0-9])?).)+[A-Za-z]{2,})|(((0|[1-9A-Fa-f][0-9A-Fa-f]{0,3}):){0,6}(0|)])$
Please click here to get a clear idea about this regex
https://regex101.com/r/7u0dze/1

Are international characters (e.g. umlaut characters) valid in the local part of email addresses?

Are german umlauts (ä, ö, ü) and the sz-character (ß) valid in the local part of an email-address?
For example take this email-address: björn.nußbaum#trouble.org
RFC 5322 quite clearly says, that umlauts (and other international characters) aren't allowed. If I take a look at chapter 3.4.1, there's the following regarding the local part:
local-part = dot-atom / quoted-string / obs-local-part
So what means dot-atom? It's described in chapter 3.2.3: Well, long story short: Printable US-ASCII characters not including specials
So in the whole RFC 5322 I can't see anything regarding international characters.
Or is RFC 5322 already obsolete? (RFC 822 -> RFC 2822 -> RFC 5322)
Update:
The important point for me is: What's the current standard? International characters allowed or not?
RFC 5322 is marked as DRAFT STANDARD. So I think that's the most recent source to rely on, isn't it?
Efran mentioned, that RFC 5336 allows international characters. But RFC 5336 is marked as EXPERIMENTAL, so that's not interesting for me.
Yes, they are valid characters as long as the mail exchanger responsible for the email address supports the UTF8SMTP extension, discussed in RFC 5336. Beware that just a small portion of the mail exchangers out there supports internationalized email addresses.
Both our email validation component for Microsoft .NET and our REST email validation service, for example, allow UTF8 characters in the local part of an email address but will mark it as invalid if its related mail exchanger does not support the aforementioned extension.
https://www.rfc-editor.org/rfc/rfc5322#section-3.4.1 is your latest standards track reference. Generally it is not advisable to use characters which require quoting due to the outrageously high amount of standards unconformant MTAs out there. Such email are bound to get lost in the long run.
As a friendly advice this table is pretty useful too (from Jochen Topf, titled "Characters in the local part of an email address"): https://www.jochentopf.com/email/chars.html
It looks like rfc6531 replaces 5336 and it is "PROPOSED STANDARD"
https://www.rfc-editor.org/rfc/rfc6531

Are email headers case sensitive?

Are email headers case sensitive?
For example, is Content-Type different from Content-type?
According to RFC 5322, I don't see anything about case sensitivity. However, I'm seeing a problem with creating MIME messages using the PEAR Mail_mime module, and everything is pointing to the fact that our SMTP server uses Content-type and MIME-version instead of Content-Type and MIME-Version. I tried using another SMTP server (like GMail), but unfortunately our web servers are firewalled pretty tightly.
RFC 5322 does actually specify this, but it is very indirect.
Section 1.2.2 says:
This specification uses the Augmented
Backus-Naur Form (ABNF) [RFC5234]
notation for the formal definitions of
the syntax of messages.
In turn, Section 2.3 of RFC 5234 says:
NOTE:
ABNF strings are case insensitive and the character set for
these strings is US-ASCII.
So when RFC 5322 specifies a production rule like this:
from = "From:" mailbox-list CRLF
It is implicit that the "From:" is not case-sensitive.
[update]
As for Content-Type and MIME-Version, they are specified by the MIME spec (RFC 2045). That in turn refers to the BNF described by the original RFC 822, which (luckily) also makes it clear that these literal strings are case-insensitive.
Bottom line: According to the spec, Email headers are not case-sensitive, so it sounds like your mail server is buggy.

Non-Latin characters in username for FTP

I tried to find the list of characters allowed in username for FTP but the RFC is not very specific. What ftp servers and clients support user names in Unicode? Special characters? Is there a generally accepted spec that explains the list of characters allowed in FTP usernames? (googling was of no help to me)
RFC 959 5.3.2:
<username> ::= <string>
<string> ::= <char> | <char><string>
<char> ::= any of the 128 ASCII characters except <CR> and <LF>
Later RFCs (like proposed standard RFC 3659) talk about UTF-8 extensions, but only in the context of pathnames and file contents encoding.
So you can only depend on ASCII, but I suspect in practice most clients and servers support UTF-8.
Try to encode using UTF-8 because most FTP servers will work with UTF-8.