Email IDNA encoding. Do we need to encode the whole email or each parts separately?

Email IDNA encoding. Do we need to encode the whole email or each parts separately? - email

I have an email with accents that needs to be encoded using IDNA (from Python)
Something like this:
CäciliaAbitz#somedomain.net
If I do a encode('idna') for the whole email, I get the following:
xn--cciliaabitz#somedomain-04b.net
The domains became somedomain-04b.net, which is not normal (right?)
Doing a encoding on each part of the email results in :
b''.join([x.encode('idna') for x in email.split('#')])
> b'xn--cciliaabitz-l8a#somedomain.net'
But I'm not sure if this is correct, working or if I'm missing something.

RFC 5890 works on labels, which are mostly dot separated parts of an email address. In your example, you only have one label in the local part (before the # sign), "CäciliaAbitz", and two labels in the domain part ("somedomain.net"). If you encode without paying attention to the labels, you encode the dots, and the result is a single label where you need multiple ones. With that, your assumption, that "somedomain-04b.net" is not normal (or valid), is correct.
To correctly encode, you need to split not only between local and domain part at the #, but also at any dot within both local and domain parts.

Related

Is it possible to add another Unicode character for "at sign" without changing any code in the back-end of all the email providers?

So lets say for some reason we wanted to add another Unicode character for at sign, and use it instead of # in all the email providers
Now i have three questions:
How do email providers parse the email, do they actually parse the written email until they see a # and they have hard-coded the # symbol's Unicode in the parser?
Do different service providers have different email parser with different standards or is there a standard type of parser library that every email provider use?
Will it be possible to add another at sign symbol and use it in emails without having to make changes in all the email provider's code?

Yes, e-mail addresses are parsed using a hard-wired # character. After almost fifty years of e-mail, there are literally millions of e-mail handling programs, and they all use this same syntax. So you're not going to be able to change this convention, and your second and third questions are moot.

E-mail addresses are parsed by tens of different kind of softwares, not just "email server" software inside "e-mail providers". Even things as trivial as client-side javascript highlighting for an e-mail field - of which there are easly tens of thousands around, would have to adapt.
An "#" is not a character class by itself - so, even if it were an unique "unicode character class" for "Unicode Separator", whou would ever have written code that would check for the character class of the separator? Have you ever done that, even for filtering punctuation out? (A real use case for the unicode classification of characters, and even them, this sees little use in real-world code).
Now, of course, you are free to write email client code that would present the "#" as anything else when rendering e-mail data to the users. Internally, if this software would not use "#", even for its own uses, it would not work with anything else in the World - from antivirus software to text-based templates.
And finally, such a change would hardly have to do with "unicode" itself - Unicode can standardize characters - but the e-mail protocol is a separate thing - normally the series of documents kept as "RFC"s is what mandate various internet protocols, including IMAP, POP and SMTP- the three protocols that are used to enable e-mail to work. Even if new RFCs for all these would be published with a new character accept in place of "#", it would likely take more than a decade until all software around, as detailed above, would be compliant enough to enable it to be used. (And yes, all of it would have to be changed)

The precise format of Content-Id header

I'm really confused when it comes to the format of Content-Id headers in message parts.
It seems to me that only RFC 2045 covers the format of the header, however briefly:
In constructing a high-level user agent, it may be desirable to allow
one body to make reference to another. Accordingly, bodies may be
labelled using the "Content-ID" header field, which is syntactically
identical to the "Message-ID" header field:
id := "Content-ID" ":" msg-id
Like the Message-ID values, Content-ID values must be generated to be
world-unique.
RFC 2822 explains the format of a msg-id token like so:
The message identifier (msg-id) is similar in syntax to an angle-addr
construct without the internal CFWS.
message-id = "Message-ID:" msg-id CRLF
in-reply-to = "In-Reply-To:" 1*msg-id CRLF
references = "References:" 1*msg-id CRLF
msg-id = [CFWS] "<" id-left "#" id-right ">" [CFWS]
id-left = dot-atom-text / no-fold-quote / obs-id-left
id-right = dot-atom-text / no-fold-literal / obs-id-right
no-fold-quote = DQUOTE *(qtext / quoted-pair) DQUOTE
no-fold-literal = "[" *(dtext / quoted-pair) "]"
Long story short: it includes the at ('#') symbol, just like the Message-Id header of a message. However, almost all reader-friendly articles on MIME format give examples of Content-Id without the at symbol (including not-really-global identifiers like myimagecid or inlineimage001 as well as randomly generated UUIDS without the at symbol). They would surely stress the importance of the '#' symbol if that would be necessary, just like they do with the Message-Id header, right? Right?
I've run some tests on real-world email clients and see how they compose emails with embedded inline images:
Thunderbird generates identifiers with the at symbol. Example: part1.12345678.12345678#domain.example.com
Gmail generates identifiers without such symbol and with no domain part. Example: ii_abc1234x0_12345ab12abcdefa
I didn't test any more email clients (if someone did, it'd be great to complete the list above), but these two already show the striking difference. Google not obeying RFC standards? It sure looks smelly and I want to know whether that's because I missed something, or because the format isn't really that important after all (which in the long run feels rather disturbing). I'm also interested in checking how many popular email clients actually discard the 'at' symbol.

Go by what the spec says, not by what some mail clients do.
So yes, a Content-Id header should have a value that conforms to the way the specification says and therefor should have an '#' symbol.
The world of email is a broken hell hole of many different mail clients and servers doing their own thing and not respecting the standards.
As someone who has written mail software for the past 17 years, I can assure you, this is not the only place that Google deviates from the specs.

How to correctly encode commas in email display name?

I have a similar problem to this question, but could not find any useful information in the answers.
I'm trying to send an email to a recipient with a display name Lastname, firstname using the Quoted-Printable encoding. The exact header, as seen in the source of the received email, is:
To: =?UTF-8?Q?"Lastname,=20firstname"?= <email#example.com>
However, Outlook displays it like this:
Effectively interpreting the comma as a separator between recipients, even though it's enclosed in a Quoted-Printable encoding.
When there is no comma, the header is properly interpreted.
Am I doing something wrong, or is it impossible to use commas in a display-name?
Note: I'm currently using Amazon SES and the ZF2 Zend\Mail component, but the tools should not matter, I'm only interested in the correct header format and will adjust my tools or code accordingly.

What you are seeing is not correct behavior as far as I can tell, but the workaround should be obvious: QP-encode the comma. The double quotes are redundant and should be omitted:
From: =?UTF-8?q?Lastname=2C_Firstname?= <email#example.com>
(As such, it is obviously insane to put the last name first; but e.g. Outlook connected to Active Directory seems to insist on this silly anti-convention.)

Is it possible to send email to an address that contains latin unicode characters with cfmail?

We need to be able to send an email with cfmail to an email address that contains a latin a with acute. I assume we'll eventually have to allow other Unicode characters too - a sample email address is foobár#example.com. ColdFusion throws an error on this email address, which is technically valid. Since the acute a is a UTF-8 character, and the default encoding for cfmail is UTF-8, I'm not sure what other settings I would need to enable to make this work. Is this possible?
The error I get is Attribute validation error for tag CFMAIL.
Detail: The value of the attribute to, which is currently foobár#example.com, is invalid.

I'm neither an I18N nor email expert but my understanding FWIW is that current systems don't generally support unicode in the local part of the email address, i.e. the mailbox name before the #. Local mail servers may support it and allow a name such as foobár internally, but if that person wants to receive mail from the outside world they will also need an ASCII alias such as foobar.
There is however a mechanism for supporting unicode in the domain portion of the address, which involves conversion to an ASCII representation called punycode. This means an address such as foo#foobár.com will be converted to foo#xn--foobr-0qa.com which current DNS and mail systems will accept.
It's possible to do this conversion in ColdFusion by using existing Java libraries. For more detail see this question.

Are email addresses allowed to contain non-alphanumeric characters?

I'm building a website using Django. The website could have a significant number of users from non-English speaking countries.
I just want to know if there are any technical restrictions on what types of characters an email address could contain.
Are email addresses only allowed to contain English letters, numbers, _, # and .?
Are they allowed to contain non-English alphabets like é or ü?
Are they allowed to contain Chinese or Japanese or other Unicode characters?

Email address consists of two parts local before # and domain that goes after.
Rules to these parts are different:
For local part you can use ASCII:
Latin letters A - Z a - z
digits 0 - 9
special characters !#$%&'*+-/=?^_`{|}~
dot ., that it is not first or last, and not in sequence
space and "(),:;<>#[] characters are allowed with restrictions (they are only allowed inside a quoted string, a backslash or double-quote must be preceded by a backslash)
Plus since 2012 you can use international characters above U+007F, encoded as UTF-8.
Domain part is more restricted:
Latin letters A - Z a - z
digits 0 - 9
hyphen -, that is not first or last, multiple hyphens in sequence are allowed.
Regex to validate
^(([^<>()\[\]\.,;:\s#\"]+(\.[^<>()\[\]\.,;:\s#\"]+)*)|(\".+\"))#(([^<>()[\]\.,;:\s#\"]+\.)+[^<>()[\]\.,;:\s#\"]{2,})
Hope this saves you some time.

Well, yes. Read (at least) this article from Wikipedia.
I live in Argentina and here are allowed emails like ñoñó1234#server.com

The allowed syntax in an email address is described in [RFC 3696][1], and is pretty involved.
The exact rule [for local part; the part before the '#'] is that any ASCII character, including control
characters, may appear quoted, or in a quoted string. When quoting
is needed, the backslash character is used to quote the following
character
[...]
Without quotes, local-parts may consist of any combination of
alphabetic characters, digits, or any of the special characters
! # $ % & ' * + - / = ? ^ _ ` . { | } ~
[...]
Any characters, or combination of bits (as octets), are permitted in
DNS names. However, there is a preferred form that is required by
most applications...
...and so on, in some depth.
[1]: https://www.rfc-editor.org/rfc/rfc3696

Instead of worrying about what email addresses can and can't contain, which you really don't care about, test whether your setup can send them email or not—this is what you really care about! This means actually sending a verification email.
Otherwise, you can't catch a much more common case of accidental typos that stay within any character set you devise. (Quick: is random#mydomain.com a valid address for me to use at your site, or not?) It also avoids unnecessarily and gratuitously alienating any users when you tell them their perfectly valid and correct address is wrong. You still may not be able to process some addresses (this is necessary alienation), as the other answers say: email address processing isn't trivial; but that's something they need to find out if they want to provide you with an email address!
All you should check is that the user supplies some text before an #, some text after it, and the address isn't outrageously long (say 1000 characters). If you want to provide a warning ("this looks like trouble! is there a typo? double-check before continuing"), that's fine, but it shouldn't block the add-email-address process.
Of course, if you don't care to ever send email to them, then just take whatever they enter. For example, the address might solely be used for Gravatar, but Gravatar verifies all email addresses anyway.

There is a possibility to have non-ASCII email addresses, as shown by this RFC: https://www.rfc-editor.org/rfc/rfc3490 but I think this has not been set for all countries, and from what I understand only one language code will be allowed for each country, and there is also a way to turn it into ASCII, but that won't be a trivial issue.

I have encountered email addresses with single quotes, and not infrequently either. We reject whitespace (though strictly speaking it is allowed), more than one '#' sign and address strings shorter than five characters in total. I believe this solves more problems than it creates, and so far over ten years and several hundred thousand addresses it's worked to reject many garbage addresses. Also there is a trigger to downcase all email addresses on insert or update.
That being said it is impossible to validate an email without a round trip to the owner, but at least we can reject data that is extremely suspect.

I took a look at the regex in pooh17's answer and noticed it allows the local part to be greater than 64 characters if separated by periods (it just checked the bit before the first period is less than 64 characters). You can make use of positive lookahead to improve this, here's my suggestion if you're really wanting a regex for this
^(((?=.{1,64}#)[^<>()[\].,;:\s#"]+(\.[^<>()[\].,;:\s#"]+)*)|((?=.{1,66}#)".+"))#(?=.{1,255}$)(\[(IPv6:)?[\dA-Fa-f:.]+]|(?!.*?\.\.)(([^\s!"#$%&'()*+,./:;<=>?#[\]^_`{|}~]+\.?)+[^\s!"#$%&'()*+,./:;<=>?#[\]^_`{|}~]{2,}))$

Building on #Matas Vaitkevicius' answer: I've fixed up the regex some more in Python, to have it match valid email addresses as defined on this page and this page of wikipedia, using that awesome regex101 website: https://regex101.com/r/uP2oL7/26
^(([^<>()\[\]\.,;:\s#\"]{1,64}(\.[^<>()\[\]\.,;:\s#\"]+)*)|(\".+\"))#\[*(?!.*?\.\.)(([^<>()[\]\.,;\s#\"]+\.?)+[^<>()[\]\.,;\s#\"]{2,})\]?
Hope this helps someone!:)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse