What escaping or purification is needed for an email subject?

What escaping or purification is needed for an email subject? - email

Sites like Facebook have the user's name in the subject line that sent you a message.
Because of this, what escaping would you do on user entered values in a message subject? Or would you just not allow anything other than a-z, 0-9, period, comma and single quotes?

You need to be careful with email headers, 8 bit chars are a bit of a no-no. (mail servers will reject them).
The proper way to do it is to MIME encode your subject lines and make sure the ASCII char \n is not in the subject line (technically multi-line subjects are possible, but I'd imagine plenty of mail clients would have problems)
See http://en.wikipedia.org/wiki/MIME#Encoded-Word for more info.

Escaping is needed if there are forbidden characters. The subject is terminated by a NL so this is the only (ASCII) character that shouldn't be put in the header.
See also rfc821

It’s the same problem with contact forms.
If you look at an email header you get e.g. this:
Subject: user123 has sent you an invite
From: "User123" <user123#example.org>
You have to make sure that user names do not resemble values of an email header. If it’s possible for a user to name himself “To: spamreceiver1#example.org, spamreceiver2#example.org, spamreceiver3#example.org, spamreceiver4#example.org” you have to clean the input.
A search for “contact form spam” should show you what to do. You should at least remove all occurrences of "To:", "Subject:", "From:" etc.

Related

How to escape a full email address for SMTP in the headers when the email address contains non-ascii chars

It's about sending emails with non ASCII chars in the email address.
When I use send the TO /RCPT stuff to the SMTP server I know that I need to use punycode here.
But what about the To: and From: Header. Again I know that if the User friendly part contains a non ascii char I con use the standard header encoding that I also use for the subject. But this encoding is only used for the user friendly part.
But what if the email address contains a non ascii char? How must the To header be formatted.
So how to encode "Tüst" ?
This is the encoding as far as I know.
"=?iso-8859-1?Q?T=FCst?="<tüst#domain.de>
But what with the email address.
In fact: I don't understand the RFC's. I tried hard but failed.

The answer is: UTF-8 is the correct way to encode the header.
After some more research I found the answer hidden inside this article:
https://en.wikipedia.org/wiki/International_email
Although the traditional format for email header section allows
non-ASCII characters to be included in the value portion of some of
the header fields using MIME-encoded words (e.g. in display names or
in a Subject header field), MIME-encoding must not be used to encode
other information in a header, such as an email address, or header
fields like Message-ID or Received. Moreover, the MIME-encoding
requires extra processing of the header to convert the data to and
from its MIME-encoded word representation, and harms readability of a
header section.
The 2012 standards RFC 6532 and RFC 6531 allow the inclusion of
Unicode characters in a header content using UTF-8 encoding, and their
transmission via SMTP - but in practice support is only slowly rolling
out.[5]

How to correctly encode commas in email display name?

I have a similar problem to this question, but could not find any useful information in the answers.
I'm trying to send an email to a recipient with a display name Lastname, firstname using the Quoted-Printable encoding. The exact header, as seen in the source of the received email, is:
To: =?UTF-8?Q?"Lastname,=20firstname"?= <email#example.com>
However, Outlook displays it like this:
Effectively interpreting the comma as a separator between recipients, even though it's enclosed in a Quoted-Printable encoding.
When there is no comma, the header is properly interpreted.
Am I doing something wrong, or is it impossible to use commas in a display-name?
Note: I'm currently using Amazon SES and the ZF2 Zend\Mail component, but the tools should not matter, I'm only interested in the correct header format and will adjust my tools or code accordingly.

What you are seeing is not correct behavior as far as I can tell, but the workaround should be obvious: QP-encode the comma. The double quotes are redundant and should be omitted:
From: =?UTF-8?q?Lastname=2C_Firstname?= <email#example.com>
(As such, it is obviously insane to put the last name first; but e.g. Outlook connected to Active Directory seems to insist on this silly anti-convention.)

Mandrill "reject_reason": "invalid-sender"

I'm trying to send emails using mandrill email service but I get the following error
Full Response
[
{
"email": "someemail#somedomain.com",
"status": "rejected",
"_id": "b814c2974594466cba9c904c54dca6c6",
"reject_reason": "invalid-sender"
}
]
Apart from the above error there is no more details about it. we are using .net to send emails with Mandrill SMTP settings.

It'd be useful to see the call/email that's being sent. That error means that there's an invalid sender, as indicated in the reject reason field. That could be because of an invalid email address, invalidly-encoded from name, or invalid or broken encoding in other headers making it so that Mandrill can't parse the "from" header, but without seeing the actual email that you're sending, it's hard to say for sure exactly what the issue is.
You probably want to check that there's a fully-qualified domain name in the from email address, and that if the subject line is encoded, there aren't things like newline (\n) characters that break multibyte characters in the subject line. If you aren't able to identify the issue in the raw SMTP message, feel free to get in touch with support for further troubleshooting assistance.

I had the same problem, in my case, I had forgotten to complete the template defaults "From Name" and "Subject".

I had the same problem. In my case encoding in headers was the problem. I did change the headers encoding to UTF-8 and it worked. I was using C# SMTP and the code is below.
message.HeadersEncoding = Encoding.UTF8;
Hope it works!

For me, it was because my emails were coming from email#example.net1
Mandrill rejected me because of the 1 at the end. e+mail#example.net and email#example.neta are both valid and will be accepted.
My other tests just had blank From headers, so they were rejected as well. I didn't even realize these emails were being received by Mandrill until I logged in and checked the API logs.

I've had a similar problem recently. It was due to my use of certain characters in the message.from_name field. After searching through documentation and stack overflow, I couldn't find a list of forbidden characters, so although this doesn't necessarily pertain to your case, I thought I'd share this small list I compiled of some acceptable characters (not an exhaustive list):
a-z
A-Z
0-9
_, -, !, #, $, %, \, ^, &, *, +, =, {, }, ?, .
In JS, here's a RegExp that will match with forbidden characters (or, rather, any characters that aren't in the aforementioned list):
const pattern = /[^a-zA-Z0-9_\-!#$%\^&*+={}?.]/;
Hope this is helpful for anyone else stuck on this.

If you use .NET SmtpClient, may be this is because of bug on it: https://social.msdn.microsoft.com/Forums/vstudio/en-US/4d1c1752-70ba-420a-9510-8fb4aa6da046/subject-encoding-on-smtpclientmailmessage
Workaround, that helped us:
use
message.SubjectEncoding = Encoding.Unicode;
instead of
message.SubjectEncoding = Encoding.UTF8;
This is still actual in .Net Framework 4.7.2

Are email addresses allowed to contain non-alphanumeric characters?

I'm building a website using Django. The website could have a significant number of users from non-English speaking countries.
I just want to know if there are any technical restrictions on what types of characters an email address could contain.
Are email addresses only allowed to contain English letters, numbers, _, # and .?
Are they allowed to contain non-English alphabets like é or ü?
Are they allowed to contain Chinese or Japanese or other Unicode characters?

Email address consists of two parts local before # and domain that goes after.
Rules to these parts are different:
For local part you can use ASCII:
Latin letters A - Z a - z
digits 0 - 9
special characters !#$%&'*+-/=?^_`{|}~
dot ., that it is not first or last, and not in sequence
space and "(),:;<>#[] characters are allowed with restrictions (they are only allowed inside a quoted string, a backslash or double-quote must be preceded by a backslash)
Plus since 2012 you can use international characters above U+007F, encoded as UTF-8.
Domain part is more restricted:
Latin letters A - Z a - z
digits 0 - 9
hyphen -, that is not first or last, multiple hyphens in sequence are allowed.
Regex to validate
^(([^<>()\[\]\.,;:\s#\"]+(\.[^<>()\[\]\.,;:\s#\"]+)*)|(\".+\"))#(([^<>()[\]\.,;:\s#\"]+\.)+[^<>()[\]\.,;:\s#\"]{2,})
Hope this saves you some time.

Well, yes. Read (at least) this article from Wikipedia.
I live in Argentina and here are allowed emails like ñoñó1234#server.com

The allowed syntax in an email address is described in [RFC 3696][1], and is pretty involved.
The exact rule [for local part; the part before the '#'] is that any ASCII character, including control
characters, may appear quoted, or in a quoted string. When quoting
is needed, the backslash character is used to quote the following
character
[...]
Without quotes, local-parts may consist of any combination of
alphabetic characters, digits, or any of the special characters
! # $ % & ' * + - / = ? ^ _ ` . { | } ~
[...]
Any characters, or combination of bits (as octets), are permitted in
DNS names. However, there is a preferred form that is required by
most applications...
...and so on, in some depth.
[1]: https://www.rfc-editor.org/rfc/rfc3696

Instead of worrying about what email addresses can and can't contain, which you really don't care about, test whether your setup can send them email or not—this is what you really care about! This means actually sending a verification email.
Otherwise, you can't catch a much more common case of accidental typos that stay within any character set you devise. (Quick: is random#mydomain.com a valid address for me to use at your site, or not?) It also avoids unnecessarily and gratuitously alienating any users when you tell them their perfectly valid and correct address is wrong. You still may not be able to process some addresses (this is necessary alienation), as the other answers say: email address processing isn't trivial; but that's something they need to find out if they want to provide you with an email address!
All you should check is that the user supplies some text before an #, some text after it, and the address isn't outrageously long (say 1000 characters). If you want to provide a warning ("this looks like trouble! is there a typo? double-check before continuing"), that's fine, but it shouldn't block the add-email-address process.
Of course, if you don't care to ever send email to them, then just take whatever they enter. For example, the address might solely be used for Gravatar, but Gravatar verifies all email addresses anyway.

There is a possibility to have non-ASCII email addresses, as shown by this RFC: https://www.rfc-editor.org/rfc/rfc3490 but I think this has not been set for all countries, and from what I understand only one language code will be allowed for each country, and there is also a way to turn it into ASCII, but that won't be a trivial issue.

I have encountered email addresses with single quotes, and not infrequently either. We reject whitespace (though strictly speaking it is allowed), more than one '#' sign and address strings shorter than five characters in total. I believe this solves more problems than it creates, and so far over ten years and several hundred thousand addresses it's worked to reject many garbage addresses. Also there is a trigger to downcase all email addresses on insert or update.
That being said it is impossible to validate an email without a round trip to the owner, but at least we can reject data that is extremely suspect.

I took a look at the regex in pooh17's answer and noticed it allows the local part to be greater than 64 characters if separated by periods (it just checked the bit before the first period is less than 64 characters). You can make use of positive lookahead to improve this, here's my suggestion if you're really wanting a regex for this
^(((?=.{1,64}#)[^<>()[\].,;:\s#"]+(\.[^<>()[\].,;:\s#"]+)*)|((?=.{1,66}#)".+"))#(?=.{1,255}$)(\[(IPv6:)?[\dA-Fa-f:.]+]|(?!.*?\.\.)(([^\s!"#$%&'()*+,./:;<=>?#[\]^_`{|}~]+\.?)+[^\s!"#$%&'()*+,./:;<=>?#[\]^_`{|}~]{2,}))$

Building on #Matas Vaitkevicius' answer: I've fixed up the regex some more in Python, to have it match valid email addresses as defined on this page and this page of wikipedia, using that awesome regex101 website: https://regex101.com/r/uP2oL7/26
^(([^<>()\[\]\.,;:\s#\"]{1,64}(\.[^<>()\[\]\.,;:\s#\"]+)*)|(\".+\"))#\[*(?!.*?\.\.)(([^<>()[\]\.,;\s#\"]+\.?)+[^<>()[\]\.,;\s#\"]{2,})\]?
Hope this helps someone!:)

Can an email address contain international (non-english) characters?

If it's possible, should I accept such emails from users and what problems to expect when I will be sending mails to such addresses?

Officially, per RFC 6532 - Yes.
For a quick explanation, check out wikipedia on the subject.

Update 2015: Use RFC 6532
The experimental 5335 has been Obsoleted by: 6532 and
this later has been set to "Category: Standards Track",
making it the standard.
The Section 3.2 (Syntax Extensions to RFC 5322) has updated most text fields to
include (proper) UTF-8.
The following rules extend the ABNF syntax defined in [RFC5322] and
[RFC5234] in order to allow UTF-8 content.
VCHAR =/ UTF8-non-ascii
ctext =/ UTF8-non-ascii
atext =/ UTF8-non-ascii
qtext =/ UTF8-non-ascii
text =/ UTF8-non-ascii
; note that this upgrades the body to UTF-8
dtext =/ UTF8-non-ascii
The preceding changes mean that the following constructs now
allow UTF-8:
1. Unstructured text, used in header fields like
"Subject:" or "Content-description:".
2. Any construct that uses atoms, including but not limited
to the local parts of addresses and Message-IDs. This
includes addresses in the "for" clauses of "Received:"
header fields.
3. Quoted strings.
4. Domains.
Note that header field names are not on this list; these are still
restricted to ASCII.
Please note the explicit inclusion of Domains.
And the explicit exclusion of header names.
Also Note about NFKC:
The UTF-8 NFKC normalization form SHOULD NOT be used because
it may lose information that is needed to correctly spell
some names in some unusual circumstances.
And Section 3 start:
Also note that messages in this format require the use of the
SMTPUTF8 extension [RFC6531] to be transferred via SMTP.

The problem is that some mail clients (server-tools and / or desktop tools) don't support it and throw an 'invalid email' exception when you try to send a mail to an address which contains umlauts for example.
If you want full support, you could do the trick with converting the email-address parts to "punycode". This allows users to type in their addresses the usual way but you save it the supported-level way.
Example: müller.com » xn--mller-kva.com
Both points to the same thing.

I would assume yes since a number of top level domains already allow non ascii
characters for domains and since the domain is part of an email address, it's
perfectly possible. An example for such a domain would be www.öko.de

short answer: yes
not only in the username but also in the domain name are allowed.

The answer is yes, but they need to be encoded specially.
Look at this. Read the part that refers to email-headers and RFC 2047.

Not yet. The IEEE plans to do this:
H-Online article: IEFT planning internationalised email addresses, here is the RfC: SMTP Extension for Internationalized Email Addresses
Quote from H-Online (as it went down):
The Internet Engineering Task Force (IETF) has published three crucial documents for the standardisation of email address headers
that include symbols outside the ASCII character set. This means that
soon you'll be able to use Chinese characters, French accents, and
German umlauts in email addresses as well as just in the body of the
message. So if your name is Zoë and you work for a company that makes
façades, you might be interested in a new email address. But
representatives of providers are already moaning. They say there would
need to be an "upgrade mania" if the Unicode standard UTF-8 is to
replace the American Standard Code for Information Interchange (ASCII)
currently used as the general email language.
RFC 5335 specifies the use of UTF-8 in practically all email headers.
Changes would have to be made to SMTP clients, SMTP servers, mail user
agents (MUAs), software for mailing lists, gateways to other media,
and everywhere else where email is processed or passed along. RFC 5336
expands the SMTP email transport protocol. At the level of the
protocol, the expansion is labelled UTF8SMTP.
A new header field will be added as a sort of "emergency parachute" to
ensure that UTF-8 emails have a soft landing if they are thrown out
before reaching the recipient by systems that have not been upgraded.
The "OldAddress" is a purely ASCII address. But OldAddress is not to
be used as a channel for a second transfer attempt, but rather to make
sure that feedback is sent home.
Finally, RFC5337 ensures that correct messages are sent pertaining to
the delivery status of non-ASCII emails. The correct address of an
unreachable addressee must be sent back, even if further transport has
been refused. The email Address Internationalization (EAI) working
group is also working on a number of "downgrade mechanisms" for
various header fields and the envelope. If possible, original header
information is to be "packaged" and preserved.
Germany's DeNIC, the registrar for the ".de" domain, is nonetheless
taking this in its stride. "There is really not much we can do",
explained DeNIC spokesperson Klaus Herzig. DeNIC is instead paying
more attention to the update that the IETF is working on for the
standard of international domains – RFC3490, or IDNA2003 as it's
sometimes known. "We are not that happy about it because there is no
backwards compatibility," Herzig explained. When the update comes,
DeNIC says it will be throwing its weight behind the symbol "ß" - also
known as estzett - which has been overlooked up to now. The German
registrar also says that it may wait a bit before switching in light
of the lack of backward compatibility. Once the new standard is
running stably and registrars and providers have adopted it, the ß
will be added.
In contrast, experts believe that Chinese registrars in China and
Taiwan will quickly implement the change for internationalised email.
Representatives of CNIC and TWNIC are authors of the standards.
Chinese users currently have to write emails in ASCII to the left of
the # and in Chinese characters to the right of it for Chinese
domains, which have already been internationalized.
(Monika Ermert)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse