How to correctly encode commas in email display name? - email

I have a similar problem to this question, but could not find any useful information in the answers.
I'm trying to send an email to a recipient with a display name Lastname, firstname using the Quoted-Printable encoding. The exact header, as seen in the source of the received email, is:
To: =?UTF-8?Q?"Lastname,=20firstname"?= <email#example.com>
However, Outlook displays it like this:
Effectively interpreting the comma as a separator between recipients, even though it's enclosed in a Quoted-Printable encoding.
When there is no comma, the header is properly interpreted.
Am I doing something wrong, or is it impossible to use commas in a display-name?
Note: I'm currently using Amazon SES and the ZF2 Zend\Mail component, but the tools should not matter, I'm only interested in the correct header format and will adjust my tools or code accordingly.

What you are seeing is not correct behavior as far as I can tell, but the workaround should be obvious: QP-encode the comma. The double quotes are redundant and should be omitted:
From: =?UTF-8?q?Lastname=2C_Firstname?= <email#example.com>
(As such, it is obviously insane to put the last name first; but e.g. Outlook connected to Active Directory seems to insist on this silly anti-convention.)

Related

UTF-8 encoding in emails, parsing the body

So I don't really want this question to be language specific, however I suspect Go (my language choice) is playing a part here.
I'm trying to find a string within the body of a raw email. To do so, I am getting the encoding, and the marjority of cases are quoted-printable.
Ok so thats fine, I am encoding my search query quoted printable and then doing a search for it. That works.
However. In one specific case the raw email I see in gmail looks fine, however when I retrieve the raw email from the gmail API the although the encoding and everything is identical, its encoding the " as =22
Research shows me thats because the charset is utf-8.
I haven't quite got my head around whether thats encoded utf-8 then quoted-printable or the other way around, but thats not quite the question either....
If I look at the email where the " is =22 I see the char set is utf-8 and when I look at another where its not encoded, the charset is UTF-8 (notice the case). I can't believe that the case here is whats causing this to happen, but it doesn't seem a robust enough way to work out if =22 is actually =22 or is a " encoded utf-8.
My original thought was to always decode the quoted-printable and then re-encode it before doing the search but I don't think this is going to be a robust approach going forward and thought others might have a better suggestion?
Conclusion, I'm trying to find a string in a raw email but the encoding is causing me problems getting my search string to match the encoding of the body
The =22-type encoding actually has nothing to do with the charset (whether that is utf-8 lowercase or UTF-8 uppercase or any other charset).
It is the Content-Transfer-Encoding: quoted-printable encoding.
The quoted-printable encoding is just a way of hex-encoding octets, typically limited to octets that fall outside of the printable ascii range. It seems odd that the DQUOTE character would be encoded in this way, but it's perfectly legal to do so.
If you want to search for strings in the body of the message, you'll need to first decode the body of the message. Otherwise you will not be successful.
I would recommend reading rfc2045 at a minimum.
You may also need to end up reading rfc2047 if you end up wanting to search headers at some point, but that gets... tricky due to various bugs that sending clients have.
Now that I've been "triggered" into a rant about MIME, let me explain why decoding headers is so hard to get right. I'm sure just about every developer who has ever worked on an email client could tell you this, but I guess I'm going to be the one to do it.
Here's just a short list of the problems every developer faces when they go to implement a decoder for headers which have been (theoretically) encoded according to the rfc2047 specification:
First off, there are technically two variations of header encoding formats specified by rfc2047 - one for phrases and one for unstructured text fields. They are very similar but you can't use the same rules for tokenizing them. I mention this because it seems that most MIME parsers miss this very subtle distinction and so, as you might imagine, do most MIME generators. Hell, most MIME generators probably never even heard of specifications to begin with it seems.
This brings us to:
There are so many variations of how MIME headers fail to be tokenizable according to the rules of rfc2822 and rfc2047. You'll encounter fun stuff such as:
a. encoded-word tokens illegally being embedded in other word tokens
b. encoded-word tokens containing illegal characters in them (such as spaces, line breaks, and more) effectively making it so that a tokenizer can no longer, well, tokenize them (at least not easily)
c. multi-byte character sequences being split between multiple encoded-word tokens which means that it's not possible to decode said encoded-word tokens individually
d. the payloads of encoded-word tokens being split up into multiple encoded-word tokens, often splitting in a location which makes it impossible to decode the payload in isolation
You can see some examples here.
Something that many developers seem to miss is the fact that each encoded-word token is allowed to be in different character encodings (you might have one token in UTF-8, another in ISO-8859-1 and yet another in koi8-r). Normally, this would be no big deal because you'd just decode each payload, then convert from the specified charset into UTF-8 via iconv() or something. However, due to the fun brokenness that I mentioned above in (2c) and (2d), this becomes more complicated.
If that isn't enough to make you want to throw your hands up in the air and mutter some profanities, there's more...
Undeclared 8bit text in headers. Yep. Some mailers just didn't get the memo that they are supposed to encode non-ASCII text. So now you get to have the fun experience of mixing and matching undeclared 8bit text of God-only-knows what charset along with the content of (probably broken) encoded-words.
If you want to see how to deal with these issues, you can take a look at how I did it using C in my GMime library, here: https://github.com/jstedfast/gmime/blob/master/gmime/gmime-utils.c#L1894 (in case line offsets change in the future, look for _g_mime_utils_header_decode_text() and the various internal methods it uses in that source file - I have written comments explaining how it deals with the above issues).
Or you can see how I did it using C# in my MimeKit library, here: https://github.com/jstedfast/MimeKit/blob/master/MimeKit/Utils/Rfc2047.cs
For more infomation about why & how dealing with email is hard, check out Joshua Cramner's blog series: http://quetzalcoatal.blogspot.com/search/label/email-hard

Email IDNA encoding. Do we need to encode the whole email or each parts separately?

I have an email with accents that needs to be encoded using IDNA (from Python)
Something like this:
CäciliaAbitz#somedomain.net
If I do a encode('idna') for the whole email, I get the following:
xn--cciliaabitz#somedomain-04b.net
The domains became somedomain-04b.net, which is not normal (right?)
Doing a encoding on each part of the email results in :
b''.join([x.encode('idna') for x in email.split('#')])
> b'xn--cciliaabitz-l8a#somedomain.net'
But I'm not sure if this is correct, working or if I'm missing something.
RFC 5890 works on labels, which are mostly dot separated parts of an email address. In your example, you only have one label in the local part (before the # sign), "CäciliaAbitz", and two labels in the domain part ("somedomain.net"). If you encode without paying attention to the labels, you encode the dots, and the result is a single label where you need multiple ones. With that, your assumption, that "somedomain-04b.net" is not normal (or valid), is correct.
To correctly encode, you need to split not only between local and domain part at the #, but also at any dot within both local and domain parts.

The precise format of Content-Id header

I'm really confused when it comes to the format of Content-Id headers in message parts.
It seems to me that only RFC 2045 covers the format of the header, however briefly:
In constructing a high-level user agent, it may be desirable to allow
one body to make reference to another. Accordingly, bodies may be
labelled using the "Content-ID" header field, which is syntactically
identical to the "Message-ID" header field:
id := "Content-ID" ":" msg-id
Like the Message-ID values, Content-ID values must be generated to be
world-unique.
RFC 2822 explains the format of a msg-id token like so:
The message identifier (msg-id) is similar in syntax to an angle-addr
construct without the internal CFWS.
message-id = "Message-ID:" msg-id CRLF
in-reply-to = "In-Reply-To:" 1*msg-id CRLF
references = "References:" 1*msg-id CRLF
msg-id = [CFWS] "<" id-left "#" id-right ">" [CFWS]
id-left = dot-atom-text / no-fold-quote / obs-id-left
id-right = dot-atom-text / no-fold-literal / obs-id-right
no-fold-quote = DQUOTE *(qtext / quoted-pair) DQUOTE
no-fold-literal = "[" *(dtext / quoted-pair) "]"
Long story short: it includes the at ('#') symbol, just like the Message-Id header of a message. However, almost all reader-friendly articles on MIME format give examples of Content-Id without the at symbol (including not-really-global identifiers like myimagecid or inlineimage001 as well as randomly generated UUIDS without the at symbol). They would surely stress the importance of the '#' symbol if that would be necessary, just like they do with the Message-Id header, right? Right?
I've run some tests on real-world email clients and see how they compose emails with embedded inline images:
Thunderbird generates identifiers with the at symbol. Example: part1.12345678.12345678#domain.example.com
Gmail generates identifiers without such symbol and with no domain part. Example: ii_abc1234x0_12345ab12abcdefa
I didn't test any more email clients (if someone did, it'd be great to complete the list above), but these two already show the striking difference. Google not obeying RFC standards? It sure looks smelly and I want to know whether that's because I missed something, or because the format isn't really that important after all (which in the long run feels rather disturbing). I'm also interested in checking how many popular email clients actually discard the 'at' symbol.
Go by what the spec says, not by what some mail clients do.
So yes, a Content-Id header should have a value that conforms to the way the specification says and therefor should have an '#' symbol.
The world of email is a broken hell hole of many different mail clients and servers doing their own thing and not respecting the standards.
As someone who has written mail software for the past 17 years, I can assure you, this is not the only place that Google deviates from the specs.

How to send a csv attachment with lines longer than 990 characters?

Alright. I thought this problem had something to do with my rails app, but it seems to have to do with the deeper workings of email attachments.
I have to send out a csv file from my rails app to a warehouse that fulfills orders places in my store. The warehouse has a format for the CSV, and ironically the header line of the CSV file is super long (1000+ characters).
I was getting a line break in the header line of the csv file when I received the test emails and couldn't figure out what put it there. However, some googling has finally showed the reason: attached files have a line character limit of 1000. Why? I don't know. It seems ridiculous, but I still have to send this csv file somehow.
I tried manually setting the MIME type of the attachment to text/csv, but that was no help. Does anybody know how to solve this problem?
Some relevant google results : http://www.google.com/search?client=safari&rls=en&q=csv+wrapped+990&ie=UTF-8&oe=UTF-8
update
I've tried encoding the attachment in base64 like so:
attachments['205.csv'] = {:data=> ActiveSupport::Base64.encode64(#string), :encoding => 'base64', :mime_type => 'text/csv'}
That doesn't seem to have made a difference. I'm receiving the email with a me.com account via Sparrow for Mac. I'll try using gmail's web interface.
This seems to be because the SendGrid mail server is modifying the attachment content. If you send an attachment with a plain text storage mime type (e.g text/csv) it will wrap the content every 990 characters, as you observed. I think this is related to RFC 2045/821:
Content-Transfer-Encoding Header Field
Many media types which could be usefully transported via email are
represented, in their "natural" format, as 8bit character or binary
data. Such data cannot be transmitted over some transfer protocols.
For example, RFC 821 (SMTP) restricts mail messages to 7bit US-ASCII
data with lines no longer than 1000 characters including any trailing
CRLF line separator.
It is necessary, therefore, to define a standard mechanism for
encoding such data into a 7bit short line format. Proper labelling
of unencoded material in less restrictive formats for direct use over
less restrictive transports is also desireable. This document
specifies that such encodings will be indicated by a new "Content-
Transfer-Encoding" header field. This field has not been defined by
any previous standard.
If you send the attachment using base64 encoding instead of the default 7-bit the attachment remains unchanged (no added line breaks):
attachments['file.csv']= { :data=> ActiveSupport::Base64.encode64(#string), :encoding => 'base64' }
Could you have newlines in your data that would cause this? Check and see if
csv_for_orders(orders).lines.count == orders.count
If so, a quick/hackish fix might be changing where you call values_for_line_item(item) to values_for_line_item(item).map{|c| c.gsub(/(\r|\n)/, '')} (same for the other line_item calls).

What escaping or purification is needed for an email subject?

Sites like Facebook have the user's name in the subject line that sent you a message.
Because of this, what escaping would you do on user entered values in a message subject? Or would you just not allow anything other than a-z, 0-9, period, comma and single quotes?
You need to be careful with email headers, 8 bit chars are a bit of a no-no. (mail servers will reject them).
The proper way to do it is to MIME encode your subject lines and make sure the ASCII char \n is not in the subject line (technically multi-line subjects are possible, but I'd imagine plenty of mail clients would have problems)
See http://en.wikipedia.org/wiki/MIME#Encoded-Word for more info.
Escaping is needed if there are forbidden characters. The subject is terminated by a NL so this is the only (ASCII) character that shouldn't be put in the header.
See also rfc821
It’s the same problem with contact forms.
If you look at an email header you get e.g. this:
Subject: user123 has sent you an invite
From: "User123" <user123#example.org>
You have to make sure that user names do not resemble values of an email header. If it’s possible for a user to name himself “To: spamreceiver1#example.org, spamreceiver2#example.org, spamreceiver3#example.org, spamreceiver4#example.org” you have to clean the input.
A search for “contact form spam” should show you what to do. You should at least remove all occurrences of "To:", "Subject:", "From:" etc.