Email formatting from mail clients - email

I have been doing some research/tests on the standardized email format. Ultimately I am looking to develop an email parser for an application. I am noticing some differences in the format of the email, mainly between email clients (gmail, mac mail, etc) and email marketing services (Constant Contact, Mail Chimp, etc).
My understanding of the format (RFC2822) is that a \n\n separates the headers from the body. These appears to be consistent with emails received from email marketing services. Email clients, however, appear to have an extra set of header(s) or instructions for the message. See examples of email strings below. Note that I pulled these strings via an email pipe. Also note, these are only snippets of the header/body split.
Email Marketing Service:
Content-Type: text/html;
charset="utf-8"
Content-Transfer-Encoding: 8bit
<html>
<head>
<title>Welcome to Banana Republic. Enjoy 25% off! </title>
<STYLE type="text/css">
.ReadMsgBody
{ width: 100%;}
.ExternalClass
{width: 100%;}
Here you will see the line break separating the headers from the body. All good according to the format. Now look at the email client.
Email client:
Mime-Version: 1.0 (Mac OS X Mail 7.0 (1816))
X-Mailer: Apple Mail (2.1816)
--Apple-Mail=_28DD752B-7960-488D-994F-DA9408FCA880
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset=windows-1252
Testing Mac Mail. This is the body.
You see that in this case, there is an additional set of "headers" which appear to be instructions about how, in this case, Mac Mail has formatted the email.
I guess my question is, is this a valid format? Is there any specification on it? Is there any well known/documented ways to check for and parse this type of format without knowing which type of format is being received?

[extending points made in comments]
is this a valid format?
Yes. The overall framework for mail messages more complex than strict 7-bit ASCII text is known as MIME. It includes the specification of the "Content-Type" header in your first example that informs a client that the whole message is HTML rather than plain text. Many (possibly most) messages these days are of type "multipart/alternative" at the outermost level, encapsulating 2 (or more!) versions of the message body, most often a text/plain representation and text/html version, which is itself often inside a multipart/mixed container including embedded images.
Is there any specification on it?
Yes. The basics of MIME are described in RFC's 2045-2049 and there have been many extensions and corrections described in many later RFC's and type registration docs. MIME also provides the core components for the specification of HTTP documents, so many of the extensions are almost irrelevant for email.
Is there any well known/documented ways to check for and parse this
type of format without knowing which type of format is being received?
Yes. While nearly all modern email is in MIME format, formally you can detect it by looking for the "MIME-Version" header. See RFC2045 for specifics. Note that your first example doesn't show that header but it must have existed in the full original because otherwise the headers you showed would be meaningless.
This demonstrates why you probably should reconsider the idea of writing your own mail parser. What you saw as 2 formats are not that in fact, rather they are just different applications of the MIME format framework. MIME is significantly older than RFC2822 (which, incidentally, is itself obsoleted by RFC5322) and has many mature and robust parsers available. It is easy to write a MIME parser that will work for most mail, a little harder to write one that will work for nearly all valid mail, and sanity-challenging to write one that will safely handle the real world of mail which often isn't exactly correct and in some cases is designed to break naive parsers in malicious ways. Take advantage of the torn-out hair of decades of coders who have preceded you: use an existing parser.

Related

Position of MIME in the Networking stack

Based on a what I found on the internet, MIME (Multipurpose Internet Mail Extensions, now Internet Media Type (?)) is a way to describe file types (a header used by several protocols).
So, MIME itself is not a protocol, rather an extension used by other protocols, right ?
This means that the extension is used at the application layer by the applications with no protocol doing anything other than carrying the MIME header.
So, if I send a mail with a mp3 attachment, SMTP/other application layer protocol recognizes that this is an mp3 attachment or it is the duty of the application solely to recognize the file? In that sense, MIME cannot be called as an extension to SMTP but rather a feature to be used by applications.
If SMTP does not recognize that this is a different kind of file, how will it properly store it at the mail server ? (e.g. a MPEG video file needs a particular format to be stored, how will mail server store it without giving it any special treatment ? )
Sorry if my questions sound a bit vague but I want to get an idea of how different protocols (especially, SMTP) use MIME.
Thanks for your help.
RFC 822 email was originally purely plain-text, 7-bit US-ASCII. MIME specifies a facility for encapsulating other media types in email containers. It does not specify any changes to SMTP (although e.g. the 8BITMIME ESMTP extension is useful for simplifying transport of MIME messages). Thus, it is an extension of an existing protocol, not a distinct protocol in its own right. This is also demonstrated by the fact that other protocols -- notably, HTTP -- have incorporated (parts of) MIME for tagging of content types and encodings.
An Internet Media Type is only one aspect of what MIME used to codify; the mechanisms for specifying character sets and encodings are still defined in MIME proper.
Traditionally, the mail server simply stores the bare RFC822 message in its message store; it is the responsibility of the mail client to parse and possibly manipulate any MIME structure in the body for display and interaction. (The fact that RFC 822 has been superseded by 2282 and then 5322 has not fundamentally changed the actual mail message format.)
Some servers deviate from this model; for example, Microsoft Exchange seems to parse all incoming messages in order to borg them into its internal format, somewhat to the detriment of its interoperability with standard tools, and the sanity of those few of us who require reliable, felicitous access to our actual email.
The SMTP protocol itself knows nothing about the MIME format, but the SMTP server itself has to at least implement basic rfc0822 support in order to ad the Received headers, however, it does not need to implement MIME.
How does the server save the file to disk? The same way it received it from the client over the TCP/IP stream. It just saves the raw bytes sent (with the addition of the addition of a Received header I mentioned).
In other words, you are way over-thinking this. The SMTP server doesn't have to know anything about mp3 file attachments or anything else because the MIME format (it's not a protocol) is just a way to serialize the mp3 data in a message.

Is it safe to send 8-bit emails?

I would like to know if it is safe to send emails with 8-bit characters or if it is still needed to use quoted-printable or base64 encoding.
The 8BITMIME extension is now 20 years old. Are there SMTP servers or mail clients that still are not 8-bit clean? Is there any impact on email deliverability when sending 8-bit emails?
I did not find any numbers but it looks like it is now quite safe to send emails with 8-bit body. But since the big players like Gmail still encode emails there might be some servers that still are not 8-bit clean.
However while sending an email with an 8-bit body might be safe, sending it with 8-bit headers is not.
RFC 2822 which was the standard until late 2008 prohibited non-ASCII characters in headers.
RFC 6532 proposed a standard for 8-bit headers but it is quite recent (2012) and does not seem widely implemented yet.
So sending unencoded 8-bit emails is currently not safe.
There are still SMTP servers that haven't been updated to support 8BITMIME, so yes, you still need to check for the extension.

Multipart/alternative subtype, when client use it?

Why webmails (like Gmail) sends MIME messages using multipart/alternative subtype (when composing in HTML) while others send HTML as MIME with text/html parts inside (without using alternative subtype)?
The section 5.1.4 of RFC 2046 defines multipart/alternative MIME type to allow the sender to provide different, interchangeable representations of the same message and to leave it up to the receiver to chose the form of presentation most suitable for its capabilities. Note that while the general meaning of each representation for the user should be retained, there usually is some information loss from one representation to the other (e.g. text/plain is missing the formatting information with respect to text/html). The alternatives should generally be ordered from the plainest to the richest, i.e. if the alternatives are again text/html and text/plain then text/plain should come first. This helps the users of non-MIME-conformant viewers in which the easiest to interpret part will show up first. Generally, a a MIME-conformant viewer should display the last representation it is capable of viewing since it is the most preferable.
This content type is often contrasted with multipart/mixed where a number of different resources are combined in a single message.
The main reason some mail services provide messages as multipart/alternative is to support different types of viewing applications on the receiving end. For example, some viewers lack the ability to render HTML and require text/plain representation for the message to be at all readable. At the same time, other viewers do have the ability to render HTML and can provide much better user experience when message is delivered as text/html. The most flexible solution to the trade-off between supporting wide range of viewers and enhancing user experience for the more capable ones is afforded by delivering both representations wrapped in a multipart/alternative message.
For details see RFC 2046.
multipart/alternative indicates that each part is an "alternative" version of the same (or similar) content, each in a different format denoted by its "Content-Type" header. The formats are ordered by how faithful they are to the original, with the least faithful first and the most faithful last.
Mail-agents like Gmail know what they are doing, and convert the text/html to text/plain and put both alternatives into there emails and let the receiving end decide which alternative to use.
There are also mail-agents that don't know how to extract a text-only version from the html content, just because the developer did not bother to implement it, so they only send text/html with out any alternatives.
And sometimes - i call them the crazy ones - send multipart/alternative, but actually only put text/html without any alternatives. Which is not really nice, but it is not against any spec.

Can punycode-encoded email addresses clash with "real" addresses?

The problem is this: I'm using a third-party Email delivery service that doesn't accept mail addresses with non-ASCII characters in the name part, like müller#example.com .
Encoding such an address with Punycode:
http://en.wikipedia.org/wiki/Punycode
http://idnaconv.phlymail.de/index.php?decoded=m%C3%BCller%40example.com&idn_version=2008&encode=Encode+%3E%3E&lang=de
yields this address:
xn--mller-kva#example.com
And sending mail to it via the service seems to work.
However, I'm not sure if someone couldn't register "xn--mller-kva#example.com" directly, thus receiving Emails meant for "müller#example.com".
Is this clashing possible ? Are there other solutions for this problem ?
UPDATE
Thanks for the answers. Here's a summary of what we learned:
Punycoding the local part of the email address works, and you can send and receive from such an encoded address (of course)
However, there are no guarantees at all that providers or mail clients will understand the encoding, or do it automatically. Clashes are therefore possible, and the whole idea not a good one :)
One should simply do what everyone else does, which is to not allow or accept non-ASCII name parts, as per specification
And finally, it turns out the third-party service prohibits such shenanigans anyway.
Non-ASCII characters are not allowed in the local part of email addresses. Period. Punycode is ONLY FOR DOMAINS, not for local parts of email addresses.
However, it is very likely that the IETF adopts a standard that makes internationalized local parts possible. This standard, however, will probably not be based on punycode.
I got bored and was researching this tonight, and apparently this is now codified in the Extended SMTP standard, specifically SMTPUTF8 as per RFC 6531. See http://en.wikipedia.org/wiki/Extended_SMTP#SMTPUTF8
My brief experiment using emoji mailbox names returned the following error when sending via Gmail:
local-part of envelope contains utf8 but remote server did not offer SMTPUTF8
This is the same regardless whether I used the emoji or punycode version of the address.
You can encode sections of mail header fields into different character encodings using a format like the following: =?UTF-8?B?w6HDq8O0?= This allows you to embed things like umlauts but I'm pretty sure it doesn't work for the actual address part.
There's not reason why you cannot use these characters to form your address. RFC5322 defines the characters that may appear in the address part in Section 3.4 and all the characters you use above are valid. However as the other comment added it's all a little fruitless if the mail clients that you are sending to cannot parse this format.
Some SMTP servers might 'accidentally' allow umlauts but since they're not within the supported character ranges I wouldn't risk it.
The only standard way to send non us-ascii characters in the local-part of a email address is through rfc6532 (Internationalized email headers) and rfc6531 (SMTP Extension for SMTPUTF8).
As far as I know there is no standard way to encode non us-ascii chars in a local part of a email address notably:
Puny code is for domain names only, not the local part. But you can have a local part which happens to look like the puny encoding of some string but it should be displayed in it's puny encoded form. If a mail program decides to display it after puny decoding it it's non standard behavior.
The encoded word encoding mechanism mentioned in one of the answers (the =?utf-8?Q?foobar?= thing) is not applicable to the local part of a mail address, only to the display name of a mailbox (which is something different, but related i.e. the thing your mail program might display instead of the mail address).
In the end this means that müler#example.com and xn--mler-0ra#example.com
are two completely unrelated email addresses which just would have
the same meaning if they would have been domains (but they are not
so they can collide).
Theoretically you could hope that by now (2019) all mail servers support
SMTPUTF8 and all client support internationalized mails, but sadly I would
not count on it if it's important.
Btw. it happens that the local part of a email address is the only thing in
the mail standard(s) where you might want to have non us-ascii chars and there
is no way to encode it (as far as I know). All other parts either have encoded word, puny, percent, base64, quoted-printable or some other form of encoding mechanism.
did a few tests.. umlauts in the local part seem to work in certain setups. neither my MUA (claws) nor the outbound relay (exim) nor the receiving MTA (postfix) complained or did any punycode conversion. providers like gmail and hotmail however don't allow the umlauts at all ( tested webmail and direct incoming and outgoing smtp). I didn't find any documentation about this case that suggests punycoding local parts.so, since it's not documented and no one does it there is no clashing problem :-)
conclusion: you probably shouldn't accept umlauts in the local part in the first place and not even try to send an email to those addresses. (if the big players don't do it and it's not documented/supported by RFC, why should you?)

Simple chat protocol

I'm learning networking and threading in C#. For that purpose I'm developing chat over network.
Currently I have basic communication between client - server (TCP). Server can work with multiple clients. But only client - server communication. Basically client sends ASCII encoded message to server, then server decodes it and shows in console.
By now I want to implement Client-Client communication.
Suppose we have online list of clients in each client and message box for sending message to each client.
Next step is clicking button, which will compose a Socket and send, then Server should understand whom is addressed message.
So, what should be my structure of message, and how I should understand in Server, whom addressed message?
Generally I don't need code, I want theory. Simple and short. Maybe tutorials?
I have looked into XMPP. It's very heavy. I just need direction, how I can do this. My goal is to learn, not implement it and forgot.
TCP is stream based which means that you will never know, with the help of TCP, when a message begins and ends. Any message/protocol design needs to address that.
There are two ways to detect when a message ends. The first way is to add a delimiter at the end of message, and the second way is to include the length in a header.
HTTP uses both. It uses an empty line to determine when the header ends. And in the header it got a Content-Length header which tells how large the body is.
For binary protocols I suggest that you use a fixed length header where the first integer (4 bytes) is a version and the second integer is the body length. In this way you can easily switch header layout between versions (since the version is the first integer).
For text protocol it really depends on how the message contents looks like. The problem is that the content may not include the delimiter to be used (which can be hard if you are transporting chat messages). You could of course escape the delimiter if it exists in the actual chat message. But imho a better approach is to use a header/body layout like HTTP (since it's also quite easy to parse and you can have X number of headers without having to change the parser).
A message would look like:
From: Arne
To: #ChannelName
WrittenAt: 2011-07-03 12:00 GMT
Content-Length: 16
This is a text
Notice that the length is 16, this is since the new line was included in the body.
As for client-client communication I would always go through the server if you are a beginner. It's a lot easier since otherwise you have to make sure that at least one of the clients is not behind a router (or it will be impossible to deliver the message).
Just check the To header if it's for a chat room or a user.