I have found conflicting information about dot stuffing when transmitting an email.
stuff a dot if the line contains a single dot (to avoid premature termination)
stuff a dot to every line stat starts with a dot
stuff a dot to (1) and to every line part of a quoted-printable message part only
Can anyone clarify?
According to the SMTP standard RFC 5321, section 4.5.2:
https://www.rfc-editor.org/rfc/rfc5321#section-4.5.2
To allow all user composed text to be transmitted transparently, the following procedures are used:
Before sending a line of mail text, the SMTP client checks the first character of the line. If it is a period, one additional period is inserted at the beginning of the line.
When a line of mail text is received by the SMTP server, it checks the line. If the line is composed of a single period, it is treated as the end of mail indicator. If the first character is a period and there are other characters on the line, the first character is deleted.
So, from the three points of your question, the second one is right.
The practical answer: If you're using quoted printable format then always translate a dot to =2E. You can't rely on all smtp servers doing the dot removal correctly.
If you want to assume the whole world is standards compliant then go with answer 2 above.
In SMTP protocol the mail is terminated by a single dot and a newline character(s)
In simple terms something like:
\r\n.\r\n
The characters:
CR LF DOT CR LF
Which corresponds to a single dot at the beginning of a line.
In case the mail data contains a single . At the beginning of line and is followed by a new line character then the SMTP protocol will consider it as mail termination and hence only a part of mail would be delivered.
So the whole idea is to avoid these type of situation by padding an extra dot.
Related
I'm making a code to check a mailbox, and forward unseen mails to another user.
But sometimes it fails with an error:
ValueError: Header values may not contain linefeed or carriage return characters
I checked the raw fetched data and found out that the 'Subject' value contains \r\n.
Not all mails contain, but some do.
It just appears normal in the mailbox, and I have no idea why some contain such characters.
Does it have to do with the length of the subject?
How can I deal with these situations?
Thanks :)
Email messages have a maximum line length. That's historical and the rule isn't upheld 100% of the time, so to speak. But in header fields, a space is to be treated the same as a CR LF and a sequence of spaces or a htab character. This is a really long subject, encoded in that way:
Subject: Pretend this is about 80-90
characters long
The simplest way to deal with it is to consider any sequences of space characters to be a single space.
Read the source of any email message, you'll see this wrapping in most of then. The Received fields is almost always wrapped, for instance, and quite often To if there are many addressees, or Content-Type/Content-Disposition for attachments.
This question seems fairly pedantic, however it feels reasonably important when trying to follow the RFC. I am trying to write an IRC client and I am using the RFC to follow how the protocol should be written. I came across the section for message prefixes and was slightly confused by what was written.
Each IRC message may consist of up to three main parts: the prefix
(optional), the command, and the command parameters (of which there
may be up to 15). The prefix, command, and all parameters are
separated by one (or more) ASCII space character(s) (0x20).
The presence of a prefix is indicated with a single leading ASCII
colon character (':', 0x3b), which must be the first character of the
message itself. There must be no gap (whitespace) between the colon
and the prefix.
My question concerns the first sentence in the second paragraph; ASCII colon character (':', 0x3b). With (to my understanding) 0x3bbeing the ASCII character for a semi-colon, does this mean that the prefix may be either semi-colon or a colon, or is it simply a typo in the document? I'm going ahead with using a colon for now, however my curiosity is nagging away at me.
The colon : (0x3a) is correct.
This is the first errata listed for RFC1459.
I'm generating emails. They Show up fine for me in gmail and Outlook 2010. However, my client sees the = sign that gets added to the end of lines by the quoted-printable formatting. It also eats the character on the next line, but then displaying the equal sign.
Example:
line that en=
ds like this
shows up like
line that en=s like this
(Note: The EOL character in my emails is just LF. No CR.)
I'm confirming what outlook version my client is using, but I think it's 2007. The email headers from her appear to come through Exchange 6.5.
My emails are created in php using the HtmlMimeMail5 library. They are multipart emails, with the applicable section sent with:
Content-Type: text/html; charset="utf-8"
Content-Transfer-Encoding: quoted-printable
It appears I could just make sure nothing in my email reaches the line wrap at 76 characters, but that seems like the wrong way to solve the problem. Should the EOL character be different? (In emails from the client, the EOL character is simply a LF) Any other ideas?
I do not know what the PHP library does, but in the end MIME mail must contain CR LF line endings. Obviously the client notices that = is not followed by a proper CR LF sequence, so it assumes that it is not a soft line break, but a character encoded in two hex digits, therefore it reads the next two bytes. It should notice that the next two bytes are not valid hex digits, so its behavior is wrong too, but we have to admit that at that point it does not have a chance to display something useful. They opted for the garbage in, garbage out approach.
How many characters are allowed to be in the subject line of Internet email?
I had a scan of The RFC for email but could not see specifically how long it was allowed to be.
I have a colleague that wants to programmatically validate for it.
If there is no formal limit, what is a good length in practice to suggest?
See RFC 2822, section 2.1.1 to start.
There are two limits that this
standard places on the number of
characters in a line. Each line of
characters MUST be no more than 998
characters, and SHOULD be no more than
78 characters, excluding the CRLF.
As the RFC states later, you can work around this limit (not that you should) by folding the subject over multiple lines.
Each header field is logically a
single line of characters comprising
the field name, the colon, and the
field body. For convenience however,
and to deal with the 998/78 character
limitations per line, the field body
portion of a header field can be split
into a multiple line representation;
this is called "folding". The general
rule is that wherever this standard
allows for folding white space (not
simply WSP characters), a CRLF may be
inserted before any WSP. For
example, the header field:
Subject: This is a test
can be represented as:
Subject: This
is a test
The recommendation for no more than 78 characters in the subject header sounds reasonable. No one wants to scroll to see the entire subject line, and something important might get cut off on the right.
RFC2322 states that the subject header "has no length restriction"
but to produce long headers but you need to split it across multiple lines, a process called "folding".
subject is defined as "unstructured" in RFC 5322
here's some quotes ([...] indicate stuff i omitted)
3.6.5. Informational Fields
The informational fields are all optional. The "Subject:" and
"Comments:" fields are unstructured fields as defined in section
2.2.1, [...]
2.2.1. Unstructured Header Field Bodies
Some field bodies in this specification are defined simply as
"unstructured" (which is specified in section 3.2.5 as any printable
US-ASCII characters plus white space characters) with no further
restrictions. These are referred to as unstructured field bodies.
Semantically, unstructured field bodies are simply to be treated as a
single line of characters with no further processing (except for
"folding" and "unfolding" as described in section 2.2.3).
2.2.3 [...] An unfolded header field has no length restriction and
therefore may be indeterminately long.
after some test: If you send an email to an outlook client, and the subject is >77 chars, and it needs to use "=?ISO" inside the subject (in my case because of accents) then OutLook will "cut" the subject in the middle of it and mesh it all that comes after, including body text, attaches, etc... all a mesh!
I have several examples like this one:
Subject: =?ISO-8859-1?Q?Actas de la obra N=BA.20100154 (Expediente N=BA.20100182) "NUEVA RED FERROVIARIA.=
TRAMO=20BEASAIN=20OESTE(Pedido=20PC10/00123-125),=20BEASAIN".?=
To:
As you see, in the subject line it cutted on char 78 with a "=" followed by 2 or 3 line feeds, then continued with the rest of the subject baddly.
It was reported to me from several customers who all where using OutLook, other email clients deal with those subjects ok.
If you have no ISO on it, it doesn't hurt, but if you add it to your subject to be nice to RFC, then you get this surprise from OutLook. Bit if you don't add the ISOs, then iPhone email will not understand it(and attach files with names using such characters will not work on iPhones).
Limits in the context of Unicode multi-byte character capabilities
While RFC5322 defines a limit of 1000 (998 + CRLF) characters, it does so in the context of headers limited to ASCII characters only.
RFC 6532 explains how to handle multi-byte Unicode characters.
Section 3.4 ( Effects on Line Length Limits ) states:
Section 2.1.1 of [RFC5322] limits lines to 998 characters and
recommends that the lines be restricted to only 78 characters. This
specification changes the former limit to 998 octets. (Note that, in
ASCII, octets and characters are effectively the same, but this is
not true in UTF-8.) The 78-character limit remains defined in terms
of characters, not octets, since it is intended to address display
width issues, not line-length issues.
So for example, because you are limited to 998 octets, you can't have 998 smiley faces in your subject line as each emoji of this type is 4 octets.
Using PHP to demonstrate:
Run php -a for an interactive terminal.
// Multi-byte string length:
var_export(mb_strlen("\u{0001F602}",'UTF-8'));
// 1
// ASCII string length:
var_export(strlen("\u{0001F602}"));
// 4
// ASCII substring of four octet character:
var_export(substr("\u{0001F602}",0,4));
// '😂'
// ASCI substring of four octet character truncated to 3 octets, mutating character:
var_export(substr("\u{0001F602}",0,3));
// 'â–’'
I don't believe that there is a formal limit here, and I'm pretty sure there isn't any hard limit specified in the RFC either, as you found.
I think that some pretty common limitations for subject lines in general (not just e-mail) are:
80 Characters
128 Characters
256 Characters
Obviously, you want to come up with something that is reasonable. If you're writing an e-mail client, you may want to go with something like 256 characters, and obviously test thoroughly against big commercial servers out there to make sure they serve your mail correctly.
Hope this helps!
What's important is which mechanism you are using the send the email. Most modern libraries (i.e. System.Net.Mail) will hide the folding from you. You just put a very long email subject line in without (CR,LF,HTAB). If you start trying to do your own folding all bets are off. It will start reporting errors. So if you are having this issue just filter out the CR,LF,HTAB and let the library do the work for you. You can usually also set the encoding text type as a separate field. No need for iso encoding in the subject line.
From http://www.faqs.org/rfcs/rfc2822.html:
CR and LF MUST only occur together as
CRLF; they MUST NOT appear
independently in the body.
We have a web service that sends out confirmation emails, but one of our users pointed out that this does not adhere to the rfc2822 standard. So my question is, why is it important for CR and LF to appear together in email messages?
Because it's in the accepted RFC?
Implementations are derived from RFCs. If that was not the case, then there would be no guarantee of interoperability between different implementations. There may or may not be tangible, technical reasons of requiring them to appear together, but in this case those reasons are irrelevant. It's a simple matter of "because they said so."
Because in email CRLF is the line separator. If you only use CR or only use LF you will have all sorts of unexpected problems with various clients, SMTP server combination. Some servers will reject your emails, some will "fix" your emails. Fixed emails are some of the most fun to deal with.
Think in term of an old teletype. CR returns the write head to the beginning of the line, LF rolls the paper one line forward. You need both steps to begin a new line. If you use CR without LF, you will overwrite the same text, which is of course illegal.
Anyway, this is the historial reason to define CR+LF as the ASCII-code for a new line. Of course in the end it is just arbitrary codes. Some systems use only CR to indicate a new line, some systems use only LF, some use a different character entirely. RFC2822 had to chose one, and decided to allow only the sequence CRLF.
Since the RFC decided to use CRLF, it makes sense to disallow CR or LF seperately, since this would be pretty useless and problematic to handle anyway.
If not you end up with a CR, which puts you on the same line, then whatever you write would be on top of the chars at the left on the same line, then comes the LF and you are in some column towards the middle and start writing again. Messy.