Why is it so important for CR and LF to appear together in Email? - email

From http://www.faqs.org/rfcs/rfc2822.html:
CR and LF MUST only occur together as
CRLF; they MUST NOT appear
independently in the body.
We have a web service that sends out confirmation emails, but one of our users pointed out that this does not adhere to the rfc2822 standard. So my question is, why is it important for CR and LF to appear together in email messages?

Because it's in the accepted RFC?
Implementations are derived from RFCs. If that was not the case, then there would be no guarantee of interoperability between different implementations. There may or may not be tangible, technical reasons of requiring them to appear together, but in this case those reasons are irrelevant. It's a simple matter of "because they said so."

Because in email CRLF is the line separator. If you only use CR or only use LF you will have all sorts of unexpected problems with various clients, SMTP server combination. Some servers will reject your emails, some will "fix" your emails. Fixed emails are some of the most fun to deal with.

Think in term of an old teletype. CR returns the write head to the beginning of the line, LF rolls the paper one line forward. You need both steps to begin a new line. If you use CR without LF, you will overwrite the same text, which is of course illegal.
Anyway, this is the historial reason to define CR+LF as the ASCII-code for a new line. Of course in the end it is just arbitrary codes. Some systems use only CR to indicate a new line, some systems use only LF, some use a different character entirely. RFC2822 had to chose one, and decided to allow only the sequence CRLF.
Since the RFC decided to use CRLF, it makes sense to disallow CR or LF seperately, since this would be pretty useless and problematic to handle anyway.

If not you end up with a CR, which puts you on the same line, then whatever you write would be on top of the chars at the left on the same line, then comes the LF and you are in some column towards the middle and start writing again. Messy.

Related

Is U+0085 NEXT LINE (NEL) deprecated?

There is few information about NEL. I think it was supposed to replace LF and CR LF, but it seems like it wasn't very used.
Is it somehow deprecated and should applications interpret it as a new line?
There are many different code points which can be used as line separators, mostly for legacy reasons. Unicode's design seeks to preserve all information, so text can be roundtrip en- and decoded to/from Unicode without information loss. NEL is/was used in EBCDIC to represent newlines, so it made its way into Unicode as a separate character.
It's not "deprecated" in the sense that it still fulfils the function it was designed for, but you will hardly find it in actual use unless you're dealing with legacy EBCDIC somehow. If you do, you may want to treat it as newline, but many modern systems to date just treat it as whitespace.
The Wikipedia article on NEL has more information.

Is it possible to add another Unicode character for "at sign" without changing any code in the back-end of all the email providers?

So lets say for some reason we wanted to add another Unicode character for at sign, and use it instead of # in all the email providers
Now i have three questions:
How do email providers parse the email, do they actually parse the written email until they see a # and they have hard-coded the # symbol's Unicode in the parser?
Do different service providers have different email parser with different standards or is there a standard type of parser library that every email provider use?
Will it be possible to add another at sign symbol and use it in emails without having to make changes in all the email provider's code?
Yes, e-mail addresses are parsed using a hard-wired # character. After almost fifty years of e-mail, there are literally millions of e-mail handling programs, and they all use this same syntax. So you're not going to be able to change this convention, and your second and third questions are moot.
E-mail addresses are parsed by tens of different kind of softwares, not just "email server" software inside "e-mail providers". Even things as trivial as client-side javascript highlighting for an e-mail field - of which there are easly tens of thousands around, would have to adapt.
An "#" is not a character class by itself - so, even if it were an unique "unicode character class" for "Unicode Separator", whou would ever have written code that would check for the character class of the separator? Have you ever done that, even for filtering punctuation out? (A real use case for the unicode classification of characters, and even them, this sees little use in real-world code).
Now, of course, you are free to write email client code that would present the "#" as anything else when rendering e-mail data to the users. Internally, if this software would not use "#", even for its own uses, it would not work with anything else in the World - from antivirus software to text-based templates.
And finally, such a change would hardly have to do with "unicode" itself - Unicode can standardize characters - but the e-mail protocol is a separate thing - normally the series of documents kept as "RFC"s is what mandate various internet protocols, including IMAP, POP and SMTP- the three protocols that are used to enable e-mail to work. Even if new RFCs for all these would be published with a new character accept in place of "#", it would likely take more than a decade until all software around, as detailed above, would be compliant enough to enable it to be used. (And yes, all of it would have to be changed)

SMTP dot stuffing.. when and where to do it?

I have found conflicting information about dot stuffing when transmitting an email.
stuff a dot if the line contains a single dot (to avoid premature termination)
stuff a dot to every line stat starts with a dot
stuff a dot to (1) and to every line part of a quoted-printable message part only
Can anyone clarify?
According to the SMTP standard RFC 5321, section 4.5.2:
https://www.rfc-editor.org/rfc/rfc5321#section-4.5.2
To allow all user composed text to be transmitted transparently, the following procedures are used:
Before sending a line of mail text, the SMTP client checks the first character of the line. If it is a period, one additional period is inserted at the beginning of the line.
When a line of mail text is received by the SMTP server, it checks the line. If the line is composed of a single period, it is treated as the end of mail indicator. If the first character is a period and there are other characters on the line, the first character is deleted.
So, from the three points of your question, the second one is right.
The practical answer: If you're using quoted printable format then always translate a dot to =2E. You can't rely on all smtp servers doing the dot removal correctly.
If you want to assume the whole world is standards compliant then go with answer 2 above.
In SMTP protocol the mail is terminated by a single dot and a newline character(s)
In simple terms something like:
\r\n.\r\n
The characters:
CR LF DOT CR LF
Which corresponds to a single dot at the beginning of a line.
In case the mail data contains a single . At the beginning of line and is followed by a new line character then the SMTP protocol will consider it as mail termination and hence only a part of mail would be delivered.
So the whole idea is to avoid these type of situation by padding an extra dot.

What is difference between \n and \r? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
What is the difference between \r and \n?
Hi,
What is difference between \n (newline) and \r (carriage return)? They both move current cursor to the next line. Are they same?
\r returns the cursor to the beginning of the line, NOT to the next line. When you use \nin Linux, \r is implied, in windows, it is not.
Using \r in Unix-like systems may result in overwriting the same line.
I suggest you read this.
In short, a newline in Windows is "\r\n", while a newline in Unix is just "\n" (and, just to make life difficult, a newline in older Macs is "\r")
Actually, a carriage return is supposed to move the cursor to the beginning of the current line. Then, newline moves the cursor exactly down one.
Nowadays, compilers will often automatically convert one or the other to \r\n on Windows or \n on Linux. Mac used to use \r but they have changed to the \n convention.
(edit: removed false/untested statements)
Read The Great Newline Schism it explains everything in deep detail with great humor.
Ah the old days of the typewriter...
The difference between the two stems from days of yonder when typing was done directly to paper. It required two actions to go to the next line:
pushing the 'carriage' (big cilinder on the top) back to the left (this is where the character would end up).
shifting the paper one line up. (thus going down one line)
Splitting these two actions facilitated going back to a precise character position to correct it (there was no way to go up one line, or left one character!). Holding paper whiteout on the erroneous character and hitting that key would neatly whiteout exactly that erroneous character, then you could go back again and hit the correct key
(there was a key for not moving the carriage though).
In the young computer age these actions were translated 1 to 1 into \r for carriage return and \n for shifting the 'paper'.
Nowadays the major operating systems apparently have differing opinions on whether this is still necessary for computer technology where going back to previous position is much easier. However, in modern programming languages you'll generally see that \n is assumed to mean \r\n.
No they're not. Modern text editors often treat them the same however because their old uses don't make much sense for digital word processors.
For example \r literally means "return to the beginning of the line". While this might have been useful for a typewriter if you just wanted to overwrite everything on that line this sort of functionality doesn't make much sense for digital type.
\n on the other hand would simply move down a line without returning to the beginning. This was also useful on a typewriter for indentation or bulleting. Again, not something that makes much sense for digital type.
Telnet is one example where both characters are still used in this manner.
Both characters were included in ascii language simply because when it was being spec'd they hadn't realized that functionality that was useful on a typewriter didn't make much sense on a computer.

Historical reason behind different line ending at different platforms

Why did DOS/Windows and Mac decide to use \r\n and \r for line ending instead of \n? Was it just a result of trying to be "different" from Unix?
And now that Mac OS X is Unix (-like), did Apple switch to \n from \r?
DOS inherited CR-LF line endings (what you're calling \r\n, just making the ascii characters explicit) from CP/M. CP/M inherited it from the various DEC operating systems which influenced CP/M designer Gary Kildall.
CR-LF was used so that the teletype machines would return the print head to the left margin (CR = carriage return), and then move to the next line (LF = line feed).
The Unix guys handled that in the device driver, and when necessary translated LF to CR-LF on output to devices that needed it.
And as you guessed, Mac OS X now uses LF.
Really adding to #Mark Harrison...
The people who tell you that Unix is "just outputting the text the programmer specified" whereas DOS is broken are plain wrong. There are also claims that it's stupid for DOS to flag EOF when it sees an EOF character, raising the question of what exactly that EOF character is for.
There is no one true convention for text file line endings - only platform-specific conventions. After all, even CR-LF, CR and LF aren't the only line end conventions to ever be used, and ASCII was never even the one and only character set. The problem is the C standard library and runtime, which didn't abstract away this platform-dependent detail. Other third generation languages (such as Pascal and even Basic) managed it, at least to some degree. Because of this, when C compilers were written for other platforms, runtime library hacks were needed to achieve compatibility with existing source code and books.
In fact, it's Unix and Multics that originally needed string translation for console I/O, since users usually sat at an ASCII terminal that required CR LF line ends. This translation was done in a device driver, though - the goal was to abstract away the device-specifics, assuming that it was better to adopt one convention and stick to it for stored text files.
The C text I/O hack is similar in principle to what CygWin does now, hacking Linux runtimes to work as well as can be expected on Windows. There's a real history of hacking things about to turn them into Unix-alikes - but then there's also Wine, turning Linux into Windows. Oddly enough, you can read some misplaced line-end criticism of Windows in the CygWin FAQ (Internet Archive link added 2013 - the page no longer exists). Maybe it's just their sense of humour, since they are basically doing what they are criticising, but on a much grander scale ;-)
The C++ standard library (whatever platform its implemented on) avoids this issue using iostreams, which abstract away line ends. For output, that suits me fine. For input, I need more control, so I either interpret character-by-character or else use a scanner generator.
[EDIT It turns out that the struck-out claim above isn't true, and never was. The std::endl literally translates to a \n and a flush. The \n is exactly the same \n you get in C - it tends to get called "new line", but it's actually an ASCII line feed character, which then gets translated by the runtime if necessary. Funny how false assumptions can get so ingrained you never question them - basically, C++ had no choice to do what C did (other than adding more layers on top) for compatibility reasons, and that should always have been obvious.]
The biggest slice of blame from my POV is with C, but C isn't the only project to fail to anticipate its move to other platforms. Blaming Bill Gates is just nuts - all he did was buy and polish a variant of the then popular CP/M. Really, it's just history - the same reason why we don't know what character codes 128 to 255 refer to in most text files. Given the ease of coping with all three line end conventions, it's odd that some developers still insist on that "my platforms convention is the one true way, and I shall force it on you like it or not" attitude.
Also - will the Unicode line separator codepoint U+2028 replace all these conventions in future text files? ;-)
It's interesting to note the CRLF is pretty much the internet standard. That is, pretty much every standard internet protocol that is line oriented uses CRLF. SMTP, POP, IMAP, NNTP, etc.. The body of email consists of lines terminated by CRLF.
According to Wikipedia: in the beginning, the program had to put in extra CR characters before the LF to slow the program down so the printer had time to keep up - and CP/M and then later Windows used this method. But Multics's printer driver put in extra characters automatically so the program didn't have to - and Unix developer from that. But none of that explains why the early Mac didn't do that (they do now that they are based on Unix).
https://en.wikipedia.org/wiki/Newline#History:
The sequence CR+LF was commonly used on many early computer systems that had adopted Teletype machines—typically a Teletype Model 33 ASR—as a console device, because this sequence was required to position those printers at the start of a new line. The separation of newline into two functions concealed the fact that the print head could not return from the far right to the beginning of the next line in time to print the next character. Any character printed after a CR would often print as a smudge in the middle of the page while the print head was still moving the carriage back to the first position. "The solution was to make the newline two characters: CR to move the carriage to column one, and LF to move the paper up."[1] In fact, it was often necessary to send extra characters—extraneous CRs or NULs—which are ignored but give the print head time to move to the left margin. Many early video displays also required multiple character times to scroll the display.
On such systems, applications had to talk directly to the Teletype machine and follow its conventions since the concept of device drivers hiding such hardware details from the application was not yet well developed. Therefore, text was routinely composed to satisfy the needs of Teletype machines. Most minicomputer systems from DEC used this convention. CP/M also used it in order to print on the same terminals that minicomputers used. From there MS-DOS (1981) adopted CP/M's CR+LF in order to be compatible, and this convention was inherited by Microsoft's later Windows operating system.
The Multics operating system began development in 1964 and used LF alone as its newline. Multics used a device driver to translate this character to whatever sequence a printer needed (including extra padding characters), and the single byte was more convenient for programming. What seems like a more obvious[citation needed] choice—CR—was not used, as CR provided the useful function of overprinting one line with another to create boldface and strikethrough effects. Perhaps more importantly, the use of LF alone as a line terminator had already been incorporated into drafts of the eventual ISO/IEC 646 standard. Unix followed the Multics practice, and later Unix-like systems followed Unix. This created conflicts between Windows and Unix-like OSes, whereby files composed on one OS cannot be properly formatted or interpreted by another OS (for example a UNIX shell script written in a Windows text editor like Notepad).