Historical reason behind different line ending at different platforms - operating-system

Why did DOS/Windows and Mac decide to use \r\n and \r for line ending instead of \n? Was it just a result of trying to be "different" from Unix?
And now that Mac OS X is Unix (-like), did Apple switch to \n from \r?

DOS inherited CR-LF line endings (what you're calling \r\n, just making the ascii characters explicit) from CP/M. CP/M inherited it from the various DEC operating systems which influenced CP/M designer Gary Kildall.
CR-LF was used so that the teletype machines would return the print head to the left margin (CR = carriage return), and then move to the next line (LF = line feed).
The Unix guys handled that in the device driver, and when necessary translated LF to CR-LF on output to devices that needed it.
And as you guessed, Mac OS X now uses LF.

Really adding to #Mark Harrison...
The people who tell you that Unix is "just outputting the text the programmer specified" whereas DOS is broken are plain wrong. There are also claims that it's stupid for DOS to flag EOF when it sees an EOF character, raising the question of what exactly that EOF character is for.
There is no one true convention for text file line endings - only platform-specific conventions. After all, even CR-LF, CR and LF aren't the only line end conventions to ever be used, and ASCII was never even the one and only character set. The problem is the C standard library and runtime, which didn't abstract away this platform-dependent detail. Other third generation languages (such as Pascal and even Basic) managed it, at least to some degree. Because of this, when C compilers were written for other platforms, runtime library hacks were needed to achieve compatibility with existing source code and books.
In fact, it's Unix and Multics that originally needed string translation for console I/O, since users usually sat at an ASCII terminal that required CR LF line ends. This translation was done in a device driver, though - the goal was to abstract away the device-specifics, assuming that it was better to adopt one convention and stick to it for stored text files.
The C text I/O hack is similar in principle to what CygWin does now, hacking Linux runtimes to work as well as can be expected on Windows. There's a real history of hacking things about to turn them into Unix-alikes - but then there's also Wine, turning Linux into Windows. Oddly enough, you can read some misplaced line-end criticism of Windows in the CygWin FAQ (Internet Archive link added 2013 - the page no longer exists). Maybe it's just their sense of humour, since they are basically doing what they are criticising, but on a much grander scale ;-)
The C++ standard library (whatever platform its implemented on) avoids this issue using iostreams, which abstract away line ends. For output, that suits me fine. For input, I need more control, so I either interpret character-by-character or else use a scanner generator.
[EDIT It turns out that the struck-out claim above isn't true, and never was. The std::endl literally translates to a \n and a flush. The \n is exactly the same \n you get in C - it tends to get called "new line", but it's actually an ASCII line feed character, which then gets translated by the runtime if necessary. Funny how false assumptions can get so ingrained you never question them - basically, C++ had no choice to do what C did (other than adding more layers on top) for compatibility reasons, and that should always have been obvious.]
The biggest slice of blame from my POV is with C, but C isn't the only project to fail to anticipate its move to other platforms. Blaming Bill Gates is just nuts - all he did was buy and polish a variant of the then popular CP/M. Really, it's just history - the same reason why we don't know what character codes 128 to 255 refer to in most text files. Given the ease of coping with all three line end conventions, it's odd that some developers still insist on that "my platforms convention is the one true way, and I shall force it on you like it or not" attitude.
Also - will the Unicode line separator codepoint U+2028 replace all these conventions in future text files? ;-)

It's interesting to note the CRLF is pretty much the internet standard. That is, pretty much every standard internet protocol that is line oriented uses CRLF. SMTP, POP, IMAP, NNTP, etc.. The body of email consists of lines terminated by CRLF.

According to Wikipedia: in the beginning, the program had to put in extra CR characters before the LF to slow the program down so the printer had time to keep up - and CP/M and then later Windows used this method. But Multics's printer driver put in extra characters automatically so the program didn't have to - and Unix developer from that. But none of that explains why the early Mac didn't do that (they do now that they are based on Unix).
https://en.wikipedia.org/wiki/Newline#History:
The sequence CR+LF was commonly used on many early computer systems that had adopted Teletype machines—typically a Teletype Model 33 ASR—as a console device, because this sequence was required to position those printers at the start of a new line. The separation of newline into two functions concealed the fact that the print head could not return from the far right to the beginning of the next line in time to print the next character. Any character printed after a CR would often print as a smudge in the middle of the page while the print head was still moving the carriage back to the first position. "The solution was to make the newline two characters: CR to move the carriage to column one, and LF to move the paper up."[1] In fact, it was often necessary to send extra characters—extraneous CRs or NULs—which are ignored but give the print head time to move to the left margin. Many early video displays also required multiple character times to scroll the display.
On such systems, applications had to talk directly to the Teletype machine and follow its conventions since the concept of device drivers hiding such hardware details from the application was not yet well developed. Therefore, text was routinely composed to satisfy the needs of Teletype machines. Most minicomputer systems from DEC used this convention. CP/M also used it in order to print on the same terminals that minicomputers used. From there MS-DOS (1981) adopted CP/M's CR+LF in order to be compatible, and this convention was inherited by Microsoft's later Windows operating system.
The Multics operating system began development in 1964 and used LF alone as its newline. Multics used a device driver to translate this character to whatever sequence a printer needed (including extra padding characters), and the single byte was more convenient for programming. What seems like a more obvious[citation needed] choice—CR—was not used, as CR provided the useful function of overprinting one line with another to create boldface and strikethrough effects. Perhaps more importantly, the use of LF alone as a line terminator had already been incorporated into drafts of the eventual ISO/IEC 646 standard. Unix followed the Multics practice, and later Unix-like systems followed Unix. This created conflicts between Windows and Unix-like OSes, whereby files composed on one OS cannot be properly formatted or interpreted by another OS (for example a UNIX shell script written in a Windows text editor like Notepad).

Related

Is U+0085 NEXT LINE (NEL) deprecated?

There is few information about NEL. I think it was supposed to replace LF and CR LF, but it seems like it wasn't very used.
Is it somehow deprecated and should applications interpret it as a new line?
There are many different code points which can be used as line separators, mostly for legacy reasons. Unicode's design seeks to preserve all information, so text can be roundtrip en- and decoded to/from Unicode without information loss. NEL is/was used in EBCDIC to represent newlines, so it made its way into Unicode as a separate character.
It's not "deprecated" in the sense that it still fulfils the function it was designed for, but you will hardly find it in actual use unless you're dealing with legacy EBCDIC somehow. If you do, you may want to treat it as newline, but many modern systems to date just treat it as whitespace.
The Wikipedia article on NEL has more information.

How UTF8/Unicode adapt to new writing systems?

An example to clarify my question:
The Hongkongers' native language is Cantonese, however, we all write in a different language: Madarin Chinese. Two languages are kindof similar, and Hongkongers are educated to write in Madarin Chinese language.
Cantonese doesn't have a writing system. Though we are still happy with Madarin as our writing language, however, in case one day Hongkongers decided to develop a 'Cantonese script' which contains not-yet-existing characters, how should UTF8/Unicode/fonts change, to adapt these new characters?
I mean, who will change the UTF8/Unicode/fonts standard? How exactly Linux/Windows OS have to be modified, in order to display these newly created characters?
(The example is just to make my question clear. We're not talking about politics ;D )
The Unicode coding space has over 1,000,000 code points, and only about 10% of them have been allocated, so there is a lot of room for new characters (even though some areas of the coding space have been set apart for use other than added characters). The Unicode Consortium, working in close cooperation with the relevant body at ISO, assigns code points to new characters on the basis of proposals that demonstrate actual usage or, in some cases, plans with a solid basis and widespread support.
Thus, if a new script were designed and there was a large community that would seriously use it, it would be added, with its characters, into Unicode after due proposals and discussion.
It would then be up to font manufacturers to add glyphs for such characters. This might take a long time, but if there is strong enough need, new fonts and enhancements to existing fonts would emerge.
No change to UTF-8 or other Unicode transfer encodings would be needed. They already encode the entire coding space, whether code points are assigned to characters or not.
Rendering software would need no modifications, unless there are some specialties in the writing system. Normal characters would be rendered just fine, as soon as suitable fonts are available.
However, if the characters added were outside the Basic Multilingual Plane (BMP), the “16-bit subset of Unicode”, both rendering and processing (and input) would be problematic. Many programming languages and programs effectively treat Unicode as if it were a 16-bit code and run into problems (possibly solvable, but still) when characters outside the BMP are used. If the writing system had, say, 10,000 characters, it is quite possible that it would have to allocated outside the BMP.
The Unicode committee adds new characters as they see fit. Then fonts add support for the new characters. Operating systems should not require changes simply to display the new characters. Typing the characters would generally require updates or plug-ins to an operating system's input methods.

How important is file encoding?

How important is file encoding? The default for Notepad++ is ANSI, but would it be better to use UTF-8 or what problems could occur if not using one or the other?
Yes, it would be better if everyone used UTF-8 for all documents always.
Unfortunately, they don't, primarily because Windows text editors (and many other Win tools) default to “ANSI”. This is a misleading name as it is nothing to do with ANSI X3.4 (aka ASCII) or any other ANSI standard, but in fact means the system default code page of the current Windows machine. That default code page can change between machines, or on the same machine, at which point all text files in “ANSI” that have non-ASCII characters like accented letters in will break.
So you should certainly create new files in UTF-8, but you will have to be aware that text files other people give you are likely to be in a motley collection of crappy country-specific code pages.
Microsoft's position has been that users who want Unicode support should use UTF-16LE files; it even, misleadingly, calls this encoding simply “Unicode” in save box encoding menus. MS took this approach because in the early days of Unicode it was believed that this would be the cleanest way of doing it. Since that time:
Unicode was expanded beyond 16-bit code points, removing UTF-16's advantage of each code unit being a code point;
UTF-8 was invented, with the advantage that as well as covering all of Unicode, it's backwards-compatible with 7-bit ASCII (which UTF-16 isn't as it's full of zero bytes) and for this reason it's also typically more compact.
Most of the rest of the world (Mac, Linux, the web in general) has, accordingly, already moved to UTF-8 as a standard encoding, eschewing UTF-16 for file storage or network purposes. Unfortunately Windows remains stuck with the archaic and useless selection of incompatible code pages it had back in the early Windows NT days. There is no sign of this changing in the near future.
If you're sharing files between systems that use differing default encodings, then a Unicode encoding is the way to go. If you don't plan on it, or use only the ASCII set of characters and aren't going to work with encodings that, for whatever reason, modify those (I can't think of any at the moment, but you never know...), you don't really need it.
As an aside, this is the sort of stuff that happens when you don't use a Unicode encoding for files with non-ASCII characters on a system with a different encoding from the one the file was created with: http://en.wikipedia.org/wiki/Mojibake
It is very importaint since your whatevertool will show false chars/whatever if you use the wrong encoding. Try to load a kyrillic file in Notepad without using UTF-8 or so and see a lot of "?" coming up. :)

What is difference between \n and \r? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
What is the difference between \r and \n?
Hi,
What is difference between \n (newline) and \r (carriage return)? They both move current cursor to the next line. Are they same?
\r returns the cursor to the beginning of the line, NOT to the next line. When you use \nin Linux, \r is implied, in windows, it is not.
Using \r in Unix-like systems may result in overwriting the same line.
I suggest you read this.
In short, a newline in Windows is "\r\n", while a newline in Unix is just "\n" (and, just to make life difficult, a newline in older Macs is "\r")
Actually, a carriage return is supposed to move the cursor to the beginning of the current line. Then, newline moves the cursor exactly down one.
Nowadays, compilers will often automatically convert one or the other to \r\n on Windows or \n on Linux. Mac used to use \r but they have changed to the \n convention.
(edit: removed false/untested statements)
Read The Great Newline Schism it explains everything in deep detail with great humor.
Ah the old days of the typewriter...
The difference between the two stems from days of yonder when typing was done directly to paper. It required two actions to go to the next line:
pushing the 'carriage' (big cilinder on the top) back to the left (this is where the character would end up).
shifting the paper one line up. (thus going down one line)
Splitting these two actions facilitated going back to a precise character position to correct it (there was no way to go up one line, or left one character!). Holding paper whiteout on the erroneous character and hitting that key would neatly whiteout exactly that erroneous character, then you could go back again and hit the correct key
(there was a key for not moving the carriage though).
In the young computer age these actions were translated 1 to 1 into \r for carriage return and \n for shifting the 'paper'.
Nowadays the major operating systems apparently have differing opinions on whether this is still necessary for computer technology where going back to previous position is much easier. However, in modern programming languages you'll generally see that \n is assumed to mean \r\n.
No they're not. Modern text editors often treat them the same however because their old uses don't make much sense for digital word processors.
For example \r literally means "return to the beginning of the line". While this might have been useful for a typewriter if you just wanted to overwrite everything on that line this sort of functionality doesn't make much sense for digital type.
\n on the other hand would simply move down a line without returning to the beginning. This was also useful on a typewriter for indentation or bulleting. Again, not something that makes much sense for digital type.
Telnet is one example where both characters are still used in this manner.
Both characters were included in ascii language simply because when it was being spec'd they hadn't realized that functionality that was useful on a typewriter didn't make much sense on a computer.

What is the difference between \r and \n?

How are \r and \n different? I think it has something to do with Unix vs. Windows vs. Mac, but I'm not sure exactly how they're different, and which to search for/match in regexes.
They're different characters. \r is carriage return, and \n is line feed.
On "old" printers, \r sent the print head back to the start of the line, and \n advanced the paper by one line. Both were therefore necessary to start printing on the next line.
Obviously that's somewhat irrelevant now, although depending on the console you may still be able to use \r to move to the start of the line and overwrite the existing text.
More importantly, Unix tends to use \n as a line separator; Windows tends to use \r\n as a line separator and Macs (up to OS 9) used to use \r as the line separator. (Mac OS X is Unix-y, so uses \n instead; there may be some compatibility situations where \r is used instead though.)
For more information, see the Wikipedia newline article.
EDIT: This is language-sensitive. In C# and Java, for example, \n always means Unicode U+000A, which is defined as line feed. In C and C++ the water is somewhat muddier, as the meaning is platform-specific. See comments for details.
In C and C++, \n is a concept, \r is a character, and \r\n is (almost always) a portability bug.
Think of an old teletype. The print head is positioned on some line and in some column. When you send a printable character to the teletype, it prints the character at the current position and moves the head to the next column. (This is conceptually the same as a typewriter, except that typewriters typically moved the paper with respect to the print head.)
When you wanted to finish the current line and start on the next line, you had to do two separate steps:
move the print head back to the beginning of the line, then
move it down to the next line.
ASCII encodes these actions as two distinct control characters:
\x0D (CR) moves the print head back to the beginning of the line. (Unicode encodes this as U+000D CARRIAGE RETURN.)
\x0A (LF) moves the print head down to the next line. (Unicode encodes this as U+000A LINE FEED.)
In the days of teletypes and early technology printers, people actually took advantage of the fact that these were two separate operations. By sending a CR without following it by a LF, you could print over the line you already printed. This allowed effects like accents, bold type, and underlining. Some systems overprinted several times to prevent passwords from being visible in hardcopy. On early serial CRT terminals, CR was one of the ways to control the cursor position in order to update text already on the screen.
But most of the time, you actually just wanted to go to the next line. Rather than requiring the pair of control characters, some systems allowed just one or the other. For example:
Unix variants (including modern versions of Mac) use just a LF character to indicate a newline.
Old (pre-OSX) Macintosh files used just a CR character to indicate a newline.
VMS, CP/M, DOS, Windows, and many network protocols still expect both: CR LF.
Old IBM systems that used EBCDIC standardized on NL--a character that doesn't even exist in the ASCII character set. In Unicode, NL is U+0085 NEXT LINE, but the actual EBCDIC value is 0x15.
Why did different systems choose different methods? Simply because there was no universal standard. Where your keyboard probably says "Enter", older keyboards used to say "Return", which was short for Carriage Return. In fact, on a serial terminal, pressing Return actually sends the CR character. If you were writing a text editor, it would be tempting to just use that character as it came in from the terminal. Perhaps that's why the older Macs used just CR.
Now that we have standards, there are more ways to represent line breaks. Although extremely rare in the wild, Unicode has new characters like:
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR
Even before Unicode came along, programmers wanted simple ways to represent some of the most useful control codes without worrying about the underlying character set. C has several escape sequences for representing control codes:
\a (for alert) which rings the teletype bell or makes the terminal beep
\f (for form feed) which moves to the beginning of the next page
\t (for tab) which moves the print head to the next horizontal tab position
(This list is intentionally incomplete.)
This mapping happens at compile-time--the compiler sees \a and puts whatever magic value is used to ring the bell.
Notice that most of these mnemonics have direct correlations to ASCII control codes. For example, \a would map to 0x07 BEL. A compiler could be written for a system that used something other than ASCII for the host character set (e.g., EBCDIC). Most of the control codes that had specific mnemonics could be mapped to control codes in other character sets.
Huzzah! Portability!
Well, almost. In C, I could write printf("\aHello, World!"); which rings the bell (or beeps) and outputs a message. But if I wanted to then print something on the next line, I'd still need to know what the host platform requires to move to the next line of output. CR LF? CR? LF? NL? Something else? So much for portability.
C has two modes for I/O: binary and text. In binary mode, whatever data is sent gets transmitted as-is. But in text mode, there's a run-time translation that converts a special character to whatever the host platform needs for a new line (and vice versa).
Great, so what's the special character?
Well, that's implementation dependent, too, but there's an implementation-independent way to specify it: \n. It's typically called the "newline character".
This is a subtle but important point: \n is mapped at compile time to an implementation-defined character value which (in text mode) is then mapped again at run time to the actual character (or sequence of characters) required by the underlying platform to move to the next line.
\n is different than all the other backslash literals because there are two mappings involved. This two-step mapping makes \n significantly different than even \r, which is simply a compile-time mapping to CR (or the most similar control code in whatever the underlying character set is).
This trips up many C and C++ programmers. If you were to poll 100 of them, at least 99 will tell you that \n means line feed. This is not entirely true. Most (perhaps all) C and C++ implementations use LF as the magic intermediate value for \n, but that's an implementation detail. It's feasible for a compiler to use a different value. In fact, if the host character set is not a superset of ASCII (e.g., if it's EBCDIC), then \n will almost certainly not be LF.
So, in C and C++:
\r is literally a carriage return.
\n is a magic value that gets translated (in text mode) at run-time to/from the host platform's newline semantics.
\r\n is almost always a portability bug. In text mode, this gets translated to CR followed by the platform's newline sequence--probably not what's intended. In binary mode, this gets translated to CR followed by some magic value that might not be LF--possibly not what's intended.
\x0A is the most portable way to indicate an ASCII LF, but you only want to do that in binary mode. Most text-mode implementations will treat that like \n.
"\r" => Return
"\n" => Newline or Linefeed
(semantics)
Unix based systems use just a "\n" to end a line of text.
Dos uses "\r\n" to end a line of text.
Some other machines used just a "\r". (Commodore, Apple II, Mac OS prior to OS X, etc..)
\r is used to point to the start of a line and can replace the text from there, e.g.
main()
{
printf("\nab");
printf("\bsi");
printf("\rha");
}
Produces this output:
hai
\n is for new line.
In short \r has ASCII value 13 (CR) and \n has ASCII value 10 (LF).
Mac uses CR as line delimiter (at least, it did before, I am not sure for modern macs), *nix uses LF and Windows uses both (CRLF).
In addition to #Jon Skeet's answer:
Traditionally Windows has used \r\n, Unix \n and Mac \r, however newer Macs use \n as they're unix based.
\r is Carriage Return; \n is New Line (Line Feed) ... depends on the OS as to what each means. Read this article for more on the difference between '\n' and '\r\n' ... in C.
in C# I found they use \r\n in a string.
\r used for carriage return. (ASCII value is 13)
\n used for new line. (ASCII value is 10)