Why would I use a Unicode Signature Byte-Order-Mark (BOM)? - unicode

Are these obsolete? They seem like the worst idea ever -- embed something in the contents of your file that no one can see, but impacts the file's functionality. I don't understand why I would want one.

They're necessary in some cases, yes, because there are both little-endian and big-endian implementations of UTF-16.
When reading an unknown UTF-16 file, how can you tell which of the two is used?
The only solution is to place some kind of easily identifiable marker in the file, which can never be mistaken for anything else, regardless of the endian-ness used.
That's what the BOM does.
And do you need one? Only if you're 1) using an UTF encoding where endianness is an issue (It matters for UTF-16, but UTF8 always looks the same regardless of endianness), and the file is going to be shared with external applications.
If your own app is the only one that's going to read and write the file, you can omit the BOM, and simply decide once and for all which endianness you're going to use. But if another application has to read the file, it won't know the endianness in advance, so adding the BOM might be a good idea.

Some excerpts from the UTF and BOM FAQ from the Unicode Consortium may be helpful.
Q: What is a BOM?
A: A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files. Under some higher level protocols, use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol. (Emphasis mine.)
I wouldn't exactly say the byte-order mark is embedded in the data. Rather, it prefixes the data. The character is only a byte-order mark when it's the first thing in the data stream. Anywhere else, and it's the zero-width non-breaking space. Unicode-aware programs that don't honor the byte-order mark aren't really harmed by its presence anyway since the character is invisible, and a word-joiner at the start of a block of text just joins the next character to nothing, so it has no effect.
Q: Where is a BOM useful?
A: A BOM is useful at the beginning of files that are typed as text, but for which it is not known whether they are in big or little endian format—it can also serve as a hint indicating that the file is in Unicode, as opposed to in a legacy encoding and furthermore, it act as a signature for the specific encoding form used.
So, you'd want a BOM when your program is capable of handling multiple encodings of Unicode. How else will your program know which encoding to use when interpreting its input?
Q: When a BOM is used, is it only in 16-bit Unicode text?
A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, UTF-7, etc. The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in.
That's probably the case where the BOM is used most frequently today. It distinguishes UTF-8-encoded text from any other encodings; it's not really marking the order of the bytes since UTF-8 only has one order.
If you're designing your own protocol or data format, you're not required to use a BOM. Another question from the FAQ touches on that:
Q: How do I tag data that does not interpret U+FEFF as a BOM?
A: Use the tag UTF-16BE to indicate big-endian UTF-16 text, and UTF-16LE to indicate little-endian UTF-16 text. If you do use a BOM, tag the text as simply UTF-16.
It mentions the concept of tagging your data's format. That means specifying the format out-of-band from the data itself. That's great if such a facility is available to you, but it's often not, especially when older systems are being retrofitted for Unicode.

As you tagged this with UTF-8 I'm going to say you don't need a BOM. Byto Order Marks are only useful for UTF-16 and UTF-32 as it informs the computer whether the file is in Big Endian or Little Endian. Some text editors may use the Byte Order Mark to decide what encoding the document uses but this is not part of the Unicode standard.

The BOM signifies which encoding of Unicode the file is in. Without this distinction, a unicode reader would not know how to read the file.
However, UTF-8 doesn't require a BOM.
Check out the Wikipedia article.

The "BOM" is a holdover from the early days of Unicode when it was assumed that using Unicode would mean using 16-bit characters. It is completely pointless in an encoding like UTF-8 which has only one byte order. The choice of U+FEFF is also suboptimal for UTF-32, because it cannot distinguish between all possible middle-endian byte orders (to do so would require a BOM encoded with 4 different bytes).
The only reason you'd use one is when sending UTF-16 or UTF-32 data between platforms with different byte orders, but (1) most people use UTF-8 anyway, and (2) the MIME charset parameter provides a better mechanism.

As UTF16 and UTF32 BOMs tell whether the content is in Big-Endian or Little-Endian Format and also that content is Unicode, the UTF-8 BOM classifies the file as utf-8 encoded. Without the UTF-8 BOM, how can you know if it is a ANSI file or UTF-8 encoded file? The UTF-8 BOM doesn't tell endianess of course, because utf-8 is always a byte stream, but it tells if content is utf-8 encoded Unicode or ANSI. Of course you can scan for valid utf-8 sequences but in my opinion, it is easier to check the first three Bytes of the file.

UTF16 and UTF32 can be written in both Big-Endian and Little-Endian forms. You could try to heuristically determine the endianess by analysing the result of treating the file in either endianess, but to save you all that bother, the BOM can tell you right away.
UTF-8 doesn't really need a BOM though, as you decode it byte by byte.

Regardless of whether you use these yourself when creating text files, its probably worthwhile to be aware of when you read text files. i.e. detect and skip (and ideally handle accordingly) the BOM at the beginning of the file. I've run into a few which had it and which caused my some issues initially until I figured out what was going on.

Related

(Tcl) what character encoding set should I use?

So I'm trying to open and parse some old Visual Studio compilation log files with Tcl; my only problem is the files are in a strange encoding. Upon examining them with Notepad++ it seems they are in the 'UCS-2 Little Endian' encoding. Two questions:
Is there any command in Tcl that allows me to look at the character encoding of a file? I know there is encoding system which tells me the system encoding.
Using encoding names Tcl tells me the available encoding names are the following list:
cp860 cp861 cp862 cp863 tis-620 cp864 cp865 cp866 gb12345 gb2312-raw cp949 cp950 cp869 dingbats ksc5601 macCentEuro cp874 macUkraine jis0201 gb2312 euc-cn euc-jp macThai iso8859-10 jis0208 iso2022-jp macIceland iso2022 iso8859-13 jis0212 iso8859-14 iso8859-15 cp737 iso8859-16 big5 euc-kr macRomania macTurkish gb1988 iso2022-kr macGreek ascii cp437 macRoman iso8859-1 iso8859-2 iso8859-3 macCroatian koi8-r iso8859-4 ebcdic iso8859-5 cp1250 macCyrillic iso8859-6 cp1251 macDingbats koi8-u iso8859-7 cp1252 iso8859-8 cp1253 iso8859-9 cp1254 cp1255 cp850 cp1256 cp932 identity cp1257 cp852 macJapan cp1258 shiftjis utf-8 cp855 cp936 symbol cp775 unicode cp857
Given this, what would be the appropriate name to use in the fconfigure -encoding command to read these UCS-2 Little Endianencoded files and convert them to UTF-8 for use? If I understand the fconfigure command correctly, I need to specify the encoding type of the source file rather than what I want it to be; I just don't know which of the options in the above list corresponds to UCS-2 Little Endian. After reading a little bit, I see that UCS-2 is a predecessor of the UTF-16 character encoding, but that option isn't here either.
Thanks!
I'm afraid, currently there's no way to do it just by using fconfigure -encoding ?something?: the unicode encoding has rather moot meaning, and there's a feature request to create explicit support for UTF-16 variants.
What you could do about it?
Since unicode in Tcl running on Windows should mean UTF-16 with native endianness1 (little-endian on Wintel), if your solution is supposed to be a quick and dirty one, just try using -encoding unicode and see if that helps.
If you're targeting at some more bullet-proof or future-proof of cross-platform solution, I'd switch the channel to binary more, read the contents in chunks of two bytes at a time, and then use
binary scan $twoBytes s n
to scan the sequence of two bytes in $twoBytes as an 16-bit integer into a variable named "n", followed by something like
set c [format %c $n]
to produce a unicode character out of the number in $n, and assign it to a variable.
This way supposedly requires a bit more trickery to get correctly:
You might check the very first character obtained from the stream to see if it's a byte-order-mark, and drop it if it is.
If you need to process the stream in a line-wise manner, you'd have to implement a little state machine that would handle the CR+LF sequences correctly.
When doing your read $channelId 2, to get the next character, you should check that it returned not just 0 or 2, but also 1 — in case the file happens to be corrupted, — and handle this.
The UCS-2 encoding differs from UTF-16 in that the latter might contain the so-called surrogate pairs, and hence it is not a fixed-length encoding. Hence handling an UTF-16 stream properly implies also detecting those surrogate pairs. On the other hand, I hardly beleive a compilation log produced by MSVS might contain them, so I'd just assume it's encoded in UCS-2LE.
1 The true story is that the only thing Tcl guarantees about textual strings it handles (that is, those obtained by maniputating text, not via binary format or encoding convertto or reading a stream in binary mode) is that they're Unicode (or, rather, the "BMP" part of it).
But technically, the interpreter might switch the internal representation of any string between the UTF-8 encoding it uses by default and some fixed-length encoding which is what is referred to by that name "unicode". The "problem" is that no part of Tcl documentation specifies that internal fixed-length encoding because you're required to explicitly convert any text you output or read to/from some specific encoding — either via configuring the stream or using encoding convertfrom and encoding convertto or using binary format and binary scan, and the interpreter will do the right thing no matter which precise encoding it's currently using for your source string value — it's all transparent. Moreover, the next release of the "standard" Tcl interpreter might decide to drop this internal feature completely, or, say, use 32-bit or 64-bit integers for that internal fixed-length encoding. Whatever "non-standard" interpreters do (like Jacl etc) are also up to them. In other words, this feature is internal and is not a part of the documented contract about the interpreter's behaviour. And by the way, the "standard" encoding for Tcl strings (UTF-8) is not specified as such either — it's just an implementation detail.
In Tcl v8.6.8 I could solve the same issue with fconfigure channelId -encoding unicode.

Is BOM used for 2-byte Unicode text files?

I know that BOM is used for UTF-8 files, but what about the text files where every character is 2-bytes, should I add the byte order mark to them, too?
BOM's were invented for UCS-2 and UTF-16, and then only later appropriated by Microsoft (and then XML) for UTF-8. Think about the name: 'byte order mark'. UTF-8 has only one possible byte order, so it doesn't need a BOM to reveal the order. The three-byte sequence for U+FEFF in UTF-8 has, instead, become a Unicode signature for file type sniffing.
However, early versions of the XML support in Java did not respond well to a UTF-8 BOM, in spite of the inclusion of the UTF-8 BOM in the XML standard. Further, a file with a BOM can't be simply concatenated onto another file, because U+FEFF isn't BOM in the middle of the file; it's ZWNBSP.

Unicode Encodings

I have a question as to how programs parse strings if they do not a priori know the encoding that is used.
As I understand it, the UTF-8 encoding stores ASII characters with 1 byte, and all other chracters with up to as many as 6 (I think it's 6) bytes. Thus, for example, two spaces would be stored in memory as 0x2020.
How then, would a program be able to determine the difference between this string and the string`0x2020 encoded using the UTF-16 encoding which corresponds to the single character which evidently is a character that appears similar to the symbol sometimes used to denote the adjoint of an operator in mathematics (I just looked that up here).
It seems as if the parser would always have to know the encoding of a string before hand. If so, how is this implemented in practice? Is there a byte preceeding each string which tells the parser what encoding is used or something?
In general, it is not possible to know for certain the exact encoding used based solely on the stream of bytes that can represent text. However, if there is a byte order mark somewhere, you can use it at least as a hint as to what encoding is being used.
But with no hints or some kind of contract/exchange of metadata between the producer and consumer of the text, you can't be 100% sure. You can try using a heuristic, but then you get these kinds of problems if you end up guessing wrong.
If you want to be really sure, set up some kind of protocol or contract between the producer and the consumer of the text so that the text and the encoding scheme is known. You can hardcode the encoding scheme (for example, your program may parse UTF-8 and only UTF-8), or ensure the producer of the text always prepend a byte order mark or specially designed header bytes to communicate the encoding scheme.
Does the language always store strings in a certain encoding so that
the display function could safely assume that the string was encoded,
say, using UTF-8?
In depends on the language.
In C#, yes. A char is defined by the language specification (8.2.1) as a UTF-16 code unit, and thus a string is always UTF-16. Just like Java.
In Ruby 1.9, a string is a byte array with an associated Encoding.
But in pre-Unicode languages like C (and badly-designed post-Unicode languages like PHP), a string is just a byte array with no encoding information. You have to rely on convention. It's a real interesting experience to write a program that uses both a library that assumes UTF-8 strings and another that assumes windows-1252 strings.
A question that's equally relevant to all languages is: How do you determine the encoding of a byte array that contains encoded text? There are several different approaches:
Encoding declarations.
In protocols that use MIME types (notably, SMTP and HTTP), you can declare Content-Type: text/html; charset=UTF-8. In HTML, you can use <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"> or the newer <meta charset="UTF-8">. In XML, there's <?xml version="1.0" encoding="UTF-8"?>. In Python source code, there's # -*- coding: UTF-8 -*-.
Unfortunately, such declarations aren't always accurate. And they aren't available at all for locally-stored plain .txt files, so then a different approach must be used.
Byte-order mark (BOM)
Putting the special character U+FEFF at the beginning of a file lets you distinguish between the various UTF encodings.
But it's not usable for legacy encodings like ISO-8859-x or Windows-125x, and not always used with UTF-8.
Validation
Some encodings have strict rules about what makes a valid string. The best-known is UTF-8, with its rigid separation of leading/trailing bytes, prohibition of "overlong" encodings, etc. UTF-32 is even easier to recognize because the restriction of Unicode to 17 "planes" means that every code unit must have the form 00 {00-10} xx xx (or xx xx {00-10} 00 for little-endian).
So if text validates as being UTF-8 or UTF-32, you can safely assume that it is. There's a possibility of false positives, but it's very low.
However, this approach doesn't work well for UTF-16, where the false-positive rate is too high. (The only way for an even-length byte array to not be valid UTF-16 is to contain unpaired surrogates, or U+FFFE or U+FFFF.)
Statistical analysis
Use character frequency tables of various language/encoding combinations. This is the approach used by chardet (in combination with BOM and validation).
Falling back on a default encoding
When all else fails, assume ISO-8859-1, windows-1252, or Encoding.Default.

What's the difference between UTF-8 and UTF-8 with BOM?

What's different between UTF-8 and UTF-8 with BOM? Which is better?
The UTF-8 BOM is a sequence of bytes at the start of a text stream (0xEF, 0xBB, 0xBF) that allows the reader to more reliably guess a file as being encoded in UTF-8.
Normally, the BOM is used to signal the endianness of an encoding, but since endianness is irrelevant to UTF-8, the BOM is unnecessary.
According to the Unicode standard, the BOM for UTF-8 files is not recommended:
2.6 Encoding Schemes
... Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature. See the “Byte Order Mark” subsection in Section 16.8, Specials, for more information.
The other excellent answers already answered that:
There is no official difference between UTF-8 and BOM-ed UTF-8
A BOM-ed UTF-8 string will start with the three following bytes. EF BB BF
Those bytes, if present, must be ignored when extracting the string from the file/stream.
But, as additional information to this, the BOM for UTF-8 could be a good way to "smell" if a string was encoded in UTF-8... Or it could be a legitimate string in any other encoding...
For example, the data [EF BB BF 41 42 43] could either be:
The legitimate ISO-8859-1 string "ABC"
The legitimate UTF-8 string "ABC"
So while it can be cool to recognize the encoding of a file content by looking at the first bytes, you should not rely on this, as show by the example above
Encodings should be known, not divined.
There are at least three problems with putting a BOM in UTF-8 encoded files.
Files that hold no text are no longer empty because they always contain the BOM.
Files that hold text within the ASCII subset of UTF-8 are no longer themselves ASCII because the BOM is not ASCII, which makes some existing tools break down, and it can be impossible for users to replace such legacy tools.
It is not possible to concatenate several files together because each file now has a BOM at the beginning.
And, as others have mentioned, it is neither sufficient nor necessary to have a BOM to detect that something is UTF-8:
It is not sufficient because an arbitrary byte sequence can happen to start with the exact sequence that constitutes the BOM.
It is not necessary because you can just read the bytes as if they were UTF-8; if that succeeds, it is, by definition, valid UTF-8.
Here are examples of the BOM usage that actually cause real problems and yet many people don't know about it.
BOM breaks scripts
Shell scripts, Perl scripts, Python scripts, Ruby scripts, Node.js scripts or any other executable that needs to be run by an interpreter - all start with a shebang line which looks like one of those:
#!/bin/sh
#!/usr/bin/python
#!/usr/local/bin/perl
#!/usr/bin/env node
It tells the system which interpreter needs to be run when invoking such a script. If the script is encoded in UTF-8, one may be tempted to include a BOM at the beginning. But actually the "#!" characters are not just characters. They are in fact a magic number that happens to be composed out of two ASCII characters. If you put something (like a BOM) before those characters, then the file will look like it had a different magic number and that can lead to problems.
See Wikipedia, article: Shebang, section: Magic number:
The shebang characters are represented by the same two bytes in
extended ASCII encodings, including UTF-8, which is commonly used for
scripts and other text files on current Unix-like systems. However,
UTF-8 files may begin with the optional byte order mark (BOM); if the
"exec" function specifically detects the bytes 0x23 and 0x21, then the
presence of the BOM (0xEF 0xBB 0xBF) before the shebang will prevent
the script interpreter from being executed. Some authorities recommend
against using the byte order mark in POSIX (Unix-like) scripts,[14]
for this reason and for wider interoperability and philosophical
concerns. Additionally, a byte order mark is not necessary in UTF-8,
as that encoding does not have endianness issues; it serves only to
identify the encoding as UTF-8. [emphasis added]
BOM is illegal in JSON
See RFC 7159, Section 8.1:
Implementations MUST NOT add a byte order mark to the beginning of a JSON text.
BOM is redundant in JSON
Not only it is illegal in JSON, it is also not needed to determine the character encoding because there are more reliable ways to unambiguously determine both the character encoding and endianness used in any JSON stream (see this answer for details).
BOM breaks JSON parsers
Not only it is illegal in JSON and not needed, it actually breaks all software that determine the encoding using the method presented in RFC 4627:
Determining the encoding and endianness of JSON, examining the first four bytes for the NUL byte:
00 00 00 xx - UTF-32BE
00 xx 00 xx - UTF-16BE
xx 00 00 00 - UTF-32LE
xx 00 xx 00 - UTF-16LE
xx xx xx xx - UTF-8
Now, if the file starts with BOM it will look like this:
00 00 FE FF - UTF-32BE
FE FF 00 xx - UTF-16BE
FF FE 00 00 - UTF-32LE
FF FE xx 00 - UTF-16LE
EF BB BF xx - UTF-8
Note that:
UTF-32BE doesn't start with three NULs, so it won't be recognized
UTF-32LE the first byte is not followed by three NULs, so it won't be recognized
UTF-16BE has only one NUL in the first four bytes, so it won't be recognized
UTF-16LE has only one NUL in the first four bytes, so it won't be recognized
Depending on the implementation, all of those may be interpreted incorrectly as UTF-8 and then misinterpreted or rejected as invalid UTF-8, or not recognized at all.
Additionally, if the implementation tests for valid JSON as I recommend, it will reject even the input that is indeed encoded as UTF-8, because it doesn't start with an ASCII character < 128 as it should according to the RFC.
Other data formats
BOM in JSON is not needed, is illegal and breaks software that works correctly according to the RFC. It should be a nobrainer to just not use it then and yet, there are always people who insist on breaking JSON by using BOMs, comments, different quoting rules or different data types. Of course anyone is free to use things like BOMs or anything else if you need it - just don't call it JSON then.
For other data formats than JSON, take a look at how it really looks like. If the only encodings are UTF-* and the first character must be an ASCII character lower than 128 then you already have all the information needed to determine both the encoding and the endianness of your data. Adding BOMs even as an optional feature would only make it more complicated and error prone.
Other uses of BOM
As for the uses outside of JSON or scripts, I think there are already very good answers here. I wanted to add more detailed info specifically about scripting and serialization, because it is an example of BOM characters causing real problems.
What's different between UTF-8 and UTF-8 without BOM?
Short answer: In UTF-8, a BOM is encoded as the bytes EF BB BF at the beginning of the file.
Long answer:
Originally, it was expected that Unicode would be encoded in UTF-16/UCS-2. The BOM was designed for this encoding form. When you have 2-byte code units, it's necessary to indicate which order those two bytes are in, and a common convention for doing this is to include the character U+FEFF as a "Byte Order Mark" at the beginning of the data. The character U+FFFE is permanently unassigned so that its presence can be used to detect the wrong byte order.
UTF-8 has the same byte order regardless of platform endianness, so a byte order mark isn't needed. However, it may occur (as the byte sequence EF BB FF) in data that was converted to UTF-8 from UTF-16, or as a "signature" to indicate that the data is UTF-8.
Which is better?
Without. As Martin Cote answered, the Unicode standard does not recommend it. It causes problems with non-BOM-aware software.
A better way to detect whether a file is UTF-8 is to perform a validity check. UTF-8 has strict rules about what byte sequences are valid, so the probability of a false positive is negligible. If a byte sequence looks like UTF-8, it probably is.
UTF-8 with BOM is better identified. I have reached this conclusion the hard way. I am working on a project where one of the results is a CSV file, including Unicode characters.
If the CSV file is saved without a BOM, Excel thinks it's ANSI and shows gibberish. Once you add "EF BB BF" at the front (for example, by re-saving it using Notepad with UTF-8; or Notepad++ with UTF-8 with BOM), Excel opens it fine.
Prepending the BOM character to Unicode text files is recommended by RFC 3629: "UTF-8, a transformation format of ISO 10646", November 2003
at https://www.rfc-editor.org/rfc/rfc3629 (this last info found at: http://www.herongyang.com/Unicode/Notepad-Byte-Order-Mark-BOM-FEFF-EFBBBF.html)
BOM tends to boom (no pun intended (sic)) somewhere, someplace. And when it booms (for example, doesn't get recognized by browsers, editors, etc.), it shows up as the weird characters  at the start of the document (for example, HTML file, JSON response, RSS, etc.) and causes the kind of embarrassments like the recent encoding issue experienced during the talk of Obama on Twitter.
It's very annoying when it shows up at places hard to debug or when testing is neglected. So it's best to avoid it unless you must use it.
Question: What's different between UTF-8 and UTF-8 without a BOM? Which is better?
Here are some excerpts from the Wikipedia article on the byte order mark (BOM) that I believe offer a solid answer to this question.
On the meaning of the BOM and UTF-8:
The Unicode Standard permits the BOM in UTF-8, but does not require
or recommend its use. Byte order has no meaning in UTF-8, so its
only use in UTF-8 is to signal at the start that the text stream is
encoded in UTF-8.
Argument for NOT using a BOM:
The primary motivation for not using a BOM is backwards-compatibility
with software that is not Unicode-aware... Another motivation for not
using a BOM is to encourage UTF-8 as the "default" encoding.
Argument FOR using a BOM:
The argument for using a BOM is that without it, heuristic analysis is
required to determine what character encoding a file is using.
Historically such analysis, to distinguish various 8-bit encodings, is
complicated, error-prone, and sometimes slow. A number of libraries
are available to ease the task, such as Mozilla Universal Charset
Detector and International Components for Unicode.
Programmers mistakenly assume that detection of UTF-8 is equally
difficult (it is not because of the vast majority of byte sequences
are invalid UTF-8, while the encodings these libraries are trying to
distinguish allow all possible byte sequences). Therefore not all
Unicode-aware programs perform such an analysis and instead rely on
the BOM.
In particular, Microsoft compilers and interpreters, and many
pieces of software on Microsoft Windows such as Notepad will not
correctly read UTF-8 text unless it has only ASCII characters or it
starts with the BOM, and will add a BOM to the start when saving text
as UTF-8. Google Docs will add a BOM when a Microsoft Word document is
downloaded as a plain text file.
On which is better, WITH or WITHOUT the BOM:
The IETF recommends that if a protocol either (a) always uses UTF-8,
or (b) has some other way to indicate what encoding is being used,
then it “SHOULD forbid use of U+FEFF as a signature.”
My Conclusion:
Use the BOM only if compatibility with a software application is absolutely essential.
Also note that while the referenced Wikipedia article indicates that many Microsoft applications rely on the BOM to correctly detect UTF-8, this is not the case for all Microsoft applications. For example, as pointed out by #barlop, when using the Windows Command Prompt with UTF-8†, commands such type and more do not expect the BOM to be present. If the BOM is present, it can be problematic as it is for other applications.
† The chcp command offers support for UTF-8 (without the BOM) via code page 65001.
This question already has a million-and-one answers and many of them are quite good, but I wanted to try and clarify when a BOM should or should not be used.
As mentioned, any use of the UTF BOM (Byte Order Mark) in determining whether a string is UTF-8 or not is educated guesswork. If there is proper metadata available (like charset="utf-8"), then you already know what you're supposed to be using, but otherwise you'll need to test and make some assumptions. This involves checking whether the file a string comes from begins with the hexadecimal byte code, EF BB BF.
If a byte code corresponding to the UTF-8 BOM is found, the probability is high enough to assume it's UTF-8 and you can go from there. When forced to make this guess, however, additional error checking while reading would still be a good idea in case something comes up garbled. You should only assume a BOM is not UTF-8 (i.e. latin-1 or ANSI) if the input definitely shouldn't be UTF-8 based on its source. If there is no BOM, however, you can simply determine whether it's supposed to be UTF-8 by validating against the encoding.
Why is a BOM not recommended?
Non-Unicode-aware or poorly compliant software may assume it's latin-1 or ANSI and won't strip the BOM from the string, which can obviously cause issues.
It's not really needed (just check if the contents are compliant and always use UTF-8 as the fallback when no compliant encoding can be found)
When should you encode with a BOM?
If you're unable to record the metadata in any other way (through a charset tag or file system meta), and the programs being used like BOMs, you should encode with a BOM. This is especially true on Windows where anything without a BOM is generally assumed to be using a legacy code page. The BOM tells programs like Office that, yes, the text in this file is Unicode; here's the encoding used.
When it comes down to it, the only files I ever really have problems with are CSV. Depending on the program, it either must, or must not have a BOM. For example, if you're using Excel 2007+ on Windows, it must be encoded with a BOM if you want to open it smoothly and not have to resort to importing the data.
UTF-8 without BOM has no BOM, which doesn't make it any better than UTF-8 with BOM, except when the consumer of the file needs to know (or would benefit from knowing) whether the file is UTF-8-encoded or not.
The BOM is usually useful to determine the endianness of the encoding, which is not required for most use cases.
Also, the BOM can be unnecessary noise/pain for those consumers that don't know or care about it, and can result in user confusion.
It should be noted that for some files you must not have the BOM even on Windows. Examples are SQL*plus or VBScript files. In case such files contains a BOM you get an error when you try to execute them.
Quoted at the bottom of the Wikipedia page on BOM: http://en.wikipedia.org/wiki/Byte-order_mark#cite_note-2
"Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature"
UTF-8 with BOM only helps if the file actually contains some non-ASCII characters. If it is included and there aren't any, then it will possibly break older applications that would have otherwise interpreted the file as plain ASCII. These applications will definitely fail when they come across a non ASCII character, so in my opinion the BOM should only be added when the file can, and should, no longer be interpreted as plain ASCII.
I want to make it clear that I prefer to not have the BOM at all. Add it in if some old rubbish breaks without it, and replacing that legacy application is not feasible.
Don't make anything expect a BOM for UTF-8.
I look at this from a different perspective. I think UTF-8 with BOM is better as it provides more information about the file. I use UTF-8 without BOM only if I face problems.
I am using multiple languages (even Cyrillic) on my pages for a long time and when the files are saved without BOM and I re-open them for editing with an editor (as cherouvim also noted), some characters are corrupted.
Note that Windows' classic Notepad automatically saves files with a BOM when you try to save a newly created file with UTF-8 encoding.
I personally save server side scripting files (.asp, .ini, .aspx) with BOM and .html files without BOM.
When you want to display information encoded in UTF-8 you may not face problems. Declare for example an HTML document as UTF-8 and you will have everything displayed in your browser that is contained in the body of the document.
But this is not the case when we have text, CSV and XML files, either on Windows or Linux.
For example, a text file in Windows or Linux, one of the easiest things imaginable, it is not (usually) UTF-8.
Save it as XML and declare it as UTF-8:
<?xml version="1.0" encoding="UTF-8"?>
It will not display (it will not be be read) correctly, even if it's declared as UTF-8.
I had a string of data containing French letters, that needed to be saved as XML for syndication. Without creating a UTF-8 file from the very beginning (changing options in IDE and "Create New File") or adding the BOM at the beginning of the file
$file="\xEF\xBB\xBF".$string;
I was not able to save the French letters in an XML file.
One practical difference is that if you write a shell script for Mac OS X and save it as plain UTF-8, you will get the response:
#!/bin/bash: No such file or directory
in response to the shebang line specifying which shell you wish to use:
#!/bin/bash
If you save as UTF-8, no BOM (say in BBEdit) all will be well.
The Unicode Byte Order Mark (BOM) FAQ provides a concise answer:
Q: How I should deal with BOMs?
A: Here are some guidelines to follow:
A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as
files. When you need to conform to such a protocol, use a BOM.
Some protocols allow optional BOMs in the case of untagged text. In those cases,
Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM,
the encoding could be anything.
Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there
is no BOM, the text should be interpreted as big-endian.
Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the
BOM as encoding form signature should be avoided.
Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In
particular, whenever a data stream is declared to be UTF-16BE,
UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.
From http://en.wikipedia.org/wiki/Byte-order_mark:
The byte order mark (BOM) is a Unicode
character used to signal the
endianness (byte order) of a text file
or stream. Its code point is U+FEFF.
BOM use is optional, and, if used,
should appear at the start of the text
stream. Beyond its specific use as a
byte-order indicator, the BOM
character may also indicate which of
the several Unicode representations
the text is encoded in.
Always using a BOM in your file will ensure that it always opens correctly in an editor which supports UTF-8 and BOM.
My real problem with the absence of BOM is the following. Suppose we've got a file which contains:
abc
Without BOM this opens as ANSI in most editors. So another user of this file opens it and appends some native characters, for example:
abg-αβγ
Oops... Now the file is still in ANSI and guess what, "αβγ" does not occupy 6 bytes, but 3. This is not UTF-8 and this causes other problems later on in the development chain.
As mentioned above, UTF-8 with BOM may cause problems with non-BOM-aware (or compatible) software. I once edited HTML files encoded as UTF-8 + BOM with the Mozilla-based KompoZer, as a client required that WYSIWYG program.
Invariably the layout would get destroyed when saving. It took my some time to fiddle my way around this. These files then worked well in Firefox, but showed a CSS quirk in Internet Explorer destroying the layout, again. After fiddling with the linked CSS files for hours to no avail I discovered that Internet Explorer didn't like the BOMfed HTML file. Never again.
Also, I just found this in Wikipedia:
The shebang characters are represented by the same two bytes in extended ASCII encodings, including UTF-8, which is commonly used for scripts and other text files on current Unix-like systems. However, UTF-8 files may begin with the optional byte order mark (BOM); if the "exec" function specifically detects the bytes 0x23 0x21, then the presence of the BOM (0xEF 0xBB 0xBF) before the shebang will prevent the script interpreter from being executed. Some authorities recommend against using the byte order mark in POSIX (Unix-like) scripts,[15] for this reason and for wider interoperability and philosophical concerns
Here is my experience with Visual Studio, Sourcetree and Bitbucket pull requests, which has been giving me some problems:
So it turns out BOM with a signature will include a red dot character on each file when reviewing a pull request (it can be quite annoying).
If you hover on it, it will show a character like "ufeff", but it turns out Sourcetree does not show these types of bytemarks, so it will most likely end up in your pull requests, which should be ok because that's how Visual Studio 2017 encodes new files now, so maybe Bitbucket should ignore this or make it show in another way, more info here:
Red dot marker BitBucket diff view
I save a autohotkey file with utf-8, the chinese characters become strang.
With utf-8 BOM, works fine.
AutoHotkey will not automatically recognize a UTF-8 file unless it begins with a byte order mark.
https://www.autohotkey.com/docs/FAQ.htm#nonascii
UTF with a BOM is better if you use UTF-8 in HTML files and if you use Serbian Cyrillic, Serbian Latin, German, Hungarian or some exotic language on the same page.
That is my opinion (30 years of computing and IT industry).

What are some common character encodings that a text editor should support?

I have a text editor that can load ASCII and Unicode files. It automatically detects the encoding by looking for the BOM at the beginning of the file and/or searching the first 256 bytes for characters > 0x7f.
What other encodings should be supported, and what characteristics would make that encoding easy to auto-detect?
Definitely UTF-8. See http://www.joelonsoftware.com/articles/Unicode.html.
As far as I know, there's no guaranteed way to detect this automatically (although the probability of a mistaken diagnosis can be reduced to a very small amount by scanning).
I don't know about encodings, but make sure it can support the multiple different line ending standards! (\n vs \r\n)
If you haven't checked out Mich Kaplan's blog yet, I suggest doing so: http://blogs.msdn.com/michkap/
Specifically this article may be useful: http://www.siao2.com/2007/04/22/2239345.aspx
There is no way how you can detect an encoding. The best thing you could do is something like IE and depend on letter distributions in different languages, as well as standard characters for a language. But that's a long shot at best.
I would advise getting your hands on some large library of character sets (check out projects like iconv) and make all of those available to the user. But don't bother auto-detecting. Simply allow the user to select his preference of a default charset, which itself would be UTF-8 by default.
Latin-1 (ISO-8859-1) and its Windows extension CP-1252 must definitely be supported for western users. One could argue that UTF-8 is a superior choice, but people often don't have that choice. Chinese users would require GB-18030, and remember there are Japanese, Russians, Greeks too who all have there own encodings beside UTF-8-encoded Unicode.
As for detection, most encodings are not safely detectable. In some (like Latin-1), certain byte values are just invalid. In UTF-8, any byte value can occur, but not every sequence of byte values. In practice, however, you would not do the decoding yourself, but use an encoding/decoding library, try to decode and catch errors. So why not support all encodings that this library supports?
You could also develop heuristics, like decoding for a specific encoding and then test the result for strange characters or character combinations or frequency of such characters. But this would never be safe, and I agree with Vilx- that you shouldn't bother. In my experience, people normally know that a file has a certain encoding, or that only two or three are possible. So if they see you chose the wrong one, they can easily adapt. And have a look at other editors. The most clever solution is not always the best, especially if people are used to other programs.
UTF-16 is not very common in plain text files. UTF-8 is much more common because it is back compatible with ASCII and is specified in standards like XML.
1) Check for BOM of various Unicode encodings. If found, use that encoding.
2) If no BOM, check if file text is valid UTF-8, reading until you reach a sufficient non-ASCII sample (since many files are almost all ASCII but may have a few accented characters or smart quotes) or the file ends. If valid UTF-8, use UTF-8.
3) If not Unicode it's probably current platform default codepage.
4) Some encodings are easy to detect, for example Japanese Shift-JIS will have heavy use of the prefix bytes 0x82 and 0x83 indicating hiragana and katakana.
5) Give user option to change encoding if program's guess turns out to be wrong.
Whatever you do, use more than 256 bytes for a sniff test. It's important to get it right, so why not check the whole doc? Or at least the first 100KB or so.
Try UTF-8 and obvious UTF-16 (lots of alternating 0 bytes), then fall back to the ANSI codepage for the current locale.