How does vbscript filesystemobject encode characters? - unicode

I have this vbscript code:
Set fs = CreateObject("Scripting.FileSystemObject")
Set ts = fs.OpenTextFile("tmp.txt", 2, True)
for i = 128 to 255
s = chr(i)
if lenb(s) <>2 then
wscript.echo i
wscript.quit
end if
ts.write s
next
ts.close
On my system, each integer is converted to a double byte character: there are no numbers in that range that cannot be represented by a character, and no number requires more than 2 bytes.
But when I look at the file, I find only 127 bytes.
This answer: https://stackoverflow.com/a/31436726/1335492 suggests the the FSO creates UTF files and inserts a BOM. But the file contains only 127 bytes, and no Byte Order Mark.
How does FSO decide how to encode text? What encoding allows 8 bit single-byte characters? What encodings do not include 255 8 bit single-byte characters?
(Answers about how FSO reads characters may also be interesting, but that's not what I'm specifically asking here)
Edit: I've limited my question to the high-bit characters, to make it clear what the question is. (Answers about the low-bit characters may also be interesting, but that's not what I'm specifically asking here)

Short Answer:
The file system object maps "Unicode" to "ASCII" using the code page associated with the System Locale. (Chr and ChrW use the User Locale.)
Application:
There may be silent transposition errors between the System code page and the Thread (user) code page. There may also be coding and decoding errors if code points are missing from a code page, or, as with Japanese and UTF-8, the code pages contain multi-byte characters.
VBscript provides no native method to detect the User, Thread, or System code page. The Thread (user) code page maybe inferred from the Locale set by SetLocale or returned by GetLocale (there is a list here: https://www.science.co.il/language/Locale-codes.php), but there does not appear to be any MS documentation. On Win2K+, WMI may be used to query the System code page. The CHCP command queries and changes the OEM codepage, which is neither the User nor the System code page.
The system code page may be spoofed by an application manifest. There is no way for an application (such as cscript or wscript) or script (such as VBScript or JScript) to change it's parent system except by creating a new process with a new manifest. or rebooting the system after making a registry change.
In detail:
s = chr(i)
'creates a Unicode string, using the Thread Locale Codepage.
Code points that do not exist as characters are mapped as control characters: 127 becomes U+00FF (which is a standard Unicode control character), and 128 becomes U+20AC (the Euro symbol) and 129 becomes 0081 (which is a code point in a Unicode control character region). In VBScript, Thread Locale can be set and read by SetLocale and GetLocale
createobject("Scripting.FileSystemObject").OpenTextFile(strOutFile, 2, True).write s
'creates a 'code page' string, using the System Locale Codepage.
There are two ways that Windows can handle Unicode values it can't map: it can either map to a default character, or return an error. "Scripting.FileSystemObject" uses the error setting, and throws an exception.
In More Detail:
The Thread Locale is, by default, the User Locale, which is the date and time format setting in the "Region and Language" control panel applet (called different things in different versions of windows). It has an associated code page. According to MS internationalization expert Michka (Michael Kaplan, RIP), the reason it has a code page is so that Months and Days of the week can be written in appropriate characters, and it should not be used for any other purpose.
The ASP-classic people clearly had other ideas, since Response.CodePage is thread-locale, and can be controlled by vbscript GetLocale and SetLocale amongst other methods. If the User Locale is changed, all processes are notified, and any thread that is using the default value updates. (I haven't tested what happens to a thread currently using a non-default value).
The System Locale is also called "Language for non-Unicode programs" and is also found in the "Region and Language" applet, but requires a reboot to change. This is the value used internally by windows ("The System") to map between the "A" API and the "W" API. Changing this has no effect on the language of the Windows GUI (That is not a "non-Unicode program")
Assuming that the "Time and Date" setting matches the "Language for non-Unicode programs", any Chr(i) that can create a valid Unicode code point (see "mapping errors" below), will map back exactly from Unicode to "code page". Note that this does work for code points that are "control characters": also note that it doesn't work the other way: UTF-CodePage-UTF doesn't always round-trip exactly. Famously (Character,Modifer)-CodePage-(Complex Character) does not round-trip correctly, where Unicode defines more than one way of constructing a language character representation.
If the "Time and Date" does not match the "Language for non-Unicode programs", any translation could take place, for example U+0101 is 0xE0 on cp28594 and 0xE2 on cp28603: Chr(224) would go through U+0101 to be written as 226.
Even if there are not transposition errors, if the "Time and Date" does not match the "Language for non-Unicode programs" the program may fail when translating to the System Locale: if the Unicode code point does not have a matching Code Page code point, there will be an exception from the FileSystemObject.
There may also be mapping errors at Chr(i), going from Code page to Unicode. Code page 1041 (Japanese) is a double-byte code page (probably Shift JIS). 0x81 is (only) the first byte of a double-byte pair. To be consistent with other code pages, 0x81 should map to the control character 0081, but when given 81 and code page 1041, Windows assumes that the next byte in the buffer, or in the BSTR, is the second byte of the double-byte pair (I've not determined if the mistake is made before or after the conversion). Chr(&H81) is mapped to U+xx81 (81,xx). When I did it, I got U+4581, which is a CJK Unified Ideograph (Brasenia purpurca): it's not mapped by code page 1041.
Mapping errors at Chr(1) do not cause VBScript exceptions at the point of creation. If the UTF-16 code point created is invalid or not on the System Locale code page, there will be a FileSystemObject exception at .write. This particular problem can be avoided by using ChrW(i) instead of Chr(i). On code page 1041, ChrW(129) becomes the Unicode Control character 0081 instead of xx81.
Background:
A program can map between Unicode and "codepage" using any installed code page: the Windows functions MultiByteToWideChar and WideCharToMultiByte take [UINT CodePage] as the first parameter. That mechanism is used internally in Windows to map the "A" API to the "W" API, for example GetAddressByNameA and GetAddressByNameW. Windows is "W", (wide, 16 bit) internally, and "A" strings are mapped to "W" strings on call, and back from "W" to "A" on return. When Windows does the mapping, it uses the code page associated with the "System Locale", also called "Language for non-Unicode programs".
The Windows API function WriteFile writes bytes, not characters, so it's not an "A" or "W" function. Any program that uses it has to handle conversion between strings and bytes. The c function fwrite writes characters, so it can handle 16 bit characters, but it has no way of handling variable length code points like UTF-8 or UTF-16: again, any program that uses "fwrite" has to handle conversion between strings and words.
The C++ function fwrite can handle UTF, and the compiler function _fwrite does magic that depends on the compiler. Presumably, on Windows, if code page translation is required the MultiByteToWideChar and WideCharToMultiByte API is used.
The "A" code pages and the "A" API were called "ANSI" or "ASCII" or "OEM", and started out as 8 bit characters, then grew to double-byte characters, and have now grown to UTF-8 (1..3 bytes). The "W" API started out as 16 bit characters, then grew to UTF-16 (1..6 bytes). Both are multi-word character encodings: the distinction is that for the "A" API and code pages, the word length is 8 bits: for the "W" API and UTF-16, the word length is 16 bits. Because they are both multi-byte mappings, and because "byte" and "word" and "char" and "character" mean different things in different contexts, and because "W" and particularly "A" mean different things than they did years ago, I've just use "A" and "W" and "code page" and "Unicode".
"OEM" is the code page associated with another locale: The Console I/O API. It is per-process (it's a thread locale), it can be changed dynamically (using the CHCP command) and its default value is set at installation: there is no GUI provided to change the value stored in the registry. Most console programs don't use the console I/O API, and as written, use either the system locale, or the user locale, or, (sometimes inadvertently), a mixture of both.
The System Locale can be spoofed by using a manifest and there was a WinXP utility called "AppLocale" that did the same thing.

FSO decide how to encode text during file opening. Use format argument as follows:
Set ts = fs.OpenTextFile("tmp.txt", 2, True, -1)
' ↑↑
Resource: OpenTextFile Method
Syntax
object.OpenTextFile(filename[, iomode[, create[, format]]])
Arguments
object - Required. Object is always the name of a FileSystemObject.
filename - Required. String expression that identifies the file to
open.
iomode - Optional. Can be one of three constants: ForReading,
ForWriting, or ForAppending.
create - Optional. Boolean value that indicates whether a new file
can be created if the specified filename doesn't exist. The value is
True if a new file is created, False if it isn't created. If
omitted, a new file isn't created.
format - Optional. One of three Tristate values used to indicate the
format of the opened file.
TristateTrue = -1 to open the file as Unicode,
TristateFalse = 0 to open the file as ASCII,
TristateUseDefault = -2 to open the file as the system default.
If omitted, the file is opened as ASCII.

Related

Can someone explain the sequence of events that occurs in the encoding/decoding process?

I'm trying to solidify my understanding of encoding and decoding. I'm not sure how the sequence of events works in different settings:
When I type on my computer, is the computer(or whatever program I'm in), automatically decoding my letters in UTF-8(or whatever encoding is used)
When I save a file, is it automatically saving it using the encoding standard that was used to decode my text? Let's say I send over that document or dataset to someone, am I sending a bunch of 1s and 0s to them? and then their decoder is decoding it based on whatever default or encoding standard they specify?
How does code points play into this? Does my computer also have a default code point dictionary it uses?
If these above is true, how do I find out what kind of decoding/encoding my computer/program is using?
Sorry if this isn't clear, or if I'm misunderstanding/using terminology incorrectly.
There are a few ways that this can work, but here is one possibility.
First, yes, in a way, the computer "decodes" each letter you type into some encoding. Each time you press a key on your keyboard, you close a circuit, which signals to other hardware in your computer (e.g., a keyboard controller) that a key was pressed. This hardware then populates a buffer with information about the keyboard event (key up, key down, key repeat) and sends an interrupt to the CPU.
When the CPU receives the interrupt, it jumps to a hardware-defined location in memory and begins executing the code it finds there. This code often will examine which device sent the interrupt and then jump to some other location that has code to handle an interrupt sent by the particular device. This code will then read a "scan code" from the buffer on the device to determine which key event occurred.
The operating system then processes the scan code and delivers it to the application that is waiting for keyboard input. One way it can do this is by populating a buffer with the UTF-8-encoded character that corresponds to the key (or keys) that was pressed. The application would then read the buffer when it receives control back from the operating system.
To answer your second question, we first have to remember what happens as you enter data into your file. As you type, your application receives the letters (perhaps UTF-8-encoded, as mentioned above) corresponding to the keys that you press. Now, your application will need to keep track of which letters it has received so that it can later save the data you've entered to a file. One way that it can do this is by allocating a buffer when the program is started and then copying each character into the buffer as it is received. If the characters are delivered from the OS UTF-8-encoded, then your application could simply copy those bytes to the other buffer. As you continue typing, your buffer will continue to be populated by the characters that are delivered by the OS. When it's time to save your file, your application can ask the OS to write the contents of the buffer to a file or to send them over the network. Device drivers for your disk or network interface know how to send this data to the appropriate hardware device. For example, to write to a disk, you may have to write your data to a buffer, write to a register on the disk controller to signal to write the data in the buffer to the disk, and then repeatedly read from another register on the disk controller to check if the write is complete.
Third, Unicode defines a code point for each character. Each code point can be encoded in more than one way. For example, the code point U+004D ("Latin capital letter M") can be encoded in UTF-8 as 0x4D, in UTF-16 as 0x004D, or in UTF-32 as 0x0000004D (see Table 3-4 in The Unicode Standard). If you have data in memory, then it is encoded using some encoding, and there are libraries available that can convert from one encoding to another.
Finally, you can find out how your computer processes keyboard input by examining the device drivers. You could start by looking at some Linux drivers, as many are open source. Each program, however, can encode and decode data however it chooses to. You would have to examine the code for each individual program to understand how its encoding and decoding works.
It is a complex question, also because it depends on many things.
When I type on my computer, is the computer(or whatever program I'm in), automatically decoding my letters in UTF-8(or whatever encoding is used)
This is very complex. Some programs get the keyboard code (e.g. games), but most programs uses operating system services, to interpret keyboard codes (considering various keyboard layouts, but also modifying result according Shift, Control, etc.).
So, it depends on operating system and program about which encoding you get. For terminal programs, the locale of the process include also encoding of stdin/stdout (standard input and standard output). For graphical interfaces, you may get different encoding (according system encoding).
But UTF-8 is an encoding, so you used wrongly the word decoding in UTF-8.
When I save a file, is it automatically saving it using the encoding standard that was used to decode my text? Let's say I send over that document or dataset to someone, am I sending a bunch of 1s and 0s to them? and then their decoder is decoding it based on whatever default or encoding standard they specify?
This is the complex part. Many systems, and computer languages are old, so they were designed with just one system encoding. E.g. C language. So there is not really a decoding. Programs uses directly the encoding, and they hard code that letter A has a specific value. For computers, only the numeric value matter. Only when data is printed things are interpreted, and in a complex way (fonts, character size, ligatures, next line, ...). [And also if you use string functions, you explicitly tell program to uses the numbers as a string of characters].
Some languages (and HTML: you view a page generated by an external machine, so system encoding is not more the same) introduced the decoding part: internally in a program you have one single way to represent a string (e.g. with Unicode Code Points). But to have such uniform format, we need to decode strings (but so, now we can handle different encoding, and not being restricted to the encoding of the system).
If you save a file, it will have a sequences of bytes. To interpret (also known as decoding) you need to know which encoding has the file. In general you should know it, or give (e.g. as HTML) an out-of-band information ("the following file is UTF-8", e.g. in HTTP headers, or in extension, or in field definition of a database, or...). Some systems (Microsoft Windows) uses BOM (Byte order mark) to distinguish between UTF-16LE, UTF-16BE, UTF-8, and old system encoding (some people call it ANSI, but it is not ANSI, and it could be many different code pages).
The decoder: usually it should know the encoding, else either it use defaults, or it guess it. HTML has a list on step to perform to get an estimate. BOM method above could help. And some tools will check looking common combination of characters (in various languages). But this is still magic. Without BOM or out-of-band data, we can just estimate, and we get wrong often.
How does code points play into this? Does my computer also have a default code point dictionary it uses?
Code point is the base of Unicode. Every "character" has a code point: a fix number, with a description. This is abstract. In UTF-32 you use the same number for encoding (using 32bit integers), on all other encoding, you have a functions (or a map) from code point to encoded values (and also the way back). Code point is just a numeric value which describes the semantic (so the meaning) of a character. To transmit such information, usually we need an encoding (or just a escaping sequence, e.g. U+FFFF represent (as text) the BOM character).
If these above is true, how do I find out what kind of decoding/encoding my computer/program is using?
Nobody can answer: your computer will uses a lot of encoding.
MacOS, Unix, POSIX systems: modern systems (and not root account): they will use probably UTF-8. Root will probably use just ASCII (7-bit).
Windows: Internally it uses often UTF16. The output, it depends on the program, but nearly always it uses an 8-bit encoding (so not the UTF16). Windows can read and write several encoding. You can ask the system the default encoding (but programs could still write in UTF-8 or other encoding, if they want). Terminal and settings could gives you different default encoding on different programs.
For this reason, if you program in Windows, you should explicitly save files as UTF-8 (my recommendation), and possibly with BOM (but if you need interoperability with non-Windows machines, in such case, ignore BOM, but you should already know that such files must be UTF-8).

How do I create a character set like ASCII?

I'm curious about the way that in the past it was implemented and I want to get information about how can I implement a character set of my own.
ASCII (American Standard Code for Information Interchange) was the "original" characterset, and remains the basis for most text data. ASCII is actually a 7-bit code (the numeric values range from 0 to 127) with the most significant bit of a byte indicating if the rest of the byte refers to ASCII (if zero) or the current Codepage.
Extra (non-ascii) characters were then added to these codepages, and the user's computer would load a specific codepage to use. Unfortunately this meant that you needed to load the correct codepage before viewing a file or the wrong characters would appear.
We have now moved on, and most systems use Unicode which is a variable character length (rather than the single-byte characters used previously) which can contain thousands upon thousands of characters, allowing for a single encoding to cater for what would have been multiple codepages using the ASCII+Codepage method of old.
That's the brief history; As to how to create your own characterset, I'm not sure what you are trying to achieve - You can create your own fonts, but if you're talking about an actual characterset (i.e. characters that do not already exist) then you'll have to get your characterset added to a standard such as Unicode so that other computers can make use of your new characters, which would be a considerable amount of work (and I have no idea how you'd even go about it) -- It's worth considering, however, that almost every character in existence already exists in Unicode so you may want to review what's already been done before you try and take on a mammoth undertaking such as creating an entirely new characterset.

(Tcl) what character encoding set should I use?

So I'm trying to open and parse some old Visual Studio compilation log files with Tcl; my only problem is the files are in a strange encoding. Upon examining them with Notepad++ it seems they are in the 'UCS-2 Little Endian' encoding. Two questions:
Is there any command in Tcl that allows me to look at the character encoding of a file? I know there is encoding system which tells me the system encoding.
Using encoding names Tcl tells me the available encoding names are the following list:
cp860 cp861 cp862 cp863 tis-620 cp864 cp865 cp866 gb12345 gb2312-raw cp949 cp950 cp869 dingbats ksc5601 macCentEuro cp874 macUkraine jis0201 gb2312 euc-cn euc-jp macThai iso8859-10 jis0208 iso2022-jp macIceland iso2022 iso8859-13 jis0212 iso8859-14 iso8859-15 cp737 iso8859-16 big5 euc-kr macRomania macTurkish gb1988 iso2022-kr macGreek ascii cp437 macRoman iso8859-1 iso8859-2 iso8859-3 macCroatian koi8-r iso8859-4 ebcdic iso8859-5 cp1250 macCyrillic iso8859-6 cp1251 macDingbats koi8-u iso8859-7 cp1252 iso8859-8 cp1253 iso8859-9 cp1254 cp1255 cp850 cp1256 cp932 identity cp1257 cp852 macJapan cp1258 shiftjis utf-8 cp855 cp936 symbol cp775 unicode cp857
Given this, what would be the appropriate name to use in the fconfigure -encoding command to read these UCS-2 Little Endianencoded files and convert them to UTF-8 for use? If I understand the fconfigure command correctly, I need to specify the encoding type of the source file rather than what I want it to be; I just don't know which of the options in the above list corresponds to UCS-2 Little Endian. After reading a little bit, I see that UCS-2 is a predecessor of the UTF-16 character encoding, but that option isn't here either.
Thanks!
I'm afraid, currently there's no way to do it just by using fconfigure -encoding ?something?: the unicode encoding has rather moot meaning, and there's a feature request to create explicit support for UTF-16 variants.
What you could do about it?
Since unicode in Tcl running on Windows should mean UTF-16 with native endianness1 (little-endian on Wintel), if your solution is supposed to be a quick and dirty one, just try using -encoding unicode and see if that helps.
If you're targeting at some more bullet-proof or future-proof of cross-platform solution, I'd switch the channel to binary more, read the contents in chunks of two bytes at a time, and then use
binary scan $twoBytes s n
to scan the sequence of two bytes in $twoBytes as an 16-bit integer into a variable named "n", followed by something like
set c [format %c $n]
to produce a unicode character out of the number in $n, and assign it to a variable.
This way supposedly requires a bit more trickery to get correctly:
You might check the very first character obtained from the stream to see if it's a byte-order-mark, and drop it if it is.
If you need to process the stream in a line-wise manner, you'd have to implement a little state machine that would handle the CR&plus;LF sequences correctly.
When doing your read $channelId 2, to get the next character, you should check that it returned not just 0 or 2, but also 1 — in case the file happens to be corrupted, — and handle this.
The UCS-2 encoding differs from UTF-16 in that the latter might contain the so-called surrogate pairs, and hence it is not a fixed-length encoding. Hence handling an UTF-16 stream properly implies also detecting those surrogate pairs. On the other hand, I hardly beleive a compilation log produced by MSVS might contain them, so I'd just assume it's encoded in UCS-2LE.
1 The true story is that the only thing Tcl guarantees about textual strings it handles (that is, those obtained by maniputating text, not via binary format or encoding convertto or reading a stream in binary mode) is that they're Unicode (or, rather, the "BMP" part of it).
But technically, the interpreter might switch the internal representation of any string between the UTF-8 encoding it uses by default and some fixed-length encoding which is what is referred to by that name "unicode". The "problem" is that no part of Tcl documentation specifies that internal fixed-length encoding because you're required to explicitly convert any text you output or read to/from some specific encoding — either via configuring the stream or using encoding convertfrom and encoding convertto or using binary format and binary scan, and the interpreter will do the right thing no matter which precise encoding it's currently using for your source string value — it's all transparent. Moreover, the next release of the "standard" Tcl interpreter might decide to drop this internal feature completely, or, say, use 32-bit or 64-bit integers for that internal fixed-length encoding. Whatever "non-standard" interpreters do (like Jacl etc) are also up to them. In other words, this feature is internal and is not a part of the documented contract about the interpreter's behaviour. And by the way, the "standard" encoding for Tcl strings (UTF-8) is not specified as such either — it's just an implementation detail.
In Tcl v8.6.8 I could solve the same issue with fconfigure channelId -encoding unicode.

Where can I find a good introduction to character encoding?

I have to write some code working with character encoding. Is there a good introduction to the subject to get me started?
First posted at What every developer should know about character encoding.
If you write code that touches a text file, you probably need this.
Lets start off with two key items
1.Unicode does not solve this issue for us (yet).
2.Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.
And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts.
The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.
And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.
And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.
Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.
Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.
Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.
UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.
But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.
Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.
Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.
Here's a key point about these text files – every program is still using an encoding. It may not be setting it in code, but by definition an encoding is being used.
Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.
Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)
Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.
Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.
Wrapping it up
I think there are two key items to keep in mind here. First, make sure you are taking the encoding in to account on text files. Second, this is actually all very easy and straightforward. People rarely screw up how to use an encoding, it's when they ignore the issue that they get in to trouble.
From Joel Spolsky
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
http://www.joelonsoftware.com/articles/Unicode.html
As usual, Wikipedia is a good starting point: http://en.wikipedia.org/wiki/Character_encoding
I have a very basic introduction on my blog, which also includes links to in-depth resources if you REALLY want to dig into the subject matter.
http://www.dotnetnoob.com/2011/12/introduction-to-character-encoding.html

What encoding Win32 API functions expect?

For example, MessageBox function has LPCTSTR typed argument for text and caption, which is a pointer to char or pointer to wchar when _UNICODE or _MBCS are defined, respectively.
How does the MessageBox function interpret those stings? As which encoding?
Only explanation I managed to find is this:
http://msdn.microsoft.com/en-us/library/cwe8bzh0(VS.90).aspx
But it doesn't say anything about encoding? Just that in case of _MBCS one character takes up one wchar (which is 16-bit on Windows) and that in case of _UNICODE one or two char's (8-bit).
So are those some Microsoft's versions of UTF-8 and UTF-16 that ignore anything that has to be encoded in 3 or four bytes in case of UTF-8 and anything that has to be encoded in 4 bytes in case of UTF-16? And is there a way to show anything outside of basic multilingual plane of Unicode with MessageBox?
There are normally two different implementations of each function:
MessageBoxA, which accepts ANSI strings
MessageBoxW, which accepts Unicode strings
Here, 'ANSI' means the multi-byte code page currently assigned to the process. This varies according to the user's preferences and locale setting, although Win32 API functions such as WideCharToMultiByte can be counted on to do the right conversion, and the GetACP function will tell you the code page in use. MSDN explains the ANSI code page and how it interacts with Unicode.
'Unicode' generally means UCS-2; that is, support for characters above 0xFFFF isn't consistent. I haven't tried this, but UI functions such as MessageBox in recent versions (> Windows 2000) should support characters outside the BMP.
The ...A functions are obsolete and only wrap the ...W functions. The former were required for compatibility with Windows 9x, but since that is not used any more, you should avoid them at any costs and use the ...W functions exclusively. They require UTF-16 strings, the only native Windows encoding. All modern Windows versions should support non-BMP characters quite well (if there is a font that has these characters, of course).