I want to send emails in a number of languages (en/es/fr/de/ru/pl). I notice that Gmail uses KOI8-R charset when sending emails contatining Cyrillic characters.
Can I just use KOI8-R for ALL my emails, or is there any reason to select a particular charset for each language?
i would recommend alway using utf-8 nowadays.
wikipedia on utf-8:
UTF-8 (8-bit UCS/Unicode
Transformation Format) is a
variable-length character encoding for
Unicode. It is able to represent any
character in the Unicode standard, yet
is backwards compatible with ASCII.
For these reasons, it is steadily
becoming the preferred encoding for
e-mail, web pages,[1] and other places
where characters are stored or
streamed.
Use UTF-8. KOI8-R wouldn't be ideal for non-Russian languages, and changing codesets always tends to be a headache on the receiving side.
Related
I'm currently reading mails from file and process some of the header information. Non-ASCII characters are encoded according to RFC2047 in quoted-printable oder Base64, so the files contain no non-ASCII characters . If the file is encoded in UTF-8, Win-1252 or one of the ISO-8859-* character encodings, I won't run into problems because ASCII is embedded at the same place in all these charsets (so 0x41 is a A in all of those charsets).
But what if the file is encoded using an encoding that does not embed ASCII in that way? Do encodings like this even exist? And if so, is there even a reliable way of detecting them?
There is a Charset-detector of Mozilla based on this very interesting article. It can detect a very large amount of different encodings. There is also a port to C# available on GitHub which I used before. It turned out to be quite reliable. But of course, when the text just contains ASCII characters, it cannot distinguish between the different encodings that encode ASCII in the same way. But any encodings that encode ASCII in a different way should be detected correctly with this library.
The question says it all. Is it possible to transfer a UTF-8 file over FTP using ASCII mode? Or will this cause the characters to be written incorrectly? Thanks!
UTF-8 encoding was designed to be backward compatible with ASCII encoding.
The RFC 959 requires the FTP clients and servers to treat the file in ASCII mode as 8-bit:
3.1.1.1. ASCII TYPE
...
The sender converts the data from an internal character
representation to the standard 8-bit NVT-ASCII
representation (see the Telnet specification). The receiver
will convert the data from the standard form to his own
internal form.
In accordance with the NVT standard, the sequence
should be used where necessary to denote the end of a line
of text. (See the discussion of file structure at the end
of the Section on Data Representation and Storage.)
...
Using the standard NVT-ASCII representation means that data
must be interpreted as 8-bit bytes.
So even UTF-8 unaware FTP client or server should correctly translate line-endings, as these are encoded identically in ASCII and UTF-8. And they should not corrupt the other characters.
From a practical point of view: I haven't met a server that does have problems with 8-bit text files. I'm Czech, so I regularly work with UTF-8 and in past with Windows-1250 and ISO/IEC 8859-2 8-bit encodings.
RFC 2640, from 1999, updates the FTP protocol to support internationalization. It requires FTP servers to use UTF-8 as the transfer encoding in section 2.2. So as long as you aren't trying to upload to a DEC TOPS-20 server (which stores five 7-bit bytes within a 36-bit word), you should be fine.
I'm decoding bytestreams into unicode characters without knowing the encoding that's been used by each of a hundred or so senders.
Many of the senders are not technically astute, and will not be able to tell me what encoding they are using. It will be determined by the happenstance of the toolchains they are using to generate the data.
The senders are, for the moment, all UK/English based, using a variety of operating systems.
Can I ask all the senders to send me a particular string of characters that will unambiguously demonstrate what encoding each sender is using?
I understand that there are libraries that use heuristics to guess at the encoding - I'm going to chase that up too, as a runtime fallback, but first I'd like to try and determine what encodings are being used, if I can.
(Don't think it's relevant, but I'm working in Python)
A full answer to this question depends on a lot of factors, such as the range of encodings used by the various upstream systems, and how well your users will comply with instructions to type magic character sequences into text fields, and how skilled they will be at the obscure keyboard combinations to type the magic character sequences.
There are some very easy character sequences which only some users will be able to type. Only users with a Cyrillic keyboard and encoding will find it easy to type "Ильи́ч" (Ilyich), and so you only have to distinguish between the Cyrillic-capable encodings like UTF-8, UTF-16, iso8859_5, and koi8_r. Similarly, you could come up with Japanese, Chinese, and Korean character sequences which distinguish between users of Japanese, simplified Chinese, traditional Chinese, and Korean systems.
But let's concentrate on users of western European computer systems, and the common encodings like ISO-8859-15, Mac_Roman, UTF-8, UTF-16LE, and UTF-16BE. A very simple test is to have users enter the Euro character '€', U+20AC, and see what byte sequence gets generated:
byte ['\xa4'] means iso-8859-15 encoding
bytes ['\xe2', '\x82', '\xac'] mean utf-8 encoding
bytes ['\x00', '\xac'] mean utf-16be encoding
bytes ['\xac', '\x00'] mean utf-16le encoding
byte ['\x80'] means cp1252 ("Windows ANSI") encoding
byte ['\xdb'] means macroman encoding
iso-8859-1 won't be able to represent the Euro character at all. iso-8859-15 is the Euro-supporting successor to iso-8859-1.
U.S. users probably won't know how to type a Euro character. (OK, that's too snarky. 3% of them will know.)
You should check what each of these byte sequences, interpreted as any of the possible encodings, is not a character sequence that users would likely type themselves. For instance, the '\xa4' of the iso-8859-15 Euro symbol could also be the iso-8859-1 or cp1252 or UTF-16le encoding of '¤', the macroman encoding of '§', or the first byte of any of thousands of UTF-16 characters, such as U+A4xx Yi Syllables, or U+01A4 LATIN SMALL LETTER OI. It would not be a valid first byte of a UTF-8 sequence. If some of your users submit text in Yi, you might have a problem.
The Python 3.x documentation, 7.2.3. Standard Encodings lists the character encodings which the Python standard library can easily handle. The following program lets you see how a test character sequence is encoded into bytes by various encodings:
>>> for e in ['iso-8859-1','iso-8859-15', 'utf-8', 'utf-16be', 'utf-16le', \
... 'cp1252', 'macroman']:
... print e, list( euro.encode(e, 'backslashreplace'))
So, as an expedient, satisficing hack, consider telling your users to type a '€' as the first character of a text field, if there are any problems with encoding. Then your system should interpret any of the above byte sequences as an encoding clue, and discard them. If users want to start their text content with a Euro character, they start the field with '€€'; the first gets swallowed, the second remains part of the text.
I have a text editor that can load ASCII and Unicode files. It automatically detects the encoding by looking for the BOM at the beginning of the file and/or searching the first 256 bytes for characters > 0x7f.
What other encodings should be supported, and what characteristics would make that encoding easy to auto-detect?
Definitely UTF-8. See http://www.joelonsoftware.com/articles/Unicode.html.
As far as I know, there's no guaranteed way to detect this automatically (although the probability of a mistaken diagnosis can be reduced to a very small amount by scanning).
I don't know about encodings, but make sure it can support the multiple different line ending standards! (\n vs \r\n)
If you haven't checked out Mich Kaplan's blog yet, I suggest doing so: http://blogs.msdn.com/michkap/
Specifically this article may be useful: http://www.siao2.com/2007/04/22/2239345.aspx
There is no way how you can detect an encoding. The best thing you could do is something like IE and depend on letter distributions in different languages, as well as standard characters for a language. But that's a long shot at best.
I would advise getting your hands on some large library of character sets (check out projects like iconv) and make all of those available to the user. But don't bother auto-detecting. Simply allow the user to select his preference of a default charset, which itself would be UTF-8 by default.
Latin-1 (ISO-8859-1) and its Windows extension CP-1252 must definitely be supported for western users. One could argue that UTF-8 is a superior choice, but people often don't have that choice. Chinese users would require GB-18030, and remember there are Japanese, Russians, Greeks too who all have there own encodings beside UTF-8-encoded Unicode.
As for detection, most encodings are not safely detectable. In some (like Latin-1), certain byte values are just invalid. In UTF-8, any byte value can occur, but not every sequence of byte values. In practice, however, you would not do the decoding yourself, but use an encoding/decoding library, try to decode and catch errors. So why not support all encodings that this library supports?
You could also develop heuristics, like decoding for a specific encoding and then test the result for strange characters or character combinations or frequency of such characters. But this would never be safe, and I agree with Vilx- that you shouldn't bother. In my experience, people normally know that a file has a certain encoding, or that only two or three are possible. So if they see you chose the wrong one, they can easily adapt. And have a look at other editors. The most clever solution is not always the best, especially if people are used to other programs.
UTF-16 is not very common in plain text files. UTF-8 is much more common because it is back compatible with ASCII and is specified in standards like XML.
1) Check for BOM of various Unicode encodings. If found, use that encoding.
2) If no BOM, check if file text is valid UTF-8, reading until you reach a sufficient non-ASCII sample (since many files are almost all ASCII but may have a few accented characters or smart quotes) or the file ends. If valid UTF-8, use UTF-8.
3) If not Unicode it's probably current platform default codepage.
4) Some encodings are easy to detect, for example Japanese Shift-JIS will have heavy use of the prefix bytes 0x82 and 0x83 indicating hiragana and katakana.
5) Give user option to change encoding if program's guess turns out to be wrong.
Whatever you do, use more than 256 bytes for a sniff test. It's important to get it right, so why not check the whole doc? Or at least the first 100KB or so.
Try UTF-8 and obvious UTF-16 (lots of alternating 0 bytes), then fall back to the ANSI codepage for the current locale.
After figuring out how to encode email that will be read by a Japanese user (encoding for IS0-2022-JP and then base64 encoding), I need to figure out how to test that this actually works. I'm not fluent in Japanese. How does one go about testing that the email reads correctly? The message I'm sending would be written by my program in English, to be read by a Japanese user.
If you have a hard time dealing with Japanese characters for test cases, why not test using wide latin? Wide latin is the latin character set formatted to monospaced and with their own code points in ISO-2020 and in unicode. If you get yourself a Japanese IME, you should be able to input them.
Here is some comparisons:
Hello World // standard ASCII codepoints
Hello World // Japanese wide latin
Since you will be able to enter and read everything with ease, this strategy might be good for testing.
If you're sending English, you shouldn't need to worry about encoding. ISO-2022-JP starts in ASCII.