I'm currently reading mails from file and process some of the header information. Non-ASCII characters are encoded according to RFC2047 in quoted-printable oder Base64, so the files contain no non-ASCII characters . If the file is encoded in UTF-8, Win-1252 or one of the ISO-8859-* character encodings, I won't run into problems because ASCII is embedded at the same place in all these charsets (so 0x41 is a A in all of those charsets).
But what if the file is encoded using an encoding that does not embed ASCII in that way? Do encodings like this even exist? And if so, is there even a reliable way of detecting them?
There is a Charset-detector of Mozilla based on this very interesting article. It can detect a very large amount of different encodings. There is also a port to C# available on GitHub which I used before. It turned out to be quite reliable. But of course, when the text just contains ASCII characters, it cannot distinguish between the different encodings that encode ASCII in the same way. But any encodings that encode ASCII in a different way should be detected correctly with this library.
Related
I am a very stupid Programme Manager and I have a client requesting us to send in either ASCII or ANSI encoding format.
Our programmers has used Unicode (UTF-16), so my question is if Unicode (UTF-16) is compatible with ASCII or ANSI? Or am I understanding this incorrectly? Are we to change encoding or?
We haven't tried anything yet.
In short: ASCII encoding contains 128 characters. ANSI encoding contains 256 characters. UTF-16 encoding has the capacity for 1,112,064 character codes. There is some nuance such as the bytes used to store each character, but I don't think that is relevant here.
You can certainly convert a UTF-16 document down to ANSI or ASCII encoding, but any characters that are beyond their specification will be lost (probably converted to the 128th or 256th character, respectively, or some sort of null character).
For you, as a manager, there are some questions. At minimum:
Why does the client need this particular encoding? Can it be accommodated in some other way?
Are any characters in your data beyond the scope of ASCII/ANSI. Most (all?) programming languages provide a method to retrieve an integer representation of a character and determine if it is beyond the range of the desired encoding. This could be leveraged to discover how many instances exist of a character not compatible with the desired encoding.
I'm importing data from flat-files (text files). I do not know which encoding they will use, it may be unicode, or it may be ASCII. What happens if I just choose "Unicode string [DT_WSTR]" (Or unicode data) in my integration package. Would it be able to read ASCII without issues? I am using SSIS 2012.
What happens if I just choose "Unicode string [DT_WSTR]" (Or unicode data) in my integration package. Would it be able to read ASCII without issues?
The encoding that Microsoft misleadingly call “Unicode” is actually UTF-16LE, an encoding based around two-byte code units.
UTF-16LE is not compatible with ASCII (or any of the locale-specific ANSI code pages) so if you read a file this is actually encoded in an ASCII superset you will get unreadable nonsense.
There's no magic ‘do the right thing’ option for reading characters from files, you have to know what encoding was used to create them. If you can see an encoded Byte Order Mark on the front of the data that usually allows you to make a good guess, but otherwise you're on your own.
Can someone please redirect me to some good references about the encoding and decoding in communication and different encoding techniques(unicode, base64, utf7) etc.
Wikipedia is always a good start.
Then there's always Joel Spolsky's article: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Note that the three things you name operate on different levels.
Unicode is a character set: a mapping between characters and numbers (code points).
UTF7 maps between code points and bytes.
base64 maps between bytes and bytes. (It mangles bytes so that they are represented by bytes in the ASCII range.)
The definitions of encoding and decoding are somewhat subjective.
Both are forms of transliteration, being the process of converting from one alphabet to another. ASCII to UTF8, ASCII to base64, etc are all examples of this.
What distinguishes the two is that "encoding" is often used when transliterating from a usable format to a transmission or intermediate format of some kind and decoding is the reverse. This is where the "subjective" bit comes in. ASCII to UTF8 can be viewed as encoding or decoding depending on the context.
Other formats like base64 are used almost universally for transmission only (eg binary data in email) and as such converting to them is almost universally called "encoding" and converting from as "decoding".
The important point to take away from all this is that something like ASCII or UTF8 is not magical in any way. All these formats are simply an agreed-upon encoding of information into a binary format. So ASCII 65 is 'A' for no other reason than that's the standard.
Unicode formats get more interesting because they make the distinction between the code point and the encoding. Unicode defines the code points for each character. The binary data is different for each encoding format. For example, see Unicode Character 'EURO-CURRENCY SIGN' (U+20A0) to see all the different binary values for one code point.
Regarding yours unicode, base64, utf7 (no one uses it, it might be utf8). They are not just "encoding & decoding" but encoding & decoding of text data.
Unicode is the way all real and possible characters are enumerated. It has nothing about encoding itself. UTFXX is set of encoding of unicode (converting code to actual bytes). most popular are UTF8 and UTF16. Very basically UTF8 is ASCII compatible (chars with codes < 128 are represented same way as ASCII), but other characters are represented by 2-3 bytes. UTF16 encode most of characters to 2 bytes.
Base64 has nothing about text data. It encodes generic binary data to text that consists of 64 printable ascii characters. It is used to transfer binary data, UTF8 and UTF16 via Email usually.
Are these obsolete? They seem like the worst idea ever -- embed something in the contents of your file that no one can see, but impacts the file's functionality. I don't understand why I would want one.
They're necessary in some cases, yes, because there are both little-endian and big-endian implementations of UTF-16.
When reading an unknown UTF-16 file, how can you tell which of the two is used?
The only solution is to place some kind of easily identifiable marker in the file, which can never be mistaken for anything else, regardless of the endian-ness used.
That's what the BOM does.
And do you need one? Only if you're 1) using an UTF encoding where endianness is an issue (It matters for UTF-16, but UTF8 always looks the same regardless of endianness), and the file is going to be shared with external applications.
If your own app is the only one that's going to read and write the file, you can omit the BOM, and simply decide once and for all which endianness you're going to use. But if another application has to read the file, it won't know the endianness in advance, so adding the BOM might be a good idea.
Some excerpts from the UTF and BOM FAQ from the Unicode Consortium may be helpful.
Q: What is a BOM?
A: A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files. Under some higher level protocols, use of a BOM may be mandatory (or prohibited) in the Unicode data stream defined in that protocol. (Emphasis mine.)
I wouldn't exactly say the byte-order mark is embedded in the data. Rather, it prefixes the data. The character is only a byte-order mark when it's the first thing in the data stream. Anywhere else, and it's the zero-width non-breaking space. Unicode-aware programs that don't honor the byte-order mark aren't really harmed by its presence anyway since the character is invisible, and a word-joiner at the start of a block of text just joins the next character to nothing, so it has no effect.
Q: Where is a BOM useful?
A: A BOM is useful at the beginning of files that are typed as text, but for which it is not known whether they are in big or little endian format—it can also serve as a hint indicating that the file is in Unicode, as opposed to in a legacy encoding and furthermore, it act as a signature for the specific encoding form used.
So, you'd want a BOM when your program is capable of handling multiple encodings of Unicode. How else will your program know which encoding to use when interpreting its input?
Q: When a BOM is used, is it only in 16-bit Unicode text?
A: No, a BOM can be used as a signature no matter how the Unicode text is transformed: UTF-16, UTF-8, UTF-7, etc. The exact bytes comprising the BOM will be whatever the Unicode character U+FEFF is converted into by that transformation format. In that form, the BOM serves to indicate both that it is a Unicode file, and which of the formats it is in.
That's probably the case where the BOM is used most frequently today. It distinguishes UTF-8-encoded text from any other encodings; it's not really marking the order of the bytes since UTF-8 only has one order.
If you're designing your own protocol or data format, you're not required to use a BOM. Another question from the FAQ touches on that:
Q: How do I tag data that does not interpret U+FEFF as a BOM?
A: Use the tag UTF-16BE to indicate big-endian UTF-16 text, and UTF-16LE to indicate little-endian UTF-16 text. If you do use a BOM, tag the text as simply UTF-16.
It mentions the concept of tagging your data's format. That means specifying the format out-of-band from the data itself. That's great if such a facility is available to you, but it's often not, especially when older systems are being retrofitted for Unicode.
As you tagged this with UTF-8 I'm going to say you don't need a BOM. Byto Order Marks are only useful for UTF-16 and UTF-32 as it informs the computer whether the file is in Big Endian or Little Endian. Some text editors may use the Byte Order Mark to decide what encoding the document uses but this is not part of the Unicode standard.
The BOM signifies which encoding of Unicode the file is in. Without this distinction, a unicode reader would not know how to read the file.
However, UTF-8 doesn't require a BOM.
Check out the Wikipedia article.
The "BOM" is a holdover from the early days of Unicode when it was assumed that using Unicode would mean using 16-bit characters. It is completely pointless in an encoding like UTF-8 which has only one byte order. The choice of U+FEFF is also suboptimal for UTF-32, because it cannot distinguish between all possible middle-endian byte orders (to do so would require a BOM encoded with 4 different bytes).
The only reason you'd use one is when sending UTF-16 or UTF-32 data between platforms with different byte orders, but (1) most people use UTF-8 anyway, and (2) the MIME charset parameter provides a better mechanism.
As UTF16 and UTF32 BOMs tell whether the content is in Big-Endian or Little-Endian Format and also that content is Unicode, the UTF-8 BOM classifies the file as utf-8 encoded. Without the UTF-8 BOM, how can you know if it is a ANSI file or UTF-8 encoded file? The UTF-8 BOM doesn't tell endianess of course, because utf-8 is always a byte stream, but it tells if content is utf-8 encoded Unicode or ANSI. Of course you can scan for valid utf-8 sequences but in my opinion, it is easier to check the first three Bytes of the file.
UTF16 and UTF32 can be written in both Big-Endian and Little-Endian forms. You could try to heuristically determine the endianess by analysing the result of treating the file in either endianess, but to save you all that bother, the BOM can tell you right away.
UTF-8 doesn't really need a BOM though, as you decode it byte by byte.
Regardless of whether you use these yourself when creating text files, its probably worthwhile to be aware of when you read text files. i.e. detect and skip (and ideally handle accordingly) the BOM at the beginning of the file. I've run into a few which had it and which caused my some issues initially until I figured out what was going on.
I have a Perl script that is being called by third parties to send me names of people who have registered my software. One of these parties encodes the names in UTF-8, so I have adapted my script accordingly to decode UTF-8 to ASCII with Encode::decode_utf8(...).
This usually works fine, but every 6 months or so one of the names contains cyrillic, greek or romanian characters, so decoding the name results in garbage characters such as "ПодражанÑкаÑ". I have to follow-up with the customer and ask him for a "latin character version" of his name in order to issue a registration code.
So, is there any Perl module that can detect whether there are such characters and automatically translates them to their closest ASCII representation if necessary?
It seems that I can use Lingua::Cyrillic::Translit::ICAO plus Lingua::DetectCharset to handle Cyrillic, but I would prefer something that works with other character sets as well.
I believe you could use Text::Unidecode for this, it is precisely what it tries to do.
In the documentation for Text::Unicode, under "Caveats", it appears that this phrase is incorrect:
Make sure that the input data really is a utf8 string.
UTF-8 is a variable-length encoding, whereas Text::Unidecode only accepts a fixed-length (two-byte) encoding for each character. So that sentence should read:
Make sure that the input data really is a string of two-byte Unicode characters.
This is also referred to as UCS-2.
If you want to convert strings which really are utf8, you would do it like so:
my $decode_status = utf8::decode($input_to_be_converted);
my $converted_string = unidecode ($input_to_be_converted);
If you have to deal with UTF-8 data that are not in the ascii range, your best bet is to change your backend so it doesn't choke on utf-8. How would you go about transliterating kanji signs?
If you get cyrilic text there is no "closest ASCII representation" for many characters.