Different encoding results for same UTF-8 character for ã - encoding

One encoding scheme gives ã as a%CC%83 and the other gives ã as %C3%a3.
What is the difference between these 2? I was debugging this, and found that, a%CC%83 is actually 2 character. One for ~ which is %CC%83 and one for a. But, how can someone obtain that encoding?

It's normal behavior. In Unicode same symbol can be encoded as single character or as composition of two or more characters.
https://en.wikipedia.org/wiki/Precomposed_character

Related

How to remove accents and keep Chinese characters using a command?

I’m trying to remove the accented characters (CAFÉ -> CAFE) while keeping all the Chinese characters by using a command. Currently, I’m using iconv to remove the accented characters. It turns out that all the Chinese characters are encoded as “?????”. I can’t figure out the way to keep the Chinese characters in an ASCII-encoded file at the same time.
How can I do so?
iconv -f utf-8 -t ascii//TRANSLIT//IGNORE -o converted.bin test.bin
There is no way to keep Chinese characters in a file whose encoding is ASCII; this encoding only encodes the code points between NUL (0x00) and 0x7F (DEL) which basically means the basic control characters plus basic
English alphabetics and punctuation. (Look at the ASCII chart for an enumeration.)
What you appear to be asking is how to remove accents from European alphabetics while keeping any Chinese characters intact in a file whose encoding is UTF-8. I believe there is no straightforward way to do this with iconv, but it should be comfortably easy to come up with a one-liner in a language with decent Unicode support, like perhaps Perl.
bash$ python -c 'print("\u4effCaf\u00e9\u9f00")' >unizh.txt
bash$ cat unizh.txt
仿Café鼀
bash$ perl -CSD -MUnicode::Normalize -pe '$_ = NFKD($_); s/\p{M}//g' unizh.txt
仿Cafe鼀
Maybe add the -i option to modify the file in-place; this simple demo just writes out the result to standard output.
This has the potentially undesired side effect of normalizing each character to its NFKD form.
Code inspired by Remove accents from accented characters and Chinese characters to test with gleaned from What's the complete range for Chinese characters in Unicode? (the ones on the boundary of the range are not particularly good test cases so I just guessed a bit).
The iconv tool is meant to convert the way characters are encoded (i.e. saved to a file as bytes). By converting to ASCII (a very limited character set that contains the numbers, some punctuation, and the basic alphabet in upper and lower case), you can save only the characters that can reasonably be matched to that set. So an accented letter like É gets converted to E because that's a reasonably similar ASCII character, but a Chinese character like 公 is so far away from the ASCII character set that only question marks are possible.
The answer by tripleee is probably what you need. But if the conversion to NFKD form is a problem for you, an alternative is using a direct list of characters you want to replace:
sed 'y/áàäÁÀÄéèëÉÈË/aaaAAAeeeEEE/' <test.bin >converted.bin
where you need to list the original characters and their replacements in the same order. Obviously it is more work, so do this only if you need full control over what changes you make.

Are all character encoding compatible with ascii, for at least a-z

Some charset don't have all the 128 first identical to ascii, but is A to Z and a to z, always in the sam position?
I had a plan to set apaches default charset to somting odd in my test envirement, for easy detecting sites that don't tell the browser what encoding it sending.
But so far, I can't find one that makes A to Z show up as someting else.
There is an other question close to the subject, but thats about all 128 ascii chars:
Are ASCII characters always encoded the same way in all character encodings?
No, EBCDIC from IBM is the famous exception.
Another testcase is UTF-16 Big Endian, which puts "A" at U+0041. ASCII would treat the first 00 as a NUL, which often is interpreted as an end-of-string.
In short: no. The encoding mentioned in the other answer, EBCDIC, has a very different layout, to pick just one example.
Most encodings you find in the wild on the web today are probably ASCII compatible. But there are encodings other than ASCII which are entirely incompatible too.

What's the ASCII character code for '—'?

I am working on decoding text. I am trying to find the character code for the — character, not to be mistaken for -, in ASCII. I have tried unsuccessfully. Does anybody know how to convert it?
Quotation from wiki (Em dash)
When an actual em dash is unavailable—as in the ASCII character set—a double ("--") or triple hyphen-minus ("---") is used. In Unicode, the em dash is U+2014 (decimal 8212).
Em dash character is not a part of ASCII character set.
— is known as an Em Dash. It's character code is \u2014. It is not an ASCII character, so you cannot decode it with the ASCII character set because it is not in the ASCII character table. You would probably want to use UTF8 instead.
Windows
For Windows on a keyboard with a Numeric keypad:
Use Alt+0150 (en dash), Alt+0151 (em dash), or Alt+8722 (minus sign) using the numeric keypad.
This character does not exist in ASCII, but only in Unicode, usually encoded by UTF-8.
In UTF-8, characters are encoded by 2- or 3-byte sequences (or occasionally longer), where none of the two or three bytes is a valid ASCII code, where all of them are outside the ASCII range of 0 through 127.
One suspects that the foregoing only partly answers your question, but if so then this is probably because your question is, inadvertently, only partly asked. For further details, you can extend your question with more specifics.
The character — is not part of the ASCII set.
But if you are looking to convert it to some other format (like U+hex), you can use this online tool. Put your character into the first green box and click "Convert" (above the box)
further below you'll find a number of different codes, including U+hex:
U+2014
Feel free to edit this answer if the link breaks or leave a comment so I can find a replacement.
Alt + 0151 seems to do the trick—perhaps it doesn't work on all keyboards.
alt-196 - while holding down the 'Alt' key, type 196 on the numeric keypad, then release the 'Alt' key

What multi-byte character set starts with 0x7F and is 4 bytes long?

I'm trying to get some legacy code to display Chinese characters properly. One character encoding I'm trying to work with starts with a 0x7F and is 4 bytes long (including the 0x7F byte). Does anyone know what kind of encoding this is and where I can find information for it? Thanks..
UPDATE:
I've also had to work with some Japanese encoding that starts every character with a 0xE3 and is three bytes long. It displays on my computer properly if I choose the Japanese locale in Windows, however, it doesn't display properly in our application. However, if any other locale other than Japanese is selected, I cannot even view the filenames properly. So I'm guessing this encoding is not Unicode. Anyone know what it is? Is it ANSI? Is it Shift JIS?
For the Chinese one, I've tested it with Unicode and UTF-8 characters and I'm getting the same pattern; 0x7F followed by three bytes. Are Unicode and UTF-8 the same?
One character encoding I'm trying to work with starts with a 0x7F and is 4 bytes long
What are the other bytes? Do you have any Latin text in this encoding?
If it's “0x7f 0x... 0x00 0x00” you are looking at UTF-32LE. It could also be two UTF-16 (either LE or BE) characters.
Most East Asian encodings use 0x80-0xFF as lead bytes for non-ASCII characters; there is none I know of that would use a leading 0x7F as anything other than an ASCII delete.
ETA:
are there supposed to be Byte Order Marks?
There doesn't need to be a BOM if there is an out-of-band way of signalling that the encoding is ‘UTF-32LE’ (possibly one that is lost before it gets to you).
I've also had to work with some Japanese encoding that starts every character with a 0xE3 and is three bytes long.
That's surely UTF-8. Sequence 0xE3 0x... 0x... would result in a character between U+3000 and U+4000, which is where the hiragana/katakana live.
It displays on my computer properly if I choose the Japanese locale in Windows, however, it doesn't display properly in our application.
Then chances are your application is is one of the regrettable horde of non-Unicode-compliant apps, still using ‘A’(*) versions of the Win32 interfaces inside of the ‘W’-suffixed ones. Whether you can read in the string according to its real encoding is moot: a non-Unicode-compliant app will never be able to display an East Asian ideograph on a Western locale.
(*: named for “ANSI”, which is Windows's misleading term for “whatever the system codepage is set to at the moment”. That's why changing your locale affected it.)
ETA(2):
OK, cracked it. It's not any standardised encoding I've met before, but it's relatively easy to decipher if you assume the premise that Unicode code points are being encoded.
0x00-0x7E: plain ASCII
0x7F A B C: Unicode character
The character encoded in a Unicode escape can be calculated by taking the index in a key string of A, B and C and adding together:
A*0x1000 + B*0x40 + C
That is, it's a base-64 character set, but it's not the usual Base64 standard. A little experimentation gives a key string of:
.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz
The ‘.’ and ‘_’ characters are guesses, since none of the characters you posted uses them. We'd need more data to find out the exact string.
So, for example:
0x7F 3 u g
A=4 B=58 C=44
4*0x1000 + 58*0x40 + 44 = 0x4EAC
U+4EAC = 京
ETA(3):
Yeah, it should be easy to create a native Unicode string by sucking out each code point manually and joining as a character. Not quite sure what's available on whatever platform you're using, but any Unicode-capable platform should be able to make a string from codepoints simply (and hopefully without having to manually re-encode to UTF-16LE bytes).
I figured it must be Unicode codepoints by noticing that the three example characters had first escape-characters in the same general range, and in the same numerical order as their Unicode codepoints. The other two characters seemed to change randomly, so it was very likely a big-endian encoding of the code point, and probably a base-64 encoding as 6 is as many bits as you can get out of readable ASCII.
Standard Base64 itself starts with letters, which would put something starting with a number too far up to be in the Basic Multilingual Plane. So I started guessing with ‘0123456789ABCDEFG...’ which would be the other obvious choice of key string. That got resulting numbers that were close to the code points for the given characters, but a bit too low. Inserting an extra character at the start of the key string (so digit ‘0’ doesn't map to number 0) got one of the characters right and the other two very close; the one that was right had no lower-case letters, so to change only the lower-case letters I inserted another character between the upper and lower cases. This came up with the right numbers.
It's not guaranteed that this is actually right, but (apart from the arbitrary choice of inserted characters) it's very likely to be it.
You might want to look at chinese character encoding page on Wikipedia. The only encoding in there that I can see that is always 4 bytes is UTF-32.
GB 18030 is the current standard Chinese character set, but it can be 1 to 4 bytes long.
Try chardet. It does a good job of guessing the character encoding of a string of bytes.
Are Unicode and UTF-8 the same?
No. UTF-8 is just one way to represent Unicode characters as a sequence of bytes. Unicode is the full standard, assigning numeric and human-readable identifiers to each character, as well as lots of metadata about the characters.
It might be a valid unicode encoding, such as a utf-8 or UTF16 surrogate pair.
Yes, the Chinese one is UTF-8, a implementation (encoding) of Unicode.
The UTF-8 is 1 byte long for ASCII characters and up to 4 bytes for others.

How can I convert non-ASCII characters encoded in UTF8 to ASCII-equivalent in Perl?

I have a Perl script that is being called by third parties to send me names of people who have registered my software. One of these parties encodes the names in UTF-8, so I have adapted my script accordingly to decode UTF-8 to ASCII with Encode::decode_utf8(...).
This usually works fine, but every 6 months or so one of the names contains cyrillic, greek or romanian characters, so decoding the name results in garbage characters such as "ПодражанÑкаÑ". I have to follow-up with the customer and ask him for a "latin character version" of his name in order to issue a registration code.
So, is there any Perl module that can detect whether there are such characters and automatically translates them to their closest ASCII representation if necessary?
It seems that I can use Lingua::Cyrillic::Translit::ICAO plus Lingua::DetectCharset to handle Cyrillic, but I would prefer something that works with other character sets as well.
I believe you could use Text::Unidecode for this, it is precisely what it tries to do.
In the documentation for Text::Unicode, under "Caveats", it appears that this phrase is incorrect:
Make sure that the input data really is a utf8 string.
UTF-8 is a variable-length encoding, whereas Text::Unidecode only accepts a fixed-length (two-byte) encoding for each character. So that sentence should read:
Make sure that the input data really is a string of two-byte Unicode characters.
This is also referred to as UCS-2.
If you want to convert strings which really are utf8, you would do it like so:
my $decode_status = utf8::decode($input_to_be_converted);
my $converted_string = unidecode ($input_to_be_converted);
If you have to deal with UTF-8 data that are not in the ascii range, your best bet is to change your backend so it doesn't choke on utf-8. How would you go about transliterating kanji signs?
If you get cyrilic text there is no "closest ASCII representation" for many characters.