How to remove accents and keep Chinese characters using a command?

How to remove accents and keep Chinese characters using a command? - command

I’m trying to remove the accented characters (CAFÉ -> CAFE) while keeping all the Chinese characters by using a command. Currently, I’m using iconv to remove the accented characters. It turns out that all the Chinese characters are encoded as “?????”. I can’t figure out the way to keep the Chinese characters in an ASCII-encoded file at the same time.
How can I do so?
iconv -f utf-8 -t ascii//TRANSLIT//IGNORE -o converted.bin test.bin

There is no way to keep Chinese characters in a file whose encoding is ASCII; this encoding only encodes the code points between NUL (0x00) and 0x7F (DEL) which basically means the basic control characters plus basic
English alphabetics and punctuation. (Look at the ASCII chart for an enumeration.)
What you appear to be asking is how to remove accents from European alphabetics while keeping any Chinese characters intact in a file whose encoding is UTF-8. I believe there is no straightforward way to do this with iconv, but it should be comfortably easy to come up with a one-liner in a language with decent Unicode support, like perhaps Perl.
bash$ python -c 'print("\u4effCaf\u00e9\u9f00")' >unizh.txt
bash$ cat unizh.txt
仿Café鼀
bash$ perl -CSD -MUnicode::Normalize -pe '$_ = NFKD($_); s/\p{M}//g' unizh.txt
仿Cafe鼀
Maybe add the -i option to modify the file in-place; this simple demo just writes out the result to standard output.
This has the potentially undesired side effect of normalizing each character to its NFKD form.
Code inspired by Remove accents from accented characters and Chinese characters to test with gleaned from What's the complete range for Chinese characters in Unicode? (the ones on the boundary of the range are not particularly good test cases so I just guessed a bit).

The iconv tool is meant to convert the way characters are encoded (i.e. saved to a file as bytes). By converting to ASCII (a very limited character set that contains the numbers, some punctuation, and the basic alphabet in upper and lower case), you can save only the characters that can reasonably be matched to that set. So an accented letter like É gets converted to E because that's a reasonably similar ASCII character, but a Chinese character like 公 is so far away from the ASCII character set that only question marks are possible.
The answer by tripleee is probably what you need. But if the conversion to NFKD form is a problem for you, an alternative is using a direct list of characters you want to replace:
sed 'y/áàäÁÀÄéèëÉÈË/aaaAAAeeeEEE/' <test.bin >converted.bin
where you need to list the original characters and their replacements in the same order. Obviously it is more work, so do this only if you need full control over what changes you make.

Related

grep for emojis in linux

I am trying to grep across a list of tokens that include several non-ASCII characters. I want to match only emojis, other characters such as ð or ñ are fine. The unicode range for emojis appears to be U+1F600-U+1F1FF but when I search for it using grep this happens:
grep -P "[\x1F6-\x1F1]" contact_names.tokens
grep: range out of order in character class
https://unicode.org/emoji/charts/full-emoji-list.html#1f3f4_e0067_e0062_e0077_e006c_e0073_e007f

You need to specify the code points with full value (not 1F6 but 1F600) and wrap them with curly braces. In addition, the first value must be smaller than the last value.
So the regex should be "[\x{1F1FF}-\x{1F600}]".
The unicode range for emojis is, however, more complex than you assumed. The page you referred does not sort characters by code point and emojis are placed in many blocks. If you want to cover almost all of emoji:
grep -P "[\x{1f300}-\x{1f5ff}\x{1f900}-\x{1f9ff}\x{1f600}-\x{1f64f}\x{1f680}-\x{1f6ff}\x{2600}-\x{26ff}\x{2700}-\x{27bf}\x{1f1e6}-\x{1f1ff}\x{1f191}-\x{1f251}\x{1f004}\x{1f0cf}\x{1f170}-\x{1f171}\x{1f17e}-\x{1f17f}\x{1f18e}\x{3030}\x{2b50}\x{2b55}\x{2934}-\x{2935}\x{2b05}-\x{2b07}\x{2b1b}-\x{2b1c}\x{3297}\x{3299}\x{303d}\x{00a9}\x{00ae}\x{2122}\x{23f3}\x{24c2}\x{23e9}-\x{23ef}\x{25b6}\x{23f8}-\x{23fa}]" contact_names.tokens
(The range is borrowed from Suhail Gupta's answer on a similar question)
If you need to allow/disallow specific emoji blocks, see sequence data on unicode.org. List of emoji on Wikipedia also show characters in ordered tables but it might not list latest ones.

You could use ugrep as a drop-in replacement for grep to do this:
ugrep "[\x{1F1FF}-\x{1F600}]" contact_names.tokens
ugrep matches Unicode patterns by default (disabled with option -U).
The regular expression syntax is POSIX ERE compliant, extended with
Unicode character classes, lazy quantifiers, and negative patterns to
skip unwanted pattern matches to produce more precise results.
ugrep searches UTF-encoded input when UTF BOM (byte order mark) are
present and ASCII and UTF-8 when no UTF BOM is present. Option
--encoding permits many other file formats to be searched, such as ISO-8859-1, EBCDIC, and code pages 437, 850, 858, 1250 to 1258.
ugrep searches text and binary files and produces hexdumps for binary matches.
The Unicode ranges for emojis is larger than the range 1F1FF+U to 1F600+U. See the official Unicode 12 publication https://unicode.org/emoji/charts-12.0/full-emoji-list.html

How does code pages work in case of chinese

How does code pages work in case of chinese / japanese?
It is unable to encode all alphabet's characters for these languages in the limits of one byte so how does it work then?
Note that I'm taking about pre-Unicode times.

I'm most familiar with Japanese, but in general the strategy is the same for any language that needs more characters than fit in a single byte - you use a variable width multibyte encoding where some bytes are recognized as starting a "wide" character and ASCII is left alone.
In the early days so-called "ASCII-safe" encodings were useful. These used only seven bits (the high bit was always 0) so they worked with a variety of systems (including hardware) that expected only control characters to set the high bit in any byte. ISO-2022-JP is one of these and is still used in email quite often (mostly on feature phones).
Here's what ISO-2022-JP looks like if you don't decode it:
echo "日本語" | iconv -f utf8 -t iso2022jp | cat -v
^[$BF|K\8l^[(B
Note that "test" comes through unchanged and all other characters are valid ASCII; ^[ is an ASCII escape character. (ISO-2022 also has 8-bit versions, but the 7-bit is the most commonly used variety.)
Later variable width encodings, like EUC, Shift-JIS, and UTF-8 all work on the same principle except they use binary (non-ASCII) escapes, so the first character of any multi-byte character has the high bit set (that is, the unsigned byte value is >128). The Wikipedia article for UTF-8 has a nice table explaining how UTF handles this. Just like the older ASCII-safe encodings, these leave ASCII strings unmodified.
There also exist fixed-width multibyte encodings, but they're relatively uncommon. There was an attempt to popularize an encoding that just used two bytes for everything, called "UCS-2", but it ended up not having room for enough characters and was mostly superseded by variable width UTF-16 in the 1990s. UTF-16 is (practically speaking) the internal encoding used in Java and Javascript, but due to the history with UCS-2 sometimes things like string length work in strange ways.
Technically fixed-width UTF-32 exists, but it's not widely used and I've never personally encountered it in the wild.

Which non-printable ASCII characters have no side-effects?

I want to use a couple of non-printable ASCII characters to encode special meanings to text generated by an application (under Linux).
Some non-printable ASCII characters are interpreted to have special meanings in a sheer number of applications, such as LF (linefeed), character number 10 in the ASCII table. I am unsure if there are more of these. Basically, I am looking for a character, where the function of which is outdated and safe to use in the modern world.
Thank you in advance.

I do replace literal \xNN with their character in Perl?

I have a Perl script that takes text values from a MySQL table and writes it to a text file. The problem is, when I open the text file for viewing I am getting a lot of hex characters like \x92 and \x93 which stands for single and double quotes, I guess.
I am using DBI->quote function to escape the special chars before writing the values to the text file. I have tried using Encode::Encoder, but with no luck. The character set on both the tables is latin1.
How do I get rid of those hex characters and get the character to show in the text file?

ISO Latin-1 does not define characters in the range 0x80 to 0x9f, so displaying these bytes in hex is expected. Most likely your data is actually encoded in Windows-1252, which is the same as Latin1 except that it defines additional characters (including left/right quotes) in this range.

\x92 and \x93 are empty characters in the latin1 character set (see here or here). If you are certain that you are indeed dealing with latin1, you can simply delete them.

It sounds like you need to change the character sets on the tables, or translate the non-latin-1 characters into latin-1 equivalents. I'd prefer the first solution. Get used to Unicode; you're going to have to learn it at some point. :)

How can I convert non-ASCII characters encoded in UTF8 to ASCII-equivalent in Perl?

I have a Perl script that is being called by third parties to send me names of people who have registered my software. One of these parties encodes the names in UTF-8, so I have adapted my script accordingly to decode UTF-8 to ASCII with Encode::decode_utf8(...).
This usually works fine, but every 6 months or so one of the names contains cyrillic, greek or romanian characters, so decoding the name results in garbage characters such as "ÐŸÐ¾Ð´Ñ€Ð°Ð¶Ð°Ð½ÑÐºÐ°Ñ". I have to follow-up with the customer and ask him for a "latin character version" of his name in order to issue a registration code.
So, is there any Perl module that can detect whether there are such characters and automatically translates them to their closest ASCII representation if necessary?
It seems that I can use Lingua::Cyrillic::Translit::ICAO plus Lingua::DetectCharset to handle Cyrillic, but I would prefer something that works with other character sets as well.

I believe you could use Text::Unidecode for this, it is precisely what it tries to do.

In the documentation for Text::Unicode, under "Caveats", it appears that this phrase is incorrect:
Make sure that the input data really is a utf8 string.
UTF-8 is a variable-length encoding, whereas Text::Unidecode only accepts a fixed-length (two-byte) encoding for each character. So that sentence should read:
Make sure that the input data really is a string of two-byte Unicode characters.
This is also referred to as UCS-2.
If you want to convert strings which really are utf8, you would do it like so:
my $decode_status = utf8::decode($input_to_be_converted);
my $converted_string = unidecode ($input_to_be_converted);

If you have to deal with UTF-8 data that are not in the ascii range, your best bet is to change your backend so it doesn't choke on utf-8. How would you go about transliterating kanji signs?

If you get cyrilic text there is no "closest ASCII representation" for many characters.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse