i use dos2unix to remove control-m characters, i works fine for normal text file. But for text files with unicode characters it replaces all unicode characters with junk character. Please let me know how to use dos2unix for files with unicode characters
It depends which implementation and version of dos2unix you use. This implementation (http://waterlan.home.xs4all.nl/dos2unix.html) supports Unicode since version 6.0.
Related
I’m trying to remove the accented characters (CAFÉ -> CAFE) while keeping all the Chinese characters by using a command. Currently, I’m using iconv to remove the accented characters. It turns out that all the Chinese characters are encoded as “?????”. I can’t figure out the way to keep the Chinese characters in an ASCII-encoded file at the same time.
How can I do so?
iconv -f utf-8 -t ascii//TRANSLIT//IGNORE -o converted.bin test.bin
There is no way to keep Chinese characters in a file whose encoding is ASCII; this encoding only encodes the code points between NUL (0x00) and 0x7F (DEL) which basically means the basic control characters plus basic
English alphabetics and punctuation. (Look at the ASCII chart for an enumeration.)
What you appear to be asking is how to remove accents from European alphabetics while keeping any Chinese characters intact in a file whose encoding is UTF-8. I believe there is no straightforward way to do this with iconv, but it should be comfortably easy to come up with a one-liner in a language with decent Unicode support, like perhaps Perl.
bash$ python -c 'print("\u4effCaf\u00e9\u9f00")' >unizh.txt
bash$ cat unizh.txt
仿Café鼀
bash$ perl -CSD -MUnicode::Normalize -pe '$_ = NFKD($_); s/\p{M}//g' unizh.txt
仿Cafe鼀
Maybe add the -i option to modify the file in-place; this simple demo just writes out the result to standard output.
This has the potentially undesired side effect of normalizing each character to its NFKD form.
Code inspired by Remove accents from accented characters and Chinese characters to test with gleaned from What's the complete range for Chinese characters in Unicode? (the ones on the boundary of the range are not particularly good test cases so I just guessed a bit).
The iconv tool is meant to convert the way characters are encoded (i.e. saved to a file as bytes). By converting to ASCII (a very limited character set that contains the numbers, some punctuation, and the basic alphabet in upper and lower case), you can save only the characters that can reasonably be matched to that set. So an accented letter like É gets converted to E because that's a reasonably similar ASCII character, but a Chinese character like 公 is so far away from the ASCII character set that only question marks are possible.
The answer by tripleee is probably what you need. But if the conversion to NFKD form is a problem for you, an alternative is using a direct list of characters you want to replace:
sed 'y/áàäÁÀÄéèëÉÈË/aaaAAAeeeEEE/' <test.bin >converted.bin
where you need to list the original characters and their replacements in the same order. Obviously it is more work, so do this only if you need full control over what changes you make.
I've been given a huge properties file created in eclipse with ISO-8859-1 encoding, and all Greek characters in it are in Unicode format (i.e.: \u03bc\u03af\u03b1\u0020\u03bc\u03ad\u03c1\u03b1). It works fine, but I want the actual file to be human readable.
I converted the file to utf-8, but the characters remained as they were. Is there a way to automatically convert the contents of the file to utf-8 either from inside eclipse, or via an external tool?
This can be done with the AnyEdit Tools plugin:
When installed, in the editor hit Ctrl+A (to select all the text containing ASCII and Unicode formatted characters), right-click and choose Convert > From Unicode Notation.
If i use dir /s /b>list.txt all unicode characters in file names, like äöüß, are broken or missed - instead of ä i get '', ü just disappears and so on...
Yes, i know, unicode characters aren't a good way to name files - they aren't named by me.
Is there a method to get file names healthy listed?
The default console code page usually only supports a small subset of Unicode. US Windows defaults to code page 437 and supports only 256 characters.
If you open a Unicode command prompt (cmd /u), when you redirect to a file the file will be encoded in UTF-16LE, which supports all Unicode characters. Notepad should display the content as long as its font supports the glyphs used.
Changing to an encoding such as UTF-8 (chcp 65001) that supports the full Unicode code point set and redirecting to a file will use that encoding and work as well.
Even that EULA displayed in WiX is RTF file, and it seems to be fully capable of unicode support, for some reason non-breaking whitespaces aren't rendered as such.
Investigation shown that non-breaking spaces are stored in RTF (which is internally plain text ASCII file) as \~ control word. Though other non-ASCII characters are stored using \'xx control word, where x are hexadecimal digits. Simple search-replace of \~ to \'a0 did the trick.
Apparently, that's limitation of parsing module in control used to display EULA. Sadly, it's definitely not the biggest one.
I've a Character Encoding issue.
I've a text file written in arabic, when i open it I get weird characters.. like this åÇÜÇáÍÌÑäÇáÑÝÇÚíãäí..
Is there any way to fix this and get a correct text? the text file where it is written is utf8x encoded.
As in the comment: it is not UTF8, it is WINDOWS-1256 encoding, so you can repair it on Linux using iconv command for file test:
jh#jh-aspire:4804~$ iconv -fwindows-1256 -tutf8 test
هاـالحجرنالرفاعيمني
(I have no idea what it means as I don't know Arabic)
You can use Notepad++ for showing the Arabic texts.