How to convert a Big5 encoded txt file to UTF8 encoded txt file? - encoding

I have a Big5 encoded file, which can't be opened by Mac TextEdit. I wonder how to convert the whole file into utf8 encoding, since utf8 is much more universal and common.
I have tried using iconv in my terminal, but it does not work. I can't find anything useful about this error by Google either.
$ iconv -f BIG5 -t UTF8 in.txt > out.txt
iconv: in.txt:5:0: cannot convert
Are there any other ways to convert?
I got the txt file from here, whcih is a list of Chinese names writing in Taiwan Traditional Chinese.

Looking at the first 20 lines of your file, it is clear that the encoding uses the byte 0x8C as first byte of some multibyte sequences. The encodings that have this property are:
BIG5
BIG5-HKSCS
CP932
CP936
CP949
CP950
GB18030
GBK
JOHAB
Shift_JIS
Shift_JISX0213
Try them in turn:
$ for encoding in BIG5 BIG5-HKSCS CP932 CP936 CP949 CP950 GB18030 GBK \
JOHAB Shift_JIS Shift_JISX0213; do \
if head -n 20 < unique_names_2012.txt | iconv -f $encoding -t UTF-8 > /dev/null 2> /dev/null; then \
echo $encoding ; \
fi; \
done
With GNU libiconv, it prints
BIG5-HKSCS
CP950
GB18030
Is it in GB18030 encoding?
$ iconv -f GB18030 < unique_names_2012.txt
shows hundreds of lines that use characters in the PUA range. While not impossible, it seems unlikely.
Is it in CP950 encoding?
$ iconv -f CP950 < unique_names_2012.txt
gives a conversion error at line 2294.
Is it in BIG5-HKSCS encoding?
$ iconv -f BIG5-HKSCS < unique_names_2012.txt
gives a conversion error at line 713.
So, most probably the file is encoded in a variant of BIG5. There are many such variants, see http://haible.de/bruno/charsets/conversion-tables/Chinese.html. Possibly it's an extension of CP950 or an extension of BIG5-HKSCS (since these are the most popular encodings from the BIG5 family today).
In summary, such conversion errors are caused by unstandardized proliferation of BIG5 variants.
The best thing you can do is to request the original file in UTF-8 encoding; let the originator deal with it.

Related

How to reencode Source file which has "é" instead of "é"?

I've just inherited a legacy project in which my predecessor pushed incorrectly encoded files.
The comments, in French, should include special characters as é,è,ç etc.
But, for instance here, a 'é' is shown as 'é'.
I'm looking for a command line tool to handle all files of the project. I'm pretty sure iconv should to the trick, but what I tried so far did not work :
Here are some initial informations:
# problematic file example
$ file Parametres.cpp
Parametres.cpp: C source, ISO-8859 text
# check that my OS handles utf8
$echo "éè" > test.tmp
$ file test.tmp
test.tmp: UTF-8 Unicode text
$ cat test.tmp
éè
I tried whithout success (meaning in Parametres.cpp.utf8 i still got 'é') :
iconv -f ISO-8859-1 -t UTF-8 Parametres.cpp -o Parametres.cpp.utf8
iconv -f ISO-8859-1 -t UTF-8//TRANSLIT Parametres.cpp -o Parametres.cpp.utf8
iconv -f ISO-8859-1//TRANSLIT -t UTF-8 Parametres.cpp -o Parametres.cpp.utf8
My guess is that the original encoding was not ISO-8859-1 but something else. And due to misconfigured IDE, chars 'Ã' and '©' got definitly encoded in ISO-8859-1. From what I understood, TRANSLIT should to the job, but it seems not.
So, here are my questions :
is there a better tool than iconv to do this job in CentOS7.2 (yes, I know. Legacy is legacy...) ?
Or, How to determine (or guess) the original encoding to make iconv solve my problem ?
Any help or ideas are appreciated :-)

Formating file changes encoding on Redhat system

I have a bash script which extract data from an oracle database. I use spool to extract data. After extraction I format the file by removing and replacing some characters. My problem is after formating the files are in ANSI encoding instead of ut8.
Extraction with spool. The file is utf8
Format with cat and tr command and redirect in another file. This file is ansi.
The same process works fine on Aix system. I try iconv but it doesnt work. Do you please have an idea why the encoding changes from utf8 to ansi ? How to correct it please ?
You should consequently use either ISO-8859-1 or UTF-8. In the latter case, don't use tr as it doesn't (yet?) support multi-byte characters, use sed instead (e.g sed 's/deletethis//g').
ISO-8859-1:
export LC_CTYPE=fr_FR.ISO-8859-1
export NLS_LANG=French_France.WE8ISO8859P1
# fetch data from Oracle, emulated by the following line
echo 'âêîôû' >test.latin1 # 5 bytes (+lineend)
# perform formatting, eg:
sed 's/ê/[e-circumflex]/g' test.latin1
# or the same with hex-codes:
sed $'s/\xea/[e-circumflex]/g' test.latin1
UTF-8:
export LC_CTYPE=fr_FR.UTF-8
export NLS_LANG=French_France.AL32UTF8
# fetch data from Oracle, emulated by the following line
echo 'âêîôû' >test.utf8 # 10 bytes (+lineend)
# perform formatting, eg:
sed 's/ê/[e-circumflex]/g' test.utf8
# or the same with hex-codes:
sed $'s/\xc3\xaa/[e-circumflex]/g' test.utf8
Note: no conversion (iconv, recode, etc) is required, just make sure NLS_LANG and LC_CTYPE are compatible. (Also, your terminal(emulator) should be set accordingly; for PuTTY it is Configuration/Category/Window/Translation/Remote-character-set.)
Original answer:
I cannot tell what's wrong with the formatting you perform, but here is a method to damage the utf8-encoded text:
$ echo 'ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP' | iconv -f iso-8859-2 -t utf-8 | xxd
00000000: c381 5256 c38d 5a54 c5b0 52c5 9020 54c3 ..RV..ZT..R.. T.
00000010: 9c4b c396 5246 c39a 52c3 9347 c389 500a .K..RF..R..G..P.
$ echo 'ÁRVÍZTŰRŐ TÜKÖRFÚRÓGÉP' | iconv -f iso-8859-2 -t utf-8 | tr -d $'\200-\237' | xxd
00000000: c352 56c3 5a54 c5b0 52c5 2054 c34b c352 .RV.ZT..R. T.K.R
00000010: 46c3 52c3 47c3 500a F.R.G.P.
Here the tr -d $'\200-\237' part deleted half of the utf8-sequences (c381 became c3, c590 became c5), rendering the text unusable.

Converting only non utf-8 files to utf-8

I have a set of md files, some of them are utf-8 encoded, and others are not (windows-1256 actually).
I want to convert only non-utf-8 files to utf-8.
The following script can partly do the job:
for file in *.md;
do
iconv -f windows-1256 -t utf-8 "$file" -o "${file%.md}.🆕.md";
done
I still need to exclude the original utf-8 files from this process, (maybe using file command?). Try the following command to understand what I mean:
file --mime-encoding *
Notice that although file command isn't smart enough to detect the right character set of non-utf-8 files, It's enough in this case that it can distinguish between utf-8 and non-utf-8 files.
Thanks in advance for help.
You can use for example an if statement:
if file --mime-encoding "$file" | grep -v -q utf-8 ; then
iconv -f windows-1256 -t utf-8 "$file" -o "${file%.md}.🆕.md";
fi
If grep doesn't find a match, it returns a status code indicating failure. The if statement tests the status code

Convert UTF-8 to UTF-16 in iconv

As far as I know, the UTF-8 form of"你好" (means "How are you?" in English) is
\xe4\xbd\xa0\xe5\xa5\xbd, and the UTF-16 form is u\u4f60\u597d (or you can write it as \x4f\x60\x59\x7d).
Now I use iconv to convert from UTF-8 to UTF-16. At first, I created a new file, with one line("你好") in it, named test, and I run the command:
cat test | iconv -f UTF-8 -t UNICODE
��`O}Y
It's not \x4f\x60\x59\x7d. How can I get the right output?
It's not UTF-8, but UCS-2
Try:-
cat test | iconv -f UCS-2 -t UTF-16

iconv: Converting from Windows ANSI to UTF-8 with BOM

I want to use iconv to convert files on my Mac. The goal is to go from "Windows ANSI" to "whatever Windows Notepad saves, if you tell it to use UFT8".
This is what I want:
$ file names.csv
names.csv: UTF-8 Unicode (with BOM) text, with CRLF line terminators
This is what I use:
$ iconv -f CP1252 -t UTF-8 names.csv > names.utf8.csv
This is what I get (not what I want):
$ file names.utf8.csv
names.utf8.csv: UTF-8 Unicode text, with CRLF line terminators
How do I get the BOM?
You can add it manually by first echoing the bytes into the file:
echo -ne '\xEF\xBB\xBF' > names.utf8.csv
and then concatenating your required information at the end:
iconv -f CP1252 -t UTF-8 names.csv >> names.utf8.csv
Note the >> rather than >.
Note that "Windows ANSI" may not be CP1252 - that is configured by users.
The BOM is not necessary for UTF-8.
And Windows Notepad can save UTF-8 with or without BOM.
I needed the opossite. (covert german text from UTF-8 to ANSI)
So command I used:
1. iconv -l (check available formats)
2. iconv -f UTF8 -t MS-ANSI de.txt > output.txt
and now if I open output.txt it is already in ANSI. Job done.