Charset conversion from XXX to utf-8, command line - command-line

I have a bunch of text files that are encoded in ISO-8851-2 (have some polish characters). Is there a command line tool for linux/mac that I could run from a shell script to convert this to a saner utf-8?

Use iconv, for example like this:
iconv -f LATIN1 -t UTF-8 input.txt > output.txt
Some more information:
You may want to specify UTF-8//TRANSLIT instead of plain UTF-8. To quote the manpage:
If the string //TRANSLIT is appended to to-encoding, characters being converted are transliterated when needed and possible. This means that when a character cannot be represented in the target character set, it can be approximated through one or several similar looking characters. Characters that are outside of the target character set and cannot be transliterated are replaced with a question mark (?) in the output.
For a full list of encoding codes accepted by iconv, execute iconv -l.
The example above makes use of shell redirection. Make sure you are not using a shell that mangles encodings on redirection – that is, do not use PowerShell for this.

recode latin2..utf8 myfile.txt
This will overwrite myfile.txt with the new version. You can also use recode without a filename as a pipe.

GNU 'libiconv' should be able to do the job.

Related

Encoding from ANSI when having non-latin letters

I have a very old program (not a server or something on the internet) that I think it use the ANSI (Windows-1252) encoding.
The problem is that some inputs to this program are written in Arabic.
However, when I am trying to read the result, the Arabic words are written with very wired character. For example the input: "نور" is converted to "äæÑ".
The program output should contain a combination of English words and Arabic words.
E.x. It outputs "Name äæÑ" while the correct output should be something like "Name نور".
In general, the English words are correct and readable with both UTF-8 and ANSI. But the Arabic words are read for example as "���" with UTF-8 and as "äæÑ" with ANSI.
I understand that this is because ANSI doesn't have support to non-Latin letters.
but what should I do now? How can I convert them to Arabic again?
Note: I know the exact input and the exact output that this program should produce.
Note2: I don't have the source code of this program. I just want to convert the output file of this program to have the correct words or encoding.
I solved this problem now by typing in the terminal:
iconv -f WINDOWS-1256 -t utf8 < my_File.ged > result.ged
I tried to write code in java that do a similar thing but it wasn't really working with giving my the result I wanted.
I have also tried the previous terminal command but using WINDOWS-1252 instead of WINDOWS-1256 but it wasn't working. So, I guess it is good to try different encoding until it is working

How to remove accents and keep Chinese characters using a command?

I’m trying to remove the accented characters (CAFÉ -> CAFE) while keeping all the Chinese characters by using a command. Currently, I’m using iconv to remove the accented characters. It turns out that all the Chinese characters are encoded as “?????”. I can’t figure out the way to keep the Chinese characters in an ASCII-encoded file at the same time.
How can I do so?
iconv -f utf-8 -t ascii//TRANSLIT//IGNORE -o converted.bin test.bin
There is no way to keep Chinese characters in a file whose encoding is ASCII; this encoding only encodes the code points between NUL (0x00) and 0x7F (DEL) which basically means the basic control characters plus basic
English alphabetics and punctuation. (Look at the ASCII chart for an enumeration.)
What you appear to be asking is how to remove accents from European alphabetics while keeping any Chinese characters intact in a file whose encoding is UTF-8. I believe there is no straightforward way to do this with iconv, but it should be comfortably easy to come up with a one-liner in a language with decent Unicode support, like perhaps Perl.
bash$ python -c 'print("\u4effCaf\u00e9\u9f00")' >unizh.txt
bash$ cat unizh.txt
仿Café鼀
bash$ perl -CSD -MUnicode::Normalize -pe '$_ = NFKD($_); s/\p{M}//g' unizh.txt
仿Cafe鼀
Maybe add the -i option to modify the file in-place; this simple demo just writes out the result to standard output.
This has the potentially undesired side effect of normalizing each character to its NFKD form.
Code inspired by Remove accents from accented characters and Chinese characters to test with gleaned from What's the complete range for Chinese characters in Unicode? (the ones on the boundary of the range are not particularly good test cases so I just guessed a bit).
The iconv tool is meant to convert the way characters are encoded (i.e. saved to a file as bytes). By converting to ASCII (a very limited character set that contains the numbers, some punctuation, and the basic alphabet in upper and lower case), you can save only the characters that can reasonably be matched to that set. So an accented letter like É gets converted to E because that's a reasonably similar ASCII character, but a Chinese character like 公 is so far away from the ASCII character set that only question marks are possible.
The answer by tripleee is probably what you need. But if the conversion to NFKD form is a problem for you, an alternative is using a direct list of characters you want to replace:
sed 'y/áàäÁÀÄéèëÉÈË/aaaAAAeeeEEE/' <test.bin >converted.bin
where you need to list the original characters and their replacements in the same order. Obviously it is more work, so do this only if you need full control over what changes you make.

Japanese filename being read as 8.3 on Windows

I'm working on a Windows system with Perl 5.8.8 (yes, it's old but it's what's on the server and I can't change that).
I've been sent a csv datafile from our Japan office. The filename is 'a151110a(01_WeLBA①).csv'. My perl script can open and read the file, but it sees the filename as 'A15111~1.CSV'; so it's interpreting the filename into 8.3 format. I've tried using glob and readdir to create a file list and both give the same result.
The problem is that I need to be able to pull some info out of the filename. I need that part that is inside the parentheses, the '01_WeLBA' part. But Perl doesn't seem to "see" that. Those parentheses have a space (or other whitespace character) either in front or just after them. If I manually remove those and that numeral '1'-inside-a-circle character, then Perl sees the filename as it is.
Is there a way to get Perl to 'see' the filename as it appears in Windows Explorer?
Use Win32::LongPath's opendirL, readdirL and closedirL.
Windows provides two versions of each system call that accepts strings:
The "UNICODE" version (suffixed with "W" for "wide") accepts/returns strings encoded using UTF-16le. This version supports all Unicode characters.
The "ANSI" version (suffixed with "A") accepts/returns strings encoded using the Active Code Page (ACP). The "A" version only supports a small subset of the Unicode characters.
You can obtain the ACP for your system using the following:
perl -Mv5.14 -MWin32 -e"say Win32::GetACP()"
Unfortunately, Perl functions (named operators) use the "A" version of system calls and expect/return text encoded using the ACP. This severely limits which file names that can be passed to them.
For example, my system's ACP is 1252, so the "A" version of system calls would not support Japanese characters. This means there is nothing I can do to make open, -e, etc work with file names containing Japanese characters. ouch.
Win32::LongPath provides alternatives to Perl's builtins that use the "W" version of system calls, and thus support all Unicode characters. For example, -e is just a call to stat, and it provides statL.
Have a look at Win32::LongPath, it does exactly what you need.

How can I convert japanese characters to unicode in Perl?

Can you point me tool to convert japanese characters to unicode?
CPAN gives me "Unicode::Japanese". Hope this is helpful to start with. Also you can look at article on Character Encodings in Perl and perl doc for unicode for more information.
See http://p3rl.org/UNI.
use Encode qw(decode encode);
my $bytes_in_sjis_encoding = "\x88\xea\x93\xf1\x8e\x4f";
my $unicode_string = decode('Shift_JIS', $bytes_in_sjis_encoding); # returns 一二三
my $bytes_in_utf8_encoding = encode('UTF-8', $unicode_string); # returns "\xe4\xb8\x80\xe4\xba\x8c\xe4\xb8\x89"
For batch conversion from the command line, use piconv:
piconv -f Shift_JIS -t UTF-8 < infile > outfile
First, you need to find out the encoding of the source text if you don't know it already.
The most common encodings for Japanese are:
euc-jp: (often used on Unixes and some web pages etc with greater Kanji coverage than shift-jis)
shift-jis (Microsoft also added some extensions to shift-jis which is called cp932, which is often used on non-Unicode Windows programs)
iso-2022-jp is a distant third
A common encoding conversion library for many languages is iconv (see http://en.wikipedia.org/wiki/Iconv and http://search.cpan.org/~mpiotr/Text-Iconv-1.7/Iconv.pm) which supports many other encodings as well as Japanese.
This question seems a bit vague to me, I'm not sure what you're asking. Usually you would use something like this:
open my $file, "<:encoding(cp-932)", "JapaneseFile.txt"
to open a file with Japanese characters. Then Perl will automatically convert it into its internal Unicode format.

ja chars in windows batch file

What is the secret to japanese characters in a Windows XP .bat file?
We have a script for open a file off disk in kiosk mode:
#ECHO OFF
"%ProgramFiles%\Internet Explorer\iexplore.exe" –K "%CD%\XYZ.htm"
It works fine when the OS is english, and it works fine for the japanese OS when XYZ is made up of english characters, but when XYZ is made up of japanese characters, they are getting mangled into gibberish by the time IE tries to find the file.
If the batch file is saved as Unicode or Unicode big endian the script wont even run.
I have tried various ways of encoding the japanese characters. ampersand escape does not work (〹)
Percent escape does not work %xx%xx%xx
ABC works, AB%43 becomes AB3 in the error message, so it looks like the percent escape is trying to do parameter substitution. This is confirmed because %043 puts in the name of the script !
One thing that does work is pasting the ja characters into a command prompt.
#ECHO OFF
CD "%ProgramFiles%\Internet Explorer\"
Set /p URL ="file to open: "
start iexplore.exe –K %URL%
This tells me that iexplore.exe will accept and parse the parameter correctly when it has ja characters, but not when they are written into the script.
So it would be nice to know what the secret may be to getting the parameter into IE successfully via the batch file, as opposed to via the clipboard and an environment variable.
Any suggestions greatly appreciated !
best regards
Richard Collins
P.S.
another post has has made this suggestion, which i am yet to follow up:
You might have more luck in cmd.exe if you opened it in UNICODE mode. Use "cmd /U".
Batch renaming of files with international chars on Windows XP
I will need to find out if this can be from inside the script.
For the record, a simple answer has been found for this question.
If the batch file is saved as ANSI - it works !
First of all: Batch files are pretty limited in their internationalization support. There is no direct way of telling cmd what codepage a batch file is in. UTF-16 is out anyway, since cmd won't even parse that.
I have detailed an option in my answer to the following question:
Batch file encoding
which might be helpful for your needs.
In principle it boils down to the following:
Use an encoding which has single-byte mappings for ASCII
Put a chcp ... at the start of the batch file
Use the set codepage for the rest of the file
You can use codepage 65001, which is UTF-8 but make sure that your file doesn't include the U+FEFF character at the start (used as byte-order mark in UTF-16 and UTF-32 and sometimes used as marker for UTF-8 files as well). Otherwise the first command in the file will produce an error message.
So just use the following:
echo off
chcp 65001
"%ProgramFiles%\Internet Explorer\iexplore.exe" –K "%CD%\XYZ.htm"
and save it as UTF-8 without BOM (Note: Notepad won't allow you to do that) and it should work.
cmd /u won't do anything here, that advice is pretty much bogus. The /U switch only specifies that Unicode will be used for redirection of input and output (and piping). It has nothing to do with the encoding the console uses for output or reading batch files.
URL encoding won't help you either. cmd is hardly a web browser and outside of HTTP and the web URL encoding isn't exactly widespread (hence the name). cmd uses percent signs for environment variables and arguments to batch files and subroutines.
"Ampersand escape" also known as character entities known from HTML and XML, won't work either, because cmd is also not HTML or XML. The ampersand is used to execute multiple commands in a single line.
I too suffered this frustrating problem in batch/cmd files. However, so far as I can see, no one yet has stated the reason why this problem occurs, here or in other, similar posts at StackOverflow. The nearest statement addressing this was:
“First of all: Batch files are pretty limited in their internationalization support. There is no direct way of telling cmd what codepage a batch file is in.”
Here is the basic problem. Cmd files are the Windows-2000+ successor to MS-DOS and IBM-DOS bat(ch) files. MS and IBM DOS (1984 vintage) were written in the IBM-PC character set (code page 437). There, the 8th-bit codes were assigned (or “clothed” with) characters different from those assigned to the corresponding codes of Windows, ANSI, or Unicode. The presumption of CP437 encoding is unalterable (except, as previously noted, through cmd.exe /u). Where the characters of the IBM-PC set have exact counterparts in the Unicode set, Windows Explorer remaps them to the Unicode counterparts. Alas, even Windows-1252 characters like š and ¾ have no counterpart in code page 437.
Here is another way to see the problem. Try opening your batch/cmd script using the Windows Edit.com program (at C:\Windows\system32\Edit.com). The Windows-1252 character 0145 ‘ (Unicode 8217) instead appears as IBM-PC 145 æ. A batch command to rename Mary'sFile.txt as Mary’sFile.txt fails, as it is interpreted as MaryæsFile.txt.
This problem can be avoided in the case of copying a file named Mary’sFile.txt: cite it as Mary?sFile.txt, e.g.:
xCopy Mary?sFile.txt Mary?sLastFile.txt
You will see a similar treatment (substitution of question marks) in a DIR list of files having Unicode characters.
Obviously, this is useless unless an extant file has the Unicode characters. This solution’s range is paltry and inadequate, but please make what use of it you can.
You can try to use Shift-JIS encoding.