How can I convert japanese characters to unicode in Perl? - perl

Can you point me tool to convert japanese characters to unicode?

CPAN gives me "Unicode::Japanese". Hope this is helpful to start with. Also you can look at article on Character Encodings in Perl and perl doc for unicode for more information.

See http://p3rl.org/UNI.
use Encode qw(decode encode);
my $bytes_in_sjis_encoding = "\x88\xea\x93\xf1\x8e\x4f";
my $unicode_string = decode('Shift_JIS', $bytes_in_sjis_encoding); # returns 一二三
my $bytes_in_utf8_encoding = encode('UTF-8', $unicode_string); # returns "\xe4\xb8\x80\xe4\xba\x8c\xe4\xb8\x89"
For batch conversion from the command line, use piconv:
piconv -f Shift_JIS -t UTF-8 < infile > outfile

First, you need to find out the encoding of the source text if you don't know it already.
The most common encodings for Japanese are:
euc-jp: (often used on Unixes and some web pages etc with greater Kanji coverage than shift-jis)
shift-jis (Microsoft also added some extensions to shift-jis which is called cp932, which is often used on non-Unicode Windows programs)
iso-2022-jp is a distant third
A common encoding conversion library for many languages is iconv (see http://en.wikipedia.org/wiki/Iconv and http://search.cpan.org/~mpiotr/Text-Iconv-1.7/Iconv.pm) which supports many other encodings as well as Japanese.

This question seems a bit vague to me, I'm not sure what you're asking. Usually you would use something like this:
open my $file, "<:encoding(cp-932)", "JapaneseFile.txt"
to open a file with Japanese characters. Then Perl will automatically convert it into its internal Unicode format.

Related

How to remove accents and keep Chinese characters using a command?

I’m trying to remove the accented characters (CAFÉ -> CAFE) while keeping all the Chinese characters by using a command. Currently, I’m using iconv to remove the accented characters. It turns out that all the Chinese characters are encoded as “?????”. I can’t figure out the way to keep the Chinese characters in an ASCII-encoded file at the same time.
How can I do so?
iconv -f utf-8 -t ascii//TRANSLIT//IGNORE -o converted.bin test.bin
There is no way to keep Chinese characters in a file whose encoding is ASCII; this encoding only encodes the code points between NUL (0x00) and 0x7F (DEL) which basically means the basic control characters plus basic
English alphabetics and punctuation. (Look at the ASCII chart for an enumeration.)
What you appear to be asking is how to remove accents from European alphabetics while keeping any Chinese characters intact in a file whose encoding is UTF-8. I believe there is no straightforward way to do this with iconv, but it should be comfortably easy to come up with a one-liner in a language with decent Unicode support, like perhaps Perl.
bash$ python -c 'print("\u4effCaf\u00e9\u9f00")' >unizh.txt
bash$ cat unizh.txt
仿Café鼀
bash$ perl -CSD -MUnicode::Normalize -pe '$_ = NFKD($_); s/\p{M}//g' unizh.txt
仿Cafe鼀
Maybe add the -i option to modify the file in-place; this simple demo just writes out the result to standard output.
This has the potentially undesired side effect of normalizing each character to its NFKD form.
Code inspired by Remove accents from accented characters and Chinese characters to test with gleaned from What's the complete range for Chinese characters in Unicode? (the ones on the boundary of the range are not particularly good test cases so I just guessed a bit).
The iconv tool is meant to convert the way characters are encoded (i.e. saved to a file as bytes). By converting to ASCII (a very limited character set that contains the numbers, some punctuation, and the basic alphabet in upper and lower case), you can save only the characters that can reasonably be matched to that set. So an accented letter like É gets converted to E because that's a reasonably similar ASCII character, but a Chinese character like 公 is so far away from the ASCII character set that only question marks are possible.
The answer by tripleee is probably what you need. But if the conversion to NFKD form is a problem for you, an alternative is using a direct list of characters you want to replace:
sed 'y/áàäÁÀÄéèëÉÈË/aaaAAAeeeEEE/' <test.bin >converted.bin
where you need to list the original characters and their replacements in the same order. Obviously it is more work, so do this only if you need full control over what changes you make.

write utf8 with perl bug?

My problem is simple. I want to output UTF-8 with my Perl script.
This code is not working.
use utf8;
open(TROIS,">utf8.out.2.txt");
binmode(TROIS, ":utf8");
print TROIS "Hello\n";
The output file is not in UTF-8. (My file script is coded in UTF-8)
But if I insert an accentuated character in my print, then it's working and my output file is in UTF-8. Example:
print TROIS "é\n";
I use ActivePerl 5.10 under Windows. What might be the problem?
You're writing nothing but ASCII characters with Hello\n. Fortunately ASCII is still perfectly valid UTF-8. However, auto detection by editors will most likely not show UTF-8 as the encoding because they don't have anything to judge the file content's encoding by. I guess you simply don't know how file encodings work.
A file's encoding is a property that in general is not stored in a file or externally alongside a file. A lot of editors simply assume a certain encoding based on the operating system they run on or the environment settings (system language), or they include some kind of semi-intelligent auto-detection (which may still fail because file encodings cannot be auto-detected unambiguously). That's why you have to tell Perl that a file is encoded in UTF-8 when you read it with binmode or the corresponding I/O layer.
Now there is one way of marking a text file's encoding if said encoding is one of the UTF family (UTF-8, UTF-16 LE and BE, UTF-32 LE and BE) . That way is called the BOM (byte order mark). However, producing files with a BOM came from a time when UTF-8 had not been spread as widely as it is today. It usually poses more and different problems than it solves, especially due to editors and applications in general not supporting BOMs at all. Therefore BOMs should probably be avoided nowadays.
There are exceptions, of course, in which the file format contains certain instructions that tell the file's encoding. XML comes to mind with its DOCTYPE declaration. However, even for such files you will have to recognize if a file is encoded in a multi-byte encoding that always uses at least two bytes per character (UTF-16/UTF-32) or not in order to parse the DOCTYPE declaration in the first place. It's simply not simple ;)

How can I display Extended ASCII Codes characters in Perl?

How to display 192 character symbol ( └ ) in perl ?
What you want is to be able to print unicode, and the answer is in perldoc perluniintro.
You can use \x{nnnn} where n is the hex identifier, or you can do \N{...} with the name:
perl -E 'say "\x{2514}"; use charnames; say "\N{BOX DRAWINGS LIGHT UP AND RIGHT}"'
To use exactly these codes your terminal must support Code Page 437, which contains frames. Alternatively you can use derived CP850 with less boxing characters.
Such boxing characters also exist as Unicode Block Elements. The char which you want in perl is noted as \N{U+2514}. More details in perlunicode
That looks like the Code page 437 encoding. Perl is probably just outputting bytes that you give it. And your terminal is probably expecting UTF8.
So you need to decode it to Unicode, then re-encode it in UTF-8.
EDIT: Correct encoding.
As usual, Jon Skeet nails it: the 192 code is in the "extended ASCII" range. I suggest you follow #Douglas Leeder's advice, but I'm not sure which encoding www.LookupTables.com is giving you; ISO-8859-1 thinks 192 maps to "À", and Mac OS Roman thinks its "¿".
Is there a solution that works on ALL characters?
The user says they wanted to use an latin-1 extended charset character — so let's try an example from this block! So, if they wanted the Æ character, they would run...
print "\x{00C6}";
Output: �
Full Testable, Online Demo
TDLR Character Encoding Modes in Perl
So, wait, what just happened there? You'll notice that other ways of invoking UTF-8, such as char(...), \N{U+...}, and even unpack(...) also have the same issue. That's right -- the problem isn't with any of these functions, but an underlying character abstraction layer. In this case, you'll want to indicate this layer early in your code..
use open qw( :std :encoding(UTF-8) );
print "\x{00C6}";
Output: Æ
Now I can spell 'Ælf' correctly!
Full Testable, Online Demo
Why did that happen?
There is a note within the PerlDoc regarding the chr() function....
Note that characters from 128 to 255 (inclusive) are by default internally not encoded as UTF-8 for backward compatibility reasons.
For this reason, this special block needs to have that special use open to indicate std encoding.

Working out file encoding: I know the string, know the character, what is the encoding?

I'm adding data from a csv file into a database. If I open the CSV file, some of the entries contain bullet points - I can see them. file says it is encoded as ISO-8859.
$ file data_clean.csv
data_clean.csv: ISO-8859 English text, with very long lines, with CRLF, LF line terminators
I read it in as follows and convert it from ISO-8859-1 to UTF-8, which my database requires.
row = [unicode(x.decode("ISO-8859-1").strip()) for x in row]
print row[4]
description = row[4].encode("UTF-8")
print description
This gives me the following:
'\xa5 Research and insight \n\xa5 Media and communications'
¥ Research and insight
¥ Media and communications
Why is the \xa5 bullet character converting as a yen symbol?
I assume because I'm reading it in as the wrong encoding, but what is the right encoding in this case? It isn't cp1252 either.
More generally, is there a tool where you can specify (i) string (ii) known character, and find out the encoding?
I don't know of any general tool, but this Wikipedia page (linked from the page on codepage 1252) shows that A5 is a bullet point in the Mac OS Roman codepage.
More generally, is there a tool where
you can specify (i) string (ii) known
character, and find out the encoding?
You can easily write one in Python.
(Examples use 3.x syntax.)
import encodings
ENCODINGS = set(encodings._aliases.values()) - {'mbcs', 'tactis'}
def _decode(data, encoding):
try:
return data.decode(encoding)
except UnicodeError:
return None
def possible_encodings(encoded, decoded):
return {enc for enc in ENCODINGS if _decode(encoded, enc) == decoded}
So if you know that your bullet point is U+2022, then
>>> possible_encodings(b'\xA5', '\u2022')
{'mac_iceland', 'mac_roman', 'mac_turkish', 'mac_latin2', 'mac_cyrillic'}
You could try
iconv -f latin1 -t utf8 data_clean.csv
if you know it is indeed iso-latin-1
Although in iso-latin-1 \xA5 is indeed a ¥
Edit: Actually this seems to be a problem on Mac, using Word or similar and Arial (?) and printing or converting to PDF. Some issues about fonts and what not. Maybe you need to explicitly massage the file first. Sounds familiar?
http://forums.quark.com/p/14849/61253.aspx
http://www.macosxhints.com/article.php?story=2003090403110643

Perl strings internals

How do perl strings represented internally? What encoding is used? How do I handle different encodings properly?
I've been using perl for quite a long time, but it didn't include a lot of string handling in different encodings, and when I encountered a minor problem that had something to do with encodings I usually resorted to some shamanic actions.
Until this moment I thought about perl strings as sequences of bytes, which did fit pretty well for my tasks. Now I need to do some processing of UTF-8 encoded file and here starts trouble.
First, I read file into string like this:
open(my $in, '<', $ARGV[0]) or die "cannot open file $ARGV[0] for reading";
binmode($in, ':utf8');
my $contents;
{
local $/;
$contents = <$in>;
}
close($in);
then simply print it:
print $contents;
And I get two things: a warning Wide character in print at <scriptname> line <n> and a garbage in console. So I can conclude that perl strings have a concept of "character" that can be "wide" or not, but when printed these "wide" characters are represented in console as multiple bytes, not as single "character".
(I wonder now why did all my previous experience with binary files worked quite how I expected it to work without any "character" issues).
Why then I see garbage in console? If perl stores strings as character in some known encoding, I don't think there is a big problem to find out console encoding and print text properly. (I use Windows, BTW).
If perl stores strings as variable-width character sequences (e.g. using same UTF-8 encoding), why is it done this way? From my C experience handling strings is PAIN.
Update.
I use two computers for testing, one runs Windows 7 x64 with English language pack installed, but with Russian regional settings (so I have cp866 as OEM codepage and cp1251 as ANSI) with ActivePerl 5.10.1 x64; another runs Windows XP 32 bit Russian localization with Cygwin Perl 5.10.0.
Thanks to links, now I have much more solid understanding on what's going on and how things should be done.
Setting utf8 before reading from the file is good, it automagically decodes the bytes into the internal encoding. (Which is also UTF-8 but you don't need to know, and shouldn't rely on.)
Before printing you need to encode the characters back to bytes.
use Encode;
utf8::encode($contents);
There is also a two argument form of encode, for other encodings than unicode. (That sentence echoes too much, doesn't it?)
Here is a good reference. (Would have been more, but it's my first post.) Check out perlunitut too, and the unicode article on Joel on Software.
http://www.ahinea.com/en/tech/perl-unicode-struggle.html
Oh, and it must use multi-byte strings, because otherwise it's just not unicode.
Perl strings are stored internally in one of two encodings, either a 8-bit byte oriented native encoding, or UTF-8. For backwards comparability the assumption is that all I/O and strings are in native encoding, unless otherwise specified. Native encoding is usually 8-bit ASCII, but this can be changed with use locale.
In your sample you call binmode on your input handle changing it to use :utf8 semantics. One effect of this is that all strings read from this handle will be encoded as UTF-8. print writes to STDOUT by default, and STDOUT defaults to expecting native encoded characters.
Perl in an attempt to do the right thing will allow a UTF-8 string to be sent to a native encoded output, but if there is no encoding attached to that handle then it has to guess how to output multi-byte characters and it will almost certainly guess wrong. That is what the warning means, a multi-byte character was sent to a stream only expecting single byte characters and the result was that the character was probably damaged in translation.
Depending on what you want to accomplish you can use the Encode module mentioned by dylan to convert the UTF-8 data to a single byte character set that can be printed safely or if you know that whatever is attached to STDOUT can handle UTF-8 you can use binmode(STDOUT, ':utf8'); to tell Perl you want any data sent to STDOUT to be sent as UTF-8.
You should mention your actual Windows and Perl versions as this really depends on your used versions and installed language packages.
Otherwise have a look at the PerlUnicode manual first -
Perl uses logically-wide characters to represent strings internally.
it will confirm your statements.
Windows does not fully install all UTF8 character- thus this is might be the reason for your issue. You may need to install an additional language package.