How can I display Extended ASCII Codes characters in Perl? - perl

How to display 192 character symbol ( └ ) in perl ?

What you want is to be able to print unicode, and the answer is in perldoc perluniintro.
You can use \x{nnnn} where n is the hex identifier, or you can do \N{...} with the name:
perl -E 'say "\x{2514}"; use charnames; say "\N{BOX DRAWINGS LIGHT UP AND RIGHT}"'

To use exactly these codes your terminal must support Code Page 437, which contains frames. Alternatively you can use derived CP850 with less boxing characters.
Such boxing characters also exist as Unicode Block Elements. The char which you want in perl is noted as \N{U+2514}. More details in perlunicode

That looks like the Code page 437 encoding. Perl is probably just outputting bytes that you give it. And your terminal is probably expecting UTF8.
So you need to decode it to Unicode, then re-encode it in UTF-8.
EDIT: Correct encoding.

As usual, Jon Skeet nails it: the 192 code is in the "extended ASCII" range. I suggest you follow #Douglas Leeder's advice, but I'm not sure which encoding www.LookupTables.com is giving you; ISO-8859-1 thinks 192 maps to "À", and Mac OS Roman thinks its "¿".

Is there a solution that works on ALL characters?
The user says they wanted to use an latin-1 extended charset character — so let's try an example from this block! So, if they wanted the Æ character, they would run...
print "\x{00C6}";
Output: �
Full Testable, Online Demo
TDLR Character Encoding Modes in Perl
So, wait, what just happened there? You'll notice that other ways of invoking UTF-8, such as char(...), \N{U+...}, and even unpack(...) also have the same issue. That's right -- the problem isn't with any of these functions, but an underlying character abstraction layer. In this case, you'll want to indicate this layer early in your code..
use open qw( :std :encoding(UTF-8) );
print "\x{00C6}";
Output: Æ
Now I can spell 'Ælf' correctly!
Full Testable, Online Demo
Why did that happen?
There is a note within the PerlDoc regarding the chr() function....
Note that characters from 128 to 255 (inclusive) are by default internally not encoded as UTF-8 for backward compatibility reasons.
For this reason, this special block needs to have that special use open to indicate std encoding.

Related

Unexpected behavior of HTML::Entities

I'm a newbie user of Perl's HTML::Entities routine decode_entities() to
convert headlines scraped from news media websites.
Here's a good result:
Before: Texas grand jury clears Planned Parenthood, indicts its accusers
After: Texas grand jury clears Planned Parenthood, indicts its accusers
But here's a puzzling result:
Before: Big changes could be coming to Utah’s criminal justice system
After: Big changes could be coming to Utahâs criminal justice system
Notice that not only was the ’ code not converted to a single quote, the wasn't decoded into a space, unlike in the first example.
What's going on?
The difference between your first and second example is that the first one does not contain any code points above 255, while the second one does. So, the first string can be displayed according to the native 8-bit character set of your system (most likely ISO 8859-1/Latin 1), but the second cannot. The reason for this, according to perlunicode, is that "using a code point above 255 implies Unicode for the whole string".
Since you now have Unicode characters in your string, you'll need to properly encode your text for output, otherwise you'll see "strange characters" (just like the ones in your example!). Since you didn't provide a Minimal, Complete, and Verifiable example, I'm not sure what your output method is, but let's just assume STDOUT to make things easy. There are a couple different ways to encode your text into an octet stream:
Manually, using the Encode module
Automatically, using the correct I/O layer
I prefer the second option because it's less tedious. To do that, we'll just call binmode() on STDOUT:
use strict;
use warnings;
use HTML::Entities;
my $str = 'Big changes could be coming to Utah’s criminal justice system';
my $decoded = decode_entities($str);
binmode(STDOUT, ':encoding(UTF-8)');
printf("%s\n%vx\n", $decoded, $decoded);
Output:
$ perl foo.pl
Big changes could be coming to Utah’s criminal justice system
42.69.67.20.63.68.61.6e.67.65.73.20.63.6f.75.6c.64.20.62.65.20.63.6f.6d.69.6e.67.20.74.6f.20.55.74.61.68.2019.73.20.63.72.69.6d.69.6e.61.6c.20.6a.75.73.74.69.63.65.a0.73.79.73.74.65.6d
You can see that there's code point 2019 (right single quotation mark) between characters 68 and 73 (h and s, respectively), and also an a0 (non-breaking space) between 65 and 73, which would be e and s.
In addition to the aforementioned perlunicode reference, you should also read perluniintro, perlunitut (short!), and perlunifaq if you're interested in learning more about how Perl handles Unicode and character encoding in general.

Checklist for going the Unicode way with Perl

I am helping a client convert their Perl flat-file bulletin board site from ISO-8859-1 to Unicode.
Since this is my first time, I would like to know if the following "checklist" is complete. Everything works well in testing, but I may be missing something which would only occur at rare occasions.
This is what I have done so far (forgive me for only including "summary" code examples):
Made sure files are always read and written in UTF-8:
use open ':utf8';
Made sure CGI input is received as UTF-8 (the site is not using CGI.pm):
s{%([a-fA-F0-9]{2})}{ pack ("C", hex ($1)) }eg; # Kept from existing code
s{%u([0-9A-F]{4})}{ pack ('U*', hex ($1)) }eg; # Added
utf8::decode $_;
Made sure text is printed as UTF-8:
binmode STDOUT, ':utf8';
Made sure browsers interpret my content as UTF-8:
Content-Type: text/html; charset=UTF-8
<meta http-equiv="content-type" content="text/html;charset=UTF-8">
Made sure forms send UTF-8 (probably not necessary as long as page encoding is set):
accept-charset="UTF-8"
Don't think I need the following, since inline text (menus, headings, etc.) is only in ASCII:
use utf8;
Does this looks reasonable or am I missing something?
EDIT: I should probably also mention that we will be running a one-time batch to read all existing text data files and save them in UTF-8 encoding.
The :utf8 PerlIO layer is not strict enough. It permits input that fulfills the structural requirement of UTF-8 byte sequences, but for good security, you want to reject stuff that is not actually valid Unicode. Replace it everywhere with the PerlIO::encoding layer, thus: :encoding(UTF-8).
For the same reason, always Encode::decode('UTF-8', …), not Encode::decode_utf8(…).
Make decoding fail hard with an exception, compare:
perl -E'use Encode qw(decode); say decode(q(UTF-8), qq(\x{c0})); say q(survived)'
perl -E'use Encode qw(decode); say decode(q(UTF-8), qq(\x{c0}), Encode::FB_CROAK); say q(survived)'
You are not taking care of surrogate pairs in the %u notation. This is the only major bug I can see in your list. 2. is written correctly as:
use Encode qw(decode);
use URI::Escape::XS qw(decodeURIComponent);
$_ = decode('UTF-8', decodeURIComponent($_), Encode::FB_CROAK);
Do not mess around with the functions from the utf8 module. Its documentation says so. It's intended as a pragma to tell Perl that the source code is in UTF-8. If you want to do encoding/decoding, use the Encode module.
Add the utf8 pragma anyway in every module. It cannot hurt, but you will future-proof code maintenance in case someone adds those string literals. See also CodeLayout::RequireUseUTF8.
Employ encoding::warnings to smoke out remaining implicit upgrades. Verify for each case whether this is intended/needed. If yes, convert it to an explicit upgrade with Unicode::Semantics. If not, this is a hint that you should have earlier had a decoding step. The documents from http://p3rl.org/UNI give the advice to immediately decode after receiving the data from the source. Go over the places where the code is reading/writing data and verify you have a decoding/encoding step, either explicitly (decode('UTF-8', …)) or implicitly through a layer (use open pragma, binmode, 3 argument form of open).
For debugging: If you are not sure what string is in a variable in which representation at a certain time, you cannot just print, use the tools Devel::StringInfo and Devel::Peek instead.
You're always missing something. The problem is usually the unknown unknowns, though. :)
Effective Perl Programming has a Unicode chapter that covers many of the Perl basics. The one Item we didn't cover though, was everything you have to do to ensure your database server and web server do the right thing.
Some other things you'll need to do:
Upgrade to the most recent Perl you can. Unicode stuff got a lot easier in 5.8, and even easier in 5.10.
Ensure that site content is converted to UTF-8. You might write a crawler to hit pages and look for the Unicode substitution character (that thing that looks like a diamond with a question mark in it). Let's see if I can make it in StackOverflow: �
Ensure that your database server supports UTF-8, that you've set up the tables with UTF-8 aware columns, and that you tell DBI to use the UTF-8 support in its driver (some of this is in the book).
Ensure that anything looking at #ARGV translates the items from the locale of the command line to UTF-8 (it's in the book).
If you find anything else, please let us know by answering your own question with whatever we left out. ;)

Perl strings internals

How do perl strings represented internally? What encoding is used? How do I handle different encodings properly?
I've been using perl for quite a long time, but it didn't include a lot of string handling in different encodings, and when I encountered a minor problem that had something to do with encodings I usually resorted to some shamanic actions.
Until this moment I thought about perl strings as sequences of bytes, which did fit pretty well for my tasks. Now I need to do some processing of UTF-8 encoded file and here starts trouble.
First, I read file into string like this:
open(my $in, '<', $ARGV[0]) or die "cannot open file $ARGV[0] for reading";
binmode($in, ':utf8');
my $contents;
{
local $/;
$contents = <$in>;
}
close($in);
then simply print it:
print $contents;
And I get two things: a warning Wide character in print at <scriptname> line <n> and a garbage in console. So I can conclude that perl strings have a concept of "character" that can be "wide" or not, but when printed these "wide" characters are represented in console as multiple bytes, not as single "character".
(I wonder now why did all my previous experience with binary files worked quite how I expected it to work without any "character" issues).
Why then I see garbage in console? If perl stores strings as character in some known encoding, I don't think there is a big problem to find out console encoding and print text properly. (I use Windows, BTW).
If perl stores strings as variable-width character sequences (e.g. using same UTF-8 encoding), why is it done this way? From my C experience handling strings is PAIN.
Update.
I use two computers for testing, one runs Windows 7 x64 with English language pack installed, but with Russian regional settings (so I have cp866 as OEM codepage and cp1251 as ANSI) with ActivePerl 5.10.1 x64; another runs Windows XP 32 bit Russian localization with Cygwin Perl 5.10.0.
Thanks to links, now I have much more solid understanding on what's going on and how things should be done.
Setting utf8 before reading from the file is good, it automagically decodes the bytes into the internal encoding. (Which is also UTF-8 but you don't need to know, and shouldn't rely on.)
Before printing you need to encode the characters back to bytes.
use Encode;
utf8::encode($contents);
There is also a two argument form of encode, for other encodings than unicode. (That sentence echoes too much, doesn't it?)
Here is a good reference. (Would have been more, but it's my first post.) Check out perlunitut too, and the unicode article on Joel on Software.
http://www.ahinea.com/en/tech/perl-unicode-struggle.html
Oh, and it must use multi-byte strings, because otherwise it's just not unicode.
Perl strings are stored internally in one of two encodings, either a 8-bit byte oriented native encoding, or UTF-8. For backwards comparability the assumption is that all I/O and strings are in native encoding, unless otherwise specified. Native encoding is usually 8-bit ASCII, but this can be changed with use locale.
In your sample you call binmode on your input handle changing it to use :utf8 semantics. One effect of this is that all strings read from this handle will be encoded as UTF-8. print writes to STDOUT by default, and STDOUT defaults to expecting native encoded characters.
Perl in an attempt to do the right thing will allow a UTF-8 string to be sent to a native encoded output, but if there is no encoding attached to that handle then it has to guess how to output multi-byte characters and it will almost certainly guess wrong. That is what the warning means, a multi-byte character was sent to a stream only expecting single byte characters and the result was that the character was probably damaged in translation.
Depending on what you want to accomplish you can use the Encode module mentioned by dylan to convert the UTF-8 data to a single byte character set that can be printed safely or if you know that whatever is attached to STDOUT can handle UTF-8 you can use binmode(STDOUT, ':utf8'); to tell Perl you want any data sent to STDOUT to be sent as UTF-8.
You should mention your actual Windows and Perl versions as this really depends on your used versions and installed language packages.
Otherwise have a look at the PerlUnicode manual first -
Perl uses logically-wide characters to represent strings internally.
it will confirm your statements.
Windows does not fully install all UTF8 character- thus this is might be the reason for your issue. You may need to install an additional language package.

How can I convert japanese characters to unicode in Perl?

Can you point me tool to convert japanese characters to unicode?
CPAN gives me "Unicode::Japanese". Hope this is helpful to start with. Also you can look at article on Character Encodings in Perl and perl doc for unicode for more information.
See http://p3rl.org/UNI.
use Encode qw(decode encode);
my $bytes_in_sjis_encoding = "\x88\xea\x93\xf1\x8e\x4f";
my $unicode_string = decode('Shift_JIS', $bytes_in_sjis_encoding); # returns 一二三
my $bytes_in_utf8_encoding = encode('UTF-8', $unicode_string); # returns "\xe4\xb8\x80\xe4\xba\x8c\xe4\xb8\x89"
For batch conversion from the command line, use piconv:
piconv -f Shift_JIS -t UTF-8 < infile > outfile
First, you need to find out the encoding of the source text if you don't know it already.
The most common encodings for Japanese are:
euc-jp: (often used on Unixes and some web pages etc with greater Kanji coverage than shift-jis)
shift-jis (Microsoft also added some extensions to shift-jis which is called cp932, which is often used on non-Unicode Windows programs)
iso-2022-jp is a distant third
A common encoding conversion library for many languages is iconv (see http://en.wikipedia.org/wiki/Iconv and http://search.cpan.org/~mpiotr/Text-Iconv-1.7/Iconv.pm) which supports many other encodings as well as Japanese.
This question seems a bit vague to me, I'm not sure what you're asking. Usually you would use something like this:
open my $file, "<:encoding(cp-932)", "JapaneseFile.txt"
to open a file with Japanese characters. Then Perl will automatically convert it into its internal Unicode format.

How to detect malformed UTF characters

I want to detect and replace malformed UTF-8 characters with blank space using a Perl script while loading the data using SQL*Loader. How can I do this?
Consider Python. It allows to extend codecs with user-defined error handlers, so you can replace undecodable bytes with anything you want.
import codecs
codecs.register_error('spacer', lambda ex: (u' ', ex.start + 1))
s = 'spam\xb0\xc0eggs\xd0bacon'.decode('utf8', 'spacer')
print s.encode('utf8')
This prints:
spam eggs bacon
EDIT: (Removed bit about SQL Loader as it seems to no longer be relevant.)
One problem is going to be working out what counts as the "end" of a malformed UTF-8 character. It's easy to say what's illegal, but it may not be obvious where the next legal character starts.
RFC 3629 describes the structure of UTF-8 characters. If you take a look at that, you'll see that it's pretty straightforward to find invalid characters, AND that the next character boundary is always easy to find (it's a character < 128, or one of the "long character" start markers, with leading bits of 110, 1110, or 11110).
But BKB is probably correct - the easiest answer is to let perl do it for you, although I'm not sure what Perl does when it detects the incorrect utf-8 with that filter in effect.