Getting all Unicode aliases for a codepoint - perl

The charnames pragma provides charnames::viacode which returns the "best" name for a given code point
For instance
$ perl -Mcharnames=:full -E'say charnames::viacode(ord "A")'
LATIN CAPITAL LETTER A
Is there a convenient way to discover all known aliases for this name from within Perl?

To get the Unicode aliases of a code point, you can use the following:
use Unicode::UCD qw( charprop );
my #aliases =
map { s/:.*//sr }
split /,/,
charprop($ucp, "Name_Alias"); # $ucp is the Unicode code point as a number.
For example, this returns SP for U+0020 SPACE.
The complete list can be found here.
For all the values you can pass to \N{}, see here.

Related

Why do I get garbled output when I decode some HTML entities but not others?

In Perl, I am trying to decode strings which contain numeric HTML entities using HTML::Entities. Some entities work, while "newer" entities don't. For example:
decode_entities('®'); # returns ® as expected
decode_entities('Ω'); # returns Ω instead of Ω
decode_entities('★'); # returns ★ instead of ★
Is there a way to decode these "newer" HTML entities in Perl? In PHP, the html_entity_decode function seems to decode all of these entities without any problem.
The decoding works fine. It's how you're outputting them that's wrong. For example, you may have sent the strings to a terminal without encoding them for that terminal first. This is achieved through the open pragma in the following program:
$ perl -e'
use open ":std", ":encoding(UTF-8)";
use HTML::Entities qw( decode_entities );
CORE::say decode_entities($_)
for "®", "Ω", "★";
'
®
Ω
★
Make sure your terminal can handle UTF-8 encoding. It looks like it's having problems with multibyte characters. You can also try to set UTF-8 for STDOUT in case you get wide character warnings.
use strict;
use warnings;
use HTML::Entities;
binmode STDOUT, ':encoding(UTF-8)';
print decode_entities('®'); # returns ®
print decode_entities('Ω'); # returns Ω
print decode_entities('★'); # returns ★
This gives me the correct/expected results.

Unicode property "Space" in Perl 5 and Perl 6

Is the unicode-property \p{Space} a Perl5 extension?
In Perl5 Space matches all white-spaces
my $s = "one\ttwo\nthree";
$s =~ s/\p{Space}/*/g;
say $s;
# one*two*three
while in Per6 it maybe matches only a simple space
my $s = "one\ttwo\nthree";
$s.=subst( /<:Space>/, '*', :g );
say $s;
# one two
# three
Tabulators are of category Control, not Space. The property you're interested in is actually called White_Space, and that's what you need to use in Perl 6:
say so "\t" ~~ /<:White_Space>/
Several alternative spellings appear to be available as well, including WhiteSpace, WSpace and its lower-case variants, but not WS.
There is also a built-in rule <ws>, which matches zero or more whitespace characters instead of a single one, and of course \s, which already uses Unicode semantics.
It's not really an extension, but it's a shorthand name for another Unicode property, \p{White_Space}. This is documented in detail in the manpage perluniprops.
I have no idea what the Perl6 people are doing here.

Perl ord and chr working with unicode

To my horror I've just found out that chr doesn't work with Unicode, although it does something. The man page is all but clear
Returns the character represented by that NUMBER in the character set. For example, chr(65)" is "A" in either ASCII or Unicode, and chr(0x263a) is a Unicode smiley face.
Indeed I can print a smiley using
perl -e 'print chr(0x263a)'
but things like chr(0x00C0) do not work. I see that my perl v5.10.1 is a bit ancient, but when I paste various strange letters in the source code, everything's fine.
I've tried funny things like use utf8 and use encoding 'utf8', I haven't tried funny things like use v5.12 and use feature 'unicode_strings' as they don't work with my version, I was fooling around with Encode::decode to find out that I need no decoding as I have no byte array to decode. I've read much more documentation than ever before, and found quite a few interesting things but nothing helpful. It looks like a sort of the Unicode Bug but there's no usable solution given. Moreover I don't care about the whole string semantics, all I need is a trivial function.
So how can I convert a number into a string consisting of the single character corresponding with it, so that for example real_chr(0xC0) eq 'À' holds?
The first answer I've got explains quite everything about IO, but I still don't understand why
#!/usr/bin/perl -w
use strict;
use utf8;
use encoding 'utf8';
print chr(0x00C0) eq 'À' ? 'eq1' : 'ne1', " - ", chr(0x263a) eq '☺' ? 'eq1' : 'ne1', "\n";
print 'À' =~ /\w/ ? "match1" : "no_match1", " - ", chr(0x00C0) =~ /\w/ ? "match2" : "no_match2", "\n";
prints
ne1 - eq1
match1 - no_match2
It means that the manually entered 'À' differs from chr(0x00C0). Moreover, the former is a word constituent character (correct!) while the latter is not (but should be!).
First,
perl -le'print chr(0x263A);'
is buggy. Perl even tells you as much:
Wide character in print at -e line 1.
That doesn't qualify as "working". So while they differ in how fail to provide what you want, neither of the following gives you what you want:
perl -le'print chr(0x263A);'
perl -le'print chr(0x00C0);'
To properly output the UTF-8 encoding of those Unicode code points, you need to tell Perl to encoding the Unicode points with UTF-8.
$ perl -le'use open ":std", ":encoding(UTF-8)"; print chr(0x263A);'
☺
$ perl -le'use open ":std", ":encoding(UTF-8)"; print chr(0x00C0);'
À
Now on to the "why".
File handle can only transmit bytes, so unless you tell it otherwise, Perl file handles expect bytes. That means the string you provide to print cannot contain anything but bytes, or in other words, it cannot contain characters over 255. The output is exactly what you provide:
$ perl -e'print map chr, 0x00, 0x65, 0xC0, 0xF0' | od -t x1
0000000 00 65 c0 f0
0000004
This is useful. This is different then what you want, but that doesn't make it wrong. If you want something different, you just need to tell Perl what you want.
By adding an :encoding layer, the handle now expects a string of Unicode characters, or as I call it, "text". The layer tells Perl how to convert the text into bytes.
$ perl -e'
use open ":std", ":encoding(UTF-8)";
print map chr, 0x00, 0x65, 0xC0, 0xF0, 0x263a;
' | od -t x1
0000000 00 65 c3 80 c3 b0 e2 98 ba
0000011
Your right that chr doesn't know or care about Unicode. Like length, substr, ord and reverse, chr implements a basic string function, not a Unicode function. That doesn't mean it can't be used to work with text string. As you've seen, the problem wasn't with chr but with what you did with the string after you built it.
A character is an element of a string, and a character is a number. That means a string is just a sequence of numbers. Whether you treat those numbers as Unicode code points (text), packed IP addresses or temperature measurements is entirely up to you and the functions to which you pass the strings.
Here are a few example of operators that do assign meaning to the strings they receive as operands:
m// expects a string of Unicode code points.
connect expects a sequence of bytes that represent a sockaddr_in structure.
print with a handle without :encoding expect a sequence of bytes.
print with a handle with :encoding expect a sequence of Unicode code points.
etc
So how can I convert a number into a string consisting of the single character corresponding with it, so that for example real_chr(0xC0) eq 'À' holds?
chr(0xC0) eq 'À' does hold. Did you remember to tell Perl you encoded your source code using UTF-8 by using use utf8;? If you didn't tell Perl, Perl actually sees a two-character string on the RHS.
Regarding the question you've added:
There are problems with the encoding pragma. I recommend against using it. Instead, use
use open ':std', ':encoding(UTF-8)';
That'll fix one of the problems. The other problem you are encountering is with
chr(0x00C0) =~ /\w/
It's a known bug that's intentionally left broken for backwards compatibility reasons. That is, unless you request a more recent version of the language as follows:
use 5.014; # use 5.012; *might* suffice.
A workaround that works as far back as 5.8:
my $x = chr(0x00C0);
utf8::upgrade($x);
$x =~ /\w/

how to replace the special character with escape character

my $c= 'ODD_`!"£$%^&*(){}][##;:/?.>,<|\'
I want to replace all of them into as special character
how achiveve this in faster way ..
my $c= 'ODD_\`\!\"\£\$\%\^\&\*\(\)\{\}\]\[\#\,\;\:\/\?\.\>\,\<\|\\'
Use quotemeta:
#!/usr/bin/env perl
use warnings; use strict;
my $c = 'ODD_`!"£$%^&*(){}][##;:/?.>,<|\\';
print quotemeta($c), "\n";
Note that your definition of $c would not compile as you have to escape \ even in single quoted strings.
While I think that Sinan's answer is correct for what I am assuming you need (based on your list of characters to escape), for completeness I will add the module URI::Escape which escapes the metacharacters in HTML text. It does seem that it has some facility to specify the unsafe characters though, so perhaps it could help you too.

Why can't I set $LIST_SEPARATOR in Perl?

I want to set the LIST_SEPARATOR in perl, but all I get is this warning:
Name "main::LIST_SEPARATOR" used only once: possible typo at ldapflip.pl line 7.
Here is my program:
#!/usr/bin/perl -w
#vals;
push #vals, "a";
push #vals, "b";
$LIST_SEPARATOR='|';
print "#vals\n";
I am sure I am missing something obvious, but I don't see it.
Thanks
Only the mnemonic is available
$" = '|';
unless you
use English;
first.
As described in perlvar. Read the docs, please.
The following names have special meaning to Perl. Most punctuation names have reasonable mnemonics, or analogs in the shells. Nevertheless, if you wish to use long variable names, you need only say
use English;
at the top of your program. This aliases all the short names to the long names in the current package. Some even have medium names, generally borrowed from awk. In general, it's best to use the
use English '-no_match_vars';
invocation if you don't need $PREMATCH, $MATCH, or $POSTMATCH, as it avoids a certain performance hit with the use of regular expressions. See English.
perlvar is your friend:
• $LIST_SEPARATOR
• $"
This is like $, except that it applies to array and slice values interpolated into a double-quoted string (or similar interpreted string). Default is a space. (Mnemonic: obvious, I think.)
$LIST_SEPARATOR is only avaliable if you use English; If you don't want to use English; in all your programs, use $" instead. Same variable, just with a more terse name.
Slightly off-topic (the question is already well answered), but I don't get the attraction of English.
Cons:
A lot more typing
Names not more obvious (ie, I still have to look things up)
Pros:
?
I can see the benefit for other readers - especially people who don't know Perl very well at all. But in that case, if it's a question of making code more readable later, I would rather this:
{
local $" = '|'; # Set interpolated list separator to '|'
# fun stuff here...
}
you SHOULD use the strict pragma:
use strict;
you might want to use the diagnostics pragma to get additional hits about the warnings (that you already have enabled with the -w flag):
use diagnostics;