Perl HTML Encoding Named Entities - perl

I would like to encode 'special chars' to their named entity.
My code:
use HTML::Entities;
print encode_entities('“');
Desired output:
“
And not:
“
Does anyone have an idea? Greetings

If you don't use use utf8;, the file is expected to be encoded using iso-8859-1 (or subset US-ASCII).
«“» is not found in iso-8859-1's charset.
If you use use utf8;, the file is expected to be encoded using UTF-8.
«“» is found in UTF-8's charset, Unicode.
You indicated your file isn't saved as UTF-8, so as far as Perl is concerned, your source file cannot possibly contain «“».
Odds are that you encoded your file using cp1252, an extension of iso-8859-1 that adds «“». That's not a valid choice.
Options:
[Best option] Save the file as UTF-8 and use the following:
use utf8;
use HTML::Entities;
print encode_entities('“');
Save the file as cp1252, but only use US-ASCII characters.
use charnames ':full';
use HTML::Entities;
print encode_entities("\N{LEFT DOUBLE QUOTATION MARK}");
or
use HTML::Entities;
print encode_entities("\N{U+201C}");
or
use HTML::Entities;
print encode_entities("\x{201C}");
[Unrecommended] Save the file as cp1252 and decode literals explicitly
use HTML::Entities;
print encode_entities(decode('cp1252', '“'));
Perl sees:
use HTML::Entities;
print encode_entities(decode('cp1252', "\x93"));

Perl doesn't know the encoding of your source file. If you include any special characters, you should always save it with UTF-8-encoding and put
use utf8;
at the top of your code. This will make sure your string literals contain codepoints, not just bytes.

I had the same problem and applied all of the above hints. It worked from within my perl script (CGI), e.g. ä = encode_entities("ä") produced the correct result. Yet applying encode_entities(param("test")) would encode the single bytes.
I found this advice: http://blog.endpoint.com/2010/12/character-encoding-in-perl-decodeutf8.html
Putting it together this is my solution which finally works:
use CGI qw/:standard/;
use utf8;
use HTML::Entities;
use Encode;
print encode_entities(decode_utf8(param("test")));
It is not clear to me why that was required, but it works. HTH

Related

Is it possible to print 'é' as '%C3%A9' in Perl?

I have some string with accent like "é" and the goal is to put my string into an URL so I need to convert "é" to "%C3%A9"
I have tested some module as HTML::Entitie, Encode or URI::Encode without any success
Actual Result:
%C3%83%C2%A9
Expected Result:
%C3%A9
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Entities;
use feature 'say';
use URI::Encode qw( uri_encode );
my $var = "é";
say $var;
$var = uri_encode( $var );
say $var;
You are missing use utf8.
The use utf8 pragma tells the Perl parser to allow UTF-8 in the
program text in the current lexical scope. The no utf8 pragma tells
Perl to switch back to treating the source text as literal bytes in
the current lexical scope. (On EBCDIC platforms, technically it is
allowing UTF-EBCDIC, and not UTF-8, but this distinction is academic,
so in this document the term UTF-8 is used to mean both).
Do not use this pragma for anything else than telling Perl that your script is written in UTF-8. The utility functions described below are
directly usable without use utf8;.

perl prints 3 wrong characters instead of unicode character

Been having trouble with the print function, I know I'm missing something small. I've been looking everywhere and trying stuff out but can't seem to find the solution.
I'm trying to print braille characters in perl, I got the value of 2881 from a table and converted it to hexa. When I try to print the hexadecimal character, perl prints 3 characters instead.
Code:
#!/usr/local/bin/perl
use utf8;
print "\x{AF1}";
Output:
C:\Users\ElizabethTosh\Desktop>perl testff.pl
Wide character in print at testff.pl line 3.
૱
Issue #1: You need to tell Perl to encode the output for your terminal.
Add the following to your program.
use Win32 qw( );
use open ':std', ':encoding(cp'.Win32::GetConsoleOutputCP().')';
use utf8; merely specifies the that source file is encoded using UTF-8 instead of ASCII.
Issue #2: Your terminal probably can't handle that character.
The console of US-English machines likely expect cp437. It's character set doesn't include any braille characters.
You could try switching to code page 65001 (UTF-8) using chcp 65001. You may also need to switch the console's font to one that includes braille characters. (MS Gothic worked for me, although it does weird things to the backslashes.)
Issue #3: You have the wrong character code.
U+0AF1 GUJARATI RUPEE SIGN (૱): "\x{AF1}" or "\N{U+0AF1}" or chr(2801)
U+0B41 ORIYA VOWEL SIGN U (ୁ): "\x{B41}" or "\N{U+0B41}" or chr(2881)
U+2801 BRAILLE PATTERN DOTS-1 (⠁): "\x{2801}" or "\N{U+2801}" or chr(10241)
U+2881 BRAILLE PATTERN DOTS-18 (⢁): "\x{2881}" or "\N{U+2881}" or chr(10369)
All together,
use strict;
use warnings;
use feature qw( say );
use Win32 qw( );
use open ':std', ':encoding(cp'.Win32::GetConsoleOutputCP().')';
say(chr($_)) for 0x2801, 0x2881;
Output:
>chcp 65001
Active code page: 65001
>perl a.pl
⠁
⢁
If you save a character with UTF-8, and it's displayed as 3 strange characters instead of 1, it means that the character is in the range U+0800 to U+FFFF, and that you decode it with some single-byte encoding instead of UTF-8.
So, change the encoding of your terminal to UTF-8. If you can't do this, redirect the output to a file:
perl testff.pl >file
And open the file with a text editor that supports UTF-8, to see if the character is displayed correctly.
You want to print the character U+2881 (⢁), and not U+0AF1. 2881 is already in hexadecimal.
To get rid of the Wide character in print warning, set the input and output of your Perl program to UTF-8:
use open ':std', ':encoding(UTF-8)';
Instead of use utf8;, which only enables the interpretation of the program text as UTF-8.
Summary
Source file (testff.pl):
#!/usr/local/bin/perl
use strict;
use warnings;
use open ':std', ':encoding(UTF-8)';
print "\x{2881}";
Run:
> perl testff.pl
⢁

Why do I get garbled output when I decode some HTML entities but not others?

In Perl, I am trying to decode strings which contain numeric HTML entities using HTML::Entities. Some entities work, while "newer" entities don't. For example:
decode_entities('®'); # returns ® as expected
decode_entities('Ω'); # returns Ω instead of Ω
decode_entities('★'); # returns ★ instead of ★
Is there a way to decode these "newer" HTML entities in Perl? In PHP, the html_entity_decode function seems to decode all of these entities without any problem.
The decoding works fine. It's how you're outputting them that's wrong. For example, you may have sent the strings to a terminal without encoding them for that terminal first. This is achieved through the open pragma in the following program:
$ perl -e'
use open ":std", ":encoding(UTF-8)";
use HTML::Entities qw( decode_entities );
CORE::say decode_entities($_)
for "®", "Ω", "★";
'
®
Ω
★
Make sure your terminal can handle UTF-8 encoding. It looks like it's having problems with multibyte characters. You can also try to set UTF-8 for STDOUT in case you get wide character warnings.
use strict;
use warnings;
use HTML::Entities;
binmode STDOUT, ':encoding(UTF-8)';
print decode_entities('®'); # returns ®
print decode_entities('Ω'); # returns Ω
print decode_entities('★'); # returns ★
This gives me the correct/expected results.

How to use unicode in perl CGI param

I have a Perl CGI script accepting unicode characters as one of the params.
The url is of the form
.../worker.pl?text="some_unicode_chars"&...
In the perl script, I pass the $text variable to a shell script:
system "a.sh \"$text\" out_put_file";
If I hardcode the text in the perl script, it works well. However, the output makes no sense when $text is got from web using CGI.
my $q = CGI->new;
my $text = $q->param('text');
I suspect it's the encoding caused the problem. uft-8 caused me so many troubles. Anyone please help me?
Perhaps this will help. From Perl Programming/Unicode UTF-8:
By default, CGI.pm does not decode your form parameters. You can use
the -utf8 pragma, which will treat (and decode) all parameters as
UTF-8 strings, but this will fail if you have any binary file upload
fields. A better solution involves overriding the param method:
(example follows)
[Wrong - see Correction] Here's documentation for the utf-8 pragma. Since uploading binary data does not appear to be a concern for you, use of the utf-8 pragma appears to be the most straightforward approach.
Correction: Per the comment from #Slaven, do not confuse the general Perl utf8 pragma with the -utf-8 pragma that has been defined for use with CGI.pm:
-utf8
This makes CGI.pm treat all parameters as UTF-8 strings. Use this with
care, as it will interfere with the processing of binary uploads. It
is better to manually select which fields are expected to return utf-8
strings and convert them using code like this:
use Encode;
my $arg = decode utf8=>param('foo');
Follow Up: duleshi, you ask: But I still don't understand the differnce between decode in Encode and utf8::decode. How do the Encode and utf8 modules differ?
From the documentation for the utf8 pragma:
Note that this function does not handle arbitrary encodings. Therefore
Encode is recommended for the general purposes; see also Encode.
Put another way, the Encode module works with many different encodings (including UTF-8), whereas the utf8 functions work only with the UTF-8 encoding.
Here is a Perl program that demonstrates the equivalence of the two approaches to encoding and decoding UTF-8. (Also see the live demo.)
#!/usr/bin/perl
use strict;
use warnings;
use utf8; # allows 'ñ' to appear in the source code
use Encode;
my $word = "Español"; # the 'ñ' is permitted because of the 'use utf8' pragma
# Convert the string to its UTF-8 equivalent.
my $utf8_word = Encode::encode("UTF-8", $word);
# Use 'utf8::decode' to convert the string back to internal form.
my $word_again_via_utf8 = $utf8_word;
utf8::decode($word_again_via_utf8); # converts in-place
# Use 'Encode::decode' to convert the string back to internal form.
my $word_again_via_Encode = Encode::decode("UTF-8", $utf8_word);
# Do the two conversion methods produce the same result?
# Prints 'Yes'.
print $word_again_via_utf8 eq $word_again_via_Encode ? "Yes\n" : "No\n";
# Do we get back the original internal string after converting both ways?
# Prints 'Yes'.
print $word eq $word_again_via_Encode ? "Yes\n" : "No\n";
If you're passing UTF-8 data around in the parameters list, then you definitely want to be URI encoding them using the URI::Escape module. This will convert any extended characters to percent values which as easily printable and readable. On the receiving end you will then need to URI decode them before continuing.

Perl: String literal in module in latin1 - I want utf8

In the Date::Holidays::DK module, the names of certain Danish holidays are written in Latin1 encoding. For example, January 1st is 'Nytårsdag'. What should I do to $x below in order to get a proper utf8-encoded string?
use Date::Holidays::DK;
my $x = is_dk_holiday(2011,1,1);
I tried various combinations of use utf8 and no utf8 before/after use Date::Holidays::DK, but it does not seem to have any effect. I also triede to use Encode's decode, with no luck. More specifically,
use Date::Holidays::DK;
use Encode;
use Devel::Peek;
my $x = decode("iso-8859-1",
is_dk_holiday(2011,1,1)
);
Dump($x);
print "January 1st is '$x'\n";
gives the output
SV = PV(0x15eabe8) at 0x1492a10
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x1593710 "Nyt\303\245rsdag"\0 [UTF8 "Nyt\x{e5}rsdag"]
CUR = 10
LEN = 16
January 1st is 'Nyt sdag'
(with an invalid character between t and s).
use utf8 and no utf8 before/after use Date::Holidays::DK, but it does not seem to have any effect.
Correct. The utf8 pragma only indicates that the source code of the program is written in UTF-8.
I also tried to use Encode's decode, with no luck.
You did not perceive this correctly, you in fact did the right thing. You now have a string of Perl characters and can manipulate it.
with an invalid character between t and s
You also interpret this wrong, it is in fact the å character.
You want to output UTF-8, so you are lacking the encoding step.
my $octets = encode 'UTF-8', $x;
print $octets;
Please read http://p3rl.org/UNI for the introduction to the topic of encoding. You always must decode and encode, either explicitely or implicitely.
use utf8 only is a hint to the perl interpreter/compiler that your file is UTF-8 encoded. If you have strings with high-bit set, it will automatically encode them to unicode.
If you have a variable that is encoded in iso-8859-1 you must decode it. Then your variable is in the internal unicode format. That's utf8 but you shouldn't care which encoding perl uses internaly.
Now if you want to print such a string you need to convert the unicode string back to a byte string. You need to do a encode on this string. If you don't do an encode manually perl itself will encode it back to iso-8859-1. This is the default encoding.
Before you print your variable $x, you need to do a $x = encode('UTF-8', $x) on it.
For correct handling of UTF-8 you always need to decode() every external input over I/O. And you always need to encode() everything that leaves your program.
To change the default input/output encoding you can use something like this.
use utf8;
use open ':encoding(UTF-8)';
use open ':std';
The first line says that your source code is encoded in utf8. The second line says that every input/ouput should automatically encode in utf8. It is important to notice that a open() also open a file in utf8 mode. If you work with binary files you need to call a binmode() on the handle.
But the second line does not change handling of STDIN,STDOUT or STDERR. The third line will change that.
You can probably use the modul utf8:all that makes this process easier. But it is always good to understand how all this works behind the scenes.
To correct your example. One possible way is this:
#!/usr/bin/env perl
use Date::Holidays::DK;
use Encode;
use Devel::Peek;
my $x = decode("iso-8859-1",
is_dk_holiday(2011,1,1)
);
Dump($x);
print encode("UTF-8", "January 1st is '$x'\n");