Am I using utf8::is_utf8 correctly? - perl

Does this work correctly? Some error messages are already decode and some need do be decoded do get a correct output.
#!/usr/bin/env perl
use warnings;
use strict;
use utf8;
use open qw(:utf8 :std);
use Encode qw(decode_utf8);
# ...
if ( not eval{
# some error-messages (utf8) are decoded some are not
1 }
) {
if ( utf8::is_utf8 $# ) {
print $#;
}
else {
print decode_utf8( $# );
}
}

Am I using utf8::is_utf8 correctly?
No. Any use of utf8::is_utf8 is incorrect as you should never use it! Using utf8::is_utf8 to guess at semantics of a string is what's known as an instance of The Unicode Bug. Except for inspecting the internal state of variables when debugging Perl or XS module, utf8::is_utf8 has no use.
It does not indicate whether the value in a variable is encoded using UTF-8 or not. In fact, that's impossible to know reliably. For example, does "\xC3\xA9" produce a string that's encoded using UTF-8 or not? Well, there's no way to know! It depends on whether I meant "é", "é" or something entirely different.
If the variable may contain both encoded and decoded strings, it's up to you to track that using a second variable. I strongly advise against this, though. Just decode everything as it comes in from the outside.
If you really can't, your best bet it to try to decode $# and ignore errors. It's very unlikely that something readable that isn't UTF-8 would be valid UTF-8.
# $# is sometimes encoded. If it's not,
# the following will leave it unchanged.
utf8::decode($#);
print $#;

Related

How to use unicode in perl CGI param

I have a Perl CGI script accepting unicode characters as one of the params.
The url is of the form
.../worker.pl?text="some_unicode_chars"&...
In the perl script, I pass the $text variable to a shell script:
system "a.sh \"$text\" out_put_file";
If I hardcode the text in the perl script, it works well. However, the output makes no sense when $text is got from web using CGI.
my $q = CGI->new;
my $text = $q->param('text');
I suspect it's the encoding caused the problem. uft-8 caused me so many troubles. Anyone please help me?
Perhaps this will help. From Perl Programming/Unicode UTF-8:
By default, CGI.pm does not decode your form parameters. You can use
the -utf8 pragma, which will treat (and decode) all parameters as
UTF-8 strings, but this will fail if you have any binary file upload
fields. A better solution involves overriding the param method:
(example follows)
[Wrong - see Correction] Here's documentation for the utf-8 pragma. Since uploading binary data does not appear to be a concern for you, use of the utf-8 pragma appears to be the most straightforward approach.
Correction: Per the comment from #Slaven, do not confuse the general Perl utf8 pragma with the -utf-8 pragma that has been defined for use with CGI.pm:
-utf8
This makes CGI.pm treat all parameters as UTF-8 strings. Use this with
care, as it will interfere with the processing of binary uploads. It
is better to manually select which fields are expected to return utf-8
strings and convert them using code like this:
use Encode;
my $arg = decode utf8=>param('foo');
Follow Up: duleshi, you ask: But I still don't understand the differnce between decode in Encode and utf8::decode. How do the Encode and utf8 modules differ?
From the documentation for the utf8 pragma:
Note that this function does not handle arbitrary encodings. Therefore
Encode is recommended for the general purposes; see also Encode.
Put another way, the Encode module works with many different encodings (including UTF-8), whereas the utf8 functions work only with the UTF-8 encoding.
Here is a Perl program that demonstrates the equivalence of the two approaches to encoding and decoding UTF-8. (Also see the live demo.)
#!/usr/bin/perl
use strict;
use warnings;
use utf8; # allows 'ñ' to appear in the source code
use Encode;
my $word = "Español"; # the 'ñ' is permitted because of the 'use utf8' pragma
# Convert the string to its UTF-8 equivalent.
my $utf8_word = Encode::encode("UTF-8", $word);
# Use 'utf8::decode' to convert the string back to internal form.
my $word_again_via_utf8 = $utf8_word;
utf8::decode($word_again_via_utf8); # converts in-place
# Use 'Encode::decode' to convert the string back to internal form.
my $word_again_via_Encode = Encode::decode("UTF-8", $utf8_word);
# Do the two conversion methods produce the same result?
# Prints 'Yes'.
print $word_again_via_utf8 eq $word_again_via_Encode ? "Yes\n" : "No\n";
# Do we get back the original internal string after converting both ways?
# Prints 'Yes'.
print $word eq $word_again_via_Encode ? "Yes\n" : "No\n";
If you're passing UTF-8 data around in the parameters list, then you definitely want to be URI encoding them using the URI::Escape module. This will convert any extended characters to percent values which as easily printable and readable. On the receiving end you will then need to URI decode them before continuing.

perl save utf-8 text problem

I am playing around the pplog, a single file file base blog.
The writing to file code:
open(FILE, ">$config_postsDatabaseFolder/$i.$config_dbFilesExtension");
my $date = getdate($config_gmt);
print FILE $title.'"'.$content.'"'.$date.'"'.$category.'"'.$i; # 0: Title, 1: Content, 2: Date, 3: Category, 4: FileName
print 'Your post '. $title.' has been saved. Go to Index';
close FILE;
The input text:
春眠不覺曉,處處聞啼鳥. 夜來風雨聲,花落知多小.
After store to file, it becomes:
春眠不覺�›�,處處聞啼鳥. 夜來風�›�聲,花落知多小.
I can use Eclipse to edit the file and make it render to normal. The problem exists during printing to the file.
Some basic info:
Strawberry perl 5.12
without use utf8;
tried use utf8;, dosn't have effect.
Thank you.
--- EDIT ---
Thanks for comments. I traced the code:
Codes add new content:
# Blog Add New Entry Page
my $pass = r('pass');
#BK 7JUL09 patch from fedekun, fix post with no title that caused zero-byte message...
my $title = r('title');
my $content = '';
if($config_useHtmlOnEntries == 0)
{
$content = bbcode(r('content'));
}
else
{
$content = basic_r('content');
}
my $category = r('category');
my $isPage = r('isPage');
sub r
{
escapeHTML(param($_[0]));
}
sub r forward the command to a CGI.pm function.
In CGI.pm
sub escapeHTML {
# hack to work around earlier hacks
push #_,$_[0] if #_==1 && $_[0] eq 'CGI';
my ($self,$toencode,$newlinestoo) = CGI::self_or_default(#_);
return undef unless defined($toencode);
$toencode =~ s{&}{&}gso;
$toencode =~ s{<}{<}gso;
$toencode =~ s{>}{>}gso;
if ($DTD_PUBLIC_IDENTIFIER =~ /[^X]HTML 3\.2/i) {
# $quot; was accidentally omitted from the HTML 3.2 DTD -- see
# <http://validator.w3.org/docs/errors.html#bad-entity> /
# <http://lists.w3.org/Archives/Public/www-html/1997Mar/0003.html>.
$toencode =~ s{"}{"}gso;
}
else {
$toencode =~ s{"}{"}gso;
}
# Handle bug in some browsers with Latin charsets
if ($self->{'.charset'}
&& (uc($self->{'.charset'}) eq 'ISO-8859-1' # This line cause trouble. it treats Chinese chars as ISO-8859-1
|| uc($self->{'.charset'}) eq 'WINDOWS-1252')) {
$toencode =~ s{'}{'}gso;
$toencode =~ s{\x8b}{‹}gso;
$toencode =~ s{\x9b}{›}gso;
if (defined $newlinestoo && $newlinestoo) {
$toencode =~ s{\012}{
}gso;
$toencode =~ s{\015}{
}gso;
}
}
return $toencode;
}
Further trace the problem, found out the browser default to iso-8859-1, even manually set to utf-8, it send the string back to server as iso-8859-1.
Finally,
print header(-charset => qw(utf-8)), '<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
add the -charset => qw(utf-8) param to header. The Chinese poem is still Chinese poem.
Thanks for Schwern's comments, it inspired me to trace out the problem and learn a leeson.
In order to get utf8 really working in Perl involves flipping on a lot of individual features. use utf8 only makes your code utf8 (strings, variables, regexes...), you have to do file handles separately.
Its complicated, and the simplest thing is to use utf8::all which will make utf8 the default for your code, your files, #ARGV, STDIN, STDOUT and STDERR. utf8 support is constantly improving in Perl, and utf8::all will add it as it comes available.
I'm unsure of how your code can produce that output—for example, the quote marks are missing. Of course, this could be due to "corruption" somewhere between your file and me seeing the page. SO may filter corrupted UTF-8. I suggest providing hex dumps in the future!
Anyway, to get UTF-8 output working in Perl, there are several approaches:
Work with character data, that is let Perl know that your variables contain Unicode. This is probably the best method. Confirm that utf8::is_utf8($var) is true (you do not need to, and should not use utf8 for this). If not, look into the Encode module's decode function to make Perl know its Unicode. Once Perl knows your data is characters, that print will give warnings (which you do have enabled, right?). To fix, enable the :utf8 or :encoding(utf8) layer on your file (the latter version provides error checking). You can do this in your open (open FILE, '>:utf8', "$fname") or alternative enable it with binmode (binmode FILE, ':utf8'). Note that you can also use other encodings; see the encoding and PerlIO::encoding docs.
Treat your Unicode as opaque binary data. utf8::is_utf8($var) must be false. You must be very careful when manipulating strings; for example, if you've got UTF-16-BE, this would be a bad idea: print "$data\n", because you actually need print $data\0\n". UTF-8 has fewer of these issues, but you need to be aware of them.
I suggest reading the perluniintro, perlunitut, perlunicode, and perlunifaq manpages/pods.
Also, use utf8; just tells Perl that your script is written in UTF-8. Its effects are very limited; see its pod docs.
You're not showing the code that is actually running. I successfully processed the text you supplied as input with both 5.10.1 on Cygwin and 5.12.3 on Windows. So I suspect a bug in your code. Try narrowing down the problem by writing a short, self-contained test case.

Perl: String literal in module in latin1 - I want utf8

In the Date::Holidays::DK module, the names of certain Danish holidays are written in Latin1 encoding. For example, January 1st is 'Nytårsdag'. What should I do to $x below in order to get a proper utf8-encoded string?
use Date::Holidays::DK;
my $x = is_dk_holiday(2011,1,1);
I tried various combinations of use utf8 and no utf8 before/after use Date::Holidays::DK, but it does not seem to have any effect. I also triede to use Encode's decode, with no luck. More specifically,
use Date::Holidays::DK;
use Encode;
use Devel::Peek;
my $x = decode("iso-8859-1",
is_dk_holiday(2011,1,1)
);
Dump($x);
print "January 1st is '$x'\n";
gives the output
SV = PV(0x15eabe8) at 0x1492a10
REFCNT = 1
FLAGS = (PADMY,POK,pPOK,UTF8)
PV = 0x1593710 "Nyt\303\245rsdag"\0 [UTF8 "Nyt\x{e5}rsdag"]
CUR = 10
LEN = 16
January 1st is 'Nyt sdag'
(with an invalid character between t and s).
use utf8 and no utf8 before/after use Date::Holidays::DK, but it does not seem to have any effect.
Correct. The utf8 pragma only indicates that the source code of the program is written in UTF-8.
I also tried to use Encode's decode, with no luck.
You did not perceive this correctly, you in fact did the right thing. You now have a string of Perl characters and can manipulate it.
with an invalid character between t and s
You also interpret this wrong, it is in fact the å character.
You want to output UTF-8, so you are lacking the encoding step.
my $octets = encode 'UTF-8', $x;
print $octets;
Please read http://p3rl.org/UNI for the introduction to the topic of encoding. You always must decode and encode, either explicitely or implicitely.
use utf8 only is a hint to the perl interpreter/compiler that your file is UTF-8 encoded. If you have strings with high-bit set, it will automatically encode them to unicode.
If you have a variable that is encoded in iso-8859-1 you must decode it. Then your variable is in the internal unicode format. That's utf8 but you shouldn't care which encoding perl uses internaly.
Now if you want to print such a string you need to convert the unicode string back to a byte string. You need to do a encode on this string. If you don't do an encode manually perl itself will encode it back to iso-8859-1. This is the default encoding.
Before you print your variable $x, you need to do a $x = encode('UTF-8', $x) on it.
For correct handling of UTF-8 you always need to decode() every external input over I/O. And you always need to encode() everything that leaves your program.
To change the default input/output encoding you can use something like this.
use utf8;
use open ':encoding(UTF-8)';
use open ':std';
The first line says that your source code is encoded in utf8. The second line says that every input/ouput should automatically encode in utf8. It is important to notice that a open() also open a file in utf8 mode. If you work with binary files you need to call a binmode() on the handle.
But the second line does not change handling of STDIN,STDOUT or STDERR. The third line will change that.
You can probably use the modul utf8:all that makes this process easier. But it is always good to understand how all this works behind the scenes.
To correct your example. One possible way is this:
#!/usr/bin/env perl
use Date::Holidays::DK;
use Encode;
use Devel::Peek;
my $x = decode("iso-8859-1",
is_dk_holiday(2011,1,1)
);
Dump($x);
print encode("UTF-8", "January 1st is '$x'\n");

Unicode string mess in perl

I have an external module, that is returning me some strings. I am not sure how are the strings returned, exactly. I don't really know, how Unicode strings work and why.
The module should return, for example, the Czech word "být", meaning "to be". (If you cannot see the second letter - it should look like this.) If I display the string, returned by the module, with Data Dumper, I see it as b\x{fd}t.
However, if I try to print it with print $s, I got "Wide character in print" warning, and ? instead of ý.
If I try Encode::decode(whatever, $s);, the resulting string cannot be printed anyway (always with the "Wide character" warning, sometimes with mangled characters, sometimes right), no matter what I put in whatever.
If I try Encode::encode("utf-8", $s);, the resulting string CAN be printed without the problems or error message.
If I use use encoding 'utf8';, printing works without any need of encoding/decoding. However, if I use IO::CaptureOutput or Capture::Tiny module, it starts shouting "Wide character" again.
I have a few questions, mostly about what exactly happens. (I tried to read perldocs, but I was not very wise from them)
Why can't I print the string right after getting it from the module?
Why can't I print the string, decoded by "decode"? What exactly "decode" did?
What exactly "encode" did, and why there was no problem in printing it after encoding?
What exactly use encoding do? Why is the default encoding different from utf-8?
What do I have to do, if I want to print the scalars without any problems, even when I want to use one of the capturing modules?
edit: Some people tell me to use -C or binmode or PERL_UNICODE. That is a great advice. However, somehow, both the capturing modules magically destroy the UTF8-ness of STDOUT. That seems to be more a bug of the modules, but I am not really sure.
edit2: OK, the best solution was to dump the modules and write the "capturing" myself (with much less flexibility).
Because you output a string in perl's internal form (utf8) to a non-unicode filehandle.
The decode function decodes a sequence of bytes assumed to be in ENCODING into Perl's internal form (utf8). Your input seems to be already decoded,
The encode() function encodes a string from Perl's internal form into ENCODING.
The encoding pragma allows you to write your script in any encoding you like. String literals are automatically converted to perl's internal form.
Make sure perl knows which encoding your data comes in and come out.
See also perluniintro, perlunicode, Encode module, binmode() function.
I recommend reading the Unicode chapter of my book Effective Perl Programming. We put together all the docs we could find and explained Unicode in Perl much more coherently than I've seen anywhere else.
This program works fine for me:
#!perl
use utf8;
use 5.010;
binmode STDOUT, ':utf8';
my $string = return_string();
say $string;
sub return_string { 'být' }
Additionally, Capture::Tiny works just fine for me:
#!perl
use utf8;
use 5.010;
use Capture::Tiny qw(capture);
binmode STDOUT, ':utf8';
my( $stdout, $stderr ) = capture {
system( $^X, '/Users/brian/Desktop/czech.pl' );
};
say "STDOUT is [$stdout]";
IO::CaptureOutput seems to have some problems though:
#!perl
use utf8;
use 5.010;
use IO::CaptureOutput qw(capture);
binmode STDOUT, ':utf8';
capture {
system( $^X, '/Users/brian/Desktop/czech.pl' );
} \my $stdout, \my $stderr;
say "STDOUT is [$stdout]";
For this I get:
STDOUT is [být
]
However, that's easy to fix. Don't use that module. :)
You should also look at the PERL_UNICODE environment variable, which is the same as using the -C option. That allows you to set STDIN/STDOUT/STDERR (and #ARGV) to be UTF-8 without having to alter your scripts.

Why does this base64 string comparison in Perl fail?

I am trying to compare an encode_base64('test') to the string variable containing the base64 string of 'test'. The problem is it never validates!
use MIMI::Base64 qw(encode_base64);
if (encode_base64("test") eq "dGVzdA==")
{
print "true";
}
Am I forgetting anything?
Here's a link to a Perlmonks page which says "Beware of the newline at the end of the encode_base64() encoded strings".
So the simple 'eq' may fail.
To suppress the newline, say encode_base64("test", "") instead.
When you do a string comparison and it fails unexpectedly, print the strings to see what is actually in them. I put brackets around the value to see any extra whitespace:
use MIME::Base64;
$b64 = encode_base64("test");
print "b64 is [$b64]\n";
if ($b64 eq "dGVzdA==") {
print "true";
}
This is a basic debugging technique using the best debugger ever invented. Get used to using it a lot. :)
Also, sometimes you need to read the documentation for things a couple time to catch the important parts. In this case, MIME::Base64 tells you that encode_base64 takes two arguments. The second argument is the line ending and defaults to a newline. If you don't want a newline on the end of the string you need to give it another line ending, such as the empty string:
encode_base64("test", "")
Here's an interesting tip: use Perl's wonderful and well-loved testing modules for debugging. Not only will that give you a head start on testing, but sometimes they'll make your debugging output a lot faster. For example:
#!/usr/bin/perl
use strict;
use warnings;
use Test::More 0.88;
BEGIN { use_ok 'MIME::Base64' => qw(encode_base64) }
is( encode_base64("test", "dGVzdA==", q{"test" encodes okay} );
done_testing;
Run that script, with perl or with prove, and it won't just tell you that it didn't match, it will say:
# Failed test '"test" encodes okay'
# at testbase64.pl line 6.
# got: 'gGVzdA==
# '
# expected: 'dGVzdA=='
and sharp-eyed readers will notice that the difference between the two is indeed the newline. :)