UTF-8 in a Perl module name - perl

How can I write a Perl module with UTF-8 in its name and filename? My current try yields "Can't locate Täst.pm in #INC", but the file does exist. I'm on Windows, and haven't tried this on Linux yet.
test.pl:
use strict;
use warnings;
use utf8;
use Täst;
Täst.pm:
package Täst;
use utf8;
Update: My current work-around it so use Tast (ASCII) and put package Täst (Unicode) in Tast.pm (ASCII). It's confusing, though.

Unfortunately, Perl, Windows, and Unicode filenames really don't go together at the moment. My advice is to save yourself a lot of hassle and stick with plain ASCII for your module names. This blog post mentions a few of the problems.

The use utf8 needs to appear before the package Täst, so that the latter can be correctly interpreted. On my Mac:
test.pl:
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use Tëst;
# 'use utf8' only indicates the code's encoding, but we also want stdout to be utf8
use encoding "utf8";
Tëst::hëllö();
Tëst.pm:
use utf8;
package Tëst;
sub Tëst::hëllö() {
print "Hëllö, wörld!\n";
}
1;
Output:
Macintosh:Desktop sherm$ ./test.pl
Hëllö, wörld!
As I said though - I ran this on my Mac. As cjm said above, your mileage may vary on Windows.

Unicode support often fails at the boundaries. Package and subroutine names need to map cleanly onto filenames, which is problematic on some operating systems. Not only does the OS have to create the filename that you expect, but you also have to be able to find it later as the same name.
We talked a little about the filename issue in Effective Perl Programming, but I also summarized much more in How do I create then use long Windows paths from Perl?. Jeff Atwood mentions this as part of his post on his Filesystem Paths: How Long is Too Long?.

I wouldn't recommend this approach if this is software you plan to release, to be honest. Even if you get it working fine for you, it's likely to be somewhat fragile on machines where UTF-8 isn't configured quite right, and/or filenames may not contain UTF-8 characters, etc.

Related

Reliable Perl encoding with File::Slurp

I need to replace every occurrence of http:// with // in a file. The file may be (at least) in UTF-8, CP1251, or CP1255.
Does the following work?
use File::Slurp;
my $Text = read_file($File, binmode=>':raw');
$Text =~ s{http://}{//}gi;
write_file($File, {atomic=>1, binmode=>':raw'}, $Text);
It seems correct, but I need to be sure that the file will not be damaged whatever encoding it has. Please help me to be sure.
This answer won't make you sure, though I hope it can help.
I don't see any problem with your script (tested with utf8 ans iso-8859-1 without problems) though there seems to be a discussion regarding the capacity of File::slurp to correctly handle encoding : http://blogs.perl.org/users/leon_timmermans/2015/08/fileslurp-is-broken-and-wrong.html
In this answer on a similar subject, the author recommends File::Slurper as an alternative, due to better encoding handling: https://stackoverflow.com/a/206682/6193608
It's no longer recommended to use File::Slurp (see here).
I would recommend using Path::Tiny. It's easy to use, works with both files and directories, only uses core modules, and has slurp/spew methods specifically for uft8 and raw so you shouldn't have a problem with the encoding.
Usage:
use Path::Tiny;
my $Text = path($File)->slurp_raw;
$Text =~ s{http://}{//}gi;
path($File)->spew_raw($Text);
Update: From documentation on spew:
Writes data to a file atomically. The file is written to a temporary file in the same directory, then renamed over the original. An optional hash reference may be used to pass options. The only option is binmode, which is passed to binmode() on the handle used for writing.
spew_raw is like spew with a binmode of :unix for a fast, unbuffered, raw write.
spew_utf8 is like spew with a binmode of :unix:encoding(UTF-8) (or PerlIO::utf8_strict). If Unicode::UTF8 0.58+ is installed, a raw spew will be done instead on the data encoded with Unicode::UTF8.

how to decode_entities in utf8

In perl, I am working with the following utf-8 text:
my $string = 'a 3.9 kΩ resistor and a 5 µF capacitor';
However, when I run the following:
decode_entities('a 3.9 kΩ resistor and a 5 µF capacitor');
I get
a 3.9 kΩ resistor and a 5 µF capacitor
The Ω symbol has successfully decoded, but the µ symbol now has gibberish before it.
How can I use decode_entities while making sure non-encoded utf-8 symbols (such as µ) are not converted to gibberish?
This isn't a very well-phrased question. You didn't tell us where your decode_entities() function comes from and you didn't give a simple example that we could just run to reproduce your problem.
But I was able to reproduce your problem with this code:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use HTML::Entities;
say decode_entities('a 3.9 kΩ resistor and a 5 µF capacitor');
The problem here is that by default, Perl will interpret your source code (and, therefore, any strings included in it) as ISO-8859-1. As your string is in UTF8, you just need to tell Perl to interpret your source code as UTF8 by adding use utf8 to your code.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use utf8; # Added this line
use HTML::Entities;
say decode_entities('a 3.9 kΩ resistor and a 5 µF capacitor');
Running this will give you the correct string, but you'll also get a warning.
Wide character in say
This is because Perl's IO layer expects single-byte characters by default and any attempt to send a multi-byte character through it is seen as a potential problem. You can fix that by telling Perl that STDOUT should accept UTF8 characters. There are many ways to do that. The easiest is probably to add -CS to the shebang line.
#!/usr/bin/perl -CS
use strict;
use warnings;
use 5.010;
use utf8;
use HTML::Entities;
say decode_entities('a 3.9 kΩ resistor and a 5 µF capacitor');
Perl has great support for Unicode, but it can be hard to get started with it. I recommend reading perlunitut to see how it all works.
You are using the Encode CPAN library. If that is true, you can try this...
my $string = "...";
$string = decode_entities(decode('utf-8', $string));
This may seem illogical. If Perl is natively UTF-8 itself, why should you need to decode a UTF-8 string? It is simply another way of telling Perl that you have a UTF-8 value that it needs to interpret as natively UTF-8.
The corruption you are seeing is when a UTF-8 value doesn't have the rights bytes recognized (it shows "0xC1 0xAF" when Dumpered; after the above change, it ought to show "0x1503", or some similar concat'ed bytes) .
There are a ton of settings that can affect this in perl. The above is most likely the right combination of changes that you need for your given settings. Otherwise, some variation (swap encode with decode('latin1', ...), etc.) of the above should solve the problem.

how to use utf-8 in a perl cgi-bin script?

I have the following cgi bin script:
#! /usr/bin/perl
#
use utf8;
use CGI;
my $q = CGI->new();
my %params = $q->Vars;
print $q->header('text/html');
$w = $params{"words"};
print "$w\n";
I want to be able to call it as cgi-bin/script.pl?words=É for example, but when I do that, what's printed is not UTF-8, but instead garbled:
É
Is there any way to use cgi-bin with utf8?
Your line use utf8 doesn't do anything for you, other than allowing UTF-8 characters in the source file itself. You must make sure that the output handles (on STDOUT as well as any files) are set to utf8. One easy way to handle this is the utf8::all module. Also, make sure you are sending the correct headers, and use the -utf8 CGI pragma to treat incoming parameters as UTF-8. Finally, as always, be sure to use strict and warnings.
The following should get you started:
#!/usr/bin/perl
use strict;
use warnings;
use utf8::all;
use CGI qw(-utf8);
my $q = CGI->new;
print $q->header("text/html;charset=UTF-8");
print $q->param("words");
exit;
I have been having this problem of intermittent failure of utf8 encoding with my CGI script.
I tried everything but couldn't reliably repeat the problem.
I finally discovered that is is absolutely critical to be consistent with you use of the utf8 pragma throughout every module that uses CGI
use CGI qw(-utf8);
What seems to happen is that modperl invokes the CGI module just once per requests. If there is inconsistent including of the CGI module - say for some utility function that is just using a redirect function and you haven't bothered to set the utf8 pragma. Then this invocation can be the one that modperl decides to use to decode requests.
You will save yourself a lot of pain in the long run if you start out by reading the perlunitut and perlunicode documentation pages. They will give you the basics on exactly what Unicode and character encodings are, and how to work with them in Perl.
Also, what you're asking for is more complex than you think. There are many layers hidden in the phrase "use cgi-bin with utf8", starting with your interface to whatever tool you're using to send requests to the web server and ending with that tool having parsed a response and presenting it to you. You need to understand all those layers well enough to at least be able to tell if the problem lies in your CGI script or not. For example, it doesn't help if your script works perfectly if the problem is that bash and curl don't agree on the encoding of your command line arguments.

Why are use warnings; use strict; not default in Perl?

I'm wondering why
use warnings;
use strict;
are not default in Perl. They're needed for every script. If someone (for good reason) needs to disable them, they should use no strict and/or should use some command line argument (for one-liners).
Are there too many badly-written CPAN modules (using "badly" to mean without use strict)? Or is it because this can break a lot of code already in production? I'm sure there is a reason and I would like to know it.
In 5.14 IO::File is loaded automagically on demand, wouldn't it be possible to do something like that with these basic pragmas?
It's for backwards compatibility. Perl 4 didn't have strict at all, and there are most likely still scripts out there originally written for Perl 4 that still work fine with Perl 5. Making strict automatic would break those scripts. The situation is even worse for one-liners, many of which don't bother to declare variables. Making one-liners strict by default would break probably millions of shell scripts and Makefiles out there.
It can't be loaded automagically, because it adds restrictions, not features. It's one thing to load IO::File when a method is called on a filehandle. But activating strict unless the code did something prohibited by strict is meaningless.
If a script specifies a minimum version of 5.11.0 or higher (e.g. use 5.012), then strict is turned on automatically. This doesn't enable warnings, but perhaps that will be added in a future version. Also, if you do OO programming in Perl, you should know that using Moose automatically turns on both strict and warnings in that class.
If you are on a modern Perl, say so, you just have to enable it. 5.12 applies strict except for one-liners. It can't be default because of backward compatibility.
$ cat strict-safe?.pl
use 5.012;
$foo
$ perl strict-safe\?.pl
Global symbol "$foo" requires explicit package name at strict-safe?.pl line 2.
Execution of strict-safe?.pl aborted due to compilation errors.
Well, use strict is default now, sort of.
Since Perl 5.12.0 if you require a version of Perl >= 5.12.0, then your script will have all the backwards incompatible features turned on, including strict by default.
use 5.12.0;
use warnings;
Is the same as:
use strict;
use warnings;
use feature ':5.12';
It hasn't been turned on more broadly because doing so would break a lot scripts that people depend on to "just work".
Moose also automatically turns on strict and warnings when you use it. So if you do any Moose based Perl OOP, then you get a free pass here, too.
It's a philosophical question, not a "it won't work" question.
First, perl has always been under the "you can do it incorrectly if you want" type of paradigm. Which is why there are a lot of perl haters out there. Many would prefer that the language always force you to write good code, but many quick-script-hackers don't want to. Consider:
perl -e '#a = split(/[,:]/, $_); print $a[1],"\n";'
Now, it would be easy to add a 'my' in front of the #a, but for a one line, one-time script people don't want to do that.
Second, yes, I think most of CPAN would indeed need to be rewritten.
There isn't a good answer you'll like, I'm afraid.
Both warnings and strict will finally be default (along with some Perl 5 features that were not defaults) with Perl 7. It is expected to be released in the first half of 2021 (with a release candidate maybe around the end of 2020). Maybe it will be out around May 18th to mark the 10 year anniversary of this question? Better late than never!
You can use the common::sense module, if you need:
use utf8;
use strict qw(vars subs);
use feature qw(say state switch);
no warnings;
use warnings qw(FATAL closed threads internal debugging pack
portable prototype inplace io pipe unpack malloc
deprecated glob digit printf layer
reserved taint closure semicolon);
no warnings qw(exec newline unopened);
It reduces the memory usage.

Why do my Perl tests fail with use encoding 'utf8'?

I'm puzzled with this test script:
#!perl
use strict;
use warnings;
use encoding 'utf8';
use Test::More 'no_plan';
ok('áá' =~ m/á/, 'ok direct match');
my $re = qr{á};
ok('áá' =~ m/$re/, 'ok qr-based match');
like('áá', $re, 'like qr-based match');
The three tests fail, but I was expecting that the use encoding 'utf8' would upgrade both the literal áá and the qr-based regexps to utf8 strings, and thus passing the tests.
If I remove the use encoding line the tests pass as expected, but I can't figure it out why would they fail in utf8 mode.
I'm using perl 5.8.8 on Mac OS X (system version).
Do not use the encoding pragma. It’s broken. (Juerd Waalboer gave a great talk where he mentioned this at YAPC::EU 2k8.)
It does at least two things at once that do not belong together:
It specifies an encoding for your source file.
It specifies an encoding for your file input/output.
And to add injury to insult it also does #1 in a broken fashion: it reinterprets \xNN sequences as being undecoded octets as opposed to treating them like codepoints, and decodes them, preventing you from being able to express characters outside the encoding you specified and making your source code mean different things depending on the encoding. That’s just astonishingly wrong.
Write your source code in ASCII or UTF-8 only. In the latter case, the utf8 pragma is the correct thing to use. If you don’t want to use UTF-8, but you do want to include non-ASCII charcters, escape or decode them explicitly.
And use I/O layers explicitly or set them using the open pragma to have I/O automatically transcoded properly.
It works fine on my computer (on perl 5.10). Maybe you should try replacing that use encoding 'utf8' with use utf8.
What version of perl are you using? I think older versions had bugs with UTF-8 in regexps.
The Test::More documentation contains a fix for this issue, which I just found today (and this entry shows higher in the googles).
utf8 / "Wide character in print"
If you use utf8 or other non-ASCII characters with Test::More you might get a "Wide character in print" warning. Using binmode STDOUT, ":utf8" will not fix it. Test::Builder (which powers Test::More) duplicates STDOUT and STDERR. So any changes to them, including changing their output disciplines, will not be seem by Test::More. The work around is to change the filehandles used by Test::Builder directly.
my $builder = Test::More->builder;
binmode $builder->output, ":utf8";
binmode $builder->failure_output, ":utf8";
binmode $builder->todo_output, ":utf8";
I added this bit of boilerplate to my testing code and it works a charm.