How to get a defined ${^OPEN}? - perl

What changes are required in this code to get a defined ${^OPEN}?
#!/usr/bin/env perl
use warnings;
use strict;
use open qw( :std :utf8 );
print ${^OPEN};
Use of uninitialized value $^OPEN in print at ./perl.pl line 6.

This is quite uneasy way. May be it is better to user more readable Perl.
:utf8 outputs utf-8 charset but does not checks its validity, you should not use this one, except for one liners. Use :encoding(UTF-8) instead.
Please refer to this post How differs the open pragma with different utf8? for more information about different types of utf-8 input/outputs.
I even do not know what could possibly be ${^OPEN} variable. I advice you not to use it at all, as you should not use magic punctuation.
Hope it helps

Related

Is it possible to print 'é' as '%C3%A9' in Perl?

I have some string with accent like "é" and the goal is to put my string into an URL so I need to convert "é" to "%C3%A9"
I have tested some module as HTML::Entitie, Encode or URI::Encode without any success
Actual Result:
%C3%83%C2%A9
Expected Result:
%C3%A9
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Entities;
use feature 'say';
use URI::Encode qw( uri_encode );
my $var = "é";
say $var;
$var = uri_encode( $var );
say $var;
You are missing use utf8.
The use utf8 pragma tells the Perl parser to allow UTF-8 in the
program text in the current lexical scope. The no utf8 pragma tells
Perl to switch back to treating the source text as literal bytes in
the current lexical scope. (On EBCDIC platforms, technically it is
allowing UTF-EBCDIC, and not UTF-8, but this distinction is academic,
so in this document the term UTF-8 is used to mean both).
Do not use this pragma for anything else than telling Perl that your script is written in UTF-8. The utility functions described below are
directly usable without use utf8;.

how to decode_entities in utf8

In perl, I am working with the following utf-8 text:
my $string = 'a 3.9 kΩ resistor and a 5 µF capacitor';
However, when I run the following:
decode_entities('a 3.9 kΩ resistor and a 5 µF capacitor');
I get
a 3.9 kΩ resistor and a 5 µF capacitor
The Ω symbol has successfully decoded, but the µ symbol now has gibberish before it.
How can I use decode_entities while making sure non-encoded utf-8 symbols (such as µ) are not converted to gibberish?
This isn't a very well-phrased question. You didn't tell us where your decode_entities() function comes from and you didn't give a simple example that we could just run to reproduce your problem.
But I was able to reproduce your problem with this code:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use HTML::Entities;
say decode_entities('a 3.9 kΩ resistor and a 5 µF capacitor');
The problem here is that by default, Perl will interpret your source code (and, therefore, any strings included in it) as ISO-8859-1. As your string is in UTF8, you just need to tell Perl to interpret your source code as UTF8 by adding use utf8 to your code.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
use utf8; # Added this line
use HTML::Entities;
say decode_entities('a 3.9 kΩ resistor and a 5 µF capacitor');
Running this will give you the correct string, but you'll also get a warning.
Wide character in say
This is because Perl's IO layer expects single-byte characters by default and any attempt to send a multi-byte character through it is seen as a potential problem. You can fix that by telling Perl that STDOUT should accept UTF8 characters. There are many ways to do that. The easiest is probably to add -CS to the shebang line.
#!/usr/bin/perl -CS
use strict;
use warnings;
use 5.010;
use utf8;
use HTML::Entities;
say decode_entities('a 3.9 kΩ resistor and a 5 µF capacitor');
Perl has great support for Unicode, but it can be hard to get started with it. I recommend reading perlunitut to see how it all works.
You are using the Encode CPAN library. If that is true, you can try this...
my $string = "...";
$string = decode_entities(decode('utf-8', $string));
This may seem illogical. If Perl is natively UTF-8 itself, why should you need to decode a UTF-8 string? It is simply another way of telling Perl that you have a UTF-8 value that it needs to interpret as natively UTF-8.
The corruption you are seeing is when a UTF-8 value doesn't have the rights bytes recognized (it shows "0xC1 0xAF" when Dumpered; after the above change, it ought to show "0x1503", or some similar concat'ed bytes) .
There are a ton of settings that can affect this in perl. The above is most likely the right combination of changes that you need for your given settings. Otherwise, some variation (swap encode with decode('latin1', ...), etc.) of the above should solve the problem.

Perl + Unicode: "Wide Strings" error

I am running Active Perl 5.14 on Windows 7.
I am trying to write a program that will read-in a conversion table, then work on a file and replace certain patterns by other patterns - all of the above in Unicode (UTF-8). Here is the beginning of the program:
#!/usr/local/bin/perl
# Load a conversion table from CONVTABLE to %ConvTable.
# Then find matches in a file and convert them.
use strict;
use warnings;
use Encode;
use 5.014;
use utf8;
use autodie;
use warnings qw< FATAL utf8 >;
use open qw< :std :utf8 >;
use charnames qw< :full >;
use feature qw< unicode_strings >;
my ($i,$j,$InputFile, $OutputFile,$word,$from,$to,$linetoprint);
my (#line, #lineout);
my %ConvTable; # Conversion hash
print 'Conversion table: opening file: E:\My Documents\Perl\Conversion table.txt'."\n";
my $sta= open (CONVTABLE, "<:encoding(utf8)", 'E:\My Documents\Perl\Conversion table.txt');
binmode STDOUT, ':utf8'; # output should be in UTF-8
# Load conversion hash
while (<CONVTABLE>) {
chomp;
print "$_\n"; # etc ...
# etc ...
It turns out that at this point, it says:
wide character in print at (eval 155)E:/Active Perl/lib/Perl5DB.pl:640]line 2, <CONVTABLE> line 1, etc...
Why is that? I think I've gone through and implemented all the necessary prescriptions for correct handling of Unicode strings, decoding and encoding into UTF-8?
And how to fix it?
TIA
Helen
The Perl debugger has its own output handle that is distinct from STDOUT (although it may ultimately go to the same place as STDOUT). You'll also want to do something like this near the beginning of your script:
binmode $DB::OUT, ':utf8' if $DB::OUT;
I suspect that the problem is in some part of the code that you haven't shown us. I base this suspicion on the following facts:
The error message you quote says at (eval 155). There are no evals in your code.
The code you have shown us above does not produce a "wide character" warning when I run it, even if the input contains Unicode characters. The only way I can make it produce one is to comment out both the use open line and the binmode STDOUT line.
Admittedly, my testing environment is not exactly identical to yours: I'm on Linux, and my Perl is only v5.10.1, meaning that I had to lower the version requirement and turn off the unicode_strings feature (not that you're actually using it). Still, I very much suspect that the problem is not in the code you've posted.

Which safety net do you use in Perl?

Which safety net do you use?
use warnings;
or
use strict;
I know that
A potential problem caught by use
strict; will cause your code to stop
immediately when it is encountered,
while use warnings; will merely give a
warning (like the command-line switch
-w) and let your code run.
Still I want to know that which one is mostly used by the Perl-programmers. Which one they have seen being used the most?
Both, of course. If perl were designed today, use strict and use warnings would be the default behavior. It's just like having warnings turned on in a compiler - why would you not do that by default?
What you have doesn’t even start to be enough.
I use code approximating this as a starting point. It works well in my environment, although as always your mileage may vary.
#!/usr/bin/env perl
use v5.12;
use utf8;
use strict;
use autodie;
use warnings;
use warnings qw< FATAL utf8 >;
use feature qw< unicode_strings >;
use open qw< :std :utf8 >;
use charnames qw< :full >;
# These are core modules:
use Carp qw< carp croak confess cluck >;
use File::Basename qw< basename dirname >;
use Unicode::Normalize qw< NFD NFKD NFC NFKC >;
use Getopt::Long qw< GetOptions >;
use Pod::Usage qw< pod2usage >;
our $VERSION = v0.0.1;
$0 = basename($0); # shorter messages
## $| = 1;
$SIG{__DIE__} = sub {
confess "Uncaught exception: #_" unless $^S;
};
$SIG{__WARN__} = sub {
if ($^S) { cluck "Trapped warning: #_" }
else { confess "Deadly warning: #_" }
};
END {
local $SIG{PIPE} = sub { exit };
close STDOUT;
}
if (grep /\P{ASCII}/ => #ARGV) {
#ARGV = map { decode("UTF-8", $_) } #ARGV;
}
binmode(DATA, ":utf8");
## Getopt::Long::Configure qw[ bundling auto_version ];
if (!#ARGV && -t STDIN) {
print STDERR "$0: reading from stdin: type ^D to end, ^C to kill...\n";
}
while (<>) {
$_ = NFD($_);
# ...
print NFC($_);
}
exit;
=pod
=encoding utf8
=head1 NAME
=head1 SYNOPSIS
=head1 DESCRIPTION
=head1 OPTIONS
=head1 EXAMPLES
=head1 ERRORS
=head1 FILES
=head1 ENVIRONMENT
=head1 PROGRAMS
=head1 AUTHOR
=head1 COPYRIGHT AND LICENCE
=head1 REVISION HISTORY
=head1 BUGS
=head1 TODO
=head1 SEE ALSO
=cut
__END__
Your UTF-8 data goes here.
You can find more examples of this sort of thing in action in the Perl Unicode Tool Chest, currently up to around 50 files ranging from the simple to the sublime.
use strict generates an error if you use symbolic references (ie, strings to represent names of symbols). It generates an error if you use a variable without declaring it (this encourages the use of lexical 'my' variables, but is also satisfied if you properly declare package globals). It also generates an error if you leave barewords hanging around in the script (unquoted strings, essentially, by Perl's definition of quotes). With 'strict', you may enable or disable any of three categories of strictures, and my do so within scoped blocks. It is a best practice to enable strictures, though on occasion legitimate code requires that some of its features be locally disabled. However, one should think long and hard about whether this is really necessary, and whether their solution is ideal. You can read about strictures in Perl's POD entitled 'strict'.
use warnings generates a warning message based on many criteria, which are described in the POD 'perllexwarn'. These warnings have nothing to do with strictures, but rather, watch for the most common "gotchas" one is likely to encounter in their programming. It is a best practice to use warnings while writing scripts too. In some cases where the message might be undesirable a certain warning category may be locally disabled within a scope. Additional info is described in 'warnings'.
use diagnostics makes the warnings more verbose, and in a development or learning environment, particularly among newcomers, that's highly desirable. Diagnostics would probably be left out of a 'final product', but while in development they can be a really nice addition to the terse messages normally generated. You can read about diagnostics in the Perl POD "diagnostics."
There is no reason to force oneself to use only one of the above options or another. In particular, use warnings and use strict should generally both be used in modern Perl programs.
In all cases (except diagnostics, which you're only using for development anyway), individual strictures or warnings may be lexically disabled. Furthermore, their errors may be trapped with eval{ .... }, with Try::Tiny's try/catch blocks, and a few other ways. If there's a concern about a message giving a potential attacker more information about a script, the messages could be routed to a logfile. If there's a risk of said logfile consuming lots of space, there's a bigger issue at hand, and the source of the issue should either be resolved or in some rare cases simply have the message disabled.
Perl programs nowadays should be highly strict/warnings compliant as a best practice.
Use both, as the linked page says.
The documentation is perhaps a bit unclear. use strict and use warnings catch different problems; use strict will not cause your program to immediately exit when mere warnings are encountered, only when you violate the strict syntax requirements. You will still get only warnings printed when your code does things that are less seriously bad.
use strict;
#use warnings;
use diagnostics; # "This module extends the terse diagnostics..." by http://perldoc.perl.org/diagnostics.html
Both! But I prefer diagnostics, instead of warnings, which give you some more information.

Unicode string mess in perl

I have an external module, that is returning me some strings. I am not sure how are the strings returned, exactly. I don't really know, how Unicode strings work and why.
The module should return, for example, the Czech word "být", meaning "to be". (If you cannot see the second letter - it should look like this.) If I display the string, returned by the module, with Data Dumper, I see it as b\x{fd}t.
However, if I try to print it with print $s, I got "Wide character in print" warning, and ? instead of ý.
If I try Encode::decode(whatever, $s);, the resulting string cannot be printed anyway (always with the "Wide character" warning, sometimes with mangled characters, sometimes right), no matter what I put in whatever.
If I try Encode::encode("utf-8", $s);, the resulting string CAN be printed without the problems or error message.
If I use use encoding 'utf8';, printing works without any need of encoding/decoding. However, if I use IO::CaptureOutput or Capture::Tiny module, it starts shouting "Wide character" again.
I have a few questions, mostly about what exactly happens. (I tried to read perldocs, but I was not very wise from them)
Why can't I print the string right after getting it from the module?
Why can't I print the string, decoded by "decode"? What exactly "decode" did?
What exactly "encode" did, and why there was no problem in printing it after encoding?
What exactly use encoding do? Why is the default encoding different from utf-8?
What do I have to do, if I want to print the scalars without any problems, even when I want to use one of the capturing modules?
edit: Some people tell me to use -C or binmode or PERL_UNICODE. That is a great advice. However, somehow, both the capturing modules magically destroy the UTF8-ness of STDOUT. That seems to be more a bug of the modules, but I am not really sure.
edit2: OK, the best solution was to dump the modules and write the "capturing" myself (with much less flexibility).
Because you output a string in perl's internal form (utf8) to a non-unicode filehandle.
The decode function decodes a sequence of bytes assumed to be in ENCODING into Perl's internal form (utf8). Your input seems to be already decoded,
The encode() function encodes a string from Perl's internal form into ENCODING.
The encoding pragma allows you to write your script in any encoding you like. String literals are automatically converted to perl's internal form.
Make sure perl knows which encoding your data comes in and come out.
See also perluniintro, perlunicode, Encode module, binmode() function.
I recommend reading the Unicode chapter of my book Effective Perl Programming. We put together all the docs we could find and explained Unicode in Perl much more coherently than I've seen anywhere else.
This program works fine for me:
#!perl
use utf8;
use 5.010;
binmode STDOUT, ':utf8';
my $string = return_string();
say $string;
sub return_string { 'být' }
Additionally, Capture::Tiny works just fine for me:
#!perl
use utf8;
use 5.010;
use Capture::Tiny qw(capture);
binmode STDOUT, ':utf8';
my( $stdout, $stderr ) = capture {
system( $^X, '/Users/brian/Desktop/czech.pl' );
};
say "STDOUT is [$stdout]";
IO::CaptureOutput seems to have some problems though:
#!perl
use utf8;
use 5.010;
use IO::CaptureOutput qw(capture);
binmode STDOUT, ':utf8';
capture {
system( $^X, '/Users/brian/Desktop/czech.pl' );
} \my $stdout, \my $stderr;
say "STDOUT is [$stdout]";
For this I get:
STDOUT is [být
]
However, that's easy to fix. Don't use that module. :)
You should also look at the PERL_UNICODE environment variable, which is the same as using the -C option. That allows you to set STDIN/STDOUT/STDERR (and #ARGV) to be UTF-8 without having to alter your scripts.