Non-Ascii data behaves differently with different Perl installations - perl

I have the following script which behaves differently on two different Perl installations I have. One is Perl 5.8.5 and the other is Perl 5.8.8.
Here is the script:
#!/usr/bin/perl
use FindBin(qw($Bin));
use lib $Bin;
use lib "$Bin/../lib";
use XML::LibXML;
use strict; # quote strings, declare variables
use warnings; # on by default
use warnings qw(FATAL utf8); # fatalize encoding glitches
use open qw(:std :utf8); # undeclared streams in UTF-8
my $xml =<<EOS;
<?xml version="1.0" encoding="UTF8"?>
<foo>Привет, мир!</foo>
EOS
my $parser = new XML::LibXML;
my $doc = '';
eval { $doc = $parser->parse_string($xml); };
if ($#) {
die "Error: $#";
}
my $root = $doc->getDocumentElement();
print "XML after parsing: ", $root->toString(), "\n";
On my 5.8.8 Perl installation I get:
XML after parsing: <foo>Привет, мир!</foo>
On my 5.8.5 Perl installation I get:
XML after parsing: <foo>Привет, мир!</foo>
I want my 5.8.5 installation to behave like the 5.8.8 one in this regard. Is this a matter of just upgrading my Perl, or setting some special compilation flag?

First of all, both outputs are equivalent. XML::LibXML is free to generate either one, and it shouldn't matter to the receiving parser. Of course, XML is suppose to be human readable, and this is probably what concerns you.
No, XML::LibXML does not have an option to control which characters it escapes. In fact, I've only known it to escape only when needed, which is the first behaviour.
No need to upgrade Perl. Upgrading XML::LibXML or libxml2 (the underlying library used by XML::LibXML) will do the trick.
# XML::LibXML's version
>perl -MXML::LibXML -E"say $XML::LibXML::VERSION"
1.70
# libxml2's version
>perl -MXML::LibXML -E"say XML::LibXML::LIBXML_DOTTED_VERSION"
2.7.7
Off-topic tips:
I presume your source code is encoded using UTF-8? If so, I would add use utf8; to let Perl know that. If you do, you'll need to change
my $xml = <<EOS;
to
my $xml = encode_utf8(<<EOS);
Using
<<'EOI'
instead of
<<EOI
will prevent Perl from messing with your XML (prevent interpolation and interpretation of \ sequences).

Related

Is there a way to decode UTF-8 characters in #ARGV by default? [duplicate]

How do I treat the elements of #ARGV as UTF-8 in Perl?
Currently I'm using the following work-around ..
use Encode qw(decode encode);
my $foo = $ARGV[0];
$foo = decode("utf-8", $foo);
.. which works but is not very elegant.
I'm using Perl v5.8.8 which is being called from bash v3.2.25 with a LANG set to en_US.UTF-8.
Outside data sources are tricky in Perl. For command-line arguments, you're probably getting them as the encoding specified in your locale. Don't rely on your locale to be the same as someone else who might run your program.
You have to find out what that is then convert to Perl's internal format. Fortunately, it's not that hard.
The I18N::Langinfo module has the stuff you need to get the encoding:
use I18N::Langinfo qw(langinfo CODESET);
my $codeset = langinfo(CODESET);
Once you know the encoding, you can decode them to Perl strings:
use Encode qw(decode);
#ARGV = map { decode $codeset, $_ } #ARGV;
Although Perl encodes internal strings as UTF-8, you shouldn't ever think or know about that. You just decode whatever you get, which turns it into Perl's internal representation for you. Trust that Perl will handle everything else. When you need to store the data, ensure that you use the encoding you like.
If you know that your setup is UTF-8 and the terminal will give you the command-line arguments as UTF-8, you can use the A option with Perl's -C switch. This tells your program to assume the arguments are encoded as UTF-8:
% perl -CA program
Use Encode::Locale:
use Encode::Locale;
decode_argv Encode::FB_CROAK;
This works, also on Win32, pretty OK for me.
The way you've done it seems correct. That's what I would do.
However, this perldoc page suggests that the command line flag -CA should tell it to treat #ARGV as utf-8. (not tested).
For example for windows
set code
chcp 1251
in perl:
use utf8;
use Modern::Perl;
use Encode::Locale qw(decode_argv);
if (-t)
{
binmode(STDIN, ":encoding(console_in)");
binmode(STDOUT, ":encoding(console_out)");
binmode(STDERR, ":encoding(console_out)");
}
Encode::Locale::decode_argv();
in command line
perl -C ppixregexplain.pl qr/\bмама\b/i > ex1.html 2>&1
where ppixregexplain.pl
You shouldn't have to do anything special to the string. Perl strings are in UTF-8 by default starting with Perl 5.8.
perl -CO -le 'print "\x{2603}"' | xargs perl -le 'print "I saw #ARGV"'
The code above works just fine on Ubuntu 9.04, OS X 10.6, and FreeBSD 7.
FalseVinylShrub brings up a good point, We can see a definite difference between
perl -Mutf8 -wle ';print utf8::is_utf8($ARGV[0]) ? "t" : "f"' a
and
perl -Mutf8 -CA -wle ';print utf8::is_utf8($ARGV[0]) ? "t" : "f"' a

regex /ms not behaving as expected perl 5.8.8

I have a failing test on 5.8.8, I don't understand why, esp when it works in more recent versions (perhaps it was just a bug) (here's a link to the full code)
use strict;
use warnings;
use Test::More;
my $fname = 'Fo';
my $content = do { local $/ ; <DATA> };
like $content, qr/^$fname $/xms, q[includes first name];
done_testing;
__DATA__
use strict;
use warnings;
use Test::More;
# generated by Dist::Zilla::Plugin::Test::PodSpelling bootstrapped version
eval "use Test::Spelling 0.12; use Pod::Wordlist::hanekomu; 1" or die $#;
add_stopwords(<DATA>);
all_pod_files_spelling_ok('bin', 'lib');
__DATA__
Fo
oer
bar
on all recent versions of perl this works fine. but in 5.8.8 the test fails. I found that by removing the ^ and $ the code works, its like Perls regex engine is ignoring the /m but the documentation says it was supported.
Why does this not work? and what is the most correct way to fix it? (note: I believe that the test should check that these elements are on a line by themselves )
This is bug RT#7781. It was fixed in 5.8.9 and 5.10.0.
Workarounds:
qr/^/m is equivalent to qr/(?:^|(?<=\n))/
qr/$/m is equivalent to qr/(?=\n|\z)/

Too late for -CSD

Trying to run this little perl program from parsCit:
parsCit-client.pl e1.txt
Too late for -CSD option at [filename] line 1
e1.txt is here: http://dl.dropbox.com/u/10557283/parserProj/e1.txt
I'm running the program from win7 cmd, not Cygwin.
filename is parsCit-client.pl - entire program is here:
#!/usr/bin/perl -CSD
#
# Simple SOAP client for the ParsCit web service.
#
# Isaac Councill, 07/24/07
#
use strict;
use encoding 'utf8';
use utf8;
use SOAP::Lite +trace=>'debug';
use MIME::Base64;
use FindBin;
my $textFile = $ARGV[0];
my $repositoryID = $ARGV[1];
if (!defined $textFile || !defined $repositoryID) {
print "Usage: $0 textFile repositoryID\n".
"Specify \"LOCAL\" as repository if using local file system.\n";
exit;
}
my $wsdl = "$FindBin::Bin/../wsdl/ParsCit.wsdl";
my $parsCitService = SOAP::Lite
->service("file:$wsdl")
->on_fault(
sub {
my($soap, $res) = #_;
die ref $res ? $res->faultstring :
$soap->transport->status;
});
my ($citations, $citeFile, $bodyFile) =
$parsCitService->extractCitations($textFile, $repositoryID);
#print "$citations\n";
#print "CITEFILE: $citeFile\n";
#print "BODYFILE: $bodyFile\n";
From perldoc perlrun, about the -C switch:
Note: Since perl 5.10.1, if the -C option is used on the "#!" line, it
must be specified on the command line as well, since the standard
streams are already set up at this point in the execution of the perl
interpreter. You can also use binmode() to set the encoding of an I/O
stream.
Which is presumably what the compiler means by it being "too late".
In other words:
perl -CSD parsCit-client.pl
Because command-line options in a #! "shebang" are not passed consistently across all operating systems (see this answer), and Perl has already opened streams before parsing the script shebang, and so cannot compensate for this in some older OSs, it was decided in bug 34087 to forbid -C in the shebang. Of course, not everyone was happy with this "fix", particularly if it would have otherwise worked on their OS and they don't want to think about anything other than UTF-8.
If you think binmode() is ugly and unnecessary (and doesn't cover command-line arguments), you might like to consider the utf8::all package which has a similar effect to perl -CSDL.
Or were you using *nix, I would suggest export PERL_UNICODE="SDA" in the enclosing script to get Perl to realise it's in a UTF-8 environment.

perl mktemp and echo

i am trying to put some word in tempfile via commandline
temp file creat but word not past in tempfile
#!/usr/bin/perl -w
system ('clear');
$TMPFILE = "mktemp /tmp/myfile/devid.XXXXXXXXXX";
$echo = "echo /"hello world/" >$TMPFILE";
system ("$TMPFILE");
system ("$echo");
Please Help to Solve This
To capture the name output by mktemp, do this instead:
chomp($TMPFILE = `mktemp /tmp/myfile/devid.XXXXXXXXXX`);
But Perl can do all the things you are doing without resorting to the shell.
Avoid using external commands from perl script as much as possible.
you can use: File::Temp module in this case, see this
Here's a specific demonstration of the advice that others have given you: where possible, use Perl directly rather than invoking system. Also, you should get in the habit of including use strict and use warnings in your Perl scripts.
use strict;
use warnings;
use File::Temp;
my $ft = File::Temp->new(
UNLINK => 0,
TEMPLATE => '/tmp/myfile/devid.XXXXXXXXXX',
);
print "Writing to temp file: ", $ft->filename, "\n";
print $ft "Hello, world.\n";

How can I copy files with special characters in their names with Perl's File::Copy?

I am trying to copy all files in one location to a different location and am using the File::Copy module and copy command from that, but now the issue I am facing is that I have file whose name has special character whose ascii value is &#253 but in unix file system it is stored as ? and so my question is that will copy or move command consider this files with special characters while copying or moving to another location or not,
if now then what would be an possible work around for this ?
Note: I cannot create file with special characters in unix because special characters are replaced with ? and I cannot do so in Windows because on Windows Special Characters are replaced with the Encoded value as in my case of &#253 ?
my $folderpath = 'the_path';
open my $IN, '<', 'path/to/infile';
my $total;
while (<$IN>) {
chomp;
my $size = -s "$folderpath/$_";
print "$_ => $size\n";
$total += $size;
}
print "Total => $total\n";
Courtesy: RickF Answer
Any suggesion would be highly appreciated.
Reference Question : Perl File Handling Question
As workaround I can suggest to convert all unsupported characters to supported. This can be done in many ways. For example you can use URI::Escape:
use URI::Escape;
my $new_file_name = uri_escape($weird_file_name);
Update:
Here is how I was able to copy file by its uft-8 name. I'm on Windows. I've used Win32::GetANSIPathName to get short file name. Then it was copied nice:
use File::Copy;
use URI::Escape;
use Win32;
use utf8; ## tell perl that source code is in utf-9
use strict;
use warnings;
my $test_file = "IBMýSoftware.txt";
my $from_file = Win32::GetANSIPathName($test_file); ## get "short" name of file
my $to_file = uri_escape($test_file); ## name with special characters escaped
printf("copy [%s] -> [%s]\n", $from_file, $to_file);
copy($from_file, $to_file);
After coping all file to new names on Windows, you'll be able to work with them on linux without problems.
Here are some hints about utf-8 file opening:
How do I create a Unicode directory on Windows using Perl?
With a utf8-encoded Perl script, can it open a filename encoded as GB2312?
Character 253 is ý. I guess that on your Unix system the locale is not set, or only the most primitive fall-back locale is in effect, and that is why you see a replacement character. If I am guessing correctly, the solution is to simply set the locale to something, preferably to an UTF-8 locale since that can handle all characters, and Perl shouldn't even enter into the problem.
> cat 3761218.pl
use utf8;
use strict;
use warnings FATAL => 'all';
use autodie qw(:all);
my $file_name = '63551_106640_63551 IBMýSoftware Delivery&Fulfillment(Div-61) Data IPS 08-20-2010 v3.xlsm';
open my $h, '>', $file_name;
> perl 3761218.pl
> ls 6*
63551_106640_63551 IBMýSoftware Delivery&Fulfillment(Div-61) Data IPS 08-20-2010 v3.xlsm
> LANG=C ls 6* # temporarily cripple locale so that the problem in the question is exhibited
63551_106640_63551 IBM??Software Delivery&Fulfillment(Div-61) Data IPS 08-20-2010 v3.xlsm
> locale | head -1 # show which locale I have set
LANG=de_DE.UTF-8
The following script works as expected for me:
#!/usr/bin/perl
use strict; use warnings;
use autodie;
use File::Copy qw( copy );
use File::Spec::Functions qw( catfile );
my $fname = chr 0xfd;
open my $out, '>', catfile($ENV{TEMP}, $fname);
close $out;
copy catfile($ENV{TEMP}, $fname) => catfile($ENV{HOME}, $fname);