Win32::Console: InputChar and codepage 65001 - perl

When I run this script in a Windows console where the active codepage is 65001 InputChar returns undef if I enter an ö (U+00F6). Does this mean that InputChar doesn't work with cp65001?
#!perl
use warnings;
use strict;
use 5.10.0;
use Devel::Peek;
use Win32::Console;
my $in = Win32::Console->new( STD_INPUT_HANDLE );
$in->Mode( ENABLE_PROCESSED_INPUT );
my $char = $in->InputChar();
Dump $char;
say "{$char}";
C:>chcp 65001
Active code page: 65001
C:>perl.pl
SV = NULL(0x0) at 0x12b6fac
REFCNT = 1
FLAGS = (PADMY)
Use of uninitialized value $char in concatenation (.) or string at ... line 21.
{}

If you look inside sub InputChar you can see it uses _ReadConsole which doesn't do unicode (i think char * isn't unicode)
It also doesn't do unicode because of the way ReadConsole function (Windows) is called, at least that is what documentation hints to me :)
update: OTOH, if I edit Win32-Console-0.10\Makefile.PL to add
DEFINE => ' -DUNICODE ',
and then recompile/reinstall Win32::Console, I can get AöBöCöDö10 into the program using the following
my $chars = ShInputChar( $in, 10 );
sub ShInputChar {
package Win32::Console;
my($self, $number) = #_;
return undef unless ref($self);
$number = 1 unless defined($number);
my $onumber = $number;
## double up or free to wrong pool, char versus wchar
$number = 2 * $number;
my $buffer = (" " x $number);
my $readed = _ReadConsole($self->{'handle'}, $buffer, $number) ;
my $err = sprintf "ErrSet \$!(%d)(%s)\n\$^E(%d)(%s)\n", $!,$!,$^E,$^E;
use Encode;
$buffer = Encode::decode('UTF-16LE', $buffer );
if ( $readed == $number or $onumber == $readed ) {
return $buffer;
}
else {
warn "wanted $number but read $readed returning buffer anyway";
return $buffer;
}
}
You should report this to the author , hes more knowledgeable about win32

I would be very cautious with libwin32 (of which Win32::Console is a part) as it was last updated over six years ago, in the early days of Windows Vista.
You may want to try Win32::Unicode::Console which has a very different API but is designed for your purpose.

Related

Perl printf to use commas as thousands-separator

Using awk, I can print a number with commas as thousands separators.
(with a export LC_ALL=en_US.UTF-8 beforehand).
awk 'BEGIN{printf("%\047d\n", 24500)}'
24,500
I expected the same format to work with Perl, but it does not:
perl -e 'printf("%\047d\n", 24500)'
%'d
The Perl Cookbook offers this solution:
sub commify {
my $text = reverse $_[0];
$text =~ s/(\d\d\d)(?=\d)(?!\d*\.)/$1,/g;
return scalar reverse $text;
}
However I am assuming that since the printf option works in awk, it should also work in Perl.
The apostrophe format modifier is a non-standard POSIX extension.
The documentation for Perl's printf has this to say about such extensions
Perl does its own "sprintf" formatting: it emulates the C
function sprintf(3), but doesn't use it except for
floating-point numbers, and even then only standard modifiers
are allowed. Non-standard extensions in your local sprintf(3)
are therefore unavailable from Perl.
The Number::Format module will do this for you, and it takes its default settings from the locale, so is as portable as it can be
use strict;
use warnings 'all';
use v5.10.1;
use Number::Format 'format_number';
say format_number(24500);
output
24,500
A more perl-ish solution:
$a = 12345678; # no comment
$b = reverse $a; # $b = '87654321';
#c = unpack("(A3)*", $b); # $c = ('876', '543', '21');
$d = join ',', #c; # $d = '876,543,21';
$e = reverse $d; # $e = '12,345,678';
print $e;
outputs 12,345,678.
I realize this question was from almost 4 years ago, but since it comes up in searches, I'll add an elegant native Perl solution I came up with. I was originally searching for a way to do it with sprintf, but everything I've found indicates that it can't be done. Then since everyone is rolling their own, I thought I'd give it a go, and this is my solution.
$num = 12345678912345; # however many digits you want
while($num =~ s/(\d+)(\d\d\d)/$1\,$2/){};
print $num;
Results in:
12,345,678,912,345
Explanation:
The Regex does a maximal digit search for all leading digits. The minimum number of digits in a row it'll act on is 4 (1 plus 3). Then it adds a comma between the two. Next loop if there are still 4 digits at the end (before the comma), it'll add another comma and so on until the pattern doesn't match.
If you need something safe for use with more than 3 digits after the decimal, use this modification: (Note: This won't work if your number has no decimal)
while($num =~ s/(\d+)(\d\d\d)([.,])/$1\,$2$3/){};
This will ensure that it will only look for digits that ends in a comma (added on a previous loop) or a decimal.
Most of these answers assume that the format is universal. It isn't. CLDR uses Unicode information to figure it out. There's a long thread in How to properly localize numbers?.
CPAN has the CLDR::Number module:
#!perl
use v5.10;
use CLDR::Number;
use open qw(:std :utf8);
my $locale = $ARGV[0] // 'en';
my #numbers = qw(
123
12345
1234.56
-90120
);
my $cldr = CLDR::Number->new( locale => $locale );
my $decf = $cldr->decimal_formatter;
foreach my $n ( #numbers ) {
say $decf->format($n);
}
Here are a few runs:
$ perl comma.pl
123
12,345
1,234.56
-90,120
$ perl comma.pl es
123
12.345
1234,56
-90.120
$ perl comma.pl bn
১২৩
১২,৩৪৫
১,২৩৪.৫৬
-৯০,১২০
It seems heavyweight, but the output is correct and you don't have to allow the user to change the locale you want to use. However, when it's time to change the locale, you are ready to go. I also prefer this to Number::Format because I can use a locale that's different from my local settings for my terminal or session, or even use multiple locales:
#!perl
use v5.10;
use CLDR::Number;
use open qw(:std :utf8);
my #locales = qw( en pt bn );
my #numbers = qw(
123
12345
1234.56
-90120
);
my #formatters = map {
my $cldr = CLDR::Number->new( locale => $_ );
my $decf = $cldr->decimal_formatter;
[ $_, $cldr, $decf ];
} #locales;
printf "%10s %10s %10s\n" . '=' x 32 . "\n", #locales;
foreach my $n ( #numbers ) {
printf "%10s %10s %10s\n",
map { $_->[-1]->format($n) } #formatters;
}
The output has three locales at once:
en pt bn
================================
123 123 ১২৩
12,345 12.345 ১২,৩৪৫
1,234.56 1.234,56 ১,২৩৪.৫৬
-90,120 -90.120 -৯০,১২০
Here's an elegant Perl solution I've been using for over 20 years :)
1 while $text =~ s/(.*\d)(\d\d\d)/$1\.$2/g;
And if you then want two decimal places:
$text = sprintf("%0.2f", $text);
1 liner: Use a little loop whith a regex:
while ($number =~ s/^(\d+)(\d{3})/$1,$2/) {}
Example:
use strict;
use warnings;
my #numbers = (12321, 12.12, 122222.3334, '1234abc', '1.1', '1222333444555,666.77');
for(#numbers) {
my $number = $_;
while ($number =~ s/^(\d+)(\d{3})/$1,$2/) {}
print "$_ -> $number\n";
}
Output:
12321 -> 12,321
12.12 -> 12.12
122222.3334 -> 122,222.3334
1234abc -> 1,234abc
1.1 -> 1.1
1222333444555,666.77 -> 1,222,333,444,555,666.77
Pattern:
(\d+)(\d{3})
-> Take all numbers but the last 3 in group 1
-> Take the remaining 3 numbers in group2 on the beginning of $number
-> Followed is ignored
Substitution
$1,$2
-> Put a seperator sign (,) between group 1 and 2
-> The rest remains unchanged
So if you have 12345.67 the numers the regex uses are 12345. The '.' and all followed is ignored.
1. run (12345.67):
-> matches: 12345
-> group 1: 12,
group 2: 345
-> substitute 12,345
-> result: 12,345.67
2. run (12,345.67):
-> does not match!
-> while breaks.
Parting from #Laura's answer, I tweaked the pure perl, regex-only solution to work for numbers with decimals too:
while ($formatted_number =~ s/^(-?\d+)(\d{3}(?:,\d{3})*(?:\.\d+)*)$/$1,$2/) {};
Of course this assumes a "," as thousands separator and a "." as decimal separator, but it should be trivial to use variables to account for that for your given locale(s).
I used the following but it does not works as of perl v5.26.1
sub format_int
{
my $num = shift;
return reverse(join(",",unpack("(A3)*", reverse int($num))));
}
The form that worked for me was:
sub format_int
{
my $num = shift;
return scalar reverse(join(",",unpack("(A3)*", reverse int($num))));
}
But to use negative numbers the code must be:
sub format_int
{
if ( $val >= 0 ) {
return scalar reverse join ",", unpack( "(A3)*", reverse int($val) );
} else {
return "-" . scalar reverse join ",", unpack( "(A3)*", reverse int(-$val) );
}
}
Did somebody say Perl?
perl -pe '1while s/(\d+)(\d{3})/$1,$2/'
This works for any integer.
# turning above answer into a function
sub format_float
# returns number with commas..... and 2 digit decimal
# so format_float(12345.667) returns "12,345.67"
{
my $num = shift;
return reverse(join(",",unpack("(A3)*", reverse int($num)))) . sprintf(".%02d",int(100*(.005+($num - int($num)))));
}
sub format_int
# returns number with commas.....
# so format_int(12345.667) returns "12,345"
{
my $num = shift;
return reverse(join(",",unpack("(A3)*", reverse int($num))));
}
I wanted to print numbers it in a currency format. If it turned out even, I still wanted a .00 at the end. I used the previous example (ty) and diddled with it a bit more to get this.
sub format_number {
my $num = shift;
my $result;
my $formatted_num = "";
my #temp_array = ();
my $mantissa = "";
if ( $num =~ /\./ ) {
$num = sprintf("%0.02f",$num);
($num,$mantissa) = split(/\./,$num);
$formatted_num = reverse $num;
#temp_array = unpack("(A3)*" , $formatted_num);
$formatted_num = reverse (join ',', #temp_array);
$result = $formatted_num . '.'. $mantissa;
} else {
$formatted_num = reverse $num;
#temp_array = unpack("(A3)*" , $formatted_num);
$formatted_num = reverse (join ',', #temp_array);
$result = $formatted_num . '.00';
}
return $result;
}
# Example call
# ...
printf("some amount = %s\n",format_number $some_amount);
I didn't have the Number library on my default mac OS X perl, and I didn't want to mess with that version or go off installing my own perl on this machine. I guess I would have used the formatter module otherwise.
I still don't actually like the solution all that much, but it does work.
This is good for money, just keep adding lines if you handle hundreds of millions.
sub commify{
my $var = $_[0];
#print "COMMIFY got $var\n"; #DEBUG
$var =~ s/(^\d{1,3})(\d{3})(\.\d\d)$/$1,$2$3/;
$var =~ s/(^\d{1,3})(\d{3})(\d{3})(\.\d\d)$/$1,$2,$3$4/;
$var =~ s/(^\d{1,3})(\d{3})(\d{3})(\d{3})(\.\d\d)$/$1,$2,$3,$4$5/;
$var =~ s/(^\d{1,3})(\d{3})(\d{3})(\d{3})(\d{3})(\.\d\d)$/$1,$2,$3,$4,$5$6/;
#print "COMMIFY made $var\n"; #DEBUG
return $var;
}
A solution that produces a localized output:
# First part - Localization
my ( $thousands_sep, $decimal_point, $negative_sign );
BEGIN {
my ( $l );
use POSIX qw(locale_h);
$l = localeconv();
$thousands_sep = $l->{ 'thousands_sep' };
$decimal_point = $l->{ 'decimal_point' };
$negative_sign = $l->{ 'negative_sign' };
}
# Second part - Number transformation
sub readable_number {
my $val = shift;
#my $thousands_sep = ".";
#my $decimal_point = ",";
#my $negative_sign = "-";
sub _readable_int {
my $val = shift;
# a pinch of PERL magic
return scalar reverse join $thousands_sep, unpack( "(A3)*", reverse $val );
}
my ( $i, $d, $r );
$i = int( $val );
if ( $val >= 0 ) {
$r = _readable_int( $i );
} else {
$r = $negative_sign . _readable_int( -$i );
}
# If there is decimal part append it to the integer result
if ( $val != $i ) {
( undef, $d ) = ( $val =~ /(\d*)\.(\d*)/ );
$r = $r . $decimal_point . $d;
}
return $r;
}
The first part gets the symbols used in the current locale to be used on the second part.
The BEGIN block is used to calculate the sysmbols only once at the beginning.
If for some reason there is need to not use POSIX locale, one can ommit the first part and uncomment the variables on the second part to hardcode the sysmbols to be used ($thousands_sep, $thousands_sep and $thousands_sep)

Fastest method for checking if a LF is at the end of a large scalar in Perl?

I've come up with the following to check the final character of a $scaler for a linefeed:
if( $buffer !~ /\n$/ ) {
if( substr( $buffer, -1, 1 ) !~ /\n/ ) {
if( substr( $buffer, -1, 1 ) ne '\n' ) {
Is there a faster method I could? The size of the $buffer scalar can get large and I've noticed that the larger it gets, the longer these conditionals take to run. I do have another scalar containing the length of $buffer, if that would help.
Thanks
The full code:
#!/usr/bin/perl
use strict;
use warnings;
use Fcntl qw();
use Time::HiRes qw( gettimeofday tv_interval );
use constant BUFSIZE => 2 ** 21; # 2MB worked best for me, YMMV.
die "ERROR: Missing filename" if( !$ARGV[0] );
my $top = [gettimeofday];
sysopen( my $fh, $ARGV[0], Fcntl::O_RDONLY | Fcntl::O_BINARY ) or
die "ERROR: Unable to open $ARGV[0], because $!\n";
open my $output, ">", "/dev/null"; # for 'dummy' processing
my $size = -s $ARGV[0];
my $osiz = $size;
my( $buffer, $offset, $lnCtr ) = ( "", "", 0 );
while( $size ) {
my $read = sysread( $fh, $buffer, BUFSIZE, length($offset) );
$size -= $read;
my #lines = split /\n/, $buffer;
if( substr( $buffer, -1, 1 ) ne "\n" ) {
$offset = pop( #lines );
} else {
$offset = "";
}
for my $line ( #lines ) {
processLine( \$line );
$lnCtr++;
}
$buffer = $offset if( $offset );
}
close $fh;
print "Processed $lnCtr lines ($osiz bytes) in file: $ARGV[0] in ".
tv_interval( $top ).
" secs.\n";
print "Using a buffered read of ".BUFSIZE." bytes. - JLB\n";
sub processLine {
if( ref($_[0]) ) {
print $output ${$_[0]}."\n";
} else {
print $output $_[0]."\n";
}
return 0;
}
I think I've reached that 'point-of-diminishing returns' in my attempts of making this run any faster. It seems to now be able to read in data as fast as my RAID5 SSDs are able to fetch it. As you can see, there is a reason I didn't use chomp(), the input can contain hundreds of thousands of linefeeds, which I need to keep to be able to break the lines for processing.
./fastread.pl newdata.log
Processed 516670 lines (106642635 bytes) in file: newdata.log in 0.674738 secs.
Using a buffered read of 2097152 bytes. - JLB
Perl has two string storage formats.
One of the formats uses the same number of bytes (1) to store each possible character the string can contain. Because of that and because Perl keeps track of how many bytes is used by a string, the performance of substr($x, -1) on a string in this format does not depend on the the length of the string.
The problem with the aforementioned format is that it can only store a very limited range of characters. It could be used to store the Unicode code points "Eric" and "Éric", but not for "Ελλάδα". When necessary (and even when not necessary), Perl will automatically switch a string's storage format to the other format.
The second format can store any Unicode code point as a character. In fact, it can store any 32-bit or 64-bit value (depending on perl's build settings). The downside is that a variable number of bytes is used to store each character. So even though Perl knows the number of bytes used by the entire string, it doesn't know where any character but the first one starts.* To find the last character, it must scan the entire string.
That said, because of properties of the storage format, it would actually be quite easy to find the last char of a string in constant time.
use Inline C => <<'__END_OF_C__';
# O(1) version of substr($x,-1)
SV* last_char(SV* sv) {
STRLEN len;
const char* s = SvPV(sv, len);
if (!len)
return newSVpvn("", 0);
{
const U32 utf8 = SvUTF8(sv);
const char* p = s+len-1;
if (utf8) {
while (p != s && (*p & 0xC0) != 0xC0)
--p;
}
return newSVpvn_utf8(p, s+len-p, utf8);
}
}
__END_OF_C__
* — It does keep a cache of the couple of char position to byte position mappings.
You've shown code which can be cleaned up so you don't even need to check the last char for a newline.
sub processLine {
print $_[0] $_[1];
}
open(my $fh, '<:raw', $ARGV[0])
or die("Can't open $ARGV[0]: $!\n");
my $buffer = '';
my $lnCtr = 0;
while (1) {
my $rv = sysread($fh, $buffer, BUFSIZE, length($buffer));
die $! if !defined($rv);
last if !$rv;
while ($buffer =~ s/(.*\n)//) {
processLine($1);
++$lnCtr;
}
}
if (length($buffer)) {
processLine($output, $buffer);
++$lnCtr;
}
Notes:
No need for sysopen. open is simpler.
If you pass $buffer to sysread, it doesn't make sense to use length($offset).
As you can see, $offset and the copying thereof is completely unnecessary.
Passing a var to a sub does not copy it, so no need to pass a reference.
If processLine doesn't need the newline, use s/(.*)\n// instead.
Why are you concerned about speed? Is this piece of code in a part of your program that is measurably slow, perhaps profiled with Devel::NYTProf? If not, then I suggest you go with what is the clearest to read and the most idiomatic, which is probably
if( $buffer !~ /\n$/ )
Your final version:
if( substr( $buffer, -1, 1 ) ne '\n' )
would also be a fine choice except for your single-quoting the linefeed, thus giving you a two-character string consisting of a backslash and a lowercase n. Perhaps you're coming from C where single characters are single quoted and strings are double-quoted? You want
if( substr( $buffer, -1, 1 ) ne "\n" )
This version
if( substr( $buffer, -1, 1 ) !~ /\n/ )
is doing a regex match that it shouldn't be because it's checking a one-character string against a single-character regex. The next person to read the code will think that's strange and wonder why you'd do that. Also, back to that speed thing, it's slower to match a string against a regex than just compare against a single character for equality.
Here is a Benchmark:
#!/usr/bin/perl
use strict;
use warnings;
use Benchmark qw(:all);
my $buffer = 'abc'x10_000_000;
$buffer .= "\n";
my $count = -2;
cmpthese($count, {
'regex' => sub {
if ($buffer !~ /\n$/) { }
},
'substr + regex' => sub {
if (substr($buffer, -1, 1) !~ /\n$/) { }
},
'substr + ne' => sub {
if (substr($buffer, -1, 1) ne "\n") { }
},
'chomp' => sub {
if (chomp $buffer) { }
},
});
Output:
Rate substr + regex substr + ne regex chomp
substr + regex 6302468/s -- -11% -44% -70%
substr + ne 7072032/s 12% -- -37% -66%
regex 11294695/s 79% 60% -- -46%
chomp 20910531/s 232% 196% 85% --
chomp is certainly the fastest way.
I suspect perl is treating the string as utf-8 and has to iterate over the whole thing for some reason.
You could temporarily switch to byte semantics to see if the char on the end is a newline.
See docs for Perl's bytes pragma and perlunicode.
You can try chomp. Chomp will return the number of EOL characters removed from the end of a line:
if ( chomp $buffer ) {
print "You had an LF on the end of \$buffer";
}
Of course, chomp removes the NL characters it counted.

How to match Chinese character in the web pages whose charset is big5 (Perl)?

I am doing a match between Chinese words, for example, "语言中心“ and a mount of web files (php, html, htm, etc).
However, somehow I get the following error:
Malformed UTF-8 character (1 byte, need 2, after start byte 0xdf) in regexp compilation at ../Final_FindOnlyNoReplace_CLE_Chinese.pl line 89, <INFILE> line 12.
Can anyone help?
Here is my code.
#!/usr/bin/env perl
use Encode qw/encode decode/;
use utf8;
use strict;
use Cwd;
use LWP::UserAgent;
my($path) = #_;
## append a trailing / if it's not there
$path .= '/' if($path !~ /\/$/);
use File::Glob ':glob';
my #all_files = bsd_glob($path."*");
for my $eachFile (#all_files) {
open(INFILE, "<$eachFile") || die ("Could not open '$eachFile'\n");
my(#inlines) = <INFILE>;
my($line, $find);
my $outkey = 1;
foreach $line (#inlines) {
$find = &find($line);
if ($find ne 'false') {
chomp($line);
print "\tline$outkey : $line\n";
}
$outkey ++;
}
}
#subroutine
sub find {
my $m = encode("utf8", decode("big5", #_));
my $html = LWP::UserAgent->new
->get($m)
->decoded_content;
my $str_chinese = '語言中心';
if ($m =~ /$str_chinese/) {
$m; ##if match, return the whole line.
}
}
You aren't searching in $html you've retrieved and decoded, but in URL instead: $m =~ /$str_chinese/, which, I guess, is not what you intend.
Also, you're comparing result of find function with exact string "false," which will never work. Change if ($find ne 'false') to if (defined($find)) and add explicit returns for success and failure to find for clarity.
Finally, you script seems to fail because you point it to directory that have some other Perl scripts amongst other files. They're most likely in UTF-8, so when your script tries to read them as big5 data, it falis on decoding. Just change your glob to cover data files only.
#!/usr/bin/env perl
use utf8;
use strictures;
use LWP::UserAgent qw();
use Path::Class::Rule qw();
use URI::file qw();
my $start_directory = q(.);
my $search_text = qr'語言中心';
my $next = Path::Class::Rule->new->name(qw(*.php *.htm*))->iter($start_directory);
my #matching_lines;
while (my $file = $next->()) {
for my $line (split /\R/, LWP::UserAgent
->new
->get(URI::file->new_abs($file))
->decoded_content
) {
push #matching_lines, $line if $line =~ $search_text;
}
}
# #matching_lines is (
# '<title>Untitled 語言中心 Document</title>',
# 'abc 語言中心 cde',
# '天天向上語言中心他'
# )

Question about pathname encoding

What have I done to get such a strange encoding in this path-name?
In my file manager (Dolphin) the path-name looks good.
#!/usr/local/bin/perl
use warnings;
use 5.014;
use utf8;
use open qw( :encoding(UTF-8) :std );
use File::Find;
use Devel::Peek;
use Encode qw(decode);
my $string;
find( sub { $string = $File::Find::name }, 'Delibes, Léo' );
$string =~ s|Delibes,\ ||;
$string =~ s|\..*\z||;
my ( $s1, $s2 ) = split m|/|, $string, 2;
say Dump $s1;
say Dump $s2;
# SV = PV(0x824b50) at 0x9346d8
# REFCNT = 1
# FLAGS = (PADMY,POK,pPOK,UTF8)
# PV = 0x93da30 "L\303\251o"\0 [UTF8 "L\x{e9}o"]
# CUR = 4
# LEN = 16
# SV = PV(0x7a7150) at 0x934c30
# REFCNT = 1
# FLAGS = (PADMY,POK,pPOK,UTF8)
# PV = 0x7781e0 "Lakm\303\203\302\251"\0 [UTF8 "Lakm\x{c3}\x{a9}"]
# CUR = 8
# LEN = 16
say $s1;
say $s2;
# Léo
# Lakmé
$s1 = decode( 'utf-8', $s1 );
$s2 = decode( 'utf-8', $s2 );
say $s1;
say $s2;
# L�o
# Lakmé
Unfortunately your operating system's pathname API is another "binary interface" where you will have to use Encode::encode and Encode::decode to get predictable results.
Most operating systems treat pathnames as a sequence of octets (i.e. bytes). Whether that sequence should be interpreted as latin-1, UTF-8 or other character encoding is an application decision. Consequently the value returned by readdir() is simply a sequence of octets, and File::Find doesn't know that you want the path name as Unicode code points. It forms $File::Find::name by simply concatenating the directory path (which you supplied) with the value returned by your OS via readdir(), and that's how you got code points mashed with octets.
Rule of thumb: Whenever passing path names to the OS, Encode::encode() it to make sure it is a sequence of octets. When getting a path name from the OS, Encode::decode() it to the character set that your application wants it in.
You can make your program work by calling find this way:
find( sub { ... }, Encode::encode('utf8', 'Delibes, Léo') );
And then calling Encode::decode() when using the value of $File::Find::name:
my $path = Encode::decode('utf8', $File::Find::name);
To be more clear, this is how $File::Find::name was formed:
use Encode;
# This is a way to get $dir to be represented as a UTF-8 string
my $dir = 'L' .chr(233).'o'.chr(256);
chop $dir;
say "dir: ", d($dir); # length = 3
# This is what readdir() is returning:
my $leaf = encode('utf8', 'Lakem' . chr(233));
say "leaf: ", d($leaf); # length = 7
$File::Find::name = $dir . '/' . $leaf;
say "File::Find::name: ", d($File::Find::name);
sub d {
join(' ', map { sprintf("%02X", ord($_)) } split('', $_[0]))
}
The POSIX filesystem API is broken as no encoding is enforced. Period.
Many problems can happen. For example a pathname can even contain both latin1 and UTF-8 depending on how various filesystems on a path handle encoding (and if they do).

How can I get a call stack listing in Perl?

Is there a way I can access (for printout) a list of sub + module to arbitrary depth of sub-calls preceding a current position in a Perl script?
I need to make changes to some Perl modules (.pm's). The workflow is initiated from a web-page thru a cgi-script, passing input through several modules/objects ending in the module where I need to use the data. Somewhere along the line the data got changed and I need to find out where.
You can use Devel::StackTrace.
use Devel::StackTrace;
my $trace = Devel::StackTrace->new;
print $trace->as_string; # like carp
It behaves like Carp's trace, but you can get more control over the frames.
The one problem is that references are stringified and if a referenced value changes, you won't see it. However, you could whip up some stuff with PadWalker to print out the full data (it would be huge, though).
This code works without any additional modules.
Just include it where needed.
my $i = 1;
print STDERR "Stack Trace:\n";
while ( (my #call_details = (caller($i++))) ){
print STDERR $call_details[1].":".$call_details[2]." in function ".$call_details[3]."\n";
}
Carp::longmess will do what you want, and it's standard.
use Carp qw<longmess>;
use Data::Dumper;
sub A { &B; }
sub B { &C; }
sub C { &D; }
sub D { &E; }
sub E {
# Uncomment below if you want to see the place in E
# local $Carp::CarpLevel = -1;
my $mess = longmess();
print Dumper( $mess );
}
A();
__END__
$VAR1 = ' at - line 14
main::D called at - line 12
main::C called at - line 10
main::B called at - line 8
main::A() called at - line 23
';
I came up with this sub (Now with optional blessin' action!)
my $stack_frame_re = qr{
^ # Beginning of line
\s* # Any number of spaces
( [\w:]+ ) # Package + sub
(?: [(] ( .*? ) [)] )? # Anything between two parens
\s+ # At least one space
called [ ] at # "called" followed by a single space
\s+ ( \S+ ) \s+ # Spaces surrounding at least one non-space character
line [ ] (\d+) # line designation
}x;
sub get_stack {
my #lines = split /\s*\n\s*/, longmess;
shift #lines;
my #frames
= map {
my ( $sub_name, $arg_str, $file, $line ) = /$stack_frame_re/;
my $ref = { sub_name => $sub_name
, args => [ map { s/^'//; s/'$//; $_ }
split /\s*,\s*/, $arg_str
]
, file => $file
, line => $line
};
bless $ref, $_[0] if #_;
$ref
}
#lines
;
return wantarray ? #frames : \#frames;
}
caller can do that, though you may want even more information than that.
There's also Carp::confess and Carp::cluck.
In case you can't use (or would like to avoid) non-core modules, here's a simple subroutine I came up with:
#!/usr/bin/perl
use strict;
use warnings;
sub printstack {
my ($package, $filename, $line, $subroutine, $hasargs, $wantarray, $evaltext, $is_require, $hints, $bitmask, $hinthash);
my $i = 1;
my #r;
while (#r = caller($i)) {
($package, $filename, $line, $subroutine, $hasargs, $wantarray, $evaltext, $is_require, $hints, $bitmask, $hinthash) = #r;
print "$filename:$line $subroutine\n";
$i++;
}
}
sub i {
printstack();
}
sub h {
i;
}
sub g {
h;
}
g;
It produces output like as follows:
/root/_/1.pl:21 main::i
/root/_/1.pl:25 main::h
/root/_/1.pl:28 main::g
Or a oneliner:
for (my $i = 0; my #r = caller($i); $i++) { print "$r[1]:$r[2] $r[3]\n"; }
You can find documentation on caller here.
One that is more pretty: Devel::PrettyTrace
use Devel::PrettyTrace;
bt;
Moving my comment to an answer:
Install Devel::Confess the right way
cpanm Devel::Confess
Run with
perl -d:Confess myscript.pl
On errors, this will show the whole call stack list.