I have a perl string containing Unicode characters and I want to create a file with this string as a filename. It should work on Windows, Linux and Mac whatever the locale used.
Here is my code:
use strict;
use warnings FATAL => 'all';
use Encode::Locale;
use Encode;
# ファイル.c
my $file = "\x{30D5}\x{30A1}\x{30A4}\x{30EB}.c";
$file = encode(locale_fs => $file);
open(my $filehdl, '>', $file) or die("Unable to create file: $!");
close($filehdl);
I use encode function because, according to this answer:
Perl treats file names as opaque strings of bytes. They need to be encoded as per your "locale"'s encoding (ANSI code page).
However, this code fails with the following error:
Unable to create file: Invalid argument at .\perl.pl line 15.
I took a deeper look on how the string is encoded by encode:
my $rep = sprintf '%v02X', $file;
print($rep);
This prints:
3F.3F.3F.3F.2E.63
In my current locale (CP-1252), it corresponds to ????.c. We can see that each Unicode characters has been replaced by a question mark.
I think it is normal to have question marks here because the characters in my string are not representable using CP-1252 encoding.
So, my question is: is there a way to create a file with a name containing Unicode characters?
For Windows there is a module Win32::LongPath, which not only allows long file names, but also unicode characters.
I wrote myself a module for all kinds of file and dir IO that I need, that on Windows uses these module's functions, and else the standard perl ones, like so:
use Carp;
use Fcntl qw( :flock :seek );
use constant USE_LONG => ($^O =~ /Win/i) ? 1 : 0;
use if USE_LONG, 'Win32::LongPath', ':funcs';
sub open
{
my $f = shift; # file
my $m = shift; # mode
my $l = #_ ? (shift) : 'utf8'; # encoding
my $lock = $m eq '<' ? LOCK_SH : LOCK_EX;
length $l
and $m .= ":$l";
my $h;
USE_LONG ? openL( \$h, $m, $f ) : open( $h, $m, $f ) # openL needs REF on Handle!
or confess "Can't open file: '$f' ($^E)";
flock( $h, $lock );
return $h;
}
That way the code is portable. It runs on a Linux server as well as on my Windows PC at home.
Using awk, I can print a number with commas as thousands separators.
(with a export LC_ALL=en_US.UTF-8 beforehand).
awk 'BEGIN{printf("%\047d\n", 24500)}'
24,500
I expected the same format to work with Perl, but it does not:
perl -e 'printf("%\047d\n", 24500)'
%'d
The Perl Cookbook offers this solution:
sub commify {
my $text = reverse $_[0];
$text =~ s/(\d\d\d)(?=\d)(?!\d*\.)/$1,/g;
return scalar reverse $text;
}
However I am assuming that since the printf option works in awk, it should also work in Perl.
The apostrophe format modifier is a non-standard POSIX extension.
The documentation for Perl's printf has this to say about such extensions
Perl does its own "sprintf" formatting: it emulates the C
function sprintf(3), but doesn't use it except for
floating-point numbers, and even then only standard modifiers
are allowed. Non-standard extensions in your local sprintf(3)
are therefore unavailable from Perl.
The Number::Format module will do this for you, and it takes its default settings from the locale, so is as portable as it can be
use strict;
use warnings 'all';
use v5.10.1;
use Number::Format 'format_number';
say format_number(24500);
output
24,500
A more perl-ish solution:
$a = 12345678; # no comment
$b = reverse $a; # $b = '87654321';
#c = unpack("(A3)*", $b); # $c = ('876', '543', '21');
$d = join ',', #c; # $d = '876,543,21';
$e = reverse $d; # $e = '12,345,678';
print $e;
outputs 12,345,678.
I realize this question was from almost 4 years ago, but since it comes up in searches, I'll add an elegant native Perl solution I came up with. I was originally searching for a way to do it with sprintf, but everything I've found indicates that it can't be done. Then since everyone is rolling their own, I thought I'd give it a go, and this is my solution.
$num = 12345678912345; # however many digits you want
while($num =~ s/(\d+)(\d\d\d)/$1\,$2/){};
print $num;
Results in:
12,345,678,912,345
Explanation:
The Regex does a maximal digit search for all leading digits. The minimum number of digits in a row it'll act on is 4 (1 plus 3). Then it adds a comma between the two. Next loop if there are still 4 digits at the end (before the comma), it'll add another comma and so on until the pattern doesn't match.
If you need something safe for use with more than 3 digits after the decimal, use this modification: (Note: This won't work if your number has no decimal)
while($num =~ s/(\d+)(\d\d\d)([.,])/$1\,$2$3/){};
This will ensure that it will only look for digits that ends in a comma (added on a previous loop) or a decimal.
Most of these answers assume that the format is universal. It isn't. CLDR uses Unicode information to figure it out. There's a long thread in How to properly localize numbers?.
CPAN has the CLDR::Number module:
#!perl
use v5.10;
use CLDR::Number;
use open qw(:std :utf8);
my $locale = $ARGV[0] // 'en';
my #numbers = qw(
123
12345
1234.56
-90120
);
my $cldr = CLDR::Number->new( locale => $locale );
my $decf = $cldr->decimal_formatter;
foreach my $n ( #numbers ) {
say $decf->format($n);
}
Here are a few runs:
$ perl comma.pl
123
12,345
1,234.56
-90,120
$ perl comma.pl es
123
12.345
1234,56
-90.120
$ perl comma.pl bn
১২৩
১২,৩৪৫
১,২৩৪.৫৬
-৯০,১২০
It seems heavyweight, but the output is correct and you don't have to allow the user to change the locale you want to use. However, when it's time to change the locale, you are ready to go. I also prefer this to Number::Format because I can use a locale that's different from my local settings for my terminal or session, or even use multiple locales:
#!perl
use v5.10;
use CLDR::Number;
use open qw(:std :utf8);
my #locales = qw( en pt bn );
my #numbers = qw(
123
12345
1234.56
-90120
);
my #formatters = map {
my $cldr = CLDR::Number->new( locale => $_ );
my $decf = $cldr->decimal_formatter;
[ $_, $cldr, $decf ];
} #locales;
printf "%10s %10s %10s\n" . '=' x 32 . "\n", #locales;
foreach my $n ( #numbers ) {
printf "%10s %10s %10s\n",
map { $_->[-1]->format($n) } #formatters;
}
The output has three locales at once:
en pt bn
================================
123 123 ১২৩
12,345 12.345 ১২,৩৪৫
1,234.56 1.234,56 ১,২৩৪.৫৬
-90,120 -90.120 -৯০,১২০
Here's an elegant Perl solution I've been using for over 20 years :)
1 while $text =~ s/(.*\d)(\d\d\d)/$1\.$2/g;
And if you then want two decimal places:
$text = sprintf("%0.2f", $text);
1 liner: Use a little loop whith a regex:
while ($number =~ s/^(\d+)(\d{3})/$1,$2/) {}
Example:
use strict;
use warnings;
my #numbers = (12321, 12.12, 122222.3334, '1234abc', '1.1', '1222333444555,666.77');
for(#numbers) {
my $number = $_;
while ($number =~ s/^(\d+)(\d{3})/$1,$2/) {}
print "$_ -> $number\n";
}
Output:
12321 -> 12,321
12.12 -> 12.12
122222.3334 -> 122,222.3334
1234abc -> 1,234abc
1.1 -> 1.1
1222333444555,666.77 -> 1,222,333,444,555,666.77
Pattern:
(\d+)(\d{3})
-> Take all numbers but the last 3 in group 1
-> Take the remaining 3 numbers in group2 on the beginning of $number
-> Followed is ignored
Substitution
$1,$2
-> Put a seperator sign (,) between group 1 and 2
-> The rest remains unchanged
So if you have 12345.67 the numers the regex uses are 12345. The '.' and all followed is ignored.
1. run (12345.67):
-> matches: 12345
-> group 1: 12,
group 2: 345
-> substitute 12,345
-> result: 12,345.67
2. run (12,345.67):
-> does not match!
-> while breaks.
Parting from #Laura's answer, I tweaked the pure perl, regex-only solution to work for numbers with decimals too:
while ($formatted_number =~ s/^(-?\d+)(\d{3}(?:,\d{3})*(?:\.\d+)*)$/$1,$2/) {};
Of course this assumes a "," as thousands separator and a "." as decimal separator, but it should be trivial to use variables to account for that for your given locale(s).
I used the following but it does not works as of perl v5.26.1
sub format_int
{
my $num = shift;
return reverse(join(",",unpack("(A3)*", reverse int($num))));
}
The form that worked for me was:
sub format_int
{
my $num = shift;
return scalar reverse(join(",",unpack("(A3)*", reverse int($num))));
}
But to use negative numbers the code must be:
sub format_int
{
if ( $val >= 0 ) {
return scalar reverse join ",", unpack( "(A3)*", reverse int($val) );
} else {
return "-" . scalar reverse join ",", unpack( "(A3)*", reverse int(-$val) );
}
}
Did somebody say Perl?
perl -pe '1while s/(\d+)(\d{3})/$1,$2/'
This works for any integer.
# turning above answer into a function
sub format_float
# returns number with commas..... and 2 digit decimal
# so format_float(12345.667) returns "12,345.67"
{
my $num = shift;
return reverse(join(",",unpack("(A3)*", reverse int($num)))) . sprintf(".%02d",int(100*(.005+($num - int($num)))));
}
sub format_int
# returns number with commas.....
# so format_int(12345.667) returns "12,345"
{
my $num = shift;
return reverse(join(",",unpack("(A3)*", reverse int($num))));
}
I wanted to print numbers it in a currency format. If it turned out even, I still wanted a .00 at the end. I used the previous example (ty) and diddled with it a bit more to get this.
sub format_number {
my $num = shift;
my $result;
my $formatted_num = "";
my #temp_array = ();
my $mantissa = "";
if ( $num =~ /\./ ) {
$num = sprintf("%0.02f",$num);
($num,$mantissa) = split(/\./,$num);
$formatted_num = reverse $num;
#temp_array = unpack("(A3)*" , $formatted_num);
$formatted_num = reverse (join ',', #temp_array);
$result = $formatted_num . '.'. $mantissa;
} else {
$formatted_num = reverse $num;
#temp_array = unpack("(A3)*" , $formatted_num);
$formatted_num = reverse (join ',', #temp_array);
$result = $formatted_num . '.00';
}
return $result;
}
# Example call
# ...
printf("some amount = %s\n",format_number $some_amount);
I didn't have the Number library on my default mac OS X perl, and I didn't want to mess with that version or go off installing my own perl on this machine. I guess I would have used the formatter module otherwise.
I still don't actually like the solution all that much, but it does work.
This is good for money, just keep adding lines if you handle hundreds of millions.
sub commify{
my $var = $_[0];
#print "COMMIFY got $var\n"; #DEBUG
$var =~ s/(^\d{1,3})(\d{3})(\.\d\d)$/$1,$2$3/;
$var =~ s/(^\d{1,3})(\d{3})(\d{3})(\.\d\d)$/$1,$2,$3$4/;
$var =~ s/(^\d{1,3})(\d{3})(\d{3})(\d{3})(\.\d\d)$/$1,$2,$3,$4$5/;
$var =~ s/(^\d{1,3})(\d{3})(\d{3})(\d{3})(\d{3})(\.\d\d)$/$1,$2,$3,$4,$5$6/;
#print "COMMIFY made $var\n"; #DEBUG
return $var;
}
A solution that produces a localized output:
# First part - Localization
my ( $thousands_sep, $decimal_point, $negative_sign );
BEGIN {
my ( $l );
use POSIX qw(locale_h);
$l = localeconv();
$thousands_sep = $l->{ 'thousands_sep' };
$decimal_point = $l->{ 'decimal_point' };
$negative_sign = $l->{ 'negative_sign' };
}
# Second part - Number transformation
sub readable_number {
my $val = shift;
#my $thousands_sep = ".";
#my $decimal_point = ",";
#my $negative_sign = "-";
sub _readable_int {
my $val = shift;
# a pinch of PERL magic
return scalar reverse join $thousands_sep, unpack( "(A3)*", reverse $val );
}
my ( $i, $d, $r );
$i = int( $val );
if ( $val >= 0 ) {
$r = _readable_int( $i );
} else {
$r = $negative_sign . _readable_int( -$i );
}
# If there is decimal part append it to the integer result
if ( $val != $i ) {
( undef, $d ) = ( $val =~ /(\d*)\.(\d*)/ );
$r = $r . $decimal_point . $d;
}
return $r;
}
The first part gets the symbols used in the current locale to be used on the second part.
The BEGIN block is used to calculate the sysmbols only once at the beginning.
If for some reason there is need to not use POSIX locale, one can ommit the first part and uncomment the variables on the second part to hardcode the sysmbols to be used ($thousands_sep, $thousands_sep and $thousands_sep)
I try to convert string to utf8.
#!/usr/bin/perl -w
use Encode qw(encode decode is_utf8);
$str = "\320\300\304\310\323\321 \316\320\300\312\313";
Encode::from_to($str, 'windows-1251', 'utf-8');
print "converted:\n$str\n";
And in this case I get what I need:
# ./convert.pl
converted:
РАДИУС ОРАКЛ
But if I use external variable:
#!/usr/bin/perl -w
use Encode qw(encode decode is_utf8);
$str = $ARGV[0];
Encode::from_to($str, 'windows-1251', 'utf-8');
print "converted:\n$str\n";
Nothing happens.
# ./convert.pl "\320\300\304\310\323\321 \316\320\300\312\313"
converted:
\320\300\304\310\323\321 \316\320\300\312\313
This is the dump of the first example:
SV = PV(0x1dceb78) at 0x1ded120
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x1de7970 "\320\300\304\310\323\321 \316\320\300\312\313"\0
CUR = 12
LEN = 16
And the second:
SV = PV(0x1c1db78) at 0x1c3c110
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x1c5e7e0 "\\320\\300\\304\\310\\323\\321 \\316\\320\\300\\312\\313"\0
CUR = 45
LEN = 48
I've tried this method:
#!/usr/bin/perl -w
use Devel::Peek;
$str = pack 'C*', map oct, $ARGV[0] =~ /\\(\d{3})/g;
print Dump ($str);
# ./convert.pl "\320\300\304\310\323\321 \316\320\300\312\313"
SV = PV(0x1c1db78) at 0x1c3c110
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x1c5e7e0 "\320\300\304\310\323\321\316\320\300\312\313"\0
CUR = 11
LEN = 48
But again it's not what I need. Could you help me to get the result like in the first script?
After using this
($str = shift) =~ s/\\([0-7]+)/chr oct $1/eg
as suggested by Borodin, I get this
SV = PVMG(0x13fa7f0) at 0x134d0f0
REFCNT =
FLAGS = (SMG,POK,pPOK)
IV = 0
NV = 0
PV = 0x1347970 "\320\300\304\310\323\321 \316\320\300\312\313"\0
CUR = 12
LEN = 16
MAGIC = 0x1358290
MG_VIRTUAL = &PL_vtbl_mglob
MG_TYPE = PERL_MAGIC_regex_global(g)
MG_LEN = -1
It's not clear exactly what input you're getting or where from, or what you want your output to be, but you shouldn't be encoding your data into UTF-8 for use within the program because you want to deal with characters and not encoded bytes. You should just decode it from whatever external encoding is being sent to the program and work with it like that
It sounds like the input is Windows-1251 and the output is UTF-8 (?) and I assume the backslashes are a distraction. There are no backslashes in the file or typed on the keyboard are there? So changing the base to hex for clarity, your input string is like this
"\xD0\xC0\xC4\xC8\xD3\xD1\x20\xCE\xD0\xC0\xCA\xCB"
and you want to convert it to a Perl character string, do some stuff with it, and print it to the output. If you're on a Linux machine and you want to explicitly decode it from raw input bytes, then you need to write something like this
use utf8;
use strict;
use warnings;
use feature 'say';
use open qw/ :std OUT :encoding(UTF-8) /;
use Encode qw/ decode /;
my $str = "\xD0\xC0\xC4\xC8\xD3\xD1\x20\xCE\xD0\xC0\xCA\xCB";
$str = decode('Windows-1251', $str);
say $str;
output
РАДИУС ОРАКЛ
But that's a contrived situation. The string is actually coming from an input stream, so it's better to set the encoding of the stream and forget about manual decoding. You can use binmode if you're reading from STDIN, like this
binmode STDIN, 'encoding(Windows-1251)';
and then text input from STDIN will be converted implicitly from Windows-1251-encoded bytes to a character string. Alternatively, if you're opening a file on your own handle, you can put the encoding in the open call
open my $fh, '<:encoding(Windows-1251)', $file or die $!;
and then you don't need to add a binmode either
As I said, I've assumed your output is UTF-8, and in the program above the line
use open qw/ :std OUT :encoding(UTF-8) /;
sets all output file handles to have a default of UTF-8 encoding. The :std also sets the built-in handles STDOUT and STDERR to UTF-8. If this isn't what you want and you can't figure out how to set it up as you need it then please do ask
think about this:
$ perl -le 'print length("\320\300\304\310\323\321 \316\320\300\312\313")'
12
$ perl -le 'print length($ARGV[0])' "\320\300\304\310\323\321 \316\320\300\312\313"
45
here we recieve the number of characters in given string.
pay attention that when string is inside perl script, perl interprets backslashed symbols according to their codes. but if backslashed symbols are outside perl script, the are just shell symbols and shell doesn't interpret them somehow and so you get exactly what you give.
A couple of simple methods to convert backslashes and octal digits typed in utf-8 terminal to cp1251:
$str = perl -e 'print "$ARGV[0]"' | iconv -f windows-1251;
print $str;
or
$str = pack "C*", map oct()? oct : 32, $ARGV[0] =~ / \d{3} | \s /gx;
print $str;
When I run this script in a Windows console where the active codepage is 65001 InputChar returns undef if I enter an ö (U+00F6). Does this mean that InputChar doesn't work with cp65001?
#!perl
use warnings;
use strict;
use 5.10.0;
use Devel::Peek;
use Win32::Console;
my $in = Win32::Console->new( STD_INPUT_HANDLE );
$in->Mode( ENABLE_PROCESSED_INPUT );
my $char = $in->InputChar();
Dump $char;
say "{$char}";
C:>chcp 65001
Active code page: 65001
C:>perl.pl
SV = NULL(0x0) at 0x12b6fac
REFCNT = 1
FLAGS = (PADMY)
Use of uninitialized value $char in concatenation (.) or string at ... line 21.
{}
If you look inside sub InputChar you can see it uses _ReadConsole which doesn't do unicode (i think char * isn't unicode)
It also doesn't do unicode because of the way ReadConsole function (Windows) is called, at least that is what documentation hints to me :)
update: OTOH, if I edit Win32-Console-0.10\Makefile.PL to add
DEFINE => ' -DUNICODE ',
and then recompile/reinstall Win32::Console, I can get AöBöCöDö10 into the program using the following
my $chars = ShInputChar( $in, 10 );
sub ShInputChar {
package Win32::Console;
my($self, $number) = #_;
return undef unless ref($self);
$number = 1 unless defined($number);
my $onumber = $number;
## double up or free to wrong pool, char versus wchar
$number = 2 * $number;
my $buffer = (" " x $number);
my $readed = _ReadConsole($self->{'handle'}, $buffer, $number) ;
my $err = sprintf "ErrSet \$!(%d)(%s)\n\$^E(%d)(%s)\n", $!,$!,$^E,$^E;
use Encode;
$buffer = Encode::decode('UTF-16LE', $buffer );
if ( $readed == $number or $onumber == $readed ) {
return $buffer;
}
else {
warn "wanted $number but read $readed returning buffer anyway";
return $buffer;
}
}
You should report this to the author , hes more knowledgeable about win32
I would be very cautious with libwin32 (of which Win32::Console is a part) as it was last updated over six years ago, in the early days of Windows Vista.
You may want to try Win32::Unicode::Console which has a very different API but is designed for your purpose.
I am doing a match between Chinese words, for example, "语言中心“ and a mount of web files (php, html, htm, etc).
However, somehow I get the following error:
Malformed UTF-8 character (1 byte, need 2, after start byte 0xdf) in regexp compilation at ../Final_FindOnlyNoReplace_CLE_Chinese.pl line 89, <INFILE> line 12.
Can anyone help?
Here is my code.
#!/usr/bin/env perl
use Encode qw/encode decode/;
use utf8;
use strict;
use Cwd;
use LWP::UserAgent;
my($path) = #_;
## append a trailing / if it's not there
$path .= '/' if($path !~ /\/$/);
use File::Glob ':glob';
my #all_files = bsd_glob($path."*");
for my $eachFile (#all_files) {
open(INFILE, "<$eachFile") || die ("Could not open '$eachFile'\n");
my(#inlines) = <INFILE>;
my($line, $find);
my $outkey = 1;
foreach $line (#inlines) {
$find = &find($line);
if ($find ne 'false') {
chomp($line);
print "\tline$outkey : $line\n";
}
$outkey ++;
}
}
#subroutine
sub find {
my $m = encode("utf8", decode("big5", #_));
my $html = LWP::UserAgent->new
->get($m)
->decoded_content;
my $str_chinese = '語言中心';
if ($m =~ /$str_chinese/) {
$m; ##if match, return the whole line.
}
}
You aren't searching in $html you've retrieved and decoded, but in URL instead: $m =~ /$str_chinese/, which, I guess, is not what you intend.
Also, you're comparing result of find function with exact string "false," which will never work. Change if ($find ne 'false') to if (defined($find)) and add explicit returns for success and failure to find for clarity.
Finally, you script seems to fail because you point it to directory that have some other Perl scripts amongst other files. They're most likely in UTF-8, so when your script tries to read them as big5 data, it falis on decoding. Just change your glob to cover data files only.
#!/usr/bin/env perl
use utf8;
use strictures;
use LWP::UserAgent qw();
use Path::Class::Rule qw();
use URI::file qw();
my $start_directory = q(.);
my $search_text = qr'語言中心';
my $next = Path::Class::Rule->new->name(qw(*.php *.htm*))->iter($start_directory);
my #matching_lines;
while (my $file = $next->()) {
for my $line (split /\R/, LWP::UserAgent
->new
->get(URI::file->new_abs($file))
->decoded_content
) {
push #matching_lines, $line if $line =~ $search_text;
}
}
# #matching_lines is (
# '<title>Untitled 語言中心 Document</title>',
# 'abc 語言中心 cde',
# '天天向上語言中心他'
# )