Perl: substitute .* to § - perl

I have to substitute multiple substrings from expressions like $fn = '1./(4.*z.^2-1)' in Perl (v5.24.1#Win10):
$fn =~ s/.\//_/g;
$fn =~ s/.\*/§/g;
$fn =~ s/.\^/^/g;
but § is not working; I get a ┬º in the expression result (1_(4┬ºz^2-1)). I need this for folder and file names and it worked fine in Matlab#Win10 with fn = strrep(fn, '.*', '§').
How can a get the § in the Perl substitution result?

It works for me:
#! /usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
use utf8;
use open IO => ':encoding(UTF-8)', ':std';
my $fn = '1./(4.*z.^2-1)';
s/.\//_/g,
s/.\*/§/g,
s/.\^/^/g
for $fn;
say $fn;
Output:
1_(4§z^2-1)
You can see use utf8, it tells Perl the source code is in UTF-8. Make sure you save the source as UTF-8, then.
The use open sets the UTF-8 encoding for standard input and output. Make sure the terminal to which you print is configured to work in UTF-8, too.

Related

Perl - Correcting char encoding on command line input

I am writing a program to fix mangled encoding, specifically latin1(iso-8859-1) to greek (iso-8859-7).
I created a function that works as intended; a variable with badly encoded text is converted properly.
When I try to convert $ARGV[0] with this function it doesn't seem to correctly interpret the input.
Here is a test program to demonstrate the issue:
#!/usr/bin/env perl
use 5.018;
use utf8;
use strict;
use open qw(:std :encoding(utf-8));
use Encode qw(encode decode);
sub unmangle {
my $input = shift;
print $input . "\n";
print decode('iso-8859-7', encode('latin1',$input)) . "\n";
}
my $test = "ÁöéÝñùìá"; # should be Αφιέρωμα
say "fix variable:";
unmangle($test);
say "\nfix argument:";
unmangle($ARGV[0]);
When I run this program with the same input as my $test variable the reults are not the same (as I expected that they should be):
$ ./fix_bad_encoding.pl "ÁöéÝñùìá"
fix variable:
ÁöéÝñùìá
Αφιέρωμα
fix stdin:
ÃöéÃñùìá
ΓΓΆΓ©ΓñùìÑ
How do I get $ARGV[0] to behave the way the $test variable does?
You decoded the source. You decoded STDIN (which you don't use), STDOUT and STDERR. But not #ARGV.
$_ = decode("UTF-8", $_) for #ARGV;
-CA tells Perl the arguments are UTF-8 encoded. You can decode the argument from UTF-8 yourself:
unmangle(decode('UTF-8', $ARGV[0]));
Also, it's not "stdin" (that would be reading from *STDIN), but "argument".

Sorting UTF-8 input

I need to sort lines from file, saved as UTF-8. These lines can start with cyrillic or latin characters. My code works wrong on cyrillic one.
sub sort_by_default {
my #sorted_lines = sort {
$a <=> $b
||
fc( $a) cmp fc($b)
} #_;
}
The cmp used with sort can't help with this; it has no notion of encodings and merely compares by codepoint, character by character, with surprises in many languages. Use Unicode::Collate.† See this post for a bit more and for far more this post by tchrist and this perl.com article.
The other issue is of reading (decoding) input and writing (encoding) output in utf8 correctly. One way to ensure that data on standard streams is handled is via the open pragma, with which you can set "layers" so that input and output is decoded/encoded as data is read/written.
Altogether, an example
use warnings;
use strict;
use feature 'say';
use Unicode::Collate;
use open ":std", ":encoding(UTF-8)";
my $file = ...;
open my $fh, '<', $file or die "Can't open $file: $!";
my #lines = <$fh>;
chomp #lines;
my $uc = Unicode::Collate->new();
my #sorted = $uc->sort(#lines);
say for #sorted;
The module's cmp method can be used for individual comparisons (if data
is in a complex data structure and not just a flat list of lines, for instance)
my #sorted = map { $uc->cmp($a, $b) } #data;
where $a and $b need be set suitably so to extract what to compare from #data.
If you have utf8 data right in the source you need use utf8, while if you receive utf8 via yet other channels (from #ARGV included) you may need to manually Encode::decode those strings.
Please see the linked post (and links in it) and documentation for more detail. See this perlmonks post for far more rounded information. See this Effective Perler article on custom sorting.
† Example: by codepoint comparison ä > b while the accepted order in German is ä < b
perl -MUnicode::Collate -wE'use utf8; binmode STDOUT, ":encoding(UTF-8)";
#s = qw(ä b);
say join " ", sort { $a cmp $b } #s; #--> b ä
say join " ", Unicode::Collate->new->sort(#s); #--> ä b
'
so we need to use Unicode::Collate (or a custom sort routine).
To open a file saved as UTF-8, use the appropriate layer:
open my $FH, '<:encoding(UTF-8)', 'filename' or die $!;
Don't forget to set the same layer for the output.
#! /usr/bin/perl
use warnings;
use strict;
binmode *DATA, ':encoding(UTF-8)';
binmode *STDOUT, ':encoding(UTF-8)';
print for sort <DATA>;
__DATA__
Борис
Peter
John
Владимир
The key to handle UTF-8 correctly in Perl is to make sure that Perl knows that a certain source or destination of information is in UTF-8. This is done differently depending on the way you get info in or out. If the UTF-8 is coming from an input file, the way to open the file is:
open( my $fh, '<:encoding(UTF-8)', "filename" ) or die "Cannot open file: $!\n";
If you are going to have UTF-8 inside the source of your script, then make sure you have:
use utf8;
At the beginning of the script.
If you are going to get UTF-8 characters from STDIN, use this at the beginning of the script:
binmode(STDIN, ':encoding(UTF-8)');
For STDOUT use:
binmode(STDOUT, ':encoding(UTF-8)');
Also, make sure you read UTF-8 vs. utf8 vs. UTF8 to know the difference between each encoding name. utf8 or UTF8 will allow valid UTF-8 and also non-valid UTF-8 (according to the first UTF-8 proposed standard) and will not complain about non-valid codepoints. UTF-8 will allow valid UTF-8 but will not allow non-valid codepoint combinations; it is a short name for utf-8-strict. You may also read the question How do I sanitize invalid UTF-8 in Perl?
.
Finally, following #zdim advise, you may use at the beginning of the script:
use open ':encoding(UTF-8)';
And other variants as described here. That will set the encoding layer for all open instructions that do not specify a layer explicitly.

Could File::Find::Rule be patched to automatically handle filename character encoding/decoding?

Suppose I have a file with name æ (UNICODE : 0xE6, UTF8 : 0xC3 0xA6) in the current directory.
Then, I would like to use File::Find::Rule to locate it:
use feature qw(say);
use open qw( :std :utf8 );
use strict;
use utf8;
use warnings;
use File::Find::Rule;
my $fn = 'æ';
my #files = File::Find::Rule->new->name($fn)->in('.');
say $_ for #files;
The output is empty, so apparently this did not work.
If I try to encode the filename first:
use Encode;
my $fn = 'æ';
my $fn_utf8 = Encode::encode('UTF-8', $fn, Encode::FB_CROAK | Encode::LEAVE_SRC);
my #files = File::Find::Rule->new->name($fn_utf8)->in('.');
say $_ for #files;
The output is:
æ
So it found the file, but the returned filename is not decoded into a Perl string. To fix this, I can decode the result, replacing the last line with:
say Encode::decode('UTF-8', $_, Encode::FB_CROAK) for #files;
The question is if both the encoding and decoding could/should have been done automatically by File::Find::Rule so I could have used my original program and not have had to worry about encoding and decoding at all?
(For example, could File::Find::Rule have used I18N::Langinfo to determine that the current locale's codeset is UTF-8 ?? )
Yeah, I wish. If there's was a major Perl project I'd work on, this would be it.
The issue is that there could be badly-encoded file names, including file names encoded using a different encoding than expected. That means the first thing needed is a way of round-tripping badly-encoded file names through a decode-encode process. I think Python uses the surrogate pair code points to represent the bad bytes.
You would need a pragma to ensure backwards compatibility.

perl search & replace script for all files in a directory

I have a directory with nearly 1,200 files. I need to successively go through each file in a perl script to search and replace any occurrences of 66 strings. So, for each file I need to run all 66 s&r's. My replace string is in Thai, so I cannot use the shell. It must be a .pl file or similar so that I can use use::utf8. I am just not familiar with how to open all files in a directory one by one to perform actions on them. Here is a sample of my s&r:
s/psa0*(\d+)/เพลงสดุดี\1/g;
Thanks for any help.
use utf8;
use strict;
use warnings;
use File::Glob qw( bsd_glob );
#ARGV = map bsd_glob($_), #ARGV;
while (<>) {
s/psa0*(?=\d)/เพลงสดุดี/g;
print;
}
perl -i.bak script.pl *
I used File::Glob's bsd_glob since glob won't handle spaces "correctly". They are actually the same function, but the function behaves differently based on how it's called.
By the way, using \1 in the replacement expression (i.e. outside a regular expression) makes no sense. \1 is a regex pattern that means "match what the first capture captured". So
s/psa0*(\d+)/เพลงสดุดี\1/g;
should be
s/psa0*(\d+)/เพลงสดุดี$1/g;
The following is a faster alternative:
s/psa0*(?=\d)/เพลงสดุดี/g;
See opendir/readdir/closedir for functions that can iterate through all the filenames in a directory (much like you would use open/readline/close to iterate through all the lines in a file).
Also see the glob function, which returns a list of filenames that match some pattern.
Just in case someone could use it in the future. This is what I actually did.
use warnings;
use strict;
use utf8;
my #files = glob ("*.html");
foreach $a (#files) {
open IN, "$a" or die $!;
open OUT, ">$a-" or die $!;
binmode(IN, ":utf8");
binmode(OUT, ":utf8");
select (OUT);
foreach (<IN>) {
s/gen0*(\d+)/ปฐมกาล $1/;
s/exo0*(\d+)/อพยพ $1/;
s/lev0*(\d+)/เลวีนิติ $1/;
s/num0*(\d+)/กันดารวิถี $1/;
...etc...
print "$_";
}
close IN;
close OUT;
};

Perl code Save ANSI encoding format xml file into UTF-8 encoding

I need to change the encoding format of a file from ANSI to UTF-8... Please suggest me to complete this, I have done using some methods. But it didn't work. Herewith I have written the code, which I have did.
use utf8;
use File::Slurp;
$File_Name="c:\\test.xml";
$file_con=read_file($File_Name);
open (OUT, ">c:\\b.xml");
binmode(OUT, ":utf8");
print OUT $file_con;
close OUT;
Assuming you have a valid XML file, this would do it:
use XML::LibXML qw( );
my $doc = XML::LibXML->new()->parse_file('text.xml');
$doc->setEncoding('UTF-8');
open(my $fh, '>:raw', 'test.utf8.xml')
or die("Can't create test.utf8.xml: $!\n");
print($fh $doc->toString());
This handles both converting the encoding and adjusting the <?xml?> directive. The previous answers left the wrong encoding in the <?xml?> directive.
If you just want to make a filter, try this:
perl -MEncode -pwe 's/(.*)/encode('utf8', $1)/e;'
For example:
type c:\text.xml |perl -MEncode -pwe 's/(.*)/encode('utf8', $1)/e;' >c:\b.xml
Or modifying your code:
use File::Slurp;
use Encode;
$File_Name="c:\\test.xml";
$file_con=read_file($File_Name);
open (OUT, ">c:\\b.xml");
print OUT encode('utf8', $file_con);
close OUT;
Use Text::Iconv:
use Text::Iconv;
$converter = Text::Iconv->new("cp1252", "utf-8");
$converted = $converter->convert($file_con);
(assuming you are using codepage 1252 as your default codepage).