Sorting UTF-8 input - perl

I need to sort lines from file, saved as UTF-8. These lines can start with cyrillic or latin characters. My code works wrong on cyrillic one.
sub sort_by_default {
my #sorted_lines = sort {
$a <=> $b
||
fc( $a) cmp fc($b)
} #_;
}

The cmp used with sort can't help with this; it has no notion of encodings and merely compares by codepoint, character by character, with surprises in many languages. Use Unicode::Collate.† See this post for a bit more and for far more this post by tchrist and this perl.com article.
The other issue is of reading (decoding) input and writing (encoding) output in utf8 correctly. One way to ensure that data on standard streams is handled is via the open pragma, with which you can set "layers" so that input and output is decoded/encoded as data is read/written.
Altogether, an example
use warnings;
use strict;
use feature 'say';
use Unicode::Collate;
use open ":std", ":encoding(UTF-8)";
my $file = ...;
open my $fh, '<', $file or die "Can't open $file: $!";
my #lines = <$fh>;
chomp #lines;
my $uc = Unicode::Collate->new();
my #sorted = $uc->sort(#lines);
say for #sorted;
The module's cmp method can be used for individual comparisons (if data
is in a complex data structure and not just a flat list of lines, for instance)
my #sorted = map { $uc->cmp($a, $b) } #data;
where $a and $b need be set suitably so to extract what to compare from #data.
If you have utf8 data right in the source you need use utf8, while if you receive utf8 via yet other channels (from #ARGV included) you may need to manually Encode::decode those strings.
Please see the linked post (and links in it) and documentation for more detail. See this perlmonks post for far more rounded information. See this Effective Perler article on custom sorting.
† Example: by codepoint comparison ä > b while the accepted order in German is ä < b
perl -MUnicode::Collate -wE'use utf8; binmode STDOUT, ":encoding(UTF-8)";
#s = qw(ä b);
say join " ", sort { $a cmp $b } #s; #--> b ä
say join " ", Unicode::Collate->new->sort(#s); #--> ä b
'
so we need to use Unicode::Collate (or a custom sort routine).

To open a file saved as UTF-8, use the appropriate layer:
open my $FH, '<:encoding(UTF-8)', 'filename' or die $!;
Don't forget to set the same layer for the output.
#! /usr/bin/perl
use warnings;
use strict;
binmode *DATA, ':encoding(UTF-8)';
binmode *STDOUT, ':encoding(UTF-8)';
print for sort <DATA>;
__DATA__
Борис
Peter
John
Владимир

The key to handle UTF-8 correctly in Perl is to make sure that Perl knows that a certain source or destination of information is in UTF-8. This is done differently depending on the way you get info in or out. If the UTF-8 is coming from an input file, the way to open the file is:
open( my $fh, '<:encoding(UTF-8)', "filename" ) or die "Cannot open file: $!\n";
If you are going to have UTF-8 inside the source of your script, then make sure you have:
use utf8;
At the beginning of the script.
If you are going to get UTF-8 characters from STDIN, use this at the beginning of the script:
binmode(STDIN, ':encoding(UTF-8)');
For STDOUT use:
binmode(STDOUT, ':encoding(UTF-8)');
Also, make sure you read UTF-8 vs. utf8 vs. UTF8 to know the difference between each encoding name. utf8 or UTF8 will allow valid UTF-8 and also non-valid UTF-8 (according to the first UTF-8 proposed standard) and will not complain about non-valid codepoints. UTF-8 will allow valid UTF-8 but will not allow non-valid codepoint combinations; it is a short name for utf-8-strict. You may also read the question How do I sanitize invalid UTF-8 in Perl?
.
Finally, following #zdim advise, you may use at the beginning of the script:
use open ':encoding(UTF-8)';
And other variants as described here. That will set the encoding layer for all open instructions that do not specify a layer explicitly.

Related

Perl: substitute .* to §

I have to substitute multiple substrings from expressions like $fn = '1./(4.*z.^2-1)' in Perl (v5.24.1#Win10):
$fn =~ s/.\//_/g;
$fn =~ s/.\*/§/g;
$fn =~ s/.\^/^/g;
but § is not working; I get a ┬º in the expression result (1_(4┬ºz^2-1)). I need this for folder and file names and it worked fine in Matlab#Win10 with fn = strrep(fn, '.*', '§').
How can a get the § in the Perl substitution result?
It works for me:
#! /usr/bin/perl
use warnings;
use strict;
use feature qw{ say };
use utf8;
use open IO => ':encoding(UTF-8)', ':std';
my $fn = '1./(4.*z.^2-1)';
s/.\//_/g,
s/.\*/§/g,
s/.\^/^/g
for $fn;
say $fn;
Output:
1_(4§z^2-1)
You can see use utf8, it tells Perl the source code is in UTF-8. Make sure you save the source as UTF-8, then.
The use open sets the UTF-8 encoding for standard input and output. Make sure the terminal to which you print is configured to work in UTF-8, too.

Could File::Find::Rule be patched to automatically handle filename character encoding/decoding?

Suppose I have a file with name æ (UNICODE : 0xE6, UTF8 : 0xC3 0xA6) in the current directory.
Then, I would like to use File::Find::Rule to locate it:
use feature qw(say);
use open qw( :std :utf8 );
use strict;
use utf8;
use warnings;
use File::Find::Rule;
my $fn = 'æ';
my #files = File::Find::Rule->new->name($fn)->in('.');
say $_ for #files;
The output is empty, so apparently this did not work.
If I try to encode the filename first:
use Encode;
my $fn = 'æ';
my $fn_utf8 = Encode::encode('UTF-8', $fn, Encode::FB_CROAK | Encode::LEAVE_SRC);
my #files = File::Find::Rule->new->name($fn_utf8)->in('.');
say $_ for #files;
The output is:
æ
So it found the file, but the returned filename is not decoded into a Perl string. To fix this, I can decode the result, replacing the last line with:
say Encode::decode('UTF-8', $_, Encode::FB_CROAK) for #files;
The question is if both the encoding and decoding could/should have been done automatically by File::Find::Rule so I could have used my original program and not have had to worry about encoding and decoding at all?
(For example, could File::Find::Rule have used I18N::Langinfo to determine that the current locale's codeset is UTF-8 ?? )
Yeah, I wish. If there's was a major Perl project I'd work on, this would be it.
The issue is that there could be badly-encoded file names, including file names encoded using a different encoding than expected. That means the first thing needed is a way of round-tripping badly-encoded file names through a decode-encode process. I think Python uses the surrogate pair code points to represent the bad bytes.
You would need a pragma to ensure backwards compatibility.

Cannot write UTF-16LE encoded CSV file with Text::CSV_XS Perl module

I want to write a CSV file encoded in UTF-16LE.
However, the output in the file gets messed up. There are strange chinese looking letters: ਍挀攀氀氀㄀⸀㄀㬀挀攀氀氀㄀⸀㈀㬀ഀ.
This looks like off-by-one-byte problem mentioned here: Creating UTF-16 newline characters in Python for Windows Notepad
Other threads about Perl and Text::CSV_XS didn't help.
This is how I try it:
#!perl
use strict;
use warnings;
use utf8;
use Text::CSV_XS;
binmode STDOUT, ":utf8";
my $csv = Text::CSV_XS->new({
binary => 1,
sep_char => ";",
quote_char => undef,
eol => $/,
});
open my $in, '<:encoding(UTF-16LE)', 'in.csv' or die "in.csv: $!";
open my $out, '>:encoding(UTF-16LE)', 'out.csv' or die "out.csv: $!";
while (my $row = $csv->getline($in)) {
$_ =~ s/ä/æ/ for #$row; # something will be done to the data...
$csv->print($out, $row);
}
close $in;
close $out;
in.csv contains some test data and it is encoded in UTF-16LE:
header1;header2;
cell1.1;cell1.2;
äöü2.1;ab"c2.2;
The results looks like this:
header1;header2;਍挀攀氀氀㄀⸀㄀㬀挀攀氀氀㄀⸀㈀㬀ഀ
æöü2.1;abc2.2;਍
It is not an option to switch to UTF-8 as output format (which works fine btw).
So, how do I write valid UTF-16LE encoded CSV files using Text::CSV_XS?
Perl adds :crlf by default on Windows. It's added first, before your :encoding is added.
That means LF⇔CRLF conversion will be performed before decoding on reads, and after encoding on writes. This is backwards.
It ends up working with UTF-8 despite being done backwards because all of the following conditions are met:
The UTF-8 encoding of LF is the same as its Code Point (0A).
The UTF-8 encoding of CR is the same as its Code Point (0D).
0A always refers to LF no matter where they are in the file.
0D always refers to CR no matter where they are in the file.
None of those conditions holds true for UTF-16le.
Fix:
open(my $fh_in, '<:raw:encoding(UTF-16LE):crlf', $qfn_in)
open(my $fh_out, '>:raw:encoding(UTF-16LE):crlf', $qfn_out)

How to detect UTF8 with BOM encoding in Perl

I have simple Perl script that does comparison of two files.
Result I write in different files with UTF8 BOM encoding.
To save text in BOM file I do printing chr(65279) into the beginning of the result file. Sometimes input text already contains BOM char in the begging of the text and my script prints one more.
The question is: How I can workaround it to do not print this BOM char twice.
See below text of my Perl's code:
use strict;
use warnings;
use List::Compare;
use Cwd 'abs_path';
use open ':encoding(utf8)';
use open IO => ':encoding(utf8)';
open F, "<$ARGV[0]" or die $!;
open S, "<$ARGV[1]" or die $!;
my #a=<F>;
my #b=<S>;
close F;
close S;
my $lc = List::Compare->new(\#a, \#b);
my #intersection = $lc->get_intersection;
my #missing = $lc->get_unique;
my #extra = $lc->get_complement;
open EXTRA, ">".$ARGV[2]."file_extra.txt" or die("Unable to open the file");
open MISSING, ">".$ARGV[2]."file_missing.txt" or die("Unable to open the file");
open SUBTRACTED, ">".$ARGV[2]."file_subtr.txt" or die("Unable to open the file");
#Turn on UTF-8 BOM support
print EXTRA chr(65279);
print MISSING chr(65279);
print SUBTRACTED chr(65279);
print MISSING #missing;
print EXTRA #extra;
print SUBTRACTED #intersection;
close MISSING;
close EXTRA;
close SUBTRACTED;
Strip it while reading file content (in your example apply s/^\x{FEFF}// to $a[0] and $b[0]) and then either add it in front of output when you print results, if you really need it, but better yet - don't print it back at all, as it is useless for UTF-8.
If you have double BOM, this is probably because one BOM comes from your input. So you should clean up your input before processing it:
s/^\x{FEFF}/ for $a[0], $b[0];

How do I read UTF-8 with diamond operator (<>)?

I want to read UTF-8 input in Perl, no matter if it comes from the standard input or from a file, using the diamond operator: while(<>){...}.
So my script should be callable in these two ways, as usual, giving the same output:
./script.pl utf8.txt
cat utf8.txt | ./script.pl
But the outputs differ! Only the second call (using cat) seems to work as designed, reading UTF-8 properly. Here is the script:
#!/usr/bin/perl -w
binmode STDIN, ':utf8';
binmode STDOUT, ':utf8';
while(<>){
my #chars = split //, $_;
print "$_\n" foreach(#chars);
}
How can I make it read UTF-8 correctly in both cases? I would like to keep using the diamond operator <> for reading, if possible.
EDIT:
I realized I should probably describe the different outputs. My input file contains this sequence: a\xCA\xA7b. The method with cat correctly outputs:
a
\xCA\xA7
b
But the other method gives me this:
a
\xC3\x8A
\xC2\xA7
b
Try to use the pragma open instead:
use strict;
use warnings;
use open qw(:std :utf8);
while(<>){
my #chars = split //, $_;
print "$_" foreach(#chars);
}
You need to do this because the <> operator is magical. As you know it will read from STDIN or from the files in #ARGV. Reading from STDIN causes no problem as STDIN is already open thus binmode works well on it. The problem is when reading from the files in #ARGV, when your script starts and calls binmode the files are not open. This causes STDIN to be set to UTF-8, but this IO channel is not used when #ARGV has files. In this case the <> operator opens a new file handle for each file in #ARGV. Each file handle gets reset and loses it's UTF-8 attribute. By using the pragma open you force each new STDIN to be in UTF-8.
Your script works if you do this:
#!/usr/bin/perl -w
binmode STDOUT, ':utf8';
while(<>){
binmode ARGV, ':utf8';
my #chars = split //, $_;
print "$_\n" foreach(#chars);
}
The magic filehandle that <> reads from is called *ARGV, and it is
opened when you call readline.
But really, I am a fan of explicitly using Encode::decode and
Encode::encode when appropriate.
You can switch on UTF8 by default with the -C flag:
perl -CSD -ne 'print join("\n",split //);' utf8.txt
The switch -CSD turns on UTF8 unconditionally; if you use simply -C it will turn on UTF8 only if the relevant environment variables (LC_ALL, LC_TYPE and LANG) indicate so. See perlrun for details.
This is not recommended if you don't invoke perl directly (in particular, it might not work reliably if you pass options to perl from the shebang line). See the other answers in that case.
If you put a call to binmode inside of the while loop, then it will switch the handle to utf8 mode AFTER the first line is read in. That is probably not what you want to do.
Something like the following might work better:
#!/usr/bin/env perl -w
binmode STDOUT, ':utf8';
eof() ? exit : binmode ARGV, ':utf8';
while( <> ) {
my #chars = split //, $_;
print "$_\n" foreach(#chars);
} continue {
binmode ARGV, ':utf8' if eof && !eof();
}
The call to eof() with parens is magical, as it checks for end of file on the pseudo-filehandle used by <>. It will, if necessary, open the next handle that needs to be read, which typically has the effect of making *ARGV valid, but without reading anything out of it. This allows us to binmode the first file that's read from, before anything is read from it.
Later, eof (without parens) is used; this checks the last handle that was read from for end of file. It will be true after we process the last line of each file from the commandline (or when stdin reaches it's end).
Obviously, if we've just processed the last line of one file, calling eof() (with parens) opens the next file (if there is one), makes *ARGV valid (if it can), and tests for end of file on that next file. If that next file is present, and isn't at end of file, then we can safely use binmode on ARGV.