How to detect UTF8 with BOM encoding in Perl - perl

I have simple Perl script that does comparison of two files.
Result I write in different files with UTF8 BOM encoding.
To save text in BOM file I do printing chr(65279) into the beginning of the result file. Sometimes input text already contains BOM char in the begging of the text and my script prints one more.
The question is: How I can workaround it to do not print this BOM char twice.
See below text of my Perl's code:
use strict;
use warnings;
use List::Compare;
use Cwd 'abs_path';
use open ':encoding(utf8)';
use open IO => ':encoding(utf8)';
open F, "<$ARGV[0]" or die $!;
open S, "<$ARGV[1]" or die $!;
my #a=<F>;
my #b=<S>;
close F;
close S;
my $lc = List::Compare->new(\#a, \#b);
my #intersection = $lc->get_intersection;
my #missing = $lc->get_unique;
my #extra = $lc->get_complement;
open EXTRA, ">".$ARGV[2]."file_extra.txt" or die("Unable to open the file");
open MISSING, ">".$ARGV[2]."file_missing.txt" or die("Unable to open the file");
open SUBTRACTED, ">".$ARGV[2]."file_subtr.txt" or die("Unable to open the file");
#Turn on UTF-8 BOM support
print EXTRA chr(65279);
print MISSING chr(65279);
print SUBTRACTED chr(65279);
print MISSING #missing;
print EXTRA #extra;
print SUBTRACTED #intersection;
close MISSING;
close EXTRA;
close SUBTRACTED;

Strip it while reading file content (in your example apply s/^\x{FEFF}// to $a[0] and $b[0]) and then either add it in front of output when you print results, if you really need it, but better yet - don't print it back at all, as it is useless for UTF-8.

If you have double BOM, this is probably because one BOM comes from your input. So you should clean up your input before processing it:
s/^\x{FEFF}/ for $a[0], $b[0];

Related

Sorting UTF-8 input

I need to sort lines from file, saved as UTF-8. These lines can start with cyrillic or latin characters. My code works wrong on cyrillic one.
sub sort_by_default {
my #sorted_lines = sort {
$a <=> $b
||
fc( $a) cmp fc($b)
} #_;
}
The cmp used with sort can't help with this; it has no notion of encodings and merely compares by codepoint, character by character, with surprises in many languages. Use Unicode::Collate.† See this post for a bit more and for far more this post by tchrist and this perl.com article.
The other issue is of reading (decoding) input and writing (encoding) output in utf8 correctly. One way to ensure that data on standard streams is handled is via the open pragma, with which you can set "layers" so that input and output is decoded/encoded as data is read/written.
Altogether, an example
use warnings;
use strict;
use feature 'say';
use Unicode::Collate;
use open ":std", ":encoding(UTF-8)";
my $file = ...;
open my $fh, '<', $file or die "Can't open $file: $!";
my #lines = <$fh>;
chomp #lines;
my $uc = Unicode::Collate->new();
my #sorted = $uc->sort(#lines);
say for #sorted;
The module's cmp method can be used for individual comparisons (if data
is in a complex data structure and not just a flat list of lines, for instance)
my #sorted = map { $uc->cmp($a, $b) } #data;
where $a and $b need be set suitably so to extract what to compare from #data.
If you have utf8 data right in the source you need use utf8, while if you receive utf8 via yet other channels (from #ARGV included) you may need to manually Encode::decode those strings.
Please see the linked post (and links in it) and documentation for more detail. See this perlmonks post for far more rounded information. See this Effective Perler article on custom sorting.
† Example: by codepoint comparison ä > b while the accepted order in German is ä < b
perl -MUnicode::Collate -wE'use utf8; binmode STDOUT, ":encoding(UTF-8)";
#s = qw(ä b);
say join " ", sort { $a cmp $b } #s; #--> b ä
say join " ", Unicode::Collate->new->sort(#s); #--> ä b
'
so we need to use Unicode::Collate (or a custom sort routine).
To open a file saved as UTF-8, use the appropriate layer:
open my $FH, '<:encoding(UTF-8)', 'filename' or die $!;
Don't forget to set the same layer for the output.
#! /usr/bin/perl
use warnings;
use strict;
binmode *DATA, ':encoding(UTF-8)';
binmode *STDOUT, ':encoding(UTF-8)';
print for sort <DATA>;
__DATA__
Борис
Peter
John
Владимир
The key to handle UTF-8 correctly in Perl is to make sure that Perl knows that a certain source or destination of information is in UTF-8. This is done differently depending on the way you get info in or out. If the UTF-8 is coming from an input file, the way to open the file is:
open( my $fh, '<:encoding(UTF-8)', "filename" ) or die "Cannot open file: $!\n";
If you are going to have UTF-8 inside the source of your script, then make sure you have:
use utf8;
At the beginning of the script.
If you are going to get UTF-8 characters from STDIN, use this at the beginning of the script:
binmode(STDIN, ':encoding(UTF-8)');
For STDOUT use:
binmode(STDOUT, ':encoding(UTF-8)');
Also, make sure you read UTF-8 vs. utf8 vs. UTF8 to know the difference between each encoding name. utf8 or UTF8 will allow valid UTF-8 and also non-valid UTF-8 (according to the first UTF-8 proposed standard) and will not complain about non-valid codepoints. UTF-8 will allow valid UTF-8 but will not allow non-valid codepoint combinations; it is a short name for utf-8-strict. You may also read the question How do I sanitize invalid UTF-8 in Perl?
.
Finally, following #zdim advise, you may use at the beginning of the script:
use open ':encoding(UTF-8)';
And other variants as described here. That will set the encoding layer for all open instructions that do not specify a layer explicitly.

perl output - failing in printing utf8 text files correctly

so i have utf8 text files, which i want to read in, put the lines into an array, and print it out. But the output however doesn't print the signs correctly, for example the output line looks like following:
"arnſtein gehört gräflichen "
So i tried testing the script by one line, pasted directly into the perl script, without reading it from file. And there the output is perfectly fine. I checked the files, which are in utf8 unicode. Still the files must cause the output problem (?).
Because the script is too long, i just cut it down to the relevant:
( goes to directory, opens files, leads the input to the function &align, anaylse it, add it to an array, print the array)
#!/usr/bin/perl -w
use strict;
use utf8;
binmode(STDIN,":utf8");
binmode(STDOUT,":utf8");
binmode(STDERR,":utf8");
#opens directory
#opens file from directory
if (-d "$dir/$first"){
opendir (UDIR, "$dir/$first") or die "could not open: $!";
foreach my $t (readdir(UDIR)){
next if $first eq ".";
next if $first eq "..";
open(GT,"$dir/$first/$t") or die "Could not open GT, $!";
my $gt= <GT>;
chomp $gt;
#directly pasted lines in perl - creates correct output
&align("det man die Profeſſores der Philoſophie re- ");
#lines from file - output not correct
#&align($gt);
close GT;
next;
}closedir UDIR;
}
Any idea ?
You told Perl that your source code was UTF-8, and that STDIN, STDOUT, & STDERR are UTF-8, but you didn't say that the file you're reading contains UTF-8.
open(GT,"<:utf8", "$dir/$first/$t") or die "Could not open GT, $!";
Without that, Perl assumes the file is encoded in ISO-8859-1, since that's Perl's default charset if you don't specify a different one. It helpfully transcodes those ISO-8859-1 characters to UTF-8 for output, since you've told it that STDOUT uses UTF-8. Since the file was actually UTF-8, not ISO-8859-1, you get incorrect output.

Cannot write UTF-16LE encoded CSV file with Text::CSV_XS Perl module

I want to write a CSV file encoded in UTF-16LE.
However, the output in the file gets messed up. There are strange chinese looking letters: ਍挀攀氀氀㄀⸀㄀㬀挀攀氀氀㄀⸀㈀㬀ഀ.
This looks like off-by-one-byte problem mentioned here: Creating UTF-16 newline characters in Python for Windows Notepad
Other threads about Perl and Text::CSV_XS didn't help.
This is how I try it:
#!perl
use strict;
use warnings;
use utf8;
use Text::CSV_XS;
binmode STDOUT, ":utf8";
my $csv = Text::CSV_XS->new({
binary => 1,
sep_char => ";",
quote_char => undef,
eol => $/,
});
open my $in, '<:encoding(UTF-16LE)', 'in.csv' or die "in.csv: $!";
open my $out, '>:encoding(UTF-16LE)', 'out.csv' or die "out.csv: $!";
while (my $row = $csv->getline($in)) {
$_ =~ s/ä/æ/ for #$row; # something will be done to the data...
$csv->print($out, $row);
}
close $in;
close $out;
in.csv contains some test data and it is encoded in UTF-16LE:
header1;header2;
cell1.1;cell1.2;
äöü2.1;ab"c2.2;
The results looks like this:
header1;header2;਍挀攀氀氀㄀⸀㄀㬀挀攀氀氀㄀⸀㈀㬀ഀ
æöü2.1;abc2.2;਍
It is not an option to switch to UTF-8 as output format (which works fine btw).
So, how do I write valid UTF-16LE encoded CSV files using Text::CSV_XS?
Perl adds :crlf by default on Windows. It's added first, before your :encoding is added.
That means LF⇔CRLF conversion will be performed before decoding on reads, and after encoding on writes. This is backwards.
It ends up working with UTF-8 despite being done backwards because all of the following conditions are met:
The UTF-8 encoding of LF is the same as its Code Point (0A).
The UTF-8 encoding of CR is the same as its Code Point (0D).
0A always refers to LF no matter where they are in the file.
0D always refers to CR no matter where they are in the file.
None of those conditions holds true for UTF-16le.
Fix:
open(my $fh_in, '<:raw:encoding(UTF-16LE):crlf', $qfn_in)
open(my $fh_out, '>:raw:encoding(UTF-16LE):crlf', $qfn_out)

Perl incorrectly adding newline characters?

This is my tab delimited input file
Name<tab>Street<tab>Address
This is how I want my output file to look like
Street<tab>Address<tab>Address
(yes duplicate the next two columns) My output file looks like this instead
Street<tab>Address
<tab>Address
What is going on with perl? This is my code.
open (IN, $ARGV[0]);
open (OUT, ">output.txt");
while ($line = <IN>){
chomp $line;
#line=split/\t/,$line;
$line[2]=~s/\n//g;
print OUT $line[1]."\t".$line[2]."\t".$line[2]."\n";
}
close( OUT);
First of all, you should always
use strict and use warnings for even the most trivial programs. You will also need to declare each of your variables using my as close as possible to their first use
use lexical file handles and the three-parameter form of open
check the success of every open call, and die with a string that includes $! to show the reason for the failure
Note also that there is no need to explicitly open files named on the command line that appear in #ARGV: you can just read from them using <>.
As others have said, it looks like you are reading a file of DOS or Windows origin on a Linux system. Instead of using chomp, you can remove all trailing whitespace characters from each line using s/\s+\z//. Since CR and LF both count as "whitespace", this will remove all line terminators from each record. Beware, however, that, if trailing space is significant or if the last field may be blank, then this will also remove spaces and tabs. In that case, s/[\r\n]+\z// is more appropriate.
This version of your program works fine.
use strict;
use warnings;
#ARGV = 'addr.txt';
open my $out, '>', 'output.txt' or die $!;
while (<>) {
s/\s+\z//;
my #fields = split /\t/;
print $out join("\t", #fields[1, 2, 2]), "\n";
}
close $out or die $!;
If you know beforehand the origin of your data file, and know it to be a DOS-like file that terminates records with CR LF, you can use the PerlIO crlf layer when you open the file. Like this
open my $in, '<:crlf', $ARGV[0] or die $!;
then all records will appear to end in just "\n" when they are read on a Linux system.
A general solution to this problem is to install PerlIO::eol. Then you can write
open my $in, '<:raw:eol(LF)', $ARGV[0] or die $!;
and the line ending will always be "\n" regardless of the origin of the file, and regardless of the platform where Perl is running.
Did you try to eliminate not only the "\n" but also the "\r"???
$file[2] =~ s/\r\n//g;
$file[3] =~ s/\r\n//g; # Is it the "good" one?
It could work. DOS line endings could also be "\r" (not only "\n").
Another way to avoid end of line problems is to only capture the characters you're interested in:
open (IN, $ARGV[0]);
open (OUT, ">output.txt");
while (<IN>) {
print OUT "$1\t$2\t$2\n" if /^(\w+)\t\w+\t(\w+)\s*/;
}
close( OUT);

Unicode in Perl not working

I have some text files which I am trying to transform with a Perl script on Windows. The text files look normal in Notepad+, but all the regexes in my script were failing to match. Then I noticed that when I open the text files in NotePad+, the status bar says "UCS-2 Little Endia" (sic). I am assuming this corresponds to the encoding UCS-2LE. So I created "readFile" and "writeFile" subs in Perl, like so:
use PerlIO::encoding;
my $enc = ':encoding(UCS-2LE)';
sub readFile {
my ($fName) = #_;
open my $f, "<$enc", $fName or die "can't read $fName\n";
local $/;
my $txt = <$f>;
close $f;
return $txt;
}
sub writeFile {
my ($fName, $txt) = #_;
open my $f, ">$enc", $fName or die "can't write $fName\n";
print $f $txt;
close $f;
}
my $fName = 'someFile.txt';
my $txt = readFile $fName;
# ... transform $txt using s/// ...
writeFile $fName, $txt;
Now the regexes match (although less often than I expect), but the output contains long strings of Asian-looking characters interspersed with longs strings of the correct text. Is my code wrong? Or perhaps Notepad+ is wrong about the encoding? How should I proceed?
OK, I figured it out. The problem was being caused by a disconnect between the encoding translation done by the "encoding..." parameter of the "open" call and the default CRLF translation done by Perl on Windows. What appeared to be happening was that LF was being translated to CRLF on output after the encoding had already been done, which threw off the "parity" of the 16-bit encoding for the following line. Once the next line was reached, the "parity" got put back. That would explain the "long strings of Asian-looking characters interspersed with longs strings of the correct text"... every other line was being messed up.
To correct it, I took out the encoding parameter in my "open" call and added a "binmode" call, as follows:
open my $f, $fName or die "can't read $fName\n";
binmode $f, ':raw:encoding(UCS-2LE)';
binmode apparently has a concept of "layered" I/O handling that is somewhat complicated.
One thing I can't figure out is how to get my CRLF translation back. If I leave out :raw or add :crlf, the "parity" problem returns. I've tried re-ordering as well and can't get it to work.
(I added this as a separate question: CRLF translation with Unicode in Perl)
I don't have the Notepad+ editor to check but it may be a BOM problem with your output encoding not containing a BOM.
http://perldoc.perl.org/Encode/Unicode.html#Size%2c-Endianness%2c-and-BOM
Maybe you need to encode $txt using a byte order mark as described above.