perl output - failing in printing utf8 text files correctly - perl

so i have utf8 text files, which i want to read in, put the lines into an array, and print it out. But the output however doesn't print the signs correctly, for example the output line looks like following:
"arnſtein gehört gräflichen "
So i tried testing the script by one line, pasted directly into the perl script, without reading it from file. And there the output is perfectly fine. I checked the files, which are in utf8 unicode. Still the files must cause the output problem (?).
Because the script is too long, i just cut it down to the relevant:
( goes to directory, opens files, leads the input to the function &align, anaylse it, add it to an array, print the array)
#!/usr/bin/perl -w
use strict;
use utf8;
binmode(STDIN,":utf8");
binmode(STDOUT,":utf8");
binmode(STDERR,":utf8");
#opens directory
#opens file from directory
if (-d "$dir/$first"){
opendir (UDIR, "$dir/$first") or die "could not open: $!";
foreach my $t (readdir(UDIR)){
next if $first eq ".";
next if $first eq "..";
open(GT,"$dir/$first/$t") or die "Could not open GT, $!";
my $gt= <GT>;
chomp $gt;
#directly pasted lines in perl - creates correct output
&align("det man die Profeſſores der Philoſophie re- ");
#lines from file - output not correct
#&align($gt);
close GT;
next;
}closedir UDIR;
}
Any idea ?

You told Perl that your source code was UTF-8, and that STDIN, STDOUT, & STDERR are UTF-8, but you didn't say that the file you're reading contains UTF-8.
open(GT,"<:utf8", "$dir/$first/$t") or die "Could not open GT, $!";
Without that, Perl assumes the file is encoded in ISO-8859-1, since that's Perl's default charset if you don't specify a different one. It helpfully transcodes those ISO-8859-1 characters to UTF-8 for output, since you've told it that STDOUT uses UTF-8. Since the file was actually UTF-8, not ISO-8859-1, you get incorrect output.

Related

Remove mysterious line breaks in CSV file using Perl

I have a CSV file that I'm parsing using Perl. The file is a BOM produced by Solidworks 2015 that was saved as an XLS file, then opened in Excel and saved as a CSV file.
There are cells that have line breaks. When I read a line with such a cell from the file, the line comes in with the line breaks. For example, here is one of the lines read looks like this:
74,,74,1,1,"SJ-TL303202-DET-074-
001",PDSI,"2.25"" DIA. X 8.00""",A2,513,1,
It reads in as a single line in Perl.
When I turn the Show All Characters in Notepad++, I can see the line breaks are cause by [CR][LF].
So I thought this would work to remove the line feeds:
$line =~ s/[\r\n]+//g;
but it does not.
You don't give much of a sample of your CSV data, but what you show is perfectly valid. A text field may contain newlines if you wish, as long as it is enclosed in double-quotes
The Text::CSV module will process it quite happily as long as you enable the binary option in the constructor call, and you may reformat the data as you wish before you write it back out again
This program expects the path to the input file as a parameter on the command line, and it will write the modified data to STDOUT, which you can redirect on the command line, like this
$ perl fix_csv.pl input.csv > output.csv
I've assumed that your data contains only 7-bit ASCII data, and it should work whether you're running it on a Windows system or on Linux
use strict;
use warnings 'all';
my ($csv_file) = #ARGV;
use Text::CSV;
open my $fh, '<', $csv_file or die qq{Unable to open "$csv_file" for input: $!};
my $csv = Text::CSV->new( { binary => 1 } );
while ( my $row = $csv->getline( $fh ) ) {
tr/\r\n//d for #$row;
$csv->combine(#$row);
print $csv->string, "\n";
}
output
74,,74,1,1,SJ-TL303202-DET-074-001,PDSI,"2.25"" DIA. X 8.00""",A2,513,1,

Perl file processing on SHIFT_JIS encoded Japanese files

I have a set of SHIFT_JIS (Japanese) encoded csv file from Windows, which I am trying to process on a Linux server running Perl v5.10.1 using regular expressions to make string replacements.
Here is my requirement:
I want the Perl script’s regular expressions being human readable (at least to a Japanese person)
Ie. like this:
s/北/0/g;
Instead of it littered with some hex codes
s/\x{4eba}/0/g;
Right now, I am editing the Perl script in Notepad++ on Windows, and pasting in the string I need to search for from the csv data file onto the Perl script.
I have the following working test script below:
use strict;
use warnings;
use utf8;
open (IN1, "<:encoding(shift_jis)", "${work_dir}/tmp00.csv") or die "Error: tmp00.csv\n";
open (OUT1, "+>:encoding(shift_jis)" , "${work_dir}/tmp01.csv") or die "Error: tmp01.csv\n";
while (<IN1>)
{
print $_ . "\n";
chomp;
s/北/0/g;
s/10:00/9:00/g;
print OUT1 "$_\n";
}
close IN1;
close OUT1;
This would successfully replace the 10:00 with 9:00 in the csv file, but the issue is I was unable to replace北 (ie. North) with 0 unless use utf8 is also included at the top.
Questions:
1) In the open documentation, http://perldoc.perl.org/functions/open.html, I didn’t see use utf8 as a requirement, unless it is implicit?
a) If I had use utf8 only, then the first print statement in the loop would print garbage character to my xterm screen.
b) If I had called open with :encoding(shift_jis) only, then the first print statement in the loop would print Japanese character to my xterm screen, but the replacement would not happen. There is no warning that use utf8 was not specified.
c) If I used both a) and b), then this example works.
How does “use utf8” modify the behavior of calling open with :enoding(shift_jis) in this Perl script?
2) I also tried to open the file without any encoding specified, wouldn’t Perl treat the file strings as raw bytes, and be able to perform regular expression match that way if the strings I pasted in the script, is in the same encoding as the text in the original data file? I was able to do file name replacement earlier this way without specifying any encoding whatsoever (please refer to my related post here: Perl Japanese to English filename replacement).
Thanks.
UPDATES 1
Testing a simple localization sample in Perl for filename and file text replacement in Japanese
In Windows XP, copy the 南 character from within a .csv data file and copy to the clipboard, then use it as both the file name (ie. 南.txt) and file content (南). In Notepad++ , reading the file under encoding UTF-8 shows x93xEC, reading it under SHIFT_JIS displays南.
Script:
Use the following Perl script south.pl, which will be run on a Linux server with Perl 5.10
#!/usr/bin/perl
use feature qw(say);
use strict;
use warnings;
use utf8;
use Encode qw(decode encode);
my $user_dir="/usr/frank";
my $work_dir = "${user_dir}/test_south";
# forward declare the function prototypes
sub fileProcess;
opendir(DIR, ${work_dir}) or die "Cannot open directory " . ${work_dir};
# readdir OPTION 1 - shift_jis
#my #files = map { Encode::decode("shift_jis", $_); } readdir DIR; # Note filename could not be decoded as shift_jis
#binmode(STDOUT,":encoding(shift_jis)");
# readdir OPTION 2 - utf8
my #files = map { Encode::decode("utf8", $_); } readdir DIR; # Note filename could be decoded as utf8
binmode(STDOUT,":encoding(utf8)"); # setting display to output utf8
say #files;
# pass an array reference of files that will be modified
fileNameTranslate();
fileProcess();
closedir(DIR);
exit;
sub fileNameTranslate
{
foreach (#files)
{
my $original_file = $_;
#print "original_file: " . "$original_file" . "\n";
s/南/south/;
my $new_file = $_;
# print "new_file: " . "$_" . "\n";
if ($new_file ne $original_file)
{
print "Rename " . $original_file . " to \n\t" . $new_file . "\n";
rename("${work_dir}/${original_file}", "${work_dir}/${new_file}") or print "Warning: rename failed because: $!\n";
}
}
}
sub fileProcess
{
# file process OPTION 3, open file as shift_jis, the search and replace would work
# open (IN1, "<:encoding(shift_jis)", "${work_dir}/south.txt") or die "Error: south.txt\n";
# open (OUT1, "+>:encoding(shift_jis)" , "${work_dir}/south1.txt") or die "Error: south1.txt\n";
# file process OPTION 4, open file as utf8, the search and replace would not work
open (IN1, "<:encoding(utf8)", "${work_dir}/south.txt") or die "Error: south.txt\n";
open (OUT1, "+>:encoding(utf8)" , "${work_dir}/south1.txt") or die "Error: south1.txt\n";
while (<IN1>)
{
print $_ . "\n";
chomp;
s/南/south/g;
print OUT1 "$_\n";
}
close IN1;
close OUT1;
}
Result:
(BAD) Uncomment Option 1 and 3, (Comment Option 2 and 4)
Setup: Readdir encoding, SHIFT_JIS; file open encoding SHIFT_JIS
Result: file name replacement failed..
Error: utf8 "\x93" does not map to Unicode at .//south.pl line 68.
\x93
(BAD) Uncomment Option 2 and 4 (Comment Option 1 and 3)
Setup: Readdir encoding, utf8; file open encoding utf8
Result: file name replacement worked, south.txt generated
But south1.txt file content replacement failed , it has the content \x93 ().
Error: "\x{fffd}" does not map to shiftjis at .//south.pl line 25.
... -Ao?= (Bx{fffd}.txt
(GOOD) Uncomment Option 2 and 3, (Comment Option 1 and 4)
Setup: Readdir encoding, utf8; file open encoding SHIFT_JIS
Result: file name replacement worked, south.txt generated
South1.txt file content replacement worked, it has the content south.
Conclusion:
I had to use different encoding scheme for this example to work properly. Readdir utf8, and file processing SHIFT_JIS, as the content of the csv file was SHIFT_JIS encoded.
A good place to start would be to read the documentation for the utf8 module. Which says:
The use utf8 pragma tells the Perl parser to allow UTF-8 in the
program text in the current lexical scope (allow UTF-EBCDIC on EBCDIC
based platforms). The no utf8 pragma tells Perl to switch back to
treating the source text as literal bytes in the current lexical
scope.
If you don't have use utf8 in your code, then the Perl compiler assumes that your source code is in your system's native single-byte encoding. And the character '北' will make little sense. Adding the pragma tells Perl that your code includes Unicode characters and everything starts to work.

Line breaks don't exist on input from FTP file (Perl)

I downloaded a csv file using Net::FTP. When I look at this file in text editor or excel or even when I cut/paste it has line breaks and looks like this:
000000000G911|06
0000000000CDR|25|123
0000000000EGP|19
When I read the file in Perl it sees the entire text as one line like this:
000000000G911|060000000000CDR|25|1230000000000EGP|19
I have tried reading it using
tie #lines, 'Tie::File', "C:/Programs/myfile.csv", autochomp=>0 or die "Can't read file: $!\n";
foreach $l (#lines1)
{print "$l\n";
}
and
open FILE, "`<`$filename" or die $!;
my #lines=`<`FILE>;
foreach $l (#lines)
{print "$l\n";
}
close FILE;
The file has line breaks in a format that Perl is not recognizing because it is coming from a different operating system. The other programs are automatically detecting the different line break format, but Perl doesn't do that.
If you have Net::FTP perform the transfer in ASCII mode (e.g. $ftp->ascii to enable this mode), this should be taken care of and corrected for you.
Alternatively, you can figure out what is being used for line breaks and then set the special $/ variable to that value.

How to detect UTF8 with BOM encoding in Perl

I have simple Perl script that does comparison of two files.
Result I write in different files with UTF8 BOM encoding.
To save text in BOM file I do printing chr(65279) into the beginning of the result file. Sometimes input text already contains BOM char in the begging of the text and my script prints one more.
The question is: How I can workaround it to do not print this BOM char twice.
See below text of my Perl's code:
use strict;
use warnings;
use List::Compare;
use Cwd 'abs_path';
use open ':encoding(utf8)';
use open IO => ':encoding(utf8)';
open F, "<$ARGV[0]" or die $!;
open S, "<$ARGV[1]" or die $!;
my #a=<F>;
my #b=<S>;
close F;
close S;
my $lc = List::Compare->new(\#a, \#b);
my #intersection = $lc->get_intersection;
my #missing = $lc->get_unique;
my #extra = $lc->get_complement;
open EXTRA, ">".$ARGV[2]."file_extra.txt" or die("Unable to open the file");
open MISSING, ">".$ARGV[2]."file_missing.txt" or die("Unable to open the file");
open SUBTRACTED, ">".$ARGV[2]."file_subtr.txt" or die("Unable to open the file");
#Turn on UTF-8 BOM support
print EXTRA chr(65279);
print MISSING chr(65279);
print SUBTRACTED chr(65279);
print MISSING #missing;
print EXTRA #extra;
print SUBTRACTED #intersection;
close MISSING;
close EXTRA;
close SUBTRACTED;
Strip it while reading file content (in your example apply s/^\x{FEFF}// to $a[0] and $b[0]) and then either add it in front of output when you print results, if you really need it, but better yet - don't print it back at all, as it is useless for UTF-8.
If you have double BOM, this is probably because one BOM comes from your input. So you should clean up your input before processing it:
s/^\x{FEFF}/ for $a[0], $b[0];

Unicode in Perl not working

I have some text files which I am trying to transform with a Perl script on Windows. The text files look normal in Notepad+, but all the regexes in my script were failing to match. Then I noticed that when I open the text files in NotePad+, the status bar says "UCS-2 Little Endia" (sic). I am assuming this corresponds to the encoding UCS-2LE. So I created "readFile" and "writeFile" subs in Perl, like so:
use PerlIO::encoding;
my $enc = ':encoding(UCS-2LE)';
sub readFile {
my ($fName) = #_;
open my $f, "<$enc", $fName or die "can't read $fName\n";
local $/;
my $txt = <$f>;
close $f;
return $txt;
}
sub writeFile {
my ($fName, $txt) = #_;
open my $f, ">$enc", $fName or die "can't write $fName\n";
print $f $txt;
close $f;
}
my $fName = 'someFile.txt';
my $txt = readFile $fName;
# ... transform $txt using s/// ...
writeFile $fName, $txt;
Now the regexes match (although less often than I expect), but the output contains long strings of Asian-looking characters interspersed with longs strings of the correct text. Is my code wrong? Or perhaps Notepad+ is wrong about the encoding? How should I proceed?
OK, I figured it out. The problem was being caused by a disconnect between the encoding translation done by the "encoding..." parameter of the "open" call and the default CRLF translation done by Perl on Windows. What appeared to be happening was that LF was being translated to CRLF on output after the encoding had already been done, which threw off the "parity" of the 16-bit encoding for the following line. Once the next line was reached, the "parity" got put back. That would explain the "long strings of Asian-looking characters interspersed with longs strings of the correct text"... every other line was being messed up.
To correct it, I took out the encoding parameter in my "open" call and added a "binmode" call, as follows:
open my $f, $fName or die "can't read $fName\n";
binmode $f, ':raw:encoding(UCS-2LE)';
binmode apparently has a concept of "layered" I/O handling that is somewhat complicated.
One thing I can't figure out is how to get my CRLF translation back. If I leave out :raw or add :crlf, the "parity" problem returns. I've tried re-ordering as well and can't get it to work.
(I added this as a separate question: CRLF translation with Unicode in Perl)
I don't have the Notepad+ editor to check but it may be a BOM problem with your output encoding not containing a BOM.
http://perldoc.perl.org/Encode/Unicode.html#Size%2c-Endianness%2c-and-BOM
Maybe you need to encode $txt using a byte order mark as described above.