Unicode in Perl not working - perl

I have some text files which I am trying to transform with a Perl script on Windows. The text files look normal in Notepad+, but all the regexes in my script were failing to match. Then I noticed that when I open the text files in NotePad+, the status bar says "UCS-2 Little Endia" (sic). I am assuming this corresponds to the encoding UCS-2LE. So I created "readFile" and "writeFile" subs in Perl, like so:
use PerlIO::encoding;
my $enc = ':encoding(UCS-2LE)';
sub readFile {
my ($fName) = #_;
open my $f, "<$enc", $fName or die "can't read $fName\n";
local $/;
my $txt = <$f>;
close $f;
return $txt;
}
sub writeFile {
my ($fName, $txt) = #_;
open my $f, ">$enc", $fName or die "can't write $fName\n";
print $f $txt;
close $f;
}
my $fName = 'someFile.txt';
my $txt = readFile $fName;
# ... transform $txt using s/// ...
writeFile $fName, $txt;
Now the regexes match (although less often than I expect), but the output contains long strings of Asian-looking characters interspersed with longs strings of the correct text. Is my code wrong? Or perhaps Notepad+ is wrong about the encoding? How should I proceed?

OK, I figured it out. The problem was being caused by a disconnect between the encoding translation done by the "encoding..." parameter of the "open" call and the default CRLF translation done by Perl on Windows. What appeared to be happening was that LF was being translated to CRLF on output after the encoding had already been done, which threw off the "parity" of the 16-bit encoding for the following line. Once the next line was reached, the "parity" got put back. That would explain the "long strings of Asian-looking characters interspersed with longs strings of the correct text"... every other line was being messed up.
To correct it, I took out the encoding parameter in my "open" call and added a "binmode" call, as follows:
open my $f, $fName or die "can't read $fName\n";
binmode $f, ':raw:encoding(UCS-2LE)';
binmode apparently has a concept of "layered" I/O handling that is somewhat complicated.
One thing I can't figure out is how to get my CRLF translation back. If I leave out :raw or add :crlf, the "parity" problem returns. I've tried re-ordering as well and can't get it to work.
(I added this as a separate question: CRLF translation with Unicode in Perl)

I don't have the Notepad+ editor to check but it may be a BOM problem with your output encoding not containing a BOM.
http://perldoc.perl.org/Encode/Unicode.html#Size%2c-Endianness%2c-and-BOM
Maybe you need to encode $txt using a byte order mark as described above.

Related

Cyrillic symbols shown strangеly when writing to a file

I have a class that has a string field input which contains UTF-8 characters. My class also has a method toString. I want to save instances of the class to a file using the method toString. The problem is that strange symbols are being written in the file:
my $dest = "output.txt";
print "\nBefore saving to file\n" . $message->toString() . "\n";
open (my $fh, '>>:encoding(UTF-8)', $dest)
or die "Cannot open $dest : $!";
lock($fh);
print $fh $message->toString();
unlock($fh);
close $fh;
The first print works fine
Input: {"paramkey":"message","paramvalue":"здравейте"}
is being printed to the console. The problem is when I write to the file:
Input: {"paramkey":"message","paramvalue":"здÑавейÑе"}
I used flock for locking/unlocking the file.
The contents of the string returned by your toString method are already UTF-8 encoded. That works fine when you print it to your terminal because it is expecting UTF-8 data. But when you open your output file with
open (my $fh, '>>:encoding(UTF-8)', $dest) or die "Cannot open $dest : $!"
you are asking that Perl should reencode the data as UTF-8. That converts each byte of the UTF-8-encoded data to a separate UTF-8 sequence, which isn't what you want at all. Unfortunately you don't show your code for the class that $message belongs to, so I can't help you with this
You can fix that by changing your open call to just
open (my $fh, '>>', $dest) or die "Cannot open $dest : $!"
which will avoid the additional encoding step. But you should really be working with unencoded characters throughout your Perl code: removing any encoding from files you are reading from, and encoding output data as necessary when you write to output files.
I suppose you miss
use utf8;
in your code...
This code produces the "output.txt" file you do expect:
#!/usr/bin/perl
use strict;
use utf8;
my $dest = "output.txt";
my $message = "здравейте";
print "\nBefore saving to file\n" . $message . "\n";
open (my $fh, '>>:encoding(UTF-8)', $dest)
or die "Cannot open $dest : $!";
lock($fh);
print $fh $message;
close $fh;
I did not use toString() method because I'm working on native strings, not real objects, but this does not change the substance...
How does your toString method work? I would guess, based on the output you've provided, that the toString method is producing bytes instead of characters, and then perl is getting confused when trying to convert it.
Try binmode STDOUT, ':encoding(UTF-8)' before your print to see if it produces the same output as the file - otherwise your test is apples and oranges.
If it's already bytes instead of characters, you can open your $dest without any encoding(...) layer and it'll work.
In general, I find it quite painful to work in characters over bytes, but since it resolves more corner cases that I don't have to think about anymore, the extra work becomes worth it, but it is extra work.

perl output - failing in printing utf8 text files correctly

so i have utf8 text files, which i want to read in, put the lines into an array, and print it out. But the output however doesn't print the signs correctly, for example the output line looks like following:
"arnſtein gehört gräflichen "
So i tried testing the script by one line, pasted directly into the perl script, without reading it from file. And there the output is perfectly fine. I checked the files, which are in utf8 unicode. Still the files must cause the output problem (?).
Because the script is too long, i just cut it down to the relevant:
( goes to directory, opens files, leads the input to the function &align, anaylse it, add it to an array, print the array)
#!/usr/bin/perl -w
use strict;
use utf8;
binmode(STDIN,":utf8");
binmode(STDOUT,":utf8");
binmode(STDERR,":utf8");
#opens directory
#opens file from directory
if (-d "$dir/$first"){
opendir (UDIR, "$dir/$first") or die "could not open: $!";
foreach my $t (readdir(UDIR)){
next if $first eq ".";
next if $first eq "..";
open(GT,"$dir/$first/$t") or die "Could not open GT, $!";
my $gt= <GT>;
chomp $gt;
#directly pasted lines in perl - creates correct output
&align("det man die Profeſſores der Philoſophie re- ");
#lines from file - output not correct
#&align($gt);
close GT;
next;
}closedir UDIR;
}
Any idea ?
You told Perl that your source code was UTF-8, and that STDIN, STDOUT, & STDERR are UTF-8, but you didn't say that the file you're reading contains UTF-8.
open(GT,"<:utf8", "$dir/$first/$t") or die "Could not open GT, $!";
Without that, Perl assumes the file is encoded in ISO-8859-1, since that's Perl's default charset if you don't specify a different one. It helpfully transcodes those ISO-8859-1 characters to UTF-8 for output, since you've told it that STDOUT uses UTF-8. Since the file was actually UTF-8, not ISO-8859-1, you get incorrect output.

Why doesn't chomp() work in this case?

I'm trying to use chomp() to remove all the newline character from a file. Here's the code:
use strict;
use warnings;
open (INPUT, 'input.txt') or die "Couldn't open file, $!";
my #emails = <INPUT>;
close INPUT;
chomp(#emails);
my $test;
foreach(#emails)
{
$test = $test.$_;
}
print $test;
and the test conent for the input.txt file is simple:
hello.com
hello2.com
hello3.com
hello4.com
my expected output is something like this: hello.comhello2.comhello3.comhello4.com
however, I'm still getting the same content as the input file, any help please?
Thank you
If the input file was generated on a different platform (one that uses a different EOL sequence), chomp might not strip off all the newline characters. For example, if you created the text file in Windows (which uses \r\n) and ran the script on Mac or Linux, only the \n would get chomp()ed and the output would still "look" like it had newlines.
If you know what the EOL sequence of the input is, you can set $/ before chomp(). Otherwise, you may need to do something like
my #emails = map { s/[\n\r]+$//g; $_ } <INPUT>;

Perl incorrectly adding newline characters?

This is my tab delimited input file
Name<tab>Street<tab>Address
This is how I want my output file to look like
Street<tab>Address<tab>Address
(yes duplicate the next two columns) My output file looks like this instead
Street<tab>Address
<tab>Address
What is going on with perl? This is my code.
open (IN, $ARGV[0]);
open (OUT, ">output.txt");
while ($line = <IN>){
chomp $line;
#line=split/\t/,$line;
$line[2]=~s/\n//g;
print OUT $line[1]."\t".$line[2]."\t".$line[2]."\n";
}
close( OUT);
First of all, you should always
use strict and use warnings for even the most trivial programs. You will also need to declare each of your variables using my as close as possible to their first use
use lexical file handles and the three-parameter form of open
check the success of every open call, and die with a string that includes $! to show the reason for the failure
Note also that there is no need to explicitly open files named on the command line that appear in #ARGV: you can just read from them using <>.
As others have said, it looks like you are reading a file of DOS or Windows origin on a Linux system. Instead of using chomp, you can remove all trailing whitespace characters from each line using s/\s+\z//. Since CR and LF both count as "whitespace", this will remove all line terminators from each record. Beware, however, that, if trailing space is significant or if the last field may be blank, then this will also remove spaces and tabs. In that case, s/[\r\n]+\z// is more appropriate.
This version of your program works fine.
use strict;
use warnings;
#ARGV = 'addr.txt';
open my $out, '>', 'output.txt' or die $!;
while (<>) {
s/\s+\z//;
my #fields = split /\t/;
print $out join("\t", #fields[1, 2, 2]), "\n";
}
close $out or die $!;
If you know beforehand the origin of your data file, and know it to be a DOS-like file that terminates records with CR LF, you can use the PerlIO crlf layer when you open the file. Like this
open my $in, '<:crlf', $ARGV[0] or die $!;
then all records will appear to end in just "\n" when they are read on a Linux system.
A general solution to this problem is to install PerlIO::eol. Then you can write
open my $in, '<:raw:eol(LF)', $ARGV[0] or die $!;
and the line ending will always be "\n" regardless of the origin of the file, and regardless of the platform where Perl is running.
Did you try to eliminate not only the "\n" but also the "\r"???
$file[2] =~ s/\r\n//g;
$file[3] =~ s/\r\n//g; # Is it the "good" one?
It could work. DOS line endings could also be "\r" (not only "\n").
Another way to avoid end of line problems is to only capture the characters you're interested in:
open (IN, $ARGV[0]);
open (OUT, ">output.txt");
while (<IN>) {
print OUT "$1\t$2\t$2\n" if /^(\w+)\t\w+\t(\w+)\s*/;
}
close( OUT);

Perl printing binary to files - cr lf

I am not a regular Perl programmer and I could not find anything about this in the forum or few books I have.
I am trying to write binary data to a file using the construct:
print filehandle $record
I note that all of my records truncate when an x'0A' is encountered so apparently Perl uses the LF as and end of record indicator. How can I write the complete records, using for example, a length specifier? I am worried about Perl tampering with other binary "non printables" as well.
thanks
Fritz
You want to use
open(my $fh, '<', $qfn) or die $!;
binmode($fh);
or
open(my $fh, '<:raw', $qfn) or die $!;
to prevent modifications. Same goes for output handles.
This "truncation at 0A" talk makes it sound like you're using readline and expect to do something other than read a line.
Well, actually, it can! You just need to tell readline you want it to read fix width records.
local $/ = \4096;
while (my $rec = <$fh>) {
...
}
The other alternative would be to use read.
while (1) {
my $rv = read($fh, my $rec, 4096);
die $! if !defined($rv);
last if !$rv;
...
}
binmode
open
read
readline (aka <> and <$fh>)
$/
Perl is not "tampering" with your writes. If your records are being truncated when they encounter a line feed, then that's a problem with the code that reads them, not the code that writes them. (Unless the format specifies that line feeds must be escaped, in which case the "problem" with the code writing the file is that it doesn't tamper with the data (by escaping line feeds) and instead writes exactly what you tell it to.)
Please provide a small (but runnable) code sample demonstrating your issue, ideally including both reading and writing, along with the actual result and the desired result, and we'll be able to give more specific help.
Note, however, that \n does not map directly to a single data byte (ASCII character) unless you're in binary mode. If the file is being read or written in text mode, \n could be just a CR, just a LF, or a CRLF, depending on the operating system it's being run under.