How to properly print non-English characters to a file with Perl? - perl

I am using Perl to print some data read from one file to another. Sometimes I read in non-English characters, such as accented characters like é. However, doing:
print FILE_HANDLER "... $variable ...";
does not keep the accents. The é actually gets printed out as "é".
How can I print these characters out so that they're properly preserved? For more information, the files that I open and write to are done as such:
open READ_FILE, "<", "file.xml" or die $!;
open WRITE_FILE, ">", "file.txt" or die $!;
Thanks for all your help.

perldoc -f open says:
You may (and usually should) use the three-argument form of open to specify I/O layers (sometimes referred to as "disciplines") to apply to the handle that affect how the input and output are processed (see open and PerlIO for more details). For example:
open(my $fh, "<:encoding(UTF-8)", "filename")
|| die "can't open UTF-8 encoded filename: $!";
opens the UTF8-encoded file containing Unicode characters; see perluniintro

Related

Cyrillic symbols shown strangеly when writing to a file

I have a class that has a string field input which contains UTF-8 characters. My class also has a method toString. I want to save instances of the class to a file using the method toString. The problem is that strange symbols are being written in the file:
my $dest = "output.txt";
print "\nBefore saving to file\n" . $message->toString() . "\n";
open (my $fh, '>>:encoding(UTF-8)', $dest)
or die "Cannot open $dest : $!";
lock($fh);
print $fh $message->toString();
unlock($fh);
close $fh;
The first print works fine
Input: {"paramkey":"message","paramvalue":"здравейте"}
is being printed to the console. The problem is when I write to the file:
Input: {"paramkey":"message","paramvalue":"здÑавейÑе"}
I used flock for locking/unlocking the file.
The contents of the string returned by your toString method are already UTF-8 encoded. That works fine when you print it to your terminal because it is expecting UTF-8 data. But when you open your output file with
open (my $fh, '>>:encoding(UTF-8)', $dest) or die "Cannot open $dest : $!"
you are asking that Perl should reencode the data as UTF-8. That converts each byte of the UTF-8-encoded data to a separate UTF-8 sequence, which isn't what you want at all. Unfortunately you don't show your code for the class that $message belongs to, so I can't help you with this
You can fix that by changing your open call to just
open (my $fh, '>>', $dest) or die "Cannot open $dest : $!"
which will avoid the additional encoding step. But you should really be working with unencoded characters throughout your Perl code: removing any encoding from files you are reading from, and encoding output data as necessary when you write to output files.
I suppose you miss
use utf8;
in your code...
This code produces the "output.txt" file you do expect:
#!/usr/bin/perl
use strict;
use utf8;
my $dest = "output.txt";
my $message = "здравейте";
print "\nBefore saving to file\n" . $message . "\n";
open (my $fh, '>>:encoding(UTF-8)', $dest)
or die "Cannot open $dest : $!";
lock($fh);
print $fh $message;
close $fh;
I did not use toString() method because I'm working on native strings, not real objects, but this does not change the substance...
How does your toString method work? I would guess, based on the output you've provided, that the toString method is producing bytes instead of characters, and then perl is getting confused when trying to convert it.
Try binmode STDOUT, ':encoding(UTF-8)' before your print to see if it produces the same output as the file - otherwise your test is apples and oranges.
If it's already bytes instead of characters, you can open your $dest without any encoding(...) layer and it'll work.
In general, I find it quite painful to work in characters over bytes, but since it resolves more corner cases that I don't have to think about anymore, the extra work becomes worth it, but it is extra work.

perl parsing inserting new line and ^M

i am trying to modify a few strings in a file using perl by using the below logic..
open FILE1, "< /tmp/sam.dsl" //In read mode
open FILE2, "> /tmp/sam2.dsl" // Open in write mode
while(<FILE1>)
if($_=s/string/found/g)
push FILE2, $_...
I am able to change the contents however the when i read the file it has ^M in it..
my datafile is of the below format
name 'SAMPLE'
i would like to change this to
name 'SAMPLE2'
currently with my code it changes to
name 'SAMPLE2
'
which creates a new line and then does the replacement.
Do i need to use anyother mode to open the file to write..?
My guess is, that you are working with a linux file on some windows. Perl automatically converts \n into \r\n on dos-compatible machines after reading and before writing. To get rid of this behaviour, you can use binmode <FILE> on your filehandles, but it sets your filehandle into "raw binary mode". If you want to use some other layers (like :utf8 or :encoding(utf-8)) are not enabled, and you might want to set them yourself, if you are handling character data. You also could use the PerlIO::eol module from CPAN.
Consider looking at these documentation pages:
PerlIO for a general understanding how the Perl-IO works.
open the pragma (not the function) to set layers for one program.
binmode the function you might want to consider.
My suggestion, but I can't test it (no Windows around), would be to use the following:
open my $outfile, '<:encoding(utf-8)', "filename" or die "error opening: $!";
binmode $outfile, join '', grep {$_ ne ':crlf'} PerlIO::get_layers($outfile)
or die "error setting output:\n $!"
while(<$infile>){
s/match/replacement/g;
print $outfile $_;
}

Perl incorrectly adding newline characters?

This is my tab delimited input file
Name<tab>Street<tab>Address
This is how I want my output file to look like
Street<tab>Address<tab>Address
(yes duplicate the next two columns) My output file looks like this instead
Street<tab>Address
<tab>Address
What is going on with perl? This is my code.
open (IN, $ARGV[0]);
open (OUT, ">output.txt");
while ($line = <IN>){
chomp $line;
#line=split/\t/,$line;
$line[2]=~s/\n//g;
print OUT $line[1]."\t".$line[2]."\t".$line[2]."\n";
}
close( OUT);
First of all, you should always
use strict and use warnings for even the most trivial programs. You will also need to declare each of your variables using my as close as possible to their first use
use lexical file handles and the three-parameter form of open
check the success of every open call, and die with a string that includes $! to show the reason for the failure
Note also that there is no need to explicitly open files named on the command line that appear in #ARGV: you can just read from them using <>.
As others have said, it looks like you are reading a file of DOS or Windows origin on a Linux system. Instead of using chomp, you can remove all trailing whitespace characters from each line using s/\s+\z//. Since CR and LF both count as "whitespace", this will remove all line terminators from each record. Beware, however, that, if trailing space is significant or if the last field may be blank, then this will also remove spaces and tabs. In that case, s/[\r\n]+\z// is more appropriate.
This version of your program works fine.
use strict;
use warnings;
#ARGV = 'addr.txt';
open my $out, '>', 'output.txt' or die $!;
while (<>) {
s/\s+\z//;
my #fields = split /\t/;
print $out join("\t", #fields[1, 2, 2]), "\n";
}
close $out or die $!;
If you know beforehand the origin of your data file, and know it to be a DOS-like file that terminates records with CR LF, you can use the PerlIO crlf layer when you open the file. Like this
open my $in, '<:crlf', $ARGV[0] or die $!;
then all records will appear to end in just "\n" when they are read on a Linux system.
A general solution to this problem is to install PerlIO::eol. Then you can write
open my $in, '<:raw:eol(LF)', $ARGV[0] or die $!;
and the line ending will always be "\n" regardless of the origin of the file, and regardless of the platform where Perl is running.
Did you try to eliminate not only the "\n" but also the "\r"???
$file[2] =~ s/\r\n//g;
$file[3] =~ s/\r\n//g; # Is it the "good" one?
It could work. DOS line endings could also be "\r" (not only "\n").
Another way to avoid end of line problems is to only capture the characters you're interested in:
open (IN, $ARGV[0]);
open (OUT, ">output.txt");
while (<IN>) {
print OUT "$1\t$2\t$2\n" if /^(\w+)\t\w+\t(\w+)\s*/;
}
close( OUT);

What are the differences between these "open" formats

I use this syntax to open my files, since I learned that some years ago in a training and the books I have do it the same way.
open( INPUTFILE, "< $input_file" ) || die "Can't open $input_file: $!";
Some days ago i saw in a SO answer this form:
open( $input_file, "<", $input_file ) || die "Can't open $input_file: $!";
Is this format new or just doing the same, a different way, using a normal variable as filehandle?
Should I change to the "new" format? Does it have some advantages, or does the "old" format have some disadvantages?
You should use the three-argument version because it protects against files with crazy names. Consider the following:
my $file = "<file.txt";
open( INPUTFILE, "< $file" ) or die "$!";
This will interpolate as:
open( INPUTFILE, "< <file.txt" ) or die "$!";
...meaning you'll actually open a file named file.txt instead of one named <file.txt.
Now, for the filehandle, you want to use a lexical filehandle:
open( my $fh, "<", $file.txt ) or die "$!";
The reason for this is when $fh goes out of scope, the file closes. Further, the other type of filehandle (I can't remember what it's called) has global scope. Programmers aren't all that imaginative, so it's likely that you'll name your filehandle INPUTFILE or FH or FILEHANDLE. What happens if someone else has done the same, named their filehandle INPUTFILE, in a module you use? Well, they're both valid and one clobbers the other. Which one clobbers? Who knows. It's up to the ordering of when they're opened. And closing? And what happens if the other programmer has opened INPUTFILE but actually opened it for write? Worlds end, my friend, worlds end.
If you use a lexical filehandle (the $fh) you don't have to worry about worlds ending, because even if the other programmer does call it $fh, variable scope protects you from clobbering.
So yes, always use the three-argument form of open() with lexical filehandles. Save the world.
The difference between the two argument form of open and the three argument form has to do with how the filename is treated when it contains special characters. For example, in a two argument open, if the filename contains the | character, or a number of other special characters, the file will be opened with the shell. Thus with the two argument open it is possible to write something like this:
open my $file, 'rm -rf * |';
and perl will happily open a pipe to the output of rm while it runs deleting your system.
Whereas if you use the three argument form of open, the filename is never passed through the shell, which is far safer, if you are getting your filename from an untrusted source.
I also find the three argument form to be less ambiguous because it forces you to specify if you are reading, writing, or appending.
You can get all of the gory details of open on the manual page.
From the perl tutorial
There is also a 3-argument version of
open, which lets you put the special
redirection characters into their own
argument:
open( INFO, ">", $datafile ) || die "Can't create $datafile: $!";
In this case, the filename to open is
the actual string in $datafile , so
you don't have to worry about
$datafile containing characters that
might influence the open mode, or
whitespace at the beginning of the
filename that would be absorbed in the
2-argument version. Also, any
reduction of unnecessary string
interpolation is a good thing.
So the 3 arguments version of the open is the safest to use.

How can I read input from a text file in Perl?

I would like to take input from a text file in Perl. Though lot of info are available over the net, it is still very confusing as to how to do this simple task of printing every line of text file. So how to do it? I am new to Perl, thus the confusion .
eugene has already shown the proper way. Here is a shorter script:
#!/usr/bin/perl
print while <>
or, equivalently,
#!/usr/bin/perl -p
on the command line:
perl -pe0 textfile.txt
You should start learning the language methodically, following a decent book, not through haphazard searches on the web.
You should also make use of the extensive documentation that comes with Perl.
See perldoc perltoc or perldoc.perl.org.
For example, opening files is covered in perlopentut.
First, open the file:
open my $fh, '<', "filename" or die $!;
Next, use the while loop to read until EOF:
while (<$fh>) {
# line contents's automatically stored in the $_ variable
}
close $fh or die $!;
# open the file and associate with a filehandle
open my $file_handle, '<', 'your_filename'
or die "Can't open your_filename: $!\n";
while (<$file_handle>) {
# $_ contains each record from the file in turn
}