Cyrillic symbols shown strangеly when writing to a file - perl

I have a class that has a string field input which contains UTF-8 characters. My class also has a method toString. I want to save instances of the class to a file using the method toString. The problem is that strange symbols are being written in the file:
my $dest = "output.txt";
print "\nBefore saving to file\n" . $message->toString() . "\n";
open (my $fh, '>>:encoding(UTF-8)', $dest)
or die "Cannot open $dest : $!";
lock($fh);
print $fh $message->toString();
unlock($fh);
close $fh;
The first print works fine
Input: {"paramkey":"message","paramvalue":"здравейте"}
is being printed to the console. The problem is when I write to the file:
Input: {"paramkey":"message","paramvalue":"здÑавейÑе"}
I used flock for locking/unlocking the file.

The contents of the string returned by your toString method are already UTF-8 encoded. That works fine when you print it to your terminal because it is expecting UTF-8 data. But when you open your output file with
open (my $fh, '>>:encoding(UTF-8)', $dest) or die "Cannot open $dest : $!"
you are asking that Perl should reencode the data as UTF-8. That converts each byte of the UTF-8-encoded data to a separate UTF-8 sequence, which isn't what you want at all. Unfortunately you don't show your code for the class that $message belongs to, so I can't help you with this
You can fix that by changing your open call to just
open (my $fh, '>>', $dest) or die "Cannot open $dest : $!"
which will avoid the additional encoding step. But you should really be working with unencoded characters throughout your Perl code: removing any encoding from files you are reading from, and encoding output data as necessary when you write to output files.

I suppose you miss
use utf8;
in your code...
This code produces the "output.txt" file you do expect:
#!/usr/bin/perl
use strict;
use utf8;
my $dest = "output.txt";
my $message = "здравейте";
print "\nBefore saving to file\n" . $message . "\n";
open (my $fh, '>>:encoding(UTF-8)', $dest)
or die "Cannot open $dest : $!";
lock($fh);
print $fh $message;
close $fh;
I did not use toString() method because I'm working on native strings, not real objects, but this does not change the substance...

How does your toString method work? I would guess, based on the output you've provided, that the toString method is producing bytes instead of characters, and then perl is getting confused when trying to convert it.
Try binmode STDOUT, ':encoding(UTF-8)' before your print to see if it produces the same output as the file - otherwise your test is apples and oranges.
If it's already bytes instead of characters, you can open your $dest without any encoding(...) layer and it'll work.
In general, I find it quite painful to work in characters over bytes, but since it resolves more corner cases that I don't have to think about anymore, the extra work becomes worth it, but it is extra work.

Related

Perl script find and replace not working?

I am trying to create a script in Perl to replace text in all HTML files in a given directory. However, it is not working. Could anyone explain what I'm doing wrong?
my #files = glob "ACM_CCS/*.html";
foreach my $file (#files)
{
open(FILE, $file) || die "File not found";
my #lines = <FILE>;
close(FILE);
my #newlines;
foreach(#lines) {
$_ =~ s/Authors Here/Authors introduced this subject for the first time in this paper./g;
#$_ =~ s/Authors Elsewhere/Authors introduced this subject in a previous paper./g;
#$_ =~ s/D4-/D4: Is the supporting evidence described or cited?/g;
push(#newlines,$_);
}
open(FILE, $file) || die "File not found";
print FILE #newlines;
close(FILE);
}
For example, I'd want to replace "D4-" with "D4: Is the...", etc. Thanks, I'd appreciate any tips.
You are using the two argument version of open. If $file does not start with "<", ">", or ">>", it will be opened as read filehandle. You cannot write to a read file handle. To solve this, use the three argument version of open:
open my $in, "<", $file or die "could not open $file: $!";
open my $out, ">", $file or die "could not open $file: $!";
Also note the use of lexical filehandles ($in) instead of the bareword file handles (FILE). Lexical filehandles have many benefits over bareword filehandles:
They are lexically scoped instead of global
They close when they go out of scope instead of at the end of the program
They are easier to pass to functions (ie you don't have to use a typeglob reference).
You use them just like you would use a bareword filehandle.
Other things you might want to consider:
use the strict pragma
use the warnings pragma
work on files a line or chunk at a time rather than reading them in all at once
use an HTML parser instead of regex
use named variables instead of the default variable ($_)
if you are using the default variable, don't include it where it is already going to be used (eg s/foo/bar/; instead of $_ =~ s/foo/bar/;)
Number 4 may be very important for what you are doing. If you are not certain of the format these HTML files are in, then you could easily miss things. For instance, "Authors Here" and "Authors\nHere" means the same thing to HTML, but your regex will miss the later. You might want to take a look at XML::Twig (I know it says XML, but it handles HTML as well). It is a very easy to use XML/HTML parser.

PERL Net::DNS output to file

Completely new to Perl (in the process of learning) and need some help. Here is some code that I found which prints results to the screen great, but I want it printed to a file. How can I do this? When I open a file and send output to it, I get garbage data.
Here is the code:
use Net::DNS;
my $res = Net::DNS::Resolver->new;
$res->nameservers("ns.example.com");
my #zone = $res->axfr("example.com");
foreach $rr (#zone) {
$rr->print;
}
When I add:
open(my $fh, '>', $filename) or die "Could not open file '$filename' $!";
.....
$rr -> $fh; #I get garbage.
Your #zone array contains a list of Net::DNS::RR objects, whose print method stringifies the object and prints it to the currently selected file handle
To print the same thing to a different file handle you will have to stringify the object yourself
This should work
open my $fh, '>', $filename or die "Could not open file '$filename': $!";
print $fh $_->string, "\n" for #zone;
When you're learning a new language, making random changes to code in the hope that they will do what you want is not a good idea. A far better approach is to read the documentation for the libraries and functions that you are using.
The original code uses $rr->print. The documentation for Net::DNS::Resolver says:
print
$resolver->print;
Prints the resolver state on the standard output.
The print() method there is named after the standard Perl print function which we can use to print data to any filehandle. There's a Net::DNS::Resolver method called string which is documented like this:
string
print $resolver->string;
Returns a string representation of the resolver state.
So it looks like $rr->print is equivalent to print $rr->string. And it's simple enough to change that to print to your new filehandle.
print $fh $rr->string;
p.s. And, by the way, it's "Perl", not "PERL".

How to properly print non-English characters to a file with Perl?

I am using Perl to print some data read from one file to another. Sometimes I read in non-English characters, such as accented characters like é. However, doing:
print FILE_HANDLER "... $variable ...";
does not keep the accents. The é actually gets printed out as "é".
How can I print these characters out so that they're properly preserved? For more information, the files that I open and write to are done as such:
open READ_FILE, "<", "file.xml" or die $!;
open WRITE_FILE, ">", "file.txt" or die $!;
Thanks for all your help.
perldoc -f open says:
You may (and usually should) use the three-argument form of open to specify I/O layers (sometimes referred to as "disciplines") to apply to the handle that affect how the input and output are processed (see open and PerlIO for more details). For example:
open(my $fh, "<:encoding(UTF-8)", "filename")
|| die "can't open UTF-8 encoded filename: $!";
opens the UTF8-encoded file containing Unicode characters; see perluniintro

Unicode in Perl not working

I have some text files which I am trying to transform with a Perl script on Windows. The text files look normal in Notepad+, but all the regexes in my script were failing to match. Then I noticed that when I open the text files in NotePad+, the status bar says "UCS-2 Little Endia" (sic). I am assuming this corresponds to the encoding UCS-2LE. So I created "readFile" and "writeFile" subs in Perl, like so:
use PerlIO::encoding;
my $enc = ':encoding(UCS-2LE)';
sub readFile {
my ($fName) = #_;
open my $f, "<$enc", $fName or die "can't read $fName\n";
local $/;
my $txt = <$f>;
close $f;
return $txt;
}
sub writeFile {
my ($fName, $txt) = #_;
open my $f, ">$enc", $fName or die "can't write $fName\n";
print $f $txt;
close $f;
}
my $fName = 'someFile.txt';
my $txt = readFile $fName;
# ... transform $txt using s/// ...
writeFile $fName, $txt;
Now the regexes match (although less often than I expect), but the output contains long strings of Asian-looking characters interspersed with longs strings of the correct text. Is my code wrong? Or perhaps Notepad+ is wrong about the encoding? How should I proceed?
OK, I figured it out. The problem was being caused by a disconnect between the encoding translation done by the "encoding..." parameter of the "open" call and the default CRLF translation done by Perl on Windows. What appeared to be happening was that LF was being translated to CRLF on output after the encoding had already been done, which threw off the "parity" of the 16-bit encoding for the following line. Once the next line was reached, the "parity" got put back. That would explain the "long strings of Asian-looking characters interspersed with longs strings of the correct text"... every other line was being messed up.
To correct it, I took out the encoding parameter in my "open" call and added a "binmode" call, as follows:
open my $f, $fName or die "can't read $fName\n";
binmode $f, ':raw:encoding(UCS-2LE)';
binmode apparently has a concept of "layered" I/O handling that is somewhat complicated.
One thing I can't figure out is how to get my CRLF translation back. If I leave out :raw or add :crlf, the "parity" problem returns. I've tried re-ordering as well and can't get it to work.
(I added this as a separate question: CRLF translation with Unicode in Perl)
I don't have the Notepad+ editor to check but it may be a BOM problem with your output encoding not containing a BOM.
http://perldoc.perl.org/Encode/Unicode.html#Size%2c-Endianness%2c-and-BOM
Maybe you need to encode $txt using a byte order mark as described above.

How can I create a new file using a variable value as the name in Perl?

Eg:
$variable = "10000";
for($i=0; $i<3;$i++)
{
$variable++;
$file = $variable."."."txt";
open output,'>$file' or die "Can't open the output file!";
}
This doesn't work. Please suggest a new way.
Everyone here has it right, you are using single quotes in your call to open. Single quotes do not interpolate variables into the quoted string. Double quotes do.
my $foo = 'cat';
print 'Why does the dog chase the $foo?'; # prints: Why does the dog chase the $foo?
print "Why does the dog chase the $foo?"; # prints: Why does the dog chase the cat?
So far, so good. But, the others have neglected to give you some important advice about open.
The open function has evolved over the years, as has the way that Perl works with filehandles. In the old days, open was always called with the mode and the file name combined in the second argument. The first argument was always a global filehandle.
Experience showed that this was a bad idea. Combining the mode and the filename in one argument created security problems. Using global variables, well, is using global variables.
Since Perl 5.6.0 you can use a 3 argument form of open that is much more secure, and you can store your filehandle in a lexically scoped scalar.
open my $fh, '>', $file or die "Can't open $file - $!\n";
print $fh "Goes into the file\n";
There are many nice things about lexical filehandles, but one excellent property is that they are automatically closed when their refcount drops to 0 and they are destroyed. There is no need to explicitly close them.
Something else worth noting is that it is considered by most of the Perl community that it is a good idea to always use the strict and warnings pragmas. Using them helps catch many bugs early in the development process and can be a huge time saver.
use strict;
use warnings;
for my $base ( 10_001..10_003 ) {
my $file = "$base.txt";
print "file: $file\n";
open my $fh,'>', $file or die "Can't open the output file: $!";
# Do stuff with handle.
}
I simplified your code a bit too. I used the range operator to generate your base numbers for the file names. Since we are working with numbers and not strings, I was able to use the _, as the thousands separator to improve readability without impacting the final result. Finally, I used an idiomatic perl for loop instead of the C style for you had.
I hope you find this helpful.
use double quotes: ">$file". single quotes will not interpolate your variable.
$variable = "10000";
for($i=0; $i<3;$i++)
{
$variable++;
$file = $variable."."."txt";
print "file: $file\n";
open $output,">$file" or die "Can't open the output file!";
close($output);
}
The problem is that you're using single quotes for the second argument to open, and single-quoted strings do not interpolate variables mentioned in them. Perl interpreted your code as though you wanted to open a file that really had a dollar sign for the first character of its name. (Check your disk; you should see an empty file named $file there.)
You can avoid the issue by using the three-argument version of open:
open output, '>', $file
Then the file-name argument can't accidentally interfere with the open-mode argument, and there's no unnecessary variable interpolation or concatenation.
$variable = "10000";
for($i=0; $i<3;$i++)
{
$variable++;
$file = $variable . 'txt';
open output,'>$file' or die "Can't open the output file!";
}
this works
1.txt
2.txt and so on ..
Use a file handle:
my $file = "whatevernameyouwant";
open (MYFILE, ">>$file");
print MYFILE "Bob\n";
close (MYFILE);
print '$file' yields $file, whereas print "$file" yields whatevernameyouwant.
You almost have it right, but there are a couple of issues.
1 - You need to use double quotes around the file you're opening.
open output,">$file" or die[...]
2 - Minor niggles, you don't close the files afterwards.
I'd rewrite your code something like this:
#!/usr/bin/perl
$variable = "1000";
for($i=0; $i<3;$i++) {
$variable++;
$file = $variable."."."txt";
open output,">$file" or die "Can't open the output file!";
}