How can I output UTF-8 from Perl? - perl

I am trying to write a Perl script using the utf8 pragma, and I'm getting unexpected results. I'm using Mac OS X 10.5 (Leopard), and I'm editing with TextMate. All of my settings for both my editor and operating system are defaulted to writing files in utf-8 format.
However, when I enter the following into a text file, save it as a ".pl", and execute it, I get the friendly "diamond with a question mark" in place of the non-ASCII characters.
#!/usr/bin/env perl -w
use strict;
use utf8;
my $str = 'Çirçös';
print( "$str\n" );
Any idea what I'm doing wrong? I expect to get 'Çirçös' in the output, but I get '�ir��s' instead.

use utf8; does not enable Unicode output - it enables you to type Unicode in your program. Add this to the program, before your print() statement:
binmode(STDOUT, ":utf8");
See if that helps. That should make STDOUT output in UTF-8 instead of ordinary ASCII.

You can use the open pragma.
For eg. below sets STDOUT, STDIN & STDERR to use UTF-8....
use open qw/:std :utf8/;

TMTOWTDI, chose the method that best fits how you work. I use the environment method so I don't have to think about it.
In the environment:
export PERL_UNICODE=SDL
on the command line:
perl -CSDL -le 'print "\x{1815}"';
or with binmode:
binmode(STDOUT, ":utf8"); #treat as if it is UTF-8
binmode(STDIN, ":encoding(utf8)"); #actually check if it is UTF-8
or with PerlIO:
open my $fh, ">:utf8", $filename
or die "could not open $filename: $!\n";
open my $fh, "<:encoding(utf-8)", $filename
or die "could not open $filename: $!\n";
or with the open pragma:
use open ":encoding(utf8)";
use open IN => ":encoding(utf8)", OUT => ":utf8";

You also want to say, that strings in your code are utf-8. See Why does modern Perl avoid UTF-8 by default?. So set not only PERL_UNICODE=SDAL but also PERL5OPT=-Mutf8.

Thanks, finally got an solution to not put utf8::encode all over code.
To synthesize and complete for other cases, like write and read files in utf8 and also works with LoadFile of an YAML file in utf8
use utf8;
use open ':encoding(utf8)';
binmode(STDOUT, ":utf8");
open(FH, ">test.txt");
print FH "something éá";
use YAML qw(LoadFile Dump);
my $PUBS = LoadFile("cache.yaml");
my $f = "2917";
my $ref = $PUBS->{$f};
print "$f \"".$ref->{name}."\" ". $ref->{primary_uri}." ";
where cache.yaml is:
---
2917:
id: 2917
name: Semanário
primary_uri: 2917.xml

do in your shell:
$ env |grep LANG
This will probably show that your shell is not using a utf-8 locale.

Related

How to write to an existing file in Perl?

I want to open an existing file in my desktop and write to it, for some reason I can't do it in ubuntu. Maybe I don't write the path exactly?
Is it possible without modules and etc.
open(WF,'>','/home/user/Desktop/write1.txt';
$text = "I am writing to this file";
print WF $text;
close(WF);
print "Done!\n";
You have to open a file in append (>>) mode in order to write to same file.
(Use a modern way to read a file, using a lexical filehandle:)
Here is the code snippet (tested in Ubuntu 20.04.1 with Perl v5.30.0):
#!/usr/bin/perl
use strict;
use warnings;
my $filename = '/home/vkk/Scripts/outfile.txt';
open(my $fh, '>>', $filename) or die "Could not open file '$filename' $!";
print $fh "Write this line to file\n";
close $fh;
print "done\n";
For more info, refer these links - open or appending-to-files by Gabor.
Please see following code sample, it demonstrates some aspects of correct usage of open, environment variables and reports an error if a file can not be open for writing.
Note: Run a search in Google for Perl bookshelf
#!/bin/env perl
#
# vim: ai ts=4 sw=4
#
use strict;
use warnings;
use feature 'say';
my $fname = $ENV{HOME} . '/Desktop/write1.txt';
my $text = 'I am writing to this file';
open my $fh, '>', $fname
or die "Can't open $fname";
say $fh $text;
close $fh;
say 'Done!';
Documentation quote
About modes
When calling open with three or more arguments, the second argument -- labeled MODE here -- defines the open mode. MODE is usually a literal string comprising special characters that define the intended I/O role of the filehandle being created: whether it's read-only, or read-and-write, and so on.
If MODE is <, the file is opened for input (read-only). If MODE is >, the file is opened for output, with existing files first being truncated ("clobbered") and nonexisting files newly created. If MODE is >>, the file is opened for appending, again being created if necessary.
You can put a + in front of the > or < to indicate that you want both read and write access to the file; thus +< is almost always preferred for read/write updates--the +> mode would clobber the file first. You can't usually use either read-write mode for updating textfiles, since they have variable-length records. See the -i switch in perlrun for a better approach. The file is created with permissions of 0666 modified by the process's umask value.
These various prefixes correspond to the fopen(3) modes of r, r+, w, w+, a, and a+.
Documentation: open, close,

Perl (wrong?) encoding of output file

I am running Active Perl 5.16.3 on Windows 7 (32 bits).
My (short) program massages an input text file (encoded in UTF-8). I wish the output encoding to be in Latin1, so my code is:
open (OUT, '>;encoding(Latin1)', "out.txt") || die "Cannot open output file: $!\n";
print OUT "$string\n";
yet the resulting file is still in UTF-8. What am I doing wrong?
Firstly, the encoding layer is separated from the open mode by a colon, not a semicolon.
open OUT, '>:encoding(latin1)', "out.txt" or die "Cannot open output file: $!\n";
Secondly, Latin-1 can only encode a small subset of UTF-8. Furthermore, most of this subset is encoded the same in both encodings. We therefore have to use a test file with characters that are not encoded the same, e.g. \N{MULTIPLICATION SIGN} U+00D7 ×, which is \xD7 in Latin-1, and \xC3\x97 in UTF-8.
Make also sure that you actually decode the input file.
Here is how you could generate the test file:
$ perl -CSA -E'say "\N{U+00D7}"' > input.txt
Here is how you can test that you are properly recoding the file:
use strict;
use warnings;
use autodie;
open my $in, "<:encoding(UTF-8)", "input.txt";
open my $out, ">:encoding(latin1)", "output.txt";
while (<$in>) {
print { $out } $_;
}
The input.txt and output.txt should afterwards have different lengths (3 bytes → 2 bytes).

Reading Cyrillic characters from file in perl

I'm having trouble reading Cyrillic characters from a file in perl.
The text file is written in Notepad and contains "абвгдежзийклмнопрстуфхцчшщъьюя".
Here's my code:
#!/usr/bin/perl
use warnings;
use strict;
open FILE, "text.txt" or die $!;
while (<FILE>) {
print $_;
}
If I save the text file using the ANSI encoding, I get:
рстуфхцчшщъыьэюяЁёЄєЇїЎў°∙·№■
If I save it using the UTF-8 encoding, and I use the function decode('UTF-8', $_) from the package Encode, I get:
Wide character in print at test.pl line 11, <TEXT> line 1.
and a bunch of unreadable characters.
I'm using the command prompt in windows 7x64
You're decoding your inputs, but "forgot" to encode your outputs.
Your file is probably encoded using cp1251.
Your terminal expects cp866.
Use
use open ':std', ':encoding(cp866)';
use open IO => ':encoding(cp1251)';
open(my $FILE, '<', 'text.txt')
or die $!;
or
use open ':std', ':encoding(cp866)';
open(my $FILE, '<:encoding(cp1251)', 'text.txt')
or die $!;
Use :encoding(UTF-8) instead of :encoding(cp1251) if you saved as UTF-8.

Copy/rename images with utf8 names using csv file

I'm working on a script to batch rename and copy images based on a csv file. The csv consists of column 1: old name and column 2: new name. I want to use the csv file as input for the perl script so that it checks the old name and makes a copy using the new name into a new folder. The problem that (i think) I'm having has to do with the images. They contain utf8 characters like ß etc. When I run the script it prints out this: Barfu├ƒg├ñsschen where it should be Barfußgässchen and the following error:
Unsuccessful stat on filename containing newline at C:/Perl64/lib/File/Copy.pm line 148, <$INFILE> line 1.
Copy failed: No such file or directory at X:\Script directory\correction.pl line 26, <$INFILE> line 1.
I know it has to do with Binmode utf8 but even when i try a simple script (saw it here: How can I output UTF-8 from Perl?):
use strict;
use utf8;
my $str = 'Çirçös';
binmode(STDOUT, ":utf8");
print "$str\n";
it prints out this: Ãirþ÷s
This is my entire script, can someone explain to me where i'm going wrong? (its not the cleanest of codes because i was testing out stuff).
use strict;
use warnings;
use File::Copy;
use utf8;
my $inputfile = shift || die "give input!\n";
#my $outputfile = shift || die "Give output!\n";
open my $INFILE, '<', $inputfile or die "In use / not found :$!\n";
#open my $OUTFILE, '>', $outputfile or die "In use / not found :$!\n";
binmode($INFILE, ":encoding(utf8)");
#binmode($OUTFILE, ":encoding(utf8)");
while (<$INFILE>) {
s/"//g;
my #elements = split /;/, $_;
my $old = $elements[1];
my $new = "new/$elements[3]";
binmode STDOUT, ':utf8';
print "$old | $new\n";
copy("$old","$new") or die "Copy failed: $!";
#copy("Copy.pm",\*STDOUT);
# my $output_line = join(";", #elements);
# print $OUTFILE $output_line;
#print "\n"
}
close $INFILE;
#close $OUTFILE;
exit 0;
You need to ensure every step of the process is using UTF-8.
When you create the input CSV, you need to make sure that it's saved as UTF-8, preferably without a BOM. Windows Notepad will add a BOM so try Notepad++ instead which gives you more control of the encoding.
You also have the problem that the Windows console is not UTF-8 compliant by default. See Unicode characters in Windows command line - how?. Either set the codepage with chcp 65001 or don't change the STDOUT encoding.
In terms of your code, the first error regarding the new line is probably due to the trailing new line from the CSV. Add chomp() after while (<$INFILE>) {
Update:
To "address" the file you need to encode your filenames in the correct locale - See How do you create unicode file names in Windows using Perl and What is the universal way to use file I/O API with unicode filenames?. Assuming you're using Western 1252 / Latin, this means when your copy command will look like:
copy(encode("cp1252", $old), encode("cp1252", $new))
Also, your open should also encode the filename:
open my $INFILE, '<', encode("cp1252", $inputfile)
Update 2:
As you're running in a DOS window, remove binmode(STDOUT, ":utf8"); and leave the default codepage in place.

How can I copy files with special characters in their names with Perl's File::Copy?

I am trying to copy all files in one location to a different location and am using the File::Copy module and copy command from that, but now the issue I am facing is that I have file whose name has special character whose ascii value is &#253 but in unix file system it is stored as ? and so my question is that will copy or move command consider this files with special characters while copying or moving to another location or not,
if now then what would be an possible work around for this ?
Note: I cannot create file with special characters in unix because special characters are replaced with ? and I cannot do so in Windows because on Windows Special Characters are replaced with the Encoded value as in my case of &#253 ?
my $folderpath = 'the_path';
open my $IN, '<', 'path/to/infile';
my $total;
while (<$IN>) {
chomp;
my $size = -s "$folderpath/$_";
print "$_ => $size\n";
$total += $size;
}
print "Total => $total\n";
Courtesy: RickF Answer
Any suggesion would be highly appreciated.
Reference Question : Perl File Handling Question
As workaround I can suggest to convert all unsupported characters to supported. This can be done in many ways. For example you can use URI::Escape:
use URI::Escape;
my $new_file_name = uri_escape($weird_file_name);
Update:
Here is how I was able to copy file by its uft-8 name. I'm on Windows. I've used Win32::GetANSIPathName to get short file name. Then it was copied nice:
use File::Copy;
use URI::Escape;
use Win32;
use utf8; ## tell perl that source code is in utf-9
use strict;
use warnings;
my $test_file = "IBMýSoftware.txt";
my $from_file = Win32::GetANSIPathName($test_file); ## get "short" name of file
my $to_file = uri_escape($test_file); ## name with special characters escaped
printf("copy [%s] -> [%s]\n", $from_file, $to_file);
copy($from_file, $to_file);
After coping all file to new names on Windows, you'll be able to work with them on linux without problems.
Here are some hints about utf-8 file opening:
How do I create a Unicode directory on Windows using Perl?
With a utf8-encoded Perl script, can it open a filename encoded as GB2312?
Character 253 is ý. I guess that on your Unix system the locale is not set, or only the most primitive fall-back locale is in effect, and that is why you see a replacement character. If I am guessing correctly, the solution is to simply set the locale to something, preferably to an UTF-8 locale since that can handle all characters, and Perl shouldn't even enter into the problem.
> cat 3761218.pl
use utf8;
use strict;
use warnings FATAL => 'all';
use autodie qw(:all);
my $file_name = '63551_106640_63551 IBMýSoftware Delivery&Fulfillment(Div-61) Data IPS 08-20-2010 v3.xlsm';
open my $h, '>', $file_name;
> perl 3761218.pl
> ls 6*
63551_106640_63551 IBMýSoftware Delivery&Fulfillment(Div-61) Data IPS 08-20-2010 v3.xlsm
> LANG=C ls 6* # temporarily cripple locale so that the problem in the question is exhibited
63551_106640_63551 IBM??Software Delivery&Fulfillment(Div-61) Data IPS 08-20-2010 v3.xlsm
> locale | head -1 # show which locale I have set
LANG=de_DE.UTF-8
The following script works as expected for me:
#!/usr/bin/perl
use strict; use warnings;
use autodie;
use File::Copy qw( copy );
use File::Spec::Functions qw( catfile );
my $fname = chr 0xfd;
open my $out, '>', catfile($ENV{TEMP}, $fname);
close $out;
copy catfile($ENV{TEMP}, $fname) => catfile($ENV{HOME}, $fname);