Copy/rename images with utf8 names using csv file - perl

I'm working on a script to batch rename and copy images based on a csv file. The csv consists of column 1: old name and column 2: new name. I want to use the csv file as input for the perl script so that it checks the old name and makes a copy using the new name into a new folder. The problem that (i think) I'm having has to do with the images. They contain utf8 characters like ß etc. When I run the script it prints out this: Barfu├ƒg├ñsschen where it should be Barfußgässchen and the following error:
Unsuccessful stat on filename containing newline at C:/Perl64/lib/File/Copy.pm line 148, <$INFILE> line 1.
Copy failed: No such file or directory at X:\Script directory\correction.pl line 26, <$INFILE> line 1.
I know it has to do with Binmode utf8 but even when i try a simple script (saw it here: How can I output UTF-8 from Perl?):
use strict;
use utf8;
my $str = 'Çirçös';
binmode(STDOUT, ":utf8");
print "$str\n";
it prints out this: Ãirþ÷s
This is my entire script, can someone explain to me where i'm going wrong? (its not the cleanest of codes because i was testing out stuff).
use strict;
use warnings;
use File::Copy;
use utf8;
my $inputfile = shift || die "give input!\n";
#my $outputfile = shift || die "Give output!\n";
open my $INFILE, '<', $inputfile or die "In use / not found :$!\n";
#open my $OUTFILE, '>', $outputfile or die "In use / not found :$!\n";
binmode($INFILE, ":encoding(utf8)");
#binmode($OUTFILE, ":encoding(utf8)");
while (<$INFILE>) {
s/"//g;
my #elements = split /;/, $_;
my $old = $elements[1];
my $new = "new/$elements[3]";
binmode STDOUT, ':utf8';
print "$old | $new\n";
copy("$old","$new") or die "Copy failed: $!";
#copy("Copy.pm",\*STDOUT);
# my $output_line = join(";", #elements);
# print $OUTFILE $output_line;
#print "\n"
}
close $INFILE;
#close $OUTFILE;
exit 0;

You need to ensure every step of the process is using UTF-8.
When you create the input CSV, you need to make sure that it's saved as UTF-8, preferably without a BOM. Windows Notepad will add a BOM so try Notepad++ instead which gives you more control of the encoding.
You also have the problem that the Windows console is not UTF-8 compliant by default. See Unicode characters in Windows command line - how?. Either set the codepage with chcp 65001 or don't change the STDOUT encoding.
In terms of your code, the first error regarding the new line is probably due to the trailing new line from the CSV. Add chomp() after while (<$INFILE>) {
Update:
To "address" the file you need to encode your filenames in the correct locale - See How do you create unicode file names in Windows using Perl and What is the universal way to use file I/O API with unicode filenames?. Assuming you're using Western 1252 / Latin, this means when your copy command will look like:
copy(encode("cp1252", $old), encode("cp1252", $new))
Also, your open should also encode the filename:
open my $INFILE, '<', encode("cp1252", $inputfile)
Update 2:
As you're running in a DOS window, remove binmode(STDOUT, ":utf8"); and leave the default codepage in place.

Related

How to write to an existing file in Perl?

I want to open an existing file in my desktop and write to it, for some reason I can't do it in ubuntu. Maybe I don't write the path exactly?
Is it possible without modules and etc.
open(WF,'>','/home/user/Desktop/write1.txt';
$text = "I am writing to this file";
print WF $text;
close(WF);
print "Done!\n";
You have to open a file in append (>>) mode in order to write to same file.
(Use a modern way to read a file, using a lexical filehandle:)
Here is the code snippet (tested in Ubuntu 20.04.1 with Perl v5.30.0):
#!/usr/bin/perl
use strict;
use warnings;
my $filename = '/home/vkk/Scripts/outfile.txt';
open(my $fh, '>>', $filename) or die "Could not open file '$filename' $!";
print $fh "Write this line to file\n";
close $fh;
print "done\n";
For more info, refer these links - open or appending-to-files by Gabor.
Please see following code sample, it demonstrates some aspects of correct usage of open, environment variables and reports an error if a file can not be open for writing.
Note: Run a search in Google for Perl bookshelf
#!/bin/env perl
#
# vim: ai ts=4 sw=4
#
use strict;
use warnings;
use feature 'say';
my $fname = $ENV{HOME} . '/Desktop/write1.txt';
my $text = 'I am writing to this file';
open my $fh, '>', $fname
or die "Can't open $fname";
say $fh $text;
close $fh;
say 'Done!';
Documentation quote
About modes
When calling open with three or more arguments, the second argument -- labeled MODE here -- defines the open mode. MODE is usually a literal string comprising special characters that define the intended I/O role of the filehandle being created: whether it's read-only, or read-and-write, and so on.
If MODE is <, the file is opened for input (read-only). If MODE is >, the file is opened for output, with existing files first being truncated ("clobbered") and nonexisting files newly created. If MODE is >>, the file is opened for appending, again being created if necessary.
You can put a + in front of the > or < to indicate that you want both read and write access to the file; thus +< is almost always preferred for read/write updates--the +> mode would clobber the file first. You can't usually use either read-write mode for updating textfiles, since they have variable-length records. See the -i switch in perlrun for a better approach. The file is created with permissions of 0666 modified by the process's umask value.
These various prefixes correspond to the fopen(3) modes of r, r+, w, w+, a, and a+.
Documentation: open, close,

Perl (wrong?) encoding of output file

I am running Active Perl 5.16.3 on Windows 7 (32 bits).
My (short) program massages an input text file (encoded in UTF-8). I wish the output encoding to be in Latin1, so my code is:
open (OUT, '>;encoding(Latin1)', "out.txt") || die "Cannot open output file: $!\n";
print OUT "$string\n";
yet the resulting file is still in UTF-8. What am I doing wrong?
Firstly, the encoding layer is separated from the open mode by a colon, not a semicolon.
open OUT, '>:encoding(latin1)', "out.txt" or die "Cannot open output file: $!\n";
Secondly, Latin-1 can only encode a small subset of UTF-8. Furthermore, most of this subset is encoded the same in both encodings. We therefore have to use a test file with characters that are not encoded the same, e.g. \N{MULTIPLICATION SIGN} U+00D7 ×, which is \xD7 in Latin-1, and \xC3\x97 in UTF-8.
Make also sure that you actually decode the input file.
Here is how you could generate the test file:
$ perl -CSA -E'say "\N{U+00D7}"' > input.txt
Here is how you can test that you are properly recoding the file:
use strict;
use warnings;
use autodie;
open my $in, "<:encoding(UTF-8)", "input.txt";
open my $out, ">:encoding(latin1)", "output.txt";
while (<$in>) {
print { $out } $_;
}
The input.txt and output.txt should afterwards have different lengths (3 bytes → 2 bytes).

Reading Cyrillic characters from file in perl

I'm having trouble reading Cyrillic characters from a file in perl.
The text file is written in Notepad and contains "абвгдежзийклмнопрстуфхцчшщъьюя".
Here's my code:
#!/usr/bin/perl
use warnings;
use strict;
open FILE, "text.txt" or die $!;
while (<FILE>) {
print $_;
}
If I save the text file using the ANSI encoding, I get:
рстуфхцчшщъыьэюяЁёЄєЇїЎў°∙·№■
If I save it using the UTF-8 encoding, and I use the function decode('UTF-8', $_) from the package Encode, I get:
Wide character in print at test.pl line 11, <TEXT> line 1.
and a bunch of unreadable characters.
I'm using the command prompt in windows 7x64
You're decoding your inputs, but "forgot" to encode your outputs.
Your file is probably encoded using cp1251.
Your terminal expects cp866.
Use
use open ':std', ':encoding(cp866)';
use open IO => ':encoding(cp1251)';
open(my $FILE, '<', 'text.txt')
or die $!;
or
use open ':std', ':encoding(cp866)';
open(my $FILE, '<:encoding(cp1251)', 'text.txt')
or die $!;
Use :encoding(UTF-8) instead of :encoding(cp1251) if you saved as UTF-8.

Perl Programm keep waiting to open a file using utf8

I am trying to read a UTF-8 encoded xml file. This file is of size around 8M and contains only one line.
I used below line to open this single line xml file:
open(INP,"<:utf8","$infile") or die "Couldn't open file passed as input, $!";
local $/ = undef;
my $inputfile = <INP>;
print $inputfile; ## Not working..
But after this line program get stuck and keep waiting.
I have tried other methods like binmode and decode but getting the same issue.
The same Program works when i change above mentioned file opening code to:
open(INP,"$infile") or die "Couldn't open file passed as input, $!";
local $/ = undef;
my $inputfile = <INP>;
print $inputfile; ## It works..
open(INP,"$infile") or die "Couldn't open file passed as input, $!";
binmode(INP, ":utf8");
local $/ = undef;
my $inputfile = <INP>;
print $inputfile; ## Not working..
Can you please help me what I am doing wrong here? I need to perform some operation on the input data and have to get utf8 encoded output.
I tried your last snippet here (Ubuntu 12.04, perl 5.14.2) and it works as expected. Only problem I have, is a difference between the input and output. Input file is UTF-8 and output is ISO-8859-1.
When I add
use utf8;
use open qw(:std :utf8);
this problem is gone, though. So this must be an environment issue.

How can I output UTF-8 from Perl?

I am trying to write a Perl script using the utf8 pragma, and I'm getting unexpected results. I'm using Mac OS X 10.5 (Leopard), and I'm editing with TextMate. All of my settings for both my editor and operating system are defaulted to writing files in utf-8 format.
However, when I enter the following into a text file, save it as a ".pl", and execute it, I get the friendly "diamond with a question mark" in place of the non-ASCII characters.
#!/usr/bin/env perl -w
use strict;
use utf8;
my $str = 'Çirçös';
print( "$str\n" );
Any idea what I'm doing wrong? I expect to get 'Çirçös' in the output, but I get '�ir��s' instead.
use utf8; does not enable Unicode output - it enables you to type Unicode in your program. Add this to the program, before your print() statement:
binmode(STDOUT, ":utf8");
See if that helps. That should make STDOUT output in UTF-8 instead of ordinary ASCII.
You can use the open pragma.
For eg. below sets STDOUT, STDIN & STDERR to use UTF-8....
use open qw/:std :utf8/;
TMTOWTDI, chose the method that best fits how you work. I use the environment method so I don't have to think about it.
In the environment:
export PERL_UNICODE=SDL
on the command line:
perl -CSDL -le 'print "\x{1815}"';
or with binmode:
binmode(STDOUT, ":utf8"); #treat as if it is UTF-8
binmode(STDIN, ":encoding(utf8)"); #actually check if it is UTF-8
or with PerlIO:
open my $fh, ">:utf8", $filename
or die "could not open $filename: $!\n";
open my $fh, "<:encoding(utf-8)", $filename
or die "could not open $filename: $!\n";
or with the open pragma:
use open ":encoding(utf8)";
use open IN => ":encoding(utf8)", OUT => ":utf8";
You also want to say, that strings in your code are utf-8. See Why does modern Perl avoid UTF-8 by default?. So set not only PERL_UNICODE=SDAL but also PERL5OPT=-Mutf8.
Thanks, finally got an solution to not put utf8::encode all over code.
To synthesize and complete for other cases, like write and read files in utf8 and also works with LoadFile of an YAML file in utf8
use utf8;
use open ':encoding(utf8)';
binmode(STDOUT, ":utf8");
open(FH, ">test.txt");
print FH "something éá";
use YAML qw(LoadFile Dump);
my $PUBS = LoadFile("cache.yaml");
my $f = "2917";
my $ref = $PUBS->{$f};
print "$f \"".$ref->{name}."\" ". $ref->{primary_uri}." ";
where cache.yaml is:
---
2917:
id: 2917
name: Semanário
primary_uri: 2917.xml
do in your shell:
$ env |grep LANG
This will probably show that your shell is not using a utf-8 locale.