Perl and Postgresql UTF8 problems with directories - perl

I am having trouble with directory names with UTF8 characters in them on a Mac (10.11.2) with Perl 5.22 and Postgresql (9.4). Text encoding in Postgresql is set to UTF8.
If I have a directory name with a non-ascii UTF8 character in it I can chdir() to that directory if the directory name is read in by the Perl script or inserted into a string in the Perl script. If I insert this name into a PG table and read it back out (SELECT dirname FROM utfdirs) I can't chdir to that directory. However, the on screen printed strings are identical, a Perl cmp test on the two strings reports they are identical, and guess_encoding() reports both are UTF8.
#!/opt/local/bin/perl5.22
use strict;
use Cwd;
use DBI;
use Encode;
use Encode qw/from_to/;
use Encode::Detect;
use Encode::Guess;
use Encode::UTF8Mac;
#
Encode::Guess->add_suspects(qw/utf-8-mac/);
#
my $dbname = 'test';
my $dbh = DBI->connect("dbi:Pg:dbname=$dbname;host=localhost");
$dbh->do("SET client_min_messages TO WARNING");
#
my $homeDir = '/Users/jldasch';
chdir($homeDir) or die "Cannot cd to [$homeDir]\n";
opendir(D,".");
my #tdlist = sort grep(/(Lambda?)|(Delta?)/,readdir(D));
closedir(D);
$dbh->do("DELETE FROM utfdirs");
my $ins = $dbh->prepare("INSERT INTO utfdirs (dirname) VALUES (?)");
foreach my $d (#tdlist) {
chdir($homeDir);
my $ok = chdir($d) ? 1 : 0;
my $fp = "${homeDir}/${d}";
printf("%2d %s\n",$ok,$fp);
$ins->execute($fp);
}
my $rset = $dbh->selectall_arrayref("SELECT dirname FROM utfdirs ORDER BY dirname");
my $i = 0;
foreach my $r (#$rset) {
my $dbdir = $r->[0];
my $pdir = ${homeDir} . '/' . $tdlist[$i++];
print "$r->[0] $pdir\n";
my $encPerl = guess_encoding($pdir);
my $encDb = guess_encoding($dbdir);
print "Perl Encoding [$encPerl->{Name}]\n";
print "Db Encoding [$encDb->{Name}]\n";
unless ( chdir($dbdir) ) {
print "Cannot CD to DbDir [$dbdir]\n";
print "DbDir and PerlDir Match\n" if ($dbdir eq $pdir)
}
exit;
The output:
bash-3.2$ ./utfstuff2.pl
1 /Users/jldasch/DeltaΔ
1 /Users/jldasch/Lambdaλ
/Users/jldasch/DeltaΔ /Users/jldasch/DeltaΔ
Perl Encoding [utf8]
Db Encoding [utf8]
Cannot CD to DbDir [/Users/jldasch/DeltaΔ]
DbDir and PerlDir Match
/Users/jldasch/Lambdaλ /Users/jldasch/Lambdaλ
Perl Encoding [utf8]
Db Encoding [utf8]
Cannot CD to DbDir [/Users/jldasch/Lambdaλ]
DbDir and PerlDir Match
So at the level I have checked so far Perl is telling me the strings are the same (both cmp and guess_encoding()), they print the same, but they are not the same.
How do I convert the UTF8 string returned by Postgresql to a string which is accepted (in Perl) as a valid directory name for chdir()?

There is a module Encode::UTF8Mac which appears to solve this.
my $macOkDir = Encode::decode('utf-8-mac',$dbDir)
– John Daschbach

Related

How to extract 4 letters from file names and use in substitution in multiple files

my $v3test;
my $rootDir = "C:\\";
$v3test = "$rootDir"."test\\";
directory
chdir $v3test;
opendir(V3, $v3test);
my #str0 = readdir V3;
my $str0 = #str0;
local $^I = '';
local #ARGV = glob "*.rnx";
File Name: GANS???????????????.rnx,
YONS???????????????.rnx,
GUMC???????????????.rnx
my $str5 = "CREF0001";
my $str6 = substr(#ARGV[0], 0, 4);
**#I want to extract 4 words form file title**
while (<>) {
s/\Q$str5/$str6/g;
print;
}
The *.rnx data is GPS data.
I want to extract 4 words from *.rnx file title.
How can I do this?
Edit: It has been confirmed in comments that it is four letters, not words. Those should be used, with four spaces appended, to replace the string $str5 in all files.
The following replaces CREF0001 by a string derived from the name of the file.
So in files YONS...rng the string CREF0001 is replaced by YONS (four spaces), while in all files with the name GANS...rng the replacement is GANS , etc.
With $^I set the files are edited in place. I assign ~ to it so to keep a backup, filename~. Assign an empty string '' instead if backup is unneeded but only once this has been well tested.
use warnings;
use strict;
use feature 'say';
# Assigning other than '' (empty string) keeps a backup
local $^I = '~';
local #ARGV = glob "*.rnx";
my $bad_data = 'CREF0001';
my $filename = $ARGV[0]; # initialize
# Replace $bad_data with this string
my $gps_name_string = substr($filename, 0, 4) . ' 'x4;
while (<>) {
if ($filename ne $ARGV) { # a new file
$filename = $ARGV;
$gps_name_string = substr($filename, 0, 4) . ' 'x4;
}
s/$bad_data/$gps_name_string/g;
print;
}
This uses the $ARGV variable, which has the name of the currently processed file, to detect when the loop started processing lines from the next file, so to build the suitable replacement string.
I presume that there is a reason for using local-ized #ARGV, and that is fine. I'd like to mention a couple of other options though
Submit the glob on the command line, as progname *.rng, and this way #ARGV gets set and then while (<>) { } in the program processes lines from those files
Build the file list as you do, using glob, but then process files by names, not using <>
use warnings;
use strict;
use Path::Tiny; # for path()->edit_lines
my $bad_data = 'CREF0001';
my #files = glob "*.rng";
foreach my $file (#files) {
my $gps_name_string = substr($file, 0, 4) . ' 'x4;
path($file)->edit_lines( sub { s/$bad_data/$gps_name_string/g } );
}
The edit_lines applies the anonymous sub in its argument, here with just the regex, to each line and rewrites the file. See Path::Tiny. Or one can open the files normally and iterate over lines as in the main text (except that now we know the filename).

Perl Script Not Liking Date Extension

why do I receive the error complaining about the parenthesis ?
sh: syntax error at line 1 : `)' unexpected
when adding this date extension to the new file -- mv abc abc$(date +%Y%m%d%H%M%S)
for it seems that it doesn't like that last parenthesis
#!/usr/bin/perl
# =========================================== #
# Script to watch POEDIACK file size
#
# - Comments -
#
# script will check the file size of the POEDIACK file in
# $LAWDIR/$PLINE/edi/in.
# If it's > 1 gig, it will send notification via email
#
#
# =========================================== #
use strict;
use POSIX qw(strftime);
# get env vars from system
my $LAWDIR = #ENV{'LAWDIR'};
my $PLINE = #ENV{'PLINE'};
#my $email_file = "/lsf10/monitors/poediack.email";
my $curr_date = strftime('%m%d%Y', localtime);
my $ack_file = "$LAWDIR" . "/$PLINE" . "/edi/in/POEDIACK";
my $ack_location = "$LAWDIR" . "/$PLINE" . "/edi/in/";
my $mv_location = "$LAWDIR" . "/$PLINE" . "/edi/in/Z_files";
my $ack_file_limit = 10;
#my $ack_file_limit = 1000000000;
my $ack_file_size;
if( -e $ack_file)
{
$ack_file_size = -s $ack_file;
if ( $ack_file_size > $ack_file_limit )
{
`compress -vf $ack_file`;
`mv $mv_location\$ack_file.Z $mv_location\$ack_file.Z.$(date +%Y%m%d%H%M%S)`;
}
}
else
{
print "POEDIACK File not found: $ack_file\n";
}
### end perl script ###
$( is being interpreted as a variable. It is the group ID of the process. You need to escape it.
And you probably shouldn't escape $ack_file.
`mv $mv_location$ack_file.Z $mv_location$ack_file.Z.\$(date +%Y%m%d%H%M%S)`;
It's safer and faster to avoid complicated shell commands and use rename instead.
use autodie;
my $timestamp = strftime('%Y%m%d%H%M%S', localtime);
rename "$mv_location$ack_file.Z", "$mv_location$ack_file.Z.$timestamp";
Or use an existing log rotator.

Cannot call pdflatex from perl script (due to encoding?)

When I call pdflatex manually from the windows command line, it generates the desired pdf.
When I call pdflatex from a perl script instead, it does not:
system("pdflatex $fileName");
.. results in
Sorry, but pdflatex did not succeed.
You may want to visit the MiKTeX project page, if you need help.
utf8 "\x80" does not map to Unicode at C:/strawberry-perl/perl/site/lib/Encode.pm line 200.
The script was running on unix before and working fine. Now, after having it migrated to a windows system it doesn't.
The content of the tex-input-file is generated by the script as well. the "file"-command on my Mac tells me that this file is encoded as "us-ascii".
So I tried to make perl encode it as "utf-8", but it did not work:
open(FH, "> :encoding(utf-8)", $fileName);
or
binmode(FH, ":utf8");
Files are still being generated with us-ascii encoding. How can I change that?
So far, the encoding is my only clue.
What else could be the problem?
If this works fine when manually typed into the command line the this could be due to the way perl interpolates the quotation marks before passing the command to the system. Have you tried printing the call you making to test whether it provides the exact same imput as when to enter it manually? Otherwise, for passing arguments to a program via the system command in perl I always separate them out as follows to avoid any interpolation errors:
#...
my $prog = "Z.*";
my $arg1 = "X";
my $arg2 = "Y";
#...
my $file = "W.*";
system("$prog", ("$arg1", "$arg2", ..., "$file"));
#...
If this doesn't work, another, albeit rather clunky solution, might be to import the file contents into a variable and try the following to 'manually' encode it in perl as follows:
use Encode;
use utf8;
use charnames qw( :full :short );
my $encodedfile = encode("utf8", $filecontents);
If you happen to have any active caracters in the file which could influence the way pdflatex handles the final output (for example in perl \\ gives \ to pdflatex, which ends up finally being ) you can append the following to the encoding:
my $str = $encodedfile;
my $find = "\\N{U+005C}";
my $replace = "\\textbackslash ";
$str =~ s/$find/$replace/g;
my %special_characters;
$special_characters{"\\N{U+0025}"} = "\\pourcent ";
$special_characters{"\\\$"} = "\\\$";
$special_characters{"\\N{U+007B}"} = "\\{";
$special_characters{"\N{U+007D}"} = "\\}";
$special_characters{"\N{U+0026}"} = "\\&";
$special_characters{"\\N{U+005F}"} = "\\textunderscore ";
$special_characters{"\\N{U+002F}"} = "\/";
$special_characters{"\\N{U+005B}"} = "\[";
$special_characters{"\\N{U+005D}"} = "\]";
$special_characters{"\\N{U+005E}"} = "\\textasciicircum ";
$special_characters{"\\N{U+0023}"} = "\\#";
$special_characters{"\\\N{U+007E}"} = "\\textasciitilde ";
$special_characters{"\\\N{U+0021}"} = " \\newline ";
my $string = $str;
foreach my $char (keys %special_characters) {
$string =~ s/$char/$special_characters{$char}/g;
}
Hope this helps.

Cleanup failed with tempfile using perl

I needed to use "iconv" to convert char encoding from some files generated on windows. Sometimes those files are very big and execution fails because it runs out of RAM. Googling i found a script which is called "iconv-chunks.pl" which is basically a perl script which processes the files and works pretty well, but it generates temporary files on my /tmp folder.
The problem is that this scripts runs automatically everyday for many files and it keeps generating garbage on my /tmp dir even though it has the cleanup flag ON.
The script im talking about is:
https://code.google.com/p/clschool-team4/source/browse/trunk/iconv-chunks.pl?r=53
#!/usr/bin/perl
our $CHUNK_SIZE = 1024 * 1024 * 100; # 100M
=head1 NAME
iconv-chunks - Process huge files with iconv
=head1 SYNOPSIS
iconv-chunks <filename> [iconv-options]
=head1 DESCRIPTION
The standard iconv program reads the entire input file into
memory, which doesn't work for large files (such as database exports).
This script is just a wrapper that processes the input file
in manageable chunks and writes it to standard output.
The first argument is the input filename (use - to specify standard input).
Anything else is passed through to iconv.
The real iconv needs to be somewhere in your PATH.
=head1 EXAMPLES
# Convert latin1 to utf-8:
./iconv-chunks database.txt -f latin1 -t utf-8 > out.txt
# Input filename of - means standard input:
./iconv-chunks - -f iso8859-1 -t utf8 < database.txt > out.txt
# More complex example, using compressed input/output to minimize disk use:
zcat database.txt.gz | ./iconv-chunks - -f iso8859-1 -t utf8 | \
gzip - > database-utf.dump.gz
=head1 AUTHOR
Maurice Aubrey <maurice.aubrey+iconv#gmail.com>
=cut
# $Id: iconv-chunks 6 2007-08-20 21:14:55Z mla $
use strict;
use warnings;
use bytes;
use File::Temp qw/ tempfile /;
# iconv errors:
# iconv: unable to allocate buffer for input: Cannot allocate memory
# iconv: cannot open input file `database.txt': File too large
#ARGV >= 1 or die "Usage: $0 <inputfile> [iconv-options]\n";
my #options = splice #ARGV, 1;
my($oh, $tmp) = tempfile(undef, CLEANUP => 1);
# warn "Tempfile: $tmp\n";
my $iconv = "iconv #options $tmp";
sub iconv { system($iconv) == 0 or die "command '$iconv' failed: $!" }
my $size = 0;
# must read by line to ensure we don't split multi-byte character
while (<>) {
$size += length $_;
print $oh $_;
if ($size >= $CHUNK_SIZE) {
iconv;
truncate $oh, 0 or die "truncate '$tmp' failed: $!";
seek $oh, 0, 0 or die "seek on '$tmp' failed: $!";
$size = 0;
}
}
iconv if $size > 0;
Any help finding the problem or how can it delete temporary files after finishing?
Regards
Change
my($oh, $tmp) = tempfile(undef, CLEANUP => 1);
to
my($oh, $tmp) = tempfile(UNLINK => 1);
CLEANUP is used to trigger removal of temporary directories on exit, not files. Note that passing undef as the first argument in order to use the default template is unnecessary.

How do I copy a file with a UTF-8 filename to another UTF-8 filename in Perl on Windows?

For example, given an empty file テスト.txt, how would I make a copy called テスト.txt.copy?
My first crack at it managed to access the file and create the new filename, but the copy generated テスト.txt.copy.
Here was my first crack at it:
#!/usr/bin/env perl
use strict;
use warnings;
use English '-no_match_vars';
use File::Basename;
use Getopt::Long;
use File::Copy;
use Win32;
my (
$output_relfilepath,
) = process_command_line();
open my $fh, '>', $output_relfilepath or die $!;
binmode $fh, ':utf8';
foreach my $short_basename ( glob( '*.txt') ) {
# skip the output basename if it's in the glob
if ( $short_basename eq $output_relfilepath ) {
next;
}
my $long_basename = Win32::GetLongPathName( $short_basename );
my $new_basename = $long_basename . '.copy';
print {$fh} sprintf(
"short_basename = (%s)\n" .
" long_basename = (%s)\n" .
" new_basename = (%s)\n",
$short_basename,
$long_basename,
$new_basename,
);
copy( $short_basename, $new_basename );
}
printf(
"\n%s done! (%d seconds elapsed)\n",
basename( $0 ),
time() - $BASETIME,
);
# === subroutines ===
sub process_command_line {
# default arguments
my %args
= (
output_relfilepath => 'output.txt',
);
GetOptions(
'help' => sub { print usage(); exit },
'output_relfilepath=s' => \$args{output_relfilepath},
);
return (
$args{output_relfilepath},
);
}
sub usage {
my $script_name = basename $0;
my $usage = <<END_USAGE;
======================================================================
Test script to copy files with a UTF-8 filenames to files with
different UTF-8 filenames. This example tries to make copies of all
.txt files with versions that end in .txt.copy.
usage: ${script_name} (<options>)
options:
-output_relfilepath <s> set the output relative file path to <s>.
this file contains the short, long, and
new basenames.
(default: 'output.txt')
----------------------------------------------------------------------
examples:
${script_name}
======================================================================
END_USAGE
return $usage;
}
Here are the contents of output.txt after execution:
short_basename = (BD9A~1.TXT)
long_basename = (テスト.txt)
new_basename = (テスト.txt.copy)
I've tried replacing File::Copy's copy command with a system call:
my $cmd = "copy \"${short_basename}\" \"${new_basename}\"";
print `$cmd`;
and with Win32::CopyFile:
Win32::CopyFile( $short_basename, $new_basename, 'true' );
Unfortunately, I get the same result in both cases (テスト.txt.copy). For the system call, the print shows 1 file(s) copied. as expected.
Notes:
I'm running Perl 5.10.0 via Strawberry Perl on Windows 7 Professional
I use the Win32 module to access long filenames
The glob returns short filenames, which I have to use to access the file
テスト = test (tesuto) in katakana
I've read perlunitut and The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
This should be possible with the CopyFileW function from Win32API::File, which should be included with Strawberry. I've never messed with Unicode filenames myself, so I'm not sure of the details. You might need to use Encode to manually convert the filename to UTF-16LE (encode('UTF16-LE', $filename)).
You're getting the long filename using Win32, which gives you a UTF-8-encoded string.
However, you're then setting the long filename using plain copy, which uses the C stdlib IO functions. The stdlib functions use the default filesystem encoding.
On modern Linuxes that's usually UTF-8, but on Windows it (sadly) never is, because the system default code page cannot be set to UTF-8. So you'll get your UTF-8 string interpreted as a code page 1252 string on a Western European Windows install, as has happened here. (On a Japanese machine it'd get interpreted as code page 932 — like Shift-JIS — which would come out something like 繝�せ繝�.)
I've not done this in Perl, but I'd suspect the Win32::CopyFile function would be more likely to be able to handle the kind of Unicode paths returned elsewhere in the Win32 module.
Use Encode::Locale:
use Encode::Locale;
use Encode;
use File::Copy;
copy( encode(locale_fs => $short_basename),
encode(locale_fs => $new_basename) ) || die $!;
I successfully duplicated your problem on my Windows machine (Win XP Simplified Chinese version) and my conclusion is that the problem is caused by the font. Choose a Truetype font rather than Raster fonts and see if everything is okay.
My experiment is this:
I first changed the code page of my Windows Console from the default 936 (GBK) to 65001 (UTF-8).
by typing C:>chcp 65001
I wrote a scrip that contains the code: $a= "テスト"; print $a; and saved it as UTF-8.
I ran the script from the Console and found "テスト" became "テスト", which is exactly the same sympton you described in your question.
I changed the Console Font from Raster Fonts to Lucida Console, the console screen gave me this: "テストストトト", which is still not quite right but I assume it is getting closer to the core of the problem.
So althought I'm not 100% sure but the problem is probably caused by the font.
Hope this helps.
See https://metacpan.org/pod/Win32::Unicode
#!/usr/bin/perl --
use utf8;
use strict;
use warnings;
my #kebabs = (
"\x{45B}\x{435}\x{432}\x{430}\x{43F}.txt", ## ћевап.txt
"ra\x{17E}nji\x{107}.txt", ## ražnjić.txt
"\x{107}evap.txt", ## ćevap.txt
"\x{43A}\x{435}\x{431}\x{430}\x{43F}\x{447}\x{435}.txt", ## кебапче.txt
"kebab.txt",
);
{
use Win32::Unicode qw/ -native /;
printW "I \x{2665} Perl"; # unicode console out
mkpathW 'meat';
chdirW 'meat';
for my $kebab ( #kebabs ){
printW "kebabing the $kebab\n";
open my($fh), '>:raw', $kebab or dieW Fudge($kebab);
print $fh $kebab or dieW Fudge($kebab);
close $fh or dieW Fudge($kebab);
}
}
sub Fudge {
use Errno();
join qq/\n/,
"Error #_",
map { " $_" } int( $! ) . q/ / . $!,
int( $^E ) . q/ / . $^E,
grep( { $!{$_} } keys %! ),
q/ /;
}