Cannot call pdflatex from perl script (due to encoding?) - perl

When I call pdflatex manually from the windows command line, it generates the desired pdf.
When I call pdflatex from a perl script instead, it does not:
system("pdflatex $fileName");
.. results in
Sorry, but pdflatex did not succeed.
You may want to visit the MiKTeX project page, if you need help.
utf8 "\x80" does not map to Unicode at C:/strawberry-perl/perl/site/lib/Encode.pm line 200.
The script was running on unix before and working fine. Now, after having it migrated to a windows system it doesn't.
The content of the tex-input-file is generated by the script as well. the "file"-command on my Mac tells me that this file is encoded as "us-ascii".
So I tried to make perl encode it as "utf-8", but it did not work:
open(FH, "> :encoding(utf-8)", $fileName);
or
binmode(FH, ":utf8");
Files are still being generated with us-ascii encoding. How can I change that?
So far, the encoding is my only clue.
What else could be the problem?

If this works fine when manually typed into the command line the this could be due to the way perl interpolates the quotation marks before passing the command to the system. Have you tried printing the call you making to test whether it provides the exact same imput as when to enter it manually? Otherwise, for passing arguments to a program via the system command in perl I always separate them out as follows to avoid any interpolation errors:
#...
my $prog = "Z.*";
my $arg1 = "X";
my $arg2 = "Y";
#...
my $file = "W.*";
system("$prog", ("$arg1", "$arg2", ..., "$file"));
#...
If this doesn't work, another, albeit rather clunky solution, might be to import the file contents into a variable and try the following to 'manually' encode it in perl as follows:
use Encode;
use utf8;
use charnames qw( :full :short );
my $encodedfile = encode("utf8", $filecontents);
If you happen to have any active caracters in the file which could influence the way pdflatex handles the final output (for example in perl \\ gives \ to pdflatex, which ends up finally being ) you can append the following to the encoding:
my $str = $encodedfile;
my $find = "\\N{U+005C}";
my $replace = "\\textbackslash ";
$str =~ s/$find/$replace/g;
my %special_characters;
$special_characters{"\\N{U+0025}"} = "\\pourcent ";
$special_characters{"\\\$"} = "\\\$";
$special_characters{"\\N{U+007B}"} = "\\{";
$special_characters{"\N{U+007D}"} = "\\}";
$special_characters{"\N{U+0026}"} = "\\&";
$special_characters{"\\N{U+005F}"} = "\\textunderscore ";
$special_characters{"\\N{U+002F}"} = "\/";
$special_characters{"\\N{U+005B}"} = "\[";
$special_characters{"\\N{U+005D}"} = "\]";
$special_characters{"\\N{U+005E}"} = "\\textasciicircum ";
$special_characters{"\\N{U+0023}"} = "\\#";
$special_characters{"\\\N{U+007E}"} = "\\textasciitilde ";
$special_characters{"\\\N{U+0021}"} = " \\newline ";
my $string = $str;
foreach my $char (keys %special_characters) {
$string =~ s/$char/$special_characters{$char}/g;
}
Hope this helps.

Related

Perl CGI produces unexpected output

I have a Perl CGI script for online concordance application that searches for an instance of word in a text and prints the sorted output.
#!/usr/bin/perl -wT
# middle.pl - a simple concordance
# require
use strict;
use diagnostics;
use CGI;
# ensure all fatals go to browser during debugging and set-up
# comment this BEGIN block out on production code for security
BEGIN {
$|=1;
print "Content-type: text/html\n\n";
use CGI::Carp('fatalsToBrowser');
}
# sanity check
my $q = new CGI;
my $target = $q->param("keyword");
my $radius = $q->param("span");
my $ordinal = $q->param("ord");
my $width = 2*$radius;
my $file = 'concordanceText.txt';
if ( ! $file or ! $target ) {
print "Usage: $0 <file> <target>\n";
exit;
}
# initialize
my $count = 0;
my #lines = ();
$/ = ""; # Paragraph read mode
# open the file, and process each line in it
open(FILE, " < $file") or die("Can not open $file ($!).\n");
while(<FILE>){
# re-initialize
my $extract = '';
# normalize the data
chomp;
s/\n/ /g; # Replace new lines with spaces
s/\b--\b/ -- /g; # Add spaces around dashes
# process each item if the target is found
while ( $_ =~ /\b$target\b/gi ){
# find start position
my $match = $1;
my $pos = pos;
my $start = $pos - $radius - length($match);
# extract the snippets
if ($start < 0){
$extract = substr($_, 0, $width+$start+length($match));
$extract = (" " x -$start) . $extract;
}else{
$extract = substr($_, $start, $width+length($match));
my $deficit = $width+length($match) - length($extract);
if ($deficit > 0) {
$extract .= (" " x $deficit);
}
}
# add the extracted text to the list of lines, and increment
$lines[$count] = $extract;
++$count;
}
}
sub removePunctuation {
my $string = $_[0];
$string = lc($string); # Convert to lowercase
$string =~ s/[^-a-z ]//g; # Remove non-aplhabetic characters
$string =~ s/--+/ /g; #Remove 2+ hyphens with a space
$string =~s/-//g; # Remove hyphens
$string =~ s/\s=/ /g;
return($string);
}
sub onLeft {
#USAGE: $word = onLeft($string, $radius, $ordinal);
my $left = substr($_[0], 0, $_[1]);
$left = removePunctuation($left);
my #word = split(/\s+/, $left);
return($word[-$_[2]]);
}
sub byLeftWords {
my $left_a = onLeft($a, $radius, $ordinal);
my $left_b = onLeft($b, $radius, $ordinal);
lc($left_a) cmp lc($left_b);
}
# process each line in the list of lines
print "Content-type: text/plain\n\n";
my $line_number = 0;
foreach my $x (sort byLeftWords #lines){
++$line_number;
printf "%5d",$line_number;
print " $x\n\n";
}
# done
exit;
The perl script produces expected result in terminal (command line). But the CGI script for online application produces unexpected output. I cannot figure out what mistake I am making in the CGI script. The CGI script should ideally produce the same output as the command line script. Any suggestion would be very helpful.
Command Line Output
CGI Output
The BEGIN block executes before anything else and thus before
my $q = new CGI;
The output goes to the server process' stdout and not to the HTTP stream, so the default is text/plain as you can see in the CGI output.
After you solve that problem you'll find that the output still looks like a big ugly block because you need to format and send a valid HTML page, not just a big block of text. You cannot just dump a bunch of text to the browser and expect it to do anything intelligent with it. You must create a complete HTML page with tags to layout your content, probably with CSS as well.
In other words, the output required will be completely different from the output when writing only to the terminal. How to structure it is up to you, and explaining how to do that is out of scope for StackOverflow.
As the other answers state, the BEGIN block is executed at the very start of your program.
BEGIN {
$|=1;
print "Content-type: text/html\n\n";
use CGI::Carp('fatalsToBrowser');
}
There, you output an HTTP header Content-type: text/html\n\n. The browser sees that first, and treats all your output as HTML. But you only have text. Whitespace in an HTML page is collapsed into single spaces, so all your \n line breaks disappear.
Later, you print another header, the browser cannot see that as a header any more, because you already had one and finished it off with two newlines \n\n. It's now too late to switch back to text/plain.
It is perfectly fine to have a CGI program return text/plain and just have text without markup be displayed in a browser when all you want is text, and no colors or links or tables. For certain use cases this makes a lot of sense, even if it doesn't have the hyper in Hypertext any more. But you're not really doing that.
Your BEGIN block serves a purpose, but you are overdoing it. You're trying to make sure that when an error occurs, it gets nicely printed in the browser, so you don't need to deal with the server log while developing.
The CGI::Carp module and it's functionality fatalsToBrowser bring their own mechanism for that. You don't have to do it yourself.
You can safely remove the BEGIN block and just put your use CGI::CARP at the top of the script with all the other use statements. They all get run first anyway, because use gets run at compile time, while the rest of your code gets run at run time.
If you want, you can keep the $|++, which turns off the buffering for your STDOUT handle. It gets flushed immediately and every time you print something, that output goes directly to the browser instead of collecting until it's enough or there is a newline. If your process runs for a long time, this makes it easier for the user to see that stuff is happening, which is also useful in production.
The top of your program should look like this now.
#!/usr/bin/perl -T
# middle.pl - a simple concordance
use strict;
use warnigns;
use diagnostics;
use CGI;
use CGI::Carp('fatalsToBrowser');
$|=1;
my $q = CGI->new;
Finally, a a few quick words on the other parts I deleted from there.
Your comment requires over the use statements is misleading. Those are use, not require. As I said above, use gets run at compile time. require on the other hand gets run at run time and can be done conditionally. Misleading comments will make it harder for others (or you) to maintain your code later on.
I removed the -w flag from the shebang (#!/usr/bin/perl) and put the use warnings pragma in. That's a more modern way to turn on warnings, because sometimes the shebang can be ignored.
The use diagnostics pragma gives you extra long explanations when things go wrong. That's useful, but also extra slow. You can use it during development, but please remove it for production.
The comment sanity check should be moved down under the CGI instantiation.
Please use the invocation form of new to instantiate CGI, and any other classes. The -> syntax will take care of inheritance properly, while the old new CGI cannot do that.
I ran your cgi. The BEGIN block is run regardless and you print a content-type header here - you have explicitly asked for HTML here. Then later you attemp to print another header for PLAIN. This is why you can see the header text (that hasn't taken effect) at the beginning of the text in the browser window.

Get value of autosplit delimiter?

If I run a script with perl -Fsomething, is that something value saved anywhere in the Perl environment where the script can find it? I'd like to write a script that by default reuses the input delimiter (if it's a string and not a regular expression) as the output delimiter.
Looking at the source, I don't think the delimiter is saved anywhere. When you run
perl -F, -an
the lexer actually generates the code
LINE: while (<>) {our #F=split(q\0,\0);
and parses it. At this point, any information about the delimiter is lost.
Your best option is to split by hand:
perl -ne'BEGIN { $F="," } #F=split(/$F/); print join($F, #F)' foo.csv
or to pass the delimiter as an argument to your script:
F=,; perl -F$F -sane'print join($F, #F)' -- -F=$F foo.csv
or to pass the delimiter as an environment variable:
export F=,; perl -F$F -ane'print join($ENV{F}, #F)' foo.csv
As #ThisSuitIsBlackNot says it looks like the delimiter is not saved anywhere.
This is how the perl.c stores the -F parameter
case 'F':
PL_minus_a = TRUE;
PL_minus_F = TRUE;
PL_minus_n = TRUE;
PL_splitstr = ++s;
while (*s && !isSPACE(*s)) ++s;
PL_splitstr = savepvn(PL_splitstr, s - PL_splitstr);
return s;
And then the lexer generates the code
LINE: while (<>) {our #F=split(q\0,\0);
However this is of course compiled, and if you run it with B::Deparse you can see what is stored.
$ perl -MO=Deparse -F/e/ -e ''
LINE: while (defined($_ = <ARGV>)) {
our(#F) = split(/e/, $_, 0);
}
-e syntax OK
Being perl there is always a way, however ugly. (And this is some of the ugliest code I have written in a while):
use B::Deparse;
use Capture::Tiny qw/capture_stdout/;
BEGIN {
my $f_var;
}
unless ($f_var) {
$stdout = capture_stdout {
my $sub = B::Deparse::compile();
&{$sub}; # Have to capture stdout, since I won't bother to setup compile to return the text, instead of printing
};
my (undef, $split_line, undef) = split(/\n/, $stdout, 3);
($f_var) = $split_line =~ /our\(\#F\) = split\((.*)\, \$\_\, 0\);/;
print $f_var,"\n";
}
Output:
$ perl -Fe/\\\(\\[\\\<\\{\"e testy.pl
m#e/\(\[\<\{"e#
You could possible traverse the bytecode instead, since the start probably will be identical every time until you reach the pattern.

Search and replace in Perl for particular word

I have a huge file which consists of similar lines below , with different clocks:
cmd -quiet [get_ports p1] ref_clocks "cudtclk_sp cudtclk"
cmd -quiet [get_ports p2] clock "cu2xdtclk_sp cu2xdtclk"
And I need to replace cudtclk with some other name like cdtclk whenever I have ref_clocks in my file, globally.
I have written following code but it doesn't seem to be working.
#!/usr/bin/perl
use strict;
use warnings;
sub clock_change
{       # Get the subroutine's argument.
my $arg = shift;
# Hash of stuff we want to replace.
my %replace = (
"cudtclk" => "cdtclk",
);
# See if there's a replacement for the given text.
my $text = $replace{$arg};
if(defined($text)) {
return $text;
}
return $arg;
}
open PAR, "<file name>";
while(<PAR>) {
$_ =~ s/\S+\s\S+\s\S+\s\S+\sref_clocks\s+(\S+\s+\S+)/clock_change($1)/eig;
print $_;   ##print it to some file later.
}
"And I need to replace cudtclk with some other name like cdtclk"
perl -pe 's/\bcudtclk\b/cdtclk/' thefile > newfile
"whenever I have ref_clocks"
perl -pe 's/\bcudtclk\b/cdtclk/ if /\bref_clocks\b/' thefile > newfile
Alternatively:
# saves original file as file.bak
perl -i.bak -pe 's/\bcudtclk\b/cdtclk/ if /\bref_clocks\b/' file
Tighten to suit your data, as necessary.
Although the substitution seems like unnecessarily complex, you can fix it with something similar to:
$_ =~ s/(ref_clocks\s+")([^_]+)_sp(\s+)\2/
$1.clock_change($2)."_sp$3".clock_change($2)/eig;

Perl split() Function Not Handling Pipe Character Saved As A Variable

I'm running into a little trouble with Perl's built-in split function. I'm creating a script that edits the first line of a CSV file which uses a pipe for column delimitation. Below is the first line:
KEY|H1|H2|H3
However, when I run the script, here is the output I receive:
Col1|Col2|Col3|Col4|Col5|Col6|Col7|Col8|Col9|Col10|Col11|Col12|Col13|
I have a feeling that Perl doesn't like the fact that I use a variable to actually do the split, and in this case, the variable is a pipe. When I replace the variable with an actual pipe, it works perfectly as intended. How could I go about splitting the line properly when using pipe delimitation, even when passing in a variable? Also, as a silly caveat, I don't have permissions to install an external module from CPAN, so I have to stick with built-in functions and modules.
For context, here is the necessary part of my script:
our $opt_h;
our $opt_f;
our $opt_d;
# Get user input - filename and delimiter
getopts("f:d:h");
if (defined($opt_h)) {
&print_help;
exit 0;
}
if (!defined($opt_f)) {
$opt_f = &promptUser("Enter the Source file, for example /qa/data/testdata/prod.csv");
}
if (!defined($opt_d)) {
$opt_d = "\|";
}
my $delimiter = "\|";
my $temp_file = $opt_f;
my #temp_file = split(/\./, $temp_file);
$temp_file = $temp_file[0]."_add-headers.".$temp_file[1];
open(source_file, "<", $opt_f) or die "Err opening $opt_f: $!";
open(temp_file, ">", $temp_file) or die "Error opening $temp_file: $!";
my $source_header = <source_file>;
my #source_header_columns = split(/${delimiter}/, $source_header);
chomp(#source_header_columns);
for (my $i=1; $i<=scalar(#source_header_columns); $i++) {
print temp_file "Col$i";
print temp_file "$delimiter";
}
print temp_file "\n";
while (my $line = <source_file>) {
print temp_file "$line";
}
close(source_file);
close(temp_file);
The first argument to split is a compiled regular expression or a regular expression pattern. If you want to split on text |. You'll need to pass a pattern that matches |.
quotemeta creates a pattern from a string that matches that string.
my $delimiter = '|';
my $delimiter_pat = quotemeta($delimiter);
split $delimiter_pat
Alternatively, quotemeta can be accessed as \Q..\E inside double-quoted strings and the like.
my $delimiter = '|';
split /\Q$delimiter\E/
The \E can even be omitted if it's at the end.
my $delimiter = '|';
split /\Q$delimiter/
I mentioned that split also accepts a compiled regular expression.
my $delimiter = '|';
my $delimiter_re = qr/\Q$delimiter/;
split $delimiter_re
If you don't mind hardcoding the regular expression, that's the same as
my $delimiter_re = qr/\|/;
split $delimiter_re
First, the | isn't special inside doublequotes. Setting $delimiter to just "|" and then making sure it is quoted later would work or possibly setting $delimiter to "\\|" would be ok by itself.
Second, the | is special inside regex so you want to quote it there. The safest way to do that is ask perl to quote your code for you. Use the \Q...\E construct within the regex to mark out data you want quoted.
my #source_header_columns = split(/\Q${delimiter}\E/, $source_header);
see: http://perldoc.perl.org/perlre.html
It seems as all you want to do is count the fields in the header, and print the header. Might I suggest something a bit simpler than using split?
my $str="KEY|H1|H2|H3";
my $count=0;
$str =~ s/\w+/"Col" . ++$count/eg;
print "$str\n";
Works with most any delimeter (except alphanumeric and underscore), it also saves the number of fields in $count, in case you need it later.
Here's another version. This one uses the character class brackets instead, to specify "any character but this", which is just another way of defining a delimeter. You can specify delimeter from the command-line. You can use your getopts as well, but I just used a simple shift.
my $d = shift || '[^|]';
if ( $d !~ /^\[/ ) {
$d = '[^' . $d . ']';
}
my $str="KEY|H1|H2|H3";
my $count=0;
$str =~ s/$d+/"Col" . ++$count/eg;
print "$str\n";
By using the brackets, you do not need to worry about escaping metacharacters.
#!/usr/bin/perl
use Data::Dumper;
use strict;
my $delimeter="\\|";
my $string="A|B|C|DD|E";
my #arr=split(/$delimeter/,$string);
print Dumper(#arr)."\n";
output:
$VAR1 = 'A';
$VAR2 = 'B';
$VAR3 = 'C';
$VAR4 = 'DD';
$VAR5 = 'E';
seems you need define delimeter as \\|

How do I copy a file with a UTF-8 filename to another UTF-8 filename in Perl on Windows?

For example, given an empty file テスト.txt, how would I make a copy called テスト.txt.copy?
My first crack at it managed to access the file and create the new filename, but the copy generated テスト.txt.copy.
Here was my first crack at it:
#!/usr/bin/env perl
use strict;
use warnings;
use English '-no_match_vars';
use File::Basename;
use Getopt::Long;
use File::Copy;
use Win32;
my (
$output_relfilepath,
) = process_command_line();
open my $fh, '>', $output_relfilepath or die $!;
binmode $fh, ':utf8';
foreach my $short_basename ( glob( '*.txt') ) {
# skip the output basename if it's in the glob
if ( $short_basename eq $output_relfilepath ) {
next;
}
my $long_basename = Win32::GetLongPathName( $short_basename );
my $new_basename = $long_basename . '.copy';
print {$fh} sprintf(
"short_basename = (%s)\n" .
" long_basename = (%s)\n" .
" new_basename = (%s)\n",
$short_basename,
$long_basename,
$new_basename,
);
copy( $short_basename, $new_basename );
}
printf(
"\n%s done! (%d seconds elapsed)\n",
basename( $0 ),
time() - $BASETIME,
);
# === subroutines ===
sub process_command_line {
# default arguments
my %args
= (
output_relfilepath => 'output.txt',
);
GetOptions(
'help' => sub { print usage(); exit },
'output_relfilepath=s' => \$args{output_relfilepath},
);
return (
$args{output_relfilepath},
);
}
sub usage {
my $script_name = basename $0;
my $usage = <<END_USAGE;
======================================================================
Test script to copy files with a UTF-8 filenames to files with
different UTF-8 filenames. This example tries to make copies of all
.txt files with versions that end in .txt.copy.
usage: ${script_name} (<options>)
options:
-output_relfilepath <s> set the output relative file path to <s>.
this file contains the short, long, and
new basenames.
(default: 'output.txt')
----------------------------------------------------------------------
examples:
${script_name}
======================================================================
END_USAGE
return $usage;
}
Here are the contents of output.txt after execution:
short_basename = (BD9A~1.TXT)
long_basename = (テスト.txt)
new_basename = (テスト.txt.copy)
I've tried replacing File::Copy's copy command with a system call:
my $cmd = "copy \"${short_basename}\" \"${new_basename}\"";
print `$cmd`;
and with Win32::CopyFile:
Win32::CopyFile( $short_basename, $new_basename, 'true' );
Unfortunately, I get the same result in both cases (テスト.txt.copy). For the system call, the print shows 1 file(s) copied. as expected.
Notes:
I'm running Perl 5.10.0 via Strawberry Perl on Windows 7 Professional
I use the Win32 module to access long filenames
The glob returns short filenames, which I have to use to access the file
テスト = test (tesuto) in katakana
I've read perlunitut and The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
This should be possible with the CopyFileW function from Win32API::File, which should be included with Strawberry. I've never messed with Unicode filenames myself, so I'm not sure of the details. You might need to use Encode to manually convert the filename to UTF-16LE (encode('UTF16-LE', $filename)).
You're getting the long filename using Win32, which gives you a UTF-8-encoded string.
However, you're then setting the long filename using plain copy, which uses the C stdlib IO functions. The stdlib functions use the default filesystem encoding.
On modern Linuxes that's usually UTF-8, but on Windows it (sadly) never is, because the system default code page cannot be set to UTF-8. So you'll get your UTF-8 string interpreted as a code page 1252 string on a Western European Windows install, as has happened here. (On a Japanese machine it'd get interpreted as code page 932 — like Shift-JIS — which would come out something like 繝�せ繝�.)
I've not done this in Perl, but I'd suspect the Win32::CopyFile function would be more likely to be able to handle the kind of Unicode paths returned elsewhere in the Win32 module.
Use Encode::Locale:
use Encode::Locale;
use Encode;
use File::Copy;
copy( encode(locale_fs => $short_basename),
encode(locale_fs => $new_basename) ) || die $!;
I successfully duplicated your problem on my Windows machine (Win XP Simplified Chinese version) and my conclusion is that the problem is caused by the font. Choose a Truetype font rather than Raster fonts and see if everything is okay.
My experiment is this:
I first changed the code page of my Windows Console from the default 936 (GBK) to 65001 (UTF-8).
by typing C:>chcp 65001
I wrote a scrip that contains the code: $a= "テスト"; print $a; and saved it as UTF-8.
I ran the script from the Console and found "テスト" became "テスト", which is exactly the same sympton you described in your question.
I changed the Console Font from Raster Fonts to Lucida Console, the console screen gave me this: "テストストトト", which is still not quite right but I assume it is getting closer to the core of the problem.
So althought I'm not 100% sure but the problem is probably caused by the font.
Hope this helps.
See https://metacpan.org/pod/Win32::Unicode
#!/usr/bin/perl --
use utf8;
use strict;
use warnings;
my #kebabs = (
"\x{45B}\x{435}\x{432}\x{430}\x{43F}.txt", ## ћевап.txt
"ra\x{17E}nji\x{107}.txt", ## ražnjić.txt
"\x{107}evap.txt", ## ćevap.txt
"\x{43A}\x{435}\x{431}\x{430}\x{43F}\x{447}\x{435}.txt", ## кебапче.txt
"kebab.txt",
);
{
use Win32::Unicode qw/ -native /;
printW "I \x{2665} Perl"; # unicode console out
mkpathW 'meat';
chdirW 'meat';
for my $kebab ( #kebabs ){
printW "kebabing the $kebab\n";
open my($fh), '>:raw', $kebab or dieW Fudge($kebab);
print $fh $kebab or dieW Fudge($kebab);
close $fh or dieW Fudge($kebab);
}
}
sub Fudge {
use Errno();
join qq/\n/,
"Error #_",
map { " $_" } int( $! ) . q/ / . $!,
int( $^E ) . q/ / . $^E,
grep( { $!{$_} } keys %! ),
q/ /;
}