Perl: Encoding messed up after text concatenation - perl

I have encountered a weird situation while updating/upgrading some legacy code.
I have a variable which contains HTML. Before I can output it, it has to be filled with lots of data. In essence, I have the following:
for my $line (#lines) {
$output = loadstuff($line, $output);
}
Inside of loadstuff(), there is the following
sub loadstuff {
my ($line, $output) = #_;
# here the process is simplified for better understanding.
my $stuff = getOtherStuff($line);
my $result = $output.$stuff;
return $result;
}
This function builds a page which consists of different areas. All area is loaded up independently, that's why there is a for-loop.
Trouble starts right about here. When I load the page from ground up (click on a link, Perl executes and delivers HTML), everything is loaded fine. Whenever I load a second page via AJAX for comparison, that HTML has broken encoding.
I tracked down the problem to this line my $result = $output.$stuff. Before the concatenation, $output and $stuff are fine. But afterward, the encoding in $result is messed up.
Does somebody have a clue why concatenation messes up my encoding? While we are on the subject, why does it only happen when the call is done via AJAX?
Edit 1
The Perl and the AJAX call both execute the very same functions for building up a page. So, whenever I fix it for AJAX, it is broken for freshly reloaded pages. It really seems to happen only if AJAX starts the call.
The only difference in this particular case is that the current values for the page are compared with an older one (it is a backup/restore function). From here, everything is the same. The encoding in the variables (as far as I can tell) are ok. I even tried the Encode functions only on the values loaded from AJAX, but to no avail. The files themselves seem to be utf8 according to "Kate".
Besides that, I have a another function with the same behavior which uses the EXACT same functions, values and files. When the call is started from Perl/Apache, the encoding is ok. Via AJAX, again, it is messed up.
I have been examinating the AJAX Request (jQuery) and could not find anything odd. The encoding seems to be utf8 too.

Perl has a “utf8” flag for every scalar value, which may be “on” or “off”. “On” state of the flag tells perl to treat the value as a string of Unicode characters.
If you take a string with utf8 flag off and concatenate it with a string that has utf8 flag on, perl converts the first one to Unicode. This is the usual source of problems.
You need to either convert both variables to bytes with Encode::encode() or to perl's internal format with Encode::decode() before concatenation.
See perldoc Encode.

Expanding on the previous answer, here's a little more information that I found useful when I started messing with character encodings in Perl.
This is an excellent introduction to Unicode in perl: http://perldoc.perl.org/perluniintro.html. The section "Perl's Unicode Model" is particularly relevant to the issue you're seeing.
A good rule to use in Perl is to decode data to Perl characters on it's way in and encode it into bytes on it's way out. You can do this explicitly using Encode::encode and Encode::decode. If you're reading from/writing to a file handle you can specify an encoding on the filehandle by using binmode and setting layer: perldoc -f binmode
You can tell which of the strings in your example has been decoded into Perl characters using Encode::is_utf8:
use Encode qw( is_utf8 );
print is_utf8($stuff) ? 'characters' : 'bytes';

A colleague of mine found the answer to this problem. It really had something to do with the fact that AJAX started the call.
The file structure is as follows:
1 Handler, accessed by Apache
1 Handler, accessed by Apache but who only contains AJAX responders. We call it the AJAX-Handler
1 package, which contains functions relevant for the entire software, who access yet other packages from our own Framework
Inside of the AJAX-Handler, we print the result as such
sub handler {
my $r = shift;
# processing output
$r->print($output);
return Apache2::Const::OK;
}
Now, when I replace $r->print($output); by print($output);, the problem disappears! I know that this is not the recommended way to print stuff in mod_perl, but this seems to work.
Still, any ideas how to do this the proper way are welcome.

Related

How can I get perl to correctly pass a command line argument with multiple arguments and complex file paths (spaces and symbols)?

I have a small perl script which collects file paths from an excel file and passes them through the command line to perltex which then compiles a pdf based on the files and paths chosen.
My problem is that the moment I introduce more complex file paths (which is necessary based on the network setup of the final user pool) perltex fails to find the file paths, cutting them at the space.
A MWE is a follows
#!/usr/bin/perl
use strict;
use warnings;
use 5.14.2;
use Text::Template;
use Spreadsheet::Read;
use Spreadsheet::ParseXLSX;
use utf8;
use charnames qw( :full :short );
use autodie;
my $row = 5;
my $col = 15;
my $File = "C:/Users/me/Desktop/Reporting-Static/Input-test1.xlsm";
my $parser = Spreadsheet::ParseXLSX->new();
my $workbook = $parser->parse($File);
my $worksheet = $workbook->worksheet("Input");
my $cell = $worksheet->get_cell($row, $col);
my $Filename = $cell->Value();
my $texfile = "C:/Users/me/Desktop/Reporting-Static/file.tex";
# can't find this file if there are spaces in the address
system("perltex", "--latex=pdflatex", "--nosafe", "--jobname=$Filename", "$texfile");
if ( $? == -1 )
{
print "command failed: $!\n";
}
else
{
printf "command exited with value %d", $? >> 8;
}
exit;
However, the moment I change the folder name to one with spaces eg. "Reporting Static" it fails to find the tex file.
I have read several other posts regarding this on stack exchange and other websites but for whatever reason the proposed solutions do not appear to work for me. I have tried
my $texfile = "C:/Users/me/Desktop/Reporting Static/file.tex";
my $texfile = C:/Users/me/Desktop/"Reporting Static"/file.tex;
my $texfile = "\"C:/Users/me/Desktop/"Reporting Static"/file.tex\"";
my $texfile = "\"C:/Users/me/Desktop/Reporting Static/file.tex\"";
my $texfile = "C:/Users/me/Desktop/Reporting^ Static/file.tex";
my $texfile = "C:/Users/me/Desktop/Reporting\^ Static/file.tex";
As well as a few other combinations or varioations of the above, all without success. I have also tried replacing the double quote with a single quote so that perl doesn't interpolate the contents.
I have also tried manually typing all of the above into the command prompt to check whether there was a small issue with the way perl passed the commands to the command line but still no luck.
I am aware that I can use the 'dir /X ~1 c:\' command to find system name allocations that avoid spaces but the idea is that the filename and location will be dynamic and change as a funtion of department and site, so I would prefer to avoid trying to write a script which will go and find this pathname and use it to replace all locations using spaces or other special characters.
The final idea that I had is that this problem could be connected ot the way that perltex passes it's arguments yet I am unable to find any documentation (that I can follow...) on the specifics of how this particular aspect of the file functions.
So my questions are, is there something I am missing not metioned in the other answers that I have read regarding how to correctly pass these paths to perltex, is there perhaps some sort of incompatiblity in how I'm trying to go about this, is this more probabl linked to perltex as opposed to perl or cmd or is there something completely different that I am unaware of that is stopping this from working???
EDIT:
from cmd prompt perltex returns a "unable to find path X, please enter another file location". Until now I hadn't really tested retyping the path by by entering 'C:/Users/me/Desktop/"Reporting Static"/file.tex' (no quotes at the beginning) it is subsequently accepted and runs. but initially passing it this path does not work, suggesting that some internal perltex code accepts the inital path differently to being repassed the same path after an error.... not quite sure what to make of this.
EDIT:
The contents of #latexcmdline that I extracted
$VAR1 = [
'pdflatex',
'--jobname=--',
'\\makeatletter\\def\\plmac#tag{AYNNNUVKQVJGZKKPGPTH}\\def\\plmac#tofile{Perl.topl}\\def\\plmac#fromfile{Perl.frpl}\\def\\plmac#toflag{Perl.tfpl}\\def\\plmac#fromflag{Perl.ffpl}\\def\\plmac#doneflag{Perl.dfpl}\\def\\plmac#pipe{Perl.pipe}\\makeatother\\input C:/Users/me/Desktop/PERLTEST/Perl',
'Modules/RevuedeProjetDB.tex'
];
This was done by inserting
use File::Slurp;
use Data::Dumper;
write_file 'C:\Users\me\Desktop\PERLTEST\mydump.log', Dumper( \#latexcmdline );
before the exec command.
Update
I initially recommended that you should use String::ShellQuote but that module is for Linux only so I deleted my answer when I realised that your question was about the Windows system
It seems that there's also a Win32::ShellQuote which does the same thing for Windows, so I am renewing my suggestion
As I said before, the issue is that perltex itself doesn't properly handle paths containing whitespace, even if they are correctly passed as a single element of #ARGV. I believe the solution is to pass the path including enclosing quotes, although I have never been able to test this properly as I have no LaTex installation
Unfortunately, even if I pass qq{"$texfile"}, the quotes are still stripped before they reach the target program, so they must be protected in some way
You need the quote_system function from that module, which will prepare a list of strings so that they retain any quotation marks
Using a parameter of quote_system(qq{"$texfile"}) produces the correct result in my tests. It is the equivalent of passing qq{"\\"$texfile\\""} but less ugly
So your system call should be like this (with no modification to perltex.pl)
I have applied the same principle to $Filename as it may well be that this also contains whitespace
use Win32::ShellQuote 'quote_system';
system(quote_system(
'perltex',
'--latex=pdflatex',
'--nosafe',
qq{--jobname="$Filename"},
qq{"$texfile"},
));
Okay, well I have a solution of sorts
The issue, as I suspected, is that, although the path is passed as a single string to perltex.pl, the latter doesn't handle paths with spaces properly after it has received them
The temporary fix is to hack perltex.pl
Line 82 of my version of perltex.pl (there is no version number in the source) reads
$latexcmdline[$firstcmd] = "\\input $option";
If you change that to
$latexcmdline[$firstcmd] = qq{\\input "$option"};
then all should be well. However this is a solid fix only when it is distributed by the author of perltex. Meanwhile I am looking for a nicer solution from the calling side
There are two steps to solving this problem.
Work out how to get the correct arguments into an external
program.
Work out how to do that from a Perl program.
For step 1, I find a program like this to be useful.
#!/usr/bin/perl
use strict;
use warnings;
print "Received ", scalar #ARGV, " arguments:\n";
for (1 .. #ARGV) {
print "$_: $ARGV[$_ - 1]\n";
}
It just explains what arguments it receives on the command line. You can use this in place of "perltex" for testing purposes.
You'll see that if you give it an argument that contains spaces, then that is interpreted as the called program as multiple arguments. The way to get round that is to quote the argument that contains a space. And I seem to remember that Windows insists on double-quotes (for reasons that I can never remember).
So I think that you want this:
system('perltex', '--latex=pdflatex', '--nosafe', "--jobname=\"$Filename\"", "\"$texfile\"");
I've double-quoted both of the filenames. Of course, those escaped double-quote characters look really ugly, and Perl gives us qq(...) to make that look nicer.
system('perltex', '--latex=pdflatex', '--nosafe', qq(--jobname="$Filename"), qq("$texfile"));
If that's not quite right, then the program I showed earlier will make it easier to find the solution.
Update: Borodin's comment below about this making no difference to $texfile is accurate. The fact that we're passing a list to system() means that the shell isn't involved at all.

How to override a subroutine such as `length` in Perl?

I would like to simply override the length subroutine to take in account ANSI escape sequences so I wrote this:
sub length {
my $str = shift;
if ($cfg{color}) {
return length($str =~ s/\x1B\[\d+[^m]*m//gr);
}
return length($str);
}
Unfortunately Perl detect the ambiguous call that is remplaced with CORE::length.
How can I just tell Perl to use the local declaration instead?
Of course, an alternative solution would be to rename each call to length with ansi_length and rename the custom function accordingly.
To those who want more details:
The context where I would like to override the core module length is a short code that generate ASCII tables (a bit like Text::ASCIITable, but with different features like multicolumns and multirows). I don't want to write a dedicated Perl module because I would like to keep my program as monolithic as possible because the people what will use it are not familiar with CPAN or even modules installation.
In this code, I need to know the width of each columns in each rows in order to align them properly. When a cell contain a colored text with an ANSI sequence like ^[[33mgreen^[[0m, I need to ignore the coloring sequences.
As I already use UTF-8 chars in my Program, I had to add this to my Program:
use utf8;
use open ':std', ':encoding(UTF-8)';
I noticed the utf8 module also overload the core subroutine length. I realized this will also be a good solution in my case.
Eventually I think I added enough details to this question. I would be glad to be notified why I got downvotes on this question. I don't think I can make this more clear. Also I think all these details are not usefull at all to understand the initial question...
Overwriting a core function is not a good idea. If you use a library, that itself uses the core function, the library function would be confronted with the overwritten function and may fail. You could create an own module/namespace ANSI:: or so, then use ANSI::length, but I think it is better to use a name like you proposed: ansi_length.
If you still insist:
You can overwrite the core function with
BEGIN {
*CORE::GLOBAL::length = sub ...
}
Whenever you need access to the origin CORE function, use
CORE::length.
This is valid for all built in functions of Perl.
Here is a reference : http://perldoc.perl.org/CORE.html

Perl security, the print statement?

I don't know perl at all, but I need to change an exitsting program to give a clearer debug output.
This is the statement:
print $lfh "$ts\t$tid\t$msg\n";
where $msg will be created by joining the arguments of the function like this:
my $msg = join( "\t", #_ );
Somewhere in $msg I would like to add one of the command line arguments the user has supplied when calling the program. Is that risk for an exploit if printed to stdout?
Note also that $lfh will be taken from an environment variable (set by the script itself earlier on) if it is to write to a file, like this:
open my $lfh, ">>", $ENV{GL_LOGFILE}
The info I could find on perl security says nothing about the print statement, so maybe I'm just completely paranoid, but better safe than sorry...
You can pass arbitrary data to print, and it won't break. However:
printing to a file handle may go through various IO layers like encodings. Not all data may be valid for these layers. E.g. encoding layers will either use a substitution character or terminate the script when handed invalid data.
Perl's print doesn't care about control characters like NULs, backspaces, or carriage returns and just passes them through. Depending on how the output is viewed, it might look jumbled or could break the viewing application. Example: User input is equal to the string "\rHAHA". When you print "something$userinput", you might see HAHAthing when displayed on the console, because the carriage returns places the cursor in the 1st column. If the user input contains newlines, your output records could be broken up, e.g. "foo${userinput}bar" might become the two lines foo and bar.
If the input is condidential information, the output will be confidential as well, so you should make sure that the output can't be viewed by anyone. E.g. displaying the output including debug information on a web page could allow attackers to obtain more information about your system.

How can I convert CGI input to UTF-8 without Perl's Encode module?

Through this forum, I have learned that it is not a good idea to use the following for converting CGI input (from either an escape()d Ajax call or a normal HTML form post) to UTF-8:
read (STDIN, $_, $ENV{CONTENT_LENGTH});
s{%([a-fA-F0-9]{2})}{ pack ('C', hex ($1)) }eg;
utf8::decode $_;
A safer way (which for example does not allow bogus characters through) is to do the following:
use Encode qw (decode);
read (STDIN, $_, $ENV{CONTENT_LENGTH});
s{%([a-fA-F0-9]{2})}{ pack ('C', hex ($1)) }eg;
decode ('UTF-8', $_, Encode::FB_CROAK);
I would, however, very much like to avoid using any modules (including XSLoader, Exporter, and whatever else they bring with them). The function is for a high-volume mod_perl driven website and I think both performance and maintainability will be better without modules (especially since the current code does not use any).
I guess one approach would be to examine the Encode module and strip out the functions and constants used for the “decode ('UTF-8', $_, Encode::FB_CROAK)” call. I am not sufficiently familiar with Unicode and Perl modules to do this. Maybe somebody else is capable of doing this or know a similar, safe “native” way of doing the UTF-8 conversion?
UPDATE:
I prefer keeping things non-modular, because then the only black-box is Perl's own compiler (unless of course you dig down into the module libs).
Sometimes you see large modules being replaced with a few specific lines of code. For example, instead of the CGI.pm module (which people are also in love with), one can use the following for parsing AJAX posts:
my %Input;
if ($ENV{CONTENT_LENGTH}) {
read (STDIN, $_, $ENV{CONTENT_LENGTH});
foreach (split (/&/)) {
tr/+/ /; s/%([a-fA-F0-9]{2})/pack("C", hex($1))/eg;
if (m{^(\w+)=\s*(.*?)\s*$}s) { $Input{$1} = $2; }
else { die ("bad input ($_)"); }
}
}
In a similar way, it would be great if one could extract or replicate Encode's UTF-8 decode function.
Don't pre-optimize. Do it the conventional way first then profile and benchmark later to see where you need to optimize. People usually waste all their time somewhere else, so starting off blindfolded and hadcuffed doesn't give you any benefit.
Don't be afraid of modules. The point of mod_perl is to load up everything as few times as possible so the startup time and module loading time are insignificant.
Don't use escape() to create your posted data. This isn't compatible with URL-encoding, it's a mutant JavaScript oddity which should normally never be used. One of the defects is that it will encode non-ASCII characters to non-standard %uNNNN sequences based on UTF-16 code units, instead of standard URL-encoded UTF-8. Your current code won't be able to handle that.
You should typically use encodeURIComponent() instead.
If you must URL-decode posted input yourself rather than using a form library (and this does mean you won't be able to handle multipart/form-data), you will need to convert + symbols to spaces before replacing %-sequences. This replacement is standard in form submissions (though not elsewhere in URL-encoded data).
To ensure input is valid UTF-8 if you really don't want to use a library, try this regex. It also excludes some control characters (you may want to tweak it to exclude more).

How do I find "wide characters" printed by perl?

A perl script that scrapes static html pages from a website and writes them to individual files appears to work, but also prints many instances of wide character in print at ./script.pl line n to console: one for each page scraped.
However, a brief glance at the html files generated does not reveal any obvious mistakes in the scraping. How can I find/fix the problem character(s)? Should I even care about fixing it?
The relevant code:
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
...
foreach (#urls) {
$mech->get($_);
print FILE $mech->content; #MESSAGE REFERS TO THIS LINE
...
This is on OSX with Perl 5.8.8.
If you want to fix up the files after the fact, then you could pipe them through fix_latin which will make sure they're all UTF-8 (assuming the input is some mixture of ASCII, Latin-1, CP1252 or UTF-8 already).
For the future, you could use $mech->response->decoded_content which should give you UTF-8 regardless of what encoding the web server used. The you would binmode(FILE, ':utf8') before writing to it, to ensure that Perl's internal string representation is converted to strict UTF-8 bytes on output.
I assume you're crawling images or something of that sort, anyway you can get around the problem by adding binmode(FILE); or if they are webpages and UTF-8 then try binmode( FILE, ':utf8' ). See perldoc -f binmode, perldoc perlopentut, and perldoc PerlIO for more information..
The ":bytes", ":crlf", and ":utf8", and any other directives of the form ":...", are called I/O layers. The "open" pragma can be used to establish default I/O layers. See open.
To mark FILEHANDLE as UTF-8, use ":utf8" or ":encoding(utf8)". ":utf8" just marks the data as UTF-8 without further checking, while ":encoding(utf8)" checks the data for actually being
valid UTF-8. More details can be found in PerlIO::encoding.