how to use utf-8 in a perl cgi-bin script? - perl

I have the following cgi bin script:
#! /usr/bin/perl
#
use utf8;
use CGI;
my $q = CGI->new();
my %params = $q->Vars;
print $q->header('text/html');
$w = $params{"words"};
print "$w\n";
I want to be able to call it as cgi-bin/script.pl?words=É for example, but when I do that, what's printed is not UTF-8, but instead garbled:
É
Is there any way to use cgi-bin with utf8?

Your line use utf8 doesn't do anything for you, other than allowing UTF-8 characters in the source file itself. You must make sure that the output handles (on STDOUT as well as any files) are set to utf8. One easy way to handle this is the utf8::all module. Also, make sure you are sending the correct headers, and use the -utf8 CGI pragma to treat incoming parameters as UTF-8. Finally, as always, be sure to use strict and warnings.
The following should get you started:
#!/usr/bin/perl
use strict;
use warnings;
use utf8::all;
use CGI qw(-utf8);
my $q = CGI->new;
print $q->header("text/html;charset=UTF-8");
print $q->param("words");
exit;

I have been having this problem of intermittent failure of utf8 encoding with my CGI script.
I tried everything but couldn't reliably repeat the problem.
I finally discovered that is is absolutely critical to be consistent with you use of the utf8 pragma throughout every module that uses CGI
use CGI qw(-utf8);
What seems to happen is that modperl invokes the CGI module just once per requests. If there is inconsistent including of the CGI module - say for some utility function that is just using a redirect function and you haven't bothered to set the utf8 pragma. Then this invocation can be the one that modperl decides to use to decode requests.

You will save yourself a lot of pain in the long run if you start out by reading the perlunitut and perlunicode documentation pages. They will give you the basics on exactly what Unicode and character encodings are, and how to work with them in Perl.
Also, what you're asking for is more complex than you think. There are many layers hidden in the phrase "use cgi-bin with utf8", starting with your interface to whatever tool you're using to send requests to the web server and ending with that tool having parsed a response and presenting it to you. You need to understand all those layers well enough to at least be able to tell if the problem lies in your CGI script or not. For example, it doesn't help if your script works perfectly if the problem is that bash and curl don't agree on the encoding of your command line arguments.

Related

Perl - Validate Chinese character input from web page form?

My Perl script accepts and processes input from a text field in a form on a web page. It was written for the English version of the web page and works just fine.
There is also a Chinese version of the page (a separate page, not both languages on the same page), and now I need my script to work with that. The user input on this page is expected to be in Chinese.
Expecting to need to work in UTF-8, I added
use utf8;
This continues to function just fine on the English page.
But in order to, for example, define a string variable for comparison that uses Chinese characters, I have to save the Perl script itself with utf-8 encoding. As soon as I do that, I get the dreaded 500 server error.
Clearly I'm going about this wrong and any helpful direction will be greatly appreciated/
Thanks.
EDIT - please see my clarification post below.
To handle utf8 properly :
use strict; use warnings;
use utf8;
use open(IO => ':encoding(utf8)');
binmode $_, ":utf8" for qw/STDOUT STDIN STDERR/;
open(my $fh, '<:utf8', '/file/path'); # if you need a file-handle
# code.....
Check
why-does-modern-perl-avoid-utf-8-by-default
perluniintro
I'm sorry - I think I poorly expressed my question by including too much information.
The issue is - if I save my script in ANSI format and upload it to the server, it works just fine for the English page. Expecting to want to use Chinese characters in the script, I saved it in UTF-8 format and re-uploaded, and suddenly it throws 500 for the English page.
I tested with a Hello World script:
#!/usr/bin/perl -T
use strict;
use warnings;
print "Content-type: text/html\n\n";
print "Hello, world!\n";
Works fine when saved as ANSI - fails 500 when saved as UTF8.

How can I pass variables from one CGI script to another?

I have a CGI perl script called install-app-pl.cgi:
#!/usr/bin/perl -w
print header('text/html');
use strict;
use CGI ':standard';
# Get me some vars
my #params = param();
my $APP_NAME = param('app_name');
my $APP_WEB_PORT = param('app_web_port');
my $APP_WEB_USER = param('app_web_user');
my $APP_WEB_PASS = param('app_web_pass');
my $DOWNLOAD_DIR = param('download_dir');
my $CONFIG_DIR = param('config_dir');
my $LIBRARY_DIR = param('library_dir');
my $TEMP_DOWNLOAD_DIR = param('temp_download_dir');
# Run another script
if ( $APP_NAME ) {
print "Installing $APP_NAME...";
print "<pre>";
system ("perl /var/www/mysite.local/public_html/lib/$APP_NAME/install-$APP_NAME.pl");
print "</pre>" ;
}
else {
print "No app specified, check the error log";
}
I'm trying to get it to pass the variables defined from the CGI parameters to install-$APP_NAME.pl
#!/usr/bin/perl -w
print header('text/html');
use strict;
use CGI ':standard';
require "/var/www/mysite.local/public_html/cgi-bin/install-app-pl.cgi"
# Echo my vars
print "$CONFIG_DIR $DOWNLOAD_DIR $LIBRARY_DIR $PGID $PUID $TZ $APP_WEB_PORT";
But I'm not sure of the best way to pass those on.
Are you sure that install-app-pl.cgi is a CGI program? Are you sure that it's not just a Perl command-line program? I mean, I see how it's named, but it seems very strange to call a CGI program using system() like that.
And the difference is crucial here. CGI programs access their parameters in a different way command-line programs.
If it really is a CGI program, then you have a few options:
Make an HTTP request to it (using something from the LWP bundle of modules).
Use CGI.pm's debugging mechanism to call it the same way as you're currently calling it, but passing the CGI parameters like foo=xxx&bar=yyy&baz=zzz (see the DEBUGGING section of the CGI.pm documentation for details). This, of course, relies on the program using CGI.pm and it feels a bit hacky to me.
Ask yourself if the program really needs to be a CGI program if you're calling from another program using system(). And then decide to rewrite it as a command-line program. If you want both a CGI version and a command-line version, then you could move most of the code to a module which could be used by two thin wrappers which just extract the parameters.
A few other points about your code.
Perl 5.6 (released in 2000) introduced a use warnings pragma. Most people now use that in place of -w on the shebang line.
It seems weird to call the header() function before loading the CGI module that defines it. It works, because the use is handled at compile time, but it would be nice to re-order that code to make more sense.
Similarly. most people would have use strict (and use warnings) as the very first things in their program. Immediately after the shebang line.
system() returns the return value from the process. If your second program produces useful output that you want displayed on the web page, you should use backticks instead.
If all of your output is going to be in a <pre> element, why not just remove that element and return a content type of "text/plain" instead?
Update: And I'd be remiss if I didn't reiterate what many people have already said in comments on your original question - this sounds like a terrible idea.

How to print out package code?

I have some difficulties with perl script which have one module (.pm) encoded by custom function and before module is loaded into .cgi scrcript is always decoded.
I could even let it be as it is but currently I have to do several changes in subroutines which this module contains and since it is encoded I am helpless ;/
So far I've tried several ways i.e:
#!/usr/bin/perl
use strict;
use lib '.';
use ModuleX; ### This is encoded module which I need
use CGI::Carp qw(fatalsToBrowser);
Unfortunatelly $body returns only ";" as a result ;/ I hope that it is possible to get those method code, but I have no idea what else I could do.
Thanks for help.
Are you trying to deparse the new method in the ModuleX package? Then I believe that you want to say
my $body = $deparse->coderef2text(\&Modulex::new);

UTF-8 in a Perl module name

How can I write a Perl module with UTF-8 in its name and filename? My current try yields "Can't locate Täst.pm in #INC", but the file does exist. I'm on Windows, and haven't tried this on Linux yet.
test.pl:
use strict;
use warnings;
use utf8;
use Täst;
Täst.pm:
package Täst;
use utf8;
Update: My current work-around it so use Tast (ASCII) and put package Täst (Unicode) in Tast.pm (ASCII). It's confusing, though.
Unfortunately, Perl, Windows, and Unicode filenames really don't go together at the moment. My advice is to save yourself a lot of hassle and stick with plain ASCII for your module names. This blog post mentions a few of the problems.
The use utf8 needs to appear before the package Täst, so that the latter can be correctly interpreted. On my Mac:
test.pl:
#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use Tëst;
# 'use utf8' only indicates the code's encoding, but we also want stdout to be utf8
use encoding "utf8";
Tëst::hëllö();
Tëst.pm:
use utf8;
package Tëst;
sub Tëst::hëllö() {
print "Hëllö, wörld!\n";
}
1;
Output:
Macintosh:Desktop sherm$ ./test.pl
Hëllö, wörld!
As I said though - I ran this on my Mac. As cjm said above, your mileage may vary on Windows.
Unicode support often fails at the boundaries. Package and subroutine names need to map cleanly onto filenames, which is problematic on some operating systems. Not only does the OS have to create the filename that you expect, but you also have to be able to find it later as the same name.
We talked a little about the filename issue in Effective Perl Programming, but I also summarized much more in How do I create then use long Windows paths from Perl?. Jeff Atwood mentions this as part of his post on his Filesystem Paths: How Long is Too Long?.
I wouldn't recommend this approach if this is software you plan to release, to be honest. Even if you get it working fine for you, it's likely to be somewhat fragile on machines where UTF-8 isn't configured quite right, and/or filenames may not contain UTF-8 characters, etc.

Unicode string mess in perl

I have an external module, that is returning me some strings. I am not sure how are the strings returned, exactly. I don't really know, how Unicode strings work and why.
The module should return, for example, the Czech word "být", meaning "to be". (If you cannot see the second letter - it should look like this.) If I display the string, returned by the module, with Data Dumper, I see it as b\x{fd}t.
However, if I try to print it with print $s, I got "Wide character in print" warning, and ? instead of ý.
If I try Encode::decode(whatever, $s);, the resulting string cannot be printed anyway (always with the "Wide character" warning, sometimes with mangled characters, sometimes right), no matter what I put in whatever.
If I try Encode::encode("utf-8", $s);, the resulting string CAN be printed without the problems or error message.
If I use use encoding 'utf8';, printing works without any need of encoding/decoding. However, if I use IO::CaptureOutput or Capture::Tiny module, it starts shouting "Wide character" again.
I have a few questions, mostly about what exactly happens. (I tried to read perldocs, but I was not very wise from them)
Why can't I print the string right after getting it from the module?
Why can't I print the string, decoded by "decode"? What exactly "decode" did?
What exactly "encode" did, and why there was no problem in printing it after encoding?
What exactly use encoding do? Why is the default encoding different from utf-8?
What do I have to do, if I want to print the scalars without any problems, even when I want to use one of the capturing modules?
edit: Some people tell me to use -C or binmode or PERL_UNICODE. That is a great advice. However, somehow, both the capturing modules magically destroy the UTF8-ness of STDOUT. That seems to be more a bug of the modules, but I am not really sure.
edit2: OK, the best solution was to dump the modules and write the "capturing" myself (with much less flexibility).
Because you output a string in perl's internal form (utf8) to a non-unicode filehandle.
The decode function decodes a sequence of bytes assumed to be in ENCODING into Perl's internal form (utf8). Your input seems to be already decoded,
The encode() function encodes a string from Perl's internal form into ENCODING.
The encoding pragma allows you to write your script in any encoding you like. String literals are automatically converted to perl's internal form.
Make sure perl knows which encoding your data comes in and come out.
See also perluniintro, perlunicode, Encode module, binmode() function.
I recommend reading the Unicode chapter of my book Effective Perl Programming. We put together all the docs we could find and explained Unicode in Perl much more coherently than I've seen anywhere else.
This program works fine for me:
#!perl
use utf8;
use 5.010;
binmode STDOUT, ':utf8';
my $string = return_string();
say $string;
sub return_string { 'být' }
Additionally, Capture::Tiny works just fine for me:
#!perl
use utf8;
use 5.010;
use Capture::Tiny qw(capture);
binmode STDOUT, ':utf8';
my( $stdout, $stderr ) = capture {
system( $^X, '/Users/brian/Desktop/czech.pl' );
};
say "STDOUT is [$stdout]";
IO::CaptureOutput seems to have some problems though:
#!perl
use utf8;
use 5.010;
use IO::CaptureOutput qw(capture);
binmode STDOUT, ':utf8';
capture {
system( $^X, '/Users/brian/Desktop/czech.pl' );
} \my $stdout, \my $stderr;
say "STDOUT is [$stdout]";
For this I get:
STDOUT is [být
]
However, that's easy to fix. Don't use that module. :)
You should also look at the PERL_UNICODE environment variable, which is the same as using the -C option. That allows you to set STDIN/STDOUT/STDERR (and #ARGV) to be UTF-8 without having to alter your scripts.