Perl - Validate Chinese character input from web page form? - perl

My Perl script accepts and processes input from a text field in a form on a web page. It was written for the English version of the web page and works just fine.
There is also a Chinese version of the page (a separate page, not both languages on the same page), and now I need my script to work with that. The user input on this page is expected to be in Chinese.
Expecting to need to work in UTF-8, I added
use utf8;
This continues to function just fine on the English page.
But in order to, for example, define a string variable for comparison that uses Chinese characters, I have to save the Perl script itself with utf-8 encoding. As soon as I do that, I get the dreaded 500 server error.
Clearly I'm going about this wrong and any helpful direction will be greatly appreciated/
Thanks.
EDIT - please see my clarification post below.

To handle utf8 properly :
use strict; use warnings;
use utf8;
use open(IO => ':encoding(utf8)');
binmode $_, ":utf8" for qw/STDOUT STDIN STDERR/;
open(my $fh, '<:utf8', '/file/path'); # if you need a file-handle
# code.....
Check
why-does-modern-perl-avoid-utf-8-by-default
perluniintro

I'm sorry - I think I poorly expressed my question by including too much information.
The issue is - if I save my script in ANSI format and upload it to the server, it works just fine for the English page. Expecting to want to use Chinese characters in the script, I saved it in UTF-8 format and re-uploaded, and suddenly it throws 500 for the English page.
I tested with a Hello World script:
#!/usr/bin/perl -T
use strict;
use warnings;
print "Content-type: text/html\n\n";
print "Hello, world!\n";
Works fine when saved as ANSI - fails 500 when saved as UTF8.

Related

how to use utf-8 in a perl cgi-bin script?

I have the following cgi bin script:
#! /usr/bin/perl
#
use utf8;
use CGI;
my $q = CGI->new();
my %params = $q->Vars;
print $q->header('text/html');
$w = $params{"words"};
print "$w\n";
I want to be able to call it as cgi-bin/script.pl?words=É for example, but when I do that, what's printed is not UTF-8, but instead garbled:
É
Is there any way to use cgi-bin with utf8?
Your line use utf8 doesn't do anything for you, other than allowing UTF-8 characters in the source file itself. You must make sure that the output handles (on STDOUT as well as any files) are set to utf8. One easy way to handle this is the utf8::all module. Also, make sure you are sending the correct headers, and use the -utf8 CGI pragma to treat incoming parameters as UTF-8. Finally, as always, be sure to use strict and warnings.
The following should get you started:
#!/usr/bin/perl
use strict;
use warnings;
use utf8::all;
use CGI qw(-utf8);
my $q = CGI->new;
print $q->header("text/html;charset=UTF-8");
print $q->param("words");
exit;
I have been having this problem of intermittent failure of utf8 encoding with my CGI script.
I tried everything but couldn't reliably repeat the problem.
I finally discovered that is is absolutely critical to be consistent with you use of the utf8 pragma throughout every module that uses CGI
use CGI qw(-utf8);
What seems to happen is that modperl invokes the CGI module just once per requests. If there is inconsistent including of the CGI module - say for some utility function that is just using a redirect function and you haven't bothered to set the utf8 pragma. Then this invocation can be the one that modperl decides to use to decode requests.
You will save yourself a lot of pain in the long run if you start out by reading the perlunitut and perlunicode documentation pages. They will give you the basics on exactly what Unicode and character encodings are, and how to work with them in Perl.
Also, what you're asking for is more complex than you think. There are many layers hidden in the phrase "use cgi-bin with utf8", starting with your interface to whatever tool you're using to send requests to the web server and ending with that tool having parsed a response and presenting it to you. You need to understand all those layers well enough to at least be able to tell if the problem lies in your CGI script or not. For example, it doesn't help if your script works perfectly if the problem is that bash and curl don't agree on the encoding of your command line arguments.

Print other language character in csv using perl file handling

I am scraping a site based on German language , I am trying to store the content of the site in a CSV using Perl , but i am facing garbage value in the csv, the code i use is
open my $fh, '>> :encoding(UTF-8)', 'output.csv';
print {$fh} qq|"$title"\n|;
close $fh;
For example :I expect Weiß ,Römersandalen , but i get Weiß, Römersandalen
Update :
Code
use strict;
use warnings;
use utf8;
use WWW::Mechanize::Firefox;
use autodie qw(:all);
my $m = WWW::Mechanize::Firefox->new();
print "\n\n *******Program Begins********\n\n";
$m->get($url) or die "unable to get $url";
my $Home_Con=$m->content;
my $title='';
if($Home_Con=~m/<span id="btAsinTitle">([^<]*?)<\/span>/is){
$title=$1;
print "title ::$1\n";
}
open my $fh, '>> :encoding(UTF-8)', 's.txt'; #<= (Weiß)
print {$fh} qq|"$title"\n|;
close $fh;
open $fh, '>> :encoding(UTF-8)', 's1.csv'; #<= (Weiß)
print {$fh} qq|"$title"\n|;
close $fh;
print "\n\n *******Program ends********";
<>;
This is the part of code. The method works fine in text files, but not in csv.
You've shown us the code where you're encoding the data correctly as you write it to the file.
What we also need to see is how the data gets into your program. Are you decoding it correctly at that point?
Update:
If the code was really just my $title='Weiß ,Römersandalen' as you say in the comments, then the solution would be as simple as adding use utf8 to your code.
The point is that Perl needs to know how to interpret the stream of bytes that it's dealing with. Outside your program, data exists as bytes in various encodings. You need to decode that data as it enters your program (decoding turns a stream of bytes into a string of characters) and encode it again as it leaves your program. You're doing the encoding step correctly, but not the decoding step.
The reason that use utf8 fixes that in the simple example you've given is that use utf8 tells Perl that your source code should be interpreted as a stream of bytes encoded as utf8. It then converts that stream of bytes into a string of characters containing the correct characters for 'Weiß ,Römersandalen'. It can then successfully encode those characters into bytes representing those characters encoded as utf8 as they are written to the file.
Your data is actually coming from a web page. I assume you're using LWP::Simple or something like that. That data might be encoded as utf8 (I doubt it, given the problems you're having) but it might also be encoded as ISO-8859-1 or ISO-8859-9 or CP1252 or any number of other encodings. Unless you know what the encoding is and correctly decode the incoming data, you will see the results that you are getting.
Check if there are any weird characters at start or anywhere in the file using commands like head or tail

Corrupt Spanish characters when saving variables to a text file in Perl

I think I have an encoding problem. My knowledge of perl is not great. Much better with other languages, but I have tried everything I can think of and checked lots of other posts.
I am collecting a name and address. This can contain non english characters. In this case Spanish.
A php process uses curl to execute a .pl script and passes the values URLEncoded
The .pl executes a function in a .pm which writes the data to a text file. No database is involved.
Both the .pl and .pm have
use Encode;
use utf8;
binmode (STDIN, 'utf8');
binmode (STDOUT, 'utf8');
defined. Below is the function which is writing the text to a file
sub bookingCSV(#){
my $filename = "test.csv";
utf8::decode($_[1]{booking}->{LeadNameFirst});
open OUT, ">:utf8", $filename;
$_="\"$_[1]{booking}->{BookingNo}¦¦$_[1]{booking}->{ShortPlace}¦¦$_[1]{booking}->{ShortDev}¦¦$_[1]{booking}->{ShortAcc}¦¦$_[1]{booking}->{LeadNameFirst}¦¦$_[1]{booking}->{LeadNameLast}¦¦$_[1]{booking}->{Email}¦¦$_[1]{booking}->{Telephone}¦¦$_[1]{booking}->{Company}¦¦$_[1]{booking}->{Address1}¦¦$_[1]{booking}->{Address2}¦¦$_[1]{booking}->{Town}¦¦$_[1]{booking}->{County}¦¦$_[1]{booking}->{Zip}¦¦$_[1]{booking}->{Country}¦¦";
print OUT $_;
close (OUT);
All Spanish characters are corrupted in the text file. I have tried decode on one specific field "LeadNameFirst" but that has not made a difference. I left the code in place just in case it is useful.
Thanks for any help.
What is the encoding of the input? If the input encoding is not utf-8, then it will not do you any good to decode it as utf-8 input.
Does the input come from an HTML form? Then the encoding probably matches the encoding of the web page it came from. ISO-8859-1 is a common default encoding for American/European locales. Anyway, once you discover the encoding, you can decode the input with it:
$name = decode('iso-8859-1',$_[1]{booking}->{LeadNameFirst});
print OUT "name is $name\n"; # utf8 layer already enabled
Some browsers look for and respect a accept-charset attribute inside a <form> tag, e.g.,
<form action="/my_form_processor.php" accept-charset="UTF-8">
...
</form>
This will (cross your fingers) cause you to receive the form input as utf-8 encoded.

How do I find "wide characters" printed by perl?

A perl script that scrapes static html pages from a website and writes them to individual files appears to work, but also prints many instances of wide character in print at ./script.pl line n to console: one for each page scraped.
However, a brief glance at the html files generated does not reveal any obvious mistakes in the scraping. How can I find/fix the problem character(s)? Should I even care about fixing it?
The relevant code:
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
...
foreach (#urls) {
$mech->get($_);
print FILE $mech->content; #MESSAGE REFERS TO THIS LINE
...
This is on OSX with Perl 5.8.8.
If you want to fix up the files after the fact, then you could pipe them through fix_latin which will make sure they're all UTF-8 (assuming the input is some mixture of ASCII, Latin-1, CP1252 or UTF-8 already).
For the future, you could use $mech->response->decoded_content which should give you UTF-8 regardless of what encoding the web server used. The you would binmode(FILE, ':utf8') before writing to it, to ensure that Perl's internal string representation is converted to strict UTF-8 bytes on output.
I assume you're crawling images or something of that sort, anyway you can get around the problem by adding binmode(FILE); or if they are webpages and UTF-8 then try binmode( FILE, ':utf8' ). See perldoc -f binmode, perldoc perlopentut, and perldoc PerlIO for more information..
The ":bytes", ":crlf", and ":utf8", and any other directives of the form ":...", are called I/O layers. The "open" pragma can be used to establish default I/O layers. See open.
To mark FILEHANDLE as UTF-8, use ":utf8" or ":encoding(utf8)". ":utf8" just marks the data as UTF-8 without further checking, while ":encoding(utf8)" checks the data for actually being
valid UTF-8. More details can be found in PerlIO::encoding.

With a utf8-encoded Perl script, can it open a filename encoded as GB2312?

I'm not talking about reading in the file content in utf-8 or non-utf-8 encoding and stuff. It's about file names. Usually I save my Perl script in the system default encoding, "GB2312" in my case and I won't have any file open problems. But for processing purposes, I'm now having some Perl script files saved in utf-8 encoding. The problem is: these scripts cannot open the files whose names consist of characters encoded in "GB2312" encoding and I don't like the idea of having to rename my files.
Does anyone happen to have any experience in dealing with this kind of situation? Thanks like always for any guidance.
Edit
Here's the minimized code to demonstrate my problem:
# I'm running ActivePerl 5.10.1 on Windows XP (Simplified Chinese version)
# The file system is NTFS
#!perl -w
use autodie;
my $file = "./测试.txt"; #the file name consists of two Chinese characters
open my $in,'<',"$file";
while (<$in>){
print;
}
This test script can run well if saved in "ANSI" encoding (I assume ANSI encoding is the same as GB2312, which is used to display Chinese charcters). But it won't work if saved as "UTF-8" and the error message is as follows:
Can't open './娴嬭瘯.txt' for reading: 'No such file or directory'.
In this warning message, "娴嬭瘯" are meaningless junk characters.
Update
I tried first encoding the file name as GB2312 but it does not seem to work :(
Here's what I tried:
#!perl -w
use autodie;
use Encode;
my $file = "./测试.txt";
encode("gb2312", decode("utf-8", $file));
open my $in,'<',"$file";
while (<$in>){
print;
}
My current thinking is: the file name in my OS is 测试.txt but it is encoded as GB2312. In the Perl script the file name looks the same to human eyes, still 测试.txt. But to Perl, they are different because they have different internal representations. But I don't understand why the problem persists when I already converted my file name in Perl to GB2312 as shown in the above code.
Update
I made it, finally made it :)
#brian's suggestion is right. I made a mistake in the above code. I didn't give the encoded file name back to the $file.
Here's the solution:
#!perl -w
use autodie;
use Encode;
my $file = "./测试.txt";
$file = encode("gb2312", decode("utf-8", $file));
open my $in,'<',"$file";
while (<$in>){
print;
}
If you
use utf8;
in your Perl script, that merely tells perl that the source is in UTF-8. It doesn't affect how perl deals with the outside world. Are you turning on any other Perl Unicode features?
Are you having problems with every filename, or just some of them? Can you give us some examples, or a small demonstration script? I don't have a filesystem that encodes names as GB2312, but have you tried encoding your filenames as GB2312 before you call open?
If you want specific strings encoded with a specific encoding, you can use the Encode module. Try that with your filenames that you give to open.