Perl drop down menus and Unicode - perl

I've been going around on this for some time now and can't quite get it. This is Perl 5 on Ubuntu. I have a drop down list on my web page:
$output .= start_form . "Student: " . popup_menu(-name=>'student', -values=>['', #students], -labels=>\%labels, -onChange=>'Javascript:submit()') . end_form;
It's just a set of names in the form "Last, First" that are coming from a SQL Server table. The labels are created from the SQL columns like so:
$labels{uc($record->{'id'})} = $record->{'lastname'} . ", " . $record->{'firstname'};
The issue is that the drop down isn't displaying some Unicode characters correctly. For instance, "Søren" shows up in the drop down as "Søren". I have in my header:
use utf8;
binmode(STDOUT, ":utf8");
...and I've also played around with various takes on the "decode( )" function, to no avail. To me, the funny thing is that if I pull $labels into a test script and print the list to the console, the names appear just fine! So what is it about the drop down that is causing this? Thank you in advance.
EDIT:
This is the relevant functionality, which I've stripped down to this script that runs in the console and yields the correct results for three entries that have Unicode characters:
#!/usr/bin/perl
use DBI;
use lib '/home/web/library';
use mssql_util;
use Encode;
binmode(STDOUT, ":utf8");
$query = "[SQL query here]";
$dbh = &connect;
$sth = $dbh->prepare($query);
$result = $sth->execute();
while ($record = $sth->fetchrow_hashref())
{
if ($record->{'id'})
{
$labels{uc($record->{'id'})} = Encode::decode('UTF-8', $record->{'lastname'} . ", " . $record->{'nickname'} . " (" . $record->{'entryid'} . ")");
}
}
$sth->finish();
print "$labels{'ST123'}\n";
print "$labels{'ST456'}\n";
print "$labels{'ST789'}\n";
The difference in what the production script is doing is that instead of printing to the console like above, it's printing to HTTP:
$my_output = "<p>$labels{'ST123'}</p><br>
<p>$labels{'ST456'}</p><br>
<p>$labels{'ST789'}</p>";
$template =~ s/\$body/$my_output/;
print header(-cookie=>$cookie) . $template;
This gives, i.e., strings like "Zoë" and "Søren" on the page. BUT, if I remove binmode(STDOUT, ":utf8"); from the top of the production script, then the strings appear just fine on the page (i.e. I get "Zoë" and "Søren").
I believe that the binmode( ) line is necessary when writing UTF-8 to output, and yet removing it here produces the correct results. What gives?

Problem #1: Decoding inputs
53.C3.B8.72.65.6E is the UTF-8 encoding for Søren. When you instruct Perl to encode it all over again (by printing it to handle with the :utf8 layer), you are producing garbage.
You need to decode your inputs ($record->{id}, $record->{lastname}, $record->{firstname}, etc)! This will transform The UTF-8 bytes 53.C3.B8.72.65.6E ("encoded text") into the Unicode Code Points 53.F8.72.65.6E ("decoded text").
In this form, you will be able to use uc, regex matches, etc. You will also be able to print them out to a handle with an encoding layer (e.g. :encoding(UTF-8), or the improper :utf8).
You let on that these inputs come from a database. Most DBD have a flag that causes strings to be decoded. For example, if it's a MySQL database, you should pass mysql_enable_utf8mb4 => 1 to connect.
Problem #2: Communicating encoding
If you're going to output UTF-8, don't tell the browser it's ISO-8859-1!
$ perl -e'use CGI qw( :standard ); print header()'
Content-Type: text/html; charset=ISO-8859-1
Fixed:
$ perl -e'use CGI qw( :standard ); print header( -type => "text/html; charset=UTF-8" )'
Content-Type: text/html; charset=UTF-8

Hard to give a definitive solution as you don't give us much useful information. But here are some pointers that might help.
use utf8 only tells Perl that your source code is encoded as UTF-8. It does nothing useful here.
Reading perldoc perlunitut would be a good start.
Do you know how your database tables are encoded?
Do you know whether your database connection is configured to automatically decode data coming from the database into Perl characters?
What encoding are you telling the browser that you have encoded your HTTP response in?

Related

How to handle form input from web with Unicode and/or emoji?

I'm working on code that accepts input from a web-based form. In some fields (like first and last name), users may enter special characters like umlauts. In other fields, like textarea comments, they may enter umlauts or even emoji. The input needs to be handled as entered by the user. The input is saved to MySQL using DBI queries with placeholders.
Do I need to untaint all input used in queries? And if so, how is untaint performed when the data from users may contain special characters and emoji?
I can untaint using something like this command line test, but the umlauts and emoji are stripped out.
#!/usr/bin/perl -T
use strict;
use warnings;
use Scalar::Util qw(tainted);
my $v = shift || die 'nothing';
$v =~ /(([a-z]|[A-Z]|[0-9]| )+)/;
$v = $1;
if (tainted($v)) {
die 'input is tainted';
}
print "$v\n";
exit;
Edit
Based on further testing and the comments, I've come to the following conclusions:
If input will be used only as placeholder data in a query with $dbh->bind_param, untaint is not needed.
"Bind values are passed to the database separately from the SQL statement, so there's no need to 'wrap up' the value in SQL quoting rules."
Lastly, the handling of umlaut and emoji characters in my test seems to be correct. Output in the browser looks correct, and values in the database also look correct. I'm able to search using ascii terms and MySQL matches the appropriate umlaut characters, etc.
CentOS 8
Perl 5, version 26
MySQL 8.0.17 (using CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci)
DBI->connect (using mysql_enable_utf8mb4 => 1)
No binmode in used in Perl script
HTML has <meta charset="utf-8"> in the <head> section
The input is handled using:
use CGI;
use Encode qw(decode_utf8);
my $q = CGI->new;
my $v = Encode::decode_utf8($q->param('comment') || 0);
I mention all of these details because, even though it looks like everything is working correctly, I could still be doing something wrong.

Properly displaying UTF-8 chars in Perl

I am running perl 5, version 24, subversion 3 (v5.24.3) built for MSWin32-x64-multi-thread
(with 1 registered patch, see perl -V for more detail) (Active State).
Trying to parse HTML page encoded in UTF-8:
$request = new HTTP::Request('GET', $url);
$response = $ua->request($request);
$content = $response->content();
I parse the $content as one giant string using INDEX and SUBSTR functions, that works fine.
HTML page contains string with value ÖBB and I need to insert it in the database exactly as ÖBB
When I print it and insert in the db, instead of Ö I get some ascii characters.
NOTE: this question is not database related; MySQL handles utf-8 just fine, so if I insert value "ÖBB" it will take it no problem.
I've looked at great number of similar questions/answers here and in other forums and I am none wiser.
use utf-8 and binmode(STDOUT, ":utf8") has not worked for me...
Would greatly appreciate a code snippet that would solve the issue, thank you.
Decode inputs; encode outputs.
First of all, you don't decode your inputs.
$response->content returns the raw content that could be in any encoding. Use $response->decoded_content(); to get the decoded response if it's HTML.
Second of all, you might not be encoding your outputs.
You didn't specify which database driver you use. Most DBI drivers have an option you need to specify. For example, with MySQL, you want
my $dbh = DBI->connect(
'dbi:mysql:...',
$user, $password,
{
mysql_enable_utf8mb4 => 1,
...
},
);
You mentioned use utf8;. That tells Perl that your source code is encoded using UTF-8 rather than ASCII. Do use it if your source code is encoded using UTF-8.
This is not directly related to your issue.
You mentioned binmode(STDOUT, ":utf8"). That's a very poor way of writing
use open ':std', ':encoding(UTF-8)';
The above handles that for STDIN, STDOUT and STDERR, and does so at compile time. It also sets the default for files open in scope of the pragma.
But that's assuming the terminal expects UTF-8. That would be the case if you used chcp 65001. For a version that handles whatever encoding the terminal expects, you can use the following:
BEGIN {
require Win32;
my $cie = "cp" . Win32::GetConsoleCP();
my $coe = "cp" . Win32::GetConsoleOutputCP();
my $ae = "cp" . Win32::GetACP();
binmode(STDIN, ":encoding($cie)");
binmode(STDOUT, ":encoding($coe)");
binmode(STDERR, ":encoding($coe)");
require open;
"open"->import(":encoding($ae)");
}
This has a few more details.
This is not directly related to your issue.
This is what worked:
use Win32::API;
binmode(STDOUT, ":unix:utf8");
$SetConsoleOutputCP= new Win32::API( 'kernel32.dll',
'SetConsoleOutputCP', 'N','N' );
$SetConsoleOutputCP->Call(65001);
All this was on the surface and I simply overlooked it ;-)
For MySQL db to work right and accept utf-8 encoded string this connection parameter had to be enabled:
mysql_enable_utf8 => 1,
There are several components are involved when you capture webpage and output it to the screen.
For the moment let's assume that you use Windows and run following script in a terminal window.
First you need to confirm that your terminal supports UTF8 encoding. Type command chcp and see if it will output 65001.
If it does then you set, if it does not then issue the following command chcp 65001.
Run the script with command perl script_name.pl and you should get output with ÖBB included in terminal window
use strict;
use warnings;
use utf8;
use feature 'say';
use HTTP::Tiny;
my $url = shift || 'https://www.thetrainline.com/en/train-companies/obb';
my $response = HTTP::Tiny->new->get($url);
if ($response->{success}) {
my $html = $response->{content};
$html =~ m/(<p>Planning.+pets.<\/p>)/;
say $1;
}
To store data in UTF8 encoding in database, the database should be configured to support UTF8 encoding.
In case of MYSQL database the command should look like following
CREATE DATABASE mydb
CHARACTER SET utf8
COLLATE utf8_general_ci;
See the following MYSQL documentation webpage.

Perl UTF8 in CGI problems

I have a very simple Perl script which works right on the terminal but when run as a CGI script it produces garbage. The script basically take a HTML entities encoded data and converts it to print it. I have tried all the different setup like using "Encode" to change the output and set the STDOUT to utf8 mode and it does not help. I have also tried to change the environment of CGI to see if things will work like the terminal environment. Still no luck.
Here is the script
#!/usr/bin/perl
use HTML::Entities qw(encode_entities_numeric decode_entities);
use Encode qw/encode decode/;
binmode(STDOUT, ":utf8");
#$ENV{'PERL_UNICODE'} = 'D';
#$ENV{'LANG'} = 'en_US.UTF-8';
#$ENV{'TERM'} = 'vt100';
#$ENV{'SHELL'} = '/bin/bash';
#binmode(STDOUT, ":utf8");
print "Content-type: text/html\n\n";
my $y = decode_entities("Συστήματα_&#x
391;νίχνευσης_Εισ.pd
f");
#print encode("UTF8",$y);
print $y;
The output on terminal it is clean like
perl test.pl
Content-type: text/html
Συστήματα_Ανίχνευσης_Εισ.pdf
But on the CGI print it is garbled
ΣυστηÌματα_ΑνιÌχνευσης_Εισ.pdf
I am sort of stuck as I cannot find any simple way to solve this. Tried "encode_utf8" and utf8::upgrade of the variable but still no luck. Anyone's experience here will help a lot!
Thanks
Vijay
When interpreting a HTML document, the browser needs to know the encoding. The default encoding as per the HTML standard is not UTF-8. Since the browser is assuming the wrong encoding, it reads garbage.
Instead, you should specify the encoding explicitly, such as by printing a meta tag
<meta charset="utf-8">
or by including the encoding in the content type:
Content-type: text/html; charset=utf-8
Here, using the content type would seem most appropriate.

How do I avoid double UTF-8 encoding in XML::LibXML

My program receives UTF-8 encoded strings from a data source. I need to tamper with these strings, then output them as part of an XML structure.
When I serialize my XML document, it will be double encoded and thus broken. When I serialize only the root element, it will be fine, but of course lacking the header.
Here's a piece of code trying to visualize the problem:
use strict; use diagnostics; use feature 'unicode_strings';
use utf8; use v5.14; use encoding::warnings;
binmode(STDOUT, ":encoding(UTF-8)"); use open qw( :encoding(UTF-8) :std );
use XML::LibXML
# Simulate actual data source with a UTF-8 encoded file containing '¿Üßıçñíïì'
open( IN, "<", "./input" ); my $string = <IN>; close( IN ); chomp( $string );
$string = "Value of '" . $string . "' has no meaning";
# create example XML document as <response><result>$string</result></response>
my $xml = XML::LibXML::Document->new( "1.0", "UTF-8" );
my $rsp = $xml->createElement( "response" ); $xml->setDocumentElement( $rsp );
$rsp->appendTextChild( "result", $string );
# Try to forward the resulting XML to a receiver. Using STDOUT here, but files/sockets etc. yield the same results
# This will not warn and be encoded correctly but lack the XML header
print( "Just the root document looks good: '" . $xml->documentElement->serialize() . "'\n" );
# This will include the header but wide chars are mangled
print( $xml->serialize() );
# This will even issue a warning from encoding::warnings
print( "The full document looks mangled: '" . $xml->serialize() . "'\n" );
Spoiler 1: Good case:
<response><result>Value of '¿Üßıçñíïì' has no meaning</result></response>
Spoiler 2: Bad case:
<?xml version="1.0" encoding="UTF-8"?><response><result>Value of '¿ÃÃıçñíïì' has no meaning</result></response>
The root element and its contents are already UTF-8 encoded. XML::LibXML accepts the input and is able to work on it and output it again as valid UTF-8. As soon as I try to serialize the whole XML document, the wide characters inside get mangled. In a hex dump, it looks as if the already UTF-8 encoded string gets passed through a UTF-8 encoder again. I've searched, tried and read a lot, from Perl's own Unicode tutorial all the way through tchrist's great answer to the Why does modern Perl avoid UTF-8 by default? question. I don't think this is a general Unicode problem, though, but rather a specific issue between me and XML::LibXML.
What do I need to do to be able to output a full XML document including the header so that its contents remain correctly encoded? Is there a flag/property/switch to set?
(I'll gladly accept links to the corresponding part(s) of TFM that I should have R for as long as they are actually helpful ;)
ikegami is correct, but he didn't really explain what's wrong. To quote the docs for XML::LibXML::Document:
IMPORTANT: unlike toString for other nodes, on document nodes this function returns the XML as a byte string in the original encoding of the document (see the actualEncoding() method)!
(serialize is just an alias for toString)
When you print a byte string to a filehandle marked with an :encoding layer, it gets encoded as if it were ISO-8859-1. Since you have a string containing UTF-8 bytes, it gets double encoded.
As ikegami said, use binmode(STDOUT) to remove the encoding layer from STDOUT. You could also decode the result of serialize back into characters before printing it, but that assumes the document is using the same encoding you have set on your output filehandle. (Otherwise, you'll emit a XML document whose actual encoding doesn't match what its header claims.) If you're printing to a file instead of STDOUT, open it with '>:raw' to avoid double encoding.
Since XML documents are parsed without needing any external information, they are binary files rather than text files.
You're telling Perl to encode anything sent to STDOUT[1], but then you proceed to output an XML document to it. You can't apply a character encoding to a binary file as it corrupts it.
Replace
binmode(STDOUT, ":encoding(UTF-8)");
with
binmode(STDOUT);
Note: This assumes the rest of the text you are outputting is just temporary debugging information. The output doesn't otherwise make sense.
In fact, you do this twice! Once using use open qw( :encoding(UTF-8) :std );, and then a second time using binmode(STDOUT, ":encoding(UTF-8)");.
I do not like changing settings of STDOUT because of specific features of "toString()" in two modules XML::LibXML::Document, XML::LibXML::Element.
So, I do prefer to add "Encode::encode" where it is required. You may run the following example:
use strict;
use warnings FATAL => 'all';
use XML::LibXML;
my ( $doc, $main, $nodelatin, $nodepolish );
$doc = XML::LibXML::Document->createDocument( '1.0', 'UTF-8' );
$main = $doc->createElement('main');
$doc->addChild($main);
$nodelatin = $doc->createElement('latin');
$nodelatin->appendTextNode('Lorem ipsum dolor sit amet');
$main->addChild($nodelatin);
print __LINE__, ' ', $doc->toString(); # printed OK
print __LINE__, ' ', $doc->documentElement()->toString(), "\n\n"; # printed OK
$nodepolish = $doc->createElement('polish');
$nodepolish->appendTextNode('Zażółć gęślą jaźń');
$main->addChild($nodepolish);
print __LINE__, ' ', $doc->toString(); # printed OK
print __LINE__, ' ', Encode::encode("UTF-8", $doc->documentElement()->toString()), "\n"; # printed OK
print __LINE__, ' ', $doc->documentElement()->toString(), "\n"; # Wide character in print

perl save utf-8 text problem

I am playing around the pplog, a single file file base blog.
The writing to file code:
open(FILE, ">$config_postsDatabaseFolder/$i.$config_dbFilesExtension");
my $date = getdate($config_gmt);
print FILE $title.'"'.$content.'"'.$date.'"'.$category.'"'.$i; # 0: Title, 1: Content, 2: Date, 3: Category, 4: FileName
print 'Your post '. $title.' has been saved. Go to Index';
close FILE;
The input text:
春眠不覺曉,處處聞啼鳥. 夜來風雨聲,花落知多小.
After store to file, it becomes:
春眠不覺�›�,處處聞啼鳥. 夜來風�›�聲,花落知多小.
I can use Eclipse to edit the file and make it render to normal. The problem exists during printing to the file.
Some basic info:
Strawberry perl 5.12
without use utf8;
tried use utf8;, dosn't have effect.
Thank you.
--- EDIT ---
Thanks for comments. I traced the code:
Codes add new content:
# Blog Add New Entry Page
my $pass = r('pass');
#BK 7JUL09 patch from fedekun, fix post with no title that caused zero-byte message...
my $title = r('title');
my $content = '';
if($config_useHtmlOnEntries == 0)
{
$content = bbcode(r('content'));
}
else
{
$content = basic_r('content');
}
my $category = r('category');
my $isPage = r('isPage');
sub r
{
escapeHTML(param($_[0]));
}
sub r forward the command to a CGI.pm function.
In CGI.pm
sub escapeHTML {
# hack to work around earlier hacks
push #_,$_[0] if #_==1 && $_[0] eq 'CGI';
my ($self,$toencode,$newlinestoo) = CGI::self_or_default(#_);
return undef unless defined($toencode);
$toencode =~ s{&}{&}gso;
$toencode =~ s{<}{<}gso;
$toencode =~ s{>}{>}gso;
if ($DTD_PUBLIC_IDENTIFIER =~ /[^X]HTML 3\.2/i) {
# $quot; was accidentally omitted from the HTML 3.2 DTD -- see
# <http://validator.w3.org/docs/errors.html#bad-entity> /
# <http://lists.w3.org/Archives/Public/www-html/1997Mar/0003.html>.
$toencode =~ s{"}{"}gso;
}
else {
$toencode =~ s{"}{"}gso;
}
# Handle bug in some browsers with Latin charsets
if ($self->{'.charset'}
&& (uc($self->{'.charset'}) eq 'ISO-8859-1' # This line cause trouble. it treats Chinese chars as ISO-8859-1
|| uc($self->{'.charset'}) eq 'WINDOWS-1252')) {
$toencode =~ s{'}{'}gso;
$toencode =~ s{\x8b}{‹}gso;
$toencode =~ s{\x9b}{›}gso;
if (defined $newlinestoo && $newlinestoo) {
$toencode =~ s{\012}{
}gso;
$toencode =~ s{\015}{
}gso;
}
}
return $toencode;
}
Further trace the problem, found out the browser default to iso-8859-1, even manually set to utf-8, it send the string back to server as iso-8859-1.
Finally,
print header(-charset => qw(utf-8)), '<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8" />
add the -charset => qw(utf-8) param to header. The Chinese poem is still Chinese poem.
Thanks for Schwern's comments, it inspired me to trace out the problem and learn a leeson.
In order to get utf8 really working in Perl involves flipping on a lot of individual features. use utf8 only makes your code utf8 (strings, variables, regexes...), you have to do file handles separately.
Its complicated, and the simplest thing is to use utf8::all which will make utf8 the default for your code, your files, #ARGV, STDIN, STDOUT and STDERR. utf8 support is constantly improving in Perl, and utf8::all will add it as it comes available.
I'm unsure of how your code can produce that output—for example, the quote marks are missing. Of course, this could be due to "corruption" somewhere between your file and me seeing the page. SO may filter corrupted UTF-8. I suggest providing hex dumps in the future!
Anyway, to get UTF-8 output working in Perl, there are several approaches:
Work with character data, that is let Perl know that your variables contain Unicode. This is probably the best method. Confirm that utf8::is_utf8($var) is true (you do not need to, and should not use utf8 for this). If not, look into the Encode module's decode function to make Perl know its Unicode. Once Perl knows your data is characters, that print will give warnings (which you do have enabled, right?). To fix, enable the :utf8 or :encoding(utf8) layer on your file (the latter version provides error checking). You can do this in your open (open FILE, '>:utf8', "$fname") or alternative enable it with binmode (binmode FILE, ':utf8'). Note that you can also use other encodings; see the encoding and PerlIO::encoding docs.
Treat your Unicode as opaque binary data. utf8::is_utf8($var) must be false. You must be very careful when manipulating strings; for example, if you've got UTF-16-BE, this would be a bad idea: print "$data\n", because you actually need print $data\0\n". UTF-8 has fewer of these issues, but you need to be aware of them.
I suggest reading the perluniintro, perlunitut, perlunicode, and perlunifaq manpages/pods.
Also, use utf8; just tells Perl that your script is written in UTF-8. Its effects are very limited; see its pod docs.
You're not showing the code that is actually running. I successfully processed the text you supplied as input with both 5.10.1 on Cygwin and 5.12.3 on Windows. So I suspect a bug in your code. Try narrowing down the problem by writing a short, self-contained test case.