Properly displaying UTF-8 chars in Perl - perl

I am running perl 5, version 24, subversion 3 (v5.24.3) built for MSWin32-x64-multi-thread
(with 1 registered patch, see perl -V for more detail) (Active State).
Trying to parse HTML page encoded in UTF-8:
$request = new HTTP::Request('GET', $url);
$response = $ua->request($request);
$content = $response->content();
I parse the $content as one giant string using INDEX and SUBSTR functions, that works fine.
HTML page contains string with value ÖBB and I need to insert it in the database exactly as ÖBB
When I print it and insert in the db, instead of Ö I get some ascii characters.
NOTE: this question is not database related; MySQL handles utf-8 just fine, so if I insert value "ÖBB" it will take it no problem.
I've looked at great number of similar questions/answers here and in other forums and I am none wiser.
use utf-8 and binmode(STDOUT, ":utf8") has not worked for me...
Would greatly appreciate a code snippet that would solve the issue, thank you.

Decode inputs; encode outputs.
First of all, you don't decode your inputs.
$response->content returns the raw content that could be in any encoding. Use $response->decoded_content(); to get the decoded response if it's HTML.
Second of all, you might not be encoding your outputs.
You didn't specify which database driver you use. Most DBI drivers have an option you need to specify. For example, with MySQL, you want
my $dbh = DBI->connect(
'dbi:mysql:...',
$user, $password,
{
mysql_enable_utf8mb4 => 1,
...
},
);
You mentioned use utf8;. That tells Perl that your source code is encoded using UTF-8 rather than ASCII. Do use it if your source code is encoded using UTF-8.
This is not directly related to your issue.
You mentioned binmode(STDOUT, ":utf8"). That's a very poor way of writing
use open ':std', ':encoding(UTF-8)';
The above handles that for STDIN, STDOUT and STDERR, and does so at compile time. It also sets the default for files open in scope of the pragma.
But that's assuming the terminal expects UTF-8. That would be the case if you used chcp 65001. For a version that handles whatever encoding the terminal expects, you can use the following:
BEGIN {
require Win32;
my $cie = "cp" . Win32::GetConsoleCP();
my $coe = "cp" . Win32::GetConsoleOutputCP();
my $ae = "cp" . Win32::GetACP();
binmode(STDIN, ":encoding($cie)");
binmode(STDOUT, ":encoding($coe)");
binmode(STDERR, ":encoding($coe)");
require open;
"open"->import(":encoding($ae)");
}
This has a few more details.
This is not directly related to your issue.

This is what worked:
use Win32::API;
binmode(STDOUT, ":unix:utf8");
$SetConsoleOutputCP= new Win32::API( 'kernel32.dll',
'SetConsoleOutputCP', 'N','N' );
$SetConsoleOutputCP->Call(65001);
All this was on the surface and I simply overlooked it ;-)
For MySQL db to work right and accept utf-8 encoded string this connection parameter had to be enabled:
mysql_enable_utf8 => 1,

There are several components are involved when you capture webpage and output it to the screen.
For the moment let's assume that you use Windows and run following script in a terminal window.
First you need to confirm that your terminal supports UTF8 encoding. Type command chcp and see if it will output 65001.
If it does then you set, if it does not then issue the following command chcp 65001.
Run the script with command perl script_name.pl and you should get output with ÖBB included in terminal window
use strict;
use warnings;
use utf8;
use feature 'say';
use HTTP::Tiny;
my $url = shift || 'https://www.thetrainline.com/en/train-companies/obb';
my $response = HTTP::Tiny->new->get($url);
if ($response->{success}) {
my $html = $response->{content};
$html =~ m/(<p>Planning.+pets.<\/p>)/;
say $1;
}
To store data in UTF8 encoding in database, the database should be configured to support UTF8 encoding.
In case of MYSQL database the command should look like following
CREATE DATABASE mydb
CHARACTER SET utf8
COLLATE utf8_general_ci;
See the following MYSQL documentation webpage.

Related

Perl - Validate Chinese character input from web page form?

My Perl script accepts and processes input from a text field in a form on a web page. It was written for the English version of the web page and works just fine.
There is also a Chinese version of the page (a separate page, not both languages on the same page), and now I need my script to work with that. The user input on this page is expected to be in Chinese.
Expecting to need to work in UTF-8, I added
use utf8;
This continues to function just fine on the English page.
But in order to, for example, define a string variable for comparison that uses Chinese characters, I have to save the Perl script itself with utf-8 encoding. As soon as I do that, I get the dreaded 500 server error.
Clearly I'm going about this wrong and any helpful direction will be greatly appreciated/
Thanks.
EDIT - please see my clarification post below.
To handle utf8 properly :
use strict; use warnings;
use utf8;
use open(IO => ':encoding(utf8)');
binmode $_, ":utf8" for qw/STDOUT STDIN STDERR/;
open(my $fh, '<:utf8', '/file/path'); # if you need a file-handle
# code.....
Check
why-does-modern-perl-avoid-utf-8-by-default
perluniintro
I'm sorry - I think I poorly expressed my question by including too much information.
The issue is - if I save my script in ANSI format and upload it to the server, it works just fine for the English page. Expecting to want to use Chinese characters in the script, I saved it in UTF-8 format and re-uploaded, and suddenly it throws 500 for the English page.
I tested with a Hello World script:
#!/usr/bin/perl -T
use strict;
use warnings;
print "Content-type: text/html\n\n";
print "Hello, world!\n";
Works fine when saved as ANSI - fails 500 when saved as UTF8.

Perl drop down menus and Unicode

I've been going around on this for some time now and can't quite get it. This is Perl 5 on Ubuntu. I have a drop down list on my web page:
$output .= start_form . "Student: " . popup_menu(-name=>'student', -values=>['', #students], -labels=>\%labels, -onChange=>'Javascript:submit()') . end_form;
It's just a set of names in the form "Last, First" that are coming from a SQL Server table. The labels are created from the SQL columns like so:
$labels{uc($record->{'id'})} = $record->{'lastname'} . ", " . $record->{'firstname'};
The issue is that the drop down isn't displaying some Unicode characters correctly. For instance, "Søren" shows up in the drop down as "Søren". I have in my header:
use utf8;
binmode(STDOUT, ":utf8");
...and I've also played around with various takes on the "decode( )" function, to no avail. To me, the funny thing is that if I pull $labels into a test script and print the list to the console, the names appear just fine! So what is it about the drop down that is causing this? Thank you in advance.
EDIT:
This is the relevant functionality, which I've stripped down to this script that runs in the console and yields the correct results for three entries that have Unicode characters:
#!/usr/bin/perl
use DBI;
use lib '/home/web/library';
use mssql_util;
use Encode;
binmode(STDOUT, ":utf8");
$query = "[SQL query here]";
$dbh = &connect;
$sth = $dbh->prepare($query);
$result = $sth->execute();
while ($record = $sth->fetchrow_hashref())
{
if ($record->{'id'})
{
$labels{uc($record->{'id'})} = Encode::decode('UTF-8', $record->{'lastname'} . ", " . $record->{'nickname'} . " (" . $record->{'entryid'} . ")");
}
}
$sth->finish();
print "$labels{'ST123'}\n";
print "$labels{'ST456'}\n";
print "$labels{'ST789'}\n";
The difference in what the production script is doing is that instead of printing to the console like above, it's printing to HTTP:
$my_output = "<p>$labels{'ST123'}</p><br>
<p>$labels{'ST456'}</p><br>
<p>$labels{'ST789'}</p>";
$template =~ s/\$body/$my_output/;
print header(-cookie=>$cookie) . $template;
This gives, i.e., strings like "Zoë" and "Søren" on the page. BUT, if I remove binmode(STDOUT, ":utf8"); from the top of the production script, then the strings appear just fine on the page (i.e. I get "Zoë" and "Søren").
I believe that the binmode( ) line is necessary when writing UTF-8 to output, and yet removing it here produces the correct results. What gives?
Problem #1: Decoding inputs
53.C3.B8.72.65.6E is the UTF-8 encoding for Søren. When you instruct Perl to encode it all over again (by printing it to handle with the :utf8 layer), you are producing garbage.
You need to decode your inputs ($record->{id}, $record->{lastname}, $record->{firstname}, etc)! This will transform The UTF-8 bytes 53.C3.B8.72.65.6E ("encoded text") into the Unicode Code Points 53.F8.72.65.6E ("decoded text").
In this form, you will be able to use uc, regex matches, etc. You will also be able to print them out to a handle with an encoding layer (e.g. :encoding(UTF-8), or the improper :utf8).
You let on that these inputs come from a database. Most DBD have a flag that causes strings to be decoded. For example, if it's a MySQL database, you should pass mysql_enable_utf8mb4 => 1 to connect.
Problem #2: Communicating encoding
If you're going to output UTF-8, don't tell the browser it's ISO-8859-1!
$ perl -e'use CGI qw( :standard ); print header()'
Content-Type: text/html; charset=ISO-8859-1
Fixed:
$ perl -e'use CGI qw( :standard ); print header( -type => "text/html; charset=UTF-8" )'
Content-Type: text/html; charset=UTF-8
Hard to give a definitive solution as you don't give us much useful information. But here are some pointers that might help.
use utf8 only tells Perl that your source code is encoded as UTF-8. It does nothing useful here.
Reading perldoc perlunitut would be a good start.
Do you know how your database tables are encoded?
Do you know whether your database connection is configured to automatically decode data coming from the database into Perl characters?
What encoding are you telling the browser that you have encoded your HTTP response in?

Why do I get garbled output when I decode some HTML entities but not others?

In Perl, I am trying to decode strings which contain numeric HTML entities using HTML::Entities. Some entities work, while "newer" entities don't. For example:
decode_entities('®'); # returns ® as expected
decode_entities('Ω'); # returns Ω instead of Ω
decode_entities('★'); # returns ★ instead of ★
Is there a way to decode these "newer" HTML entities in Perl? In PHP, the html_entity_decode function seems to decode all of these entities without any problem.
The decoding works fine. It's how you're outputting them that's wrong. For example, you may have sent the strings to a terminal without encoding them for that terminal first. This is achieved through the open pragma in the following program:
$ perl -e'
use open ":std", ":encoding(UTF-8)";
use HTML::Entities qw( decode_entities );
CORE::say decode_entities($_)
for "®", "Ω", "★";
'
®
Ω
★
Make sure your terminal can handle UTF-8 encoding. It looks like it's having problems with multibyte characters. You can also try to set UTF-8 for STDOUT in case you get wide character warnings.
use strict;
use warnings;
use HTML::Entities;
binmode STDOUT, ':encoding(UTF-8)';
print decode_entities('®'); # returns ®
print decode_entities('Ω'); # returns Ω
print decode_entities('★'); # returns ★
This gives me the correct/expected results.

Decode unicode escape characters with perl

I hate to ask a question that's undoubtedly been answered a dozen times before, but I find encoding issues confusing and am having a hard time matching up other people's q/a with my own problem.
I'm pulling information from a json file online, and my perl script isn't handling unicode escape characters properly.
Script looks like this:
use LWP::Simple;
use JSON;
my $url = ______;
my $json = get($url);
my $data = decode_json($json);
foreach my $i (0 .. $#{data->{People}}) {
print "$data->{People}[$i]{first_name} $data->{People}[$i]{last_name}\n";
}
It encounters jsons that look like this: "first_name":"F\u00e9lix","last_name":"Cat" and prints them like this: FΘlix Cat
I'm sure there's a trivial fix here, but I'm stumped. I'd really appreciate any help you can provide.
You didn't tell Perl how to encode the output. You need to add
use open ':std', ':encoding(XXX)';
where XXX is the encoding the terminal expects.
On unix boxes, you normally need
use open ':std', ':encoding(UTF-8)';
On Windows boxes, you normally need
use Win32 qw( );
use open ':std', ':encoding(cp'.Win32::GetConsoleOutputCP().')';

How to use unicode in perl CGI param

I have a Perl CGI script accepting unicode characters as one of the params.
The url is of the form
.../worker.pl?text="some_unicode_chars"&...
In the perl script, I pass the $text variable to a shell script:
system "a.sh \"$text\" out_put_file";
If I hardcode the text in the perl script, it works well. However, the output makes no sense when $text is got from web using CGI.
my $q = CGI->new;
my $text = $q->param('text');
I suspect it's the encoding caused the problem. uft-8 caused me so many troubles. Anyone please help me?
Perhaps this will help. From Perl Programming/Unicode UTF-8:
By default, CGI.pm does not decode your form parameters. You can use
the -utf8 pragma, which will treat (and decode) all parameters as
UTF-8 strings, but this will fail if you have any binary file upload
fields. A better solution involves overriding the param method:
(example follows)
[Wrong - see Correction] Here's documentation for the utf-8 pragma. Since uploading binary data does not appear to be a concern for you, use of the utf-8 pragma appears to be the most straightforward approach.
Correction: Per the comment from #Slaven, do not confuse the general Perl utf8 pragma with the -utf-8 pragma that has been defined for use with CGI.pm:
-utf8
This makes CGI.pm treat all parameters as UTF-8 strings. Use this with
care, as it will interfere with the processing of binary uploads. It
is better to manually select which fields are expected to return utf-8
strings and convert them using code like this:
use Encode;
my $arg = decode utf8=>param('foo');
Follow Up: duleshi, you ask: But I still don't understand the differnce between decode in Encode and utf8::decode. How do the Encode and utf8 modules differ?
From the documentation for the utf8 pragma:
Note that this function does not handle arbitrary encodings. Therefore
Encode is recommended for the general purposes; see also Encode.
Put another way, the Encode module works with many different encodings (including UTF-8), whereas the utf8 functions work only with the UTF-8 encoding.
Here is a Perl program that demonstrates the equivalence of the two approaches to encoding and decoding UTF-8. (Also see the live demo.)
#!/usr/bin/perl
use strict;
use warnings;
use utf8; # allows 'ñ' to appear in the source code
use Encode;
my $word = "Español"; # the 'ñ' is permitted because of the 'use utf8' pragma
# Convert the string to its UTF-8 equivalent.
my $utf8_word = Encode::encode("UTF-8", $word);
# Use 'utf8::decode' to convert the string back to internal form.
my $word_again_via_utf8 = $utf8_word;
utf8::decode($word_again_via_utf8); # converts in-place
# Use 'Encode::decode' to convert the string back to internal form.
my $word_again_via_Encode = Encode::decode("UTF-8", $utf8_word);
# Do the two conversion methods produce the same result?
# Prints 'Yes'.
print $word_again_via_utf8 eq $word_again_via_Encode ? "Yes\n" : "No\n";
# Do we get back the original internal string after converting both ways?
# Prints 'Yes'.
print $word eq $word_again_via_Encode ? "Yes\n" : "No\n";
If you're passing UTF-8 data around in the parameters list, then you definitely want to be URI encoding them using the URI::Escape module. This will convert any extended characters to percent values which as easily printable and readable. On the receiving end you will then need to URI decode them before continuing.