Decode unicode escape characters with perl - perl

I hate to ask a question that's undoubtedly been answered a dozen times before, but I find encoding issues confusing and am having a hard time matching up other people's q/a with my own problem.
I'm pulling information from a json file online, and my perl script isn't handling unicode escape characters properly.
Script looks like this:
use LWP::Simple;
use JSON;
my $url = ______;
my $json = get($url);
my $data = decode_json($json);
foreach my $i (0 .. $#{data->{People}}) {
print "$data->{People}[$i]{first_name} $data->{People}[$i]{last_name}\n";
}
It encounters jsons that look like this: "first_name":"F\u00e9lix","last_name":"Cat" and prints them like this: FΘlix Cat
I'm sure there's a trivial fix here, but I'm stumped. I'd really appreciate any help you can provide.

You didn't tell Perl how to encode the output. You need to add
use open ':std', ':encoding(XXX)';
where XXX is the encoding the terminal expects.
On unix boxes, you normally need
use open ':std', ':encoding(UTF-8)';
On Windows boxes, you normally need
use Win32 qw( );
use open ':std', ':encoding(cp'.Win32::GetConsoleOutputCP().')';

Related

Properly displaying UTF-8 chars in Perl

I am running perl 5, version 24, subversion 3 (v5.24.3) built for MSWin32-x64-multi-thread
(with 1 registered patch, see perl -V for more detail) (Active State).
Trying to parse HTML page encoded in UTF-8:
$request = new HTTP::Request('GET', $url);
$response = $ua->request($request);
$content = $response->content();
I parse the $content as one giant string using INDEX and SUBSTR functions, that works fine.
HTML page contains string with value ÖBB and I need to insert it in the database exactly as ÖBB
When I print it and insert in the db, instead of Ö I get some ascii characters.
NOTE: this question is not database related; MySQL handles utf-8 just fine, so if I insert value "ÖBB" it will take it no problem.
I've looked at great number of similar questions/answers here and in other forums and I am none wiser.
use utf-8 and binmode(STDOUT, ":utf8") has not worked for me...
Would greatly appreciate a code snippet that would solve the issue, thank you.
Decode inputs; encode outputs.
First of all, you don't decode your inputs.
$response->content returns the raw content that could be in any encoding. Use $response->decoded_content(); to get the decoded response if it's HTML.
Second of all, you might not be encoding your outputs.
You didn't specify which database driver you use. Most DBI drivers have an option you need to specify. For example, with MySQL, you want
my $dbh = DBI->connect(
'dbi:mysql:...',
$user, $password,
{
mysql_enable_utf8mb4 => 1,
...
},
);
You mentioned use utf8;. That tells Perl that your source code is encoded using UTF-8 rather than ASCII. Do use it if your source code is encoded using UTF-8.
This is not directly related to your issue.
You mentioned binmode(STDOUT, ":utf8"). That's a very poor way of writing
use open ':std', ':encoding(UTF-8)';
The above handles that for STDIN, STDOUT and STDERR, and does so at compile time. It also sets the default for files open in scope of the pragma.
But that's assuming the terminal expects UTF-8. That would be the case if you used chcp 65001. For a version that handles whatever encoding the terminal expects, you can use the following:
BEGIN {
require Win32;
my $cie = "cp" . Win32::GetConsoleCP();
my $coe = "cp" . Win32::GetConsoleOutputCP();
my $ae = "cp" . Win32::GetACP();
binmode(STDIN, ":encoding($cie)");
binmode(STDOUT, ":encoding($coe)");
binmode(STDERR, ":encoding($coe)");
require open;
"open"->import(":encoding($ae)");
}
This has a few more details.
This is not directly related to your issue.
This is what worked:
use Win32::API;
binmode(STDOUT, ":unix:utf8");
$SetConsoleOutputCP= new Win32::API( 'kernel32.dll',
'SetConsoleOutputCP', 'N','N' );
$SetConsoleOutputCP->Call(65001);
All this was on the surface and I simply overlooked it ;-)
For MySQL db to work right and accept utf-8 encoded string this connection parameter had to be enabled:
mysql_enable_utf8 => 1,
There are several components are involved when you capture webpage and output it to the screen.
For the moment let's assume that you use Windows and run following script in a terminal window.
First you need to confirm that your terminal supports UTF8 encoding. Type command chcp and see if it will output 65001.
If it does then you set, if it does not then issue the following command chcp 65001.
Run the script with command perl script_name.pl and you should get output with ÖBB included in terminal window
use strict;
use warnings;
use utf8;
use feature 'say';
use HTTP::Tiny;
my $url = shift || 'https://www.thetrainline.com/en/train-companies/obb';
my $response = HTTP::Tiny->new->get($url);
if ($response->{success}) {
my $html = $response->{content};
$html =~ m/(<p>Planning.+pets.<\/p>)/;
say $1;
}
To store data in UTF8 encoding in database, the database should be configured to support UTF8 encoding.
In case of MYSQL database the command should look like following
CREATE DATABASE mydb
CHARACTER SET utf8
COLLATE utf8_general_ci;
See the following MYSQL documentation webpage.

Perl drop down menus and Unicode

I've been going around on this for some time now and can't quite get it. This is Perl 5 on Ubuntu. I have a drop down list on my web page:
$output .= start_form . "Student: " . popup_menu(-name=>'student', -values=>['', #students], -labels=>\%labels, -onChange=>'Javascript:submit()') . end_form;
It's just a set of names in the form "Last, First" that are coming from a SQL Server table. The labels are created from the SQL columns like so:
$labels{uc($record->{'id'})} = $record->{'lastname'} . ", " . $record->{'firstname'};
The issue is that the drop down isn't displaying some Unicode characters correctly. For instance, "Søren" shows up in the drop down as "Søren". I have in my header:
use utf8;
binmode(STDOUT, ":utf8");
...and I've also played around with various takes on the "decode( )" function, to no avail. To me, the funny thing is that if I pull $labels into a test script and print the list to the console, the names appear just fine! So what is it about the drop down that is causing this? Thank you in advance.
EDIT:
This is the relevant functionality, which I've stripped down to this script that runs in the console and yields the correct results for three entries that have Unicode characters:
#!/usr/bin/perl
use DBI;
use lib '/home/web/library';
use mssql_util;
use Encode;
binmode(STDOUT, ":utf8");
$query = "[SQL query here]";
$dbh = &connect;
$sth = $dbh->prepare($query);
$result = $sth->execute();
while ($record = $sth->fetchrow_hashref())
{
if ($record->{'id'})
{
$labels{uc($record->{'id'})} = Encode::decode('UTF-8', $record->{'lastname'} . ", " . $record->{'nickname'} . " (" . $record->{'entryid'} . ")");
}
}
$sth->finish();
print "$labels{'ST123'}\n";
print "$labels{'ST456'}\n";
print "$labels{'ST789'}\n";
The difference in what the production script is doing is that instead of printing to the console like above, it's printing to HTTP:
$my_output = "<p>$labels{'ST123'}</p><br>
<p>$labels{'ST456'}</p><br>
<p>$labels{'ST789'}</p>";
$template =~ s/\$body/$my_output/;
print header(-cookie=>$cookie) . $template;
This gives, i.e., strings like "Zoë" and "Søren" on the page. BUT, if I remove binmode(STDOUT, ":utf8"); from the top of the production script, then the strings appear just fine on the page (i.e. I get "Zoë" and "Søren").
I believe that the binmode( ) line is necessary when writing UTF-8 to output, and yet removing it here produces the correct results. What gives?
Problem #1: Decoding inputs
53.C3.B8.72.65.6E is the UTF-8 encoding for Søren. When you instruct Perl to encode it all over again (by printing it to handle with the :utf8 layer), you are producing garbage.
You need to decode your inputs ($record->{id}, $record->{lastname}, $record->{firstname}, etc)! This will transform The UTF-8 bytes 53.C3.B8.72.65.6E ("encoded text") into the Unicode Code Points 53.F8.72.65.6E ("decoded text").
In this form, you will be able to use uc, regex matches, etc. You will also be able to print them out to a handle with an encoding layer (e.g. :encoding(UTF-8), or the improper :utf8).
You let on that these inputs come from a database. Most DBD have a flag that causes strings to be decoded. For example, if it's a MySQL database, you should pass mysql_enable_utf8mb4 => 1 to connect.
Problem #2: Communicating encoding
If you're going to output UTF-8, don't tell the browser it's ISO-8859-1!
$ perl -e'use CGI qw( :standard ); print header()'
Content-Type: text/html; charset=ISO-8859-1
Fixed:
$ perl -e'use CGI qw( :standard ); print header( -type => "text/html; charset=UTF-8" )'
Content-Type: text/html; charset=UTF-8
Hard to give a definitive solution as you don't give us much useful information. But here are some pointers that might help.
use utf8 only tells Perl that your source code is encoded as UTF-8. It does nothing useful here.
Reading perldoc perlunitut would be a good start.
Do you know how your database tables are encoded?
Do you know whether your database connection is configured to automatically decode data coming from the database into Perl characters?
What encoding are you telling the browser that you have encoded your HTTP response in?

Escape whitespace when using backticks

I've had a search around, and from my perspective using backticks is the only way I can solve this problem. I'm trying to call the mdls command from Perl for each file in a directory to find it's last accessed time. The issue I'm having is that in the file names I have from find I have unescaped spaces which bash obviously doesn't like. Is there an easy way to escape all of the white space in my file names before passing them to mdls. Please forgive me if this is an obvious question. I'm quite new to Perl.
my $top_dir = '/Volumes/hydrogen/FLAC';
sub wanted { # Learn about sub routines
if ($File::Find::name) {
my $curr_file_path = $File::Find::name. "\n";
`mdls $curr_file_path`;
print $_;
}
}
find(\&wanted, $top_dir);
If you are JUST wanting "last access time" in terms of of the OS last access time, mdls is the wrong tool. Use perl's stat. If you want last access time in terms of the Mac registered application (ie, a song by Quicktime or iTunes) then mdls is potentially the right tool. (You could also use osascript to query the Mac app directly...)
Backticks are for capturing the text return. Since you are using mdls, I assume capturing and parsing the text is still to come.
So there are several methods:
Use the list form of system and the quoting is not necessary (if you
don't care about the return text);
Use String::ShellQuote to escape the file name before sending to sh;
Build the string and enclose in single quotes prior to sending to sending to the shell. This is harder than it sounds because files names with single quotes defeats your quotes! For example, sam's song.mp4 is a legal file name, but if you surround with single quotes you get 'sam's song.mp4' which is not what you meant...
Use open to open a pipe to the output of the child process like this: open my $fh, '-|', "mdls", "$curr_file" or die "$!";
Example of String::ShellQuote:
use strict; use warnings;
use String::ShellQuote;
use File::Find;
my $top_dir = '/Users/andrew/music/iTunes/iTunes Music/Music';
sub wanted {
if ($File::Find::name) {
my $curr_file = "$File::Find::name";
my $rtr;
return if -d;
my $exec="mdls ".shell_quote($curr_file);
$rtr=`$exec`;
print "$rtr\n\n";
}
}
find(\&wanted, $top_dir);
Example of pipe:
use strict; use warnings;
use String::ShellQuote;
use File::Find;
my $top_dir = '/Users/andrew/music/iTunes/iTunes Music/Music';
sub wanted {
if ($File::Find::name) {
my $curr_file = "$File::Find::name";
my $rtr;
return if -d;
open my $fh, '-|', "mdls", "$curr_file" or die "$!";
{ local $/; $rtr=<$fh>; }
close $fh or die "$!";
print "$rtr\n\n";
}
}
find(\&wanted, $top_dir);
If you're sure the filenames don't contain newlines (either CR or LF), then pretty much all Unix shells accept backslash quoting, and Perl has the quotemeta function to apply it.
my $curr_file_path = quotemeta($File::Find::name);
my $time = `mdls $curr_file_path`;
Unfortunately, that doesn't work for filenames with newlines, because the shell handles a backslash followed by a newline by deleting both characters instead of just the backslash. So to be really safe, use String::ShellQuote:
use String::ShellQuote;
...
my $curr_file_path = shell_quote($File::Find::name);
my $time = `mdls $curr_file_path`;
That should work on filenames containing anything except a NUL character, which you really shouldn't be using in filenames.
Both of these solutions are for Unix-style shells only. If you're on Windows, proper shell quoting is much trickier.
If you just want to find the last access time, is there some weird Mac reason you aren't using stat? When would it be worse than kMDItemLastUsedDate?
my $last_access = ( stat($file) )[8];
It seems kMDItemLastUsedDate isn't always updated to the last access time. If you work with a file through the terminal (e.g. cat, more), kMDItemLastUsedDate doesn't change but the value that comes back from stat is right. touch appears to do the right thing in both cases.
It looks like you need stat for the real answer, but mdls if you're looking for access through applications.
You can bypass the shell by expressing the command as a list, combined with capture() from IPC::System::Simple:
use IPC::System::Simple qw(capture);
my $output = capture('mdls', $curr_file_path);
Quote the variable name inside the backticks:
`mdls "$curr_file_path"`;
`mdls '$curr_file_path'`;

Unicode string mess in perl

I have an external module, that is returning me some strings. I am not sure how are the strings returned, exactly. I don't really know, how Unicode strings work and why.
The module should return, for example, the Czech word "být", meaning "to be". (If you cannot see the second letter - it should look like this.) If I display the string, returned by the module, with Data Dumper, I see it as b\x{fd}t.
However, if I try to print it with print $s, I got "Wide character in print" warning, and ? instead of ý.
If I try Encode::decode(whatever, $s);, the resulting string cannot be printed anyway (always with the "Wide character" warning, sometimes with mangled characters, sometimes right), no matter what I put in whatever.
If I try Encode::encode("utf-8", $s);, the resulting string CAN be printed without the problems or error message.
If I use use encoding 'utf8';, printing works without any need of encoding/decoding. However, if I use IO::CaptureOutput or Capture::Tiny module, it starts shouting "Wide character" again.
I have a few questions, mostly about what exactly happens. (I tried to read perldocs, but I was not very wise from them)
Why can't I print the string right after getting it from the module?
Why can't I print the string, decoded by "decode"? What exactly "decode" did?
What exactly "encode" did, and why there was no problem in printing it after encoding?
What exactly use encoding do? Why is the default encoding different from utf-8?
What do I have to do, if I want to print the scalars without any problems, even when I want to use one of the capturing modules?
edit: Some people tell me to use -C or binmode or PERL_UNICODE. That is a great advice. However, somehow, both the capturing modules magically destroy the UTF8-ness of STDOUT. That seems to be more a bug of the modules, but I am not really sure.
edit2: OK, the best solution was to dump the modules and write the "capturing" myself (with much less flexibility).
Because you output a string in perl's internal form (utf8) to a non-unicode filehandle.
The decode function decodes a sequence of bytes assumed to be in ENCODING into Perl's internal form (utf8). Your input seems to be already decoded,
The encode() function encodes a string from Perl's internal form into ENCODING.
The encoding pragma allows you to write your script in any encoding you like. String literals are automatically converted to perl's internal form.
Make sure perl knows which encoding your data comes in and come out.
See also perluniintro, perlunicode, Encode module, binmode() function.
I recommend reading the Unicode chapter of my book Effective Perl Programming. We put together all the docs we could find and explained Unicode in Perl much more coherently than I've seen anywhere else.
This program works fine for me:
#!perl
use utf8;
use 5.010;
binmode STDOUT, ':utf8';
my $string = return_string();
say $string;
sub return_string { 'být' }
Additionally, Capture::Tiny works just fine for me:
#!perl
use utf8;
use 5.010;
use Capture::Tiny qw(capture);
binmode STDOUT, ':utf8';
my( $stdout, $stderr ) = capture {
system( $^X, '/Users/brian/Desktop/czech.pl' );
};
say "STDOUT is [$stdout]";
IO::CaptureOutput seems to have some problems though:
#!perl
use utf8;
use 5.010;
use IO::CaptureOutput qw(capture);
binmode STDOUT, ':utf8';
capture {
system( $^X, '/Users/brian/Desktop/czech.pl' );
} \my $stdout, \my $stderr;
say "STDOUT is [$stdout]";
For this I get:
STDOUT is [být
]
However, that's easy to fix. Don't use that module. :)
You should also look at the PERL_UNICODE environment variable, which is the same as using the -C option. That allows you to set STDIN/STDOUT/STDERR (and #ARGV) to be UTF-8 without having to alter your scripts.

Decoding URI related entities with Perl

I may not even be referring to this the proper way so my apologies in advance. Our server logs are constantly showing us an encoded style of attack. An example is below....
http://somecompany.com/script.pl?var=%20%D1%EB........ (etc etc)
I am familiar with encoding and decoding HTML entities using Perl (using HTML::Entities) but I am not even sure how to refer to this style of decoding. I'd love to be able to write a script to decode these URI encodings (?). Is there a module that anyone knows of that can point me in the right direction?
Nikki
Use the URI::Escape module to escape and unescape URI-encoded strings.
Example:
use strict;
use warnings;
use URI::Escape;
my $uri = "http://somecompany.com/script.pl?var=%20%D1%EB";
my $decoded = uri_unescape( $uri );
print $decoded, "\n";
There are online resources such as http://www.albionresearch.com/misc/urlencode.php for doing quick encoding/decoding of a string.
Programmatically, you can do this:
use URI::Escape;
my $str = uri_unescape("%20%D1%EB");
print $str . "\n";
or simply:
perl -MURI::Escape -wle'print uri_unescape("%20%D1%EB");'