"Cannot decode string with wide characters" appears on a weird place - perl

I am trying to use XML::RAI perl module on UTF8 coded text and I still have error I don't really understand... here is the code (it shouldn't do anything useful yet):
use HTTP::Request;
use LWP::UserAgent;
use XML::RAI;
use Encode;
my $ua = LWP::UserAgent->new;
sub readFromWeb{
my $address = shift;
my $request = HTTP::Request->new( GET => $address );
my $response = $ua->request( $request );
return unless $response->code == 200;
return decode("utf8", $response->content());
}
sub readFromRSS{
my $address=shift;
my $content = readFromWeb $address;
my $rai = XML::RAI->parse_string($content);
#this line "causes" the error
}
readFromRSS("http://aktualne.centrum.cz/export/rss-hp.phtml");
#I am testing it on this particular RSS
the error is:
Cannot decode string with wide characters at /usr/lib/perl5/5.8.8/i686-linux/Encode.pm line 166.
I don't have a clue if that's my fault or the fault of XML::RAI. I don't see where these wide characters can be, if $content is already decoded from utf8...
edit: for some reason I still don't understand, removing the "decode" part actually solved the problem.

The problem is double-decoding. XML::RAI::parse_string() apparently
expects an UTF-8 encoded document and does the decoding itself. If you
pass in a string that is already decoded, decoding it a second time will fail,
of course:
#!/usr/bin/perl
use strict;
use warnings;
use Encode qw( decode );
use LWP::Simple qw( get );
my $xml = get("http://aktualne.centrum.cz/export/rss-hp.phtml");
$xml = decode('UTF-8', $xml);
$xml = decode('UTF-8', $xml); # dies: Cannot decode string with wide characters ...
So just skip the decode() step and you'll be fine.

Related

I get error 'Can't use string as a HASH ref while "strict refs"....' when i use my perl code

im using this perl code to transform JSON into other form with some regular expressions:
use strict;
use warnings;
use feature 'say';
use JSON;
use utf8;
my %IDs = ( 'User awx01 logged in.' => 1001 );
my %levels = ( INFO => 4 );
my $data = <DATA>;
my $json = data2json($data);
my $record = decode_json($json);
say rec2msg($record);
sub data2json {
my $json = shift;
$json =~ s/[""]/"/g;
$json =~ s/\\//g;
$json =~ s/"(\{.*?\})"/$1/;
return $json;
}
sub rec2msg {
my $r = shift;
$r->{Message}{message} =~ /(\w+) (\w+) (.+)/;
my($user,$msg) = ($2,"$1 $3");
my $ID = $IDs{$r->{Message}{message}};
my $level = $levels{$r->{Message}{level}};
my $out = "$r->{Message}{'#timestamp'} host CEF:0|OpenSource|AWX|7.0.0|$ID|$msg|$level|src=127.0.0.1 dst=$r->{MessageSourceAddress} duser=$user";
return $out;
}
__DATA__
{"MessageSourceAddress":"192.168.81.20","EventReceivedTime":"2020-02-06 11:55:14","SourceModuleName":"udp","SourceModuleType":"im_udp","SyslogFacilityValue":1,"SyslogFacility":"USER","SyslogSeverityValue":5,"SyslogSeverity":"NOTICE","SeverityValue":2,"Severity":"INFO","EventTime":"2020-02-06 11:55:14","Hostname":"192.168.81.20","Message":"{\"#timestamp\": \"2020-02-06T08:55:52.907Z\", \"message\": \"User awx01 logged in.\", \"host\": \"awxweb\", \"level\": \"INFO\", \"logger_name\": \"awx.api.generics\", \"stack_info\": null, \"type\": \"other\", \"cluster_host_id\": \"awx-contr-01\", \"tower_uuid\": \"333b4131-495f-4460-8e4b-890241a9d73d\"}"}
But im getting this error:
2020-03-31 20:48:50 ERROR perl subroutine rec2msg failed with an error: 'Can't use string ("140511667030448") as a HASH ref while "strict refs" in use at /usr/libexec/nxlog/modules/extension/perl/event1.pl line 21.;'
What im doing wrong? How could i solve it?
You have JSON embedded in JSON, so you need to decode it twice. This often happens when you have one service passing through the response for another service.
Your data2json wasn't decoding that second level, so the value for the Message name was still a string. Since that value wasn't a hash reference, you get the error you reported.
You don't want to use a bunch of substitutions on the entire thing because you can inadvertently change things you shouldn't be messing with. Decode the top level just as you did, but then do the same thing for the Message value:
# read in all the data, even though it looks like a single line. Maybe it won't be later.
my $data = do { local $/; <DATA> };
# decode the first layer
my $decoded = decode_json( $data );
# decode the Message value:
$decoded->{Message} = decode_json( $decoded->{Message} );
Now, when you call rec2msg it should work out.
Note that this has the opposite problem to reverse it. You can't merely encode the entire thing to JSON again. The value for Message still needs to be a string, so you have to encode that first if you want to send it somewhere else. If you are doing that, you probably want to work on a copy. I use dclone to make a deep copy so whatever I do to $encoded does not show up in $decoded:
# make a deep copy so nested references aren't shared
use Storable qw(dclone);
my $encoded = dclone( $decoded );
$encoded->{Message} = encode_json( $encoded->{Message} );
my $new_data = encode_json( $encoded );
Then $new_data will have the same escaping as the original input.
Here it is altogether:
use strict;
use warnings;
use feature 'say';
use JSON;
use utf8;
my %IDs = ( 'User awx01 logged in.' => 1001 );
my %levels = ( INFO => 4 );
# read in all the data, even though it looks
my $data = do { local $/; <DATA> };
my $decoded = decode_json( $data );
$decoded->{Message} = decode_json( $decoded->{Message} );
say rec2msg($decoded);
sub rec2msg {
my $r = shift;
$r->{Message}{message} =~ /(\w+) (\w+) (.+)/;
my($user,$msg) = ($2,"$1 $3");
my $ID = $IDs{$r->{Message}{message}};
my $level = $levels{$r->{Message}{level}};
my $out = "$r->{Message}{'#timestamp'} host CEF:0|OpenSource|AWX|7.0.0|$ID|$msg|$level|src=127.0.0.1 dst=$r->{MessageSourceAddress} duser=$user";
return $out;
}

Xpath won't fiind id

I'm failing to get a node by its id.
The code is straight forward and should be self-explaining.
#!/usr/bin/perl
use Encode;
use utf8;
use LWP::UserAgent;
use URI::URL;
use Data::Dumper;
use HTML::TreeBuilder::XPath;
my $url = 'https://www.airbnb.com/rooms/1976460';
my $browser = LWP::UserAgent->new;
my $resp = $browser->get( $url, 'User-Agent' => 'Mozilla\/5.0' );
if ($resp->is_success) {
my $base = $resp->base || '';
print "-> base URL: $base\n";
my $data = $resp->decoded_content;
my $tree= HTML::TreeBuilder::XPath->new;
$tree->parse_content( $resp->decoded_content() );
binmode STDOUT, ":encoding(UTF-8)";
my $price_day = $tree->find('.//*[#id="price_amount"]/');
print Dumper($price_day);
$tree->delete();
}
The code above prints:
-> base URL: https://www.airbnb.com/rooms/1976460
$VAR1 = undef;
How can I select a node by its ID?
Thanks in advance.
Take that / off the end of that XPath.
.//*[#id="price_amount"]
should do. As it is, it's not valid XPath.
There is a trailing slash in your XPath, that you need to remove
my $price_day = $tree->find('.//*[#id="price_amount"]');
However, from my own testing, I believe that HTML::TreeBuilder::XPath is also having trouble parsing that specific URL. Perhaps because of the conditional comments?
As an alternative approach, I would recommend using Mojo::UserAgent and Mojo::DOM instead.
The following uses the css selector div#price_amount to easily find your desired element and print it out.
use strict;
use warnings;
use Mojo::UserAgent;
my $url = 'https://www.airbnb.com/rooms/1976460';
my $dom = Mojo::UserAgent->new->get($url)->res->dom;
my $price_day = $dom->at(q{div#price_amount})->all_text;
print $price_day, "\n";
Outputs:
$285
Note, there is a helpful 8 minute introductory video to this set of modules at Mojocast Episode 5.

Perl script gives different results for HTTP/POST when run from command line and server

I'm trying to write a proxy that takes a POST request in UTF-8, converts the request to TIS-620 (Thai language) and submits it to a server. It then takes the TIS-620 server response and converts it to UTF-8.
The problem I'm having is that everything works fine when I run from the command line. I get back a page of HTML as expected. When I run the same script from my webserver I get a HTTP/200 OK response, but there's nothing in the body.
I'm absolutely stumped as to why this is happening. I'm also at a loss as to how to go about debugging this. Any thoughts much appreciated.
This is a stripped down version of the program:
#!/usr/bin/perlml -w -T
use strict;
use CGI qw(:standard);
use CGI::Carp qw(warningsToBrowser fatalsToBrowser);
use Encode qw(decode encode);
use LWP::UserAgent;
use URI::Escape;
use utf8;
# Print Headers
print "Content-Type: text/html; charset=utf-8\n\n";
my $postData = "word=กระคน";
# Convert to TIS-620
my $post = encode("iso-8859-11", "$postData");
# URI Escape it
$post = uri_escape($post, "\x00-\x1f\x7f-\xff");
# Prepare POST
my $ua = LWP::UserAgent->new;
my $server_endpoint = "http://rirs3.royin.go.th/new-search/word-search-all-x.asp";
my $req = HTTP::Request->new(POST => $server_endpoint);
# Set header fields and POST data
$req->header('CONTENT_TYPE' => 'application/x-www-form-urlencoded');
my $postLength;
{use bytes; my $postLength = length($post);}
$req->header('Content-Length' => $postLength);
$req->content($post);
# Get & print response
my $resp = $ua->request($req);
my $respCode = $resp->code();
print("Response Code = $respCode<br>\n");
$resp = $resp->decoded_content;
my $respUTF = decode("iso-8859-11", $resp);
print ("Response:<br>\n$respUTF\n");
The script can be run at http://thai-notes.com/cgi-bin/test2.pl

UTF-8 support for RSS parsing

I'm using XML::RSSLite for parsing RSS data I retrieved using LWP. LWP is correctly retrieving in the right encoding but when using RSSLite to parse the data, the encoding seems to be lost and characteres like é, è, à, etc. are deleted from the output. Is there an option to set in order to force the encoding?
Here is my script:
use strict;
use XML::RSSLite;
use LWP::UserAgent;
use HTTP::Headers;
use utf8;
my $ua = LWP::UserAgent->new;
$ua->timeout(10);
$ua->env_proxy;
my $URL = "http://www.boursier.com/syndication/rss/news/FR0004031839/FR";
my $response = $ua->get($URL);
if ($response->is_success) {
my $content = $response->decoded_content((charset => 'UTF-8'));
my %result;
parseRSS(\%result, \$content);
foreach my $item (#{ $result{items} }) {
print "ITEM: $item->{title}\n";
}
}
I tried to use XML::RSS as it seems to have more option that may be handy in my case but it failed to install unfortunately. :(
I like that Mojo::UserAgent along with Mojo::DOM already have the support I need without me tracking down the right combinations of modules to use, and it handles the UTF-8 bits without me doing anything special:
use v5.10;
use open qw( :std :utf8 );
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new;
my $URL = "http://www.boursier.com/syndication/rss/news/FR0004031839/FR";
my $response = $ua->get($URL)->res;
my #links = $response
->dom( 'item > title' )
->map( sub { $_->text } )
->each;
$" = "\n";
print "#links\n";
I have another example at Painless RSS processing with Mojo
The RSSLite documentation explicitely states:
Remove characters other than 0-9~!##$%^&*()-+=a-zA-Z[];',.:"<>?\s
Therefore, the module is hopelessly broken. Try again with XML::Feed

How can I suppress \\ in my output from Perl's JSON module?

Following code always print paths with double slashes:
use JSON;
use File::Spec;
my $installdir = $ENV{"ProgramFiles"};
my $xptrlc = File::Spec->catfile($installdir,"bin","sample");
my $jobhash;
my $return_packet;
$jobhash->{'PATH'} = $xptrlc;
$return_packet->{'JOB'} = $jobhash;
my $js = new JSON;
my $str = $js->objToJson($return_packet);
print STDERR "===> $str \n";
OUTPUT of this script is
===> {"JOB":{"PATH":"C:\\Program Files (x86)\\bin\\sample"}}
Any solution to remove those double \\ slashes?
As Greg mentioned, the '\' character is represented as '\\' in JSON.
http://www.ietf.org/rfc/rfc4627.txt?number=4627
If you intend to use "thaw" the JSON somewhere, like in another Perl program or in JavaScript, you will still get back exactly what you put in.
Are you trying to do something else with your JSON?
Windows is perfectly fine with using '/' in paths if that bothers you so much:
use strict; use warnings;
use JSON;
use File::Spec::Functions qw(catfile);
my $installdir = $ENV{ProgramFiles};
my $xptrlc = catfile $installdir,qw(bin sample);
$xptrlc =~ s'\\'/'g;
my $packet = { JOB => { PATH => $xptrlc } };
my $js = JSON->new;
my $str = $js->encode($packet);
warn "===> $str \n";
Output:
===> {"JOB":{"PATH":"C:/Program Files/bin/sample"}}
On the other hand, the encoded value will be correctly decoded:
use JSON;
warn JSON->new->decode(scalar <DATA>)->{JOB}->{PATH}, "\n";
__DATA__
{"JOB":{"PATH":"C:\\Program Files (x86)\\bin\\sample"}}
Output:
C:\Temp> ht
C:\Program Files (x86)\bin\sample