Why is Perl HTTP::Response not decoding this apostrophe? - perl

I'm using
my $ua = new LWP::UserAgent;
$ua->agent("Mozilla/5.0 (Windows NT 6.1; Intel Mac OS X 10.6; rv:7.0.1) Gecko/20100101 Firefox/7.0.1 ");
my $url = "http://somedomain.com/page/";
my $req = new HTTP::Request 'GET' => $url;
$req->header('Accept' => 'text/html');
my $response = $ua->request($req);
my $html = $response->decoded_content;
to get a web page. On this page, Abobo's Big Adventure appears. In $request->content and $request->decoded_content, this is shown as Abobo's Big Adventure.
Is there something I can do to make this decode correctly?

Why, that is completely valid HTML! However, you can decode the Entities using HTML::Entities from CPAN.
use HTML::Entities;
...;
my $html = $response->decoded_content;
my $decoded_string = decode_entities($html);
The docs for HTTP::Response::decoded_content state that the Content-encoding and charsets are reversed, not HTML entities (which are a HTML/XML language feature, not really an encoding).
Edit:
However, as ikegami pointed out, decoding the entities immediately could render the HTML unparsable. Therefore, it might be best to parse the HTML first (e.g. using HTML::Tree), and then only decoding the text nodes when needed.
use HTML::TreeBuilder;
my $url = ...;
my $tree = HTML::TreeBuilder->new_from_url($url); # invokes LWP automatically
my $decoded_text = decode_entities($tree->as_text); # dumps the tree as flat text, then decodes.

I'm guessing there probably is an ampersand there before the hash mark. Making it the HTML entity expressed ' These aren't that hard to change. You can do something like this:
my $content = $response->decoded_content;
$content
=~ s{(&#(\d{2,3});)}{
$2 < 128 ? ord( $2 ) : $1
}gem
;
The range check pretty much assures you you're dealing with ASCII. If you want to get more complex, you could also put together a hash of values, and change it like so:
my %entity_lookup
= ( 150 => '-'
, 151 => '--' # m-dash
, 160 => ' '
...
);
...
$content
=~ s{(&#(\d+);)}{
$2 < 128 ? ord( $2 ) : $entity_lookup{ $2 } // $1
}gem
;
But that would be up to you.

Related

Getting data from table ?

how to display data (Stock name, Capitals, Close Price, Market value)from the website in terminal? I have this website:
http://www.tpex.org.tw/web/stock/aftertrading/daily_mktval/mkt.php?l=en-us
, I create somethink.
my $url = 'http://www.tpex.org.tw/web/stock/aftertrading/daily_mktval/mkt.php?l=en-us';
use LWP::Simple;
my $content = get $url;
die "Couldn't get $url" unless defined $content;
But I don't really know how to use $content to print the data which I need.
I'll be grateful for each help :)
You need to take a look at the excellent HTML::TableExtract module
Here's an example that uses the module to extract the data you require. I've used the URL for the printer-friendly version of the page for two reasons: the standard page uses JavaScript to build the table after it has been downloaded, so it isn't available to LWP::Simple which doesn't have JavaScript support; and it includes all the information on a single page, whereas the main page splits it up into many short sections
This is a far more robust, clear, and flexible technique than using regex patterns to parse HTML, which is generally a terrible idea
use strict;
use warnings 'all';
use LWP::Simple;
use HTML::TableExtract;
use open qw/ :std :encoding(utf-8) /;
use constant URL => 'http://www.tpex.org.tw/web/stock/aftertrading/daily_mktval/mkt_print.php?l=en-us';
my $content = get URL or die "Couldn't get " . URL;
my $te = HTML::TableExtract->new( headers => [
qr/Stock\s+Name/,
qr/Capitals/,
qr/Close\s+Price/,
qr/Market\s+Value/,
] );
$te->parse($content);
for my $row ( $te->rows ) {
next unless $row->[0]; # Skip the final row with empty fields
$_ = qq{"$_"} for $row->[0]; # Enclose the Stock Name in quotes
tr/,//d for #{$row}[1,2,3]; # and remove commas from the numeric columns
print join(',', #$row), "\n";
}
output
"OBI Pharma, Inc.",171199584,594.00,101692
"Vanguard International Semiconductor Co.",1638982267,53.90,88341
"Hermes Microvision, Inc.",71000000,1155.00,82005
"TaiMed Biologics Inc.",247732750,238.00,58960
"Phison Electronics Corp.",197373993,271.00,53488
"FamilyMart.co.,Ltd",223220000,202.00,45090
"WIN SEMICONDUCTORS CORP.",596666262,65.30,38962
"PChome online Inc.",99854871,368.50,36796
"TUNG THIH ELECTRONIC CO.,LTD.",84488699,435.00,36752
"ST.SHINE OPTICAL CO.,LTD",50416516,694.00,34989
"POYA CO.,LTD",95277388,350.00,33347
"SIMPLO TECHNOLOGY CO.,LTD.",308284198,108.00,33294
"LandMark Optoelectronics Corporation",69909752,474.50,33172
"Ginko International Co., Ltd.",92697472,340.00,31517
"GIGASOLAR MATERIALS CORPORATION",60989036,506.00,30860
"TTY Biopharm Company Limited",248649959,114.00,28346
"CHIPBOND TECHNOLOGY CORPORATION",649261998,41.90,27204
"Globalwafers.Co.,Ltd.",369250000,69.10,25515
"eMemory Technology lnc.",75782242,321.00,24326
"Parade Technology, Ltd.",76111677,315.50,24013
"PharmaEngine, Inc.",102101000,235.00,23993
"JIH SUN FINANCIAL HOLDING CO., LTD",3396302860,6.86,23298
...
Simple pattern matching and some trick enough for to do it.
In your task $content contain the whole text.
First, extract the table body content from the $content by using .+ with s flag. s flag helps to allow, match the any character with new line.
Second, split the extracted data by using </tr>.
Third, Iterate the foreach for the array then again will do pattern matching with grouping for extract the data.
Here $l1 and $l2 stores the rank and stock code. And the other data will be stored into the #arc variable
my $url = 'http://www.tpex.org.tw/web/stock/aftertrading/daily_mktval/mkt_print.php?l=en-us&d=2016/06/04&s=0,asc,0';
use LWP::Simple;
my $content = get $url;
die "Couldn't get $url" unless defined $content;
my ($table_body) = $content =~m/<tbody>(.+)<\/tbody>/s;
my #ar = split("</tr>",$table_body);
foreach my $lines(#ar)
{
my ($l1,$l2,#arc) = $lines =~m/>(.+?)<\/td>/g;
$, = "\t\t";
print #arc,"\n";
}

Perl DES CBC encryption results in more bytes

I am using Perl to perform CBC DES encryption using the Crypt::CBC library:
#!/usr/bin/perl
use Crypt::CBC;
$key = "\x4A\x6F\xC2\x2A\x44\xE2\xA4\x48";
$iv = "\x00\x00\x00\x00\x00\x00\x00\x00";
$data = "\x51\x55\x45\x53\x54\x49\x4F\x4E";
print "TXT->", $data, "\n";
print "HEX->", unpack("H*", $data), "\n";
$cipher = Crypt::CBC->new(-literal_key => 1,
-key => $key,
-iv => $iv,
-header => 'none');
$ciphertext = $cipher->encrypt($data);
print "ENC->", unpack("H*", $ciphertext), "\n";
The output of the code is:
TXT->QUESTION
HEX->5155455354494f4e
ENC->8220553e09f1b31ba7691f3f7fb52416
My data is conveniently of size 64bits (16 hex digits) which is in accordance with the DES standard. According to Wikipedia
DES is the archetypal block cipher — an algorithm that takes a fixed-length string of plaintext bits and transforms it through a series of complicated operations into another ciphertext bitstring of the same length
Why is it that the encoded output is of longer byte length than the original input?
Thanks.
Working backward from the second block (a7691f3f7fb52416) gives 8a285d3601f9bb13, and XORed with the first block (8220553e09f1b31b) gives 0808080808080808 (HEX). Something is producing the block value of 0808080808080808 as a second input block value.
So all you have to do is figure out where the backspace characters came from as a second block input.
See https://metacpan.org/pod/Crypt::CBC
This:
#!/usr/bin/perl
use Crypt::CBC;
$key = "\x4A\x6F\xC2\x2A\x44\xE2\xA4\x48";
$iv = "\x00\x00\x00\x00\x00\x00\x00\x00";
$data = "\x51\x55\x45\x53\x54\x49\x4F\x4E";
print "TXT->", $data, "\n";
print "HEX->", unpack("H*", $data), "\n";
$cipher = Crypt::CBC->new(-literal_key => 1,
-key => $key,
-iv => $iv,
-header => 'none',
-padding => 'null');
$ciphertext = $cipher->encrypt($data);
print "ENC->", unpack("H*", $ciphertext), "\n";
Gave:
david_koontz#Macbook: cbc_des
TXT->QUESTION
HEX->5155455354494f4e
ENC->8220553e09f1b31b
david_koontz#Macbook:
I made the mistake poking around because I know a fair bit about DES, not so much perl.
Adding the padding null seemed to do the trick, after I learned how to add Crypt::CBC and Crypt::DES to a perl library.
I used http://code.google.com/p/dpades/source/browse/trunk/simu_js/JS-DES.html to do the encryptions and decryptions necessary to figure out what's going on. Use the view raw file button and save JS-DES.html locally, open it with a browser.
The encrypted message is longer because it includes the IV. BTW, a fixed IV does not make sense, it should be random and newly generated for each message.

Right way to retrieve a ISO-8859-1 encoded website with LWP::UserAgent?

I am retrieving a ´ISO-8859-1´ encoded website by using ´LWP::UserAgent´ with the following code.
The problem is, that the special characters are not displayed right, especialy the "€" sign is displayed wrong.
The content encoding is recognized as ´ISO-8859-1´, which is right.
To display the retrieved text I am saving it into a file and open it with Notepag++.
Question: How can I retrieve ´ISO-8859-1´ encoded special characters, in the right way?
#SENDING REQUEST
my $ua = LWP::UserAgent->new();
$ua->agent('Mozilla/5.0 (Windows NT 6.1; WOW64; rv:15.0) Gecko/20100101 Firefox/15.0.1'); # pretend we are very capable browser
my $req = HTTP::Request->new(GET => $url);
#add some header fields
$req->header('Accept', 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8');
$req->header('Accept-Language', 'en;q=0.5');
$req->header('Connection', 'keep-alive');
$req->header('Host', 'www.url.com');
#SEND
my $response = $ua->request($req);
#decode trial1
print $response->content_charset(); # gives ISO-8859-1 which is right
my $content = $response->decoded_content(); #special chars are displayed wrong
#decode trial2
my $decContent = decode('ISO-8859-1', $response->content());
my $utf8Content = encode( 'utf-8', $decContent ); #special char € is displayed as Â
#decode trial3
Encode::from_to($content, 'iso-8859-1', 'utf8'); #special char € is displayed as  too
#example on writing data to file
open(MYOUTFILE, ">>D:\\encodingperl.html"); #open for write, overwrite
print MYOUTFILE "$utf8Content"; #write text
close(MYOUTFILE);
Same as any other:
my $content = $response->decoded_content();
That said, the iso-8859-1 charset does not include the Euro sign. You probably actually have cp1252. You can fix that as follows:
my $content = $response->decoded_content( charset => 'cp1252' );
Your second problem is that you don't encode your output. This is how you'd do it.
open(my $MYOUTFILE, '>>:encoding(cp1252)', 'D:\\encodingperl.html')
or die $!;
print $MYOUTFILE $content;
Use the encoding that's appropriate for you (e.g. UTF-8) if it's not cp1252 you want. If you want the original file in the original encoding, use
my $content = $response->decoded_content( charset => 'none' );
and
open(my $MYOUTFILE, '>>', 'D:\\encodingperl.html')
or die $!;
binmode($MYOUTFILE);
print $MYOUTFILE $content;
ISO-8859-1 doesn't have a euro symbol. If you need the euro symbol, you should either use ISO-8859-15 or, better yet, UTF-8.

How to convince SOAP::Lite to return UTF-8 data in responses as UTF-8?

I'm trying to transmit UTF-8 strings in complex data structures with SOAP::Lite. However, as it turns out, SOAP::Lite quietly converts all UTF-8 strings into base-64-encoded octets. The problem with that is that the deserializing does not revert the conversion and only does a straight base64 decode.
This leaves me confused as to how a user is supposed to ensure that they get UTF-8 data from the SOAP::Lite response. Walking the tree and running decode_utf8 on all strings seems wasteful.
Any suggestions?
Edit: In a nutshell, how do i make this test pass without monkey-patching?
I just hit the same problem and found the above discussion useful. As you say in the OP, the problem is that the data is encoded in base64 and the is_utf8 flag get lost. what happens in the serlializer treats any string with a non-ascii character as binary. I got it to do what I wanted by tweaking the serializer as below. It could have odd consequences, but it works in my situation..
use strictures;
use Test::More;
use SOAP::Lite;
use utf8;
use Data::Dumper;
my $data = "mü\x{2013}";
my $ser = SOAP::Serializer->new;
$ser->typelookup->{trick_into_ignoring} = [9, \&utf8::is_utf8 ,'as_utf8_string'];
my $xml = $ser->envelope( freeform => $data );
my ( $cycled ) = values %{ SOAP::Deserializer->deserialize( $xml )->body };
is( length( $data ), length( $cycled ), "UTF-8 string is the same after serializing" );
done_testing;
sub check_utf8 {
my ($val) = #_;
return utf8::is_utf8($val);
}
package SOAP::Serializer;
sub as_utf8_string {
my $self = shift;
my($value, $name, $type, $attr) = #_;
return $self->as_string($value, $name, $type, $attr);
}
1;
The 9 means the utf8 check is performed before the check for non-ascii characters. if the utf8 flag is on then it treats it as a 'normal' string.
Use of is_utf8 (line 278) is evil and wrong. As we can't trust SOAP::Lite with encoding character data properly (to be fair, this code was likely written before word got around in the community how to do this particular kind of string processing), we shall give it octet data only and therefore have to handle encoding/decoding ourself. Pick a single encoding, apply it before handing off data to S::L, reverse it after receiving data.
use utf8;
use strictures;
use Encode qw(decode encode);
use SOAP::Lite qw();
use Test::More;
my $original = 'mü';
my $xml = SOAP::Serializer->envelope(
freeform => encode('UTF-8', $original, Encode::FB_CROAK | Encode::LEAVE_SRC)
);
my ($roundtrip) = map {
decode('UTF-8', $_, Encode::FB_CROAK | Encode::LEAVE_SRC)
} values %{SOAP::Deserializer->deserialize($xml)->body};
is(length($original), length($roundtrip),
'Perl character string round-trips without changing length');
done_testing;

Is there a way to check, if an argument is passed in single quotes?

Is there a (best) way to check, if $uri was passed in single quotes?
#!/usr/local/bin/perl
use warnings;
use 5.012;
my $uri = shift;
# uri_check
# ...
Added this example, to make my problem more clear.
#!/usr/local/bin/perl
use warnings;
use 5.012;
use URI;
use URI::Escape;
use WWW::YouTube::Info::Simple;
use Term::Clui;
my $uri = shift;
# uri check here
$uri = URI->new( $uri );
my %params = $uri->query_form;
die "Malformed URL or missing parameter" if $params{v} eq '';
my $video_id = uri_escape( $params{v} );
my $yt = WWW::YouTube::Info::Simple->new( $video_id );
my $info = $yt->get_info();
my $res = $yt->get_resolution();
my #resolution;
for my $fmt ( sort { $a <=> $b } keys %$res ) {
push #resolution, sprintf "%d : %s", $fmt, $res->{$fmt};
}
# with an uri-argument which is not passed in single quotes
# the script doesn't get this far
my $fmt = choose( 'Resolution', #resolution );
$fmt = ( split /\s:\s/, $fmt )[0];
say $fmt;
You can't; bash parses the quotes before the string is passed to the Perl interpreter.
To expand on Blagovest's answer...
perl program http://example.com/foo?bar=23&thing=42 is interpreted by the shell as:
Execute perl and pass it the arguments program and http://example.com/foo?bar=23
Make it run in the background (that's what & means)
Interpret thing=42 as setting the environment variable thing to be 42
You should have seen an error like -bash: thing: command not found but in this case bash interpreted thing=42 as a valid instruction.
The shell handles the quoting and Perl has no knowledge of that. Perl can't issue an error message, it just sees arguments after shell processing. It never even sees the &. This is just one of those Unix things you'll have to learn to live with. The shell is a complete programming environment, for better or worse.
There are other shells which dumb things down quite a bit so you can avoid this issue, but really you're better off learning the quirks and powers of a real shell.