Transparently Handling GZip Encoded content with WWW::Mechanize - perl

I am using WWW::Mechanize and currently handling HTTP responses with the 'Content-Encoding: gzip' header in my code by first checking the response headers and then using IO::Uncompress::Gunzip to get the uncompressed content.
However I would like to do this transparently so that WWW::Mechanize methods like form(), links() etc work on and parse the uncompressed content. Since WWW::Mechanize is a sub-class of LWP::UserAgent, I would prefer to use the LWP::UA::handlers to do this.
While I have been partly successful (I can print the uncompressed content for example), I am unable to do this transparently in a way that I can call
$mech->forms();
In summary: How do I "replace" the content inside the $mech object so that from that point onwards, all WWW::Mechanize methods work as if the Content-Encoding never happened?
I would appreciate your attention and help.
Thanks

WWW::Mechanize::GZip, I think.

It looks to me like you can replace it by using the $res->content( $bytes ) member.
By the way, I found this stuff by looking at the source of LWP::UserAgent, then HTTP::Response, then HTTP::Message.

It is built in with UserAgent and thus Mechanize. One MAJOR caveat to save you some hair
-To debug, make sure you check for error $# after the call to decoded_content.
$html = $r->decoded_content;
die $# if $#;
Better yet, look through the source of HTTP::Message and make sure all the support packages are there
In my case, decoded_content returned undef while content is raw binary, and I went on a wild goose chase. UserAgent will set the error flag on failure to decode, but Mechanize will just ignore it (It doesn't check or log the incidence as its own error/warning).
In my case $# sez: "Can't find IO/HTML.pm .. It was eval'ed
After having to dive into the source, I find out the built-in decoding process is long, meticulous, and arduous, covering just about every scenario and making tons of guesses (Thank you Gisle!).
if you are paranoid, explicitly set the default header to be used with every request at new()
$browser = new WWW::Mechanize('default_headers' => HTTP::Headers->new('Accept-Encoding'
=> scalar HTTP::Message::decodable()));

Related

save directly to file and get filename using WWW::Mechanize

I would normally save to a file using this:
$mech->save_content($mech->response->filename);
But due to some big files which cause "Out of memory" I have to use this instead:
$mech->get( $url, ":content_file"=>$tempfile );
How can I get the filename with the second method, or do I have to make one up?
I want the filename that would be returned in the response object: $mech->response->filename. I don't want to make up my own filename.
The :content_file option is inherited from LWP::UserAgent and behaves in the same way. You don't know the file beforehand.
You could do a HEAD request to check the filename, and then do a GET request.
Alternatively, have a look at the lwp-download utility that ships with LWP::UserAgent. It provides exactly what you need. You could use it directly, or lift the stuff you want out of it. A WWW::Mechanize object can be dropped into this code and will behave exactly the same as the LWP::UserAgent one.

Perl get request returns empty response, maybe session related?

I was using an open source tool called SimTT which gets an URL of a tabletennis league and then calculates the probable results (e.g. ranking of teams and players). Unfortunately the webpage moved to a different webpage.
I downloaded the open source and repaired the parsing of the webpage, but currently I'm only able to download the page manually and read it then from a file.
Below you can find an excerpt of my code to retrieve the page. It prints success, but the response is empty. Unfortunately I'm not familiar with perl and webtechniques very well, but in Wireshark I could see that one of the last things send was a new session key. But I'm not sure, if the problem is related to cookies, ssl or something like that.
It would be very nice if someone could help me to get access. I know that there are some people out there which would like to use the tool.
So heres the code:
use LWP::UserAgent ();
use HTTP::Cookies;
my $ua = LWP::UserAgent->new(keep_alive=>1);
$ua->agent('Mozilla/5.0');
$ua->cookie_jar({});
my $request = new HTTP::Request('GET', 'https://www.mytischtennis.de/clicktt/ByTTV/18-19/ligen/Bezirksoberliga/gruppe/323819/mannschaftsmeldungen/vr');
my $response = $ua->request($request);
if ($response->is_success) {
print "Success: ", $response->decoded_content;
}
else {
die $response->status_line;
}
Either there is some rudimentary anti-bot protection at the server or the server is misconfigured or otherwise broken. It looks like it expects to have an Accept-Encoding header in the request which LWP by default does not sent. The value of this header does not really seem to matter, i.e. the server will send the content compressed with gzip if the client claims to support it but it will send uncompressed data if the client offered only a compression method which is unknown to the server.
With this knowledge one can change the code like this:
my $request = HTTP::Request->new('GET',
'https://www.mytischtennis.de/clicktt/ByTTV/18-19/ligen/Bezirksoberliga/gruppe/323819/mannschaftsmeldungen/vr',
[ 'Accept-Encoding' => 'foobar' ]
);
With this simple change the code works currently for me. Note that it might change at any time if the server setup will be changed, i.e. it might then need other workarounds.

Perl CGI problem

I'm doing some development work that uses an embedded Linux for the OS and Boa for the web server. I have a web page that posts to a CGI script, handles the form data, and replies. My development environment was Ubuntu and everything worked fine, but when I ported my code over to the embedded Linux, the CGI module did not instantiate (or at least does not seem to instantiate). Here is a stripped down section of my code. The print statement complains about an uninitialized variable.
use CGI;
use strict;
use warnings;
my $cgiObj = CGI->new();
print $cgiObj->param('wlanPort');
Again, this works fine in my development environment, but fails in the embedded environment. The CGI.pm is installed and there are no errors generated on the CGI->new() command. I have also verified that the form data is being sent, but obviously can't guarantee that it is being received by the Perl script.
I have a feeling that it is a Boa configuration issue and that's what I'll be looking into next. I'm fairly new to Perl, so I'm not sure what else to do. Any ideas?
EDIT: Definitely not a Boa config issue. Still looking into it.
UPDATE:
I've simplified my code to the following:
#!/usr/bin/perl
use CGI qw(:standard);
$data = param('wlanPort') || '<i>(No Input)</i>';
print header;
print <<END;
<title>Echoing user input</title>
<p>wlanPort: $data</p>
END
As expected, it prints (No Input)
I should also point out that the form is enctype="multipart/form-data" because I have to have a file upload capability and I am using the "POST" method.
I used the HttpFox plugin to inspect the post data and checked on the wlanPort value:
-----------------------------132407047814270795471206851178 Content-Disposition: form-data;
name="wlanPort"
eth1
So it is almost definitely being sent...
UPDATE 2: I installed the same version of Perl and Boa being used in the embedded system on my Ubuntu laptop. Works on the laptop, not in the device, which is the same result. I've told my employer that that I've exhausted all possibilities other than the way Boa and (Micro) Perl are built on the device vs. in Ubuntu.
CGI is a beautifully simple protocol and while I cannot answer your question directly, I can suggest some techniques to help isolate the problem.
If your form is being submitted via POST, then the contents of that form will appear as a URLencoded string in the content of the HTTP request your script is getting. Without using the CGI module at all, you should be able to read the the request from STDIN:
my $request = "";
while (<STDIN>) {
$request .= $_;
}
if (open my $out, ">>/tmp/myapp.log") {
print $out $request;
close $out;
}
You can then examine /tmp/myapp.log to see if you are getting all the information from the request that you think you are.
For completeness, if your form submits via GET, then the arguments will be in the environment variable QUERY_STRING, which you can look at in Perl with $ENV{'QUERY_STRING'}.
There should be no difference in the way CGI's object interface and functional interface parses the request. I am not familiar with boa, but I can't imagine it breaking basic CGI protocol.
A common problem is that you named the form parameter one thing and are looking for a different parameter name in the CGI script. That's always a laugh.
Good luck and hope this helps some.
I know this is very old post and OP may not be interested in any new info related to this, but the general question about how to debug CGI scripts has some relevance still. I had similar issues with dev vs. prod environments. To help those who stumble upon this thread, I am posting my experience in dealing with this situation. My simple answer is, to use Log::Log4perl and Data::Dumper modules to demystify this (assuming there is a way to access logs on your prod environment). This way with negligible overhead, you can turn on tracing when problems creep in (even if the code worked before, but due to changes over time, it started failing). Log every relevant bit of information at appropriate log level (trace, debug, info, warning, error, fatal) and configure what level is good for operations. Without these mechanisms, it will be very difficult to get insight into production operations. Hope this helps.
use CGI;
use Log::Log4perl qw(easy);
use Data::Dumper;
use strict;
use warnings;
my $cgiObj = CGI->new();
my $log = Log::Log4perl::get_logger();
$log->trace("CGI Data: " . Dumper($cgiObj));
print $cgiObj->param('wlanPort');

How can I detect the file type of image at a URL?

How to find the image file type in Perl form website URL?
For example,
$image_name = "logo";
$image_path = "http://stackoverflow.com/content/img/so/".$image_name
From this information how to find the file type that . here the example it should display
"png"
http://stackoverflow.com/content/img/so/logo.png .
Supposer if it has more files like SO web site . it should show all file types
If you're using LWP to fetch the image, you can look at the content-type header returned by the HTTP server.
Both WWW::Mechanize and LWP::UserAgent will give you an HTTP::Response object for any GET request. So you can do something like:
use strict;
use warnings;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
$mech->get( "http://stackoverflow.com/content/img/so/logo.png" );
my $type = $mech->response->headers->header( 'Content-Type' );
You can't easily tell. The URL doesn't necessarily reflect the type of the image.
To get the image type you have to make a request via HTTP (GET, or more efficiently, HEAD), and inspect the Content-type header in the HTTP response.
Well, https://stackoverflow.com/content/img/so/logo is a 404. If it were not, then you could use
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my ($content_type) = head "https://stackoverflow.com/content/img/so/logo.png";
print "$content_type\n" if defined $content_type;
__END__
As Kent Fredric points out, what the web server tells you about content type need not match the actual content sent by the web server. Keep in mind that File::MMagic can also be fooled.
#!/usr/bin/perl
use strict;
use warnings;
use File::MMagic;
use LWP::UserAgent;
my $mm = File::MMagic->new;
my $ua = LWP::UserAgent->new(
max_size => 1_000 * 1_024,
);
my $res = $ua->get('https://stackoverflow.com/content/img/so/logo.png');
if ( $res->code eq '200' ) {
print $mm->checktype_contents( $res->content );
}
else {
print $res->status_line, "\n";
}
__END__
You really can't make assumptions about content based on URL, or even content type headers.
They're only guides to what is being sent.
A handy trick to confuse things that use suffix matching to identify file-types is doing this:
http://example.com/someurl?q=foo#fakeheheh.png
And if you were to arbitrarily permit that image to be added to the page, it might in some cases be a doorway to an attack of some sorts if the browser followed it. ( For example, http://really_awful_bank.example.com/transfer?amt=1000000;from=123;to=123 )
Content-type based forgery is not so detrimental, but you can do nasty things if the person who controls the name works out how you identify things and sends different content types for HEAD requests as it does for GET requests.
It could tell the HEAD request that it's an Image, but then tell the GET request that its a application/javascript and goodness knows where that will lead.
The only way to know for certain what it is is downloading the file and then doing MAGIC based identification, or more (i.e., try to decode the image). Then all you have to worry about is images that are too large, and specially crafted images that could trip vulnerabilities in computers that are not yet patched for that vulnerability.
Granted all of the above is extreme paranoia, but if you know the rare possibilities you can make sure they can't happen :)
From what i understand you're not worried about the content type of an image you already know the the name+extension for, you want to find the extension for an image you know the base name of.
In order to do that you'd have to test all the image extensions you wanted individually and store which ones resolved and which ones didn't. For example both https://stackoverflow.com/content/img/so/logo.png and https://stackoverflow.com/content/img/so/logo.gif could exist. They don't in this exact situation but on some arbitrary server you could have multiple images with the same base name but different extensions. Unfortunately there's no way to get a list of available extensions of a file in a remote web directory by supplying its base name without looping through the possibilities.

Compressing HTTP request with LWP, Apache, and mod_deflate

I have a client/server system that performs communication using XML transferred using HTTP requests and responses with the client using Perl's LWP and the server running Perl's CGI.pm through Apache. In addition the stream is encrypted using SSL with certificates for both the server and all clients.
This system works well, except that periodically the client needs to send really large amounts of data. An obvious solution would be to compress the data on the client side, send it over, and decompress it on the server. Rather than implement this myself, I was hoping to use Apache's mod_deflate's "Input Decompression" as described here.
The description warns:
If you evaluate the request body yourself, don't trust the Content-Length header! The Content-Length header reflects the length of the incoming data from the client and not the byte count of the decompressed data stream.
So if I provide a Content-Length value which matches the compressed data size, the data is truncated. This is because mod_deflate decompresses the stream, but CGI.pm only reads to the Content-Length limit.
Alternatively, if I try to outsmart it and override the Content-Length header with the decompressed data size, LWP complains and resets the value to the compressed length, leaving me with the same problem.
Finally, I attempted to hack the part of LWP which does the correction. The original code is:
# Set (or override) Content-Length header
my $clen = $request_headers->header('Content-Length');
if (defined($$content_ref) && length($$content_ref)) {
$has_content = length($$content_ref);
if (!defined($clen) || $clen ne $has_content) {
if (defined $clen) {
warn "Content-Length header value was wrong, fixed";
hlist_remove(\#h, 'Content-Length');
}
push(#h, 'Content-Length' => $has_content);
}
}
elsif ($clen) {
warn "Content-Length set when there is no content, fixed";
hlist_remove(\#h, 'Content-Length');
}
And I changed the push line to:
push(#h, 'Content-Length' => $clen);
Unfortunately this causes some problem where content (truncated or not) doesn't even get to my CGI script.
Has anyone made this work? I found this which does compression on a file before uploading, but not compressing a generic request.
Although you said you didn't want to do the compression yourself, there are lots of perl modules which will do both sides for you, Compress::Zlib for example.
I have a cheat (with a .net part of the company) where I get passed XML as a separate parameter posted in, then can handle it as if it was a string rather than faffing about with SOAP like stuff.
I don't think you can change the Content-Length like that. It would confuse Apache, because mod_deflate wouldn't know how much compressed data to read. What about having the client add an X-Uncompressed-Length header, and then use a modified version of CGI.pm that uses X-Uncompressed-Length (if present) instead of Content-Length? (Actually, you probably don't need to modify CGI.pm. Just set $ENV{'CONTENT_LENGTH'} to the appropriate value before initializing the CGI object or calling any CGI functions.)
Or, use a lower-level module that uses the bucket brigade to tell how much data to read.
I am not sure if I am following you with what you want, but I have a custom get/post module, that I use to do some non standard stuff. The below code will read in anything sent via post, or STDIN.
read(STDIN, $query_string, $ENV{'CONTENT_LENGTH'});
Instead of using using $ENV's value use your's. I hope this helps, and sorry if it doesn't.