Base URI while using Furl - perl

I've checked the documentation https://metacpan.org/pod/Furl
but can't found how can I get sites base URI while using Furl?
With LWP it's easy:
my $res = $ua->get($url);
my $base_uri = $res->base;
The base function try to get values from this header fields
my $base = (
$self->header('Content-Base'), # used to be HTTP/1.1
$self->header('Content-Location'), # HTTP/1.1
$self->header('Base'), # HTTP/1.0
)[0];
But I couldn't do the same with Furl.

First: it seems you want to do an anonymous array at $base, thus, it should be:
my $base = [
$res->header('header1'),
$res->header('header2'),
$res->header('header3')
];
Because the code you had just saved the first header (in your case, Content-Base) and did nothing with the last two, you can check that with Data::Dumper.
Maybe that's why it didn't work.
Second: But, after reading through the code of the Furl module, I found out there's no exposed method for getting an url's base, so unless you are also checking in your own code for the <base> html tag and the uri you used to request your response (even after redirects), your code might break with some oldish sites. HTTP::Response does this checking, and that's what LWP uses.
Citation for hierarchy of base URIs: HTTP::Response - HTTP style response message

Related

Better way to proxy an HTTP request using Perl HTTP::Response and LWP?

I need a Perl CGI script that fetches a URL and then returns the result of the fetch - the status, headers and content - unaltered to the CGI environment so that the "proxied" URL is returned by the web server to the user's browser as if they'd accessed the URL directly.
I'm running my script from cgi-bin in an Apache web server on an Ubuntu 14.04 host, but this question should be independent of server platform - anything that can run Perl CGI scripts should be able to do it.
I've tried using LWP::UserAgent::request() and I've got very close. It returns an HTTP::Response object that contains the status code, headers and content, and even has an "as_string" method that turns it into a human-readable form. The problem from a CGI perspective is that "as string" converts the status code to "HTTP/1.1 200 OK" rather than "Status: 200 OK", so the Apache server doesn't recognise the output as a valid CGI response.
I can fix this by using other methods in HTTP::Response to split out the various parts, but there seems to be no public way of getting at the encapsulated HTTP::Headers object in order to call its as_string method; instead I have to hack into the Perl blessed object hash and yank out the private "_headers" member directly. To me this seems slightly evil, so is there a better way?
Here's some code to illustrate the above. If you put it in your cgi-bin directory then you can call it as
http://localhost/cgi-bin/lwp-test?url=http://localhost/&http-response=1&show=1
You can use a different URL for testing if you want. If you set http-response=0 (or drop the param altogether) then you get the working piece-by-piece solution. If you set show=0 (or drop it) then the proxied request is returned by the script. Apache will return the proxied page if you have http-response=0 and will choke with a 500 Internal Server Error if it's 1.
#!/usr/bin/perl
use strict;
use warnings;
use CGI::Simple;
use HTTP::Request;
use HTTP::Response;
use LWP::UserAgent;
my $q = CGI::Simple->new();
my $ua = LWP::UserAgent->new();
my $req = HTTP::Request->new(GET => $q->param('url'));
my $res = $ua->request($req);
# print a text/plain header if called with "show=1" in the query string
# so proxied URL response is shown in browser, otherwise just output
# the proxied response as if it was ours.
if ($q->param('show')) {
print $q->header("text/plain");
print "\n";
}
if ($q->param('http-response')) {
# This prints the status as "HTTP/1.1 200 OK", not "Status: 200 OK".
print $res->as_string;
} else {
# This works correctly as a proxy, but using {_headers} to get at
# the private encapsulated HTTP:Response object seems a bit evil.
# There must be a better way!
print "Status: ", $res->status_line, "\n";
print $res->{_headers}->as_string;
print "\n";
print $res->content;
}
Please bear in mind that this script was written purely to demonstrate how to forward an HTTP::Response object to the CGI environment and bears no resemblance to my actual application.
You can go around the internals of the response object at $res->{_headers} by using the $res->headers method, that returns the actual HTTP::Headers instance that is used. HTTP::Response inherits that from HTTP::Message.
It would then look like this:
print "Status: ", $res->status_line, "\n";
print $res->headers->as_string;
That looks less evil, though it's still not pretty.
As simbabque pointed out, HTTP::Response has a headers method through inheritance from HTTP::Message. We can tidy up the handling of the status code by using HTTP::Response->header to push it into the embedded HTTP::Headers object, then use headers_as_string to print out the headers more cleanly. Here's the final script:-
#!/usr/bin/perl
use strict;
use warnings;
use CGI::Simple;
use HTTP::Request;
use HTTP::Response;
use LWP::UserAgent;
my $q = CGI::Simple->new();
my $ua = LWP::UserAgent->new();
my $req = HTTP::Request->new(GET => $q->param('url'));
my $res = $ua->request($req);
# print a text/plain header if called with "show=1" in the query string
# so proxied URL response is shown in browser, otherwise just output
# the proxied response as if it was ours.
if ($q->param('show')) {
print $q->header("text/plain");
}
# $res->as_string returns the status in a "HTTP/1.1 200 OK" line rather than
# a "Status: 200 OK" header field so it can't be used for a CGI response.
# We therefore have a little more work to do...
# convert status from line to header field
$res->header("Status", $res->status_line);
# now print headers and content - don't forget a blank line between the two
print $res->headers_as_string, "\n", $res->content;

How to get URLencoded data from the body of a POST in CGI Perl

POSTDATA is not the correct answer. I have read the docs and still don't see how I can get the data.
I want to receive this request:
POST /cgi-bin/myscript.cgi HTTP/1.1
Host: myhost.com
Content-Length: 3
Content-Type: application/x-www-form-urlencoded
255
and have the server respond
You sent the string "255"
Please assist, I am a Perl beginner and have gotten a bunch of seemingly wrong and useless answers to this seemingly simple request.
CGI will automatically parse form data, so you need to hide that what you got is form data (or at least claims to be).
use CGI qw( );
$ENV{CONTENT_TYPE} = 'application/octet-stream';
my $cgi = CGI->new();
my $post_data = $cgi->param('POSTDATA');
Better solution: Have the requester use a correct content type (e.g. application/octet-stream), or have the requester actually send form data (e.g. data=255).
Unique solution for me, was change of ContentType on client's petition to 'application/octet-stream'
Module CGI CPAN says:
If POSTed data is not of type application/x-www-form-urlencoded or
multipart/form-data, then the POSTed data will not be processed, but
instead be returned as-is in a parameter named POSTDATA.
So if you can't change on clients petition to other ContentType, it won't be processed.
CGI (in recent versions at least) will stuff incorrectly encoded x-www-form-urlencoded params into a parameter named keywords. Better to send a proper content type though, then the POSTDATA works exactly as the docs say:
If POSTed data is not of type application/x-www-form-urlencoded or
multipart/form-data, then the POSTed data will not be processed...
use strictures;
use CGI::Emulate::PSGI;
use Plack::Test;
use HTTP::Request::Common;
use Test::More;
my $post = POST "/non-e-importa",
"Content-Length" => 5,
"Content-Type" => "application/x-www-form-urlencoded",
Content => "ohai\n";
my $cgis = CGI::Emulate::PSGI->handler( sub {
use CGI "param", "header";
my $incorrectly_encoded_body = param("keywords");
print header("text/plain"), $incorrectly_encoded_body;
});
test_psgi $cgis, sub {
my $cb = shift;
my $res = $cb->($post);
is $res->content, "ohai", "Soopersek437 param: keywords";
};
done_testing();
__END__
prove so-16846138 -v
ok 1 - Soopersek437 param: keywords
1..1
ok
All tests successful.
Result: PASS

How to detect a changed webpage?

In my application, I fetch webpages periodically using LWP. Is there anyway to check whether between two consecutive fetches the webpage has got changed in some respect (other than explicitly doing a comparison) ? Is there any signature(say CRC) that is being generated at lower protocol layers which can be extracted and compared against older signatures to see possible changes ?
There are two possible approaches. One is to use a digest of the page, e.g.
use strict;
use warnings;
use Digest::MD5 'md5_hex';
use LWP::UserAgent;
# fetch the page, etc.
my $digest = md5_hex $response->decoded_content;
if ( $digest ne $saved_digest ) {
# the page has changed.
}
Another option is to use an HTTP ETag, if the server provides one for the resource requested. You can simply store it and then set your request headers to include an If-None-Match field on subsequent requests. If the server ETag has remained the same, you'll get a 304 Not Modified status and an empty response body. Otherwise you'll get the new page. (And new ETag.) See Entity Tags in RFC2616.
Of course, the server could be lying, and sending the same ETag even though the content has changed. There's no way to know unless you look.
You should use the If-Modified-Since request header, noting the gotchas in the RFC. You send this header with the request. If the server supports it and thinks the content is newer, it sends it to you. If it thinks you have the most recent version, it returns a 304 with no message body.
However, as other answers have noted, the server doesn't have to tell you the truth, so you're sometimes stuck downloading the content and checking yourself. Many dynamic things will always claim to have new content because many developers have never thought about supporting basic HTTP things in their web apps.
For the LWP bits, you can create a single request with an extra header:
use HTTP::Request;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
my $request = HTTP::Request->new( GET => $url );
$r->header( 'If-Modified-Since' => $time );
$ua->request( $request );
For all requests, you can set a request handler:
$ua->add_handler(
request_send => sub {
my($request, $ua, $h) = #_;
# ... look up time from local store
$r->header( 'If-Modified-Since' => $time );
}
);
However, LWP can do most of this for you with mirror if you want to save the files:
$ua->mirror( $url, $filename )

Trying to get source code of a webpage in perl

I'm trying to get a html source of a webpage using the Perl "get" function. I have written the code 5 months back and it was working fine, but yesterday I made a small edit, but it failed to work after that, no matter how much I tried.
Here is the code I tried.
#!usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my $link = 'www.google.com';
my $sou = get($link) or die "cannot retrieve code\n";
print $sou;
The code works fine , but its not able to retrieve the source, instead it displays
cannot retrieve code
my $link = 'http://www.google.com';
This might be a bit late,
I have been struggling with the same problem and I think I have figured why this occurs. I usually web-scrape websites with python and I have figured out that It is ideal to include some extra header info to the get requests.This fools the website into thinking the bot is a person and gives the bot access to the website and does not invoke a 400 bad request status code.
So I applied this thinking to my Perl script, which was similar to yours, and just added some extra header info. The result gave me the source code for the website with no strugle.
Here is the code:
#!/usr/bin/perl
# This line specifies the LWP version and if not put in place the code will fail.
use LWP 5.64;
# This line defines the virtual browser.
$browser = LWP::UserAgent->new;
# This line defines the header infomation that will be given to the website (eg. google) incase the website invokes a 400 bad request status code.
#ns_headers = (
'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*',
'Accept-Charset' => 'iso-8859-1,*,utf-8',
'Accept-Language' => 'en-US',
);
# This line defines the url that the user agent will browse.
$url = 'https://www.google.com/';
# This line is used to request data from the specified url above.
$response = $browser->get($url, #ns_headers) or die "cannot retrieve code\n";;
# Decodes responce so the HTML source code is visable.
$HTML = $response->decoded_content;
print($HTML);
I have LWP::Useragent as this has the ability for you to add extra header infomation.
I hope this helped,
ME.
PS. Sorry if you already have the answer for this, just wanted to help.

How can I detect the file type of image at a URL?

How to find the image file type in Perl form website URL?
For example,
$image_name = "logo";
$image_path = "http://stackoverflow.com/content/img/so/".$image_name
From this information how to find the file type that . here the example it should display
"png"
http://stackoverflow.com/content/img/so/logo.png .
Supposer if it has more files like SO web site . it should show all file types
If you're using LWP to fetch the image, you can look at the content-type header returned by the HTTP server.
Both WWW::Mechanize and LWP::UserAgent will give you an HTTP::Response object for any GET request. So you can do something like:
use strict;
use warnings;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
$mech->get( "http://stackoverflow.com/content/img/so/logo.png" );
my $type = $mech->response->headers->header( 'Content-Type' );
You can't easily tell. The URL doesn't necessarily reflect the type of the image.
To get the image type you have to make a request via HTTP (GET, or more efficiently, HEAD), and inspect the Content-type header in the HTTP response.
Well, https://stackoverflow.com/content/img/so/logo is a 404. If it were not, then you could use
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my ($content_type) = head "https://stackoverflow.com/content/img/so/logo.png";
print "$content_type\n" if defined $content_type;
__END__
As Kent Fredric points out, what the web server tells you about content type need not match the actual content sent by the web server. Keep in mind that File::MMagic can also be fooled.
#!/usr/bin/perl
use strict;
use warnings;
use File::MMagic;
use LWP::UserAgent;
my $mm = File::MMagic->new;
my $ua = LWP::UserAgent->new(
max_size => 1_000 * 1_024,
);
my $res = $ua->get('https://stackoverflow.com/content/img/so/logo.png');
if ( $res->code eq '200' ) {
print $mm->checktype_contents( $res->content );
}
else {
print $res->status_line, "\n";
}
__END__
You really can't make assumptions about content based on URL, or even content type headers.
They're only guides to what is being sent.
A handy trick to confuse things that use suffix matching to identify file-types is doing this:
http://example.com/someurl?q=foo#fakeheheh.png
And if you were to arbitrarily permit that image to be added to the page, it might in some cases be a doorway to an attack of some sorts if the browser followed it. ( For example, http://really_awful_bank.example.com/transfer?amt=1000000;from=123;to=123 )
Content-type based forgery is not so detrimental, but you can do nasty things if the person who controls the name works out how you identify things and sends different content types for HEAD requests as it does for GET requests.
It could tell the HEAD request that it's an Image, but then tell the GET request that its a application/javascript and goodness knows where that will lead.
The only way to know for certain what it is is downloading the file and then doing MAGIC based identification, or more (i.e., try to decode the image). Then all you have to worry about is images that are too large, and specially crafted images that could trip vulnerabilities in computers that are not yet patched for that vulnerability.
Granted all of the above is extreme paranoia, but if you know the rare possibilities you can make sure they can't happen :)
From what i understand you're not worried about the content type of an image you already know the the name+extension for, you want to find the extension for an image you know the base name of.
In order to do that you'd have to test all the image extensions you wanted individually and store which ones resolved and which ones didn't. For example both https://stackoverflow.com/content/img/so/logo.png and https://stackoverflow.com/content/img/so/logo.gif could exist. They don't in this exact situation but on some arbitrary server you could have multiple images with the same base name but different extensions. Unfortunately there's no way to get a list of available extensions of a file in a remote web directory by supplying its base name without looping through the possibilities.