How to detect a changed webpage?

How to detect a changed webpage? - perl

In my application, I fetch webpages periodically using LWP. Is there anyway to check whether between two consecutive fetches the webpage has got changed in some respect (other than explicitly doing a comparison) ? Is there any signature(say CRC) that is being generated at lower protocol layers which can be extracted and compared against older signatures to see possible changes ?

There are two possible approaches. One is to use a digest of the page, e.g.
use strict;
use warnings;
use Digest::MD5 'md5_hex';
use LWP::UserAgent;
# fetch the page, etc.
my $digest = md5_hex $response->decoded_content;
if ( $digest ne $saved_digest ) {
# the page has changed.
}
Another option is to use an HTTP ETag, if the server provides one for the resource requested. You can simply store it and then set your request headers to include an If-None-Match field on subsequent requests. If the server ETag has remained the same, you'll get a 304 Not Modified status and an empty response body. Otherwise you'll get the new page. (And new ETag.) See Entity Tags in RFC2616.
Of course, the server could be lying, and sending the same ETag even though the content has changed. There's no way to know unless you look.

You should use the If-Modified-Since request header, noting the gotchas in the RFC. You send this header with the request. If the server supports it and thinks the content is newer, it sends it to you. If it thinks you have the most recent version, it returns a 304 with no message body.
However, as other answers have noted, the server doesn't have to tell you the truth, so you're sometimes stuck downloading the content and checking yourself. Many dynamic things will always claim to have new content because many developers have never thought about supporting basic HTTP things in their web apps.
For the LWP bits, you can create a single request with an extra header:
use HTTP::Request;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new;
my $request = HTTP::Request->new( GET => $url );
$r->header( 'If-Modified-Since' => $time );
$ua->request( $request );
For all requests, you can set a request handler:
$ua->add_handler(
request_send => sub {
my($request, $ua, $h) = #_;
# ... look up time from local store
$r->header( 'If-Modified-Since' => $time );
}
);
However, LWP can do most of this for you with mirror if you want to save the files:
$ua->mirror( $url, $filename )

Related

Send HTTP streaming request in perl

I want to send xml request using HTTP streaming protocol . where transfer-encoding is "chunked". Currently i am using LWP::UserAgent to send the xml transaction.
my $userAgent = LWP::UserAgent->new;
my $starttime = time();
my $response = $userAgent->request(POST $url,
Content_Type => 'application/xml',
Transfer_Encoding => 'Chunked',
Content => $xml);
print "Response".Dumper($response);
But i am getting http status code 411 Length Required. Which means "client error response code indicates that the server refuses to accept the request without a defined "
How we can handle this while sending a request in chunked ?

LWP::UserAgent's API isn't designed to send a stream, but it is able to do so with minimal hacking.
use strict;
use warnings qw( all );
use HTTP::Request::Common qw( POST );
use LWP::UserAgent qw( );
my $ua = LWP::UserAgent->new();
# Don't provide any content.
my $request = POST('http://stackoverflow.org/junk',
Content_Type => 'application/xml',
);
# POST() insists on adding a Content-Length header.
# We need to remove it to get a chunked request.
$request->headers->remove_header('Content-Length');
# Here's where we provide the stream generator.
my $buffer = 'abc\n';
$request->content(sub {
return undef if !length($buffer); # Return undef when done.
return substr($buffer, 0, length($buffer), ''); # Return a chunk of data otherwise.
});
my $response = $ua->request($request);
print($response->status_line);
Using a proxy (Fiddler), we can see this does indeed send a chunked request:
There's no point in using a chunked request if you already have the entire document handy like in the example you gave. Instead, let's say wanted to upload the output of some external tool as it produced its output. To do that, you could use the following:
open(my $pipe, '-|:raw', 'some_tool');
$request->content(sub {
my $rv = sysread($pipe, my $buf, 64*1024);
die $! if !defined($rv);
return undef if !$rv;
return $buf;
});

But i am getting http status code 411 Length Required.
Not all servers understand a request with a chunked payload even though this is standardized in HTTP/1.1 (but not in HTTP/1.0). For example nginx only supports chunking within a request since version 1.3.9 (2012), see Is there a way to avoid nginx 411 Content-Length required errors?. If the server does not understand a request with chunked encoding there is nothing you can do from the client side, i.e. you simply cannot use chunked transfer encoding then. If you have control over the server make sure that the server actually supports it.
I've also never experienced browsers send such requests, probably since they cannot guarantee that the server will support such request. I've only seen mobile apps used where the server and app is managed by the same party and thus support for chunked requests can be guaranteed.

Perl get request returns empty response, maybe session related?

I was using an open source tool called SimTT which gets an URL of a tabletennis league and then calculates the probable results (e.g. ranking of teams and players). Unfortunately the webpage moved to a different webpage.
I downloaded the open source and repaired the parsing of the webpage, but currently I'm only able to download the page manually and read it then from a file.
Below you can find an excerpt of my code to retrieve the page. It prints success, but the response is empty. Unfortunately I'm not familiar with perl and webtechniques very well, but in Wireshark I could see that one of the last things send was a new session key. But I'm not sure, if the problem is related to cookies, ssl or something like that.
It would be very nice if someone could help me to get access. I know that there are some people out there which would like to use the tool.
So heres the code:
use LWP::UserAgent ();
use HTTP::Cookies;
my $ua = LWP::UserAgent->new(keep_alive=>1);
$ua->agent('Mozilla/5.0');
$ua->cookie_jar({});
my $request = new HTTP::Request('GET', 'https://www.mytischtennis.de/clicktt/ByTTV/18-19/ligen/Bezirksoberliga/gruppe/323819/mannschaftsmeldungen/vr');
my $response = $ua->request($request);
if ($response->is_success) {
print "Success: ", $response->decoded_content;
}
else {
die $response->status_line;
}

Either there is some rudimentary anti-bot protection at the server or the server is misconfigured or otherwise broken. It looks like it expects to have an Accept-Encoding header in the request which LWP by default does not sent. The value of this header does not really seem to matter, i.e. the server will send the content compressed with gzip if the client claims to support it but it will send uncompressed data if the client offered only a compression method which is unknown to the server.
With this knowledge one can change the code like this:
my $request = HTTP::Request->new('GET',
'https://www.mytischtennis.de/clicktt/ByTTV/18-19/ligen/Bezirksoberliga/gruppe/323819/mannschaftsmeldungen/vr',
[ 'Accept-Encoding' => 'foobar' ]
);
With this simple change the code works currently for me. Note that it might change at any time if the server setup will be changed, i.e. it might then need other workarounds.

how to know which http action called from rest client at server side

When a request was made(actions like GET POST PATCH) to server through a rest client like LWP or REST::Client or HTTP::Request. how can we decode the request so that we will get the actual method which is called from client. if we can get the action we will process or respond to client accordingly.
this way i am able to get headers, all params sent in post request.
my $q = CGI->new;
my $input = $q->param( 'POSTDATA' ); # for content
my %headers = map { $_ => $q->http($_) } $q->http();
print $q->header('text/plain');
print "Got the following headers:\n";
for my $header ( keys %headers ) {
print "$header: $headers{$header}\n";
}
Now my question is how to receive the action like GET or POST.

From the docs
request_method()
Returns the method used to access your script, usually one of 'POST', 'GET' or 'HEAD'.
Also from the docs:
CGI.pm is no longer considered good practice for developing web applications, including quick prototyping and small web scripts. There are far better, cleaner, quicker, easier, safer, more scalable, more extensible, more modern alternatives available at this point in time. These will be documented with CGI::Alternatives.

LWP Send Post request and get headers only in response

I have code like this
my $ua = new LWP::UserAgent;
$ua->timeout($timeout);
$ua->agent($useragent);
$response = $ua->post($domain,['login_name'=>$login,'login_password'=> $password])->as_string;
Content of page so large, thatI can't receive it. How to get only headers with sending post data?

I think this should do it for you.
my $ua = LWP::UserAgent->new();
$ua->timeout($timeout);
$ua->agent($useragent);
my $response = $ua->post(
$domain,
[ 'login_name' => $login, 'login_password' => $password ]
);
use Data::Dumper;
print Dumper( $response->headers() );
print $response->request()->content(), "\n";

To first, check if you can pass this login_name and login_password via HEAD (in url string: domain/?login_name=...&login_password=...). If this will not work, you are in bad case.
You cannot use POST with behavior of HEAD. LWP will wait full response.
Using POST the server will send you the content anyway, but you can avoid receiving all content using sockets tcp by yourself: gethostbyname, connect, sysread until you get /\r?\n\r?\n/ and close socket after this. Some traffic will be utilized anyway, but you can save memory and receive time.
Its not normal thing to do this with sockets, but sometimes when you have highload/big data - there is no better way than such mess.

How can I detect the file type of image at a URL?

How to find the image file type in Perl form website URL?
For example,
$image_name = "logo";
$image_path = "http://stackoverflow.com/content/img/so/".$image_name
From this information how to find the file type that . here the example it should display
"png"
http://stackoverflow.com/content/img/so/logo.png .
Supposer if it has more files like SO web site . it should show all file types

If you're using LWP to fetch the image, you can look at the content-type header returned by the HTTP server.
Both WWW::Mechanize and LWP::UserAgent will give you an HTTP::Response object for any GET request. So you can do something like:
use strict;
use warnings;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
$mech->get( "http://stackoverflow.com/content/img/so/logo.png" );
my $type = $mech->response->headers->header( 'Content-Type' );

You can't easily tell. The URL doesn't necessarily reflect the type of the image.
To get the image type you have to make a request via HTTP (GET, or more efficiently, HEAD), and inspect the Content-type header in the HTTP response.

Well, https://stackoverflow.com/content/img/so/logo is a 404. If it were not, then you could use
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my ($content_type) = head "https://stackoverflow.com/content/img/so/logo.png";
print "$content_type\n" if defined $content_type;
__END__
As Kent Fredric points out, what the web server tells you about content type need not match the actual content sent by the web server. Keep in mind that File::MMagic can also be fooled.
#!/usr/bin/perl
use strict;
use warnings;
use File::MMagic;
use LWP::UserAgent;
my $mm = File::MMagic->new;
my $ua = LWP::UserAgent->new(
max_size => 1_000 * 1_024,
);
my $res = $ua->get('https://stackoverflow.com/content/img/so/logo.png');
if ( $res->code eq '200' ) {
print $mm->checktype_contents( $res->content );
}
else {
print $res->status_line, "\n";
}
__END__

You really can't make assumptions about content based on URL, or even content type headers.
They're only guides to what is being sent.
A handy trick to confuse things that use suffix matching to identify file-types is doing this:
http://example.com/someurl?q=foo#fakeheheh.png
And if you were to arbitrarily permit that image to be added to the page, it might in some cases be a doorway to an attack of some sorts if the browser followed it. ( For example, http://really_awful_bank.example.com/transfer?amt=1000000;from=123;to=123 )
Content-type based forgery is not so detrimental, but you can do nasty things if the person who controls the name works out how you identify things and sends different content types for HEAD requests as it does for GET requests.
It could tell the HEAD request that it's an Image, but then tell the GET request that its a application/javascript and goodness knows where that will lead.
The only way to know for certain what it is is downloading the file and then doing MAGIC based identification, or more (i.e., try to decode the image). Then all you have to worry about is images that are too large, and specially crafted images that could trip vulnerabilities in computers that are not yet patched for that vulnerability.
Granted all of the above is extreme paranoia, but if you know the rare possibilities you can make sure they can't happen :)

From what i understand you're not worried about the content type of an image you already know the the name+extension for, you want to find the extension for an image you know the base name of.
In order to do that you'd have to test all the image extensions you wanted individually and store which ones resolved and which ones didn't. For example both https://stackoverflow.com/content/img/so/logo.png and https://stackoverflow.com/content/img/so/logo.gif could exist. They don't in this exact situation but on some arbitrary server you could have multiple images with the same base name but different extensions. Unfortunately there's no way to get a list of available extensions of a file in a remote web directory by supplying its base name without looping through the possibilities.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to detect a changed webpage? - perl

Related

Send HTTP streaming request in perl

Perl get request returns empty response, maybe session related?

how to know which http action called from rest client at server side

LWP Send Post request and get headers only in response

How can I detect the file type of image at a URL?

Categories

Resources