malformed url versus broken link in perl - perl

I'm looking to distinguish between typos in urls and actual broken links. For example:
typo:
www.google/com
broken link:
www.thislinkpointstonothing.org
I would like my application to recognize the first URL as misformed and to indicate the second URL to returns a 404 not found when queried. Is there a perl module or a regex I can use in order to make this distinction?
I'm trying to distinguish between bad links due to typos or - if the link follows the RFC - if the bad link is simply due to the page no longer existing.

You are talking about two different types of "queries": DNS lookups and HTTP requests. HTTP requests make use of DNS lookups - but not always. for example a server can be located on a local network and you can request pages from it using IP address and name information from an /etc/hosts file. Link names may not always contain a host name portion since they can be relative (this is often a good practice for making a website easily able to be moved behind a reverse proxy or if the host name changes.
Taking account of that distinction the gist of your question - whether you can check a link for correctness of the URI versus a truly missing a a page (404) is sound - but a DNS query for google/com should snot succeed. Is your application being redirected by a proxy to a search page?
A rough approximate answer follows - this probably won't be much use as it is but you'll get the idea. For more useful approaches in perl, you might want to check if some of the more featureful perl frameworks (Catalyst Mojo) have methods for doing this. Also, if you are doing this for your front end UI (i.e. javascript on a web page) there may be well established approaches that are quicker or simpler. The following steps seem like what you want to do with each link:
1 Check if the link/URI is "normal"; if not print an error; if it is then :
2 Do a DNS lookup on the host part of the link/URI; if not print an error; if it succeeds then:
3 Try to fetch the web page and print any errors or if it succeeds say so
Measuring "success" is hard to automate though: should it be defined strictly as status "200 OK"? Perhaps you or another member can add that part and find an elegant way to read in the "links" (broken or otherwise).
This script makes sloppy use of our to stash things and won't run with use strict. Please somebody make it look nicer and use only CORE:: modules :-)
use Regexp::Common qw/URI/;
use Net::DNS;
use 5.10.0;
use LWP::UserAgent;
my $url = "http://www.google.com/adsfdsa" ;
my $lookup = Net::DNS::Resolver->new;
my $ua = LWP::UserAgent->new;
### Step 0. regexp the URI ##
if ($url =~ /$RE{URI}{HTTP}{-keep}/){
say "$url is a URI ";
our $hostpart = $3; # stash the host part as per man page
our $filepart = $5; # stash the path
}
### Step 1. do a DNS look up and if it succeeds then or else ... etc. ###
if ($lookup->query($hostpart)) { say "$hostpart is a valid host" }
else { say " but $hostpart is an invalid host" }
### Step 2. fetch the page and check the return code ###
my $request = HTTP::Request->new(GET => $url);
my $response = $ua->request($request);
if ($response->is_success || $response->is_redirect ) {
print $request->content;
}
else {
say "but $filepart is an invalid path";
}
Output:
http://www.google.com/adsfdsa is a URI
www.google.com is a valid host
but /adsfdsa is an invalid path
You could make something like the above (but shorter, more elegant and more efficient!) read from your input and return the appropriate message to your users i.e. from wherever the error happens: i.e. at step 0., 1, or 2.). Note taht there are likely faster and simpler ways to do this, but modules like Regexp::Common and LWP are well exercised and tessted.

Related

Perl get request returns empty response, maybe session related?

I was using an open source tool called SimTT which gets an URL of a tabletennis league and then calculates the probable results (e.g. ranking of teams and players). Unfortunately the webpage moved to a different webpage.
I downloaded the open source and repaired the parsing of the webpage, but currently I'm only able to download the page manually and read it then from a file.
Below you can find an excerpt of my code to retrieve the page. It prints success, but the response is empty. Unfortunately I'm not familiar with perl and webtechniques very well, but in Wireshark I could see that one of the last things send was a new session key. But I'm not sure, if the problem is related to cookies, ssl or something like that.
It would be very nice if someone could help me to get access. I know that there are some people out there which would like to use the tool.
So heres the code:
use LWP::UserAgent ();
use HTTP::Cookies;
my $ua = LWP::UserAgent->new(keep_alive=>1);
$ua->agent('Mozilla/5.0');
$ua->cookie_jar({});
my $request = new HTTP::Request('GET', 'https://www.mytischtennis.de/clicktt/ByTTV/18-19/ligen/Bezirksoberliga/gruppe/323819/mannschaftsmeldungen/vr');
my $response = $ua->request($request);
if ($response->is_success) {
print "Success: ", $response->decoded_content;
}
else {
die $response->status_line;
}
Either there is some rudimentary anti-bot protection at the server or the server is misconfigured or otherwise broken. It looks like it expects to have an Accept-Encoding header in the request which LWP by default does not sent. The value of this header does not really seem to matter, i.e. the server will send the content compressed with gzip if the client claims to support it but it will send uncompressed data if the client offered only a compression method which is unknown to the server.
With this knowledge one can change the code like this:
my $request = HTTP::Request->new('GET',
'https://www.mytischtennis.de/clicktt/ByTTV/18-19/ligen/Bezirksoberliga/gruppe/323819/mannschaftsmeldungen/vr',
[ 'Accept-Encoding' => 'foobar' ]
);
With this simple change the code works currently for me. Note that it might change at any time if the server setup will be changed, i.e. it might then need other workarounds.

Getting the Error 403 Forbidden while posting the website address through Perl

use strict;
use LWP::UserAgent;
my $UserAgent = LWP::UserAgent->new;
my $response = $UserAgent->get("https://scholar.google.co.in/scholar_lookup?author=N.+R.+Alpert&author=S.+A.+Mohiddin&author=D.+Tripodi&author=J.+Jacobson-Hatzell&author=K.+Vaughn-Whitley&author=C.+Brosseau+&publication_year=2005&title=Molecular+and+phenotypic+effects+of+heterozygous,+homozygous,+and+compound+heterozygote+myosin+heavy-chain+mutations&journal=Am.+J.+Physiol.+Heart+Circ.+Physiol.&volume=288&pages=H1097-H1102");
if ($response->is_success)
{
$response->content =~ /<title>(.*?) - Google Scholar<\/title>/;
print $1;
}
else
{
die $response->status_line;
}
I am getting the below error while running this script.
403 Forbidden at D:\Getelement.pl line 52.
I have pasted this website address in address bar, and its redirecting exactly to that site, but its not working in while running by script.
Can you please help me on this issue.
Google Terms of Service disallow automated searches. They are
detecting you're sending this from a script because your headers and
your browser standard headers are very different, and you can analyze
them if you want.
In the old times they had a SOAP API, and you could use modules like
WWW::Search::Google but that's not the case anymore because this
API was deprecated.
Alternatives were already discussed in the following StackOverflow
question:
What are the alternatives now that the Google web search API has
been deprecated?
Google has blacklisted LWP::UserAgent They either blacklisted the UserAgent or parts of the request (headers whatsoever).
I suggest you use Mojo::UserAgent.. The request looks like by default more like a browser. You must write minimum 1 lines of code.
use Mojo::UserAgent;
use strict;
use warnings;
print Mojo::UserAgent->new->get('https://scholar.google.co.in/scholar_lookup?author=N.+R.+Alpert&author=S.+A.+Mohiddin&author=D.+Tripodi&author=J.+Jacobson-Hatzell&author=K.+Vaughn-Whitley&author=C.+Brosseau+&publication_year=2005&title=Molecular+and+phenotypic+effects+of+heterozygous,+homozygous,+and+compound+heterozygote+myosin+heavy-chain+mutations&journal=Am.+J.+Physiol.+Heart+Circ.+Physiol.&volume=288&pages=H1097-H1102')->res->dom->at('title')->text;
# Prints Molecular and phenotypic effects of heterozygous, homozygous, and
# compound heterozygote myosin heavy-chain mutations - Google Scholar
Terms
The code does not accept any terms nor additional lines has been added to bypass security checks. It's absolutely fine.
You can fetch your content if you add a User Agent string to identify yourself to the web server:
...
my $UserAgent = LWP::UserAgent-new;
$UserAgent->agent('Mozilla/5.0'); #...add this...
...
print $1;
...
This prints: "Molecular and phenotypic effects of heterozygous, homozygous, and compound heterozygote myosin heavy-chain mutations"

Perl transparent proxy

I'm trying to create one transparent HTTP proxy. It's purpose is to stay between the browser and the web server and be invisible. Here is the code I'm using. Unfortunately it's not working very well. When I open the web page (referenced by $base_uri) there are different results depending on that whether I've opened it using the browser only or the browser and the proxy. I'm trying it on a web site which is returning all kinds of responses including "transfer-encoding: chunked" (so I guess may be the problem could be there?!). I think that there could be also problems with the cookies but I don't know how to solve them (if any...).
#!/usr/bin/perl
use strict;
use HTTP::Daemon;
use LWP::UserAgent;
use HTTP::Cookies;
my $cookie_jar = HTTP::Cookies->new();
my $ua = LWP::UserAgent->new( max_redirect => 0, env_proxy => 0,keep_alive => 1, timeout => 30, agent => "Mozilla/4.76 [en] (Win98; U)");
my $d = HTTP::Daemon->new(
LocalHost => "localhost", # remove this to listen from other machines
# (i.e. open-relay... be careful of spammers!)
LocalPort => 33331
) || die;
print "[Proxy URL:", $d->url, "]\n";
fork(); fork(); fork(); # 2^3 = 8 processes
$ua->cookie_jar($cookie_jar);
my $base_uri = 'http://example.com/';
while (my $c = $d->accept) {
while (my $request = $c->get_request) {
my $uri = $base_uri . $request->uri->as_string();
my $method = $request->method;
my $req = HTTP::Request->new($method, $uri);
$request->uri($uri);
print "[[ $method >> $uri ]]\n";
my $response = $ua->simple_request($request);
$c->send_response( $response );
}
$c->close;
undef($c);
}
Thank you in advance!
It is not clear, what you really want. You should describe it much better. If you describe what and why you need that proxy and what features it needs, any help can be much better. Nevertheless I'll try.
What you currently do is to take an incoming connection, extract the URI and the call method and pass it to your source. Well, HTTP is much more complex - you strip everything like the transported data (e.g. for POST requests) as well as all the header lines (cookies, login data, browser identification, language specs, ...), which usually carry important information. Also you modify the timing behavior a lot. Then you sent it to your proxy target.
Now you take the server answer and again strip everything relevant. Also you only reply the answer after it is finished. For streaming data this will not work (you already mentioned the chunked transfer mode). Also your method requires a lot of memory for large files (e.g. a DVD image - 4GB).
Without further details about your application no real suggestion is possible, but some ideas:
a) As told in the comments there are Perl modules available. You may test them and see if they fit your needs.
b) You can go down a level. Use IO::Socket::INET or INET6 and directly work on the socket level. Send each packet as it comes in directly to the output. This is nearly 100% transparent (except for IP address and probably the TCP packet sizes). Thought if you want to change or inspect data, you directly need to care for the HTTP yourself, which can be really complicated nowadays (especially due to transfer encoding).
c) Maybe don't code yourself, but use an existing proxy (e.g. the Apache webserver or specific proxy programs).

is it possible to write a single perl cgi script to serve all http requests

I am wondering if it is possible to use a single perl cgi script to server all http requests to my site, no matter what relative URL given by the visitors.
Please share your thoughts. Many thanks.
If you call your script index.cgi and combine that with a mod_rewrite rule to redirect all requests to /index.cgi/foo then foo will be available as $ENV{'PATH_INFO'}, thereby letting you know what the original request path was.
It's quite possible using mod_rewrite as other people have said. But you probably don't want to do it in a CGI program. Far better to write a proper web application using something like Catalyst or Dancer (probably with Plack at the back end).
If you're not really committed to your existing web server, you could use something like this:
use HTTP::Daemon; # need LWP-5.32 or better
use HTTP::Status;
use HTTP::Response;
use URI::Heuristic;
my $server = HTTP::Daemon->new(LocalPort => 89);
my $this_url = $server->url;
etc.
I grabbed that snippet from an existing program that ran as its own web server. Not sure how many of the "use" commands are required after the first one, but hopefully that gives you some ideas.
Per request of the submitter, I'm submitting a more-complete version of the HTTPD script:
#!/usr/bin/perl
use strict;
use warnings;
use HTTP::Daemon;
my $PORT = 89;
my $server = HTTP::Daemon->new(LocalPort =>$PORT);
# Init
print "Starting server at $server->url\n";
print "You can also use http://localhost:$PORT if browsing from the same machine running this script.\n\n";
# Server
my $count=0;
while (my $client = $server->accept) {
CONNECTION:
while (my $request = $client->get_request) {
$count++;
print "Connection #$count:\n";
print $request->as_string;
print "\n";
$client->autoflush;
RESPONSE:
print $client "Relative URL used was " . $request->url->path;
last CONNECTION;
}
$client->close;
undef $client;
}
Instead of the simple line that prints "Relative URL used was", you'd most likely want to parse the URL used to perform whatever different functions you need this script to do for every HTTP request.
You can't set up perl itself to do this. However, you should be able to configure your webserver to redirect all requests to a single CGI script, usually then passing the full script as a parameter. If you're running on Apache, look at mod-rewrite.
Can't vote up neither comment to davorg by low reputation (I'm new here).
You can use Mojolicious framework too. Mojolicious::Lite allows you to write full apps in a single file (logic, templating, etc), and I guess you're searching for something like:
http://search.cpan.org/perldoc?Mojolicious::Lite#Wildcard_Placeholders
More info at:
http://mojolicio.us/

How can I detect the file type of image at a URL?

How to find the image file type in Perl form website URL?
For example,
$image_name = "logo";
$image_path = "http://stackoverflow.com/content/img/so/".$image_name
From this information how to find the file type that . here the example it should display
"png"
http://stackoverflow.com/content/img/so/logo.png .
Supposer if it has more files like SO web site . it should show all file types
If you're using LWP to fetch the image, you can look at the content-type header returned by the HTTP server.
Both WWW::Mechanize and LWP::UserAgent will give you an HTTP::Response object for any GET request. So you can do something like:
use strict;
use warnings;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
$mech->get( "http://stackoverflow.com/content/img/so/logo.png" );
my $type = $mech->response->headers->header( 'Content-Type' );
You can't easily tell. The URL doesn't necessarily reflect the type of the image.
To get the image type you have to make a request via HTTP (GET, or more efficiently, HEAD), and inspect the Content-type header in the HTTP response.
Well, https://stackoverflow.com/content/img/so/logo is a 404. If it were not, then you could use
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my ($content_type) = head "https://stackoverflow.com/content/img/so/logo.png";
print "$content_type\n" if defined $content_type;
__END__
As Kent Fredric points out, what the web server tells you about content type need not match the actual content sent by the web server. Keep in mind that File::MMagic can also be fooled.
#!/usr/bin/perl
use strict;
use warnings;
use File::MMagic;
use LWP::UserAgent;
my $mm = File::MMagic->new;
my $ua = LWP::UserAgent->new(
max_size => 1_000 * 1_024,
);
my $res = $ua->get('https://stackoverflow.com/content/img/so/logo.png');
if ( $res->code eq '200' ) {
print $mm->checktype_contents( $res->content );
}
else {
print $res->status_line, "\n";
}
__END__
You really can't make assumptions about content based on URL, or even content type headers.
They're only guides to what is being sent.
A handy trick to confuse things that use suffix matching to identify file-types is doing this:
http://example.com/someurl?q=foo#fakeheheh.png
And if you were to arbitrarily permit that image to be added to the page, it might in some cases be a doorway to an attack of some sorts if the browser followed it. ( For example, http://really_awful_bank.example.com/transfer?amt=1000000;from=123;to=123 )
Content-type based forgery is not so detrimental, but you can do nasty things if the person who controls the name works out how you identify things and sends different content types for HEAD requests as it does for GET requests.
It could tell the HEAD request that it's an Image, but then tell the GET request that its a application/javascript and goodness knows where that will lead.
The only way to know for certain what it is is downloading the file and then doing MAGIC based identification, or more (i.e., try to decode the image). Then all you have to worry about is images that are too large, and specially crafted images that could trip vulnerabilities in computers that are not yet patched for that vulnerability.
Granted all of the above is extreme paranoia, but if you know the rare possibilities you can make sure they can't happen :)
From what i understand you're not worried about the content type of an image you already know the the name+extension for, you want to find the extension for an image you know the base name of.
In order to do that you'd have to test all the image extensions you wanted individually and store which ones resolved and which ones didn't. For example both https://stackoverflow.com/content/img/so/logo.png and https://stackoverflow.com/content/img/so/logo.gif could exist. They don't in this exact situation but on some arbitrary server you could have multiple images with the same base name but different extensions. Unfortunately there's no way to get a list of available extensions of a file in a remote web directory by supplying its base name without looping through the possibilities.