I'm trying to get a html source of a webpage using the Perl "get" function. I have written the code 5 months back and it was working fine, but yesterday I made a small edit, but it failed to work after that, no matter how much I tried.
Here is the code I tried.
#!usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my $link = 'www.google.com';
my $sou = get($link) or die "cannot retrieve code\n";
print $sou;
The code works fine , but its not able to retrieve the source, instead it displays
cannot retrieve code
my $link = 'http://www.google.com';
This might be a bit late,
I have been struggling with the same problem and I think I have figured why this occurs. I usually web-scrape websites with python and I have figured out that It is ideal to include some extra header info to the get requests.This fools the website into thinking the bot is a person and gives the bot access to the website and does not invoke a 400 bad request status code.
So I applied this thinking to my Perl script, which was similar to yours, and just added some extra header info. The result gave me the source code for the website with no strugle.
Here is the code:
#!/usr/bin/perl
# This line specifies the LWP version and if not put in place the code will fail.
use LWP 5.64;
# This line defines the virtual browser.
$browser = LWP::UserAgent->new;
# This line defines the header infomation that will be given to the website (eg. google) incase the website invokes a 400 bad request status code.
#ns_headers = (
'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*',
'Accept-Charset' => 'iso-8859-1,*,utf-8',
'Accept-Language' => 'en-US',
);
# This line defines the url that the user agent will browse.
$url = 'https://www.google.com/';
# This line is used to request data from the specified url above.
$response = $browser->get($url, #ns_headers) or die "cannot retrieve code\n";;
# Decodes responce so the HTML source code is visable.
$HTML = $response->decoded_content;
print($HTML);
I have LWP::Useragent as this has the ability for you to add extra header infomation.
I hope this helped,
ME.
PS. Sorry if you already have the answer for this, just wanted to help.
Related
I was using an open source tool called SimTT which gets an URL of a tabletennis league and then calculates the probable results (e.g. ranking of teams and players). Unfortunately the webpage moved to a different webpage.
I downloaded the open source and repaired the parsing of the webpage, but currently I'm only able to download the page manually and read it then from a file.
Below you can find an excerpt of my code to retrieve the page. It prints success, but the response is empty. Unfortunately I'm not familiar with perl and webtechniques very well, but in Wireshark I could see that one of the last things send was a new session key. But I'm not sure, if the problem is related to cookies, ssl or something like that.
It would be very nice if someone could help me to get access. I know that there are some people out there which would like to use the tool.
So heres the code:
use LWP::UserAgent ();
use HTTP::Cookies;
my $ua = LWP::UserAgent->new(keep_alive=>1);
$ua->agent('Mozilla/5.0');
$ua->cookie_jar({});
my $request = new HTTP::Request('GET', 'https://www.mytischtennis.de/clicktt/ByTTV/18-19/ligen/Bezirksoberliga/gruppe/323819/mannschaftsmeldungen/vr');
my $response = $ua->request($request);
if ($response->is_success) {
print "Success: ", $response->decoded_content;
}
else {
die $response->status_line;
}
Either there is some rudimentary anti-bot protection at the server or the server is misconfigured or otherwise broken. It looks like it expects to have an Accept-Encoding header in the request which LWP by default does not sent. The value of this header does not really seem to matter, i.e. the server will send the content compressed with gzip if the client claims to support it but it will send uncompressed data if the client offered only a compression method which is unknown to the server.
With this knowledge one can change the code like this:
my $request = HTTP::Request->new('GET',
'https://www.mytischtennis.de/clicktt/ByTTV/18-19/ligen/Bezirksoberliga/gruppe/323819/mannschaftsmeldungen/vr',
[ 'Accept-Encoding' => 'foobar' ]
);
With this simple change the code works currently for me. Note that it might change at any time if the server setup will be changed, i.e. it might then need other workarounds.
use strict;
use LWP::UserAgent;
my $UserAgent = LWP::UserAgent->new;
my $response = $UserAgent->get("https://scholar.google.co.in/scholar_lookup?author=N.+R.+Alpert&author=S.+A.+Mohiddin&author=D.+Tripodi&author=J.+Jacobson-Hatzell&author=K.+Vaughn-Whitley&author=C.+Brosseau+&publication_year=2005&title=Molecular+and+phenotypic+effects+of+heterozygous,+homozygous,+and+compound+heterozygote+myosin+heavy-chain+mutations&journal=Am.+J.+Physiol.+Heart+Circ.+Physiol.&volume=288&pages=H1097-H1102");
if ($response->is_success)
{
$response->content =~ /<title>(.*?) - Google Scholar<\/title>/;
print $1;
}
else
{
die $response->status_line;
}
I am getting the below error while running this script.
403 Forbidden at D:\Getelement.pl line 52.
I have pasted this website address in address bar, and its redirecting exactly to that site, but its not working in while running by script.
Can you please help me on this issue.
Google Terms of Service disallow automated searches. They are
detecting you're sending this from a script because your headers and
your browser standard headers are very different, and you can analyze
them if you want.
In the old times they had a SOAP API, and you could use modules like
WWW::Search::Google but that's not the case anymore because this
API was deprecated.
Alternatives were already discussed in the following StackOverflow
question:
What are the alternatives now that the Google web search API has
been deprecated?
Google has blacklisted LWP::UserAgent They either blacklisted the UserAgent or parts of the request (headers whatsoever).
I suggest you use Mojo::UserAgent.. The request looks like by default more like a browser. You must write minimum 1 lines of code.
use Mojo::UserAgent;
use strict;
use warnings;
print Mojo::UserAgent->new->get('https://scholar.google.co.in/scholar_lookup?author=N.+R.+Alpert&author=S.+A.+Mohiddin&author=D.+Tripodi&author=J.+Jacobson-Hatzell&author=K.+Vaughn-Whitley&author=C.+Brosseau+&publication_year=2005&title=Molecular+and+phenotypic+effects+of+heterozygous,+homozygous,+and+compound+heterozygote+myosin+heavy-chain+mutations&journal=Am.+J.+Physiol.+Heart+Circ.+Physiol.&volume=288&pages=H1097-H1102')->res->dom->at('title')->text;
# Prints Molecular and phenotypic effects of heterozygous, homozygous, and
# compound heterozygote myosin heavy-chain mutations - Google Scholar
Terms
The code does not accept any terms nor additional lines has been added to bypass security checks. It's absolutely fine.
You can fetch your content if you add a User Agent string to identify yourself to the web server:
...
my $UserAgent = LWP::UserAgent-new;
$UserAgent->agent('Mozilla/5.0'); #...add this...
...
print $1;
...
This prints: "Molecular and phenotypic effects of heterozygous, homozygous, and compound heterozygote myosin heavy-chain mutations"
I'm new to perl. In the past few days, I've made some simple scripts that save websites' source codes to my computer via "get." They do what they're supposed to, but will not get the content of a website which is a forum. Non-forum websites work just fine.
Any idea what's going on? Here's the problem chunk:
my $url = 'http://www.computerforum.com/';
my $content = get $url || die "Unable to get content";
http://p3rl.org/LWP::Simple#get:
The get() function will fetch the document identified by the given URL and return it. It returns undef if it fails. […]
You will not be able to examine the response code or response headers (like 'Content-Type') when you are accessing the web using this function. If you need that information you should use the full OO interface (see LWP::UserAgent).
You really need better error reporting, switch to the LWP::UserAgent library. I suspect the forum software blocks the LWP user agent.
Are both of these versions OK or is one of them to prefer?
#!/usr/bin/env perl
use strict;
use warnings;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new();
my $content;
# 1
$mech->get( 'http://www.kernel.org' );
$content = $mech->content;
print $content;
# 2
my $res = $mech->get( 'http://www.kernel.org' );
$content = $res->content;
print $content;
They are both acceptable. The second one seems cleaner to me because it returns a proper HTTP::Response object which you can query and call methods on, and also means that if you make another Mechanize request, you'll still have access to the old HTTP response. With your first approach, each time you make a request, the content method will change to something new, which sounds error-prone.
Btw, for either method, you should check $response->is_success or $mech->success before accessing the content, as the request may have failed.
The content() method is sometimes more convenient:
$mech->content(...)
Returns the content that the mech uses internally for the last page fetched. Ordinarily this is the same as $mech->response()->content(), but this may differ for HTML documents if "update_html" is overloaded, and/or extra named arguments are passed to content():
$mech->content( format => 'text' )
Returns a text-only version of the page, with all HTML markup stripped. This feature requires HTML::TreeBuilder to be installed, or a fatal error will be thrown.
$mech->content( base_href => [$base_href|undef] )
Returns the HTML document, modified to contain a mark-up in the header. $base_href is $mech->base() if not specified. This is handy to pass the HTML to e.g. HTML::Display.
$mech->content is specifically there so you can bypass having to get the result response. The simpler it is, the better.
How to find the image file type in Perl form website URL?
For example,
$image_name = "logo";
$image_path = "http://stackoverflow.com/content/img/so/".$image_name
From this information how to find the file type that . here the example it should display
"png"
http://stackoverflow.com/content/img/so/logo.png .
Supposer if it has more files like SO web site . it should show all file types
If you're using LWP to fetch the image, you can look at the content-type header returned by the HTTP server.
Both WWW::Mechanize and LWP::UserAgent will give you an HTTP::Response object for any GET request. So you can do something like:
use strict;
use warnings;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
$mech->get( "http://stackoverflow.com/content/img/so/logo.png" );
my $type = $mech->response->headers->header( 'Content-Type' );
You can't easily tell. The URL doesn't necessarily reflect the type of the image.
To get the image type you have to make a request via HTTP (GET, or more efficiently, HEAD), and inspect the Content-type header in the HTTP response.
Well, https://stackoverflow.com/content/img/so/logo is a 404. If it were not, then you could use
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my ($content_type) = head "https://stackoverflow.com/content/img/so/logo.png";
print "$content_type\n" if defined $content_type;
__END__
As Kent Fredric points out, what the web server tells you about content type need not match the actual content sent by the web server. Keep in mind that File::MMagic can also be fooled.
#!/usr/bin/perl
use strict;
use warnings;
use File::MMagic;
use LWP::UserAgent;
my $mm = File::MMagic->new;
my $ua = LWP::UserAgent->new(
max_size => 1_000 * 1_024,
);
my $res = $ua->get('https://stackoverflow.com/content/img/so/logo.png');
if ( $res->code eq '200' ) {
print $mm->checktype_contents( $res->content );
}
else {
print $res->status_line, "\n";
}
__END__
You really can't make assumptions about content based on URL, or even content type headers.
They're only guides to what is being sent.
A handy trick to confuse things that use suffix matching to identify file-types is doing this:
http://example.com/someurl?q=foo#fakeheheh.png
And if you were to arbitrarily permit that image to be added to the page, it might in some cases be a doorway to an attack of some sorts if the browser followed it. ( For example, http://really_awful_bank.example.com/transfer?amt=1000000;from=123;to=123 )
Content-type based forgery is not so detrimental, but you can do nasty things if the person who controls the name works out how you identify things and sends different content types for HEAD requests as it does for GET requests.
It could tell the HEAD request that it's an Image, but then tell the GET request that its a application/javascript and goodness knows where that will lead.
The only way to know for certain what it is is downloading the file and then doing MAGIC based identification, or more (i.e., try to decode the image). Then all you have to worry about is images that are too large, and specially crafted images that could trip vulnerabilities in computers that are not yet patched for that vulnerability.
Granted all of the above is extreme paranoia, but if you know the rare possibilities you can make sure they can't happen :)
From what i understand you're not worried about the content type of an image you already know the the name+extension for, you want to find the extension for an image you know the base name of.
In order to do that you'd have to test all the image extensions you wanted individually and store which ones resolved and which ones didn't. For example both https://stackoverflow.com/content/img/so/logo.png and https://stackoverflow.com/content/img/so/logo.gif could exist. They don't in this exact situation but on some arbitrary server you could have multiple images with the same base name but different extensions. Unfortunately there's no way to get a list of available extensions of a file in a remote web directory by supplying its base name without looping through the possibilities.