Please tell me which module is used to scrape a website which is completely developed in ASP and all it's contents are not in proper HTML syntax.
It does not matter which language was used to develop the website. All that you (client) get from the website is the produced HTML (or broken HTML in this case).
You can use the "LWP" library and the "get" function to read the website content into a variable... and then analyze it using regular expressions.
Like this:
use strict;
use LWP::Simple;
my $url = 'http://...';
my $content = get $url;
if ($content =~ m/.../) {
...
}
Or you could use WWW::Mechanize. It builds upon LWP (which LWP::Simple is a very simple subset of) and provides lots of handy 'browser-like' behavior. For example, the typical session management of ASP-generated websites with login cookies and stuff gets handled by Mechanize automatically.
use strict; use warnings;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
$mech->get( 'http:://www.example.org/login.asp' );
$mech->submit_form(
form_number => 3,
fields => {
username => 'test',
password => 'secret',
}
);
While this first of all is good for testing, it still has LWPs inherited methods and you can access the plain request. You can thus access the request as well, while still having the power of the built-in parser to access forms and links.
Also consider using a proper HTML parser, even if the website's output is no very fancy. There are several of these around that can handle it. It will be a lot easier than just building up a bunch of regexes. These will get hard to maintain once you need to go back because the page has changed something.
Here's a list of related questions that have info on this subject:
Parse html using Perl
Parse html using Perl
Perl parse links from HTML Table
What are some good ways to parse HTML and CSS in Perl?
Related
I'm trying to get a html source of a webpage using the Perl "get" function. I have written the code 5 months back and it was working fine, but yesterday I made a small edit, but it failed to work after that, no matter how much I tried.
Here is the code I tried.
#!usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my $link = 'www.google.com';
my $sou = get($link) or die "cannot retrieve code\n";
print $sou;
The code works fine , but its not able to retrieve the source, instead it displays
cannot retrieve code
my $link = 'http://www.google.com';
This might be a bit late,
I have been struggling with the same problem and I think I have figured why this occurs. I usually web-scrape websites with python and I have figured out that It is ideal to include some extra header info to the get requests.This fools the website into thinking the bot is a person and gives the bot access to the website and does not invoke a 400 bad request status code.
So I applied this thinking to my Perl script, which was similar to yours, and just added some extra header info. The result gave me the source code for the website with no strugle.
Here is the code:
#!/usr/bin/perl
# This line specifies the LWP version and if not put in place the code will fail.
use LWP 5.64;
# This line defines the virtual browser.
$browser = LWP::UserAgent->new;
# This line defines the header infomation that will be given to the website (eg. google) incase the website invokes a 400 bad request status code.
#ns_headers = (
'User-Agent' => 'Mozilla/4.76 [en] (Win98; U)',
'Accept' => 'image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, image/png, */*',
'Accept-Charset' => 'iso-8859-1,*,utf-8',
'Accept-Language' => 'en-US',
);
# This line defines the url that the user agent will browse.
$url = 'https://www.google.com/';
# This line is used to request data from the specified url above.
$response = $browser->get($url, #ns_headers) or die "cannot retrieve code\n";;
# Decodes responce so the HTML source code is visable.
$HTML = $response->decoded_content;
print($HTML);
I have LWP::Useragent as this has the ability for you to add extra header infomation.
I hope this helped,
ME.
PS. Sorry if you already have the answer for this, just wanted to help.
I'm doing some development work that uses an embedded Linux for the OS and Boa for the web server. I have a web page that posts to a CGI script, handles the form data, and replies. My development environment was Ubuntu and everything worked fine, but when I ported my code over to the embedded Linux, the CGI module did not instantiate (or at least does not seem to instantiate). Here is a stripped down section of my code. The print statement complains about an uninitialized variable.
use CGI;
use strict;
use warnings;
my $cgiObj = CGI->new();
print $cgiObj->param('wlanPort');
Again, this works fine in my development environment, but fails in the embedded environment. The CGI.pm is installed and there are no errors generated on the CGI->new() command. I have also verified that the form data is being sent, but obviously can't guarantee that it is being received by the Perl script.
I have a feeling that it is a Boa configuration issue and that's what I'll be looking into next. I'm fairly new to Perl, so I'm not sure what else to do. Any ideas?
EDIT: Definitely not a Boa config issue. Still looking into it.
UPDATE:
I've simplified my code to the following:
#!/usr/bin/perl
use CGI qw(:standard);
$data = param('wlanPort') || '<i>(No Input)</i>';
print header;
print <<END;
<title>Echoing user input</title>
<p>wlanPort: $data</p>
END
As expected, it prints (No Input)
I should also point out that the form is enctype="multipart/form-data" because I have to have a file upload capability and I am using the "POST" method.
I used the HttpFox plugin to inspect the post data and checked on the wlanPort value:
-----------------------------132407047814270795471206851178 Content-Disposition: form-data;
name="wlanPort"
eth1
So it is almost definitely being sent...
UPDATE 2: I installed the same version of Perl and Boa being used in the embedded system on my Ubuntu laptop. Works on the laptop, not in the device, which is the same result. I've told my employer that that I've exhausted all possibilities other than the way Boa and (Micro) Perl are built on the device vs. in Ubuntu.
CGI is a beautifully simple protocol and while I cannot answer your question directly, I can suggest some techniques to help isolate the problem.
If your form is being submitted via POST, then the contents of that form will appear as a URLencoded string in the content of the HTTP request your script is getting. Without using the CGI module at all, you should be able to read the the request from STDIN:
my $request = "";
while (<STDIN>) {
$request .= $_;
}
if (open my $out, ">>/tmp/myapp.log") {
print $out $request;
close $out;
}
You can then examine /tmp/myapp.log to see if you are getting all the information from the request that you think you are.
For completeness, if your form submits via GET, then the arguments will be in the environment variable QUERY_STRING, which you can look at in Perl with $ENV{'QUERY_STRING'}.
There should be no difference in the way CGI's object interface and functional interface parses the request. I am not familiar with boa, but I can't imagine it breaking basic CGI protocol.
A common problem is that you named the form parameter one thing and are looking for a different parameter name in the CGI script. That's always a laugh.
Good luck and hope this helps some.
I know this is very old post and OP may not be interested in any new info related to this, but the general question about how to debug CGI scripts has some relevance still. I had similar issues with dev vs. prod environments. To help those who stumble upon this thread, I am posting my experience in dealing with this situation. My simple answer is, to use Log::Log4perl and Data::Dumper modules to demystify this (assuming there is a way to access logs on your prod environment). This way with negligible overhead, you can turn on tracing when problems creep in (even if the code worked before, but due to changes over time, it started failing). Log every relevant bit of information at appropriate log level (trace, debug, info, warning, error, fatal) and configure what level is good for operations. Without these mechanisms, it will be very difficult to get insight into production operations. Hope this helps.
use CGI;
use Log::Log4perl qw(easy);
use Data::Dumper;
use strict;
use warnings;
my $cgiObj = CGI->new();
my $log = Log::Log4perl::get_logger();
$log->trace("CGI Data: " . Dumper($cgiObj));
print $cgiObj->param('wlanPort');
I am wondering if it is possible to use a single perl cgi script to server all http requests to my site, no matter what relative URL given by the visitors.
Please share your thoughts. Many thanks.
If you call your script index.cgi and combine that with a mod_rewrite rule to redirect all requests to /index.cgi/foo then foo will be available as $ENV{'PATH_INFO'}, thereby letting you know what the original request path was.
It's quite possible using mod_rewrite as other people have said. But you probably don't want to do it in a CGI program. Far better to write a proper web application using something like Catalyst or Dancer (probably with Plack at the back end).
If you're not really committed to your existing web server, you could use something like this:
use HTTP::Daemon; # need LWP-5.32 or better
use HTTP::Status;
use HTTP::Response;
use URI::Heuristic;
my $server = HTTP::Daemon->new(LocalPort => 89);
my $this_url = $server->url;
etc.
I grabbed that snippet from an existing program that ran as its own web server. Not sure how many of the "use" commands are required after the first one, but hopefully that gives you some ideas.
Per request of the submitter, I'm submitting a more-complete version of the HTTPD script:
#!/usr/bin/perl
use strict;
use warnings;
use HTTP::Daemon;
my $PORT = 89;
my $server = HTTP::Daemon->new(LocalPort =>$PORT);
# Init
print "Starting server at $server->url\n";
print "You can also use http://localhost:$PORT if browsing from the same machine running this script.\n\n";
# Server
my $count=0;
while (my $client = $server->accept) {
CONNECTION:
while (my $request = $client->get_request) {
$count++;
print "Connection #$count:\n";
print $request->as_string;
print "\n";
$client->autoflush;
RESPONSE:
print $client "Relative URL used was " . $request->url->path;
last CONNECTION;
}
$client->close;
undef $client;
}
Instead of the simple line that prints "Relative URL used was", you'd most likely want to parse the URL used to perform whatever different functions you need this script to do for every HTTP request.
You can't set up perl itself to do this. However, you should be able to configure your webserver to redirect all requests to a single CGI script, usually then passing the full script as a parameter. If you're running on Apache, look at mod-rewrite.
Can't vote up neither comment to davorg by low reputation (I'm new here).
You can use Mojolicious framework too. Mojolicious::Lite allows you to write full apps in a single file (logic, templating, etc), and I guess you're searching for something like:
http://search.cpan.org/perldoc?Mojolicious::Lite#Wildcard_Placeholders
More info at:
http://mojolicio.us/
Are both of these versions OK or is one of them to prefer?
#!/usr/bin/env perl
use strict;
use warnings;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new();
my $content;
# 1
$mech->get( 'http://www.kernel.org' );
$content = $mech->content;
print $content;
# 2
my $res = $mech->get( 'http://www.kernel.org' );
$content = $res->content;
print $content;
They are both acceptable. The second one seems cleaner to me because it returns a proper HTTP::Response object which you can query and call methods on, and also means that if you make another Mechanize request, you'll still have access to the old HTTP response. With your first approach, each time you make a request, the content method will change to something new, which sounds error-prone.
Btw, for either method, you should check $response->is_success or $mech->success before accessing the content, as the request may have failed.
The content() method is sometimes more convenient:
$mech->content(...)
Returns the content that the mech uses internally for the last page fetched. Ordinarily this is the same as $mech->response()->content(), but this may differ for HTML documents if "update_html" is overloaded, and/or extra named arguments are passed to content():
$mech->content( format => 'text' )
Returns a text-only version of the page, with all HTML markup stripped. This feature requires HTML::TreeBuilder to be installed, or a fatal error will be thrown.
$mech->content( base_href => [$base_href|undef] )
Returns the HTML document, modified to contain a mark-up in the header. $base_href is $mech->base() if not specified. This is handy to pass the HTML to e.g. HTML::Display.
$mech->content is specifically there so you can bypass having to get the result response. The simpler it is, the better.
How to find the image file type in Perl form website URL?
For example,
$image_name = "logo";
$image_path = "http://stackoverflow.com/content/img/so/".$image_name
From this information how to find the file type that . here the example it should display
"png"
http://stackoverflow.com/content/img/so/logo.png .
Supposer if it has more files like SO web site . it should show all file types
If you're using LWP to fetch the image, you can look at the content-type header returned by the HTTP server.
Both WWW::Mechanize and LWP::UserAgent will give you an HTTP::Response object for any GET request. So you can do something like:
use strict;
use warnings;
use WWW::Mechanize;
my $mech = WWW::Mechanize->new;
$mech->get( "http://stackoverflow.com/content/img/so/logo.png" );
my $type = $mech->response->headers->header( 'Content-Type' );
You can't easily tell. The URL doesn't necessarily reflect the type of the image.
To get the image type you have to make a request via HTTP (GET, or more efficiently, HEAD), and inspect the Content-type header in the HTTP response.
Well, https://stackoverflow.com/content/img/so/logo is a 404. If it were not, then you could use
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
my ($content_type) = head "https://stackoverflow.com/content/img/so/logo.png";
print "$content_type\n" if defined $content_type;
__END__
As Kent Fredric points out, what the web server tells you about content type need not match the actual content sent by the web server. Keep in mind that File::MMagic can also be fooled.
#!/usr/bin/perl
use strict;
use warnings;
use File::MMagic;
use LWP::UserAgent;
my $mm = File::MMagic->new;
my $ua = LWP::UserAgent->new(
max_size => 1_000 * 1_024,
);
my $res = $ua->get('https://stackoverflow.com/content/img/so/logo.png');
if ( $res->code eq '200' ) {
print $mm->checktype_contents( $res->content );
}
else {
print $res->status_line, "\n";
}
__END__
You really can't make assumptions about content based on URL, or even content type headers.
They're only guides to what is being sent.
A handy trick to confuse things that use suffix matching to identify file-types is doing this:
http://example.com/someurl?q=foo#fakeheheh.png
And if you were to arbitrarily permit that image to be added to the page, it might in some cases be a doorway to an attack of some sorts if the browser followed it. ( For example, http://really_awful_bank.example.com/transfer?amt=1000000;from=123;to=123 )
Content-type based forgery is not so detrimental, but you can do nasty things if the person who controls the name works out how you identify things and sends different content types for HEAD requests as it does for GET requests.
It could tell the HEAD request that it's an Image, but then tell the GET request that its a application/javascript and goodness knows where that will lead.
The only way to know for certain what it is is downloading the file and then doing MAGIC based identification, or more (i.e., try to decode the image). Then all you have to worry about is images that are too large, and specially crafted images that could trip vulnerabilities in computers that are not yet patched for that vulnerability.
Granted all of the above is extreme paranoia, but if you know the rare possibilities you can make sure they can't happen :)
From what i understand you're not worried about the content type of an image you already know the the name+extension for, you want to find the extension for an image you know the base name of.
In order to do that you'd have to test all the image extensions you wanted individually and store which ones resolved and which ones didn't. For example both https://stackoverflow.com/content/img/so/logo.png and https://stackoverflow.com/content/img/so/logo.gif could exist. They don't in this exact situation but on some arbitrary server you could have multiple images with the same base name but different extensions. Unfortunately there's no way to get a list of available extensions of a file in a remote web directory by supplying its base name without looping through the possibilities.