How can I scrape images off of a webage using Perl - perl

I would like to scrape all images off of a webpage and am running into a problem I don't understand.
For instance if I use enter https://www.google.com/search?as_st=y&tbm=isch&hl=en&as_q=%22escape%20room%22+movie+poster&safe=images into my browser and then use the browser's "View Source" option I get a massive amount of text/code. Using "find" I get more than 400 instances of
https://
So the simple code I wrote (below) gets the content and writes the result to a file. But a grep search of https:// only returns 7 instances. So obviously I am doing something incorrectly, perhaps the page is dynamic and I can't access that part?
Is there a way I can get the same source, via Perl, that I get via View Source?
my $ua = new LWP::UserAgent;
$ua->agent("$0/0.1 " . $ua->agent);
$ua->agent("Mozilla/8.0");
my $sstring = "https://www.google.com/search?as_st=y&tbm=isch&hl=en&as_q=%22escape%20room%22+movie+poster&safe=images";
my $req = new HTTP::Request 'GET' => $sstring;
$req->header('Accept' => 'text/html');
my $res = $ua->request($req);
open(my $fh, '>', 'report.txt');
print $fh $res->decoded_content;
close $fh;
Here's the example I got from WWW:Mechanize::Chrome
my $mech = WWW::Mechanize::Chrome->new();
my $sstring = "https://www.google.com/search?as_st=y&tbm=isch&hl=en&as_q=\"" . $arrayp . "\"" . "+movie+poster&safe=images";
$mech->get($sstring);
sleep 5;
print $_->get_attribute('href'), "\n\t-> ", $_->get_attribute('innerHTML'), "\n"
for $mech->selector('a.download');

The Google search uses Javascript to alter the page content after load. LWP::UserAgent does not support Javascript and what you get is only the initial document. (Hint: An easy way to see in the browser what LWP::UserAgent "sees" is using a browser addon to disable Javascript).
You will need to use something that is called a "headless Browser", for example WWW::Mechanize::Chrome

Related

Mojolicious and directory traversal

I am new to Mojolicious and trying to build a tiny webservice using this framework ,
I wrote the below code which render some file remotely
use Mojolicious::Lite;
use strict;
use warnings;
app->static->paths->[0]='C:\results';
get '/result' => sub {
my $self = shift;
my $headers = $self->res->headers;
$headers->content_type('text/zip;charset=UTF-8');
$self->render_static('result.zip');
};
app->start;
but it seems when i try to fetch the file using the following url:
http://mydomain:3000/result/./../result
i get the file .
is there any option on mojolicious to prevent such directory traversal?
i.e in the above case i want only
http:/mydomain:300/result
to serve the page if someone enter this url :
http://mydomain:3000/result/./../result
the page should not be served .
is it possoible to do this ?
/$result^/ is a regular expression, and if you have not defined the scalar variable $result (which it does not appear you have), it resolves to /^/, which matches not just
http://mydomain:3000/result/./../result but also
http://mydomain:3000/john/jacob/jingleheimer/schmidt.
use strict and use warnings, even on tiny webservices.

Processing external page with perl CGI or act as a reverse proxy

There is a page residing on a local server running Apache. I would like to submit the form via a GET request with a single name/value pair, like:
id=item1234
This GET request has to be processed by another server which I don't have control over subsequently returning a page which I would like to transform with the CGI script. In other words:
User submits form
MY apache proxies to external resource
EXTERNAL resource throws back a page
MY apache transforms it with a CGI (maybe another way?)
User get a modified page
Again this more like an architectural question so I'd be grateful for any hints, even poking my nose into some guides will help as I wasn't able to structure my google request well enough to locate anything related.
Thanks.
Pass the id "17929632" to this CGI code ("proxy.pl?id=17929632"), and you should this exact page in your browser.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use CGI::Pretty qw(:standard -any -no_xhtml -oldstyle_urls);
print header;
print "<html>\n";
print " <head><title>Proxy Demo</title></head>\n";
print " <body bgcolor=\"white\">\n";
my $id = param('id') || die "No CGI param 'id'\n";
my $ua = LWP::UserAgent->new;
$ua->agent("MyApp/0.1 ");
# Create a request
my $req = HTTP::Request->new(GET => "http://stackoverflow.com/questions/$id");
# Pass request to the user agent and get a response back
my $response = $ua->request($req);
# Check the outcome of the response
if ($response->is_success) {
my $content = $response->content;
# Modify the original content here!
print $content;
}
else {
print $response->status_line;
}
print "</body></html>\n";
Vague question, vague answer: write your CGI program to include a HTTP user agent, e.g. LWP.

500 Internal Server Error in perl-cgi program

I am getting error as "Internal Server Error.The server encountered an internal error or misconfiguration and was unable to complete your request."
I am submitting a form in html and get its values.
HTML Code (index.cgi)
#!c:/perl/bin/perl.exe
print "Content-type: text/html; charset=iso-8859-1\n\n";
print "<html>";
print "<body>";
print "<form name = 'login' method = 'get' action = '/cgi-bin/login.pl'> <input type = 'text' name = 'uid'><br /><input type = 'text' name = 'pass'><br /><input type = 'submit'>";
print "</body>";
print "</html>";
Perl Code to fetch data (login.pl)
#!c:/perl/bin/perl.exe
use CGI::Carp qw(fatalsToBrowser);
my(%frmfields);
getdata(\%frmfields);
sub getdata {
my ($buffer) = "";
if (($ENV{'REQUEST_METHOD'} eq 'GET')) {
my (%hashref) = shift;
$buffer = $ENV{'QUERY_STRING'};
foreach (split(/&/,$buffer)) {
my ($key, $value) = split(/=/, $_);
$key = decodeURL($key);
$value= decodeURL($value);
$hashref{$key} = $value;
}
}
else{
read(STDIN,$buffer,$ENV{'CONTENT_LENGTH'})
}
}
sub decodeURL{
$_=shift;
tr/+/ /;
s/%(..)/pack('c', hex($1))/eg;
return($_);
}
The HTML page opens correctly but when i submit the form, i get internal server error.
Please help.
What does the web server's error log say?
Independent of what it says, you must stop parsing the form data yourself. There are modules for that, specifically CGI.pm. Using that, you can do this instead:
use CGI;
my $CGI = CGI->new();
my $uid = $CGI->param( 'uid' );
my $pass = $CGI->param( 'pass' );
# rest of your script
Much cleaner and much safer.
I agree with Tore that you must not parse this yourself. Your code has multiple errors. You don't allow multiple parameter values, you don't allow the ; alternate separator, you don't handle POST with a query string in the URL, and so on.
I don't know how long it will be online for free, but chapter 15 of my new "Beginning Perl" book covers Web programming. That should get you started on some decent basics. Note that the online version is an early, rough draft. The actual book also includes Chapter 19 which has a complete Web app example.
could it be this line that's the problem?
my (%hashref) = shift;
You're initialising a proper hash, but shift will give you a hash reference, since you did getdata(\%frmfields);. You probably want this, instead:
my $hashref = shift;
"500 Internal Server Error" just means that something didn't work the way the web server expected. Maybe you don't have CGI enabled. Maybe the script isn't executable. Maybe it's in a directory the web server isn't allowed to access. It's even possible that maybe the web server ran the script successfully and it worked perfectly, but didn't start its output with a valid set of HTTP headers. You need to look in the web server's error log to find out what it didn't like, which may or may not be a Perl issue.
Like everyone else has said, though, don't try to parse query strings and grovel though %ENV yourself. Use one of the many fine modules or frameworks which are available and already known to work correctly. CGI.pm is the granddaddy of them all and works well for smaller projects, but I'd recommend looking into a proper web application framework such as Dancer, Mojolicious, or Catalyst (there are many others, but those are the big three) if you're planning to build anything with more than a handful of relatively simple pages and forms.

Icon Image Proxy in Perl

I'm trying to figure out the right way to serve icon files for our site listings. Basically an icon for a listing can come from an image file from a handful of different services (Flickr, Picasa, Google Static Maps, our own internal image hosting service, etc). The URL of the icon is stored in our database so I'd like to enable each listing icon to be displayed by simply calling:
http://www.example.com/listing/1234/icon
Currently I have been using CGI.pm to do a redirect to the correct icon URL, however, I want the file to be directly displayed without having to do a 301 redirect. Here is the code for what we've been using:
my $url = "http://www.example-service.com/image-123.gif";
print $query->redirect(-url=>$url);
I would appreciate any suggestions and code examples of on how I could update this to serve the file via proxy without having to redirect the user. Thanks in advance for your help!
Use LWP to get the remote file and print it out.
#!/usr/local/bin/perl
use LWP::UserAgent;
use CGI;
my $q = CGI->new;
my $ua = LWP::UserAgent->new;
$ua->agent("MyApp/0.1");
my $url = 'http://www.example-service.com/image-123.gif';
# Create a request
my $req = HTTP::Request->new(GET => $url);
my $res = $ua->request($req);
if ($res->is_success) {
print $q->header( $res->content_type );
print $res->content;
} else {
print $q->header( 'text/plain', $res->status_line );
print $res->status_line, "\n";
}
Alternatively you could write a trigger for your database which downloads the image for the listing and stores it either in the webroot somewhere or in the database itself when you add a new listing.

How can I download a file using WWW::Mechanize or any Perl module?

Is there a way in WWW::Mechanize or any Perl module to read on a file after accessing a website. For example, I clicked a button 'Receive', and a file (.txt) will appear containing a message. How will I be able to read the content? Answers are very much appreciated.. I've been working on this for days,, Also, I tried all the possibilities. Can anyone help? If you can give me an idea please? :)
Here is a part of my code:
...
my $username = "admin";<br>
my $password = "12345";<br>
my $url = "http://...do_gsm_sms.cgi";
my $mech = WWW::Mechanize->new(autocheck => 1, quiet => 0, agent_alias =>$login_agent, cookie_jar => $cookie_jar);
$mech->credentials($username, $password);<br>
$mech->get($url);
$mech->success() or die "Can't fetch the Requested page";<br>
print "OK! \n"; #This works <br>
$mech->form_number(1);
$mech->click()
;
After this, 'Downloads' dialog box will appear so I can save the file (but I can also set the default to open it immediately instead of saving). Question is, how can I read the content of this file?
..
I take you mean that the web site responds to the form submission by returning a non-HTML response (say a 'text/plain' file), that you wish to save.
I believe you want $mech->save_content( $filename )
Added:
First you need to submit the WWW:Mech's form submission, before saving the resulting (text) file.
The click is for clicking a button, whereas you want to submit a form, using $mech->submit() or $mech->submit_form( ... ).
#!/usr/bin/perl
use strict;
use warnings;
use WWW::Mechanize;
my $username = "admin";
my $password = "12345";
my $login_agent = 'WWW::Mechanize login-agent';
my $cookie_jar;
#my $url = "http://localhost/cgi-bin/form_mech.pl";
my $url = "http://localhost/form_mech.html";
my $mech = WWW::Mechanize->new(autocheck => 1, quiet => 0,
agent_alias => $login_agent, cookie_jar => $cookie_jar
);
$mech->credentials($username, $password);
$mech->get($url);
$mech->success() or die "Can't fetch the Requested page";
print "OK! \n"; #This works
$mech->submit_form(
form_number => 1,
);
die "Submit failed" unless $mech->success;
$mech->save_content('out.txt');
After the click (assuming that's doing what it's supposed to), the returned data should be stored in your $mech object. You should be able to get the file data with $mech->content(),
perhaps after verifying success with $mech->status() and the type of response with $mech->content_type().
You may find it helpful to remember that WWW::Mechanize replaces the browser; anything a browser would have done, like bringing up a download window and saving a file, doesn't actually happen, but all the information the browser would have had is accessible through WWW::Mechanize's methods.
Dare I ask... have you tried this?
my $content = $mech->content();
Open the file (not 'Downloads' window) as if you were viewing it within your browser; you can save it later with a few lines of code.
Provided you have HTML::TreeBuilder installed:
my $textFile = $mech->content(format => "text");
should get you the text of the resulting window that opens.
Then open a filehandle to write your results in:
open my $fileHandle, ">", "results.txt";
print $fileHandle $textFile;
close $fileHandle;
I do this all the time with LWP, but I'm sure it's equally possible with Mech
I think where you might be going wrong is using Mech to request the page that has the button on it when you actually want to request the content from the page that the button causes to be sent to the browser when clicked.
What you need to do is review the html source of the page with the button that initiates the download and see what the Action associated with the button is. Most likely it will be a POST with some hidden fields or a URL to do a GET.
The Target URL of the Click has the stuff you actually want to get, not the URL of the page with the button on it.
For problems like this, you often have to investigate the complete chain of events that the browser handles. Use an HTTP sniffer tool to see everything the browser is doing until it gets to the file file. You then have to do the same thing in Mech.