How can I download a file using WWW::Mechanize or any Perl module? - perl

Is there a way in WWW::Mechanize or any Perl module to read on a file after accessing a website. For example, I clicked a button 'Receive', and a file (.txt) will appear containing a message. How will I be able to read the content? Answers are very much appreciated.. I've been working on this for days,, Also, I tried all the possibilities. Can anyone help? If you can give me an idea please? :)
Here is a part of my code:
...
my $username = "admin";<br>
my $password = "12345";<br>
my $url = "http://...do_gsm_sms.cgi";
my $mech = WWW::Mechanize->new(autocheck => 1, quiet => 0, agent_alias =>$login_agent, cookie_jar => $cookie_jar);
$mech->credentials($username, $password);<br>
$mech->get($url);
$mech->success() or die "Can't fetch the Requested page";<br>
print "OK! \n"; #This works <br>
$mech->form_number(1);
$mech->click()
;
After this, 'Downloads' dialog box will appear so I can save the file (but I can also set the default to open it immediately instead of saving). Question is, how can I read the content of this file?
..

I take you mean that the web site responds to the form submission by returning a non-HTML response (say a 'text/plain' file), that you wish to save.
I believe you want $mech->save_content( $filename )
Added:
First you need to submit the WWW:Mech's form submission, before saving the resulting (text) file.
The click is for clicking a button, whereas you want to submit a form, using $mech->submit() or $mech->submit_form( ... ).
#!/usr/bin/perl
use strict;
use warnings;
use WWW::Mechanize;
my $username = "admin";
my $password = "12345";
my $login_agent = 'WWW::Mechanize login-agent';
my $cookie_jar;
#my $url = "http://localhost/cgi-bin/form_mech.pl";
my $url = "http://localhost/form_mech.html";
my $mech = WWW::Mechanize->new(autocheck => 1, quiet => 0,
agent_alias => $login_agent, cookie_jar => $cookie_jar
);
$mech->credentials($username, $password);
$mech->get($url);
$mech->success() or die "Can't fetch the Requested page";
print "OK! \n"; #This works
$mech->submit_form(
form_number => 1,
);
die "Submit failed" unless $mech->success;
$mech->save_content('out.txt');

After the click (assuming that's doing what it's supposed to), the returned data should be stored in your $mech object. You should be able to get the file data with $mech->content(),
perhaps after verifying success with $mech->status() and the type of response with $mech->content_type().
You may find it helpful to remember that WWW::Mechanize replaces the browser; anything a browser would have done, like bringing up a download window and saving a file, doesn't actually happen, but all the information the browser would have had is accessible through WWW::Mechanize's methods.

Dare I ask... have you tried this?
my $content = $mech->content();

Open the file (not 'Downloads' window) as if you were viewing it within your browser; you can save it later with a few lines of code.
Provided you have HTML::TreeBuilder installed:
my $textFile = $mech->content(format => "text");
should get you the text of the resulting window that opens.
Then open a filehandle to write your results in:
open my $fileHandle, ">", "results.txt";
print $fileHandle $textFile;
close $fileHandle;

I do this all the time with LWP, but I'm sure it's equally possible with Mech
I think where you might be going wrong is using Mech to request the page that has the button on it when you actually want to request the content from the page that the button causes to be sent to the browser when clicked.
What you need to do is review the html source of the page with the button that initiates the download and see what the Action associated with the button is. Most likely it will be a POST with some hidden fields or a URL to do a GET.
The Target URL of the Click has the stuff you actually want to get, not the URL of the page with the button on it.

For problems like this, you often have to investigate the complete chain of events that the browser handles. Use an HTTP sniffer tool to see everything the browser is doing until it gets to the file file. You then have to do the same thing in Mech.

Related

How can I scrape images off of a webage using Perl

I would like to scrape all images off of a webpage and am running into a problem I don't understand.
For instance if I use enter https://www.google.com/search?as_st=y&tbm=isch&hl=en&as_q=%22escape%20room%22+movie+poster&safe=images into my browser and then use the browser's "View Source" option I get a massive amount of text/code. Using "find" I get more than 400 instances of
https://
So the simple code I wrote (below) gets the content and writes the result to a file. But a grep search of https:// only returns 7 instances. So obviously I am doing something incorrectly, perhaps the page is dynamic and I can't access that part?
Is there a way I can get the same source, via Perl, that I get via View Source?
my $ua = new LWP::UserAgent;
$ua->agent("$0/0.1 " . $ua->agent);
$ua->agent("Mozilla/8.0");
my $sstring = "https://www.google.com/search?as_st=y&tbm=isch&hl=en&as_q=%22escape%20room%22+movie+poster&safe=images";
my $req = new HTTP::Request 'GET' => $sstring;
$req->header('Accept' => 'text/html');
my $res = $ua->request($req);
open(my $fh, '>', 'report.txt');
print $fh $res->decoded_content;
close $fh;
Here's the example I got from WWW:Mechanize::Chrome
my $mech = WWW::Mechanize::Chrome->new();
my $sstring = "https://www.google.com/search?as_st=y&tbm=isch&hl=en&as_q=\"" . $arrayp . "\"" . "+movie+poster&safe=images";
$mech->get($sstring);
sleep 5;
print $_->get_attribute('href'), "\n\t-> ", $_->get_attribute('innerHTML'), "\n"
for $mech->selector('a.download');
The Google search uses Javascript to alter the page content after load. LWP::UserAgent does not support Javascript and what you get is only the initial document. (Hint: An easy way to see in the browser what LWP::UserAgent "sees" is using a browser addon to disable Javascript).
You will need to use something that is called a "headless Browser", for example WWW::Mechanize::Chrome

Forms and POST using perl

I have this which is supposed to fill the email form on the http://faceoook.com/recover.php
and as you know, you can search by email,name or phone number.
So I am trying to search by email, and get the content of that page after the search has been completed to see whether the profile is found or not, but the code doesn't seem to work.
use HTTP::Request::Common;
use LWP::UserAgent;
$email="blabla\#hotmail.com";
my %data=(email=>$email);
my $user_agent = 'Mozilla/6.0';
my $Browser = LWP::UserAgent->new;
$Browser->agent($user_agent);
$ua=$Browser->post('https://www.facebook.com/recover.php',\%data);
if($ua->content=~/couldn\'t/){ #"couldn't" is part of the message displayed when
print "Not Found"; # input doesn't match
}
elsif ($ua->content=~/name/) {
print "Found";
}
else {
print "Not found";
}
$result=$ua->content;
open FILE,">","me.txt" or die $!;
print FILE $result;
close FILE;
use strict
make it compile under strict
review the manpage for LWP::UserAgent, there's a problem with your code that you'll have to discover on your own so you'll remember
review your variable names in light of the conventions used in the manpage
review the approach (considering the Facebook has an API, IIRC)
no need to escape the single quote in the regex
You should post your request to the URL in the action field of the form (while you're using the URL of the page which shows the form).
Also add any hidden field to your %data.
Have a look at the HTML code of the page (or use some sort of form inspector) to get the correct URL and the hidden fields (javascript code, if present, can further complicate things).
Then use strict (and use warnings as well) as already said by Lumi.

WWW:Mechanize Form Select

I am attempting to login to Youtube with WWW:Mechanize and use forms() to print out all the forms on the page after logging in. My script is logging in successfully, and also successfully navigating to Youtube.com/inbox; However, for some reason Mechanize can not see any forms at Youtube.com/inbox. It just returns blank. Here is my code:
#!"C:\Perl64\bin\perl.exe" -T
use strict;
use warnings;
use CGI;
use CGI::Carp qw/fatalsToBrowser/;
use WWW::Mechanize;
use Data::Dumper;
my $q = CGI->new;
$q->header();
my $url = 'https://www.google.com/accounts/ServiceLogin?uilel=3&service=youtube&passive=true&continue=http://www.youtube.com/signin%3Faction_handle_signin%3Dtrue%26nomobiletemp%3D1%26hl%3Den_US%26next%3D%252Findex&hl=en_US&ltmpl=sso';
my $mechanize = WWW::Mechanize->new(autocheck => 1);
$mechanize->agent_alias( 'Windows Mozilla' );
$mechanize->get($url);
$mechanize->submit_form(
form_id => 'gaia_loginform',
fields => { Email => 'myemail',Passwd => 'mypassword' },
);
die unless ($mechanize->success);
$url = 'http://www.youtube.com/inbox';
$mechanize->get($url);
$mechanize->form_id('comeposeform');
my $page = $mechanize->content();
print Dumper($mechanize->forms());
Mechanize is unable to see any forms at youtube.com/inbox, however, like I said, I can print all of the forms from the initial link, no matter what I change it to...
Thanks in advance.
As always, one of the best debugging approaches is to print what you get and check if it is what you were expecting. This applies to your problem too.
In your case, if you print $mechanize->content() you'll see that you didn't get the page you're expecting. YouTube wants you to follow a JavaScript redirect in order to complete your cross-domain login action. You have multiple options here:
parse the returned content manually – i.e. /location\.replace\("(.+?)"/
try to have your code parse JavaScript (have a look at WWW::Scripter)
[recommended] use YouTube API for managing your inbox

Icon Image Proxy in Perl

I'm trying to figure out the right way to serve icon files for our site listings. Basically an icon for a listing can come from an image file from a handful of different services (Flickr, Picasa, Google Static Maps, our own internal image hosting service, etc). The URL of the icon is stored in our database so I'd like to enable each listing icon to be displayed by simply calling:
http://www.example.com/listing/1234/icon
Currently I have been using CGI.pm to do a redirect to the correct icon URL, however, I want the file to be directly displayed without having to do a 301 redirect. Here is the code for what we've been using:
my $url = "http://www.example-service.com/image-123.gif";
print $query->redirect(-url=>$url);
I would appreciate any suggestions and code examples of on how I could update this to serve the file via proxy without having to redirect the user. Thanks in advance for your help!
Use LWP to get the remote file and print it out.
#!/usr/local/bin/perl
use LWP::UserAgent;
use CGI;
my $q = CGI->new;
my $ua = LWP::UserAgent->new;
$ua->agent("MyApp/0.1");
my $url = 'http://www.example-service.com/image-123.gif';
# Create a request
my $req = HTTP::Request->new(GET => $url);
my $res = $ua->request($req);
if ($res->is_success) {
print $q->header( $res->content_type );
print $res->content;
} else {
print $q->header( 'text/plain', $res->status_line );
print $res->status_line, "\n";
}
Alternatively you could write a trigger for your database which downloads the image for the listing and stores it either in the webroot somewhere or in the database itself when you add a new listing.

How can I access forms without a name or id with Perl's WWW::Mechanize?

I am having problems with my Perl program. This program logs in to a specific web page and fills up the text area for the message and an input box for mobile numbers. Upon clicking the 'Send' button, the message will be sent to the specified number. I already got it to work for sending messages. But the problem is I can't make it work for receiving messages/replies. I'm using WWW::Mechanize module in Perl. Here is a part of my code (for receiving msgs):
$username = 'suezy';
$password = '123';
$url = 'http://..sample.cgi';
# ...
$mech->credentials($username, $password);
$mech->get($url);
$mech->submit();
My problem is, the forms shows no names. There are two buttons in this form, but I can't select which button to click, since there are no name specified and the ids contains a space(e.g. form name='receive msg'..). I need to click on the second button, 'Receive'.
Question is, how will I be able to access the forms and buttons using mechanize module without using names?
You can pass a form_number argument to the submit_form method.
Or call the form_number method to affect which form is used by later calls to click or field.
Have you tried to use HTTP Recorder?
Have a look at the documentation and try it to see if it gives a reasonable result for you.
Seeing that there are only two buttons on your form, ysth's suggestion should be easy to implement.
use strict;
use warnings;
use WWW::Mechanize;
my $username = "suezy";
my $password = "123";
my $url = 'http://.../sample.cgi';
my $mech = WWW::Mechanize->new();
$mech->get($url);
$mech->credentials($username,$password);
And then:
$mech->click_button({number => 1}); # if the 'Receive' button is 1
Or:
$mech->click_button({number => 2}); # if the 'Receive' button is 2
A case of trial-and-error is more than adequate for you to figure out which button you're clicking.
EDIT
I'm assuming that the relevant form has already been selected. If not:
$mech->form_number($formNumber);
where $formNumber is the form number on the page in question.
$mech->form_with_fields('username');
will select the form that contain a field named username.
hth