Perl - mechanize - perl

I have the following code that works just fine.
#!/usr/bin/perl -w
use strict;
use LWP 6.03;
use URI;
my $browser=LWP::UserAgent->new;
my $url=URI->new ( 'http://www.google.com/search');
$url->query_form(
'h1'=>'en',
'num'=>'100',
'q'=>'glass',
);
my $response=$browser->get($url,
'User-Agent' => 'Mozilla/4.76 [en] (win98; U)',
'Accept' => 'image/gif, image/x-bitmap, image/jpeg, image/pjpeg,image/png,*/*',
'Accept-Charset' => 'iso-8859-1,*',
'Accept-Language' => 'en-US',
);
if ($response->content=~m/glass/i){
print "Success";
open (GGLASS,">gglass");
print GGLASS $response->content;
} else {
print "complete failure";
}
I have another piece of code that also works fine.
It uses the following:
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
use HTML::TokeParser;
When I look up the documentation for my code at cpan it tells me that the libraries I am using are deprecated. Even though it works with my system, the style of programming is being abandoned. It references me to something I have never used and I do not know if that is soon to be abandoned as well. What is the popular way to scrape a website. I do not want to be considered a dinosaur or be stuck with antiquated or remedial programs and tactics that leave me in the previous century. If you could come up with a piece of code that is similar to the first example that would be nice. This way I could compare the two.

Your documentation is wrong. Neither one of LWP, URI, WWW::Mechanize, HTML::TokeParser is deprecated. Mechanize works just fine in general for crawling. I would replace HTML::TokeParser with something that handles HTML parsing in a declarative fashion, though - Web::Query is splendid, HTML::TreeBuilder::XPath is nice.
However, concerning your code example: Google's terms of use forbid scraping. Use their API instead!

Related

Getting different source code for same url

I'm trying to grab the headline from a Washington Post news page on the web with a simple Perl script:
#! /usr/bin/env perl
use strict;
use warnings;
use LWP::Simple;
use Web::Scraper;
my $url = 'https://www.washingtonpost.com/outlook/why-trump-is-flirting-with-abandoning-fox-news-for-one-america/2019/10/11/785fa156-eba4-11e9-85c0-85a098e47b37_story.html';
my $scraper = scraper{
process '//h1[#data-qa="headline"]', 'headline' => 'TEXT',
};
my $html = get($url);
print $html;
my $res = $scraper->scrape ($html);
The problem I'm having is that it works only about 1/2 of the time even when fetching the exact same URL. The source code that is returned is in a completely different format than other times.
Perhaps this is an anti-scraping measure for unknown agents? I'm not sure but it seems like it should never work at all if that was the case.
Is there a simple workaround I might employ like accepting cookies?
Modified $scraper to the following to get it to work with the different source code:
my $scraper = scraper {
process '//h1[#data-qa="headline"]', 'headline' => 'TEXT',
process '//h1[#itemprop="headline"]', 'headline2' => 'TEXT',
};
Either headline or headline will be populated.

How to decide whether or not use Perl CGI vs regular HTML output

I've discovered that CGI is no longer recommended when it comes to creating HTML pages, but my search for answers as to where the use of CGI is appropriate has caused more confusion than answers.
I apologise if my question is basic, but I'm hoping that an answer to my question will help to clarify some things.
I'm being told not to create a form like this:
sub output_form {
my ($q) = #_;
print $q->start_form(
-name => 'main',
-method => 'POST',
);
print $q->start_table;
print $q->Tr(
$q->td('Update#:'),
$q->td(
$q->textfield(-name => "update_num", -size => 02)
)
);
print $q->Tr(
$q->td('Date:'),
$q->td(
$q->textfield(-name => "date",-id => "datepicker")
)
);
print $q->Tr(
$q->td('Location:'),
$q->td(
$q->textfield(-name => "location", -size => 50)
)
);
print $q->Tr(
$q->td('Queue:'),
$q->td(
$q->textfield(-name => "queue", -size => 50)
)
);
print $q->Tr(
$q->td('ETO:'),
$q->td(
$q->textfield(-name => "eto", -size => 50)
)
);
print $q->Tr(
$q->td('CAD#:'),
$q->td(
$q->textfield(-name => "cad", -size => 50)
)
);
print $q->Tr(
$q->td('Remarks:'),
$q->td(
$q->textfield(-name => "remarks", -size => 50)
)
But if I create such a form using a regular HTML page, will I be able to interact with user input from a Perl script?
Update
I've looked at your question again, and it seems like you've become so entrenched in what CGI offers that you've got yourself lost
But if create such a form using a regular HTML page, will I be able to interact with user input from a Perl script?
Whatever your program does, and however it does it, it must send an ordinary HTML page back to the browser that made the original request. There is nothing magical about the various start_form, start_table, Tr, td etc. functions that CGI makes available: it is supposed to be a more convenient way of generating HTML using Perl syntax
Generating HTML is nothing to do with the CGI protocol, and many people felt that it was inappropriate to include that sort of functionality in a module called CGI. That lead to things such as HTML::Tiny, which provides HTML construction functions similar to CGI
Other functions grew to provide just support for the CGI protocol, such as CGI::Minimal
There are many more examples of the separate implementation of both aspects of the original CGI.pm, but you are concerned about whether you can interact with a use via HTTP
Once again, there is nothing special about the functions that CGI.pm makes available. You should run an old CGI program from the command line to see that it just generates the string of HTML that you have prescribed in your calls, and you could have created that in any way that was convenient
Once the HTML has been built and sent to the client, it makes no difference how the message was built. The page will be displayed on the browser and it will offer the user the chance to request more information
I hope that's clearer for you?
Take a look at CGI::Alternatives for options other than CGI
But you're talking about constructing HTML, which is nothing to do with CGI, and one of the main criticisms of the module was that it wrapped too much functionality into a single box
You should focus on using a template package to build your HTML, and one of the most popular is Template::Toolkit
You probably have additional CSS styling and JavaScript intelligence, which should be linked from your HTML as separate files
For a browser to present an HTML page to a user, the web sever has to return an HTTP response that includes the required HTML in the body. Sometimes that HTML is returned from a static file and sometimes it is generated by some server-side application.
The browser doesn't care (and, indeed, is unlikely to know) how that HTML is generated. All it knows is that it has received an HTTP response with a Content-Type of text/html and a body consisting of HTML which it needs to parse and render.
So you have a couple of options. You can write a static HTML file that contains your form. Or you can write a Perl program that generates it. Either of these options makes no difference to the browser. You have chosen to write a Perl program. There are various technologies that you can use to implement this. I wouldn't recommend CGI these days (see CGI:Alternatives for some suggestions) but let's assume that we're going with that.
(It's also worth pointing out here that CGI - the protocol - is not the same thing as CGI.pm the library that is often used to write Perl programs that run under the protocol. You don't need to use CGI.pm to write a CGI program.)
CGI.pm used to include helper functions for generating HTML. These are now deprecated and have been moved to a separate module. There are many reasons for their deprecation. The most obvious one is probably that on many projects, the people designing and implementing the front-end of the site are different people to the ones writing the back-end code. If a front-end developer already needs to know HTML, CSS and Javascript, it's slightly unfair to expect them to know Perl as as - which they would need in order to edit the web pages using the HTML generation functions. Even in a situation where I am the only person working on a site, I find that enforcing a strict separation between the front- and back-end technologies helps to keep the code cleaner.
So I really wouldn't recommend using those functions. No-one would, as far as I can see. Instead. I would use a templating system. In particular, I'd use the Template Toolkit (that's a personal preference, but I'm slightly biased).
With a templating engine, you can put all of your HTML code in a completely separate file which your front-end team can own and edit in whatever way they choose. Then, when your back-end code needs to display the HTML page, it can use template-processing functions to do that. A (very!) simple example might look like this:
In template.cgi:
#!/usr/bin/perl
use strict;
use warnings;
use Template;
use CGI qw[header param]; # Only use two functions from CGI.pm
print header;
my $tt = Template->new;
if (my $name = param('name')) {
$tt->process('output.tt', { name => $name })
or die $tt->error;
} else {
$tt->process('form.tt')
or die $tt->error;
}
form.tt would look like this:
<html>
<head><title>What's your name?</title></head>
<body>
<form enctype="multipart/form-data">
Enter name: <input name="name" />
</form>
</body>
</html>
And output.tt would look like this:
<html>
<head><title>Welcome [% name %]</title></head>
<body>
<h1>Hello [% name %]</h1>
<p>Pleased to meet you.</p>
</body>
</html>
The fall of CGI.pm came from two directions. The HTML-building methods were always an ugly duckling, with some form of template being preferred. At the other side, the methods that handled interaction with the client (CGI) were superseded first by mod_perl (which has its own library for this sort of thing), and later by frameworks like Dancer and Mojolicious.
Those frameworks also incorporate templates. There's very little reason to learn the old style CGI anymore, unless you're maintaining old code. There's also plenty of debate between the Dancer and Mojo camps; I'd suggest picking one, learning it on one project, and then take up the other on another project.

Mojolicious and directory traversal

I am new to Mojolicious and trying to build a tiny webservice using this framework ,
I wrote the below code which render some file remotely
use Mojolicious::Lite;
use strict;
use warnings;
app->static->paths->[0]='C:\results';
get '/result' => sub {
my $self = shift;
my $headers = $self->res->headers;
$headers->content_type('text/zip;charset=UTF-8');
$self->render_static('result.zip');
};
app->start;
but it seems when i try to fetch the file using the following url:
http://mydomain:3000/result/./../result
i get the file .
is there any option on mojolicious to prevent such directory traversal?
i.e in the above case i want only
http:/mydomain:300/result
to serve the page if someone enter this url :
http://mydomain:3000/result/./../result
the page should not be served .
is it possoible to do this ?
/$result^/ is a regular expression, and if you have not defined the scalar variable $result (which it does not appear you have), it resolves to /^/, which matches not just
http://mydomain:3000/result/./../result but also
http://mydomain:3000/john/jacob/jingleheimer/schmidt.
use strict and use warnings, even on tiny webservices.

Site scraping in perl using WWW::Mechanize

I have used WWW::Mechanize in perl for site scraping application.
I have faced some difficulties when I'm going to login to particular site via WWW::Mechanize. I have gone through some examples of WWW::Mechanize. But i couldn't find out my issue.
I have mention below my code.
#!/usr/bin/perl -w
use strict;
use WWW::Mechanize;
use HTTP::Cookies;
use Crypt::SSLeay;
my $agent = WWW::Mechanize->new(noproxy => 0);
$agent->cookie_jar(HTTP::Cookies->new());
$agent->agent('Mozilla/5.0');
$agent->proxy(['https', 'http', 'ftp'], 'http://proxy.rcapl.com:3128');
$agent->get("http://www.facebook.com");
my $re=$agent->submit_form(
form_number => 1,
fields => {
Email => 'xyz#gmail.com',
Passwd =>'xyz'
}
);
print $re->content();
When i run the code it says:
Error POSTing https://www.facebook.com/login.php?login_attempt=1: Not Implemented at ./test.pl line 11
Can anybody tell what's going wrong on code. Do i need to set all the parameters which facebook send for login?.
The proxy is faulty:
Error GETing http://www.facebook.com: Can't connect to proxy.rcapl.com:3128 (Bad hostname) at so11406791.pl line 11.
The program works for me without calling the proxy method. Remove this.

mod_perl redirect

So I'm working in a mod_perl environment, and I want to know what the best way is to redirect to a new url. I know in CGI Perl you use print "Location:...", however I've come to find that usually there are better ways to do things in mod_perl, but I can't seem to find anything. Thanks in advance!
use Apache2::Const -compile => qw(REDIRECT);
sub handler {
my $r = shift;
$r->headers_out->set( Location => $url);
$r->status(Apache2::Const::REDIRECT); #302
}
This is the answer for how to properly redirect in mod_perl2