How can I get started with web page scraping using Perl? - perl

I am interested in learning Perl. I am using Learning Perl books and cpan's web-sites for reference.
I am looking forward to do some web/text scraping application using Perl to apply whatever I have learnt.
Please suggest me some good options to begin with.
(this is not a homework. want to do something in Perl that would help me exploit basic Perl features)

If the web pages you want to scrape require JavaScript to function properly, you are going to need more than what WWW::Mechanize can provide you. You might even have to resort to controlling a specific browser via Perl (e.g. using Win32::IE::Mechanize or WWW::Mechanize::Firefox).
I haven't tried it, but there is also WWW::Scripter with the WWW::Scripter::Plugin::JavaScript plugin.

As others have said, WWW::Mechanize is an excellent module to use for web scraping tasks; you'll do well to learn how to use it, it can make common tasks very easy. I've used it for several web scraping tasks, and it just takes care of all the boring stuff - "go here, find a link with this text and follow it, now find a form with fields named 'username' and 'password', enter these values and submit the form...".
Scrappy is also well worth a look - it lets you do a lot with very little code - an example from its documentation:
my $spidy = Scrappy->new;
$spidy->crawl('http://search.cpan.org/recent', {
'#cpansearch li a' => sub {
print shift->text, "\n";
}
});
Scrappy makes use of Web::Scraper under the hood, which you might want to look at too as another option.
Also, if you need to extract data from HTML tables, HTML::TableExtract makes this dead easy - you can locate the table you're interested in by naming the headings it contains, and extract data very easily indeed, for example:
use HTML::TableExtract;
$te = HTML::TableExtract->new( headers => [qw(Date Price Cost)] );
$te->parse($html_string) or die "Didn't find table";
foreach $row ($te->rows) {
print join(',', #$row), "\n";
}

The most popular web scraping module for Perl is WWW::Mechanize, which is excellent if you can't just retrieve your destination page but need to navigate to it using links or forms, for instance, to log in. Have a look at its documentation for inspiration.
If your needs are simple, you can extract the information you need from the HTML using regular expressions (but beware your sanity), otherwise it might be better to use a module such as HTML::TreeBuilder to do the job.
A module that seems interesting, but that I haven't really tried yet, is WWW::Scripter. It's a subclass of WWW::Mechanize, but has support for Javascript and AJAX, and also integrates HTML::DOM, another way to extract information from the page.

Try the Web-Scraper Perl module. A beginners tutorial can be found here.
It's safe, easy to use and fast.

You may also want to have a look at my new Perl wrapper over Java HtmlUnit. It is very easy to use, e.g. look at the quick tutorial here:
http://code.google.com/p/spidey/wiki/QuickTutorial
By tomorrow I will publish some detailed installation instructions and a first release.
Unlike Mechanize and alike you get some JavaScript support and it is way faster and less memory demanding than screen scraping.

Related

Hyperlinking text to a function perl CGI

I have an option on my social media website to search for users.
After searching for users I display the searched users name and username in plain text. I want to make it so the user can click on the plain text and be redirected to the user's profile they searched for.
So far I searched for CSS which would help me achieve this but all of them link to a url but I want to call a function instead. Is this possible?
No. The functions don't exist outside of the Perl, which runs on the server.
The browser can only interact with the Perl by requesting URLs from the server.
You need to map the URLs on to the functions you want to run.
If you're doing this by hand, you would typically do something like:
my $action = $q->param('action');
if ($action eq "show_user") {
show_user();
}
Frameworks such as Catalyst, Dancer, and Web::Simple provide routing systems to make this easier.
For example, in Catalyst (probably the most complex of the options I suggested above, but the one I'm most familiar with) you might do something like:
sub show_user : Local : Args(1) {
# code to run in http://example.com/show_user/gettingthere is requested
my ($self, $c, $username) = #_;
}
Firstly, you can't put links in plain text. So I suspect that you're actually returning HTML.
Secondly, I'm not sure why you think that CSS would be the right too for changing how your links work. CSS is for changing appearance, not behaviour.
On most social media web sites, each user has a profile page and that page has a URL. So shouldn't your search results just include those profile page links rather than a call to some Perl function?
All in all, you seem rather confused. I think you should probably have another go at explaining what you're doing - as currently it's really not very clear. Perhaps include some code - that usually clarifies things.

Adding perl to a webpage?

I had posted a question before about asking how to use CGI perl to create webpages with perl.
The general idea I got was no one uses CGI with perl anymore.
So I am sorta stuck then, I would like to use perl with small website I want to create, some like a WebGUI. But I don't need anything really complex to start with, one suggestion was some thing called catalyst but that seems to be way more than I need.
Where can I go from here?
I have looked around and I seemed to be getting old pages nothing that really gives me any clear understanding of what is the easiest way to integrate perl with a website.
If you are looking for a quick start-up and easy simple cases, try Dancer. I find it super easy to get something lightweight. Specifically, take a look at the introduction. That should be straightforward enough to build from.
There are plenty of Perl web frameworks. You could try also Mojolicious.
It has no dependencies besides Perl core modules but offers a lot of functionalities.
At the same time it allows you to start with minimum knowledge about Perl.
use Mojolicious::Lite;
get '/' => {text => 'Hello World!'};
app->start;
If you want to learn basics of CGI/with Perl just learn regular Perl and remember that you have to output the HTTP header like so:
print 'Content-type: text/plain;';
print 'charset=iso-8859-1\n\n';
Then just print the rest of your HTML with print statements.
Keep in mind that the header data is on one line followed by two newlines; otherwise it will generate a 500-server-error due to the bad header.

separate layout from templates in perl cgi::application

I am building a perl cgi::application using html::template.
I am using 7-8 different templates having the same layout - header, footer, left column etc.
How can I separate this html out of the template files into a single layout file. What perl modules do I need in addition to cgi::app and html::template.
Thanks
I agee that Template-Tookit is better.
If you absolutely have to use HTML::Template you can use the TMPL_INCLUDE directive. It'll search your defined template paths or you can specify a full path to another template. It'll process the variables in it as well.
You can create seperate template files for the header, footer and such and in your page templates just TMPL_INCLUDE them. It's less elegant and more repetative than Template Toolkit's WRAPPER (You'll have to TMPL_INCLUDE in each page several times for all shared elements) but it'll get the job done.
If you can, invest the time and use Template Toolkit.
I'd switch out HTML::Template for Template-Toolkit and make use of it's WRAPPER directive.
I don't know about Template-Toolkit. So i won't discuss about which solution is the most convenient.
I can just give you another solution, which is dependant of the server your running your cgi's on.
With Apache server, you can use includes in your html :
<!--#include virtual="/includes/header/header.htm"-->
you may call htm (static pages) as well as dynamic pages :
<!--#include virtual="/perl/includes/dynamic.pl"-->
but you have to do some apache tweaking. see Apache Tutorial: Introduction to Server Side Includes
Hope this will help, or at least give some ideas
Can the people who don't like HTML::Template please say why? While the Wrapper idea seems helpful to this particular poster, there's nothing wrong with the idea of includes: they're more flexible, and many web developers will already be familiar with the concept from non-dynamic publishing.

How could I find files that use certain modules in CPAN?

Some modules on CPAN are excellently documented, others.... not so much, but it's usually easy to discern how to use a module via prior art (e.g modules/tests that used the module you're looking to use). I'm wondering what the best way is to find code that uses the code you're looking to use.
example
I want to use (maybe?) Dist::Zilla::App::Tester for something, but the author has elected not to write any documentation on how to use it, some I'm wondering what path of least resistance is to find code that already uses it.
please don't answer for this module
Give a man a fish; you have fed him for today. Teach a man to fish; and you have fed him for a lifetime
Try Google Code Search, trying to search for strings like "use Dist::Zilla::App::Tester" (quotes are important).
Use CPANTS - The CPAN Testing Service web site.
Search for the distribution
Click Other dists requiring this
Here is the page for Dist-Zilla
As an aside, you can always read the source by hitting the Source button on the top of the page on search.cpan.org. In this case, the package doesn't have much code to begin with. Also, many big modules these days have ::Cookbooks ::Manuals or ::Tutorials Dist-Zilla has one too
My guess is ::Tester just supplies the dzil test command through its test_dzil sub.
One option is to use Google Code Search (Google for that phrase for a link :) ); unioned with pure googling. Search for "use my::module::name" string.
If the module name is not something well-searchable (e.g. too many hits), may be combine with "
For searches over CPAN, I suggest CPAN Grep over Google code search.
For more complex searches, I'd write a very small program using CPAN::Visitor and a minicpan.
For quick dependency checking, I'd use the not-perfect-but-very-good CPANDB.

How do I create graphs in Perl on Windows?

How do I use Perl to create graphs?
I'm running scheduled job that creates text reports. I'd like to move this to the next step (for the management) and also create some graphs that go along with this. Is this possible / feasible? It'd be great if I could do this using Office some how.
update: solutions i'm going to investigate in this order
Spreadsheet::WriteExcel (this seems to now have changed from the last time i investigated this .... wait, this was suggested by the author of the module. cool.)
GD Graph - this is now available for ActivePerl(wasn't last time i looked)
SVG
Open Charts look interesting.
Chartdirector
GD and GD::Graph are probably your best bets, you can use them to create images that you can then embed into whatever you need.
All of the methods mentioned above are really good, but personally I like SVG::TT::Graph. I really like the power that SVG gives you to draw really nice-looking graphs.
Also you can take a look at Google Charts CPAN module
use Google::Chart;
my $chart = Google::Chart->new(
type => "Bar",
data => [ 1, 2, 3, 4, 5 ]
);
print $chart->as_uri, "\n"; # or simply print $chart, "\n"
$chart->render_to_file( filename => 'filename.png' );
At work we have used the excellent Chartdirector.
It's not free, but is very cheap (maybe 50 bucks or so). The cost is well worth it, as the API and docs are both excellent (way better than GD!), so easily saved more than that amount of my time.
There's also a free version, which includes a small yellow banner advertising the product on each chart - to be honest if this is for personal use, you can go for that as it's really not very intrusive at all.
Chartdirector is available for lots of platforms (Win, Linux, Solaris, BSD, OSX) and has an API for lots of languages, too (Perl, ASP, .NET, Java, PHP, Python, Ruby, C++).
The output is easy on the eye, as you can see at their examples page.
Sorry for blowing my own trumpet, but you might be interested to have a look some slides I did for a short presentation about Graphing With Perl.
It mentions some of the suggestions here, but also gives you some code snippets that you might be able to use to help you get the most of what you're doing.
Depending on the complexity of your graph, simply generating a command file for Gnuplot—or GraphViz/Dotty, depending on what kind of graph you are referring to—might do the trick?
The Perl module Spreadsheet::WriteExcel allows you to create Excel workbooks that include charts.
You first have to create the type of chart that you want in Excel and then extract it out using a utility called chartex which is installed with Spreadsheet::WriteExcel.
The chart template can then be added to a new workbook and made to reference new data.
The documentation is here and there are several examples in the charts directory of the distro.
The mechanism is a little inflexible however and the it is sometimes tricky to get the exact result that you want.
Haven't tried it yet but Chart::Clicker looks quite nifty.
I think it uses the Cairo graphic library (alternative to GD) but is actually built on top of Graphics::Primitive which is an "interesting" graphics agnostic package.
The author in question (GPHAT) seems to be putting together some integrated tools for producing reports... http://www.onemogin.com/blog/582-pixels-and-painting-my-recent-cpan-releases
On a side note... have used both ChartDirector and OFC and both are good (especially if web based).
Spreadsheet::WriteExcel::Chart
You might need something like strawberry or vanilla Perl to get this to compile. Or PPM might have the module.
Tutorial link:
http://search.cpan.org/dist/Spreadsheet-WriteExcel/charts/charts.pod
It won't work with Office, but I really like Chart::OFC which will create Open Flash Charts. Very slick looking and easy to use.
It depends to a great extent what sort of graphs (the look of them), and the data-source. I've had some good result by using the YUI Charts and feeding them some JSON style versions of the original source data. Rolling over a live chart for exact values is quite easy for example. There are plenty of examples on the developer pages.
If you're set on doing this in MS Office you can use the Win32::OLE module to control Excel via OLE. Be warned, that this tends to run slowly and it can be difficult to find documentation for Excel's API. On the plus side, it allows you to do pretty much everything that you can do manually.
Metaprograming of course! Output an R script that creates the graph.
PGPlot does great graphs. There are some examples here. It works fine with Perl 5.8.8 but is broken in 5.10.0
Spreadsheet::WriteExcel will let you just get the data into Excel, then write Excel equations for the graphs.