Perl web crawling framework - perl

I've been using Perl for years to crawl and scrape for various different purposes, one thing that's always bugged me is while there are tons of great CPAN modules for small-scale scraping and crawling, such as LWP, WWW::Mechanize, Web::Scraper, AnyEvent::HTTP, and now Mojo::UserAgent, there don't appear to be any crawling frameworks in the same way as there are for other languages.
For example Apache Nutch (/Droids) & Scrapy (Python).
Anybody know of any projects at all in Perl that are equivalent?

You probably need to take a look at modules such as HTML::Robot::Scraper or
HTTP::UserAgentString::Robot and I think there are a few more with robot in their name.

Related

Is there any general purpose perl module with helpers for web applications?

Is there a general purpose Perl Module with helpers for web development? Like form-builders, url helper, etc?
Based on my searches, I couldn't find anything. I'm already using Mason, but do not want to define my helpers as components from scratch.
Of course, more than you can shake a stick at.
The CGI module includes support for building forms, and other formatting things, if you want something very basic/low-level.
If you want more of a framework, there are dozens or more. pokey909's answer is a good place to start if that's your goal.
And there are hundreds or more special-use modules related to specific aspects of web applications. Search CPAN for these more specific ones.
Your question is open-ended and it sounds like you want to eat your framework and have it. Still, here are some starting points.
Form helpers
In approximately the order of preference:
HTML::FormHandler
HTML::FormFu
Rose::HTML::Objects
CGI::FormBuilder (I think this is abandoned)
URL helpers
URI
URI::QueryParam
URI::Escape
Path::Class
Mojolicious has some discrete pieces that might be worth looking at too. Even if you don't use a framework, you should look at them and the choices they make for plugins and helpers.
I cant see how you couldnt find anything else. Just google for "perl web framework".
There a quite a few, have a look here:
https://www.socialtext.net/perl5/web_frameworks

How can I automate website testing with Perl? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Integration Testing for a Web App
I've recently delved into Perl and I'd like to learn how to automate web site testing. To that end I'd appreciate answers to the following:
What modules from CPAN should I look into? Is there something similar to Watir (I know there's a Windows only port of it in Perl)?
I'm using a Windows system, so does it matter if I use ActivePerl or Strawberry Perl?
What books should I look into (aside from "Perl Testing: A Developer's Notebook")?
Edit:
The web application is written in Java
I want to test the application's web GUI, which makes use of javascript/ajax & some flex/flash
WWW::Mechanize, WWW::Selenium and LWP::Simple just to name a few.
ActivePerl is more user friendly than Strawberry when it comes to downloading CPAN modules, but at the end of the day they are both equally as good.
I haven't come across any books on the subject; perhaps a tutorial is what you need.
Are you interested testing the WEB GUI itself via http calls or individual Perl modules via calling Perl code directly?
For the former, Selenium - which AFAIK is not Perl specific at all - seems to be the accepted Best Practice (see Integration Testing for a Web App link helpfully provided by Ether ). It has a Perl CPAN module for integration, mentioned in the same SO question.
You can of course do your own testing frameworks, e.g. implement http calls via WWW::Mechanize. But that would not work with JavaScript-enabled web sites (are there any left that aren't?) since you need a JavaScript engine, either provided by the browser (Selenium's approach) or something embedded which AFAIK Perl doesn't actually have, though Java does.
However, there are non-http approaches to testing Perl side code of the web apps as well:
Have all non-presentation logic nicely modularized away, and test it as usual using Test::More or your other favorite test frameworks.
Test presentation (or presentation+logic) by using your web framework in command line mode, emulating running on web server. CGI.pm allows that, as well as EmbPerl. Not sure about Catalyst apps.

How do I create CGI scripts with Perl?

I am creating a website and done with some HTML stuff, but I am thinking to create site using CGI and Perl scripting. I don't have much idea on CGI scripting. Can anyone please suggest me how to create a CGI script and how to create web pages with that?
Try Ovid's CGI Course - it's a good start. "A Beginner's Introduction to Perl Web Programming" is also a good reading, but with less information.
Later you can try framework like CGI::Application - it should be easy even for a beginner.
Have a look at A Beginner's Introduction to Perl Web Programming on perl.com.
I would recommend you make use of tools like Template Toolkit or Markapl to make the process of producing and maintaining your HTML view much simpler.
And moving forward you might want to consider looking at using a web framework ranging from something very light like Squatting to a powerful beast like Catalyst.
CGI is a standard to interface a webserver and a content provider or external application (let's say a Perl script, a compiled program), I think you better study Perl to code some CGI.
This link is a beginner guide to get started with CGI coding in Perl.
Also if you are beginner in Perl you might look to some other scripting language more dedicated to web, the best example is PHP. But if you are already experienced Perl CGI, the CGI.pm is nice also.
As far as I understand your question, you say that you wrote your own web server and now want to implement CGI!?
These links should help you:
CGI Specification
How to implement CGI in a Web Server
Edit: Check the original question before downvoting me.

How do I use mod_perl2 and Apache Bucket Brigades?

I'm writing an application to do proxying and rewriting of webpages on the fly and am pretty settled on using mod_perl2 - there is an existing implementation using mod_perl (v1) that I'm working from. In mod_perl2, there's this idea of APR::Brigades and APR::Buckets which, from my vague understanding, are an efficient way to do the sort of filtering & rewriting that I want. I can't, however, find anything but the Perldoc pages for these modules, so I'm really quite unsure how to utilize them.
Can anyone explain mod_perl2 Bucket Brigades to me, point me to a tutorial, or even show me some open-source app that uses mod_perl2 that I could learn from?
Buckets and Brigades are native concept to the Apache Portable Runtime. You'll find ample examples of the native API, with a HTTP-specific slant, in the source code for Apache HTTP Server modules like mod_proxy, mod_deflate, and mod_substitute.
See the filter info here:
http://www.apachetutor.org/dev/#filter
Then take a peek at the previously mentioned Apache HTTP Server modules.
There seems to be a simple perl-specific filter here:
http://perl.apache.org/docs/2.0/user/handlers/filters.html#Bucket_Brigade_based_Output_Filters

Do you need a framework to write Ruby or Python code for the web?

Every time I see Ruby or Python discussed in the context of web development, it's always with a framework (Rails for Ruby, Django for Python). Are these frameworks necessary? If not, is there a reason why these languages are often used within a framework, while Perl and PHP are not?
I can only speak towards Ruby - but, no, you don't need a framework to run Ruby based pages on the web. You do need a ruby enabled server, such as Apache running eruby/erb. But, once you do, you can create .rhtml files just like RoR, where it processes the inline ruby code.
The short answer is no, they are not necessary. In ruby you have .erb templates that can be used in a similar way as you use PHP pages. You can write a site in ruby or Python using several technologies (Rails-like frameworks, Templates or even talking directly with the HTTP library and building the page CGI-style).
Web frameworks like Python's Django or Ruby's Rails (there are many) just raise the level of abstraction from the PHP's or ASP's, and automate several process (like login, database interaction, REST API's) which is always a good thing.
"Need" is a strong word. You can certainly write Python without one, but I wouldn't want to.
Python wasn't designed (like PHP was, for example) as a direct web scripting language, so common web-ish things like connecting to databases isn't native, and frameworks are handy.
EDIT: mod_python exists for Apache, so if you're merely looking to write some scripts, then Python doesn't need a framework. If you want to build an entire site, I'd recommend using one.
From a Pythonic point of view, you'd absolutely want to use one of the frameworks. Yes, it might be possible to write a web app without them, but it's not going to be pretty. Here's a few things you'll (probably) end up writing from scratch:
Templating: unless you're writing a really really quick hack, you don't want to be generating all of your HTML within your Python code -- this is a really poor design that becomes a maintainability nightmare.
URL Processing: splitting a URL and identifying which code to run isn't a trivial task. Django (for example) provides a fantastic mechanism to map from a set of regular expressions to a set of view functions.
Authentication: rolling your own login/logout/session management code is a pain, especially when there's already pre-written (and tested) code available
Error handling: frameworks already have a good mechanism in place to a) help you debug your app, and b) help redirect to proper 404 and 500 pages.
To add to this, all of the framework libraries are all heavily tested (and fire tested). Additionally, there are communities of people who are developing using the same code base, so if you have any questions, you can probably find help.
In summary, you don't have to, but unless your project is "a new web framework", you're probably better off using one of the existing ones instead.
Framework? Heck, you don't even need a web server if you're using Python, you can make one in around three lines of code.
As to the why:
The most plausible thing I can think of is that Perl and PHP were developed before the notion of using frameworks for web apps became popular. Hence, the "old" way of doing things has stuck around in those cultures. Ruby and Python became popular after frameworks became popular, hence they developed together. If your language has a good framework (or more than one) that's well supported by the community, there's not much reason to try to write a Web App without one.
A framework isn't necessary per se, but it can certainly speed development and help you write "better" code. In PHP, there are definitely frameworks that get used like CakePHP, and in Perl there are many as well like Mason and Catalyst.
The frameworks aren't necessary. However, a lot of developers think frameworks ease development by automating a lot of things. For example, Django will create a production-ready backend for you based on your database structure. It also has lets you incorporate various plugins if you choose. I don't know too much about Rails or Perl frameworks, but PHP frameworks such as Zend, Symfony, Code Igniter, CakePHP, etc are used widely.
Where I work at we rolled our own PHP framework.
Are these frameworks necessary?
No. They, like any 'framework', are simply for speeding up development time and making the programmer's job easier.
If not, is there a reason why these languages are often used within a framework, while Perl and PHP are not?
PHP and Perl were popular languages for building web sites well before the idea of using frameworks was. Frameworks like Rails are what gave Ruby it's following. I'm not sure that Python or Ruby were that common as web languages before they were backed by frameworks.
These days, even PHP/Perl web developement should be backed by a framework (of which there are now many).
By no means are those development frameworks required. But as with most development environments, your productivity will increase exponentially if you have a supported framework to reference and build your applications on. It also decreases the training needed to bring others up to speed on your applications if they already have a core understanding of the framework that you use.
For python, the answer is No you don't have to. You can write python directly behind your web server very easily, take a look at mod_python for how to do it.
A lot of people like frameworks because they supply a lot of the boilerplate code in a reliable form so you don't have to write it yourself. But, like any code project, you should choose the tools and frameworks on their merit for your problem.
You can certainly write CGI scripts in either language and do things "raw".
The frameworks (ideally) save the trouble of writing a pile of code for things that other people have already handled (session handling, etc.).
The decision probably comes down to what you need to do. If the framework has the features you need, why not use it. If the framework is going to require extensive modifications, it might be easier to roll your own stuff. Or check out a different framework.
The python library has numerous modules for doing cgi, parsing html, cookies, WSGI, etc:
http://docs.python.org/library/index.html
PHP has a lot of frameworks. Probably more then most. In Ruby most use Rails so thats what you hear, and Django for Python is mentioned more then not.
But with PHP you have many to choose from.
List of web application frameworks
Any language that can "print" can be used to generate web pages, but frameworks handle a lot of the HTML generation for you. They let you concentrate more on the content and less on the details of coding the raw HTML.