Perl, mod_perl2 or CGI for a web-scraping service?

Perl, mod_perl2 or CGI for a web-scraping service? - perl

I'm going to design an open-source web service which should collect ("web-scrape") some data from multiple - currently three - web sites.
The web sites do not expose any web service nor any API, they just publish web pages.
Data will be collected 'live' on any client's request from all the web sites in parallel, and will then be parsed to XML to be returned to the client.
The server operating system will be Linux.
The clients will initially be just an Android application of mine.
The concurrent clients will possibly be about 100 or more, if the project will be successful... ;-).
Currently my preferencese go to the adoption of:
perl (for the service laguage)
mod_perl2 with ModPerl::Registry (for an Apache embedded fast perl interpreter)
perl module CHI::Driver::FastMmap (for a modern and fast cache handler)
perl module Coro (for an async event loop to place many requests in parallel)
Since I suppose the specifications on the project can be of general use and interest, and since I am getting many problems with the combined use of Coro with mod_perl2, I ask:
Are my adoption preferences well matched?
Do you see any incompatibilities or potential problems?
Do you have any suggestion to enhance (in this order):
compatibility among components
neatness of the implementation
ease of maintainability
performances

You probably don't want to develop using mod_perl for any new project anymore. You really want to use something Plack based, or maybe even Plack itself. If you want to use Coro, using a AnyEvent such as Twiggy based backend may make most sense (though you may want to put a reverse proxy in front of it).

Are you happy sticking with apache?
If so, forget Coro and let apache handle concurrency; preload your modules and configuration, and write a super-efficient apache RequestHandler. (That's the way I go whenever apache2+modperl2 is available.)
If not, start learning Plack which is server-agnostic.
If you choose the first route, I'd recommend avoiding traditional CGI and instead adopting CGI::Application, which gives almost the lightness and speed of CGI but with a much much nicer/modern development environment and framework (and is Plack-compatible).

Related

Is FCGI dead? What is alternative in these days?

I'm beginner of Perl.
My understanding is below.
FCGI is a protocol
It is a gateway interface between web server and web applications
The process keeps alive for specific period(such as 5 mins) and accepts multiple requests, so response is fast.
You cache some data before process is build so that you can share those caches with all process, and you can save memory by Copy-on-Write.
It looks nice.
However, I have never seen FCGI in my experience with modern development with Golang, Nginx or whatever.
Doesn't modern web application require FCGI anymore?
What was the disadvantage of FCGI, and what is the still advantage of FCGI?

If we say there are better alternatives/ways that will be proper statement instead of saying anything dead or alive. Still in 2021 I have seen code running with FCGI in production and its going good. The latest comment happens in 2019 in the github. Everything has a time frame. Being old doesn't mean bad/dead, being younger doesn't mean good/alive.
For modern web development there are many frameworks available right now -
Catalyst
Mojolicious
Dancer2
Kelp
Raisin
Top 3 are most common ones. Mojo is mine personal favorite.
You can use them with Plack/uWSGI and you are good to go in no time. They will take care of everything.
Since you mentioned "FastCGI is a protocol" and it is not an implementation, it shouldn't be specific to any language. There will be implementation across different language(maybe not popular). You can find them with single search. One example of Nginx
There are various other questions asked before similar to this. Have a look at those. They will give you more clarity.
Is there a speed difference between WSGI and FCGI?
Is mod_perl what I'm looking for? FastCGI? PSGI/Plack?
Perl CGI vs FastCGI
Which is better perl-CGI, mod_perl or PSGI?

To expand slightly on Maverick's answer...
When writing a web application, there are two questions you need to ask yourself:
What language/framework am I going to write this application in?
How am I going to deploy this application?
In the bad old days, the answers to the two questions were intertwined in annoying ways. You might write a CGI program originally and then switch to a different deployment environment (FCGI or mod_perl, perhaps) at a later date when you wanted better performance from your application. But changing to a different deployment environment would usually mean making a lot of changes to your code. The application needed too much knowledge about the deployment environment. And that made life hard.
If you use the frameworks in the other answer, then things are different. Most (maybe all) of those frameworks run on top of a protocol called PSGI (Perl Server Gateway Interface). By writing your application using those frameworks, then your application will interact with the web server using the PSGI protocols. And there are PSGI adaptors available for all web deployment environments. So you can start by deploying your Dancer2 application as a CGI program (very few people do that, but it's completely possible) and then move it to an FCGI or mod_perl environment without changing your code at all.
I still see people deploying applications in FCGI. It's certainly more common than using mod_perl these days. But, to be honest, the most common deployment environment I see is to set your application up as a standalone service running on a high port number and to use a server like nginx to proxy requests to your service. That's what I'd recommend.

What criteria should I use to evaluate a Perl "app server" (mod_perl replacement)?

Short version:
What criteria should I use to evaluate possible candidates for a Perl "app server" (mod_perl replacement)?
We are looking for some sort of framework which will allow executing various Perl programs repeatedly (as a service) without the costs of:
re-launcing perl interpreter once per each execution
loading/compiling Perl modules once per execution
(both of which are the benefits that running mod_perl provides)
Notes:
We do NOT much care about any additional benefits afforded by mod_perl too much, such as deep Apache integration.
This will be a pure app server, meaning there is no need for any web specific functionality (it's not a problem if the app server provides it, just won't be needed).
We will of course consider the obvious criteria (raw speed, production-ready stability, active development, ability to run on OSs we care about). What I'm interested in is less trivial and subtle things that we may wish from such a framework/server.
Background:
At $work, the powers that be decided that they want to replace a current situation (simple webapps being developed in Embperl and deployed via Apache/mod_perl).
The decision was made to use a (home-grown) MVC system that will have a Java Spring front end for the View; and the Controller will parsel out back-end service requests to per-app services that perform Model duties (don't get hung up on details of this - it's not very relevant to the main question).
One of the options for back-end services is Perl, so that we can leverage all our existing Perl IP (libraries, webapp backend code) going forward, and not have to port 100% of it to Java.
To summarize:
| View | Model/app | Model loaded/executed by: |
================================================================================
OLD | Empberl | Model.pm | mod_perl has Model.pm loaded, called from view.epl |
NEW | Java | Model.pm | perl generic_model.pl -model Model (does "require") |
================================================================================
Now, those of you who did Perl Web development for a while, will immediately notice the most glaring problem with the new design:
| Perl interpreter starts | Perl modules are loaded and compiled |
=======================================================================
OLD | Once per mod_perl thread | Once per mod_perl thread
NEW | Once per EVERY! request | Once per EVERY! request |
=======================================================================
In other words, in the new model, we no longer have any performance benefits afforded by mod_perl as a persistent server side app container!!!
Therefore, we are looking at possible app containers to serve the same function.
(as a side note, yes, we thought about simply running an instance of Apache with mod_perl as such an app container, as a viable possibility. However, since web functionality is not required, I'd like to see if any other options may fit the bill).

Starman is a High-performance preforking PSGI/Plack web server that may be used in that context. It's easy to build a REST application that serves stateless JSON objects (this is a simple use case).
Starman is a production-ready server and it's really easy to install a set of Starman instances behind a reverse-proxy (this SO question may helps you), for scaling purposes

I think you've already identified what you need to know and what to test: execution time versus memory. You also need to evaluate the flexibility and ease of deployment that you get with mod_perl and the big win of preserving the usefulness of software you've already developed (and the accumulated experience of your staff). Remember you can easily separate services by CPU and machine if your new front end is going to be talking to your applications inside your own network. A lot depends on how "web-ish" you can make your services and if they can be efficiently "daemonized". Of course there's lots of ways for web front ends to talk to other services and perl can handle pretty much all of them ... TIMTOWTDI.
Since you mention "tuits" (i.e. "manpower") as a constraint, perhaps the best approach in the short term is to set up a separate apache - mod_perl instance as an "application container" and run your applications that way (since they run well that way already, is this correct?). After all, apache (and mod_perl) are well known, battle tested, and eminently tweakable and configurable. The deployment options are pretty flexible (same machine, different machine(s), failover, load balancing, cloud, local, VMs) and they have been well tested as well.
Once you get that running you could then begin experimenting with various "low manpower required" approaches to magically daemonizing your applications and services without apache. Plack and Starman have been mentioned, Mojolicious:: is another framework that is capable of working with JSON websockets etc (and Plack). These too have been well tested but are perhaps less familiar than mod_perl and Apache. Still if you are a perl shop you shouldn't have difficulty working with these "modern" tools. One day, if you do end up with more resource, you could build out a sophisticated networked platform for all your perl based services and use all the cool "new" (or at least newer than mod_perl) stuff on CPAN like POE, Plack, etc. You might have a lot of fun developing cool stuff as you solve your business problem.
To clarify my earlier comment: Ubic (see Ubic on MetaCPAN) can daemonize (and thus precompile) your perl tools and offers some monitoring and management facilities that come for free with the framework. There is one Ubic module available designed for use with Plack called Ubic::Service::Plack. In and of itself Ubic does not provide an easy solution for your new Java/Swing application to talk to your perl applications but it might help manage/monitor whatever solution you come up with.
Good luck and have fun!

You can create a simple daemon using HTTP::Daemon, and have all benefits of compiling necessary parts of your code later (require), or in advance, before daemon starts.

Web Programming For The Non-Web Programmer (in Perl)

I am looking to start Web Programming in Perl (Perl is the only language I know). The problem is, I have no prior knowledge of anything to do with the web, except surfing it. I have no idea where to start.
So my question(s) is...
Where do I start learning Web Programming? What should I know? What should I use?
I thank everybody in advance for answering and helping.

The key things to understand are:
What you can send to browsers
… or rather, the things you intend to send to browsers, but having an awareness of what else is out there is useful (since, in complex web applications in particular, you will need to select appropriate data formats).
e.g.
HTML
CSS
JavaScript
Images
JSON
XML
PDFs
When you are generating data dynamically, you should also understand the available tools (e.g. the Perl community has a strong preference for TT for generating HTML, but there are other options such as Mason, while JSON::Any tends to be my goto for JSON).
Transport mechanisms
HTTP (including what status codes to use and when, how to do redirects, what methods (POST, GET, PUT, etc) to use and when).
HTTPS (HTTP with SSL encryption)
How to get a webserver to talk to your Perl
PSGI/Plack if you want modern and efficient
CGI for very simple
mod_perl if you want crazy levels of power (I've seen someone turn then Apache HTTPD into an SMTP spam filter using it).
Security
How to guard against malicious input (which basically comes down to knowing how to take data in one format (such as submitted form data) and convert it to another (such as HTML or SQL).
Web Frameworks
You can push a lot of work off to frameworks, which provide structured ways to organise a web applications.
Web::Simple is simple
Dancer seems to be holding the middle ground (although I have to confess that I haven't had a chance to use it yet)
Catalyst probably has the steepest learning curve but comes with a lot of power and plugins.

Dependent on complexity of your project, you could have a look at Catalyst MVC. This is a good framework, messing up with the most request stuff, but gives you enough in deep view whats going on.
There is a good tutorial in CPAN
If you want to start with mod_perl or CGI, there are also some Tutorials :
mod_perl
CGI Doc

If you're looking to try some web programming in Perl, you could try hosting a Dancer app for free on OpenShift Express.
There's even a "Dancer on OpenShift Express" repo to get you started: https://github.com/openshift/dancer-example

HTTP::Proxy for pen testing tasks

Could someone provide ideas how HTTP::Proxy module is compared to others proxies like paros and burp proxy and if someone use it during his work specifically if it used by the pen testing community for real job

HTTP::Proxy is a perl module that you can use to build your own intercepting (and other) proxies. Since it has the entire Perl language at its disposal, it can be very powerful, but it requires you to know what you're doing and develop a lot of code.
The other proxies you mention (Burp, Paros, WebScarab, etc.) have the advantage of being built for the specific purpose of intercepting and tampering with data flows in HTTP sessions. Many of them can man-in-the-middle SSL/TLS sessions, which HTTP::Proxy won't do out of the box. They are by far the preferable tools for pentesting work.
Though a Perl program using HTTP::Proxy could be very extensible, several of the pre-built proxies offer scripting capabilities. WebScarab, for instance, allows scripting with BeanShell.
In short, these are really different tools for different purposes.

mod_perl vs mod_fastcgi

I'm developing a web app in Perl with some C as necessary for some heavy duty number crunching. The main problem I'm having so far is trying to decide if I should use mod-perl, mod-fastcgi or both to run my scripts because I'm having a difficult time trying to analyze the pros and cons of each mod.
Can anyone post a summary or give a link where I can find some comparison information and perhaps some recommendations with examples?

They are quite different beasts.
mod_fastcgi (by the way, mod_fcgid is recommended) just supports the FCGI protocol to execute CGIs faster with some knobs to control how many processes will it run simutaneously and not much more.
mod_perl, on the other hand is a platform for development of applications that exposes most Apache internals to you so you can tweak every webserver knob from your code, accelerates CGIs, and much more.
If all you wish is to run your CGIs quickly, and want to support as many hosts as possible, you should stick with supporting those two ways to run your code and probably standard CGI as well.
If you care about maximum efficiency at the cost of flexibility, you could aim for a single platform, probably mod_perl.
But probably the sanest option is to run everywhere and use a framework to build the application that'll take care of using the advantages of a particular way of executing if present, like Catalyst.

I would advise you to use a framework such as Catalyst that takes care of such details. For most applications, it doesn't matter how the program connects to the webserver, as long as it is done in an efficient way. The choice between mod_perl and FastCGI should be made by the sysadmin who deploys it, not the developer.

Here is a site with some actual performance comparisons of mod_perl, mod_fastcgi, cgi (Perl) and a Java servlet - for a very basic script: https://sites.google.com/site/arjunwebworld/Home/programming/apache-jmeter
In summary:
cgi - 1200+ requests per minute
mod_perl - 6000+ requests per minute (ModPerl::PerlRun only)
fast_cgi - 6000+ requests per minute
mod_perl - 6000+ requests per minute (ModPerl::Registry)
servlets - 2438 requests per minute.
There is an old thread on PerlMonks comparing mod_perl and fastcgi here: http://www.perlmonks.org/?node_id=108008

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse