Is Perl a good option for heavy text-processing? - perl

I have this web application which needs to do several heavy text processing tasks: removing certain characters, parsing XML files, among others. Some of them involve regular expressions.
The web application has some implementations in Java and others in PHP. Is it worth using Perl or other specific text processing language for such tasks, or is there really no difference with using PHP?
I even thought of using Sed, Awk maybe even some compiled C scripts for processing texts. There's a lot of text to be processed...

Yes, Perl is a good option. As a language, it's definitely more suitable for those kinds of tasks than Java or PHP. If you have the Perl knowledge, I would recommend it for this kind of task.

I too suggest you use Perl, it's made for text crunching.
However, if you are going to parse/process XML, please don't try to roll your own solution, there are several high quality modules that do the job correctly. As a starter, I recommend you take a look at XML::Twig
Also, for regular expressions, there are dozens of already-made ones under the Regexp::Common distribution. Most probably you'll find what you need there and it will save you time.

Perl is THE language for text processsing. It was designed with this in mind.

Text processing is exactly what Perl was created for. After all it's Practical Extraction and Report Language. On the other hand, for web application I'd prefer Python.

Yes, Perl was designed with processing text in mind.
It has tons of useful text processing features, and it was the first language I used (long ago) that had regular expressions.
http://en.wikipedia.org/wiki/Perl

Yes. Text processing is PERL's #1 strong point. Since you will integrate into your existing app, you'll need to execute an external program so think about how to run it securely and perhaps as a background process (to avoid start up delays in your real time web app.)

Related

How can I hide Perl code?

I've written some Perl programs and am planning on distributing them. They're part of a large binary distribution (mostly compiled C/C++). If possible, I'd prefer to give up as little as possible (I'm responsible for delivering working software, not delivering clever algorithms). What is my best bet for hiding the Perl code so that if someone really wants to see the source, they'd have to put a bit more effort than in than simply opening the file in an editor?
You could encrypt your code and then at run time decrypt it and send it to perl stdin. (of course the decryptor would not be encrypted).
I got some minify/compile answers to my question How can I compile my Perl script so to reduce startup time?
Acme::Bleach
Filter::Crypto (potentially via PAR::Filter::Crypto) is clearly the most advanced open source tool for this job (barring perlcc which doesn't work well for many things, YMMV).
If all you want is hide the code from casual tinkerers, that's more than sufficient. Hiding it from determined and/or capable people is practically impossible.
It won't make it harder to just open the files but an obfuscator can make it more difficult to understand and modify your code. Have a look here or here for a start.

Prerequisites for Perl

I am new to developing a website using Perl CGI scripts.
What are the prerequisites for Perl?
Please give me some basic ideas or some good tutorials.
Do you mean: what prerequisite knowledge would help you learn Perl?
The Perl language grew out of the best features of the Unix commands grep, awk, sed, and shell scripts, plus some C. If you are strong in those areas, Perl will seem like a good fit. You'd already be used to cryptic variable names, quick-n-dirty looping constructs, dynamic typing, regular expressions, and standard input/output.
First decide if you are writing a CGI script or an Apache handler. The 2 are coded completely differently, but both have their advantages and disadvantages.
If you want to run your script using CGI (or FCGI), I recommend this tutorial for beginning programmers. It will help you through creating simple forms. If you already know a bit about programming a script language, you will be able to skim-read a lot of that. If you are a new programmer, you might also want zo look up a few alternatives, like coding CGI scripts with Python, a language often thought to be better suited to new programmers.
I would point you towards learn.perl.org, the Modern Perl book by chromatic, and the overview of web frameworks on the Perl 5 wiki. Also make sure that the tutorials and articles you use are relatively recent. Perl has been around for a long time and modern Perl is different from what people wrote 10 years ago.
I suggest starting with PSGI. It is somewhat new (compared to mod_perl), but it's Perl's answer to Python WSGI. Most of the large frameworks support it, as does other interesting projects like Starman. I would start there and Google anything I don't understand.

Should I migrate from CGI.pm to CGI::Simple?

I just noticed CGI::Simple while looking something up for the CGI.pm module. Should I be using CGI::Simple instead? What will it offer me over CGI.pm, which I've used for eight years? I see that CGI::Simple doesn't do HTML generation; what should I be using for that? And will it integrate with CGI::Simple by allowing me to make form values persist, as CGI.pm does?
I think it boils down to this line from the docs: "In practical testing this module loads and runs about twice as fast as CGI.pm depending on the precise task."
If you aren't concerned by the speed of your CGI program, I think it is safe to ignore this module. If you are concerned with speed I would suggest you look into CGI::Fast first.
I have rarely used the HTML generation facilities of CGI.pm. For that, I prefer HTML::Template, usually in conjunction with CGI::Application. CGI::Application can use any $cgi object, specified in the call to its constructor.
I think CGI still has its place. I like CGI::Simple because it provides a clean OO interface.
I maintain CGI.pm and have helped patch CGI::Simple as well. I've looked the code for both in depth and have benchmarked them. I think there are minimal benefits to switching to CGI::Simple. YOu will find some headaches in the process, like incompatible syntax for handling file uploads that would need to be changed.
I agree with the sentiment of some others here that if you are going to forward, you should look beyond either of these. I recommend looking towards something that natively works with PSGI.
I'm kinda surprised you're still using CGI at all. Consider a more adult framework like a Catalyst/TT/DBIx stack.
You might try CGI::Simple for new things, but otherwise let sleeping dogs lie. If your old programs are working, leave them alone. :)
CGI.pm has a good install base it, most perl installs have it, a refactored and slightly minimized CGI::Simple doesn't do it for me really. I would have to have a point where I ended up needing CGI.pm for something and maintaining both.
I find the HTML generator of CGI to handle escaping, encoding and solid compliant HTML as a great tool.
As you seem ready to migrate, please stop writing dirty old CGI-based scripts. Use instead a modern and clean web engine such as Dancer or Mojolicious.

How can I add internationalization to my Perl script?

I'm looking at introducing multi-lingual support to a mature CGI application written in Perl. I had originally considered rolling my own solution using a Perl hash (stored on disk) for translation files but then I came across a CPAN module which appears to do just what I want (i18n).
Does anyone have any experience with internationalization (specifically the i18n CPAN module) in Perl? Is the i18n module the preferred method for multi-lingual support or should I reconsider a custom solution?
Thanks
There is a Perl Journal article on software localisation. It will provide you with a good idea of what you can expect when adding multi-lingual support. It's beautifully written and humourous.
Specifically, the article is written by the folks who wrote and maintain Locale::Maketext, so I would recommend that module simply based upon the amount of pain it is clear the authors have had to endure to make it work correctly.
If you have the time then do take a look at the way the I18N is done in the Jifty framework - although initially quite confusing it is very elegant and usable.
They overload _ so that you can use _("text to translate") anywhere in the code. These strings are then translated using Locale::Maketext as normal.
What makes it really powerful is that they defer the translation until the string is needed using Scalar::Defer so that you can start adding the strings at any time, even before you know which language they will be translated into. For example in config files etc. This really make I18N easy to work with.

What's the best way to write a Perl CGI application?

Every example I've seen of CGI/Perl basically a bunch of print statements containing HTML, and this doesn't seem like the best way to write a CGI app. Is there a better way to do this? Thanks.
EDIT: I've decided to use CGI::Application and HTML::Template, and use the following tutorial: http://docs.google.com/View?docid=dd363fg9_77gb4hdh7b. Thanks!
Absolutely (you're probably looking at tutorials from the 90s). You'll want to pick a framework. In Perl-land these are the most popular choices:
CGI::Application - very lightweight with lots of community plugins
Catalyst - heavier with more bells and whistles
Jifty - more magical than the above
This is a really, really big question. In short, the better way is called Model/View/Controller (aka MVC). With MVC, your application is split into three pieces.
The Model is your data and business logic. It's the stuff that makes up the core of your application.
The View is the code for presenting things to the user. In a web application, this will typically be some sort of templating system, but it could also be a PDF or Excel spreadsheet. Basically, it's the output.
Finally, you have the Controller. This is responsible for putting the Model and View together. It takes a user's request, gets the relevant model objects, and calls the appropriate view.
mpeters already mentioned several MVC frameworks for Perl. You'll also want to pick a templating engine. The two most popular are Template Toolkit and Mason.
Leaving the question of CGI vs MVC framework for the moment, what you're going to want is one of the output templating modules from the CPAN.
The Template Toolkit is very popular (Template.pm on CPAN)
Also popular are Text::Template, HTML::Template, and HTML::Mason.
HTML::Mason is much more than a template module, and as such might be a little too heavy for a simple CGI app, but is worth investigating a little while you're deciding which would be best for you.
Text::Template is reasonably simple, and uses Perl inside the templates, so you can loop over data and perform display logic in Perl. This is seen as both a pro and con by people.
HTML::Template is also small and simple. It implements its own small set of tags for if/then/else processing, variable setting, and looping. That's it. This is seen as both a pro and a con for the exact opposite reasons as Text::Template.
Template toolkit (TT) implements a very large, perlish template language that includes looping and logic, and much more.
I used HTML::Template one, and found I wanted a few more features. I then used Text::Template with success, but found its desire to twiddle with namespaces to be a little annoying. I've come to know and love Template Toolkit. For me it just feels right.
Your mileage may vary.
Of course, there is still the old "print HTML" method, sometimes a couple of print statements suffices. But you've hit upon the idea of separating your display from your main logic. Which is a good thing.
It's the first step down the road to Model/View/Controller (MVC) in which you keep separate your data model&business logic (your code that accepts the input, does something with it, and decides what needs to be output), your your input/output (Templates or print statements - HTML, PDF, etc.) , and the code that connects the two (CGI, CGI::Application, Catalyst MVC Framework, etc.). The idea being that a change to your data structure (in the Model) should not require changes to your output routines (View).
The Perl5 Wiki provides a good (though not yet complete) list of web frameworks & templates.
The comparison articles linked in the "templates" wiki entry is worth reading. I would also recommend reading this push style templating systems article on PerlMonks.
For templating then Template Toolkit is the one I've used most and can highly recommend it. There is also an O'Reilly book and is probably the most used template system in the Perl kingdom (inside or outside of web frameworks).
Another approach which I've been drawn more and more to is non template "builder" solutions. Modules like Template::Declare & HTML::AsSubs fit this bill.
One solution that I feel strikes the right balance in the Framework/Roll-your-own dilemma is the use of three key perl modules: CGI.pm, Template Toolkit , and DBI. With these three modules you can do elegant MVC programming that is easy to write and to maintain.
All three modules are very flexible with Template Toolkit (TT) allowing you to output data in html, XML, or even pdf if you need to. You can also include perl logic in TT, even add your database interface there. This makes your CGI scripts very small and easy to maintain, especially when you use the "standard" pragma.
It also allows you to put JavaScript and AJAXy stuff in the template itself which mimics the separation between client and server.
These three modules are among the most popular on CPAN, have excellent documentation and a broad user base. Plus developing with these means you can rather quickly move to mod_perl once you have prototyped your code giving you a quick transition to Apache's embedded perl environment.
All in all a compact yet flexible toolset.
You can also separate presentation out from code and just use a templating system without needing to bring in all the overhead of a full-blown framework. Template Toolkit can be used by itself in this fashion, as can Mason, although I tend to consider it to be more of a framework disguised as a templating system.
If you're really gung-ho about separating code from presentation, though, be aware that TT and Mason both allow (or even encourage, depending on which docs you read) executable code to be embedded in the templates. Personally, I feel that embedding code in your HTML is no better than embedding HTML in your code, so I tend to opt for HTML::Template.