An automated way to do string extraction for Perl/Mason i18n? - perl

I'm currently working on internationalizing a very large Perl/Mason web application, as a team of one (does that make this a death march??). The application is nearing 20 years old, and is written in a relatively old-school Perl style; it doesn't use Moose or another OO module. I'm currently planning on using Locale::Maketext::Gettext to do message lookups, and using GNU Gettext catalog files.
I've been trying to develop some tools to aid in string extraction from our bigass codebase. Currently, all I have is a relatively simple Perl script to parse through source looking for string literals, prompt the user with some context and whether or not the string should be marked for translation, and mark it if so.
There's way too much noise in terms of strings I need to mark versus strings I can ignore. A lot of strings in the source aren't user-facing, such as hash keys, or type comparisons like
if (ref($db_obj) eq 'A::Type::Of::Db::Module')
I do apply some heuristics to each proposed string to see whether I can ignore it off the bat (ex. I ignore strings that are used for hash lookups, since 99% of the time in our codebase these aren't user facing). However, despite all that, around 90% of the strings my program shows me are ones I don't care about.
Is there a better way I could help automate my task of string extraction (i.e. something more intelligent than grabbing every string literal from the source)? Are there any commercial programs that do this that could handle both Perl and Mason source?
ALSO, I had a (rather silly) idea for a superior tool, whose workflow I put below. Would it be worth the effort implementing something like this (which would probably take care of 80% of the work very quickly), or should I just submit to an arduous, annoying, manual string extraction process?
Start by extracting EVERY string literal from the source, and putting it into a Gettext PO file.
Then, write a Mason plugin to parse the HTML for each page being served by the application, with the goal of noting strings that the user is seeing.
Use the hell out of the application and try to cover all use cases, building up a store of user facing strings.
Given this store of strings the user saw, do fuzzy matches against strings in the catalog file, and keep track of catalog entries that have a match from the UI.
At the end, anything in the catalog file that didn't get matched would likely not be user facing, so delete those from the catalog.

There are no Perl tools I know of which will intelligently extract strings which might need internationalization vs ones that will not. You're supposed to mark them in the code as you write them, but as you said that wasn't done.
You can use PPI to do the string extraction intelligently.
#!/usr/bin/env perl
use strict;
use warnings;
use Carp;
use PPI;
my $doc = PPI::Document->new(shift);
# See PPI::Node for docs on find
my $strings = $doc->find(sub {
my($top, $element) = #_;
print ref $element, "\n";
# Look for any quoted string or here doc.
# Does not pick up unquoted hash keys.
return $element->isa("PPI::Token::Quote") ||
$element->isa("PPI::Token::HereDoc");
});
# Display the content and location.
for my $string (#$strings) {
my($line, $row, $col) = #{ $string->location };
print "Found string at line $line starting at character $col.\n";
printf "String content: '%s'\n", string_content($string);
}
# *sigh* PPI::Token::HereDoc doesn't have a string method
sub string_content {
my $string = shift;
return $string->isa("PPI::Token::Quote") ? $string->string :
$string->isa("PPI::Token::HereDoc") ? $string->heredoc :
croak "$string is neither a here-doc nor a quote";
}
You can do more sophisticated examination of the tokens surrounding the strings to determine if it's something significant. See PPI::Element and PPI::Node for more details. Or you can examine the content of the string to determine if it's significant.
I can't go much further because "significant" is up to you.

Our Source Code Search Engine is normally used to efficiently search large code bases, using indexes constructed from the lexemes of the languages it knows. That list of languages is pretty broad, including Java, C#, COBOL and ... Perl. The lexeme extractors are language precise (because they are "stolen" from our DMS Software Reengineering Toolkit, a language-agnostic program transformation system, where precision is fundamental).
Given an indexed code base, one can then enter queries to find arbitrary sequences of lexemes in spite of language-specific white space; one can log the hits of such queries and their locations.
The extremely short query:
S
to the Search Engine finds all lexical elements which are classified as strings (keywords, variable names, comments are all ignored; just strings!). (Normally people write more complex queries with regular expression constraints, such as S=*Hello to find strings that end with "Hello")
The relevance here is that the Source Code Search Engine has precise knowledge of lexical syntax of strings in Perl (including specifically elements of interpolated strings and all the wacky escape sequences). So the query above will find all strings in Perl; with logging on, you get all the strings and their locations logged.
This stunt actually works for any langauge the Search Engine understands, so it is a rather general way to extract the strings for such internationalization tasks.

Related

Effect of use Encode qw/encode decode from_to/;?

What is the effect of this at the top of a perl script?
use Encode qw/encode decode from_to/;
I found this on code I have taken over, but I don't know what it does.
Short story: for an experienced perl coded who knows what modules are:
The Encode module is for converting perl strings to "some other" format (for which there are many sub-modules that define difference formats). Typically, it's used for converting to and from Unicode formats eg:
... to convert a string from Perl's internal format into ISO-8859-1, also known as Latin1:
$octets = encode("iso-8859-1", $string);
decode is for going the other way, and from_to converts a string from one format to another in place;
from_to($octets, "iso-8859-1", "cp1250");
Long story: for someone who doesn't know what a module is/does:
This is the classic way one uses code from elsewhere. "Elsewhere" usually means one of two possibilities - either;
Code written "in-house" - ie: a part of your private application that a past developer has decided to factor out (presumably) because its applicable in several locations/applications; or
Code written outside the organisation and made available publicly, typically from the Comprehensive Perl Archive Network - CPAN
Now, it's possible - but unlikely - that someone within your organization has created in-house code and co-incidentally used the same name for a module on CPAN so, if you check CPAN by searching for "Encode" - you can see that there is a module of that name - and that will almost certainly be what you are using. You can read about it here.
The qw/.../ stands for "quote words" and is a simple short hand for creating a list of strings; in this case it translates to ("encode", "decode", "from_to") which in turn is a specification of what parts of the Encode module you (or the original author) want.
You can read about those parts under the heading "Basic methods" on the documentation (or "POD") page I referred earlier. Don't be put off by the reference to "methods" - many modules (and it appears this one) are written in such a way that they support both an Object Oriented and functional interface. As a result, you will probably see direct calls to the three functions mentioned earlier as if they were written directly in the program itself.

perl CGI.pm: lists vs strings to build pages

Is there a reason why CGI.pm examples rely on list concatenation rather than on string concatenation? are the two interchangeable? think
print q->hidden(-name =>'rm', -value => $var).
q->submit(-name =>"rm$var");
vs.
print q->hidden(-name =>'rm', -value => $var),
q->submit(-name =>"rm$var");
I have a specific reason for asking. it is very convenient to build up a page from strings. after all, perl understands scalar strings as basic types.
however, I am completely perplexed by some odd behavior in the string concat context. Specifically, I have encountered occasional cases in which the $var in the hidden is not the same as the one in the submit button. I could work around this, but I would rather understand CGI.pm .
could someone please explain whether string concat should work?
Unless you change the default value of $,,
print EXPR1, EXPR2;
and
print EXPR1 . EXPR2;
produce the same result if the expressions aren't context-specific. The functions in question always return an HTML string, so you're good.
You're right that the two examples have the same effect (well, as ikegami says, unless you have changed $,). The only difference is, of course, that in the first example, print is passed one string and in the second example, it gets two.
But have you read the comments about the HTML generation functions in recent versions of the CGI.pm documentation?
All HTML generation functions within CGI.pm are no longer being
maintained. Any issues, bugs, or patches will be rejected unless they
relate to fundamentally broken page rendering.
The rationale for this is that the HTML generation functions of CGI.pm
are an obfuscation at best and a maintenance nightmare at worst. You
should be using a template engine for better separation of concerns.
See CGI::Alternatives for an example of using CGI.pm with the
Template::Toolkit module.
These functions, and perldoc for them, will continue to exist in the
v4 releases of CGI.pm but may be deprecated (soft) in v5 and beyond.
I would seriously consider moving way from these functions (and, indeed, CGI.pm itself) for new work.

What is Perl's secret of getting small code do so much?

I've seen many (code-golf) Perl programs out there and even if I can't read them (Don't know Perl) I wonder how you can manage to get such a small bit of code to do what would take 20 lines in some other programming language.
What is the secret of Perl? Is there a special syntax that allows you to do complex tasks in few keystrokes? Is it the mix of regular expressions?
I'd like to learn how to write powerful and yet short programs like the ones you know from the code-golf challenges here. What would be the best place to start out? I don't want to learn "clean" Perl - I want to write scripts even I don't understand anymore after a week.
If there are other programming languages out there with which I can write even shorter code, please tell me.
There are a number of factors that make Perl good for code golfing:
No data typing. Values can be used interchangeably as strings and numbers.
"Diagonal" syntax. Usually referred to as TMTOWTDI (There's more than one way to do it.)
Default variables. Most functions act on $_ if no argument is specified. (A few act
on #_.)
Functions that take multiple arguments (like split) often have defaults that
let you omit some arguments or even all of them.
The "magic" readline operator, <>.
Higher order functions like map and grep
Regular expressions are integrated into the syntax (i.e. not a separate library)
Short-circuiting operators return the last value tested.
Short-circuiting operators can be used for flow control.
Additionally, without strictures (which are off be default):
You don't need to declare variables.
Barewords auto-quote to strings.
undef becomes either 0 or '' depending on context.
Now that that's out of the way, let me be very clear on one point:
Golf is a game.
It's great to aspire to the level of perl-fu that allows you to be good at it, but in the name of $DIETY do not golf real code. For one, it's a horrible waste of time. You could spend an hour trying to trim out a few characters. Golfed code is fragile: it almost always makes major assumptions and blithely ignores error checking. Real code can't afford to be so careless. Finally, your goal as a programmer should be to write clear, robust, and maintainable code. There's a saying in programming: Always write your code as if the person who will maintain it is a violent sociopath who knows where you live.
So, by all means, start golfing; but realize that it's just playing around and treat it as such.
Most people miss the point of much of Perl's syntax and default operators. Perl is largely a "DWIM" (do what I mean) language. One of it's major design goals is to "make the common things easy and the hard things possible".
As part of that, Perl designers talk about Huffman coding of the syntax and think about what people need to do instead of just giving them low-level primitives. The things that you do often should take the least amount of typing, and functions should act like the most common behavior. This saves quite a bit of work.
For instance, the split has many defaults because there are some use cases where leaving things off uses the common case. With no arguments, split breaks up $_ on whitespace because that's a very common use.
my #bits = split;
A bit less common but still frequent case is to break up $_ on something else, so there's a slightly longer version of that:
my #bits = split /:/;
And, if you wanted to be explicit about the data source, you can specify the variable too:
my #bits = split /:/, $line;
Think of this as you would normally deal with life. If you have a common task that you perform frequently, like talking to your bartender, you have a shorthand for it the covers the usual case:
The usual
If you need to do something, slightly different, you expand that a little:
The usual, but with onions
But you can always note the specifics
A dirty Bombay Sapphire martini shaken not stirred
Think about this the next time you go through a website. How many clicks does it take for you to do the common operations? Why are some websites easy to use and others not? Most of the time, the good websites require you to do the least amount of work to do the common things. Unlike my bank which requires no fewer than 13 clicks to make a credit card bill payment. It should be really easy to give them money. :)
This doesn't answer the whole question, but in regards to writing code you won't be able to read in a couple days, here's a few languages that will encourage you to write short, virtually unreadable code:
J
K
APL
Golfscript
Perl has a lot of single character special variables that provide a lot of shortcuts eg $. $_ $# $/ $1 etc. I think it's that combined with the built in regular expressions, allows you to write some very concise but unreadable code.
Perl's special variables ($_, $., $/, etc.) can often be used to make code shorter (and more obfuscated).
I'd guess that the "secret" is in providing native operations for often repeated tasks.
In the domain that perl was originally envisioned for you often have to
Take input linewise
Strip off whitespace
Rip lines into words
Associate pairs of data
...
and perl simple provided operators to do these things. The short variable names and use of defaults for many things is just gravy.
Nor was perl the first language to go this way. Many of the features of perl were stolen more-or-less intact (or often slightly improved) from sed and awk and various shells. Good for Larry.
Certainly perl wasn't the last to go this way, you'll find similar features in python and php and ruby and ... People liked the results and weren't about to give them up just to get more regular syntax.
What's Java's secret of copying a variable in only one line, without worrying about buses and memory? Answer: the code is transformed to bigger code. Same for every language ever invented.

Removing comments using Perl

Something I keep doing is removing comments from a file as I process it. I was was wondering if there a module to do this.
Sort of code I keep writing time and again is
while(<>) {
s/#.*// ;
next if /^ \s+ $/x ;
**** do something useful here ****
}
Edit Just to clarify, the input is not Perl. It is a text file of my own making that might have data I want to process in some way or other. I want to beable to place comments that are ignored by my programs
Unless this is a learning experience I suggest you use Regexp::Common::comment instead of writing your own regular expressions.
It supports quite a few languages.
The question does not make clear what type of file it is. Are we dealing with perl source files? If so, your approach is not entirely correct - see gbacon's comment. Perl source files are notoriously difficult (impossible?) to parse with regex. In that case, or if you need to deal with several types of files, use Regexp::Common::comment as suggested by Niffle. Otherwise, if you think your regex logic is correct for your scenario, then I personally prefer to write it explicitly, it's just a pair of strighforward lines, there is little to be gained by using a module (and you introduce a dependency).

Why is there so much "magic" in Perl?

Looking through the perlsub and perlop manpages I've noticed that there are many references to "magic" and "magical" there (just search any of them for "magic"). I wonder why is Perl so rich in them.
Some examples:
print ++($foo = 'zz') # prints 'aaa'
printf "%d: %s", $! = 1, $! # prints '1: Operation not permitted'
while (my $line = <FH>) { ... } # $line is tested for definedness, not truth
use warnings; print "0 but true" + 1 # "0 but true" is a valid number!
When a Perl feature is described as "magic":
It means that that feature is
implemented by NBA star Magic Johnson.
Whenever Perl executes "magic", it is
actually sending an RPC call to a
remote receiver implanted in Magic
himself. He computes the answer, and
then sends a return message. The use
of Mr. Johnson for all the hard parts
of Perl provides a great abstraction
layer and simplifies porting to new
platforms. It's way easier than, say,
the Apache Portable Runtime.
Source: perrin on Perl Monks
It's official! Perl is more magical.
Hits from the following Google searches:
25 site:ruby-doc.org magic
36 site:docs.python.org magic
497 site:perldoc.perl.org magic
Magic, in Perl parlance is simply the word given to attributes applied to variables / functions that allow an extension of their functionality. Some of this functionality is available directly from Perl, and some requires the use of the C api.
A perfect example of magic is the tie interface which allows you to define your own implementation of a variable. Every operation that can be done to a variable (fetching or storing a value for instance) is exposed for reimplementation, allowing for elegant and logical syntactic constructs like a hash with values stored on disk, which are transparently loaded and saved behind the scenes.
Magic can also refer to the special ways that certain builtins can behave, such as how the first argument to map or grep can either be a block or a bare expression:
my #squares = map {$_**2} 1 .. 10;
my #roots = map sqrt, 1 .. 10;
which is not a behavior available to user defined subroutines.
Many other features of Perl, such as operator overloading or variables that can return different values when used with numeric or string operators are implemented with magic. Context could be seen as magic as well.
In a nutshell, magic is any time that a Perl construct behaves differently than a naive interpretation would suggest, an exception to the rule. Magic is of course very powerful, and should not be wielded without great care. Magic Johnson is of course involved in the execution of all magic (see FM's answer), but that is beyond the scope of this explaination.
I wonder why is Perl so rich in them.
To make things easy.
You'll find that most "magic" in Perl is to simplify the syntax for common tasks.
Because perl always Does What I Mean for some values of always.
I think (opinion more than fact) that this has to do with the organic growth viewpoint that Perl's creator Larry Wall has with the Perl language. Python is a study in the opposite approach, whose style often makes Perl hackers cringe at the perception of being forced to conform to a stylistic regime.
Some of it has to do with Perl being designed to be "efficient" at writing quick scripts to do Perl*-ish* tasks, in both wall clock time, and in keystrokes. Some of it has to do with the TMTOWTDI mantra of Perl and its followers.
Programmers tend to be opinionated about Perl's frequent usage of "magic", for some it is an eye-straining visual cacophony of chaos and disrespect for orderliness (which harkens back to the days of computer Priesthood in white lab coats behind a glass window), for others it is a shining example of getting things done efficiently, if not always obviously to the novice or outsider.
Perl's design philosophy is that simple things must be simple. This sounds good,and to some extent it is. However, there's a tradeoff involved: Making every simple thing a one-liner results in tons of special case hacks to save a few lines of code. Different people have different preferences regarding making simple operations within a language simple versus making the language specification simple. Perl is at one extreme. Java is at the other, at least among languages that people actually use. Python and C# are somewhere in between.