Why are Perl source filters bad and when is it OK to use them? - perl

It is "common knowledge" that source filters are bad and should not be used in production code.
When answering a a similar, but more specific question I couldn't find any good references that explain clearly why filters are bad and when they can be safely used. I think now is time to create one.
Why are source filters bad?
When is it OK to use a source filter?

Why source filters are bad:
Nothing but perl can parse Perl. (Source filters are fragile.)
When a source filter breaks pretty much anything can happen. (They can introduce subtle and very hard to find bugs.)
Source filters can break tools that work with source code. (PPI, refactoring, static analysis, etc.)
Source filters are mutually exclusive. (You can't use more than one at a time -- unless you're psychotic).
When they're okay:
You're experimenting.
You're writing throw-away code.
Your name is Damian and you must be allowed to program in latin.
You're programming in Perl 6.

Only perl can parse Perl (see this example):
#result = (dothis $foo, $bar);
# Which of the following is it equivalent to?
#result = (dothis($foo), $bar);
#result = dothis($foo, $bar);
This kind of ambiguity makes it very hard to write source filters that always succeed and do the right thing. When things go wrong, debugging is awkward.
After crashing and burning a few times, I have developed the superstitious approach of never trying to write another source filter.
I do occasionally use Smart::Comments for debugging, though. When I do, I load the module on the command line:
$ perl -MSmart::Comments test.pl
so as to avoid any chance that it might remain enabled in production code.
See also: Perl Cannot Be Parsed: A Formal Proof

I don't like source filters because you can't tell what code is going to do just by reading it. Additionally, things that look like they aren't executable, such as comments, might magically be executable with the filter. You (or more likely your coworkers) could delete what you think isn't important and break things.
Having said that, if you are implementing your own little language that you want to turn into Perl, source filters might be the right tool. However, just don't call it Perl. :)

It's worth mentioning that Devel::Declare keywords (and starting with Perl 5.11.2, pluggable keywords) aren't source filters, and don't run afoul of the "only perl can parse Perl" problem. This is because they're run by the perl parser itself, they take what they need from the input, and then they return control to the very same parser.
For example, when you declare a method in MooseX::Declare like this:
method frob ($bubble, $bobble does coerce) {
... # complicated code
}
The word "method" invokes the method keyword parser, which uses its own grammar to get the method name and parse the method signature (which isn't Perl, but it doesn't need to be -- it just needs to be well-defined). Then it leaves perl to parse the method body as the body of a sub. Anything anywhere in your code that isn't between the word "method" and the end of a method signature doesn't get seen by the method parser at all, so it can't break your code, no matter how tricky you get.

The problem I see is the same problem you encounter with any C/C++ macro more complex than defining a constant: It degrades your ability to understand what the code is doing by looking at it, because you're not looking at the code that actually executes.

In theory, a source filter is no more dangerous than any other module, since you could easily write a module that redefines builtins or other constructs in "unexpected" ways. In practice however, it is quite hard to write a source filter in a way where you can prove that its not going to make a mistake. I tried my hand at writing a source filter that implements the perl6 feed operators in perl5 (Perl6::Feeds on cpan). You can take a look at the regular expressions to see the acrobatics required to simply figure out the boundaries of expression scope. While the filter works, and provides a test bed to experiment with feeds, I wouldn't consider using it in a production environment without many many more hours of testing.
Filter::Simple certainly comes in handy by dealing with 'the gory details of parsing quoted constructs', so I would be wary of any source filter that doesn't start there.
In all, it really depends on the filter you are using, and how broad a scope it tries to match against. If it is something simple like a c macro, then its "probably" ok, but if its something complicated then its a judgement call. I personally can't wait to play around with perl6's macro system. Finally lisp wont have anything on perl :-)

There is a nice example here that shows in what trouble you can get with source filters.
http://shadow.cat/blog/matt-s-trout/show-us-the-whole-code/
They used a module called Switch, which is based on source filters. And because of that, they were unable to find the source of an error message for days.

Related

Perl: Is it possible to dynamically fix compile time error?

If I have, for example, next perl script:
use strict;
use warnings;
print $x;
When I run this script, compilation will fail with error:
Global symbol "$x" requires explicit package name (did you forget to declare "my $x"?) at ...
Is it possible to write some perl module which will be called when this error occur and automatically fix this error and continue compilation? (Even links to any info is OK)
# This code is incorrect.
# Here I just ask about such ability
# This code is very weak approximation how it might look
package AutoFix;
sub fix {
$main::x = 'You are defined now';
}
1;
So next code will not fail and print You are defined now:
use strict;
use warnings;
use AutoFix;
print $x;
How much work would you like to do to create the code that could figure out what the fix should be? And, will that amount of work be comparable or less to the work required to examine code by hand?
Now, I'm writing all of this having spent quite a bit of time trying to come up with a system to analyze CPAN installer output to figure out what went wrong (a major impetus for CPANPLUS, now relegated to history). It's easy to tell that something is not right, but beyond that is a lot of suffering.
In your example, you have an error about an undeclared variable. How does AutoFix know if that should be a package or a lexical variable? You can guess one or the other, but you actually have two big problems:
What is the intent of the code?
Does the code reflect the actual intent?
Determining the intent of the code is often very difficult for even an experienced human programmer to figure out (just read StackOverflow question comments). Compiling code is often not correct code, in the sense that it doesn't achieve the desired outcome. Furthermore, does the programmer even understand the problem? Does the code the programmer wrote (incorrectly here) reflect the actual work the code should do? It's difficult for humans in code review to figure this out. Tools like Coverity can guess at problems it knows about, but they aren't going to be able to correct the code.
But let's say that the programmer understands the problem. Have they correctly expressed that? The longer you've been programming, the more you lean toward "no", in general, in my experience.
This is completely different than the database constraint you mentioned. That's a narrowly targeted fix for an expected and allowed situation. Consider a different parallel: if the record has a New York area code but a Chicago address, should I fix the city? When I was a younger dumbass, I did a similar thing to a database. It was stupid because I thought I knew something I didn't, and everyone who understood the situation recognized it immediately. Even then, those sorts of constraints are how we model what we know about the world, not what the world actually is.
Now, to make AutoFix, you need to make something that can look at code, understand it, and figure out what it should do. You can make guesses, but you have no basis for playing the probabilities there.
Technical matters can't solve this. AutoFix can undo the work of pragmas such that some classes of errors don't show up, but so what? The program with an error just continues? How does that help anyone?
Not only that, compilers tend to complain when they realize they can't parse something. What they complain about is often not the problem. The first thing I teach people while debugging is that they need to look at the statement immediately proceeding the line line number in the error message. Any error message you catch can have a virtually infinite number of causes.
Consider this code, which fails in the same way as your example (same error message) but for a completely different but common reason:
use strict;
use warnings;
my $x = 5,
print $x++;
How do you figure out what the fix should be? It's not about declaring $x.
So, you now have two cases, and you build that your fixer. Then you encounter another case, so you build that in. And you keep doing this until eventually you have a large dictionary of fixes. Maybe you get a bit crazy and do some machine learning (and wouldn't a corpus of bad code and resolutions be cool).
But, the program still can't continue. It has to start over because it has to at least back up to where it should have done something but didn't. You can't merely restart the program because you don't know if its idempotent. Re-runing the program might redo work it shouldn't, such as inserting duplicate into databases.
Having said all that, this sort of thing is related to static analysis and the refactoring browser. Adam Kennedy's Parse Perl Isolated (PPI) project was a first step into understanding Perl code without compiling it, then move toward the Smalltalk ideal of understanding which parts of code represented the same thing. If you knew that two things named foo were the same thing, you could rearrange code dealing with foo. For example, if you renamed a method from bar to set_bar, you could immediately know which bars you should rename and which belonged to some other class.
Adam wrote Acme::BadExample and challenged anyone to get it to run. He wrote "any given piece of Perl source exists in bizarre pseudo-quantum-like state, in that it demonstrates both duality and indeterminism."
Jos Boumans stepped up and used some mind-bending Perl, which he then showed in Barely Legal XXX Perl, which I think he first presented in 2006. He was amazingly creative in his solutions, and in a way that I wouldn't want in production code.
Perl doesn't even know, by design, what type of thing will be in a variable or even that the method you might call on it will exist. In fact, it defers so much to the runtime, trusting that things will be in place by the time you need them, that we often say "only Perl can parse perl". You literally need to be able to run Perl code to properly compile it since BEGIN blocks can affect the parse. For example, a BEGIN can define a subroutine with a certain arity. How do you parse foo 5, 6? You have to know what has already been defined.
Perl has other "action at a distance" features that make this even tougher. autodie redefines CORE features to add extra behavior, but you might not be able to see that in the code. You can set default regex flags (and I've seen plenty of big screw ups by people applying /isxm to entire files without checking).
As noted above, autofixing compile time error is not possible (or probably hard to fix)
Instead of fixing compile time error try to resolve your problem in different way.
For example. In your script you use $x variable. Probably you know that you will use it and you want to get instance of some value, e.g. You are defined now then you could use Exporter:
use strict;
use warnings;
use AutoFix qw/ $x /;
print $x;
And AutoFix module will look like:
package AutoFix;
require Exporter;
our #ISA = qw(Exporter);
our #EXPORT_OK = qw( $x $y $z ); # symbols to export on request
... # code which will create instance of $x $y $z on request
1;
Gool luck ;-)

General check of missing semicolon

As a Perl beginner I am sometimes getting compilation errors and have to search a lot to find it. In the end it is just a missing semicolon at the end of a line. Some syntax errors with missing semicolon are checked by Perl but not in general. Is there a way to get this check?
edit:
I know about Perl::Critic but can't use it atm. And I don't know if it checks for missing semicolon in general.
Because semicolons actually mean something in Perl and aren't just there for decoration, it's not possible for any tool (even the Perl interpreter itself) to know in every case whether you actually meant to leave off the semi-colon or not. Thus, there's no general-case answer to your question; you'll just need to go through your code and make sure it's correct.
As mentioned in my comments, there are various tricks you can try with your editor to expedite the process of finding potentially-incorrect lines; you must, however, either examine and fix these by hand or risk introducing new problems.
The syntax check is perl -c, but that's no different than attempting to run the program outright. Due to its flexible/undecidable syntax, one cannot generally do what you want. That's the downside of comfort and expressiveness.
Upgrade to the latest stable Perl, the parser's error messages got better/more exact over the last years and will correctly recognise many circumstances of a missing semicolon.
Rule of thumb that works for many parsers/other languages: if the error makes no sense, look a couple of lines before.
use diagnostics; usually gives you a nice hint, same as use warnings;. Try to keep a consistent coding style, check perlstyle.
Also you can use Perl::Critic online.
Also as general advice learn how to use packages and modules, try to group code into subs and study the syntax of arrays, lists and hashes. A common mistake is forgetting the ; after an anonymous hashref assignment:
my $hashref = { a => 5, b => 10};

would it be worth it to use inline::C to speed up math

i have been working on a perl program to process large amounts of dna. It outputs exactly what i need however it takes much longer than i would like using NYTprof i have narrowed down the major problem areas to be the loop that adds my values together. would using inline::C to do the math make my program faster or should i accept the speed and move on? is there another way to improve the speed? here is my program and an input it would run as well as an executable with the default values entered already.
It's unlikely you'll get useful help here (this included). I can see various problems with your code, and none have to do with the choice of language.
use CPAN. If you're parsing genbank, then use some an appropriate module.
You're writing assembly in Perl, and neither Perl nor you are very good at that. It's near impossible to know what's going on when you don't pass parameters to subroutines, instead relying on globals all over the place. What do #X1, #X2, #Y1, #Y2 mean?
The following might be your problem: until ($ender - $starter > $tlength) { (line 153). According to your test case, these start by being 103, 1, and 200, and it's not clear when or if they change. Depending on what's in #te, it might or might not ever get out of the loop; I just can't tell from your code.
It would help if we knew, exactly, what are the parameters to add, the in-out invariants, and what it is returning.
That's all I got.
I second the recommendation of PDL made in a comment, if it's applicable. Or the use of a CPAN module tailored to your problem (again, if applicable).
I didn't see anything that looked unambiguously like "the loop that adds my values together" in that code; please, show just the code you are considering optimizing, ideally with just enough structure around it to actually run it.
So to answer your generic question generically, yes, Inline::C can be a useful tool for optimization if you are certain your performance problem is limited to what it actually can do for you. In using it, be aware that invoking your C code from Perl or vice versa is non-trivially expensive, so you have to have enough code translated to C to minimize the transitions.

What is Perl's secret of getting small code do so much?

I've seen many (code-golf) Perl programs out there and even if I can't read them (Don't know Perl) I wonder how you can manage to get such a small bit of code to do what would take 20 lines in some other programming language.
What is the secret of Perl? Is there a special syntax that allows you to do complex tasks in few keystrokes? Is it the mix of regular expressions?
I'd like to learn how to write powerful and yet short programs like the ones you know from the code-golf challenges here. What would be the best place to start out? I don't want to learn "clean" Perl - I want to write scripts even I don't understand anymore after a week.
If there are other programming languages out there with which I can write even shorter code, please tell me.
There are a number of factors that make Perl good for code golfing:
No data typing. Values can be used interchangeably as strings and numbers.
"Diagonal" syntax. Usually referred to as TMTOWTDI (There's more than one way to do it.)
Default variables. Most functions act on $_ if no argument is specified. (A few act
on #_.)
Functions that take multiple arguments (like split) often have defaults that
let you omit some arguments or even all of them.
The "magic" readline operator, <>.
Higher order functions like map and grep
Regular expressions are integrated into the syntax (i.e. not a separate library)
Short-circuiting operators return the last value tested.
Short-circuiting operators can be used for flow control.
Additionally, without strictures (which are off be default):
You don't need to declare variables.
Barewords auto-quote to strings.
undef becomes either 0 or '' depending on context.
Now that that's out of the way, let me be very clear on one point:
Golf is a game.
It's great to aspire to the level of perl-fu that allows you to be good at it, but in the name of $DIETY do not golf real code. For one, it's a horrible waste of time. You could spend an hour trying to trim out a few characters. Golfed code is fragile: it almost always makes major assumptions and blithely ignores error checking. Real code can't afford to be so careless. Finally, your goal as a programmer should be to write clear, robust, and maintainable code. There's a saying in programming: Always write your code as if the person who will maintain it is a violent sociopath who knows where you live.
So, by all means, start golfing; but realize that it's just playing around and treat it as such.
Most people miss the point of much of Perl's syntax and default operators. Perl is largely a "DWIM" (do what I mean) language. One of it's major design goals is to "make the common things easy and the hard things possible".
As part of that, Perl designers talk about Huffman coding of the syntax and think about what people need to do instead of just giving them low-level primitives. The things that you do often should take the least amount of typing, and functions should act like the most common behavior. This saves quite a bit of work.
For instance, the split has many defaults because there are some use cases where leaving things off uses the common case. With no arguments, split breaks up $_ on whitespace because that's a very common use.
my #bits = split;
A bit less common but still frequent case is to break up $_ on something else, so there's a slightly longer version of that:
my #bits = split /:/;
And, if you wanted to be explicit about the data source, you can specify the variable too:
my #bits = split /:/, $line;
Think of this as you would normally deal with life. If you have a common task that you perform frequently, like talking to your bartender, you have a shorthand for it the covers the usual case:
The usual
If you need to do something, slightly different, you expand that a little:
The usual, but with onions
But you can always note the specifics
A dirty Bombay Sapphire martini shaken not stirred
Think about this the next time you go through a website. How many clicks does it take for you to do the common operations? Why are some websites easy to use and others not? Most of the time, the good websites require you to do the least amount of work to do the common things. Unlike my bank which requires no fewer than 13 clicks to make a credit card bill payment. It should be really easy to give them money. :)
This doesn't answer the whole question, but in regards to writing code you won't be able to read in a couple days, here's a few languages that will encourage you to write short, virtually unreadable code:
J
K
APL
Golfscript
Perl has a lot of single character special variables that provide a lot of shortcuts eg $. $_ $# $/ $1 etc. I think it's that combined with the built in regular expressions, allows you to write some very concise but unreadable code.
Perl's special variables ($_, $., $/, etc.) can often be used to make code shorter (and more obfuscated).
I'd guess that the "secret" is in providing native operations for often repeated tasks.
In the domain that perl was originally envisioned for you often have to
Take input linewise
Strip off whitespace
Rip lines into words
Associate pairs of data
...
and perl simple provided operators to do these things. The short variable names and use of defaults for many things is just gravy.
Nor was perl the first language to go this way. Many of the features of perl were stolen more-or-less intact (or often slightly improved) from sed and awk and various shells. Good for Larry.
Certainly perl wasn't the last to go this way, you'll find similar features in python and php and ruby and ... People liked the results and weren't about to give them up just to get more regular syntax.
What's Java's secret of copying a variable in only one line, without worrying about buses and memory? Answer: the code is transformed to bigger code. Same for every language ever invented.

Removing comments using Perl

Something I keep doing is removing comments from a file as I process it. I was was wondering if there a module to do this.
Sort of code I keep writing time and again is
while(<>) {
s/#.*// ;
next if /^ \s+ $/x ;
**** do something useful here ****
}
Edit Just to clarify, the input is not Perl. It is a text file of my own making that might have data I want to process in some way or other. I want to beable to place comments that are ignored by my programs
Unless this is a learning experience I suggest you use Regexp::Common::comment instead of writing your own regular expressions.
It supports quite a few languages.
The question does not make clear what type of file it is. Are we dealing with perl source files? If so, your approach is not entirely correct - see gbacon's comment. Perl source files are notoriously difficult (impossible?) to parse with regex. In that case, or if you need to deal with several types of files, use Regexp::Common::comment as suggested by Niffle. Otherwise, if you think your regex logic is correct for your scenario, then I personally prefer to write it explicitly, it's just a pair of strighforward lines, there is little to be gained by using a module (and you introduce a dependency).