Removing comments using Perl - perl

Something I keep doing is removing comments from a file as I process it. I was was wondering if there a module to do this.
Sort of code I keep writing time and again is
while(<>) {
s/#.*// ;
next if /^ \s+ $/x ;
**** do something useful here ****
}
Edit Just to clarify, the input is not Perl. It is a text file of my own making that might have data I want to process in some way or other. I want to beable to place comments that are ignored by my programs

Unless this is a learning experience I suggest you use Regexp::Common::comment instead of writing your own regular expressions.
It supports quite a few languages.

The question does not make clear what type of file it is. Are we dealing with perl source files? If so, your approach is not entirely correct - see gbacon's comment. Perl source files are notoriously difficult (impossible?) to parse with regex. In that case, or if you need to deal with several types of files, use Regexp::Common::comment as suggested by Niffle. Otherwise, if you think your regex logic is correct for your scenario, then I personally prefer to write it explicitly, it's just a pair of strighforward lines, there is little to be gained by using a module (and you introduce a dependency).

Related

perl CGI.pm: lists vs strings to build pages

Is there a reason why CGI.pm examples rely on list concatenation rather than on string concatenation? are the two interchangeable? think
print q->hidden(-name =>'rm', -value => $var).
q->submit(-name =>"rm$var");
vs.
print q->hidden(-name =>'rm', -value => $var),
q->submit(-name =>"rm$var");
I have a specific reason for asking. it is very convenient to build up a page from strings. after all, perl understands scalar strings as basic types.
however, I am completely perplexed by some odd behavior in the string concat context. Specifically, I have encountered occasional cases in which the $var in the hidden is not the same as the one in the submit button. I could work around this, but I would rather understand CGI.pm .
could someone please explain whether string concat should work?
Unless you change the default value of $,,
print EXPR1, EXPR2;
and
print EXPR1 . EXPR2;
produce the same result if the expressions aren't context-specific. The functions in question always return an HTML string, so you're good.
You're right that the two examples have the same effect (well, as ikegami says, unless you have changed $,). The only difference is, of course, that in the first example, print is passed one string and in the second example, it gets two.
But have you read the comments about the HTML generation functions in recent versions of the CGI.pm documentation?
All HTML generation functions within CGI.pm are no longer being
maintained. Any issues, bugs, or patches will be rejected unless they
relate to fundamentally broken page rendering.
The rationale for this is that the HTML generation functions of CGI.pm
are an obfuscation at best and a maintenance nightmare at worst. You
should be using a template engine for better separation of concerns.
See CGI::Alternatives for an example of using CGI.pm with the
Template::Toolkit module.
These functions, and perldoc for them, will continue to exist in the
v4 releases of CGI.pm but may be deprecated (soft) in v5 and beyond.
I would seriously consider moving way from these functions (and, indeed, CGI.pm itself) for new work.

How to split long Perl code into several files without too much manual editing?

How do I split a long Perl script into two or more different files that can all access the same variables - without having to rename all shared variables from e.g. $count to $::count (or $main::count which is the same)?
In other words, what's the best and simplest way to split the Perl script into several files without having to import a lot of variables/functions and/or do a lot of manual editing.
I assume it has something to do with making the code part of the same package/scope/namespace, but my experiments so far have failed.
I am not sure it makes a difference, but the script is used for web/CGI purposes and will be running under mod_perl.
EDIT - Background:
I kind of knew I would get that response. The reason I want to split up the file is the following:
Currently I have a single very old and very long Perl file. I know it is not following Perl best practices but it works.
The problem is, I need to distribute the data files it uses between different web servers, first of all for performance reasons. There will be one "master" server and one or several "slaves".
About 20% of the mentioned Perl file contains shared functions, 40% has the code need to run on the master server and 40% on the slave servers. Therefore, I would like to split the code into three files: 1. shared, 2. master-only, 3. slave-only. On the master server, 1 and 2 will be loaded, on the slaves, 1 and 3 will be loaded.
I assume this approach would use less process RAM and, more importantly, I would minimize the risk of not splitting the code correctly (e.g. a slave process calling a master data file). I don't see a great need for modularization, as the system works and the code does not need a lot of changes or exchanges with other projects.
EDIT 2 - Solution:
Found the solution I was looking for here:
http://www.perlmonks.org/?node_id=95813
In cases where the main package is in ownership of the variable, the
actual word 'main' can be ommitted to yield something like: $::var
It is possible to get around having to fully qualify variable names
when strict is in use. Applying a simply use vars to your script, with
the variable names as it arguments will get around explicit package
names.
Actually, I ended up repeating the our ($count, etc...) statement for the needed variables instead of use vars ();
Do let me know if I am missing something vital - apart from not going with modules! :)
#Axeman, Thanks, I will accept your answer, both for your effort and for sending me in the right direction.
Unless you put different package statements in their files, they will all be treated as if they had package main; at the top. So assuming that the scripts use package variables, you shouldn't have to do anything. If you have declared them with my (that is, if they are lexically scoped variables) then you would have to make sure that all references to the variables are in the same file.
But splitting scripts up for length is a rotten substitute for modularization. Yes, modularization helps keep code length down, but modularization if the proper way to keep code length down--for all the reasons that you would want to keep code-length down, modularization does it best.
If chopping the files by length could really work for you, then you could create a script like this:
do '/path/to/bin/part1.pl';
do '/path/to/bin/part2.pl';
do '/path/to/bin/part3.pl';
...
But I kind of suspect that if the organization of this code is as bad as you're--sort of--indicating, it might suffer from some of the same re-inventing the wheel that I've seen in Perl-ignorant scripts. Just offhand (I might be wrong) but I'm thinking you would be surprised how much could be chopped from the length by simply substituting better-tested Perl library idioms than for-looping and while-ing everything.

What is Perl's secret of getting small code do so much?

I've seen many (code-golf) Perl programs out there and even if I can't read them (Don't know Perl) I wonder how you can manage to get such a small bit of code to do what would take 20 lines in some other programming language.
What is the secret of Perl? Is there a special syntax that allows you to do complex tasks in few keystrokes? Is it the mix of regular expressions?
I'd like to learn how to write powerful and yet short programs like the ones you know from the code-golf challenges here. What would be the best place to start out? I don't want to learn "clean" Perl - I want to write scripts even I don't understand anymore after a week.
If there are other programming languages out there with which I can write even shorter code, please tell me.
There are a number of factors that make Perl good for code golfing:
No data typing. Values can be used interchangeably as strings and numbers.
"Diagonal" syntax. Usually referred to as TMTOWTDI (There's more than one way to do it.)
Default variables. Most functions act on $_ if no argument is specified. (A few act
on #_.)
Functions that take multiple arguments (like split) often have defaults that
let you omit some arguments or even all of them.
The "magic" readline operator, <>.
Higher order functions like map and grep
Regular expressions are integrated into the syntax (i.e. not a separate library)
Short-circuiting operators return the last value tested.
Short-circuiting operators can be used for flow control.
Additionally, without strictures (which are off be default):
You don't need to declare variables.
Barewords auto-quote to strings.
undef becomes either 0 or '' depending on context.
Now that that's out of the way, let me be very clear on one point:
Golf is a game.
It's great to aspire to the level of perl-fu that allows you to be good at it, but in the name of $DIETY do not golf real code. For one, it's a horrible waste of time. You could spend an hour trying to trim out a few characters. Golfed code is fragile: it almost always makes major assumptions and blithely ignores error checking. Real code can't afford to be so careless. Finally, your goal as a programmer should be to write clear, robust, and maintainable code. There's a saying in programming: Always write your code as if the person who will maintain it is a violent sociopath who knows where you live.
So, by all means, start golfing; but realize that it's just playing around and treat it as such.
Most people miss the point of much of Perl's syntax and default operators. Perl is largely a "DWIM" (do what I mean) language. One of it's major design goals is to "make the common things easy and the hard things possible".
As part of that, Perl designers talk about Huffman coding of the syntax and think about what people need to do instead of just giving them low-level primitives. The things that you do often should take the least amount of typing, and functions should act like the most common behavior. This saves quite a bit of work.
For instance, the split has many defaults because there are some use cases where leaving things off uses the common case. With no arguments, split breaks up $_ on whitespace because that's a very common use.
my #bits = split;
A bit less common but still frequent case is to break up $_ on something else, so there's a slightly longer version of that:
my #bits = split /:/;
And, if you wanted to be explicit about the data source, you can specify the variable too:
my #bits = split /:/, $line;
Think of this as you would normally deal with life. If you have a common task that you perform frequently, like talking to your bartender, you have a shorthand for it the covers the usual case:
The usual
If you need to do something, slightly different, you expand that a little:
The usual, but with onions
But you can always note the specifics
A dirty Bombay Sapphire martini shaken not stirred
Think about this the next time you go through a website. How many clicks does it take for you to do the common operations? Why are some websites easy to use and others not? Most of the time, the good websites require you to do the least amount of work to do the common things. Unlike my bank which requires no fewer than 13 clicks to make a credit card bill payment. It should be really easy to give them money. :)
This doesn't answer the whole question, but in regards to writing code you won't be able to read in a couple days, here's a few languages that will encourage you to write short, virtually unreadable code:
J
K
APL
Golfscript
Perl has a lot of single character special variables that provide a lot of shortcuts eg $. $_ $# $/ $1 etc. I think it's that combined with the built in regular expressions, allows you to write some very concise but unreadable code.
Perl's special variables ($_, $., $/, etc.) can often be used to make code shorter (and more obfuscated).
I'd guess that the "secret" is in providing native operations for often repeated tasks.
In the domain that perl was originally envisioned for you often have to
Take input linewise
Strip off whitespace
Rip lines into words
Associate pairs of data
...
and perl simple provided operators to do these things. The short variable names and use of defaults for many things is just gravy.
Nor was perl the first language to go this way. Many of the features of perl were stolen more-or-less intact (or often slightly improved) from sed and awk and various shells. Good for Larry.
Certainly perl wasn't the last to go this way, you'll find similar features in python and php and ruby and ... People liked the results and weren't about to give them up just to get more regular syntax.
What's Java's secret of copying a variable in only one line, without worrying about buses and memory? Answer: the code is transformed to bigger code. Same for every language ever invented.

Why are Perl source filters bad and when is it OK to use them?

It is "common knowledge" that source filters are bad and should not be used in production code.
When answering a a similar, but more specific question I couldn't find any good references that explain clearly why filters are bad and when they can be safely used. I think now is time to create one.
Why are source filters bad?
When is it OK to use a source filter?
Why source filters are bad:
Nothing but perl can parse Perl. (Source filters are fragile.)
When a source filter breaks pretty much anything can happen. (They can introduce subtle and very hard to find bugs.)
Source filters can break tools that work with source code. (PPI, refactoring, static analysis, etc.)
Source filters are mutually exclusive. (You can't use more than one at a time -- unless you're psychotic).
When they're okay:
You're experimenting.
You're writing throw-away code.
Your name is Damian and you must be allowed to program in latin.
You're programming in Perl 6.
Only perl can parse Perl (see this example):
#result = (dothis $foo, $bar);
# Which of the following is it equivalent to?
#result = (dothis($foo), $bar);
#result = dothis($foo, $bar);
This kind of ambiguity makes it very hard to write source filters that always succeed and do the right thing. When things go wrong, debugging is awkward.
After crashing and burning a few times, I have developed the superstitious approach of never trying to write another source filter.
I do occasionally use Smart::Comments for debugging, though. When I do, I load the module on the command line:
$ perl -MSmart::Comments test.pl
so as to avoid any chance that it might remain enabled in production code.
See also: Perl Cannot Be Parsed: A Formal Proof
I don't like source filters because you can't tell what code is going to do just by reading it. Additionally, things that look like they aren't executable, such as comments, might magically be executable with the filter. You (or more likely your coworkers) could delete what you think isn't important and break things.
Having said that, if you are implementing your own little language that you want to turn into Perl, source filters might be the right tool. However, just don't call it Perl. :)
It's worth mentioning that Devel::Declare keywords (and starting with Perl 5.11.2, pluggable keywords) aren't source filters, and don't run afoul of the "only perl can parse Perl" problem. This is because they're run by the perl parser itself, they take what they need from the input, and then they return control to the very same parser.
For example, when you declare a method in MooseX::Declare like this:
method frob ($bubble, $bobble does coerce) {
... # complicated code
}
The word "method" invokes the method keyword parser, which uses its own grammar to get the method name and parse the method signature (which isn't Perl, but it doesn't need to be -- it just needs to be well-defined). Then it leaves perl to parse the method body as the body of a sub. Anything anywhere in your code that isn't between the word "method" and the end of a method signature doesn't get seen by the method parser at all, so it can't break your code, no matter how tricky you get.
The problem I see is the same problem you encounter with any C/C++ macro more complex than defining a constant: It degrades your ability to understand what the code is doing by looking at it, because you're not looking at the code that actually executes.
In theory, a source filter is no more dangerous than any other module, since you could easily write a module that redefines builtins or other constructs in "unexpected" ways. In practice however, it is quite hard to write a source filter in a way where you can prove that its not going to make a mistake. I tried my hand at writing a source filter that implements the perl6 feed operators in perl5 (Perl6::Feeds on cpan). You can take a look at the regular expressions to see the acrobatics required to simply figure out the boundaries of expression scope. While the filter works, and provides a test bed to experiment with feeds, I wouldn't consider using it in a production environment without many many more hours of testing.
Filter::Simple certainly comes in handy by dealing with 'the gory details of parsing quoted constructs', so I would be wary of any source filter that doesn't start there.
In all, it really depends on the filter you are using, and how broad a scope it tries to match against. If it is something simple like a c macro, then its "probably" ok, but if its something complicated then its a judgement call. I personally can't wait to play around with perl6's macro system. Finally lisp wont have anything on perl :-)
There is a nice example here that shows in what trouble you can get with source filters.
http://shadow.cat/blog/matt-s-trout/show-us-the-whole-code/
They used a module called Switch, which is based on source filters. And because of that, they were unable to find the source of an error message for days.

Why shouldn't I use shell tools in Perl code?

It is generally advised not to use additional linux tools in a Perl code;
e.g if someone intends to print the last line of a text file he can:
$last_line = `tail -1 $file` ;
or otherwise, open the file and read it line by line
open(INFO,$file);
while(<INFO>) {
$last_line = $_ if eof;
}
What are the pitfalls of using the previous and why should I avoid using shell tools in my code?
thanx,
Efficiency - you don't have to spawn a new process
Portability - you don't have to worry about an executable not existing, accepting different switches, or having different output
Ease of use - you don't have to parse the output, the results are already in a usable form
Error handling - you have finer-grained control over errors and what to do about them in Perl.
It's better to keep all the action in Perl because it's faster and because it's more secure. It's faster because you're not spawning a new process, and it's more secure because you don't have to worry about shell meta character trickery.
For example, in your first case if $file contained "afilename ; rm -rf ~" you would be a very unhappy camper.
P.S. The best all-Perlway to do the tail is to use File::ReadBackwards
One of the primary reasons (besides portability) for not executing shell commands is that it introduces overhead by spawning another process. That's why much of the same functionality is available via CPAN in Perl modules.
One reason is that your Perl code might be running in an environment where there is no shell tool called 'tail'.
It's a personal call depending on the project:
Is it going to be always used in shell environments with tail?
Do you care about only using pure Perl code?
Using tail? Fine. But that's really a special case, since it's so easy to use and since it is so trivial.
The problem in general is not really efficiency or portability, that is largely irrelevant; the issue is ease of use. To run an external utility, you have to find out what arguments it accepts, write code to transform your program's data structures to that format, quote them properly, build the command line, and run the application. Then, you might have to feed it data and read data from it (involving complexity like an event loop, worrying about deadlocking, etc.), and finally interpret the return value. (UNIX processes consider "0" true and anything else false, but Perl assumes the opposite. foo() and die is hard to read.) This is a lot of work to do, and that's why people avoid it. It's much easier to create an instance of a class and call methods on it to get the data you need.
(You can abstract away processes this way; see Crypt::GpgME for example. It handles the complexity associated with invoking gpg, which would normally involve creating multiple filehandles other than STDOUT, STDIN, and STDERR, among other things.)
The main reason I see for doing it all in Perl would be for robustness. Your use of tail will fail if the filename has shell metacharacters or spaces or doesn't exist or isn't accessible. From Perl, characters in the filename aren't an issue, and you can distinguish between errors in accessing the file. Sometimes being robust is more important than speedy coding and sometimes it's not.