Cleanup huge Perl Codebase - perl

I am currently working on a roughly 15 years old web application.
It contains mainly CGI perl scripts with HTML::Template templates.
It has over 12 000 files and roughly 260 MB of total code. I estimate that no more than 1500 perl scripts are needed and I want to get rid of all the unused code.
There are practically no tests written for the code.
My questions are:
Are you aware of any CPAN module that can help me get a list of only used and required modules?
What would be your approach if you'd want to get rid of all the extra code?
I was thinking at the following approaches:
try to override the use and require perl builtins with ones that output the loaded file name in a specific location
override the warnings and/or strict modules import function and output the file name in the specific location
study the Devel::Cover perl module and take the same approach and analyze the code when doing manual testing instead of automated tests
replace the perl executable with a custom one, which will log each name of file it reads (I don't know how to do that yet)
some creative use of lsof (?!?)

Devel::Modlist may give you what you need, but I have never used it.
The few times I have needed to do somehing like this I have opted for the more brute force approach of inspecting %INC at the end the program.
END {
open my $log_fh, ...;
print $log_fh "$_\n" for sort keys %INC;
}

As a first approximation, I would simply run
egrep -r '\<(use|require)\>' /path/to/source/*
Then spend a couple of days cleaning up the output from that. That will give you a list of all of the modules used or required.
You might also be able to play around with #INC to exclude certain library paths.
If you're trying to determine execution path, you might be able to run the code through the debugger with 'trace' (i.e. 't' in the debugger) turned on, then redirect the output to a text file for further analysis. I know that this is difficult when running CGI...

Assuming the relevant timestamps are turned on, you could check access times on the various script files - that should rule out any top-level script files that aren't being used.
Might be worth adding some instrumentation to CGI.pm to log the current script-name ($0) to see what's happening.

Related

Is Perl unit-testing only for modules, not programs?

The docs I find around the ’net and the book I have, Perl Testing, either say or suggest that unit-testing for Perl is usually done when creating modules.
Is this true? Is there no way to unit-test actual programs using Test::More and cousins?
Of course you can test scripts using Test::More. It's just harder, because most scripts would need to be run as a separate process from which you capture the output, and then test it against expected output.
This is why modulinos (see chapter 17 in: brian d foy, Mastering Perl, second edition, O'Reilly, 2014) were developed. A modulino is a script that can also be used as a module. This makes it easier to test, as you can load the modulino into your test script and then test its functions like you would a regular module.
The key feature of a modulino is this:
#! /usr/bin/perl
package App::MyName; # put it in a package
run() unless caller; # Run program unless loaded as a module
sub run {
... # your program here
}
The function doesn't have to be called run; you could use main if you're a C programmer. You'd also normally have additional subroutines that run calls as needed.
Then your test scripts can use require "path/to/script" to load your modulino and exercise its functions. Since many scripts involve writing output, and it's often easier to print as you go instead of doing print sub_that_returns_big_string(), you may find Test::Output useful.
It's not an easiest way to test your code, but you can test the script directly. You can run the script with specific parameters using system (or better IPC::Run3), capture output and compare it with expected result.
But this will be a top level test. It'll be hard to tell which part of your code caused problem.
Unit-tests are used to test individual modules. This makes it easier to see where the problem came from. Also testing functions individually is much easier, because you only need to think about what can happen in smaller piece of code.
The way of testing is depend on your project size. You can, of cause, put everything into single file, but putting your code into module (or even splitting it into different modules) will give you benefits in future: code can be easier reused and tested.

How to split long Perl code into several files without too much manual editing?

How do I split a long Perl script into two or more different files that can all access the same variables - without having to rename all shared variables from e.g. $count to $::count (or $main::count which is the same)?
In other words, what's the best and simplest way to split the Perl script into several files without having to import a lot of variables/functions and/or do a lot of manual editing.
I assume it has something to do with making the code part of the same package/scope/namespace, but my experiments so far have failed.
I am not sure it makes a difference, but the script is used for web/CGI purposes and will be running under mod_perl.
EDIT - Background:
I kind of knew I would get that response. The reason I want to split up the file is the following:
Currently I have a single very old and very long Perl file. I know it is not following Perl best practices but it works.
The problem is, I need to distribute the data files it uses between different web servers, first of all for performance reasons. There will be one "master" server and one or several "slaves".
About 20% of the mentioned Perl file contains shared functions, 40% has the code need to run on the master server and 40% on the slave servers. Therefore, I would like to split the code into three files: 1. shared, 2. master-only, 3. slave-only. On the master server, 1 and 2 will be loaded, on the slaves, 1 and 3 will be loaded.
I assume this approach would use less process RAM and, more importantly, I would minimize the risk of not splitting the code correctly (e.g. a slave process calling a master data file). I don't see a great need for modularization, as the system works and the code does not need a lot of changes or exchanges with other projects.
EDIT 2 - Solution:
Found the solution I was looking for here:
http://www.perlmonks.org/?node_id=95813
In cases where the main package is in ownership of the variable, the
actual word 'main' can be ommitted to yield something like: $::var
It is possible to get around having to fully qualify variable names
when strict is in use. Applying a simply use vars to your script, with
the variable names as it arguments will get around explicit package
names.
Actually, I ended up repeating the our ($count, etc...) statement for the needed variables instead of use vars ();
Do let me know if I am missing something vital - apart from not going with modules! :)
#Axeman, Thanks, I will accept your answer, both for your effort and for sending me in the right direction.
Unless you put different package statements in their files, they will all be treated as if they had package main; at the top. So assuming that the scripts use package variables, you shouldn't have to do anything. If you have declared them with my (that is, if they are lexically scoped variables) then you would have to make sure that all references to the variables are in the same file.
But splitting scripts up for length is a rotten substitute for modularization. Yes, modularization helps keep code length down, but modularization if the proper way to keep code length down--for all the reasons that you would want to keep code-length down, modularization does it best.
If chopping the files by length could really work for you, then you could create a script like this:
do '/path/to/bin/part1.pl';
do '/path/to/bin/part2.pl';
do '/path/to/bin/part3.pl';
...
But I kind of suspect that if the organization of this code is as bad as you're--sort of--indicating, it might suffer from some of the same re-inventing the wheel that I've seen in Perl-ignorant scripts. Just offhand (I might be wrong) but I'm thinking you would be surprised how much could be chopped from the length by simply substituting better-tested Perl library idioms than for-looping and while-ing everything.

How expensive is: require "foo.pl";

I'm about to rewrite a large portion of a project that I have developed over the last 10years while learning perl. There is alot of optimisation that can be gained.
A key part of the code is a large if/elsif block that require xxx.cgi files depending on a POST value. Eg:
if($FORM{'action'} eq "1"){require "1.cgi";}
elsif($FORM{'action'} eq "2"){require "2.cgi";}
elsif($FORM{'action'} eq "3"){require "3.cgi";}
elsif($FORM{'action'} eq "4"){require "4.cgi";}
It has many more irritations but just how expensive is using "require" in perl?
require itself has a relatively low cost in any case and, if you require the same file more than once within a single run of your program, it will detect that the file has already been loaded and not attempt to load it a second time. However, if you have a long and highly-populated search path (#INC) and you require (or use) a lot of files, it's possible that all of the directory searches could add up; this isn't common (and doesn't sound likely in your case), but it can be improved by reorganizing your module directories so that the things you're loading show up earlier in #INC.
The potentially-major performance hit referred to by earlier answers is the cost of compiling the code in the files you require. Getting rid of the require by moving the code into your main program will not help with this, as the code will still need to be compiled. In your case, it would probably make things worse, as it would cause the code for all options to be compiled on every one rather than only compiling the code used by the one action selected by the user.
As has been said, it really depends on the actual code in those files. Your best bet would be to do tests using Devel::NYTProf and/or Benchmark to see where the most time is being spent in your code if you are unhappy with its performance.
You can also read Profiling Perl on perl.com, but it is a bit outdated as it uses Devel::DProf.
Not answer to your primary question, but still a good idea for code refactor i read recently in Ovid blog.
The first time, possibly expensive; Perl has to search a path to find the file and load it up. Subsequent times, it's cheap -- a table is consulted and the file isn't actually loaded a second time. If this is in a CGI that is run once per request and then exited, then this is not too good.
It's really going to depend on the size of the files you're calling to. If you have massive CGI files, then it might detriment the performance of your software. If we're talking 6 or 7 lines of code each, then no issue. Try benchmarking your program's performance with and without, and make your own judgement.

Enable global warnings

I have to optimize an intranet written in Perl (about 3000 files). The first thing I want to do is enable warnings "-w" or "use warnings;" so I can get rid of all those errors, then try to implement "use strict;".
Is there a way of telling Perl to use warnings all the time (like the settings in php.ini for PHP), without the need to modify each script to add "-w" to it's first line?
I even thought to make an alias for /usr/bin/perl, or move it to another name and make a simple script instead of it just to add the -w flag (like a proxy).
How would you debug it?
Well…
You could set the PERL5OPT envariable to hold -w. See the perlrun manpage for details. I hope you’ll consider tainting, too, like -T or maybe -t, for security tracking.
But I don’t envy you. Retrofitting code developed without the benefit of use warnings and use strict is usually a royal PITA.
I have something of a standard boiler-plate I use to start new Perl programs. But I haven’t given any thought to one for CGI programs, which would likely benefit from some tweaks against that boiler-plate.
Retrofitting warnings and strict is hard. I don't recommend a Big Bang approach, setting warnings (let alone strictures) on everything. You will be inundated with warnings to the point of uselessness.
You start by enabling warnings on the modules used by the scripts (there are some, aren't there?), rather than applying warnings to everything. Get the core clean, then get to work on the periphery, one unit at a time. So, in fact, I'd recommend having a simple (Perl) script that simply finds a line that does not start with a hash and adds use warnings; (and maybe use strict; too, since you're going to be dealing with one script at a time), so you can do the renovations one script at a time.
In other words, you will probably be best off actually editing each file as you're about to renovate it.
I'd only use the blanket option to make a simple assessment of the scope of the problem: is it a complete and utter disaster, or merely a few peccadilloes in a few files. Sadly, if the code was developed without warnings and strict, it is more likely to be 'disaster' than 'minimal'.
You may find that your predecessors were prone to copy and paste and some erroneous idioms crop up repeatedly in copied code. Write a Perl script that fixes each one. I have a bunch of fix* scripts in my personal bin directory that deal with various changes - either fixing issues created by recalcitrant (or, more usually, simply long departed) colleagues or to accommodate my own changing standards.
You can set warnings and strictures for all Perl scripts by adding -Mwarnings -Mstrict to your PERL5OPT environment variable. See perlrun for details.

What are the best-practices for implementing a CLI tool in Perl?

I am implementing a CLI tool using Perl.
What are the best-practices we can follow here?
As a preface, I spent 3 years engineering and implementing a pretty complicated command line toolset in Perl for a major financial company. The ideas below are basically part of our team's design guidelines.
User Interface
Command line option: allow as many as possible have default values.
NO positional parameters for any command that has more than 2 options.
Have readable options names. If length of command line is a concern for non-interactive calling (e.g. some un-named legacy shells have short limits on command lines), provide short aliases - GetOpt::Long allows that easily.
At the very least, print all options' default values in '-help' message.
Better yet, print all the options' "current" values (e.g. if a parameter and a value are supplied along with "-help", the help message will print parameter's value from command line). That way, people can assemble command line string for complicated command and verify it by appending "-help", before actually running.
Follow Unix standard convention of exiting with non-zero return code if program terminated with errors.
If your program may produce useful (e.g. worth capturing/grepping/whatnot) output, make sure any error/diagnostic messages go to STDERR so they are easily separable.
Ideally, allow the user to specify input/output files via command line parameter, instead of forcing "<" / ">" redirects - this allows MUCH simpler life to people who need to build complicated pipes using your command. Ditto for error messages - have logfile option.
If a command has side effect, having a "whatif/no_post" option is usually a Very Good Idea.
Implementation
As noted previously, don't re-invent the wheel. Use standard command line parameter handling modules - MooseX::Getopt, or Getopt::Long
For Getopt::Long, assign all the parameters to a single hash as opposed to individual variables. Many useful patterns include passing that CLI args hash to object constructors.
Make sure your error messages are clear and informative... E.g. include "$!" in any IO-related error messages. It's worth expending extra 1 minute and 2 lines in your code to have a separate "file not found" vs. "file not readable" errors, as opposed to spending 30 minutes in production emergency because a non-readable file error was misdiagnosed by Production Operations as "No input file" - this is a real life example.
Not really CLI-specific, but validate all parameters, ideally right after getting them.
CLI doesn't allow for a "front-end" validation like webapps do, so be super extra vigilant.
As discussed above, modularize business logic. Among other reasons already listed, the amount of times I had to re-implement an existing CLI tool as a web app is vast - and not that difficult if the logic is already a properly designed perm module.
Interesting links
CLI Design Patterns - I think this is ESR's
I will try to add more bullets as I recall them.
Use POD to document your tool, follow the guidelines of manpages; include at least the following sections: NAME, SYNOPSIS, DESCRIPTION, AUTHOR. Once you have proper POD you can generate a man page with pod2man, view the documentation at the console with perldoc your-script.pl.
Use a module that handles command line options for you. I really like using Getopt::Long in conjunction with Pod::Usage this way invoking --help will display a nice help message.
Make sure that your scripts returns a proper exit value if it was successful or not.
Here's a small skeleton of a script that does all of these:
#!/usr/bin/perl
=head1 NAME
simplee - simple program
=head1 SYNOPSIS
simple [OPTION]... FILE...
-v, --verbose use verbose mode
--help print this help message
Where I<FILE> is a file name.
Examples:
simple /etc/passwd /dev/null
=head1 DESCRIPTION
This is as simple program.
=head1 AUTHOR
Me.
=cut
use strict;
use warnings;
use Getopt::Long qw(:config auto_help);
use Pod::Usage;
exit main();
sub main {
# Argument parsing
my $verbose;
GetOptions(
'verbose' => \$verbose,
) or pod2usage(1);
pod2usage(1) unless #ARGV;
my (#files) = #ARGV;
foreach my $file (#files) {
if (-e $file) {
printf "File $file exists\n" if $verbose;
}
else {
print "File $file doesn't exist\n";
}
}
return 0;
}
Some lessons I've learned:
1) Always use Getopt::Long
2) Provide help on usage via --help, ideally with examples of common scenarios. It helps people don't know or have forgotten how to use the tool. (I.e., you in six months).
3) Unless it's pretty obvious to the user as why, don't go for long period (>5s) without output to the user. Something like 'print "Row $row...\n" unless ($row % 1000)' goes a long way.
4) For long running operations, allow the user to recover if possible. It really sucks to get through 500k of a million, die, and start over again.
5) Separate the logic of what you're doing into modules and leave the actual .pl script as barebones as possible; parsing options, display help, invoking basic methods, etc. You're inevitably going to find something you want to reuse, and this makes it a heck of a lot easier.
The most important thing is to have standard options.
Don't try to be clever, be simply consistent with already existing tools.
How to achieve this is also important, but only comes second.
Actually, this is quite generic to all CLI interfaces.
There are a couple of modules on CPAN that will make writing CLI programs a lot easier:
App::CLI
App::Cmd
If you app is Moose based also have a look at MooseX::Getopt and MooseX::Runnable
The following points aren't specific to Perl but I've found many Perl CL scripts to be deficient in these areas:
Use common command line options. To show the version number implement -v or --version not --ver. For recursive processing -r (or perhaps -R although in my Gnu/Linux experience -r is more common) not --rec. People will use your script if they can remember the parameters. It's easy to learn a new command if you can remember "it works like grep" or some other familiar utility.
Many command line tools process "things" (files or directories) within the "current directory". While this can be convenient make sure you also add command line options for explicitly identifying the files or directories to process. This makes it easier to put your utility in a pipeline without developers having to issue a bunch of cd commands and remember which directory they're in.
You should use Perl modules to make your code reusable and easy to understand.
should have a look at Perl best practices