How should I restructure a large Perl script?

How should I restructure a large Perl script? - perl

I have a more or less large Perl script of ~ 1000 lines. The script accepts a few arguments and it runs straight forward. No modules, no functions. The script could be divided into three parts, initialization part, arguments parsing part and work part, but I don't know how to do that. Everything must be kept in a single file. Please, can anyone give me instructions/advice how to structure my Perl script?
Thanks.

You ask for advice on how to refactor your script, but you don't appear to understand why to refactor it. Without the why, the how isn't going to do you much good. And with the why, the how may fall out quite naturally.
If your script is working perfectly and needs no modification and all you'll ever do with it is run it, then you probably don't have a reason to refactor it - and I say that from the perspective of despising long routines. But...
If something's wrong with it
If you are trying to find a bug in your 1,000-line program, you have some hard work ahead of you. The problem could be anywhere. Break it up into smaller pieces so that you can verify the input and output at different stages - ideally, write tests for the smaller pieces. Fine-grained unit tests will tell you what isn't working, the nature of the error, and where the error exists.
If you need to modify it
If you need to change the script to - say - accommodate a new graphics format, or take advantage of multiple processors, or record its activities to a log - you will find it easier to extend if the program elements that need revision or extension are better isolated.
If you're trying to explain it to someone else, or show it off
You will find it much easier to convey the ideas in your script to another developer if the ideas are broken out into discrete methods.
So, there are some reasons why you might choose to refactor. If any of them apply, refactor accordingly; the how will drop out naturally. Extract Method may be your best friend.

If you can see logical parts of your script, you should definitely abstract them into functions. Having a single script of over 1000 lines, and not breaking it up into whatever abstraction units your language provides (functions, classes, etc.) is a very bad idea. Maintaining your script, i. e. adding features and fixing bugs, will be a nightmare.
I strongly suggest you read the book Clean Code by Robert C. Martin. It uses Java for examples, but the ideas are applicable to any language. The one that is most relevant here is "Make your functions small. Then make them smaller."

1000 lines and no functions? why not? this is a vague question.

You could break each section into a separate function, and then have a function that runs through each of these functions in the correct order called 'run()' or something similar. This would allow you to break up the program into more mangeable chunks.
p.s. man I think I used the word function too many times in this answer!

Have you refactored at all? At 1000 lines I'd suspect to see some code that could be broken down into functions internal to the script.
Well, if you have three separate sections that's the logical choice.
You could make each one into a function and then have a simple linear control at the top:
my $var1, $var2, $var3;
$var1 = init();
$var2 = parseInput();
$doWork();
sub init() {
some code here
}
sub parseInput() {
some code here
}
sub doWork() {
some code here
}
The big issue is you're going to be using globals a lot. I'd build them into a structure or two. I would also expect to see the big three broken down into functions themselves. Back in the 80s when the big thing I was learning was structured programming (the best design here I think) the rule of thumb was a function should fit on roughly one screen or less.

Most people would typically answer something like "a subroutine should do one thing" and "a subroutine should only take up one page in your editor". You can try to keep these things in mind when you refactor your code.
Try to identify parts of your code that can be split off into logical sections. You've started this process by spotting 'initialization', 'argument parsing', and 'work'. See if there are some sub-sections within that that can be pruned off into other subroutines.
Also, why do you not use any modules? One that springs to mind is Getopt::Long, which is a core module, so you won't have to install it manually. It will handle all of your argument parsing, and by using it you will probably avoid bugs and could shorten your code to make it more maintainable. By using standard modules like this, you not only (hopefully!) reduce the number of bugs in your code, you make it easier for other Perl programmers to understand.

You could look at search.cpan.org, maybe some Perl module suits your needs. For example there is a CGI::Application

Related

How to add logging information to perl legacy code

I have a medium to large size system built in perl, that has been developed during the last 15 years and is built of many scripts and pm files,
and in order to improve the system i need more data, the easiest way as i see it to get this data is to have every function in the code to print out the start and end time to some log so it will be possible the understand what is taking the most time.
however this is an old system and some parts are less maintainable than others and on top of it i need it to be running which means in order to get real data i need it to print this out from production.
what i want to do is to override in some way the function declration to wrap each function start in a line like
NAME start STARTTIME PARAMS
and when it leaves the function
NAME ended STARTTIME PARAMS
does anybody can point me to the right direction?
Thanks

Take a look at Devel::NYTProf. It can profile the amount of time that all of your subs are taking (and do a lot more). It doesn't involve a lot of messy code modification; instead you just run your script with it:
perl -d:NYTProf your_script.pl

Previous answers are spot on (I especially recommend Devel::NYTProf). However, a more general technique you could apply in general to gather data about your subroutines' behaviour is fiddling with the symbol table, "appending" (or prepending) code to the actual sub's code.
A couple of pointers:
In Perl, can I call a method before executing every function in a package? (this answer shows a code example you could adapt to your particular situation)
Hook::LexWrap is a module that lets you augment subroutine behaviour in several ways, without touching the original code
HTH

sounds like you need to use a profiler
http://www.perl.org/about/whitepapers/perl-profiling.html

Perl profilers usually have a huge impact on the program performance, so using them in production may not be a great idea.
You can try Devel::ContinuousProfiler that claims to have very low impact (I myself have never used it, just discovered it this morning!)

Design - When to create new functions?

This is a general design question not relating to any language. I'm a bit torn between going for minimum code or optimum organization.
I'll use my current project as an example. I have a bunch of tabs on a form that perform different functions. Lets say Tab 1 reads in a file with a specific layout, tab 2 exports a file to a specific location, etc. The problem I'm running into now is that I need these tabs to do something slightly different based on the contents of a variable. If it contains a 1 I may need to use Layout A and perform some extra concatenation, if it contains a 2 I may need to use Layout B and do no concatenation but add two integer fields, etc. There could be 10+ codes that I will be looking at.
Is it more preferable to create an individual path for each code early on, or attempt to create a single path that branches out only when absolutely required.
Creating an individual path for each code would allow my code to be extremely easy to follow at a glance, which in turn will help me out later on down the road when debugging or making changes. The downside to this is that I will increase the amount of code written by calling some of the same functions in multiple places (for example, steps 3, 5, and 9 for every single code may be exactly the same.
Creating a single path that would branch out only when required will be a bit messier and more difficult to follow at a glance, but I would create less code by placing conditionals only at steps that are unique.
I realize that this may be a case-by-case decision, but in general, if you were handed a previously built program to work on, which would you prefer?
Edit: I've drawn some simple images to help express it. Codes 1/2/3 are the variables and the lines under them represent the paths they would take. All of these steps need to be performed in a linear chronological fashion, so there would be a function to essentially just call other functions in the proper order.
Different Paths
Single Path

Creating a single path that would
branch out only when required will be
a bit messier and more difficult to
follow at a glance, but I would create
less code by placing conditionals only
at steps that are unique.
Im not buying this statement. There is a level of finesse when deciding when to write new functions. Functions should be as simple and reusable as possible (but no simpler). The correct answer is almost never 'one big file that does a lot of branching'.
Less LOC (lines of code) should not be the goal. Readability and maintainability should be the goal. When you create functions, the names should be self documenting. If you have a large block of code, it is good to do something like
function doSomethingComplicated() {
stepOne();
stepTwo();
// and so on
}
where the function names are self documenting. Not only will the code be more readable, you will make it easier to unit test each segment of the code in isolation.
For the case where you will have a lot of methods that call the same exact methods, you can use good OO design and design patterns to minimize the number of functions that do the same thing. This is in reference to your statement "The downside to this is that I will increase the amount of code written by calling some of the same functions in multiple places (for example, steps 3, 5, and 9 for every single code may be exactly the same."
The biggest danger in starting with one big block of code is that it will never actually get refactored into smaller units. Just start down the right path to begin with....
EDIT --
for your picture, I would create a base-class with all of the common methods that are used. The base class would be abstract, with an abstract method. Subclasses would implement the abstract method and use the common functions they need. Of course, replace 'abstract' with whatever your language of choice provides.

You should always err on the side of generalization, with the only exception being early prototyping (where throughput of generating working stuff is majorly impacted by designing correct abstractions/generalizations). having said that, you should NEVER leave that mess of non-generalized cloned branches past the early prototype stage, as it leads to messy hard to maintain code (if you are doing almost the same thing 3 different times, and need to change that thing, you're almost sure to forget to change 1 out of 3).

Again it's hard to specifically answer such an open ended question, but I believe you don't have to sacrifice one for the other.
OOP techniques solves this issue by allowing you to encapsulate the reusable portions of your code and generate child classes to handle object specific behaviors.

Personally I think you might (if possible by your API) create inherited forms, create them on fly on master form (with tabs), pass agruments and embed in tab container.
When to inherit form and when to decide to use arguments (code) to show/hide/add/remove functionality is up to you, yet master form should contain only decisions and argument passing and embeddable forms just plain functionality - this way you can separate organisation from implementation.

Is it naughty to have a large utility file?

In my C project I have quite a large utils.c file. It is really full of many utilities of different sorts. I feel a bit naughty just stuffing different miscellaneous functions in there. For example it has some utilities related to low level stuff such as a lowercase() function, and it also has some quite sophisticated utilities such as converting to/from different colour formats.
My question is, is it very naughty to have such a large utils.c with many different types of utilities in it? Should I break it up into many different kinds of utility files? Such as graphics_utils.c and so on What do you think?

Breaking them up into separate files based on categories (ie graphics, strings, etc.) will lead to better organization, making it easier to locate certain pieces of code, having smaller files to go through, instead of just one large file.

You want to break it up, not just for organizational reasons, but because you will have many other files that depend on this one. Because everything will depend on this file, it makes this one file difficult to change because it might cause widespread breakage.
http://ifacethoughts.net/2006/04/15/stable-dependencies-principle/

If it's just you that will EVER maintain the stuff, it's a matter of when the complexity gets to the point where you find yourself searching for things. That would be the time to refactor and reorganize (there's a cost to reorganize, just as there's a cost to not reorganize).
If it's POSSIBLE that anyone else will maintain a project that includes your utils, you have to consider THEIR pain point when deciding when to reorganize. Theirs is MUCH lower than yours.

I tend to break them up into various sub-utils as you say (graphics_utils) when it becomes appropriate.

Break it up. Stuff will be easier to find, easier to reuse, easier to refactor, easier to unit test. I recently needed to get a set of ISO-8601 date handling methods out of a ginormous Java utility class of static methods, and it was really hard to find the 5% of the code I needed.

It is definitely not kosher, because the next guy coming through your code won't know where to look for anything. Break it up by function, and your coworkers will thank you!

Another advantage that comes from breaking up the file into separates is that when you place it under source control, you can have finer grained control. This really is useful if you have bits that are tweaked/extended/specialised frequently, and other bits that are relatively stable.

Another point: You should organize your code, i. e. break it up in smaller modules and categorize it, because at some point in time you will end up writing a second and third function for the same thing, simply for the reason that you wont find that function that you knew it was there, but you don't remember it's name.
I've got a (rather large) project with such a module and there is programming logic for which there are up to 5-6 implementations (for the same thing).

Like everyone else I would break them up. But I tend to use Extension Methods now, so I would have one class (and one file) per class being extended (e.g. StringExtensions, SqlDataReaderExtensions, etc). I find this tends to break up the utility methods nicely.

Does it matter if there are unused functions I put into a big CoolFunctions.h / CoolFunctions.m file that's included everywhere in my project?

I want to create a big file for all cool functions I find somehow reusable and useful, and put them all into that single file. Well, for the beginning I don't have many, so it's not worth thinking much about making several files, I guess. I would use pragma marks to separate them visually.
But the question: Would those unused methods bother in any way? Would my application explode or have less performance? Or is the compiler / linker clever enough to know that function A and B are not needed, and thus does not copy their "code" into my resulting app?

This sounds like an absolute architectural and maintenance nightmare. As a matter of practice, you should never make a huge blob file with a random set of methods you find useful. Add the methods to the appropriate classes or categories. See here for information on the blob anti-pattern, which is what you are doing here.
To directly answer your question: no, methods that are never called will not affect the performance of your app.

No, they won't directly affect your app. Keep in mind though, all that unused code is going to make your functions file harder to read and maintain. Plus, writing functions you're not actually using at the moment makes it easy to introduce bugs that aren't going to become apparent until much later on when you start using those functions, which can be very confusing because you've forgotten how they're written and will probably assume they're correct because you haven't touched them in so long.
Also, in an object oriented language like Objective-C global functions should really only be used for exceptional, very reusable cases. In most instances, you should be writing methods in classes instead. I might have one or two global functions in my apps, usually related to debugging, but typically nothing else.
So no, it's not going to hurt anything, but I'd still avoid it and focus on writing the code you need now, at this very moment.

The code would still be compiled and linked into the project, it just wouldn't be used by your code, meaning your resultant executable will be larger.
I'd probably split the functions into seperate files, depending on the common areas they are to address, so I'd have a library of image functions separate from a library of string manipulation functions, then include whichever are pertinent to the project in hand.

I don't think having unused functions in the .h file will hurt you in any way. If you compile all the corresponding .m files containing the unused functions in your build target, then you will end up making a bigger executable than is required. Same goes for if you include the code via static libraries.
If you do use a function but you didn't include the right .m file or library, then you'll get a link error.

Good practice class line count [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I know that there is no right answer to this question, I'm just asking for your opinions.
I know that creating HUGE class files with thousand lines of code is not a good practice since it's hard to maintain and also it usually means that you should probably review your program logic.
In your opinion what is an average line count for a class let's say in Java (i don't know if the choice of language has anything to do with it but just in case...)

Yes, I'd say it does have to do with the language, if only because some languages are more verbose than others.
In general, I use these rules of thumb:
< 300 lines: fine
300 - 500 lines: reasonable
500 - 1000 lines: maybe ok, but plan on refactoring
> 1000 lines: definitely refactor
Of course, it really depends more on the nature and complexity of the code than on LOC, but I've found these reasonable.

In general, number of lines is not the issue - a slightly better metric is number of public methods. But there is no correct figure. For example, a utility string class might correctly have hundreds of methods, whereas a business level class might have only a couple.
If you are interested in LOC, cyclomatic and other complexity measurements, I can strongly recommend Source Monitor from http://www.campwoodsw.com, which is free, works with major languages such as Java & C++, and is all round great.

From Eric Raymond's "The Art Of Unix Programming"
In nonmathematical terms, Hatton's empirical results imply a sweet spot between 200 and 400 logical lines of code that minimizes probable defect density, all other factors (such as programmer skill) being equal. This size is independent of the language being used — an observation which strongly reinforces the advice given elsewhere in this book to program with the most powerful languages and tools you can. Beware of taking these numbers too literally however. Methods for counting lines of code vary considerably according to what the analyst considers a logical line, and other biases (such as whether comments are stripped). Hatton himself suggests as a rule of thumb a 2x conversion between logical and physical lines, suggesting an optimal range of 400–800 physical lines.
Taken from here

Better to measure something like cyclomatic complexity and use that as a gauge. You could even stick it in your build script/ant file/etc.
It's too easy, even with a standardized code format, for lines of code to be disconnected from the real complexity of the class.
Edit: See this question for a list of cyclomatic complexity tools.

I focus on methods and (try to) keep them below 20 lines of code. Class length is in general dictated by the single responsibility principle. But I believe that this is no absolute measure because it depends on the level of abstraction, hence somewhere between 300 and 500 lines I start looking over the code for a new responsibility or abstraction to extract.

Small enough to do only the task it is charged with.
Large enough to do only the task it is charged with.
No more, no less.

In my experience any source file over 1000 text lines I will start wanting to break up. Ideally methods should fit on a single screen, if possible.
Lately I've started to realise that removing unhelpful comments can help greatly with this. I comment far more sparingly now than I did 20 years ago when I first started programming.

The short answer: less than 250 lines.
The shorter answer: Mu.
The longer answer: Is the code readable and concise? Does the class have a single responsibility? Does the code repeat itself?

For me, the issue isn't LOC. What I look at is several factors. First, I check my If-Else-If statements. If a lot of them have the same conditions, or result in similar code being run, I try to refactor that. Then I look at my methods and variables. In any single class, that class should have one primary function and only that function. If it has variables and methods for a different area, consider putting those into their own class. Either way, avoid counting LOC for two reasons:
1) It's a bad metric. If you count LOC you're counting not just long lines, but also lines which are whitespace and used for comments as though they are the same. You can avoid this, but at the same time, you're still counting small lines and long lines equally.
2) It's misleading. Readability isn't purely a function of LOC. A class can be perfectly readable but if you have a LOC count which it violates, you're gonna find yourself working hard to squeeze as many lines out of it as you can. You may even end up making the code LESS readable. If you take the LOC to assign variables and then use them in a method call, it's more readable than calling the assignments of those variables directly in the method call itself. It's better to have 5 lines of readable code than to condense it into 1 line of unreadable code.
Instead, I'd look at depth of code and line length. These are better metrics because they tell you two things. First, the nested depth tells you if you're logic needs to be refactored. If you are looking at If statements or loops nested more than 2 deep, seriously consider refactoring. Consider refactoring if you have more than one level of nesting. Second, if a line is long, it is generally very unreadable. Try separating out that line onto several more readable lines. This might break your LOC limit if you have one, but it does actually improve readability.

line counting == bean counting.
The moment you start employing tools to find out just how many lines of code a certain file or function has, you're screwed, IMHO, because you stopped worrying about managebility of the code and started bureaucratically making rules and placing blame.
Have a look at the file / function, and consider if it is still comfortable to work with, or starts getting unwieldly. If in doubt, call in a co-developer (or, if you are running a one-man-show, some developer unrelated to the project) to have a look, and have a quick chat about it.
It's really just that: a look. Does someone else immediately get the drift of the code, or is it a closed book to the uninitiated? This quick look tells you more about the readability of a piece of code than any line metrics ever devised. It is depending on so many things. Language, problem domain, code structure, working environment, experience. What's OK for one function in one project might be all out of proportion for another.
If you are in a team / project situation, and can't readily agree by this "one quick look" approach, you have a social problem, not a technical one. (Differing quality standards, and possibly a communication failure.) Having rules on file / function lengths is not going to solve your problem. Sitting down and talking about it over a cool drink (or a coffee, depending...) is a better choice.

You're right... there is no answer to this. You cannot put a "best practice" down as a number of lines of code.
However, as a guideline, I often go by what I can see on one page. As soon as a method doesn't fit on one page, I start thinking I'm doing something wrong. As far as the whole class is concerned, if I can't see all the method/property headers on one page then maybe I need to start splitting that out as well.
Again though, there really isn't an answer, some things just have to get big and complex. The fact that you know this is bad and you're thinking about it now, probably means that you'll know when to stop when things get out of hand.

Lines of code is much more about verbosity than any other thing. In the project I'm currently working we have some files with over 1000 LOC. But, if you strip the comments, it will probably remain about 300 or even less. If you change declarations like
int someInt;
int someOtherInt;
to one line, the file will be even shorter.
However, if you're not verbose and you still have a big file, you'll probably need to think about refactoring.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse