What's the point of using "map()" for two elements in perl? - perl

I've seen code where there are just two rather static elements to be mapped such as time intervals with start and end dates, yet map() is being used rather than explicit code for mapping, e.g.
{ map { ... } qw(start end) } # vs.
{ start => ..., end => ... }
Which way is preferrable, and why?
The map form may be less concise but looks more functional (as in functional programming), so I guess that's why it may be preferred over explicit code and is perhaps more DRY.
However, it looks less legible to me because there is more logic going on behind, and mapping should also be less efficient because it invokes calls a and consists of more atomic operations.
EDIT
There is a conflicting goal in programming: KISS (keep it { pick 2 from: small, simple, stupid }). Using map slightly complicates code.

Assuming you're not just setting both items to the same constant or something similarly trivial, I would expect the map version to be more concise.
IMO, the main point in favor of the map version is that you know the same process will be used to produce both values. Not only for the sake of DRY, but also because it eliminates any concern that one might have a subtle change which the other doesn't.
As for the performance concern... If your use case is sufficiently performance-sensitive for any potential difference to matter, then you shouldn't be using Perl in the first place. Switching to well-written C (not C#, not C++, not Objective C - just plain C) will have a far greater performance impact than micro-optimizing whether you assign two values individually vs. using a loop to set them. But the odds of your use case being that sensitive are approximately zero anyhow.

There is a principle of coding known as DRY. Don't Repeat Yourself.
It asserts that:
Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.
And that can be interpreted as condensing duplicate typing with (things like) map/for.
I use idioms like the one you've quoted when I'm trying to expand some text - for example:
my #defs = map { "DEF:$_=$source_file:$_:MAX" } qw ( read write );
This generates me some DEF lines for rrdtool.
I'm doing it this way, because for some cases, I've got considerably longer lists of 'things I want to define' and want to be consistent. (Sometimes I have say, 10 similar lines that differ only by a single word).
But also because:
my #defs = ( "DEF:read=$source_file:read:MAX",
"DEF:write=$source_file:write:MAX" );
There's not much in it for two elements, and I'd suggest it's as much a matter of style as anything. However, if you've got more than that, it quickly becomes very beneficial because you can change the single line - say you've got a different file location? Want to swap MAX for AVERAGE?
It's also quite shockingly easy to go 'punctuation blind' when looking at a long sequence of similar statements, where someone's typo-ed and added a , where it should be . or similar.
And ... you probably don't lose a great deal in terms of readability. But will acknowledge that's something of a style point, because whilst map is pretty amazing, it can make for some rather hard to read code if you're not careful.
Also to specifically address:
mapping should also be less efficient because it invokes calls a and consists of more atomic operations.
A wise man once said:
premature optimization is the root of all evil
Don't think about the efficiency of a statement - look at the legibility/readability. Compilers are pretty clever. Most "obvious" optimisations, they already make for you. Processors are also pretty fast. Your limiting factor in most code isn't the amount of CPU cycles you need, it's IO throughput and memory footprint. So don't worry about it - write clear code.
And if there's a performance critical demand on your code, you should be using a code profiler to look at where you gain the most efficiency for your effort at refactoring. You may end up with less clear code in doing so (sometimes) but that's a more clear tradeoff.

Related

Should I prefer hashes or hashrefs in Perl?

I'm still learning perl.
For me it feels more "natural" to references to hashes rather than directly access them, because it is easier to pass references to a sub, (one variable can be passed instead of a list). Generally I prefer this approach to that where one directly accesses %hashes".
The question is, where (in what situations) is better to use plain %hashes, so
$hash{key} = $value;
instead of
$href->{key} = $value
Is here any speed, or any other "thing" what prefers to use %hashes and not $hashrefs? Or it is matter only of pure personal taste and TIMTOWTDI? Some examples, when is better to use %hash?
I think this kind of question is very legitimate: Programming languages such as Perl or C++ have come a long way and accumulated a lot of historical baggage, but people typically learn them from ahistorical, synchronous exposés. Hence they keep wondering why TIMTOWDI and WTF all these choices and what is better and what should be preferred?
So, before version 5, Perl didn't have references. It only had value types. References are an add-on to Perl 4, enabling lots more stuff to be written. Value types had to be retained, of course, to keep backward compatibility; and also, for simplicity's sake, because frequently you don't need the indirection that references are.
To answer your question:
Don't waste time thinking about the speed of Perl hash lists. They're fast. They're memory access. Accessing a database or the filesystem or the net, that is where your program will typically spend time.
In theory, a dereference operation should take a tiny bit of time, so tiny it shouldn't matter.
If you're curious, then benchmark. Don't draw too many conclusions from differences you might see. Things could look different on another release.
So there is no speed reason to favour references over value types or vice versa.
Is there any other reason? I'd say it's a question of style and taste. Personally, I prefer the syntax without -> accessors.
If you can use a plain hashes, to describe your data, you use a plain hash. However, when your data structure gets a bit more complex, you will need to use references.
Imagine a program where I'm storing information about inventory items, and how many I have in stock. A simple hash works quite well:
$item{XP232} = 324;
$item{BV348} = 145;
$item{ZZ310} = 485;
If all you're doing is creating quick programs that can read a file and store simple information for a report, there's no need to use references at all.
However, when things get more complex, you need references. For example, my program isn't just tracking my stock, I'm tracking all aspects of my inventory. Inventory items also have names, the company that creates them, etc. In this case, I'll want to have my hashes not pointing to a single data point (the number of items I have in stock), but a reference to a hash:
$item{XP232}->{DESCRIPTION} = "Blue Widget";
$item{XP232}->{IN_STOCK} = 324;
$item{XP232}->{MANUFACTURER} = "The Great American Widget Company";
$item{BV348}->{DESCRIPTION} = "A Small Purple Whatzit";
$item{BV348}->{IN_STOCK} = 145;
$item{BV348}->{MANUFACTURER} = "Acme Whatzit Company";
You can do all sorts of wacky things to do something like this (like have separate hashes for each field or put all fields in a single value separated by colons), but it's simply easier to use references to store these more complex structures.
For me the main reason to use $hashrefs to %hashes is the ability to give them meaningful names (a related idea would be name the references to an anonymous hash) which can help you separate data structures from program logic and make things easier to read and maintain.
If you end up with multiple levels of references (refs to refs?!) you start to loose this clean and readable advantage, though. As well, for short programs or modules, or at earlier stages of development where you are retesting things as you go, directly accessing the %hash can make things easier for simple debugging (print statements and the like) and avoiding accidental "action at a distance" issues so you can focus on "iterating" through your design, using references where appropriate.
In general though I think this is a great question because TIMTOWDI and TIMTOCWDI where C = "correct". Thanks for asking it and thanks for the answers.

When should you use XS?

I am writing up a talk on XS and I need to know when the community thinks it is proper to reach for XS.
I can think of at least three reasons to use XS:
You have a C library you want to access in Perl 5
You have a block of code that is provably slowing down your program and it would be faster if written in C
You need access to something only available in XS
Reason 1 is obvious and should need no explaination.
When you really need reason 2 is less obvious. Often you are better off looking at how the code is structured. You should only invoke reason 2 if you have profiled your code and have a benchmark and test suite to prove that the XS code is faster and correct.
Reason 3 is a dangerous reason. It is rare that you actually need to look into Perl's guts to do something, but there is at least one valid case.
In a few cases, better memory management is another reason for using XS. For example, if you have a very large block of objects of some similar type, this can be managed more efficiently through XS. KinoSearch uses this for tokens, for example, where start and end offsets in a large string can be managed more effectively through XS than as a huge pool of scalars. PDL also has a memory management aspect to it, as well as speed.
There are proposals to integrate some of this approach into core Perl in the long term, initially because it offers a chance to make sharing data in threading better: see: http://openparallel.com/2011/07/05/a-new-hope-for-efficient-safe-data-sharing-between-threads-in-perl/.

Design - When to create new functions?

This is a general design question not relating to any language. I'm a bit torn between going for minimum code or optimum organization.
I'll use my current project as an example. I have a bunch of tabs on a form that perform different functions. Lets say Tab 1 reads in a file with a specific layout, tab 2 exports a file to a specific location, etc. The problem I'm running into now is that I need these tabs to do something slightly different based on the contents of a variable. If it contains a 1 I may need to use Layout A and perform some extra concatenation, if it contains a 2 I may need to use Layout B and do no concatenation but add two integer fields, etc. There could be 10+ codes that I will be looking at.
Is it more preferable to create an individual path for each code early on, or attempt to create a single path that branches out only when absolutely required.
Creating an individual path for each code would allow my code to be extremely easy to follow at a glance, which in turn will help me out later on down the road when debugging or making changes. The downside to this is that I will increase the amount of code written by calling some of the same functions in multiple places (for example, steps 3, 5, and 9 for every single code may be exactly the same.
Creating a single path that would branch out only when required will be a bit messier and more difficult to follow at a glance, but I would create less code by placing conditionals only at steps that are unique.
I realize that this may be a case-by-case decision, but in general, if you were handed a previously built program to work on, which would you prefer?
Edit: I've drawn some simple images to help express it. Codes 1/2/3 are the variables and the lines under them represent the paths they would take. All of these steps need to be performed in a linear chronological fashion, so there would be a function to essentially just call other functions in the proper order.
Different Paths
Single Path
Creating a single path that would
branch out only when required will be
a bit messier and more difficult to
follow at a glance, but I would create
less code by placing conditionals only
at steps that are unique.
Im not buying this statement. There is a level of finesse when deciding when to write new functions. Functions should be as simple and reusable as possible (but no simpler). The correct answer is almost never 'one big file that does a lot of branching'.
Less LOC (lines of code) should not be the goal. Readability and maintainability should be the goal. When you create functions, the names should be self documenting. If you have a large block of code, it is good to do something like
function doSomethingComplicated() {
stepOne();
stepTwo();
// and so on
}
where the function names are self documenting. Not only will the code be more readable, you will make it easier to unit test each segment of the code in isolation.
For the case where you will have a lot of methods that call the same exact methods, you can use good OO design and design patterns to minimize the number of functions that do the same thing. This is in reference to your statement "The downside to this is that I will increase the amount of code written by calling some of the same functions in multiple places (for example, steps 3, 5, and 9 for every single code may be exactly the same."
The biggest danger in starting with one big block of code is that it will never actually get refactored into smaller units. Just start down the right path to begin with....
EDIT --
for your picture, I would create a base-class with all of the common methods that are used. The base class would be abstract, with an abstract method. Subclasses would implement the abstract method and use the common functions they need. Of course, replace 'abstract' with whatever your language of choice provides.
You should always err on the side of generalization, with the only exception being early prototyping (where throughput of generating working stuff is majorly impacted by designing correct abstractions/generalizations). having said that, you should NEVER leave that mess of non-generalized cloned branches past the early prototype stage, as it leads to messy hard to maintain code (if you are doing almost the same thing 3 different times, and need to change that thing, you're almost sure to forget to change 1 out of 3).
Again it's hard to specifically answer such an open ended question, but I believe you don't have to sacrifice one for the other.
OOP techniques solves this issue by allowing you to encapsulate the reusable portions of your code and generate child classes to handle object specific behaviors.
Personally I think you might (if possible by your API) create inherited forms, create them on fly on master form (with tabs), pass agruments and embed in tab container.
When to inherit form and when to decide to use arguments (code) to show/hide/add/remove functionality is up to you, yet master form should contain only decisions and argument passing and embeddable forms just plain functionality - this way you can separate organisation from implementation.

How should I restructure a large Perl script?

I have a more or less large Perl script of ~ 1000 lines. The script accepts a few arguments and it runs straight forward. No modules, no functions. The script could be divided into three parts, initialization part, arguments parsing part and work part, but I don't know how to do that. Everything must be kept in a single file. Please, can anyone give me instructions/advice how to structure my Perl script?
Thanks.
You ask for advice on how to refactor your script, but you don't appear to understand why to refactor it. Without the why, the how isn't going to do you much good. And with the why, the how may fall out quite naturally.
If your script is working perfectly and needs no modification and all you'll ever do with it is run it, then you probably don't have a reason to refactor it - and I say that from the perspective of despising long routines. But...
If something's wrong with it
If you are trying to find a bug in your 1,000-line program, you have some hard work ahead of you. The problem could be anywhere. Break it up into smaller pieces so that you can verify the input and output at different stages - ideally, write tests for the smaller pieces. Fine-grained unit tests will tell you what isn't working, the nature of the error, and where the error exists.
If you need to modify it
If you need to change the script to - say - accommodate a new graphics format, or take advantage of multiple processors, or record its activities to a log - you will find it easier to extend if the program elements that need revision or extension are better isolated.
If you're trying to explain it to someone else, or show it off
You will find it much easier to convey the ideas in your script to another developer if the ideas are broken out into discrete methods.
So, there are some reasons why you might choose to refactor. If any of them apply, refactor accordingly; the how will drop out naturally. Extract Method may be your best friend.
If you can see logical parts of your script, you should definitely abstract them into functions. Having a single script of over 1000 lines, and not breaking it up into whatever abstraction units your language provides (functions, classes, etc.) is a very bad idea. Maintaining your script, i. e. adding features and fixing bugs, will be a nightmare.
I strongly suggest you read the book Clean Code by Robert C. Martin. It uses Java for examples, but the ideas are applicable to any language. The one that is most relevant here is "Make your functions small. Then make them smaller."
1000 lines and no functions? why not? this is a vague question.
You could break each section into a separate function, and then have a function that runs through each of these functions in the correct order called 'run()' or something similar. This would allow you to break up the program into more mangeable chunks.
p.s. man I think I used the word function too many times in this answer!
Have you refactored at all? At 1000 lines I'd suspect to see some code that could be broken down into functions internal to the script.
Well, if you have three separate sections that's the logical choice.
You could make each one into a function and then have a simple linear control at the top:
my $var1, $var2, $var3;
$var1 = init();
$var2 = parseInput();
$doWork();
sub init() {
some code here
}
sub parseInput() {
some code here
}
sub doWork() {
some code here
}
The big issue is you're going to be using globals a lot. I'd build them into a structure or two. I would also expect to see the big three broken down into functions themselves. Back in the 80s when the big thing I was learning was structured programming (the best design here I think) the rule of thumb was a function should fit on roughly one screen or less.
Most people would typically answer something like "a subroutine should do one thing" and "a subroutine should only take up one page in your editor". You can try to keep these things in mind when you refactor your code.
Try to identify parts of your code that can be split off into logical sections. You've started this process by spotting 'initialization', 'argument parsing', and 'work'. See if there are some sub-sections within that that can be pruned off into other subroutines.
Also, why do you not use any modules? One that springs to mind is Getopt::Long, which is a core module, so you won't have to install it manually. It will handle all of your argument parsing, and by using it you will probably avoid bugs and could shorten your code to make it more maintainable. By using standard modules like this, you not only (hopefully!) reduce the number of bugs in your code, you make it easier for other Perl programmers to understand.
You could look at search.cpan.org, maybe some Perl module suits your needs. For example there is a CGI::Application

Good practice class line count [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I know that there is no right answer to this question, I'm just asking for your opinions.
I know that creating HUGE class files with thousand lines of code is not a good practice since it's hard to maintain and also it usually means that you should probably review your program logic.
In your opinion what is an average line count for a class let's say in Java (i don't know if the choice of language has anything to do with it but just in case...)
Yes, I'd say it does have to do with the language, if only because some languages are more verbose than others.
In general, I use these rules of thumb:
< 300 lines: fine
300 - 500 lines: reasonable
500 - 1000 lines: maybe ok, but plan on refactoring
> 1000 lines: definitely refactor
Of course, it really depends more on the nature and complexity of the code than on LOC, but I've found these reasonable.
In general, number of lines is not the issue - a slightly better metric is number of public methods. But there is no correct figure. For example, a utility string class might correctly have hundreds of methods, whereas a business level class might have only a couple.
If you are interested in LOC, cyclomatic and other complexity measurements, I can strongly recommend Source Monitor from http://www.campwoodsw.com, which is free, works with major languages such as Java & C++, and is all round great.
From Eric Raymond's "The Art Of Unix Programming"
In nonmathematical terms, Hatton's empirical results imply a sweet spot between 200 and 400 logical lines of code that minimizes probable defect density, all other factors (such as programmer skill) being equal. This size is independent of the language being used — an observation which strongly reinforces the advice given elsewhere in this book to program with the most powerful languages and tools you can. Beware of taking these numbers too literally however. Methods for counting lines of code vary considerably according to what the analyst considers a logical line, and other biases (such as whether comments are stripped). Hatton himself suggests as a rule of thumb a 2x conversion between logical and physical lines, suggesting an optimal range of 400–800 physical lines.
Taken from here
Better to measure something like cyclomatic complexity and use that as a gauge. You could even stick it in your build script/ant file/etc.
It's too easy, even with a standardized code format, for lines of code to be disconnected from the real complexity of the class.
Edit: See this question for a list of cyclomatic complexity tools.
I focus on methods and (try to) keep them below 20 lines of code. Class length is in general dictated by the single responsibility principle. But I believe that this is no absolute measure because it depends on the level of abstraction, hence somewhere between 300 and 500 lines I start looking over the code for a new responsibility or abstraction to extract.
Small enough to do only the task it is charged with.
Large enough to do only the task it is charged with.
No more, no less.
In my experience any source file over 1000 text lines I will start wanting to break up. Ideally methods should fit on a single screen, if possible.
Lately I've started to realise that removing unhelpful comments can help greatly with this. I comment far more sparingly now than I did 20 years ago when I first started programming.
The short answer: less than 250 lines.
The shorter answer: Mu.
The longer answer: Is the code readable and concise? Does the class have a single responsibility? Does the code repeat itself?
For me, the issue isn't LOC. What I look at is several factors. First, I check my If-Else-If statements. If a lot of them have the same conditions, or result in similar code being run, I try to refactor that. Then I look at my methods and variables. In any single class, that class should have one primary function and only that function. If it has variables and methods for a different area, consider putting those into their own class. Either way, avoid counting LOC for two reasons:
1) It's a bad metric. If you count LOC you're counting not just long lines, but also lines which are whitespace and used for comments as though they are the same. You can avoid this, but at the same time, you're still counting small lines and long lines equally.
2) It's misleading. Readability isn't purely a function of LOC. A class can be perfectly readable but if you have a LOC count which it violates, you're gonna find yourself working hard to squeeze as many lines out of it as you can. You may even end up making the code LESS readable. If you take the LOC to assign variables and then use them in a method call, it's more readable than calling the assignments of those variables directly in the method call itself. It's better to have 5 lines of readable code than to condense it into 1 line of unreadable code.
Instead, I'd look at depth of code and line length. These are better metrics because they tell you two things. First, the nested depth tells you if you're logic needs to be refactored. If you are looking at If statements or loops nested more than 2 deep, seriously consider refactoring. Consider refactoring if you have more than one level of nesting. Second, if a line is long, it is generally very unreadable. Try separating out that line onto several more readable lines. This might break your LOC limit if you have one, but it does actually improve readability.
line counting == bean counting.
The moment you start employing tools to find out just how many lines of code a certain file or function has, you're screwed, IMHO, because you stopped worrying about managebility of the code and started bureaucratically making rules and placing blame.
Have a look at the file / function, and consider if it is still comfortable to work with, or starts getting unwieldly. If in doubt, call in a co-developer (or, if you are running a one-man-show, some developer unrelated to the project) to have a look, and have a quick chat about it.
It's really just that: a look. Does someone else immediately get the drift of the code, or is it a closed book to the uninitiated? This quick look tells you more about the readability of a piece of code than any line metrics ever devised. It is depending on so many things. Language, problem domain, code structure, working environment, experience. What's OK for one function in one project might be all out of proportion for another.
If you are in a team / project situation, and can't readily agree by this "one quick look" approach, you have a social problem, not a technical one. (Differing quality standards, and possibly a communication failure.) Having rules on file / function lengths is not going to solve your problem. Sitting down and talking about it over a cool drink (or a coffee, depending...) is a better choice.
You're right... there is no answer to this. You cannot put a "best practice" down as a number of lines of code.
However, as a guideline, I often go by what I can see on one page. As soon as a method doesn't fit on one page, I start thinking I'm doing something wrong. As far as the whole class is concerned, if I can't see all the method/property headers on one page then maybe I need to start splitting that out as well.
Again though, there really isn't an answer, some things just have to get big and complex. The fact that you know this is bad and you're thinking about it now, probably means that you'll know when to stop when things get out of hand.
Lines of code is much more about verbosity than any other thing. In the project I'm currently working we have some files with over 1000 LOC. But, if you strip the comments, it will probably remain about 300 or even less. If you change declarations like
int someInt;
int someOtherInt;
to one line, the file will be even shorter.
However, if you're not verbose and you still have a big file, you'll probably need to think about refactoring.