Should I prefer hashes or hashrefs in Perl? - perl

I'm still learning perl.
For me it feels more "natural" to references to hashes rather than directly access them, because it is easier to pass references to a sub, (one variable can be passed instead of a list). Generally I prefer this approach to that where one directly accesses %hashes".
The question is, where (in what situations) is better to use plain %hashes, so
$hash{key} = $value;
instead of
$href->{key} = $value
Is here any speed, or any other "thing" what prefers to use %hashes and not $hashrefs? Or it is matter only of pure personal taste and TIMTOWTDI? Some examples, when is better to use %hash?

I think this kind of question is very legitimate: Programming languages such as Perl or C++ have come a long way and accumulated a lot of historical baggage, but people typically learn them from ahistorical, synchronous exposés. Hence they keep wondering why TIMTOWDI and WTF all these choices and what is better and what should be preferred?
So, before version 5, Perl didn't have references. It only had value types. References are an add-on to Perl 4, enabling lots more stuff to be written. Value types had to be retained, of course, to keep backward compatibility; and also, for simplicity's sake, because frequently you don't need the indirection that references are.
To answer your question:
Don't waste time thinking about the speed of Perl hash lists. They're fast. They're memory access. Accessing a database or the filesystem or the net, that is where your program will typically spend time.
In theory, a dereference operation should take a tiny bit of time, so tiny it shouldn't matter.
If you're curious, then benchmark. Don't draw too many conclusions from differences you might see. Things could look different on another release.
So there is no speed reason to favour references over value types or vice versa.
Is there any other reason? I'd say it's a question of style and taste. Personally, I prefer the syntax without -> accessors.

If you can use a plain hashes, to describe your data, you use a plain hash. However, when your data structure gets a bit more complex, you will need to use references.
Imagine a program where I'm storing information about inventory items, and how many I have in stock. A simple hash works quite well:
$item{XP232} = 324;
$item{BV348} = 145;
$item{ZZ310} = 485;
If all you're doing is creating quick programs that can read a file and store simple information for a report, there's no need to use references at all.
However, when things get more complex, you need references. For example, my program isn't just tracking my stock, I'm tracking all aspects of my inventory. Inventory items also have names, the company that creates them, etc. In this case, I'll want to have my hashes not pointing to a single data point (the number of items I have in stock), but a reference to a hash:
$item{XP232}->{DESCRIPTION} = "Blue Widget";
$item{XP232}->{IN_STOCK} = 324;
$item{XP232}->{MANUFACTURER} = "The Great American Widget Company";
$item{BV348}->{DESCRIPTION} = "A Small Purple Whatzit";
$item{BV348}->{IN_STOCK} = 145;
$item{BV348}->{MANUFACTURER} = "Acme Whatzit Company";
You can do all sorts of wacky things to do something like this (like have separate hashes for each field or put all fields in a single value separated by colons), but it's simply easier to use references to store these more complex structures.

For me the main reason to use $hashrefs to %hashes is the ability to give them meaningful names (a related idea would be name the references to an anonymous hash) which can help you separate data structures from program logic and make things easier to read and maintain.
If you end up with multiple levels of references (refs to refs?!) you start to loose this clean and readable advantage, though. As well, for short programs or modules, or at earlier stages of development where you are retesting things as you go, directly accessing the %hash can make things easier for simple debugging (print statements and the like) and avoiding accidental "action at a distance" issues so you can focus on "iterating" through your design, using references where appropriate.
In general though I think this is a great question because TIMTOWDI and TIMTOCWDI where C = "correct". Thanks for asking it and thanks for the answers.

Related

What's the point of using "map()" for two elements in perl?

I've seen code where there are just two rather static elements to be mapped such as time intervals with start and end dates, yet map() is being used rather than explicit code for mapping, e.g.
{ map { ... } qw(start end) } # vs.
{ start => ..., end => ... }
Which way is preferrable, and why?
The map form may be less concise but looks more functional (as in functional programming), so I guess that's why it may be preferred over explicit code and is perhaps more DRY.
However, it looks less legible to me because there is more logic going on behind, and mapping should also be less efficient because it invokes calls a and consists of more atomic operations.
EDIT
There is a conflicting goal in programming: KISS (keep it { pick 2 from: small, simple, stupid }). Using map slightly complicates code.
Assuming you're not just setting both items to the same constant or something similarly trivial, I would expect the map version to be more concise.
IMO, the main point in favor of the map version is that you know the same process will be used to produce both values. Not only for the sake of DRY, but also because it eliminates any concern that one might have a subtle change which the other doesn't.
As for the performance concern... If your use case is sufficiently performance-sensitive for any potential difference to matter, then you shouldn't be using Perl in the first place. Switching to well-written C (not C#, not C++, not Objective C - just plain C) will have a far greater performance impact than micro-optimizing whether you assign two values individually vs. using a loop to set them. But the odds of your use case being that sensitive are approximately zero anyhow.
There is a principle of coding known as DRY. Don't Repeat Yourself.
It asserts that:
Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.
And that can be interpreted as condensing duplicate typing with (things like) map/for.
I use idioms like the one you've quoted when I'm trying to expand some text - for example:
my #defs = map { "DEF:$_=$source_file:$_:MAX" } qw ( read write );
This generates me some DEF lines for rrdtool.
I'm doing it this way, because for some cases, I've got considerably longer lists of 'things I want to define' and want to be consistent. (Sometimes I have say, 10 similar lines that differ only by a single word).
But also because:
my #defs = ( "DEF:read=$source_file:read:MAX",
"DEF:write=$source_file:write:MAX" );
There's not much in it for two elements, and I'd suggest it's as much a matter of style as anything. However, if you've got more than that, it quickly becomes very beneficial because you can change the single line - say you've got a different file location? Want to swap MAX for AVERAGE?
It's also quite shockingly easy to go 'punctuation blind' when looking at a long sequence of similar statements, where someone's typo-ed and added a , where it should be . or similar.
And ... you probably don't lose a great deal in terms of readability. But will acknowledge that's something of a style point, because whilst map is pretty amazing, it can make for some rather hard to read code if you're not careful.
Also to specifically address:
mapping should also be less efficient because it invokes calls a and consists of more atomic operations.
A wise man once said:
premature optimization is the root of all evil
Don't think about the efficiency of a statement - look at the legibility/readability. Compilers are pretty clever. Most "obvious" optimisations, they already make for you. Processors are also pretty fast. Your limiting factor in most code isn't the amount of CPU cycles you need, it's IO throughput and memory footprint. So don't worry about it - write clear code.
And if there's a performance critical demand on your code, you should be using a code profiler to look at where you gain the most efficiency for your effort at refactoring. You may end up with less clear code in doing so (sometimes) but that's a more clear tradeoff.

How should I restructure a large Perl script?

I have a more or less large Perl script of ~ 1000 lines. The script accepts a few arguments and it runs straight forward. No modules, no functions. The script could be divided into three parts, initialization part, arguments parsing part and work part, but I don't know how to do that. Everything must be kept in a single file. Please, can anyone give me instructions/advice how to structure my Perl script?
Thanks.
You ask for advice on how to refactor your script, but you don't appear to understand why to refactor it. Without the why, the how isn't going to do you much good. And with the why, the how may fall out quite naturally.
If your script is working perfectly and needs no modification and all you'll ever do with it is run it, then you probably don't have a reason to refactor it - and I say that from the perspective of despising long routines. But...
If something's wrong with it
If you are trying to find a bug in your 1,000-line program, you have some hard work ahead of you. The problem could be anywhere. Break it up into smaller pieces so that you can verify the input and output at different stages - ideally, write tests for the smaller pieces. Fine-grained unit tests will tell you what isn't working, the nature of the error, and where the error exists.
If you need to modify it
If you need to change the script to - say - accommodate a new graphics format, or take advantage of multiple processors, or record its activities to a log - you will find it easier to extend if the program elements that need revision or extension are better isolated.
If you're trying to explain it to someone else, or show it off
You will find it much easier to convey the ideas in your script to another developer if the ideas are broken out into discrete methods.
So, there are some reasons why you might choose to refactor. If any of them apply, refactor accordingly; the how will drop out naturally. Extract Method may be your best friend.
If you can see logical parts of your script, you should definitely abstract them into functions. Having a single script of over 1000 lines, and not breaking it up into whatever abstraction units your language provides (functions, classes, etc.) is a very bad idea. Maintaining your script, i. e. adding features and fixing bugs, will be a nightmare.
I strongly suggest you read the book Clean Code by Robert C. Martin. It uses Java for examples, but the ideas are applicable to any language. The one that is most relevant here is "Make your functions small. Then make them smaller."
1000 lines and no functions? why not? this is a vague question.
You could break each section into a separate function, and then have a function that runs through each of these functions in the correct order called 'run()' or something similar. This would allow you to break up the program into more mangeable chunks.
p.s. man I think I used the word function too many times in this answer!
Have you refactored at all? At 1000 lines I'd suspect to see some code that could be broken down into functions internal to the script.
Well, if you have three separate sections that's the logical choice.
You could make each one into a function and then have a simple linear control at the top:
my $var1, $var2, $var3;
$var1 = init();
$var2 = parseInput();
$doWork();
sub init() {
some code here
}
sub parseInput() {
some code here
}
sub doWork() {
some code here
}
The big issue is you're going to be using globals a lot. I'd build them into a structure or two. I would also expect to see the big three broken down into functions themselves. Back in the 80s when the big thing I was learning was structured programming (the best design here I think) the rule of thumb was a function should fit on roughly one screen or less.
Most people would typically answer something like "a subroutine should do one thing" and "a subroutine should only take up one page in your editor". You can try to keep these things in mind when you refactor your code.
Try to identify parts of your code that can be split off into logical sections. You've started this process by spotting 'initialization', 'argument parsing', and 'work'. See if there are some sub-sections within that that can be pruned off into other subroutines.
Also, why do you not use any modules? One that springs to mind is Getopt::Long, which is a core module, so you won't have to install it manually. It will handle all of your argument parsing, and by using it you will probably avoid bugs and could shorten your code to make it more maintainable. By using standard modules like this, you not only (hopefully!) reduce the number of bugs in your code, you make it easier for other Perl programmers to understand.
You could look at search.cpan.org, maybe some Perl module suits your needs. For example there is a CGI::Application

How can I represent sets in Perl?

I would like to represent a set in Perl. What I usually do is using a hash with some dummy value, e.g.:
my %hash=();
$hash{"element1"}=1;
$hash{"element5"}=1;
Then use if (defined $hash{$element_name}) to decide whether an element is in the set.
Is this a common practice? Any suggestions on improving this?
Also, should I use defined or exists?
Thank you
Yes, building hash sets that way is a common idiom. Note that:
my #keys = qw/a b c d/;
my %hash;
#hash{#keys} = ();
is preferable to using 1 as the value because undef takes up significantly less space. This also forces you to uses exists (which is the right choice anyway).
Use one of the many Set modules on CPAN. Judging from your example, Set::Light or Set::Scalar seem appropriate.
I can defend this advice with the usual arguments pro CPAN (disregarding possible synergy effects).
How can we know that look-up is all that is needed, both now and in the future? Experience teaches that even the simplest programs expand and sprawl. Using a module would anticipate that.
An API is much nicer for maintenance, or people who need to read and understand the code in general, than an ad-hoc implementation as it allows to think about partial problems at different levels of abstraction.
Related to that, if it turns out that the overhead is undesirable, it is easy to go from a module to a simple by removing indirections or paring data structures and source code. But on the other hand, if one would need more features, it is moderately more difficult to achieve the other way around.
CPAN modules are already tested and to some extent thoroughly debugged, perhaps also the API underwent improvement steps over the time, whereas with ad-hoc, programmers usually implement the first design that comes to mind.
Rarely it turns out that picking a module at the beginning is the wrong choice.
That's how I've always done it. I would tend to use exists rather than defined but they should both work in this context.

What's the best way to make a deep copy of a data structure in Perl?

Given a data structure (e.g. a hash of hashes), what's the clean/recommended way to make a deep copy for immediate use? Assume reasonable cases, where the data's not particularly large, no complicated cycles exist, and readability/maintainability/etc. are more important than speed at all costs.
I know that I can use Storable, Clone, Clone::More, Clone::Fast, Data::Dumper, etc. What's the current best practice?
Clone is much faster than Storable::dclone, but the latter supports more data types.
Clone::Fast and Clone::More are pretty much equivalent if memory serves me right, but less feature complete than even Clone, and Scalar::Util::Clone supports even less but IIRC is the fastest of them all for some structures.
With respect to readability these should all work the same, they are virtually interchangeable.
If you have no specific performance needs I would just use Storable's dclone.
I wouldn't use Data::Dumper for this simply because it's so cumbersome and roundabout. It's probably going to be very slow too.
For what it's worth, if you ever want customizable cloning then Data::Visitor provides hooking capabilities and fairly feature complete deep cloning is the default behavior.
My impression is that Storable::dclone() is somewhat canonical.
Clone is probably what you want for that. At least, that's what all the code I've seen uses.
Try to use fclone from Panda::Lib which seems the fastest one (written in XS)
Quick and dirty hack if you're already dealing with JSONs and using the JSON module in your code: convert the structure to a JSON and then convert the JSON back to a structure:
use JSON;
my %hash = (
obj => {},
arr => []
);
my $hash_ref_to_hash_copy = from_json(to_json(\%hash));
The only negative possibly being having to deal with a hash reference instead of a pure hash, but still, this has come in handy a few times for me.

When should I use OO Perl?

I'm just learning Perl.
When is it advisable to use OO Perl instead of non-OO Perl?
My tendency would be to always prefer OO unless the project is just a code snippet of < 10 lines.
TIA
From Damian Conway:
10 criteria for knowing when to use object-oriented design
Design is large, or is likely to become large
When data is aggregated into obvious structures, especially if there’s a lot of data in each aggregate
For instance, an IP address is not a good candidate: There’s only 4 bytes of information related to an IP address. An immigrant going through customs has a lot of data related to him, such as name, country of origin, luggage carried, destination, etc.
When types of data form a natural hierarchy that lets us use inheritance.
Inheritance is one of the most powerful feature of OO, and the ability to use it is a flag.
When operations on data varies on data type
GIFs and JPGs might have their cropping done differently, even though they’re both graphics.
When it’s likely you’ll have to add data types later
OO gives you the room to expand in the future.
When interactions between data is best shown by operators
Some relations are best shown by using operators, which can be overloaded.
When implementation of components is likely to change, especially in the same program
When the system design is already object-oriented
When huge numbers of clients use your code
If your code will be distributed to others who will use it, a standard interface will make maintenence and safety easier.
When you have a piece of data on which many different operations are applied
Graphics images, for instance, might be blurred, cropped, rotated, and adjusted.
When the kinds of operations have standard names (check, process, etc)
Objects allow you to have a DB::check, ISBN::check, Shape::check, etc without having conflicts between the types of check.
There is a good discussion about same subject # PerlMonks.
Having Moose certainly makes it easier to always use OO from the word go. The only real exception is if compilation start-up is an issue (Moose does currently have a compile time overhead).
I don't think you should measure it by lines of code.
You are right, often when you are just writing a simple script OO is probably too much overhead, but I think you should be more flexible regarding the 10 lines aproach.
In all cases when you are using OO Perl Rememebr to use Moose (or Mouse)
This question doesn't have that much to do with Perl. The question is "when, given a choice, should I use OO?" That "given a choice" bit is because in some languages (Java, for example), you really don't have any choice.
The answer is "when it makes sense". Think about the problem you're trying to solve. Does the problem fit into the OO concepts of classes and object? If it does, great, use OO. Otherwise use some other paradigm.
Perl is fairly flexible, and you can easily write procedural, functional, or OO Perl, or even mix them together. Don't get hung up on doing OO because everyone else is. Learn to use the right approach for each task.
All of this takes experience and practice, so make sure to try all these approaches out, and maybe even take some smaller problems and solve them in multiple ways to see how each works.
Damian Conway has a passage in Perl Best Practices about this. It is not a rule that you have to follow it, but it is probably better advice that I can give without knowing a lot about what you are doing.
Here is the publisher's page if that is a better place to link to the book.