When should you use XS? - perl

I am writing up a talk on XS and I need to know when the community thinks it is proper to reach for XS.

I can think of at least three reasons to use XS:
You have a C library you want to access in Perl 5
You have a block of code that is provably slowing down your program and it would be faster if written in C
You need access to something only available in XS
Reason 1 is obvious and should need no explaination.
When you really need reason 2 is less obvious. Often you are better off looking at how the code is structured. You should only invoke reason 2 if you have profiled your code and have a benchmark and test suite to prove that the XS code is faster and correct.
Reason 3 is a dangerous reason. It is rare that you actually need to look into Perl's guts to do something, but there is at least one valid case.

In a few cases, better memory management is another reason for using XS. For example, if you have a very large block of objects of some similar type, this can be managed more efficiently through XS. KinoSearch uses this for tokens, for example, where start and end offsets in a large string can be managed more effectively through XS than as a huge pool of scalars. PDL also has a memory management aspect to it, as well as speed.
There are proposals to integrate some of this approach into core Perl in the long term, initially because it offers a chance to make sharing data in threading better: see: http://openparallel.com/2011/07/05/a-new-hope-for-efficient-safe-data-sharing-between-threads-in-perl/.

Related

What's the point of using "map()" for two elements in perl?

I've seen code where there are just two rather static elements to be mapped such as time intervals with start and end dates, yet map() is being used rather than explicit code for mapping, e.g.
{ map { ... } qw(start end) } # vs.
{ start => ..., end => ... }
Which way is preferrable, and why?
The map form may be less concise but looks more functional (as in functional programming), so I guess that's why it may be preferred over explicit code and is perhaps more DRY.
However, it looks less legible to me because there is more logic going on behind, and mapping should also be less efficient because it invokes calls a and consists of more atomic operations.
EDIT
There is a conflicting goal in programming: KISS (keep it { pick 2 from: small, simple, stupid }). Using map slightly complicates code.
Assuming you're not just setting both items to the same constant or something similarly trivial, I would expect the map version to be more concise.
IMO, the main point in favor of the map version is that you know the same process will be used to produce both values. Not only for the sake of DRY, but also because it eliminates any concern that one might have a subtle change which the other doesn't.
As for the performance concern... If your use case is sufficiently performance-sensitive for any potential difference to matter, then you shouldn't be using Perl in the first place. Switching to well-written C (not C#, not C++, not Objective C - just plain C) will have a far greater performance impact than micro-optimizing whether you assign two values individually vs. using a loop to set them. But the odds of your use case being that sensitive are approximately zero anyhow.
There is a principle of coding known as DRY. Don't Repeat Yourself.
It asserts that:
Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.
And that can be interpreted as condensing duplicate typing with (things like) map/for.
I use idioms like the one you've quoted when I'm trying to expand some text - for example:
my #defs = map { "DEF:$_=$source_file:$_:MAX" } qw ( read write );
This generates me some DEF lines for rrdtool.
I'm doing it this way, because for some cases, I've got considerably longer lists of 'things I want to define' and want to be consistent. (Sometimes I have say, 10 similar lines that differ only by a single word).
But also because:
my #defs = ( "DEF:read=$source_file:read:MAX",
"DEF:write=$source_file:write:MAX" );
There's not much in it for two elements, and I'd suggest it's as much a matter of style as anything. However, if you've got more than that, it quickly becomes very beneficial because you can change the single line - say you've got a different file location? Want to swap MAX for AVERAGE?
It's also quite shockingly easy to go 'punctuation blind' when looking at a long sequence of similar statements, where someone's typo-ed and added a , where it should be . or similar.
And ... you probably don't lose a great deal in terms of readability. But will acknowledge that's something of a style point, because whilst map is pretty amazing, it can make for some rather hard to read code if you're not careful.
Also to specifically address:
mapping should also be less efficient because it invokes calls a and consists of more atomic operations.
A wise man once said:
premature optimization is the root of all evil
Don't think about the efficiency of a statement - look at the legibility/readability. Compilers are pretty clever. Most "obvious" optimisations, they already make for you. Processors are also pretty fast. Your limiting factor in most code isn't the amount of CPU cycles you need, it's IO throughput and memory footprint. So don't worry about it - write clear code.
And if there's a performance critical demand on your code, you should be using a code profiler to look at where you gain the most efficiency for your effort at refactoring. You may end up with less clear code in doing so (sometimes) but that's a more clear tradeoff.

Stream in production code

Do people really use Scala's Stream class in production code, or is it primarily of academic interest?
There's no problem with Stream, except when people use it to replace Iterator -- as opposed to replacing List, which is the collection most similar to it. In that particular case, one has to be careful in its use. On the other hand, one has to be careful using Iterator as well, since each element can only be iterated through once.
So, since both have their own problems, why single out Stream's? I daresay it's simply that people are used to Iterator from Java, whereas Stream is a functional thing.
Even though I wrote that Iterator is what I want to use nearly all the time I do use Stream in production code. I just don't automatically assume that the cells are garbage collected.
Sometimes Stream fits the problem perfectly. I think the api gives some good examples where recursion is involved...
Look here. This blog post describes how to use Scala Streams (along with memory mapped file) to read large files (1-2G) efficiently.
I did not try it yet but the solution looks reasonable. Stream provides a nice abstraction on top of the low level ByteBuffer Java API to handle a memory mapped file as a sequence of records.
Yes, I use it, although it tends to be for something like this:
(as.toStream collect expensiveConversionToB) match {
case b #:: _ => //found my expensive b
case _ =>
}
Of course, I might use a non-strict view and a find for this example
Since the only reason not to use Streams is that it can be tricky to ensure the JVM isn't keeping references to early conses around, one approach I've used that's fairly nice is to build up a Stream and immediately convert it to an Iterator for actual use. It loses a little of Stream's nice properties on the use side especially with respect to backtracking, but if you're only going to make one pass over the result it's frequently easier to build a structure this way than to contort into the hasNext/next() model of Iterator directly.

How can I represent sets in Perl?

I would like to represent a set in Perl. What I usually do is using a hash with some dummy value, e.g.:
my %hash=();
$hash{"element1"}=1;
$hash{"element5"}=1;
Then use if (defined $hash{$element_name}) to decide whether an element is in the set.
Is this a common practice? Any suggestions on improving this?
Also, should I use defined or exists?
Thank you
Yes, building hash sets that way is a common idiom. Note that:
my #keys = qw/a b c d/;
my %hash;
#hash{#keys} = ();
is preferable to using 1 as the value because undef takes up significantly less space. This also forces you to uses exists (which is the right choice anyway).
Use one of the many Set modules on CPAN. Judging from your example, Set::Light or Set::Scalar seem appropriate.
I can defend this advice with the usual arguments pro CPAN (disregarding possible synergy effects).
How can we know that look-up is all that is needed, both now and in the future? Experience teaches that even the simplest programs expand and sprawl. Using a module would anticipate that.
An API is much nicer for maintenance, or people who need to read and understand the code in general, than an ad-hoc implementation as it allows to think about partial problems at different levels of abstraction.
Related to that, if it turns out that the overhead is undesirable, it is easy to go from a module to a simple by removing indirections or paring data structures and source code. But on the other hand, if one would need more features, it is moderately more difficult to achieve the other way around.
CPAN modules are already tested and to some extent thoroughly debugged, perhaps also the API underwent improvement steps over the time, whereas with ad-hoc, programmers usually implement the first design that comes to mind.
Rarely it turns out that picking a module at the beginning is the wrong choice.
That's how I've always done it. I would tend to use exists rather than defined but they should both work in this context.

Are MooseX::Declare and MooseX::Method::Signatures production ready?

From the current version (0.98) of the Moose::Manual::MooseX are the lines:
We have high hopes for the future of
MooseX::Method::Signatures and
MooseX::Declare. However, these
modules, while used regularly in
production by some of the more insane
members of the community, are still
marked alpha just in case backwards
incompatible changes need to be made.
I noticed that for MooseX::Method::Signatures the change log for September 2009 mentions the removal of the "scary ALPHA disclaimer".
So, are these still "alpha"?
Would I still be considered one of the "more insane" to use them?
I'd say they are production ready - I'm using them in production - but there are several things to consider:
Performance
MooseX::Declare and dependencies do almost all of their magic at compile time. Depending on the size of your program, you might find anywhere from half a second to several seconds of additional initialization overhead. If this a problem, don't use MooseX::Declare.
At runtime, the main overhead is type and argument checking, which you should (ideally) be doing anyway. That said, Moose type constraints have some overheads, namely coercion and the more complex (MooseX::Types::Structured-style) constraints. Don't use these if performance is an issue.
Stability
MooseX::Declare and MooseX::Method::Signature's external syntax is now stable. But it is important to know that the internals are subject to extreme change. (fortunately, changes for the better)
To give you an idea, the signature itself is grabbed using a big block of C code stolen from the Perl tokenizer (toke.c). This can break in some situations since it isn't actually parsing anything. The bit inside the brackets is parsed using PPI, which is designed for pure Perl, but the resulting PPI tree is then hacked up to get something useful. Devel::Declare itself is a hack - after it sees specific keywords (e.g. 'role', 'class', 'method') the Devel::Declare-using module must rewrite the source code by hand, with no interaction with the real Perl parser.
Corner cases may cause Perl to segfault. Or rewrite the source code badly, so you get syntax errors but have no idea what's causing them without -MO::Deparse. If you mess up the MooseX::Declare syntax by accident, there is no guarantee that the module will detect this and give you a sensible error. The ALPHA message may have gone, but this is still doing dark and scary things internally, and you should be prepared for that.
UPDATE
MooseX::Declare has not been updated much, and you may wish to look at alternatives such as Moops. Personally, I have decided to stick with pure Moose until Perl itself begins to support class/method/has syntax natively, which is possibly on the cards.
I think it's a matter of differing perspectives as much as anything -- rafl is one of the aforementioned "more insane members of the community" while Rolsky is more conservative. It's up to you to decide who you agree with, and really I think that the most important variable is your own code.
MooseX::Declare is good code. It won't randomly blow up your machine, it's not awful for performance, and it offers a lot of nifty stuff while reducing the amount of boilerplate that you have to write. But it might change in the future, making your code refuse to compile until it's updated; it might make your editor and other development tools confused when it sees syntax that it can't parse, it might piss off your collaborators by making them learn a new module to work with your code, or it might piss off your boss by making it so any future maintainer has to learn a new module to work with your code. Which of those things apply to you, and to what degree? You know better than I do, I hope.
There are people who feel that the maturity and stability of MooseX::Delcare, Devel::Declare on which it's based, or even Moose itself are not yet ready for "prime time". I also know of two large companies with millions of visitors a month, who have MooseX::Declare in their production environment. I personally am happy with the stack I am provided with Moose and do not see a need yet to bring in MooseX::Declare. I know people who's opinion I deeply respect who refuse to write new code without the declarative sugar from MooseX::Declare.
All of this is to say, the decision on whether something is or is not production ready is highly dependent upon your production environment, your development needs, and taste for risk. Without being in your shoes we can't possibly give an informed decision as to how well any given tool matches that profile.
MooseX::Method::Signatures (MXMS), and MooseX::Declare which uses it, is not production ready. This is not because the code isn't stable, but because it is appallingly slow. Simply using the method keyword, no types or arguments, is a 500-1000x runtime performance hit over a regular method call. My Macbook Pro can do about 6,000 simple method calls per second using MXMS vs 5,000,000 with plain Perl.
Method::Signatures, in contrast, has almost no performance hit above what it would normally cost to do the requested checks. The syntax is almost exactly the same as MXMS and it supports Moose (and Mouse) types. Both rely on the same underlying syntax modifying technique. (Full disclosure, I am the author of Method::Signatures.)
If you like MooseX::Declare but want the performance of Method::Signatures, try Method::Signatures::Modifiers.
It depends on what you mean by "production ready". I wouldn't depend on them until their velocity slows down quite a bit. I like my production stuff to not need frequent care from external code changes, API adjustments, and so on. That's not something particular to Moose, but any young project.
You have to judge how much that matters to you. In some situations, pushing stuff into production is a lengthy process, so you must be circumspect with such things. At the other extreme, some places let you edit files directly on the production server. That is, you have to define your tolerance before anyone can tell you which side a given MooseX module is on.
MooseX::Declare and MooseX::Method::Signatures are working well but they can have really nasty penalty depending on what does your code do. This can be fixed by just not using method keyword or using Method::Signatures::Modifiers.
Performance penalty I am seeing is around 2-5x compared to Method::Signatures::Modifiers (5x being mostly for one specific class I was using). And it seems that it is mostly compile time or maybe first time initialization, because it is getting under 2x when the computation is longer.
Method::Signatures::Modifiers has better errors but you have to turn this optimization off when you use debugger (it goes haywire because it does not see these methods, you can see for yourself in -MO=Deparse output).
It may be worth it to get rid of Perl argument shifting hell.

When should I use OO Perl?

I'm just learning Perl.
When is it advisable to use OO Perl instead of non-OO Perl?
My tendency would be to always prefer OO unless the project is just a code snippet of < 10 lines.
TIA
From Damian Conway:
10 criteria for knowing when to use object-oriented design
Design is large, or is likely to become large
When data is aggregated into obvious structures, especially if there’s a lot of data in each aggregate
For instance, an IP address is not a good candidate: There’s only 4 bytes of information related to an IP address. An immigrant going through customs has a lot of data related to him, such as name, country of origin, luggage carried, destination, etc.
When types of data form a natural hierarchy that lets us use inheritance.
Inheritance is one of the most powerful feature of OO, and the ability to use it is a flag.
When operations on data varies on data type
GIFs and JPGs might have their cropping done differently, even though they’re both graphics.
When it’s likely you’ll have to add data types later
OO gives you the room to expand in the future.
When interactions between data is best shown by operators
Some relations are best shown by using operators, which can be overloaded.
When implementation of components is likely to change, especially in the same program
When the system design is already object-oriented
When huge numbers of clients use your code
If your code will be distributed to others who will use it, a standard interface will make maintenence and safety easier.
When you have a piece of data on which many different operations are applied
Graphics images, for instance, might be blurred, cropped, rotated, and adjusted.
When the kinds of operations have standard names (check, process, etc)
Objects allow you to have a DB::check, ISBN::check, Shape::check, etc without having conflicts between the types of check.
There is a good discussion about same subject # PerlMonks.
Having Moose certainly makes it easier to always use OO from the word go. The only real exception is if compilation start-up is an issue (Moose does currently have a compile time overhead).
I don't think you should measure it by lines of code.
You are right, often when you are just writing a simple script OO is probably too much overhead, but I think you should be more flexible regarding the 10 lines aproach.
In all cases when you are using OO Perl Rememebr to use Moose (or Mouse)
This question doesn't have that much to do with Perl. The question is "when, given a choice, should I use OO?" That "given a choice" bit is because in some languages (Java, for example), you really don't have any choice.
The answer is "when it makes sense". Think about the problem you're trying to solve. Does the problem fit into the OO concepts of classes and object? If it does, great, use OO. Otherwise use some other paradigm.
Perl is fairly flexible, and you can easily write procedural, functional, or OO Perl, or even mix them together. Don't get hung up on doing OO because everyone else is. Learn to use the right approach for each task.
All of this takes experience and practice, so make sure to try all these approaches out, and maybe even take some smaller problems and solve them in multiple ways to see how each works.
Damian Conway has a passage in Perl Best Practices about this. It is not a rule that you have to follow it, but it is probably better advice that I can give without knowing a lot about what you are doing.
Here is the publisher's page if that is a better place to link to the book.