The "Why" behind PMD's rules

The "Why" behind PMD's rules - pmd

Is there a good resource which describes the "why" behind PMD rule sets? PMD's site has the "what" - what each rule does - but it doesn't describe why PMD has that rule and why ignoring that rule can get you in trouble in the real world. In particular, I'm interested in knowing why PMD has the AvoidInstantiatingObjectsInLoops and OnlyOneReturn rules (the first seems necessary if you need to create a new object corresponding to each object in a collection, the second seems like it is a necessity in many cases that return a value based on some criteria), but what I'm really after is a link somewhere describing the "why" behind a majority of PMD's rules, since this comes up often enough.
Just to be clear, I know that I can disable these and how to do that, I'm just wondering why they are there in the first place. Sorry if there's something obvious I missed out there, but I did a Google search and SO search before posting this. I also understand that these issues are often a matter of "taste" - what I'm looking for is what the argument for the rules are and what alternatives there are. To give a concrete example, how are you supposed to implement one object corresponding to every object in a loop (which is a common operation in Java) without instantiating each object in a loop?

In each case, the rule can be a matter of specific circumstances or just "taste".
Instantiating an Object in a loop should be avoided if there are a large number of iterations and the instantiation is expensive. If you can move the code out of the loop, you will avoid many object instantiations, and therefore improve performance. Having said that, this isn't always possible, and in some cases it just doesn't matter to the overall performance of the code. In these cases, do whichever is clearer.
For OnlyOneReturn, there are several ways to view this (with vehement supporters behind each), but they all basically boil down to taste.
For your example, the OnlyOneReturn proponents want code like:
public int performAction(String input) {
int result;
if (input.equals("bob")) {
result = 1;
} else {
result = 2;
}
return result;
}
Rather than:
public int performAction(String input) {
if (input.equals("bob")) {
return 1;
} else {
return 2;
}
}
As you can see, the additional clarity of ReturnOnlyOnce can be debated.
Also see this SO question that relates to instantiation within loops.

This article, A Comparison of Bug Finding Tools for Java, "by Nick Rutar, Christian Almazan, and Jeff Foster, compares several bug checkers for Java..."—FindBugs Documents and Publications. PMD is seen to be rather more verbose.
Addendum: As the authors suggest,
"all of the tools choose different tradeoffs between
generating false positives and false negatives."
In particular, AvoidInstantiatingObjectsInLoops may not be a bug at all if that is the intent. It's included to help Avoid creating unnecessary objects. Likewise OnlyOneReturn is suggestive in nature. Multiple returns represent a form of goto, sometimes considered harmful, but reasonably used to improve readability.
My pet peeve is people who mandate the use of such tools without understanding the notion of false positives.
As noted here, more recent versions of PMD support improved customization when integrated into the build process.

You can look at the PMD-homepage, the rules are explained here in detail and often with a why. The site is structured for the rules-groups, here the link to basic-rules: http://pmd.sourceforge.net/rules/basic.html

Each rule is in a PMD Rule Set, which can give you a clue to the reasoning behind the rule (if it isn't explained in detail on the Rule Set page itself).
In the case of AvoidInstantiatingObjectsInLoops, it can be expensive to instantiate a similar object again and again. However it is frequently necessary. On my own project, I have disable this rule, since it is flagging too many false positives.
In the case of OnlyOneReturn, note that it is in a Rule Set called Controversial, which is a hint that these rules are debatable, and depend on the case. I have disabled this entire Rule Set as well.

Related

Do we have to implement copy on write behavior for our custom types?

In Swift, collections are implicitly implemented with copy on write behavior; However, we don't get it for free in our custom types.
My main question is:
Regardless of how to achieve it, is it a good idea to do for our custom types? Why/Why not?
Moreover:
According to this answer, even the built-in types (but not collections) provided from the Swift standard library do not implement it which could be an indication that we don't have to do it. Even so, is there any advantage of doing it?

You do not have to do it, but it can be a worthwhile optimization if you have the resources and need to do so. Ask yourself the following questions:
Is my datatype copied often (i.e. applicability)?
Is it easy enough to implement CoW in reasonably time (i.e. viability)?
Does my application benefit from these optimizations (i.e. return of investment)?
Probably, in most applications it is not necessary and the users will not notice the difference. In some specific cases it might be applicable, but be critical. Remember:
Premature performance optimization is the root of all evil ~ Donald Knuth

What's the point of using "map()" for two elements in perl?

I've seen code where there are just two rather static elements to be mapped such as time intervals with start and end dates, yet map() is being used rather than explicit code for mapping, e.g.
{ map { ... } qw(start end) } # vs.
{ start => ..., end => ... }
Which way is preferrable, and why?
The map form may be less concise but looks more functional (as in functional programming), so I guess that's why it may be preferred over explicit code and is perhaps more DRY.
However, it looks less legible to me because there is more logic going on behind, and mapping should also be less efficient because it invokes calls a and consists of more atomic operations.
EDIT
There is a conflicting goal in programming: KISS (keep it { pick 2 from: small, simple, stupid }). Using map slightly complicates code.

Assuming you're not just setting both items to the same constant or something similarly trivial, I would expect the map version to be more concise.
IMO, the main point in favor of the map version is that you know the same process will be used to produce both values. Not only for the sake of DRY, but also because it eliminates any concern that one might have a subtle change which the other doesn't.
As for the performance concern... If your use case is sufficiently performance-sensitive for any potential difference to matter, then you shouldn't be using Perl in the first place. Switching to well-written C (not C#, not C++, not Objective C - just plain C) will have a far greater performance impact than micro-optimizing whether you assign two values individually vs. using a loop to set them. But the odds of your use case being that sensitive are approximately zero anyhow.

There is a principle of coding known as DRY. Don't Repeat Yourself.
It asserts that:
Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.
And that can be interpreted as condensing duplicate typing with (things like) map/for.
I use idioms like the one you've quoted when I'm trying to expand some text - for example:
my #defs = map { "DEF:$_=$source_file:$_:MAX" } qw ( read write );
This generates me some DEF lines for rrdtool.
I'm doing it this way, because for some cases, I've got considerably longer lists of 'things I want to define' and want to be consistent. (Sometimes I have say, 10 similar lines that differ only by a single word).
But also because:
my #defs = ( "DEF:read=$source_file:read:MAX",
"DEF:write=$source_file:write:MAX" );
There's not much in it for two elements, and I'd suggest it's as much a matter of style as anything. However, if you've got more than that, it quickly becomes very beneficial because you can change the single line - say you've got a different file location? Want to swap MAX for AVERAGE?
It's also quite shockingly easy to go 'punctuation blind' when looking at a long sequence of similar statements, where someone's typo-ed and added a , where it should be . or similar.
And ... you probably don't lose a great deal in terms of readability. But will acknowledge that's something of a style point, because whilst map is pretty amazing, it can make for some rather hard to read code if you're not careful.
Also to specifically address:
mapping should also be less efficient because it invokes calls a and consists of more atomic operations.
A wise man once said:
premature optimization is the root of all evil
Don't think about the efficiency of a statement - look at the legibility/readability. Compilers are pretty clever. Most "obvious" optimisations, they already make for you. Processors are also pretty fast. Your limiting factor in most code isn't the amount of CPU cycles you need, it's IO throughput and memory footprint. So don't worry about it - write clear code.
And if there's a performance critical demand on your code, you should be using a code profiler to look at where you gain the most efficiency for your effort at refactoring. You may end up with less clear code in doing so (sometimes) but that's a more clear tradeoff.

Writing programs in dynamic languages that go beyond what the specification allows

With the growth of dynamically typed languages, as they give us more flexibility, there is the very likely probability that people will write programs that go beyond what the specification allows.
My thinking was influenced by this question, when I read the answer by bobince:
A question about JavaScript's slice and splice methods
The basic thought is that splice, in Javascript, is specified to be used in only certain situations, but, it can be used in others, and there is nothing that the language can do to stop it, as the language is designed to be extremely flexible.
Unless someone reads through the specification, and decides to adhere to it, I am fairly certain that there are many such violations occuring.
Is this a problem, or a natural extension of writing such flexible languages? Or should we expect tools like JSLint to help be the specification police?
I liked one answer in this question, that the implementation of python is the specification. I am curious if that is actually closer to the truth for these types of languages, that basically, if the language allows you to do something then it is in the specification.
Is there a Python language specification?
UPDATE:
After reading a couple of comments, I thought I would check the splice method in the spec and this is what I found, at the bottom of pg 104, http://www.mozilla.org/js/language/E262-3.pdf, so it appears that I can use splice on the array of children without violating the spec. I just don't want people to get bogged down in my example, but hopefully to consider the question.
The splice function is intentionally generic; it does not require that its this value be an Array object.
Therefore it can be transferred to other kinds of objects for use as a method. Whether the splice function
can be applied successfully to a host object is implementation-dependent.
UPDATE 2:
I am not interested in this being about javascript, but language flexibility and specs. For example, I expect that the Java spec specifies you can't put code into an interface, but using AspectJ I do that frequently. This is probably a violation, but the writers didn't predict AOP and the tool was flexible enough to be bent for this use, just as the JVM is also flexible enough for Scala and Clojure.

Whether a language is statically or dynamically typed is really a tiny part of the issue here: a statically typed one may make it marginally easier for code to enforce its specs, but marginally is the key word here. Only "design by contract" -- a language letting you explicitly state preconditions, postconditions and invariants, and enforcing them -- can help ward you against users of your libraries empirically discovering what exactly the library will let them get away with, and taking advantage of those discoveries to go beyond your design intentions (possibly constraining your future freedom in changing the design or its implementation). And "design by contract" is not supported in mainstream languages -- Eiffel is the closest to that, and few would call it "mainstream" nowadays -- presumably because its costs (mostly, inevitably, at runtime) don't appear to be justified by its advantages. "Argument x must be a prime number", "method A must have been previously called before method B can be called", "method C cannot be called any more once method D has been called", and so on -- the typical kinds of constraints you'd like to state (and have enforced implicitly, without having to spend substantial programming time and energy checking for them yourself) just don't lend themselves well to be framed in the context of what little a statically typed language's compiler can enforce.

I think that this sort of flexibility is an advantage as long as your methods are designed around well defined interfaces rather than some artificial external "type" metadata. Most of the array functions only expect an object with a length property. The fact that they can all be applied generically to lots of different kinds of objects is a boon for code reuse.
The goal of any high level language design should be to reduce the amount of code that needs to be written in order to get stuff done- without harming readability too much. The more code that has to be written, the more bugs get introduced. Restrictive type systems can be, (if not well designed), a pervasive lie at worst, a premature optimisation at best. I don't think overly restrictive type systems aid in writing correct programs. The reason being that the type is merely an assertion, not necessarily based on evidence.
By contrast, the array methods examine their input values to determine whether they have what they need to perform their function. This is duck typing, and I believe that this is more scientific and "correct", and it results in more reusable code, which is what you want. You don't want a method rejecting your inputs because they don't have their papers in order. That's communism.

I do not think your question really has much to do with dynamic vs. static typing. Really, I can see two cases: on one hand, there are things like Duff's device that martin clayton mentioned; that usage is extremely surprising the first time you see it, but it is explicitly allowed by the semantics of the language. If there is a standard, that kind of idiom may appear in later editions of the standard as a specific example. There is nothing wrong with these; in fact, they can (unless overused) be a great productivity boost.
The other case is that of programming to the implementation. Such a case would be an actual abuse, coming from either ignorance of a standard, or lack of a standard, or having a single implementation, or multiple implementations that have varying semantics. The problem is that code written in this way is at best non-portable between implementations and at worst limits the future development of the language, for fear that adding an optimization or feature would break a major application.

It seems to me that the original question is a bit of a non-sequitor. If the specification explicitly allows a particular behavior (as MUST, MAY, SHALL or SHOULD) then anything compiler/interpreter that allows/implements the behavior is, by definition, compliant with the language. This would seem to be the situation proposed by the OP in the comments section - the JavaScript specification supposedly* says that the function in question MAY be used in different situations, and thus it is explicitly allowed.
If, on the other hand, a compiler/interpreter implements or allows behavior that is expressly forbidden by a specification, then the compiler/interpreter is, by definition, operating outside the specification.
There is yet a third scenario, and an associated, well defined, term for those situations where the specification does not define a behavior: undefined. If the specification does not actually specify a behavior given a particular situation, then the behavior is undefined, and may be handled either intentionally or unintentionally by the compiler/interpreter. It is then the responsibility of the developer to realize that the behavior is not part of the specification, and, should s/he choose to leverage the behavior, the developer's application is thereby dependent upon the particular implementation. The interpreter/compiler providing that implementation is under no obligation to maintain the officially undefined behavior beyond backwards compatibility and whatever commitments the producer may make. Furthermore, a later iteration of the language specification may define the previously undefined behavior, making the compiler/interpreter either (a) non-compliant with the new iteration, or (b) come out with a new patch/version to become compliant, thereby breaking older versions.
* "supposedly" because I have not seen the spec, myself. I go by the statements made, above.

Why use a post compiler?

I am battling to understand why a post compiler, like PostSharp, should ever be needed?
My understanding is that it just inserts code where attributed in the original code, so why doesn't the developer just do that code writing themselves?
I expect that someone will say it's easier to write since you can use attributes on methods and then not clutter them up boilerplate code, but that can be done using DI or reflection and a touch of forethought without a post compiler. I know that since I have said reflection, the performance elephant will now enter - but I do not care about the relative performance here, when the absolute performance for most scenarios is trivial (sub millisecond to millisecond).

Let's try to take an architectural point on the issue. Say you are an architect (everyone wants to be an architect ;)
You need to deliver the architecture to your team:
a selected set of libraries, architectural patterns, and design patterns. As a part of your design, you say: "we will implement caching using the following design pattern:"
string key = string.Format("[{0}].MyMethod({1},{2})", this, param1, param2 );
T value;
if ( !cache.TryGetValue( key, out value ) )
{
using ( cache.Lock(key) )
{
if (!cache.TryGetValue( key, out value ) )
{
// Do the real job here and store the value into variable 'value'.
cache.Add( key, value );
}
}
}
This is a correct way to do tracing. Developers are going to implement this pattern thousands of times, so you write a nice Word document telling how you want the pattern to be implemented. Yeah, a Word document. Do you have a better solution? I'm afraid you don't. Classic code generators won't help. Functional programming (delegates)? It works fairly well for some aspects, but not here: you need to pass method parameters to the pattern. So what's left? Describe the pattern in natural language and trust developers will implement them.
What will happen?
First, some junior developer will look at the code and tell "Hm. Two cache lookups. Kinda useless. One is enough." (that's not a joke -- ask the DNN team about this issue). And your patterns cease to be thread-safe.
As an architect, how do you ensure that the pattern is properly applied? Unit testing? Fair enough, but you will hardly detect threading issues this way. Code review? That's maybe the solution.
Now, what is you decide to change the pattern? For instance, you detect a bug in the cache component and decide to use your own? Are you going to edit thousands of methods? It's not just refactoring: what if the new component has different semantics?
What if you decide that a method is not going to be cached any more? How difficult will it be to remove caching code?
The AOP solution (whatever the framework is) has the following advantages over plain code:
It reduces the number of lines of code.
It reduces the coupling between components, therefore you don't have to change much things when you decide to change the logging component (just update the aspect), therefore it improves the capacity of your source code to cope with new requirements over time.
Because there is less code, the probability of bugs is lower for a given set of features, therefore AOP improves the quality of your code.
So if you put it all together:
Aspects reduce both development costs and maintenance costs of software.
I have a 90 min talk on this topic and you can watch it at http://vimeo.com/2116491.
Again, the architectural advantages of AOP are independent of the framework you choose. The differences between frameworks (also discussed in this video) influence principally the extent to which you can apply AOP to your code, which was not the point of this question.

Suppose you already have a class which is well-designed, well-tested etc. You want to easily add some timing on some of the methods. Yes, you could use dependency injection, create a decorator class which proxies to the original but with timing for each method - but even that class is going to be a mess of repetition...
... or you can add reflection to the mix and use a dynamic proxy of some description, which lets you write the timing code once, but requires you to get that reflection code just right -which isn't as easy as it might be, especially if generics are involved.
... or you can add an attribute to each method that you want timed, write the timing code once, and apply it as a post-compile step.
I know which seems more elegant to me - and more obvious when reading the code. It can be applied even in situations where DI isn't appropriate (and it really isn't appropriate for every single class in a system) and with no other changes elsewhere.

AOP (PostSharp) is for attaching code to all sorts of points in your application, from one location, so you don't have to place it there.
You cannot achieve what PostSharp can do with Reflection.
I personally don't see a big use for it, in a production system, as most things can be done in other, better, ways (logging, etc).
You may like to review the other threads on this matter:
Anyone with Postsharp experience in production?
Other than logging, and transaction management what are some practical applications of AOP?
Aspect Oriented Programming: What do you use PostSharp for?
etc (search)

Aspects take away all the copy & paste - code and make adding new features faster.
I hate nothing more than, for example, having to write the same piece of code over and over again. Gael has a very nice example regarding INotifyPropertyChanged on his website (www.postsharp.net).
This is exactly what AOP is for. Forget about the technical details, just implement what you are being asked for.
In the long run, I think we all should say goodbye to the way we are writing software now. It's tedious and plainly stupid to write boilerplate code and iterate manually.
The future belongs to declarative, functional style being held together by an object oriented framework - and the cross cutting concerns being handled by aspects.
I guess the only people who will not get it soon are the guys who are still payed for lines of code.

Can something be initializable?

I've created an interface called Initializable, but according to Dictionary.com this is not a word. Searching Google only gives about 30k results and they are mostly API references.
Is there another word to describe something that can be initialized (which is a word)?
Edit:
Thanks for the questions about it being in the constructor, that may be a better way. Right now they are static classes (as static as can be in Ruby) that get loaded dynamically and have some initilization stuff to do.

Technical people create new words all the time. (see example below) But this isn't a case of creating a new word. This is a case of a "derivation". You have take a perfectly good word ("initialize") and added a perflecty good derivative suffix to it ("able"). The resulting word initializable is a derivative word.
In short, if something can be initialized, it is initializeable. Just like it can be runable, or stopable.
Now, I don't think it will be long before a grammar Nazi points out the error of my ways here. But English is rich and expressive language. A word doesn't have to be listed on "dictionary.com" for it to be valid. Nor even on m-w.com (which I believe is a better site).
One of my favorite books is Garner's Modern American Usage. Its a great book and is more than a dictionary - it is a reference and guide on how American English is used.
"Atomic" is a good example of a word we use in software development all the time that is somewhat of a "made up" word. In a development context something that is atomic either happens, or does not happen - it cannot be divided into separate operations. But, the common definition for this word doesn't take this usage into account.
Bah! Here is a better one.... "Grep" Not in the dictionary - but yet, a perfectly good word. I use it all the time

How about -
interface ICanBeInitialized
or...(and I had a little xmas drinky...so sorry)
interface ICanHazInitialization

I think the question --- other than the pedantic one about the word, which I'll mention below --- is what the behavior you intend to identify by this "Initializable" tag might be.
It's not an uncommon style to write a private method init() in, eg, Java to do complicated initialization; since that code may be needed in several places (what with copy constructors, clone operations and so on) it's just good form. Its less common, but a valid thing to so, to have a "Forward" class that is constructed, but that is waiting for some asynchronous operation in order to be fully initialized (eg, the Asynchronous Completion Token pattern). So it's not necessarily so that this should be just in the ctor, but I'm curious what the actual behavior you want would be.
On the word, English is a somewhat agglutinating language, like German: there are grammatical rules that construct works from base words and ther syllables in patterns. one of those is the one here, "Initial" -> "initialize" => "initializable". Any native speaker will recognize "initializable" as something that has the property of being able to be initialized. So it is a value word, but one they don't have in the dictionary for the same reason that don't have separate entries for the plurals.

I see nothing wrong with making up a word for an API-like thing if the invented word is clear.
I think worse words than Initializable have been invented - such as 'stringize' and 'RAII' (I nkow it's not a word, but it's still a term that's used often, and makes me cringe every time - even though the concept is doubleplusgood).
The problem I might have with Initializable is that it sounds like an interface that does what a constructor should be doing.

If Google gives back API references for "Initializable", it seems to me like a valid name for an interface, even though it might not be a valid English word. There's nothing wrong with using a made-up word, as long as it's descriptive.
The only thing I get confused about is classes are typically able to be initialized through their constructor. How does your interface provide functionality not available through the use of a constructor? The answer to this question may provide a more descriptive name than simply "Initializable". (i.e. in what way is it initializable?)

If Initializable most clearly describes what the interface is about, I wouldn't care about trying to find another word just so it is a valid English word. As long as it's not a UI string, the priority should be in naming clarity not validity of the word in the English language.

Yes, it's fine. Programming terms don't have to be in the dictionary. Dictionary.com also doesn't like "Serializable", and we all know that one's OK.

Yes. Don't let the lack of an existing word spoil your creativity if the meaning is clear

For those questioning the use of initialize - you may want to put constructor logic in a void method so to avoid race conditions when constructing weakly coupled classes through factories.
Example Factory:
public static T CreateSingleInstance<T>(string providerName, string sectionName)
{
//create the key
ProviderKey key = new ProviderKey(providerName, typeof(T));
//check key
if (!_singletons.ContainsKey(key))
{
object provider = _singletons[key] = CreateInstance<T>(providerName, sectionName);
IInitializable initializableProvider = provider as IInitializable;
if (initializableProvider != null)
initializableProvider.Initialize();
}
return (T)_singletons[key];
}
Example Implementation Constructors that would cause a race condition
public class Class
{
public Class()
{
Factory.CreateSingleInstance<OtherClass>(null, null);
}
}
public class OtherClass
{
public OtherClass()
{
Factory.CreateSingleInstance<Class>(null, null);
}
}

To echo what other's have said, it's valid English since English, like many languages, is formulaic and you're just applying a valid linguistic formula. Looks like there's precedence in what you're doing. The SQL Server team thinks what you're doing is valid since they came up with the same IInitializable Interface (see here). It looks something like this:
public interface IInitializable {
public void Initialize (IServiceProvider serviceProvider);
}