Is input validation necessary? - matlab

This is a very naive question about input validation in general.
I learned about input validation techniques such as parse and validatestring. In fact, MATLAB built-in functions are full of those validations and parsers. So, I naturally thought this is the professional way of code development. With these techniques, you can be sure of data format of input variables. Otherwise your codes will reject the inputs and return an error.
However, some people argue that if there is a problem in input variable, codes will cause errors and stop. You'll notice the problem anyway, and then what's the point of those complicated validations? Given that codes for validation itself take some efforts and time, often with quite complicated flow controls, I had to admit this opinion has its point. With massive input validations, readability of codes may be compromised.
I would like hear about opinions from advanced users on this issue.

Here is my experience, I hope it matches best practice.
First of all, let me mention that I typically work in situations where I have full control, and won't build my own UI as #tom mentioned. In general, if there is at any point a large probability that your program gets junk inputs it will be worth checking for them.
Some tradeoffs that I typically make to decide whether I should check my inputs:
Development time vs debug time
If erronious inputs are hard to debug (for example because they don't cause errors but just undesirable outcomes) the balance will typically be in favor of checking, otherwise not.
If you are not sure where you will end up (re)using the code, it may help to enforce any assumptions that are required on the input.
Development time vs runtime experience
If your code takes an hour to run, and will break in the end when an invalid input value occurs, you would want to check of this at the beginning of the code
If the code runs into an error whilst opening a file, the user may not understand immediately, if you mention that no valid filename is specified this may be easier to deal with.

The really (really) short story:
Break your design down into user interface, business logic and data - (see MVC pattern)
In your UI layer, do "common sense" validation, e.g. if the input is a $ cost value then it should be >= 0, be able to be parsed into a decimal etc.
In your business logic layer, validate the value, e.g. the $ cost value might not be allowed to be greater than the profit margin (etc.)
In your data layer, validate the data operation, e.g. that insert operation succeeded
The extra really short story: YES! Validate all inputs.
For extra reading credits see: this!

Related

How does a fuzzer deal with invalid inputs?

Suppose that I have a program that takes a pointer as its input. Without prior knowledge about the structure of the pointee, how does a fuzzer create valid inputs that can actually hits the internal of the program? To make this more concrete, imagine an artificial C program
int myprogram (unknow_pointer* input){
printf("%s", input->name);
}
In some situations, the tested program first checks the input format. If the input format is not good, it raises an exception. In such situations, how can a fuzzer reach program points beyond that exception-raising statement?
Most fuzzers don't know anything about the internal structure of the program. Different fuzzers dealt with this in a various ways:
Not deal with it at all. Just throw random inputs and hope to produce an input that will pass some/all checks. (for example - radamasa)
Mutate a valid input - take a known valid input, and mutate it (flip bits, remove parts, add parts, etc.) in many cases it will be valid enough to pass some or all of the checks. For example - if you want to fuzz VLC, you will take a valid movie file as the input for the fuzzer, which will provide mutations of it to VLC. Those are often called mutation based fuzzers. (for example - zzuf)
If you have prior knowledge of the input's structure, build a model of the input, and then mutate specific fields within it. A big advantage of such method is the ability to deal with very specific types of fields - checksums, hashes, sizes, etc. Those are often called generation based fuzzers. (for example - spike, sulley and their successors, peach)
However, in recent years a new kind of fuzzers was evolved - feedback based fuzzers - these fuzzers perform mutations on a valid (or not) input, and based on feedback they receive from the fuzzed program they decide how and what to mutate next. The feedback is received by instrumenting the program execution, either by injection tracing in compile time, injecting the tracing code by patching the program in runtime, or using hardware tracing mechanisms. First among them is AFL (you can read more about it here).
A fuzzer throws every sort of random combination of inputs at the attack surface. The intention is to look for any opportunity for a "golden BB" to get past the input checks and get a response that can be further explored.

Which is better in PHP: suppress warnings with '#' or run extra checks with isset()?

For example, if I implement some simple object caching, which method is faster?
1. return isset($cache[$cls]) ? $cache[$cls] : $cache[$cls] = new $cls;
2. return #$cache[$cls] ?: $cache[$cls] = new $cls;
I read somewhere # takes significant time to execute (and I wonder why), especially when warnings/notices are actually being issued and suppressed. isset() on the other hand means an extra hash lookup. So which is better and why?
I do want to keep E_NOTICE on globally, both on dev and production servers.
I wouldn't worry about which method is FASTER. That is a micro-optimization. I would worry more about which is more readable code and better coding practice.
I would certainly prefer your first option over the second, as your intent is much clearer. Also, best to keep away edge condition problems by always explicitly testing variables to make sure you are getting what you are expecting to get. For example, what if the class stored in $cache[$cls] is not of type $cls?
Personally, if I typically would not expect the index on $cache to be unset, then I would also put error handling in there rather than using ternary operations. If I could reasonably expect that that index would be unset on a regular basis, then I would make class $cls behave as a singleton and have your code be something like
return $cls::get_instance();
The isset() approach is better. It is code that explicitly states the index may be undefined. Suppressing the error is sloppy coding.
According to this article 10 Performance Tips to Speed Up PHP, warnings take additional execution time and also claims the # operator is "expensive."
Cleaning up warnings and errors beforehand can also keep you from
using # error suppression, which is expensive.
Additionally, the # will not suppress the errors with respect to custom error handlers:
http://www.php.net/manual/en/language.operators.errorcontrol.php
If you have set a custom error handler function with
set_error_handler() then it will still get called, but this custom
error handler can (and should) call error_reporting() which will
return 0 when the call that triggered the error was preceded by an #.
If the track_errors feature is enabled, any error message generated by
the expression will be saved in the variable $php_errormsg. This
variable will be overwritten on each error, so check early if you want
to use it.
# temporarily changes the error_reporting state, that's why it is said to take time.
If you expect a certain value, the first thing to do to validate it, is to check that it is defined. If you have notices, it's probably because you're missing something. Using isset() is, in my opinion, a good practice.
I ran timing tests for both cases, using hash keys of various lengths, also using various hit/miss ratios for the hash table, plus with and without E_NOTICE.
The results were: with error_reporting(E_ALL) the isset() variant was faster than the # by some 20-30%. Platform used: command line PHP 5.4.7 on OS X 10.8.
However, with error_reporting(E_ALL & ~E_NOTICE) the difference was within 1-2% for short hash keys, and up 10% for longer ones (16 chars).
Note that the first variant executes 2 hash table lookups, whereas the variant with # does only one lookup.
Thus, # is inferior in all scenarios and I wonder if there are any plans to optimize it.
I think you have your priorities a little mixed up here.
First of all, if you want to get a real world test of which is faster - load test them. As stated though suppressing will probably be slower.
The problem here is if you have performance issues with regular code, you should be upgrading your hardware, or optimize the grand logic of your code rather than preventing proper execution and error checking.
Suppressing errors to steal the tiniest fraction of a speed gain won't do you any favours in the long run. Especially if you think that this error may keep happening time and time again, and cause your app to run more slowly than if the error was caught and fixed.

Are there performance reasons against goto? [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
GOTO, does it affect the performance while it is executed and run on the device?
Is it a good practice to use GOTO in objective C or is it bad practice to use it?
And, when is it a good choice to use GOTO statement?
Thanks.
A goto is simply a jump, so that its effect on performance is practically zero. It’s a bad practice because it harms code readability; you can mostly do without it. Some of the cases where it makes sense to use goto are described in previous questions, just search for goto.
Using a go to statement is usually a bad practice, especially in a object oriented language(where you can achieve the same purpose in an OO way easyly ), but not from a performance point of view but rather from the code readability point of view...
There is nothing in BAD and GOOD practice this is up to your requirement.
If you have same code which you want to execute you can say loop then you can use goto. Well here is a small example about this I think it would clear your doubt.
Declare any label name, here hello is label then you can call it using goto statement like this -
hello:
NSLog(#"Print hello!");
goto hello;
This would print 'Print hello!' again and again.
Not affecting the performance, just for a good structure and readability which are important features of professional programming. But sometimes, using goto may help to ease complexity in cases where the loop is too deep, but you want to jump out when certain condition is triggered. Even so, it can also be avoided in other ways.
In principle goto can affect performance simply by being present in the function.
The performance difference will almost always be unnoticeable, and there are a lot of things other than goto than can slightly perturb the optimizer and affect performance. But if you're interested you could examine the emitted code for differences.
It's a basic requirement in the emitted code that the same registers must be used for the same things at the source and target of the goto[*]. This constrains the register allocation when the compiler optimizes the code. Such constraints may have no effect at all, they may slow things down or cause additional code to be emitted. If they speed things up, it can only be by accident because the compiler's heuristics were in effect incorrect when applied to the unconstrained version.
The effect might be more pronounced for a computed goto (a GNU extension), where you can store a label in a variable and goto the variable. In that case, every possible target has to share a register state with every possible source.
What doesn't (normally) make a difference to performance is goto the start or end of a block vs. the equivalent break or continue or else. It's all the same to the optimizer: the compiler breaks your code down into so-called "basic blocks" with jumps and conditional jumps between them. It doesn't normally care whether the reason for the jump is a goto or not, and it has to get the register states right no matter which. This is why almost any programming construct can be described as "goto in disguise" by someone who's only thinking about the emitted instructions.
[*] to be more precise -- there could be an implicit zap at a goto, meaning that some register is used for one thing at the source and isn't used at all at the target. But you can't have some register that the target expects to contain a particular value (like the current value of a variable) and the source doesn't. So if that was the case before and then you add the goto, either the target needs to stop expecting it, or the source needs to put it there. Typically either one is going to require extra code to shuffle values between registers and stack.

Implementing snapshot in FRP

I'm implementing an FRP framework in Scala and I seem to have run into a problem. Motivated by some thinking, this question I decided to restrict the public interface of my framework so Behaviours could only be evaluated in the 'present' i.e.:
behaviour.at(now)
This also falls in line with Conal's assumption in the Fran paper that Behaviours are only ever evaluated/sampled at increasing times. It does restrict transformations on Behaviours but otherwise we find ourselves in huge problems with Behaviours that represent some input:
val slider = Stepper(0, sliderChangeEvent)
With this Behaviour, evaluating future values would be incorrect and evaluating past values would require an unbounded amount of memory (all occurrences used in the 'slider' event would have to be stored).
I am having trouble with the specification for the 'snapshot' operation on Behaviours given this restriction. My problem is best explained with an example (using the slider mentioned above):
val event = mouseB // an event that occurs when the mouse is pressed
val sampler = slider.snapshot(event)
val stepper = Stepper(0, sampler)
My problem here is that if the 'mouseB' Event has occurred when this code is executed then the current value of 'stepper' will be the last 'sample' of 'slider' (the value at the time the last occurrence occurred). If the time of the last occurrence is in the past then we will consequently end up evaluating 'slider' using a past time which breaks the rule set above (and your original assumption). I can see a couple of ways to solve this:
We 'record' the past (keep hold of all past occurrences in an Event) allowing evaluation of Behaviours with past times (using an unbounded amount of memory)
We modify 'snapshot' to take a time argument ("sample after this time") and enforce that that time >= now
In a more wacky move, we could restrict creation of FRP objects to the initial setup of a program somehow and only start processing events/input after this setup is complete
I could also simply not implement 'sample' or remove 'stepper'/'switcher' (but I don't really want to do either of these things). Has anyone any thoughts on this? Have I misunderstood anything here?
Oh I see what you mean now.
Your "you can only sample at 'now'" restriction isn't tight enough, I think. It needs to be a bit stronger to avoid looking into the past. Since you are using an environmental conception of now, I would define the behavior construction functions in terms of it (so long as now cannot advance by the mere execution of definitions, which, per my last answer, would get messy). For example:
Stepper(i,e) is a behavior with the value i in the interval [now,e1] (where e1 is the
time of first occurrence of e after now), and the value of the most recent occurrence of e afterward.
With this semantics, your prediction about the value of stepper that got you into this conundrum is dismantled, and the stepper will now have the value 0. I don't know whether this semantics is desirable to you, but it seems natural enough to me.
From what I can tell, you are worried about a race condition: what happens if an event occurs while the code is executing.
Purely functional code does not like to have to know that it gets executed. Functional techniques are at their finest in the pure setting, such that it does not matter in what order code is executed. A way out of this dilemma is to pretend that every change happened in one sensitive (internal, probably) piece of imperative code; pretend that any functional declarations in the FRP framework happen in 0 time so it is impossible for something to change during their declaration.
Nobody should ever sleep, or really do anything time sensitive, in a section of code that is declaring behaviors and things. Essentially, code that works with FRP objects ought to be pure, then you don't have any problems.
This does not necessarily preclude running it on multiple threads, but to support that you might need to reorganize your internal representations. Welcome to the world of FRP library implementation -- I suspect your internal representation will fluctuate many times during this process. :-)
I'm confused about your confusion. The way I see is that Stepper will "set" the behavior to a new value whenever the event occurs. So, what happens is the following:
The instant in which the event mouseB occurs, the value of the slider behavior will be read (snapshot). This value will be "set" into the behavior stepper.
So, it is true that the Stepper will "remember" values from the past; the point is that it only remembers the latest value from the past, not everything.
Semantically, it is best to model Stepper as a function like luqui proposes.

Looking for the best equivalents of prefetch instructions for ia32, ia64, amd64, and powerpc

I'm looking at some slightly confused code that's attempted a platform abstraction of prefetch instructions, using various compiler builtins. It appears to be based on powerpc semantics initially, with Read and Write prefetch variations using dcbt and dcbtst respectively (both of these passing TH=0 in the new optional stream opcode).
On ia64 platforms we've got for read:
__lfetch(__lfhint_nt1, pTouch)
wherease for write:
__lfetch_excl(__lfhint_nt1, pTouch)
This (read vs. write prefetching) appears to match the powerpc semantics fairly well (with the exception that ia64 allows for a temporal hint).
Somewhat curiously the ia32/amd64 code in question is using
prefetchnta
Not
prefetchnt1
as it would if that code were to be consistent with the ia64 implementations (#ifdef variations of that in our code for our (still live) hpipf port and our now dead windows and linux ia64 ports).
Since we are building with the intel compiler I should be able to many of our ia32/amd64 platforms consistent by switching to the xmmintrin.h builtins:
_mm_prefetch( (char *)pTouch, _MM_HINT_NTA )
_mm_prefetch( (char *)pTouch, _MM_HINT_T1 )
... provided I can figure out what temporal hint should be used.
Questions:
Are there read vs. write ia32/amd64 prefetch instructions? I don't see any in the instruction set reference.
Would one of the nt1, nt2, nta temporal variations be preferred for read vs. write prefetching?
Any idea if there would have been a good reason to use the NTA temporal hint on ia32/amd64, yet T1 on ia64?
Are there read vs. write ia32/amd64 prefetch instructions? I don't see any in the instruction set reference.
Some systems support the prefetchw instructions for writes
Would one of the nt1, nt2, nta temporal variations be preferred for read vs. write prefetching?
If the line is exclusively used by the calling thread, it shouldn't matter how you bring the line, both reads and writes would be able to use it. The benefit for prefetchw mentioned above is that it will bring the line and give you ownership on it, which may take a while if the line was also used by another core. The hint level on the other hand is orthogonal with the MESI states, and only affects how long would the prefetched line survive. This matters if you prefetch long ahead of the actual access and don't want to prefetch to get lost in that duration, or alternatively - prefetch right before the access, and don't want the prefetches to thrash your cache too much.
Any idea if there would have been a good reason to use the NTA temporal hint on ia32/amd64, yet T1 on ia64?
Just speculating - perhaps the larger caches and aggressive memory BW are more vulnerable to bad prefetching and you'd want to reduce the impact through the non-temporal hint. Consider that your prefetcher is suddenly set loose to fetch anything it can, you'd end up swamped in junk prefetches that would through away lots of useful cachelines. The NTA hint makes them overrun each other, leaving the rest undamaged.
Of course this may also be just a bug, I can't tell for sure, only whoever developed the compiler, but it might make sense for the reason above.
The best resource I could find on x86 prefetching hint types was the good ol' article What Every Programmer Should Know About Memory.
For the most part on x86 there aren't different instructions for read and write prefetches. The exceptions seem to be those that are non-temporal aligned, where a write can bypass the cache but as far as I can tell, a read will always get cached.
It's going to be hard to backtrack through why the earlier code owners used one hint and not the other on a certain architecture. They could be making assumptions about how much cache is available on processors in that family, typical working set sizes for binaries there, long term control flow patterns, etc... and there's no telling how much any of those assumptions were backed up with good reasoning or data. From the limited background here I think you'd be justified in taking the approach that makes the most sense for the platform you're developing on now, regardless what was done on other platforms. This is especially true when you consider articles like this one, which is not the only context where I've heard that it's really, really hard to get any performance gain at all with software prefetches.
Are there any more details known up front, like typical cache miss ratios when using this code, or how much prefetches are expected to help?