Linspace vs range - matlab

I was wondering what is better style / more efficient:
x = linspace(-1, 1, 100);
or
x = -1:0.01:1;

As Oli Charlesworth mentioned, in linspace you divide the interval [a,b] into N points, whereas with the : form, you step-out from a with a specified step size (default 1) till you reach b.
One thing to keep in mind is that linspace always includes the end points, whereas, : form will include the second end-point, only if your step size is such that it falls on it at the last step else, it will fall short. Example:
0:3:10
ans =
0 3 6 9
That said, when I use the two approaches depends on what I need to do. If all I need to do is sample an interval with a fixed number of points (and I don't care about the step-size), I use linspace.
In many cases, I don't care if it doesn't fall on the last point, e.g., when working with polar co-ordinates, I don't need the last point, as 2*pi is the same as 0. There, I use 0:0.01:2*pi.

As always, use the one that best suits your purposes, and that best expresses your intentions. So use linspace when you know the number of points; use : when you know the spacing.
[Incidentally, your two examples are not equivalent; the second one will give you 201 points.]

As Oli already pointed out, it's usually easiest to use linspace when you know the number of points you want and the colon operator when you know the spacing you want between elements.
However, it should be noted that the two will often not give you exactly the same results. As noted here and here, the two approaches use slightly different methods to calculate the vector elements (here's an archived description of how the colon operator works). That's why these two vectors aren't equal:
>> a = 0:0.1:1;
>> b = linspace(0,1,11);
>> a-b
ans =
1.0e-016 *
Columns 1 through 8
0 0 0 0.5551 0 0 0 0
Columns 9 through 11
0 0 0
This is a typical side-effect of how floating-point numbers are represented. Certain numbers can't be exactly represented (like 0.1) and performing the same calculation in different ways (i.e. changing the order of mathematical operations) can lead to ever so slightly different results, as shown in the above example. These differences are usually on the order of the floating-point precision, and can often be ignored, but you should always be aware that they exist.

Related

Does the rand function ever produce values of 0 or 1 in MATLAB/Octave?

I'm looking for a function that will generate random values between 0 and 1, inclusive. I have generated 120,000 random values by using rand() function in octave, but haven't once got the values 0 or 1 as output. Does rand() ever produce such values? If not, is there any other function I can use to achieve the desired result?
If you read the documentation of rand in both Octave and MATLAB, it is an open interval between (0,1), so no, it shouldn't generate the numbers 0 or 1.
However, you can perhaps generate a set of random integers, then normalize the values so that they lie between [0,1]. So perhaps use something like randi (MATLAB docs, Octave docs) where it generates integer values from 1 up to a given maximum. With this, define this maximum number, then subtract by 1 and divide by this offset maximum to get values between [0,1] inclusive:
max_num = 10000; %// Define maximum number
N = 1000; %// Define size of vector
out = (randi(max_num, N, 1) - 1) / (max_num - 1); %// Output
If you want this to act more like rand but including 0 and 1, make the max_num variable quite large.
Mathematically, if you sample from a (continuous) uniform distribution on the closed interval [0 1], values 0 and 1 (or any value, in fact) have probability strictly zero.
Programmatically,
If you have a random generator that produces values of type double on the closed interval [0 1], the probability of getting the value 0, or 1, is not zero, but it's so small it can be neglected.
If the random generator produces values from the open interval (0, 1), the probability of getting a value 0, or 1, is strictly zero.
So the probability is either strictly zero or so small it can be neglected. Therefore, you shouldn't worry about that: in either case the probability is zero for practical purposes. Even if rand were of type (1) above, and thus could produce 0 and 1, it would produce them with probability so small that you would "never" see those values.
Does that sound strange? Well, that happens with any number. You "never" see rand ever outputting exactly 1/4, either. There are so many possible outputs, all of them equally likely, that the probability of any given output is virtually zero.
rand produces numbers from the open interval (0,1), which does not include 0 or 1, so you should never get those values.. This was more clearly documented in previous versions, but it's still stated in the help text for rand (type help rand rather than doc rand).
However, since it produces doubles, there are only a finite number of values that it will actually produce. The precise set varies depending on the RNG algorithm used. For Mersenne twister, the default algorithm, the possible values are all multiples of 2^(-53), within the open interval (0,1). (See doc RandStream.list, and then "Choosing a Random Number Generator" for info on other generators).
Note that 2^(-53) is eps/2. Therefore, it's equivalent to drawing from the closed interval [2^(-53), 1-2^(-53)], or [eps/2, 1-eps/2].
You can scale this interval to [0,1] by subtracting eps/2 and dividing by 1-eps. (Use format hex to display enough precision to check that at the bit level).
So x = (rand-eps/2)/(1-eps) should give you values on the closed interval [0,1].
But I should give a word of caution: they've put a lot of effort into making sure that output of rand gives an appropriate distribution of any given double within (0,1), and I don't think you're going to get the same nice properties on [0,1] if you apply the scaling I suggested. My knowledge of floating-point math and RNGs isn't up to explaining why, or what you might do about that.
I just tried this:
octave:1> max(rand(10000000,1))
ans = 1.00000
octave:2> min(rand(10000000,1))
ans = 3.3788e-08
Did not give me 0 strictly, so watch out for floating point operations.
Edit
Even though I said, watch out for floating point operations I did fall for that. As #eigenchris pointed out:
format long g
octave:1> a=max(rand(1000000,1))
a = 0.999999711020176
It yields a floating number close to one, not equal, as you can see now after changing the precision, as #rayryeng suggested.
Although not direct to the question here, I find it helpful to link to this SO post Octave - random generate number that has a one liner to generate 1s and 0s using r = rand > 0.5.

Dot Product: * Command vs. Loop gives different results

I have two vectors in Matlab, z and beta. Vector z is a 1x17:
1 0.430742139435890 0.257372971229541 0.0965909090909091 0.694329541928697 0 0.394960106863064 0 0.100000000000000 1 0.264704325268675 0.387774594078319 0.269207605609567 0.472226643323253 0.750000000000000 0.513121013402805 0.697062571025173
... and beta is a 17x1:
6.55269487769363e+26
0
0
-56.3867588816768
-2.21310778926413
0
57.0726052009847
0
3.47223691057151e+27
-1.00249317882651e+27
3.38202232046686
1.16425987969027
0.229504956512063
-0.314243264212449
-0.257394312588330
0.498644243389556
-0.852510642195370
I'm dealing with some singularity issues, and I noticed that if I want to compute the dot product of z*beta, I potentially get 2 different solutions. If I use the * command, z*beta = 18.5045. If I write a loop to compute the dot product (below), I get a solution of 0.7287.
summation=0;
for i=1:17
addition=z(1,i)*beta(i);
summation=summation+addition;
end
Any idea what's going on here?
Here's a link to the data: https://dl.dropboxusercontent.com/u/16594701/data.zip
The problem here is that addition of floating point numbers is not associative. When summing a sequence of numbers of comparable magnitude, this is not usually a problem. However, in your sequence, most numbers are around 1 or 10, while several entries have magnitude 10^26 or 10^27. Numerical problems are almost unavoidable in this situation.
The wikipedia page http://en.wikipedia.org/wiki/Floating_point#Accuracy_problems shows a worked example where (a + b) + c is not equal to a + (b + c), i.e. demonstrating that the order in which you add up floating point numbers does matter.
I would guess that this is a homework assignment designed to illustrate these exact issues. If not, I'd ask what the data represents to suss out the appropriate approach. It would probably be much more productive to find out why such large numbers are being produced in the first place than trying to make sense of the dot product that includes them.

Removing extreme values in a vector in Matlab?

So say, I have a = [2 7 4 9 2 4 999]
And I'd like to remove 999 from the matrix (which is an obvious outlier).
Is there a general way to remove values like this? I have a set of vectors and not all of them have extreme values like that. prctile(a,99.5) is going to output the largest number in the vector no matter how extreme (or non-extreme) it is.
There are several way to do that, but first you must define what is "extreme'? Is it above some threshold? above some number of standard deviations?
Or, if you know you have exactly n of these extreme events and that their values are larger than the rest, you can use sort and the delete the last n elements. etc...
For example a(a>threshold)=[] will take care of a threshold like definition, while a(a>mean(a)+n*std(a))=[] will take care of discarding values that are n standard deviation above the mean of a.
A completely different approach is to use the median of a, if the vector is as short as you mention, you want to look on a median value and then you can either threshold anything above some factor of that value a(a>n*median(a))=[] .
Last, a way to assess an approach to treat these spikes would be to take a histogram of the data, and work from there...
I can think of two:
Sort your matrix and remove n-elements from top and bottom.
Compute the mean and the standard deviation and discard all values that fall outside:
mean +/- (n * standard deviation)
In both cases n must be chosen by the user.
Filter your signal.
%choose the value
N = 10;
filtered = filter(ones(1,N)/N, 1, signal);
Find the noise
noise = signal - filtered;
Remove noisy elements
THRESH = 50;
signal = signal(abs(noise) < THRESH);
It is better than mean+-n*stddev approach because it looks for local changes so it won't fail on a slowly changing signal like [1 2 3 ... 998 998].

Why does crossvalind fail?

I am using cross valind function on a very small data... However I observe that it gives me incorrect results for the same. Is this supposed to happen ?
I have Matlab R2012a and here is my output
crossvalind('KFold',1:1:11,5)
ans =
2
5
1
3
2
1
5
3
5
1
5
Notice the absence of set 4.. Is this a bug ? I expected atleast 2 elements per set but it gives me 0 in one... and it happens a lot that is the values are not uniformly distributed in the sets.
The help for crossvalind says that the form you are using is: crossvalind(METHOD, GROUP, ...). In this case, GROUP is the e.g. the class labels of your data. So 1:11 as the second argument is confusing here, because it suggests no two examples have the same label. I think this is sufficiently unusual that you shouldn't be surprised if the function does something strange.
I tried doing:
numel(unique(crossvalind('KFold', rand(11, 1) > 0.5, 5)))
and it reliably gave 5 as a result, which is what I would expect; my example would correspond to a two-class problem (I would guess that, as a general rule, you'd want something like numel(unique(group)) <= numel(group) / folds) - my hypothesis would be that it tries to have one example of each class in the Kth fold, and at least 2 examples in every other, with a difference between fold sizes of no more than 1 - but I haven't looked in the code to verify this.
It is possible that you mean to do:
crossvalind('KFold', 11, 5);
which would compute 5 folds for 11 data points - this doesn't attempt to do anything clever with labels, so you would be sure that there will be K folds.
However, in your problem, if you really have very few data points, then it is probably better to do leave-one-out cross validation, which you could do with:
crossvalind('LeaveMOut', 11, 1);
although a better method would be:
for leave_out=1:11
fold_number = (1:11) ~= leave_out;
<code here; where fold_number is 0, this is the leave-one-out example. fold_number = 1 means that the example is in the main fold.>
end

How to compare different distribution means with reference truth value in Matlab?

I have production (q) values from 4 different methods stored in the 4 matrices. Each of the 4 matrices contains q values from a different method as:
Matrix_1 = 1 row x 20 column
Matrix_2 = 100 rows x 20 columns
Matrix_3 = 100 rows x 20 columns
Matrix_4 = 100 rows x 20 columns
The number of columns indicate the number of years. 1 row would contain the production values corresponding to the 20 years. Other 99 rows for matrix 2, 3 and 4 are just the different realizations (or simulation runs). So basically the other 99 rows for matrix 2,3 and 4 are repeat cases (but not with exact values because of random numbers).
Consider Matrix_1 as the reference truth (or base case ). Now I want to compare the other 3 matrices with Matrix_1 to see which one among those three matrices (each with 100 repeats) compares best, or closely imitates, with Matrix_1.
How can this be done in Matlab?
I know, manually, that we use confidence interval (CI) by plotting the mean of Matrix_1, and drawing each distribution of mean of Matrix_2, mean of Matrix_3 and mean of Matrix_4. The largest CI among matrix 2, 3 and 4 which contains the reference truth (or mean of Matrix_1) will be the answer.
mean of Matrix_1 = (1 row x 1 column)
mean of Matrix_2 = (100 rows x 1 column)
mean of Matrix_3 = (100 rows x 1 column)
mean of Matrix_4 = (100 rows x 1 column)
I hope the question is clear and relevant to SO. Otherwise please feel free to edit/suggest anything in question. Thanks!
EDIT: My three methods I talked about are a1, a2 and a3 respectively. Here's my result:
ci_a1 =
1.0e+008 *
4.084733001497999
4.097677503988565
ci_a2 =
1.0e+008 *
5.424396063219890
5.586301025525149
ci_a3 =
1.0e+008 *
2.429145282593182
2.838897116739112
p_a1 =
8.094614835195452e-130
p_a2 =
2.824626709966993e-072
p_a3 =
3.054667629953656e-012
h_a1 = 1; h_a2 = 1; h_a3 = 1
None of my CI, from the three methods, includes the mean ( = 3.454992884900722e+008) inside it. So do we still consider p-value to choose the best result?
If I understand correctly the calculation in MATLAB is pretty strait-forward.
Steps 1-2 (mean calculation):
k1_mean = mean(k1);
k2_mean = mean(k2);
k3_mean = mean(k3);
k4_mean = mean(k4);
Step 3, use HIST to plot distribution histograms:
hist([k2_mean; k3_mean; k4_mean]')
Step 4. You can do t-test comparing your vectors 2, 3 and 4 against normal distribution with mean k1_mean and unknown variance. See TTEST for details.
[h,p,ci] = ttest(k2_mean,k1_mean);
EDIT : I misinterpreted your question. See the answer of Yuk and following comments. My answer is what you need if you want to compare distributions of two vectors instead of a vector against a single value. Apparently, the latter is the case here.
Regarding your t-tests, you should keep in mind that they test against a "true" mean. Given the number of values for each matrix and the confidence intervals it's not too difficult to guess the standard deviation on your results. This is a measure of the "spread" of your results. Now the error on your mean is calculated as the standard deviation of your results divided by the number of observations. And the confidence interval is calculated by multiplying that standard error with appx. 2.
This confidence interval contains the true mean in 95% of the cases. So if the true mean is exactly at the border of that interval, the p-value is 0.05 the further away the mean, the lower the p-value. This can be interpreted as the chance that the values you have in matrix 2, 3 or 4 come from a population with a mean as in matrix 1. If you see your p-values, these chances can be said to be non-existent.
So you see that when the number of values get high, the confidence interval becomes smaller and the t-test becomes very sensitive. What this tells you, is nothing more that the three matrices differ significantly from the mean. If you have to choose one, I'd take a look at the distributions anyway. Otherwise the one with the closest mean seems a good guess. If you want to get deeper into this, you could also ask on stats.stackexchange.com
Your question and your method aren't really clear :
Is the distribution equal in all columns? This is important, as two distributions can have the same mean, but differ significantly :
is there a reason why you don't use the Central Limit Theorem? This seems to me like a very complex way of obtaining a result that can easily be found using the fact that the distribution of a mean approaches a normal distribution where sd(mean) = sd(observations)/number of observations. Saves you quite some work -if the distributions are alike! -
Now if the question is really the comparison of distributions, you should consider looking at a qqplot for a general idea, and at a 2-sample kolmogorov-smirnov test for formal testing. But please read in on this test, as you have to understand what it does in order to interprete the results correctly.
On a sidenote : if you do this test on multiple cases, make sure you understand the problem of multiple comparisons and use the appropriate correction, eg. Bonferroni or Dunn-Sidak.