Are the conditional probabilities of MATLAB's mnrval correct? - matlab

An MWE (stats toolbox required, tested on MATLAB R2014b):
x = (1:3)';
b = mnrfit(x,x,'model','hierarchical');
pihat = mnrval(b,x,'model','hierarchical','type','conditional')
Output:
pihat =
1 1
2.2204e-16 1
2.2204e-16 2.2204e-16
(Ignore the issued warning, it's because of the trivial example, which is linearly separable (I'm predicting x using itself). It doesn't matter: I've tried this as well with a non-trivial (and not-so minimal) example without the warning and the results are similar.)
My problem is the result. I've specified I want the conditional probabilities. According to MATLAB's documentation on mnrval:
Specify ['conditional'] to return predictions [...] in terms of the first k – 1 conditional category probabilities [...], i.e., the probability [...] for category j, given an outcome in category j or higher.
In my example this means rows of pihat contain the probability of
x=1 given x>=1
x=2 given x>=2
(A third column for x=3 is not necessary, because if the first two probabilities are known, the third is too. It follows logically from P(x=1) + P(x=2) + P(x=3) = 1.)
Am I interpreting this correctly? Thus, if x=1 is predicted, then the first column value should be large (close to one), because P(x=1) given x>=1 is large. The second column should be close to zero, because P(x=2) given x>=2 can't be large if x=1.
However, as you can see in the first row, the second column value is large as well as the first! I believe this is incorrect according to what the documentation specifies, am I right? The current (incorrect?) result implies the predicted probabilities in the rows are not of x=j given x>=j, but what are they then? Or how should I be interpreting them?
They are not equal to the cumulative probabilities, i.e. the probability of x<=j, which increases with j. I've checked this by calculating pihat2 = mnrval(b,x,'model','hierarchical','type','cumulative'); pihat2-pihat.

Related

what is the difference between defining a vector using linspace and defining a vector using steps?

i am trying to learn the basics of matlab ,
i wanted to write a mattlab script ,
in this script i defined a vector x with a "d" step that it's length is (2*pi/1000)
and i wanted to plot two sin function according to x :
the first sin is with a frequency of 1, and the second sin frequency 10.3 ..
this is what i did:
d=(2*pi/1000);
x=-pi:d:pi;
first=sin(x);
second=sin(10.3*x);
plot(x,first,x,second);
my question:
what is the different between :
x=linspace(-pi,pi,1000);
and ..
d=(2*pi/1000);
x=-pi:d:pi;
? i am asking because i got confused since i think they both are the same but i think there is something wrong with my assumption ..
also is there is a more sufficient way to write sin function with a giveng frequency ?
The main difference can be summarizes as predefined size vs predefined step. And your example highlights it very well, indeed (1000 elements vs 1001 elements).
The linspace function produces a fixed-length vector (the length being defined by the third input argument, which defaults to 100) whose lower and upper limits are set, respectively, by the first and the second input arguments. The correct step to use is internally computed by the function itself (step = (x2 - x1) / n).
The colon operator defines a vector of elements whose values range between the specified lower and upper limits. The step, which is an optional parameter that defaults to 1, is the discriminant of the vector length. This means that the length of the result is determined by the number of steps that must be accomplished in order to reach the upper limit, starting from the lower one. On an side note, on this MathWorks thread you can find a very interesting discussion concerning the behavior of the colon operator in respect of floating-point management.
Another difference, related to the first one, is that linspace always includes the upper limit value while the colon operator only contains it if the specified step allows it (0:5:14 = [0 5 10]).
As a general rule, I prefer to use the former when I want to produce a vector of a predefined length (pretty obvious, isn't it?), and the latter when I need to create a sequence whose length has only a marginal relevance (or no relevance at all)

How to calculate the "rest value" of a plot?

Didn't know how to paraphrase the question well.
Function for example:
Data:https://www.dropbox.com/s/wr61qyhhf6ujvny/data.mat?dl=0
In this case how do I calculate that the rest point of this function is ~1? I have access to the vector that makes the plot.
I guess the mean is an approximation but in some cases it can be pretty bad.
Under the assumption that the "rest" point is the steady-state value in your data and the fact that the steady-state value happens the majority of the times in your data, you can simply bin all of the points and use each unique value as a separate bin. The bin with the highest count should correspond to the steady-state value.
You can do this by a combination of histc and unique. Assuming your data is stored in y, do this:
%// Find all unique values in your data
bins = unique(y);
%// Find the total number of occurrences per unique value
counts = histc(y, bins);
%// Figure out which bin has the largest count
[~,max_bin] = max(counts);
%// Figure out the corresponding y value
ss_value = bins(max_bin);
ss_value contains the steady-state value of your data, corresponding to the most occurring output point with the assumptions I laid out above.
A minor caveat with the above approach is that this is not friendly to floating point data whose unique values are generated by floating point values whose decimal values beyond the first few significant digits are different.
Here's an example of your data from point 2300 to 2320:
>> format long g;
>> y(2300:2320)
ans =
0.99995724232555
0.999957488454868
0.999957733165346
0.999957976465197
0.999958218362579
0.999958458865564
0.999958697982251
0.999958935720613
0.999959172088623
0.999959407094224
0.999959640745246
0.999959873049548
0.999960104014889
0.999960333649014
0.999960561959611
0.999960788954326
0.99996101464076
0.999961239026462
0.999961462118947
0.999961683925704
0.999961904454139
Therefore, what I'd recommend is to perhaps round so that the first 5 or so significant digits are maintained.
You can do this to your dataset before you continue:
num_digits = 5;
y_round = round(y*(10^num_digits))/(10^num_digits);
This will first multiply by 10^n where n is the number of digits you desire so that the decimal point is shifted over by n positions. We round this result, then divide by 10^n to bring it back to the scale that it was before. If you do this, for those points that were 0.9999... where there are n decimal places, these will get rounded to 1, and it may help in the above calculations.
However, more recent versions of MATLAB have this functionality already built-in to round, and you can just do this:
num_digits = 5;
y_round = round(y,num_digits);
Minor Note
More recent versions of MATLAB discourage the use of histc and recommend you use histcounts instead. Same function definition and expected inputs and outputs... so just replace histc with histcounts if your MATLAB version can handle it.
Using the above logic, you could also use the median too. If the majority of data is fluctuating around 1, then the median would have a high probability that the steady-state value is chosen... so try this too:
ss_value = median(y_round);

What is this code doing? Machine Learning

I'm just learning matlab and I have a snippet of code which I don't understand the syntax of. The x is an n x 1 vector.
Code is below
p = (min(x):(max(x)/300):max(x))';
The p vector is used a few lines later to plot the function
plot(p,pp*model,'r');
It generates an arithmetic progression.
An arithmetic progression is a sequence of numbers where the next number is equal to the previous number plus a constant. In an arithmetic progression, this constant must stay the same value.
In your code,
min(x) is the initial value of the sequence
max(x) / 300 is the increment amount
max(x) is the stopping criteria. When the result of incrementation exceeds this stopping criteria, no more items are generated for the sequence.
I cannot comment on this particular choice of initial value and increment amount, without seeing the surrounding code where it was used.
However, from a naive perspective, MATLAB has a linspace command which does something similar, but not exactly the same.
Certainly looks to me like an odd thing to be doing. Basically, it's creating a vector of values p that range from the smallest to the largest values of x, which is fine, but it's using steps between successive values of max(x)/300.
If min(x)=300 and max(x)=300.5 then this would only give 1 point for p.
On the other hand, if min(x)=-1000 and max(x)=0.3 then p would have thousands of elements.
In fact, it's even worse. If max(x) is negative, then you would get an error as p would start from min(x), some negative number below max(x), and then each element would be smaller than the last.
I think p must be used to create pp or model somehow as well so that the plot works, and without knowing how I can't suggest how to fix this, but I can't think of a good reason why it would be done like this. using linspace(min(x),max(x),300) or setting the step to (max(x)-min(x))/299 would make more sense to me.
This code examines an array named x, and finds its minimum value min(x) and its maximum value max(x). It takes the maximum value and divides it by the constant 300.
It doesn't explicitly name any variable, setting it equal to max(x)/300, but for the sake of explanation, I'm naming it "incr", short for increment.
And, it creates a vector named p. p looks something like this:
p = [min(x), min(x) + incr, min(x) + 2*incr, ..., min(x) + 299*incr, max(x)];

MATLAB/General CS: Sampling Without Replacement From Multiple Sets (+Keeping Track of Unsampled Cases)

I currently implementing an optimization algorithm that requires me to sample without replacement from several sets. Although I am coding in MATLAB, this is essentially a CS question.
The situation is as follows:
I have a finite number of sets (A, B, C) each with a finite but possibly different number of elements (a1,a2...a8, b1,b2...b10, c1, c2...c25). I also have a vector of probabilities for each set which lists a probability for each element in that set (i.e. for set A, P_A = [p_a1 p_a2... p_a8] where sum(P_A) = 1). I normally use these to create a probability generating function for each set, which given a uniform number between 0 to 1, can spit out one of the elements from that set (i.e. a function P_A(u), which given u = 0.25, will select a2).
I am looking to sample without replacement from the sets A, B, and C. Each "full sample" is a sequence of elements from each of the different sets i.e. (a1, b3, c2). Note that the space of full samples is the set of all permutations of the elements in A, B, and C. In the example above, this space is (a1,a2...a8) x (b1,b2...b10) x (c1, c2...c25) and there are 8*10*25 = 2000 unique "full samples" in my space.
The annoying part of sampling without replacement with this setup is that if my first sample is (a1, b3, c2) then that does not mean I cannot sample the element a1 again - it just means that I cannot sample the full sequence (a1, b3, c2) again. Another annoying part is that the algorithm I am working with requires me do a function evaluation for all permutations of elements that I have not sampled.
The best method at my disposal right now is to keep track the sampled cases. This is a little inefficient since my sampler is forced to reject any case that has been sampled before (since I'm sampling without replacement). I then do the function evaluations for the unsampled cases, by going through each permutation (ax, by, cz) using nested for loops and only doing the function evaluation if that combination of (ax, by, cz) is not included in the sampled cases. Again, this is a little inefficient since I have to "check" whether each permutation (ax, by, cz) has already been sampled.
I would appreciate any advice in regards to this problem. In particular, I am looking a method to sample without replacement and keep track of unsampled cases that does not explicity list out the full sample space (I usually work with 10 sets with 10 elements each so listing out the full sample space would require a 10^10 x 10 matrix). I realize that this may be impossible, though finding efficient way to do it will allow me to demonstrate the true limits of the algorithm.
Do you really need to keep track of all of the unsampled cases? Even if you had a 1-by-1010 vector that stored a logical value of true or false indicating if that permutation had been sampled or not, that would still require about 10 GB of storage, and MATLAB is likely to either throw an "Out of Memory" error or bring your entire machine to a screeching halt if you try to create a variable of that size.
An alternative to consider is storing a sparse vector of indicators for the permutations you've already sampled. Let's consider your smaller example:
A = 1:8;
B = 1:10;
C = 1:25;
nA = numel(A);
nB = numel(B);
nC = numel(C);
beenSampled = sparse(1,nA*nB*nC);
The 1-by-2000 sparse matrix beenSampled is empty to start (i.e. it contains all zeroes) and we will add a one at a given index for each sampled permutation. We can get a new sample permutation using the function RANDI to give us indices into A, B, and C for the new set of values:
indexA = randi(nA);
indexB = randi(nB);
indexC = randi(nC);
We can then convert these three indices into a single unique linear index into beenSampled using the function SUB2IND:
index = sub2ind([nA nB nC],indexA,indexB,indexC);
Now we can test the indexed element in beenSampled to see if it has a value of 1 (i.e. we sampled it already) or 0 (i.e. it is a new sample). If it has been sampled already, we repeat the process of finding a new set of indices above. Once we have a permutation we haven't sampled yet, we can process it:
while beenSampled(index)
indexA = randi(nA);
indexB = randi(nB);
indexC = randi(nC);
index = sub2ind([nA nB nC],indexA,indexB,indexC);
end
beenSampled(index) = 1;
newSample = [A(indexA) B(indexB) C(indexC)];
%# ...do your subsequent processing...
The use of a sparse array will save you a lot of space if you're only going to end up sampling a small portion of all of the possible permutations. For smaller total numbers of permutations, like in the above example, I would probably just use a logical vector instead of a sparse vector.
Check the matlab documentation for the randi function; you'll just want to use that in conjunction with the length function to choose random entries from each vector. Keeping track of each sampled vector should be as simple as just concatenating it to a matrix;
current_values = [5 89 45]; % lets say this is your current sample set
used_values = [used_values; current_values];
% wash, rinse, repeat

How to compare different distribution means with reference truth value in Matlab?

I have production (q) values from 4 different methods stored in the 4 matrices. Each of the 4 matrices contains q values from a different method as:
Matrix_1 = 1 row x 20 column
Matrix_2 = 100 rows x 20 columns
Matrix_3 = 100 rows x 20 columns
Matrix_4 = 100 rows x 20 columns
The number of columns indicate the number of years. 1 row would contain the production values corresponding to the 20 years. Other 99 rows for matrix 2, 3 and 4 are just the different realizations (or simulation runs). So basically the other 99 rows for matrix 2,3 and 4 are repeat cases (but not with exact values because of random numbers).
Consider Matrix_1 as the reference truth (or base case ). Now I want to compare the other 3 matrices with Matrix_1 to see which one among those three matrices (each with 100 repeats) compares best, or closely imitates, with Matrix_1.
How can this be done in Matlab?
I know, manually, that we use confidence interval (CI) by plotting the mean of Matrix_1, and drawing each distribution of mean of Matrix_2, mean of Matrix_3 and mean of Matrix_4. The largest CI among matrix 2, 3 and 4 which contains the reference truth (or mean of Matrix_1) will be the answer.
mean of Matrix_1 = (1 row x 1 column)
mean of Matrix_2 = (100 rows x 1 column)
mean of Matrix_3 = (100 rows x 1 column)
mean of Matrix_4 = (100 rows x 1 column)
I hope the question is clear and relevant to SO. Otherwise please feel free to edit/suggest anything in question. Thanks!
EDIT: My three methods I talked about are a1, a2 and a3 respectively. Here's my result:
ci_a1 =
1.0e+008 *
4.084733001497999
4.097677503988565
ci_a2 =
1.0e+008 *
5.424396063219890
5.586301025525149
ci_a3 =
1.0e+008 *
2.429145282593182
2.838897116739112
p_a1 =
8.094614835195452e-130
p_a2 =
2.824626709966993e-072
p_a3 =
3.054667629953656e-012
h_a1 = 1; h_a2 = 1; h_a3 = 1
None of my CI, from the three methods, includes the mean ( = 3.454992884900722e+008) inside it. So do we still consider p-value to choose the best result?
If I understand correctly the calculation in MATLAB is pretty strait-forward.
Steps 1-2 (mean calculation):
k1_mean = mean(k1);
k2_mean = mean(k2);
k3_mean = mean(k3);
k4_mean = mean(k4);
Step 3, use HIST to plot distribution histograms:
hist([k2_mean; k3_mean; k4_mean]')
Step 4. You can do t-test comparing your vectors 2, 3 and 4 against normal distribution with mean k1_mean and unknown variance. See TTEST for details.
[h,p,ci] = ttest(k2_mean,k1_mean);
EDIT : I misinterpreted your question. See the answer of Yuk and following comments. My answer is what you need if you want to compare distributions of two vectors instead of a vector against a single value. Apparently, the latter is the case here.
Regarding your t-tests, you should keep in mind that they test against a "true" mean. Given the number of values for each matrix and the confidence intervals it's not too difficult to guess the standard deviation on your results. This is a measure of the "spread" of your results. Now the error on your mean is calculated as the standard deviation of your results divided by the number of observations. And the confidence interval is calculated by multiplying that standard error with appx. 2.
This confidence interval contains the true mean in 95% of the cases. So if the true mean is exactly at the border of that interval, the p-value is 0.05 the further away the mean, the lower the p-value. This can be interpreted as the chance that the values you have in matrix 2, 3 or 4 come from a population with a mean as in matrix 1. If you see your p-values, these chances can be said to be non-existent.
So you see that when the number of values get high, the confidence interval becomes smaller and the t-test becomes very sensitive. What this tells you, is nothing more that the three matrices differ significantly from the mean. If you have to choose one, I'd take a look at the distributions anyway. Otherwise the one with the closest mean seems a good guess. If you want to get deeper into this, you could also ask on stats.stackexchange.com
Your question and your method aren't really clear :
Is the distribution equal in all columns? This is important, as two distributions can have the same mean, but differ significantly :
is there a reason why you don't use the Central Limit Theorem? This seems to me like a very complex way of obtaining a result that can easily be found using the fact that the distribution of a mean approaches a normal distribution where sd(mean) = sd(observations)/number of observations. Saves you quite some work -if the distributions are alike! -
Now if the question is really the comparison of distributions, you should consider looking at a qqplot for a general idea, and at a 2-sample kolmogorov-smirnov test for formal testing. But please read in on this test, as you have to understand what it does in order to interprete the results correctly.
On a sidenote : if you do this test on multiple cases, make sure you understand the problem of multiple comparisons and use the appropriate correction, eg. Bonferroni or Dunn-Sidak.