Calculating cumulative distribution of two samples

Calculating cumulative distribution of two samples - scipy

I was going through the a scipy code for ks test (2 sample) which calculates the maximum distance between CDF's of any two given samples. code for calculating the cumulative Distribution Function(CDF).
I fail to understand the logic in the lines for calculating cdf. First, data1 and data2 is sorted and then using np.searchsorted we are trying to find the position of data_all in both data1 and data2. data_all is nothing but concatenation of sorted data1 and data2.
What if, the min value of data2 is below data1. Doesn't that violate the assumption that cdf shouldn't be decreasing with value
data_all = np.concatenate([data1,data2])
cdf1 = np.searchsorted(data1,data_all,side='right')/(1.0*n1)
cdf2 = (np.searchsorted(data2,data_all,side='right'))/(1.0*n2)

It's true that data_all is not sorted in general, but this does not matter for the computation.
The array cdf1 holds the values of the CDF of the first sample, computed at each of the points data_all
The array cdf2 holds the values of the CDF of the second sample, computed at each of the points data_all
Then the code does
np.max(np.absolute(cdf1 - cdf2))
taking the maximum of the differences of these. When you find the maximum of numbers, it does not matter in what order you look at them.
So, the order of these two arrays does not matter, as long as it's consistent: cdf1[42] is the value of CDF1 at some point and cdf2[42] is the value of CDF2 at the same point.

Related

How to perform operations along a certain dimension of an array?

I have a 3D array containing five 3-by-4 slices, defined as follows:
rng(3372061);
M = randi(100,3,4,5);
I'd like to collect some statistics about the array:
The maximum value in every column.
The mean value in every row.
The standard deviation within each slice.
This is quite straightforward using loops,
sz = size(M);
colMax = zeros(1,4,5);
rowMean = zeros(3,1,5);
sliceSTD = zeros(1,1,5);
for indS = 1:sz(3)
sl = M(:,:,indS);
sliceSTD(indS) = std(sl(1:sz(1)*sz(2)));
for indC = 1:sz(1)
rowMean(indC,1,indS) = mean(sl(indC,:));
end
for indR = 1:sz(2)
colMax(1,indR,indS) = max(sl(:,indR));
end
end
But I'm not sure that this is the best way to approach the problem.
A common pattern I noticed in the documentation of max, mean and std is that they allow to specify an additional dim input. For instance, in max:
M = max(A,[],dim) returns the largest elements along dimension dim. For example, if A is a matrix, then max(A,[],2) is a column vector containing the maximum value of each row.
How can I use this syntax to simplify my code?

Many functions in MATLAB allow the specification of a "dimension to operate over" when it matters for the result of the computation (several common examples are: min, max, sum, prod, mean, std, size, median, prctile, bounds) - which is especially important for multidimensional inputs. When the dim input is not specified, MATLAB has a way of choosing the dimension on its own, as explained in the documentation; for example in max:
If A is a vector, then max(A) returns the maximum of A.
If A is a matrix, then max(A) is a row vector containing the maximum value of each column.
If A is a multidimensional array, then max(A) operates along the first array dimension whose size does not equal 1, treating the elements as vectors. The size of this dimension becomes 1 while the sizes of all other dimensions remain the same. If A is an empty array whose first dimension has zero length, then max(A) returns an empty array with the same size as A.
Then, using the ...,dim) syntax we can rewrite the code as follows:
rng(3372061);
M = randi(100,3,4,5);
colMax = max(M,[],1);
rowMean = mean(M,2);
sliceSTD = std(reshape(M,1,[],5),0,2); % we use `reshape` to turn each slice into a vector
This has several advantages:
The code is easier to understand.
The code is potentially more robust, being able to handle inputs beyond those it was initially designed for.
The code is likely faster.
In conclusion: it is always a good idea to read the documentation of functions you're using, and experiment with different syntaxes, so as not to miss similar opportunities to make your code more succinct.

Are the conditional probabilities of MATLAB's mnrval correct?

An MWE (stats toolbox required, tested on MATLAB R2014b):
x = (1:3)';
b = mnrfit(x,x,'model','hierarchical');
pihat = mnrval(b,x,'model','hierarchical','type','conditional')
Output:
pihat =
1 1
2.2204e-16 1
2.2204e-16 2.2204e-16
(Ignore the issued warning, it's because of the trivial example, which is linearly separable (I'm predicting x using itself). It doesn't matter: I've tried this as well with a non-trivial (and not-so minimal) example without the warning and the results are similar.)
My problem is the result. I've specified I want the conditional probabilities. According to MATLAB's documentation on mnrval:
Specify ['conditional'] to return predictions [...] in terms of the first k – 1 conditional category probabilities [...], i.e., the probability [...] for category j, given an outcome in category j or higher.
In my example this means rows of pihat contain the probability of
x=1 given x>=1
x=2 given x>=2
(A third column for x=3 is not necessary, because if the first two probabilities are known, the third is too. It follows logically from P(x=1) + P(x=2) + P(x=3) = 1.)
Am I interpreting this correctly? Thus, if x=1 is predicted, then the first column value should be large (close to one), because P(x=1) given x>=1 is large. The second column should be close to zero, because P(x=2) given x>=2 can't be large if x=1.
However, as you can see in the first row, the second column value is large as well as the first! I believe this is incorrect according to what the documentation specifies, am I right? The current (incorrect?) result implies the predicted probabilities in the rows are not of x=j given x>=j, but what are they then? Or how should I be interpreting them?
They are not equal to the cumulative probabilities, i.e. the probability of x<=j, which increases with j. I've checked this by calculating pihat2 = mnrval(b,x,'model','hierarchical','type','cumulative'); pihat2-pihat.

How to calculate the "rest value" of a plot?

Didn't know how to paraphrase the question well.
Function for example:
Data:https://www.dropbox.com/s/wr61qyhhf6ujvny/data.mat?dl=0
In this case how do I calculate that the rest point of this function is ~1? I have access to the vector that makes the plot.
I guess the mean is an approximation but in some cases it can be pretty bad.

Under the assumption that the "rest" point is the steady-state value in your data and the fact that the steady-state value happens the majority of the times in your data, you can simply bin all of the points and use each unique value as a separate bin. The bin with the highest count should correspond to the steady-state value.
You can do this by a combination of histc and unique. Assuming your data is stored in y, do this:
%// Find all unique values in your data
bins = unique(y);
%// Find the total number of occurrences per unique value
counts = histc(y, bins);
%// Figure out which bin has the largest count
[~,max_bin] = max(counts);
%// Figure out the corresponding y value
ss_value = bins(max_bin);
ss_value contains the steady-state value of your data, corresponding to the most occurring output point with the assumptions I laid out above.
A minor caveat with the above approach is that this is not friendly to floating point data whose unique values are generated by floating point values whose decimal values beyond the first few significant digits are different.
Here's an example of your data from point 2300 to 2320:
>> format long g;
>> y(2300:2320)
ans =
0.99995724232555
0.999957488454868
0.999957733165346
0.999957976465197
0.999958218362579
0.999958458865564
0.999958697982251
0.999958935720613
0.999959172088623
0.999959407094224
0.999959640745246
0.999959873049548
0.999960104014889
0.999960333649014
0.999960561959611
0.999960788954326
0.99996101464076
0.999961239026462
0.999961462118947
0.999961683925704
0.999961904454139
Therefore, what I'd recommend is to perhaps round so that the first 5 or so significant digits are maintained.
You can do this to your dataset before you continue:
num_digits = 5;
y_round = round(y*(10^num_digits))/(10^num_digits);
This will first multiply by 10^n where n is the number of digits you desire so that the decimal point is shifted over by n positions. We round this result, then divide by 10^n to bring it back to the scale that it was before. If you do this, for those points that were 0.9999... where there are n decimal places, these will get rounded to 1, and it may help in the above calculations.
However, more recent versions of MATLAB have this functionality already built-in to round, and you can just do this:
num_digits = 5;
y_round = round(y,num_digits);
Minor Note
More recent versions of MATLAB discourage the use of histc and recommend you use histcounts instead. Same function definition and expected inputs and outputs... so just replace histc with histcounts if your MATLAB version can handle it.
Using the above logic, you could also use the median too. If the majority of data is fluctuating around 1, then the median would have a high probability that the steady-state value is chosen... so try this too:
ss_value = median(y_round);

How to sort the significant data in array of minimas?

I have a vector of local minimas and most of these values are noise and having only few significant values. Let suppose if my vector having three significant values and all are below than then zeros. There is huge difference between significant value and noise. The magnitude of each significant value is also different. I want to sort these significant values in ascending order on basis of their indices. For Example
y = [-0.0002, -0.00058, -0.28, -0.0008, -0.25,-0.0006,-0.00004,-0.26]
output = [-0.28, -0.25, -0.26, -0.0002, -0.00058,-0.0008, -0.0006,-0.00004]
Somehow, I want to preserve the order to peak indices.

So for you the order in y is more important than a difference of 1/99 (you may tune 1/99 to anything you prefer).
[~,order]=sort(y+(1:length(y))/99)
output=y(order)
I add the vector [1/99,2/99,3/99,4/99 ...] to y before sorting. Then I only retain the way that modified y would be sorted. And I apply that on the unmodified y.
Won't work if y is very long. For that, you may want to repeat a second time my suggestion:
[~,order]=sort(y+(1:length(y))/99)
y=y(order)
[~,order]=sort(y+(1:length(y))/99)
output=y(order)

MATLAB: What's [Y,I]=max(AS,[],2);?

I just started matlab and need to finish this program really fast, so I don't have time to go through all the tutorials.
can someone familiar with it please explain what the following statement is doing.
[Y,I]=max(AS,[],2);
The [] between AS and 2 is what's mostly confusing me. And is the max value getting assigned to both Y and I ?

According to the reference manual,
C = max(A,[],dim) returns the largest elements along the dimension of A specified by scalar dim. For example, max(A,[],1) produces the maximum values along the first dimension (the rows) of A.
[C,I] = max(...) finds the indices of the maximum values of A, and returns them in output vector I. If there are several identical maximum values, the index of the first one found is returned.
I think [] is there just to distinguish itself from max(A,B).

C = max(A,[],dim) returns the largest elements along the dimension of A specified by scalar dim. For example, max(A,[],1) produces the maximum values along the first dimension (the rows) of A.
Also, the [C, I] = max(...) form gives you the maximum values in C, and their indices (i.e. locations) in I.
Why don't you try an example, like this? Type it into MATLAB and see what you get. It should make things much easier to see.
m = [[1;6;2] [5;8;0] [9;3;5]]
max(m,[],2)

AS is matrix.
This will return the largest elements of AS in its 2nd dimension (i.e. its columns)

This function is taking AS and producing the maximum value along the second dimension of AS. It returns the max value 'Y' and the index of it 'I'.

note the apparent wrinkle in the matlab convention; there are a number of builtin functions which have signature like:
xs = sum(x,dim)
which works 'along' the dimension dim. max and min are the oddbal exceptions:
xm = max(x,dim); %this is probably a silent semantical error!
xm = max(x,[],dim); %this is probably what you want
I sometimes wish matlab had a binary max and a collapsing max, instead of shoving them into the same function...

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Calculating cumulative distribution of two samples - scipy

Related

How to perform operations along a certain dimension of an array?

Are the conditional probabilities of MATLAB's mnrval correct?

How to calculate the "rest value" of a plot?

How to sort the significant data in array of minimas?

MATLAB: What's [Y,I]=max(AS,[],2);?

Categories

Resources