Matlab: Chi-squared fit (chi2gof) to test if data is exponentially distributed

Matlab: Chi-squared fit (chi2gof) to test if data is exponentially distributed - matlab

I guess this is a simple question, but I can't sort it out. I have a vector, the first elements of which look like:
V = [31 52 38 29 29 34 29 24 25 25 32 28 24 28 29 ...];
and I want to perform a chi2gof test in Matlab to test if V is exponentially distributed. I did:
[h,p] = chi2gof(V,'cdf',#expcdf);
but I get a warning message saying:
Warning: After pooling, some bins still have low expected counts.
The chi-square approximation may not be accurate
Have I defined the chi2gof call incorrectly?

At 36 values, you have a very small sample set. From the second sentence of Wikipedia's article on the chi-squared test (emphasis added):
It is suitable for unpaired data from large samples.
Large in this case usually means around at least 100. Read about more assumptions of this test here.
Alternatives
You might try kstest in Matlab, which is based on the Kolmogorov-Smirnov test:
[h,p] = kstest(V,'cdf',[V(:) expcdf(V(:),expfit(V))])
Or try lillietest, which is based on the Lilliefors test and has an option specifically for exponential distributed data:
[h,p] = lillietest(V,'Distribution','exp')
In case you can increase your sample size, you are doing one thing wrong with chi2gof. From the help for the 'cdf' option:
A fully specified cumulative distribution function. This
can be a ProbabilityDistribution object, a function
handle, or a function. name. The function must take X
values as its only argument. Alternately, you may provide
a cell array whose first element is a function name or
handle, and whose later elements are parameter values,
one per cell. The function must take X values as its
first argument, and other parameters as later arguments.
You're not supplying any additional parameters, so expcdf is using the default mean parameter of mu = 1. Your data values are very large and don't correspond at all an exponential distribution with this mean. You need to estimate parameters as well. You the expfit function, which is basted on maximum likelihood expectation, you might try something like this:
[h,p] = chi2gof(V,'cdf',#(x)expcdf(x,expfit(x)),'nparams',1)
However, with only 36 samples you may not get a very good estimate for a distribution like this and still may not get expected results even for data sampled from a known distribution, e.g.:
V = exprnd(10,1,36);
[h,p] = chi2gof(V,'cdf',#(x)expcdf(x,expfit(x)),'nparams',1)

Related

Why do I get different result in different versions of MATLAB (2016 vs 2021)?

Why do I get different results when using the same code running in different version of MATLAB (2016 vs 2021) for sum(b.*x1) where b is single and x1 is double. How to avoid such error between MATLAB version?
MATLAB v.2021:
sum(b.*x1)
ans =
single
-0.0013286
MATLAB 2016
sum(b.*x1)
ans =
single
-0.0013283

In R2017b, they changed the behavior of sum for single-precision floats, and in R2020b they made the same changes for other data types too.
The change speeds up the computation, and improves accuracy by reducing the rounding errors. Simply put, previously the algorithm would just run through the array in sequence, adding up the values. The new behavior computes the sum over smaller portions of the array, and then adds up those results. This is more precise because the running total can become a very large number, and adding smaller numbers to it causes more rounding in those smaller numbers. The speed improvement comes from loop unrolling: the loop now steps over, say, 8 values at the time, and in the loop body, 8 running totals are computed (they don’t specify the number they use, the 8 here is an example).
Thus, your newer result is a better approximation to the sum of your array than the old one.
For more details (a better explanation of the new algorithm and the reason for the change), see this blog post.
Regarding how to avoid the difference: you could implement your own sum function, and use that instead of the builtin one. I would suggest writing it as a MEX-file for efficiency. However, do make sure you match the newer behavior of the builtin sum, as that is the better approximation.
Here is an example of the problem. Let's create an array with N+1 elements, where the first one has a value of N and the rest have a value of 1.
N = 1e8;
a = ones(N+1,1,'single');
a(1) = N;
The sum over this array is expected to be 2*N. If we set N large enough w.r.t. the data type, I see this in R2017a (before the change):
>> sum(a)
ans =
single
150331648
And I see this in R2018b (after the change for single-precision sum):
>> sum(a)
ans =
single
199998976
Both implementations make rounding errors here, but one is obviously much, much closer to the expected result (2e8, or 200000000).

What is the link between randi and rand?

I'm running on R2012a version. I tried to write a function that imitates randi using rand (only rand), producing the same output when the same arguments are passed and the same seed is provided. I tried something with the command window and here's what I got:
>> s = rng;
>> R1 = randi([2 20], 3, 5)
R1 =
2 16 11 15 14
10 17 10 16 14
9 5 14 7 5
>> rng(s)
>> R2 = 2+18*rand(3, 5)
R2 =
2.6200 15.7793 10.8158 14.7686 14.2346
9.8974 16.3136 10.0206 15.5844 13.7918
8.8681 5.3637 13.6336 6.9685 4.9270
>>
A swift comparison led me to believe that there's some link between the two: each integer in R1 is within plus or minus unity from the corresponding element in R2. Nonetheless, I failed to go any further: I checked for ceiling, flooring, fixing and rounding but neither of them seems to work.

randi([2 20]) generates integers between 2 and 20, both included. That is, it can generate 19 different values, not 18.
19 * rand
generates values uniformly distributed within the half-open interval [0,19), flooring it gives you uniformly distributed integers in the range [0,18].
Thus, in general,
x = randi([a,b]]);
y = rand * (b-a+1) + a;
should yield numbers with the same property. From OP’s experiment it looks like they might generate the same sequence, but this cannot be guaranteed, and it likely doesn't.
Why? It is likely that randi is not implemented in terms of rand, but it’s underlying random generator, which produces integers. To go from a random integer x in a large range ([0,N-1]) to one in a small range ([0,n-1]), you would normally use the modulo operator (mod(x,N)) or a floored division like above, but remove a small subset of the values that skew the distribution. This other anser gives a detailed explanation. I like to think of it in terms of examples:
Say random values are in the range [0,2^16-1] (N=2^16) and you want values in the range [0,18] (n=19). mod(19,2^16)=5. That is, the largest 5 values that can be generated by the random number generator are mapped to the lowest 5 values of the output range (assuming the modulo method), leaving those numbers slightly more likely to be generated than the rest of the output range. These lowest 5 values have a chance floor(N/n)+1, whereas the rest has a chance floor(N/n). This is bad. [Using floored division instead of modulo yields a different distribution of the unevenness, but the end result is the same: some numbers are slightly more likely than others.]
To solve this issue, a correct implementation does as follows: if you get one of the values in the random generator that are floor(N/n)*n or higher, you need to throw it away and try again. This is a very small chance, of course, with a typical random number generator that uses N=2^64.
Though we don't know how randi is implemented, we can be fairly certain that it follows the correct implementation described here. So your sequence based on rand might be right for millions of numbers, but then start deviating.
Interestingly enough, Octave's randi is implemented as an M-file, so we can see how they do it. And it turns out it uses the wrong algorithm shown at the top of this answer, based on rand:
ri = imin + floor ( (imax-imin+1)*rand (varargin{:}) );
Thus, Octave's randi is biased!

Matlab non-linear binary Minimisation

I have to set up a phoneme table with a specific probability distribution for encoding things.
Now there are 22 base elements (each with an assigned probability, sum 100%), which shall be mapped on a 12 element table, which has desired element probabilities (sum 100%).
So part of the minimisation is to merge several base elements to get 12 table elements. Each base element must occur exactly once.
In addition, the table has 3 rows. So the same 12 element composition of the 22 base elements must minimise the error for 3 target vectors. Let's say the given target vectors are b1,b2,b3 (dimension 12x1), the given base vector is x (dimension 22x1) and they are connected by the unknown matrix A (12x22) by:
b1+err1=Ax
b2+err2=Ax
b3+err3=Ax
To sum it up: A is to be found so that dot_prod(err1+err2+err3, err1+err2+err3)=min (least squares). And - according to the above explanation - A must contain only 1's and 0's, while having exactly one 1 per column.
Unfortunately I have no idea how to approach this problem. Can it be expressed in a way different from the matrix-vector form?
Which tools in matlab could do it?

I think I found the answer while parsing some sections of the Matlab documentation.
First of all, the problem can be rewritten as:
errSum=err1+err2+err3=3Ax-b1-b2-b3
=> dot_prod(errSum, errSum) = min(A)
Applying the dot product (least squares) yields a quadratic scalar expression.
Syntax-wise, the fmincon tool within the optimization box could do the job. It has constraints parameters, which allow to force Aij to be binary and each column to be 1 in sum.
But apparently fmincon is not ideal for binary problems algorithm-wise and the ga tool should be used instead, which can be called in a similar way.
Since the equation would be very long in my case and needs to be written out, I haven't tried yet. Please correct me, if I'm wrong. Or add further solution-methods, if available.

Measure the STD of RMSE

I'm working on a time series forecasting problem and I would like to confirm if it makes sense to compute the standard deviation of the root mean squared error. If so, is this the correct way?
STD_test = std(sqrt((y_real-y_pred).^2))
Also, imagine that the output of the model is 100, the RMSE 20 and the STD 10. This means that the real value is between [70,120] ?

The term y_real-y_pred is the vector of errors. The expression squares each element of it, and then sqrts each element of it, thus having the effect of abs(). Then std() is run on the vector of errors. Thus, this is computing the S.D. of the (absolute) error. That is a meaningful metric, but unlikely to be what you are after. Try:
e = y_real-y_pred;
MSE = mean(e.^2);
RMSE = sqrt(MSE);
sd = std(RMSE);
That will compute what you want. However, since RMSE is a scalar value, the value sd will be zero, so to answer the first part of your question, no it is not meaningful. What is meaningful is to look at the s.d. of the error itself:
sd = std(e);
RMSE and s.d. are somewhat related but they are distinct.

Your RMSE is fine; but the final conclusion is not! A std of 10 means there's a roughly 68% chance that your output lies within +- std. You can refer to this wiki link to learn more about the rule.

Partitioning a number into a number of almost equal partitions

I would like to partition a number into an almost equal number of values in each partition. The only criteria is that each partition must be in between 60 to 80.
For example, if I have a value = 300, this means that 75 * 4 = 300.
I would like to know a method to get this 4 and 75 in the above example. In some cases, all partitions don't need to be of equal value, but they should be in between 60 and 80. Any constraints can be used (addition, subtraction, etc..). However, the outputs must not be floating point.
Also it's not that the total must be exactly 300 as in this case, but they can be up to a maximum of +40 of the total, and so for the case of 300, the numbers can sum up to 340 if required.

Assuming only addition, you can formulate this problem into a linear programming problem. You would choose an objective function that would maximize the sum of all of the factors chosen to generate that number for you. Therefore, your objective function would be:
(source: codecogs.com)
.
In this case, n would be the number of factors you are using to try and decompose your number into. Each x_i is a particular factor in the overall sum of the value you want to decompose. I'm also going to assume that none of the factors can be floating point, and can only be integer. As such, you need to use a special case of linear programming called integer programming where the constraints and the actual solution to your problem are all in integers. In general, the integer programming problem is formulated thusly:
You are actually trying to minimize this objective function, such that you produce a parameter vector of x that are subject to all of these constraints. In our case, x would be a vector of numbers where each element forms part of the sum to the value you are trying to decompose (300 in your case).
You have inequalities, equalities and also boundaries of x that each parameter in your solution must respect. You also need to make sure that each parameter of x is an integer. As such, MATLAB has a function called intlinprog that will perform this for you. However, this function assumes that you are minimizing the objective function, and so if you want to maximize, simply minimize on the negative. f is a vector of weights to be applied to each value in your parameter vector, and with our objective function, you just need to set all of these to -1.
Therefore, to formulate your problem in an integer programming framework, you are actually doing:
(source: codecogs.com)
V would be the value you are trying to decompose (so 300 in your example).
The standard way to call intlinprog is in the following way:
x = intlinprog(f,intcon,A,b,Aeq,beq,lb,ub);
f is the vector that weights each parameter of the solution you want to solve, intcon denotes which of your parameters need to be integer. In this case, you want all of them to be integer so you would have to supply an increasing vector from 1 to n, where n is the number of factors you want to decompose the number V into (same as before). A and b are matrices and vectors that define your inequality constraints. Because you want equality, you'd set this to empty ([]). Aeq and beq are the same as A and b, but for equality. Because you only have one constraint here, you would simply create a matrix of 1 row, where each value is set to 1. beq would be a single value which denotes the number you are trying to factorize. lb and ub are the lower and upper bounds for each value in the parameter set that you are bounding with, so this would be 60 and 80 respectively, and you'd have to specify a vector to ensure that each value of the parameters are bounded between these two ranges.
Now, because you don't know how many factors will evenly decompose your value, you'll have to loop over a given set of factors (like between 1 to 10, or 1 to 20, etc.), place your results in a cell array, then you have to manually examine yourself whether or not an integer decomposition was successful.
num_factors = 20; %// Number of factors to try and decompose your value
V = 300;
results = cell(1, num_factors);
%// Try to solve the problem for a number of different factors
for n = 1 : num_factors
x = intlinprog(-ones(n,1),1:n,[],[],ones(1,n),V,60*ones(n,1),80*ones(n,1));
results{n} = x;
end
You can then go through results and see which value of n was successful in decomposing your number into that said number of factors.
One small problem here is that we also don't know how many factors we should check up to. That unfortunately I don't have an answer to, and so you'll have to play with this value until you get good results. This is also an unconstrained parameter, and I'll talk about this more later in this post.
However, intlinprog was only released in recent versions of MATLAB. If you want to do the same thing without it, you can use linprog, which is the floating point version of integer programming... actually, it's just the core linear programming framework itself. You would call linprog this way:
x = linprog(f,A,b,Aeq,beq,lb,ub);
All of the variables are the same, except that intcon is not used here... which makes sense as linprog may generate floating point numbers as part of its solution. Due to the fact that linprog can generate floating point solutions, what you can do is if you want to ensure that for a given value of n, you could loop over your results, take the floor of the result and subtract with the final result, and sum over the result. If you get a value of 0, this means that you had a completely integer result. Therefore, you'd have to do something like:
num_factors = 20; %// Number of factors to try and decompose your value
V = 300;
results = cell(1, num_factors);
%// Try to solve the problem for a number of different factors
for n = 1 : num_factors
x = linprog(-ones(n,1),[],[],ones(1,n),V,60*ones(n,1),80*ones(n,1));
results{n} = x;
end
%// Loop through and determine which decompositions were successful integer ones
out = cellfun(#(x) sum(abs(floor(x) - x)), results);
%// Determine which values of n were successful in the integer composition.
final_factors = find(~out);
final_factors will contain which number of factors you specified that was successful in an integer decomposition. Now, if final_factors is empty, this means that it wasn't successful in finding anything that would be able to decompose the value into integer factors. Noting your problem description, you said you can allow for tolerances, so perhaps scan through results and determine which overall sum best matches the value, then choose whatever number of factors that gave you that result as the final answer.
Now, noting from my comments, you'll see that this problem is very unconstrained. You don't know how many factors are required to get an integer decomposition of your value, which is why we had to semi-brute-force it. In fact, this is a more general case of the subset sum problem. This problem is NP-complete. Basically, what this means is that it is not known whether there is a polynomial-time algorithm that can be used to solve this kind of problem and that the only way to get a valid solution is to brute-force each possible solution and check if it works with the specified problem. Usually, brute-forcing solutions requires exponential time, which is very intractable for large problems. Another interesting fact is that modern cryptography algorithms use NP-Complete intractability as part of their ciphertext and encrypting. Basically, they're banking on the fact that the only way for you to determine the right key that was used to encrypt your plain text is to check all possible keys, which is an intractable problem... especially if you use 128-bit encryption! This means you would have to check 2^128 possibilities, and assuming a moderately fast computer, the worst-case time to find the right key will take more than the current age of the universe. Check out this cool Wikipedia post for more details in intractability with regards to key breaking in cryptography.
In fact, NP-complete problems are very popular and there have been many attempts to determine whether there is or there isn't a polynomial-time algorithm to solve such problems. An interesting property is that if you can find a polynomial-time algorithm that will solve one problem, you will have found an algorithm to solve them all.
The Clay Mathematics Institute has what are known as Millennium Problems where if you solve any problem listed on their website, you get a million dollars.
Also, that's for each problem, so one problem solved == 1 million dollars!
(source: quickmeme.com)
The NP problem is amongst one of the seven problems up for solving. If I recall correctly, only one problem has been solved so far, and these problems were first released to the public in the year 2000 (hence millennium...). So... it has been about 14 years and only one problem has been solved. Don't let that discourage you though! If you want to invest some time and try to solve one of the problems, please do!
Hopefully this will be enough to get you started. Good luck!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse