Measure the STD of RMSE - matlab

I'm working on a time series forecasting problem and I would like to confirm if it makes sense to compute the standard deviation of the root mean squared error. If so, is this the correct way?
STD_test = std(sqrt((y_real-y_pred).^2))
Also, imagine that the output of the model is 100, the RMSE 20 and the STD 10. This means that the real value is between [70,120] ?

The term y_real-y_pred is the vector of errors. The expression squares each element of it, and then sqrts each element of it, thus having the effect of abs(). Then std() is run on the vector of errors. Thus, this is computing the S.D. of the (absolute) error. That is a meaningful metric, but unlikely to be what you are after. Try:
e = y_real-y_pred;
MSE = mean(e.^2);
RMSE = sqrt(MSE);
sd = std(RMSE);
That will compute what you want. However, since RMSE is a scalar value, the value sd will be zero, so to answer the first part of your question, no it is not meaningful. What is meaningful is to look at the s.d. of the error itself:
sd = std(e);
RMSE and s.d. are somewhat related but they are distinct.

Your RMSE is fine; but the final conclusion is not! A std of 10 means there's a roughly 68% chance that your output lies within +- std. You can refer to this wiki link to learn more about the rule.

Related

Understanding expectation operatior with an example : Matlab

Say, I have N observations stored in an array, X = [x_1,x_2,...,x_N]. What is the meaning of E[\sum_{i=1}^N{x_i}]/N? This to me appears an average operation. But not sure. Can somebody please help what is the meaning of this operator with the help of any example in Matlab?
In general, (1) what is E[x]~ wherexis a random variable,
(2)E[x^2], and
(3)E[d]whered = 1/N(sum_{i=1}^N x_i)`
Your notation is still pretty bad to the point it's hard to understand, BUT there's a limit to what you can do on this site since it does not have MathJax enabled...
The expectation value is a generalized average in the sense that it should be weighted by the distribution from which X is drawn. If X is uniformly distributed then you'll get what you would call the "average" and what I think your first formula is giving. If X is distributed by some other distribution, then you will get something else.
If the distribution is discrete, then in general
E[f(x)] = sum_{i=1}^N [f(x_i) p(x_i)],
where p(x) is the distribution for the random variable.

MATLAB Simple - Linear Predictive Coding and Energy Forecasting

I have a dataset with 274 samples (9 months) of the daily energy (Watts.hour) used on a residential household. I'm not sure if i'm applying the lpc function correctly.
My code is the following:
filename='9-months.csv';
energy = csvread(filename);
C=zeros(5,1);
counter=0;
N=3;
for n=274:-1:31
w2=energy(1:n-1,1);
a=lpc(w2,N);
energy_estimated=0;
for X = 1:N
energy_estimated = energy_estimated + (-a(X+1)*energy(n-X));
end
w_real=energy(n);
error2=abs(w_real-energy_estimated);
counter=counter+1;
C(counter,1)=error2;
end
mean_error=round(mean(C));
Being "n" the sample on analysis, I will use the energy array's values, from 1 to n-1, to calculate the lpc coefficientes (with N=3).
After that, it will apply the calculated coefficients on the "for" cycle presented, in order to calculate the estimated energy.
Finally, error2 outputs the error between the real energy and estimated value.
On the example presented ( http://www.mathworks.com/help/signal/ref/lpc.html ) some filters are used. Do I need to apply any filter to it? Is my methodology correct?
Thank you very much in advance!
The lpc seems to be used correctly, but there are a few other things about your code. I am adressign the part at he "for n" :
for n=31:274 %for me it would seem more logically to go forward in time
w2=energy(1:n-1,1);
a=lpc(w2,N);
energy_estimate=filter([0 -a(2:end)],1,w2);
energy_estimate=energy_estimate(end);
estimates(n)=energy_estimate;
end
error=energy(31:274)-estimates(31:274)';
meanerror=mean(error); %you dont really round mean errors
filter is exactly what you are trying to do with the X=1:N loop. but this will perform the calculation for the entire w2 vector. If you just want the last value take the (end) command as well.
Now there is no reason to calculate the error for every single value and then add them to a vector you can do that faster after the calculation.
Now if your trying to estimate future values with a lpc it could work like that, but you are implying that every value is only dependend on the last 3 values. Have you tried something like a polynominal approach? i would think that this would be closer to reality.

Matlab: Chi-squared fit (chi2gof) to test if data is exponentially distributed

I guess this is a simple question, but I can't sort it out. I have a vector, the first elements of which look like:
V = [31 52 38 29 29 34 29 24 25 25 32 28 24 28 29 ...];
and I want to perform a chi2gof test in Matlab to test if V is exponentially distributed. I did:
[h,p] = chi2gof(V,'cdf',#expcdf);
but I get a warning message saying:
Warning: After pooling, some bins still have low expected counts.
The chi-square approximation may not be accurate
Have I defined the chi2gof call incorrectly?
At 36 values, you have a very small sample set. From the second sentence of Wikipedia's article on the chi-squared test (emphasis added):
It is suitable for unpaired data from large samples.
Large in this case usually means around at least 100. Read about more assumptions of this test here.
Alternatives
You might try kstest in Matlab, which is based on the Kolmogorov-Smirnov test:
[h,p] = kstest(V,'cdf',[V(:) expcdf(V(:),expfit(V))])
Or try lillietest, which is based on the Lilliefors test and has an option specifically for exponential distributed data:
[h,p] = lillietest(V,'Distribution','exp')
In case you can increase your sample size, you are doing one thing wrong with chi2gof. From the help for the 'cdf' option:
A fully specified cumulative distribution function. This
can be a ProbabilityDistribution object, a function
handle, or a function. name. The function must take X
values as its only argument. Alternately, you may provide
a cell array whose first element is a function name or
handle, and whose later elements are parameter values,
one per cell. The function must take X values as its
first argument, and other parameters as later arguments.
You're not supplying any additional parameters, so expcdf is using the default mean parameter of mu = 1. Your data values are very large and don't correspond at all an exponential distribution with this mean. You need to estimate parameters as well. You the expfit function, which is basted on maximum likelihood expectation, you might try something like this:
[h,p] = chi2gof(V,'cdf',#(x)expcdf(x,expfit(x)),'nparams',1)
However, with only 36 samples you may not get a very good estimate for a distribution like this and still may not get expected results even for data sampled from a known distribution, e.g.:
V = exprnd(10,1,36);
[h,p] = chi2gof(V,'cdf',#(x)expcdf(x,expfit(x)),'nparams',1)

Partitioning a number into a number of almost equal partitions

I would like to partition a number into an almost equal number of values in each partition. The only criteria is that each partition must be in between 60 to 80.
For example, if I have a value = 300, this means that 75 * 4 = 300.
I would like to know a method to get this 4 and 75 in the above example. In some cases, all partitions don't need to be of equal value, but they should be in between 60 and 80. Any constraints can be used (addition, subtraction, etc..). However, the outputs must not be floating point.
Also it's not that the total must be exactly 300 as in this case, but they can be up to a maximum of +40 of the total, and so for the case of 300, the numbers can sum up to 340 if required.
Assuming only addition, you can formulate this problem into a linear programming problem. You would choose an objective function that would maximize the sum of all of the factors chosen to generate that number for you. Therefore, your objective function would be:
(source: codecogs.com)
.
In this case, n would be the number of factors you are using to try and decompose your number into. Each x_i is a particular factor in the overall sum of the value you want to decompose. I'm also going to assume that none of the factors can be floating point, and can only be integer. As such, you need to use a special case of linear programming called integer programming where the constraints and the actual solution to your problem are all in integers. In general, the integer programming problem is formulated thusly:
You are actually trying to minimize this objective function, such that you produce a parameter vector of x that are subject to all of these constraints. In our case, x would be a vector of numbers where each element forms part of the sum to the value you are trying to decompose (300 in your case).
You have inequalities, equalities and also boundaries of x that each parameter in your solution must respect. You also need to make sure that each parameter of x is an integer. As such, MATLAB has a function called intlinprog that will perform this for you. However, this function assumes that you are minimizing the objective function, and so if you want to maximize, simply minimize on the negative. f is a vector of weights to be applied to each value in your parameter vector, and with our objective function, you just need to set all of these to -1.
Therefore, to formulate your problem in an integer programming framework, you are actually doing:
(source: codecogs.com)
V would be the value you are trying to decompose (so 300 in your example).
The standard way to call intlinprog is in the following way:
x = intlinprog(f,intcon,A,b,Aeq,beq,lb,ub);
f is the vector that weights each parameter of the solution you want to solve, intcon denotes which of your parameters need to be integer. In this case, you want all of them to be integer so you would have to supply an increasing vector from 1 to n, where n is the number of factors you want to decompose the number V into (same as before). A and b are matrices and vectors that define your inequality constraints. Because you want equality, you'd set this to empty ([]). Aeq and beq are the same as A and b, but for equality. Because you only have one constraint here, you would simply create a matrix of 1 row, where each value is set to 1. beq would be a single value which denotes the number you are trying to factorize. lb and ub are the lower and upper bounds for each value in the parameter set that you are bounding with, so this would be 60 and 80 respectively, and you'd have to specify a vector to ensure that each value of the parameters are bounded between these two ranges.
Now, because you don't know how many factors will evenly decompose your value, you'll have to loop over a given set of factors (like between 1 to 10, or 1 to 20, etc.), place your results in a cell array, then you have to manually examine yourself whether or not an integer decomposition was successful.
num_factors = 20; %// Number of factors to try and decompose your value
V = 300;
results = cell(1, num_factors);
%// Try to solve the problem for a number of different factors
for n = 1 : num_factors
x = intlinprog(-ones(n,1),1:n,[],[],ones(1,n),V,60*ones(n,1),80*ones(n,1));
results{n} = x;
end
You can then go through results and see which value of n was successful in decomposing your number into that said number of factors.
One small problem here is that we also don't know how many factors we should check up to. That unfortunately I don't have an answer to, and so you'll have to play with this value until you get good results. This is also an unconstrained parameter, and I'll talk about this more later in this post.
However, intlinprog was only released in recent versions of MATLAB. If you want to do the same thing without it, you can use linprog, which is the floating point version of integer programming... actually, it's just the core linear programming framework itself. You would call linprog this way:
x = linprog(f,A,b,Aeq,beq,lb,ub);
All of the variables are the same, except that intcon is not used here... which makes sense as linprog may generate floating point numbers as part of its solution. Due to the fact that linprog can generate floating point solutions, what you can do is if you want to ensure that for a given value of n, you could loop over your results, take the floor of the result and subtract with the final result, and sum over the result. If you get a value of 0, this means that you had a completely integer result. Therefore, you'd have to do something like:
num_factors = 20; %// Number of factors to try and decompose your value
V = 300;
results = cell(1, num_factors);
%// Try to solve the problem for a number of different factors
for n = 1 : num_factors
x = linprog(-ones(n,1),[],[],ones(1,n),V,60*ones(n,1),80*ones(n,1));
results{n} = x;
end
%// Loop through and determine which decompositions were successful integer ones
out = cellfun(#(x) sum(abs(floor(x) - x)), results);
%// Determine which values of n were successful in the integer composition.
final_factors = find(~out);
final_factors will contain which number of factors you specified that was successful in an integer decomposition. Now, if final_factors is empty, this means that it wasn't successful in finding anything that would be able to decompose the value into integer factors. Noting your problem description, you said you can allow for tolerances, so perhaps scan through results and determine which overall sum best matches the value, then choose whatever number of factors that gave you that result as the final answer.
Now, noting from my comments, you'll see that this problem is very unconstrained. You don't know how many factors are required to get an integer decomposition of your value, which is why we had to semi-brute-force it. In fact, this is a more general case of the subset sum problem. This problem is NP-complete. Basically, what this means is that it is not known whether there is a polynomial-time algorithm that can be used to solve this kind of problem and that the only way to get a valid solution is to brute-force each possible solution and check if it works with the specified problem. Usually, brute-forcing solutions requires exponential time, which is very intractable for large problems. Another interesting fact is that modern cryptography algorithms use NP-Complete intractability as part of their ciphertext and encrypting. Basically, they're banking on the fact that the only way for you to determine the right key that was used to encrypt your plain text is to check all possible keys, which is an intractable problem... especially if you use 128-bit encryption! This means you would have to check 2^128 possibilities, and assuming a moderately fast computer, the worst-case time to find the right key will take more than the current age of the universe. Check out this cool Wikipedia post for more details in intractability with regards to key breaking in cryptography.
In fact, NP-complete problems are very popular and there have been many attempts to determine whether there is or there isn't a polynomial-time algorithm to solve such problems. An interesting property is that if you can find a polynomial-time algorithm that will solve one problem, you will have found an algorithm to solve them all.
The Clay Mathematics Institute has what are known as Millennium Problems where if you solve any problem listed on their website, you get a million dollars.
Also, that's for each problem, so one problem solved == 1 million dollars!
(source: quickmeme.com)
The NP problem is amongst one of the seven problems up for solving. If I recall correctly, only one problem has been solved so far, and these problems were first released to the public in the year 2000 (hence millennium...). So... it has been about 14 years and only one problem has been solved. Don't let that discourage you though! If you want to invest some time and try to solve one of the problems, please do!
Hopefully this will be enough to get you started. Good luck!

ILMath.Vec() Appears to be Generation Slightly Wrong Output (Potential Bug?)

I'm comparing ILMath.Vec() with MatLab's and I'm seeing significant rounding errors.
For example, if I create a vector (using Start:0, Step:1.2635048525911006, End:20700) for each system:
MatLab: [Start:Step:End]
ILNumerics: Vec<double>(Start, Step, End)
And then take the average abs difference, I get an average error of 1.56019608343883E-09. However, I if create the vector by hand (using multiplication), I get an average error of only 3.10973469197506E-13, 4 magnitudes smaller error.
After looking at ILNumerics' vec function (using Reflector), I think I know why the average error value is so large. The ILMath.vec() function is using addition vs. multiplication. Summing the step value 16,384 times is not the same thing as multiplying the step value x N (where N is the current loop count) 16,384 times! The addition's rounding errors add up very quickly!
Please consider fixing this issue.