MATLAB Multinomial Logistic Regression Inputs - matlab

This is my first time attempting to use multinomial logistic regression, and I'm having a hard time getting started. I currently have a dataset of 203 observations with 22 independent variables and 1 dependent variable, all of which are numerical and continuous. My goal is to use MATLAB mnrfit function to predict the probabilities of future observations having a dependent variable falling into one of three intervals (y<0, 0<y<5, and 5<y).
How would I input my data into the mnrfit function to get these results? I believe that I would have to use this function to get the coefficients and then use the mnrval function to determine the probabilities for future observations. Thanks for the help!

Given http://se.mathworks.com/help/stats/mnrfit.html
It seems all you have to do is turn your Y variable to an integer array, something like
say Yord = (Y>0) + (Y>5) + 1
then call B = mnrfit(X, Yord)
where X is the matrix of predictors/features
reshape B in the way suggested in the example on the link above and finally call
mnrval(B, X) to get the probabilites of being less than zero, between zero and five or above zero

Related

Matlab's Arburg (Autoregression Burg's method) for forecasting time series

Matlab's arburg function returns a vector of coefficients of the form [1 c(i) c(2) ... c(p)] where p is the model's order. But these are not the coefficients for forecasting, instead they are used with a random input vector to simulate an stochastic AR process. Without forecasting anything on test data how can I compute model's error to calculate say AIC criterion? Is there a categorical difference between AR models like this and those used for forecasting?
So I have found that yes indeed we can use those coefficients (except the first one which is always 1) to forecast the next time point. To use the coefficients, first we need to remove the first one, then flip and negate the array. The mean absolute error can be calculated like this:
coeffs = -flip(coeffs(2:end))
error = mean(abs(time_series(t) - coeffs*time_series(t-length(coeffs):t-1)))
where * is the matrix multiplication assuming coeffs is a row vector and time_series is a column vector.

Covariance of predictions made from mgcv::gam

suppose i use mgcv’s gam to predict with some newdata producing two outputs(say g1,g2) . How can ask gam.predict to return the covariance of the two predictions?
I would like to compute the confidence interval of f(g1,g2) for some f. Assuming two predictions follow bivariate normal , i could use the delta theorem to compute the confidence interval.
Or is there an alternative method to compute this?
I am not really sure what you are asking, but if you need to calculate the covariance of any two data sets g1, g2 (independently if they are output of function gam.predict) you can simply use the function cov() after calling gam.predict, i.e., cov(g1,g2).

For loop equation into Octave / Matlab code

I'm using octave 3.8.1 which works like matlab.
I have an array of thousands of values I've only included three groupings as an example below:
(amp1=0.2; freq1=3; phase1=1; is an example of one grouping)
t=0;
amp1=0.2; freq1=3; phase1=1; %1st grouping
amp2=1.4; freq2=2; phase2=1.7; %2nd grouping
amp3=0.8; freq3=5; phase3=1.5; %3rd grouping
The Octave / Matlab code below solves for Y so I can plug it back into the equation to check values along with calculating values not located in the array.
clear all
t=0;
Y=0;
a1=[.2,3,1;1.4,2,1.7;.8,5,1.5]
for kk=1:1:length(a1)
Y=Y+a1(kk,1)*cos ((a1(kk,2))*t+a1(kk,3))
kk
end
Y
PS: I'm not trying to solve for Y since it's already solved for I'm trying to solve for Phase
The formulas located below are used to calculate Phase but I'm not sure how to put it into a for loop that will work in an array of n groupings:
How would I write the equation / for loop for finding the phase if I want to find freq=2.5 and amp=.23 and the phase is unknown I've looked online and it may require writing non linear equations which I'm not sure how to convert what I'm trying to do into such an equation.
phase1_test=acos(Y/amp1-amp3*cos(2*freq3*pi*t+phase3)/amp1-amp2*cos(2*freq2*pi*t+phase2)/amp1)-2*freq1*pi*t
phase2_test=acos(Y/amp2-amp3*cos(2*freq3*pi*t+phase3)/amp2-amp1*cos(2*freq1*pi*t+phase1)/amp2)-2*freq2*pi*t
phase3_test=acos(Y/amp3-amp2*cos(2*freq2*pi*t+phase2)/amp3-amp1*cos(2*freq1*pi*t+phase1)/amp3)-2*freq2*pi*t
Image of formula below:
I would like to do a check / calculate phases if given a freq and amp values.
I know I have to do a for loop but how do I convert the phase equation into a for loop so it will work on n groupings in an array and calculate different values not found in the array?
Basically I would be given an array of n groupings and freq=2.5 and amp=.23 and use the formula to calculate phase. Note: freq will not always be in the array hence why I'm trying to calculate the phase using a formula.
Ok, I think I finally understand your question:
you are trying to find a set of phase1, phase2,..., phaseN, such that equations like the ones you describe are satisfied
You know how to find y, and you supply values for freq and amp.
In Matlab, such a problem would be solved using, for example fsolve, but let's look at your problem step by step.
For simplicity, let me re-write your equations for phase1, phase2, and phase3. For example, your first equation, the one for phase1, would read
amp1*cos(phase1 + 2 freq1 pi t) + amp2*cos(2 freq2 pi t + phase2) + amp3*cos(2 freq3 pi t + phase3) - y = 0
Note that ampX (X is a placeholder for 1, 2, 3) are given, pi is a constant, t is given via Y (I think), freqX are given.
Hence, you are, in fact, dealing with a non-linear vector equation of the form
F(phase) = 0
where F is a multi-dimensional (vector) function taking a multi-dimensional (vector) input variable phase (comprised of phase1, phase2,..., phaseN). And you are looking for the set of phaseX, where all of the components of your vector function F are zero. N.B. F is a shorthand for your functions. Therefore, the first component of F, called f1, for example, is
f1 = amp1*cos(phase1+...) + amp2*cos(phase2+...) + amp3*cos(phase3+...) - y = 0.
Hence, f1 is a one-dimensional function of phase1, phase2, and phase3.
The technical term for what you are trying to do is find a zero of a non-linear vector function, or find a solution of a non-linear vector function. In Matlab, there are different approaches.
For a one-dimensional function, you can use fzero, which is explained at http://www.mathworks.com/help/matlab/ref/fzero.html?refresh=true
For a multi-dimensional (vector) function as yours, I would look into using fsolve, which is part of Matlab's optimization toolbox (which means I don't know how to do this in Octave). The function fsolve is explained at http://www.mathworks.com/help/optim/ug/fsolve.html
If you know an approximate solution for your phases, you may also look into iterative, local methods.
In particular, I would recommend you look into the Newton's Method, which allows you to find a solution to your system of equations F. Wikipedia has a good explanation of Newton's Method at https://en.wikipedia.org/wiki/Newton%27s_method . Newton iterations are very simple to implement and you should find a lot of resources online. You will have to compute the derivative of your function F with respect to each of your variables phaseX, which is very simple to compute since you're only dealing with cos() functions. For starters, have a look at the one-dimensional Newton iteration method in Matlab at http://www.math.colostate.edu/~gerhard/classes/331/lab/newton.html .
Finally, if you want to dig deeper, I found a textbook on this topic from the society for industrial and applied math: https://www.siam.org/books/textbooks/fr16_book.pdf .
As you can see, this is a very large field; Newton's method should be able to help you out, though.
Good luck!

Normalize in Adaboost without numerical error - Matlab

I'm implementing AdaBoost on Matlab. This algorithm requires that in every iteration the weights of each data point in the training set sum up to one.
If I simply use the following normalization v = v / sum(v) I get a vector whose 1-norm is 1 except some numerical error which later leads to the failure of the algorithm.
Is there a matlab function for normalizing a vector so that it's 1-norm is EXACTLY 1?
Assuming you want identical values to be normalised with the same factor, this is not possible. Simple counter example:
v=ones(21,1);
v = v / sum(v);
sum(v)-1
One common way to deal with it, is enforce values sum(v)>=1 or sum(v)<=1 if your algorithm can deal with a derivation to one side:
if sum(v)>1
v=v-eps(v)
end
Alternatively you can try using vpa, but this will drastically increase your computation time.

Simple binary logistic regression using MATLAB

I'm working on doing a logistic regression using MATLAB for a simple classification problem. My covariate is one continuous variable ranging between 0 and 1, while my categorical response is a binary variable of 0 (incorrect) or 1 (correct).
I'm looking to run a logistic regression to establish a predictor that would output the probability of some input observation (e.g. the continuous variable as described above) being correct or incorrect. Although this is a fairly simple scenario, I'm having some trouble running this in MATLAB.
My approach is as follows: I have one column vector X that contains the values of the continuous variable, and another equally-sized column vector Y that contains the known classification of each value of X (e.g. 0 or 1). I'm using the following code:
[b,dev,stats] = glmfit(X,Y,'binomial','link','logit');
However, this gives me nonsensical results with a p = 1.000, coefficients (b) that are extremely high (-650.5, 1320.1), and associated standard error values on the order of 1e6.
I then tried using an additional parameter to specify the size of my binomial sample:
glm = GeneralizedLinearModel.fit(X,Y,'distr','binomial','BinomialSize',size(Y,1));
This gave me results that were more in line with what I expected. I extracted the coefficients, used glmval to create estimates (Y_fit = glmval(b,[0:0.01:1],'logit');), and created an array for the fitting (X_fit = linspace(0,1)). When I overlaid the plots of the original data and the model using figure, plot(X,Y,'o',X_fit,Y_fit'-'), the resulting plot of the model essentially looked like the lower 1/4th of the 'S' shaped plot that is typical with logistic regression plots.
My questions are as follows:
1) Why did my use of glmfit give strange results?
2) How should I go about addressing my initial question: given some input value, what's the probability that its classification is correct?
3) How do I get confidence intervals for my model parameters? glmval should be able to input the stats output from glmfit, but my use of glmfit is not giving correct results.
Any comments and input would be very useful, thanks!
UPDATE (3/18/14)
I found that mnrval seems to give reasonable results. I can use [b_fit,dev,stats] = mnrfit(X,Y+1); where Y+1 simply makes my binary classifier into a nominal one.
I can loop through [pihat,lower,upper] = mnrval(b_fit,loopVal(ii),stats); to get various pihat probability values, where loopVal = linspace(0,1) or some appropriate input range and `ii = 1:length(loopVal)'.
The stats parameter has a great correlation coefficient (0.9973), but the p values for b_fit are 0.0847 and 0.0845, which I'm not quite sure how to interpret. Any thoughts? Also, why would mrnfit work over glmfit in my example? I should note that the p-values for the coefficients when using GeneralizedLinearModel.fit were both p<<0.001, and the coefficient estimates were quite different as well.
Finally, how does one interpret the dev output from the mnrfit function? The MATLAB document states that it is "the deviance of the fit at the solution vector. The deviance is a generalization of the residual sum of squares." Is this useful as a stand-alone value, or is this only compared to dev values from other models?
It sounds like your data may be linearly separable. In short, that means since your input data is one dimensional, that there is some value of x such that all values of x < xDiv belong to one class (say y = 0) and all values of x > xDiv belong to the other class (y = 1).
If your data were two-dimensional this means you could draw a line through your two-dimensional space X such that all instances of a particular class are on one side of the line.
This is bad news for logistic regression (LR) as LR isn't really meant to deal with problems where the data are linearly separable.
Logistic regression is trying to fit a function of the following form:
This will only return values of y = 0 or y = 1 when the expression within the exponential in the denominator is at negative infinity or infinity.
Now, because your data is linearly separable, and Matlab's LR function attempts to find a maximum likelihood fit for the data, you will get extreme weight values.
This isn't necessarily a solution, but try flipping the labels on just one of your data points (so for some index t where y(t) == 0 set y(t) = 1). This will cause your data to no longer be linearly separable and the learned weight values will be dragged dramatically closer to zero.