I need to use SVMs for regression.
I have y: a 261x1 vector and x: a 261x10 vector.
I would like to calculate 10 weights such that the weighted 10 values of x at each of the 261 data points mimic the y value.
However, when I run this using the libsvm package, I am getting 261 weights and not the 10 I want.
From my understanding, libsvm requires the x and y vector to be the same length and hence inputting the transpose of x and y will not work.
(Note: this is a portfolio optimization problem and 261 is the number of days, and 10 is the number of stocks)
I could not understand what 'weights' means but I suggest you to use libsvmwrite function to write your labels and feature vectors in the required format. and use libsvmread method to get the formatted data to pass as an input.
Related
I plotted 5 fold cross-validation data as a cell array to perfcurve function with positive class=1. Then it generated 3 curves as you can see in the diagram. I was expecting only one curve.
[X,Y,T,AUC,OPTROCPT,SUBY,SUBYNAMES] = perfcurve(Actual_label,Score,1);
plot(X,Y)
Here, Actual_label and Score are a cell array of size 5 X 1. Each cell array is of size 70 X 1. And 1 denotes positive class=1.
P.S: I am using One-class SVM and 'fitSVMPosterior' function is not appropriate for one-class learning (same has been mentioned in the documentation of MATLAB). Therefore posterior probability can't be used here.
When you compute the confidence bounds, X and Y are an m-by-3 array, where m is the number of fixed X values or thresholds (T values). The first column of Y contains the mean value. The second and third columns contain the lower bound and the upper bound, respectively, of the pointwise confidence bounds. AUC is also a row vector with three elements, following the same convention.
Above explanation is taken from MATLAB documentation.
That is expected because you are plotting the ROC curve for each of the 5 folds.
Now if you want to have only one ROC for your classifier, you can either use the 5 trained classifiers to predict the labels of an independent test set or you can average the posterior probabilities of the 5 folds and have one ROC.
I have got a matrix of AirFuelRatio values at certain engine speeds and throttlepositions. (eg. the AFR is 14 at 2500rpm and 60% throttle)
The matrix is now 25x10, and the engine speed ranges from 1200-6000rpm with interval 200rpm, the throttle range from 0.1-1 with interval 0.1.
Say i have measured new values, eg. an AFR of 13.5 at 2138rpm and 74,3% throttle, how do i merge that in the matrix? The matrix closest values are 2000 or 2200rpm and 70 or 80% throttle. Also i don't want new data to replace the older data. How can i make the matrix take this value in and adjust its values to take the new value in account?
Simplified i have the following x-axis values(top row) and 1x4 matrix(below):
2 4 6 8
14 16 18 20
I just measured an AFR value of 15.5 at 3 rpm. If you interpolate the AFR matrix you would've gotten a 15, so this value is out of the ordinary.
I want the matrix to take this data and adjust the other variables to it, ie. average everything so that the more data i put in the more reliable and accurate the matrix becomes. So in the simplified case the matrix would become something like:
2 4 6 8
14.3 16.3 18.2 20.1
So it averages between old and new data. I've read the documentation about concatenation but i believe my problem can't be solved with that function.
EDIT: To clarify my question, the following visual clarification.
The 'matrix' keeps the same size of 5 points whil a new data point is added. It takes the new data in account and adjusts the matrix accordingly. This is what i'm trying to achieve. The more scatterd data i get, the more accurate the matrix becomes. (and yes the green dot in this case would be an outlier, but it explains my case)
Cheers
This is not a matter of simple merge/average. I don't think there's a quick method to do this unless you have simplifying assumptions. What you want is a statistical inference of the underlying trend. I suggest using Gaussian process regression to solve this problem. There's a great MATLAB toolbox by Rasmussen and Williams called GPML. http://www.gaussianprocess.org/gpml/
This sounds more like a data fitting task to me. What you are suggesting is that you have a set of measurements for which you wish to get the best linear fit. Instead of producing a table of data, what you need is a table of values, and then find the best fit to those values. So, for example, I could create a matrix, A, which has all of the recorded values. Let's start with:
A=[2,14;3,15.5;4,16;6,18;8,20];
I now need a matrix of points for the inputs to my fitting curve (which, in this instance, lets assume it is linear, so is the set of values 1 and x)
B=[ones(size(A,1),1), A(:,1)];
We can find the linear fit parameters (where it cuts the y-axis and the gradient) using:
B\A(:,2)
Or, if you want the points that the line goes through for the values of x:
B*(B\A(:,2))
This results in the points:
2,14.1897 3,15.1552 4,16.1207 6,18.0517 8,19.9828
which represents the best fit line through these points.
You can manually extend this to polynomial fitting if you want, or you can use the Matlab function polyfit. To manually extend the process you should use a revised B matrix. You can also produce only a specified set of points in the last line. The complete code would then be:
% Original measurements - could be read in from a file,
% but for this example we will set it to a matrix
% Note that not all tabulated values need to be present
A=[2,14; 3,15.5; 4,16; 5,17; 8,20];
% Now create the polynomial values of x corresponding to
% the data points. Choosing a second order polynomial...
B=[ones(size(A,1),1), A(:,1), A(:,1).^2];
% Find the polynomial coefficients for the best fit curve
coeffs=B\A(:,2);
% Now generate a table of values at specific points
% First define the x-values
tabinds = 2:2:8;
% Then generate the polynomial values of x
tabpolys=[ones(length(tabinds),1), tabinds', (tabinds').^2];
% Finally, multiply by the coefficients found
curve_table = [tabinds', tabpolys*coeffs];
% and display the results
disp(curve_table);
This is my first time attempting to use multinomial logistic regression, and I'm having a hard time getting started. I currently have a dataset of 203 observations with 22 independent variables and 1 dependent variable, all of which are numerical and continuous. My goal is to use MATLAB mnrfit function to predict the probabilities of future observations having a dependent variable falling into one of three intervals (y<0, 0<y<5, and 5<y).
How would I input my data into the mnrfit function to get these results? I believe that I would have to use this function to get the coefficients and then use the mnrval function to determine the probabilities for future observations. Thanks for the help!
Given http://se.mathworks.com/help/stats/mnrfit.html
It seems all you have to do is turn your Y variable to an integer array, something like
say Yord = (Y>0) + (Y>5) + 1
then call B = mnrfit(X, Yord)
where X is the matrix of predictors/features
reshape B in the way suggested in the example on the link above and finally call
mnrval(B, X) to get the probabilites of being less than zero, between zero and five or above zero
I'm working on doing a logistic regression using MATLAB for a simple classification problem. My covariate is one continuous variable ranging between 0 and 1, while my categorical response is a binary variable of 0 (incorrect) or 1 (correct).
I'm looking to run a logistic regression to establish a predictor that would output the probability of some input observation (e.g. the continuous variable as described above) being correct or incorrect. Although this is a fairly simple scenario, I'm having some trouble running this in MATLAB.
My approach is as follows: I have one column vector X that contains the values of the continuous variable, and another equally-sized column vector Y that contains the known classification of each value of X (e.g. 0 or 1). I'm using the following code:
[b,dev,stats] = glmfit(X,Y,'binomial','link','logit');
However, this gives me nonsensical results with a p = 1.000, coefficients (b) that are extremely high (-650.5, 1320.1), and associated standard error values on the order of 1e6.
I then tried using an additional parameter to specify the size of my binomial sample:
glm = GeneralizedLinearModel.fit(X,Y,'distr','binomial','BinomialSize',size(Y,1));
This gave me results that were more in line with what I expected. I extracted the coefficients, used glmval to create estimates (Y_fit = glmval(b,[0:0.01:1],'logit');), and created an array for the fitting (X_fit = linspace(0,1)). When I overlaid the plots of the original data and the model using figure, plot(X,Y,'o',X_fit,Y_fit'-'), the resulting plot of the model essentially looked like the lower 1/4th of the 'S' shaped plot that is typical with logistic regression plots.
My questions are as follows:
1) Why did my use of glmfit give strange results?
2) How should I go about addressing my initial question: given some input value, what's the probability that its classification is correct?
3) How do I get confidence intervals for my model parameters? glmval should be able to input the stats output from glmfit, but my use of glmfit is not giving correct results.
Any comments and input would be very useful, thanks!
UPDATE (3/18/14)
I found that mnrval seems to give reasonable results. I can use [b_fit,dev,stats] = mnrfit(X,Y+1); where Y+1 simply makes my binary classifier into a nominal one.
I can loop through [pihat,lower,upper] = mnrval(b_fit,loopVal(ii),stats); to get various pihat probability values, where loopVal = linspace(0,1) or some appropriate input range and `ii = 1:length(loopVal)'.
The stats parameter has a great correlation coefficient (0.9973), but the p values for b_fit are 0.0847 and 0.0845, which I'm not quite sure how to interpret. Any thoughts? Also, why would mrnfit work over glmfit in my example? I should note that the p-values for the coefficients when using GeneralizedLinearModel.fit were both p<<0.001, and the coefficient estimates were quite different as well.
Finally, how does one interpret the dev output from the mnrfit function? The MATLAB document states that it is "the deviance of the fit at the solution vector. The deviance is a generalization of the residual sum of squares." Is this useful as a stand-alone value, or is this only compared to dev values from other models?
It sounds like your data may be linearly separable. In short, that means since your input data is one dimensional, that there is some value of x such that all values of x < xDiv belong to one class (say y = 0) and all values of x > xDiv belong to the other class (y = 1).
If your data were two-dimensional this means you could draw a line through your two-dimensional space X such that all instances of a particular class are on one side of the line.
This is bad news for logistic regression (LR) as LR isn't really meant to deal with problems where the data are linearly separable.
Logistic regression is trying to fit a function of the following form:
This will only return values of y = 0 or y = 1 when the expression within the exponential in the denominator is at negative infinity or infinity.
Now, because your data is linearly separable, and Matlab's LR function attempts to find a maximum likelihood fit for the data, you will get extreme weight values.
This isn't necessarily a solution, but try flipping the labels on just one of your data points (so for some index t where y(t) == 0 set y(t) = 1). This will cause your data to no longer be linearly separable and the learned weight values will be dragged dramatically closer to zero.
In Matlab, if I am given a time series dataset in which the second column of values is a function of the times of the first column, and I need to integrate over the second column of values, how do I do that without a function?
Why dont you just use trapz function. This is in Octave but should be the same in Matlab which uses the trapezoidal method.
octave-3.6.2.exe:1> x=1:5
x =
1 2 3 4 5
octave-3.6.2.exe:2> y=x.*x
y =
1 4 9 16 25
octave-3.6.2.exe:3> Area=trapz(x,y)
Area = 42
In MATLAB you can use the function
cumtrapz(time,data)
which is the cumulative trapezoidal integration.
The 2 inputs are vectors having the same lenght.
In this way you can for example obtain the velocity integrating the acceleration.
The output is a vector with the same length of the inputs.
EDIT
You can also have a look at what I answered here Numerical integration using Simpson's Rule on discrete data