MATLAB: Using the 'resubstitution' option in sequentialfs - matlab

I'm new to MATLAB and trying to implement sequentialfs to identify the best subsets to fit a linear regression. I've read through the online documentation but am finding it difficult to understand.
I have set of training data and would like to use the 'resubstitution' option to apply this as test data as well.
x_train is a matrix with 67 rows and 8 columns.
y_train has 67 rows and 1 column.
Could somebody please check why this code doesn't work?
x_train = std_pros_pred_full;
y_train = pros_resp;
fun_RSS = #RSS_check;
inmodel = sequentialfs(fun_RSS,x_train,y_train,'resubstitution');
The function RSS_check performs a linear regression calculation and outputs the sum of squared errors. It's defined (externally) like this:
function RSS_out = RSS_check(X,Y)
lin1 = X'*X;
lin2 = X'*Y;
lin_coef = (lin1^-1)*lin2;
lin_fit = X*lin_coef;
row_count = size(Y,1);
RSS_out = 0;
for q = 1:row_count
pred_diff = lin_fit(q,1) - Y(q,1);
RSS_out =RSS_out + pred_diff^2;
end
The error message is:
Error using sequentialfs (line 212)
Wrong number of arguments.
When trying different options, I've also had errors concerning the number of inputs and outputs to the function. A lot of examples I've seen reference separate matrices of test data (giving 4 inputs), but I thought that would be unnecessary with the 'resubstitution' option.

Related

Error in MATLAB sequentialfs while selecting features from 94*263 feature vectors

I have 94 samples with 263 features for each sample. The total feature vector is 94*263 in size. There are no NaN or Inf value in the feature vectors. There are two classes (51 in class a and 43 in class b). I am using sequentialfs to select features but I am getting the following error each time:
Error using crossval>evalFun (line 480)
The function '#(XT,yT,Xt,yt)(sum(~strcmp(yt,classify(Xt,XT,yT,'quadratic'))))' generated the following error:
The input to SVD must not contain NaN or Inf.
The code is:
X = FEATUREVECTOR;
y = LABELS;
c = cvpartition(y,'k',10);
opts = statset('display','iter');
fun = #(XT,yT,Xt,yt)...
(sum(~strcmp(yt,classify(Xt,XT,yT,'quadratic'))));
[fs,history] = sequentialfs(fun,X,y,'cv',c,'options',opts)
Can you please tell me how to solve the problem?
It looks like you are calling sequentialfs with some inputs, that MAY be vaguely related to the mess of random numbers we see in your question. Beyond that, I can't read anything from your mind. If you want help you need to show what you did.
I change input data and it works well,
load fisheriris;
X = randn(150,10);
X(:,[1 3 5 7 ])= meas;
y = species;
c = cvpartition(y,'k',10);
opts = statset('display','iter');
fun = #(XT,yT,Xt,yt)...
(sum(~strcmp(yt,classify(Xt,XT,yT,'quadratic'))));
[fs,history] = sequentialfs(fun,X,y,'cv',c,'options',opts)
Your input data has problem.

EEG data classification with SWLDA using matlab

I want to ask your help in EEG data classification.
I am a graduate student trying to analyze EEG data.
Now I am struggling with classifying ERP speller (P300) with SWLDA using Matlab
Maybe there is something wrong in my code.
I have read several articles, but they did not cover much details.
My data size is described as below.
size(target) = [300 1856]
size(nontarget) = [998 1856]
row indicates the number of trials, column indicates spanned feature
(I stretched data [64 29] (for visual representation I did not select ROI)
I used stepwisefit function in Matlab to classify target vs non-target
Code is attached below.
ingredients = [targets; nontargets];
heat = [class_targets; class_nontargets]; % target: 1, non-target: -1
randomized_set = shuffle([ingredients heat]);
for k=1:10 % 10-fold cross validation
parition_factor = ceil(size(randomized_set,1) / 10);
cv_test_idx = (k-1)*parition_factor + 1:min(k * parition_factor, size(randomized_set,1));
total_idx = 1:size(randomized_set,1);
cv_train_idx = total_idx(~ismember(total_idx, cv_test_idx));
ingredients = randomized_set(cv_train_idx, 1:end-1);
heat = randomized_set(cv_train_idx, end);
[W,SE,PVAL,INMODEL,STATS,NEXTSTEP,HISTORY]= stepwisefit(ingredients, heat, 'penter', .1);
valid_id = find(INMODEL==1);
v_weights = W(valid_id)';
t_ingredients = randomized_set(cv_test_idx, 1:end-1);
t_heat = randomized_set(cv_test_idx, end); % true labels for test set
v_features = t_ingredients(:, valid_id);
v_weights = repmat(v_weights, size(v_features, 1), 1);
predictor = sum(v_weights .* v_features, 2);
m_result = predictor > 0; % class A: +1, B: 0
t_heat(t_heat==-1) = 0;
acc(k) = sum(m_result==t_heat) / length(m_result);
end
p.s. my code is currently very inefficient and might be bad..
In my assumption, stepwisefit calculates significant coefficients every steps, and valid column would be remained.
Even though it's not LDA, but for binary classification, LDA and linear regression are not different.
However, results were almost random chance.. (for other binary data on the internet, it worked..)
I think I made something wrong, and your help can correct me.
I will appreciate any suggestion and tips to implement classifier for ERP speller.
Or any idea for implementing SWLDA in Matlab code?
The name SWLDA is only used in the context of Brain Computer Interfaces, but I bet it has another name in a more general context.
If you track the recipe of SWLDA you will end up in Krusienski 2006 papers ("A comparison..." and "Toward enhanced P300..") and from there the book where stepwise logarithmic regression is explained: "Draper Smith, Applied Regression Analysis, 1981". However, as far as I am aware of, no paper gives actually the complete recipe on how to implement it (and their details and secrets).
My approach was using stepwiseglm:
H=predictors;
TH=variables;
lbs=labels % (1,2)
if (stepwiseflag)
mdl = stepwiseglm(H', lbs'-1,'constant','upper','linear','distr','binomial');
if (mdl.NumEstimatedCoefficients>1)
inmodel = [];
for i=2:mdl.NumEstimatedCoefficients
inmodel = [inmodel str2num(mdl.CoefficientNames{i}(2:end))];
end
H = H(inmodel,:);
TH = TH(inmodel,:);
end
end
lbls = classify(TH',H',lbs','linear');
You can also use a k-fold cross validaton approach using matlab cvpartition.
c = cvpartition(lbs,'k',10);
opts = statset('display','iter');
fun = #(XT,yT,Xt,yt)...
(sum(~strcmp(yt,classify(Xt,XT,yT,'linear'))));

First non demo example for Gaussian process using GPML (Matlab)?

After having some basics understanding of GPML toolbox , I written my first code using these tools. I have a data matrix namely data consist of two array values of total size 1000. I want to use this matrix to estimate the GP value using GPML toolbox. I have written my code as follows :
x = data(1:200,1); %training inputs
Y = data(1:201,2); %, training targets
Ys = data(201:400,2);
Xs = data(201:400,1); %possibly test cases
covfunc = {#covSE, 3};
ell = 1/4; sf = 1;
hyp.cov = log([ell; sf]);
likfunc = #likGauss;
sn = 0.1;
hyp.lik = log(sn);
[ymu ys2 fmu fs2] = gp(hyp, #infExact, [], covfunc, likfunc,X,Y,Xs,Ys);
plot(Xs, fmu);
But when I am running this code getting error 'After having some basics understanding of GPML toolbox , I written my first code using these tools. I have a data matrix namely data consist of two array values of total size 1000. I want to use this matrix to estimate the GP value using GPML toolbox. I have written my code as follows :
x = data(1:200,1); %training inputs
Y = data(1:201,2); %, training targets
Ys = data(201:400,2);
Xs = data(201:400,1); %possibly test cases
covfunc = {#covSE, 3};
ell = 1/4; sf = 1;
hyp.cov = log([ell; sf]);
likfunc = #likGauss;
sn = 0.1;
hyp.lik = log(sn);
[ymu ys2 fmu fs2] = gp(hyp, #infExact, [], covfunc, likfunc,X,Y,Xs,Ys);
plot(Xs, fmu);
But when I am running this code getting:
Error using covMaha (line 58) Parameter mode is either 'eye', 'iso',
'ard', 'proj', 'fact', or 'vlen'
Please if possible help me to figure out where I am making mistake ?
I know this is way late, but I just ran into this myself. The way to fix it is to change
covfunc = {#covSE, 3};
to something like
covfunc = {#covSE, 'iso'};
It doesn't have to be 'iso', it can be any of the options listed in the error message. Just make sure your hyperparameters are set correctly for the specific mode you choose. This is detailed more in the covMaha.m file in GPML.

Variable error rate of SVM Classifier using K-Fold Cross Vaidation Matlab

I'm using K-Fold Cross-validation to get the error rate of a SVM Classifier. This is the code with wich I'm getting the error rate for 8-Fold Cross-validation:
data = load('Entrenamiento.txt');
group = importdata('Grupos.txt');
CP = classperf(group);
N = length(group);
k = 8;
indices = crossvalind('KFold',N,k);
single_error = zeros(1,k);
for j = 1:k
test = (indices==j);
train = ~test;
SVMModel_1 = fitcsvm(data(train,:),group(train,:),'BoxConstraint',1,'KernelFunction','linear');
classification = predict(SVMModel_1,data(test,:));
classperf(CP,classification,test);
single_error(1,j) = CP.ErrorRate;
end
confusion_matrix = CP.CountingMatrix
VP = confusion_matrix(1,1);
FP = confusion_matrix(1,2);
FN = confusion_matrix(2,1);
VN = confusion_matrix(2,2);
mean_error = mean(single_error)
However, the mean_error changes each time I run the script. This is due to crossvalind, which generates random cross-validation indices, so each time I run the script, it generates different random indices.
What should I do to calculate the true error rate? Should I calculate the mean error rate of n code executions? Or what value should I use?
You can check wiki,
In k-fold cross-validation, the original sample is randomly partitioned into k equal size subsamples.
and
The k results from the folds can then be averaged (or otherwise combined) to produce a single estimation.
So no worries about different error rates of randomly selecting folds.
Of course the results will be different.
However if your error rate is in wide range then increasing k would help.
Also rng can be used to get fixed results.

using precomputed kernels with libsvm

I'm currently working on classifying images with different image-descriptors. Since they have their own metrics, I am using precomputed kernels. So given these NxN kernel-matrices (for a total of N images) i want to train and test a SVM. I'm not very experienced using SVMs though.
What confuses me though is how to enter the input for training. Using a subset of the kernel MxM (M being the number of training images), trains the SVM with M features. However, if I understood it correctly this limits me to use test-data with similar amounts of features. Trying to use sub-kernel of size MxN, causes infinite loops during training, consequently, using more features when testing gives poor results.
This results in using equal sized training and test-sets giving reasonable results. But if i only would want to classify, say one image, or train with a given amount of images for each class and test with the rest, this doesn't work at all.
How can i remove the dependency between number of training images and features, so i can test with any number of images?
I'm using libsvm for MATLAB, the kernels are distance-matrices ranging between [0,1].
You seem to already have figured out the problem... According to the README file included in the MATLAB package:
To use precomputed kernel, you must include sample serial number as
the first column of the training and testing data.
Let me illustrate with an example:
%# read dataset
[dataClass, data] = libsvmread('./heart_scale');
%# split into train/test datasets
trainData = data(1:150,:);
testData = data(151:270,:);
trainClass = dataClass(1:150,:);
testClass = dataClass(151:270,:);
numTrain = size(trainData,1);
numTest = size(testData,1);
%# radial basis function: exp(-gamma*|u-v|^2)
sigma = 2e-3;
rbfKernel = #(X,Y) exp(-sigma .* pdist2(X,Y,'euclidean').^2);
%# compute kernel matrices between every pairs of (train,train) and
%# (test,train) instances and include sample serial number as first column
K = [ (1:numTrain)' , rbfKernel(trainData,trainData) ];
KK = [ (1:numTest)' , rbfKernel(testData,trainData) ];
%# train and test
model = svmtrain(trainClass, K, '-t 4');
[predClass, acc, decVals] = svmpredict(testClass, KK, model);
%# confusion matrix
C = confusionmat(testClass,predClass)
The output:
*
optimization finished, #iter = 70
nu = 0.933333
obj = -117.027620, rho = 0.183062
nSV = 140, nBSV = 140
Total nSV = 140
Accuracy = 85.8333% (103/120) (classification)
C =
65 5
12 38