How to split the dataset into train/ validation / test with cvpartion?

How to split the dataset into train/ validation / test with cvpartion? - matlab

I'm training NNs for classification. However, I only have 525 samples and approximately 300 predictor variables. I know I could try to reduce the number of variables, looking for the ones that are really more important, but this is not the point.
Currently I divide my data into training / validation / test, using validation for early stop during network training.
I want to use cross-validation in Matlab with the cvpartition function, however this function divides dataset in training / test. Is there any way to use cvpartition to split into training / validation / test?
c=cvpartition(t_class,'KFold',10,'Stratify', true)
K-fold cross validation partition
NumObservations: 525
NumTestSets: 10
TrainSize: 473 472 472 472 472 472 473 473 473 473
TestSize: 52 53 53 53 53 53 52 52 52 52

Coss-validation is only meant to have two sets, the one it is training on and the other it tests on and in the next iteration again. So cvpartition won't give you a split into three sets. You can now argue that the validation set os only a subset of the test set, so you use cvpartition again on this, making sure that you do not accidentally test on the whole test set (this doesn't work for corss-validation) or if you want to apply cross-validation, do it the other way around:
% 20% for validation
cvp = cvpartition(t_class,'HoldOut',0.2);
% extract the data set
t_class_Val = t_class(cvp.test);
% Dat_Val = Dat(cvp.test,:);
t_class_TrnTst = t_class(cvp.training);
% Dat_TrnTst = Dat(cvp.training,:);
% cross-validation for the rest
cvp2 = cvpartition(t_class_TrnTst,'KFold',10,'Stratify', true);
The other option is to code it yourself. You can randomize indices with randperm.

Related

Writing (and using) principal component analysis in matlab

I (hope to) obtain a matrix with data on different characteristics on rat calls (in the ultrasound). Variables include starting frequency, ending frequency, duration etc etc. The observations will include all the rat calls in my audio recording.
I want to use PCA to analyze my data, hopefully decorrelating any principal components that are not important to the structure of these calls and how they work, allowing me to group the calls up.
My problem is that while I have a basic understanding of how PCA works, I don't have an understanding of the finer points including how to implement this in Matlab.
I know you should standardise my data. All methods I have seen involve means adjusting by subtracting the mean. However some others also divide by the standard deviation or divide the transpose of the means adjusted data by the square root of N-1 (N being the number of variables).
I know with the standardised data, you can then find the covariance matrix, and extract the eigen values and vectors such as with using eig(cov(...)). some others use svd(...) instead. I still don't understand what this is and why it is important
I know there are different ways to implement PCA, but I don't like how I get different results for all of them.
There is even a pca(...) command also.
While reconstructing the data, some people multiply the means adjust data with the principal component data, others do the same but with the transpose of the principal component data
I just want to be able to analyse my data by plotting graphs of the principal components, and of the data (with the most insignificant principal components removed). I want to know about the variances of these eigen vectors and how much they represent the total variance of the data. I want to be able to fully exploit all the information PCA can allow me to get out
can anyone help?
=========================================================
This code seems to work based on pg 20 of http://people.maths.ox.ac.uk/richardsonm/SignalProcPCA.pdf
X = [105 103 103 66; 245 227 242 267; 685 803 750 586;...
147 160 122 93; 193 235 184 209; 156 175 147 139;...
720 874 566 1033; 253 265 171 143; 488 570 418 355;...
198 203 220 187; 360 365 337 334; 1102 1137 957 674;...
1472 1582 1462 1494; 57 73 53 47; 1374 1256 1572 1506;...
375 475 458 135; 54 64 62 41];
[M,N] = size(X);
mn = mean(X,2);
data = X - repmat(mn,1,N);
Y = data' / sqrt(N-1);
[~,S,PC] = svd(Y);
S = diag(S);
V = S .* S;
signals = PC' * data;
%plotting single PC1 on its own
figure;
plot(signals(1,:),zeros(1,size(signals,2)),'b.','markersize',15)
xlabel('PC1')
title('plotting single PC1 on its own')
%plotting PC1 against PC2
figure;
plot(signals(1,:),signals(2,:),'b.','markersize',15)
xlabel('PC1'),ylabel('PC2')
title('plotting PC1 against PC2')
figure;
plot(PC(:,1),PC(:,2),'m.','markersize',15)
xlabel('effect(PC1)'),ylabel('effect(PC2)')
but where is the standard deviation? how is the result different to
B=zscore(X);
[PC, D] = eig(cov(B));
D = diag(D);
cumsum(flipud(D)) / sum(D)
PC*B %(note how this says PC whereas above it says PC')
If the principal components are represented as columns, then I can remove the most insignificant eigen vectors by finding the smallest eigenvalue and setting its corresponding eigen vector column to a column of zeros.
How can either of these methods above be applied by using the pca(...) command and achieve THE SAME result? can anyone help explain this to me (and ideally show me how all of these can achieve the same results)?

Alternative to dec2hex in MATLAB?

I am using dec2hex up to 100 times in MATLAB. Because of this, the speed of code decreases. for one point I am using dec2hex 100 times. It will take 1 minute or more than it. I have do the same for 5000 points. But because of dec2hex it will take hours of time to run. So how can I do hexadecimal to decimal conversion optimally? Is there any other alternative that can be used instead of dec2hex?
As example:
%%Data[1..256]: can be any data from
for i=1:1:256
Table=dec2hex(Data);
%%Some permutation applied on Data
end;
Here I am using dec2hex more than 100 times for one point. And I have to use it for 5000 points.
Data =
Columns 1 through 16
105 232 98 250 234 216 98 199 172 226 250 215 188 11 52 174
Columns 17 through 32
111 181 71 254 133 171 94 91 194 136 249 168 177 202 109 187
Columns 33 through 48
232 249 191 60 230 67 183 122 164 163 91 24 145 124 200 142
This kind of data My code will use.

Function calls are (still) expensive in MATLAB. This is one of the reasons why vectorization and pseudo-vectorization is strongly recommended: processing an entire array of N values in one function call is way better than calling the processing function N times for each element, thus saving the N-1 supplemental calls overhead.
So, what you can do? Here are some non-mutually-exclusive choices:
Profile your code first. Just because something looks like the main culprit for execution time disasters, it isn't necessarily it. Type profview in your command window, chose the script that you want to run, and see where are the hotspots of your code. Choose to optimize those hotspots rather than your initial guesses.
Try faster functions. sprintf is usually fast and flexible:
Table = sprintf('%04X\n', Data);
(and — if you dive into the function code with edit dec2hex — you'll see that in some cases dec2hex actually calls sprintf).
Reduce the number of function calls. Suppose you have to build the table for the 100 datasets of different lengths, that are stored in a cell array:
DataSet = cell(1,100);
for k = 1:100
DataSet{k} = fix(1000*rand(k,1));
end;
The idea is to assemble all the numbers in a single array that you convert at once:
Table = dec2hex(vertcat(DataSet{:}));
Mind you, this is done at the expense of using supplemental memory for assembling the partial inputs in a single one — it's not always convenient to do that.
All the variants above. Okay, this point is not actually a point. :-)

How to use KNN in Matlab

I need to use KNN in matlab to find the closest data in training data from A.
I have data in .mat that has this kind of information (training data):
train_data = 1 232 34 21 0.542
2 32 333 542 0.32
and so on.
Then i have a second information that I will gather through the application but I will only get
A = 2 343 543 43 0.23
So now my question is do i only need to do is something like this, and can i use something like this?
Does KNN need to learn something or do you only need to load test data and some present data (like A) and go through some formula or preload in another function that learns it then through a second function to give you the result.
Best regards.

So you have a training set (with labels) and some test data without labels? I think you can use the function you linked to classificationknn(). If i understand your question you want something like the example: Predict Classification Based on a KNN Classifier
http://www.mathworks.se/help/stats/classification-using-nearest-neighbors.html#btap7nm

how to feed Hidden Markov Model (HMM) with several datastreams simultaneously?

I have built a body sensor network consisting of 8 accelerometers. At each sample (at about 30 Hz) each accelerometer gives me a X Y and Z value.
I have used the jahmm java library for classification of a datastream consisting of one accelerometer. This works fine. But now i am confused about how to extend my code so that it can be fed with more than one accelerometer.
a single datastream looks like this:
[-4.976763 7.096352 1.3488603]; [-4.8699903 7.417777 1.3515397];...
The library allows to define the dimensionality of the feature vector. In the above stream the dimensionality is 3. I thought of raising the dimensionality to 3 x 8 = 24, and then simply concatenate all accelerometers into a single 24D feature vector.
is this the way to go or will this deteriorate my results?
EDIT:
I have collected my data by now and it looks like this (for one participant):
"GESTURE A",[{407 318 425};...{451 467 358};{427 525 445};][{440 342 456}...;{432 530 449};]
"GESTURE A",[{406 318 424};...{450 467 357};{422 525 445};][{440 342 456}...;{428 531 449};]
"GESTURE B",[{407 318 424};...{449 466 357};{423 524 445};][{440 342 456}...;{429 530 449};]
"GESTURE B",[{380 299 399};...{424 438 338};{404 500 426};][{433 337 449}...;{429 529 449};]
the values in between {... ... ...} represent one accelerometer. Per sample (at 30hz orso) i have 8 accelerometers. One sample is within [...]. Per gesture example i have about 40 blocks of [...]
Is your suggestion that I take the first sensor (the first {} of each block of []) and create a model with the resulting sequence, and the same for the second until the eighth?.
This would give me 8 models for each gesture. Than a test sequence is yields 8 probabilities. So I would need some sort of plurality voting in order to get the overlaying class. Is this what you meant?
Thank you

I suggest to use one HMM per accelerometer, so 8 parallel models in your case. Then you can evaluat each channel individually and put everything together to get your result. So you have to write some code around the library.
If you want to handle everything in one HMM, you have to write your own observation type which can handle all 8 input streams, e.g. MyObservation extends Observation.

Need to generate a cluster of points in k-dimensional space in MATLAB

The points generated should be something like this-
21 32 34 54 76 34
23 55 67 45 75 23.322
54 23 45 76 85.1 32
the above example is when k=6.
How can I generate such a cluster of say around 1000 points and vary the value of k and the radius of the cluster.
Is there any built-in function that can do this for me? I can use any other tool if needed.
Any help would be appreciated.

Have a look at ELKI. It comes with a quite flexible data generator for clustering datasets, and there is a 640d subspace clustering example somewhere on the wiki.
Consider using d for the dimensionality, as when you are talking about clusters k usually refers to the number of clusters (think of k-means ...)

I think you would need to write your own code for this. Supposing your center is at the origin, you have to pick k numbers, in sequence, with the constraint at every step that the sum of the squares of all the numbers upto (and including) it must not exceed the radius of the hypersphere squared. That is, the k th number squared must be less than or equal to the radius squared minus the sum of the squares of all previously picked numbers.

If you have the stats toolbox this is easy
http://www.mathworks.co.uk/help/toolbox/stats/kmeans.html
Otherwise, you can quite easily write the code yourself using Lloyds algorithm.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to split the dataset into train/ validation / test with cvpartion? - matlab

Related

Writing (and using) principal component analysis in matlab

Alternative to dec2hex in MATLAB?

How to use KNN in Matlab

how to feed Hidden Markov Model (HMM) with several datastreams simultaneously?

Need to generate a cluster of points in k-dimensional space in MATLAB

Categories

Resources