Split dataset to test and train MATLAB [duplicate] - matlab

This question already has an answer here:
Matlab: How can I split my data matrix into two random subsets of column vectors while keeping the label information?
(1 answer)
Closed 5 years ago.
I want to split a very large dataset that I have (over one million observations) into a test and train set. As, you can see I have already managed to perform something similar in the code bellow with the use of dividerand.
What the code does is we have a very large set X, on every iteration we select N=1700 variables and then I split them in a ratio 7/3 - train/test. But, what I would further like to do though is instead of using %'s with the dividerand to use specific values. For instance, split the data into mini-batches with size 2000, and then use 500 for test and 1500 for training. Again, in the next loop we will select the data (2001:4000) and split them in 500 test and 1500 train etc.
Again, dividerand allows to do that with ratios but I would like to use actual values.
X = randn(10000,9);
mu_6 = zeros(510,613); % 390/802 - 450/695 - 510/613 - Test/Iterations
s2_6 = zeros(510,613);
nl6 = zeros(613,1);
RSME6 = zeros(613,1);
prev_batch = 0;
inf = #infGaussLik;
meanfunc = []; % empty: don't use a mean function
covfunc = #covSEiso; % Squared Exponential covariance
likfunc = #likGauss; % Gaussian likelihood
for k=1:613
new_batch = k*1700;
X_batch = X(1+prev_batch:new_batch,:);
[train,~,test] = dividerand(transpose(X_batch),0.7,0,0.3);
train = transpose(train);
test = transpose(test);
x_t = train(:,1:8); % Train batch we get 910 values
y_t = train(:,9);
x_z = test(:,1:8); % Test batch we get 390 values
y_z = test(:,9);
% Calculations for Gaussian process regression
if k==1
hyp = struct('mean', [], 'cov', [0 0], 'lik', -1);
else
hyp = hyp2;
end
hyp2 = minimize(hyp, #gp, -100, inf, meanfunc, covfunc, likfunc, x_t, y_t);
[m4 s4] = gp(hyp2, inf, meanfunc, covfunc, likfunc, x_t, y_t, x_z);
[nlZ4,dnlZ4] = gp(hyp2, inf, meanfunc, covfunc, likfunc, x_t, y_t);
RSME6(k,1) = sqrt(sum(((m4-y_z).^2))/450);
nl6(k,1) = nlZ4;
mu_6(:,k) = m4;
s2_6(:,k) = s4;
% End of calculations
prev_batch = new_batch;
disp(k);
end

How about:
[~, idx] = sort([randn(2000,1)]);
group1_idx = idx(1:1500);
group2_idx = idx(1501:end);

Related

Separate matrix of column vectors into training and testing data in MATLAB

I'm currently doing a project in MATLAB using the MNIST data set. I have a training data set of n = 50000, represented by a matrix of 784 x 50000 (50000 column vectors of size 784).
I am trying to separate my training and testing data (70-30, respectively), but the method I am using is a bit wordy and brute force for my liking. Being that this is MATLAB, I'm sure there has got to be a better way. The code I have been using is listed below. I'm brand new to MATLAB so please help! Thanks :)
% MNIST - data loads into trn and trnAns, representing
% the input vectors and the desired output vectors, respectively
load('Data/mnistTrn.mat');
mnist_train = zeros(784, 35000);
mnist_train_ans = zeros(10, 35000);
mnist_test = zeros(784, 15000);
mnist_test_ans = zeros(10, 15000);
indexes = zeros(1,50000);
for i = 1:50000
indexes(i) = i;
end
indexes(randperm(length(indexes)));
for i = 1:50000
if i <= 35000
mnist_train (:,i) = trn(:,indexes(i));
mnist_train_ans(:,i) = trnAns(:,indexes(i));
else
mnist_test(:,i-35000) = trn(:,indexes(i));
mnist_test_ans(:,i-35000) = trnAns(:,indexes(i));
end
end
I hope this works:
% MNIST - data loads into trn and trnAns, representing
% the input vectors and the desired output vectors, respectively
load('Data/mnistTrn.mat');
% Generating a random permutation for both trn and trnAns:
perm = randperm(50000);
% Shuffling both trn and trnAns columns using a single random permutations:
trn = trn(:, perm);
trnAns = trnAns(:, perm);
mnist_train = trn(:, 1:35000);
mnist_train_ans = trnAns(:, 1:35000);
mnist_test = trn(:, 35001:50000);
mnist_test_ans = trnAns(:, 35001:50000);

Vectorize double for loops in Matlab

I present my simple working Matlab code and will ask questions:
tic
nrand1 = 10000;
nrand2 = 20000;
% Location matrix 1: [longitude, latitude, w1]
lmat1=[rand(nrand1,1)-75 rand(nrand1,1)+39 round(rand(nrand1,1)*1000)+1];
% Location matrix 2: [longitude, latitude, w2]
lmat2=[rand(nrand2,1)-75 rand(nrand2,1)+39 round(rand(nrand2,1)*100)+1];
% The number of rows for each matrix = In fact it's nrand1 X nrand2, obviously
nobs1 = size(lmat1,1);
nobs2 = size(lmat2,1);
% The number of pair-wise distances
% between L1 locations X L2 locations
ndist = nobs1*nobs2;
% Initialization: Distance vector and weight vector
hdist = zeros(ndist,1);
weight = zeros(ndist,1);
% Double for loop -- for calculating the pair-wise distances and weights
k=1;
for i=1:nobs1
for j=1:nobs2
% distances in kilometers.
lonH = sin(0.5*(lmat1(i,1)-lmat2(j,1))*pi/180.0)^2;
latH = sin(0.5*(lmat1(i,2)-lmat2(j,2))*pi/180.0)^2;
hdist(k) = 0.001*6372797.560856*2 ...
*asin(sqrt(latH+(cos(lmat1(i,2)*pi/180.0) ...
*cos(lmat2(j,2)*pi/180.0))*lonH));
weight(k) = lmat1(i,3)*lmat2(j,3);
k=k+1;
end
end
toc
The code calculates 10000 X 20000 distances and weights.
Elapsed time is 67.124844 seconds.
Is there a way to vectorize the double-loop processing, or to perform a parallel computing? If there is no room for performance improvement in Matlab, I may have to write the double loops in C and call it from Matlab. I don't know how to call C from matlab, so I will ask a separate question. Thanks!
Using bsxfun, you can eliminate the for loops and the need for calculating matrices for each combination (this should reduce memory usage). The following is about six times faster than your original code on my computer using R2014b:
nrand1 = 10000;
nrand2 = 20000;
% Location matrix 1: [longitude, latitude, w1]
lmat1=[rand(nrand1,1)-75 rand(nrand1,1)+39 round(rand(nrand1,1)*1000)+1];
% Location matrix 2: [longitude, latitude, w2]
lmat2=[rand(nrand2,1)-75 rand(nrand2,1)+39 round(rand(nrand2,1)*100)+1];
p180 = pi/180;
lonH = sin(0.5*bsxfun(#minus,lmat1(:,1).',lmat2(:,1))*p180).^2;
latH = sin(0.5*bsxfun(#minus,lmat1(:,2).',lmat2(:,2))*p180).^2;
hdist = 0.001*6372797.560856*2*asin(sqrt(latH+bsxfun(#times,cos(lmat1(:,2).'*p180),cos(lmat2(:,2)*p180)).*lonH));
hdist1 = hdist(:);
weight1 = bsxfun(#times,lmat1(:,3).',lmat2(:,3));
weight1 = weight1(:);
Note that by using the variable p180, the math is changed slightly so you won't get precisely the same values, but they will be very close.
The solution is that your inputs (lmat1 and lmat2) do not need to be matrices like you have them. Each one is really three vectors. Once you've broken out the vectors, you can create arrays that have every permutation of lmat1 and lmat2 together (which is what your double loop is doing). At that point, you can call your math as single, fully-vectorized operations...
%make your vectors
lmat1A = rand(nrand1,1)-75;
lmat1B = rand(nrand1,1)+39;
lmat1C = round(rand(nrand1,1)*1000)+1
lmat2A = rand(nrand2,1)-75;
lmat2B = rand(nrand2,1)+39;
lmat2C = round(rand(nrand2,1)*1000)+1
%make every combination
lmat1A = lmat1A(:)*ones(1,nrand2);
lmat1B = lmat1B(:)*ones(1,nrand2);
lmat1C = lmat1C(:)*ones(1,nrand2);
lmat2A = ones(nrand1,1)*(lmat2A(:)');
lmat2B = ones(nrand1,1)*(lmat2B(:)');
lmat2C = ones(nrand1,1)*(lmat2C(:)');
%do your math
lonH = sin(0.5*(lmat1A-lmat2A)*pi/180.0).^2;
latH = sin(0.5*(lmat1B-lmat2B)*pi/180.0).^2;
hdist = 0.001*6372797.560856*2 ...
.*asin(sqrt(latH+(cos(lmat1B*pi/180.0) ...
.*cos(lmat2B*pi/180.0)).*lonH)); %use element-wise multiplication
weight = lmat1C.*lmat2C;
%reshape your output into vectors (not arrays), which is what your original code does
lonH = lonH(:)
latH = latH(:)
hdist = hdist(:);
weight = weight(:);

how can i convert my cpu code of dot product of two matrices to GPU in matlab

I want to take weighted sum of two matrices in GPUarray to be fast. for example my code on cpu is given below:
mat1 = rand(19,19);
mat2= rand(19,19);
Receptive_fieldsize = [4,3];
overlap = 1;
Output = GetweightedSum(mat1,mat2, Receptive_fieldsize,overlap); %this will output in an 6x6 matrix
where as my function body is:
function Output = GetweightedSum(mat1,mat2, RF,overlap)
gap = RF(1) - overlap;
size_mat = size(mat1);
output_size=[6,6];
for u=1: output_size(1)
for v=1: output_size(2)
min_u = (u - 1) * gap + 1;
max_u = (u - 1) * gap + RF(1);
min_v = (v - 1) * gap + 1;
max_v = (v - 1) * gap + RF(2);
input1 = mat1(min_u:max_u,min_v:max_v);
input2 = mat2(min_u:max_u,min_v:max_v);
Output(u,v) = sum(sum(input1 .*input2));
end
end
How can i convert it to GPUfunciton. Can i do it directly, OR can i use for loop in GPU code. I am totally new to GPU so don't know anything about it.
Will be thankful if some one guid me, or change the above code as reference to GPU function so that i may learn from it.
Regards
See if the codes and the comments alongside them make sense to you -
function Output = GetweightedSumGPU(mat1,mat2, RF,overlap)
%// Create parameters
gap = RF(1) - overlap;
output_size=[6,6];
sz1 = output_size(1);
sz2 = output_size(2);
nrows = size(mat1,1); %// get number of rows in mat1
%// Copy data to GPU
gmat1 = gpuArray(mat1);
gmat2 = gpuArray(mat2);
start_row_ind = gpuArray([1:RF(1)]'); %//' starting row indices for each block
col_offset = gpuArray([0:RF(2)-1]*nrows); %// column offset for each block
%// Linear indices for each block
ind = bsxfun(#plus,start_row_ind,col_offset);
%// Linear indices along rows and columns respectively
ind_rows = bsxfun(#plus,ind(:),[0:sz1-1]*gap);
ind_rows_cols = bsxfun(#plus,ind_rows,permute([0:sz2-1]*gap*nrows,[1 3 2]));
%// Elementwise multiplication, summing and gathering back result to CPU
Output = gather(reshape(sum(gmat1(ind_rows_cols).*gmat2(ind_rows_cols),1),sz1,sz2));
return;

Matlab - How to improve efficiency of two port matrix calculations?

I'm looking for a way to speed up some simple two port matrix calculations. See the below code example for what I'm doing currently. In essence, I create a [Nx1] frequency vector first. I then loop through the frequency vector and create the [2x2] matrices H1 and H2 (all functions of f). A bit of simple matrix math including a matrix left division '\' later, and I got my result pb as a [Nx1] vector. The problem is the loop - it takes a long time to calculate and I'm looking for way to improve efficiency of the calculations. I tried assembling the problem using [2x2xN] transfer matrices, but the mtimes operation cannot handle 3-D multiplications.
Can anybody please give me an idea how I can approach such a calculation without the need for looping through f?
Many thanks: svenr
% calculate frequency and wave number vector
f = linspace(20,200,400);
w = 2.*pi.*f;
% calculation for each frequency w
for i=1:length(w)
H1(i,1) = {[1, rho*c*k(i)^2 / (crad*pi); 0,1]};
H2(i,1) = {[1, 1i.*w(i).*mp; 0, 1]};
HZin(i,1) = {H1{i,1}*H2{i,1}};
temp_mat = HZin{i,1}*[1; 0];
Zin(i,1) = temp_mat(1,1)/temp_mat(2,1);
temp_mat= H1{i,1}\[1; 1/Zin(i,1)];
pb(i,1) = temp_mat(1,1); Ub(i,:) = temp_mat(2,1);
end
Assuming that length(w) == length(k) returns true , rho , c, crad, mp are all scalars and in the last line is Ub(i,1) = temp_mat(2,1) instead of Ub(i,:) = temp_mat(2,1);
temp = repmat(eyes(2),[1 1 length(w)]);
temp1(1,2,:) = rho*c*(k.^2)/crad/pi;
temp2(1,2,:) = (1i.*w)*mp;
H1 = permute(num2cell(temp1,[1 2]),[3 2 1]);
H2 = permute(num2cell(temp2,[1 2]),[3 2 1]);
HZin = cellfun(#(a,b)(a*b),H1,H2,'UniformOutput',0);
temp_cell = cellfun(#(a,b)(a*b),H1,repmat({[1;0]},length(w),1),'UniformOutput',0);
Zin_cell = cellfun(#(a)(a(1,1)/a(2,1)),temp_cell,'UniformOutput',0);
Zin = cell2mat(Zin);
temp2_cell = cellfun(#(a)({[1;1/a]}),Zin_cell,'UniformOutput',0);
temp3_cell = cellfun(#(a,b)(pinv(a)*b),H1,temp2_cell);
temp4 = cell2mat(temp3_cell);
p(:,1) = temp4(1:2:end-1);
Ub(:,1) = temp4(2:2:end);

How to set output size in Matlab newff method

Summary:
I'm trying to do classification of some images depending on the angles between body parts.
I assume that human body consists of 10 parts(as rectangles) and find the center of each part and calculate the angle of each part by reference to torso.
And I have three action categories:Handwave-Walking-Running.
My goal is to find which test images fall into which action category.
Facts:
TrainSet:1057x10 feature set,1057 stands for number of image.
TestSet:821x10
I want my output to be 3x1 matrice each row showing the percentage of classification for action category.
row1:Handwave
row2:Walking
row3:Running
Code:
actionCat='H';
[train_data_hw train_label_hw] = tugrul_traindata(TrainData,actionCat);
[test_data_hw test_label_hw] = tugrul_testdata(TestData,actionCat);
actionCat='W';
[train_data_w train_label_w] = tugrul_traindata(TrainData,actionCat);
[test_data_w test_label_w] = tugrul_testdata(TestData,actionCat);
actionCat='R';
[train_data_r train_label_r] = tugrul_traindata(TrainData,actionCat);
[test_data_r test_label_r] = tugrul_testdata(TestData,actionCat);
Train=[train_data_hw;train_data_w;train_data_r];
Test=[test_data_hw;test_data_w;test_data_r];
Target=eye(3,1);
net=newff(minmax(Train),[10 3],{'logsig' 'logsig'},'trainscg');
net.trainParam.perf='sse';
net.trainParam.epochs=50;
net.trainParam.goal=1e-5;
net=train(net,Train);
trainSize=size(Train,1);
testSize=size(Test,1);
if(trainSize > testSize)
pend=-1*ones(trainSize-testSize,size(Test,2));
Test=[Test;pend];
end
x=sim(net,Test);
Question:
I'm using Matlab newff method.But my output is always an Nx10 matrice not 3x1.
My input set should be grouped as 3 classes but they are grouped to 10 classes.
Thanks
%% Load data : I generated some random data instead
Train = rand(1057,10);
Test = rand(821,10);
TrainLabels = randi([1 3], [1057 1]);
TestLabels = randi([1 3], [821 1]);
trainSize = size(Train,1);
testSize = size(Test,1);
%% prepare the input/output vectors (1-of-N output encoding)
input = Train'; %'matrix of size numFeatures-by-numImages
output = zeros(3,trainSize); % matrix of size numCategories-by-numImages
for i=1:trainSize
output(TrainLabels(i), i) = 1;
end
%% create net: one hidden layer with 10 nodes (output layer size is infered: 3)
net = newff(input, output, 10, {'logsig' 'logsig'}, 'trainscg');
net.trainParam.perf = 'sse';
net.trainParam.epochs = 50;
net.trainParam.goal = 1e-5;
view(net)
%% training
net = init(net); % initialize
[net,tr] = train(net, input, output); % train
%% performance (on Training data)
y = sim(net, input); % predict
%[err cm ind per] = confusion(output, y);
[maxVals predicted] = max(y); % predicted
cm = confusionmat(predicted, TrainLabels);
acc = sum(diag(cm))/sum(cm(:));
fprintf('Accuracy = %.2f%%\n', 100*acc);
fprintf('Confusion Matrix:\n');
disp(cm)
%% Testing (on Test data)
y = sim(net, Test');
Note how I converted from category label for each instance (1/2/3) to a 1-to-N encoding vector ([100]: 1, [010]: 2, [001]: 3)
Also note that the test set is currently not being used, since by default the input data is divided into train/test/validation. You could achieve your manual division by setting net.divideFcn to the divideind function, and setting the corresponding net.divideParam parameters.
I showed the testing on the same training data, but you could do the same for the Test data.