How to implement data I have to svmtrain() function in MATLAB? - matlab

I have to write a script using MATLAB which will classify my data.
My data consists of 1051 web pages (rows) and 11000+ words (columns). I am holding the word occurences in the matrix for each page. The first 230 rows are about computer science course (to be labeled with +1) and remaining 821 are not (to be labeled with -1). I am going to label few part of these rows (say 30 rows) by myself. Then SVM will label the remaining unlabeled rows.
I have found that I could solve my problem using MATLAB's svmtrain() and svmclassify() methods. First I need to create SVMStruct.
SVMStruct = svmtrain(Training,Group)
Then I need to use
Group = svmclassify(SVMStruct,Sample)
But the point that I do not know what Training and Group are. For Group Mathworks says:
Grouping variable, which can be a categorical, numeric, or logical
vector, a cell vector of strings, or a character matrix with each row
representing a class label. Each element of Group specifies the group
of the corresponding row of Training. Group should divide Training
into two groups. Group has the same number of elements as there are
rows in Training. svmtrain treats each NaN, empty string, or
'undefined' in Group as a missing value, and ignores the corresponding
row of Training.
And for Training it is said that:
Matrix of training data, where each row corresponds to an observation
or replicate, and each column corresponds to a feature or variable.
svmtrain treats NaNs or empty strings in Training as missing values
and ignores the corresponding rows of Group.
I want to know how I can adopt my data to Training and Group? I need (at least) a little code sample.
EDIT
What I did not understand is that in order to have SVMStruct I have to run
SVMStruct = svmtrain(Training, Group);
and in order to have Group I have to run
Group = svmclassify(SVMStruct,Sample);
Also I still did not get what Sample should be like?
I am confused.

Training would be a matrix with 1051 rows (the webpages/training instances) and 11000 columns (the features/words). I'm assuming you want to test for the existence of each word on a webpage? In this case you could make the entry of the matrix a 1 if the word exists for a given webpage and a 0 if not.
You could initialize the matrix with Training = zeros(1051,11000); but filling the entries would be up to you, presumably done with some other code you've written.
Group is a 1-D column vector with one entry for every training instance (webpage) than tells you which of two classes the webpage belongs to. In your case you would make the first 230 entries a "+1" for computer science and the remaining 821 entries a "-1" for not.
Group = zeros(1051,1); % gives you a matrix of zeros with 1051 rows and 1 column
Group(1:230) = 1; % set first 230 entries to +1
Group(231:end) = -1; % set the rest to -1

Related

MATLAB matrix operation

I am having matrix with approx 3000 rows(changing) and 3 columns.
I have count of both rows and columns.
I am trying to plot the graph:
x=1:3000;
plot(matrix(x,1))
is there any way that I can include all rows in the plot instruction itself so that I can remove 'x=1:3000' ?
Also, I want to divide, 1st column of matrix which have 3000 rows into another matrix of 3 columns each with 1000 rows. Any specific instruction for this ?
I have made for loop for this and then i am placing individually the elements in the new array. But its taking long time.
As to the plotting issue, using the colon operator will plot all rows for your desired column:
plot(matrix(:,1));
EDIT: You mentioned you were a beginner. In case you haven't seen the colon operator used like this before, a colon operator all by itself when indexing into a matrix essentially means "all __", either "all rows" if in the first position or "all columns" if in the second position.
As for the second question, of splitting one column into a new matrix with multiple columns, you can use the reshape() function, which takes the input matrix to be reshaped and a number of output rows and columns. For example, to split the first column of matrix into 3 columns and put them into newMatrix, use the following:
newMatrix = reshape(matrix(:,1),[],3);
Note that the above code uses [] in the second argument (the number of rows argument) to mean "automatically determine number of rows".This is automatically determined based on the number of columns, which is defined in the third argument here as 3. The reshape function requires that the number of output rows * output columns be equal to input rows * input columns. So in the above case this will only work if the starting matrix has a number of rows which is divisible by 3.

MATLAB Extracting Column Number

My goal is to create a random, 20 by 5 array of integers, sort them by increasing order from top to bottom and from left to right, and then calculate the mean in each of the resulting 20 rows. This gives me a 1 by 20 array of the means. I then have to find the column whose mean is closest to 0. Here is my code so far:
RandomArray= randi([-100 100],20,5);
NewArray=reshape(sort(RandomArray(:)),20,5);
MeanArray= mean(transpose(NewArray(:,:)))
X=min(abs(x-0))
How can I store the column number whose mean is closest to 0 into a variable? I'm only about a month into coding so this probably seems like a very simple problem. Thanks
You're almost there. All you need is a find:
RandomArray= randi([-100 100],20,5);
NewArray=reshape(sort(RandomArray(:)),20,5);
% MeanArray= mean(transpose(NewArray(:,:))) %// gives means per row, not column
ColNum = find(abs(mean(NewArray,1))==min(abs(mean(NewArray,1)))); %// gives you the column number of the minimum
MeanColumn = RandomArray(:,ColNum);
find will give you the index of the entry where abs(mean(NewArray)), i.e. the absolute values of the mean per column equals the minimum of that same array, thus the index where the mean of the column is closest to 0.
Note that you don't need your MeanArray, as it transposes (which can be done by NewArray.', and then gives the mean per column, i.e. your old rows. I chucked everything in the find statement.
As suggested in the comment by Matthias W. it's faster to use the second output of min directly instead of a find:
RandomArray= randi([-100 100],20,5);
NewArray=reshape(sort(RandomArray(:)),20,5);
% MeanArray= mean(transpose(NewArray(:,:))) %// gives means per row, not column
[~,ColNum] = min(abs(mean(NewArray,1)));
MeanColumn = RandomArray(:,ColNum);

Changing numbers for given indices between matrices

I'm struggling with one of my matlab assignments. I want to create 10 different models. Each of them is based on the same original array of dimensions 1x100 m_est. Then with for loop I am choosing 5 random values from the original model and want to add the same random value to each of them. The cycle repeats 10 times chosing different values each time and adding different random number. Here is a part of my code:
steps=10;
for s=1:steps
for i=1:1:5
rl(s,i)=m_est(randi(numel(m_est)));
rl_nr(s,i)=find(rl(s,i)==m_est);
a=-1;
b=1;
r(s)=(b-a)*rand(1,1)+a;
end
pert_layers(s,:)=rl(s,:)+r(s);
M=repmat(m_est',s,1);
end
for k=steps
for m=1:1:5
M_pert=M;
M_pert(1:k,rl_nr(k,1:m))=pert_layers(1:k,1:m);
end
end
In matrix M I am storing 10 initial models and want to replace the random numbers with indices from rl_nr matrix into those stored in pert_layers matrix. However, the last loop responsible for assigning values from pert_layers to rl_nr indices does not work properly.
Does anyone know how to solve this?
Best regards
Your code uses a lot of loops and in this particular circumstance, it's quite inefficient. It's better if you actually vectorize your code. As such, let me go through your problem description one point at a time and let's code up each part (if applicable):
I want to create 10 different models. Each of them is based on the same original array of dimensions 1x100 m_est.
I'm interpreting this as you having an array m_est of 100 elements, and with this array, you wish to create 10 different "models", where each model is 5 elements sampled from m_est. rl will store these values from m_est while rl_nr will store the indices / locations of where these values originated from. Also, for each model, you wish to add a random value to every element that is part of this model.
Then with for loop I am choosing 5 random values from the original model and want to add the same random value to each of them.
Instead of doing this with a for loop, generate all of your random indices in one go. Since you have 10 steps, and we wish to sample 5 points per step, you have 10*5 = 50 points in total. As such, why don't you use randperm instead? randperm is exactly what you're looking for, and we can use this to generate unique random indices so that we can ultimately use this to sample from m_est. randperm generates a vector from 1 to N but returns a random permutation of these elements. This way, you only get numbers enumerated from 1 to N exactly once and we will ensure no repeats. As such, simply use randperm to generate 50 elements, then reshape this array into a matrix of size 10 x 5, where the number of rows tells you the number of steps you want, while the number of columns is the total number of points per model. Therefore, do something like this:
num_steps = 10;
num_points_model = 5;
ind = randperm(numel(m_est));
ind = ind(1:num_steps*num_points_model);
rl_nr = reshape(ind, num_steps, num_points_model);
rl = m_est(rl_nr);
The first two lines are pretty straight forward. We are just declaring the total number of steps you want to take, as well as the total number of points per model. Next, what we will do is generate a random permutation of length 100, where elements are enumerated from 1 to 100, but they are in random order. You'll notice that this random vector uses only a value within the range of 1 to 100 exactly once. Because you only want to get 50 points in total, simply subset this vector so that we only get the first 50 random indices generated from randperm. These random indices get stored in ind.
Next, we simply reshape ind into a 10 x 5 matrix to get rl_nr. rl_nr will contain those indices that will be used to select those entries from m_est which is of size 10 x 5. Finally, rl will be a matrix of the same size as rl_nr, but it will contain the actual random values sampled from m_est. These random values correspond to those indices generated from rl_nr.
Now, the final step would be to add the same random number to each model. You can certainly use repmat to replicate a random column vector of 10 elements long, and duplicate them 5 times so that we have 5 columns then add this matrix together with rl.... so something like:
a = -1;
b = 1;
r = (b-a)*rand(num_steps, 1) + a;
r = repmat(r, 1, num_points_model);
M_pert = rl + r;
Now M_pert is the final result you want, where we take each model that is stored in rl and add the same random value to each corresponding model in the matrix. However, if I can suggest something more efficient, I would suggest you use bsxfun instead, which does this replication under the hood. Essentially, the above code would be replaced with:
a = -1;
b = 1;
r = (b-a)*rand(num_steps, 1) + a;
M_pert = bsxfun(#plus, rl, r);
Much easier to read, and less code. M_pert will contain your models in each row, with the same random value added to each particular model.
The cycle repeats 10 times chosing different values each time and adding different random number.
Already done in the above steps.
I hope you didn't find it an imposition to completely rewrite your code so that it's more vectorized, but I think this was a great opportunity to show you some of the more advanced functions that MATLAB has to offer, as well as more efficient ways to generate your random values, rather than looping and generating the values one at a time.
Hopefully this will get you started. Good luck!

How do I find frequency of values appearing in all rows of a matrix in MATLAB?

I have a 161*32 matrix (labelled "indpic") in MATLAB and I'm trying to find the frequency of a given number appearing in a row. So I think that I need to analyse each row separately for each value, but I'm incredibly unsure about how to go about this (I'm only new to MATLAB). This also means I'm incredibly useless with loops and whatnot as well.
Any help would be greatly appreciated!
If you want to count the number of times a specific number appears in each row, you can do this:
sum(indpic == val, 2)
where indpic is your matrix (e.g image) and val is the desired value to be counted.
Explanation: checking equality of each element with the value produces a boolean matrix with "1"s at the locations of the counted value. Summing each row (i.e summing along the 2nd dimension results in the desired column vector, where each element being equal to the number of times val is repeated in the corresponding row).
If you want to count how many times each value is repeated in your image, this is called a histogram, and you can use the histc command to achieve that. For example:
histc(indpic, 1:256)
counts how many times each value from 1 to 256 appears in image indpic.
Like this,
sum(indpic(rownum,:) == 7)
obviously change 7 to whatever.
You can just write
length(find(indpic(row_num,:)==some_value))
and it will give you the number of elements equal to "some_value" in the "row_num"th row in matrix "indpic"

Preserving matrix columns using Matlab brush/select data tool

I'm working with matrices in Matlab which have five columns and several million rows. I'm interested in picking particular groups of this data. Currently I'm doing this using plot3() and the brush/select data tool.
I plot the first three columns of the matrix as X,Y, Z and highlight the matrix region I'm interested in. I then use the brush/select tool's "Create variable" tool to export that region as a new matrix.
The problem is that when I do that, the remaining two columns of the original, bigger matrix are dropped. I understand why- they weren't plotted and hence the figure tool doesn't know about them. I need all five columns of that subregion though in order to continue the processing pipeline.
I'm adding the appropriate 4th and 5th column values to the exported matrix using a horrible nested if loop approach- if columns 1, 2 and 3 match in both the original and exported matrix, attach columns 4/5 of the original matrix to the exported one. It's bad design and agonizingly slow. I know there has to be a Matlab function/trick for this- can anyone help?
Thanks!
This might help:
1. I start with matrix 1 with columns X,Y,Z,A,B
2. Using the brush/select tool, I create a new (subregion) matrix 2 with columns X,Y,Z
3. I then loop through all members of matrix 2 against all members of matrix 1. If X,Y,Z match for a pair of rows, I append A and B
from that row in matrix 1 to the appropriate row in matrix 2.
4. I become very sad as this takes forever and shows my ignorance of Matlab.
If I understand your situation correctly here is a simple way to do it:
Assuming you have a matrix like so: M = [A B C D E] where each letter is a Nx1 vector.
You select a range, this part is not really clear to me, but suppose you can create the following:
idxA,idxB and idxC, that are 1 if they are in the region and 0 otherwise.
Then you can simply use:
M(idxA&idxB&idxC,:)
and you will get the additional two columns as well.