K-means clustering KDDcup99 data set error

K-means clustering KDDcup99 data set error - matlab

I am using idx = kmeans(kddcup,5); for kmeans clustering. 145586 records with 41 features of kddcup99, 10% subset of database into 5 clusters, but MATLAB r2017a gives this error:
Kmeans cannot accept complex data!
I loaded a database in MATLAB that has 42 columns instead of 41, which means that the 42nd column is for type of row (attack, normal, ...) and is not a feature, and I don't know if I should keep that 42nd row or delete it.
I don't know if my work is correct or if there is a mistake in that code.

idx = kmeans(X,k) see documentation requires numeric input for X.
Data, specified as a numeric matrix. The rows of X correspond to
observations, and the columns correspond to variables.
If X is a numeric vector, then kmeans treats it as an n-by-1 data
matrix, regardless of its orientation.
Data Types: single | double
You will need to not pass the 42nd column of kddcup to kmeans.
Since you said that kddcup contains (attack,normal,...) are those char? So, what datatype is kddcup?
Regardless it will need to stripped of the 42nd column and possibly converted into a numeric matrix if it isn't already.

Related

Matlab Histogram Function

I'm new to Matlab and for an assignment my professor is having the class write (complete really) a custom Matlab function for generating a histogram from a set of data. Essentially a new vector is being created, L which is being updated with the information from a 2D matrix M. The first column of L contains the information from M(i,j) and in a second column contains the count (total) of M(i,j) in the data set. I'm in need of some direction as to how to proceed next.
Below is where I'm at thus far:
function L = hist_count(M)
L = [ [0:255' zeros(256,1) ];
for i = 1:size(M,1)
for j = 1:size(M,2)
L(double(M(i,j))+1,2) = <<finish code here>>;
end
end
figure;
plot(L(:1),L(:2));
The <<finish code here>> section is where I'm stuck. I understand everything up to the point where I need to update L with the information.
Assistance is appreciated.

Note: Your initialization of your histogram L has the brackets mismatched.
Remove the second [ bracket in the code. In addition, the creation of the 0:255 vector is incorrect. Doing 0:255' transposes the single constant of 255, which means that it will still create a horizontal vector of 0:255 which will make the code fail. You should surround the creation of this vector with parantheses, then transpose that result. Therefore:
L = [ (0:255)' zeros(256,1) ];
Now onto your actual problem. Judging by your initialization of the histogram, there are 256 possible values so your input is most likely of type uint8, which means that the values in your data will only be from [0-255] in steps of 1. Recall that a histogram records the total number of times you see a value. In this case, you have a two column matrix where the first column tells you the value you want to examine and the second column tells you how many times you see that value in your data. Therefore, each row tells you which value you are examining in your data as well as how many times you have seen that value in your data. Note that the counts are all initialized to zero, so the logic is that every time you see a value, you need to access the right row corresponding to the data point, then increment that value by 1.
Therefore, the line is simply just accessing the current count and adding 1 to it... you then store it back:
L(double(M(i,j))+1,2) = L(double(M(i,j))+1,2) + 1;
M(i,j) is the value found at location (i,j) in your 2D data. The last question you have is why cast the intensity to double and add 1? You cast to double because the input may be an integer type. This means that any values that are beyond the dynamic range of the type will get saturated. Because your input is type uint8, any values beyond 255 will saturate to 255. In MATLAB, we index into rows and columns of a matrix starting at 1 and because the values will potentially start at value 0, this corresponds to row 1 of your histogram so you have to offset by 1. When we get to the most extreme case of value 255 for type uint8 for example, adding 1 to this using the native uint8 will saturate to 255, which means that the values of 254 and 255 get lumped into the same bin. Therefore, you must convert to some type that extends beyond the limits of uint8 and then you add by 1 to avoid saturation. double is usually done here as a default as it has higher precision than uint8, but any type that is higher than uint8 in precision is suitable.

How to implement data I have to svmtrain() function in MATLAB?

I have to write a script using MATLAB which will classify my data.
My data consists of 1051 web pages (rows) and 11000+ words (columns). I am holding the word occurences in the matrix for each page. The first 230 rows are about computer science course (to be labeled with +1) and remaining 821 are not (to be labeled with -1). I am going to label few part of these rows (say 30 rows) by myself. Then SVM will label the remaining unlabeled rows.
I have found that I could solve my problem using MATLAB's svmtrain() and svmclassify() methods. First I need to create SVMStruct.
SVMStruct = svmtrain(Training,Group)
Then I need to use
Group = svmclassify(SVMStruct,Sample)
But the point that I do not know what Training and Group are. For Group Mathworks says:
Grouping variable, which can be a categorical, numeric, or logical
vector, a cell vector of strings, or a character matrix with each row
representing a class label. Each element of Group specifies the group
of the corresponding row of Training. Group should divide Training
into two groups. Group has the same number of elements as there are
rows in Training. svmtrain treats each NaN, empty string, or
'undefined' in Group as a missing value, and ignores the corresponding
row of Training.
And for Training it is said that:
Matrix of training data, where each row corresponds to an observation
or replicate, and each column corresponds to a feature or variable.
svmtrain treats NaNs or empty strings in Training as missing values
and ignores the corresponding rows of Group.
I want to know how I can adopt my data to Training and Group? I need (at least) a little code sample.
EDIT
What I did not understand is that in order to have SVMStruct I have to run
SVMStruct = svmtrain(Training, Group);
and in order to have Group I have to run
Group = svmclassify(SVMStruct,Sample);
Also I still did not get what Sample should be like?
I am confused.

Training would be a matrix with 1051 rows (the webpages/training instances) and 11000 columns (the features/words). I'm assuming you want to test for the existence of each word on a webpage? In this case you could make the entry of the matrix a 1 if the word exists for a given webpage and a 0 if not.
You could initialize the matrix with Training = zeros(1051,11000); but filling the entries would be up to you, presumably done with some other code you've written.
Group is a 1-D column vector with one entry for every training instance (webpage) than tells you which of two classes the webpage belongs to. In your case you would make the first 230 entries a "+1" for computer science and the remaining 821 entries a "-1" for not.
Group = zeros(1051,1); % gives you a matrix of zeros with 1051 rows and 1 column
Group(1:230) = 1; % set first 230 entries to +1
Group(231:end) = -1; % set the rest to -1

Preserving matrix columns using Matlab brush/select data tool

I'm working with matrices in Matlab which have five columns and several million rows. I'm interested in picking particular groups of this data. Currently I'm doing this using plot3() and the brush/select data tool.
I plot the first three columns of the matrix as X,Y, Z and highlight the matrix region I'm interested in. I then use the brush/select tool's "Create variable" tool to export that region as a new matrix.
The problem is that when I do that, the remaining two columns of the original, bigger matrix are dropped. I understand why- they weren't plotted and hence the figure tool doesn't know about them. I need all five columns of that subregion though in order to continue the processing pipeline.
I'm adding the appropriate 4th and 5th column values to the exported matrix using a horrible nested if loop approach- if columns 1, 2 and 3 match in both the original and exported matrix, attach columns 4/5 of the original matrix to the exported one. It's bad design and agonizingly slow. I know there has to be a Matlab function/trick for this- can anyone help?
Thanks!
This might help:
1. I start with matrix 1 with columns X,Y,Z,A,B
2. Using the brush/select tool, I create a new (subregion) matrix 2 with columns X,Y,Z
3. I then loop through all members of matrix 2 against all members of matrix 1. If X,Y,Z match for a pair of rows, I append A and B
from that row in matrix 1 to the appropriate row in matrix 2.
4. I become very sad as this takes forever and shows my ignorance of Matlab.

If I understand your situation correctly here is a simple way to do it:
Assuming you have a matrix like so: M = [A B C D E] where each letter is a Nx1 vector.
You select a range, this part is not really clear to me, but suppose you can create the following:
idxA,idxB and idxC, that are 1 if they are in the region and 0 otherwise.
Then you can simply use:
M(idxA&idxB&idxC,:)
and you will get the additional two columns as well.

Numeric and Alphabetic symbols in same matrx

I'm working on a model to use matlab as graphical representation for other model. Therefore I'd like to have a matrix that can be updated with both letters and numbers. Numbers will represent a speed while for example '-' may represent a empty section. In the matlab documentation and on internet I found a lot of interesting tips, but not what I need.
Thanks in advance!

You cannot represent data of numeric type (integers/floating points) and data of char type in a matrix. However, you can, use cells, which are similar to matrices, and can hold different data types in each cell. Here's an example.
A={[1 2 3],'hello';'world',[4,5,6]'}
A =
[1x3 double] 'hello'
'world' [3x1 double]
Here the first cell contains a row vector, the second and third cells contain strings and the fourth cell contains a column vector. Indexing into a cell is similar to that of arrays, with one minor difference: use {} to group the indices. e.g., to access the element in the second row, first column, do
A{2,1}
ans =
world
You can also access an element of an array inside a cell like
A{2,2}(2)
ans =
5

If you're wanting to store mixtures of numeric and character type data, yoda has the correct suggestion: use cell arrays.
However, based on the example you described you may have another option. If the character entries in your matrix are there for the purpose of identifying "missing data", it may make more sense to use a purely numeric matrix containing unique values like NaN or Inf to identify data points that are empty or where data is not available.
When performing operations on your matrix, you would then have to index only elements that are finite (using, for example, ISFINITE) and perform your calculations on them. There are even some functions in the Statistics Toolbox that will perform operations ignoring NaN values. This may be a cleaner way to go since you can keep your matrix as a numeric type ('single' or 'double' precision) instead of having to mess with cell arrays.

What's an appropriate data structure for a matrix with random variable entries?

I'm currently working in an area that is related to simulation and trying to design a data structure that can include random variables within matrices. To motivate this let me say I have the following matrix:
[a b; c d]
I want to find a data structure that will allow for a, b, c, d to either be real numbers or random variables. As an example, let's say that a = 1, b = -1, c = 2 but let d be a normally distributed random variable with mean 0 and standard deviation 1.
The data structure that I have in mind will give no value to d. However, I also want to be able to design a function that can take in the structure, simulate a uniform(0,1), obtain a value for d using an inverse CDF and then spit out an actual matrix.
I have several ideas to do this (all related to the MATLAB icdf function) but would like to know how more experienced programmers would do this. In this application, it's important that the structure is as "lean" as possible since I will be working with very very large matrices and memory will be an issue.
EDIT #1:
Thank you all for the feedback. I have decided to use a cell structure and store random variables as function handles. To save some processing time for large scale applications, I have decided to reference the location of the random variables to save time during the "evaluation" part.

One solution is to create your matrix initially as a cell array containing both numeric values and function handles to functions designed to generate a value for that entry. For your example, you could do the following:
generatorMatrix = {1 -1; 2 #randn};
Then you could create a function that takes a matrix of the above form, evaluates the cells containing function handles, then combines the results with the numeric cell entries to create a numeric matrix to use for further calculations:
function numMatrix = create_matrix(generatorMatrix)
index = cellfun(#(c) isa(c,'function_handle'),... %# Find function handles
generatorMatrix);
generatorMatrix(index) = cellfun(#feval,... %# Evaluate functions
generatorMatrix(index),...
'UniformOutput',false);
numMatrix = cell2mat(generatorMatrix); %# Change from cell to numeric matrix
end
Some additional things you can do would be to use anonymous functions to do more complicated things with built-in functions or create cell entries of varying size. This is illustrated by the following sample matrix, which can be used to create a matrix with the first row containing a 5 followed by 9 ones and the other 9 rows containing a 1 followed by 9 numbers drawn from a uniform distribution between 5 and 10:
generatorMatrix = {5 ones(1,9); ones(9,1) #() 5*rand(9)+5};
And each time this matrix is passed to create_matrix it will create a new 10-by-10 matrix where the 9-by-9 submatrix will contain a different set of random values.
An alternative solution...
If your matrix can be easily broken into blocks of submatrices (as in the second example above) then using a cell array to store numeric values and function handles may be your best option.
However, if the random values are single elements scattered sparsely throughout the entire matrix, then a variation similar to what user57368 suggested may work better. You could store your matrix data in three parts: a numeric matrix with placeholders (such as NaN) where the randomly-generated values will go, an index vector containing linear indices of the positions of the randomly-generated values, and a cell array of the same length as the index vector containing function handles for the functions to be used to generate the random values. To make things easier, you can even store these three pieces of data in a structure.
As an example, the following defines a 3-by-3 matrix with 3 random values stored in indices 2, 4, and 9 and drawn respectively from a normal distribution, a uniform distribution from 5 to 10, and an exponential distribution:
matData = struct('numMatrix',[1 nan 3; nan 2 4; 0 5 nan],...
'randIndex',[2 4 9],...
'randFcns',{{#randn , #() 5*rand+5 , #() -log(rand)/2}});
And you can define a new create_matrix function to easily create a matrix from this data:
function numMatrix = create_matrix(matData)
numMatrix = matData.numMatrix;
numMatrix(matData.randIndex) = cellfun(#feval,matData.randFcns);
end

If you were using NumPy, then masked arrays would be the obvious place to start, but I don't know of any equivalent in MATLAB. Cell arrays might not be compact enough, and if you did use a cell array, then you would have to come up with an efficient way to find the non-real entries and replace them with a sample from the right distribution.
Try using a regular or sparse matrix to hold the real values, and leave it at zero wherever you want a random variable. Then alongside that store a sparse matrix of the same shape whose non-zero entries correspond to the random variables in your matrix. If you want, the value of the entry in the second matrix can be used to indicate which distribution (ie. 1 for uniform, 2 for normal, etc.).
Whenever you want to get a purely real matrix to work with, you iterate over the non-zero values in the second matrix to convert them to samples, and then add that matrix to your first.