What kind of data/format should matlabs clustering toolbox use [duplicate] - cluster-analysis

I'm trying to cluster some data I have from the KDD 1999 cup dataset
the output from the file looks like this:
0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.
with 48 thousand different records in that format. I have cleaned the data up and removed the text keeping only the numbers. The output looks like this now:
I created a comma delimited file in excel and saved as a csv file then created a data source from the csv file in matlab, ive tryed running it through the fcm toolbox in matlab (findcluster outputs 38 data types which is expected with 38 columns).
The clusters however don't look like clusters or its not accepting and working the way I need it to.
Could anyone help finding the clusters? Im new to matlab so don't have any experience and I'm also new to clustering.
The method:
Chose number of clusters (K)
Initialize centroids (K patterns randomly chosen from data set)
Assign each pattern to the cluster with closest centroid
Calculate means of each cluster to be its new centroid
Repeat step 3 until a stopping criteria is met (no pattern move to another cluster)
This is what I'm trying to achieve:
This is what I'm getting:
load kddcup1.dat
plot(kddcup1(:,1),kddcup1(:,2),'o')
[center,U,objFcn] = fcm(kddcup1,2);
Iteration count = 1, obj. fcn = 253224062681230720.000000
Iteration count = 2, obj. fcn = 241493132059137410.000000
Iteration count = 3, obj. fcn = 241484544542298110.000000
Iteration count = 4, obj. fcn = 241439204971005280.000000
Iteration count = 5, obj. fcn = 241090628742523840.000000
Iteration count = 6, obj. fcn = 239363408546874750.000000
Iteration count = 7, obj. fcn = 238580863900727680.000000
Iteration count = 8, obj. fcn = 238346826370420990.000000
Iteration count = 9, obj. fcn = 237617756429912510.000000
Iteration count = 10, obj. fcn = 226364785036628320.000000
Iteration count = 11, obj. fcn = 94590774984961184.000000
Iteration count = 12, obj. fcn = 2220521449216102.500000
Iteration count = 13, obj. fcn = 2220521273191876.200000
Iteration count = 14, obj. fcn = 2220521273191876.700000
Iteration count = 15, obj. fcn = 2220521273191876.700000
figure
plot(objFcn)
title('Objective Function Values')
xlabel('Iteration Count')
ylabel('Objective Function Value')
maxU = max(U);
index1 = find(U(1, :) == maxU);
index2 = find(U(2, :) == maxU);
figure
line(kddcup1(index1, 1), kddcup1(index1, 2), 'linestyle',...
'none','marker', 'o','color','g');
line(kddcup1(index2,1),kddcup1(index2,2),'linestyle',...
'none','marker', 'x','color','r');
hold on
plot(center(1,1),center(1,2),'ko','markersize',15,'LineWidth',2)
plot(center(2,1),center(2,2),'kx','markersize',15,'LineWidth',2)

Since you are new to machine-learning/data-mining, you shouldn't tackle such advanced problems. After all, the data you are working with was used in a competition (KDD Cup'99), so don't expect it to be easy!
Besides the data was intended for a classification task (supervised learning), where the goal is predict the correct class (bad/good connection). You seem to be interested in clustering (unsupervised learning), which is generally more difficult.
This sort of dataset requires a lot of preprocessing and clever feature extraction. People usually employ domain knowledge (network intrusion detection) to obtain better features from the raw data.. Directly applying simple algorithms like K-means will generally yield poor results.
For starters, you need to normalize the attributes to be of the same scale: when computing the euclidean distance as part of step 3 in your method, the features with values such as 239 and 486 will dominate over the other features with small values as 0.05, thus disrupting the result.
Another point to remember is that too many attributes can be a bad thing (curse of dimensionality). Thus you should look into feature selection or dimensionality reduction techniques.
Finally, I suggest you familiarize yourself with a simpler dataset...

Related

Problem understanding Loss function behavior using Flux.jl. in Julia

So. First of all, I am new to Neural Network (NN).
As part of my PhD, I am trying to solve some problem through NN.
For this, I have created a program that creates some data set made of
a collection of input vectors (each with 63 elements) and its corresponding
output vectors (each with 6 elements).
So, my program looks like this:
Nₜᵣ = 25; # number of inputs in the data set
xtrain, ytrain = dataset_generator(Nₜᵣ); # generates In/Out vectors: xtrain/ytrain
datatrain = zip(xtrain,ytrain); # ensamble my data
Now, both xtrain and ytrain are of type Array{Array{Float64,1},1}, meaning that
if (say)Nₜᵣ = 2, they look like:
julia> xtrain #same for ytrain
2-element Array{Array{Float64,1},1}:
[1.0, -0.062, -0.015, -1.0, 0.076, 0.19, -0.74, 0.057, 0.275, ....]
[0.39, -1.0, 0.12, -0.048, 0.476, 0.05, -0.086, 0.85, 0.292, ....]
The first 3 elements of each vector is normalized to unity (represents x,y,z coordinates), and the following 60 numbers are also normalized to unity and corresponds to some measurable attributes.
The program continues like:
layer1 = Dense(length(xtrain[1]),46,tanh); # setting 6 layers
layer2 = Dense(46,36,tanh) ;
layer3 = Dense(36,26,tanh) ;
layer4 = Dense(26,16,tanh) ;
layer5 = Dense(16,6,tanh) ;
layer6 = Dense(6,length(ytrain[1])) ;
m = Chain(layer1,layer2,layer3,layer4,layer5,layer6); # composing the layers
squaredCost(ym,y) = (1/2)*norm(y - ym).^2;
loss(x,y) = squaredCost(m(x),y); # define loss function
ps = Flux.params(m); # initializing mod.param.
opt = ADAM(0.01, (0.9, 0.8)); #
and finally:
trainmode!(m,true)
itermax = 700; # set max number of iterations
losses = [];
for iter in 1:itermax
Flux.train!(loss,ps,datatrain,opt);
push!(losses, sum(loss.(xtrain,ytrain)));
end
It runs perfectly, however, it comes to my attention that as I train my model with an increasing data set(Nₜᵣ = 10,15,25, etc...), the loss function seams to increase. See the image below:
Where: y1: Nₜᵣ=10, y2: Nₜᵣ=15, y3: Nₜᵣ=25.
So, my main question:
Why is this happening?. I can not see an explanation for this behavior. Is this somehow expected?
Remarks: Note that
All elements from the training data set (input and output) are normalized to [-1,1].
I have not tryed changing the activ. functions
I have not tryed changing the optimization method
Considerations: I need a training data set of near 10000 input vectors, and so I am expecting an even worse scenario...
Some personal thoughts:
Am I arranging my training dataset correctly?. Say, If every single data vector is made of 63 numbers, is it correctly to group them in an array? and then pile them into an ´´´Array{Array{Float64,1},1}´´´?. I have no experience using NN and flux. How can I made a data set of 10000 I/O vectors differently? Can this be the issue?. (I am very inclined to this)
Can this behavior be related to the chosen act. functions? (I am not inclined to this)
Can this behavior be related to the opt. algorithm? (I am not inclined to this)
Am I training my model wrong?. Is the iteration loop really iterations or are they epochs. I am struggling to put(differentiate) this concept of "epochs" and "iterations" into practice.
loss(x,y) = squaredCost(m(x),y); # define loss function
Your losses aren't normalized, so adding more data can only increase this cost function. However, the cost per data doesn't seem to be increasing. To get rid of this effect, you might want to use a normalized cost function by doing something like using the mean squared cost.

Can I vectorize extraction of data from a cell array in MATLAB?

I am wondering if it is possible to use a vector to access data within a cell array. I am hoping to accomplish this using a vectorized approach rather than a for-loop.
I'm attempting to run a simple microsimulation in MATLAB. I have a simulated cohort that is initially healthy, but some are at low risk for a particular disease while others are at high risk. Thus, I have an array (Starting_Cohort) that indicates each patient's risk level (first column) and their initial status (second column). In addition, I have a cell array (pstar) that indicates each patient's likelihood of transitioning between two hypothetical health states (i.e., "healthy" and "sick").
What I would like to accomplish is the following:
1) During each period of the simulation (t = 1:T), use the first column of the starting cohort to determine the patient's risk level (i.e., 1 or 2).
2) Use the risk level to access a specific row (dependent on their current health status) within a specific cell (dependent on their risk level) of the cell array.
3) Compare the resultant vector against a random draw from the uniform distribution (contained in the array "r"), and select the column number associated with the first value larger than that draw (the column number determines their health state in the subsequent period).
HOWEVER, I want to avoid doing this for one patient at a time (i.e., introducing a nested loop), as this increases the execution time of the code by an order of magnitude (the actual cohort consists of approximately 20000 patients). I've been trying to accomplish this through vectorization - that is, running the simulation over the entire patient cohort concurrently - but I hit a roadblock when trying to access data from the cell array described above.
Starting_Cohort = [1 1; 1 1; 2 1; 2 1;];
[Cohort_Size, ~] = size(Starting_Cohort);
pstar = cell(2, 1);
pstar{1, 1} = [0.75 1.00; 0.15 1.00]; pstar{2, 1} = [0.65 1.00; 0.25 1.00];
rng(1234, 'twister'); T = 5; r = rand(Cohort_Size, T);
Sim_Results = [Starting_Cohort zeros(Cohort_Size, T)];
for t = 1:T
[~, Sim_Results(:, t+2)] = max(pstar{Sim_Results(:, 1), 1} ...
(Sim_Results(:, t+1), :) > r(:, t), [], 2);
end
When I run the above code, I obtain the error "Expected one output from a curly brace or dot indexing expression, but there were 4 results." I take this to mean that my approach to extracting information from the cell array is inappropriate, although I'm not sure whether I can address this or how. I would be deeply appreciative for any assistance rendered!
UPDATE 070619: I did eventually get this to work, using the code below. Effectively, I created a string array containing the expression I wanted to apply to each row. The expression is identical for every row EXCEPT in that it contains the row index. I can then use arrayfun and evalin to produce results similar to those I was looking for. Unfortunately, my own problem involves sparse arrays, so I could not actually solve my original problem. However, I'm hoping this information may nonetheless be useful for others.
Starting_Cohort = [1 1; 1 1; 2 1; 2 1;];
[Cohort_Size, ~] = size(Starting_Cohort);
pstar = cell(2, 1);
pstar{1, 1} = [0.75 1.00; 0.15 1.00];
pstar{2, 1} = [0.65 1.00; 0.25 1.00];
rng(1234, 'twister'); T = 5; r = rand(Cohort_Size, T);
Sim_Results = [Starting_Cohort zeros(Cohort_Size, T)];
for i = 1:Cohort_Size
TEST(i, 1) = strcat("max(pstar{Sim_Results(", string(i), ", 1), 1}",...
"(Sim_Results(", string(i), ", t+1), :) > ", ...
"r(", string(i), ", t), [], 2)");
end
for t = 1:T
[~, Sim_Results(:, t+2)] = arrayfun(#(x) evalin('base', x), TEST);
end

How can I avoid constructing these grid variables in MATLAB?

I have the following calculations in two steps:
Initially, I create a set of 4 grid vectors, each spanning from -2 to 2:
u11grid=[-2:0.1:2];
u12grid=[-2:0.1:2];
u22grid=[-2:0.1:2];
u21grid=[-2:0.1:2];
[ca, cb, cc, cd] = ndgrid(u11grid, u12grid, u22grid, u21grid);
u11grid=ca(:);
u12grid=cb(:);
u22grid=cc(:);
u21grid=cd(:);
%grid=[u11grid u12grid u22grid u21grid]
sg=size(u11grid,1);
Next, I have an algorithm assigning the same index (equalorder) to the rows of grid sharing a specific structure:
U1grid=[-u11grid -u21grid -u12grid -u22grid Inf*ones(sg,1) -Inf*ones(sg,1)];
U2grid=[u21grid-u11grid -u21grid u22grid-u12grid -u22grid Inf*ones(sg,1) -Inf*ones(sg,1)];
s1=size(U1grid,2);
s2=size(U2grid,2);
%-------------------------------------------------------
%sortedU1grid gives U1grid with each row sorted from smallest to largest
%for each row i of sortedU1grid and for j=1,2,...,s1 index1(i,j) gives
%the column position 1,2,...,s1 in U1grid(i,:) of sortedU1grid(i,j)
[sortedU1grid,index1] = sort(U1grid,2);
%for each row i of sortedU1grid, d1(i,:) is a 1x(s1-1) row of ones and zeros
% d1(i,j)=1 if sortedU1grid(i,j)-sortedU1grid(i,j-1)=0 and d1(i,j)=0 otherwise
d1 = diff(sortedU1grid,[],2) == 0;
%-------------------------------------------------------
%Repeat for U2grid
[sortedU2grid,index2] = sort(U2grid,2);
d2 = diff(sortedU2grid,[],2) == 0;
%-------------------------------------------------------
%Assign the same index to the rows of grid sharing the same "ordering"
[~,~,equalorder] = unique([index1 index2 d1 d2],'rows', 'stable'); %sgx1
My question: is there a way to compute the algorithm in step 2 without the initial construction of the grid vectors in step 1? I am asking this because step 1 takes a lot of memory given that it basically generates the Cartesian product of 4 sets.
A solution should not rely on the specific content of U1grid and U2grid as that part changes in my actual code. To be more clear: U1grid and U2grid are ALWAYS derived from u11grid, ..., u21grid; however, the way in which they are derived from u11grid, ..., u21grid is slightly more complicated in my actual code from what I have reported here.
As Cris Luengo mentions in a comment, you're always going to be dealing with a trade-off between speed and memory. That said, one option you have is to only compute each of your 4 grid variables (u11grid u12grid u22grid u21grid) when needed instead of computing them once and storing them. You will save on memory but will lose speed if you are recomputing each one multiple times.
The solution I came up with involves creating an anonymous function equivalent for each of the 4 grid variables, using combinations of repmat and repelem to compute each individually instead of ndgrid to compute them all together:
u11gridFcn = #() repmat((-2:0.1:2).', 41.^3, 1);
u12gridFcn = #() repmat(repelem((-2:0.1:2).', 41), 41.^2, 1);
u22gridFcn = #() repmat(repelem((-2:0.1:2).', 41.^2), 41, 1);
u21gridFcn = #() repelem((-2:0.1:2).', 41.^3);
sg = 41.^4;
You would then use these by replacing every usage of your 4 grid variables in U1grid and U2grid with their corresponding function call. For your specific example above, this would be the new code for U1grid and U2grid (note also the use of inf(...) instead of Inf*ones(...), a small detail):
U1grid = [-u11gridFcn() ...
-u21gridFcn() ...
-u12gridFcn() ...
-u22gridFcn() ...
inf(sg, 1) ...
-inf(sg, 1)];
U2grid = [u21gridFcn()-u11gridFcn() ...
-u21gridFcn() ...
u22gridFcn()-u12gridFcn() ...
-u22gridFcn() ...
inf(sg, 1) ...
-inf(sg, 1)];
In this example, you avoid the memory needed to store the 4 grid variables, but the values for u11grid and u12grid will each be computed twice while the values for u21grid and u22grid will each be computed three times. Likely a small time trade-off for a potentially significant memory savings.
You may be able to remove the ndgrid, but it is not the memory bottleneck of this code, which is the call to unique on the large matrix A = [index1 index2 d1 d2]. The size of A is 2825761 by 22 (much larger than the grids), and it seems that unique may even internally copy A. I was able to avoid this call using
[sorted, ind] = sortrows([index1 index2 d1 d2]);
change = [1; any(diff(sorted), 2)];
uniqueInd = cumsum(change);
equalorder(ind) = uniqueInd;
[~, ~, equalorder] = unique(equalorder, 'stable');
where the last line is still the memory bottleneck and is only needed if you want the same numbering as your code produces. If any unique ordering is okay, you can skip it. You may be able to further reduce the memory footprint by carefully clearing variables are soon as they are no longer needed.

For loop with multiplication step in MATLAB

Is there any way to use a for-loop in MATLAB with a custom step? What I want to do is iterate over all powers of 2 lesser than a given number. The equivalent loop in C++ (for example) would be:
for (int i = 1; i < 65; i *= 2)
Note 1: This is the kind of iteration that best fits for-loops, so I'd like to not use while-loops.
Note 2: I'm actually using Octave, not MATLAB.
Perhaps you want something along the lines of
for i=2.^[1:6]
disp(i)
end
Except you will need to figure out the range of exponents. This uses the fact that since
a_(i+1) = a_i*2 this can be rewritten as a_i = 2^i.
Otherwise you could do something like the following
i=1;
while i<65
i=i*2;
disp(i);
end
You can iterate over any vector, so you can use vector operations to create your vector of values before you start your loop. A loop over the first 100 square numbers, for example, could be written like so:
values_to_iterate = [1:100].^2;
for i = values_to_iterate
i
end
Or you could loop over each position in the vector values_to_iterate (this gives the same result, but has the benefit that i keeps track of how many iterations you have done - this is useful if you are writing a result from each loop sequentially to an output vector):
values_to_iterate = [1:100].^2;
for i = 1:length(values_to_iterate)
values_to_iterate(i)
results_vector(i) = some_function( values_to_iterate(i) );
end
More concisely, you can write the first example as simply:
for i = [1:100].^2
i
end
Unlike in C, there doesn't have to be a 'rule' to get from one value to the next.
The vector iterated over can be completely arbitrary:
for i = [10, -1000, 23.3, 5, inf]
i
end

clustering and matlab

I'm trying to cluster some data I have from the KDD 1999 cup dataset
the output from the file looks like this:
0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.
with 48 thousand different records in that format. I have cleaned the data up and removed the text keeping only the numbers. The output looks like this now:
I created a comma delimited file in excel and saved as a csv file then created a data source from the csv file in matlab, ive tryed running it through the fcm toolbox in matlab (findcluster outputs 38 data types which is expected with 38 columns).
The clusters however don't look like clusters or its not accepting and working the way I need it to.
Could anyone help finding the clusters? Im new to matlab so don't have any experience and I'm also new to clustering.
The method:
Chose number of clusters (K)
Initialize centroids (K patterns randomly chosen from data set)
Assign each pattern to the cluster with closest centroid
Calculate means of each cluster to be its new centroid
Repeat step 3 until a stopping criteria is met (no pattern move to another cluster)
This is what I'm trying to achieve:
This is what I'm getting:
load kddcup1.dat
plot(kddcup1(:,1),kddcup1(:,2),'o')
[center,U,objFcn] = fcm(kddcup1,2);
Iteration count = 1, obj. fcn = 253224062681230720.000000
Iteration count = 2, obj. fcn = 241493132059137410.000000
Iteration count = 3, obj. fcn = 241484544542298110.000000
Iteration count = 4, obj. fcn = 241439204971005280.000000
Iteration count = 5, obj. fcn = 241090628742523840.000000
Iteration count = 6, obj. fcn = 239363408546874750.000000
Iteration count = 7, obj. fcn = 238580863900727680.000000
Iteration count = 8, obj. fcn = 238346826370420990.000000
Iteration count = 9, obj. fcn = 237617756429912510.000000
Iteration count = 10, obj. fcn = 226364785036628320.000000
Iteration count = 11, obj. fcn = 94590774984961184.000000
Iteration count = 12, obj. fcn = 2220521449216102.500000
Iteration count = 13, obj. fcn = 2220521273191876.200000
Iteration count = 14, obj. fcn = 2220521273191876.700000
Iteration count = 15, obj. fcn = 2220521273191876.700000
figure
plot(objFcn)
title('Objective Function Values')
xlabel('Iteration Count')
ylabel('Objective Function Value')
maxU = max(U);
index1 = find(U(1, :) == maxU);
index2 = find(U(2, :) == maxU);
figure
line(kddcup1(index1, 1), kddcup1(index1, 2), 'linestyle',...
'none','marker', 'o','color','g');
line(kddcup1(index2,1),kddcup1(index2,2),'linestyle',...
'none','marker', 'x','color','r');
hold on
plot(center(1,1),center(1,2),'ko','markersize',15,'LineWidth',2)
plot(center(2,1),center(2,2),'kx','markersize',15,'LineWidth',2)
Since you are new to machine-learning/data-mining, you shouldn't tackle such advanced problems. After all, the data you are working with was used in a competition (KDD Cup'99), so don't expect it to be easy!
Besides the data was intended for a classification task (supervised learning), where the goal is predict the correct class (bad/good connection). You seem to be interested in clustering (unsupervised learning), which is generally more difficult.
This sort of dataset requires a lot of preprocessing and clever feature extraction. People usually employ domain knowledge (network intrusion detection) to obtain better features from the raw data.. Directly applying simple algorithms like K-means will generally yield poor results.
For starters, you need to normalize the attributes to be of the same scale: when computing the euclidean distance as part of step 3 in your method, the features with values such as 239 and 486 will dominate over the other features with small values as 0.05, thus disrupting the result.
Another point to remember is that too many attributes can be a bad thing (curse of dimensionality). Thus you should look into feature selection or dimensionality reduction techniques.
Finally, I suggest you familiarize yourself with a simpler dataset...