Document Clustering in Matlab - matlab

I am working on a code for document clustering in matlab. My document is :
'The first step in analyzing the requirements is to construct an object model.
It describes real world object classes and their relationships to each other.
Information for the object model comes from the problem statement, expert knowledge of the application domain, and general knowledge of the real world.
Britvic plc is one of the leading soft drinks manufacturers of soft drinks in the Beverages Sector functioning in Europe with its distribution branches in Great Britain, Ireland and France. '
As seen, the paragraphs contain different classes of data. The following is my main program:
global n;
n=1;
file1=fopen('doc1.txt','r');
%file 1 is now open
%read data from file 1
text=fileread('doc1.txt');
i=0;
%now text1 has the content of doc1 as a string.Next split the sentences
%into words.For that we are calling the split function
[C1,C2]=clustering(text)
And below comes the code for 'clustering':
function [C1,C2]=clustering(text)
global C1;
text1=strsplit(text,'.');
[rt1,ct1]=size(text1);
for i=1:(ct1-1)
var=text1{i};
vv=strsplit(var,' ');
text2=setdiff(vv,{'this','you','is','an','with','as','well','like','and','to','it','on','off','of','in','mine','your','yours','these','this','will','would','shall','should','or','a','about','all','also','am','are','but','of','for','by','my','did','do','her','his','the','him','she','he','they','that','when','we','us','not','them','if','in','just','may','not'},'stable');
[rt2,ct2]=size(text2);
for r=1:ct2
tmar=porterStemmer(text2{r});
mapr{i,r}=tmar;
end
end
[mr,mc]=size(mapr);
mapr
A=zeros(mr,mr);
for i=1:mr
for j=1:mc
for m=i+1:mr
for k=1:mc
if ~isempty(mapr{i,j})
%if(~(mapr{i,j}=='[]'))
%mapr(i,j)
if strcmp(mapr{i,j},mapr{m,k})
p=mapr{i,j};
str=sprintf('Sentences %d and %d match',i,m)
str;
str1=sprintf('And the word is : %s ',p)
str1;
A(i,m)=1;
A(m,i)=1;
end
end
end
end
end
end
sprintf('Adjacency matrix is:')
A
sprintf('The corresponding diagonnal matrix is:')
[ar,ac]=size(A);
for i=1:ar
B(i)=0;
for j=1:ac
B(i)=B(i)+A(i,j);
end
end
[br,bc]=size(B);
D=zeros(bc,bc);
for i=1:bc
D(i,i)=B(i);
end
D
sprintf('The similarity matrix is:')
C=D-A
[V,D]=eig(C,'nobalance')
F=inv(V);
V*D*F
%mvar =no of edges/total degree of vertices
no_of_edges=0;
for i=1:ar
for j=1:ac
if(i<=j)
no_of_edges=no_of_edges+A(i,j);
end
end
end
no_of_edges;
tdv=0;
for i=1:bc
tdv=tdv+B(i);
end
tdv;
mvar=no_of_edges/tdv
[dr,dc]=size(D);
temp=abs(D(1,1)-mvar);
x=D(1,1);
for i=2:dc
temp2=abs(D(i,i)-mvar);
if temp>temp2
temp=temp2;
x=D(i,i);
q=i
end
end
x
[vr,vc]=size(V);
for i=1:vr
V(i,q);
Track(i)=V(i,q);
end
sprintf('Eigen vectors corresponding to the closest value:')
Track
j=1;
m=1;
C1=' ';
C2=' ';
for i=1:vr
if(Track(i)<0)
C1=strcat(C1,text1{1,i},'.');
else
C2=strcat(C2,text1{1,i},'.');
end
end
I could generate the initial two clusters from the document. But then again, I want the clustering process to continue on the generated clusters producing more and more subclusters of each untill there is no change in the population generated. Can somebody help me implement a solution to this so that I can not only generate the clusters but also to keep track of them for further processing. Thanks in advance.

Related

How to plot Phasor in MATLAB

I have been working on Data Compression through SDT algorithm. I have been given a task to compress the data having 20001 points and I successfully compressed it into 944 points. Now I want to plot phasors in MATLAB. How to do it?
I used plot3 command but unfortunately could not get the desired result.
My code:
clear
clc
load('s.mat');
[N,~]=size(a(:,1));
b=[]
s=1; %storage point
a(s,6)=a(1,2); %vs, store the first phasor
a(s,5)=a(1,1); %ts
a(s,7)=a(1,3); %alphas
for n=3:1:N
p=n-1;
vp=a(p,2);
tp=a(p,1);
alphap=a(p,3);
vn=a(n,2);
tn=a(n,1);
alphan=a(n,3);
vs=a(s,6); %vs
ts=a(s,5);%ts
alphas=a(s,7);%alphas
for i=s+1:n-1 %Condtion for Vi and Current Points
ti=a(i,1);
alphai=a(i,3);
vi=a(i,2)*(cos(alphai)+1i*sin(alphai));%Phasors (Vi)
vcomp=((vn-vs)/(tn-ts))*(ti-ts)+vs;
alphacomp=((alphan-alphas)/(tn-ts))*(ti-ts)+alphas;
vcpx=vcomp*(cos(alphacomp)+1i*sin(alphacomp));%Phasors (Vcomp)
%Etve condition to convert ploar into rectangular components
e_TVE=norm((vcpx-vi),2)/norm(vi,2);%etve condition
a(i,4)=e_TVE;
if e_TVE<=0.001
continue;
else
s=p;
a(s,6)=vp; %vs
a(s,5)=tp;%ts
a(s,7)=alphap;%alps
% a(s,8)=a(p,6)*(cos(a(p,7))+1i*sin(a(p,7))); %Phasors of Storage point
break;
end
end
end
a(n,6)=vn; %vs,reserve the last data
a(n,5)=tn;%ts
a(n,7)=alphan;%alps
x=1;
for y=1:1:20001
if a(y,5)>0 || a(y,6)>0 || a(y,6)>0
b(x,1)=a(y,5);
b(x,2)=a(y,6);
b(x,3)=a(y,7);
x=x+1;
end
end
Figure 1 is the original (Uncompressed data):
Figure 2 is the compressed data result:
Figure 3 is the desired result:

How do I adjust this code so that I can enter how many runs I want and it will store each run in a matrix?

I have created this code to generate a 1 set of lottery numbers, but I am trying to make it so that the user can enter how many sets they want (input n), and it will print out as one long matrix of size nX6? I was messing around with a few options from online suggestions, but to no avail. I put the initial for i=1:1:n at the beginning, but I do not know how to store each run into a growing matrix. Right now it still generates just 1 set.
function lottery(n)
for i=1:1:n
xlow=1;
xhigh=69;
m=5;
i=1;
while (i<=m)
lottonum(i)=floor(xlow+rand*(xhigh-xlow+1));
flag=0;
for j=1:i-1
if (lottonum(i)==lottonum(j))
flag=1;
end
end
if flag==0
i=i+1;
end
end
ylow=1;
yhigh=26;
m=1;
lottonum1=floor(ylow+rand*(yhigh-ylow+1));
z = horzcat(lottonum, lottonum1);
end
disp('The lotto numbers picked are')
fprintf('%g ',z)
disp (' ')
The problem is that you are not storing or displaying the newly generated numbers, only the last set. To solve this, initialize z with NaNs or zeros, and later index z to store each set in a row of z, by using z(i,:) = lottonum.
However, you are using i as iterator in the while loop already, so you should use another variable, e.g. k.
You can also set z as an output of the function, so you can use this matrix in some other part of a program.
function z = lottery(n)
% init z
z = NaN(n,6);
for k = 1:n
xlow=1;
xhigh=69;
m=5;
i=1;
while (i<=m)
lottonum(i)=floor(xlow+rand*(xhigh-xlow+1));
flag=0;
for j=1:i-1
if (lottonum(i)==lottonum(j))
flag=1;
end
end
if flag==0
i=i+1;
end
end
ylow=1;
yhigh=26;
lottonum1 = floor(ylow+rand*(yhigh-ylow+1));
z(k,:) = horzcat(lottonum, lottonum1); % put the numbers in a row of z
end
disp('The lotto numbers picked are')
disp(z) % prettier display than fprintf in this case.
disp (' ')
end
The nice answer from rinkert corrected your basic mistakes (like trying to modify your loop iterator i from within the loop => does not work), and answered your question on how to store all your results.
This left you with a working code, however, I'd like to propose to you a different way to look at it.
The porposed architecture is to divide the tasks into separate functions:
One function draw_numbers which can draw N numbers randomly (and does only that)
One function draw_lottery which call the previous function as many times as it needs (your n), collect the results and display them.
draw_lottery
This architecture has the benefit to greatly simplify your main function. It can now be as simple as:
function Draws = draw_lottery(n)
% define your draw parameters
xmin = 1 ; % minimum number drawn
xmax = 69 ; % maximum number drawn
nballs = 5 ; % number of number to draw
% pre allocate results
Draws = zeros( n , nballs) ;
for iDraw=1:1:n
% draw "nballs" numbers
thisDraw = draw_numbers(xmin,xmax,nballs) ;
% add them to the result matrix
Draws(iDraw,:) = thisDraw ;
end
disp('The lotto numbers picked are:')
disp (Draws)
disp (' ')
end
draw_numbers
Instead of using a intricated set of if conditions and several iterators (i/m/k) to branch the program flow, I made the function recursive. It means the function may have to call itself a number of time until a condition is satisfied. In our case the condition is to have a set of nballs unique numbers.
The function:
(1) draws N integer numbers randomly, using randi.
(2) remove duplicate numbers (if any). Using unique.
(3) count how many unique numbers are left Nu
(4a) if Nu = N => exit function
(4b) if Nu < N => Call itself again, sending the existing Nu numbers and asking to draw an additional N-Nu numbers to add to the collection. Then back to step (2).
in code, it looks like that:
function draw = draw_numbers(xmin,xmax,nballs,drawn_set)
% check if we received a partial set
if nargin == 4
% if yes, adjust the number of balls to draw
n2draw = nballs - numel(drawn_set) ;
else
% if not, make a full draw
drawn_set = [] ;
n2draw = nballs ;
end
% draw "nballs" numbers between "xmin" and "xmax"
% and concatenate these new numbers with the partial set
d = [drawn_set , randi([xmin xmax],1,n2draw)] ;
% Remove duplicate
drawn_set = unique(d) ;
% check if we have some more balls to draw
if numel(drawn_set) < nballs
% draw some more balls
draw = draw_numbers(xmin,xmax,nballs,drawn_set) ;
else
% we're good to go, assign output and exit funtion
draw = drawn_set ;
end
end
You can have both functions into the same file if you want.
I encourage you to look at the documentation of a couple of Matlab built-in functions used:
randi
unique

CBIR average rank functions

Here is my codes for computing the average rank for each image from 1000 images. (We assume every 100 images are one catagory, e.g, 1-100, 101-200,....)
for z=1:1000
H{z}=imread(strcat(int2str(z-1),'.jpg'));
Im_red=H{z}(:,:,1);
Im_green= H{z}(:,:,2);
Im_blue= H{z}(:,:,3);
hist_im1=zeros(1,256);
[h,w]=size(Im_red);
for i=1:h
for j=1:w
value_pixel1=Im_red(i,j)+1;
hist_im1(value_pixel1)=hist_im1(value_pixel1)+1;
end
end
hist_im2=zeros(1,256);
[h,w]=size(Im_green);
for i=1:h
for j=1:w
value_pixel2=Im_green(i,j)+1;
hist_im2(value_pixel2)=hist_im2(value_pixel2)+1;
end
end
hist_im3=zeros(1,256);
[h,w]=size(Im_blue);
for i=1:h
for j=1:w
value_pixel3 = Im_blue(i,j) + 1;
hist_im3(value_pixel3) = hist_im3(value_pixel3)+1;
end
end
Q{z}=[hist_im1, hist_im2, hist_im3];
end
for r=1:1000
for i=1:1000
a(r,i)=matchfunction(Q{r},Q{i});
end
for j=1:1000
b(r,j)=j;
end
L=[a;b];
end
for r=1:1000
B=[L(r,:);L(r+1000,:)];
[d1,d2] = sort(B(1,:),'descend');
C=B(:,d2);
aaa=C(1,:);
bbb=C(2,:);
ccc=zeros(1,1000);
for g=1:1000
if ((bbb(g)>=fix((r-1)/100)*100+1) & (bbb(g)<=ceil(r/100)*100))
ccc(g)=g;
end
end
ddd=sum(ccc(g))/100;
s(r)=ddd
end
avgrank(1)=sum(s(1:100))/100
avgrank(2)=sum(s(101:200))/100
avgrank(3)=sum(s(201:300))/100
avgrank(4)=sum(s(301:400))/100
avgrank(5)=sum(s(401:500))/100
avgrank(6)=sum(s(501:600))/100
avgrank(7)=sum(s(601:700))/100
avgrank(8)=sum(s(701:800))/100
avgrank(9)=sum(s(801:900))/100
avgrank(10)=sum(s(901:1000))/100
xCoordinates = 1:10;
plot(xCoordinates,avgrank,'b:*');
The match function is a function computes the match value of two histograms of two images with two histograms as input. You can see that Q{z} is the histogram. I think my problem is within here:
for g=1:1000
if ((bbb(g)>=fix((r-1)/100)*100+1) & (bbb(g)<=ceil(r/100)*100))
ccc(g)=g;
end
end
This is how I calculate the rank. So I just give the rank to ccc(g)
since for g runs from 1 to 1000, it will just be the rank we nee if we have
(bbb(g)>=fix((r-1)/100)*100+1) & (bbb(g)<=ceil(r/100)*100)
for a g.
But why after I run this program I got the value of ccc is one thousand 0s? Why 0? Is there anything wrong with my way of getting the rank through ccc? And is there more errors of my code? I just get the average ranks and the ccc all 0 but cannot figure out why. Thanks in advance!!

How to delete decimals that have repeating decimals

I'm working on a code able to work data from star of a public catalogue. I've already got arrays of every single variable that is available:
enter code here fid=fopen('000006+2553.txt','r');
i=1;
while 1
tline=fgetl(fid);
if ~ischar(tline), break, end
A{i}=tline;
i=i+1;
end
k=1;
for j=1:1:length(A)
if length(A{j}) > 50 && length(A{j})<=92
B{k}=A{j};
k=k+1;
end
end
m=1;
for l=1:1:length(B)
C=strread(B{l},'%s','delimiter',' ');
HJD(m)=str2num(C{1});
MAG_3(m)=str2num(C{2});
MAG_0(m)=str2num(C{3});
MAG_1(m)=str2num(C{4});
MAG_2(m)=str2num(C{5});
MAG_4(m)=str2num(C{6});
MER_3(m)=str2num(C{7});
MER_0(m)=str2num(C{8});
MER_1(m)=str2num(C{9});
MER_2(m)=str2num(C{10});
MER_4(m)=str2num(C{11});
FRAME(m)=str2num(C{13});
m=m+1;
end
My problem is, that some of the values in the arrays are repeating decimals, like 29.99999 and 99.99999....etc. Since this numbers are the result of saturation of the sensor, they are wrong data, and must be eliminated. Is there anyway I can tell MATLAB to delete this particular numbers? Any help would be appreciated.

Create image from Huffman code - Matlab

I have a project about image compression in Matlab. So far i have successfully implemented Huffman encoding to the image which gives me a vector of binary codes. After that i run Huffman decoding and i get a vector which contains the elements of the image compressed. My problem is that i can find how is possible from this vector to reconstruct the image and create the image file.
Any help would be grateful
Update
Based on Ben A. help i have made progress but i still have some issues.
To be more exact. I have an image matrix. After finding unique symbols(elements) on this image matrix, i calculate the probabilities and then with this function:
function [h,L,H]=Huffman_code(p,opt)
% Huffman code generator gives a Huffman code matrix h,
% average codeword length L & entropy H
% for a source with probability vector p given as argin(1)
zero_one=['0'; '1'];
if nargin>1&&opt>0, zero_one=['1'; '0']; end
if abs(sum(p)-1)>1e-6
fprintf('\n The probabilities in p does not add up to 1!');
end
M=length(p); N=M-1; p=p(:); % Make p a column vector
h={zero_one(1),zero_one(2)};
if M>2
pp(:,1)=p;
for n=1:N
% To sort in descending order
[pp(1:M-n+1,n),o(1:M-n+1,n)]=sort(pp(1:M-n+1,n),1,'descend');
if n==1, ord0=o; end % Original descending order
if M-n>1, pp(1:M-n,n+1)=[pp(1:M-1-n,n); sum(pp(M-n:M-n+1,n))]; end
end
for n=N:-1:2
tmp=N-n+2; oi=o(1:tmp,n);
for i=1:tmp, h1{oi(i)}=h{i}; end
h=h1; h{tmp+1}=h{tmp};
h{tmp}=[h{tmp} zero_one(1)];
h{tmp+1}=[h{tmp+1} zero_one(2)];
end
for i=1:length(ord0), h1{ord0(i)}=h{i}; end
h=h1;
end
L=0;
for n=1:M, L=L+p(n)*length(h{n}); end % Average codeword length
H=-sum(p.*log2(p)); % Entropy by Eq.(9.1.4)
i calculate the huffman codes for the image.
Now i use this function:
function coded_seq=source_coding(src,symbols,codewords)
% Encode a data sequence src based on the given (symbols,codewords).
no_of_symbols=length(symbols); coded_seq=[];
if length(codewords)<no_of_symbols
error('The number of codewords must equal that of symbols');
end
for n=1:length(src)
found=0;
for i=1:no_of_symbols
if src(n)==symbols(i), tmp=codewords{i}; found=1; break; end
end
if found==0, tmp='?'; end
coded_seq=[coded_seq tmp];
end
where in src i put my image (matrix) and i get a coded sequence for my image.
Last is this function:
function decoded_seq=source_decoding(coded_seq,h,symbols)
% Decode a coded_seq based on the given (codewords,symbols).
M=length(h); decoded_seq=[];
while ~isempty(coded_seq)
lcs= length(coded_seq); found=0;
for m=1:M
codeword= h{m};
lc= length(codeword);
if lcs>=lc&codeword==coded_seq(1:lc)
symbol=symbols(m); found=1; break;
end
if found==0, symbol='?'; end
end
decoded_seq=[decoded_seq symbol];
coded_seq=coded_seq(lc+1:end);
end
Which is used to decode the coded sequence. The problem is that finally as coded sequence i get a 1x400 matrix where i should get a 225x400 which is my image dimensions.
Am i missing something? Maybe should i replace something because i have a matrix and not a number sequence (for which the code is written)?
You might want to take a look at this:
http://www.mathworks.com/matlabcentral/answers/2158-huffman-coding-and-decoding-for-image-jpeg-bmp
This seems like it's right up your alley. It should ultimately lead you to here:
http://www.mathworks.com/matlabcentral/fileexchange/26384-ppt-for-chapter-9-of-matlabsimulink-for-digital-communication