calculating x2 from poisson distributed data - matlab

So I have a table of values
v=0 1 2 3 4 5 6 7 8 9
#times obs.: 5 19 23 21 14 12 3 2 1 0
I am supposed to calculate chi squared assuming the data fits a poisson dist. with mean u=3.
I have to group values >=6 all in one bin.
I am unsure of how to plot the poisson dist., and most of all how to control what goes into what bin, if that makes sense.
I have plotted a histogram using histc before..but it was with random numbers that I normalized. The amount in each bin was set for me.
I am super new...sorry if this question sucks.

You use bar to plot a bar graph in matlab.
So this is what you do:
v=0:9;
f=[5 19 23 21 14 12 3 2 1 0];
fc=f(find(v<6)); % copy elements where v<=6 into a new array
fc(end+1)=sum(f(v=>6)); % append the sum of elements where v=>6 to that array
figure
bar(v(v<=6), fc);
That should do the trick...
Now you didn't actually ask about the chi squared calculation. I would urge you not to put values of v>6 all into one bin for that calculation, as it will give you a really bad result.
There is another technique: if you use the hist function, you can choose the bins - and Matlab will automatically put things that exceed the limits into the last bin. So if your observations were in the array Obs, you can do what was asked with:
h = hist(Obs, 0:6);
figure
bar(0:6, h)
The advantage is that you have the array h available (frequencies) for other calculations.
If you do instead
hist(Obs, 0:6)
Matlab will plot the graph for you in a single statement (but you don't have the values...)

Related

Matlab Dimensions Swapped in Meshgrid

In something which made me spent several hours, I have found an inconsistency in how Matlab deals with dimensions. I somebody can explain to me OR indicate me how to report this to Matlab, please enlight me.
For size, ones,zeros,mean, std, and most every other old and common commands existing inside Matlab, the dimension arrangement is like the classical one and like the intended standard (as per the size of every dimension), the first dimension is along the column vector, the second dimension is along the row vector, and the following are the non graphical following indexes.
>x(:,:,1)=[1 2 3 4;5 6 7 8];
>x(:,:,2)=[9 10 11 12;13 14 15 16];
>m=mean(x,1)
m(:,:,1) = 3 4 5 6
m(:,:,2) = 11 12 13 14
m=mean(x,2)
m(:,:,1) =
2.5000
6.5000
m(:,:,2) =
10.5000
14.5000
m=mean(x,3)
m = 5 6 7 8
9 10 11 12
d=size(m)
d = 2 4 2
However, for graphical commands like stream3,streamline and others relying on the meshgrid output format, the dimensions 1 and 2 are swapped!: the first dimension is the row vector and the second dimension is the column vector, and the following (third) is the non graphical index.
> [x,y]=meshgrid(1:2,1:3)
x = 1 2
1 2
1 2
y = 1 1
2 2
3 3
Then, for stream3 to operate with classically arranged matrices, we should use permute(XXX,[2 1 3]) at every 3D argument:
xyz=stream3(permute(x,[2 1 3]),permute(y,[2 1 3]),permute(z,[2 1 3])...
,permute(u,[2 1 3]),permute(v,[2 1 3]),permute(w,[2 1 3])...
,xs,ys,zs);
If anybody can explain why this happens, and can indicate to me why this is not a bug, welcome.
This behavior is not a bug because it is clearly documented as the intended behavior: https://www.mathworks.com/help/matlab/ref/meshgrid.html. Specifically:
[X,Y,Z]= meshgrid(x,y,z) returns 3-D grid coordinates defined by the vectors x, y, and z. The grid represented by X, Y, and Z has size length(y)-by-length(x)-by-length(z).
Without speaking to the original authors, the exact motivation may be a bit obscure, but I suspect it has to do with the fact that the y-axis is generally associated with the rows of an image, while x is the columns.
Columns are either "j" or "x" in the documentation, rows are either "i" or "y".
Some functions deal with spatial coordinates. The documentation will refer to "x, y, z". These functions will thus take column values before row values as input arguments.
Some functions deal with array indices. The documentation will refer to "i, j" (or sometimes "i1, i2, i3, ..., in", or using specific names instead of "i" before the dimension number). These functions will thus take row values before column values as input arguments.
Yes, this can be confusing. But if you pay attention to the names of the variables in the documentation, you will quickly figure out the right order.
With meshgrid in particular, if the "x, y, ..." order of arguments is confusing, use ndgrid instead, which takes arguments in array indexing order.

Vector with specific number of equally spaced values

I am familiar with the command:
(-5:0.1:5)
which creates a vector of values, equally spaced by 0.1, from -5 to 5.
However, is there a way to produce a vector of values, equally spaced from -5 to 5 such that there are, say, 100 values in the vector.
(-5:0.1:5) gives a vector with 101 values, however, is there a way to get a vector of 100 values without manually calculating the step size?
Yes there is. Use function linspace. See documentation here
linspace(1,10,10)
ans =
1 2 3 4 5 6 7 8 9 10
also the question is a duplicate of this question

Graphing different sets of data on same graph within a ‘for’ loop MATLAB

I just have a problem with graphing different plots on the same graph within a ‘for’ loop. I hope someone can be point me in the right direction.
I have a 2-D array, with discrete chunks of data in and amongst zeros. My data is the following:
A=
0 0
0 0
0 0
3 9
4 10
5 11
6 12
0 0
0 0
0 0
0 0
7 9.7
8 9.8
9 9.9
0 0
0 0
A chunk of data is defined as contiguous set of data, without interruptions of a [0 0] row. So in this example, the 1st chunk of data would be
3 9
4 10
5 11
6 12
And 2nd chunk is
7 9.7
8 9.8
9 9.9
The first column is x and second column is y. I would like to plot y as a function of x (x is horizontal axis, y is vertical axis) I want to plot these data sets on the same graph as a scatter graph, and put a line of best fit through the points, whenever I come across a chunk of data. In this case, I will have 2 sets of points and 2 lines of best fit (because I have 2 chunks of data). I would also like to calculate the R-squared value
The code that I have so far is shown below:
fh1 = figure;
hold all;
ah1 = gca;
% plot graphs:
for d = 1:max_number_zeros+num_rows
if sequence_holder(d,1)==0
continue;
end
c = d;
while sequence_holder(c,1)~=0
plot(ah1,sequence_holder(c,1),sequence_holder(c,num_cols),'*');
%lsline;
c =c+1;
continue;
end
end
Sequence holder is the array with the data in it. I can only plot the first set of data, with no line of best fit. I tried lsline, but that didn't work.
Can anyone tell me how to
-plot both sets of graphs
-how to draw a line of best fit a get the regression coefficient?
The first part could be done in a number of ways. I would test the second column for zeroness
zerodata = A(:,2) == 0;
which will give you a logical array of ones and zeros like [1 1 1 0 1 0 0 ...]. Then you can use this to split up your input. You could look at the diff of that array and test it for positive or negative sign. Your data starts on 0 so you won't get a transition for that one, so you'd need to think of some way to deal with that or the opposite case, unless you know for certain that it will always be one way or the other. You could just test the first element, or you could insert a known value at the start of your input array.
You will then have to store your chunks. As they may be of variable length and variable number you wouldn't put them into a big matrix, but you still want to be able to use a loop. I would use either a cell array, where each cell in a row contains the x or y data for a chunk, or a struct array where say structarray(1).x and structarray)1).y hold your data values.
Then you can iterate through your struct array and call plot on each chunk separately.
As for fitting you can use the fit command. It's complex and has lots of options so you should check out the help first (type doc fit inside the console to get the inline help, which is the same as the website help in content). The short version is that you can do a simple linear fit like this
[fitobject, gof] = fit(x, y, 'poly1');
where 'poly1' specifies you want a first order polynomial (i.e. straight line) and the output arguments give you a fit object, which you can do various things with like plot or interpolate, and the second gives you a struct containing among other things the r^2 and adjusted r^2. The fitobject also contains your fit coefficients.

Matlab general matrix indexing for accesing several rows

Edit for clarity:
I have two matrices, p.valor 2x1000 and p.clase 1x1000. p.valor consists of random numbers spanning from -6 to 6. p.clase contains, in order, 200 1:s, 200 2:s and 600 3:s. What I wan´t to do is
Print p.valor using a diferent color/prompt for each clase determined in p.clase, as in following figure.
I first wrote this, in order to find out which locations in p.valor represented where the 1,2 respective 3 where in p.clase
%identify the locations of all 1,2 respective 3 in p.clase
f1=find(p.clase==1);
f2=find(p.clase==2);
f3=find(p.clase==3);
%define vectors in p.valor representing the locations of 1,2,3 in p.clase
x1=p.valor(f1);
x2=p.valor(f2);
x3=p.valor(f3);
There is 200 ones (1) in p.valor, thus, is x1=(1:200). The problem is that each number one(1) (and, respectively 2 and 3) represents TWO elements in p.valor, since p.valor has 2 rows. So even though p.clase and thus x1 now only have one row, I need to include the elements in the same colums as all locations in f1.
So the different alternatives I have tried have not yet been succesfull. Examples:
plot(x1(:,1), x1(:,2),'ro')
hold on
plot(x2(:,1),x2(:,2),'k.')
hold on
plot(x3(:,1),x3(:,2),'b+')
and
y1=p.valor(201:400);
y2=p.valor(601:800);
y3=p.valor(1401:2000);
scatter(x1,y1,'k+')
hold on
scatter(x2,y1,'b.')
hold on
scatter(x3,y1,'ro')
and
y1=p.valor(201:400);
y2=p.valor(601:800);
y3=p.valor(1401:2000);
plot(x1,y1,'k+')
hold on
plot(x2,y2,'b.')
hold on
plot(x3,y3,'ro')
My figures have the axisies right, but the plotted values does not match the correct figure provided (see top of the question).
Ergo, my question is: how do I include tha values on the second row in p.valor in my plotted figure?
I hope this is clearer!
Values from both rows simultaneously can be accessed using this syntax:
X=p.value(:,findX)
In this case, resulting X matrix will be a matrix having 2 rows and length(findX) columns.
M = magic(5)
M =
17 24 1 8 15
23 5 7 14 16
4 6 13 20 22
10 12 19 21 3
11 18 25 2 9
M2 = M(1:2, :)
M2 =
17 24 1 8 15
23 5 7 14 16
Matlab uses column major indexing. So to get to the next row, you actually just have to add 1. Adding 2 to an index on M2 here gets you to the next column, or adding 5 to an index on M
e.g. M2(3) is 24. To get to the next row you just add one i.e. M2(4) returns 5.To get to the next column add the number of rows so M2(2 + 2) gets you 1. If you add the number of columns like you suggested you just get gibberish.
So your method is very wrong. Freude's method is 100% correct, it's much easier to use subscript indexing than linear indexing for this. But I just wanted to explain why what you were trying doesn't work in Matlab. (aside from the fact that X=p.value(findX findX+1000) gives you a syntax error, I assume you meant X=p.value([findX findX+1000]))

How to visualize binary data?

I have a dataset 6x1000 of binary data (6 data points, 1000 boolean dimensions).
I perform cluster analysis on it
[idx, ctrs] = kmeans(x, 3, 'distance', 'hamming');
And I get the three clusters. How can I visualize my result?
I have 6 rows of data each having 1000 attributes; 3 of them should be alike or similar in a way. Applying clustering will reveal the clusters. Since I know the number of clusters
I only need to find similar rows. Hamming distance tell us the similarity between rows and the result is correct that there are 3 clusters.
[EDIT: for any reasonable data, kmeans will always finds asked number
of clusters]
I want to take that knowledge
and make it easily observable and understandable without having to write huge explanations.
Matlab's example is not suitable since it deals with numerical 2D data while my questions concerns n-dimensional categorical data.
The dataset is here http://pastebin.com/cEWJfrAR
[EDIT1: how to check if clusters are significant?]
For more information please visit the following link:
https://chat.stackoverflow.com/rooms/32090/discussion-between-oleg-komarov-and-justcurious
If the question is not clear ask, for anything you are missing.
For representing the differences between high-dimensional vectors or clusters, I have used Matlab's dendrogram function. For instance, after loading your dataset into the matrix x I ran the following code:
l = linkage(a, 'average');
dendrogram(l);
and got the following plot:
The height of the bar that connects two groups of nodes represents the average distance between members of those two groups. In this case it looks like (5 and 6), (1 and 2), and (3 and 4) are clustered.
If you would rather use the hamming distance rather than the euclidian distance (which linkage does by default), then you can just do
l = linkage(x, 'average', {'hamming'});
although it makes little difference to the plot.
You can start by visualizing your data with a 'barcode' plot and then labeling rows with the cluster group they belong:
% Create figure
figure('pos',[100,300,640,150])
% Calculate patch xy coordinates
[r,c] = find(A);
Y = bsxfun(#minus,r,[.5,-.5,-.5, .5])';
X = bsxfun(#minus,c,[.5, .5,-.5,-.5])';
% plot patch
patch(X,Y,ones(size(X)),'EdgeColor','none','FaceColor','k');
% Set axis prop
set(gca,'pos',[0.05,0.05,.9,.9],'ylim',[0.5 6.5],'xlim',[0.5 1000.5],'xtick',[],'ytick',1:6,'ydir','reverse')
% Cluster
c = kmeans(A,3,'distance','hamming');
% Add lateral labeling of the clusters
nc = numel(c);
h = text(repmat(1010,nc,1),1:nc,reshape(sprintf('%3d',c),3,numel(c))');
cmap = hsv(max(c));
set(h,{'Background'},num2cell(cmap(c,:),2))
Definition
The Hamming distance for binary strings a and b the Hamming distance is equal to the number of ones (population count) in a XOR b (see Hamming distance).
Solution
Since you have six data strings, so you could create a 6 by 6 matrix filled with the Hamming distance. The matrix would be symetric (distance from a to b is the same as distance from b to a) and the diagonal is 0 (distance for a to itself is nul).
For example, the Hamming distance between your first and second string is:
hamming_dist12 = sum(xor(x(1,:),x(2,:)));
Loop that and fill your matrix:
hamming_dist = zeros(6);
for i=1:6,
for j=1:6,
hamming_dist(i,j) = sum(xor(x(i,:),x(j,:)));
end
end
(And yes this code is a redundant given the symmetry and zero diagonal, but the computation is minimal and optimizing not worth the effort).
Print your matrix as a spreadsheet in text format, and let the reader find which data string is similar to which.
This does not use your "kmeans" approach, but your added description regarding the problem helped shaping this out-of-the-box answer. I hope it helps.
Results
0 182 481 495 490 500
182 0 479 489 492 488
481 479 0 180 497 517
495 489 180 0 503 515
490 492 497 503 0 174
500 488 517 515 174 0
Edit 1:
How to read the table? The table is a simple distance table. Each row and each column represent a series of data (herein a binary string). The value at the intersection of row 1 and column 2 is the Hamming distance between string 1 and string 2, which is 182. The distance between string 1 and 2 is the same as between string 2 and 1, this is why the matrix is symmetric.
Data analysis
Three clusters can readily be identified: 1-2, 3-4 and 5-6, whose Hamming distance are, respectively, 182, 180, and 174.
Within a cluster, the data has ~18% dissimilarity. By contrast, data not part of a cluster has ~50% dissimilarity (which is random given binary data).
Presentation
I recommend Kohonen network or similar technique to present your data in, say, 2 dimensions. In general this area is called Dimensionality reduction.
I you can also go simpler way, e.g. Principal Component Analysis, but there's no quarantee you can effectively remove 9998 dimensions :P
scikit-learn is a good Python package to get you started, similar exist in matlab, java, ect. I can assure you it's rather easy to implement some of these algorithms yourself.
Concerns
I have a concern over your data set though. 6 data points is really a small number. moreover your attributes seem boolean at first glance, if that's the case, manhattan distance if what you should use. I think (someone correct me if I'm wrong) Hamming distance only makes sense if your attributes are somehow related, e.g. if attributes are actually a 1000-bit long binary string rather than 1000 independent 1-bit attributes.
Moreover, with 6 data points, you have only 2 ** 6 combinations, that means 936 out of 1000 attributes you have are either truly redundant or indistinguishable from redundant.
K-means almost always finds as many clusters as you ask for. To test significance of your clusters, run K-means several times with different initial conditions and check if you get same clusters. If you get different clusters every time or even from time to time, you cannot really trust your result.
I used a barcode type visualization for my data. The code which was posted here earlier by Oleg was too heavy for my solution (image files were over 500 kb) so I used image() to make the figures
function barcode(A)
B = (A+1)*2;
image(B);
colormap flag;
set(gca,'Ydir','Normal')
axis([0 size(B,2) 0 size(B,1)]);
ax = gca;
ax.TickDir = 'out'
end