I am plotting 2 histograms of 2 distributions in 1 figure by Matlab. However, the result shows that 2 histograms do not have the same bin width although I use the same number for bins. How can we make 2 histograms have the same bin width?
My code is simple like this:
a = distribution one
b = distribution two
nbins = number of bins
[c,d] = hist(a,nbins);
[e,f] = hist(b,nbins);
%Plotting
bar(d,c);hold on;
bar(f,e);hold off;
This can be done by simply using the bins centres from one call to hist as the bins for the another
for example
[aCounts,aBins] = hist(a,nBins);
[bCounts,bBins] = hist(b,aBins);
note that all(aBins==bBins) = 1
This method however will loose information when the min and max values of the two data sets are not similar*, one simple solution is to create bins based on the combined data
[~ , bins] = hist( [a(:),b(:)] ,nBins);
aCounts = hist( a , bins );
bCounts = hist( b , bins );
*if the ranges are vastly different it may be better to create the vector of bin centres manually
(after re-reading the question) If the bin widths are what you want to control not using the same bins creating the bin centers manually is probably best...
to do this create a vector of bin centres to pass to hist,
for example - note the number of bins is only enforced for one set of data here
aBins = linspace( min(a(:)) ,max(a(:) , nBins);
binWidth = aBins(2)-aBins(1);
bBins = min(a):binWidth:max(b)+binWidth/2
and then use
aCounts = hist( a , aBins );
bCounts = hist( b , bBins );
use histcounts with 'BinWidth' option
https://www.mathworks.com/help/matlab/ref/histcounts.html
i.e
data1 = randn(1000,1)*10;
data2 = randn(1000,1);
[hist1,~] = histcounts(data1, 'BinWidth', 10);
[hist2,~] = histcounts(data2, 'BinWidth', 10);
bar(hist1)
bar(hist2)
The behavior of hist is different when the 2nd argument is a vector instead of a scalar.
Instead of specifying a number of bins, specify the bin limits using a vector, as demonstrated in the documentation (see "Specify Bin Intervals"):
rng(0,'twister')
data1 = randn(1000,1)*10;
rng(1,'twister')
data2 = randn(1000,1);
figure
xvalues1 = -40:40;
[c,d] = hist(data1,xvalues1);
[e,f] = hist(data2,xvalues1);
%Plotting
bar(d,c,'b');hold on;
bar(f,e,'r');hold off;
This results in:
Related
Suppose that I have generated some data in matlab as follows:
n = 100;
x = randi(n,[n,1]);
y = rand(n,1);
data = [x y];
plot(x,y,'rx')
axis([0 100 0 1])
Now I want to generate an algorithm to classify all these data into some clusters(which are arbitrary) in a way such that a point be a member of a cluster only if the distance between this point and at least one of the members of the cluster be less than 10.How could I generate the code?
The clustering method you are describing is DBSCAN. Note that this algorithm will find only one cluster in provided data, since it's very unlikely that there is a point in the dataset so that its distance to all other points is more than 10.
If this is really what you want, you can use ِDBSCAN, or the one posted in FE, if you are using versions older than 2019a.
% Generating random points, almost similar to the data provided by OP
data = bsxfun(#times, rand(100, 2), [100 1]);
% Adding more random points
for i=1:5
mu = rand(1, 2)*100 -50;
A = rand(2)*5;
sigma = A*A'+eye(2)*(1+rand*2);%[1,1.5;1.5,3];
data = [data;mvnrnd(mu,sigma,20)];
end
% clustering using DBSCAN, with epsilon = 10, and min-points = 1 as
idx = DBSCAN(data, 10, 1);
% plotting clusters
numCluster = max(idx);
colors = lines(numCluster);
scatter(data(:, 1), data(:, 2), 30, colors(idx, :), 'filled')
title(['No. of Clusters: ' num2str(numCluster)])
axis equal
The numbers in above figure shows the distance between closest pairs of points in any two different clusters.
The Matlab built-in function clusterdata() works well for what you're asking.
Here is how to apply it to your example:
% number of points
n = 100;
% create the data
x = randi(n,[n,1]);
y = rand(n,1);
data = [x y];
% the number of clusters you want to create
num_clusters = 5;
T1 = clusterdata(data,'Criterion','distance',...
'Distance','euclidean',...
'MaxClust', num_clusters)
scatter(x, y, 100, T1,'filled')
In this case, I used 5 clusters and used the Euclidean distance to be the metric to group the data points, but you can always change that (see documentation of clusterdata())
See the result below for 5 clusters with some random data.
Note that the data is skewed (x-values are from 0 to 100, and y-values are from 0 to 1), so the results are also skewed, but you could always normalize your data.
Here is a way using the connected components of graph:
D = pdist2(x, y) < 10;
D(1:size(D,1)+1:end) = 0;
G = graph(D);
C = conncomp(G);
The connected components is vector that shows the cluster numbers.
Use pdist2 to compute distance matrix of x and y.
Use the distance matrix to create a logical adjacency matrix that shows two point are neighbors if distance between them is less than 10.
Set the diagonal elements of the adjacency matrix to 0 to eliminate self loops.
Create a graph from the adjacency matrix.
Compute the connected components of graph.
Note that using pdist2 for large datasets may not be applicable and you need to use other methods to form a sparse adjacency matrix.
I notified after posing my answer the answer provided by #saastn suggested to use DBSCAN algorithm that nearly follows the same approach.
I have a matrix m and plot a histogram of the third column. I search for the peak in the first 100 bins and get the frequency as a and the index of the bin as b. Now I need the edges of the bin with index b. How can I get them?
nbins = 1000;
histo = histogram(m(:,3),nbins,'Orientation','horizontal');
[a,b] = max(histo.Values(1:100))
I can think of two easy ways to do this:
function q41505566
m = randn(10000,5);
nBins = 1000;
% Option 1: using histcounts:
[N,E] = histcounts(m(:,3),nBins);
disp(E(find(N(1:100) == max(N(1:100)),1,'first')+[0 1])); % find() returns the left bin edge
% Option 2: using BinEdges:
histo = histogram(m(:,3),nBins,'Orientation','horizontal');
[a,b] = max(histo.Values(1:100));
disp(histo.BinEdges(b:b+1));
If you need an explanation for the "tricks" - please say so.
Consider this example of code to obtain the best fit from data varying the number of fitting Gaussians according the Akaike criterion
MU1 = [1];
SIGMA1 = [2];
MU2 = [-3];
SIGMA2 = [1 ];
X = [mvnrnd(MU1,SIGMA1,1000);mvnrnd(MU2,SIGMA2,1000)];
AIC = zeros(1,4);
obj = cell(1,4);
options = statset('Display','final');
for k = 1:4
obj{k} = gmdistribution.fit(X,k,'Options',options);
AIC(k)= obj{k}.AIC;
end
[minAIC,numComponents] = min(AIC)
I want to do the same thing but with data that are given in a form of a histogram (consider for example the data http://pastebin.com/embed_js.php?i=1mNRuEHZ).
What is the most direct way to implement the same procedure in matlab in this case?
If I'm getting you right, then your problem is to convert between data that is already compiled as a histogram (so numbers of observations paired with the actual value of an observation) and the original individual observations. Of course, when compiling the histogram, you have lost two things:
Order. You don't know what the order of observations was in the original data, which is probably not important, provided your observations are independent. Also, the way I get gmdistribution.fit() it doesn't take into account order anyway.
Resolution. When you create a histogram, you need to bin your data, which makes you lose precision, so to speak, because it is impossible to recover the precise values of your observations from the bins.
Once you are aware of that you can create a 'vector of observations' from your histogram data. Say, X1 is your histogram data (Nx2 vector). If you do
invX = cell2mat(arrayfun(#(x,y) repmat(y,1,x), abs(int16(1000*X1(:, 2)))', X1(:, 1)', ...
'UniformOutput', false))';
you get a vector that contains individual observations, just like X in your example.
Note that you have to convert the bin counts to integers first. At this step, because the given data's precision is quite high, I had to round to make the computation possible for my machine. However, the final result seems fairly reasonable.
Also note that I used absolute values, there are some cases in your histogram data were your data is actually negative, which, for a histogram obviously doesn't make sense.
Last but not least you have to change the number of iterations for the fit procedure to 1000. The final code to produce the below figure reads
MU1 = [1];
SIGMA1 = [2];
MU2 = [-3];
SIGMA2 = [1 ];
X = [mvnrnd(MU1,SIGMA1,1000);mvnrnd(MU2,SIGMA2,1000)];
X = X1(:, 2);
invX = cell2mat(arrayfun(#(x,y) repmat(y,1,x), abs(int16(1000*X1(:, 2)))', X1(:, 1)', ...
'UniformOutput', false))'; %'
X = invX;
AIC = zeros(1,4);
obj = cell(1,4);
options = statset('Display','final', 'MaxIter', 1000);
for k = 1:4
obj{k} = gmdistribution.fit(X,k,'Options',options);
AIC(k)= obj{k}.AIC;
end
[minAIC,numComponents] = min(AIC);
hold on;
plot(linspace(-1, 2, length(X1(:, 2))), abs(X1(:, 2)), 'LineWidth', 2)
plot(x, pd/max(pd)*double(max(abs(X1(:, 2)))), 'LineWidth', 5);
h = legend('Original data', 'PDF');
set(h,'FontSize',32);
Output looks like this:
I am working on code that select set of pixels randomly from gray images, then comparing the intensity of each 2 pixels by subtracting the intensity of pixel in one location from another one in different location.
I have code do random selection, but I am not sure of this code and I do not know how to do pixels subtraction?
thank you in advance..
{
N = 100; % number of random pixels
im = imread('image.bmp');
[nRow,nCol,c] = size(im);
randRow = randi(nRow,[N,1]);
randCol = randi(nCol,[N,1]);
subplot(2,1,1)
imagesc(im(randRow,randCol,:))
subplot(2,1,2)
imagesc(im)
}
Parag basically gave you the answer. In order to achieve this vectorized, you need to use sub2ind. However, what I would do is generate two sets of rows and columns. The reason why is because you need one set for the first set of pixels and another set for the next set of pixels so you can subtract the two sets of intensities. Therefore, do something like this:
N = 100; % number of random pixels
im = imread('image.bmp');
[nRow,nCol,c] = size(im);
%// Generate two sets of locations
randRow1 = randi(nRow,[N,1]);
randCol1 = randi(nCol,[N,1]);
randRow2 = randi(nRow,[N,1]);
randCol2 = randi(nCol,[N,1]);
%// Convert each 2D location into a single linear index
%// for vectorization, then subtract
locs1 = sub2ind([nRow, nCol], randRow1, randCol1);
locs2 = sub2ind([nRow, nCol], randRow2, randCol2);
im_subtract = im(locs1) - im(locs2);
subplot(2,1,1)
imagesc(im_subtract);
subplot(2,1,2)
imagesc(im);
However, the above code only assumes that your image is grayscale. If you want to do this for colour, you'll have to do a bit more work. You need to access each channel and subtract on a channel basis. The linear indices that were defined above are just for a single channel. As such, you'll need to offset by nRow*nCol for each channel if you want to access the same corresponding locations in the next channels. As such, I would use sub2ind in combination with bsxfun to properly generate the right values for vectorizing the subtraction. This requires just a slight modification to the above code. Therefore:
N = 100; % number of random pixels
im = imread('image.bmp');
[nRow,nCol,c] = size(im);
%// Generate two sets of locations
randRow1 = randi(nRow,[N,1]);
randCol1 = randi(nCol,[N,1]);
randRow2 = randi(nRow,[N,1]);
randCol2 = randi(nCol,[N,1]);
%// Convert each 2D location into a single linear index
%// for vectorization
locs1 = sub2ind([nRow, nCol], randRow1, randCol1);
locs2 = sub2ind([nRow, nCol], randRow2, randCol2);
%// Extend to as many channels as we have
skip_ind = permute(0:nRow*nCol:(c-1)*nRow*nCol, [1 3 2]);
%// Create 3D linear indices
locs1 = bsxfun(#plus, locs1, skip_ind);
locs2 = bsxfun(#plus, locs2, skip_ind);
%// Now subtract the locations
im_subtract = im(locs1) - im(locs2);
subplot(2,1,1)
imagesc(im_subtract);
subplot(2,1,2)
imagesc(im);
I have a matrice of A(369x10) which I want to cluster in 19 clusters.
I use this method
[idx ctrs]=kmeans(A,19)
which yields
idx(369x1) and ctrs(19x10)
I get the point up to here.All my rows in A is clustered in 19 clusters.
Now I have an array B(49x10).I want to know where the rows of this B corresponds in the among given 19 clusters.
How is it possible in MATLAB?
Thank you in advance
The following is a a complete example on clustering:
%% generate sample data
K = 3;
numObservarations = 100;
dimensions = 3;
data = rand([numObservarations dimensions]);
%% cluster
opts = statset('MaxIter', 500, 'Display', 'iter');
[clustIDX, clusters, interClustSum, Dist] = kmeans(data, K, 'options',opts, ...
'distance','sqEuclidean', 'EmptyAction','singleton', 'replicates',3);
%% plot data+clusters
figure, hold on
scatter3(data(:,1),data(:,2),data(:,3), 50, clustIDX, 'filled')
scatter3(clusters(:,1),clusters(:,2),clusters(:,3), 200, (1:K)', 'filled')
hold off, xlabel('x'), ylabel('y'), zlabel('z')
%% plot clusters quality
figure
[silh,h] = silhouette(data, clustIDX);
avrgScore = mean(silh);
%% Assign data to clusters
% calculate distance (squared) of all instances to each cluster centroid
D = zeros(numObservarations, K); % init distances
for k=1:K
%d = sum((x-y).^2).^0.5
D(:,k) = sum( ((data - repmat(clusters(k,:),numObservarations,1)).^2), 2);
end
% find for all instances the cluster closet to it
[minDists, clusterIndices] = min(D, [], 2);
% compare it with what you expect it to be
sum(clusterIndices == clustIDX)
I can't think of a better way to do it than what you described. A built-in function would save one line, but I couldn't find one. Here's the code I would use:
[ids ctrs]=kmeans(A,19);
D = dist([testpoint;ctrs]); %testpoint is 1x10 and D will be 20x20
[distance testpointID] = min(D(1,2:end));
I don't know if I get your meaning right, but if you want to know which cluster your points belong you can use KnnSearch function easily. It has two arguments and will search in first argument for the first one of them that is closest to argument two.
Assuming you're using squared euclidean distance metric, try this:
for i = 1:size(ctrs,2)
d(:,i) = sum((B-ctrs(repmat(i,size(B,1),1),:)).^2,2);
end
[distances,predicted] = min(d,[],2)
predicted should then contain the index of the closest centroid, and distances should contain the distances to the closest centroid.
Take a look inside the kmeans function, at the subfunction 'distfun'. This shows you how to do the above, and also contains the equivalents for other distance metrics.
for small amount of data, you could do
[testpointID,dum] = find(permute(all(bsxfun(#eq,B,permute(ctrs,[3,2,1])),2),[3,1,2]))
but this is somewhat obscure; the bsxfun with the permuted ctrs creates a 49 x 10 x 19 array of booleans, which is then 'all-ed' across the second dimension, permuted back and then the row ids are found. again, probably not practical for large amounts of data.