How to visualize binary data? - matlab

I have a dataset 6x1000 of binary data (6 data points, 1000 boolean dimensions).
I perform cluster analysis on it
[idx, ctrs] = kmeans(x, 3, 'distance', 'hamming');
And I get the three clusters. How can I visualize my result?
I have 6 rows of data each having 1000 attributes; 3 of them should be alike or similar in a way. Applying clustering will reveal the clusters. Since I know the number of clusters
I only need to find similar rows. Hamming distance tell us the similarity between rows and the result is correct that there are 3 clusters.
[EDIT: for any reasonable data, kmeans will always finds asked number
of clusters]
I want to take that knowledge
and make it easily observable and understandable without having to write huge explanations.
Matlab's example is not suitable since it deals with numerical 2D data while my questions concerns n-dimensional categorical data.
The dataset is here http://pastebin.com/cEWJfrAR
[EDIT1: how to check if clusters are significant?]
For more information please visit the following link:
https://chat.stackoverflow.com/rooms/32090/discussion-between-oleg-komarov-and-justcurious
If the question is not clear ask, for anything you are missing.

For representing the differences between high-dimensional vectors or clusters, I have used Matlab's dendrogram function. For instance, after loading your dataset into the matrix x I ran the following code:
l = linkage(a, 'average');
dendrogram(l);
and got the following plot:
The height of the bar that connects two groups of nodes represents the average distance between members of those two groups. In this case it looks like (5 and 6), (1 and 2), and (3 and 4) are clustered.
If you would rather use the hamming distance rather than the euclidian distance (which linkage does by default), then you can just do
l = linkage(x, 'average', {'hamming'});
although it makes little difference to the plot.

You can start by visualizing your data with a 'barcode' plot and then labeling rows with the cluster group they belong:
% Create figure
figure('pos',[100,300,640,150])
% Calculate patch xy coordinates
[r,c] = find(A);
Y = bsxfun(#minus,r,[.5,-.5,-.5, .5])';
X = bsxfun(#minus,c,[.5, .5,-.5,-.5])';
% plot patch
patch(X,Y,ones(size(X)),'EdgeColor','none','FaceColor','k');
% Set axis prop
set(gca,'pos',[0.05,0.05,.9,.9],'ylim',[0.5 6.5],'xlim',[0.5 1000.5],'xtick',[],'ytick',1:6,'ydir','reverse')
% Cluster
c = kmeans(A,3,'distance','hamming');
% Add lateral labeling of the clusters
nc = numel(c);
h = text(repmat(1010,nc,1),1:nc,reshape(sprintf('%3d',c),3,numel(c))');
cmap = hsv(max(c));
set(h,{'Background'},num2cell(cmap(c,:),2))

Definition
The Hamming distance for binary strings a and b the Hamming distance is equal to the number of ones (population count) in a XOR b (see Hamming distance).
Solution
Since you have six data strings, so you could create a 6 by 6 matrix filled with the Hamming distance. The matrix would be symetric (distance from a to b is the same as distance from b to a) and the diagonal is 0 (distance for a to itself is nul).
For example, the Hamming distance between your first and second string is:
hamming_dist12 = sum(xor(x(1,:),x(2,:)));
Loop that and fill your matrix:
hamming_dist = zeros(6);
for i=1:6,
for j=1:6,
hamming_dist(i,j) = sum(xor(x(i,:),x(j,:)));
end
end
(And yes this code is a redundant given the symmetry and zero diagonal, but the computation is minimal and optimizing not worth the effort).
Print your matrix as a spreadsheet in text format, and let the reader find which data string is similar to which.
This does not use your "kmeans" approach, but your added description regarding the problem helped shaping this out-of-the-box answer. I hope it helps.
Results
0 182 481 495 490 500
182 0 479 489 492 488
481 479 0 180 497 517
495 489 180 0 503 515
490 492 497 503 0 174
500 488 517 515 174 0
Edit 1:
How to read the table? The table is a simple distance table. Each row and each column represent a series of data (herein a binary string). The value at the intersection of row 1 and column 2 is the Hamming distance between string 1 and string 2, which is 182. The distance between string 1 and 2 is the same as between string 2 and 1, this is why the matrix is symmetric.
Data analysis
Three clusters can readily be identified: 1-2, 3-4 and 5-6, whose Hamming distance are, respectively, 182, 180, and 174.
Within a cluster, the data has ~18% dissimilarity. By contrast, data not part of a cluster has ~50% dissimilarity (which is random given binary data).

Presentation
I recommend Kohonen network or similar technique to present your data in, say, 2 dimensions. In general this area is called Dimensionality reduction.
I you can also go simpler way, e.g. Principal Component Analysis, but there's no quarantee you can effectively remove 9998 dimensions :P
scikit-learn is a good Python package to get you started, similar exist in matlab, java, ect. I can assure you it's rather easy to implement some of these algorithms yourself.
Concerns
I have a concern over your data set though. 6 data points is really a small number. moreover your attributes seem boolean at first glance, if that's the case, manhattan distance if what you should use. I think (someone correct me if I'm wrong) Hamming distance only makes sense if your attributes are somehow related, e.g. if attributes are actually a 1000-bit long binary string rather than 1000 independent 1-bit attributes.
Moreover, with 6 data points, you have only 2 ** 6 combinations, that means 936 out of 1000 attributes you have are either truly redundant or indistinguishable from redundant.
K-means almost always finds as many clusters as you ask for. To test significance of your clusters, run K-means several times with different initial conditions and check if you get same clusters. If you get different clusters every time or even from time to time, you cannot really trust your result.

I used a barcode type visualization for my data. The code which was posted here earlier by Oleg was too heavy for my solution (image files were over 500 kb) so I used image() to make the figures
function barcode(A)
B = (A+1)*2;
image(B);
colormap flag;
set(gca,'Ydir','Normal')
axis([0 size(B,2) 0 size(B,1)]);
ax = gca;
ax.TickDir = 'out'
end

Related

Interpolation on 4D data

I am trying to perform an interpolation/fit (preferably non-linear, but linear should also be fine) on 4D data. My data has a form of:
[a,b,c] = func(input)
obviously, func is unknown and ultimately data looks like (input, a, b, c):
0 -0.1253 0.0341 0.01060
35 -0.0985 0.0176 0.02060
50 -0.0315 -0.0533 0.1118
60 -0.0518 -0.0327 0.03020
80 0.2939 -0.0713 0.05670
100 0.3684 -0.0765 0.06740
I take observations at e.g. input = [0, 35, 50, 60, 80, 100] (0 being min and 100 being max; I take 6 samples in between min and max) and then I get corresponding a, b and c values (I understand that 6 sample points are a bad design of experiment so I will extend it in future).
I am trying to guess the value of a, b and c at say input = 19? Any pointers?
How to estimate goodness of fit in such scenario?
This is not 4D interpolation, this is 3 times 1D interpolation. You just interpolate interp1([0 35],[-0.1253 -0.0985],19) and the same for b and c. (interp1(intput,a,19))
Note that for the most basic 1D interpolation in a mesh grid (not what you have), you need 2 data points in general. For the most basic 2D interpolation, you need 4 data points. For 3D interpolation, 8 minimum, 4D, 16.... (2^d in general).
Also note that 1D interpolation uses 2 "dims". Because you use one to guide the interpolation, the other one is interpolated. General, with [v,a,b,c] data you would use 3D interpolation.
all that said, you do are nto in this case. You have scattered data, not a grid, thus the problem becomes considerably more complicated.
In case you can generate a few more points (not necessarily 16) you can use the function griddatan for interpolating scattered data. Note that you can not just say "give me [a,b,c] for input=19, there could be infinite amount of a,b,cs that have that condition. In any case, you always need to give dim-1 amount of sample points, and get the last one interpolated. Just an advice: this function is computationally and memory-wise very expensive. Do not use for big data points because it will crash your PC.
In the case you want to find a set of parameters that make input=19 then you are getting to more complicated area. You want to minimise a function f(x), where x=[a,b,c] for f(x)=input
In math terms:
argmin_x |f(x)-input|^2= \vec{input}
this is a harder problem and arguably more mathematics than a programming question. Perhaps a ND bspline fitting of your data would be a good f

Unreasonable [positive] log-likelihood values from matlab "fitgmdist" function

I want to fit a data sets with Gaussian mixture model, the data sets contains about 120k samples and each sample has about 130 dimensions. When I use matlab to do it, so I run scripts (with cluster number 1000):
gm = fitgmdist(data, 1000, 'Options', statset('Display', 'iter'), 'RegularizationValue', 0.01);
I get the following outputs:
iter log-likelihood
1 -6.66298e+07
2 -1.87763e+07
3 -5.00384e+06
4 -1.11863e+06
5 299767
6 985834
7 1.39525e+06
8 1.70956e+06
9 1.94637e+06
The log likelihood is bigger than 0! I think it's unreasonable, and don't know why.
Could somebody help me?
First of all, it is not a problem of how large your dataset is.
Here is some code that produces similar results with a quite small dataset:
options = statset('Display', 'iter');
x = ones(5,2) + (rand(5,2)-0.5)/1000;
fitgmdist(x,1,'Options',options);
this produces
iter log-likelihood
1 64.4731
2 73.4987
3 73.4987
Of course you know that the log function (the natural logarithm) has a range from -inf to +inf. I guess your problem is that you think the input to the log (i.e. the aposteriori function) should be bounded by [0,1]. Well, the aposteriori function is a pdf function, which means that its value can be very large for very dense dataset.
PDFs must be positive (which is why we can use the log on them) and must integrate to 1. But they are not bounded by [0,1].
You can verify this by reducing the density in the above code
x = ones(5,2) + (rand(5,2)-0.5)/1;
fitgmdist(x,1,'Options',options);
this produces
iter log-likelihood
1 -8.99083
2 -3.06465
3 -3.06465
So, I would rather assume that your dataset contains several duplicate (or very close) values.

matlab k-means clustering evaluation [duplicate]

This question already has answers here:
Evaluating K-means accuracy
(2 answers)
Closed 5 years ago.
How effectively evaluate the performance of the standard matlab k-means implementation.
For example I have a matrix X
X = [1 2;
3 4;
2 5;
83 76;
97 89]
For every point I have a gold standard clustering. Let's assume that (83,76), (97,89) is the first cluster and (1,2), (3,4), (2,5) is the second cluster. Then we run matlab
idx = kmeans(X,2)
And get the following results
idx = [1; 1; 2; 2; 2]
According the the NOMINAL values it's very bad clustering because only (2,5) is correct, but we don't care about nominal values, we care only about points that is clustered together. Therefore somehow we have to identify that only (2,5) gets to the incorrect cluster.
For me a newbie in matlab is not a trivial task to evaluate the performance of clustering. I would appreciate if you can share with us your ideas about how to evaluate the performance.
To evaluate the "best clustering" is somewhat ambiguous, especially if you have points in two different groups that may eventually cross over with respect to their features. When you get this case, how exactly do you define which cluster those points get merged to? Here's an example from the Fisher Iris dataset that you can get preloaded with MATLAB. Let's specifically take the sepal width and sepal length, which is the third and fourth columns of the data matrix, and plot the setosa and virginica classes:
load fisheriris;
plot(meas(101:150,3), meas(101:150,4), 'b.', meas(51:100,3), meas(51:100,4), 'r.', 'MarkerSize', 24)
This is what we get:
You can see that towards the middle, there is some overlap. You are lucky in that you knew what the clusters were before hand and so you can measure what the accuracy is, but if we were to get data such as the above and we didn't know what labels each point belonged to, how do you know which cluster the middle points belong to?
Instead, what you should do is try and minimize these classification errors by running kmeans more than once. Specifically, you can override the behaviour of kmeans by doing the following:
idx = kmeans(X, 2, 'Replicates', num);
The 'Replicates' flag tells kmeans to run for a total of num times. After running kmeans num times, the output memberships are those which the algorithm deemed to be the best over all of those times kmeans ran. I won't go into it, but they determine what the "best" average is out of all of the membership outputs and gives you those.
Not setting the Replicates flag obviously defaults to running one time. As such, try increasing the total number of times kmeans runs so that you have a higher probability of getting a higher quality of cluster memberships. By setting num = 10, this is what we get with your data:
X = [1 2;
3 4;
2 5;
83 76;
97 89];
num = 10;
idx = kmeans(X, 2, 'Replicates', num)
idx =
2
2
2
1
1
You'll see that the first three points belong to one cluster while the last two points belong to another. Even though the IDs are flipped, it doesn't matter as we want to be sure that there is a clear separation between the groups.
Minor note with regards to random algorithms
If you take a look at the comments above, you'll notice that several people tried running the kmeans algorithm on your data and they received different clustering results. The reason why is because when kmeans chooses the initial points for your cluster centres, these are chosen in a random fashion. As such, depending on what state their random number generator was in, it is not guaranteed that the initial points chosen for one person will be the same as another person.
Therefore, if you want reproducible results, you should set the random seed of your random seed generator to be the same before running kmeans. On that note, try using rng with an integer that is known before hand, like 123. If we did this before the code above, everyone who runs the code will be able to reproduce the same results.
As such:
rng(123);
X = [1 2;
3 4;
2 5;
83 76;
97 89];
num = 10;
idx = kmeans(X, 2, 'Replicates', num)
idx =
1
1
1
2
2
Here the labels are reversed, but I guarantee that if any else runs the above code, they will get the same labelling as what was produced above each time.

How do I plot Precision-Recall graphs for Content-Based Image Retrieval in MATLAB?

I am accessing 10 images from a folder "c1" and I have query image. I have implemented code for loading images in cell array and then I'm calculating histogram intersection between query image and each image from folder "c1" one-by-one. Now i want to draw precision-recall curve but i am not sure how to write code for getting "precision-recall curve" using the data obtained from histogram intersection.
My code:
Inp1=rgb2gray(imread('D:\visionImages\c1\1.ppm'));
figure, imshow(Inp1), title('Input image 1');
srcFiles = dir('D:\visionImages\c1\*.ppm'); % the folder in which images exists
for i = 1 : length(srcFiles)
filename = strcat('D:\visionImages\c1\',srcFiles(i).name);
I = imread(filename);
I=rgb2gray(I);
Seq{i}=I;
end
for i = 1 : length(srcFiles) % loop for calculating histogram intersections
A=Seq{i};
B=Inp1;
a = size(A,2); b = size(B,2);
K = zeros(a, b);
for j = 1:a
Va = repmat(A(:,j),1,b);
K(j,:) = 0.5*sum(Va + B - abs(Va - B));
end
end
Precision-Recall graphs measure the accuracy of your image retrieval system. They're also used in the performance of any search engine really, like text or documents. They're also used in machine learning evaluation and performance, though ROC Curves are what are more commonly used.
Precision-Recall graphs are more suitable for document and data retrieval. For the case of images here, given your query image, you are measuring how similar this image is with the rest of the images in your database. You then have similarity measures for each of the database images in relation to your query image and then you sort these similarities in descending order. For a good retrieval system, you would want the images that are the most relevant (i.e. what you are searching for) to all appear at the beginning, while the irrelevant images would appear after.
Precision
The definition of Precision is the ratio of the number of relevant images you have retrieved to the total number of irrelevant and relevant images retrieved. In other words, supposing that A was the number of relevant images retrieved and B was the total number of irrelevant images retrieved. When calculating precision, you take a look at the first several images, and this amount is A + B, as the total number of relevant and irrelevant images is how many images you are considering at this point. As such, another definition Precision is defined as the ratio of how many relevant images you have retrieved so far out of the bunch that you have grabbed:
Precision = A / (A + B)
Recall
The definition of Recall is slightly different. This evaluates how many of the relevant images you have retrieved so far out of a known total, which is the the total number of relevant images that exist. As such, let's say you again take a look at the first several images. You then determine how many relevant images there are, then you calculate how many relevant images that have been retrieved so far out of all of the relevant images in the database. This is defined as the ratio of how many relevant images you have retrieved overall. Supposing that A was again the total number of relevant images you have retrieved out of a bunch you have grabbed from the database, and C represents the total number of relevant images in your database. Recall is thus defined as:
Recall = A / C
How you calculate this in MATLAB is actually quite easy. You first need to know how many relevant images are in your database. After, you need to know the similarity measures assigned to each database image with respect to the query image. Once you compute these, you need to know which similarity measures map to which relevant images in your database. I don't see that in your code, so I will leave that to you. Once you do this, you then sort on the similarity values then you go through where in the sorted similarity values these relevant images occur. You then use these to calculate your precision and recall.
I'll provide a toy example so I can show you what the graph looks like as it isn't quite clear on how you're calculating your similarities here. Let's say I have 5 images in a database of 20, and I have a bunch of similarity values between them and a query image:
rng(123); %// Set seed for reproducibility
num_images = 20;
sims = rand(1,num_images);
sims =
Columns 1 through 13
0.6965 0.2861 0.2269 0.5513 0.7195 0.4231 0.9808 0.6848 0.4809 0.3921 0.3432 0.7290 0.4386
Columns 14 through 20
0.0597 0.3980 0.7380 0.1825 0.1755 0.5316 0.5318
Also, I know that images [1 5 7 9 12] are my relevant images.
relevant_IDs = [1 5 7 9 12];
num_relevant_images = numel(relevant_IDs);
Now let's sort the similarity values in descending order, as higher values mean higher similarity. You'd reverse this if you were calculating a dissimilarity measure:
[sorted_sims, locs] = sort(sims, 'descend');
locs will now contain the image ranks that each image ranked as. Specifically, these tell you which position in similarity the image belongs to. sorted_sims will have the similarities sorted in descending order:
sorted_sims =
Columns 1 through 13
0.9808 0.7380 0.7290 0.7195 0.6965 0.6848 0.5513 0.5318 0.5316 0.4809 0.4386 0.4231 0.3980
Columns 14 through 20
0.3921 0.3432 0.2861 0.2269 0.1825 0.1755 0.0597
locs =
7 16 12 5 1 8 4 20 19 9 13 6 15 10 11 2 3 17 18 14
Therefore, the 7th image is the highest ranked image, followed by the 16th image being the second highest image and so on. What you need to do now is for each of the images that you know are relevant, you need to figure out where these are located after sorting. We will go through each of the image IDs that we know are relevant, and figure out where these are located in the above locations array:
locations_final = arrayfun(#(x) find(locs == x, 1), relevant_IDs)
locations_final =
5 4 1 10 3
Let's sort these to get a better understand of what this is saying:
locations_sorted = sort(locations_final)
locations_sorted =
1 3 4 5 10
These locations above now tell you the order in which the relevant images will appear. As such, the first relevant image will appear first, the second relevant image will appear in the third position, the third relevant image appears in the fourth position and so on. These precisely correspond to part of the definition of Precision. For example, in the last position of locations_sorted, it would take ten images to retrieve all of the relevant images (5) in our database. Similarly, it would take five images to retrieve four relevant images in the database. As such, you would compute precision like this:
precision = (1:num_relevant_images) ./ locations_sorted;
Similarly for recall, it's simply the ratio of how many images were retrieved so far from the total, and so it would just be:
recall = (1:num_relevant_images) / num_relevant_images;
Your Precision-Recall graph would now look like the following, with Recall on the x-axis and Precision on the y-axis:
plot(recall, precision, 'b.-');
xlabel('Recall');
ylabel('Precision');
title('Precision-Recall Graph - Toy Example');
axis([0 1 0 1.05]); %// Adjust axes for better viewing
grid;
This is the graph I get:
You'll notice that between a recall ratio of 0.4 to 0.8 the precision is increasing a bit. This is because you have managed to retrieve a successive chain of images without touching any of the irrelevant ones, and so your precision will naturally increase. It goes way down after the last image, as you've had to retrieve so many irrelevant images before finally hitting a relevant image.
You'll also notice that precision and recall are inversely related. As such, if precision increases, then recall decreases. Similarly, if precision decreases, then recall will increase.
The first part makes sense because if you don't retrieve that many images in the beginning, you have a greater chance of not including irrelevant images in your results but at the same time, the amount of relevant images is rather small. This is why recall would decrease when precision would increase
The second part also makes sense because as you keep trying to retrieve more images in your database, you'll inevitably be able to retrieve all of the relevant ones, but you'll most likely start to include more irrelevant images, which would thus drive your precision down.
In an ideal world, if you had N relevant images in your database, you would want to see all of these images in the top N most similar spots. As such, this would make your precision-recall graph a flat horizontal line hovering at y = 1, which means that you've managed to retrieve all of your images in all of the top spots without accessing any irrelevant images. Unfortunately, that's never going to happen (or at least not for now...) as trying to figure out the best features for CBIR is still an on-going investigation, and no image search engine that I have seen has managed to get this perfect. This is still one of the most broadest and unsolved computer vision problems that exist today!
Edit
You retrieved this code to compute histogram intersection from this post. They have a neat way of computing histogram intersection as:
n is the total number of bins in your histogram. You'll have to play around with this to get good results, but we can leave that as a parameter in your code. The code above assumes that you have two matrices A and B where each column is a histogram. You'll generate a matrix that is of a x b, where a is the number of columns in A and b is the number of columns in b. The row and column of this matrix (i,j) tells you the similarity between the ith column in A with the b jth column in B. In your case, A would be a single column which denotes the histogram of your query image. B would be a 10 column matrix that denotes the histograms for each of the database images. Therefore, we will get a 1 x 10 array of similarity measures through histogram intersection.
As such, we need to modify your code so that you're using imhist for each of the images. We can also specify an additional parameter that gives you how many bins each histogram will have. Therefore, your code will look like this. Each new line that I have placed will have a %// NEW comment beside each line.
Inp1=rgb2gray(imread('D:\visionImages\c1\1.ppm'));
figure, imshow(Inp1), title('Input image 1');
num_bins = 32; %// NEW - I'm specifying 32 bins here. Play around with this yourself
A = imhist(Inp1, num_bins); %// NEW - Calculate histogram
srcFiles = dir('D:\visionImages\c1\*.ppm'); % the folder in which images exists
B = zeros(num_bins, length(srcFiles)); %// NEW - Store histograms of database images
for i = 1 : length(srcFiles)
filename = strcat('D:\visionImages\c1\',srcFiles(i).name);
I = imread(filename);
I=rgb2gray(I);
B(:,i) = imhist(I, num_bins); %// NEW - Put each histogram in a separate
%// column
end
%// NEW - Taken directly from the website
%// but modified for only one histogram in `A`
b = size(B,2);
Va = repmat(A, 1, b);
K = 0.5*sum(Va + B - abs(Va - B));
Take note that I have copied the code from the website, but I have modified it because there is only one image in A and so there is some code that isn't necessary.
K should now be a 1 x 10 array of histogram intersection similarities. You would then use K and assign sims to this variable (i.e. sims = K;) in the code I have written above, then run through your images. You also need to know which images are relevant images, and you'd have to change the code I've written to reflect that.
Hope this helps!

How to compare different distribution means with reference truth value in Matlab?

I have production (q) values from 4 different methods stored in the 4 matrices. Each of the 4 matrices contains q values from a different method as:
Matrix_1 = 1 row x 20 column
Matrix_2 = 100 rows x 20 columns
Matrix_3 = 100 rows x 20 columns
Matrix_4 = 100 rows x 20 columns
The number of columns indicate the number of years. 1 row would contain the production values corresponding to the 20 years. Other 99 rows for matrix 2, 3 and 4 are just the different realizations (or simulation runs). So basically the other 99 rows for matrix 2,3 and 4 are repeat cases (but not with exact values because of random numbers).
Consider Matrix_1 as the reference truth (or base case ). Now I want to compare the other 3 matrices with Matrix_1 to see which one among those three matrices (each with 100 repeats) compares best, or closely imitates, with Matrix_1.
How can this be done in Matlab?
I know, manually, that we use confidence interval (CI) by plotting the mean of Matrix_1, and drawing each distribution of mean of Matrix_2, mean of Matrix_3 and mean of Matrix_4. The largest CI among matrix 2, 3 and 4 which contains the reference truth (or mean of Matrix_1) will be the answer.
mean of Matrix_1 = (1 row x 1 column)
mean of Matrix_2 = (100 rows x 1 column)
mean of Matrix_3 = (100 rows x 1 column)
mean of Matrix_4 = (100 rows x 1 column)
I hope the question is clear and relevant to SO. Otherwise please feel free to edit/suggest anything in question. Thanks!
EDIT: My three methods I talked about are a1, a2 and a3 respectively. Here's my result:
ci_a1 =
1.0e+008 *
4.084733001497999
4.097677503988565
ci_a2 =
1.0e+008 *
5.424396063219890
5.586301025525149
ci_a3 =
1.0e+008 *
2.429145282593182
2.838897116739112
p_a1 =
8.094614835195452e-130
p_a2 =
2.824626709966993e-072
p_a3 =
3.054667629953656e-012
h_a1 = 1; h_a2 = 1; h_a3 = 1
None of my CI, from the three methods, includes the mean ( = 3.454992884900722e+008) inside it. So do we still consider p-value to choose the best result?
If I understand correctly the calculation in MATLAB is pretty strait-forward.
Steps 1-2 (mean calculation):
k1_mean = mean(k1);
k2_mean = mean(k2);
k3_mean = mean(k3);
k4_mean = mean(k4);
Step 3, use HIST to plot distribution histograms:
hist([k2_mean; k3_mean; k4_mean]')
Step 4. You can do t-test comparing your vectors 2, 3 and 4 against normal distribution with mean k1_mean and unknown variance. See TTEST for details.
[h,p,ci] = ttest(k2_mean,k1_mean);
EDIT : I misinterpreted your question. See the answer of Yuk and following comments. My answer is what you need if you want to compare distributions of two vectors instead of a vector against a single value. Apparently, the latter is the case here.
Regarding your t-tests, you should keep in mind that they test against a "true" mean. Given the number of values for each matrix and the confidence intervals it's not too difficult to guess the standard deviation on your results. This is a measure of the "spread" of your results. Now the error on your mean is calculated as the standard deviation of your results divided by the number of observations. And the confidence interval is calculated by multiplying that standard error with appx. 2.
This confidence interval contains the true mean in 95% of the cases. So if the true mean is exactly at the border of that interval, the p-value is 0.05 the further away the mean, the lower the p-value. This can be interpreted as the chance that the values you have in matrix 2, 3 or 4 come from a population with a mean as in matrix 1. If you see your p-values, these chances can be said to be non-existent.
So you see that when the number of values get high, the confidence interval becomes smaller and the t-test becomes very sensitive. What this tells you, is nothing more that the three matrices differ significantly from the mean. If you have to choose one, I'd take a look at the distributions anyway. Otherwise the one with the closest mean seems a good guess. If you want to get deeper into this, you could also ask on stats.stackexchange.com
Your question and your method aren't really clear :
Is the distribution equal in all columns? This is important, as two distributions can have the same mean, but differ significantly :
is there a reason why you don't use the Central Limit Theorem? This seems to me like a very complex way of obtaining a result that can easily be found using the fact that the distribution of a mean approaches a normal distribution where sd(mean) = sd(observations)/number of observations. Saves you quite some work -if the distributions are alike! -
Now if the question is really the comparison of distributions, you should consider looking at a qqplot for a general idea, and at a 2-sample kolmogorov-smirnov test for formal testing. But please read in on this test, as you have to understand what it does in order to interprete the results correctly.
On a sidenote : if you do this test on multiple cases, make sure you understand the problem of multiple comparisons and use the appropriate correction, eg. Bonferroni or Dunn-Sidak.