matlab k-means clustering evaluation [duplicate] - matlab

This question already has answers here:
Evaluating K-means accuracy
(2 answers)
Closed 5 years ago.
How effectively evaluate the performance of the standard matlab k-means implementation.
For example I have a matrix X
X = [1 2;
3 4;
2 5;
83 76;
97 89]
For every point I have a gold standard clustering. Let's assume that (83,76), (97,89) is the first cluster and (1,2), (3,4), (2,5) is the second cluster. Then we run matlab
idx = kmeans(X,2)
And get the following results
idx = [1; 1; 2; 2; 2]
According the the NOMINAL values it's very bad clustering because only (2,5) is correct, but we don't care about nominal values, we care only about points that is clustered together. Therefore somehow we have to identify that only (2,5) gets to the incorrect cluster.
For me a newbie in matlab is not a trivial task to evaluate the performance of clustering. I would appreciate if you can share with us your ideas about how to evaluate the performance.

To evaluate the "best clustering" is somewhat ambiguous, especially if you have points in two different groups that may eventually cross over with respect to their features. When you get this case, how exactly do you define which cluster those points get merged to? Here's an example from the Fisher Iris dataset that you can get preloaded with MATLAB. Let's specifically take the sepal width and sepal length, which is the third and fourth columns of the data matrix, and plot the setosa and virginica classes:
load fisheriris;
plot(meas(101:150,3), meas(101:150,4), 'b.', meas(51:100,3), meas(51:100,4), 'r.', 'MarkerSize', 24)
This is what we get:
You can see that towards the middle, there is some overlap. You are lucky in that you knew what the clusters were before hand and so you can measure what the accuracy is, but if we were to get data such as the above and we didn't know what labels each point belonged to, how do you know which cluster the middle points belong to?
Instead, what you should do is try and minimize these classification errors by running kmeans more than once. Specifically, you can override the behaviour of kmeans by doing the following:
idx = kmeans(X, 2, 'Replicates', num);
The 'Replicates' flag tells kmeans to run for a total of num times. After running kmeans num times, the output memberships are those which the algorithm deemed to be the best over all of those times kmeans ran. I won't go into it, but they determine what the "best" average is out of all of the membership outputs and gives you those.
Not setting the Replicates flag obviously defaults to running one time. As such, try increasing the total number of times kmeans runs so that you have a higher probability of getting a higher quality of cluster memberships. By setting num = 10, this is what we get with your data:
X = [1 2;
3 4;
2 5;
83 76;
97 89];
num = 10;
idx = kmeans(X, 2, 'Replicates', num)
idx =
2
2
2
1
1
You'll see that the first three points belong to one cluster while the last two points belong to another. Even though the IDs are flipped, it doesn't matter as we want to be sure that there is a clear separation between the groups.
Minor note with regards to random algorithms
If you take a look at the comments above, you'll notice that several people tried running the kmeans algorithm on your data and they received different clustering results. The reason why is because when kmeans chooses the initial points for your cluster centres, these are chosen in a random fashion. As such, depending on what state their random number generator was in, it is not guaranteed that the initial points chosen for one person will be the same as another person.
Therefore, if you want reproducible results, you should set the random seed of your random seed generator to be the same before running kmeans. On that note, try using rng with an integer that is known before hand, like 123. If we did this before the code above, everyone who runs the code will be able to reproduce the same results.
As such:
rng(123);
X = [1 2;
3 4;
2 5;
83 76;
97 89];
num = 10;
idx = kmeans(X, 2, 'Replicates', num)
idx =
1
1
1
2
2
Here the labels are reversed, but I guarantee that if any else runs the above code, they will get the same labelling as what was produced above each time.

Related

Unreasonable [positive] log-likelihood values from matlab "fitgmdist" function

I want to fit a data sets with Gaussian mixture model, the data sets contains about 120k samples and each sample has about 130 dimensions. When I use matlab to do it, so I run scripts (with cluster number 1000):
gm = fitgmdist(data, 1000, 'Options', statset('Display', 'iter'), 'RegularizationValue', 0.01);
I get the following outputs:
iter log-likelihood
1 -6.66298e+07
2 -1.87763e+07
3 -5.00384e+06
4 -1.11863e+06
5 299767
6 985834
7 1.39525e+06
8 1.70956e+06
9 1.94637e+06
The log likelihood is bigger than 0! I think it's unreasonable, and don't know why.
Could somebody help me?
First of all, it is not a problem of how large your dataset is.
Here is some code that produces similar results with a quite small dataset:
options = statset('Display', 'iter');
x = ones(5,2) + (rand(5,2)-0.5)/1000;
fitgmdist(x,1,'Options',options);
this produces
iter log-likelihood
1 64.4731
2 73.4987
3 73.4987
Of course you know that the log function (the natural logarithm) has a range from -inf to +inf. I guess your problem is that you think the input to the log (i.e. the aposteriori function) should be bounded by [0,1]. Well, the aposteriori function is a pdf function, which means that its value can be very large for very dense dataset.
PDFs must be positive (which is why we can use the log on them) and must integrate to 1. But they are not bounded by [0,1].
You can verify this by reducing the density in the above code
x = ones(5,2) + (rand(5,2)-0.5)/1;
fitgmdist(x,1,'Options',options);
this produces
iter log-likelihood
1 -8.99083
2 -3.06465
3 -3.06465
So, I would rather assume that your dataset contains several duplicate (or very close) values.

Optimization with discrete parameters in Matlab

I have 12 sets of vectors (about 10-20 vectors each) and i want to pick one vector of each set so that a function f that takes the sum of these vectors as argument is maximized. In addition i have constraints for some components of that sum.
Example:
a_1 = [3 2 0 5], a_2 = [3 0 0 2], a_3 = [6 0 1 1], ... , a_20 = [2 12 4 3]
b_1 = [4 0 4 -2], b_2 = [0 0 1 0], b_3 = [2 0 0 4], ... , b_16 = [0 9 2 3]
...
l_1 = [4 0 2 0], l_2 = [0 1 -2 0], l_3 = [4 4 0 1], ... , l_19 = [3 0 9 0]
s = [s_1 s_2 s_3 s_4] = a_x + b_y + ... + l_z
Constraints:
s_1 > 40
s_2 < 100
s_4 > -20
Target: Chose x, y, ... , z to maximize f(s):
f(s) -> max
Where f is a nonlinear function that takes the vector s and returns a scalar.
Bruteforcing takes too long because there are about 5.9 trillion combinations, and since i need the maximum (or even better the top 10 combinations) i can not use any of the greedy algorithms that came to my mind.
The vectors are quite sparse, about 70-90% are zeros. If that is helping somehow ...?
The Matlab Optimization toolbox didnt help either since it doesnt much support for discrete optimization.
Basically this is a lock-picking problem, where the lock's pins have 20 distinct positions, and there are 12 pins. Also:
some of the pin's positions will be blocked, depending on the positions of all the other pins.
Depending on the specifics of the lock, there may be multiple keys that fit
...interesting!
Based on Rasman's approach and Phpdna's comment, and the assumption that you are using int8 as data type, under the given constraints there are
>> d = double(intmax('int8'));
>> (d-40) * (d+100) * (d+20) * 2*d
ans =
737388162
possible vectors s (give or take a few, haven't thought about +1's etc.). ~740 million evaluations of your relatively simple f(s) shouldn't take more than 2 seconds, and having found all s that maximize f(s), you are left with the problem of finding linear combinations in your vector set that add up to one of those solutions s.
Of course, this finding of combinations is no easy feat, and the whole method breaks down anyway if you are dealing with
int16: ans = 2.311325368800510e+018
int32: ans = 4.253529737045237e+037
int64: ans = 1.447401115466452e+076
So, I'll discuss a more direct and more general approach here.
Since we're talking integers and a fairly large search space, I'd suggest using a branch-and-bound algorithm. But unlike the bintprog algorithm, you'd have to use different branching strategies, and of course, these should be based on a non-linear objective function.
Unfortunately, there is nothing like this in the optimization toolbox (or the File Exchange as far as I could find). fmincon is a no-go, since it uses gradient and Hessian information (which will usually be all-zero for integers), and fminsearch is a no-go, since you'll need a really good initial estimate, and the rate of convergence is (roughly) O(N), meaning, for this 20-dimensional problem you'll have to wait quite long before convergence, without the guarantee of having found the global solution.
An interval method could be a possibility, however, I personally have very little experience with this. There is no native interval-related stuff in MATLAB or any of its toolboxes, but there's the freely available INTLAB.
So, if you're not feeling like implementing your own non-linear binary integer programming algorithm, or are not in the mood for an adventure with INTLAB, there's really only one thing left: heuristic methods. In this link there is a similar situation, with an outline of the solution: use the genetic algorithm (ga) from the Global Optimization toolbox.
I would implement the problem roughly like so:
function [sol, fval, exitflag] = bintprog_nonlinear()
%// insert your data here
%// Any sparsity you may have here will only make this more
%// *memory* efficient, not *computationally*
data = [...
... %// this will be an array with size 4-by-20-by-12
... %// (or some permutation of that you find more intuitive)
];
%// offsets into the 3D array to facilitate indexing a bit
offsets = bsxfun(#plus, ...
repmat(1:size(data,1), size(data,3),1), ...
(0:size(data,3)-1)' * size(data,1)*size(data,2)); %//'
%// your objective function
function val = obj(X)
%// limit "X" to integers in [1 20]
X = min(max(round(X),1),size(data,3));
%// "X" will be a collection of 12 integers between 0 and 20, which are
%// indices into the data matrix
%// form "s" from "X"
s = sum(bsxfun(#plus, offsets, X*size(data,1) - size(data,1)));
%// XxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxX
%// Compute the NEGATIVE VALUE of your function here
%// XxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxXxX
end
%// your "non-linear" constraint function
function [C, Ceq] = nonlcon(X)
%// limit "X" to integers in [1 20]
X = min(max(round(X),1),size(data,3));
%// form "s" from "X"
s = sum(bsxfun(#plus, offsets, X(:)*size(data,1) - size(data,1)));
%// we have no equality constraints
Ceq = [];
%// Compute inequality constraints
%// NOTE: solver is trying to solve C <= 0, so:
C = [...
40 - s(1)
s(2) - 100
-20 - s(4)
];
end
%// useful GA options
options = gaoptimset(...
'UseParallel', 'always'...
...
);
%// The rest really depends on the specifics of the problem.
%// Useful to look at will be at least 'TolCon', 'Vectorized', and of course,
%// 'PopulationType', 'Generations', etc.
%// THE OPTIMZIATION
[sol, fval, exitflag] = ga(...
#obj, size(data,3), ... %// objective function, taking a vector of 20 values
[],[], [],[], ... %// no linear (in)equality constraints
1,size(data,2), ... %// lower and upper limits
#nonlcon, options); %// your "nonlinear" constraints
end
Note that even though your constraints are essentially linear, the way by which you must compute the value for your s necessitates the use of a custom constraint function (nonlcon).
Especially note that this is currently (probably) a sub-optimal way to use ga -- I don't know the specifics of your objective function, so a lot more may be possible. For instance, I currently use a simple round() to convert the input X to integers, but using 'PopulationType', 'custom' (with a custom 'CreationFcn', 'MutationFcn' etc.) might produce better results. Also, 'Vectorized' will likely speed things up a lot, but I don't know whether your function is easily vectorized.
And yes, I use nested functions (I just love those things!); it prevents these huge, usually identical lists of input arguments if you use sub-functions or stand-alone functions, and they can really be a performance boost because there is little copying of data. But, I realize that their scoping rules make them somewhat akin to goto constructs, and so they are -ahum- "not everyone's cup of tea"...you might want to convert them to sub-functions to prevent long and useless discussions with your co-workers :)
Anyway, this should be a good place to start. Let me know if this is useful at all.
Unless you define some intelligence on how the vector sets are organized, there will be no intelligent way of solving your problem other then pure brute force.
Say you find s s.t. f(s) is max given constraints of s, you still need to figure out how to build s with twelve 4-element vectors (an overdetermined system if there ever was one), where each vector has 20 possible values. Sparsity may help, although I'm not sure how it is possible to have a vector with four elements be 70-90% zero, and sparsity would only be useful if there was some yet to be described methodology in how the vector are organized
So I'm not saying you can't solve the problem, I'm saying you need to rethink how the problem is set-up.
I know, this answer is reaching you really late.
Unfortunately, the problem, as is, show not many patterns to be exploited, besides of brute force -Branch&Bound, Master& Slave, etc.- Trying a Master Slave approach -i.e. solving first the function continuous nonlinear problem as master, and solving the discrete selection as slave could help, but with as many combinations, and without any more information over the vectors, there is not too much space for work.
But based on the given continuous almost everywhere functions, based on combinations of sums and multiplication operators and their inverses, the sparsity is a clear point to be exploited here. If 70-90% of vectors are zero, almost a good part of the solution space will be close to zero, or close to infinite. Hence a 80-20 pseudo solution would discard easily the 'zero' combinations, and use only the 'infinite' ones.
This way, the brute-force could be guided.

How to visualize binary data?

I have a dataset 6x1000 of binary data (6 data points, 1000 boolean dimensions).
I perform cluster analysis on it
[idx, ctrs] = kmeans(x, 3, 'distance', 'hamming');
And I get the three clusters. How can I visualize my result?
I have 6 rows of data each having 1000 attributes; 3 of them should be alike or similar in a way. Applying clustering will reveal the clusters. Since I know the number of clusters
I only need to find similar rows. Hamming distance tell us the similarity between rows and the result is correct that there are 3 clusters.
[EDIT: for any reasonable data, kmeans will always finds asked number
of clusters]
I want to take that knowledge
and make it easily observable and understandable without having to write huge explanations.
Matlab's example is not suitable since it deals with numerical 2D data while my questions concerns n-dimensional categorical data.
The dataset is here http://pastebin.com/cEWJfrAR
[EDIT1: how to check if clusters are significant?]
For more information please visit the following link:
https://chat.stackoverflow.com/rooms/32090/discussion-between-oleg-komarov-and-justcurious
If the question is not clear ask, for anything you are missing.
For representing the differences between high-dimensional vectors or clusters, I have used Matlab's dendrogram function. For instance, after loading your dataset into the matrix x I ran the following code:
l = linkage(a, 'average');
dendrogram(l);
and got the following plot:
The height of the bar that connects two groups of nodes represents the average distance between members of those two groups. In this case it looks like (5 and 6), (1 and 2), and (3 and 4) are clustered.
If you would rather use the hamming distance rather than the euclidian distance (which linkage does by default), then you can just do
l = linkage(x, 'average', {'hamming'});
although it makes little difference to the plot.
You can start by visualizing your data with a 'barcode' plot and then labeling rows with the cluster group they belong:
% Create figure
figure('pos',[100,300,640,150])
% Calculate patch xy coordinates
[r,c] = find(A);
Y = bsxfun(#minus,r,[.5,-.5,-.5, .5])';
X = bsxfun(#minus,c,[.5, .5,-.5,-.5])';
% plot patch
patch(X,Y,ones(size(X)),'EdgeColor','none','FaceColor','k');
% Set axis prop
set(gca,'pos',[0.05,0.05,.9,.9],'ylim',[0.5 6.5],'xlim',[0.5 1000.5],'xtick',[],'ytick',1:6,'ydir','reverse')
% Cluster
c = kmeans(A,3,'distance','hamming');
% Add lateral labeling of the clusters
nc = numel(c);
h = text(repmat(1010,nc,1),1:nc,reshape(sprintf('%3d',c),3,numel(c))');
cmap = hsv(max(c));
set(h,{'Background'},num2cell(cmap(c,:),2))
Definition
The Hamming distance for binary strings a and b the Hamming distance is equal to the number of ones (population count) in a XOR b (see Hamming distance).
Solution
Since you have six data strings, so you could create a 6 by 6 matrix filled with the Hamming distance. The matrix would be symetric (distance from a to b is the same as distance from b to a) and the diagonal is 0 (distance for a to itself is nul).
For example, the Hamming distance between your first and second string is:
hamming_dist12 = sum(xor(x(1,:),x(2,:)));
Loop that and fill your matrix:
hamming_dist = zeros(6);
for i=1:6,
for j=1:6,
hamming_dist(i,j) = sum(xor(x(i,:),x(j,:)));
end
end
(And yes this code is a redundant given the symmetry and zero diagonal, but the computation is minimal and optimizing not worth the effort).
Print your matrix as a spreadsheet in text format, and let the reader find which data string is similar to which.
This does not use your "kmeans" approach, but your added description regarding the problem helped shaping this out-of-the-box answer. I hope it helps.
Results
0 182 481 495 490 500
182 0 479 489 492 488
481 479 0 180 497 517
495 489 180 0 503 515
490 492 497 503 0 174
500 488 517 515 174 0
Edit 1:
How to read the table? The table is a simple distance table. Each row and each column represent a series of data (herein a binary string). The value at the intersection of row 1 and column 2 is the Hamming distance between string 1 and string 2, which is 182. The distance between string 1 and 2 is the same as between string 2 and 1, this is why the matrix is symmetric.
Data analysis
Three clusters can readily be identified: 1-2, 3-4 and 5-6, whose Hamming distance are, respectively, 182, 180, and 174.
Within a cluster, the data has ~18% dissimilarity. By contrast, data not part of a cluster has ~50% dissimilarity (which is random given binary data).
Presentation
I recommend Kohonen network or similar technique to present your data in, say, 2 dimensions. In general this area is called Dimensionality reduction.
I you can also go simpler way, e.g. Principal Component Analysis, but there's no quarantee you can effectively remove 9998 dimensions :P
scikit-learn is a good Python package to get you started, similar exist in matlab, java, ect. I can assure you it's rather easy to implement some of these algorithms yourself.
Concerns
I have a concern over your data set though. 6 data points is really a small number. moreover your attributes seem boolean at first glance, if that's the case, manhattan distance if what you should use. I think (someone correct me if I'm wrong) Hamming distance only makes sense if your attributes are somehow related, e.g. if attributes are actually a 1000-bit long binary string rather than 1000 independent 1-bit attributes.
Moreover, with 6 data points, you have only 2 ** 6 combinations, that means 936 out of 1000 attributes you have are either truly redundant or indistinguishable from redundant.
K-means almost always finds as many clusters as you ask for. To test significance of your clusters, run K-means several times with different initial conditions and check if you get same clusters. If you get different clusters every time or even from time to time, you cannot really trust your result.
I used a barcode type visualization for my data. The code which was posted here earlier by Oleg was too heavy for my solution (image files were over 500 kb) so I used image() to make the figures
function barcode(A)
B = (A+1)*2;
image(B);
colormap flag;
set(gca,'Ydir','Normal')
axis([0 size(B,2) 0 size(B,1)]);
ax = gca;
ax.TickDir = 'out'
end

k nearest neighbor classifier example matlab code found at mathworks cannot understand

I understood the first example that they have provided because they have clearly explained what happens in each line. But for the second example which is usually the one that would be used in practice the most is not explained and iam having a hard time trying to understand it :(. The following are the code lines that iam having trouble with
training = [mvnrnd([ 1 1], eye(2), 100); ...
mvnrnd([-1 -1], 2*eye(2), 100)];
group = [repmat(1,100,1); repmat(2,100,1)];
and
sample = unifrnd(-5, 5, 100, 2);
and this is the link -> http://www.mathworks.in/help/toolbox/bioinfo/ref/knnclassify.html
Could someone please explain this as this will not only be beneficial to me, but for all others as well.
The first line of the code you site constructs a training set of vectors, drawn from a multivariate normal distribution, centered around [ 1 1] and [-1 -1] respectively, with standard deviations of 1 and 1 for the sigma x and sigma y for the first class, and 2 and 2 for sigma x and sigma y for the second class. Take 100 of those vectors for each group ( or class).
Then you construct the group vector, which contains group labels: the first 100 are from class 1 (repmat(1,100,1) is actually the same as ones(100,1)) and the second 100 are from class 2 (repmat(2,100,1) == ones(100,1)*2).
The second chunk of code you cite actually just generates a matrix containing 100 random data rows, all in the range [-5 , 5] having 2 dimensions (so 2 columns). This matrix gets used to test the classification on.
You might also take the habit of using the matlab help or doc function on functions you don't know/understand.

How to compare different distribution means with reference truth value in Matlab?

I have production (q) values from 4 different methods stored in the 4 matrices. Each of the 4 matrices contains q values from a different method as:
Matrix_1 = 1 row x 20 column
Matrix_2 = 100 rows x 20 columns
Matrix_3 = 100 rows x 20 columns
Matrix_4 = 100 rows x 20 columns
The number of columns indicate the number of years. 1 row would contain the production values corresponding to the 20 years. Other 99 rows for matrix 2, 3 and 4 are just the different realizations (or simulation runs). So basically the other 99 rows for matrix 2,3 and 4 are repeat cases (but not with exact values because of random numbers).
Consider Matrix_1 as the reference truth (or base case ). Now I want to compare the other 3 matrices with Matrix_1 to see which one among those three matrices (each with 100 repeats) compares best, or closely imitates, with Matrix_1.
How can this be done in Matlab?
I know, manually, that we use confidence interval (CI) by plotting the mean of Matrix_1, and drawing each distribution of mean of Matrix_2, mean of Matrix_3 and mean of Matrix_4. The largest CI among matrix 2, 3 and 4 which contains the reference truth (or mean of Matrix_1) will be the answer.
mean of Matrix_1 = (1 row x 1 column)
mean of Matrix_2 = (100 rows x 1 column)
mean of Matrix_3 = (100 rows x 1 column)
mean of Matrix_4 = (100 rows x 1 column)
I hope the question is clear and relevant to SO. Otherwise please feel free to edit/suggest anything in question. Thanks!
EDIT: My three methods I talked about are a1, a2 and a3 respectively. Here's my result:
ci_a1 =
1.0e+008 *
4.084733001497999
4.097677503988565
ci_a2 =
1.0e+008 *
5.424396063219890
5.586301025525149
ci_a3 =
1.0e+008 *
2.429145282593182
2.838897116739112
p_a1 =
8.094614835195452e-130
p_a2 =
2.824626709966993e-072
p_a3 =
3.054667629953656e-012
h_a1 = 1; h_a2 = 1; h_a3 = 1
None of my CI, from the three methods, includes the mean ( = 3.454992884900722e+008) inside it. So do we still consider p-value to choose the best result?
If I understand correctly the calculation in MATLAB is pretty strait-forward.
Steps 1-2 (mean calculation):
k1_mean = mean(k1);
k2_mean = mean(k2);
k3_mean = mean(k3);
k4_mean = mean(k4);
Step 3, use HIST to plot distribution histograms:
hist([k2_mean; k3_mean; k4_mean]')
Step 4. You can do t-test comparing your vectors 2, 3 and 4 against normal distribution with mean k1_mean and unknown variance. See TTEST for details.
[h,p,ci] = ttest(k2_mean,k1_mean);
EDIT : I misinterpreted your question. See the answer of Yuk and following comments. My answer is what you need if you want to compare distributions of two vectors instead of a vector against a single value. Apparently, the latter is the case here.
Regarding your t-tests, you should keep in mind that they test against a "true" mean. Given the number of values for each matrix and the confidence intervals it's not too difficult to guess the standard deviation on your results. This is a measure of the "spread" of your results. Now the error on your mean is calculated as the standard deviation of your results divided by the number of observations. And the confidence interval is calculated by multiplying that standard error with appx. 2.
This confidence interval contains the true mean in 95% of the cases. So if the true mean is exactly at the border of that interval, the p-value is 0.05 the further away the mean, the lower the p-value. This can be interpreted as the chance that the values you have in matrix 2, 3 or 4 come from a population with a mean as in matrix 1. If you see your p-values, these chances can be said to be non-existent.
So you see that when the number of values get high, the confidence interval becomes smaller and the t-test becomes very sensitive. What this tells you, is nothing more that the three matrices differ significantly from the mean. If you have to choose one, I'd take a look at the distributions anyway. Otherwise the one with the closest mean seems a good guess. If you want to get deeper into this, you could also ask on stats.stackexchange.com
Your question and your method aren't really clear :
Is the distribution equal in all columns? This is important, as two distributions can have the same mean, but differ significantly :
is there a reason why you don't use the Central Limit Theorem? This seems to me like a very complex way of obtaining a result that can easily be found using the fact that the distribution of a mean approaches a normal distribution where sd(mean) = sd(observations)/number of observations. Saves you quite some work -if the distributions are alike! -
Now if the question is really the comparison of distributions, you should consider looking at a qqplot for a general idea, and at a 2-sample kolmogorov-smirnov test for formal testing. But please read in on this test, as you have to understand what it does in order to interprete the results correctly.
On a sidenote : if you do this test on multiple cases, make sure you understand the problem of multiple comparisons and use the appropriate correction, eg. Bonferroni or Dunn-Sidak.