Clustering with multiple metrics in Matlab - matlab

I have a data set that contains both categorical and numerical features for each row. I would like to select a different similarity metric for each feature (column) and preform hierarchical clustering on the data. Is there a way to do that in Matlab?

Yes, this is actually fairly straightforward: linkage, which creates the tree, takes as input a dissimilarity matrix. Consequently, in the example workflow below
Y = pdist(X,'cityblock');
Z = linkage(Y,'average');
T = cluster(Z,'cutoff')
you simply replace the call to pdist with a call to your own function that calculates the pairwise dissimilarity between rows, everything else stays the same.

Related

How to use silhouette_score in Sklearn with mixed (categorical and numerical) data?

I have come to a situation where I have mixed data set as mentioned and try unsupervised clustering.
I am trying many different experiments including Gower's distance and K-prototype. I wanna try some of sklearn metrics to see how they will give me values.
While I was looking at silhouette_score, there is an argument 'metric' and I can decide with what I want to compute distances. But as my data has mixed types, I would like to choose manhattan for numerical and hamming for categorical. Is there a way I can use silhouette_score for both metrics at one go? if all my input data were numerical, I would have done as below:
silhouette_score(friendRecomennderData, labels, metric = 'manhattan')
Thank you in advance.
You are getting confused in the arguments that are passed to silhouette_score. If you read the documentation mentioned here, it say the following about the input data, i.e. the parameter X:
X: array [n_samples_a, n_samples_a] if metric == “precomputed”, or, [n_samples_a, n_features] otherwise. Array of pairwise distances between samples, or a feature array.
Thus the data can only be a numerical array comprising of distances between the samples. It's not possible to have distances as categorical values.
You need to first cluster your data, then get the distance matrix and provide the distance matrix as input to silhouette_score.
You can use distance metrics like gowers distance which deals with mixed data types and then use computed distance matrix as X and metric = 'precomputed' in silhouette_score function.

Kullback Leibler Divergence of 2 Histograms in MatLab

I would like a function to calculate the KL distance between two histograms in MatLab. I tried this code:
http://www.mathworks.com/matlabcentral/fileexchange/13089-kldiv
However, it says that I should have two distributions P and Q of sizes n x nbins. However, I am having trouble understanding how the author of the package wants me to arrange the histograms. I thought that providing the discretized values of the random variable together with the number of bins would suffice (I would assume the algorithm would use an arbitrary support to evaluate the expectations).
Any help is appreciated.
Thanks.
The function you link to requires that the two histograms passed be aligned and thus have the same length NBIN x N (not N X NBIN), that is, if N>1 then the number of rows in the inputs should be equal to the number of bins in the histograms. If you are just going to compare two histograms (that is if N=1) it doesn't really matter, you can pass either row or column vector versions of these as long as you are consistent and the order of bins matches.
A generic call to the function looks like this:
dists = kldiv(bins,P,Q)
The implementation allows comparison of multiple histograms to each other (that is, N>1), in which case pairs of columns (with matching column index) in each array are compared and the result is a row vector with distances for each matching pair.
Array bins should be the same size as P and Q and is used to perform a very minimal check that the inputs are of the same size, but is not used in the computation. The routine expects bins to contain the numeric labels of your bins so that it can check for repeated bin labels and warn you if repeats occur, but otherwise doesn't use the information.
You could do away with bins and compute the distance with
KL = sum(P .* (log2(P)-log2(Q)));
without using the Matlab Central versions. However the version you link to performs the abovementioned minimal checks and in addition allows computation of two alternative distances (consult the documentation).
The version linked to by eigenchris checks that no histogram bins are empty (which would make the computation blow up numerically) and if there are, removes their contribution to the sum (not sure this is entirely appropriate - consult an expert on the subject). It should probably also be aware of the exact form of the formula, specifically note the use of log2 above versus natural logarithm in the version linked to by eigenchris.

How to select randomly and fairly data in matlab

How can I select randomly and fairly some data from a dataset in matlab?
When we use the randperm function to select data, they are random and fair?
As you already suggested, selecting k uniformly random chosen rows out of n can be done with randperm, assuming you don't want duplication.
Example:
dataSet = rand(1000,4);
idx = randperm(size(dataSet,1),10)
dataSet(idx,:)
If you have the Statistics Toolbox, you can use randsample:
sample = randsample(data,k);
takes k values sampled uniformly at random, without replacement, from the values in the vector data. See above link for other options.
Equivalent code with randperm:
ind = randperm(numel(data));
sample = data(ind(1:k));
Yes, either of these approaches gives random samples, and yes, they are fair. I assume that by "fair" you mean "uniform": each entry of data is picked with the same probability.
anything that uses uniform distribution is "fair". because the output is supposed to be distributed randomly in an specific range. for example, rand function in matlab.

Grouping Data in a Matrix in MATLAB

I've got a really big matrix which I should "upscale" (i.e.: create another matrix where the elements of the first are grouped 40-by-40). For every 40-by-40 group I should evaluate a series of parameters (i.e.: frequencies, average and standard deviation).
I'm quite sure I can make such thing with a loop, but I was wondering if there was a more elegant vectorized method...
You might find blockproc useful. This command allows you to apply a function (e.g. #mean, #std etc.) to each distinct block in a 2D matrix.

Matlab - Stratified Sampling of Multidimensional Data

I want to divide a corpus into training & testing sets in a stratified fashion.
The observation data points are arranged in a Matrix A as
A=[16,3,0;12,6,4;19,2,1;.........;17,0,2;13,3,2]
Each column of the matrix represent a distinct feature.
In Matlab, the cvpartition(A,'holdout',p) function requires A to be a vector. How can I perform the same action with A as a Matrix i.e. resulting sets have roughly the same distribution of each feature as in the original corpus.
By using a matrix A rather than grouped data, you are making the assumption that a random partition of your data will return a test and train set with the same column distributions.
In general, the assumption you are making in your question is that there is a partition of A such that each of the marginal distributions of A (1 per column) has the same distribution across all three variables. There is no guarantee that this is true. Check whether the columns of your matrix are correlated. If they are not, simply partition on 1 and use the row indices to define a test matrix:
cv = cvpartition(A(:, 1), 'holdout', p);
text_mat = A(cv.test, :);
If they are correlated, you may need to go back and reconsider what you are trying to do.