Hamming distance in k-means clustering

Hamming distance in k-means clustering - matlab

I want to use the hamming distance in kmeans clustering in Matlab, but I get an error saying that my data must be binary.
Is there anyway around this? The data matrix that I use can't be binary (it has a physical interpretation that must allow for values 0,1,2,3) but it's important that I use the Hamming distance.

Per the MATLAB documentation, the Hamming distance measure for kmeans can only be used with binary data, as it's a measure of the percentage of bits that differ.
You could try mapping your data into a binary representation before using the function. You could also look at using the city block distance as an alternative if possible, as it is suitable for non-binary input.

The data to cluster must be of type logical. You can convert your 0/1 double, single, uintX data by a single command:
x = logical( y );
If you want to convert uint8 type data to binary, check the function uint8tobit(). Take a look at de2bi() and bi2de() functions.

Related

Evaluation of K-means clustering ( accuracy)

I created a 2-dimensional random datasets (composed from a dataset of points and a column of labels) for centroid based k-means clustering in MATLAB where each point is represented by a vector of X and Y (the point coordinates) and each label represents the data point cluster,see example in figure below.
I applied the K-means clustering algorithm on these point datasets. I need help with the following:
What function can I use to evaluate the accuracy of the K-means algorithm? In more detail: My aim is to score the Kmeans algorithm based on how many assigned labels it correctly identifies by comparing with assigned numbers by matlab. For example, I verify if the point (7.200592168, 11.73878455) is assigned with the point (6.951107307, 11.27498898) to the same cluster... etc.

If I correctly understand your question, you are looking for the adjusted rand index. This will score the similarity between your matlab labels and your k-means labels.
Alternatively you can create a confusion matrix to visualise the mapping between your two labelsets.

I would use squared error
You are trying to minimize the total squared distance between each point and the mean coordinate of it's cluster.

How to use silhouette_score in Sklearn with mixed (categorical and numerical) data?

I have come to a situation where I have mixed data set as mentioned and try unsupervised clustering.
I am trying many different experiments including Gower's distance and K-prototype. I wanna try some of sklearn metrics to see how they will give me values.
While I was looking at silhouette_score, there is an argument 'metric' and I can decide with what I want to compute distances. But as my data has mixed types, I would like to choose manhattan for numerical and hamming for categorical. Is there a way I can use silhouette_score for both metrics at one go? if all my input data were numerical, I would have done as below:
silhouette_score(friendRecomennderData, labels, metric = 'manhattan')
Thank you in advance.

You are getting confused in the arguments that are passed to silhouette_score. If you read the documentation mentioned here, it say the following about the input data, i.e. the parameter X:
X: array [n_samples_a, n_samples_a] if metric == “precomputed”, or, [n_samples_a, n_features] otherwise. Array of pairwise distances between samples, or a feature array.
Thus the data can only be a numerical array comprising of distances between the samples. It's not possible to have distances as categorical values.
You need to first cluster your data, then get the distance matrix and provide the distance matrix as input to silhouette_score.

You can use distance metrics like gowers distance which deals with mixed data types and then use computed distance matrix as X and metric = 'precomputed' in silhouette_score function.

Principal component analysis in matlab?

I have a training set with the size of (size(X_Training)=122 x 125937).
122 is the number of features
and 125937 is the sample size.
From my little understanding, PCA is useful when you want to reduce the dimension of the features. Meaning, I should reduce 122 to a smaller number.
But when I use in matlab:
X_new = pca(X_Training)
I get a matrix of size 125973x121, I am really confused, because this not only changes the features but also the sample size? This is a big problem for me, because I still have the target vector Y_Training that I want to use for my neural network.
Any help? Did I badly interpret the results? I only want to reduce the number of features.

Firstly, the documentation of the PCA function is useful: https://www.mathworks.com/help/stats/pca.html. It mentions that the rows are the samples while the columns are the features. This means you need to transpose your matrix first.
Secondly, you need to specify the number of dimensions to reduce to a priori. The PCA function does not do that for you automatically. Therefore, in addition to extracting the principal coefficients for each component, you also need to extract the scores as well. Once you have this, you simply subset into the scores and perform the reprojection into the reduced space.
In other words:
n_components = 10; % Change to however you see fit.
[coeff, score] = pca(X_training.');
X_reduce = score(:, 1:n_components);
X_reduce will be the dimensionality reduced feature set with the total number of columns being the total number of reduced features. Also notice that the number of training examples does not change as we expect. If you want to make sure that the number of features are along the rows instead of the columns after we reduce the number of features, transpose this output matrix as well before you proceed.
Finally, if you want to automatically determine the number of features to reduce to, one method to do so is to calculate the variance explained of each feature, then accumulate the values from the first feature up to the point where we exceed some threshold. Usually 95% is used.
Therefore, you need to provide additional output variables to capture these:
[coeff, score, latent, tsquared, explained, mu] = pca(X_training.');
I'll let you go through the documentation to understand the other variables, but the one you're looking at is the explained variable. What you should do is find the point where the total variance explained exceeds 95%:
[~,n_components] = max(cumsum(explained) >= 95);
Finally, if you want to perform a reconstruction and see how well the reconstruction into the original feature space performs from the reduced feature, you need to perform a reprojection into the original space:
X_reconstruct = bsxfun(#plus, score(:, 1:n_components) * coeff(:, 1:n_components).', mu);
mu are the means of each feature as a row vector. Therefore you need add this vector across all examples, so broadcasting is required and that's why bsxfun is used. If you're using MATLAB R2018b, this is now implicitly done when you use the addition operation.
X_reconstruct = score(:, 1:n_components) * coeff(:, 1:n_components).' + mu;

Calculate the variance of an integer vector in MATLAB

I need to calculate the variance of a large vector which is stored as uint8. The MATLAB var function however only accepts double and single types as input. The easiest way to calculate the variance would therefore be
vec = randi(255,1,100,'uint8');
var(single(vec))
This of course gives the correct result. However using single datatype increses the memory usage by a factor of 4. For large vectors (~ 1 million elements) this will quickly fill up the memory.
What I tried: The definition of the variance for a discrete random variable X is
(Source: Wikipedia)
I estimated the p's using the histogram, but then got stuck: To calculate the variance in a vectorized fashion, I would need to convert the x_i's to single or double.
Is there any possibility to calculate the variance without converting the whole vector to single or double?

If you're willing to work with uint16, you can do this, it creates only 3 floating point numbers (var and the 2 means), use Var(X)=Mean(X^2)-Mean(X)^2:
uivec=uint16(vec);
mean(uivec.^2)-mean(uivec)^2
So, not as good as keeping uint8 but still twice better than converting to single. It should work with uint16 because your input is uint8 and (2^8)^2=2^16.
If you want the exact same answer as var, you need to remember that MATLAB uses the unbiased estimator for var (it divides the sum by n-1 instead of n, where n is your number of samples) so you need to do:
n=length(vec);
v=mean(uivec.^2)-mean(uivec)^2*(n/(n-1))
then your v will be exactly equal to var(single(vec)).

No. The value of the variance is going to be a floating point value most likely, so you need to perform floating point operations.
p_i itself is the Probability mass function, so sum(p_i) should be one, therefore each p_i is a floating point number.
In addition, nu, the mean, will probably not be integer neither

Using large input values with Auto Encoders

I have created an Auto Encoder Neural Network in MATLAB. I have quite large inputs at the first layer which I have to reconstruct through the network's output layer. I cannot use the large inputs as it is,so I convert it to between [0, 1] using sigmf function of MATLAB. It gives me a values of 1.000000 for all the large values. I have tried using setting the format but it does not help.
Is there a workaround to using large values with my auto encoder?

The process of convert your inputs to the range [0,1] is called normalization, however, as you noticed, the sigmf function is not adequate for this task. This link maybe is useful to you.
Suposse that your inputs are given by a matrix of N rows and M columns, where each row represent an input pattern and each column is a feature. If your first column is:
vec =
-0.1941
-2.1384
-0.8396
1.3546
-1.0722
Then you can convert it to the range [0,1] using:
%# get max and min
maxVec = max(vec);
minVec = min(vec);
%# normalize to -1...1
vecNormalized = ((vec-minVec)./(maxVec-minVec))
vecNormalized =
0.5566
0
0.3718
1.0000
0.3052
As #Dan indicates in the comments, another option is to standarize the data. The goal of this process is to scale the inputs to have mean 0 and a variance of 1. In this case, you need to substract the mean value of the column and divide by the standard deviation:
meanVec = mean(vec);
stdVec = std(vec);
vecStandarized = (vec-meanVec)./ stdVec
vecStandarized =
0.2981
-1.2121
-0.2032
1.5011
-0.3839

Before I give you my answer, let's think a bit about the rationale behind an auto-encoder (AE):
The purpose of auto-encoder is to learn, in an unsupervised manner, something about the underlying structure of the input data. How does AE achieves this goal? If it manages to reconstruct the input signal from its output signal (that is usually of lower dimension) it means that it did not lost information and it effectively managed to learn a more compact representation.
In most examples, it is assumed, for simplicity, that both input signal and output signal ranges in [0..1]. Therefore, the same non-linearity (sigmf) is applied both for obtaining the output signal and for reconstructing back the inputs from the outputs.
Something like
output = sigmf( W*input + b ); % compute output signal
reconstruct = sigmf( W'*output + b_prime ); % notice the different constant b_prime
Then the AE learning stage tries to minimize the training error || output - reconstruct ||.
However, who said the reconstruction non-linearity must be identical to the one used for computing the output?
In your case, the assumption that inputs ranges in [0..1] does not hold. Therefore, it seems that you need to use a different non-linearity for the reconstruction. You should pick one that agrees with the actual range of you inputs.
If, for example, your input ranges in (0..inf) you may consider using exp or ().^2 as the reconstruction non-linearity. You may use polynomials of various degrees, log or whatever function you think may fit the spread of your input data.
Disclaimer: I never actually encountered such a case and have not seen this type of solution in literature. However, I believe it makes sense and at least worth trying.