I want to have the ability to stream kmeans, meaning that after clustering a set of data, I want to add additional data to a cluster or create new clusters, all without having to run over the old data.
I did a lot of searching but wasn't able to find matlab implementation of this code, there were many C source code however. Do anyone know of something like this?
You could use the 'start' parameter of kmeans.
Matrix: k-by-p matrix of centroid starting locations. In this case,
you can pass in [] for k, and kmeans infers k from the first dimension
of the matrix. You can also supply a 3-D array, implying a value for
the 'replicates' parameter from the array's third dimension.
Related
I have come to a situation where I have mixed data set as mentioned and try unsupervised clustering.
I am trying many different experiments including Gower's distance and K-prototype. I wanna try some of sklearn metrics to see how they will give me values.
While I was looking at silhouette_score, there is an argument 'metric' and I can decide with what I want to compute distances. But as my data has mixed types, I would like to choose manhattan for numerical and hamming for categorical. Is there a way I can use silhouette_score for both metrics at one go? if all my input data were numerical, I would have done as below:
silhouette_score(friendRecomennderData, labels, metric = 'manhattan')
Thank you in advance.
You are getting confused in the arguments that are passed to silhouette_score. If you read the documentation mentioned here, it say the following about the input data, i.e. the parameter X:
X: array [n_samples_a, n_samples_a] if metric == “precomputed”, or, [n_samples_a, n_features] otherwise. Array of pairwise distances between samples, or a feature array.
Thus the data can only be a numerical array comprising of distances between the samples. It's not possible to have distances as categorical values.
You need to first cluster your data, then get the distance matrix and provide the distance matrix as input to silhouette_score.
You can use distance metrics like gowers distance which deals with mixed data types and then use computed distance matrix as X and metric = 'precomputed' in silhouette_score function.
I have a 33000 x 1975 table in MATLAB, obviously requiring dimensionality reduction before I do any further analysis. The features are the 1975 columns and the rows are instances of the data. I tried using tsne() function on the MATLAB table, but it seems tsne() only works on numeric arrays. The thing is that is there a way to apply tsne on my MATLAB table. The table consists of both numeric as well as string data types, so table2array() doesn't work in my case for converting the table to a numeric array.
Moreover, it seems from the MATHWORKS documentation, as applied to the fisheriris dataset as an example, that tsne() takes the feature columns as the function argument. So, I would need to separate the predictors from the resonses, which shouldn't be a problem. But, initially, it seems confusing as to how I can proceed further for using the tsne. Any suggestions in this regard would be highly appreciated.
You can probably use table indexing using {} to get out the data that you want. Here's a simple example adapted from the tsne reference page:
load fisheriris
% Make a table where the first variable is the species name,
% and the other variables are the measurements
data = table(species, meas(:,1), meas(:,2), meas(:,3), meas(:,4))
% Use {} indexing on 'data' to extract a numeric matrix, then
% call 'tsne' on that
Y = tsne(data{:, 2:end});
% plot as per example.
gscatter(Y(:,1),Y(:,2),data.species)
I have a 2D 2401*266 matrix K which corresponds to x values (t: stored in a 1*266 array) and y values(z: stored in a 1*2401 array).
I want to extrapolate the matrix K to predict some future values (corresponding to t(1,267:279). So far I have extended t so that it is now a 1*279 matrix using a for loop:
for tq = 267:279
t(1,tq) = t(1,tq-1)+0.0333333333;
end
However I am stumped on how to extrapolate K without fitting a polynomial to each individual row?
I feel like there must be a more efficient way than this??
There are countless of extrapolation methods in the literature, "fitting a polynomial to each row" would be just one of them, not necessarily invalid, not sure why you mention that you do no wan't to do it. For 2D data perhaps fitting a surface would lead to better results though.
However, if you want an easy, simple way (that might or might not work with your problem), you can always use the function interp2, for interpolation. If you chose spline or makima as interpolation functions, it will also extrapolate for any query point outside the domain of K.
I have a data set that contains both categorical and numerical features for each row. I would like to select a different similarity metric for each feature (column) and preform hierarchical clustering on the data. Is there a way to do that in Matlab?
Yes, this is actually fairly straightforward: linkage, which creates the tree, takes as input a dissimilarity matrix. Consequently, in the example workflow below
Y = pdist(X,'cityblock');
Z = linkage(Y,'average');
T = cluster(Z,'cutoff')
you simply replace the call to pdist with a call to your own function that calculates the pairwise dissimilarity between rows, everything else stays the same.
I've got a really big matrix which I should "upscale" (i.e.: create another matrix where the elements of the first are grouped 40-by-40). For every 40-by-40 group I should evaluate a series of parameters (i.e.: frequencies, average and standard deviation).
I'm quite sure I can make such thing with a loop, but I was wondering if there was a more elegant vectorized method...
You might find blockproc useful. This command allows you to apply a function (e.g. #mean, #std etc.) to each distinct block in a 2D matrix.