How to apply tsne() to MATLAB tabular data? - matlab

I have a 33000 x 1975 table in MATLAB, obviously requiring dimensionality reduction before I do any further analysis. The features are the 1975 columns and the rows are instances of the data. I tried using tsne() function on the MATLAB table, but it seems tsne() only works on numeric arrays. The thing is that is there a way to apply tsne on my MATLAB table. The table consists of both numeric as well as string data types, so table2array() doesn't work in my case for converting the table to a numeric array.
Moreover, it seems from the MATHWORKS documentation, as applied to the fisheriris dataset as an example, that tsne() takes the feature columns as the function argument. So, I would need to separate the predictors from the resonses, which shouldn't be a problem. But, initially, it seems confusing as to how I can proceed further for using the tsne. Any suggestions in this regard would be highly appreciated.

You can probably use table indexing using {} to get out the data that you want. Here's a simple example adapted from the tsne reference page:
load fisheriris
% Make a table where the first variable is the species name,
% and the other variables are the measurements
data = table(species, meas(:,1), meas(:,2), meas(:,3), meas(:,4))
% Use {} indexing on 'data' to extract a numeric matrix, then
% call 'tsne' on that
Y = tsne(data{:, 2:end});
% plot as per example.
gscatter(Y(:,1),Y(:,2),data.species)

Related

Pairwise Similarity and Sorting Samples

The following is a problem from an assignment that I am trying to solve:
Visualization of similarity matrix. Represent every sample with a four-dimension vector (sepal length, sepal width, petal length, petal width). For every two samples, compute their pair-wise similarity. You may do so using the Euclidean distance or other metrics. This leads to a similarity matrix where the element (i,j) stores the similarity between samples i and j. Please sort all samples so that samples from the same category appear together. Visualize the matrix using the function imagesc() or any other function.
Here is the code I have written so far:
load('iris.mat'); % create a table of the data
iris.Properties.VariableNames = {'Sepal_Length' 'Sepal_Width' 'Petal_Length' 'Petal_Width' 'Class'}; % change the variable names to their actual meaning
iris_copy = iris(1:150,{'Sepal_Length' 'Sepal_Width' 'Petal_Length' 'Petal_Width'}); % make a copy of the (numerical) features of the table
iris_distance = table2array(iris_copy); % convert the table to an array
% pairwise similarity
D = pdist(iris_distance); % calculate the Euclidean distance and store the result in D
W = squareform(D); % convert to squareform
figure()
imagesc(W); % visualize the matrix
Now, I think I've got the coding mostly right to answer the question. My issue is how to sort all the samples so that samples from the same category appear together because I got rid of the names when I created the copy. Is it already sorted by converting to squareform? Other suggestions? Thank you!
It should be in the same order as the original data. While you could sort it afterwards, the easiest solution is to actually sort your data by class after line 2 and before line 3.
load('iris.mat'); % create a table of the data
iris.Properties.VariableNames = {'Sepal_Length' 'Sepal_Width' 'Petal_Length' 'Petal_Width' 'Class'}; % change the variable names to their actual meaning
% Sort the table here on the "Class" attribute. Don't forget to change the table name
% in the next line too if you need to.
iris_copy = iris(1:150,{'Sepal_Length' 'Sepal_Width' 'Petal_Length' 'Petal_Width'}); % make a copy of the (numerical) features of the table
Consider using sortrows:
tblB = sortrows(tblA,'RowNames') sorts a table based on its row names. Row names of a table label the rows along the first dimension of the table. If tblA does not have row names, that is, if tblA.Properties.RowNames is empty, then sortrows returns tblA.

Use every level of a categorical variable in a regression

Short description
I am trying to run a (GLM) regression in Matlab (using the fitglm function) where one of the regressors is a categorical variable. However instead of adding an intercept and dropping the first level, I would like to include each level of the categorical variable and exclude the constant term.
Motivation
I know, that theoretically the results are the same either way, but I have two reasons against estimating the model with a constant and interpreting the dummy level coefficients differently:
The smaller problem is that I am running many regressions as part of
a larger estimation procedure using different subsets of a large
dataset, and the available levels of my categorical variable might
not be the same across the regressions. In the end I would like
to compare estimated coefficients for the levels. It can be solved with
some additional code/hacking, but it would not be an elegant solution.
The bigger problem is that there are orders of magnitudes
of difference between the coefficients of the levels: some of them
are extremely small. If such a level gets used as a base level, I am
afraid that it might cause numerical problems / optimization
problems.
Tried approaches
I tried subclassing the GeneralizedLinearModel class but unfortunately it is marked as final. Class composition also does not work as I cannot even inherit from the parent of the GeneralizedLinearModel class. Modifying Matlab's files is no option as I use a shared Matlab installation.
The only idea I could come up with is using dummyvar or something similar to turn my categorical variable into a set of dummies, and then using these dummy variables in the regression. AFAIK this is how Matlab works internally, but by taking this approach I lose the user-friendliness of dealing with categorical variables.
P.S. This question was also posted on MatlabCentral at this link.
As there seems to be no built-in way to do this, I am posting a short function that I wrote to get the job done.
I have a helper function to convert the categorical variable into an array of dummies:
function dummyTable = convert_to_dummy_table(catVar)
dummyTable = array2table(dummyvar(catVar));
varName = inputname(1);
levels = categories(catVar)';
dummyTable.Properties.VariableNames = strcat(varName, '_', levels);
end
The usage is quite simple. If you have a table T with some continuous explanatory variables X1, X2, X3, a categorical explanatory variable C and a response variable Y, then instead of using
M = fitglm(T, 'Distribution', 'binomial', 'Link', 'logit', 'ResponseVar', 'Y')
which would fit a logit model using k - 1 levels for the categorical variable and an intercept, one would do
estTable = [T(:, {'X1', 'X2', 'X3', 'Y'}), convert_to_dummy_table(T.C)]
M = fitglm(estTable, 'Distribution', 'binomial', 'Link', 'logit', ...
'ResponseVar', 'Y', 'Intercept', false)
It is not as nice and readable as the default way of handling categorical variables, but it has the advantage that the names of the dummy variables are identical to the names that Matlab automatically assigns during estimation using a categorical variable. Therefore the Coefficients table of the resulting M object is easy to parse or understand.

Clustering with multiple metrics in Matlab

I have a data set that contains both categorical and numerical features for each row. I would like to select a different similarity metric for each feature (column) and preform hierarchical clustering on the data. Is there a way to do that in Matlab?
Yes, this is actually fairly straightforward: linkage, which creates the tree, takes as input a dissimilarity matrix. Consequently, in the example workflow below
Y = pdist(X,'cityblock');
Z = linkage(Y,'average');
T = cluster(Z,'cutoff')
you simply replace the call to pdist with a call to your own function that calculates the pairwise dissimilarity between rows, everything else stays the same.

Grouping Data in a Matrix in MATLAB

I've got a really big matrix which I should "upscale" (i.e.: create another matrix where the elements of the first are grouped 40-by-40). For every 40-by-40 group I should evaluate a series of parameters (i.e.: frequencies, average and standard deviation).
I'm quite sure I can make such thing with a loop, but I was wondering if there was a more elegant vectorized method...
You might find blockproc useful. This command allows you to apply a function (e.g. #mean, #std etc.) to each distinct block in a 2D matrix.

matlab code for data streaming kmeans

I want to have the ability to stream kmeans, meaning that after clustering a set of data, I want to add additional data to a cluster or create new clusters, all without having to run over the old data.
I did a lot of searching but wasn't able to find matlab implementation of this code, there were many C source code however. Do anyone know of something like this?
You could use the 'start' parameter of kmeans.
Matrix: k-by-p matrix of centroid starting locations. In this case,
you can pass in [] for k, and kmeans infers k from the first dimension
of the matrix. You can also supply a 3-D array, implying a value for
the 'replicates' parameter from the array's third dimension.