for how How to compute whether representational similarity matrix values are significant - matlab

I am new to RSA analysis in fMRI images. I used SPM 12 for preprocessing and first level analysis of my fMRI images and used RSA-toolbox to compute RDMs (representational dissimilarity matrix) for my conditions in an specific region of the brain. Now I have the RDM mateix for every single subject also have the overall RDM across all subjects. However, RSA-toolbox doesn't report any p value or significance test for the values in the RDM. How can I compute or determine which values in the RDM matrix are significant and which are not? I used pearson's r method for to compute RDMs. In paeticular I want to have an explaination about the mathematics that can be used to test significancy of these values.

Related

Multi-label clustering

I have a question regarding a task that I am trying to solve. The data that I have are characterisation data,
meaning that I have a label (PASS/FAIL) for every single datapoint.
So my data matrix, is of n rows and m columns and the target variables are again a matrix of
n rows and m columns composed of binary values (0s and 1s).
My task is to apply clustering and partition all these datapoints into two clusters, one being for PASS
datapoints and the other for FAIL datapoints. I wasn't able to find an algorithm that can solve
this type of 'multi-label' problem with clustering.
I tried to implement algorithms like k-means but while tuning the number of clusters to initialise
I get k=6 which doesn't really make sense. In the data, outliers are already dropped and they
are normalised as well.
I have a large amount of features on my data matrix (eg. >3000) and I tried to apply
dimensionality reduction methods like PCA to at least drop the features that are more
irrelevant than the rest. But I am not sure if this would be applicable in my case when
I have a binary matrix as target variables.
Is there a specific algorithm that can solve this type of problem and if so, what is the
necessary pre-processing I should be doing before applying it?

Gaussian Mixture Model (GMM) giving only one cluster

I have a dataset that has 70 columns and 4.4 million rows. I want to perform clustering on it. I did TF-IDF first then I used clustering with K-means, Bisecting k-means and Gaussian Mixture Model (GMM). While the other techniques give me the specified number of clusters, GMM gives only one cluster. Example, in the code below, I want 20 clusters but it returns only 1 cluster. Is this happening because of the fact that I have many columns or it is merely caused by the nature of the data?
gmm = GaussianMixture(k = 20, tol = 0.000001, maxIter=10000, seed =1)
model = gmm.fit(rescaledData)
df1 = model.transform(rescaledData).select(['label','prediction'])
df1.groupBy('prediction').count().show() # this returns 1 row
In my opinion, the main reason behind of bad clustering performance of Pyspark GMM is that it's implementation is done using diagonal covariance matrix which do not take account of covariance between different features present within the dataset.
Check it's implementation here: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/clustering/GaussianMixture.scala
where they have cleary mentioned to be using diagonal covariance matrix because of curse of dimensionality.
#note This algorithm is limited in its number of features since it requires storing a covariance matrix which has size quadratic in the number of features. Even when the number of features does not exceed this limit, this algorithm may perform poorly on high-dimensional data. This is due to high-dimensional data (a) making it difficult to cluster at all (based on statistical/theoretical arguments) and (b) numerical issues with Gaussian distributions.

Proper way of generating initial random vectors for generator model in GAN?

Frequently linear interpolation is used with a Gaussian or uniform prior which has unit variance and zero mean where the size of the vector can be defined in an arbitrary way e.g. 100 to generate initial random vectors for generator model in Generative Adversarial Neural (GAN).
Let's say we have 1000 images for training and batch size is 64. Then each epoch, need to generate a number of random vectors using prior distribution corresponding to each image given small batch. But the problem I see is that since there is no mapping between random vector and corresponding image, the same image can be generated using multiple initial random vectors. In this paper, it suggests overcoming this problem by using different spherical interpolation up to some extent.
So what will happens if initially generate random vectors corresponding to the number of training images and when train the model uses the same random vector which is generated initially?
In GANs the random seed used as input does not actually correspond to any real input image. What GANs actually do is learn a transformation function from a known noise distribution (e.g. Gaussian) to a complex unknown distribution, which is representated by i.i.d. samples (e.g. your training set). What the discriminator in a GAN does is to calculate a divergence (e.g. Wasserstein divergence, KL-divergence, etc.) between the generated data (e.g. transformed gaussian) and the real data (your training data). This is done in a stochastic fashion and therefore no link is neccessary between the real and the fake data. If you want to learn more about this on a hands on example, I can recommend that you train to train a Wasserstein GAN to transform one 1D gaussian distribution into another one. There you can visualize the discriminator and the gradient of the discriminator and really see the dynamics of such a system.
Anyways, what your paper is trying to tell you is after you have trained your GAN and want to see how it has mapped the generated data from the known noise space to the unknown image space. For this reason interpolation schemes have been invented like the spherical one you are quoting. They also show that the GAN has learned to map some parts of the latent space to key characteristics in images, like smiles. But this has nothing to do with the training of GANs.

clustering vs fitting a mixture model

I have a question about using a clustering method vs fitting the same data with a distribution.
Assuming that I have a dataset with 2 features (feat_A and feat_B) and let's assume that I use a clustering algorithm to divide the data in an optimal number of clusters...say 3.
My goal is to assign for each of the input data [feat_Ai,feat_Bi] a probability (or something similar) that the point belongs to cluster 1 2 3.
a. First approach with clustering:
I cluster the data in the 3 clusters and I assign to each point the probability of belonging to a cluster depending on the distance from the center of that cluster.
b. Second approach using mixture model:
I fit a mixture model or mixture distribution to the data. Data are fit to the distribution using an expectation maximization (EM) algorithm, which assigns posterior probabilities to each component density with respect to each observation. Clusters are assigned by selecting the component that maximizes the posterior probability.
In my problem I find the cluster centers (or I fit the model if approach b. is used) with a subsample of data. Then I have to assign a probability to a lot of other data... I would like to know in presence of new data which approach is better to use to still have meaningful assignments.
I would go for a clustering method for example a kmean because:
If the new data come from a distribution different from the one used to create the mixture model, the assignment could be not correct.
With new data the posterior probability changes.
The clustering method minimizes the variance of the clusters in order to find a kind of optimal separation border, the mixture model take into consideration the variance of the data to create the model (not sure that the clusters that will be formed are separated in an optimal way).
More info about the data:
Features shouldn't be assumed dependent.
Feat_A represents the duration of a physical activity Feat_B the step counts In principle we could say that with an higher duration of the activity the step counts increase, but it is not always true.
Please help me to think and if you have any other point please let me know..

Clustering: a training dataset of variable data dimensions

I have a dataset of n data, where each data is represented by a set of extracted features. Generally, the clustering algorithms need that all input data have the same dimensions (the same number of features), that is, the input data X is a n*d matrix of n data points each of which has d features.
In my case, I've previously extracted some features from my data but the number of extracted features for each data is most likely to be different (I mean, I have a dataset X where data points have not the same number of features).
Is there any way to adapt them, in order to cluster them using some common clustering algorithms requiring data to be of the same dimensions.
Thanks
Sounds like the problem you have is that it's a 'sparse' data set. There are generally two options.
Reduce the dimensionality of the input data set using multi-dimensional scaling techniques. For example Sparse SVD (e.g. Lanczos algorithm) or sparse PCA. Then apply traditional clustering on the dense lower dimensional outputs.
Directly apply a sparse clustering algorithm, such as sparse k-mean. Note you can probably find a PDF of this paper if you look hard enough online (try scholar.google.com).
[Updated after problem clarification]
In the problem, a handwritten word is analyzed visually for connected components (lines). For each component, a fixed number of multi-dimensional features is extracted. We need to cluster the words, each of which may have one or more connected components.
Suggested solution:
Classify the connected components first, into 1000(*) unique component classifications. Then classify the words against the classified components they contain (a sparse problem described above).
*Note, the exact number of component classifications you choose doesn't really matter as long as it's high enough as the MDS analysis will reduce them to the essential 'orthogonal' classifications.
There are also clustering algorithms such as DBSCAN that in fact do not care about your data. All this algorithm needs is a distance function. So if you can specify a distance function for your features, then you can use DBSCAN (or OPTICS, which is an extension of DBSCAN, that doesn't need the epsilon parameter).
So the key question here is how you want to compare your features. This doesn't have much to do with clustering, and is highly domain dependant. If your features are e.g. word occurrences, Cosine distance is a good choice (using 0s for non-present features). But if you e.g. have a set of SIFT keypoints extracted from a picture, there is no obvious way to relate the different features with each other efficiently, as there is no order to the features (so one could compare the first keypoint with the first keypoint etc.) A possible approach here is to derive another - uniform - set of features. Typically, bag of words features are used for such a situation. For images, this is also known as visual words. Essentially, you first cluster the sub-features to obtain a limited vocabulary. Then you can assign each of the original objects a "text" composed of these "words" and use a distance function such as cosine distance on them.
I see two options here:
Restrict yourself to those features for which all your data-points have a value.
See if you can generate sensible default values for missing features.
However, if possible, you should probably resample all your data-points, so that they all have values for all features.