Cluster gene expression to specific breast cancer subtypes in R using clinical parameters and gene expression data - cluster-analysis

I have a set of gene expression values of 1500 genes from breast cancer samples of 3 different subtypes and their clinical parameters. I need to cluster the gene-expression based on the breast cancer subtypes to find unique genes to each of the breast cancer subtypes. I had tried PCA and Kmeans but couldn't explain the name of subtypes. Is there any method to do this? Thanks in Advance.

What you're looking for are differentially expressed genes. If you have expression data for the three subtypes, you can use a differential expression analysis tool (DESeq2 and EdgeR are commonly used) to get the genes that are differentially expressed among your samples. The analysis gives you the extent of differential expression as well as the statistical confidence of the results.
If you also have a control sample, you can infer genes that are specifically up or down-regulated in a certain subtype from the list DEGs.
This is a helpful reference to learn more about the theory as well as the application of differential expression analysis - Introduction to DGE

Related

Classification with confidence value

I am using LDA (Matlab machine learning toolbox) to classify different type of wines based on NMR spectra. The classifier is trained on samples coming from 3 different types. I would like to manage the case where a wine is tested and is not coming from any of these 3 types or it is far from their distributions.
For instance, in the figure below we can see the boundaries and the sample distributions. In this case the posterior probability will be the same for the cross and square markers. They will be both tagged as coming from type 1. We should except that the square is rejected because it is too far from the gaussian distribution of type 1.
How can I have a classification with a confidence value? For example the cross would be tagged as type 1 with a confidence of 95% and the square as type 1 with confidence of <1% and thus rejected.

Matlab fitcsvm Feature Coefficients

I'm running a series of SVM classifiers for a binary classification problem, and am getting very nice results as far as classification accuracy.
The next step of my analysis is to understand how the different features contribute to the classification. According to the documentation, Matlab's fitcsvm function returns a class, SVMModel, which has a field called "Beta", defined as:
Numeric vector of trained classifier coefficients from the primal linear problem. Beta has length equal to the number of predictors (i.e., size(SVMModel.X,2)).
I'm not quite sure how to interpret these values. I assume higher values represent a greater contribution of a given feature to the support vector? What do negative weights mean? Are these weights somehow analogous to beta parameters in a linear regression model?
Thanks for any help and suggestions.
----UPDATE 3/5/15----
In looking closer at the equations describing the linear SVM, I'm pretty sure Beta must correspond to w in the primal form.
The only other parameter is b, which is just the offset.
Given that, and given this explanation, it seems that taking the square or absolute value of the coefficients provides a metric of relative importance of each feature.
As I understand it, this interpretation only holds for the linear binary SVM problem.
Does that all seem reasonable to people?
Intuitively, one can think of the absolute value of a feature weight as a measure of it's importance. However, this is not true in the general case because the weights symbolize how much a marginal change in the feature value would affect the output, which means that it is dependent on the feature's scale. For instance, if we have a feature for "age" that is measured in years, but than we change it to months, the corresponding coefficient will be divided by 12, but clearly,it doesn't mean that the age is less important now!
The solution is to scale the data (which is usually a good practice anyway).
If the data is scaled your intuition is correct and in fact, there is a feature selection method that does just that: choosing the features with the highest absolute weight. See http://jmlr.csail.mit.edu/proceedings/papers/v3/chang08a/chang08a.pdf
Note that this is correct only to linear SVM.

Classifier with a vector of features and a matrix of classes

I'm new in Classification so I'm asking for some advice on how to start.
I've created a Matlab script which create two matrices, one is the class identifier, meaning 100x1 which contains the group from where the data is. group one (1) or group two (2).
The second matrix contains the features 100x40 with 40 features for each point.
What's the best way to start, I'm really lost. Does Matlab has some functions I can use?
I would really appreciate some help.
Thank you.
It depends on what version of MATLAB you are using, but the best starting point would be to look at statistics toolbox for supervised learning. Here are some starting tips for MATLAB 2013a:
http://www.mathworks.co.uk/help/stats/supervised-learning.html
Let's assume that your data is
classes: 100x1
features: 100x40
For each method, the first line shows you how to fit your classification model and the second lines shows how to classify the first row of data in features.
Statistics Toolbox
Naive Bayes Classification
Wikipedia: https://en.wikipedia.org/wiki/Naive_Bayes_classifier
myClassifier = NaiveBayes.fit(features, classes)
myClassifier.predict(features(1,:))
Nearest Neighbors
Wikipedia: https://en.wikipedia.org/wiki/Nearest_neighbour_classifiers
myClassifier = ClassificationKNN.fit(features, classes)
myClassifier.predict(features(1,:))
Classification Trees
Wikipedia: https://en.wikipedia.org/wiki/Classification_tree
myClassifier = ClassificationTree.fit(features, classes)
myClassifier.predict(features(1,:))
Support Vector Machines
Wikipedia: https://en.wikipedia.org/wiki/Support_vector_machine
Note that Support Vector Machines moved into 2013a from Bioinformatics toolbox and it only supports classification into two groups.
myClassifier = svmtrain(features, classes)
svmclassify(myClassifier, features(1,:))
Discriminant Analysis
Wikipedia: https://en.wikipedia.org/wiki/Discriminant_analysis
myClassifier = ClassificationDiscriminant.fit(features, classes)
myClassifier.predict(features(1,:))
Neural Network Toolbox:
If you only have two classes, you could use Neural Network Toolbox for pattern recognition by typing nnstart

measuring uncertainty in matlabs svmclassify

I'm doing contextual object recognition and I need a prior for my observations. e.g. this space was labeled "dog", what's the probability that it was labeled correctly? Do you know if matlabs svmclassify has an argument to return this level of certainty with it's classification?
If not, matlabs svm has the following structures in it:
SVM =
SupportVectors: [11x124 single]
Alpha: [11x1 double]
Bias: 0.0915
KernelFunction: #linear_kernel
KernelFunctionArgs: {}
GroupNames: {11x1 cell}
SupportVectorIndices: [11x1 double]
ScaleData: [1x1 struct]
FigureHandles: []
Can you think of any ways to compute a good measure of uncertainty from these? (Which support vector to use?) Papers/articles explaining uncertainty in SVMs welcome. More in depth explanations of matlabs SVM are also welcome.
If you can't do it this way, can you think of any other libraries with SVMs that have this measure of uncertainty?
Salam,
Hint: I would suggest that you modify the svmclassify.m function to pass the f values of the svmdecision.m to the user. This is in addition to the outputclass.
To access svmclassify.m, just type the following in the Matlab command line:
open svmclassify
I found that svmdecision.m (https://code.google.com/p/auc-recognition/source/browse/trunk/ALLMatlab/AucLib/DigitsOfflineNew/svmdecision.m?spec=svn260&r=260) can pass the f value; so the following may be substitude when calling the svmdecision.m in svmclassify.m:
[classified, f] = svmdecision(sample,svmStruct);
By further passing this f value to the user, the multi-class implementations such as one-vs-all may be designed with the binary classifier already built in Matlab.
f values are what you are looking for in terms of comparison of different inputs being related to their output class.
I hope this helps you to write your code and understand it! though you've probably solved your problem by now.
I know this was posted a long time ago, but I think the author is looking for a measure of uncertainty in the output of the trained SVM, whether the output is an estimated label or an estimated probability, which are both point estimates. One measure would be the variance of the output for the same input x. The problem is that regular SVMs are not stochastic models, such as probabilistic/Bayesian neural nets for example. The inference with an SVM is always the same, given the same x.
I would need to check, but perhaps it would be possible to train an SVM with stochastic regularization. Maybe some form of stochastic margin maximization, whereby during the steps of the optimization routine, input training vectors can undergo small stochastic perturbations or perhaps randomly chosen features could be dropped out. Similarly, during testing, one could drop or modify different randomly chosen features, producing different point estimates each time. Then, you could take the variance of the estimates, producing the measure of uncertainty.
It could be argued that, if the input x presents patterns that the model is not familiar with, the point estimates will be wildly unstable and variable.
One simple illustration can be the following: Imagine a 2d toy example, where the two classes are well separated and occupy a dense range of feature values. While a model trained on this set will generalize well to points that fall into the distribution/range of the values seen in the train data, it would be quite unstable given a test observation that sits far away from both classes, but at the same latitude, so to speak, as the separating hyperplane. Such an observations presents a challenge to the model since a tiny perturbation of one of the support vectors could cause a tiny rotation in the separating hyperplane, not significantly changing training or validation error, but potentially changing the estimate of the far-away test observation by a lot.
Are you supplying the data and doing the training yourself? If so, the best thing to do is to partition your data into training and test sets. Matlab has a function for this called cvpartition. You can use the classification results on the test data to estimate the rate of false positive and the miss rate. For a binary classification task, those two numbers will quantify the uncertainty. For a classification test with multiple hypotheses the best thing to do, probably, is to compile your results in a confusion matrix.
edit. found some older code I'd used that might help a little
P=cvpartition(Y,'holdout',0.20);
rbfsigma=1.41;
svmStruct=svmtrain(X(P.training,:),Y(P.training),'kernel_function','rbf','rbf_sigma',rbfsigma,'boxconstraint',0.7,'showplot','true');
C=svmclassify(svmStruct,X(P.test,:));
errRate=sum(Y(P.test)~=C)/P.TestSize
conMat=confusionmat(Y(P.test),C)
LIBSVM, which also has a Matlab interface, has an option -b that makes the classification function return probability estimates. They seem to be computed following the general approach of Platt (2000), which is to perform a one-dimensional logistic regression applied to the decision values.
Platt, J. C. (2000). Probabilities for SV machines. In Smola, A. J. et al. (eds.) Advances in large margin classifiers. pp. 61–74. Cambridge: The MIT Press.

Rapidminer - neural net operator - output confidence

I have feed-forward neural network with six inputs, 1 hidden layer and two output nodes (1; 0). This NN is learned by 0;1 values.
When applying model, there are created variables confidence(0) and confidence(1), where sum of this two numbers for each row is 1.
My question is: what do these two numbers (confidence(0) and confidence(1)) exactly mean? Are these two numbers probabilities?
Thanks for answers
In general
The confidence values (or scores, as they are called in other programs) represent a measure how, well, confident the model is that the presented example belongs to a certain class. They are highly dependent on the general strategy and the properties of the algorithm.
Examples
The easiest example to illustrate is the majority classifier, who just assigns the same score for all observations based on the proportions in the original testset
Another is example the k-nearest-neighbor-classifier, where the score for a class i is calculated by averaging the distance to those examples which both belong to the k-nearest-neighbors and have class i. Then the score is sum-normalized across all classes.
In the specific example of NN, I do not know how they are calculated without checking the code. I guess it is just the value of output node, sum-normalized across both classes.
Do the confidences represent probabilities ?
In general no. To illustrate what probabilities in this context mean: If an example has probability 0.3 for class "1", then 30% of all examples with similar feature/variable values should belong to class "1" and 70% should not.
As far as I know, his task is called "calibration". For this purpose some general methods exist (e.g. binning the scores and mapping them to the class-fraction of the corresponding bin) and some classifier-dependent (like e.g. Platt Scaling which has been invented for SVMs). A good point to start is:
Bianca Zadrozny, Charles Elkan: Transforming Classifier Scores into Accurate Multiclass Probability Estimates
The confidence measures correspond to the proportion of outputs 0 and 1 that are activated in the initial training dataset.
E.g. if 30% of your training set has outputs (1;0) and the remaining 70% has outputs (0; 1), then confidence(0) = 30% and confidence(1) = 70%