my aim is to classify the data into two sections- upper and lower- finding the mid line of the peaks.
I would like to apply machine learning methods- i.e. Discriminant analysis.
Could you let me know how to do that in MATLAB?
It seems that what you are looking for is GMM (gaussian mixture model). With K=2 (number of mixtures) and dimension equal 1 this will be simple, fast method, which will give you a direct solution. Given components it is easy to analytically find a local minima (which is just a weighted average of means, with weights proportional to the std's).
Related
I have a question on self-organizing maps:
But first, here is my approach on implementing one:
The som neurons are stored in a basic array. Each neuron consists of a vector (another array of the size of the input neurons) of double values which are initialized to a random value.
As far as I understand the algorithm, this is actually all I need to implement it.
So, for the training I choose a sample of the training data at random an calculate the BMU using the Euclidian distance of sample's values and the neuron weights.
Afterwards I update it's weights and all other neurons in it's range depending on the neighborhood function and the learning rate.
Then, I decrease the neighborhood function and the learning rate.
This is done until a fixed amount of iterations.
My question is now: How do I determine the clusters after the training? My approach so far is to present a new input vector and calculate the min Euclidian distance between it and the BMU . But this seems a little naive to me. I'm sure that I've missed something.
There is no single correct way of doing that. As you noted, finding the BMU is one of them and the only one that makes sense if you just want to find the most similar cluster.
If you want to reconstruct your input vector, returning the BMU prototype works too, but may not be very precise (it is equivalent to the Nearest Neighbor rule or 1NN). Then you need to interpolate between neurons to find a better reconstruction. This could be done by weighting each neuron inversely proportional to their distance to the input vector and then computing the weighted average (this is equivalent to weighted KNN). You can also restrict this interpolation only to the BMU's neighbors, which will work faster and may give better results (this would be weighted 5NN). This technique was used here: The Continuous Interpolating Self-organizing Map.
You can see and experiment with those different options here: http://www.inf.ufrgs.br/~rcpinto/itm/ (not a SOM, but a close cousin). Click "Apply" to do regression on a curve using the reconstructed vectors, then check "Draw Regression" and try the different options.
BTW, the description of your implementation is correct.
A pretty common approach nowadays is the soft subspace clustering, where feature weights are added to find the most relevant features. You can use these weights to increase performance and improve the BMU calculation with euclidean distance.
I know a Gaussian Process Regression model is mainly specified by its covariance matrix and the free hyper-parameters act as the 'weights'of the model. But could anyone explain what do the 2 hyper-parameters (length-scale & amplitude) in the covariance matrix represent (since they are not 'real' parameters)? I'm a little confused on the 'actual' meaning of these 2 parameters.
Thank you for your help in advance. :)
First off I would like to point out that there are infinite number of kernels that could be used in a gaussian process. One of the most common however is the RBF (also referred to as squared exponential, the expodentiated quadratic, etc). This kernel is of the following form:
The above equation is of course for the simple 1D case. Here l is the length scale and sigma is the variance parameter (note they go under different names depending on the source). Effectively the length scale controls how two points appear to be similar as it simply magnifies the distance between x and x'. The variance parameter controls how smooth the function is. These are related but not the same.
The Kernel Cookbook give a nice little description and compares RBF kernels to other commonly used kernels.
I understand the concept of PCA, and what it's doing, but trying to apply the concept to my application is proving difficult.
I have a 1 by X matrix of a physiological signal (it's not EMG, but very similar, so think of it as EMG if it helps) which contains various noise and artefacts. What I've noticed of the noise is that some of it is very large and I would assume after PCA this would be the largest principal component, thus my idea of using PCA for some dimensional reduction.
My problem is that with a 1 by X matrix there is no covariance matrix, only the variance, and thus eigenvectors and all of PCA falls through.
I know I need to rearrange my data into a matrix more than 1D, but this is where I need some suggestions. Do I split my data into windows of equal length to create a large dimensional matrix which I can apply PCA to? Do I perform several trials of the same action so I have lots of data sets (this would be impractical for my application)?
Any suggestions or examples would be helpful. I'm using MATLAB to perform this task.
I am new to using Matlab and am trying to follow the example in the Bioinformatics Toolbox documentation (SVM Classification with Cross Validation) to handle a classification problem.
However, I am not able to understand Step 9, which says:
Set up a function that takes an input z=[rbf_sigma,boxconstraint], and returns the cross-validation value of exp(z).
The reason to take exp(z) is twofold:
rbf_sigma and boxconstraint must be positive.
You should look at points spaced approximately exponentially apart.
This function handle computes the cross validation at parameters
exp([rbf_sigma,boxconstraint]):
minfn = #(z)crossval('mcr',cdata,grp,'Predfun', ...
#(xtrain,ytrain,xtest)crossfun(xtrain,ytrain,...
xtest,exp(z(1)),exp(z(2))),'partition',c);
What is the function that I should be implementing here? Is it exp or minfn? I will appreciate if you can give me the code for this section. Thanks.
I will like to know what does it mean when it says exp([rbf_sigma,boxconstraint])
rbf_sigma: The svm is using a gaussian kernel, the rbf_sigma set the standard deviation (~size) of the kernel. To understand how kernels work, the SVM is putting the kernel around every sample (so that you have a gaussian around every sample). Then the kernels are added up (sumed) for the samples of each category/type. At each point the type which sum is higher would be the "winner". For example if type A has a higher sum of these kernels at point X, then if you have a new datum to classify in point X, it will be classified as type A. (there are other configuration parameters that may change the actual threshold where a category is selected over another)
Fig. Analyze this figure from the webpage you gave us. You can see how by adding up the gaussian kernels on the red samples "sumA", and on the green samples "sumB"; it is logical that sumA>sumB in the center part of the figure. It is also logical that sumB>sumA in the outer part of the image.
boxconstraint: it is a cost/penalty over miss-classified data. During the training stage of the classifier, where you use the training data to adjust the SVM parameters, the training algorithm is using an error function to decide how to optimize the SVM parameters in an iterative fashion. The cost for a miss-classified sample is proportional to how far it is from the boundary where it would have been classified correctly. In the figure that I am attaching the boundary is the inner blue circumference.
Taking into account BGreene indications and from what I understand of the tutorial:
In the tutorial they advice to try values for rbf_sigma and boxconstraint that are exponentially apart. This means that you should compare values like {0.2, 2, 20, ...} (note that this is {2*10^(i-2), i=1,2,3,...}), and NOT like {0.2, 0.3, 0.4, 0.5} (which would be linearly apart). They advice this to try a wide range of values first. You can further optimize later FROM the first optimum that you obtained before.
The command "[searchmin fval] = fminsearch(minfn,randn(2,1),opts)" will give you back the optimum values for rbf_sigma and boxconstraint. Probably you have to use exp(z) because it affects how fminsearch increments the values of z(1) and z(2) during the search for the optimum value. I suppose that when you put exp(z(1)) in the definition of #minfn, then fminsearch will take 'exponentially' big steps.
In machine learning, always try to understand that there are three subsets in your data: training data, cross-validation data, and test data. The training set is used to optimize the parameters of the SVM classifier for EACH value of rbf_sigma and boxconstraint. Then the cross validation set is used to select the optimum value of the parameters rbf_sigma and boxconstraint. And finally the test data is used to obtain an idea of the performance of your classifier (the efficiency of the classifier is determined upon the test set).
So, if you start with 10000 samples you may divide the data for example as training(50%), cross-validation(25%), test(25%). So that you will sample randomly 5000 samples for the training set, then 2500 samples from the 5000 remaining samples for the cross-validation set, and the rest of samples (that is 2500) would be separated for the test set.
I hope that I could clarify your doubts. By the way, if you are interested in the optimization of the parameters of classifiers and machine learning algorithms I strongly suggest that you follow this free course -> www.ml-class.org (it is awesome, really).
You need to implement a function called crossfun (see example).
The function handle minfn is passed to fminsearch to be minimized.
exp([rbf_sigma,boxconstraint]) is the quantity being optimized to minimize classification error.
There are a number of functions nested within this function handle:
- crossval is producing the classification error based on cross validation using partition c
- crossfun - classifies data using an SVM
- fminsearch - optimizes SVM hyperparameters to minimize classification error
Hope this helps
I have a set of 100 observations where each observation has 45 characteristics. And each one of those observations have a label attached which I want to predict based on those 45 characteristics. So it's an input matrix with the dimension 45 x 100 and a target matrix with the dimension 1 x 100.
The thing is that I want to know how many of those 45 characteristics are relevant in my set of data, basically the principal component analysis, and I understand that I can do this with Matlab function processpca.
Could you please tell me how can I do this? Suppose that the input matrix is x with 45 rows and 100 columns and y is a vector with 100 elements.
Assuming that you want to construct a model of the 1x100 vector, based on the 45x100 matrix, I am not convinced that PCA will do what you think. PCA can be used to select variables for model estimation, but this is a somewhat indirect way to gather a set of model features. Anyway, I suggest reading both:
Principal Components Analysis
and...
Putting PCA to Work
...both of which provide code in MATLAB not requiring any Toolboxes.
Have you tried COEFF = princomp(x)?
COEFF = princomp(X) performs principal
components analysis (PCA) on the
n-by-p data matrix X, and returns the
principal component coefficients, also
known as loadings. Rows of X
correspond to observations, columns to
variables. COEFF is a p-by-p matrix,
each column containing coefficients
for one principal component. The
columns are in order of decreasing
component variance.
From your question I deduced you don't need to do it in MATLAB, but you just want to analyze your dataset. According to my opinion the key is visualization of the dependencies.
If you're not forced to do the analysis in MATLAB I'd suggest you try more specialized software something like WEKA (www.cs.waikato.ac.nz/ml/weka/) or RapidMiner (rapid-i.com). Both tools can provide PCA and other dimension reduction algorithms + they contain nice visualization tools.
Your use case sounds like a combination of Classification and Feature Selection.
Statistics Toolbox offers a lot of good capabilities in this area. The toolbox provides access to a number of classification algorithms including
Naive Bayes Classifiers Bagged
Decision Trees (aka Random Forests)
Binomial and Multinominal logistic regression
Linear Discriminant analysis
You also have a variety of options available for feature selection include
sequentialfs (forwards and backwards feature selection)
relifF
"treebagger" also supports options for feature selection and estimating variable importance.
Alternatively, you can use some of Optimization Toolbox's capabilities to write your own custom equations to estimate variable importance.
A couple monthes back, I did a webinar for The MathWorks titled "Compuational Statistics: Getting Started with Classification using MTALAB". You can watch the Webinar at
http://www.mathworks.com/company/events/webinars/wbnr51468.html?id=51468&p1=772996255&p2=772996273
The code and the data set for the examples is available at MATLAB Central
http://www.mathworks.com/matlabcentral/fileexchange/28770
With all this said and done, many people using Principal Component Analysis as a pre-processing step before applying classification algorithms. PCA gets used alot
When you need to extract features from images
When you're worried about multicollinearity
You should find correlation matrix. in the following example matlab finds correlation matrix with 'corr' function
http://www.mathworks.com/help/stats/feature-transformation.html#f75476