Fitting lognormal distribution using percentile matching - MATLAB - matlab

I have a very limited cumulative probability information as shown below:
x1=6 F(x1)=0
x2=7.25 F(x2)=0.1
x3=8 F(x3)=0.35
x4=9.5 F(x4)=1
I want to fit this data using cumulative log-normal distribution curve. I know that there is a method called percentile matching for estimating the parameter from this kind of information. But we usually use only 2 data, since there are only 2 unknown parameters.
But here, I have to accommodate the upper and lower thresholds. Is there any way to do this using MATLAB?
Any reference about estimating lognormal distribution parameters using percentiles information would be appreciated. Thanks!

Related

Matlab cftool - fit using percentage deviation as criteria

I have a strong varying data which I am trying to fit using custom function in cftool in Matlab. The issue is that the fit is done default using sum of squared errors (SSE) as criteria to minimize the fit. This results in large errors in fit for small values, since their offset's contribution to SSE is low.
I would like to do the fit with sum of squared relative deviation (percentage error) as minimum criteria. Is there a way to achieve this?
Of course, I can do this in a script form by explicitly mentioning the minimizing function. However, my equation form is not a finalized one and therefore, it is difficult to play with, if it is hard-coded in the script. I find the interactive cftool way of creating fits easier.
I found a way to introduce sum of squared percentage error as criteria in Matlab cftool as follows.
The sum of squared errors (SSE) is computed as
It can be observed that there is some freedom through the weights. If we take
then this results in sum of squared percentage error. y_i is the data to be fitted.
Update: As James pointed out, necessary care has to be taken for y_i=0

Discriminant analysis method to classify data

my aim is to classify the data into two sections- upper and lower- finding the mid line of the peaks.
I would like to apply machine learning methods- i.e. Discriminant analysis.
Could you let me know how to do that in MATLAB?
It seems that what you are looking for is GMM (gaussian mixture model). With K=2 (number of mixtures) and dimension equal 1 this will be simple, fast method, which will give you a direct solution. Given components it is easy to analytically find a local minima (which is just a weighted average of means, with weights proportional to the std's).

Select data based on a distribution in matlab

I have a set of data in a vector. If I were to plot a histogram of the data I could see (by clever inspection) that the data is distributed as the sum of three distributions;
One normal distribution centered around x_1 with variance s_1;
One normal distribution centered around x_2 with variance s_2;
Once lognormal distribution.
My data is obviously a subset of the 'real' data.
What I would like to do is to take a random subset of my data away from my data ensuring that the resulting subset is a reasonable representative sample of the original data.
I would like to do this as easily as possible in matlab but am new to both statistics and matlab and am unsure where to start.
Thank you for any help :)
If you can identify each of the 3 distributions (in the sense that you can estimate their parameters), one approach could be to select a random subset of your data and then try to estimate the parameters for each distribution and see whether they are close enough (according to your own definition of "close") to the parameters of the original distributions. You should repeat this process several time and look at the average difference given a random subset size.

SVM Classification with Cross Validation

I am new to using Matlab and am trying to follow the example in the Bioinformatics Toolbox documentation (SVM Classification with Cross Validation) to handle a classification problem.
However, I am not able to understand Step 9, which says:
Set up a function that takes an input z=[rbf_sigma,boxconstraint], and returns the cross-validation value of exp(z).
The reason to take exp(z) is twofold:
rbf_sigma and boxconstraint must be positive.
You should look at points spaced approximately exponentially apart.
This function handle computes the cross validation at parameters
exp([rbf_sigma,boxconstraint]):
minfn = #(z)crossval('mcr',cdata,grp,'Predfun', ...
#(xtrain,ytrain,xtest)crossfun(xtrain,ytrain,...
xtest,exp(z(1)),exp(z(2))),'partition',c);
What is the function that I should be implementing here? Is it exp or minfn? I will appreciate if you can give me the code for this section. Thanks.
I will like to know what does it mean when it says exp([rbf_sigma,boxconstraint])
rbf_sigma: The svm is using a gaussian kernel, the rbf_sigma set the standard deviation (~size) of the kernel. To understand how kernels work, the SVM is putting the kernel around every sample (so that you have a gaussian around every sample). Then the kernels are added up (sumed) for the samples of each category/type. At each point the type which sum is higher would be the "winner". For example if type A has a higher sum of these kernels at point X, then if you have a new datum to classify in point X, it will be classified as type A. (there are other configuration parameters that may change the actual threshold where a category is selected over another)
Fig. Analyze this figure from the webpage you gave us. You can see how by adding up the gaussian kernels on the red samples "sumA", and on the green samples "sumB"; it is logical that sumA>sumB in the center part of the figure. It is also logical that sumB>sumA in the outer part of the image.
boxconstraint: it is a cost/penalty over miss-classified data. During the training stage of the classifier, where you use the training data to adjust the SVM parameters, the training algorithm is using an error function to decide how to optimize the SVM parameters in an iterative fashion. The cost for a miss-classified sample is proportional to how far it is from the boundary where it would have been classified correctly. In the figure that I am attaching the boundary is the inner blue circumference.
Taking into account BGreene indications and from what I understand of the tutorial:
In the tutorial they advice to try values for rbf_sigma and boxconstraint that are exponentially apart. This means that you should compare values like {0.2, 2, 20, ...} (note that this is {2*10^(i-2), i=1,2,3,...}), and NOT like {0.2, 0.3, 0.4, 0.5} (which would be linearly apart). They advice this to try a wide range of values first. You can further optimize later FROM the first optimum that you obtained before.
The command "[searchmin fval] = fminsearch(minfn,randn(2,1),opts)" will give you back the optimum values for rbf_sigma and boxconstraint. Probably you have to use exp(z) because it affects how fminsearch increments the values of z(1) and z(2) during the search for the optimum value. I suppose that when you put exp(z(1)) in the definition of #minfn, then fminsearch will take 'exponentially' big steps.
In machine learning, always try to understand that there are three subsets in your data: training data, cross-validation data, and test data. The training set is used to optimize the parameters of the SVM classifier for EACH value of rbf_sigma and boxconstraint. Then the cross validation set is used to select the optimum value of the parameters rbf_sigma and boxconstraint. And finally the test data is used to obtain an idea of the performance of your classifier (the efficiency of the classifier is determined upon the test set).
So, if you start with 10000 samples you may divide the data for example as training(50%), cross-validation(25%), test(25%). So that you will sample randomly 5000 samples for the training set, then 2500 samples from the 5000 remaining samples for the cross-validation set, and the rest of samples (that is 2500) would be separated for the test set.
I hope that I could clarify your doubts. By the way, if you are interested in the optimization of the parameters of classifiers and machine learning algorithms I strongly suggest that you follow this free course -> www.ml-class.org (it is awesome, really).
You need to implement a function called crossfun (see example).
The function handle minfn is passed to fminsearch to be minimized.
exp([rbf_sigma,boxconstraint]) is the quantity being optimized to minimize classification error.
There are a number of functions nested within this function handle:
- crossval is producing the classification error based on cross validation using partition c
- crossfun - classifies data using an SVM
- fminsearch - optimizes SVM hyperparameters to minimize classification error
Hope this helps

Fit A Curve to a Histogram

Is there any possibility to fit a curve to that histogram above in Matlab?
The histogram is not normalized or anything like that.
I know that there is a function called histfit,but can i use it here?
Try this FileExchange submission:
ALLFITDIST - Fit all valid parametric probability distributions to data.
--- UPDATE ---
ALLFITDIST is no longer available on the MATLAB File Exchange.
You can try this instead:
FITMETHIS - finds best-fitting distribution to data vector, including non-parametric.
If you know the underlying distribution (i.e. skewed gaussian etc.), you can manually do a maximum likelihood estimate for the parameters of the distribution and then plot the resulting distribution on top of your histogram. However, you need to normalize your histogram so that you see empirical probabilities instead of the numbers.
I think what you want it to fit a distribution, not any curve that might not have finite area under the curve. Data looks like it's censored on the right tail, but over all it may fit log normal distribution or Gamma distribution pretty well. If you have stats toolbox, try gamfit or lognfit for starter.
See also Kernel density estimation
http://en.wikipedia.org/wiki/Kernel_density