I am trying to use the stats toolbox to fit a distribution function. In my case, I already have the PMF (probability mass function, stored in an array) and wanted to fit it. But apparently, the toolbox can only take a vector of samples, from which a histogram is created.
Is there a way to pass it the PMF instead?
There are many functions in the Stats toolbox that can fit a bunch of data to a distribution. Say you wanted to fit your data (in an array) to be fit to a normal distribution, you'd use normfit like this (example taken from normfit doc)
data = normrnd(10,2,100,2);
[mu,sigma,muci,sigmaci] = normfit(data)
If the above doesn't answer you question, can you please include a snippet of code that succinctly describes what you are trying to do?
Related
I have a 115*8000 data where 115 is the number of features. When I use pca function of matlab like this
[coeff,score,latent,tsquared,explained,mu] = pca(data);
on my data. I get some values. I read on here that how can I reduce my data but one thing confuses me. The explained data shows how much a feature weighs on calculation but do features get reorganized in this proces or features are exactly in same order as I give it to function?
Also I give 115 features but explained shows 114. Why does it happen?
The data is not "reorganized" in PCA, is transformed to a new space. When you crop the PCA space, that is your data, but you are not going to be able to visualize/understand it there, you need to convert it back to "normal" space, using eigenvectors and such.
explained gives you 114 because you now what is the answer with 115! 100% of the data can be explained with the whole data!
Read about it further in this answer: Significance of 99% of variance covered by the first component in PCA
PCA does not "choose" some of your features and remove the rest.
So you should not still be thinking about the original features after running PCA.
It is well-explained here on Wikipedia. You are converting your samples from the space defined by your original features to a space where features are linearly uncorrelated and called "principal components". Note: these components are no longer the original features.
An example of this in 2D could be: you have a vector z=(2,3) defined in your Euclidean space. It needs 2 features (the x and the y). If we change the space and define it using the coordinate vectors v=(2,3) and w an orthogonal vector to v, then z=(1,0) i.e. z=1.v+0.w and can now be represented with only 1 feature (the first coordinate!).
The link that you shared explains exactly (in the selected answer) how you can go about using the outputs of the pca function to reduce your dimensionality.
(As noted by Ander you do not care about the last components since these are the weakest anyway and you want to drop them)
I am working with image processing in MATLAB. I have two different images whose histogram plots are as shown below.
Image 1:
and
Image 2:
I have multiple images like those and the only distinguishing(separating) features is that some have single peak and others have two peaks.
In other words some can be thresholded (to generate good results) while others cannot. Is there any way I can separate the two images? Are there any functions that do so in MATLAB or any reference code that will help?
The function used is imhist()
If you mean "distinguish" by "separate", then yes: The property you describe is called bimodality, i.e. you have 2 peaks that can be seperated by one threshold. So your question is actually "how do I test for an underlying bimodal distribution?"
One option to do this programmatically is Binning. This is not the most robust method but the easiest. It might work, it might not.
Kernel Smoothing is probably the more robust solution. You basically shift and scale a certain function (e.g. Gaussian) to fit the data. This can be done with histfit in matlab.
There's more solutions for this problem which you can research for yourself since you now know the terms needed. Be aware though that your problem is not a trivial one if you want to do it properly.
I have a 40X3249 noisy dataset and 40X1 resultset. I want to perform simple sequential feature selection on it, in Matlab. Matlab example is complicated and I can't follow it. Even a few examples on SoF didn't help. I want to use decision tree as classifier to perform feature selection. Can someone please explain in simple terms.
Also is it a problem that my dataset has very low number of observations compared to the number of features?
I am following this example: Sequential feature selection Matlab and I am getting error like this:
The pooled covariance matrix of TRAINING must be positive definite.
I've explained the error message you're getting in answers to your previous questions.
In general, it is a problem that you have many more variables than samples. This will prevent you using some techniques, such as the discriminant analysis you were attempting, but it's a problem anyway. The fact is that if you have that high a ratio of variables to samples, it is very likely that some combination of variables would perfectly classify your dataset even if they were all random numbers. That's true if you build a single decision tree model, and even more true if you are using a feature selection method to explicitly search through combinations of variables.
I would suggest you try some sort of dimensionality reduction method. If all of your variables are continuous, you could try PCA as suggested by #user1207217. Alternatively you could use a latent variable method for model-building, such as PLS (plsregress in MATLAB).
If you're still intent on using sequential feature selection with a decision tree on this dataset, then you should be able to modify the example in the question you linked to, replacing the call to classify with one to classregtree.
This error comes from the use of the classify function in that question, which is performing LDA. This error occurs when the data is rank deficient (or in other words, some features are almost exactly correlated). In order to overcome this, you should project the data down to a lower dimensional subspace. Principal component analysis can do this for you. See here for more details on how to use pca function within statistics toolbox of Matlab.
[basis, scores, ~] = pca(X); % Find the basis functions and their weighting, X is row vectors
indices = find(scores > eps(2*max(scores))); % This is to find irrelevant components up to machine precision of the biggest component .. with a litte extra tolerance (2x)
new_basis = basis(:, indices); % This gets us the relevant components, which are stored in variable "basis" as column vectors
X_new = X*new_basis; % inner products between the new basis functions spanning some subspace of the original, and the original feature vectors
This should get you automatic projections down into a relevant subspace. Note that your features won't have the same meaning as before, because they will be weighted combinations of the old features.
Extra note: If you don't want to change your feature representation, then instead of classify, you need to use something which works with rank deficient data. You could roll your own version of penalised discriminant analysis (which is quite simple), use support vector machines, or other classification functions which don't break with correlated features as LDA does (by virtue of requiring matrix inversion of the covariance estimate).
EDIT: P.S I haven't tested this, because I have rolled my own version of PCA in Matlab.
I need to construct an interpolating function from a 2D array of data. The reason I need something that returns an actual function is, that I need to be able to evaluate the function as part of an expression that I need to numerically integrate.
For that reason, "interp2" doesn't cut it: it does not return a function.
I could use "TriScatteredInterp", but that's heavy-weight: my grid is equally spaced (and big); so I don't need the delaunay triangularisation.
Are there any alternatives?
(Apologies for the 'late' answer, but I have some suggestions that might help others if the existing answer doesn't help them)
It's not clear from your question how accurate the resulting function needs to be (or how big, 'big' is), but one approach that you could adopt is to regress the data points that you have using a least-squares or Kalman filter-based method. You'd need to do this with a number of candidate function forms and then choose the one that is 'best', for example by using an measure such as MAE or MSE.
Of course this requires some idea of what the form underlying function could be, but your question isn't clear as to whether you have this kind of information.
Another approach that could work (and requires no knowledge of what the underlying function might be) is the use of the fuzzy transform (F-transform) to generate line segments that provide local approximations to the surface.
The method for this would be:
Define a 2D universe that includes the x and y domains of your input data
Create a 2D fuzzy partition of this universe - chosing partition sizes that give the accuracy you require
Apply the discrete F-transform using your input data to generate fuzzy data points in a 3D fuzzy space
Pass the inverse F-transform as a function handle (along with the fuzzy data points) to your integration function
If you're not familiar with the F-transform then I posted a blog a while ago about how the F-transform can be used as a universal approximator in a 1D case: http://iainism-blogism.blogspot.co.uk/2012/01/fuzzy-wuzzy-was.html
To see the mathematics behind the method and extend it to a multidimensional case then the University of Ostravia has published a PhD thesis that explains its application to various engineering problems and also provides an example of how it is constructed for the case of a 2D universe: http://irafm.osu.cz/f/PhD_theses/Stepnicka.pdf
If you want a function handle, why not define f=#(xi,yi)interp2(X,Y,Z,xi,yi) ?
It might be a little slow, but I think it should work.
If I understand you correctly, you want to perform a surface/line integral of 2-D data. There are ways to do it but maybe not the way you want it. I had the exact same problem and it's annoying! The only way I solved it was using the Surface Fitting Tool (sftool) to create a surface then integrating it.
After you create your fit using the tool (it has a GUI as well), it will generate an sftool object which you can then integrate in (2-D) using quad2d
I also tried your method of using interp2 and got the results (which were similar to the sfobject) but I had no idea how to do a numerical integration (line/surface) with the data. Creating thesfobject and then integrating it was much faster.
It was the first time I do something like this so I confirmed it using a numerically evaluated line integral. According to Stoke's theorem, the surface integral and the line integral should be the same and it did turn out to be the same.
I asked this question in the mathematics stackexchange, wanted to do a line integral of 2-d data, ended up doing a surface integral and then confirming the answer using a line integral!
Does MATLAB have a built-in function to find general properties like center of mass & moments of inertia for a polygon defined as a list of (non-integer valued) points?
regionprops performs this task for integer valued points, on the assumption that these represent indices of pixels in an image. But the only functions I can find that treat non integral point lists are polyarea and inpolygon.
My kludge for now is to create a bwconncomp structure with all the points multiplied by some large value (like 10,000), then feeding it in to regionprops, but wondered if there is a more elegant solution.
You should check out the submission POLYGEOM by H.J. Sommer on the MathWorks File Exchange. It looks like it has all the property measurements you want, and nice documentation describing the formulae used in the code.
I don't know of a function in MATLAB that would do this for you.
However, poly2mask might be of use for you to create the pixel masks to feed into regionprops. I also suggest that, should you decide to go this route, you carefully test how much the discretization affects the results, so that you don't create crazy large arrays (and waste time) for no real gain in accuracy.
One possibility is to farm out the calculations to the Java Topology Suite. I don't know about "moments of inertia", but it does at least have a centroid method.