How to estimate multi-dimensional probability distribution from data using Matlab? - matlab

Hi I'm trying to estimate the data distribution using Matlab.
For one dimensional data, I can definitely use ksdensity.
However my problems is that I need multi-dimensional joint distribution and conditional distribution.
I've tried kde tools from UCI. It is not functioning in my case and I cannot figure out why. So I'm asking for another tool I can possibly use..
Edit
The toolbox is not working and giving extreme results. I used 1e5 points and it might because the points are too dense.
KDE toolbox result
ksdensity result

Related

Advice on Speeding up SciPy Custom Distribution Sampling & Fitting

I am trying to fit a custom distribution to a large (~O(500,000) measurements) dataset using scipy. I have derived a theoretical PDF based on some other factors, but both by hand and using symbolic integration software I cannot find an exact form of the CDF.
Currently, simply evaluating 1000 random samples from my custom distribution is expensive, which I believe is due to the need to invert an unknown CDF. If I cannot find an explicit form of the CDF and it's inverse, is there anything else I can do to speed up usage of this distribution?
I've used maple, matlab and Sympy to try and determine a CDF, yet none give a result. I also tried down-sampling my data whilst still retaining the tail attributes, but this still required so much data that doing anything with the distribution was slow.
My distribution is a sub-class of SciPy's rv_continuous class.
Thanks for any advice.
This sounds like you want to sample from a Kernel Density Estimation of the probability distribution. While Scipy does offer a Gaussian Kernel package, for that many measurements you would be much better off using sklearn's implementation. A good resource with code examples can be found on Jake VanderPlas's blog.

Are there any softwares that implemented the multiple output gauss process?

I am trying to implement bayesian optimization using gauss process regression, and I want to try the multiple output GP firstly.
There are many softwares that implemented GP, like the fitrgp function in MATLAB and the ooDACE toolbox.
But I didn't find any available softwares that implementd the so called multiple output GP, that is, the Gauss Process Model that predict vector valued functions.
So, Are there any softwares that implemented the multiple output gauss process that I can use directly?
I am not sure my answer will help you as you seem to search matlab libraries.
However, you can do co-kriging in R with gstat. See http://www.css.cornell.edu/faculty/dgr2/teach/R/R_ck.pdf or https://github.com/cran/gstat/blob/master/demo/cokriging.R for more details about usage.
The lack of tools to do cokriging is partly due to the relative difficulty to use it. You need more assumptions than for simple kriging: in particular, modelling the dependence between in of the cokriged outputs via a cross-covariance function (https://stsda.kaust.edu.sa/Documents/2012.AGS.JASA.pdf). The covariance matrix is much bigger and you still need to make sure that it is positive definite, which can become quite hard depending on your covariance functions...

Obtaining distribution from histogram

I have an array of values, with that values I plotted the histogram.I want to know the corresponding distribution from the histogram obtained. How is it possible.
Could you please explain the steps in obtaining appropriate probability distribution from histogram.
You'd better to ask this question in stats.stackexchange.com as it is more about the method than the programming. However, one thing that you can do is to fit a parametric distribution (using moment matching or maximum likelihood for example) then compare the fitted distribution to the normalized histogram using KL divergence or Bhattacharyya distance.
One option might be to use the "Distribution Fitting App" in the Statistics and Machine Learning Toolbox. That should help you evaluate if your data seems like it might have been drawn from some common distributions. You may never know for sure, since multiple distributions could account for the data, but if you have a lot of data it might help you narrow it down.
I think that in many cases an eye-ball comparison is enough. With a reasonable amount of data, it is quite difficult to not be able to distinguish between a gaussian or a weibull or...
I would use fitdist or fithist to eye-ball different distributions.
If you have no idea at all on the distribution and you want to know if two datasets are distributed differently it could be useful to compare their distributions by obtaining them with the option 'kernel'

Simple Sequential feature selection in Matlab

I have a 40X3249 noisy dataset and 40X1 resultset. I want to perform simple sequential feature selection on it, in Matlab. Matlab example is complicated and I can't follow it. Even a few examples on SoF didn't help. I want to use decision tree as classifier to perform feature selection. Can someone please explain in simple terms.
Also is it a problem that my dataset has very low number of observations compared to the number of features?
I am following this example: Sequential feature selection Matlab and I am getting error like this:
The pooled covariance matrix of TRAINING must be positive definite.
I've explained the error message you're getting in answers to your previous questions.
In general, it is a problem that you have many more variables than samples. This will prevent you using some techniques, such as the discriminant analysis you were attempting, but it's a problem anyway. The fact is that if you have that high a ratio of variables to samples, it is very likely that some combination of variables would perfectly classify your dataset even if they were all random numbers. That's true if you build a single decision tree model, and even more true if you are using a feature selection method to explicitly search through combinations of variables.
I would suggest you try some sort of dimensionality reduction method. If all of your variables are continuous, you could try PCA as suggested by #user1207217. Alternatively you could use a latent variable method for model-building, such as PLS (plsregress in MATLAB).
If you're still intent on using sequential feature selection with a decision tree on this dataset, then you should be able to modify the example in the question you linked to, replacing the call to classify with one to classregtree.
This error comes from the use of the classify function in that question, which is performing LDA. This error occurs when the data is rank deficient (or in other words, some features are almost exactly correlated). In order to overcome this, you should project the data down to a lower dimensional subspace. Principal component analysis can do this for you. See here for more details on how to use pca function within statistics toolbox of Matlab.
[basis, scores, ~] = pca(X); % Find the basis functions and their weighting, X is row vectors
indices = find(scores > eps(2*max(scores))); % This is to find irrelevant components up to machine precision of the biggest component .. with a litte extra tolerance (2x)
new_basis = basis(:, indices); % This gets us the relevant components, which are stored in variable "basis" as column vectors
X_new = X*new_basis; % inner products between the new basis functions spanning some subspace of the original, and the original feature vectors
This should get you automatic projections down into a relevant subspace. Note that your features won't have the same meaning as before, because they will be weighted combinations of the old features.
Extra note: If you don't want to change your feature representation, then instead of classify, you need to use something which works with rank deficient data. You could roll your own version of penalised discriminant analysis (which is quite simple), use support vector machines, or other classification functions which don't break with correlated features as LDA does (by virtue of requiring matrix inversion of the covariance estimate).
EDIT: P.S I haven't tested this, because I have rolled my own version of PCA in Matlab.

Anyone can provide simple MATLAB routine of Kernel Density Estimation?

I am trying to learn the kernel density estimation from the basic. Anyone have the simple routine for 1d KDE would be great helpful. Thanks.
If you have the statistics toolbox in MATLAB, you can use the ksdensity to estimate pdf/cdf using kernel smoothing. Here's an example
data=[randn(2000,1);4+randn(2000,1)];%# create a bimodal Gaussian distribution
x=linspace(-4,8,1e4);%# need to evaluate density at these points
pF=ksdensity(data,x,'function','pdf');%# evaluate the pdf of the data points
If you plot it, it should look like this
You can also get the cumulative distribution or the inverse cumulative or change the kernel that is used. You can look up the list of options from the link provided. This should help you get started :)