Advice on Speeding up SciPy Custom Distribution Sampling & Fitting - scipy

I am trying to fit a custom distribution to a large (~O(500,000) measurements) dataset using scipy. I have derived a theoretical PDF based on some other factors, but both by hand and using symbolic integration software I cannot find an exact form of the CDF.
Currently, simply evaluating 1000 random samples from my custom distribution is expensive, which I believe is due to the need to invert an unknown CDF. If I cannot find an explicit form of the CDF and it's inverse, is there anything else I can do to speed up usage of this distribution?
I've used maple, matlab and Sympy to try and determine a CDF, yet none give a result. I also tried down-sampling my data whilst still retaining the tail attributes, but this still required so much data that doing anything with the distribution was slow.
My distribution is a sub-class of SciPy's rv_continuous class.
Thanks for any advice.

This sounds like you want to sample from a Kernel Density Estimation of the probability distribution. While Scipy does offer a Gaussian Kernel package, for that many measurements you would be much better off using sklearn's implementation. A good resource with code examples can be found on Jake VanderPlas's blog.

Related

Why is it important to transform the data into normal / Gaussian distribution when creating a linear regression model

I'm currently building my first regression model, and as we know that, owing to the limitations of the algorithm, we need to remove outliers and transform the distribution into a normal one.
I know that it's important and the ways to do it, but can someone please help me in understanding why exactly we need to do so? Why can't I work with a highly skewed distribution? Why does linear regression mandates this transformation in processing stage?
Classifier with and without Outliers in the data
Hope the above picture clears your doubt.
As LinearRegression model is optimized by passing through a path which has minimum squared error,
due to outliers (which are abnormal data points or noise) our classifier may deviate and work poorly on the test data (much general data).

Are there any softwares that implemented the multiple output gauss process?

I am trying to implement bayesian optimization using gauss process regression, and I want to try the multiple output GP firstly.
There are many softwares that implemented GP, like the fitrgp function in MATLAB and the ooDACE toolbox.
But I didn't find any available softwares that implementd the so called multiple output GP, that is, the Gauss Process Model that predict vector valued functions.
So, Are there any softwares that implemented the multiple output gauss process that I can use directly?
I am not sure my answer will help you as you seem to search matlab libraries.
However, you can do co-kriging in R with gstat. See http://www.css.cornell.edu/faculty/dgr2/teach/R/R_ck.pdf or https://github.com/cran/gstat/blob/master/demo/cokriging.R for more details about usage.
The lack of tools to do cokriging is partly due to the relative difficulty to use it. You need more assumptions than for simple kriging: in particular, modelling the dependence between in of the cokriged outputs via a cross-covariance function (https://stsda.kaust.edu.sa/Documents/2012.AGS.JASA.pdf). The covariance matrix is much bigger and you still need to make sure that it is positive definite, which can become quite hard depending on your covariance functions...

Obtaining distribution from histogram

I have an array of values, with that values I plotted the histogram.I want to know the corresponding distribution from the histogram obtained. How is it possible.
Could you please explain the steps in obtaining appropriate probability distribution from histogram.
You'd better to ask this question in stats.stackexchange.com as it is more about the method than the programming. However, one thing that you can do is to fit a parametric distribution (using moment matching or maximum likelihood for example) then compare the fitted distribution to the normalized histogram using KL divergence or Bhattacharyya distance.
One option might be to use the "Distribution Fitting App" in the Statistics and Machine Learning Toolbox. That should help you evaluate if your data seems like it might have been drawn from some common distributions. You may never know for sure, since multiple distributions could account for the data, but if you have a lot of data it might help you narrow it down.
I think that in many cases an eye-ball comparison is enough. With a reasonable amount of data, it is quite difficult to not be able to distinguish between a gaussian or a weibull or...
I would use fitdist or fithist to eye-ball different distributions.
If you have no idea at all on the distribution and you want to know if two datasets are distributed differently it could be useful to compare their distributions by obtaining them with the option 'kernel'

Linear least-squares fit with constraint - any ideas?

I have a problem where I am fitting a high-order polynomial to (not very) noisy data using linear least squares. Currently I'm using polynomial orders around 15 - 25, which work surprisingly well: The dependence is very nearly linear, but the accuracy of modelling the 'very nearly' is critical. I'm using Matlab's polyfit() function, and (obviously) normalising the x-data. This generally works fine, but I have come across an issue with some recent datasets. The fitted polynomial has extrema within the x-data interval. For the application I'm working on this is a non-no. The polynomial model must have no stationary points over the x-interval.
So I need to add a constraint to the least-squares problem: the derivative of the fitted polynomial must be strictly positive over a known x-range (or strictly negative - this depends on the data but a simple linear fit will quickly tell me which it is.) I have had a quick look at the available optimisation toolbox functions, but I admit I'm at a loss to know how to go about this. Does anyone have any suggestions?
[I appreciate there are probably better models than polynomials for this data, but in the short term it isn't feasible to change the form of the model]
[A closing note: I have finally got the go-ahead to replace this awful polynomial model! I am going to adopt a nonparametric approach, spline smoothing, using the excellent SPLINEFIT code by Jonas Lundgren. This has the advantage that I'm already using a spline model in the end-user application, so I already have C# code available to evaluate a spline model]
You could use cftool and use the exclude data points option.

matlab probability distribution fitting

This might be a silly question! I have a array P which represents the probability distribution of some data e.g. [0;0.3;0.7] How can I determine the type or class of discrete probability distribution of P? The original data is unavailable to me.
dfittool or fitdist requires me to give the data as input, while I already have its probability distribution. Any ideas?
You probably might have seen different probability distributions during lecture or your reading. All you have to do is plotting the given distribution against the candidates. As the distributions itself are parametrized, curve fitting or trial end error come into play. The distribution with the least error, best fit, might be the one you are looking for.
It is not possible to find out a priori what kind of distribution some data (especially with as low n as in your example) is coming from.
If you have an idea of the process that generated your data, you might be able to get an idea of which distributions to test. Maybe your data comes from the family of gamma distributions, maybe your data comes from the family of Weibull distributions etc. Then, you can fit these general distributions and see whether they are likely to simplify to a more common distribution.
For a visual representation of how well your data could approximate a certain distribution, you can use PROBPLOT.
Once you have identified possible distributions, you can fit them to the data and use the Bayesian Information Criterion (BIC) to compare which fit describes the data best. Note that unless you have huge numbers of noise-free data, it is impossible to tell which fit is correct if you have several possible distributions with comparatively low BIC.