Measuring the overlap between two probability distribution

Measuring the overlap between two probability distribution - distance

I have many probability distributions, I need to compute the amount of overlap between two probability distributions. I don't know the type of distribution since it really depends on the data itself.
My question is what appropriate approach that should I use for this case?

Related

distance metrics for clustering non-normally distributed data

The dataset I want to cluster consists of ~1000 samples and 10 features, which have different scales and ranges (negative, positive, both). Using scipy.stats.normaltest() I found that none of the features are normally-distributed (all p-values < 1e-4, small enough to reject the null hypothesis that the data are taken from a normal distribution). But all of the distance measures that I'm aware of assume normally-distributed data (I was using Mahalanobis until I realized how non-uniform the data was). What distance measures would one use in this situation? Or is this where one simply has to normalize every feature and hope that that doesn't introduce bias?

Why do you think all distances would assume normal (which btw. is not the same as uniform) data?
Consider Euclidean distance. In many physical applications this distance makes perfect sense, because it is "as the crow flies". Manhattan distance makes a lot of sense when movement is constrained to two axes that cannot be used at the same time. These are completely appropriate for non-normal distributed data.

Mutate weights and biases in a neural network through genetic algorithm

I have a genetic algorithm evolving a population of neural networks
Until now I make mutation on weights or biases using random.randn (Python) which is a random value from a normal distribution with mean = 0
It works "well" and I managed to achieve my project using it be wouldn't it be better to use a uniform distribution on a given interval ?
My intuition is that it would lead to more variety in my networks

I think, that this question has no simple solution. In case of normal distribution will be numbers around mean have more chances to be "selected" by your number generator, uniform distribution give almost equal chance to all numbers. That is clear but answer to question, will equal chance mean better result, lays according to me only at empirical experiments. So I suggest you to perform experiments with normal and uniform distribution a try to judge based on results.
About variety. I assume that you create some random vector which represents weights. At stage of mutation you perform addition of random number. This number will be more likely from close interval around mean, so in case 0 mutation with high probability will be change of some elements only little. So there will be only little improvements over vector and sometimes something big shows up. In case of uniform distribution will be changes more random, which leads to different individual. Question is, will be these individual better? I don't know, but I offer you another view. I look to genetic algorithms like an analogy to evolution theory. And from this point of view, cumulative little improvements of individual with little probability of some big change is more appropriate. Think about situation, used is uniform distribution, but children has low fitness due to big changes so at phase of creating new generation will be not selected. And you will wait so long for one tiny improvement which make your network works with good results.
Maybe one more thing. Your experiments maybe show that uniform/normal distribution is better. But such result may be true only for your current problem, no at general.

Obtaining distribution from histogram

I have an array of values, with that values I plotted the histogram.I want to know the corresponding distribution from the histogram obtained. How is it possible.
Could you please explain the steps in obtaining appropriate probability distribution from histogram.

You'd better to ask this question in stats.stackexchange.com as it is more about the method than the programming. However, one thing that you can do is to fit a parametric distribution (using moment matching or maximum likelihood for example) then compare the fitted distribution to the normalized histogram using KL divergence or Bhattacharyya distance.

One option might be to use the "Distribution Fitting App" in the Statistics and Machine Learning Toolbox. That should help you evaluate if your data seems like it might have been drawn from some common distributions. You may never know for sure, since multiple distributions could account for the data, but if you have a lot of data it might help you narrow it down.

I think that in many cases an eye-ball comparison is enough. With a reasonable amount of data, it is quite difficult to not be able to distinguish between a gaussian or a weibull or...
I would use fitdist or fithist to eye-ball different distributions.
If you have no idea at all on the distribution and you want to know if two datasets are distributed differently it could be useful to compare their distributions by obtaining them with the option 'kernel'

outlier detection based on gaussian mixture model

I have a set of data. I want to build a one class distribution from that data. Based on the learned distribution I want to get a probability value for each of the data instance.
Based on this probability values (thresholding) I want to build a classifier to classify a particular data instance is comming from that distribution or not.
In this case, lets say I have a data of 50x100000 where 50 is the dimension of each data instance, the number of instances are 100000. I am leaning a Gaussian mixture model based on this distribution.
When I try to get the probability values for instances I am getting very low values. So in this case how can I build a clssifier?

I don't think this makes sense. For example, suppose your data is 1 dimensional, and suppose the truth is that it has been sampled from a bimodal distribution. But suppose you haven't worked out that it's from a bimodal distribution and you fit a normal distribution. You'd still have the best possible fit, but it would be the best possible fit to the wrong distribution, and the truth is that none of the points come from that distribution or from any distribution that looks like it.

matlab probability distribution fitting

This might be a silly question! I have a array P which represents the probability distribution of some data e.g. [0;0.3;0.7] How can I determine the type or class of discrete probability distribution of P? The original data is unavailable to me.
dfittool or fitdist requires me to give the data as input, while I already have its probability distribution. Any ideas?

You probably might have seen different probability distributions during lecture or your reading. All you have to do is plotting the given distribution against the candidates. As the distributions itself are parametrized, curve fitting or trial end error come into play. The distribution with the least error, best fit, might be the one you are looking for.

It is not possible to find out a priori what kind of distribution some data (especially with as low n as in your example) is coming from.
If you have an idea of the process that generated your data, you might be able to get an idea of which distributions to test. Maybe your data comes from the family of gamma distributions, maybe your data comes from the family of Weibull distributions etc. Then, you can fit these general distributions and see whether they are likely to simplify to a more common distribution.
For a visual representation of how well your data could approximate a certain distribution, you can use PROBPLOT.
Once you have identified possible distributions, you can fit them to the data and use the Bayesian Information Criterion (BIC) to compare which fit describes the data best. Note that unless you have huge numbers of noise-free data, it is impossible to tell which fit is correct if you have several possible distributions with comparatively low BIC.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse