pooled mean and standard deviation - matlab

I have data obtained from different patient groups (biosignals from which I want to obtain the power value). I want to have the overall standard deviation and mean for ALL the patients. So I suppose what I am looking for is the pooled versions obtained from the individual standard deviations and means.
This is a lot of data in table format.
I was trying to avoid to develop code for this.
Matlab command grpstats seems quite good but apparently only deal with group standard deviation and mean. I am correct?
ttest2.m seems to deal only with pooled std.
Any help about this?
Many thanks in advance!

Related

Obtaining distribution from histogram

I have an array of values, with that values I plotted the histogram.I want to know the corresponding distribution from the histogram obtained. How is it possible.
Could you please explain the steps in obtaining appropriate probability distribution from histogram.
You'd better to ask this question in stats.stackexchange.com as it is more about the method than the programming. However, one thing that you can do is to fit a parametric distribution (using moment matching or maximum likelihood for example) then compare the fitted distribution to the normalized histogram using KL divergence or Bhattacharyya distance.
One option might be to use the "Distribution Fitting App" in the Statistics and Machine Learning Toolbox. That should help you evaluate if your data seems like it might have been drawn from some common distributions. You may never know for sure, since multiple distributions could account for the data, but if you have a lot of data it might help you narrow it down.
I think that in many cases an eye-ball comparison is enough. With a reasonable amount of data, it is quite difficult to not be able to distinguish between a gaussian or a weibull or...
I would use fitdist or fithist to eye-ball different distributions.
If you have no idea at all on the distribution and you want to know if two datasets are distributed differently it could be useful to compare their distributions by obtaining them with the option 'kernel'

Creating a stochastic time-series with given parameters

I would like to create a tool for generating a stochastic time-series distribution, for which I can provide the parameters (for a normal distribution) the mean, standard deviation, skewness and kurtosis. There is a similar question here using R, but I am not able to interpret this and put it in MATLAB.
Is there something that someone knows can do this already? (I haven't been able to find anything)
If not, what would be some good advice for starting something of my own? Any known useful functions? I would also like to be able to build upon it afterwards, for example: adding outliers, clusters of volatility, adjusting heteroscedasticity.
I realise me saying 'stochastic' and then in the same sentence 'given parameters' may seem odd, but it isn't - I want each time point to be random, but the parameters to describe, say 10,000 time points.
If you're looking for the equivalent of the solution in R, Matlab's Statistics Toolbox has limited support for the Johnson and Pearson distribution systems. In particular, the johnsrnd function produces random variates for the Johnson system. The Pearson system and pearsrnd, however, takes moments directly.
A big caveat. Using moments to describe or fit or produce random variates – often referred to as moment matching – is not robust and poorly regarded by statisticians. They're not guaranteed to uniquely define a distribution unless you have the entire moment generating function.

K means Analysis on KDD Cup Dataset 99

What kind of knowledge/ inference can be made from k means clustering analysis of KDDcup99 dataset?
We ploted some graphs using matlab they looks like this:::
Experiment 1: Plot of dst_host_count vs serror_rate
Experiment 2: Plot of srv_count vs srv_serror_rate
Experiment 3: Plot of count vs serror_rate
I just extracted saome features from kddcup data set and ploted them.....
The main problem am facing is due to lack of domain knowledge I cant determine what inference can be drawn form this graphs another one is if I have chosen wrong axis then what should be the correct chosen feature?
I got very less time to complete this thing so I don't understand the backgrounds very well
Any help telling the interpretation of these graphs would be helpful
What kind of unsupervised learning can be made using this data and plots?
Just to give you some domain knowledge: the KDD cup data set contains information about different aspects of network connections. Each sample contains 'connection duration', 'protocol used', 'source/destination byte size' and many other features that describes one connection connection. Now, some of these connections are malicious. The malicious samples have their unique 'fingerprint' (unique combination of different feature values) that separates them from good ones.
What kind of knowledge/ inference can be made from k means clustering analysis of KDDcup99 dataset?
You can try k-means clustering to initially cluster the normal and bad connections. Also, the bad connections falls into 4 main categories themselves. So, you can try k = 5, where one cluster will capture the good ones and other 4 the 4 malicious ones. Look at the first section of the tasks page for details.
You can also check if some dimensions in your data set have high correlation. If so, then you can use something like PCA to reduce some dimensions. Look at the full list of features. After PCA, your data will have a simpler representation (with less number of dimensions) and might give better performance.
What should be the correct chosen feature?
This is hard to tell. Currently data is very high dimensional, so I don't think trying to visualize 2/3 of the dimensions in a graph will give you a good heuristics on what dimensions to choose. I would suggest
Use all the dimensions for for training and testing the model. This will give you a measure of the best performance.
Then try removing one dimension at a time to see how much the performance is affected. For example, you remove the dimension 'srv_serror_rate' from your data and the model performance comes out to be almost the same. Then you know this dimension is not giving you any important info about the problem at hand.
Repeat step two until you can't find any dimension that can be removed without hurting performance.

fit data(measurements) with numerical datasets

I have some data for which I have a set of numerically determined model curves. Now I would like to find the one with least square deviation, I only need to vary one parameter, which is the amplitude of these model curves.
I used fitting with analytic functions, but I did not find a way to handle such a problem.
Is there any solution?
Thanks a lot!
One of the optimize functions should do the trick. You can also read the section on optimization in the manual. Without any specifics on the data or the model you wish to match, it's hard to recommend anything more specific. For example, if your cost function has many maxima and minima or is not differentiable, you'll have to choose some of the more expensive routines.

matlab probability distribution fitting

This might be a silly question! I have a array P which represents the probability distribution of some data e.g. [0;0.3;0.7] How can I determine the type or class of discrete probability distribution of P? The original data is unavailable to me.
dfittool or fitdist requires me to give the data as input, while I already have its probability distribution. Any ideas?
You probably might have seen different probability distributions during lecture or your reading. All you have to do is plotting the given distribution against the candidates. As the distributions itself are parametrized, curve fitting or trial end error come into play. The distribution with the least error, best fit, might be the one you are looking for.
It is not possible to find out a priori what kind of distribution some data (especially with as low n as in your example) is coming from.
If you have an idea of the process that generated your data, you might be able to get an idea of which distributions to test. Maybe your data comes from the family of gamma distributions, maybe your data comes from the family of Weibull distributions etc. Then, you can fit these general distributions and see whether they are likely to simplify to a more common distribution.
For a visual representation of how well your data could approximate a certain distribution, you can use PROBPLOT.
Once you have identified possible distributions, you can fit them to the data and use the Bayesian Information Criterion (BIC) to compare which fit describes the data best. Note that unless you have huge numbers of noise-free data, it is impossible to tell which fit is correct if you have several possible distributions with comparatively low BIC.