Can't calculate confidence intervals on a pooled dataset after MICE imputation - confidence-interval

I preformed mice imputation on my dataset:
LWP.imputeX<- mice(databasename, m=5, maxit=50, seed=500)
Following this I ran a linear regression on the pooled dataset:
PBMI_ModelX <- with(data=LWP.imputeX, exp = lm(SDS_BMI_y5_IOTF ~ Paternal_reported_bmi_BL)) summary(PBMI_ModelX) Pool_PBMI_ModelX<-pool(PBMI_ModelX) summary(Pool_PBMI_ModelX)
Although I generated the summary statistics for this analysis when I try to get the 95% confidence intervals for the pooled dataset I receive the following error message -
confint(Pool_PBMI_ModelX)
Error in t.test.default(x, y = y, alternative = alternative, mu = mu, :
not enough 'x' observations
I'm not sure why the confidence intervals won't work for me using the 'summary' or 'confint' commands as in my compete cases analysis I can generate them in this way.
Any assistance would be appreciated or if anyone could recommend any good references/sources for people from a non stats background on running MICE imputation. Thanks.

Related

Get Sobol indices error from kriging with Openturns

I'm looking for a Python package which computes the Sobol' indices from a kriging model and which provides a confidence intervalle on these indices that takes into account both the meta-model and the Monte-Carlo errors. Is it possible to do this with OpenTurns ?
Thanks
It is indeed possible to compute Sobol' indices from a Kriging model with confidence intervals that take into account Kriging uncertainty in addition to the uncertainty of Sobol' indices estimators.
The trick is to sample from trajectories of the conditional Gaussian process. Assuming you have a KrigingResult object kri_res, and also that you have obtained an inputDesign from a SobolIndicesExperiment with given size, you can build the ConditionalGaussianProcess with:
import openturns as ot
conditional_gp = ot.ConditionalGaussianProcess(kri_res, ot.Mesh(inputDesign))
And then sample output designs corresponding to N different trajectories of the conditional Gaussian process:
outputDesigns = conditional_gp.getSample(N)
Then you can get the distribution of the estimator of the (here first order) Sobol' indices for each trajectory:
distributions = []
for i in range(N):
algo = ot.SaltelliSensitivityAlgorithm(inputDesign, outputDesigns[i], size)
dist = algo.getFirstOrderIndicesDistribution()
distributions.append(dist)
In order to average out Kriging uncertainty, you can build the mixture of the distributions and an associated confidence interval:
mixture = ot.Mixture(distributions)
ci = mixture.computeBilateralConfidenceInterval(0.9)
Be careful, what you get this way is a domain which contains 90% of the probability mass of the joint distribution. If you want to get confidence intervals marginal by marginal, then you need to do:
intervals = []
for j in range(mixture.getDimension()):
marginal = mixture.getMarginal(j)
ci = marginal.computeBilateralConfidenceInterval(0.9)
intervals.append(ci)

How to pull out amplitude information from complex output of cpsd - MATLAB

I have two data sets that I want to analyze using a cross power spectral density plot in MATLAB with the function cpsd. With the complex output of cpsd, I was wondering how I can get amplitude information out of it. I know I can get phase info by angle(Pxy) but I don't know how to pull the amplitude information. Thanks
I think what you are looking for is abs(Pxy). According to the documentation, if Pxy = x + i*y, then:
abs(Pxy) = sqrt(x^2 + y^2) = sqrt(real(Pxy)^2 + imag(Pxy)^2)
Edit:
In light of your comment though, you are looking for the time-domain magnitude (not frequency domain, as the above will give you). This thread from the signal processing stack exchange may be of some help. It kind of looks like the averaging that cpsd performs eliminates the time-domain data from the signal.

Naïve Bayes Classifier -- is normalization necessary?

We recently studied the Naïve Bayesian Classifier in our Machine Learning class and now I'm trying to implement it on the Fisher Iris dataset as a self-exercise. The concept is easy and straightforward, with some trickiness involved for continuous attributes. I read up several literature resources which recommended using a Gaussian approximation to compute probability of test data values, so I'm going with it in my code.
Now I'm trying to run it initially for 50% training and 50% test data samples, but something is missing. The current code is always predicting class 1 (I used integers to represent the classes) for all test samples, which is obviously wrong.
My guess is that the problem may be due to normalization being omitted by the code? Though I think adding normalization would still yield proportionate results, and so far my attempts to normalize have produced the same classification results.
Can someone please suggest if there is anything obvious missing here? Or if I'm not approaching this right? Since most of the code is 'mechanics', I have made prominent (****************) the 2 lines that are responsible for the calculations. Any help is appreciated, thanks!
nsamples=75; % 50% samples
% acquire training set and test set
[trainingSample,idx] = datasample(data,nsamples,'Replace',false);
testData = data(setdiff(1:150,idx),:);
% define Gaussian function
%***********************************************************%
Phi=#(mu,sig2,x) (1/sqrt(2*pi*sig2))*exp(-((x-mu)^2)/2*sig2);
%***********************************************************%
for c=1:3 % for 3 classes in training set
clear y x mu sig2;
index=1;
for i=1 : length(trainingSample)
if trainingSample(i,5)==c
y(index,:)=trainingSample(i,:); % filter current class samples
index=index+1; % for conditional probabilities
end
end
for j=1:size(testData,1) % iterate over test samples
clear pf p;
for i=1:4 % iterate over columns
x=testData(j,i); % representing attributes
mu=mean(y(:,i));
sig2=var(y(:,i));
pf(i) = Phi(mu,sig2,x); % calc conditional probability
end
% calc class likelihood; prior * posterior
%*****************************************************%
pc(j,c) = size(y,1)/nsamples * pf(1)*pf(2)*pf(3)*pf(4);
%*****************************************************%
end
end
% find the predicted class for each test sample
% by taking the max probability calculated
for i=1:size(pc,1)
[~,q]=max(pc(i,:));
predicted(i)=q;
actual(i)=testData(i,5);
end
Normalization shouldn't be necessary since the features are only compared to each other.
p(class|thing) = p(class)p(thing|class) =
= p(class)p(feature_1|class)p(feature_2|class)...p(feature_N|class)
So when fitting the parameters for the distribution feature_i|class it will just rescale the parameters (for the new "scale") in this case (mu, sigma2), but the probabilities will remain the same.
It's hard to read the matlab code due to alot of indexing and splitting of training/testing etc. Which is a possible problem source.
You should try something with a lot less non-necessary stuff around it (I would recommend python with scikit-learn for example, alot of helpers for splitting data and such http://scikit-learn.org/).
It's really important that you separate the training and test data, and only train the model with training data and test the trained model with the test data. (Is this done?)
Next step is to check the parameters which is easiest done with either printing them out (sanity check) or..
for each feature render the gaussian bells fitted next to a histogram of the data to see that they match (remember that each histogram bar must be of height number_of_samples_within_range/total_number_of_samples.
Visualising the data and the model is really important to know what is happening.

Spectrogram from Complex Morlet wavelet ( recreating article result from data)

I am trying to recreate the following results:
from the following data:
https://www.dropbox.com/s/mi3szqgzgku29rn/FS40.dat
the time is in milliseconds (frequency is 40000 Hz)
The article state that they used Complex Morlet wavelet to create the spectrogram:
" Power estimates from the averaged LFPs were calculated from
time–frequency spectrograms of the data from 1–88 Hz by convolving the signals with a complex Morlet wavelet of the form
w(t,f0 ) = Aexp(−t^2 / 2*σ^2 )exp(2*iπf0*t)
for each frequency of interest f0, where σ = m/2πf0, and i is the imaginary unit. The normalization
factor was A = 1 /(σ (2π)^0.5 ), and the constant m defining the compromise between time and frequency resolution was 7.
I only managed to get some "good" results using spectrogram function in matlab.
But I dont have much idea of how to use the morlet complex wavelet.
I got bad result when trying to use cwt with 'morl' window
Thank You.
P.S.
I'm trying to recreate this article:
Computational modeling of distinct neocortical oscillations driven by cell-type selective optogenetic drive: separable resonant circuits controlled by low-threshold spiking and fast-spiking interneurons.
Well I managed to find this MATLAB package
https://sites.google.com/site/rwfwuwx/Home/mfeeg
It does almost the same operation that as been done in the paper..
the related files in the package are:
mf_tfcm.m & mf_cmorlet.m

Calculating confidence intervals for a non-normal distribution

First, I should specify that my knowledge of statistics is fairly limited, so please forgive me if my question seems trivial or perhaps doesn't even make sense.
I have data that doesn't appear to be normally distributed. Typically, when I plot confidence intervals, I would use the mean +- 2 standard deviations, but I don't think that is acceptible for a non-uniform distribution. My sample size is currently set to 1000 samples, which would seem like enough to determine if it was a normal distribution or not.
I use Matlab for all my processing, so are there any functions in Matlab that would make it easy to calculate the confidence intervals (say 95%)?
I know there are the 'quantile' and 'prctile' functions, but I'm not sure if that's what I need to use. The function 'mle' also returns confidence intervals for normally distributed data, although you can also supply your own pdf.
Could I use ksdensity to create a pdf for my data, then feed that pdf into the mle function to give me confidence intervals?
Also, how would I go about determining if my data is normally distributed. I mean I can currently tell just by looking at the histogram or pdf from ksdensity, but is there a way to quantitatively measure it?
Thanks!
So there are a couple of questions there. Here are some suggestions
You are right that a mean of 1000 samples should be normally distributed (unless your data is "heavy tailed", which I'm assuming is not the case). to get a 1-alpha-confidence interval for the mean (in your case alpha = 0.05) you can use the 'norminv' function. For example say we wanted a 95% CI for the mean a sample of data X, then we can type
N = 1000; % sample size
X = exprnd(3,N,1); % sample from a non-normal distribution
mu = mean(X); % sample mean (normally distributed)
sig = std(X)/sqrt(N); % sample standard deviation of the mean
alphao2 = .05/2; % alpha over 2
CI = [mu + norminv(alphao2)*sig ,...
mu - norminv(alphao2)*sig ]
CI =
2.9369 3.3126
Testing if a data sample is normally distribution can be done in a lot of ways. One simple method is with a QQ plot. To do this, use 'qqplot(X)' where X is your data sample. If the result is approximately a straight line, the sample is normal. If the result is not a straight line, the sample is not normal.
For example if X = exprnd(3,1000,1) as above, the sample is non-normal and the qqplot is very non-linear:
X = exprnd(3,1000,1);
qqplot(X);
On the other hand if the data is normal the qqplot will give a straight line:
qqplot(randn(1000,1))
You might consider, also, using bootstrapping, with the bootci function.
You may use the method proposed in [1]:
MEDIAN +/- 1.7(1.25R / 1.35SQN)
Where R = Interquartile Range,
SQN = Square Root of N
This is often used in notched box plots, a useful data visualization for non-normal data. If the notches of two medians do not overlap, the medians are, approximately, significantly different at about a 95% confidence level.
[1] McGill, R., J. W. Tukey, and W. A. Larsen. "Variations of Boxplots." The American Statistician. Vol. 32, No. 1, 1978, pp. 12–16.
Are you sure you need confidence intervals or just the 90% range of the random data?
If you need the latter, I suggest you use prctile(). For example, if you have a vector holding independent identically distributed samples of random variables, you can get some useful information by running
y = prcntile(x, [5 50 95])
This will return in [y(1), y(3)] the range where 90% of your samples occur. And in y(2) you get the median of the sample.
Try the following example (using a normally distributed variable):
t = 0:99;
tt = repmat(t, 1000, 1);
x = randn(1000, 100) .* tt + tt; % simple gaussian model with varying mean and variance
y = prctile(x, [5 50 95]);
plot(t, y);
legend('5%','50%','95%')
I have not used Matlab but from my understanding of statistics, if your distribution cannot be assumed to be normal distribution, then you have to take it as Student t distribution and calculate confidence Interval and accuracy.
http://www.stat.yale.edu/Courses/1997-98/101/confint.htm