Matlab Confidence Interval with normfit/fitdist/paramci differ from manual calculation depending on the number of elements. Why? - matlab

I am testing some confidence interval calculation, however, I have noticed differences using the Matlab functions normfit/fitdist/paramci from the manual calculation. Please have a look to the code below, and test the data with more elements. As the data size increases the differences are smaller. Does someone have a clue/solution/explanation?
Thanks
Will
%% Cleaning service
clear all; close all;
%% Data and processing
conf = norminv([0.025 0.975],0,1); % for 95%
data = normrnd(0.158,0.0265,10,1); % Change the third argument to 100, 1000, 1000, ...
[mu,sigma,muci,sigmaci] = normfit(data,.05); % for 95%
pd = fitdist(data,'Normal'); ci = paramci(pd,'Alpha',.05); % for 95%
xplus = mu + conf(2)*sigma*(1/sqrt(length(data)));
xminus = mu - conf(2)*sigma*(1/sqrt(length(data)));
Difference = [ci(1,1)-xminus ci(2,1)-xplus]

A "typical" confidence interval for a mean will actually use critical values from the t-distribution, not the normal - this will mean slightly wider intervals, wider at small sample sizes. As sample size increases, the t critical values converge to the normal critical values. I'm not a Matlab programmer these days, but I'd be curious if a canned function such as paramci will use the t-distribution instead of the normal.
So, this work is in R, not Matlab, but I'm hoping that I'll produce some numbers you'll recognize. Let's say for a sample of size n=10, mean=5, sd=2...
n <- 10
mn <- 5
sd <- 2
A 95% CI using normal critical values would be constructed "by hand" like so:
mn + qnorm(c(0.025, 0.975))*sd/sqrt(n)
# 3.76041 6.23959
and a 95% CI using t critical values like so:
mn + qt(c(0.025, 0.975), n-1)*sd/sqrt(n)
# 3.569286 6.430714
# ...note slightly wider
At n=500, the two become indistinguishable.
n <- 500
mn + qnorm(c(0.025, 0.975))*sd/sqrt(n)
# 4.824695 5.175305
mn + qt(c(0.025, 0.975), n-1)*sd/sqrt(n)
# 4.824269 5.175731
These are all manual calculations that I'm hoping will match what Matlab does in a similar scenario. If not ... then I can always retract my answer ;)

Related

Matlab codes for making scales

I've extracted certain data from an Excel file.
It involves two columns : one for certain periods and another for corresponing
daily prices. Followings are my codes.(t1 and t2 are user inputs.)
row_1 = find(period==t1)
row_2 = find(period==t2)
f_0 = period(row_1:row_2, 1)
f_1 = price(row_1:row_2 , 1)
y_1 = plot(handles.axes2, f_0, f_1)
f_0 : period (x-axis), f_1 : price(y-axis)
My goal is to express the trend of price fluctuations by using sounds.
So the way I came up with this is as follows.
Step1: Find the maximum and minimum value of the price corresponding to the given period. Step2: Divide the distances between these two points into eight sections. Step3: Allocate eight musical scales(C D E F G A B C) to each
eight sections and play it.
At my level, I achieved to find the min/max values of the given period.
But from the next stage, I can't come up with any ideas.
Please help me with any advice.
If I understand you correctly, you want to allocate eight musical scales to divided period, and such codes may help.
%% let's play some music~
clc; clear;
%% Set the Sampling frequency & time period
fs=44100;
t=0:1/fs:0.5;
%% eight musical scales
Cscale{1}=sin(2*pi*262*t); %c-do
Cscale{2}=sin(2*pi*294*t); %c-re
Cscale{3}=sin(2*pi*330*t); %c-mi
Cscale{4}=sin(2*pi*349*t); %c-fa
Cscale{5}=sin(2*pi*392*t); %c-so
Cscale{6}=sin(2*pi*440*t); %c-la
Cscale{7}=sin(2*pi*494*t); %c-ti
Cscale{8}=sin(2*pi*523*t); %c-do-high
%you could call "sound(Cscale{i},fs)" to paly each scales
%% Divide the distances between these two points
% the highest point must be special treated
Min_p=0;
Max_p=8;
Sample_p=[0 1 2 3 4 5 6 7 8];
for i=1:length(Sample_p)
S_p=Sample_p(i);
if (S_p == Max_p)
sound(Cscale{end},fs);
else
%Find the correct music scale and play it
sound(Cscale{1+floor(8*(Sample_p(i)-Min_p)/(Max_p-Min_p))},fs);
end
pause(0.5)
end
Here is what I looked at(you may need google translation because it is written in Chinese)
http://blog.csdn.net/weaponsun/article/details/46695255

How to identify an optimal subsample from a data set with missing values in MATLAB

I would like to identify the largest possible contiguous subsample of a large data set. My data set consists of roughly 15,000 financial time series of up to 360 periods in length. I have imported the data into MATLAB as a 360 by 15,000 numerical matrix.
This matrix contains a lot of NaNs due to some of the financial data not being available for the entire period. In the illustration, NaN entries are shown in dark blue, and non-NaN entries appear in light blue. It is these light blue non-NaN entries which I would like to ideally combine into an optimal subsample.
I would like to find the largest possible contiguous block of data that is contained in my matrix, while ensuring that my matrix contains a sufficient number of periods.
In a first step I would like to sort my matrix from left to right in descending order by the number of non-NaN entries in each column, that is, I would like to sort by the vector obtained by entering sum(~isnan(data),1).
In a second step I would like to find the sub-array of my data matrix that is at least 72 entries along the first dimension and is otherwise as large as possible, measured by the total number of entries.
What is the best way to implement this?
A big warning (may or may not apply depending on context)
As Oleg mentioned, when an observation is missing from a financial time series, it's often missing for reason: eg. the entity went bankrupt, the entity was delisted, or the instrument did not trade (i.e. illiquid). Constructing a sample without NaNs is likely equivalent to constructing a sample where none of these events occur!
For example, if this were hedge fund return data, selecting a sample without NaNs would exclude funds that blew up and ceased trading. Excluding imploded funds would bias estimates of expected returns upwards and estimates of variance or covariance downwards.
Picking a sample period with the fewest time series with NaNs would also exclude periods like the 2008 financial crisis, which may or may not make sense. Excluding 2008 could lead to an underestimate of how haywire things could get (though including it could lead to overestimate the probability of certain rare events).
Some things to do:
Pick a sample period as long as possible but be aware of the limitations.
Do your best to handle survivorship bias: eg. if NaNs represent delisting events, try to get some kind of delisting return.
You almost certainly will have an unbalanced panel with missing observations, and your algorithm will have to be deal with that.
Another general finance / panel data point, selecting a sample at some time point t and then following it into the future is perfectly ok. But selecting a sample based upon what happens during or after the sample period can be incredibly misleading.
Code that does what you asked:
This should do what you asked and be quite fast. Be aware of the problems though if whether an observation is missing is not random and orthogonal to what you care about.
Inputs are a T by n sized matrix X:
T = 360; % number of time periods (i.e. rows) in X
n = 15000; % number of time series (i.e. columns) in X
T_subsample = 72; % desired length of sample (i.e. rows of newX)
% number of possible starting points for series of length T_subsample
nancount_periods = T - T_subsample + 1;
nancount = zeros(n, nancount_periods, 'int32'); % will hold a count of NaNs
X_isnan = int32(isnan(X));
nancount(:,1) = sum(X_isnan(1:T_subsample, :))'; % 'initialize
% We need to obtain a count of nans in T_subsample sized window for each
% possible time period
j = 1;
for i=T_subsample + 1:T
% One pass: add new period in the window and subtract period no longer in the window
nancount(:,j+1) = nancount(:,j) + X_isnan(i,:)' - X_isnan(j,:)';
j = j + 1;
end
indicator = nancount==0; % indicator of whether starting_period, series
% has no NaNs
% number of nonan series of length T_subsample by starting period
max_subsample_size_by_starting_period = sum(indicator);
max_subsample_size = max(max_subsample_size_by_starting_period);
% find the best starting period
starting_period = find(max_subsample_size_by_starting_period==max_subsample_size, 1);
ending_period = starting_period + T_subsample - 1;
columns_mask = indicator(:,starting_period);
columns = find(columns_mask); %holds the column ids we are using
newX = X(starting_period:ending_period, columns_mask);
Here's an idea,
Assuming you can rearrange the series, calculate the distance (you decide the metric, but if looking at is nan vs not is nan, Hamming is ok).
Now hierarchically cluster the series and rearrange them using either a dendrogram
or http://www.mathworks.com/help/bioinfo/examples/working-with-the-clustergram-function.html
You should probably prune any series that doesn't have a minimum number of non nan values before you start.
First I have only little insight in financial mathematics. I understood it that you want to find the longest continuous chain of non-NaN values for each time series. The time series should be sorted depending on the length of this chain and each time series, not containing a chain above a threshold, discarded. This can be done using
data = rand(360,15e3);
data(abs(data) <= 0.02) = NaN;
%% sort and chop data based on amount of consecutive non-NaN values
binary_data = ~isnan(data);
% find edges, denote their type and calculate the biggest chunk in each
% column
edges = [2*binary_data(1,:)-1; diff(binary_data, 1)];
chunk_size = diff(find(edges));
chunk_size(end+1) = numel(edges)-sum(chunk_size);
[row, ~, id] = find(edges);
num_row_elements = diff(find(row == 1));
num_row_elements(end+1) = numel(chunk_size) - sum(num_row_elements);
%a chunk of NaN has a -1 in id, a chunk of non-NaN a 1
chunks_per_row = mat2cell(chunk_size .* id,num_row_elements,1);
% sort by largest consecutive block of non-NaNs
max_size = cellfun(#max, chunks_per_row);
[max_size_sorted, idx] = sort(max_size, 'descend');
data_sorted = data(:,idx);
% remove all elements that only have block sizes smaller then some number
some_number = 20;
data_sort_chop = data_sorted(:,max_size_sorted >= some_number);
Note that this can be done a lot simpler, if the order of periods within a time series doesn't matter, aka data([1 2 3],id) and data([3 1 2], id) are identical.
What I do not know is, if you want to discard all periods within a time series that don't correspond to the biggest value, get all those chains as individual time series, ...
Feel free to drop a comment if it has to be more specific.

Trying to produce exponential traffic

I'm trying to simulate an optical network algorithm in MATLAB for a homework project. Most of it is already done, but I have an issue with the diagrams I'm getting.
In the simulation I'm generating exponential traffic, however, for low lambda values (0.1) I'm getting very high packet drop rates (99%). I wrote a sample here which is very close to the testbench I'm running on my simulator.
% Run the simulation 10 times, with different lambda values
l = [1 2 3 4 5 6 7 8 9 10];
for i=l(1):l(end)
X = rand();
% In the 'real' simulation the following line defines the time
% when the next packet generation event will occur. Suppose that
% i is the current time
t_poiss = i + ceil((-log(X)/(i/10)));
distr(i)=t_poiss;
end
figure, plot(distr)
axis square
grid on;
title('Exponential test:')
The resulting image is
The diagram I'm getting in this sample is IDENTICAL to the diagram I'm getting for the drop rate/λ. So I would like to ask if I'm doing something wrong or if I miss something? Is this the right thing to expect?
So the problem is coming from might be a numerical problem. Since you are generating a random number for X, the number might be incredibly small - say, close to zero. If you have a number close to zero numerically, log(X) is going to be HUGE. So your calculation of t_poiss will be huge. I would suggest doing something like X = rand() + 1 to make sure that X is never close to zero.

Matlab fast neighborhood operation

I have a Problem. I have a Matrix A with integer values between 0 and 5.
for example like:
x=randi(5,10,10)
Now I want to call a filter, size 3x3, which gives me the the most common value
I have tried 2 solutions:
fun = #(z) mode(z(:));
y1 = nlfilter(x,[3 3],fun);
which takes very long...
and
y2 = colfilt(x,[3 3],'sliding',#mode);
which also takes long.
I have some really big matrices and both solutions take a long time.
Is there any faster way?
+1 to #Floris for the excellent suggestion to use hist. It's very fast. You can do a bit better though. hist is based on histc, which can be used instead. histc is a compiled function, i.e., not written in Matlab, which is why the solution is much faster.
Here's a small function that attempts to generalize what #Floris did (also that solution returns a vector rather than the desired matrix) and achieve what you're doing with nlfilter and colfilt. It doesn't require that the input have particular dimensions and uses im2col to efficiently rearrange the data. In fact, the the first three lines and the call to im2col are virtually identical to what colfit does in your case.
function a=intmodefilt(a,nhood)
[ma,na] = size(a);
aa(ma+nhood(1)-1,na+nhood(2)-1) = 0;
aa(floor((nhood(1)-1)/2)+(1:ma),floor((nhood(2)-1)/2)+(1:na)) = a;
[~,a(:)] = max(histc(im2col(aa,nhood,'sliding'),min(a(:))-1:max(a(:))));
a = a-1;
Usage:
x = randi(5,10,10);
y3 = intmodefilt(x,[3 3]);
For large arrays, this is over 75 times faster than colfilt on my machine. Replacing hist with histc is responsible for a factor of two speedup. There is of course no input checking so the function assumes that a is all integers, etc.
Lastly, note that randi(IMAX,N,N) returns values in the range 1:IMAX, not 0:IMAX as you seem to state.
One suggestion would be to reshape your array so each 3x3 block becomes a column vector. If your initial array dimensions are divisible by 3, this is simple. If they don't, you need to work a little bit harder. And you need to repeat this nine times, starting at different offsets into the matrix - I will leave that as an exercise.
Here is some code that shows the basic idea (using only functions available in FreeMat - I don't have Matlab on my machine at home...):
N = 100;
A = randi(0,5*ones(3*N,3*N));
B = reshape(permute(reshape(A,[3 N 3 N]),[1 3 2 4]), [ 9 N*N]);
hh = hist(B, 0:5); % histogram of each 3x3 block: bin with largest value is the mode
[mm mi] = max(hh); % mi will contain bin with largest value
figure; hist(B(:),0:5); title 'histogram of B'; % flat, as expected
figure; hist(mi-1, 0:5); title 'histogram of mi' % not flat?...
Here are the plots:
The strange thing, when you run this code, is that the distribution of mi is not flat, but skewed towards smaller values. When you inspect the histograms, you will see that is because you will frequently have more than one bin with the "max" value in it. In that case, you get the first bin with the max number. This is obviously going to skew your results badly; something to think about. A much better filter might be a median filter - the one that has equal numbers of neighboring pixels above and below. That has a unique solution (while mode can have up to four values, for nine pixels - namely, four bins with two values each).
Something to think about.
Can't show you a mex example today (wrong computer); but there are ample good examples on the Mathworks website (and all over the web) that are quite easy to follow. See for example http://www.shawnlankton.com/2008/03/getting-started-with-mex-a-short-tutorial/

How can I find low regions in a graph using Perl/R?

I'm examining some biological data which is basically a long list (a few million values) of integers, each saying how well this position in the genome is covered. Here is a graphical example for a data set:
I would like to look for "valleys" in this data, that is, regions which are significantly lower than their surrounding environment.
Note that the size of the valleys I'm looking for is not really known - it may range from 50 bases to a few thousands. Defining what is a valley is of course one of the questions I'm struggling with, but the previous examples are relatively easy for me:
What kind of paradigms would you recommend using to find those valleys? I mainly program using Perl and R.
Thanks!
We do peak detection (and valley detection) using running medians and median absolute deviation. You can specify how much deviation from the running median means a peak.
In a next step, we use a binomial model to check which regions contain more "extreme" values than can be expected. This model (basically a score test) results in "peak regions" instead of single peaks. Turning it around to get "valley regions" is trivial.
The running median is calculated using the function weightedMedian from the package aroma.light. We use the embed() function to make a list of "windows" and apply a kernel function on it.
The application of the weighted median :
center <- apply(embed(tmp,wdw),1,weightedMedian,w=weights,na.rm=T)
Here tmp is the temporary data vector and wdw the window size (has to be uneven). tmp is constructed by adding (wdw-1)/2 NA values at every side of the data vector. the weights are constructed using a customized function. For the mad we use the same procedure, but then on diff(data) instead of the data itself.
Running Sample code :
require(aroma.light)
# make.weights : function to make weights on basis of a normal distribution
# n is window size !!!!!!
make.weights <- function(n,
type=c("gaussian","epanechnikov","biweight","triweight","cosinus")){
type <- match.arg(type)
x <- seq(-1,1,length.out=n)
out <-switch(type,
gaussian=(1/sqrt(2*pi)*exp(-0.5*(3*x)^2)),
epanechnikov=0.75*(1-x^2),
biweight=15/16*(1-x^2)^2,
triweight=35/32*(1-x^2)^3,
cosinus=pi/4*cos(x*pi/2),
)
out <- out/sum(out)*n
return(out)
}
# score.test : function to become a p-value based on the score test
# uses normal approximation, but is still quite correct when p0 is
# pretty small.
# This test is one-sided, and tests whether the observed proportion
# is bigger than the hypothesized proportion
score.test <- function(x,p0,w){
n <- length(x)
if(missing(w)) w<-rep(1,n)
w <- w[!is.na(x)]
x <- x[!is.na(x)]
if(sum(w)!=n) w <- w/sum(w)*n
phat <- sum(x*w)/n
z <- (phat-p0)/sqrt(p0*(1-p0)/n)
p <- 1-pnorm(z)
return(p)
}
# embed.na is a modification of embed, adding NA strings
# to the beginning and end of x. window size= 2n+1
embed.na <- function(x,n){
extra <- rep(NA,n)
x <- c(extra,x,extra)
out <- embed(x,2*n+1)
return(out)
}
# running.score : function to calculate the weighted p-value for the chance of being in
# a run of peaks. This chance is based on the weighted proportion of the neighbourhood
# the null hypothesis is calculated by taking the weighted proportion
# of detected peaks in the whole dataset.
# This lessens the need for adjusting parameters and makes the
# method more automatic.
# for a correct calculation, the weights have to sum up to n
running.score <- function(sel,n=20,w,p0){
if(missing(w)) w<- rep(1,2*n+1)
if(missing(p0))p0 <- sum(sel,na.rm=T)/length(sel[!is.na(sel)]) # null hypothesis
out <- apply(embed.na(sel,n),1,score.test,p0=p0,w=w)
return(out)
}
# running.med : function to calculate the running median and mad
# for a dataset. Window size = 2n+1
running.med <- function(x,w,n,cte=1.4826){
wdw <- 2*n+1
if(missing(w)) w <- rep(1,wdw)
center <- apply(embed.na(x,n),1,weightedMedian,w=w,na.rm=T)
mad <- median(abs(x-center))*cte
return(list(med=center,mad=mad))
}
##############################################
#
# Create series
set.seed(100)
n = 1000
series <- diffinv(rnorm(20000),lag=1)
peaks <- apply(embed.na(series,n),1,function(x) x[n+1] < quantile(x,probs=0.05,na.rm=T))
pweight <- make.weights(0.2*n+1)
p.val <- running.score(peaks,n=n/10,w=pweight)
plot(series,type="l")
points((1:length(series))[p.val<0.05],series[p.val<0.05],col="red")
points((1:length(series))[peaks],series[peaks],col="blue")
The sample code above is developed to find regions with big fluctuations rather than valleys. I adapted it a bit, but it's not optimal. On top of that, for series larger than 20000 values you need a whole lot of memory, I can't run it on my computer any more.
Alternatively, you could work with an approximation of the numerical derivative and second derivative to define valleys. In your case, this might even work better. A pragmatic way of calculating the derivatives and the minima/maxima of the first derivative :
#first derivative
f.deriv <- diff(lowess(series,f=n/length(series),delta=1)$y)
#second derivative
f.sec.deriv <- diff(f.deriv)
#minima and maxima defined by where f.sec.deriv changes sign :
minmax <- cumsum(rle(sign(f.sec.deriv))$lengths)
op <- par(mfrow=c(2,1))
plot(series,type="l")
plot(f.deriv,type="l")
points((1:length(f.deriv))[minmax],f.deriv[minmax],col="red")
par(op)
You can define a valley by different criterion :
depth
width
volume (depth*width)
You might also have valley in a big mountain, do you want these too ?
For example there is a valley here : 1 2 3 4 1000 1000 800 800 800 1000 1000 500 200 3
Try to explain with more details how YOU (or any expert in your field) would choose the valleys given the data
You might want to look at watershed
You might want to try the peak detection function to identify the regions of interest. The desired minimum width of the valleys can be specified with the span parameter.
It might be a good idea to smooth the data first, to get rid of the noise peaks like the one in the right "valley" of the blue graph. A simple stats::filter should be enough.
The final step would be to check the depth of the found "valleys". This really depends on your requirements. As a first approximation, you can simply compare the peak value with the median level of the data.