Why is the confidence interval not consistent with the standard errors in this regression? - linear-regression

I am running a linear regression with fixed effect and standard errors clustered by a certain group.
areg ref1 ref1_l1 rf1 ew1 vol_ew1 sk_ew1, a(us_id) vce(cluster us_id)
The one line code is as above and the output is as follows:
Now, the t-stats and the P values look inconsistent. How can we have t-stat >5 and pval >11%?. Similarly the 95% confidence intervals appear to be way wider than Coeff. +- 2 Std. Err.
What am I missing?

There is nothing inconsistent here. You have a small sample size and a less than parsimonious model and have all but run out of degrees of freedom. Notice how areg won't post an F statistic or a P-value for the model, a strong danger sign. Your t statistics are consistent with checks by hand:
. display 2 * ttail(1, 5.54)
.11368912
. display 2 * ttail(1, 113.1)
.00562868
In short, there is no bug here and no programming issue. It's just a matter of your model over-fitting your data and the side-effects of that.
Similarly, +/- 2 SE for a 95% confidence interval is way off as a rule of thumb here. Again, a hand calculation is instructive:
. display invt(1, 0.975)
12.706205
. display invt(60, 0.975)
2.0002978
. display invt(61, 0.975)
1.9996236
. display invnormal(0.975)
1.959964

Related

temporal autocorrelation and perfect fit in glmm

I get some troubles with temporal autocorrelation and the way to implement it in glmm.
Here (https://cran.r-project.org/web/packages/glmmTMB/vignettes/covstruct.html), with an autoregressive process (AR1), they fit time series with one point by time step.
However, when I am doing that I get surprisingly high R², with simulated data and with empirical data,for example:
>library(glmmTMB)
>library(MASS)
>library(data.table)
>myfile <- "https://raw.githubusercontent.com/f-duchenne/data_for_test/master/data_for_test.txt"
>dat<- fread(myfile)
>dat$Annee=numFactor(dat$Annee)
>model <- glmmTMB(pheno_derivs2 ~ ratio+urban+ar1(Annee + 0 | species), data=dat)
>MuMIn::r.squaredGLMM(model)
R2m R2c
[1,] 0.002107069 1
1-var(residuals(model))/var(dat$pheno_derivs2)
[1] 1
cor(dat$pheno_derivs2,predict(model))^2
[1] 1
How is it possible that the model fits perfectly the data while the fixed effect is far from overfitting ?
Is it normal that including the temporal autocorrelation process gives such R² and almost a perfect fit ? (largely due to the random part, fixed part often explains a small part of the variance in my data). Is the model still interpretable ?

MATLAB Simple - Linear Predictive Coding and Energy Forecasting

I have a dataset with 274 samples (9 months) of the daily energy (Watts.hour) used on a residential household. I'm not sure if i'm applying the lpc function correctly.
My code is the following:
filename='9-months.csv';
energy = csvread(filename);
C=zeros(5,1);
counter=0;
N=3;
for n=274:-1:31
w2=energy(1:n-1,1);
a=lpc(w2,N);
energy_estimated=0;
for X = 1:N
energy_estimated = energy_estimated + (-a(X+1)*energy(n-X));
end
w_real=energy(n);
error2=abs(w_real-energy_estimated);
counter=counter+1;
C(counter,1)=error2;
end
mean_error=round(mean(C));
Being "n" the sample on analysis, I will use the energy array's values, from 1 to n-1, to calculate the lpc coefficientes (with N=3).
After that, it will apply the calculated coefficients on the "for" cycle presented, in order to calculate the estimated energy.
Finally, error2 outputs the error between the real energy and estimated value.
On the example presented ( http://www.mathworks.com/help/signal/ref/lpc.html ) some filters are used. Do I need to apply any filter to it? Is my methodology correct?
Thank you very much in advance!
The lpc seems to be used correctly, but there are a few other things about your code. I am adressign the part at he "for n" :
for n=31:274 %for me it would seem more logically to go forward in time
w2=energy(1:n-1,1);
a=lpc(w2,N);
energy_estimate=filter([0 -a(2:end)],1,w2);
energy_estimate=energy_estimate(end);
estimates(n)=energy_estimate;
end
error=energy(31:274)-estimates(31:274)';
meanerror=mean(error); %you dont really round mean errors
filter is exactly what you are trying to do with the X=1:N loop. but this will perform the calculation for the entire w2 vector. If you just want the last value take the (end) command as well.
Now there is no reason to calculate the error for every single value and then add them to a vector you can do that faster after the calculation.
Now if your trying to estimate future values with a lpc it could work like that, but you are implying that every value is only dependend on the last 3 values. Have you tried something like a polynominal approach? i would think that this would be closer to reality.

Johansen test on two stocks (for pairs trading) yielding weird results

I hope you can help me with this one.
I am using cointegration to discover potential pairs trading opportunities within stocks and more precisely I am utilizing the Johansen trace test for only two stocks at a time.
I have several securities, but for each test I only test two at a time.
If two stocks are found to be cointegrated using the Johansen test, the idea is to define the spread as
beta' * p(t-1) - c
where beta'=[1 beta2] and p(t-1) is the (2x1) vector of the previous stock prices. Notice that I seek a normalized first coefficient of the cointegration vector. c is a constant which is allowed within the cointegration relationship.
I am using Matlab to run the tests (jcitest), but have also tried utilizing Eviews for comparison of results. The two programs yields the same.
When I run the test and find two stocks to be cointegrated, I usually get output like
beta_1 = 12.7290
beta_2 = -35.9655
c = 121.3422
Since I want a normalized first beta coefficient, I set beta1 = 1 and obtain
beta_2 = -35.9655/12.7290 = -2.8255
c =121.3422/12.7290 = 9.5327
I can then generate the spread as beta' * p(t-1) - c. When the spread gets sufficiently low, I buy 1 share of stock 1 and short beta_2 shares of stock 2 and vice versa when the spread gets high.
~~~~~~~~~~~~~~~~ The problem ~~~~~~~~~~~~~~~~~~~~~~~
Since I am testing an awful lot of stock pairs, I obtain a lot of output. Quite often, however, I receive output where the estimated beta_1 and beta_2 are of the same sign, e.g.
beta_1= -1.4
beta_2= -3.9
When I normalize these according to beta_1, I get:
beta_1 = 1
beta_2 = 2.728
The current pairs trading literature doesn't mention any cases where the betas are of the same sign - how should it be interpreted? Since this is pairs trading, I am supposed to long one stock and short the other when the spread deviates from its long run mean. However, when the betas are of the same sign, to me it seems that I should always go long/short in both at the same time? Is this the correct interpretation? Or should I modify the way in which I normalize the coefficients?
I could really use some help...
EXTRA QUESTION:
Under some of my tests, I reject both the hypothesis of r=0 cointegration relationships and r<=1 cointegration relationships. I find this very mysterious, as I am only considering two variables at a time, and there can, at maximum, only be r=1 cointegration relationship. Can anyone tell me what this means?

Translating MQL4 ibands() to Matlab

I'm trying to translate an indicator from MQL4 (Metatrader language) to Matlab. The Bollinger bands code is as follows:
for(int i=Bars;i>=0;i--)
{
BANDS=iBands(Symbol(),0,20,2,1,0,1,i+1);
}
the iBands() documentation lists the 8 inputs as:
symbol
timeframe
period
deviation
bands_shift
applied_price
mode
shift
I understand all these except bands_shift and shift. Question: If i = Bars is the entire range of the data, why does the i+1 not create an out of range error? As far as I can tell, this is code for a 20 period, 2 standard deviation Bollinger band. For a given time interval, are the associated Bollinger band values the values calculated for the previous time interval (hence the 1 after the fourth comma)? What does the i+1 do then? Given this code, how would I implement in matlab? My attempt, using this moving standard deviation and this moving average:
moving_average = movemean(EURUSD_closes(1:end-1),20); %end-1 in order to shift by 1
moving_average = [NaN; moving_average]; %adding NaN to make BANDS the length of price
moving_std = movestd(EURUSD_closes(1:end-1),20,'backward');
moving_std = [NaN; moving_std1];
BANDS = moving_average + 2*moving_std;
I don't think this gives the same output as the MQL4 code. Any hints would definitely be appreciated!
With my little knowledge of Bollinger bands, it seems like you might have an implementation issue. Have you tried the output of Bollinger function in MATLAB?
Bollinger bands may have been implemented differently for edge cases where the window size is less than 20. You might have to contact the MQL4 authors to check on the formulae used. I noticed a difference when I implemented in Python and the indicator seen in Google finance. Nevertheless, if you have implemented correctly, the values where the window size is of 20, you will see the same values.
Unless you are very sure of the FEX code, you should use std and mean for implementation.
How to understand iBars+1 and a "Missing" Out of Range Error
MQL4 works in a "reversed-TimeDOMAIN-indexing" space. Thus the iBar shows the "depth" of historical TimeSeriesDataSET, while the most recent (live) bar has an index of [0].
Always [0].
This means, that for a calculation of any technical indicator, the coder must arrange the processing in this manner.
This also means, that for any new bar, the internal representation of the data-storage-layer has to somehow "shift" all DataCELLs by one to the "left" ( backwards in a TimeDOMAIN direction towards History ) to make a "space" for a new bar having still index of [0] ( a Now moment in a TimeDOMAIN ).
While "physically" moving all the current "depth" of the DataSTORE would be an absolute vaste of resources ( both time, CPU, .. ), the data-storage-layer works smarter, adjusts the indexing-head on each new bar event plus uses some form of elastic DataSTORE capacity planning/re-size on-demand, so as to minimise the mem-alloc(s) during the continuous growth of the DataSTORE.
This means, that testing for an Out of Range Error does not have support in user-code namespace of the MQL4 language.
How to understand bands_shift and shift.
Calling iBands() has to state for which Bar one asks the function to calculate a result.
shift provides input for this. The index obeys the rules above.
Once the Bollinger Bands calculations are done, one may wish to offset the curves by some amount of Bars -- transposing the graph in TimeDOMAIN { +N << left | -N >> right } -- so that the visualised graphics meets one's expectations or pleasure.
bands_shift provides input for this graph ad-hoc shifting.
Also note, that the observed differences between Google, Y!Finance, MATLAB and MQL4 graphs simply have to appear and account for additional ( not known ) details one can hardly decode from the lines just shown on the screen.
applied_price := { PRICE_CLOSE | ... | ..._TYPICAL | ..._WEIGHTED } provides an input for selecting appropriate type of price entering the Bollinger calculus.
mode := { MODE_UPPER | MODE_MAIN | MODE_LOWER } provides input for receiving either an { upper_band | Bollinger Bands' axis | lower_band } PriceDOMAIN value. Thus a "Lazy approach" is to call the iBands() thrice to get the tree-line-Bollinger, or many times for a spectrum-coloured Bollinger Band heat-maps.

Random numbers that add to 1 with a minimum increment: Matlab

Having read carefully the previous question
Random numbers that add to 100: Matlab
I am struggling to solve a similar but slightly more complex problem.
I would like to create an array of n elements that sums to 1, however I want an added constraint that the minimum increment (or if you like number of significant figures) for each element is fixed.
For example if I want 10 numbers that sum to 1 without any constraint the following works perfectly:
num_stocks=10;
num_simulations=100000;
temp = [zeros(num_simulations,1),sort(rand(num_simulations,num_stocks-1),2),ones(num_simulations,1)];
weights = diff(temp,[],2);
I foolishly thought that by scaling this I could add the constraint as follows
num_stocks=10;
min_increment=0.001;
num_simulations=100000;
scaling=1/min_increment;
temp2 = [zeros(num_simulations,1),sort(round(rand(num_simulations,num_stocks-1)*scaling)/scaling,2),ones(num_simulations,1)];
weights2 = diff(temp2,[],2);
However though this works for small values of n & small values of increment, if for example n=1,000 & the increment is 0.1% then over a large number of trials the first and last numbers have a mean which is consistently below 0.1%.
I am sure there is a logical explanation/solution to this but I have been tearing my hair out to try & find it & wondered anybody would be so kind as to point me in the right direction. To put the problem into context create random stock portfolios (hence the sum to 1).
Thanks in advance
Thank you for the responses so far, just to clarify (as I think my initial question was perhaps badly phrased), it is the weights that have a fixed increment of 0.1% so 0%, 0.1%, 0.2% etc.
I did try using integers initially
num_stocks=1000;
min_increment=0.001;
num_simulations=100000;
scaling=1/min_increment;
temp = [zeros(num_simulations,1),sort(randi([0 scaling],num_simulations,num_stocks-1),2),ones(num_simulations,1)*scaling];
weights = (diff(temp,[],2)/scaling);
test=mean(weights);
but this was worse, the mean for the 1st & last weights is well below 0.1%.....
Edit to reflect excellent answer by Floris & clarify
The original code I was using to solve this problem (before finding this forum) was
function x = monkey_weights_original(simulations,stocks)
stockmatrix=1:stocks;
base_weight=1/stocks;
r=randi(stocks,stocks,simulations);
x=histc(r,stockmatrix)*base_weight;
end
This runs very fast, which was important considering I want to run a total of 10,000,000 simulations, 10,000 simulations on 1,000 stocks takes just over 2 seconds with a single core & I am running the whole code on an 8 core machine using the parallel toolbox.
It also gives exactly the distribution I was looking for in terms of means, and I think that it is just as likely to get a portfolio that is 100% in 1 stock as it is to geta portfolio that is 0.1% in every stock (though I'm happy to be corrected).
My issue issue is that although it works for 1,000 stocks & an increment of 0.1% and I guess it works for 100 stocks & an increment of 1%, as the number of stocks decreases then each pick becomes a very large percentage (in the extreme with 2 stocks you will always get a 50/50 portfolio).
In effect I think this solution is like the binomial solution Floris suggests (but more limited)
However my question has arrisen because I would like to make my approach more flexible & have the possibility of say 3 stocks & an increment of 1% which my current code will not handle correctly, hence how I stumbled accross the original question on stackoverflow
Floris's recursive approach will get to the right answer, but the speed will be a major issue considering the scale of the problem.
An example of the original research is here
http://www.huffingtonpost.com/2013/04/05/monkeys-stocks-study_n_3021285.html
I am currently working on extending it with more flexibility on portfolio weights & numbers of stock in the index, but it appears my programming & probability theory ability are a limiting factor.......
One problem I can see is that your formula allows for numbers to be zero - when the rounding operation results in two consecutive numbers to be the same after sorting. Not sure if you consider that a problem - but I suggest you think about it (it would mean your model portfolio has fewer than N stocks in it since the contribution of one of the stocks would be zero).
The other thing to note is that the probability of getting the extreme values in your distribution is half of what you want them to be: If you have uniformly distributed numbers from 0 to 1000, and you round them, the numbers that round to 0 were in the interval [0 0.5>; the ones that round to 1 came from [0.5 1.5> - twice as big. The last number (rounding to 1000) is again from a smaller interval: [999.5 1000]. Thus you will not get the first and last number as often as you think. If instead of round you use floor I think you will get the answer you expect.
EDIT
I thought about this some more, and came up with a slow but (I think) accurate method for doing this. The basic idea is this:
Think in terms of integers; rather than dividing the interval 0 - 1 in steps of 0.001, divide the interval 0 - 1000 in integer steps
If we try to divide N into m intervals, the mean size of a step should be N / m; but being integer, we would expect the intervals to be binomially distributed
This suggests an algorithm in which we choose the first interval as a binomially distributed variate with mean (N/m) - call the first value v1; then divide the remaining interval N - v1 into m-1 steps; we can do so recursively.
The following code implements this:
% random integers adding up to a definite sum
function r = randomInt(n, limit)
% returns an array of n random integers
% whose sum is limit
% calls itself recursively; slow but accurate
if n>1
v = binomialRandom(limit, 1 / n);
r = [v randomInt(n-1, limit - v)];
else
r = limit;
end
function b = binomialRandom(N, p)
b = sum(rand(1,N)<p); % slow but direct
To get 10000 instances, you run this as follows:
tic
portfolio = zeros(10000, 10);
for ii = 1:10000
portfolio(ii,:) = randomInt(10, 1000);
end
toc
This ran in 3.8 seconds on a modest machine (single thread) - of course the method for obtaining a binomially distributed random variate is the thing slowing it down; there are statistical toolboxes with more efficient functions but I don't have one. If you increase the granularity (for example, by setting limit=10000) it will slow down more since you increase the number of random number samples that are generated; with limit = 10000 the above loop took 13.3 seconds to complete.
As a test, I found mean(portfolio)' and std(portfolio)' as follows (with limit=1000):
100.20 9.446
99.90 9.547
100.09 9.456
100.00 9.548
100.01 9.356
100.00 9.484
99.69 9.639
100.06 9.493
99.94 9.599
100.11 9.453
This looks like a pretty convincing "flat" distribution to me. We would expect the numbers to be binomially distributed with a mean of 100, and standard deviation of sqrt(p*(1-p)*n). In this case, p=0.1 so we expect s = 9.4868. The values I actually got were again quite close.
I realize that this is inefficient for large values of limit, and I made no attempt at efficiency. I find that clarity trumps speed when you develop something new. But for instance you could pre-compute the cumulative binomial distributions for p=1./(1:10), then do a random lookup; but if you are just going to do this once, for 100,000 instances, it will run in under a minute; unless you intend to do it many times, I wouldn't bother. But if anyone wants to improve this code I'd be happy to hear from them.
Eventually I have solved this problem!
I found a paper by 2 academics at John Hopkins University "Sampling Uniformly From The Unit Simplex"
http://www.cs.cmu.edu/~nasmith/papers/smith+tromble.tr04.pdf
In the paper they outline how naive algorthms don't work, in a way very similar to woodchips answer to the Random numbers that add to 100 question. They then go on to show that the method suggested by David Schwartz can also be slightly biased and propose a modified algorithm which appear to work.
If you want x numbers that sum to y
Sample uniformly x-1 random numbers from the range 1 to x+y-1 without replacement
Sort them
Add a zero at the beginning & x+y at the end
difference them & subtract 1 from each value
If you want to scale them as I do, then divide by y
It took me a while to realise why this works when the original approach didn't and it come down to the probability of getting a zero weight (as highlighted by Floris in his answer). To get a zero weight in the original version for all but the 1st or last weights your random numbers had to have 2 values the same but for the 1st & last ones then a random number of zero or the maximum number would result in a zero weight which is more likely.
In the revised algorithm, zero & the maximum number are not in the set of random choices & a zero weight occurs only if you select two consecutive numbers which is equally likely for every position.
I coded it up in Matlab as follows
function weights = unbiased_monkey_weights(num_simulations,num_stocks,min_increment)
scaling=1/min_increment;
sample=NaN(num_simulations,num_stocks-1);
for i=1:num_simulations
allcomb=randperm(scaling+num_stocks-1);
sample(i,:)=allcomb(1:num_stocks-1);
end
temp = [zeros(num_simulations,1),sort(sample,2),ones(num_simulations,1)*(scaling+num_stocks)];
weights = (diff(temp,[],2)-1)/scaling;
end
Obviously the loop is a bit clunky and as I'm using the 2009 version the randperm function only allows you to generate permutations of the whole set, however despite this I can run 10,000 simulations for 1,000 numbers in 5 seconds on my clunky laptop which is fast enough.
The mean weights are now correct & as a quick test I replicated woodchips generating 3 numbers that sum to 1 with the minimum increment being 0.01% & it also look right
Thank you all for your help and I hope this solution is useful to somebody else in the future
The simple answer is to use the schemes that work well with NO minimum increment, then transform the problem. As always, be careful. Some methods do NOT yield uniform sets of numbers.
Thus, suppose I want 11 numbers that sum to 100, with a constraint of a minimum increment of 5. I would first find 11 numbers that sum to 45, with no lower bound on the samples (other than zero.) I could use a tool from the file exchange for this. Simplest is to simply sample 10 numbers in the interval [0,45]. Sort them, then find the differences.
X = diff([0,sort(rand(1,10)),1]*45);
The vector X is a sample of numbers that sums to 45. But the vector Y sums to 100, with a minimum value of 5.
Y = X + 5;
Of course, this is trivially vectorized if you wish to find multiple sets of numbers with the given constraint.