stan number of effective sample size - hierarchical-data

I reproduced the results of a hierarchical model using the rethinking package with just rstan() and I am just curious why n_eff is not closer.
Here is the model with random intercepts for 2 groups (intercept_x2) using the rethinking package:
Code:
response = c(rnorm(500,0,1),rnorm(500,200,10))
predicotr1_continuous = rnorm(1000)
predictor2_categorical = factor(c(rep("A",500),rep("B",500) ))
data = data.frame(y = response, x1 = predicotr1_continuous, x2 = predictor2_categorical)
head(data)
library(rethinking)
m22 <- map2stan(
alist(
y ~ dnorm( mu , sigma ) ,
mu <- intercept + intercept_x2[x2] + beta*x1 ,
intercept ~ dnorm(0,10),
intercept_x2[x2] ~ dnorm(0, sigma_2),
beta ~ dnorm(0,10),
sigma ~ dnorm(0, 10),
sigma_2 ~ dnorm(0,10)
) ,
data=data , chains=1 , iter=5000 , warmup=500 )
precis(m22, depth = 2)
Mean StdDev lower 0.89 upper 0.89 n_eff Rhat
intercept 9.96 9.59 -5.14 25.84 1368 1
intercept_x2[1] -9.94 9.59 -25.55 5.43 1371 1
intercept_x2[2] 189.68 9.59 173.28 204.26 1368 1
beta 0.06 0.22 -0.27 0.42 3458 1
sigma 6.94 0.16 6.70 7.20 2927 1
sigma_2 43.16 5.01 35.33 51.19 2757 1
Now here is the same model in rstan():
# create a numeric vector to indicate the categorical groups
data$GROUP_ID = match( data$x2, levels( data$x2 ) )
library(rstan)
standat <- list(
N = nrow(data),
y = data$y,
x1 = data$x1,
GROUP_ID = data$GROUP_ID,
nGROUPS = 2
)
stanmodelcode = '
data {
int<lower=1> N;
int nGROUPS;
real y[N];
real x1[N];
int<lower=1, upper=nGROUPS> GROUP_ID[N];
}
transformed data{
}
parameters {
real intercept;
vector[nGROUPS] intercept_x2;
real beta;
real<lower=0> sigma;
real<lower=0> sigma_2;
}
transformed parameters { // none needed
}
model {
real mu;
// priors
intercept~ normal(0,10);
intercept_x2 ~ normal(0,sigma_2);
beta ~ normal(0,10);
sigma ~ normal(0,10);
sigma_2 ~ normal(0,10);
// likelihood
for(i in 1:N){
mu = intercept + intercept_x2[ GROUP_ID[i] ] + beta*x1[i];
y[i] ~ normal(mu, sigma);
}
}
'
fit22 = stan(model_code=stanmodelcode, data=standat, iter=5000, warmup=500, chains = 1)
fit22
Inference for Stan model: b212ebc67c08c77926c59693aa719288.
1 chains, each with iter=5000; warmup=500; thin=1;
post-warmup draws per chain=4500, total post-warmup draws=4500.
mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
intercept 10.14 0.30 9.72 -8.42 3.56 10.21 16.71 29.19 1060 1
intercept_x2[1] -10.12 0.30 9.73 -29.09 -16.70 -10.25 -3.50 8.36 1059 1
intercept_x2[2] 189.50 0.30 9.72 170.40 182.98 189.42 196.09 208.05 1063 1
beta 0.05 0.00 0.21 -0.37 -0.10 0.05 0.20 0.47 3114 1
sigma 6.94 0.00 0.15 6.65 6.84 6.94 7.05 7.25 3432 1
sigma_2 43.14 0.09 4.88 34.38 39.71 42.84 46.36 53.26 3248 1
lp__ -2459.75 0.05 1.71 -2463.99 -2460.68 -2459.45 -2458.49 -2457.40 1334 1
Samples were drawn using NUTS(diag_e) at Thu Aug 31 15:53:09 2017.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
My Questions:
the n_eff is larger using rethinking(). There is simulation differences but do you think something else is going on here?
Besides the n_eff being different the percentiles of the posterior distributions are different. I was thinking rethinking() and rstan() should return similar results with 5000 iterations since rethinking is just calling rstan. Are differences like that normal or something different between the 2 implementations?
I created data$GROUP_ID to indicate the categorical groupings. Is this the correct way to incorporate categorical variables into a hierarchical model in rstan()? I have 2 groups and if I had 50 groups I use the same data$GROUP_ID vector but is that the standard way?
Thank you.

Related

Adonis result does not match the NMDS plot?

I'm not a statistician and kinda working through this blindly, but why does my NMDS plot of bray-curtis measures show clear groupings for status (blue and pink) but adonis says that time point is significantly different?
distance_calc <- phyloseq::distance(rarified2, "bray")
sampledf <- data.frame(sample_data(rarified2))
adonisresult<- adonis2(formula = distance_calc ~ Status * Time.Point, data = sampledf, type = "bray")
> adonisresult
Permutation test for adonis under reduced model
Terms added sequentially (first to last)
Permutation: free
Number of permutations: 999
adonis2(formula = adonisf, data = sampledf, type = "bray")
Df SumOfSqs R2 F Pr(>F)
Status 1 0.2534 0.05397 1.2969 0.198
Time.Point 1 0.4277 0.09111 2.1893 0.016 *
Status:Time.Point 1 0.1061 0.02259 0.5429 0.901
Residual 20 3.9072 0.83232
Total 23 4.6943 1.00000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

mgcv: Difference between s(x, by=cat) and s(cat, bs='re')

What is the difference between adding a by= parameter to a smooth and adding a random effect smooth?
I've tried both, and get different results. E.g.:
library(mgcv)
set.seed(26)
gam.df <- tibble(y=rnorm(400),
x1=rnorm(400),
cat=factor(rep(1:4, each=100)))
gam0 <- gam(y ~ s(x1, by=cat), data=gam.df)
summary(gam0)
produces:
15:15:39> summary(gam0)
Family: gaussian
Link function: identity
Formula:
y ~ s(x1, by = cat)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.001275 0.049087 -0.026 0.979
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x1):cat1 1 1 7.437 0.00667 **
s(x1):cat2 1 1 0.047 0.82935
s(x1):cat3 1 1 0.393 0.53099
s(x1):cat4 1 1 0.019 0.89015
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.00968 Deviance explained = 1.96%
GCV = 0.97413 Scale est. = 0.96195 n = 400
On the other hand:
gam1 <- gam(y ~ s(x1) + s(cat, bs='re'), data=gam.df)
summary(gam1)
produces:
15:16:33> summary(gam1)
Family: gaussian
Link function: identity
Formula:
y ~ s(x1) + s(cat, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.0001211 0.0572271 0.002 0.998
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x1) 1.0000 1 2.359 0.125
s(cat) 0.7883 3 0.356 0.256
R-sq.(adj) = 0.00594 Deviance explained = 1.04%
GCV = 0.97236 Scale est. = 0.96558 n = 400
I understand that by= shows the summary by each factor level, but shouldn't the overall results of the model such as R^2 be the same?
The factor by model, gam0, contains a separate smooth of x1 for each level of cat, but doesn't include anything specifically for the means of y in each group[*] because it is miss-specified. Compare this with gam1, which has a single smooth of x1 plus group means for the levels of cat.
Even though you generated random data without any smooth or group level effects, the gam0 model is potentially much more complex and flexible a model as it contains 4 separate smooths, each using potentially 9 degrees of freedom. Your gam1 has a single smooth of x1 which uses up to 9 degrees of freedom, plus something between 4 and 0 degrees of freedom for the random effect smooth. gam0 is simply exploiting random variation in the data that can be explained a little bit by those extra potential degrees of freedom. You can see this in the adjusted R-sq.(adj), which is lower for gam0 despite it explaining ~ twice the deviance as does gam1 (not that either is a good amount of deviance explained).
r$> library("gratia")
r$> smooths(gam0)
[1] "s(x1):cat1" "s(x1):cat2" "s(x1):cat3" "s(x1):cat4"
r$> smooths(gam1)
[1] "s(x1)" "s(cat)"
[*] Note that your by model should be
gam0 <- gam(y ~ cat + s(x1, by=cat), data=gam.df)
because the smooths created by s(x1, by=cat) are subject to an identifiability constraint (as there's a constant term — the intercept — in the model). This constraint is a sum-to-zero constraint which means that the individual smooths do not contain the group means. This forces the smooths to not only model the way Y changes as a function of x1 in each group but also model the magnitude of Y in the respective groups, but without functions in the span of the basis that could model such constant (magnitude) effects.

Doing Andrew Ng's Logistic Regression execrise without fminunc

I've been trying to finish Andrew Ng's Machine Learning course, I am at the part about logistic regression now. I am trying to discover the parameters and also calculate the cost without using the MATLAB function fminunc. However, I am not converging to the correct results as posted by other students who have finished the assignment using fminunc. Specifically, my problems are:
the parameters theta are incorrect
my cost seems to be blowing up
I get many NaNs in my cost vector (I just create a vector of the costs to keep track)
I attempted to discover the parameters via Gradient Descent as how I understood the content. However, my implementation still seems to be giving me incorrect results.
dataset = load('dataStuds.txt');
x = dataset(:,1:end-1);
y = dataset(:,end);
m = length(x);
% Padding the the 1's (intercept term, the call it?)
x = [ones(length(x),1), x];
thetas = zeros(size(x,2),1);
% Setting the learning rate to 0.1
alpha = 0.1;
for i = 1:100000
% theta transpose x (tho why in MATLAB it needs to be done the other way
% round? :)
ttrx = x * thetas;
% the hypothesis function h_x = g(z) = sigmoid(-z)
h_x = 1 ./ (1 + exp(-ttrx));
error = h_x - y;
% the gradient (aka the derivative of J(\theta) aka the derivative
% term)
for j = 1:length(thetas)
gradient = 1/m * (h_x - y)' * x(:,j);
% Updating the parameters theta
thetas(j) = thetas(j) - alpha * gradient;
end
% Calculating the cost, just to keep track...
cost(i) = 1/m * ( -y' * log(h_x) - (1-y)' * log(1-h_x) );
end
% Displaying the final theta's that I obtained
thetas
The parameters theta that I get are:
thetas =
-482.8509
3.7457
2.6976
The results below is from one example that I downloaded, but the author used fminunc for this one.
Cost at theta found by fminunc: 0.203506
theta:
-24.932760
0.204406
0.199616
The data:
34.6236596245170 78.0246928153624 0
30.2867107682261 43.8949975240010 0
35.8474087699387 72.9021980270836 0
60.1825993862098 86.3085520954683 1
79.0327360507101 75.3443764369103 1
45.0832774766834 56.3163717815305 0
61.1066645368477 96.5114258848962 1
75.0247455673889 46.5540135411654 1
76.0987867022626 87.4205697192680 1
84.4328199612004 43.5333933107211 1
95.8615550709357 38.2252780579509 0
75.0136583895825 30.6032632342801 0
82.3070533739948 76.4819633023560 1
69.3645887597094 97.7186919618861 1
39.5383391436722 76.0368108511588 0
53.9710521485623 89.2073501375021 1
69.0701440628303 52.7404697301677 1
67.9468554771162 46.6785741067313 0
70.6615095549944 92.9271378936483 1
76.9787837274750 47.5759636497553 1
67.3720275457088 42.8384383202918 0
89.6767757507208 65.7993659274524 1
50.5347882898830 48.8558115276421 0
34.2120609778679 44.2095285986629 0
77.9240914545704 68.9723599933059 1
62.2710136700463 69.9544579544759 1
80.1901807509566 44.8216289321835 1
93.1143887974420 38.8006703371321 0
61.8302060231260 50.2561078924462 0
38.7858037967942 64.9956809553958 0
61.3792894474250 72.8078873131710 1
85.4045193941165 57.0519839762712 1
52.1079797319398 63.1276237688172 0
52.0454047683183 69.4328601204522 1
40.2368937354511 71.1677480218488 0
54.6351055542482 52.2138858806112 0
33.9155001090689 98.8694357422061 0
64.1769888749449 80.9080605867082 1
74.7892529594154 41.5734152282443 0
34.1836400264419 75.2377203360134 0
83.9023936624916 56.3080462160533 1
51.5477202690618 46.8562902634998 0
94.4433677691785 65.5689216055905 1
82.3687537571392 40.6182551597062 0
51.0477517712887 45.8227014577600 0
62.2226757612019 52.0609919483668 0
77.1930349260136 70.4582000018096 1
97.7715992800023 86.7278223300282 1
62.0730637966765 96.7688241241398 1
91.5649744980744 88.6962925454660 1
79.9448179406693 74.1631193504376 1
99.2725269292572 60.9990309984499 1
90.5467141139985 43.3906018065003 1
34.5245138532001 60.3963424583717 0
50.2864961189907 49.8045388132306 0
49.5866772163203 59.8089509945327 0
97.6456339600777 68.8615727242060 1
32.5772001680931 95.5985476138788 0
74.2486913672160 69.8245712265719 1
71.7964620586338 78.4535622451505 1
75.3956114656803 85.7599366733162 1
35.2861128152619 47.0205139472342 0
56.2538174971162 39.2614725105802 0
30.0588224466980 49.5929738672369 0
44.6682617248089 66.4500861455891 0
66.5608944724295 41.0920980793697 0
40.4575509837516 97.5351854890994 1
49.0725632190884 51.8832118207397 0
80.2795740146700 92.1160608134408 1
66.7467185694404 60.9913940274099 1
32.7228330406032 43.3071730643006 0
64.0393204150601 78.0316880201823 1
72.3464942257992 96.2275929676140 1
60.4578857391896 73.0949980975804 1
58.8409562172680 75.8584483127904 1
99.8278577969213 72.3692519338389 1
47.2642691084817 88.4758649955978 1
50.4581598028599 75.8098595298246 1
60.4555562927153 42.5084094357222 0
82.2266615778557 42.7198785371646 0
88.9138964166533 69.8037888983547 1
94.8345067243020 45.6943068025075 1
67.3192574691753 66.5893531774792 1
57.2387063156986 59.5142819801296 1
80.3667560017127 90.9601478974695 1
68.4685217859111 85.5943071045201 1
42.0754545384731 78.8447860014804 0
75.4777020053391 90.4245389975396 1
78.6354243489802 96.6474271688564 1
52.3480039879411 60.7695052560259 0
94.0943311251679 77.1591050907389 1
90.4485509709636 87.5087917648470 1
55.4821611406959 35.5707034722887 0
74.4926924184304 84.8451368493014 1
89.8458067072098 45.3582836109166 1
83.4891627449824 48.3802857972818 1
42.2617008099817 87.1038509402546 1
99.3150088051039 68.7754094720662 1
55.3400175600370 64.9319380069486 1
74.7758930009277 89.5298128951328 1
I ran your code and it does work fine. However, the tricky thing about gradient descent is ensuring that your costs don't diverge to infinity. If you look at your costs array, you will see that the costs definitely diverge and this is why you are not getting the correct results.
The best way to eliminate this in your case is to reduce the learning rate. Through experimentation, I have found that a learning rate of alpha = 0.003 is the best for your problem. I've also increased the number of iterations to 200000. Changing these two things gives me the following parameters and associated cost:
>> format long g;
>> thetas
thetas =
-17.6287417780435
0.146062780453677
0.140513170941357
>> cost(end)
ans =
0.214821863463963
This is more or less in line with the magnitudes of the parameters you see when you are using fminunc. However, they get slightly different parameters as well as different costs because of the actual minimization method itself. fminunc uses a variant of L-BFGS which finds the solution in a much faster way.
What is most important is the actual accuracy itself. Remember that to classify whether an example belongs to label 0 or 1, you take the weighted sum of the parameters and examples, run it through the sigmoid function and threshold at 0.5. We find what the average amount of times each expected label and predicted label match.
Using the parameters we found with gradient descent gives us the following accuracy:
>> ttrx = x * thetas;
>> h_x = 1 ./ (1 + exp(-ttrx)) >= 0.5;
>> mean(h_x == y)
ans =
0.89
This means that we've achieved an 89% classification accuracy. Using the labels provided by fminunc also gives:
>> thetas2 = [-24.932760; 0.204406; 0.199616];
>> ttrx = x * thetas2;
>> h_x = 1 ./ (1 + exp(-ttrx)) >= 0.5;
>> mean(h_x == y)
ans =
0.89
So we can see that the accuracy is the same so I wouldn't worry too much about the magnitude of the parameters but it's more in line with what we see when we compare the costs between the two implementations.
As a final note to you, I would suggest looking at this post of mine for some tips on how to make logistic regression work over long-term. I would definitely recommend normalizing your features prior to finding the parameters to make the algorithm run faster. It also addresses why you were finding the wrong parameters (namely the cost blowing up): Cost function in logistic regression gives NaN as a result.
normalizing the data using mean and standard deviation as follows enables you to use large learning rate and get a similar answer
clear; clc
data = load('ex2data1.txt');
m = length(data);
alpha = 0.1;
theta = [0; 0; 0];
y = data(:,3);
% Normalizing the data
xm1 = mean(data(:,1)); xm2 = mean(data(:,2));
xs1 = std(data(:,1)); xs2 = std(data(:,2));
x1 = (data(:,1)-xm1)./xs1; x2 = (data(:,2)-xm2)./xs2;
X = [ones(m, 1) x1 x2];
for i=1:10000
h = 1./(1+exp(-(X*theta)));
theta = theta - (alpha/m)* (X'*(h-y));
J(i) = (1/m)*(-y'*log(h)-(1-y)'*log(1-h));
end
theta
J(end)
figure
plot(J)

multiple training data for cascade-forward backpropagation network

I am training my neural network with data from 3 consecutive days and testing it with data from a 4th day. The values in this example are randomly chosen and have no relation with reality. I want the neural network to learn the current, depending on the temperature and the solar radiation.
%% initialize data for training
Temperature_Day1 = [25 26 27 26 25];
Temperature_Day2 = [25 24 24 23 24];
Temperature_Day3 = [21 20 22 21 20];
SolarRadiation_Day1 = [990 944 970 999 962];
SolarRadiation_Day2 = [993 947 973 996 967];
SolarRadiation_Day3 = [993 948 973 998 965];
Current_Day1 = [0.11 0.44 0.44 0.45 0.56];
Current_Day2 = [0.41 0.34 0.43 0.55 0.75];
Current_Day3 = [0.34 0.98 0.34 0.76 0.71];
Day1 = [Temperature_Day1; SolarRadiation_Day1]; % 2-by-5
Day2 = [Temperature_Day2; SolarRadiation_Day2]; % 2-by-5
Day3 = [Temperature_Day3; SolarRadiation_Day3]; % 2-by-5
%% training input and training target
Training_Input = [Day1; Day2; Day3]; % 6-by-5
Training_Target = [Current_Day1; Current_Day2; Current_Day3]; % 3-by-5
%% training the network
hiddenLayers= 2;
net = newcf(Training_Input, Training_Target, hiddenLayers);
y = sim(net, Training_Input);
net.trainParam.epochs = 100;
net = train(net, Training_Input, Training_Target);
%% initialize data for prediction
Temperature_Day4 = [45 23 22 11 24];
SolarRadiation_Day4 = [960 984 980 993 967];
Current_Day4 = [0.14 0.48 0.37 0.46 0.77];
Day4 = [Temperature_Day4; SolarRadiation_Day4]; % 2-by-5
Test_Input = [Day4; Day4; Day4]; % same dimension as Training_Input; subject to question
%% prediction
Predicted_Target = sim(net, Test_Input); % yields 3-by-5
My question is: How do I train it with the data of 3 days and then predict the target of the 4th day? Since training and testing inputs must have the same dimension, how do I test it for only one day? Here it is solved by just concatenating three identical data sets of the test input. However, this also yields 3 different data sets for the predicted target.
What is here the right way to do it?
BTW: I have seen this type of question many times, but the answers are never satisfying because they always suggest to change the dimensions of the test input without considering the nature of the problem (which is that only one data set is available for testing). So please don't mark this as a duplicate.
The features that you have for your network are Temperature and SolarRadiation, each taken at specific times during the day. The day on which these readings are taken are irrelevant (otherwise you wouldn't be able to predict the outputs for day 4 given data for days 1-3).
This means that we can simply pass each observation separately by concatenating the days horizontally (and similarly for the target data):
Training_Input = [Day1, Day2, Day3]; % 2-by-15
Training_Target = [Current_Day1, Current_Day2, Current_Day3]; % 1-by-15
The resulting network will give you one output (Current) per observation in the test set, so you don't need to duplicate:
Day4 = [Temperature_Day4; SolarRadiation_Day4]; % 2-by-5
Test_Input = [Day4]; % 2-by-5
PredictedTarget will now be 1-by-5 showing the predicted Current for each of the test observations.
You might consider adding a third feature as input to your net representing the time at which each observations was taken. Assuming that you have t timeslots each day at which observations are taken (thus, length(Temperature) == length(SolarRadiation) == t for all days) and observation s is taken at the same time every day, you can add a feature called TimeSlot:
TimeSlot_Day1 = 1:numel(Temperature_Day1);
TimeSlot_Day2 = 1:numel(Temperature_Day2);
TimeSlot_Day3 = 1:numel(Temperature_Day3)];
Day1 = [Temperature_Day1; SolarRadiation_Day1; TimeSlot_Day1]; % 3-by-5
Day2 = [Temperature_Day2; SolarRadiation_Day2; TimeSlot_Day2]; % 3-by-5
Day3 = [Temperature_Day3; SolarRadiation_Day3; TimeSlot_Day3]; % 3-by-5

How to find subset selection for linear regression model?

I am working with mtcars dataset and using linear regression
data(mtcars)
fit<- lm(mpg ~.,mtcars);summary(fit)
When I fit the model with lm it shows the result like this
Call:
lm(formula = mpg ~ ., data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.5087 -1.3584 -0.0948 0.7745 4.6251
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 23.87913 20.06582 1.190 0.2525
cyl6 -2.64870 3.04089 -0.871 0.3975
cyl8 -0.33616 7.15954 -0.047 0.9632
disp 0.03555 0.03190 1.114 0.2827
hp -0.07051 0.03943 -1.788 0.0939 .
drat 1.18283 2.48348 0.476 0.6407
wt -4.52978 2.53875 -1.784 0.0946 .
qsec 0.36784 0.93540 0.393 0.6997
vs1 1.93085 2.87126 0.672 0.5115
amManual 1.21212 3.21355 0.377 0.7113
gear4 1.11435 3.79952 0.293 0.7733
gear5 2.52840 3.73636 0.677 0.5089
carb2 -0.97935 2.31797 -0.423 0.6787
carb3 2.99964 4.29355 0.699 0.4955
carb4 1.09142 4.44962 0.245 0.8096
carb6 4.47757 6.38406 0.701 0.4938
carb8 7.25041 8.36057 0.867 0.3995
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.833 on 15 degrees of freedom
Multiple R-squared: 0.8931, Adjusted R-squared: 0.779
F-statistic: 7.83 on 16 and 15 DF, p-value: 0.000124
I found that none of variables are marked as significant at 0.05 significant level.
To find out significant variables I want to to do subset selection to find out best pair of vairables as predictors with response variable mpg.
The function regsubsets in the package leaps does best subset regression (see ?leaps). Adapting your code:
library(leaps)
regfit <- regsubsets(mpg ~., data = mtcars)
summary(regfit)
# or for a more visual display
plot(regfit,scale="Cp")