Compare contrasts in linear model in Python (like Rs contrast library?) - linear-regression

In R I can do the following to compare two contrasts from a linear model:
url <- "https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/spider_wolff_gorb_2013.csv"
filename <- "spider_wolff_gorb_2013.csv"
install.packages("downloader", repos="http://cran.us.r-project.org")
library(downloader)
if (!file.exists(filename)) download(url, filename)
spider <- read.csv(filename, skip=1)
head(spider, 5)
# leg type friction
# 1 L1 pull 0.90
# 2 L1 pull 0.91
# 3 L1 pull 0.86
# 4 L1 pull 0.85
# 5 L1 pull 0.80
fit = lm(friction ~ type + leg, data=spider)
fit
# Call:
# lm(formula = friction ~ type + leg, data = spider)
#
# Coefficients:
# (Intercept) typepush legL2 legL3 legL4
# 1.0539 -0.7790 0.1719 0.1605 0.2813
install.packages("contrast", repos="http://cran.us.r-project.org")
library(contrast)
l4vsl2 = contrast(fit, list(leg="L4", type="pull"), list(leg="L2",type="pull"))
l4vsl2
# lm model parameter contrast
#
# Contrast S.E. Lower Upper t df Pr(>|t|)
# 0.1094167 0.04462392 0.02157158 0.1972618 2.45 277 0.0148
I have found out how to do much of the above in Python:
import pandas as pd
df = pd.read_table("https://raw.githubusercontent.com/genomicsclass/dagdata/master/inst/extdata/spider_wolff_gorb_2013.csv", sep=",", skiprows=1)
df.head(2)
import statsmodels.formula.api as sm
model1 = sm.ols(formula='friction ~ type + leg', data=df)
fitted1 = model1.fit()
print(fitted1.summary())
Now all that remains is finding the t-statistic for the contrast of leg pair L4 vs. leg pair L2. Is this possible in Python?

statsmodels is still missing some predefined contrasts, but the t_test and wald_test or f_test methods of the model Results classes can be used to test linear (or affine) restrictions. The restrictions either be given by arrays or by strings using the parameter names.
Details for how to specify contrasts/restrictions should be in the documentation
for example
>>> tt = fitted1.t_test("leg[T.L4] - leg[T.L2]")
>>> print(tt.summary())
Test for Constraints
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
c0 0.1094 0.045 2.452 0.015 0.022 0.197
==============================================================================
The results are attributes or methods in the instance that is returned by t_test. For example the conf_int can be obtained by
>>> tt.conf_int()
array([[ 0.02157158, 0.19726175]])
t_test is vectorized and treats each restriction or contrast as separate hypothesis. wald_test treats a list of restrictions as joint hypothesis:
>>> tt = fitted1.t_test(["leg[T.L3] - leg[T.L2], leg[T.L4] - leg[T.L2]"])
>>> print(tt.summary())
Test for Constraints
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
c0 -0.0114 0.043 -0.265 0.792 -0.096 0.074
c1 0.1094 0.045 2.452 0.015 0.022 0.197
==============================================================================
>>> tt = fitted1.wald_test(["leg[T.L3] - leg[T.L2], leg[T.L4] - leg[T.L2]"])
>>> print(tt.summary())
<F test: F=array([[ 8.10128575]]), p=0.00038081249480917173, df_denom=277, df_num=2>
Aside: this also works for robust covariance matrices if cov_type was specified as argument to fit.

Related

Marginal Means accounting for the random effect uncertainty

When we have repeated measurements on an experimental unit, typically these units cannot be considered 'independent' and need to be modeled in a way that we get valid estimates for our standard errors.
When I compare the intervals obtained by computing the marginal means for the treatment using a mixed model (treating the unit as a random effect) and in the other case, first averaging over the unit and THEN runnning a simple linear model on the averaged responses, I get the exact same uncertainty intervals.
How do we incorporate the uncertainty of the measurements of the unit, into the uncertainty of what we think our treatments look like?
In order to really propogate all the uncertainty, shouldn't we see what the treatment looks like, averaged over "all possible measurements" on a unit?
``` r
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(emmeans)
library(lme4)
#> Loading required package: Matrix
library(ggplot2)
tmp <- structure(list(treatment = c("A", "A", "A", "A", "A", "A", "A",
"A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "B", "B",
"B", "B", "B", "B"), response = c(151.27333548, 162.3933313,
159.2199999, 159.16666725, 210.82, 204.18666667, 196.97333333,
194.54666667, 154.18666667, 194.99333333, 193.48, 191.71333333,
124.1, 109.32666667, 105.32, 102.22, 110.83333333, 114.66666667,
110.54, 107.82, 105.62000069, 79.79999821, 77.58666557, 75.78666928
), experimental_unit = c("A-1", "A-1", "A-1", "A-1", "A-2", "A-2",
"A-2", "A-2", "A-3", "A-3", "A-3", "A-3", "B-1", "B-1", "B-1",
"B-1", "B-2", "B-2", "B-2", "B-2", "B-3", "B-3", "B-3", "B-3"
)), row.names = c(NA, -24L), class = c("tbl_df", "tbl", "data.frame"
))
### Option 1 - Treat the experimental unit as a random effect since there are
### 4 repeat observations for the same unit
lme4::lmer(response ~ treatment + (1 | experimental_unit), data = tmp) %>%
emmeans::emmeans(., ~ treatment) %>%
as.data.frame()
#> treatment emmean SE df lower.CL upper.CL
#> 1 A 181.0794 10.83359 4 151.00058 211.1583
#> 2 B 101.9683 10.83359 4 71.88947 132.0472
#ggplot(.,aes(treatment, emmean)) +
#geom_pointrange(aes(ymin = lower.CL, ymax = upper.CL))
### Option 2 - instead of treating the unit as random effect, we average over the
### 4 repeat observations, and run a simple linear model
tmp %>%
group_by(experimental_unit) %>%
summarise(mean_response = mean(response)) %>%
mutate(treatment = c(rep("A", 3), rep("B", 3))) %>%
lm(mean_response ~ treatment, data = .) %>%
emmeans::emmeans(., ~ treatment) %>%
as.data.frame()
#> treatment emmean SE df lower.CL upper.CL
#> 1 A 181.0794 10.83359 4 151.00058 211.1583
#> 2 B 101.9683 10.83359 4 71.88947 132.0472
#ggplot(., aes(treatment, emmean)) +
#geom_pointrange(aes(ymin = lower.CL, ymax = upper.CL))
### Whether we include a random effect for the unit, or average over it and THEN model it, we find no difference in the
### marginal means for the treatments
### How do we incoporate the variation of the repeat measurments to the marginal means of the treatments?
### Do we then ignore the variation in the 'subsamples' and simply average over them PRIOR to modeling?
<sup>Created on 2021-07-31 by the [reprex package](https://reprex.tidyverse.org) (v2.0.0)</sup>
emmeans() does take into account the errors of random effects. This is what I get when I remove the complex sequences of pipes:
> mmod = lme4::lmer(response ~ treatment + (1 | experimental_unit), data = tmp)
> emmeans(mmod, "treatment")
treatment emmean SE df lower.CL upper.CL
A 181 10.8 4 151.0 211
B 102 10.8 4 71.9 132
Degrees-of-freedom method: kenward-roger
Confidence level used: 0.95
This is as shown. If I fit a fixed-effects model that accounts for experimental units as a fixed effect, I get:
> fmod = lm(response ~ treatment + experimental_unit, data = tmp)
> emmeans(fmod, "treatment")
NOTE: A nesting structure was detected in the fitted model:
experimental_unit %in% treatment
treatment emmean SE df lower.CL upper.CL
A 181 3.25 18 174.2 188
B 102 3.25 18 95.1 109
Results are averaged over the levels of: experimental_unit
Confidence level used: 0.95
The SEs of the latter results are considerably lower, and that is because the random variations in experimental_unit are modeled as fixed variations.
Apparently the piping you did accounts for the variation of the random effects and includes those in the EMMs. I think that is because you did things separately for each experimental unit and somehow combined those results. I'm not very comfortable with a sequence of pipes that is 7 steps long, and I don't understand why that results in just one set of means.
I recommend against the as.data.frame() at the end. That zaps out annotations that can be helpful in understanding what you have. If you are doing that to get more digits precision, I'll claim that those are digits you don't need, it just exaggerates the precision you are entitled to claim.
Notes on some follow-up comments
Subsequently, I am convinced that what we see in the piped operations in the second part of the OP doe indeed comprise computing the mean of each EU, then analyzing those.
Let's look at that in the context of the formal model. We have (sorry MathJax doesn't work on stackoverflow, but I'll leave the markup there anyway)
$$ Y_{ijk} = \mu + \tau_i + U_{ij} + E_{ijk} $$
where $Y_{ijk}$ is the kth response measurement on the ith treatment and jth EU in the ith treatment, and the rhs terms represent respectively the overall mean, the (fixed) treatment effects, the (random) EU effects, and the (random) error effects. We assume the random effects are all mutually independent. With a balanced design, the EMMs are just the marginal means:
$$ \bar Y_{i..} = \mu + \tau_i + \bar U_{i.} + \bar E_{i..} $$
where a '.' subscript means we averaged over that subscript. If there are n EUs per treatment and m measurements on each EU, we get that
$$ Var(\bar Y_{i..} = \sigma^2_U / n + \sigma^2_E / mn $$
Now, if we aggregate the data on EUs ahead of time, we are starting with
$$ \bar Y_{ij.} = \mu + U_{ij} + \bar E_{ij.} $$
However, if we then compute marginal means by averaging over j, we get exactly the same thing as we did before with $\bar Y_{i..}$, and the variance is exactly as already shown. That is why it doesn't matter if we aggregated first or not.

statistical test to compare 1st/2nd differences based on output from ggpredict / ggeffect

I want to conduct a simple two sample t-test in R to compare marginal effects that are generated by ggpredict (or ggeffect).
Both ggpredict and ggeffect provide nice outputs: (1) table (pred prob / std error / CIs) and (2) plot. However, it does not provide p-values for assessing statistical significance of the marginal effects (i.e., is the difference between the two predicted probabilities difference from zero?). Further, since I’m working with Interaction Effects, I'm also interested in a two sample t-tests for the First Differences (between two marginal effects) and the Second Differences.
Is there an easy way to run the relevant t tests with ggpredict/ggeffect output? Other options?
Attaching:
. reprex code with fictitious data
. To be specific: I want to test the following "1st differences":
--> .67 - .33=.34 (diff from zero?)
--> .5 - .5 = 0 (diff from zero?)
...and the following Second difference:
--> 0.0 - .34 = .34 (diff from zero?)
See also Figure 12 / Table 3 in Mize 2019 (interaction effects in nonlinear models)
Thanks Scott
library(mlogit)
#> Loading required package: dfidx
#>
#> Attaching package: 'dfidx'
#> The following object is masked from 'package:stats':
#>
#> filter
library(sjPlot)
library(ggeffects)
# create ex. data set. 1 row per respondent (dataset shows 2 resp). Each resp answers 3 choice sets, w/ 2 alternatives in each set.
cedata.1 <- data.frame( id = c(1,1,1,1,1,1,2,2,2,2,2,2), # respondent ID.
QES = c(1,1,2,2,3,3,1,1,2,2,3,3), # Choice set (with 2 alternatives)
Alt = c(1,2,1,2,1,2,1,2,1,2,1,2), # Alt 1 or Alt 2 in choice set
LOC = c(0,0,1,1,0,1,0,1,1,0,0,1), # attribute describing alternative. binary categorical variable
SIZE = c(1,1,1,0,0,1,0,0,1,1,0,1), # attribute describing alternative. binary categorical variable
Choice = c(0,1,1,0,1,0,0,1,0,1,0,1), # if alternative is Chosen (1) or not (0)
gender = c(1,1,1,1,1,1,0,0,0,0,0,0) # male or female (repeats for each indivdual)
)
# convert dep var Choice to factor as required by sjPlot
cedata.1$Choice <- as.factor(cedata.1$Choice)
cedata.1$LOC <- as.factor(cedata.1$LOC)
cedata.1$SIZE <- as.factor(cedata.1$SIZE)
# estimate model.
glm.model <- glm(Choice ~ LOC*SIZE, data=cedata.1, family = binomial(link = "logit"))
# estimate MEs for use in IE assessment
LOC.SIZE <- ggpredict(glm.model, terms = c("LOC", "SIZE"))
LOC.SIZE
#>
#> # Predicted probabilities of Choice
#> # x = LOC
#>
#> # SIZE = 0
#>
#> x | Predicted | SE | 95% CI
#> -----------------------------------
#> 0 | 0.33 | 1.22 | [0.04, 0.85]
#> 1 | 0.50 | 1.41 | [0.06, 0.94]
#>
#> # SIZE = 1
#>
#> x | Predicted | SE | 95% CI
#> -----------------------------------
#> 0 | 0.67 | 1.22 | [0.15, 0.96]
#> 1 | 0.50 | 1.00 | [0.12, 0.88]
#> Standard errors are on the link-scale (untransformed).
# plot
# plot(LOC.SIZE, connect.lines = TRUE)

Option to cut values below a threshold in papaja::apa_table

I can't figure out how to selectively print values in a table above or below some value. What I'm looking for is known as "cut" in Revelle's psych package. MWE below.
library("psych")
library("psychTools")
derp <- fa(ability, nfactors=3)
print(derp, cut=0.5) #removes all loadings smaller than 0.5
derp <- print(derp, cut=0.5) #apa_table still doesn't print like this
Question is, how do I add that cut to an apa_table? Printing apa_table(derp) prints the entire table, including all values.
The print-method from psych does not return the formatted loadings but only the table of variance accounted for. You can, however, get the result you want by manually formatting the loadings table:
library("psych")
library("psychTools")
derp <- fa(ability, nfactors=3)
# Class `loadings` cannot be coerced to data.frame or matrix
class(derp$Structure)
[1] "loadings"
# Class `matrix` is supported by apa_table()
derp_loadings <- unclass(derp$Structure)
class(derp_loadings)
[1] "matrix"
# Remove values below "cut"
derp_loadings[derp_loadings < 0.5] <- NA
colnames(derp_loadings) <- paste("Factor", 1:3)
apa_table(
derp_loadings
, caption = "Factor loadings"
, added_stub_head = "Item"
, format = "pandoc" # Omit this in your R Markdown document
, format.args = list(na_string = "") # Don't print NA
)
*Factor loadings*
Item Factor 1 Factor 2 Factor 3
---------- --------- --------- ---------
reason.4 0.60
reason.16
reason.17 0.65
reason.19
letter.7 0.61
letter.33 0.56
letter.34 0.65
letter.58
matrix.45
matrix.46
matrix.47
matrix.55
rotate.3 0.70
rotate.4 0.73
rotate.6 0.63
rotate.8 0.63

How to reproduce a linear regression done via pseudo inverse in pytorch

I try to reproduce a simple linear regression x = A†b using pytorch. But I get completely different numbers.
So first I use plain numpy and do
A_pinv = np.linalg.pinv(A)
betas = A_pinv.dot(b)
print(((b - A.dot(betas))**2).mean())
print(betas)
which results in:
364.12875
[0.43196774 0.14436531 0.42414093]
Now I try to get similar enough numbers using pytorch:
# re-implement via pytoch model using built-ins
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import TensorDataset, DataLoader
# We'll create a TensorDataset, which allows access to rows from inputs and targets as tuples.
# We'll also create a DataLoader, to split the data into batches while training.
# It also provides other utilities like shuffling and sampling.
inputs = to.from_numpy(A)
targets = to.from_numpy(b)
train_ds = TensorDataset(inputs, targets)
batch_size = 5
train_dl = DataLoader(train_ds, batch_size, shuffle=True)
# define model, loss and optimizer
new_model = nn.Linear(source_variables, predict_variables, bias=False)
loss_fn = F.mse_loss
opt = to.optim.SGD(new_model.parameters(), lr=1e-10)
def fit(num_epochs, new_model, loss_fn, opt):
for epoch in tnrange(num_epochs, desc="epoch"):
for xb,yb in train_dl:
# Generate predictions
pred = new_model(xb)
loss = loss_fn(pred, yb)
# Perform gradient descent
loss.backward()
opt.step()
opt.zero_grad()
if epoch % 1000 == 0:
print((new_model.weight, loss))
print('Training loss: ', loss_fn(model(inputs), targets))
# fit the model
fit(10000, new_model, loss_fn, opt)
It prints as the last result:
tensor([[0.0231, 0.5185, 0.4589]], requires_grad=True), tensor(271.8525, grad_fn=<MseLossBackward>))
Training loss: tensor(378.2871, grad_fn=<MseLossBackward>)
As you can see these numbers are completely different so I must have made a mistake somewhere ...
Here are the numbers for A and b to reproduce the result:
A = np.array([[2822.48, 2808.48, 2810.92],
[2832.94, 2822.48, 2808.48],
[2832.57, 2832.94, 2822.48],
[2824.23, 2832.57, 2832.94],
[2854.88, 2824.23, 2832.57],
[2800.71, 2854.88, 2824.23],
[2798.36, 2800.71, 2854.88],
[2818.46, 2798.36, 2800.71],
[2805.37, 2818.46, 2798.36],
[2815.44, 2805.37, 2818.46]], dtype=float32)
b = np.array([2832.94, 2832.57, 2824.23, 2854.88, 2800.71, 2798.36, 2818.46, 2805.37, 2815.44, 2834.4 ], dtype=float32)

Tensorflow: Cannot interpret feed_dict key as Tensor

I am trying to build a neural network model with one hidden layer (1024 nodes). The hidden layer is nothing but a relu unit. I am also processing the input data in batches of 128.
The inputs are images of size 28 * 28. In the following code I get the error in line
_, c = sess.run([optimizer, loss], feed_dict={x: batch_x, y: batch_y})
Error: TypeError: Cannot interpret feed_dict key as Tensor: Tensor Tensor("Placeholder_64:0", shape=(128, 784), dtype=float32) is not an element of this graph.
Here is the code I have written
#Initialize
batch_size = 128
layer1_input = 28 * 28
hidden_layer1 = 1024
num_labels = 10
num_steps = 3001
#Create neural network model
def create_model(inp, w, b):
layer1 = tf.add(tf.matmul(inp, w['w1']), b['b1'])
layer1 = tf.nn.relu(layer1)
layer2 = tf.matmul(layer1, w['w2']) + b['b2']
return layer2
#Initialize variables
x = tf.placeholder(tf.float32, shape=(batch_size, layer1_input))
y = tf.placeholder(tf.float32, shape=(batch_size, num_labels))
w = {
'w1': tf.Variable(tf.random_normal([layer1_input, hidden_layer1])),
'w2': tf.Variable(tf.random_normal([hidden_layer1, num_labels]))
}
b = {
'b1': tf.Variable(tf.zeros([hidden_layer1])),
'b2': tf.Variable(tf.zeros([num_labels]))
}
init = tf.initialize_all_variables()
train_prediction = tf.nn.softmax(model)
tf_valid_dataset = tf.constant(valid_dataset)
tf_test_dataset = tf.constant(test_dataset)
model = create_model(x, w, b)
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(model, y))
optimizer = tf.train.GradientDescentOptimizer(0.5).minimize(loss)
#Process
with tf.Session(graph=graph1) as sess:
tf.initialize_all_variables().run()
total_batch = int(train_dataset.shape[0] / batch_size)
for epoch in range(num_steps):
loss = 0
for i in range(total_batch):
batch_x, batch_y = train_dataset[epoch * batch_size:(epoch+1) * batch_size, :], train_labels[epoch * batch_size:(epoch+1) * batch_size,:]
_, c = sess.run([optimizer, loss], feed_dict={x: batch_x, y: batch_y})
loss = loss + c
loss = loss / total_batch
if epoch % 500 == 0:
print ("Epoch :", epoch, ". cost = {:.9f}".format(avg_cost))
print("Minibatch accuracy: %.1f%%" % accuracy(predictions, batch_labels))
valid_prediction = tf.run(tf_valid_dataset, {x: tf_valid_dataset})
print("Validation accuracy: %.1f%%" % accuracy(valid_prediction.eval(), valid_labels))
test_prediction = tf.run(tf_test_dataset, {x: tf_test_dataset})
print("TEST accuracy: %.1f%%" % accuracy(test_prediction.eval(), test_labels))
This worked for me
from keras import backend as K
and after predicting my data i inserted this part of code
then i had again loaded the model.
K.clear_session()
i faced this problem in production server,
but in my pc it was running fine
...........
from keras import backend as K
#Before prediction
K.clear_session()
#After prediction
K.clear_session()
Variable x is not in the same graph as model, try to define all of these in the same graph scope. For example,
# define a graph
graph1 = tf.Graph()
with graph1.as_default():
# placeholder
x = tf.placeholder(...)
y = tf.placeholder(...)
# create model
model = create(x, w, b)
with tf.Session(graph=graph1) as sess:
# initialize all the variables
sess.run(init)
# then feed_dict
# ......
If you use django server, just runserver with --nothreading
for example:
python manage.py runserver --nothreading
I had the same issue with flask. adding --without-threads flag to flask run or threaded=False to app.run() fixed it
In my case, I was using loop while calling in CNN multiple times, I fixed my problem by doing the following:
# Declare this as global:
global graph
graph = tf.get_default_graph()
# Then just before you call in your model, use this
with graph.as_default():
# call you models here
Note: In my case too, the app ran fine for the first time and then gave the error above. Using the above fix solved the problem.
Hope that helps.
The error message TypeError: Cannot interpret feed_dict key as Tensor: Tensor Tensor("...", dtype=dtype) is not an element of this graph can also arise in case you run a session outside of the scope of its with statement. Consider:
with tf.Session() as sess:
sess.run(logits, feed_dict=feed_dict)
sess.run(logits, feed_dict=feed_dict)
If logits and feed_dict are defined properly, the first sess.run command will execute normally, but the second will raise the mentioned error.
You can also experience this while working on notebooks hosted on online learning platforms like Coursera. So, implementing following code could help get over with the issue.
Implement this at the topmost block of Notebook file:
from keras import backend as K
K.clear_session()
Similar to #javan-peymanfard and #hmadali-shafiee, I ran into this issue when loading the model in an API. I was using FastAPI with uvicorn. To fix the issue I just set the API function definitions to async similar to this:
#app.post('/endpoint_name')
async def endpoint_function():
# Do stuff here, including possibly (re)loading the model