How to use stata svy etregress postestimation assumption check - linear-regression

When using survey data and etregress with an endogenous treatment effect in Stata number of diagnostics and post estimate parts stop being available for the use.
svy: etregress logwage i.race gender, treat(training = i.education gender)
--------------------------------------------------------------------------------------------------
| Linearized
| Coef. Std. Err. t P>|t| [95% Conf. Interval]
---------------------------------+----------------------------------------------------------------
logwage |
race |
African American | .3891554 .0031105 12.20 0.000 .2000000 .8474752
Asian American | .1487310 .0002843 04.11 0.000 .027113 .8765290
|
gender |
female | -.0230411 .010445 -6.85 0.000 -.115341 -.0107295
|
1.training | .3703371 .0451778 10.61 0.000 .2018037 .4186134
---------------------------------+----------------------------------------------------------------
training |
i.education |
Highschool | -.0715731 .0490565 1.28 0.098 -.1106579 .1291781
College | .1271380 .0401052 3.95 0.003 .0329516 .2107563
Grad School | .8522143 .0085337 8.99 0.000 .8271381 .9573284
|
gender |
female | .0127444 .0100058 5.33 0.041 .0100558 .0866312
_cons | -1.260083 .0327235 -26.12 0.000 -1.531405 -1.098524
---------------------------------+----------------------------------------------------------------
/athrho | .0051552 .031410 0.17 0.827 -.0722533 .0810246
/lnsigma | -1.872551 .0166818 -73.50 0.000 -1.928624 -1.278064
---------------------------------+----------------------------------------------------------------
rho | .0084120 .0421116 -.0649947 .0888529
sigma | .4000831 .0038170 .1925127 .5067780
lambda | .0012673 .0226365 -.0324029
When I have this model simple assumptions related to a linear model like: Check linearity or assumption of independence and the homoscedasticity, normality, or goodness of fit diagnostics do not give output.
A residuals versus predicted values plot could have been a rvfplot but this gives the error:
last estimates not found
Trying estat gofgives
invalid subcommand gof
and the same for the estat hettest
help etregress postestimation
does not discuss model assumption tests or goodness of fit tests which we normally see with regress or log-linear model in Stata.
When I try the predict residual or predict rstudent nothing is reported making plotting not possible again.
I can provide reproducible example of the problem with the reference given by others:
webuse nhanes2f, clear
qui svyset psuid [pweight=finalwgt], strata(stratid)
qui svy: etregress loglead i.female i.diabetes, treat(diabetes = weight age height i.female) // coefl
nlcom pct_eff:(100*(exp(_b[loglead:1.female])-1))
Here also the etregress is used with a log transformed dependent variable and a treatment component. Following this model like asked above, how do we check the assumptions and goodness of fit?

Related

What is | in diff output?

I am diff'ing two .md5 files with diff -yb --suppress-common-lines --width=250
and getting
0b92397d4978b7b5ba1ae2d4be0ca639 1.__Ms._Eldoris_McCondichie_3-5-04-Apple_ProRes_422_for_Interlaced_material_copy | 66c0a190ccf79e6ca1e34b86bfe89788 1. Ms. Eldoris McCondichie 3:5:04.mov
66c0a190ccf79e6ca1e34b86bfe89788 1.__Ms._Eldoris_McCondichie_3-5-04.mov | 7519e0c6d5f8f56be15b5c6eb82f1678 10. Mr. Clyde Eddy.mov
9ca5150c4f399ad58aae1e4a4a97f809 10._Mr._Clyde_Eddy-Apple_ProRes_422_for_Interlaced_material_copy.mov | 1eedc2c35fdbbc033dbc7bea8b3e4c0d 11. Ms. Jewel Smitherman Rogers 1a w TC.mov
7519e0c6d5f8f56be15b5c6eb82f1678 10._Mr._Clyde_Eddy.mov | 535acdb76a8f56f50bad418c8ff18ec7 12. Ms. Jewel Smitherman Rogers 1b w TC.mov
11411fac492a50e7692669849281dd8a 11._Ms._Jewel_Smitherman_Rogers_1a_w_TC-Apple_ProRes_422_for_Interlaced_material | 2f5d941751409dfd407e8f181aa745c6 13. Ms. Jewel Smitherman Rogers 2 w TC.mov
1eedc2c35fdbbc033dbc7bea8b3e4c0d 11._Ms._Jewel_Smitherman_Rogers_1a_w_TC.mov | 4a0c342444fedb2096c8495fae5e1459 14. Ms. Thelma Knight w TC.mov
acf43bd1b507f1370238dd9d7855f177 12._Ms._Jewel_Smitherman_Rogers_1b_w_TC-Apple_ProRes_422_for_Interlaced_material | 56b0d7952f01b48f47d90a5c300411ef 15. Mr. Robert Holloway w TC.mov
535acdb76a8f56f50bad418c8ff18ec7 12._Ms._Jewel_Smitherman_Rogers_1b_w_TC.mov | 3117d375927b0032f7b804d1f272f97a 16. Mr. Archie Franklin w TC.mov
1e2c9a47ef1ae1869e35e1c439af054f 13._Ms._Jewel_Smitherman_Rogers_2_w_TC-Apple_ProRes_422_for_Interlaced_material_ | 552230c7e504e0f3fa819a46b8169bd4 17. Mr. John Hope Franklin by Charles Ogletree.mov
2f5d941751409dfd407e8f181aa745c6 13._Ms._Jewel_Smitherman_Rogers_2_w_TC.mov | 933b65571efefd2ea1e642f478f4cc94 18. Dr. Olivia Hooker - Congress 03:07.mov
86841f59d7660a99feb5d3ce65c827a0 14._Ms._Thelma_Knight_w_TC-Apple_ProRes_422_for_Interlaced_material_copy.mov | 5fdea1ec2667544cb59e0a4b5b377092 19. Mr. John Hope Franklin - Congress 03:07.mov
4a0c342444fedb2096c8495fae5e1459 14._Ms._Thelma_Knight_w_TC.mov | 83a52017b776abd0a0c82e3866ac08b5 2. Dr. Olivia Hooker 11:16:04.mov
d6853f5f8fda9f130073fb3e3dbf16e6 15._Mr._Robert_Holloway_w_TC-Apple_ProRes_422_for_Interlaced_material_copy.mov | 3611978acf5e564efe7567262b9960f8 20. Bill O'Brian - Historian.mov
56b0d7952f01b48f47d90a5c300411ef 15._Mr._Robert_Holloway_w_TC.mov | 09108415da82b24c114f3c135db611eb 21. John Rogers - Descendant.mov
eafa27fce895e52bc7b6071668e0c10f 16._Mr._Archie_Franklin_w_TC-Apple_ProRes_422_for_Interlaced_material_copy.mov | 1d74cad5bf3e0887d31c4df709b91957 22. Ms. Eddie Faye Gates - Historian 3:4:04.mov
3117d375927b0032f7b804d1f272f97a 16._Mr._Archie_Franklin_w_TC.mov | d6a5dbc85ed5b60b4fc678ba9ad9672d 23. Scott Elsworth - Historian.mov
3969d4eb70bf9595bb7a3bac283fda28 17._Mr.__John_Hope_Franklin_by_Charles_Ogletree-Apple_ProRes_422_for_Interlaced_ | 529a6ad514dca022836f43a780f8d1dd 24. Dr. Olivia Hooker MV 8:07.mov
552230c7e504e0f3fa819a46b8169bd4 17._Mr.__John_Hope_Franklin_by_Charles_Ogletree.mov | 1ca96afc9135bc73e2be7f6702469416 25. Mr. Otis Clark MV 8:07.mov
7947f017d3e0dc194dd3085c2474583f 18._Dr._Olivia_Hooker_-_Congress_03-07-Apple_ProRes_422_for_Interlaced_material_ | 2bf6c9f2cc68f7e8d54b89d9585d90d9 26. Mr. Wes Young MV 8:07.mov
933b65571efefd2ea1e642f478f4cc94 18._Dr._Olivia_Hooker_-_Congress_03-07.mov | b609627fef5619d4d735f46352d0effb 27. Survivors Supreme Court 3:9:05.mov
eb837afbd180e1c9217b5cb02ca69849 19._Mr._John_Hope_Franklin_-_Congress_03-07-Apple_ProRes_422_for_Interlaced_mate | 91c74d7902d1bd4895e77c64fd163d7d 3. Ms. Jimmie Lily Franklin 7:8:04.mov
5fdea1ec2667544cb59e0a4b5b377092 19._Mr._John_Hope_Franklin_-_Congress_03-07.mov | ac6c3ac0437a93425f1b6be05c9a4914 4. Mr. Otis Clark 3:5:04.mov
1968980fe9a3d19650fa7c3ec5507a2e 2.__Dr._Olivia_Hooker_11-16-04-Apple_ProRes_422_for_Interlaced_material_copy.mov | 978f5771c0c6a821124387ce029983c8 5. Ms. Juanita Arnold 03:04.mov
83a52017b776abd0a0c82e3866ac08b5 2.__Dr._Olivia_Hooker_11-16-04.mov | b9fc1244ffbd733f4de4e014e5bef209 6. Ms. Eulis Jackson 03:04.mov
ada33ffb5e5e2083856f52b5709f1b31 20._Bill_O_Brian_-_Historian-Apple_ProRes_422_for_Interlaced_material_copy.mov | f7e5822a03f8bad047b701fd5c2c6704 7. Mr. Wes Young 3:25:04.mov
3611978acf5e564efe7567262b9960f8 20._Bill_O_Brian_-_Historian.mov | 1fc6342a34e1efc6b6c99225543a2900 8. Mr. J.B. Bates 3:5:04.mov
94e6f75688705b0e0a1b73e61cb367d7 21._John_Rogers_-_Descendant-Apple_ProRes_422_for_Interlaced_material_copy.mov | b760f9ef2c13fbdf17c56001860176b6 9. Ms. Beatrice Campbell Webster.mov
09108415da82b24c114f3c135db611eb 21._John_Rogers_-_Descendant.mov <
9345c8828f917026d88592d095190556 22._Ms._Eddie_Faye_Gates_-_Historian_3-4-04-Apple_ProRes_422_for_Interlaced_mate <
1d74cad5bf3e0887d31c4df709b91957 22._Ms._Eddie_Faye_Gates_-_Historian_3-4-04.mov <
65772091bcc8ebc75cca81a2fa29ecf4 23._Scott_Elsworth_-_Historian-Apple_ProRes_422_for_Interlaced_material_copy.mov <
d6a5dbc85ed5b60b4fc678ba9ad9672d 23._Scott_Elsworth_-_Historian.mov <
dd53a3132b03327ccf1660905aca1884 24._Dr._Olivia_Hooker_MV_8-07-Apple_ProRes_422_for_Interlaced_material_copy.mov <
529a6ad514dca022836f43a780f8d1dd 24._Dr._Olivia_Hooker_MV_8-07.mov <
f75eb9a79e4e4fccf53d0463a1c7d520 25._Mr._Otis_Clark_MV_8-07-Apple_ProRes_422_for_Interlaced_material_copy.mov <
1ca96afc9135bc73e2be7f6702469416 25._Mr._Otis_Clark_MV_8-07.mov <
f992aa8cf233d306fc7cea0a32d1d1a6 26._Mr._Wes_Young__MV_8-07-Apple_ProRes_422_for_Interlaced_material_copy.mov <
2bf6c9f2cc68f7e8d54b89d9585d90d9 26._Mr._Wes_Young__MV_8-07.mov <
10f38ecde7590255e88529e9ba41cd06 27._Survivors_Supreme_Court_3-9-05-Apple_ProRes_422_for_Interlaced_material_copy <
b609627fef5619d4d735f46352d0effb 27._Survivors_Supreme_Court_3-9-05.mov <
6d566d7dbbf79652a6545e89c8a0e5a6 3.__Ms._Jimmie_Lily_Franklin_7-8-04-Apple_ProRes_422_for_Interlaced_material_cop <
91c74d7902d1bd4895e77c64fd163d7d 3.__Ms._Jimmie_Lily_Franklin_7-8-04.mov <
What is the | symbol telling me? I can't find that anywhere in the diff documentation. Some of the lines match on checksums, yes, but the filenames are slightly different. For other lines, both the checksums and filenames only exist in the first file, not the second, yet in both instances the | is used.
Further down I get the < symbol, which I understand.
I don't get it.

statistical test to compare 1st/2nd differences based on output from ggpredict / ggeffect

I want to conduct a simple two sample t-test in R to compare marginal effects that are generated by ggpredict (or ggeffect).
Both ggpredict and ggeffect provide nice outputs: (1) table (pred prob / std error / CIs) and (2) plot. However, it does not provide p-values for assessing statistical significance of the marginal effects (i.e., is the difference between the two predicted probabilities difference from zero?). Further, since I’m working with Interaction Effects, I'm also interested in a two sample t-tests for the First Differences (between two marginal effects) and the Second Differences.
Is there an easy way to run the relevant t tests with ggpredict/ggeffect output? Other options?
Attaching:
. reprex code with fictitious data
. To be specific: I want to test the following "1st differences":
--> .67 - .33=.34 (diff from zero?)
--> .5 - .5 = 0 (diff from zero?)
...and the following Second difference:
--> 0.0 - .34 = .34 (diff from zero?)
See also Figure 12 / Table 3 in Mize 2019 (interaction effects in nonlinear models)
Thanks Scott
library(mlogit)
#> Loading required package: dfidx
#>
#> Attaching package: 'dfidx'
#> The following object is masked from 'package:stats':
#>
#> filter
library(sjPlot)
library(ggeffects)
# create ex. data set. 1 row per respondent (dataset shows 2 resp). Each resp answers 3 choice sets, w/ 2 alternatives in each set.
cedata.1 <- data.frame( id = c(1,1,1,1,1,1,2,2,2,2,2,2), # respondent ID.
QES = c(1,1,2,2,3,3,1,1,2,2,3,3), # Choice set (with 2 alternatives)
Alt = c(1,2,1,2,1,2,1,2,1,2,1,2), # Alt 1 or Alt 2 in choice set
LOC = c(0,0,1,1,0,1,0,1,1,0,0,1), # attribute describing alternative. binary categorical variable
SIZE = c(1,1,1,0,0,1,0,0,1,1,0,1), # attribute describing alternative. binary categorical variable
Choice = c(0,1,1,0,1,0,0,1,0,1,0,1), # if alternative is Chosen (1) or not (0)
gender = c(1,1,1,1,1,1,0,0,0,0,0,0) # male or female (repeats for each indivdual)
)
# convert dep var Choice to factor as required by sjPlot
cedata.1$Choice <- as.factor(cedata.1$Choice)
cedata.1$LOC <- as.factor(cedata.1$LOC)
cedata.1$SIZE <- as.factor(cedata.1$SIZE)
# estimate model.
glm.model <- glm(Choice ~ LOC*SIZE, data=cedata.1, family = binomial(link = "logit"))
# estimate MEs for use in IE assessment
LOC.SIZE <- ggpredict(glm.model, terms = c("LOC", "SIZE"))
LOC.SIZE
#>
#> # Predicted probabilities of Choice
#> # x = LOC
#>
#> # SIZE = 0
#>
#> x | Predicted | SE | 95% CI
#> -----------------------------------
#> 0 | 0.33 | 1.22 | [0.04, 0.85]
#> 1 | 0.50 | 1.41 | [0.06, 0.94]
#>
#> # SIZE = 1
#>
#> x | Predicted | SE | 95% CI
#> -----------------------------------
#> 0 | 0.67 | 1.22 | [0.15, 0.96]
#> 1 | 0.50 | 1.00 | [0.12, 0.88]
#> Standard errors are on the link-scale (untransformed).
# plot
# plot(LOC.SIZE, connect.lines = TRUE)

Odds and Rate Ratio CIs in Hurdle Models with Factor-Factor Interactions

I am trying to build hurdle models with factor-factor interactions but can't figure out how to calculate the CIs of the odds or rate ratios among the various factor-factor combinations.
library(glmmTMB)
data(Salamanders)
m3 <- glmmTMB(count ~ spp + mined + spp * mined,
zi=~spp + mined + spp * mined,
family=truncated_poisson, data=Salamanders) # added in the interaction
pred_dat <- data.frame(spp = rep(unique(Salamanders$spp), 2),
mined = rep(unique(Salamanders$mined), each = length(unique(Salamanders$spp))))
pred_dat # All factor-factor combos
Does anyone know how to appropriately calculate the CI around the ratios among these various factor-factor combos? I know how to calculate the actual ratio estimates (which consists of exponentiating the sum of 1-3 model coefficients, depending on the exact comparison being made) but I just can't seem to find any info on how to get the corresponding CI when an interaction is involved. If the ratio in question only requires exponentiating a single coefficient, the CI can easily be calculated; I just don't know how to do it when two or three coefficients are involved in calculating the ratio. Any help would be much appreciated.
EDIT:
I need the actual odds and rate ratios and their CIs, not the predicted values and their CIs. For example: exp(confint(m3)[2,3]) gives the rate ratio of sppPR/minedYes vs sppGP/minedYes, and c(exp(confint(m3)[2,1]),exp(confint(m3)[2,2]) gives the CI of that rate ratio. However, a number of the potential comparisons among the spp/mined combinations require summing multiple coefficients e.g., exp(confint(m3)[2,3] + confint(m3)[8,3]) but in these circumstances I do not know how to calculate the rate ratio CI because it involves multiple coefficients, each of which has its own SE estimates. How can I calculate those CIs, given that multiple coefficients are involved?
If I understand your question correctly, this would be one way to obtain the uncertainty around the predicted/fitted values of the interaction term:
library(glmmTMB)
library(ggeffects)
data(Salamanders)
m3 <- glmmTMB(count ~ spp + mined + spp * mined,
zi=~spp + mined + spp * mined,
family=truncated_poisson, data=Salamanders) # added in the interaction
ggpredict(m3, c("spp", "mined"))
#>
#> # Predicted counts of count
#> # x = spp
#>
#> # mined = yes
#>
#> x | Predicted | SE | 95% CI
#> --------------------------------------
#> GP | 1.59 | 0.92 | [0.26, 9.63]
#> PR | 1.13 | 0.66 | [0.31, 4.10]
#> DM | 1.74 | 0.29 | [0.99, 3.07]
#> EC-A | 0.61 | 0.96 | [0.09, 3.96]
#> EC-L | 0.42 | 0.69 | [0.11, 1.59]
#> DF | 1.49 | 0.27 | [0.88, 2.51]
#>
#> # mined = no
#>
#> x | Predicted | SE | 95% CI
#> --------------------------------------
#> GP | 2.67 | 0.11 | [2.15, 3.30]
#> PR | 1.59 | 0.28 | [0.93, 2.74]
#> DM | 3.10 | 0.10 | [2.55, 3.78]
#> EC-A | 2.30 | 0.17 | [1.64, 3.21]
#> EC-L | 5.25 | 0.07 | [4.55, 6.06]
#> DF | 2.68 | 0.12 | [2.13, 3.36]
#> Standard errors are on link-scale (untransformed).
plot(ggpredict(m3, c("spp", "mined")))
Created on 2020-08-04 by the reprex package (v0.3.0)
The ggeffects-package calculates marginal effects / estimates marginal means (EMM) with confidence intervals for your model terms. ggpredict() computes these EMMs based on predict(), ggemmeans() wraps the fantastic emmeans package and ggeffect() uses the effects package.

Convert KMeans "centres" output to PySpark dataframe

I'm running a K-means clustering model, and I want to analyse the cluster centroids, however the centers output is a LIST of my 20 centroids, with their coordinates (8 each) as an ARRAY. I need it as a dataframe, with clusters 1:20 as rows, and their attribute values (centroid coordinates) as columns like so:
c1 | 0.85 | 0.03 | 0.01 | 0.00 | 0.12 | 0.01 | 0.00 | 0.12
c2 | 0.25 | 0.80 | 0.10 | 0.00 | 0.12 | 0.01 | 0.00 | 0.77
c3 | 0.05 | 0.10 | 0.00 | 0.82 | 0.00 | 0.00 | 0.22 | 0.00
The dataframe format is important because what I WANT to do is:
For each centroid
Identify the 3 strongest attributes
Create a "name" for each of the 20 centroids that is a concatenation of the 3 most dominant traits in that centroid
For example:
c1 | milk_eggs_cheese
c2 | meat_milk_bread
c3 | toiletries_bread_eggs
This code is running in Zeppelin, EMR version 5.19, Spark2.4. The model works great, but this is the boilerplate code from the Spark documentation (https://spark.apache.org/docs/latest/ml-clustering.html#k-means), which produces the list of arrays output that I can't really use.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
print(center)
This is an excerpt of the output I get.
Cluster Centers:
[0.12391775 0.04282062 0.00368751 0.27282358 0.00533401 0.03389095
0.04220946 0.03213536 0.00895981 0.00990327 0.01007891]
[0.09018751 0.01354349 0.0130329 0.00772877 0.00371508 0.02288211
0.032301 0.37979978 0.002487 0.00617438 0.00610262]
[7.37626746e-02 2.02469798e-03 4.00944473e-04 9.62304581e-04
5.98964859e-03 2.95190585e-03 8.48736175e-01 1.36797882e-03
2.57451073e-04 6.13320072e-04 5.70559278e-04]
Based on How to convert a list of array to Spark dataframe I have tried this:
df = sc.parallelize(centers).toDF(['fresh_items', 'wine_liquor', 'baby', 'cigarettes', 'fresh_meat', 'fruit_vegetables', 'bakery', 'toiletries', 'pets', 'coffee', 'cheese'])
df.show()
But this throws the following error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
model.clusterCenters() gives you a list of numpy arrays and not a list of lists like in the answer you have linked. Just convert the numpy arrays to a lists before creating the dataframe:
bla = [e.tolist() for e in centers]
df = sc.parallelize(bla).toDF(['fresh_items', 'wine_liquor', 'baby', 'cigarettes', 'fresh_meat', 'fruit_vegetables', 'bakery', 'toiletries', 'pets', 'coffee', 'cheese'])
#or df = spark.createDataFrame(bla, ['fresh_items', 'wine_liquor', 'baby', 'cigarettes', 'fresh_meat', 'fruit_vegetables', 'bakery', 'toiletries', 'pets', 'coffee', 'cheese']
df.show()

how to use estat vif in the right way

I have 2 questions concerning estat vif to test multicollinearity:
Is it correct that you can only calculate estat vif after the regress command?
If I execute this command Stata only gives me the vif of one independent variable.
How do I get the vif of all the independent variables?
Q1. I find estat vif documented under regress postestimation. If you can find it documented under any other postestimation heading, then it is applicable after that command.
Q2. You don't give any examples, reproducible or otherwise, of your problem. But estat vif by default gives a result for each predictor (independent variable).
. sysuse auto, clear
(1978 Automobile Data)
. regress mpg weight price
Source | SS df MS Number of obs = 74
-------------+---------------------------------- F(2, 71) = 66.85
Model | 1595.93249 2 797.966246 Prob > F = 0.0000
Residual | 847.526967 71 11.9369995 R-squared = 0.6531
-------------+---------------------------------- Adj R-squared = 0.6434
Total | 2443.45946 73 33.4720474 Root MSE = 3.455
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | -.0058175 .0006175 -9.42 0.000 -.0070489 -.0045862
price | -.0000935 .0001627 -0.57 0.567 -.000418 .0002309
_cons | 39.43966 1.621563 24.32 0.000 36.20635 42.67296
------------------------------------------------------------------------------
. estat vif
Variable | VIF 1/VIF
-------------+----------------------
price | 1.41 0.709898
weight | 1.41 0.709898
-------------+----------------------
Mean VIF | 1.41