Copy a categorical variable with its value labels - copy

Is it possible to copy a labeled categorical variable in a single line, or do I generally have to copy over labels as a separate step?
In the case I'm looking at, egen ... group() comes close, but changes the underlying integers:
sysuse auto
** starts them from different indices
egen mycut = cut(mpg), at(0 20 30 50) label icodes
egen mycut_copy = group(mycut), label
** does weird stuff
egen mycut2 = cut(mpg), at(0 20 30 50) label icodes
replace mycut2 = group(mycut2)
egen mycut_copy2 = group(mycut2), label
** the correct approach?
gen mycut3 = cut(mpg), at(0 20 30 50) label icodes
gen mycut_copy3 = mycut3
label values mycut_copy3 mycut3

You can do what you want very easily using the less-known clonevar command:
sysuse auto, clear
egen mycut = cut(mpg), at(0 20 30 50) label icodes
clonevar mycut2 = mycut
list mycut* in 1/10, separator(0)
+----------------+
| mycut mycut2 |
|----------------|
1. | 20- 20- |
2. | 0- 0- |
3. | 20- 20- |
4. | 20- 20- |
5. | 0- 0- |
6. | 0- 0- |
7. | 20- 20- |
8. | 20- 20- |
9. | 0- 0- |
10. | 0- 0- |
+----------------+
Note that group() refers to different functions when used with generate and egen, which is why you do not get the same results.

Related

altair bar chart for grouped data sorting NOT working

`Reservation_branch_code | ON_ACCOUNT | Rcount
:-------------------------------------------------:
0 1101 | 170 | 5484
1 1103 | 101 | 5111
2 1118 | 1 | 232
3 1121 | 0 | 27
4 1126 | 90 | 191`
would like to chart sorted by "Rcount" and x axis is "Reservation_branch_code"
this below code gives me chart without Sorting by Rcount
base =alt.Chart(df1).transform_fold(
['Rcount', 'ON_ACCOUNT'],
as_=['column', 'value']
)
bars = base.mark_bar().encode(
# x='Reservation_branch_code:N',
x='Reservation_branch_code:O',
y=alt.Y('value:Q', stack=None), # stack =None enables layered bar
color=alt.Color('column:N', scale=alt.Scale(range=["#f50520", "#bab6b7"])),
tooltip=alt.Tooltip(['ON_ACCOUNT','Rcount']),
#order=alt.Order('value:Q')
)
text = base.mark_text(
align='center',
color='black',
baseline='middle',
dx=0,dy=-8, # Nudges text to right so it doesn't appear on top of the bar
).encode(
x='Reservation_branch_code:N',
y='value:Q',
text=alt.Text('value:Q', format='.0f')
)
rule = alt.Chart(df1).mark_rule(color='blue').encode(
y='mean(Rcount):Q'
)
(bars+text+rule).properties(width=790,height=330)
i sorted data in dataframe...used in that df in altair chart
but not found X axis is not sorted by Rcount column........Thanks
You can pass a list with the sort order:
import altair as alt
from vega_datasets import data
source = data.barley()
alt.Chart(source).mark_bar().encode(
x='sum(yield):Q',
y=alt.Y('site:N', sort=source['site'].unique().tolist())
)

pytorch forecasting extract feature from hidden layer

I'm following the PyTorch Forecasting tutorial: https://pytorch-forecasting.readthedocs.io/en/latest/tutorials/building.html
I implemented a LSTM using AutoRegressiveBaseModelWithCovariates and to initialized the model from my dataset.
from pytorch_forecasting.models.rnn import RecurrentNetwork
...
model = RecurrentNetwork.from_dataset(dataset_with_covariates)
I've been asked to get the output of a hidden layer and visualize w tSNE or UMAP (something I've done before with Keras). I'm new to PyTorch unfortunately. Does anyone know how to do this?
Here's the summary.
| Name | Type | Params
----------------------------------------------------------------------------------
0 | loss | MAE | 0
1 | logging_metrics | ModuleList | 0
2 | logging_metrics.0 | SMAPE | 0
3 | logging_metrics.1 | MAE | 0
4 | logging_metrics.2 | RMSE | 0
5 | logging_metrics.3 | MAPE | 0
6 | logging_metrics.4 | MASE | 0
7 | embeddings | MultiEmbedding | 47
8 | embeddings.embeddings | ModuleDict | 47
9 | embeddings.embeddings.level_0 | Embedding | 12
10 | embeddings.embeddings.supervisorvehiclestatus | Embedding | 35
11 | rnn | LSTM | 2.5 K
12 | output_projector | Linear | 11
----------------------------------------------------------------------------------
2.5 K Trainable params
0 Non-trainable params
2.5 K Total params
0.010 Total estimated model params size (MB)
In attempt to find the layer name, I did:
for name, layer in model.named_modules():
print(name, layer)
RecurrentNetwork(
(loss): MAE()
(logging_metrics): ModuleList(
(0): SMAPE()
(1): MAE()
(2): RMSE()
(3): MAPE()
(4): MASE()
)
(embeddings): MultiEmbedding(
(embeddings): ModuleDict(
(group_name): Embedding(4, 3)
(categorical_var ): Embedding(7, 5)
)
)
(rnn): LSTM(28, 10, num_layers=2, batch_first=True, dropout=0.1)
(output_projector): Linear(in_features=10, out_features=1, bias=True)
)
loss MAE()
logging_metrics ModuleList(
(0): SMAPE()
(1): MAE()
(2): RMSE()
(3): MAPE()
(4): MASE()
)
logging_metrics.0 SMAPE()
logging_metrics.1 MAE()
logging_metrics.2 RMSE()
logging_metrics.3 MAPE()
logging_metrics.4 MASE()
embeddings MultiEmbedding(
(embeddings): ModuleDict(
(group_name): Embedding(4, 3)
(categorical_var ): Embedding(7, 5)
)
)
embeddings.embeddings ModuleDict(
(group_name): Embedding(4, 3)
(categorical_var ): Embedding(7, 5)
)
embeddings.embeddings.level_0 Embedding(4, 3)
embeddings.embeddings.categorical_var Embedding(7, 5)
rnn LSTM(28, 10, num_layers=2, batch_first=True, dropout=0.1)
output_projector Linear(in_features=10, out_features=1, bias=True)
I thought I could do something like this to get the activations, but it is not working.
def get_hidden_features(x, layer):
activation = {}
def get_activation(name):
def hook(m, i, o):
activation[name] = o.detach()
return hook
model.register_forward_hook(get_activation(layer))
_ = model(x)
return activation[layer]
outhidden = get_hidden_features(x, "rnn")
Returns:
AttributeError: 'Output' object has no attribute 'detach'

Odds and Rate Ratio CIs in Hurdle Models with Factor-Factor Interactions

I am trying to build hurdle models with factor-factor interactions but can't figure out how to calculate the CIs of the odds or rate ratios among the various factor-factor combinations.
library(glmmTMB)
data(Salamanders)
m3 <- glmmTMB(count ~ spp + mined + spp * mined,
zi=~spp + mined + spp * mined,
family=truncated_poisson, data=Salamanders) # added in the interaction
pred_dat <- data.frame(spp = rep(unique(Salamanders$spp), 2),
mined = rep(unique(Salamanders$mined), each = length(unique(Salamanders$spp))))
pred_dat # All factor-factor combos
Does anyone know how to appropriately calculate the CI around the ratios among these various factor-factor combos? I know how to calculate the actual ratio estimates (which consists of exponentiating the sum of 1-3 model coefficients, depending on the exact comparison being made) but I just can't seem to find any info on how to get the corresponding CI when an interaction is involved. If the ratio in question only requires exponentiating a single coefficient, the CI can easily be calculated; I just don't know how to do it when two or three coefficients are involved in calculating the ratio. Any help would be much appreciated.
EDIT:
I need the actual odds and rate ratios and their CIs, not the predicted values and their CIs. For example: exp(confint(m3)[2,3]) gives the rate ratio of sppPR/minedYes vs sppGP/minedYes, and c(exp(confint(m3)[2,1]),exp(confint(m3)[2,2]) gives the CI of that rate ratio. However, a number of the potential comparisons among the spp/mined combinations require summing multiple coefficients e.g., exp(confint(m3)[2,3] + confint(m3)[8,3]) but in these circumstances I do not know how to calculate the rate ratio CI because it involves multiple coefficients, each of which has its own SE estimates. How can I calculate those CIs, given that multiple coefficients are involved?
If I understand your question correctly, this would be one way to obtain the uncertainty around the predicted/fitted values of the interaction term:
library(glmmTMB)
library(ggeffects)
data(Salamanders)
m3 <- glmmTMB(count ~ spp + mined + spp * mined,
zi=~spp + mined + spp * mined,
family=truncated_poisson, data=Salamanders) # added in the interaction
ggpredict(m3, c("spp", "mined"))
#>
#> # Predicted counts of count
#> # x = spp
#>
#> # mined = yes
#>
#> x | Predicted | SE | 95% CI
#> --------------------------------------
#> GP | 1.59 | 0.92 | [0.26, 9.63]
#> PR | 1.13 | 0.66 | [0.31, 4.10]
#> DM | 1.74 | 0.29 | [0.99, 3.07]
#> EC-A | 0.61 | 0.96 | [0.09, 3.96]
#> EC-L | 0.42 | 0.69 | [0.11, 1.59]
#> DF | 1.49 | 0.27 | [0.88, 2.51]
#>
#> # mined = no
#>
#> x | Predicted | SE | 95% CI
#> --------------------------------------
#> GP | 2.67 | 0.11 | [2.15, 3.30]
#> PR | 1.59 | 0.28 | [0.93, 2.74]
#> DM | 3.10 | 0.10 | [2.55, 3.78]
#> EC-A | 2.30 | 0.17 | [1.64, 3.21]
#> EC-L | 5.25 | 0.07 | [4.55, 6.06]
#> DF | 2.68 | 0.12 | [2.13, 3.36]
#> Standard errors are on link-scale (untransformed).
plot(ggpredict(m3, c("spp", "mined")))
Created on 2020-08-04 by the reprex package (v0.3.0)
The ggeffects-package calculates marginal effects / estimates marginal means (EMM) with confidence intervals for your model terms. ggpredict() computes these EMMs based on predict(), ggemmeans() wraps the fantastic emmeans package and ggeffect() uses the effects package.

how to use estat vif in the right way

I have 2 questions concerning estat vif to test multicollinearity:
Is it correct that you can only calculate estat vif after the regress command?
If I execute this command Stata only gives me the vif of one independent variable.
How do I get the vif of all the independent variables?
Q1. I find estat vif documented under regress postestimation. If you can find it documented under any other postestimation heading, then it is applicable after that command.
Q2. You don't give any examples, reproducible or otherwise, of your problem. But estat vif by default gives a result for each predictor (independent variable).
. sysuse auto, clear
(1978 Automobile Data)
. regress mpg weight price
Source | SS df MS Number of obs = 74
-------------+---------------------------------- F(2, 71) = 66.85
Model | 1595.93249 2 797.966246 Prob > F = 0.0000
Residual | 847.526967 71 11.9369995 R-squared = 0.6531
-------------+---------------------------------- Adj R-squared = 0.6434
Total | 2443.45946 73 33.4720474 Root MSE = 3.455
------------------------------------------------------------------------------
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
weight | -.0058175 .0006175 -9.42 0.000 -.0070489 -.0045862
price | -.0000935 .0001627 -0.57 0.567 -.000418 .0002309
_cons | 39.43966 1.621563 24.32 0.000 36.20635 42.67296
------------------------------------------------------------------------------
. estat vif
Variable | VIF 1/VIF
-------------+----------------------
price | 1.41 0.709898
weight | 1.41 0.709898
-------------+----------------------
Mean VIF | 1.41

Spark: All RDD data not getting saved to Cassandra table

Hi, I am trying to load RDD data to a Cassandra Column family using Scala. Out of a total 50 rows , only 28 are getting stored into cassandra table.
Below is the Code snippet:
val states = sc.textFile("state.txt")
//list o fall the 50 states of the USA
var n =0 // corrected to var
val statesRDD = states.map{a =>
n=n+1
(n, a)
}
scala> statesRDD.count
res2: Long = 50
cqlsh:brs> CREATE TABLE BRS.state(state_id int PRIMARY KEY, state_name text);
statesRDD.saveToCassandra("brs","state", SomeColumns("state_id","state_name"))
// this statement saves only 28 rows out of 50, not sure why!!!!
cqlsh:brs> select * from state;
state_id | state_name
----------+-------------
23 | Minnesota
5 | California
28 | Nevada
10 | Georgia
16 | Kansas
13 | Illinois
11 | Hawaii
1 | Alabama
19 | Maine
8 | Oklahoma
2 | Alaska
4 | New York
18 | Virginia
15 | Iowa
22 | Wyoming
27 | Nebraska
20 | Maryland
7 | Ohio
6 | Colorado
9 | Florida
14 | Indiana
26 | Montana
21 | Wisconsin
17 | Vermont
24 | Mississippi
25 | Missouri
12 | Idaho
3 | Arizona
(28 rows)
Can anyone please help me in finding where the issue is?
Edit:
I understood why only 28 rows are getting stored in Cassandra, it's because I have made the first column a PRIMARY KEY and It looks like in my code, n is incremented maximum to 28 and then it starts again with 1 till 22 (total 50).
val states = sc.textFile("states.txt")
var n =0
var statesRDD = states.map{a =>
n+=1
(n, a)
}
I tried making n an accumulator variable as well(viz. val n = sc.accumulator(0,"Counter")), but I don't see any differnce in the output.
scala> statesRDD.foreach(println)
[Stage 2:> (0 + 0) / 2]
(1,New Hampshire)
(2,New Jersey)
(3,New Mexico)
(4,New York)
(5,North Carolina)
(6,North Dakota)
(7,Ohio)
(8,Oklahoma)
(9,Oregon)
(10,Pennsylvania)
(11,Rhode Island)
(12,South Carolina)
(13,South Dakota)
(14,Tennessee)
(15,Texas)
(16,Utah)
(17,Vermont)
(18,Virginia)
(19,Washington)
(20,West Virginia)
(21,Wisconsin)
(22,Wyoming)
(1,Alabama)
(2,Alaska)
(3,Arizona)
(4,Arkansas)
(5,California)
(6,Colorado)
(7,Connecticut)
(8,Delaware)
(9,Florida)
(10,Georgia)
(11,Hawaii)
(12,Idaho)
(13,Illinois)
(14,Indiana)
(15,Iowa)
(16,Kansas)
(17,Kentucky)
(18,Louisiana)
(19,Maine)
(20,Maryland)
(21,Massachusetts)
(22,Michigan)
(23,Minnesota)
(24,Mississippi)
(25,Missouri)
(26,Montana)
(27,Nebraska)
(28,Nevada)
I am curious to know what is causing n to not getting updated after value 28? Also, what are the ways in which I can create a counter which I can use for creating RDD?
There are some misconceptions about distributed systems embedded inside your question. The real heart of this is "How do I have a counter in a distributed system?"
The short answer is you don't. For example what you've done in your code example originally is something like this.
Task One {
var x = 0
record 1: x = 1
record 2: x = 2
}
Task Two {
var x = 0
record 20: x = 1
record 21: x = 2
}
Each machine is independently creating a new x variable set at 0 which gets incremented within it's own context, independently over the other nodes.
For most use cases the "counter" question can be replaced with "How can I get a Unique Identifier per Record in a distributed system?"
For this most users end up using a UUID which can be generated on independent machines with infinitesimal chances of conflicts.
If the question can be "How can I get a monotonic increasing unique indentifier?"
Then you can use zipWithUniqueIndex which will not count but will generate monotonically increasing ids.
If you just want them number to start with it's best to do it on the local system.
Edit; Why can't I use an accumulator?
Accumulators store their state (surprise) per task. You can see this with a little example:
val x = sc.accumulator(0, "x")
sc.parallelize(1 to 50).foreachPartition{ it => it.foreach(y => x+= 1); println(x)}
/*
6
7
6
6
6
6
6
7
*/
x.value
// res38: Int = 50
The accumulators combine their state after finishing their tasks, which means you can't use them as a global distributed counter.