Convert KMeans "centres" output to PySpark dataframe - pyspark

I'm running a K-means clustering model, and I want to analyse the cluster centroids, however the centers output is a LIST of my 20 centroids, with their coordinates (8 each) as an ARRAY. I need it as a dataframe, with clusters 1:20 as rows, and their attribute values (centroid coordinates) as columns like so:
c1 | 0.85 | 0.03 | 0.01 | 0.00 | 0.12 | 0.01 | 0.00 | 0.12
c2 | 0.25 | 0.80 | 0.10 | 0.00 | 0.12 | 0.01 | 0.00 | 0.77
c3 | 0.05 | 0.10 | 0.00 | 0.82 | 0.00 | 0.00 | 0.22 | 0.00
The dataframe format is important because what I WANT to do is:
For each centroid
Identify the 3 strongest attributes
Create a "name" for each of the 20 centroids that is a concatenation of the 3 most dominant traits in that centroid
For example:
c1 | milk_eggs_cheese
c2 | meat_milk_bread
c3 | toiletries_bread_eggs
This code is running in Zeppelin, EMR version 5.19, Spark2.4. The model works great, but this is the boilerplate code from the Spark documentation (, which produces the list of arrays output that I can't really use.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
This is an excerpt of the output I get.
Cluster Centers:
[0.12391775 0.04282062 0.00368751 0.27282358 0.00533401 0.03389095
0.04220946 0.03213536 0.00895981 0.00990327 0.01007891]
[0.09018751 0.01354349 0.0130329 0.00772877 0.00371508 0.02288211
0.032301 0.37979978 0.002487 0.00617438 0.00610262]
[7.37626746e-02 2.02469798e-03 4.00944473e-04 9.62304581e-04
5.98964859e-03 2.95190585e-03 8.48736175e-01 1.36797882e-03
2.57451073e-04 6.13320072e-04 5.70559278e-04]
Based on How to convert a list of array to Spark dataframe I have tried this:
df = sc.parallelize(centers).toDF(['fresh_items', 'wine_liquor', 'baby', 'cigarettes', 'fresh_meat', 'fruit_vegetables', 'bakery', 'toiletries', 'pets', 'coffee', 'cheese'])
But this throws the following error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

model.clusterCenters() gives you a list of numpy arrays and not a list of lists like in the answer you have linked. Just convert the numpy arrays to a lists before creating the dataframe:
bla = [e.tolist() for e in centers]
df = sc.parallelize(bla).toDF(['fresh_items', 'wine_liquor', 'baby', 'cigarettes', 'fresh_meat', 'fruit_vegetables', 'bakery', 'toiletries', 'pets', 'coffee', 'cheese'])
#or df = spark.createDataFrame(bla, ['fresh_items', 'wine_liquor', 'baby', 'cigarettes', 'fresh_meat', 'fruit_vegetables', 'bakery', 'toiletries', 'pets', 'coffee', 'cheese']


pytorch forecasting extract feature from hidden layer

I'm following the PyTorch Forecasting tutorial:
I implemented a LSTM using AutoRegressiveBaseModelWithCovariates and to initialized the model from my dataset.
from pytorch_forecasting.models.rnn import RecurrentNetwork
model = RecurrentNetwork.from_dataset(dataset_with_covariates)
I've been asked to get the output of a hidden layer and visualize w tSNE or UMAP (something I've done before with Keras). I'm new to PyTorch unfortunately. Does anyone know how to do this?
Here's the summary.
| Name | Type | Params
0 | loss | MAE | 0
1 | logging_metrics | ModuleList | 0
2 | logging_metrics.0 | SMAPE | 0
3 | logging_metrics.1 | MAE | 0
4 | logging_metrics.2 | RMSE | 0
5 | logging_metrics.3 | MAPE | 0
6 | logging_metrics.4 | MASE | 0
7 | embeddings | MultiEmbedding | 47
8 | embeddings.embeddings | ModuleDict | 47
9 | embeddings.embeddings.level_0 | Embedding | 12
10 | embeddings.embeddings.supervisorvehiclestatus | Embedding | 35
11 | rnn | LSTM | 2.5 K
12 | output_projector | Linear | 11
2.5 K Trainable params
0 Non-trainable params
2.5 K Total params
0.010 Total estimated model params size (MB)
In attempt to find the layer name, I did:
for name, layer in model.named_modules():
print(name, layer)
(loss): MAE()
(logging_metrics): ModuleList(
(0): SMAPE()
(1): MAE()
(2): RMSE()
(3): MAPE()
(4): MASE()
(embeddings): MultiEmbedding(
(embeddings): ModuleDict(
(group_name): Embedding(4, 3)
(categorical_var ): Embedding(7, 5)
(rnn): LSTM(28, 10, num_layers=2, batch_first=True, dropout=0.1)
(output_projector): Linear(in_features=10, out_features=1, bias=True)
loss MAE()
logging_metrics ModuleList(
(0): SMAPE()
(1): MAE()
(2): RMSE()
(3): MAPE()
(4): MASE()
logging_metrics.0 SMAPE()
logging_metrics.1 MAE()
logging_metrics.2 RMSE()
logging_metrics.3 MAPE()
logging_metrics.4 MASE()
embeddings MultiEmbedding(
(embeddings): ModuleDict(
(group_name): Embedding(4, 3)
(categorical_var ): Embedding(7, 5)
embeddings.embeddings ModuleDict(
(group_name): Embedding(4, 3)
(categorical_var ): Embedding(7, 5)
embeddings.embeddings.level_0 Embedding(4, 3)
embeddings.embeddings.categorical_var Embedding(7, 5)
rnn LSTM(28, 10, num_layers=2, batch_first=True, dropout=0.1)
output_projector Linear(in_features=10, out_features=1, bias=True)
I thought I could do something like this to get the activations, but it is not working.
def get_hidden_features(x, layer):
activation = {}
def get_activation(name):
def hook(m, i, o):
activation[name] = o.detach()
return hook
_ = model(x)
return activation[layer]
outhidden = get_hidden_features(x, "rnn")
AttributeError: 'Output' object has no attribute 'detach'

How to use stata svy etregress postestimation assumption check

When using survey data and etregress with an endogenous treatment effect in Stata number of diagnostics and post estimate parts stop being available for the use.
svy: etregress logwage i.race gender, treat(training = gender)
| Linearized
| Coef. Std. Err. t P>|t| [95% Conf. Interval]
logwage |
race |
African American | .3891554 .0031105 12.20 0.000 .2000000 .8474752
Asian American | .1487310 .0002843 04.11 0.000 .027113 .8765290
gender |
female | -.0230411 .010445 -6.85 0.000 -.115341 -.0107295
| | .3703371 .0451778 10.61 0.000 .2018037 .4186134
training | |
Highschool | -.0715731 .0490565 1.28 0.098 -.1106579 .1291781
College | .1271380 .0401052 3.95 0.003 .0329516 .2107563
Grad School | .8522143 .0085337 8.99 0.000 .8271381 .9573284
gender |
female | .0127444 .0100058 5.33 0.041 .0100558 .0866312
_cons | -1.260083 .0327235 -26.12 0.000 -1.531405 -1.098524
/athrho | .0051552 .031410 0.17 0.827 -.0722533 .0810246
/lnsigma | -1.872551 .0166818 -73.50 0.000 -1.928624 -1.278064
rho | .0084120 .0421116 -.0649947 .0888529
sigma | .4000831 .0038170 .1925127 .5067780
lambda | .0012673 .0226365 -.0324029
When I have this model simple assumptions related to a linear model like: Check linearity or assumption of independence and the homoscedasticity, normality, or goodness of fit diagnostics do not give output.
A residuals versus predicted values plot could have been a rvfplot but this gives the error:
last estimates not found
Trying estat gofgives
invalid subcommand gof
and the same for the estat hettest
help etregress postestimation
does not discuss model assumption tests or goodness of fit tests which we normally see with regress or log-linear model in Stata.
When I try the predict residual or predict rstudent nothing is reported making plotting not possible again.
I can provide reproducible example of the problem with the reference given by others:
webuse nhanes2f, clear
qui svyset psuid [pweight=finalwgt], strata(stratid)
qui svy: etregress loglead i.female i.diabetes, treat(diabetes = weight age height i.female) // coefl
nlcom pct_eff:(100*(exp(_b[loglead:1.female])-1))
Here also the etregress is used with a log transformed dependent variable and a treatment component. Following this model like asked above, how do we check the assumptions and goodness of fit?

Odds and Rate Ratio CIs in Hurdle Models with Factor-Factor Interactions

I am trying to build hurdle models with factor-factor interactions but can't figure out how to calculate the CIs of the odds or rate ratios among the various factor-factor combinations.
m3 <- glmmTMB(count ~ spp + mined + spp * mined,
zi=~spp + mined + spp * mined,
family=truncated_poisson, data=Salamanders) # added in the interaction
pred_dat <- data.frame(spp = rep(unique(Salamanders$spp), 2),
mined = rep(unique(Salamanders$mined), each = length(unique(Salamanders$spp))))
pred_dat # All factor-factor combos
Does anyone know how to appropriately calculate the CI around the ratios among these various factor-factor combos? I know how to calculate the actual ratio estimates (which consists of exponentiating the sum of 1-3 model coefficients, depending on the exact comparison being made) but I just can't seem to find any info on how to get the corresponding CI when an interaction is involved. If the ratio in question only requires exponentiating a single coefficient, the CI can easily be calculated; I just don't know how to do it when two or three coefficients are involved in calculating the ratio. Any help would be much appreciated.
I need the actual odds and rate ratios and their CIs, not the predicted values and their CIs. For example: exp(confint(m3)[2,3]) gives the rate ratio of sppPR/minedYes vs sppGP/minedYes, and c(exp(confint(m3)[2,1]),exp(confint(m3)[2,2]) gives the CI of that rate ratio. However, a number of the potential comparisons among the spp/mined combinations require summing multiple coefficients e.g., exp(confint(m3)[2,3] + confint(m3)[8,3]) but in these circumstances I do not know how to calculate the rate ratio CI because it involves multiple coefficients, each of which has its own SE estimates. How can I calculate those CIs, given that multiple coefficients are involved?
If I understand your question correctly, this would be one way to obtain the uncertainty around the predicted/fitted values of the interaction term:
m3 <- glmmTMB(count ~ spp + mined + spp * mined,
zi=~spp + mined + spp * mined,
family=truncated_poisson, data=Salamanders) # added in the interaction
ggpredict(m3, c("spp", "mined"))
#> # Predicted counts of count
#> # x = spp
#> # mined = yes
#> x | Predicted | SE | 95% CI
#> --------------------------------------
#> GP | 1.59 | 0.92 | [0.26, 9.63]
#> PR | 1.13 | 0.66 | [0.31, 4.10]
#> DM | 1.74 | 0.29 | [0.99, 3.07]
#> EC-A | 0.61 | 0.96 | [0.09, 3.96]
#> EC-L | 0.42 | 0.69 | [0.11, 1.59]
#> DF | 1.49 | 0.27 | [0.88, 2.51]
#> # mined = no
#> x | Predicted | SE | 95% CI
#> --------------------------------------
#> GP | 2.67 | 0.11 | [2.15, 3.30]
#> PR | 1.59 | 0.28 | [0.93, 2.74]
#> DM | 3.10 | 0.10 | [2.55, 3.78]
#> EC-A | 2.30 | 0.17 | [1.64, 3.21]
#> EC-L | 5.25 | 0.07 | [4.55, 6.06]
#> DF | 2.68 | 0.12 | [2.13, 3.36]
#> Standard errors are on link-scale (untransformed).
plot(ggpredict(m3, c("spp", "mined")))
Created on 2020-08-04 by the reprex package (v0.3.0)
The ggeffects-package calculates marginal effects / estimates marginal means (EMM) with confidence intervals for your model terms. ggpredict() computes these EMMs based on predict(), ggemmeans() wraps the fantastic emmeans package and ggeffect() uses the effects package.

how to use estat vif in the right way

I have 2 questions concerning estat vif to test multicollinearity:
Is it correct that you can only calculate estat vif after the regress command?
If I execute this command Stata only gives me the vif of one independent variable.
How do I get the vif of all the independent variables?
Q1. I find estat vif documented under regress postestimation. If you can find it documented under any other postestimation heading, then it is applicable after that command.
Q2. You don't give any examples, reproducible or otherwise, of your problem. But estat vif by default gives a result for each predictor (independent variable).
. sysuse auto, clear
(1978 Automobile Data)
. regress mpg weight price
Source | SS df MS Number of obs = 74
-------------+---------------------------------- F(2, 71) = 66.85
Model | 1595.93249 2 797.966246 Prob > F = 0.0000
Residual | 847.526967 71 11.9369995 R-squared = 0.6531
-------------+---------------------------------- Adj R-squared = 0.6434
Total | 2443.45946 73 33.4720474 Root MSE = 3.455
mpg | Coef. Std. Err. t P>|t| [95% Conf. Interval]
weight | -.0058175 .0006175 -9.42 0.000 -.0070489 -.0045862
price | -.0000935 .0001627 -0.57 0.567 -.000418 .0002309
_cons | 39.43966 1.621563 24.32 0.000 36.20635 42.67296
. estat vif
Variable | VIF 1/VIF
price | 1.41 0.709898
weight | 1.41 0.709898
Mean VIF | 1.41

Spark Iterated Function CUSUM

I'm still fairly new to Spark and I'm struggling to implement an iterated function. I'm hoping someone can help me out?
In particular, I'm trying to implement the CUSUM control statistic:
$ S_i = \max (0, S_{i-1} + x_i - Target - w $ with $ S_0 = 0 $ and $ w, Target $ being fixed parameters.
The challenge is that the CUSUM statistic is defined as an iterated function which requires ordered data and the previous function value.
The following data frame shows the desired output for $ Target = 1 $ and $ w = 0.1 $ :
i x S
1 1.3 0.2
2 1.8 0.9
3 0.5 0.3
4 0.6 0
5 1.2 0.1
6 1.8 0.8
On a different note: I guess it's not possible to run CUSUM in a distributed fashion? My data set is fairly large but contains multiple groups. I hope this means I can still achieve some concurrency. I guess I have to re-partition my data to have one single partition per group to run the CUSUM algorithm per group concurrently?
I hope this makes sense and any pointers are highly appreciated!
Ideally I am looking for a solution in Scala and Spark 2.1
Thanks a lot!
After a lot of Google research I found a solution to the problem using mapPartitions
val dataset = Seq(1.3, 1.8, 0.5, 0.6, 1.2, 1.8).toDS
dataset.repartition(1).mapPartitions(iterator => {
var s = 0.0
val target = 1.0
val w = 0.1 => {
s = Math.max(0.0, s + x -target - w)
Math.round(10.0 *s)/10.0
| 0.2|
| 0.9|
| 0.3|
| 0.0|
| 0.1|
| 0.8|
I hope this will save someone some time in the future.