How do I joint test a multi-level categorical effect in ipython using statsmodels? - ipython

I am using the Ordinary Least Squares (ols) function in statsmodels in ipython to fit a linear model where one covariate (City) is a multi-level categorical effect:
result=smf.ols(formula="Y ~ C(City) + X*C(Group)",data=s).fit();
(X is continuous, Group is a binary categorical variable).
When I do results.summary(), I get one row per level of City, however, what I would like to know is the overall significance of the 'City' covariate (i.e., compare Y~C(City)+X*C(Group) with the partial model Y~X*C(Group)).
Is there a way of doing it?
thanks in advance

Thank you user333700!
Here's an elaboration of your hint. I generate data with a 3-level categorical variable, use statsmodels to fit a model, and then test all levels of the categorical variable jointly:
# 1. generate data
def rnorm(n,u,s):
return np.random.standard_normal(n)*s+u
a=rnorm(100,-1,1);
b=rnorm(100,0,1);
c=rnorm(100,+1,1);
n=rnorm(300,0,1); # some noise
y=np.concatenate((a,b,c))+n
g=np.zeros(300);
g[0:100]=1
g[100:200]=2
g[200:300]=3
df=pd.DataFrame({'Y':y,'G':g,'N':n});
# 2. fit model
r=smf.ols(formula="Y ~ N + C(G)",data=df).fit();
r.summary()
# 3. joint test
print r.params
A=np.identity(len(r.params)) # identity matrix with size = number of params
GroupTest=A[1:3,:] # for the categorical var., keep the corresponding rows of A
CovTest=A[3,:] # row for the continuous var.
print "Group effect test",r.f_test(GroupTest).fvalue
print "Covariate effect test",r.f_test(CovTest).fvalue
The result should be something like this:
Intercept -1.188975
C(G)[T.2.0] 1.315898
C(G)[T.3.0] 2.137431
N 0.922038
dtype: float64
Group effect test [[ 120.86097747]]
Covariate effect test [[ 259.34155851]]

brief answer
you can use anova_lm (type 3) directly or use f_test or wald_test and either construct the constraint matrix or provide the constraints of the hypothesis in the form of a sequence of formulas.
http://statsmodels.sourceforge.net/devel/anova.html
http://statsmodels.sourceforge.net/devel/generated/statsmodels.regression.linear_model.RegressionResults.f_test.html

Related

How to decide the range for the hyperparameter space in SVM tuning? (MATLAB)

I am tuning an SVM using a for loop to search in the range of hyperparameter's space. The svm model learned contains the following fields
SVMModel: [1×1 ClassificationSVM]
C: 2
FeaturesIdx: [4 6 8]
Score: 0.0142
Question1) What is the meaning of the field 'score' and its utility?
Question2) I am tuning the BoxConstraint, C value. Let, the number of features be denoted by the variable featsize. The variable gridC will contain the search space which can start from any value say 2^-5, 2^-3, to 2^15 etc. So, gridC = 2.^(-5:2:15). I cannot understand if there is a way to select the range?
1. score had been documented in here, which says:
Classification Score
The SVM classification score for classifying observation x is the signed distance from x to the decision boundary ranging from -∞ to +∞.
A positive score for a class indicates that x is predicted to be in
that class. A negative score indicates otherwise.
In two class cases, if there are six observations, and the predict function gave us some score value called TestScore, then we could determine which class does the specific observation ascribed by:
TestScore=[-0.4497 0.4497
-0.2602 0.2602;
-0.0746 0.0746;
0.1070 -0.1070;
0.2841 -0.2841;
0.4566 -0.4566;];
[~,Classes] = max(TestScore,[],2);
In the two-class classification, we can also use find(TestScore > 0) instead, and it is clear that the first three observations are belonging to the second class, and the 4th to 6th observations are belonging to the first class.
In multiclass cases, there could be several scores > 0, but the code max(scores,[],2) is still validate. For example, we could use the code (from here, an example called Find Multiple Class Boundaries Using Binary SVM) following to determine the classes of the predict Samples.
for j = 1:numel(classes);
[~,score] = predict(SVMModels{j},Samples);
Scores(:,j) = score(:,2); % Second column contains positive-class scores
end
[~,maxScore] = max(Scores,[],2);
Then the maxScore will denote the predicted classes of each sample.
2. The BoxConstraint denotes C in the SVM model, so we can train SVMs in different hyperparameters and select the best one by something like:
gridC = 2.^(-5:2:15);
for ii=1:length(gridC)
SVModel = fitcsvm(data3,theclass,'KernelFunction','rbf',...
'BoxConstraint',gridC(ii),'ClassNames',[-1,1]);
%if (%some constraints were meet)
% %save the current SVModel
%end
end
Note: Another way to implement this is using libsvm, a fast and easy-to-use SVM toolbox, which has the interface of MATLAB.

MATLAB Murphy's HMM Toolbox: Inconsistent Output Sequence and Label Statesname and Symbols

Hi I have been using Murphy's HMM toolbox with output of Gaussian Mixture. In brief, I have 2 datasets for training. Each dataset comprises of 2000 observations with 11 dimensions per observation. I implemented the following steps to observe the path sequence output.
N_states=2
N_Gaussian_Mixture=1
For each of the dataset, a HMM model was generated. The steps are:
Step 1: mixgauss_init() was used to generated GMM signature for my training data.
Step 2: After declaring the matrices for Prior and Transmat, mhmm_em() was used to generate HMM model for the training dataset.
Testing: 2 test data from each of the dataset are used for testing using mhm_logprob(). The output were correctly predicted using loglikelihood scores in every run.
However, when I tried to observe the sequence of the HMM modelling (Dataset_123 with testdata_123) via mixgauss_prob() followed by viterbi_path(), the output sequences were inconsistent. For example, for the first run, the output sequence can be 2221111111111. But when I rerun the program again, the sequence can change to 1111111111111 or 1111111111222. Initially I thought it could be due to my Prior matrix. I fixed the Prior value but it is not helping.
Secondly, it there a possibility when I can assigned labels to the states and sequence? Like Matlab function:
hmmgenerate(...,'Symbols',SYMBOLS) specifies the symbols that are emitted. SYMBOLS can be a numeric array or a cell array of the names of the symbols. The default symbols are integers 1 through N, where N is the number of possible emissions.
`hmmgenerate(...,'Statenames',STATENAMES) specifies the names of the states. STATENAMES can be a numeric array or a cell array of the names of the states. The default state names are 1 through M, where M is the number of states.?
Thank you for your time and hope to hear from the expert sharing.

Assessing performance of a zero inflated negative binomial model

I am modelling the diffusion of movies through a contact network (based on telephone data) using a zero inflated negative binomial model (package: pscl)
m1 <- zeroinfl(LENGTH_OF_DIFF ~ ., data = trainData, type = "negbin")
(variables described below.)
The next step is to evaluate the performance of the model.
My attempt has been to do multiple out-of-sample predictions and calculate the MSE.
Using
predict(m1, newdata = testData)
I received a prediction for the mean length of a diffusion chain for each datapoint, and using
predict(m1, newdata = testData, type = "prob")
I received a matrix containing the probability of each datapoint being a certain length.
Problem with the evaluation: Since I have a 0 (and 1) inflated dataset, the model would be correct most of the time if it predicted 0 for all the values. The predictions I receive are good for chains of length zero (according to the MSE), but the deviation between the predicted and the true value for chains of length 1 or larger is substantial.
My question is:
How can we assess how well our model predicts chains of non-zero length?
Is this approach the correct way to make predictions from a zero inflated negative binomial model?
If yes: how do I interpret these results?
If no: what alternative can I use?
My variables are:
Dependent variable:
length of the diffusion chain (count [0,36])
Independent variables:
movie characteristics (both dummies and continuous variables).
Thanks!
It is straightforward to evaluate RMSPE (root mean square predictive error), but is probably best to transform your counts beforehand, to ensure that the really big counts do not dominate this sum.
You may find false negative and false positive error rates (FNR and FPR) to be useful here. FNR is the chance that a chain of actual non-zero length is predicted to have zero length (i.e. absence, also known as negative). FPR is the chance that a chain of actual zero length is falsely predicted to have non-zero (i.e. positive) length. I suggest doing a Google on these terms to find a paper in your favourite quantitative journals or a chapter in a book that helps explain these simply. For ecologists I tend to go back to Fielding & Bell (1997, Environmental Conservation).
First, let's define a repeatable example, that anyone can use (not sure where your trainData comes from). This is from help on zeroinfl function in the pscl library:
# an example from help on zeroinfl function in pscl library
library(pscl)
fm_zinb2 <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin")
There are several packages in R that calculate these. But here's the by hand approach. First calculate observed and predicted values.
# store observed values, and determine how many are nonzero
obs <- bioChemists$art
obs.nonzero <- obs > 0
table(obs)
table(obs.nonzero)
# calculate predicted counts, and check their distribution
preds.count <- predict(fm_zinb2, type="response")
plot(density(preds.count))
# also the predicted probability that each item is nonzero
preds <- 1-predict(fm_zinb2, type = "prob")[,1]
preds.nonzero <- preds > 0.5
plot(density(preds))
table(preds.nonzero)
Then get the confusion matrix (basis of FNR, FPR)
# the confusion matrix is obtained by tabulating the dichotomized observations and predictions
confusion.matrix <- table(preds.nonzero, obs.nonzero)
FNR <- confusion.matrix[2,1] / sum(confusion.matrix[,1])
FNR
In terms of calibration we can do it visually or via calibration
# let's look at how well the counts are being predicted
library(ggplot2)
output <- as.data.frame(list(preds.count=preds.count, obs=obs))
ggplot(aes(x=obs, y=preds.count), data=output) + geom_point(alpha=0.3) + geom_smooth(col="aqua")
Transforming the counts to "see" what is going on:
output$log.obs <- log(output$obs)
output$log.preds.count <- log(output$preds.count)
ggplot(aes(x=log.obs, y=log.preds.count), data=output[!is.na(output$log.obs) & !is.na(output$log.preds.count),]) + geom_jitter(alpha=0.3, width=.15, size=2) + geom_smooth(col="blue") + labs(x="Observed count (non-zero, natural logarithm)", y="Predicted count (non-zero, natural logarithm)")
In your case you could also evaluate the correlations, between the predicted counts and the actual counts, either including or excluding the zeros.
So you could fit a regression as a kind of calibration to evaluate this!
However, since the predictions are not necessarily counts, we can't use a poisson
regression, so instead we can use a lognormal, by regressing the log
prediction against the log observed, assuming a Normal response.
calibrate <- lm(log(preds.count) ~ log(obs), data=output[output$obs!=0 & output$preds.count!=0,])
summary(calibrate)
sigma <- summary(calibrate)$sigma
sigma
There are more fancy ways of assessing calibration I suppose, as in any modelling exercise ... but this is a start.
For a more advanced assessment of zero-inflated models, check out the ways in which the log likelihood can be used, in the references provided for the zeroinfl function. This requires a bit of finesse.

Weka Simple K means handling nominal attributes

I am trying to understand how simple K-means in Weka handles nominal attributes and why it is not efficient in handling such attributes.
I read that it calculates modes for such attributes. I want to know how the similarity is calculated.
Lets take an example:
Consider a dataset with 3 numeric and a nomimal attribute.
The nominal attribute has 3 values: A, B and C.
Instance1 has value A, Instance2 has value B and Instance3 has value A.
In this case, Instance1 may be more similar to Instance3(depending on other numeric attributes of course). How will Simple K-means work in this case?
Follow up:
What if the nominal attribute has more(10) possible values?
You can try to convert it to binary features, for each such nominal attribute, e.g. has_A, has_B, has_C. Then if you scale it i1 and i3 will be closer as the mean for that attribute will be above 0.5 (re to your example) - i2 will stand out more.
If it has more, then you just add more binary features for every possible value. Basically you just pivot each nominal attribute.

how to find mean of columns in nested structure in MATLAB

I've organized some data into a nested structure that includes several subjects, 4-5 trials per subject, then identifying data like height, joint torque over a gait cycle, etc. So, for example:
subject(2).trial(4).torque
gives a matrix of joint torques for the 4th trial of subject 2, where the torque matrix columns represent degrees of freedom (hip, knee, etc.) and the rows represent time increments from 0 through 100% of a stride. What I want to do is take the mean of 5 trials for each degree of freedom and use that to represent the subject (for that degree of freedom). When I try to do it like this for the 1st degree of freedom:
for i = 2:24
numTrialsThisSubject = size(subject(i).trial, 2);
subject(i).torque = mean(subject(i).trial(1:numTrialsThisSubject).torque(:,1), 2);
end
I get this error:
??? Scalar index required for this type of multi-level indexing.
I know I can use a nested for loop to loop through the trials, store them in a temp matrix, then take the mean of the temp columns, but I'd like to avoid creating another variable for the temp matrix if I can. Is this possible?
You can use a combination of deal() and cell2mat().
Try this (use the built-in debugger to run through the code to see how it works):
for subject_k = 2:24
% create temporary cell array for holding the matrices:
temp_torques = cell(length(subject(subject_k).trial), 1);
% deal the matrices from all the trials (copy to temp_torques):
[temp_torques{:}] = deal(subject(subject_k).trial.torque);
% convert to a matrix and concatenate all matrices over rows:
temp_torques = cell2mat(temp_torques);
% calculate mean of degree of freedom number 1 for all trials:
subject(subject_k).torque = mean(temp_torques(:,1));
end
Notice that I use subject_k for the subject counter variable. Be careful with using i and j in MATLAB as names of variables, as they are already defined as 0 + 1.000i (complex number).
As mentioned above in my comment, adding another loop and temp variable turned out to be the simplest execution.