How to interpret coefficients and p-values in multiple linear regression with two categorical variables and interaction - linear-regression

I am new to linear regression so I hope you can help me with interpreting the output of a multiple linear regression with two categorical predictor variables and an interaction term.
I did the following linear regression:
lm(H1A1c ~ Vowel * Speaker, data=data)
Vowel and Speaker are both categorical variables. Vowel can be "breathy", "modal" or "creaky" and there are four different speakers (F01, F02, M01, M02). I want to see if a combination of those two categories can predict the values for H1A1c.
My output is this:
Output of lm
Please correct me if I am wrong but I think we can see from this output that the relationship between most of my variables can't be characterised as linear. What I don't really understand is how to interpret the first p-value. When I googled I found that all the other p-values refer to the relationship of the respective coefficient and what this coefficient relates to. E.g. the p-value in the third line refers to the relationship of the coefficient of the third line to the first one, i.e. 23.1182-9.6557.
What about the p-value of the first coefficient, though? There can't be a linear relationship if there is no relationship? What does this p-value refer to?
Thanks in advance for your answers!

The first p-value(Intercept) tells you how likely the y-intercept of your fitted line is going to be zero(pass through the origin). Since the p-value in your result is way lower than 0.05, you can say the y-intercept is certainly not zero.
Other p-values are to be interpreted differently. Your interpretation is correct that they give an idea whether the coefficients of the variables they represent are likely to be zero or not.
the p-value in the third line refers to the relationship of the coefficient of the third line to the first one, i.e. 23.1182-9.6557
(-9.6557) means that on an average, the predicted value of H1A1c will be 9.6557 units lower if GlottalContext=creaky(i.e. GlottalContextcreaky = 1) compared to when GlottalContext=breathy(since breathy is your reference category here) keeping all other predictors unchanged. This is obviously when the corresponding p-value is less than 0.05 which, I see, is the case for GlottalContextcreaky.
(Additionally, if I were to assume that H1A1c is a continuous variable, I am not sure if choosing a linear regression to predict H1A1c would be the best way to go since both your predictors are categorical. You might want to explore other algorithms e.g. transform your dependent variable to categorical and do a binary/multinomial logistic regression or a decision tree)

Related

Pearson correlation coefficent

This question of mine is not tightly related to Matlab, but is relevant to it:
I'm looking how to fill in the matrix [[a,b,c],[d,e,f]] in a few nontrivial ways so that as many places as possible in
corrcoef([a,b,c],[d,e,f])
are zero. My attempts yield NaN result in most cases.
Given the current comments, you are trying to understand how two series of random draws from two distributions can have zero correlation. Specifically, exercise 4.6.9 to which you refer mentions draws from two normal distributions.
An issue with your approach is that you are hoping to derive a link between a theoretical property and experimentation, in this case using Matlab. And, as you seem to have noticed, unless you are looking at specific degenerate cases, your experimentation will fail. That is because although the true correlation parameter rho in the exercise might be zero, a sample of random draws will always have some level of correlation. Here is an illustration, and as you'll notice if you run it the actual correlations span the whole spectrum between -1 and 1 despite their average being zero (as it should be since both generators are pseudo-uncorrelated):
n=1e4;
experiment = nan(n,1);
for i=1:n
r = corrcoef(rand(4,1),rand(4,1));
experiment(i)=r(2);
end
hist(experiment);
title(sprintf('Average correlation: %.4f%%',mean(experiment)));
If you look at the definition of Pearson correlation in wikipedia, you will see that the only way this can be zero is when the numerator is zero, i.e. E[(X-Xbar)(Y-Ybar)]=0. Though this might be the case asymptotically, you will be hard-pressed to find a non-degenerate case where this will happen in a small sample. Nevertheless, to show you you can derive some such degenerate cases, let's dig a bit further. If you want the expectation of this product to be zero, you could make either the left or the right hand part zero when the other is non-zero. For one side to be zero, the draw must be exactly equal to the average of draws. Therefore we can imagine creating such a pair of variables using this technique:
we create two vectors of 4 variables, and alternate which draw will be equal to the average.
let's say we want X to average 1, and Y to average 2, and we make even-indexed draws equal to the average for X and odd-indexed draws equal to the average for Y.
one such generation would be: X=[0,1,2,1], Y=[2,0,2,4], and you can check that corrcoef([0,1,2,1],[2,0,2,4]) does in fact produce an identity matrix. This is because, every time a component of X is different than its average of 1, the component in Y is equal to its average of 2.
another example, where the average of X is 3 and that of Y is 4 is: X=[3,-5,3,11], Y=[1008,4,-1000,4]. etc.
If you wanted to know how to create samples from non-correlated distributions altogether, that would be and entirely different question, though (perhaps) more interesting in terms of understanding statistics. If this is your case, and given the exercise you mention discusses normal distributions, I would suggest you take a look at generating antithetic variables using the Box-Muller transform.
Happy randomizing!

Anomaly in accuracy calculation

I am classifying a dataset with four classes using pretrained VGG19. To calculate accuracy, I used this formula:
accuracy = sum(predictedLabels==testLabels)/numel(predictedLabels) --Eq 1
Then I calculated the confusion matrix using:
confMat = confusionmat(testLabels, predictedLabels) **--Eq 2**
From which I got a matrix with 4 rows and 4 columns since I had 4 classes.
Now, we know that the accuracy formula is also:
Accuracy=TP+TN/(TP+TN+FP+FN) **Eq-3**
So I also calculated Accuracy from my confusion matrix formed through above Eq. 2. where
TP=value in (row==column),
FP=sum of column-TP,
FN=sum of row-TP,
TN=sum of the diagonal-TP
If I am doing above steps alright, then my confusion is that I am getting different accuracies from two methods Eq 1 and Eq 3. The accuracy I am getting with Eq. 1 is equivalent to the formula TP/(TP+TN). so, If this is the case, then Eq. 1 is the wrong formula for calculating accuracy. But, this formula has been used across all matlab deep learning codes.
So, MATLAB is doing something wrong (which has the probability 0, I know) or I am doing something wrong. But, unfortunately, I am unable to pinpoint my mistake.
Now, the question is,
Am I doing it wrong? Where am I missing the step? How to correct it? What is the logical explanation of this anomaly?
EDIT
This anomaly in accuracy calculation happens due to class imbalance problem. that is when, there are different number of samples in each class. therefore, the regular accuracy formula in Eq. 3 will not work in such cases.
The main issue is that negative and positive is for prediction (is this a cat or not), while you are doing classification with more than two categories. The classifier doesn't give you positive and negative (for is it a cat prediction), so it is not possible to relate to answers as true positive or false positive etc. Therefore equation 3 is meaningless, and so is the method for computing TP, TN etc. For example, if TP is row=column as you defined, then these are the accurate values in the diagonal of confMat. But what is TN? According to your definition it is TP (which is the diagonal) minus the diagonal. I hope this helps put things on the write track.

Kalman Filter prediction error estimation: why two constants and transposed matrices?

Hy everybody!
I have found a very informative and good tutorial for understanding Kalman Filter. In the end, I would like to understand the Extended Kalman Filter in the second half of the tutorial, but first I want to solve any mystery.
Kalman Filter tutorial Part 6.
I think we use constant for prediction error, because the new value in a certain k time moment can be different, than the previous. But why we use two constants? It says:
we multiply twice by a because the prediction error pk is itself a squared error; hence, it is scaled by the square of the coefficient associated with the state value xk.
I can't see the meaning of this sentence.
And later in the EKF he creates a matrix and a transposed matrix from that (in Part 12). Why the transposed one?
Thanks a lot.
The Kalman filter maintains error estimates as variances, which are squared standard deviations. When you multiply a Gaussian random variable N(x,p) by a constant a, you increase its standard deviation by a factor of a, which means its variance increases as a^2. He's writing this as a*p*a to maintain a parallel structure when he converts from a scalar state to a matrix state. If you have an error coviarance matrix P representing state x, then the error covariance of Ax is APA^T as he shows in part 12. It's a convenient shorthand for doing that calculation. You can expand the matrix multiplication by hand to see that the coefficients all go in the right place.
If any of this is fuzzy to you, I strongly recommend you read a tutorial on Gaussian random variables. Between x and P in a Kalman filter, your success depends a lot more on you understanding P than x, even though most people get started by being interested in improving x.

Matlab's VAR[X] coefficient constraints for vector time series

Matlab's VARMAX model allows the user to set flags that determine whether individual linear coefficients are to be estimated. In particular, vgxset accepts an ARsolve parameter containing flags that determine whether individual time series lag coefficients are estimated. The fact that there are individual scalar flags for each scalar lag term implies each coefficient can be activated independently.
I have 3 questions concerning this flexible feature.
(1) Does turning off a flag essentially mean that the corresponding coefficient is zero?
(2) Where is the documentation of which switch is for which coefficient? That is, for a given lag, if I wanted to turn on the coefficient for the dependence of series i on series j, would I turn on flag (i,j) or (j,i)?
(3) Since AR0solve is ignored, does that mean that there is no contemporaneous dependence between time series?
I have posted this to:
Matlab's VAR[X] coefficient constraints for vector time series
http://groups.google.com/forum/#!topic/comp.soft-sys.matlab/5AIeQYoqeWg
On Friday, May 29, 2015 at 2:05:06 PM UTC-4, Rick wrote:
(1) No, not necessarily. Turning off a flag (i.e., setting a
particular element of an input "solve" flag to logical FALSE) holds
the corresponding parameter value fixed throughout the estimation.
For example, if, say, the 3rd element of the "asolve" parameter is
FALSE (logical 0) and the 3rd element of the corresponding value
parameter "a" is such that a(3) = 0, then the estimation will
effectively exclude that coefficient from the model.
The important thing is that the parameter is held fixed at whatever
you specify. Of course, to hold a coefficient fixed you also need to
indicate the its value, and so the "asolve" and "a" parameters must
both be set.
These values do not necessarily need to be zero, although zeros
(i.e., exclusion constraints) are very common.
(2) The best I have found is the reference page for "vgxset"
function. There might be specific examples in the documentation, but
the reference page is the place I'd start.
As for your 2nd sentence, I think you are over-thinking the storage.
There is a 1-to1 correspondence between the solve parameter and its
paired value.
I suggest you simply write out a 2-D VAR model and contrive a
simple experiment. The model, and placement of coefficients and
corresponding TRUEs/FALSEs, conforms to the linear algebra of the
equation.
(3) Yes, at least from the perspective of model estimation.
That is, "vgxvarx" will not estimate "structural" VAR models, and so
the corresponding "AR0" structural coefficient is not estimated. You
can specify a non-identity AR0 coefficient, in which case the
estimation simply fits the VAR model to the modified series A0*y(t).
So, in this case you can effect contemporaneous dependence between
the series, but you cannot estimate it.
On Friday, May 29, 2015 at 2:40:28 PM UTC-4, paul.d...#gmail.com
wrote:
I just want to check a specific detail about answer#2. I am
thinking of the matrix representation of the equations when I mull
over the vgxset parameters. Are the boolean flags for solving the
coefficients suppose to form a symmetric matrix? I was more
interested in the assymetric case were, e.g., for a given lag,
series i depends on series j according to some coefficient that is
different from the dependence of series j on series i. If that
constraint is not necessary, and the flags occupy exactly the same
positions as the coefficients themselves, I think I can run with
that.
On Saturday, May 30, 2015 at 6:41:04 AM UTC-4, Rick wrote:
Paul, no, the Boolean flags need not form a symmetric matrix. Best, Rick
So getting the meaning of the row and column of the flag becomes important, and the meaning of the row and column comes from the matrix for the coefficient to which the flag corresponds. Based on my understanding of the vector setup, the row represents the dependent series while the column represents the predictor series.

How to select top 100 features(a subset) which are most relevant after pca?

I performed PCA on a 63*2308 matrix and obtained a score and a co-efficient matrix. The score matrix is 63*2308 and the co-efficient matrix is 2308*2308 in dimensions.
How do i extract the column names for the top 100 features which are most important so that i can perform regression on them?
PCA should give you both a set of eigenvectors (your co-efficient matrix) and a vector of eigenvalues (1*2308) often referred to as lambda). You might been to use a different PCA function in matlab to get them.
The eigenvalues indicate how much of your data each eigenvector explains. A simple method for selecting features would be to select the 100 features with the highest eigen values. This gives you a set of feature which explain most of the variance in the data.
If you need to justify your approach for a write up you can actually calculate the amount of variance explained per eigenvector and cut of at, for example, 95% variance explained.
Bear in mind that selecting based solely on eigenvalue, might not correspond to the set of features most important to your regression, so if you don't get the performance you expect you might want to try a different feature selection method such as recursive feature selection. I would suggest using google scholar to find a couple of papers doing something similar and see what methods they use.
A quick matlab example of taking the top 100 principle components using PCA.
[eigenvectors, projected_data, eigenvalues] = princomp(X);
[foo, feature_idx] = sort(eigenvalues, 'descend');
selected_projected_data = projected(:, feature_idx(1:100));
Have you tried with
B = sort(your_matrix,2,'descend');
C = B(:,1:100);
Be careful!
With just 63 observations and 2308 variables, your PCA result will be meaningless because the data is underspecified. You should have at least (rule of thumb) dimensions*3 observations.
With 63 observations, you can at most define a 62 dimensional hyperspace!