The single predictor variable in the best-fitting model with 1-predictor must also be selected in the best-fitting model with 2-predictors - linear-regression

In best subset selection, we know the algorithm is selecting smallest residual sum of squares in every Moddels.
The question is whether The single predictor variable in the best-fitting model with 1-predictor must also be selected in the best-fitting model with 2-predictors.
I see it is true in many real case, but according to the algorithm it seems not the case.
I was wondering if there is any statistical inference shows that it is true.

Related

Why one independent variable gets dropped in SPSS multiple regression?

I was running a linear multiple regression as well as a logistic multiple regression in SPSS.
After that when looking at the results, I realised that in each regression, one independent variable was automatically excluded by SPSS. Did we do something wrong here or what do we have to do in order to have all independent variables included in the regression?
thanks for your help!
In LOGISTIC REGRESSION you can specify a predictor variable as categorical, and thus make use of a single categorical variable. If you do this, you can specify the type of contrast coding to use and choose which category would be used as the reference category.
The REGRESSION procedure doesn't have facilities for declaring predictors categorical, so if you have an intercept or constant in the model (which of course is the default) and you try to enter K dummy or indicator variables for a K-level categorical variable, one of them will be linearly dependent on the intercept and the other K-1 dummies, and as Kevin said, will be left out.
You may be confused by which one the procedure leaves out. When the default ENTER method is used in REGRESSION, it tries to enter all the independent variables, but does it one variable at a time and performs a tolerance check to avoid numerical problems associated with redundant or nearly redundant variables. The way it does this results in the last predictor specified being the first to enter, and after that, the variable with the smallest absolute correlation with that previously entered variable, then the remaining candidate variable with the least relationship with the two entered already, etc. In cases where there are ties on tolerance values, the later listed variable gets entered.

Usage of indicator functions as features in Sequential Models

I am currently using Mallet for training a sequential model using CRF. I have understood how to provide features (that solely depend on input sequence) to the mallet package. Based on my understanding, in mallet, we have to compute all the values of the feature functions (upfront). Now, I would like to use indicator functions that depend on the label of a token. The value of these functions depends on the output label sequence and during training, I can compute the values of these indicator functions as the output label sequence is known. But, when I am applying this trained CRF model on a new input (whose output label sequene is unknown), how should I calculate the values for such features.
It will be very helpful to me if anyone can provide me any tips/relevant documents.
As you've phrased it, the question doesn't make sense: if you don't know the hidden labels, you can't set anything based on those unknown labels. An example might help.
You may not need to explicitly record these relationships. At training time the algorithm sets the parameters of the CRF to represent the relationship between the observed features and the unobserved state. Different CRF architectures can allow you to add dependencies between multiple hidden states.

ROC curve using Euclidean distance (MatLab)

Trying to understand the function perfcurve in MatLab.
Information regarding the function is confusing me at two points.
At one place, it says that
You can use perfcurve with any classifier or, more broadly, with any method that returns a numeric score for an instance of input data. By convention adopted here,
A high score returned by a classifier for any given instance signifies that the instance is likely from the positive class.
A low score signifies that the instance is likely from the negative classes.
At another point, it says that
perfcurve does not impose any requirements on the input score range. Because of this lack of normalization, you can use perfcurve to process scores returned by any classification, regression, or fit method. perfcurve does not make any assumptions about the nature of input scores or relationships between the scores for different classes.
So I was using Euclidean distance for a face recognition, user identification problem to output whether a user is already enrolled in the database or not. Since the Euclidean distance is a measure of dis-similarity and not the other way round, a lower score denotes a 1 and a higher scores denotes a 0. Can I then use these output scores directly as an argument in perfcurve, or do I need to modify it in some way?
This is the output I am currently getting for SIFT-based matching. Either there is some problem with my implementation, or the plot isn't correct. I need to figure that out.

Logistic Regression with variables that do not vary

A few questions around constant variables and logistic regression -
Lets say I have a continuous variable, but has only 1 value across the whole data set. I know I should ideally eliminate the variable since it brings no predictive value. Instead of manually doing this for each feature, does Logistic Regression make the coefficient of such variables 0 automatically?
If I use such a variable (that has only one value) in Logistic Regression with L1 regularization, will the regularization force the coefficient to 0?
On similar lines, if I have a categorical variable for which I have 3 levels - first level spans say 60% of the data set, second spans across 35% and the 3rd level at 5%), and I split it into training and testing, there is a good chance that the third level may not end up in the test set, leading us a scenario where we have a variable that has one value in the test set and other in the training set. How do I handle such scenarios ? Does regularization take care of things like this automatically?
ND
Regarding question 3)
If you want to be sure that both train and test set contain samples from each categorical variables, you can simply divide each subgroup into test and training set and then combine these again.
Regarding question 1) and 2)
The coefficent for a variable with variance zero should be zero, yes. However, whether such a coefficent "automatically" will be set to zero or be excluded from regression depends on the implementation.
If you implement logistic regression yourself, you can post the code and we can discuss specifically.
I recommend you to find an implemented version of logistic regression and test it using toy data. Then you will have your answer, whether or not the coeffient will be set to zero (which i assume).

Algorithm to detect a linear behaviour in a data set

I have posted a question about an Algorithm to make a polynomial fit of a part of a data set some time ago and received some propositions to do what I wanted. But I face another problem now I try to apply the ideas suggested in the answers.
My goal was to find the best linear fit of a data set, in which only a part of it was linear.
Here is an example of what I must do :
We have these two data sets, and I must make a linear trend of the linear part of the data that is at the left of the dashed line. In red, we have the ideal data set, that has a linear part from the beginning until the dashed line. In blue, we have the 'problematic' data set, that has a plateau. The bold part is the part that I have to use to do the linear fit of the data.
My problem is that I tried to do as mentionned in the question linked above : I found the second order derivative of the smoothed data and looked when it was not 'close enough' of 0. But here are my results for the problematic data set (first image) and for the ideal data set (second image) :
(Sorry for quality, I don't know why it is so blurred)
On both images, I plotted the first order derivative and in red, the second order derivative. On the first image, we see peaks of second derivative values. But the problem is that the peaks are not very 'high', making it difficult to establish a threshold that would tell if the set is linear or not... On the contrary, the peak of the first derivative is quite high, making it easy to see visually.
I thought that calculate the mean of the values of the first derivative and look when the value differ too much from the mean value would be enough... But when I take the mean of the values of the first derivative in order to see where the values differ from the mean value, there is a sort of offset due to the peak.
How to remove this offset in order to take only the mean value of the data at the right (the data at the left of the discontinuity that is seen on Image 1 could be non linear or be linear but have a different value from the values at the right!) of the peak efficiently ?
The mean operator (as you have noticed) is very sensitive to outliers (peaks). You may wish to use more robust estimators, such as the median or the x-percentile of the values (which should be more appropriate for your case).