Logistic Regression with variables that do not vary - linear-regression

A few questions around constant variables and logistic regression -
Lets say I have a continuous variable, but has only 1 value across the whole data set. I know I should ideally eliminate the variable since it brings no predictive value. Instead of manually doing this for each feature, does Logistic Regression make the coefficient of such variables 0 automatically?
If I use such a variable (that has only one value) in Logistic Regression with L1 regularization, will the regularization force the coefficient to 0?
On similar lines, if I have a categorical variable for which I have 3 levels - first level spans say 60% of the data set, second spans across 35% and the 3rd level at 5%), and I split it into training and testing, there is a good chance that the third level may not end up in the test set, leading us a scenario where we have a variable that has one value in the test set and other in the training set. How do I handle such scenarios ? Does regularization take care of things like this automatically?
ND

Regarding question 3)
If you want to be sure that both train and test set contain samples from each categorical variables, you can simply divide each subgroup into test and training set and then combine these again.
Regarding question 1) and 2)
The coefficent for a variable with variance zero should be zero, yes. However, whether such a coefficent "automatically" will be set to zero or be excluded from regression depends on the implementation.
If you implement logistic regression yourself, you can post the code and we can discuss specifically.
I recommend you to find an implemented version of logistic regression and test it using toy data. Then you will have your answer, whether or not the coeffient will be set to zero (which i assume).

Related

Why one independent variable gets dropped in SPSS multiple regression?

I was running a linear multiple regression as well as a logistic multiple regression in SPSS.
After that when looking at the results, I realised that in each regression, one independent variable was automatically excluded by SPSS. Did we do something wrong here or what do we have to do in order to have all independent variables included in the regression?
thanks for your help!
In LOGISTIC REGRESSION you can specify a predictor variable as categorical, and thus make use of a single categorical variable. If you do this, you can specify the type of contrast coding to use and choose which category would be used as the reference category.
The REGRESSION procedure doesn't have facilities for declaring predictors categorical, so if you have an intercept or constant in the model (which of course is the default) and you try to enter K dummy or indicator variables for a K-level categorical variable, one of them will be linearly dependent on the intercept and the other K-1 dummies, and as Kevin said, will be left out.
You may be confused by which one the procedure leaves out. When the default ENTER method is used in REGRESSION, it tries to enter all the independent variables, but does it one variable at a time and performs a tolerance check to avoid numerical problems associated with redundant or nearly redundant variables. The way it does this results in the last predictor specified being the first to enter, and after that, the variable with the smallest absolute correlation with that previously entered variable, then the remaining candidate variable with the least relationship with the two entered already, etc. In cases where there are ties on tolerance values, the later listed variable gets entered.

Is it better to have 1 or 10 output neurons?

Is it better to have:
1 output neuron that outputs a value between 0 and 15 which would be my ultimate value
or
16 output neurons that output a value between 0 and 1 which represents the propability for this value?
Example: We want to find out the grade (ranging from 0 to 15) a student gets by inputing the number of hours he learned and his IQ.
TL;DR: I think your problem would be better framed as a regression task, so use one ouptut neuron, but it is worth to try both.
I don't quite like the broadness of your question in contrast to the very specific answers, so I am going to go a little deeper and explain what exactly should be the proper formulation.
Before we start, we should clarify the two big tasks that classical Artificial Neural Networks perform:
Classification
Regression
They are inherently very different from one another; in short, Classification tries to put a label on your input (e.g., the input image shows a dog), whereas regression tries to predict a numerical value (e.g., the input data corresponds to a house that has an estimated worth of 1.5 million $US).
Obviously, you can see that predicting the numerical value requires (trivially) only one output value. Also note that this is only true for this specific example. There could be other regression usecases, in which you want your output to have more than 0 dimensions (i.e. a single point), but instead be 1D, or 2D.
A common example would for example be Image Colorization, which we can interestingly enough also frame as a classification problem. The provided link shows examples for both. In this case you would obviously have to regress (or classify) every pixel, which leads to more than one output neuron.
Now, to get to your actual question, I want to elaborate a little more on the reasoning why one-hot encoded outputs (i.e. output with as many channels as classes) are preferred for classification tasks over a single neuron.
Since we could argue that a single neuron is enough to predict the class value, we have to understand why it is problematic to get to a specific class that way.
Categorical vs Ordinal vs Interval Variables
One of the main problems is the type of your variable. In your case, there exists a clear order (15 is better than 14 is better than 13, etc.), and even an interval ordering (at least on paper), since the difference between a 15 and 13 is the same as between 14 and 12, although some scholars might argue against that ;-)
Thus, your target is an interval variable, and could thus be in theory used to regress on it. More on that later. But consider for example a variable that describes whether the image depicts a cat (0), dog (1), or car (2). Now, arguably, we cannot even order the variables (is a car > dog, or car < dog?), nor can we say that there exists an "equal distance" between a cat and a dog (similar, since both are animals?) or a cat and a car (arguably more different from each other). Thus, it becomes really hard to interpret a single output value of the network. Say an input image results in the output of, say, 1.4.
Does this now still correspond to a dog, or is this closer to a car? But what if the image actually depicts a car that has properties of a cat?
On the other hand, having 3 separate neurons that reflect the different probabilities of each class eliminate that problem, since each one can depict a relatively "undisturbed" probability.
How to Loss Function
The other problem is the question how to backpropagate through the network in the previous example. Classically, classification tasks make use of Cross-Entropy Loss (CE), whereas regression uses Mean Squared Error (MSE) as a measure. Those two are inherently different, and especially the combination of CE and Softmax lead to very convenient (and stable) derivations.
Arguably, you could apply rounding to get from 1.4 to a concise class value (in that case, 1) and then use CE loss, but that would maybe lead to numerically instability; MSE on the other hand will never give you a "clear class value", but more a regressed estimate.
In the end, the question boils down to: Do I have a classification or regression problem. In your case, I would argue that both approaches could work reasonably well. A (classification) network might not recognize the correlation between the different output classes; i.e. a student that has a high likelihood for class 14 basically has zero probability of scoring a 3 or lower. On the other hand, regression might not be able to accurately predict the results for other reasons.
If you have the time, I would highly encourage you to try both approaches. For now, considering the interval type of your target, I would personally go with a regression task, and use rounding after you have trained your network and can make accurate predictions.
It is better to have a single neuron for each class (except binary classification). This allows for better design in terms of expanding upon an existing design. A simple example is creating a network for recognizing digits 0 through 9, but then changing the design to hex from 0 through F.

sigmoid - back propagation neural network

I'm trying to create a sample neural network that can be used for credit scoring. Since this is a complicated structure for me, i'm trying to learn them small first.
I created a network using back propagation - input layer (2 nodes), 1 hidden layer (2 nodes +1 bias), output layer (1 node), which makes use of sigmoid as activation function for all layers. I'm trying to test it first using a^2+b2^2=c^2 which means my input would be a and b, and the target output would be c.
My problem is that my input and target output values are real numbers which can range from (-/infty, +/infty). So when I'm passing these values to my network, my error function would be something like (target- network output). Would that be correct or accurate? In the sense that I'm getting the difference between the network output (which is ranged from 0 to 1) and the target output (which is a large number).
I've read that the solution would be to normalise first, but I'm not really sure how to do this. Should i normalise both the input and target output values before feeding them to the network? What normalisation function is best to use cause I read different methods in normalising. After getting the optimized weights and use them to test some data, Im getting an output value between 0 and 1 because of the sigmoid function. Should i revert the computed values to the un-normalized/original form/value? Or should i only normalise the target output and not the input values? This really got me stuck for weeks as I'm not getting the desired outcome and not sure how to incorporate the normalisation idea in my training algorithm and testing..
Thank you very much!!
So to answer your questions :
Sigmoid function is squashing its input to interval (0, 1). It's usually useful in classification task because you can interpret its output as a probability of a certain class. Your network performes regression task (you need to approximate real valued function) - so it's better to set a linear function as an activation from your last hidden layer (in your case also first :) ).
I would advise you not to use sigmoid function as an activation function in your hidden layers. It's much better to use tanh or relu nolinearities. The detailed explaination (as well as some useful tips if you want to keep sigmoid as your activation) might be found here.
It's also important to understand that architecture of your network is not suitable for a task which you are trying to solve. You can learn a little bit of what different networks might learn here.
In case of normalization : the main reason why you should normalize your data is to not giving any spourius prior knowledge to your network. Consider two variables : age and income. First one varies from e.g. 5 to 90. Second one varies from e.g. 1000 to 100000. The mean absolute value is much bigger for income than for age so due to linear tranformations in your model - ANN is treating income as more important at the beginning of your training (because of random initialization). Now consider that you are trying to solve a task where you need to classify if a person given has grey hair :) Is income truly more important variable for this task?
There are a lot of rules of thumb on how you should normalize your input data. One is to squash all inputs to [0, 1] interval. Another is to make every variable to have mean = 0 and sd = 1. I usually use second method when the distribiution of a given variable is similiar to Normal Distribiution and first - in other cases.
When it comes to normalize the output it's usually also useful to normalize it when you are solving regression task (especially in multiple regression case) but it's not so crucial as in input case.
You should remember to keep parameters needed to restore the original size of your inputs and outputs. You should also remember to compute them only on a training set and apply it on both training, test and validation sets.

One Class SVM using LibSVM in Matlab - Conceptual

Perhaps this is an easy question, but I want to make sure I understand the conceptual basis of the LibSVM implementation of one-class SVMs and if what I am doing is permissible.
I am using one class SVMs in this case for outlier detection and removal. This is used in the context of a greater time series prediction model as a data preprocessing step. That said, I have a Y vector (which is the quantity we are trying to predict and is continuous, not class labels) and an X matrix (continuous features used to predict). Since I want to detect outliers in the data early in the preprocessing step, I have yet to normalize or lag the X matrix for use in prediction, or for that matter detrend/remove noise/or otherwise process the Y vector (which is already scaled to within [-1,1]). My main question is whether it is correct to model the one class SVM like so (using libSVM):
svmod = svmtrain(ones(size(Y,1),1),Y,'-s 2 -t 2 -g 0.00001 -n 0.01');
[od,~,~] = svmpredict(ones(size(Y,1),1),Y,svmod);
The resulting model does yield performance somewhat in line with what I would expect (99% or so prediction accuracy, meaning 1% of the observations are outliers). But why I ask is because in other questions regarding one class SVMs, people appear to be using their X matrices where I use Y. Thanks for your help.
What you are doing here is nothing more than a fancy range check. If you are not willing to use X to find outliers in Y (even though you really should), it would be a lot simpler and better to just check the distribution of Y to find outliers instead of this improvised SVM solution (for example remove the upper and lower 0.5-percentiles from Y).
In reality, this is probably not even close to what you really want to do. With this setup you are rejecting Y values as outliers without considering any context (e.g. X). Why are you using RBF and how did you come up with that specific value for gamma? A kernel is total overkill for one-dimensional data.
Secondly, you are training and testing on the same data (Y). A kitten dies every time this happens. One-class SVM attempts to build a model which recognizes the training data, it should not be used on the same data it was built with. Please, think of the kittens.
Additionally, note that the nu parameter of one-class SVM controls the amount of outliers the classifier will accept. This is explained in the LIBSVM implementation document (page 4): It is proved that nu is an upper bound on the fraction of training errors and
a lower bound of the fraction of support vectors. In other words: your training options specifically state that up to 1% of the data can be rejected. For one-class SVM, replace can by should.
So when you say that the resulting model does yield performance somewhat in line with what I would expect ... ofcourse it does, by definition. Since you have set nu=0.01, 1% of the data is rejected by the model and thus flagged as an outlier.

Interpretation of MATLAB's NaiveBayses 'posterior' function

After we created a Naive Bayes classifier object nb (say, with multivariate multinomial (mvmn) distribution), we can call posterior function on testing data using nb object. This function has 3 output parameters:
[post,cpre,logp] = posterior(nb,test)
I understand how post is computed and the meaning of that, also cpre is the predicted class, based on the maximum over posterior probabilities for each class.
The question is about logp. It is clear how it is computed (logarithm of the PDF of each pattern in test), but I don't understand the meaning of this measure and how it can be used in the context of Naive Bayes procedure. Any light on this is very much appreciated.
Thanks.
The logp you are referring to is the log likelihood, which is one way to measure how good a model is fitting. We use log probabilities to prevent computers from underflowing on very small floating-point numbers, and also because adding is faster than multiplying.
If you learned your classifier several times with different starting points, you would get different results because the likelihood function is not log-concave, meaning there are local maxima that you would get stuck in. If you computed the likelihood of the posterior on your original data you would get the likelihood of the model. Although the likelihood gives you a good measure of how one set of parameters fits compared to another, you need to be careful that you're not overfitting.
In your case, you are computing the likelihood on some unobserved (test) data, which gives you an idea of how well your learned classifier is fitting on the data. If you were trying to learn this model based on the test set, you would pick the parameters based on the highest test likelihood; however in general when you're doing this it's better to use a validation set. What you are doing here is computing predictive likelihood.
Computing the log likelihood is not limited to Naive Bayes classifiers and can in fact be computed for any Bayesian model (gaussian mixture, latent dirichlet allocation, etc).