I use logistic regression. I have some features. Their values are between 0 and 1, (The maximum value that the function can produce is 1 and the minimum value is 0), but both in training and test data the maximum value is very low (e.g. 0.11) therefore all values are low and close to each other. My question is that what is the best standard way to normalize/transfer the feature values to a normal scale (between 0 and 1) so that the logistic regression isn't affected by inappropriate values.
Any help would be highly appreciated.
There are different methods for feature scaling/normalization.
If you just want the feature values to be in range [0..1] do the following for each feature:
Some tutorials recommend to scale features into the range [-0.5 .. 0.5]:
I prefer to scale features by their standard deviation how explained in Stanford lectures (see chapter Preprocessing your data):
Related
I'm reading through the Make Your Own Neural Network book and in the example where it is shown on how to classify the hand written digits, the text says that the input color values that are in the range from 0 to 255 will be rescaled to the much smaller range between 0.01 to 1.0. A few questions on this!
What is against using the actual range which is 0 to 255? What would rescaling bring me?
Does this mean that if I rescale my training set, train my model with this rescaled data, I then should also use a rescaled test data?
Any arguments please?
Rescaling the data will lead to faster convergence when using methods like gradient descent. Also when your dataset features highly varying in magnitudes, using solution that includes eucliden distance can lead to bad results. In order to avoid it, scaling the features to range between 0.0 and 1.0 will be a wise solution.
For the second question, you should rescale test data.
Click those links 1, 2 and 3 to obtain more information.
I was reading through all (or most) previously asked questions, but couldn't find an answer to my problem...
I have 13 variables measured on an ordinal scale (thy represent knowledge transfer channels), which I want to cluster (HCA) for a following binary logistic regression analysis (including all 13 variables is not possible due to sample size of N=208). A Factor Analysis seems inappropriate due to the scale level. I am using SPSS (but tried R as well).
Questions:
1: Am I right in using the Chi-Squared measure for count data instead of the (squared) euclidian distance?
2. How can I justify a choice of method? I tried single, complete, Ward and average, but all give different results and I can't find a source to base my decision on.
Thanks a lot in advance!
Answer 1: Since the variables are on ordinal scale, the chi-square test is an appropriate measurement test. Because, "A Chi-square test is designed to analyze categorical data. That means that the data has been counted and divided into categories. It will not work with parametric or continuous data (such as height in inches)." Reference.
Again, ordinal scaled data is essentially count or frequency data you can use regular parametric statistics: mean, standard deviation, etc Or non-parametric tests like ANOVA or Mann-Whitney U test to compare 2 groups or Kruskal–Wallis H test to compare three or more groups.
Answer 2: In a clustering problem, the choice of distance method solely depends upon the type of variables. I recommend you to read these detailed posts 1, 2,3
I am trying to train a CNN model to classify images based on their aesthetic score. There are 2,00,000 images and every image is rated by more than 100 subjects. Mean score is calculated and the scores are normalized.
The distribution of the scores is approximately gaussian. So I have decided to build a 10 class classification model after assigning appropriate weight for each class as the data is imbalanced.
My question:
For this problem, the scores are continuous, ie, 0<0.2<0.3<0.4<0.5<..<1.
Then does that mean this is a regression problem? If so, how do I balance the data for a regression problem, as most of the datapoints are present in between 0.4 and 0.6.
Thanks!
Since your labels are continuous, you could divide them in to 10 equal quantiles using a technique like pandas.qcut() and provide label to each classes. This can turn a regression problem to a classification problem.
And as far as the imbalance is concerned, you may want to try to oversample the minority data. This will ensure your model is not biased towards majority data.
Hope this helps.
I would recommend you to do a Histogram Equalization over ALL data of your participants first, so that their ratings are destributed equaly.
Then for each image in your training set calculate the Expected Value (and if you also want to, the Variance) The Expected Value is just the mean of the votes. For the Variance there are standard functions in (almost) every programming language where you can input an array of votes which will output the Variance.
Now take the Expected Value (and if you want also the Variance) as your ground truth for your Network.
EDIT: Histogram Equalization:
Histogram equalization is a method to use the given numerical range as efficient as possible.
In the context of images, this would change the pixel values, so that the darkest pixel becomes the value 0 and the lightest value becomes 255. Furthermore every grayscale value gets destributed so that it occurs as often as each other (in average). For your dataset you want the same. Even though your values are not from 0 to 255 but from 0 to 10. Furthermore you don't need to (and shoudn't) round the resulting values to integers. In this way more often occurring votes are more spread and less often votes are contracted.
Maybe you should first calculate the expected value and than do the histogram equalization over the expected values of all images.
By this the CNN sould be able to better differentiate those small differences.
So I am currently trying to implement my first NN with a genetic algorithm for training and a sigmoid activation function. It's all good but I'm not quite sure in what ranges the weights must be. I've searched some about the question but with no luck. How does one choose the ranges of the weights in a NN? What does it depend on?
The weights can be seen as an intrinsic property of the problem you're trying to solve using the GA/NN approach; there's no general best value fo these, so you're best off studying different weight spans (w.r.t. training sets) with other parameters fixed.
E.g., study different settings for parameter weightSpan in
weights \in [-weightSpan/2, weightSpan/2],
and let your initial chromosomes describe weights with randomized values in this range. Your squashing function (sigmoid) is used to grade the NN response to the range [0, 1].
Finding an appropriate weight span is, much like setting a value of number of hidden layer, a process if problem-specific testing. ("There is no free lunch").
Edit:
I thought I'd add that the easiest way to study different weight spans is probably to set a fixed weight span, say [-1, 1], and study the squashing constant in your squashing function (sigmoid). I.e., study different (non-negative) values of constant c in your sigmoid
σ(s) = 1 / (1 + e^(-c*s))
I'm running a series of SVM classifiers for a binary classification problem, and am getting very nice results as far as classification accuracy.
The next step of my analysis is to understand how the different features contribute to the classification. According to the documentation, Matlab's fitcsvm function returns a class, SVMModel, which has a field called "Beta", defined as:
Numeric vector of trained classifier coefficients from the primal linear problem. Beta has length equal to the number of predictors (i.e., size(SVMModel.X,2)).
I'm not quite sure how to interpret these values. I assume higher values represent a greater contribution of a given feature to the support vector? What do negative weights mean? Are these weights somehow analogous to beta parameters in a linear regression model?
Thanks for any help and suggestions.
----UPDATE 3/5/15----
In looking closer at the equations describing the linear SVM, I'm pretty sure Beta must correspond to w in the primal form.
The only other parameter is b, which is just the offset.
Given that, and given this explanation, it seems that taking the square or absolute value of the coefficients provides a metric of relative importance of each feature.
As I understand it, this interpretation only holds for the linear binary SVM problem.
Does that all seem reasonable to people?
Intuitively, one can think of the absolute value of a feature weight as a measure of it's importance. However, this is not true in the general case because the weights symbolize how much a marginal change in the feature value would affect the output, which means that it is dependent on the feature's scale. For instance, if we have a feature for "age" that is measured in years, but than we change it to months, the corresponding coefficient will be divided by 12, but clearly,it doesn't mean that the age is less important now!
The solution is to scale the data (which is usually a good practice anyway).
If the data is scaled your intuition is correct and in fact, there is a feature selection method that does just that: choosing the features with the highest absolute weight. See http://jmlr.csail.mit.edu/proceedings/papers/v3/chang08a/chang08a.pdf
Note that this is correct only to linear SVM.