Putting weights on values of a categorical feature - classification

Suppose we have the following dataset
df = pd.DataFrame({'feature 1':['a','b','c','d','e'],
'feature 2':[1,2,3,4,5],'y':[1,0,0,1,1]})
as we can see feature 1 is categorical. In usual tree-based models as in XGBoost or CatBoost, the values under each feature are treated with the same weight. I was wondering how one can assign weights into individual values of a feature they are categorical? for instance, I want my model to put weight 1 on a, 0.5 on b, 2 on c, 1 on d and 0.6 on e. This is different than assigning weight to a feature as a whole, as I am trying to let the model understand that each value under each feature has their distinct weight.

Related

How to decide the range for the hyperparameter space in SVM tuning? (MATLAB)

I am tuning an SVM using a for loop to search in the range of hyperparameter's space. The svm model learned contains the following fields
SVMModel: [1×1 ClassificationSVM]
C: 2
FeaturesIdx: [4 6 8]
Score: 0.0142
Question1) What is the meaning of the field 'score' and its utility?
Question2) I am tuning the BoxConstraint, C value. Let, the number of features be denoted by the variable featsize. The variable gridC will contain the search space which can start from any value say 2^-5, 2^-3, to 2^15 etc. So, gridC = 2.^(-5:2:15). I cannot understand if there is a way to select the range?
1. score had been documented in here, which says:
Classification Score
The SVM classification score for classifying observation x is the signed distance from x to the decision boundary ranging from -∞ to +∞.
A positive score for a class indicates that x is predicted to be in
that class. A negative score indicates otherwise.
In two class cases, if there are six observations, and the predict function gave us some score value called TestScore, then we could determine which class does the specific observation ascribed by:
TestScore=[-0.4497 0.4497
-0.2602 0.2602;
-0.0746 0.0746;
0.1070 -0.1070;
0.2841 -0.2841;
0.4566 -0.4566;];
[~,Classes] = max(TestScore,[],2);
In the two-class classification, we can also use find(TestScore > 0) instead, and it is clear that the first three observations are belonging to the second class, and the 4th to 6th observations are belonging to the first class.
In multiclass cases, there could be several scores > 0, but the code max(scores,[],2) is still validate. For example, we could use the code (from here, an example called Find Multiple Class Boundaries Using Binary SVM) following to determine the classes of the predict Samples.
for j = 1:numel(classes);
[~,score] = predict(SVMModels{j},Samples);
Scores(:,j) = score(:,2); % Second column contains positive-class scores
end
[~,maxScore] = max(Scores,[],2);
Then the maxScore will denote the predicted classes of each sample.
2. The BoxConstraint denotes C in the SVM model, so we can train SVMs in different hyperparameters and select the best one by something like:
gridC = 2.^(-5:2:15);
for ii=1:length(gridC)
SVModel = fitcsvm(data3,theclass,'KernelFunction','rbf',...
'BoxConstraint',gridC(ii),'ClassNames',[-1,1]);
%if (%some constraints were meet)
% %save the current SVModel
%end
end
Note: Another way to implement this is using libsvm, a fast and easy-to-use SVM toolbox, which has the interface of MATLAB.

Matlab - Stepwise GLM with Categoricals

I have a table of 85 predictors, some of which are numerical, logical, ordinal and nominal (hot-one encoded). They are predicting a single finalScore outcome var which ranges from 0 to 1. I'm running a stepwise GLM using:
% model2 = stepwiseglm(predictors, finalScore);
Each predictor's header indicates which of the four types it is and I'm wondering if there is a way to tell the model that there are these different types. This page suggests there is for categoricals but so far I have not found anything within each of the 4 types I have.
Per Generalized Linear Models walk-through
For a table or dataset array tbl, fitting functions assume that these
data types are categorical
Logical
Categorical (nominal or ordinal)
Character array
As long as the data is represented by the appropriate types in the input table, you shouldn't have to specify any further. To ensure this you can typecast nominal with categorical(), ordinal with ordinal(), and logical with logical().
You can specify categorical vs non-categorical with stepwiseglm(...'CategoricalVars',[0 1 0 1 0 0 0 ...]); but if you typecast your input correctly this should be redundant anyways.
Once the model is built, you can verify that categorical variables and ranges are handled appropriately by checking model2.VariableInfo

Weka Simple K means handling nominal attributes

I am trying to understand how simple K-means in Weka handles nominal attributes and why it is not efficient in handling such attributes.
I read that it calculates modes for such attributes. I want to know how the similarity is calculated.
Lets take an example:
Consider a dataset with 3 numeric and a nomimal attribute.
The nominal attribute has 3 values: A, B and C.
Instance1 has value A, Instance2 has value B and Instance3 has value A.
In this case, Instance1 may be more similar to Instance3(depending on other numeric attributes of course). How will Simple K-means work in this case?
Follow up:
What if the nominal attribute has more(10) possible values?
You can try to convert it to binary features, for each such nominal attribute, e.g. has_A, has_B, has_C. Then if you scale it i1 and i3 will be closer as the mean for that attribute will be above 0.5 (re to your example) - i2 will stand out more.
If it has more, then you just add more binary features for every possible value. Basically you just pivot each nominal attribute.

How do I joint test a multi-level categorical effect in ipython using statsmodels?

I am using the Ordinary Least Squares (ols) function in statsmodels in ipython to fit a linear model where one covariate (City) is a multi-level categorical effect:
result=smf.ols(formula="Y ~ C(City) + X*C(Group)",data=s).fit();
(X is continuous, Group is a binary categorical variable).
When I do results.summary(), I get one row per level of City, however, what I would like to know is the overall significance of the 'City' covariate (i.e., compare Y~C(City)+X*C(Group) with the partial model Y~X*C(Group)).
Is there a way of doing it?
thanks in advance
Thank you user333700!
Here's an elaboration of your hint. I generate data with a 3-level categorical variable, use statsmodels to fit a model, and then test all levels of the categorical variable jointly:
# 1. generate data
def rnorm(n,u,s):
return np.random.standard_normal(n)*s+u
a=rnorm(100,-1,1);
b=rnorm(100,0,1);
c=rnorm(100,+1,1);
n=rnorm(300,0,1); # some noise
y=np.concatenate((a,b,c))+n
g=np.zeros(300);
g[0:100]=1
g[100:200]=2
g[200:300]=3
df=pd.DataFrame({'Y':y,'G':g,'N':n});
# 2. fit model
r=smf.ols(formula="Y ~ N + C(G)",data=df).fit();
r.summary()
# 3. joint test
print r.params
A=np.identity(len(r.params)) # identity matrix with size = number of params
GroupTest=A[1:3,:] # for the categorical var., keep the corresponding rows of A
CovTest=A[3,:] # row for the continuous var.
print "Group effect test",r.f_test(GroupTest).fvalue
print "Covariate effect test",r.f_test(CovTest).fvalue
The result should be something like this:
Intercept -1.188975
C(G)[T.2.0] 1.315898
C(G)[T.3.0] 2.137431
N 0.922038
dtype: float64
Group effect test [[ 120.86097747]]
Covariate effect test [[ 259.34155851]]
brief answer
you can use anova_lm (type 3) directly or use f_test or wald_test and either construct the constraint matrix or provide the constraints of the hypothesis in the form of a sequence of formulas.
http://statsmodels.sourceforge.net/devel/anova.html
http://statsmodels.sourceforge.net/devel/generated/statsmodels.regression.linear_model.RegressionResults.f_test.html

How to determine whether a feature contains discrete or continuous data in matlab?

I am wondering is there a way to determine whether a feature (a vector) contains discrete or continuous data?
like feature1 = [red, blue, green]
feature2 = [1.1, 1.2, 1.5, 1.8]
How can I judge feautre1 is discrete and feature2 is continuous?
Many Thanks.
You basically check how many distinct values are in your variable of interest. If the number of distinct values is below a percentage threshold of the number if instances then you can treat the variable as categorical. The percentage threshold depends on the number of instances you have. For example, if you have 100 instances and set a threshold of 5%, then if these instances take below 5 distinct values it is possible to treat the variable as categorical. If you have 1,000,000 instances
Check out this answer from cross validated.
https://stats.stackexchange.com/questions/12273/how-to-test-if-my-data-is-discrete-or-continuous
Note that this answer refers to R but the same principles apply to any programming environment and it should not be hard to translate this to matlab.
Every data represented in a computer is discrete, but this is probably not the answer you are looking for.
What stands the value for? Feature 1 seems to be discrete because it describes some names for colours out of a finite set. But as soon as any mixture is allowed (e.g. "23%red_42%blue_0.11%green_34.89%white" this becomes a really strange description of a continuous artefact.
Feature 2: no idea, some arbitrary numbers without any meaning.
This might help: class(feature), where feature is any object, tells you the class name of the object. For example:
feature1 = {'red','blue', 'green'};
feature2 = [1.1 1.2 1.5 1.8]
>> class(feature1)
ans =
cell
>> class(feature1{1})
ans =
char
>> class(feature2)
ans =
double
>> class(feature2(1))
ans =
double