Weka Simple K means handling nominal attributes - cluster-analysis

I am trying to understand how simple K-means in Weka handles nominal attributes and why it is not efficient in handling such attributes.
I read that it calculates modes for such attributes. I want to know how the similarity is calculated.
Lets take an example:
Consider a dataset with 3 numeric and a nomimal attribute.
The nominal attribute has 3 values: A, B and C.
Instance1 has value A, Instance2 has value B and Instance3 has value A.
In this case, Instance1 may be more similar to Instance3(depending on other numeric attributes of course). How will Simple K-means work in this case?
Follow up:
What if the nominal attribute has more(10) possible values?

You can try to convert it to binary features, for each such nominal attribute, e.g. has_A, has_B, has_C. Then if you scale it i1 and i3 will be closer as the mean for that attribute will be above 0.5 (re to your example) - i2 will stand out more.
If it has more, then you just add more binary features for every possible value. Basically you just pivot each nominal attribute.

Related

Putting weights on values of a categorical feature

Suppose we have the following dataset
df = pd.DataFrame({'feature 1':['a','b','c','d','e'],
'feature 2':[1,2,3,4,5],'y':[1,0,0,1,1]})
as we can see feature 1 is categorical. In usual tree-based models as in XGBoost or CatBoost, the values under each feature are treated with the same weight. I was wondering how one can assign weights into individual values of a feature they are categorical? for instance, I want my model to put weight 1 on a, 0.5 on b, 2 on c, 1 on d and 0.6 on e. This is different than assigning weight to a feature as a whole, as I am trying to let the model understand that each value under each feature has their distinct weight.

Matlab - Stepwise GLM with Categoricals

I have a table of 85 predictors, some of which are numerical, logical, ordinal and nominal (hot-one encoded). They are predicting a single finalScore outcome var which ranges from 0 to 1. I'm running a stepwise GLM using:
% model2 = stepwiseglm(predictors, finalScore);
Each predictor's header indicates which of the four types it is and I'm wondering if there is a way to tell the model that there are these different types. This page suggests there is for categoricals but so far I have not found anything within each of the 4 types I have.
Per Generalized Linear Models walk-through
For a table or dataset array tbl, fitting functions assume that these
data types are categorical
Logical
Categorical (nominal or ordinal)
Character array
As long as the data is represented by the appropriate types in the input table, you shouldn't have to specify any further. To ensure this you can typecast nominal with categorical(), ordinal with ordinal(), and logical with logical().
You can specify categorical vs non-categorical with stepwiseglm(...'CategoricalVars',[0 1 0 1 0 0 0 ...]); but if you typecast your input correctly this should be redundant anyways.
Once the model is built, you can verify that categorical variables and ranges are handled appropriately by checking model2.VariableInfo

Neutrality for sentiment analysis in spark

I have built a pretty basic naive bayes over apache spark and using mllib of course. But I have a few clarifications on what exactly neutrality means.
From what I understand, in a given dataset there are pre-labeled sentences which comprise of the necessary classes, let's take 3 for example below.
0-> Negative sentiment
1-> Positive sentiment
2-> Neutral sentiment
This neutral is pre-labeled in the training set itself.
Is there any other form of neutrality handling. Suppose if there are no neutral sentences available in the dataset then is it possible that I can calculate it from the scale of probability like
0.0 - 0.4 => Negative
0.4- - 0.6 => Neutral
0.6 - 1.0 => Positive
Is such kind of mapping possible in spark. I searched around but could not find any. The NaiveBayesModel class in the RDD API has a predict method which just returns a double that is mapped according to the training set i.e if only 0,1 is there it will return only 0,1 and not in a scaled manner such as 0.0 - 1.0 as above.
Any pointers/advice on this would be incredibly helpful.
Edit - 1
Sample code
//Performs tokenization,pos tagging and then lemmatization
//Returns a array of string
val tokenizedString = Util.tokenizeData(text)
val hashingTF = new HashingTF()
//Returns a double
//According to the training set 1.0 => Positive, 0.0 => Negative
val status = model.predict(hashingTF.transform(tokenizedString.toSeq))
if(status == 1.0) "Positive" else "Negative"
Sample dataset content
1,Awesome movie
0,This movie sucks
Of course the original dataset contains more longer sentences, but this should be enough for explanations I guess
Using the above code I am calculating. My question is the same
1) Neutrality handling in dataset
In the above dataset if I am adding another category such as
2,This movie can be enjoyed by kids
For arguments sake, lets assume that it is a neutral review, then the model.predict method will give either 1.0,0.0,2.0 based on the passed in sentence.
2) Using the model.predictProbabilities it gives an array of doubles, but I am not sure in what order it gives the result i.e index 0 is for negative or for positive? With three features i.e Negative,Positive,Neutral then in what order will that method return the predictions?
It would have been helpful to have the code that builds the model (for your example to work, the 0.0 from the dataset must be converted to 0.0 as a Double in the model, either after indexing it with a StringIndexer stage, or if you converted that from the file), but assuming that this code works:
val status = model.predict(hashingTF.transform(tokenizedString.toSeq))
if(status == 1.0) "Positive" else "Negative"
Then yes, it means the probabilities at index 0 is that of negative and at 1 that of positive (it's a bit strange and there must be a reason, but everything is a double in ML, even feature and category indexes). If you have something like this in your code:
val labelIndexer = new StringIndexer()
.setInputCol("sentiment")
.setOutputCol("indexedsentiment")
.fit(trainingData)
Then you can use labelIndexer.labels to identify the labels (probability at index 0 is for labelIndexer.labels at index 0.
Now regarding your other questions.
Neutrality can mean two different things. Type 1: a review contains as much positive and negative words Type 2: there is (almost) no sentiment expressed.
A Neutral category can be very helpful if you want to manage Type 2. If that is the case, you need neutral examples in your dataset. Naive Bayes is not a good classifier to apply thresholding on the probabilities in order to determine Type 2 neutrality.
Option 1: Build a dataset (if you think you will have to deal with a lot of Type 2 neutral texts). The good news is, building a neutral dataset is not too difficult. For instance you can pick random texts that are not movie reviews and assume they are neutral. It would be even better if you could pick content that is closely related to movies (but neutral), like a dataset of movie synopsis. You could then create a multi-class Naive Bayes classifier (between neutral, positive and negative) or a hierarchical classifier (first step is a binary classifier that determines whether a text is a movie review or not, second step to determine the overall sentiment).
Option 2 (can be used to deal with both Type 1 and 2). As I said, Naive Bayes is not very great to deal with thresholds on the probabilities, but you can try that. Without a dataset though, it will be difficult to determine the thresholds to use. Another approach is to identify the number of words or stems that have a significant polarity. One quick and dirty way to achieve that is to query your classifier with each individual word and count the number of times it returns "positive" with a probability significantly higher than the negative class (discard if the probabilities are too close to each other, for instance within 25% - a bit of experimentations will be needed here). At the end, you may end up with say 20 positive words vs 15 negative ones and determine it is neutral because it is balanced or if you have 0 positive and 1 negative, return neutral because the count of polarized words is too low.
Good luck and hope this helped.
I am not sure if I understand the problem but:
prior in Naive Bayes is computed from the data and cannot be set manually.
in MLLib you can use predictProbabilities to obtain class probabilities.
in ML you can use setThresholds to set prediction threshold for each class.

How do I joint test a multi-level categorical effect in ipython using statsmodels?

I am using the Ordinary Least Squares (ols) function in statsmodels in ipython to fit a linear model where one covariate (City) is a multi-level categorical effect:
result=smf.ols(formula="Y ~ C(City) + X*C(Group)",data=s).fit();
(X is continuous, Group is a binary categorical variable).
When I do results.summary(), I get one row per level of City, however, what I would like to know is the overall significance of the 'City' covariate (i.e., compare Y~C(City)+X*C(Group) with the partial model Y~X*C(Group)).
Is there a way of doing it?
thanks in advance
Thank you user333700!
Here's an elaboration of your hint. I generate data with a 3-level categorical variable, use statsmodels to fit a model, and then test all levels of the categorical variable jointly:
# 1. generate data
def rnorm(n,u,s):
return np.random.standard_normal(n)*s+u
a=rnorm(100,-1,1);
b=rnorm(100,0,1);
c=rnorm(100,+1,1);
n=rnorm(300,0,1); # some noise
y=np.concatenate((a,b,c))+n
g=np.zeros(300);
g[0:100]=1
g[100:200]=2
g[200:300]=3
df=pd.DataFrame({'Y':y,'G':g,'N':n});
# 2. fit model
r=smf.ols(formula="Y ~ N + C(G)",data=df).fit();
r.summary()
# 3. joint test
print r.params
A=np.identity(len(r.params)) # identity matrix with size = number of params
GroupTest=A[1:3,:] # for the categorical var., keep the corresponding rows of A
CovTest=A[3,:] # row for the continuous var.
print "Group effect test",r.f_test(GroupTest).fvalue
print "Covariate effect test",r.f_test(CovTest).fvalue
The result should be something like this:
Intercept -1.188975
C(G)[T.2.0] 1.315898
C(G)[T.3.0] 2.137431
N 0.922038
dtype: float64
Group effect test [[ 120.86097747]]
Covariate effect test [[ 259.34155851]]
brief answer
you can use anova_lm (type 3) directly or use f_test or wald_test and either construct the constraint matrix or provide the constraints of the hypothesis in the form of a sequence of formulas.
http://statsmodels.sourceforge.net/devel/anova.html
http://statsmodels.sourceforge.net/devel/generated/statsmodels.regression.linear_model.RegressionResults.f_test.html

How to determine whether a feature contains discrete or continuous data in matlab?

I am wondering is there a way to determine whether a feature (a vector) contains discrete or continuous data?
like feature1 = [red, blue, green]
feature2 = [1.1, 1.2, 1.5, 1.8]
How can I judge feautre1 is discrete and feature2 is continuous?
Many Thanks.
You basically check how many distinct values are in your variable of interest. If the number of distinct values is below a percentage threshold of the number if instances then you can treat the variable as categorical. The percentage threshold depends on the number of instances you have. For example, if you have 100 instances and set a threshold of 5%, then if these instances take below 5 distinct values it is possible to treat the variable as categorical. If you have 1,000,000 instances
Check out this answer from cross validated.
https://stats.stackexchange.com/questions/12273/how-to-test-if-my-data-is-discrete-or-continuous
Note that this answer refers to R but the same principles apply to any programming environment and it should not be hard to translate this to matlab.
Every data represented in a computer is discrete, but this is probably not the answer you are looking for.
What stands the value for? Feature 1 seems to be discrete because it describes some names for colours out of a finite set. But as soon as any mixture is allowed (e.g. "23%red_42%blue_0.11%green_34.89%white" this becomes a really strange description of a continuous artefact.
Feature 2: no idea, some arbitrary numbers without any meaning.
This might help: class(feature), where feature is any object, tells you the class name of the object. For example:
feature1 = {'red','blue', 'green'};
feature2 = [1.1 1.2 1.5 1.8]
>> class(feature1)
ans =
cell
>> class(feature1{1})
ans =
char
>> class(feature2)
ans =
double
>> class(feature2(1))
ans =
double