How to use sample weights for a random forest classificator in Orange? - orange

I am trying to train a random forest classificator on a very imbalanced dataset with 2 classes (benign-malign).
I have seen and followed the code from a previous question (How to set up and use sample weight in the Orange python package?) and tried to set various higher weights to the minority class data instances, but the classificators that I get work exactly the same.
My code:
data = Orange.data.Table(filename)
st = Orange.classification.tree.SimpleTreeLearner(min_instances=3)
forest = Orange.ensemble.forest.RandomForestLearner(learner=st, trees=40, name="forest")
weight = Orange.feature.Continuous("weight")
weight_id = -10
data.domain.add_meta(weight_id, weight)
data.add_meta_attribute(weight, 1.0)
for inst in data:
if inst[data.domain.class_var]=='malign':
inst[weight]=100
classifier = forest(data, weight_id)
Am I missing something?

Simple tree learner is simple: it's optimized for speed and does not support weights. I guess learning algorithms in Orange that do not support weight should raise an exception if the weight argument is specified.
If you need them just to change the class distribution, multiply data instances instead. Create a new data table and add 100 copies of each instance of malignant tumor.

Related

How to emphasize selected output in neural network

I'm training a data set with 17 features and 5 output using pytorch. But I'm most interested in two of them, let say output 2 and 3 out of 0-4. What's a good strategy to get as high accuracy as possible on 2 and 3, while the rest can have lower accuracy?
If you are using nn.CrossEntropyLoss(), you can pass in the weights to emphasize or de-emphasize certain classes. From the PyTorch docs:
torch.nn.CrossEntropyLoss(weight: Optional[torch.Tensor] = None, ...)
The weights do not have to sum up to one, since PyTorch will handle that on its own when reduction='mean', which is the default setting. The weights specify which classes to weigh more heavily when calculating the loss. In other words, the higher the weight, the higher the penalty for getting a prediction wrong for the particular set of classes with higher weights.
# imports assumed
x = torch.randn(10, 5) # dummy data
target = torch.randint(0, 5, (10,)) # dummy targets
weights = torch.tensor([1., 1., 2., 2., 1.]) # emphasize classes 2 and 3
criterion_weighted = nn.CrossEntropyLoss(weight=weights)
loss_weighted = criterion_weighted(x, target)

How to decide the range for the hyperparameter space in SVM tuning? (MATLAB)

I am tuning an SVM using a for loop to search in the range of hyperparameter's space. The svm model learned contains the following fields
SVMModel: [1×1 ClassificationSVM]
C: 2
FeaturesIdx: [4 6 8]
Score: 0.0142
Question1) What is the meaning of the field 'score' and its utility?
Question2) I am tuning the BoxConstraint, C value. Let, the number of features be denoted by the variable featsize. The variable gridC will contain the search space which can start from any value say 2^-5, 2^-3, to 2^15 etc. So, gridC = 2.^(-5:2:15). I cannot understand if there is a way to select the range?
1. score had been documented in here, which says:
Classification Score
The SVM classification score for classifying observation x is the signed distance from x to the decision boundary ranging from -∞ to +∞.
A positive score for a class indicates that x is predicted to be in
that class. A negative score indicates otherwise.
In two class cases, if there are six observations, and the predict function gave us some score value called TestScore, then we could determine which class does the specific observation ascribed by:
TestScore=[-0.4497 0.4497
-0.2602 0.2602;
-0.0746 0.0746;
0.1070 -0.1070;
0.2841 -0.2841;
0.4566 -0.4566;];
[~,Classes] = max(TestScore,[],2);
In the two-class classification, we can also use find(TestScore > 0) instead, and it is clear that the first three observations are belonging to the second class, and the 4th to 6th observations are belonging to the first class.
In multiclass cases, there could be several scores > 0, but the code max(scores,[],2) is still validate. For example, we could use the code (from here, an example called Find Multiple Class Boundaries Using Binary SVM) following to determine the classes of the predict Samples.
for j = 1:numel(classes);
[~,score] = predict(SVMModels{j},Samples);
Scores(:,j) = score(:,2); % Second column contains positive-class scores
end
[~,maxScore] = max(Scores,[],2);
Then the maxScore will denote the predicted classes of each sample.
2. The BoxConstraint denotes C in the SVM model, so we can train SVMs in different hyperparameters and select the best one by something like:
gridC = 2.^(-5:2:15);
for ii=1:length(gridC)
SVModel = fitcsvm(data3,theclass,'KernelFunction','rbf',...
'BoxConstraint',gridC(ii),'ClassNames',[-1,1]);
%if (%some constraints were meet)
% %save the current SVModel
%end
end
Note: Another way to implement this is using libsvm, a fast and easy-to-use SVM toolbox, which has the interface of MATLAB.

Is nearest centroid classifier really inefficient?

I am currently reading "Introduction to machine learning" by Ethem Alpaydin and I came across nearest centroid classifiers and tried to implement it. I guess I have correctly implemented the classifier but I am getting only 68% accuracy . So, is the nearest centroid classifier itself is inefficient or is there some error in my implementation (below) ?
The data set contains 1372 data points each having 4 features and there are 2 output classes
My MATLAB implementation :
DATA = load("-ascii", "data.txt");
#DATA is 1372x5 matrix with 762 data points of class 0 and 610 data points of class 1
#there are 4 features of each data point
X = DATA(:,1:4); #matrix to store all features
X0 = DATA(1:762,1:4); #matrix to store the features of class 0
X1 = DATA(763:1372,1:4); #matrix to store the features of class 1
X0 = X0(1:610,:); #to make sure both datasets have same size for prior probability to be equal
Y = DATA(:,5); # to store outputs
mean0 = sum(X0)/610; #mean of features of class 0
mean1 = sum(X1)/610; #mean of featurs of class 1
count = 0;
for i = 1:1372
pre = 0;
cost1 = X(i,:)*(mean0'); #calculates the dot product of dataset with mean of features of both classes
cost2 = X(i,:)*(mean1');
if (cost1<cost2)
pre = 1;
end
if pre == Y(i)
count = count+1; #counts the number of correctly predicted values
end
end
disp("accuracy"); #calculates the accuracy
disp((count/1372)*100);
There are at least a few things here:
You are using dot product to assign similarity in the input space, this is almost never valid. The only reason to use dot product would be the assumption that all your data points have the same norm, or that the norm does not matter (nearly never true). Try using Euclidean distance instead, as even though it is very naive - it should be significantly better
Is it an inefficient classifier? Depends on the definition of efficiency. It is an extremely simple and fast one, but in terms of predictive power it is extremely bad. In fact, it is worse than Naive Bayes, which is already considered "toy model".
There is something wrong with the code too
X0 = DATA(1:762,1:4); #matrix to store the features of class 0
X1 = DATA(763:1372,1:4); #matrix to store the features of class 1
X0 = X0(1:610,:); #to make sure both datasets have same size for prior probability to be equal
Once you subsamples X0, you have 1220 training samples, yet later during "testing" you test on both training and "missing elements of X0", this does not really make sense from probabilistic perspective. First of all you should never test accuracy on the training set (as it overestimates true accuracy), second of all by subsampling your training data your are not equalizing priors. Not in the method like this one, you are simply degrading quality of your centroid estimate, nothing else. These kind of techniques (sub/over- sampling) equalize priors for models that do model priors. Your method does not (as it is basically generative model with the assumed prior of 1/2), so nothing good can happen.

Enhancing accuracy of knn classifier

I have training set of size 54 * 65536 and a testing set of 18 * 65536.
I want to use a knn classifier, but I have some questions:
1) How should I define trainlabel?
Class = knnclassify(TestVec,TrainVec, TrainLabel,k);
Is it a vector of size 54 * 1 that defines to which group each row in training set belongs? Here the group is numbered as 1 ,2,..
2) To find the accuracy I used this:
cp = classperf(TrainLabel);
Class = knnclassify(TestVec,TrainVec, TrainLabel);
cp = classperf(TestLabel,Class);
cp.CorrectRate*100
Is this right? Is there another method to calculate it?
3) How can I enhance the accuracy?
4) How do I choose the best value of k?
I do not know matlab nor the implementation of the knn you are providing, so I can answer only a few of your questions.
1) You assumption is correct. trainlabel is a 54*1 vector or an array of size 54 or something equivalent that defines which group each datapoint (row) in training set belongs to.
2) ... MATLAB / implementation related, sorry.
3) That is a very big discussion. Possible ways are:
Choose a better value of K.
Preprocess the data (or make preprocessing better if already applied).
Get a better / bigger trainset.
to name a few...
4) You can use different values while measuring the accuracy for each one and keep the best. (Note: If you do that, make sure you do not measure the accuracy of the classifier per value of k only once, but rather you use some technique like 10-Folding or something).
There is more than a fair chance that the library you are using for the K-NNclassifier provides such utilities.

Example of fuzzy logic in classification

I need to classify objects using fuzzy logic. Each object is characterized by 4 features - {size, shape, color, texture}. Each feature is fuzzified by linguistic terms and some membership function. The problem is I am unable to understand how to defuzzify such that I may know which class an unknown object belongs to. Using the Mamdani Max-Min inference, can somebody help in solving this issue?
Objects = {Dustbin, Can, Bottle, Cup} or denoted as {1,2,3,4} respectively. The fuzzy sets for each feature is :
Feature : Size
$\tilde{Size_{Large}}$ = {1//1,1/2,0/3,0.6/4} for crisp values in range 10cm - 20 cm
$\tilde{Size_{Small}}$ = {0/1,0/2,1/3,0.4/4} (4cm - 10cm)
Shape:
$\tilde{Shape_{Square}}$ = {0.9/1, 0/2,0/3,0/4} for crisp values in range 50-100
$\tilde{Shape_{Cylindrical}}$ = {0.1/1, 1/2,1/3,1/4} (10-40)
Feature : Color
$\tilde{Color_{Reddish}}$ = {0/1, 0.8/2, 0.6/3,0.3/4} say red values in between 10-50 (not sure, assuming)
$\tilde{Color_{Greenish}}$ = {1/1, 0.2/2, 0.4/3, 0.7/4} say color values in 100-200
Feature : Texture
$\tilde{Tex_{Coarse}}$ = {0.2/1, 0.2/2,0/3,0.5/4} if texture crisp values 10-20
$\tilde{Tex_{Shiny}}$ = {0.8/1, 0.8/2, 1/3, 0.5/4} 30-40
The If then else rules for classification are
R1: IF object is large in size AND cylindrical shape AND greenish in color AND coarse in texture THEN object is Dustbin
or in tabular form just to save space
Object type Size Shape Color Texture
Dustbin : Large cylindrical greenish coarse
Can : small cylindrical reddish shiny
Bottle: small cylindrical reddish shiny
Cup : small cylindrical greenish shiny
Then, there is an unknown feature with crisp values X = {12cm, 52,120,11}. How do I classify it? Or is my understanding incorrect, that I need to reformulate the entire thing?
Fuzzy logic means that every pattern belongs to a class up to a level. In other words, the output of the algorithm for every pattern could be a vector of let's say percentages of similarity to each class that sum up to unity. Then the decision for a class could be taken by checking a threshold. This means that the purpose of fuzzy logic is to quantify the uncertainty. If you need a decision for your case, a simple minimum distance classifier or a majority vote should be enough. Otherwise, define again your problem by taking the "number factor" into consideration.
One possible approach could be to define centroids for each feature's distinct attribute, for example, Large_size=15cm and Small_size=7cm. The membership function could be then defined as a function of the distance from these centroids. Then you could do the following:
1) Calculate the euclidean difference * a Gaussian or Butterworth kernel (in order to capture the range around the centroid) for every feature. Prepare a kernel for every class, for example, dustbin as a target needs large size, coarse texture etc.
2) Calculate the product of all the above (this is a Naive Bayes approach). Fuzzy logic ends here.
3) Then, you could assign the pattern to the class with the highest value of the membership function.
Sorry for taking too long to answer, hope this will help.