Does sklearn support a cost matrix? - classification

Is it possible to train classifiers in sklearn with a cost matrix with different costs for different mistakes? For example in a 2 class problem, the cost matrix would be a 2 by 2 square matrix. For example A_ij = cost of classifying i as j.
The main classifier I am using is a Random Forest.
Thanks.

The cost-sensitive framework you describe is not supported in scikit-learn, in any of the classifiers we have.

You could use a custom scoring function that accepts a matrix of per-class or per-instance costs. Here's an example of a scorer that calculates per-instance misclassification cost:
def financial_loss_scorer(y, y_pred, **kwargs):
import pandas as pd
totals = kwargs['totals']
# Create an indicator - 0 if correct, 1 otherwise
errors = pd.DataFrame((~(y == y_pred)).astype(int).rename('Result'))
# Use the product totals dataset to create results
results = errors.merge(totals, left_index=True, right_index=True, how='inner')
# Calculate per-prediction loss
loss = results.Result * results.SumNetAmount
return loss.sum()
The scorer becomes:
make_scorer(financial_loss_scorer, totals=totals_data, greater_is_better=False)
Where totals_data is a pandas.DataFrame with indexes that match the training set indexes.

You could always just look at your ROC curve. Each point on the ROC curve corresponds to a separate confusion matrix. So by specifying the confusion matrix you want, via choosing your classifier threshold implies some sort of cost weighting scheme. Then you just have to choose the confusion matrix that would imply the cost matrix you are looking for.
On the other hand if you really had your heart set on it, and really want to "train" an algorithm using a cost matrix, you could "sort of" do it in sklearn.
Although it is impossible to directly train an algorithm to be cost sensitive in sklearn you could use a cost matrix sort of setup to tune your hyper-parameters. I've done something similar to this using a genetic algorithm. It really doesn't do a great job, but it should give a modest boost to performance.

One way to circumvent this limitation is to use under or oversampling. E.g., if you are doing binary classification with an imbalanced dataset, and want to make errors on the minority class more costly, you could oversample it. You may want to have a look at imbalanced-learn which is a package from scikit-learn-contrib.

May not be direct to your question (since you are asking about Random Forest).
But for SVM (in Sklearn), you can utilize the class_weight parameter to specify the weights of different classes. Essentially, you will pass in a dictionary.
You might want to refer to this page to see an example of using class_weight.

Related

Is it possible to use evaluation metrics (like NDCG) as a loss function?

I am working on a Information Retrieval model called DPR which is a basically a neural network (2 BERTs) that ranks document, given a query. Currently, This model is trained in binary manners (documents are whether related or not related) and uses Negative Log Likelihood (NLL) loss. I want to change this binary behavior and create a model that can handle graded relevance (like 3 grades: relevant, somehow relevant, not relevant). I have to change the loss function because currently, I can only assign 1 positive target for each query (DPR uses pytorch NLLLoss) and this is not what I need.
I was wondering if I could use a evaluation metric like NDCG (Normalized Discounted Cumulative Gain) to calculate the loss. I mean, the whole point of a loss function is to tell how off our prediction is and NDCG is doing the same.
So, can I use such metrics in place of loss function with some modifications? In case of NDCG, I think something like subtracting the result from 1 (1 - NDCG_score) might be a good loss function. Is that true?
With best regards, Ali.
Yes, this is possible. You would want to apply a listwise learning to rank approach instead of the more standard pairwise loss function.
In pairwise loss, the network is provided with example pairs (rel, non-rel) and the ground-truth label is a binary one (say 1 if the first among the pair is relevant, and 0 otherwise).
In the listwise learning approach, however, during training you would provide a list instead of a pair and the ground-truth value (still a binary) would indicate if this permutation is indeed the optimal one, e.g. the one which maximizes nDCG. In a listwise approach, the ranking objective is thus transformed into a classification of the permutations.
For more details, refer to this paper.
Obviously, the network instead of taking features as input may take BERT vectors of queries and the documents within a list, similar to ColBERT. Unlike ColBERT, where you feed in vectors from 2 docs (pairwise training), for listwise training u need to feed in vectors from say 5 documents.

When to use PCA for dimensionality reduction?

I am using the Matlab Classification Learner app to test different classifiers over a training set (size = 700). My response variable is a categorical label with 5 possible values. I have 7 numerical features and 2 categorical ones. I found a Cubic SVM to have the highest accuracy of 83%. But the performance goes down considerably when I enable PCA with 95% explained variance (accuracy = 40.5%). I am a student and this is the first time I am using PCA.
Why do I see such a result?
Could it be because of a small / unbalanced data set?
When is it useful to apply PCA? When we say "reduce dimensionality", is there a minimum number of features (dimensionality) in the original set?
Any help is appreciated. Thanks in advance!
I want to share my opinion
I think training set 700 means, your data is < 1k.
I'm even surprised that svm performs 83%.
Even MNIST dataset is considered to be small (60.000 training - 10.000 test). Your data is much-much smaller.
You try to reduce your small data even smaller using pca. So what will svm learns? There is no discriminating samples left?
If I were you I would test using random-forest classifier. Random-forest might even perform better.
Even if you balanced your data, it is small data.
I believe using SMOTE will not improve the result. If your data consist of images then you could use ImageDataGenerator for replicating your data. Though I'm not sure matlab contains ImageDataGenerator.
You will use PCA, when you have lots of samples. Yet the samples are not directly effecting the accuracy but they are the components of data.
For instance: Let's consider handwritten digit classification data.
From above can we say each pixel is directly effecting the accuracy?
The answer is no? Above the black pixels are not important for the accuracy, therefore to remove them we use pca.
If you want a detailed explanation with a python example. Check out my other answer

Simple Sequential feature selection in Matlab

I have a 40X3249 noisy dataset and 40X1 resultset. I want to perform simple sequential feature selection on it, in Matlab. Matlab example is complicated and I can't follow it. Even a few examples on SoF didn't help. I want to use decision tree as classifier to perform feature selection. Can someone please explain in simple terms.
Also is it a problem that my dataset has very low number of observations compared to the number of features?
I am following this example: Sequential feature selection Matlab and I am getting error like this:
The pooled covariance matrix of TRAINING must be positive definite.
I've explained the error message you're getting in answers to your previous questions.
In general, it is a problem that you have many more variables than samples. This will prevent you using some techniques, such as the discriminant analysis you were attempting, but it's a problem anyway. The fact is that if you have that high a ratio of variables to samples, it is very likely that some combination of variables would perfectly classify your dataset even if they were all random numbers. That's true if you build a single decision tree model, and even more true if you are using a feature selection method to explicitly search through combinations of variables.
I would suggest you try some sort of dimensionality reduction method. If all of your variables are continuous, you could try PCA as suggested by #user1207217. Alternatively you could use a latent variable method for model-building, such as PLS (plsregress in MATLAB).
If you're still intent on using sequential feature selection with a decision tree on this dataset, then you should be able to modify the example in the question you linked to, replacing the call to classify with one to classregtree.
This error comes from the use of the classify function in that question, which is performing LDA. This error occurs when the data is rank deficient (or in other words, some features are almost exactly correlated). In order to overcome this, you should project the data down to a lower dimensional subspace. Principal component analysis can do this for you. See here for more details on how to use pca function within statistics toolbox of Matlab.
[basis, scores, ~] = pca(X); % Find the basis functions and their weighting, X is row vectors
indices = find(scores > eps(2*max(scores))); % This is to find irrelevant components up to machine precision of the biggest component .. with a litte extra tolerance (2x)
new_basis = basis(:, indices); % This gets us the relevant components, which are stored in variable "basis" as column vectors
X_new = X*new_basis; % inner products between the new basis functions spanning some subspace of the original, and the original feature vectors
This should get you automatic projections down into a relevant subspace. Note that your features won't have the same meaning as before, because they will be weighted combinations of the old features.
Extra note: If you don't want to change your feature representation, then instead of classify, you need to use something which works with rank deficient data. You could roll your own version of penalised discriminant analysis (which is quite simple), use support vector machines, or other classification functions which don't break with correlated features as LDA does (by virtue of requiring matrix inversion of the covariance estimate).
EDIT: P.S I haven't tested this, because I have rolled my own version of PCA in Matlab.

SVM Classification with Cross Validation

I am new to using Matlab and am trying to follow the example in the Bioinformatics Toolbox documentation (SVM Classification with Cross Validation) to handle a classification problem.
However, I am not able to understand Step 9, which says:
Set up a function that takes an input z=[rbf_sigma,boxconstraint], and returns the cross-validation value of exp(z).
The reason to take exp(z) is twofold:
rbf_sigma and boxconstraint must be positive.
You should look at points spaced approximately exponentially apart.
This function handle computes the cross validation at parameters
exp([rbf_sigma,boxconstraint]):
minfn = #(z)crossval('mcr',cdata,grp,'Predfun', ...
#(xtrain,ytrain,xtest)crossfun(xtrain,ytrain,...
xtest,exp(z(1)),exp(z(2))),'partition',c);
What is the function that I should be implementing here? Is it exp or minfn? I will appreciate if you can give me the code for this section. Thanks.
I will like to know what does it mean when it says exp([rbf_sigma,boxconstraint])
rbf_sigma: The svm is using a gaussian kernel, the rbf_sigma set the standard deviation (~size) of the kernel. To understand how kernels work, the SVM is putting the kernel around every sample (so that you have a gaussian around every sample). Then the kernels are added up (sumed) for the samples of each category/type. At each point the type which sum is higher would be the "winner". For example if type A has a higher sum of these kernels at point X, then if you have a new datum to classify in point X, it will be classified as type A. (there are other configuration parameters that may change the actual threshold where a category is selected over another)
Fig. Analyze this figure from the webpage you gave us. You can see how by adding up the gaussian kernels on the red samples "sumA", and on the green samples "sumB"; it is logical that sumA>sumB in the center part of the figure. It is also logical that sumB>sumA in the outer part of the image.
boxconstraint: it is a cost/penalty over miss-classified data. During the training stage of the classifier, where you use the training data to adjust the SVM parameters, the training algorithm is using an error function to decide how to optimize the SVM parameters in an iterative fashion. The cost for a miss-classified sample is proportional to how far it is from the boundary where it would have been classified correctly. In the figure that I am attaching the boundary is the inner blue circumference.
Taking into account BGreene indications and from what I understand of the tutorial:
In the tutorial they advice to try values for rbf_sigma and boxconstraint that are exponentially apart. This means that you should compare values like {0.2, 2, 20, ...} (note that this is {2*10^(i-2), i=1,2,3,...}), and NOT like {0.2, 0.3, 0.4, 0.5} (which would be linearly apart). They advice this to try a wide range of values first. You can further optimize later FROM the first optimum that you obtained before.
The command "[searchmin fval] = fminsearch(minfn,randn(2,1),opts)" will give you back the optimum values for rbf_sigma and boxconstraint. Probably you have to use exp(z) because it affects how fminsearch increments the values of z(1) and z(2) during the search for the optimum value. I suppose that when you put exp(z(1)) in the definition of #minfn, then fminsearch will take 'exponentially' big steps.
In machine learning, always try to understand that there are three subsets in your data: training data, cross-validation data, and test data. The training set is used to optimize the parameters of the SVM classifier for EACH value of rbf_sigma and boxconstraint. Then the cross validation set is used to select the optimum value of the parameters rbf_sigma and boxconstraint. And finally the test data is used to obtain an idea of the performance of your classifier (the efficiency of the classifier is determined upon the test set).
So, if you start with 10000 samples you may divide the data for example as training(50%), cross-validation(25%), test(25%). So that you will sample randomly 5000 samples for the training set, then 2500 samples from the 5000 remaining samples for the cross-validation set, and the rest of samples (that is 2500) would be separated for the test set.
I hope that I could clarify your doubts. By the way, if you are interested in the optimization of the parameters of classifiers and machine learning algorithms I strongly suggest that you follow this free course -> www.ml-class.org (it is awesome, really).
You need to implement a function called crossfun (see example).
The function handle minfn is passed to fminsearch to be minimized.
exp([rbf_sigma,boxconstraint]) is the quantity being optimized to minimize classification error.
There are a number of functions nested within this function handle:
- crossval is producing the classification error based on cross validation using partition c
- crossfun - classifies data using an SVM
- fminsearch - optimizes SVM hyperparameters to minimize classification error
Hope this helps

Principal component analysis

I have to write a classificator (gaussian mixture model) that I use for human action recognition.
I have 4 dataset of video. I choose 3 of them as training set and 1 of them as testing set.
Before I apply the gm model on the training set I run the pca on it.
pca_coeff=princomp(trainig_data);
score = training_data * pca_coeff;
training_data = score(:,1:min(size(score,2),numDimension));
During the testing step what should I do? Should I execute a new princomp on testing data
new_pca_coeff=princomp(testing_data);
score = testing_data * new_pca_coeff;
testing_data = score(:,1:min(size(score,2),numDimension));
or I should use the pca_coeff that I compute for the training data?
score = testing_data * pca_coeff;
testing_data = score(:,1:min(size(score,2),numDimension));
The classifier is being trained on data in the space defined by the principle components of the training data. It doesn't make sense to evaluate it in a different space - therefore, you should apply the same transformation to testing data as you did to training data, so don't compute a different pca_coef.
Incidently, if your testing data is drawn independently from the same distribution as the training data, then for large enough training and test sets, the principle components should be approximately the same.
One method for choosing how many principle components to use involves examining the eigenvalues from the PCA decomposition. You can get these from the princomp function like this:
[pca_coeff score eigenvalues] = princomp(data);
The eigenvalues variable will then be an array where each element describes the amount of variance accounted for by the corresponding principle component. If you do:
plot(eigenvalues);
you should see that the first eigenvalue will be the largest, and they will rapidly decrease (this is called a "Scree Plot", and should look like this: http://www.ats.ucla.edu/stat/SPSS/output/spss_output_pca_5.gif, though your one may have up to 800 points instead of 12).
Principle components with small corresponding eigenvalues are unlikely to be useful, since the variance of the data in those dimensions is so small. Many people choose a threshold value, and then select all principle components where the eigenvalue is above that threshold. An informal way of picking the threshold is to look at the Scree plot and choose the threshold to be just after the line 'levels out' - in the image I linked earlier, a good value might be ~0.8, selecting 3 or 4 principle components.
IIRC, you could do something like:
proportion_of_variance = sum(eigenvalues(1:k)) ./ sum(eigenvalues);
to calculate "the proportion of variance described by the low dimensional data".
However, since you are using the principle components for a classification task, you can't really be sure that any particular number of PCs is optimal; the variance of a feature doesn't necessarily tell you anything about how useful it will be for classification. An alternative to choosing PCs with the Scree plot is just to try classification with various numbers of principle components and see what the best number is empirically.