I want to use k nearest neighbor for multi label classification. there are some classifiers based on knn which are implemented in mulan library, or are written in C or Matlab such as MLKNN.
when I use the same classifier for numeric dataset I get identical result,
but for nominal dataset such as slashdot and genbase (it is noticeable that the data are only 0 and 1) I obtain different result.
I want to know why this happen? these classifiers use euclidean distance and Mulan use euclidean distance of Weka .
why the result of the lazy classifiers in mulan for nominal data is different from those which are written in other languages? which one is correct?
I will be happy if you help me to find the reason.
Related
I am trying to cluster a dataset using an encoder and since I am new in this field I cant tell how to do it.My main issue is how to define the loss function since the dataset is unlabeled and up to know, what I have seen from bibliography they define as loss function the distance between the desired output and the predicted output.My question is since that I dont have a desired output how should I implement this?
You can use an auto encoder to pre-train your convolutional layers, like it described in my question here with usage of convolutional autoencoder for images
As you can see form code, loss function is Adam with metrics accuracy and dice coefficient, I think you can use accuracy only, since dice coefficient is image-specific
I’m not sure how it will work for you, because you hadn’t provided your idea how you will transform your bibliography lists to vector, perhaps you will create a list for bibliography id’s sorted by the cosine distance between them
For example, you can use a set of vector with cosine distances to each item in a bibliography list above for each reference in your dataset and use it as input for autoencoder
After encoder will be trained, you can remove the decoder part from your model output and use as an input for one of unsupervised clustering algorithms, for example, k-mean. You can find details about them here
my aim is to classify the data into two sections- upper and lower- finding the mid line of the peaks.
I would like to apply machine learning methods- i.e. Discriminant analysis.
Could you let me know how to do that in MATLAB?
It seems that what you are looking for is GMM (gaussian mixture model). With K=2 (number of mixtures) and dimension equal 1 this will be simple, fast method, which will give you a direct solution. Given components it is easy to analytically find a local minima (which is just a weighted average of means, with weights proportional to the std's).
I have found the following Matlab implementation of a Naive Bayes classifier:
https://github.com/jjedele/Naive-Bayes-Classifier-Octave-Matlab
What is the difference between Gaussian Naive Bayes and Naive Bayes? How could I extend the above implementation to become Gaussian Naive Bayes?
How can I extend the implementation for using it with 4 classes? Just doing one-vs-all other?
Thank you very much for the help.
In Naive Bayes Classification we take a set of features (x0,x1,...xn) and try to assign those feature to one of a known set Y of class (y0,y1,...yk) we do that by using training data to calculate the conditional probabilities that tell us how often a particular class had a certain feature in the training set and then multiplying them together.
The result is a score for each class in the set Y. We then take the highest scoring member of Y as the class that our feature set should be assigned to.
up until this point we haven't made any assumptions about what the p(x|C) distributions look like.
In Guassian Naive Bayes we assume that all those p(x|C) values are normaly distributed that's the only "difference" and it really isn't a difference GNB is just a subset of Naive Bayes.
This can be useful if you don't have a lot of training data, and are willing to make the assumption that the population data is normally distributed about the mean of the sample (training) data you do have.
Full discloser the Tex comes from wikipedia.
Is it possible to train classifiers in sklearn with a cost matrix with different costs for different mistakes? For example in a 2 class problem, the cost matrix would be a 2 by 2 square matrix. For example A_ij = cost of classifying i as j.
The main classifier I am using is a Random Forest.
Thanks.
The cost-sensitive framework you describe is not supported in scikit-learn, in any of the classifiers we have.
You could use a custom scoring function that accepts a matrix of per-class or per-instance costs. Here's an example of a scorer that calculates per-instance misclassification cost:
def financial_loss_scorer(y, y_pred, **kwargs):
import pandas as pd
totals = kwargs['totals']
# Create an indicator - 0 if correct, 1 otherwise
errors = pd.DataFrame((~(y == y_pred)).astype(int).rename('Result'))
# Use the product totals dataset to create results
results = errors.merge(totals, left_index=True, right_index=True, how='inner')
# Calculate per-prediction loss
loss = results.Result * results.SumNetAmount
return loss.sum()
The scorer becomes:
make_scorer(financial_loss_scorer, totals=totals_data, greater_is_better=False)
Where totals_data is a pandas.DataFrame with indexes that match the training set indexes.
You could always just look at your ROC curve. Each point on the ROC curve corresponds to a separate confusion matrix. So by specifying the confusion matrix you want, via choosing your classifier threshold implies some sort of cost weighting scheme. Then you just have to choose the confusion matrix that would imply the cost matrix you are looking for.
On the other hand if you really had your heart set on it, and really want to "train" an algorithm using a cost matrix, you could "sort of" do it in sklearn.
Although it is impossible to directly train an algorithm to be cost sensitive in sklearn you could use a cost matrix sort of setup to tune your hyper-parameters. I've done something similar to this using a genetic algorithm. It really doesn't do a great job, but it should give a modest boost to performance.
One way to circumvent this limitation is to use under or oversampling. E.g., if you are doing binary classification with an imbalanced dataset, and want to make errors on the minority class more costly, you could oversample it. You may want to have a look at imbalanced-learn which is a package from scikit-learn-contrib.
May not be direct to your question (since you are asking about Random Forest).
But for SVM (in Sklearn), you can utilize the class_weight parameter to specify the weights of different classes. Essentially, you will pass in a dictionary.
You might want to refer to this page to see an example of using class_weight.
I am trying to use LibSVM for regression. I am trying to detect faces (10 classes of different faces). I labeled 1-10 as face class and 11 is for non face. I want to develop a script usig LibSVM which will give me a continuous score between 0-1 if the test image falls to any of the 10 face class, otherwise it will give me -1 (non-face). From this score, I can be able to predict my calss. If the test image is matched with 1st class, the score should be around .1. Similarly if the test image is matched with class 10, the score should be around 1 (any continuous value close to 1). I am trying to use SVR using LibSVM to solve this problem. I can easily get the predicted class through classification. But I want a continuous score value which I can get through regression. Now, I was looking in the net for the function or parameters in function for SVR using LibSVM, but I couldn't find anything. Can anybody please help me in this regard?
This is not a regression problem. Solving it through regression will not yield good results.
You are dealing with a multiclass classification problem. The best way to approach this is to construct 10 one-vs-all classifiers with probabilistic output. To get probabilistic output (e.g. in the interval [0,1]), you can train and predict with the -b 1 option for C-SVC (-s 0).
If any of the 10 classifiers yields a sufficiently large probability for its positive class, you use that probability (which is close to 1). If none of the 10 classifiers yield a positive label with high enough confidence you can default to -1.
So in short: make a multiclass classifier containing one-vs-all classifiers with probabilistic output. Subsequently post-process the predictions as I described, using a probability threshold of your choice (for example 0.7).