How can you access specific elements of the rawPrediction in Scala? - scala

I am trying to compute the AUC for a multiclass probabilistic classifier (random Forrest) in Scala. I know that traditionally this is not done.
I would like to access the specific rawPrediction for a given label but I cannot get that value based on the index.
How should I go about it?

Related

Is it possible to use evaluation metrics (like NDCG) as a loss function?

I am working on a Information Retrieval model called DPR which is a basically a neural network (2 BERTs) that ranks document, given a query. Currently, This model is trained in binary manners (documents are whether related or not related) and uses Negative Log Likelihood (NLL) loss. I want to change this binary behavior and create a model that can handle graded relevance (like 3 grades: relevant, somehow relevant, not relevant). I have to change the loss function because currently, I can only assign 1 positive target for each query (DPR uses pytorch NLLLoss) and this is not what I need.
I was wondering if I could use a evaluation metric like NDCG (Normalized Discounted Cumulative Gain) to calculate the loss. I mean, the whole point of a loss function is to tell how off our prediction is and NDCG is doing the same.
So, can I use such metrics in place of loss function with some modifications? In case of NDCG, I think something like subtracting the result from 1 (1 - NDCG_score) might be a good loss function. Is that true?
With best regards, Ali.
Yes, this is possible. You would want to apply a listwise learning to rank approach instead of the more standard pairwise loss function.
In pairwise loss, the network is provided with example pairs (rel, non-rel) and the ground-truth label is a binary one (say 1 if the first among the pair is relevant, and 0 otherwise).
In the listwise learning approach, however, during training you would provide a list instead of a pair and the ground-truth value (still a binary) would indicate if this permutation is indeed the optimal one, e.g. the one which maximizes nDCG. In a listwise approach, the ranking objective is thus transformed into a classification of the permutations.
For more details, refer to this paper.
Obviously, the network instead of taking features as input may take BERT vectors of queries and the documents within a list, similar to ColBERT. Unlike ColBERT, where you feed in vectors from 2 docs (pairwise training), for listwise training u need to feed in vectors from say 5 documents.

the implementation of lazy multi label classifiers in Mulan

I want to use k nearest neighbor for multi label classification. there are some classifiers based on knn which are implemented in mulan library, or are written in C or Matlab such as MLKNN.
when I use the same classifier for numeric dataset I get identical result,
but for nominal dataset such as slashdot and genbase (it is noticeable that the data are only 0 and 1) I obtain different result.
I want to know why this happen? these classifiers use euclidean distance and Mulan use euclidean distance of Weka .
why the result of the lazy classifiers in mulan for nominal data is different from those which are written in other languages? which one is correct?
I will be happy if you help me to find the reason.

Auto-encoder based unsupervised clustering

I am trying to cluster a dataset using an encoder and since I am new in this field I cant tell how to do it.My main issue is how to define the loss function since the dataset is unlabeled and up to know, what I have seen from bibliography they define as loss function the distance between the desired output and the predicted output.My question is since that I dont have a desired output how should I implement this?
You can use an auto encoder to pre-train your convolutional layers, like it described in my question here with usage of convolutional autoencoder for images
As you can see form code, loss function is Adam with metrics accuracy and dice coefficient, I think you can use accuracy only, since dice coefficient is image-specific
I’m not sure how it will work for you, because you hadn’t provided your idea how you will transform your bibliography lists to vector, perhaps you will create a list for bibliography id’s sorted by the cosine distance between them
For example, you can use a set of vector with cosine distances to each item in a bibliography list above for each reference in your dataset and use it as input for autoencoder
After encoder will be trained, you can remove the decoder part from your model output and use as an input for one of unsupervised clustering algorithms, for example, k-mean. You can find details about them here

How to see which Atribute (Feature) contribute most to the performance of the classification with PCA in Matlab?

I would like to perform classification on a small data set 65x9 using some of the Machine Learning Classification Methods (SVM, Decision Trees or any other).
So, before starting with the classification I would like to do attribute analyses with PCA in Matlab or Weka (preferred MatLab). I would like to obtain which Attribute contribute most to the performance of the classifier. So I can maybe reduce the number of some Attribute or/and include more in the future. Any example of PCA can find regarding this in MatLab or Weka?
Thanks
PCA is a unsupervised feature extraction method.
If your question is on selecting attributes to use with PCA, i don't know what your purpose is but it is unnecessary to do something like that to improve classification performance. Just use the whole attributes. PCA will give you best attributes in decreasing order for each instance.
If your question is on selecting attributes after PCA, you can chose a treshold (for example 0.95) and calculate #attributes enough for treshold beginning from the first attribute to last one. You can use the eigenvalues of covariance matrix to calculate and achive treshold in PCA.
After running PCA, we know that the first attribute is the best one, the second attribute is the best one after first etc...

KNN classification with categorical data

I'm busy working on a project involving k-nearest neighbor (KNN) classification. I have mixed numerical and categorical fields. The categorical values are ordinal (e.g. bank name, account type). Numerical types are, for e.g. salary and age. There are also some binary types (e.g., male, female).
How do I go about incorporating categorical values into the KNN analysis?
As far as I'm aware, one cannot simply map each categorical field to number keys (e.g. bank 1 = 1; bank 2 = 2, etc.), so I need a better approach for using the categorical fields. I have heard that one can use binary numbers. Is this a feasible method?
You need to find a distance function that works for your data. The use of binary indicator variables solves this problem implicitly. This has the benefit of allowing you to continue your probably matrix based implementation with this kind of data, but a much simpler way - and appropriate for most distance based methods - is to just use a modified distance function.
There is an infinite number of such combinations. You need to experiment which works best for you. Essentially, you might want to use some classic metric on the numeric values (usually with normalization applied; but it may make sense to also move this normalization into the distance function), plus a distance on the other attributes, scaled appropriately.
In most real application domains of distance based algorithms, this is the most difficult part, optimizing your domain specific distance function. You can see this as part of preprocessing: defining similarity.
There is much more than just Euclidean distance. There are various set theoretic measures which may be much more appropriate in your case. For example, Tanimoto coefficient, Jaccard similarity, Dice's coefficient and so on. Cosine might be an option, too.
There are whole conferences dedicated to the topics of similarity search - nobody claimed this is trivial in anything but Euclidean vector spaces (and actually, not even there): http://www.sisap.org/2012
The most straight forward way to convert categorical data into numeric is by using indicator vectors. See the reference I posted at my previous comment.
Can we use Locality Sensitive Hashing (LSH) + edit distance and assume that every bin represents a different category? I understand that categorical data does not show any order and the bins in LSH are arranged according to a hash function. Finding the hash function that gives a meaningful number of bins sounds to me like learning a metric space.