classification algorithms for high dimension dataset - classification

I'm new to machine learning, I need advice on what kind of classification algorithms and techniques can be used to deal with high dimension datasets (129 rows and 1900 columns) for classification problem.

Algorithms will remain same as any other classification but you can do the try following:
Do dimensionality reduction using PCA to reduce dimensions
Use forward or backward selection algorithms
Remove highly correlated variable
Use L1 regularisation with high alpha value as it does features selection intinsically

Related

When to use PCA for dimensionality reduction?

I am using the Matlab Classification Learner app to test different classifiers over a training set (size = 700). My response variable is a categorical label with 5 possible values. I have 7 numerical features and 2 categorical ones. I found a Cubic SVM to have the highest accuracy of 83%. But the performance goes down considerably when I enable PCA with 95% explained variance (accuracy = 40.5%). I am a student and this is the first time I am using PCA.
Why do I see such a result?
Could it be because of a small / unbalanced data set?
When is it useful to apply PCA? When we say "reduce dimensionality", is there a minimum number of features (dimensionality) in the original set?
Any help is appreciated. Thanks in advance!
I want to share my opinion
I think training set 700 means, your data is < 1k.
I'm even surprised that svm performs 83%.
Even MNIST dataset is considered to be small (60.000 training - 10.000 test). Your data is much-much smaller.
You try to reduce your small data even smaller using pca. So what will svm learns? There is no discriminating samples left?
If I were you I would test using random-forest classifier. Random-forest might even perform better.
Even if you balanced your data, it is small data.
I believe using SMOTE will not improve the result. If your data consist of images then you could use ImageDataGenerator for replicating your data. Though I'm not sure matlab contains ImageDataGenerator.
You will use PCA, when you have lots of samples. Yet the samples are not directly effecting the accuracy but they are the components of data.
For instance: Let's consider handwritten digit classification data.
From above can we say each pixel is directly effecting the accuracy?
The answer is no? Above the black pixels are not important for the accuracy, therefore to remove them we use pca.
If you want a detailed explanation with a python example. Check out my other answer

Dimensionality reduction, noralization, resampling, k-fold CV... In what order?

In Python I am working on a binary classification problem of Fraud detection on travel insurance. Here is the characteristic about my dataset:
Contains 40,000 samples with 20 features. After one hot encoding, the number of features is 50(4 numeric, 46 categorical).
Majority unlabeled: out of 40,000 samples, 33,000 samples are unlabeled.
Highly imbalanced: out of 7,000 labeled samples, only 800 samples(11%) are positive(Fraud).
Metrics is precision, recall and F2 score. We focus more on avoiding false positive, therefore high recall is appreciated. As preprocessing I oversampled positive cases using SMOTE-NC, which takes into account categorical variables as well.
After trying several approaches including Semi-Supervised Learning with Self Training and Label Propagation/Label Spreading etc, I achieved high recall score(80% on training, 65-70% on test). However, my precision score shows some trace of overfitting(60-70% on training, 10% on testing). I understand that precision is good on training because it's resampled, and low on test data because it directly reflects the imbalance of the classes in test data. But this precision score is unacceptably low so I want to solve it.
So to simplify the model I am thinking about applying dimensionality reduction. I found a package called prince which comes with FAMD(Factor Analysis for Mixture Data).
Question 1: How I should do normalization, FAMD, k-fold Cross Validation and resampling? Is my approach below correct?
Question 2: The package prince does not have methods such as fit or transform like in Sklearn, so I cannot do the 3rd step described below. Any other good packages to do fitand transform for FAMD? And is there any other good way to reduce dimensionality on this kind of dataset?
My approach:
Make k folds and isolate one of them for validation, use the rest for training
Normalize training data and transform validation data
Fit FAMD on training data, and transform training and test data
Resample only training data using SMOTE-NC
Train whatever model it is, evaluate on validation data
Repeat 2-5 k times and take the average of precision, recall F2 score
*I would also appreciate for any kinds of advices on my overall approach to this problem
Thanks!

Choosing a margin for contrastive loss in a siamese network

I'm building a siamese network for a metric-learning task, using a contrastive loss function, and I'm uncertain on how to set the 'margin' hyperparameter for the loss.
My inputs to the loss function are currently 1024-dimension dense embeddings from an RNN layer - Does the dimensionality of that input affect how I pick a margin? Should I use a dense layer to project it to a lower-dimensional space first? Any pointers on how to pick a specific margin value (or any relevant research) would be really appreciated! In case it matters, I'm using PyTorch.
You don't need to project it to a lower dimensional space.
The dependence of the margin with the dimensionality of the space depends on how the loss is formulated: If you don't normalize the embedding values and compute a global difference between vectors, the right margin will depend on the dimensionality. But if you compute a normalize difference, such as cosine distance, the margin values won't depend on the dimensionality of the embedding space.
Here ranking (or contrastive) losses are explained, it might be useful https://gombru.github.io/2019/04/03/ranking_loss/

How to train large dataset for classification in MATLAB

I have a large features dataset of around 111 Mb for classification with 217000 data points and each point has 1760000 features point. When used in training with SVM in MATLAB, it takes a lot of time.
How can be this data processed in MATLAB.
It depends on what sort of SVM you are building.
As a rule of thumb, with such big feature sets you need to look at linear classifiers, such as an SVM with no/the linear kernel, or logistic regression with various regularizations etc.
If you're training an SVM with a Gaussian kernel, the training algorithm has O(max(n,d) min (n,d)^2) complexity, where n is the number of examples and d the number of features. In your case it ends up being O(dn^2) which is quite big.

Evaluating performance of Neural Network embeddings in kNN classifier

I am solving a classification problem. I train my unsupervised neural network for a set of entities (using skip-gram architecture).
The way I evaluate is to search k nearest neighbours for each point in validation data, from training data. I take weighted sum (weights based on distance) of labels of nearest neighbours and use that score of each point of validation data.
Observation - As I increase the number of epochs (model1 - 600 epochs, model 2- 1400 epochs and model 3 - 2000 epochs), my AUC improves at smaller values of k but saturates at the similar values.
What could be a possible explanation of this behaviour?
[Reposted from CrossValidated]
To cross check if imbalanced classes are an issue, try fitting a SVM model. If that gives a better classification(possible if your ANN is not very deep) it may be concluded that classes should be balanced first.
Also, try some kernel functions to check if this transformation makes data linearly separable?