MinMaxScaler Vs StandardScaler for Scaling Features? - neural-network

I'm training a neural network to predict Bitcoin close prices, I'm testing MinMaxScaler vs StandardScaler for input features (High, Low, Volatility) and MSE (Mean Square Error) to evaluate results.
MinMaxScaler
StandardScaler
My questions :
As noticed in pics, MinMaxScaler is doing a worse job of predicting prices. However, MSE is 0.107, On the other hand, StandardScaler has an MSE of 0.2. Why is that?
Is it Because MinMaxScaler is scaling Between [0,1] so results are closer compared to StandardScaler
Which type of scaling is used in research papers Because most of them don't mention that information and I can't tell if my results are better or worse than theirs?
Both scalers are doing scaling on each column individually. Right ? Because each feature has a very different range of values (Volatility Vs Prices). Also, I've noticed that after fitting all features together, the relationship Between features is lost. E.g: scaled low prices are higher than scaled high prices!

StandardScaler is useful for the features that follow a Normal distribution.Therefore, it makes mean = 0 and scales the data to unit variance.
MinMaxScaler may be used when the upper and lower boundaries are well known from domain knowledge.MinMaxScaler scales all the data features in the range [0, 1] or else in the range [-1, 1] if there are negative values in the dataset.This scaling compresses all the inliers in the narrow range.
Bitcoin price distribution seems to be normal So StandardScaler it is better predicted

Related

Confusion matrix doesn't match evaluation

I'm training with TensorFlow a CNN for image classification on dataset Food-101 and I reach a test accuracy of about 80% (I use model.evaluate()).
The issue I have is that when I plot the confusion matrix of the 3 classes involved this is very different and have maximum 40% on the main diagonal.
I could understand it if at least 1 of the 3 classes was around 100%, because I would expect the averaged accuracy to raise even with bad results in the other predictions. But in this case none of them is neither similar to what I achieve during evaluation.
I tried to plot the confusion matrix with training data, with which I reached more than 90% accuracy during the learning process, and also this is not correct.

How get the best threshold for classification using H2o Python

I have a classification model using H2o in Python for which the AUC = 71%
But the accuracy based on confusion Matrix is only 61%. I Understand that confusion matrix is based on .5 threshold
How do I determine for which threshold the accuracy will be 71%?
AUC of the ROC curve is not accuracy, and the value is threshold independent. It is a measure of how well separated two classes are. The 71% value tells you the probability of you randomly sampling positive class having a higher predicted probability than a randomly sampled negative class. See this explanation.
Selecting the threshold should depend on your cost matrix (how much the penalty is for False Positives or False Negatives). You would want to select the threshold that maximize your desired metric (max. F1, precision, accuracy). H2O gives multiple options. In H2O, if you call the model performance (Python ex: your_model.model_performance()), you will get the threshold for max accuracy and other optimized metrics listed.

Dimensionality reduction, noralization, resampling, k-fold CV... In what order?

In Python I am working on a binary classification problem of Fraud detection on travel insurance. Here is the characteristic about my dataset:
Contains 40,000 samples with 20 features. After one hot encoding, the number of features is 50(4 numeric, 46 categorical).
Majority unlabeled: out of 40,000 samples, 33,000 samples are unlabeled.
Highly imbalanced: out of 7,000 labeled samples, only 800 samples(11%) are positive(Fraud).
Metrics is precision, recall and F2 score. We focus more on avoiding false positive, therefore high recall is appreciated. As preprocessing I oversampled positive cases using SMOTE-NC, which takes into account categorical variables as well.
After trying several approaches including Semi-Supervised Learning with Self Training and Label Propagation/Label Spreading etc, I achieved high recall score(80% on training, 65-70% on test). However, my precision score shows some trace of overfitting(60-70% on training, 10% on testing). I understand that precision is good on training because it's resampled, and low on test data because it directly reflects the imbalance of the classes in test data. But this precision score is unacceptably low so I want to solve it.
So to simplify the model I am thinking about applying dimensionality reduction. I found a package called prince which comes with FAMD(Factor Analysis for Mixture Data).
Question 1: How I should do normalization, FAMD, k-fold Cross Validation and resampling? Is my approach below correct?
Question 2: The package prince does not have methods such as fit or transform like in Sklearn, so I cannot do the 3rd step described below. Any other good packages to do fitand transform for FAMD? And is there any other good way to reduce dimensionality on this kind of dataset?
My approach:
Make k folds and isolate one of them for validation, use the rest for training
Normalize training data and transform validation data
Fit FAMD on training data, and transform training and test data
Resample only training data using SMOTE-NC
Train whatever model it is, evaluate on validation data
Repeat 2-5 k times and take the average of precision, recall F2 score
*I would also appreciate for any kinds of advices on my overall approach to this problem
Thanks!

What is the threshold in AUC (Area under curve)

Assume a binary classifier (say a random forest) rfc and I want to calculate the AUC. I struggle to understand how the threshold are being used in the calculation. I understand that you make a plot of TPR/FPR for different thresholds. I also understand the threshold is used as a threshold for predicting class 1 (else class 0), but how does the AUC algorithm predict classes?
Say using sklearn.metrics.roc_auc_score you pass y_true and y_rfc (being the true value and the predicted value), but I do not see how the thresholds come into play in the AUC score/plot.
I have read different guides/tutorials for AUC, but all of their explanation regarding the threshold and how it is used is kinda vague.
I have also had a look at How does sklearn actually calculate AUROC? .
AUC curve is generated based on TPR/FPR of different thresholds. The main point of ROC is to sample threshold from (0;1) and get a point for curve. Notice that if your classifier is perfect you will get point (0,1) and for all smaller threshold cant be worst, so it also will be on (0,1) which leads to auc = 1.
AUC provide your information not only about classification quality but also about how good confidence of your classifier was evaluated.

relation between support vectors and accuracy in case of RBF kernel

I am using RBF kernel matlab function.
On couple of dataset as I go on increasing sigma value the number of support vectors increase and accuracy increases.
While in case of one data set, as I increase the sigma value, the support vectors decrease and accuracy increases.
I am not able to analyze the relation between support vectors and accuracy in case of RBF kernel.
The number of support vectors doesn't have a direct relationship to accuracy; it depends on the shape of the data (and your C/nu parameter).
Higher sigma means that the kernel is a "flatter" Gaussian and so the decision boundary is "smoother"; lower sigma makes it a "sharper" peak, and so the decision boundary is more flexible and able to reproduce strange shapes if they're the right answer. If sigma is very high, your data points will have a very wide influence; if very low, they will have a very small influence.
Thus, often, increasing the sigma values will result in more support vectors: for more-or-less the same decision boundary, more points will fall within the margin, because points become "fuzzier." Increased sigma also means, though, that the slack variables "moving" points past the margin are more expensive, and so the classifier might end up with a much smaller margin and fewer SVs. Of course, it also might just give you a dramatically different decision boundary with a completely different number of SVs.
In terms of maximizing accuracy, you should be doing a grid search on many different values of C and sigma and choosing the one that gives you the best performance on e.g. 3-fold cross-validation on your training set. One reasonable approach is to choose from e.g. 2.^(-9:3:18) for C and median_eval * 2.^(-4:2:10); those numbers are fairly arbitrary, but they're ones I've used with success in the past.