Improving classification model's poor accuracy - classification

I am new to machine learning and have been working on a classification problem and my model after preprocessing is constantly showing poor accuracy even after hyper-parameter optimization. Can anyone help me with suggestions on where am i doing wrong?Thank you..
in preprocessing i filled the null values using mean, checked if the data was normally distributed and did feature scaling after train test split.

Related

Why is it important to transform the data into normal / Gaussian distribution when creating a linear regression model

I'm currently building my first regression model, and as we know that, owing to the limitations of the algorithm, we need to remove outliers and transform the distribution into a normal one.
I know that it's important and the ways to do it, but can someone please help me in understanding why exactly we need to do so? Why can't I work with a highly skewed distribution? Why does linear regression mandates this transformation in processing stage?
Classifier with and without Outliers in the data
Hope the above picture clears your doubt.
As LinearRegression model is optimized by passing through a path which has minimum squared error,
due to outliers (which are abnormal data points or noise) our classifier may deviate and work poorly on the test data (much general data).

SVM Matlab classification

I'm approaching a 4 class classification problem, it's not particularly unbalanced, no missing features a lot of observation.. It seems everything good but when I approach the classification with fitcecoc it classifies everything as part of the first class. I try. to use fitclinear and fitcsvm on one vs all decomposed data but gaining the same results. Do you have any clue about the reason of that problem ?
Here are a few recommendations:
Have you normalized your data? SVM is sensitive to the features being
from different scales.
Save the mean and std you obtain during the training and use
those values during the prediction phase for normalizing the test
samples.
Change the C value and see if that changes the results.
I hope these help.

How to train a model in keras with multiple input-output datasets with different batch sizes

I have a supervised learning problem that I am solving with the Keras functional API.
As this model is predicting the state of a physical system, I know the supervised model should follow additional constraints.
I would like to add that as an additional loss term that penalizes the model for making predictions that do not follow those constraints. Unfortunately, the number of training examples for the supervised learning problem >> the number of constraint examples.
Basically, I am trying to do this:
Minimizing both the supervised learning error, and the constraint error as an auxiliary loss.
I do not believe that alternating training batches on each dataset will be successful, because the gradient will only capture the error of one problem at a time, when I really want the physical constraint to act as regularization on the supervised learning task. (If I am incorrect in my interpretation, please let me know).
I know this could be implemented in pure Tensorflow or Theano, but I am hesitant to leave the Keras ecosystem that makes everything else so convenient. If anybody knows how to train a model with batch sizes that vary across inputs, I'd really appreciate the help.

Increased Error with more Training Data for a Neural Network in Matlab

I have a question regarding the Matlab NN toolbox. As a part of research project I decided to create a Matlab script that uses the NN toolbox for some fitting solutions.
I have a data stream that is being loaded to my system. The Input data consists of 5 input channels and 1 output channel. I train my data on on this configurations for a while and try to fit the the output (for a certain period of time) as new data streams in. I retrain my network constantly to keep it updated.
So far everything works fine, but after a certain period of time the results get bad and do not represent the desired output. I really can't explain why this happens, but i could imagine that there must be some kind of memory issue, since as the data set is still small, everything is ok.
Only when it gets bigger the quality of the simulation drops down. Is there something as a memory which gets full, or is the bad sim just a result of the huge data sets? I'm a beginner with this tool and will really appreciate your feedback. Best Regards and thanks in advance!
Please elaborate on your method of retraining with new data. Do you run further iterations? What do you consider as "time"? Do you mean epochs?
At a first glance, assuming time means epochs, I would say that you're overfitting the data. Neural Networks are supposed to be trained for a limited number of epochs with early stopping. You could try regularization, different gradient descent methods (if you're using a GD method), GD momentum. Also depending on the values of your first few training datasets, you may have trained your data using an incorrect normalization range. You should check these issues out if my assumptions are correct.

Genetic algorithm for classification

I am trying to solve classification problem using Matlab GPTIPS framework.
I managed to build reasonable data representation and fitness function so far and got an average accuracy per class near 65%.
What I need now is some help with two difficulties:
My data is biased. Basically I am solving binary classification problem and only 20% of data belongs to class 1, while other 80% belong to class 0. I used accuracy of prediction as my fitness function at first, but it was really bad. The best I have now is
Fitness = 0.5*(PositivePredictiveValue + NegativePredictiveValue) - const*ComplexityOfSolution
Please, advize, how can I improve my function to make correction for data bias.
Second problem is overfitting. I divided my data into three parts: training (70%), testing (20%), validation (10%). I train each chromosome on training set, then evaluate it's fitness function on testing set. This routine allows me to reach fitness of 0.82 on my test data for the best individual in population. But same individual's result on validation data is only 60%.
I added validation check for best individual each time before new population is generated. Then I compare fitness on validation set with fitness on test set. If difference is more then 5%, then I increase penalty for solution complexity in my fitness function. But it didn't help.
I could also try to evaluate all individuals with validation set during each generation, and simply remove overfitted ones. But then I don't see any difference between my test and validation data. What else can be done here?
UPDATE:
For my second question I've found great article "Experiments on Controlling Overtting
in Genetic Programming" Along with some article authors' ideas on dealing with overfitting in GP it has impressive review with a lot of references to many different approaches to the issue. Now I have a lot of new ideas I can try for my problem.
Unfortunately, still cant' find anything on selecting a proper fitness function which will take into account unbalanced class proportions in my data.
65% accuracy is very bad when the baseline (classify everything as the class with most samples) would be 80%. You need to achieve at least baseline classification in order to have a better model than the naive one.
I would not penalize complexity. Rather limit the tree size (if possible). You could identify simpler models during the run, like storing a pareto front of models with quality and complexity as its two fitness values.
In HeuristicLab we have integrated GP based classification that can do these things. There are several options: You can choose to use MSE for classification or R2. In the latest trunk build there is also an evaluator to optimize accuracy directly (exactly speaking it optimizes the classification penalties). Optimizing MSE means it assigns each class a value (1, 2, 3,...) and tries to minimize mean squared error from that value. This may not seem optimal at first, but works. Optimizing accuracy directly may lead to faster overfitting. There is also a formula simplifier which allows you to prune and shrink your formula (and view the effects of that).
Also, does it need to be GP? Have you tried Random Forest Classification or Support Vector Machines as well? RF are pretty fast and work pretty well usually.