How do you evaluate ML model already deployed in production? - deployment

so to be more clear lets consider the problem of loan default prediction.
Let's say I have trained and tested off-line multiple classifiers and ensembled them. Then I gave this model to production.
But because people change, data and many other factors change as well. And performance of our model eventually will decrease. So then it needs to be replaced with the new, better model.
What are the common techniques, model stability tests, model performance tests, metrics after deployment? How to decide when to replace current model with newer one?

It depend wich problem (classification, regression or clustering), let say you have a classification problem and you learned and tested a model with 75% of accuracy (or other metric) , once in production, if the accuracy is Significantly less than 75% , than you can stope your model and see what is happening.
In my case, I note the accuracy of the model once in production each day for a week, after that I count a mean and a variance of accuracy and I apply a T-test of mean, to see if this accuracy is Significantly diverge or not from the accuracy desired.
Hope that will help

Related

How much data is actually required to train a doc2Vec model?

I have been using gensim's libraries to train a doc2Vec model. After experimenting with different datasets for training, I am fairly confused about what should be an ideal training data size for doc2Vec model?
I will be sharing my understanding here. Please feel free to correct me/suggest changes-
Training on a general purpose dataset- If I want to use a model trained on a general purpose dataset, in a specific use case, I need to train on a lot of data.
Training on the context related dataset- If I want to train it on the data having the same context as my use case, usually the training data size can have a smaller size.
But what are the number of words used for training, in both these cases?
On a general note, we stop training a ML model, when the error graph reaches an "elbow point", where further training won't help significantly in decreasing error. Has any study being done in this direction- where doc2Vec model's training is stopped after reaching an elbow ?
There are no absolute guidelines - it depends a lot on your dataset and specific application goals. There's some discussion of the sizes of datasets used in published Doc2Vec work at:
what is the minimum dataset size needed for good performance with doc2vec?
If your general-purpose corpus doesn't match your domain's vocabulary – including the same words, or using words in the same senses – that's a problem that can't be fixed with just "a lot of data". More data could just 'pull' word contexts and representations more towards generic, rather than domain-specific, values.
You really need to have your own quantitative, automated evaluation/scoring method, so you can measure whether results with your specific data and goals are sufficient, or improving with more data or other training tweaks.
Sometimes parameter tweaks can help get the most out of thin data – in particular, more training iterations or a smaller model (fewer vector-dimensions) can slightly offset some issues with small corpuses, sometimes. But the Word2Vec/Doc2Vec really benefit from lots of subtly-varied, domain-specific data - it's the constant, incremental tug-of-war between all the text-examples during training that helps the final representations settle into a useful constellation-of-arrangements, with the desired relative-distance/relative-direction properties.

Shouldn't we take average of n models in cross validation in linear regression?

I have a question regarding cross validation in Linear regression model.
From my understanding, in cross validation, we split the data into (say) 10 folds and train the data from 9 folds and the remaining folds we use for testing. We repeat this process until we test all of the folds, so that every folds are tested exactly once.
When we are training the model from 9 folds, should we not get a different model (may be slightly different from the model that we have created when using the whole dataset)? I know that we take an average of all the "n" performances.
But, what about the model? Shouldn't the resulting model also be taken as the average of all the "n" models? I see that the resulting model is same as the model which we created using whole of the dataset before cross-validation. If we are considering the overall model even after cross-validation (and not taking avg of all the models), then what's the point of calculating average performance from n different models (because they are trained from different folds of data and are supposed to be different, right?)
I apologize if my question is not clear or too funny.
Thanks for reading, though!
I think that there is some confusion in some of the answers proposed because of the use of the word "model" in the question asked. If I am guessing correctly, you are referring to the fact that in K-fold cross-validation we learn K-different predictors (or decision functions), which you call "model" (this is a bad idea because in machine learning we also do model selection which is choosing between families of predictors and this is something which can be done using cross-validation). Cross-validation is typically used for hyperparameter selection or to choose between different algorithms or different families of predictors. Once these chosen, the most common approach is to relearn a predictor with the selected hyperparameter and algorithm from all the data.
However, if the loss function which is optimized is convex with respect to the predictor, than it is possible to simply average the different predictors obtained from each fold.
This is because for a convex risk, the risk of the average of the predictor is always smaller than the average of the individual risks.
The PROs and CONs of averaging (vs retraining) are as follows
PROs: (1) In each fold, the evaluation that you made on the held out set gives you an unbiased estimate of the risk for those very predictors that you have obtained, and for these estimates the only source of uncertainty is due to the estimate of the empirical risk (the average of the loss function) on the held out data.
This should be contrasted with the logic which is used when you are retraining and which is that the cross-validation risk is an estimate of the "expected value of the risk of a given learning algorithm" (and not of a given predictor) so that if you relearn from data from the same distribution, you should have in average the same level of performance. But note that this is in average and when retraining from the whole data this could go up or down. In other words, there is an additional source of uncertainty due to the fact that you will retrain.
(2) The hyperparameters have been selected exactly for the number of datapoints that you used in each fold to learn. If you relearn from the whole dataset, the optimal value of the hyperparameter is in theory and in practice not the same anymore, and so in the idea of retraining, you really cross your fingers and hope that the hyperparameters that you have chosen are still fine for your larger dataset.
If you used leave-one-out, there is obviously no concern there, and if the number of data point is large with 10 fold-CV you should be fine. But if you are learning from 25 data points with 5 fold CV, the hyperparameters for 20 points are not really the same as for 25 points...
CONs: Well, intuitively you don't benefit from training with all the data at once
There are unfortunately very little thorough theory on this but the following two papers especially the second paper consider precisely the averaging or aggregation of the predictors from K-fold CV.
Jung, Y. (2016). Efficient Tuning Parameter Selection by Cross-Validated Score in High Dimensional Models. International Journal of Mathematical and Computational Sciences, 10(1), 19-25.
Maillard, G., Arlot, S., & Lerasle, M. (2019). Aggregated Hold-Out. arXiv preprint arXiv:1909.04890.
The answer is simple: you use the process of (repeated) cross validation (CV) to obtain a relatively stable performance estimate for a model instead of improving it.
Think of trying out different model types and parametrizations which are differently well suited for your problem. Using CV you obtain many different estimates on how each model type and parametrization would perform on unseen data. From those results you usually choose one well suited model type + parametrization which you will use, then train it again on all (training) data. The reason for doing this many times (different partitions with repeats, each using different partition splits) is to get a stable estimation of the performance - which will enable you to e.g. look at the mean/median performance and its spread (would give you information about how well the model usually performs and how likely it is to be lucky/unlucky and get better/worse results instead).
Two more things:
Usually, using CV will improve your results in the end - simply because you take a model that is better suited for the job.
You mentioned taking the "average" model. This actually exists as "model averaging", where you average the results of multiple, possibly differently trained models to obtain a single result. Its one way to use an ensemble of models instead of a single one. But also for those you want to use CV in the end for choosing reasonable model.
I like your thinking. I think you have just accidentally discovered Random Forest:
https://en.wikipedia.org/wiki/Random_forest
Without repeated cv your seemingly best model is likely to be only a mediocre model when you score it on new data...

Genetic algorithm for classification

I am trying to solve classification problem using Matlab GPTIPS framework.
I managed to build reasonable data representation and fitness function so far and got an average accuracy per class near 65%.
What I need now is some help with two difficulties:
My data is biased. Basically I am solving binary classification problem and only 20% of data belongs to class 1, while other 80% belong to class 0. I used accuracy of prediction as my fitness function at first, but it was really bad. The best I have now is
Fitness = 0.5*(PositivePredictiveValue + NegativePredictiveValue) - const*ComplexityOfSolution
Please, advize, how can I improve my function to make correction for data bias.
Second problem is overfitting. I divided my data into three parts: training (70%), testing (20%), validation (10%). I train each chromosome on training set, then evaluate it's fitness function on testing set. This routine allows me to reach fitness of 0.82 on my test data for the best individual in population. But same individual's result on validation data is only 60%.
I added validation check for best individual each time before new population is generated. Then I compare fitness on validation set with fitness on test set. If difference is more then 5%, then I increase penalty for solution complexity in my fitness function. But it didn't help.
I could also try to evaluate all individuals with validation set during each generation, and simply remove overfitted ones. But then I don't see any difference between my test and validation data. What else can be done here?
UPDATE:
For my second question I've found great article "Experiments on Controlling Overtting
in Genetic Programming" Along with some article authors' ideas on dealing with overfitting in GP it has impressive review with a lot of references to many different approaches to the issue. Now I have a lot of new ideas I can try for my problem.
Unfortunately, still cant' find anything on selecting a proper fitness function which will take into account unbalanced class proportions in my data.
65% accuracy is very bad when the baseline (classify everything as the class with most samples) would be 80%. You need to achieve at least baseline classification in order to have a better model than the naive one.
I would not penalize complexity. Rather limit the tree size (if possible). You could identify simpler models during the run, like storing a pareto front of models with quality and complexity as its two fitness values.
In HeuristicLab we have integrated GP based classification that can do these things. There are several options: You can choose to use MSE for classification or R2. In the latest trunk build there is also an evaluator to optimize accuracy directly (exactly speaking it optimizes the classification penalties). Optimizing MSE means it assigns each class a value (1, 2, 3,...) and tries to minimize mean squared error from that value. This may not seem optimal at first, but works. Optimizing accuracy directly may lead to faster overfitting. There is also a formula simplifier which allows you to prune and shrink your formula (and view the effects of that).
Also, does it need to be GP? Have you tried Random Forest Classification or Support Vector Machines as well? RF are pretty fast and work pretty well usually.

How to improve accuracy of decision tree in matlab

I have a set of data which I classify them in matlab using decision tree. I divide the set into two parts; one training data(85%) and the other test data(15%). The problem is that the accuracy is around %90 and I do not know how I can improve it. I would appreciate if you have any idea about it.
Decision trees might be performing low because of many reasons, one prominent reason which I can think of is that while calculating a split they do not consider inter-dependency of variables or of target variable on other variables.
Before going into improving the performance, one should be aware that it shall not cause over-fitting and shall be able to generalize.
To improve performance these few things can be done:
Variable preselection: Different tests can be done like multicollinearity test, VIF calculation, IV calculation on variables to select only a few top variables. This will lead in improved performance as it would strictly cut out the undesired variables.
Ensemble Learning Use multiple trees (random forests) to predict the outcomes. Random forests in general perform well than a single decision tree as they manage to reduce both bias and variance. They are less prone to overfitting as well.
K-Fold cross validation: Cross validation in the training data itself can improve the performance of the model a bit.
Hybrid Model: Use a hybrid model, i.e. use logistic regression after using decision trees to improve performance.
I guess the more important question here is what's a good accuracy for the given domain: if you're classifying spam then 90% might be a bit low, but if you're predicting stock prices then 90% is really high!
If you're doing this on a known domain set and there are previous examples of classification accuracy which is higher than yours, then you can try several things:
K-Fold Cross Validation
Ensamble Learning
Generalized Iterative Scaling (GIS)
Logistic Regression
I don't think you should improve this, may be the data is overfitted by the classifier. Try to use another data sets, or cross-validation to see the more accurate result.
By the way, 90%, if not overfitted, is great result, may be you even don't need to improve it.
You could look into pruning the leaves to improve the generalization of the decision tree. But as was mentioned, 90% accuracy can be considered quite good..
90% is good or bad, depends on the domain of the data.
However, it might be that the classes in your data are overlapping and you can't really do more than 90%.
You can try to look in what nodes are the errors, and check if it's possible to improve the classification by changing them.
You can also try Random Forest.

What's usual success rate for neural network models?

I am building a system with a NN trained for classification.
I am interested in what is error rate for systems you have built?
Classic example from UCI ML is the Iris data set.
NN trained on it is almost perfect - error rate 0-1%; however it is a very basic dataset.
My network has following structure: 80in, 208hid, 2out.
My result is 8% error rate on testing dataset.
Basically in this question I want to ask about various research results you encountered,
in your work, papers etc.
Addition 1:
the error rate is of course on testing data - not training. So it is completely new dataset for the network
Addition 2 (from my comment under the question):
My new results. 1200 entries, 900 training, 300 testing. 85 in Class1, 1115 in Class2. Out of 85, 44 in testing set. Error rate - 6%. It is not so bad because 44 is ~15% of 300. So I am 2.5 times better..
Model performance is completely problem-specific. Even among situations with similar quality and volumes of development data, with identical target variable definitions, performance can vary substantially. Obviously, the more similar the problem definitions, the more likely the performance of different models are to match.
Another thing to consider is the difference between technical performance and business performance. In some applications, an accuracy of 52% is tremendously profitable, whereas in other areas, and accuracy of 98% would be hopelessly low.
Let me also add that besides what Predictor mentions, measuring your performance on the training set is usually useless as a guide to determine how your classifier would perform on previously unseen data. Many times with relatively simple classifiers you can get 0% error rate on the training set without learning anything useful (this is called overfitting).
What is more commonly used (and more helpful in determining how your classifier works) is either held out data or cross validation, even better if you separate your data in three: training, validating and testing.
Also it is very hard to get a sense of how good a classifier works from one threshold and giving only true positive + true negatives. People tend to also evaluate false positives and false negatives and plot ROC curves to see/evaluate the tradeoff. So, saying "2.5 times better" you should be clear that your comparing to a classifier that classifies everything as C2, which is a pretty crappy baseline.
See for example this paper:
Danilo P. Mandic and Jonathon A. Chanbers (2000). Towards the Optimal Learning Rate for
Backpropagation, Neural Processing Letters 11: 1–5. PDF