I have been using gensim's libraries to train a doc2Vec model. After experimenting with different datasets for training, I am fairly confused about what should be an ideal training data size for doc2Vec model?
I will be sharing my understanding here. Please feel free to correct me/suggest changes-
Training on a general purpose dataset- If I want to use a model trained on a general purpose dataset, in a specific use case, I need to train on a lot of data.
Training on the context related dataset- If I want to train it on the data having the same context as my use case, usually the training data size can have a smaller size.
But what are the number of words used for training, in both these cases?
On a general note, we stop training a ML model, when the error graph reaches an "elbow point", where further training won't help significantly in decreasing error. Has any study being done in this direction- where doc2Vec model's training is stopped after reaching an elbow ?
There are no absolute guidelines - it depends a lot on your dataset and specific application goals. There's some discussion of the sizes of datasets used in published Doc2Vec work at:
what is the minimum dataset size needed for good performance with doc2vec?
If your general-purpose corpus doesn't match your domain's vocabulary – including the same words, or using words in the same senses – that's a problem that can't be fixed with just "a lot of data". More data could just 'pull' word contexts and representations more towards generic, rather than domain-specific, values.
You really need to have your own quantitative, automated evaluation/scoring method, so you can measure whether results with your specific data and goals are sufficient, or improving with more data or other training tweaks.
Sometimes parameter tweaks can help get the most out of thin data – in particular, more training iterations or a smaller model (fewer vector-dimensions) can slightly offset some issues with small corpuses, sometimes. But the Word2Vec/Doc2Vec really benefit from lots of subtly-varied, domain-specific data - it's the constant, incremental tug-of-war between all the text-examples during training that helps the final representations settle into a useful constellation-of-arrangements, with the desired relative-distance/relative-direction properties.
Related
I am trying to build cnn model (keras) that can classify image based on users emotions. I am having issues with data. I have really small data for training. Will augmenting data help? Does it improve accuracy? In which case one should choose to augment data and should avoid?
Will augmenting data help? Does it improve accuracy?
That's hard to say in advance. But almost certainly, when you already have a model which is better than random. And when you choose the right augmentation method.
See my masters thesis Analysis and Optimization of Convolutional Neural Network Architectures, page 80 for many different augmentation methods.
In which case one should choose to augment data and should avoid?
When you don't have enough data -> augment
Avoid augmentations where you can't tell the emotion after the augmentation. So in case of character recognition, rotation is a bad idea (e.g. due to 6 vs 9 or u vs n or \rightarrow vs \nearrow)
Yes, data augmentation really helps, and sometimes it's really necessary. (But take a look at Martin Thoma's answer, there are more details there and some important "take-cares").
You should use it when:
You have too little data
You notice your model is overfitting too easily (may be a model too powerful too)
Overfitting is something that happens when your model is capable of memorizing the data. Then it gets splendid accuracy for training data, but terrible accuracy for test data.
Increasing the size of training data will make it more difficult for your model to memorize. Small changes here and there will make your model stop paying attention to details that don't mean anything (but are capable of creating distinctions between images) and start paying attention to details that indeed cause the desired effect.
I'm in the overtures of designing a prose imitation system. It will read a bunch of prose, then mimic it. It's mostly for fun so the mimicking prose doesn't need to make too much sense, but I'd like to make it as good as I can, with a minimal amount of effort.
My first idea is to use my example prose to train a classifying feed-forward neural network, which classifies its input as either part of the training data or not part. Then I'd like to somehow invert the neural network, finding new random inputs that also get classified by the trained network as being part of the training data. The obvious and stupid way of doing this is to randomly generate word lists and only output the ones that get classified above a certain threshold, but I think there is a better way, using the network itself to limit the search to certain regions of the input space. For example, maybe you could start with a random vector and do gradient descent optimisation to find a local maximum around the random starting point. Is there a word for this kind of imitation process? What are some of the known methods?
How about Generative Adversarial Networks (GAN, Goodfellow 2014) and their more advanced siblings like Deep Convolutional Generative Adversarial Networks? There are plenty of proper research articles out there, and also more gentle introductions like this one on DCGAN and this on GAN. To quote the latter:
GANs are an interesting idea that were first introduced in 2014 by a
group of researchers at the University of Montreal lead by Ian
Goodfellow (now at OpenAI). The main idea behind a GAN is to have two
competing neural network models. One takes noise as input and
generates samples (and so is called the generator). The other model
(called the discriminator) receives samples from both the generator
and the training data, and has to be able to distinguish between the
two sources. These two networks play a continuous game, where the
generator is learning to produce more and more realistic samples, and
the discriminator is learning to get better and better at
distinguishing generated data from real data. These two networks are
trained simultaneously, and the hope is that the competition will
drive the generated samples to be indistinguishable from real data.
(DC)GAN should fit your task quite well.
I have a question regarding cross validation in Linear regression model.
From my understanding, in cross validation, we split the data into (say) 10 folds and train the data from 9 folds and the remaining folds we use for testing. We repeat this process until we test all of the folds, so that every folds are tested exactly once.
When we are training the model from 9 folds, should we not get a different model (may be slightly different from the model that we have created when using the whole dataset)? I know that we take an average of all the "n" performances.
But, what about the model? Shouldn't the resulting model also be taken as the average of all the "n" models? I see that the resulting model is same as the model which we created using whole of the dataset before cross-validation. If we are considering the overall model even after cross-validation (and not taking avg of all the models), then what's the point of calculating average performance from n different models (because they are trained from different folds of data and are supposed to be different, right?)
I apologize if my question is not clear or too funny.
Thanks for reading, though!
I think that there is some confusion in some of the answers proposed because of the use of the word "model" in the question asked. If I am guessing correctly, you are referring to the fact that in K-fold cross-validation we learn K-different predictors (or decision functions), which you call "model" (this is a bad idea because in machine learning we also do model selection which is choosing between families of predictors and this is something which can be done using cross-validation). Cross-validation is typically used for hyperparameter selection or to choose between different algorithms or different families of predictors. Once these chosen, the most common approach is to relearn a predictor with the selected hyperparameter and algorithm from all the data.
However, if the loss function which is optimized is convex with respect to the predictor, than it is possible to simply average the different predictors obtained from each fold.
This is because for a convex risk, the risk of the average of the predictor is always smaller than the average of the individual risks.
The PROs and CONs of averaging (vs retraining) are as follows
PROs: (1) In each fold, the evaluation that you made on the held out set gives you an unbiased estimate of the risk for those very predictors that you have obtained, and for these estimates the only source of uncertainty is due to the estimate of the empirical risk (the average of the loss function) on the held out data.
This should be contrasted with the logic which is used when you are retraining and which is that the cross-validation risk is an estimate of the "expected value of the risk of a given learning algorithm" (and not of a given predictor) so that if you relearn from data from the same distribution, you should have in average the same level of performance. But note that this is in average and when retraining from the whole data this could go up or down. In other words, there is an additional source of uncertainty due to the fact that you will retrain.
(2) The hyperparameters have been selected exactly for the number of datapoints that you used in each fold to learn. If you relearn from the whole dataset, the optimal value of the hyperparameter is in theory and in practice not the same anymore, and so in the idea of retraining, you really cross your fingers and hope that the hyperparameters that you have chosen are still fine for your larger dataset.
If you used leave-one-out, there is obviously no concern there, and if the number of data point is large with 10 fold-CV you should be fine. But if you are learning from 25 data points with 5 fold CV, the hyperparameters for 20 points are not really the same as for 25 points...
CONs: Well, intuitively you don't benefit from training with all the data at once
There are unfortunately very little thorough theory on this but the following two papers especially the second paper consider precisely the averaging or aggregation of the predictors from K-fold CV.
Jung, Y. (2016). Efficient Tuning Parameter Selection by Cross-Validated Score in High Dimensional Models. International Journal of Mathematical and Computational Sciences, 10(1), 19-25.
Maillard, G., Arlot, S., & Lerasle, M. (2019). Aggregated Hold-Out. arXiv preprint arXiv:1909.04890.
The answer is simple: you use the process of (repeated) cross validation (CV) to obtain a relatively stable performance estimate for a model instead of improving it.
Think of trying out different model types and parametrizations which are differently well suited for your problem. Using CV you obtain many different estimates on how each model type and parametrization would perform on unseen data. From those results you usually choose one well suited model type + parametrization which you will use, then train it again on all (training) data. The reason for doing this many times (different partitions with repeats, each using different partition splits) is to get a stable estimation of the performance - which will enable you to e.g. look at the mean/median performance and its spread (would give you information about how well the model usually performs and how likely it is to be lucky/unlucky and get better/worse results instead).
Two more things:
Usually, using CV will improve your results in the end - simply because you take a model that is better suited for the job.
You mentioned taking the "average" model. This actually exists as "model averaging", where you average the results of multiple, possibly differently trained models to obtain a single result. Its one way to use an ensemble of models instead of a single one. But also for those you want to use CV in the end for choosing reasonable model.
I like your thinking. I think you have just accidentally discovered Random Forest:
https://en.wikipedia.org/wiki/Random_forest
Without repeated cv your seemingly best model is likely to be only a mediocre model when you score it on new data...
I have a training dataset which gives me the ranking of various cricket players(2008) on the basis of their performance in the past years(2005-2007).
I've to develop a model using this data and then apply it on another dataset to predict the ranking of players(2012) using the data already given to me(2009-2011).
Which predictive modelling will be best for this? What are the pros and cons of using the different forms of regression or neural networks?
The type of model to use depends on different factors:
Amount of data: if you have very little data, you better opt for a simple prediction model like linear regression. If you use a prediction model which is too powerful you run into the risk of over-fitting your model with the effect that it generalizes bad on new data. Now you might ask, what is little data? That depends on the number of input dimensions and on the underlying distributions of your data.
Your experience with the model. Neural networks can be quite tricky to handle if you have little experience with them. There are quite a few parameters to be optimized, like the network layer structure, the number of iterations, the learning rate, the momentum term, just to mention a few. Linear prediction is a lot easier to handle with respect to this "meta-optimization"
A pragmatic approach for you, if you still cannot opt for one of the methods, would be to evaluate a couple of different prediction methods. You take some of your data where you already have target values (the 2008 data), split it into training and test data (take some 10% as test data, e.g.), train and test using cross-validation and compute the error rate by comparing the predicted values with the target values you already have.
One great book, which is also on the web, is Pattern recognition and machine learning by C. Bishop. It has a great introductory section on prediction models.
Which predictive modelling will be best for this? 2. What are the pros
and cons of using the different forms of regression or neural
networks?
"What is best" depends on the resources you have. Full Bayesian Networks (or k-Dependency Bayesian Networks) with information theoretically learned graphs, are the ultimate 'assumptionless' models, and often perform extremely well. Sophisticated Neural Networks can perform impressively well too. The problem with such models is that they can be very computationally expensive, so models that employ methods of approximation may be more appropriate. There are mathematical similarities connecting regression, neural networks and bayesian networks.
Regression is actually a simple form of Neural Networks with some additional assumptions about the data. Neural Networks can be constructed to make less assumptions about the data, but as Thomas789 points out at the cost of being considerably more difficult to understand (sometimes monumentally difficult to debug).
As a rule of thumb - the more assumptions and approximations in a model the easier it is to A: understand and B: find the computational power necessary, but potentially at the cost of performance or "overfitting" (this is when a model suits the training data well, but doesn't extrapolate to the general case).
Free online books:
http://www.inference.phy.cam.ac.uk/mackay/itila/
http://ciml.info/dl/v0_8/ciml-v0_8-all.pdf
Beginner on ANNs:
I am implementing a back propagation neural network to predict the price of gold. I know that I have to split my data into training data, selection data and test data.
However I unsure How to go on about using these sets of data. At first I was training the data network with my training set then after it's trained I am getting a number of inputs to my network from the test set and comparing the output.
I'm not sure if I'm doing this right and were does the selection set come in ?
thanks in advance!
The general idea is:
Train the network for a little while on the training set.
Evaluate the network on a second set, often called the validation set. Probably what you're calling the selection set.
Train the network a little more on the training set.
Evaluate the new network on the selection set again.
Which did better, the old network or the new network? If the new network is better, we're still getting some use out of training, so goto 3. If the new network is worse, more training will probably only hurt. Use the previously version of the network, since it did better.
In this way, you can tell when to stop training.
One easy modification to this is to always keep track of the best network seen so far, and we only stop training when we see some number (say, three) of training attempts that do worse in a row.
The third set, the test set, is necessary because the selection set is, if indirectly, involved in the training process. Final evaluation must be done on data that was not used at all during training.
This sort of thing is sufficient for simple experiments, but in general you'll want to use cross-validation to get a better idea of your system's performance.
I wanted to leave a comment just to say that validation sets are a good place for model-dependent hyper-parameter tuning, but I'm new here and hence lack the reputation points to do so. To make this more worthy of a separate posting, I've included an outline of my own train-validate-test process. In practice, my workflow is as follows:
Identify, collect, and clean data. Try to limit complaining during data munging process.
Split data into three sets: training, validation, test.
Establish two "base" models for evaluating more complex models built later on in the process. The first of these models is typically a basic linear/logistic regression using all possible features. The second models uses only the most obviously informative (initial identification of informative features depends on use case, typically involves combination of domain knowledge, basic clustering, simple correlation).
Begin more empirical feature selection (i.e. unsupervised NN, but usually random forest) and prototype a broad range of models using the training set.
Eliminate poorly performing models as well as uninformative features
Compare performance of remaining models against each other and the "base" models, using a modified version of the training set (same data, but sans uninformative features). Toss under-performing models.
Using the validation set, tune the appropriate hyper-parameters for each of the models (either by hand or gridsearch). Further reduce the number of models in consideration, ideally to just 2-3 (excluding base models).
Finally, evaluate model performance (with optimized hyper-parameters) on the test set. Again, compare models among themselves and against the base models. Make final model choice based on a problem-specific appropriate combination of computational complexity/cost, ease of interpretation/transparency/"explainability", and improvement over and/or performance vs base models.