Why we need training and test datasets in research? - cluster-analysis

I'm newbie in research area of data mining (text clustering) and i have couple question regarding to training and test datasets.
Is that clustering need training and testing datasets?
why we need to separate into training and test datasets?
Sorry for the rookie question hope expert in this group can help me.

As your question is on clustering:
In cluster analysis, there usually is no training or test data split.
Because you do cluster analysis when you do not have labels, so you cannot "train".
Training is a concept from machine learning, and train-test splitting is used to avoid overfitting.
But if you are not learning labels, you cannot overfit.
Properly used cluster analysis is a knowledge discovery method. You want to discover some new structure in your data, not rediscover something that is already labeled.

To train your data you need a sets of relevant data similar but not identical to your testing data. For example, you could split up your data where 0.7 of your data is training and the rest testing. This will allow your algorithm to get a feel for what it should be looking for. The rest of the data 0.3 can be used for testing as it is a distinct set of information (hopefully) which should allow the algorithm to test itself.
Why split it up?
Well if you train your data on data A and then test your algorithm on data A your algorithm will be able to identify all the information correctly because that is what it was trained on.
For example, if when learning addition you were given the sums 3+4, 4+5, 6+9, which you correctly solved it would be redundant to test your knowledge of addition using the same sums.
further information:
http://en.wikipedia.org/wiki/Natural_language_processing
http://www.nltk.org/book
Hope this helps.

Related

Is it possible to simultaneously use and train a neural network?

Is it possible to use Tensorflow or some similar library to make a model that you can efficiently train and use at the same time.
An example/use case for this would be a chat bot that you give feedback to. Somewhat like how pets learn (i.e. replicating what they just did for a reward). Or being able to add new entries or new responses they can use.
I think what you are asking is whether a model can be trained continuously without having to retrain it from scratch each time new labelled data comes in.
Answer to that is - Online models
There are models that can be trained continuously on data without worrying about training them from scratch. As per Wikipedia definition
Online machine learning is a method of machine learning in which data becomes available in sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques which generate the best predictor by learning on the entire training data set at once.
Some examples of such algorithms are
BernoulliNB
GaussianNB
MiniBatchKMeans
MultinomialNB
PassiveAggressiveClassifier
PassiveAggressiveRegressor
Perceptron
SGDClassifier
SGDRegressor
DNNs

How to cluster data using self-organising maps?

Suppose that we train a self-organising map (SOM) with a given dataset. Would it make sense to cluster the neurons of the SOM instead of the original datapoints? This doubt came to me after reading this paper, in which the following is stated:
The most important benefit of this procedure
is that computational load decreases considerably, making
it possible to cluster large data sets and to consider several
different preprocessing strategies in a limited time. Naturally,
the approach is valid only if the clusters found using the SOM
are similar to those of the original data.
In this answer it is clearly stated that SOMs don't include clustering, but some clustering procedure can be made on the SOM after it has been trained. I thought that this meant the clustering was done on the neurons of the SOM, which are in some sense a mapping of the original data, but I'm not sure about this. So, what I want to know is:
Is it correct to cluster data performing the clustering algorithm on the trained neuron weights as datapoints? If not, how is clustering done using a SOM then?
What characteristics should a dataset have, in general, for this approach to be useful?
Yes, the usual approach seems to be either hierarchical or k-means (you'll need to dig this up how it was originally done - as seen in the paper you linked, many variants including two-level approaches have been explored later) on the neurons. If you consider SOMs to be a quantization and projection technique, all of these approaches are valid to use.
It's cheaper because they are just 2 dimensional, Euclidean, and much fewer points. So that is well in line with the source that you have.
Note that a SOM neuron may be empty, it it is inbetween of two extremely well separated clusters.

How much data is actually required to train a doc2Vec model?

I have been using gensim's libraries to train a doc2Vec model. After experimenting with different datasets for training, I am fairly confused about what should be an ideal training data size for doc2Vec model?
I will be sharing my understanding here. Please feel free to correct me/suggest changes-
Training on a general purpose dataset- If I want to use a model trained on a general purpose dataset, in a specific use case, I need to train on a lot of data.
Training on the context related dataset- If I want to train it on the data having the same context as my use case, usually the training data size can have a smaller size.
But what are the number of words used for training, in both these cases?
On a general note, we stop training a ML model, when the error graph reaches an "elbow point", where further training won't help significantly in decreasing error. Has any study being done in this direction- where doc2Vec model's training is stopped after reaching an elbow ?
There are no absolute guidelines - it depends a lot on your dataset and specific application goals. There's some discussion of the sizes of datasets used in published Doc2Vec work at:
what is the minimum dataset size needed for good performance with doc2vec?
If your general-purpose corpus doesn't match your domain's vocabulary – including the same words, or using words in the same senses – that's a problem that can't be fixed with just "a lot of data". More data could just 'pull' word contexts and representations more towards generic, rather than domain-specific, values.
You really need to have your own quantitative, automated evaluation/scoring method, so you can measure whether results with your specific data and goals are sufficient, or improving with more data or other training tweaks.
Sometimes parameter tweaks can help get the most out of thin data – in particular, more training iterations or a smaller model (fewer vector-dimensions) can slightly offset some issues with small corpuses, sometimes. But the Word2Vec/Doc2Vec really benefit from lots of subtly-varied, domain-specific data - it's the constant, incremental tug-of-war between all the text-examples during training that helps the final representations settle into a useful constellation-of-arrangements, with the desired relative-distance/relative-direction properties.

Neural network preprocessing

I'm working on school project about data prediction in NN I have my data normalized and I have three input and one output
My questions is
what is the different between the taring data and test data (is the training data supposed to be the input data and the test the output data)
what is testing rate is it any random number or is there rule to find it
what is training error
and my final question is after training my data I remember something about error I'm not quite sure but do I need to find the error of my prediction and how to find it
I know my questions might not be clear but I'm just confused and tried to explain it as much as I can
Answering in a school spirit: Let's suppose you are given 10 solved exercises to study. You do study them, and then the teacher tests you on these exact exercises. You do well on the test. However, there is an important question. Why did you do well?? Did you really understand the exercises, or did you just memorize them?? And how can the teacher know ??
There is only one way: The teacher must test you on a set of similar but different exercises. If you also do well on them, you have gotten a feel for the subject, and you are able to generalize the knowledge you acquired. If not, you probably memorized them, without understanding a thing. This kind of knowledge is useless.
The same happens with neural networks. You use some patterns to (training set) to train them. But, to check if they are able to generalize, you have to test them on a different set of patterns (test set) without the network knowing the correct answers. Ideally, you should have small differences in performance between the two sets, that is good generalization ability.
So, both train and tests sets are inputs, not outputs. The only difference is when you use them, the training set during the training, and the test set after it. The training/test set rate is the percentage you got correct of the training/test sets respectively. The training/test error is the complementary, that is, the percentage you got wrong.
I know this reply might come late but I will just complement the previous answer by saying that in supervised learning both the training set and the test set are input-output pairs. By structure alone they are exactly the same, a set of input and their corresponding output(or label) pairs. There is no difference in structure between both.
As blue_note said, they are just used in different occasions: one during training and one after that

Clustering of data - Pre- processing of data

These days I am using some clustering algorithm and I just wanted to ask a question related to this field. Maybe those who are working in this field already have this answer.
During clustering I need to have some training data which I am going to cluster. The number of iterations (e.x. K-Means algorithm) is depended on the number of training data(number of vectors). Is there any method to find the most important data from training data. What I mean is: Instead of training the K-Means with all the data maybe there is a method to find just the important vectors (those vectors who affect most the clusters) and use these "important" vectors(from training data) to traing the algorithm.
I hope you understood me.
Thank You for reading and trying to answer.
"Training" and "Test" data is a concept from classification, not from cluster analysis.
K-means is a statistical method. If you want to speed it up, running it on a large enough random sample should give you nearly the same result.