getting paragraph representation for unseen paragraphs in doc2vec - classification

I would like to use genism doc2vec model for a classification task.
However, It seems like the gensim implementation of doc2vec requires to see all documents (train and test) to build the vocabulary before training the model. Otherwise, you get keyerror if you want to get document vector of a document that was not present when building the vocabulary. I wonder if my understanding is correct! In practice, one does not have access to the test data at the time of training.
Is there any way to update the vocabulary at the test time to be able to get document representation of test documents?

You can only look-up learned document-vectors for material that was presented during training.
But, there is a method infer_vector() which can provide a new tokenized document to the the frozen, trained model, and return a 'best-fit' vector. It approximates what would have been returned if the new document was available during training. See:
https://radimrehurek.com/gensim/models/doc2vec.html#gensim.models.doc2vec.Doc2Vec.infer_vector

Related

How does word embedding/ word vectors work/created?

How does word2vec create vectors for words? I trained two word2vec models using two different files (from commoncrawl website) but I am getting same word vectors for a given word from both models.
Actually, I have created multiple word2vec models using different text files from the commoncrawl website. Now I want to check which model is better among all. How can select the best model out of all these models and why I am getting same word vectors for different models?
Sorry, If the question is not clear.
If you are getting identical word-vectors from models that you've prepared from different text corpuses, something is likely wrong in your process. You may not be performing any training at all, perhaps because of a problem in how the text iterable is provided to the Word2Vec class. (In that case, word-vectors would remain at their initial, randomly-initialized values.)
You should enable logging, and review the logs carefully to see that sensible counts of words, examples, progress, and incremental-progress are displayed during the process. You should also check that results for some superficial, ad-hoc checks look sensible after training. For example, does model.most_similar('hot') return other words/concepts somewhat like 'hot'?
Once you're sure models are being trained on varied corpuses – in which case their word-vectors should be very different from each other – deciding which model is 'best' depends on your specific goals with word-vectors.
You should devise a repeatable, quantitative way to evaluate a model against your intended end-uses. This might start crudely with a few of your own manual reviews of results, like looking over most_similar() results for important words for better/worse results – but should become more extensive. rigorous, and automated as your project progresses.
An example of such an automated scoring is the accuracy() method on gensim's word-vectors object. See:
https://github.com/RaRe-Technologies/gensim/blob/6d6f5dcfa3af4bc61c47dfdf5cdbd8e1364d0c3a/gensim/models/keyedvectors.py#L652
If supplied with a specifically-formatted file of word-analogies, it will check how well the word-vectors solve those analogies. For example, the questions-words.txt of Google's original word2vec code release includes the analogies they used to report vector quality. Note, though, that the word-vectors that are best for some purposes, like understanding text topics or sentiment, might not also be the best at solving this style of analogy, and vice-versa. If training your own word-vectors, it's best to choose your training corpus/parameters based on your own goal-specific criteria for what 'good' vectors will be.

Comparison between fasttext and LDA

Hi Last week Facebook announced Fasttext which is a way to categorize words into bucket. Latent Dirichlet Allocation is also another way to do topic modeling. My question is did anyone do any comparison regarding pro and con within these 2.
I haven't tried Fasttext but here are few pro and con for LDA based on my experience
Pro
Iterative model, having support for Apache spark
Takes in a corpus of document and does topic modeling.
Not only finds out what the document is talking about but also finds out related documents
Apache spark community is continuously contributing to this. Earlier they made it work on mllib now on ml libraries
Con
Stopwords need to be defined well. They have to be related to the context of the document. For ex: "document" is a word which is having high frequency of appearance and may top the chart of recommended topics but it may or maynot be relevant, so we need to update the stopword for that
Sometime classification might be irrelevant. In the below example it is hard to infer what this bucket is talking about
Topic:
Term:discipline
Term:disciplines
Term:notestable
Term:winning
Term:pathways
Term:chapterclosingtable
Term:metaprograms
Term:breakthroughs
Term:distinctions
Term:rescue
If anyone has done research in Fasttext can you please update with your learning?
fastText offers more than topic modelling, it is a tool for generation of word embeddings and text classification using a shallow neural network.
The authors state its performance is comparable with much more complex “deep learning” algorithms, but the training time is significantly lower.
Pros:
=> It is extremely easy to train your own fastText model,
$ ./fasttext skipgram -input data.txt -output model
Just provide your input and output file, the architecture to be used and that's all, but if you wish to customize your model a bit, fastText provides the option to change the hyper-parameters as well.
=> While generating word vectors, fastText takes into account sub-parts of words called character n-grams so that similar words have similar vectors even if they happen to occur in different contexts. For example, “supervised”, “supervise” and “supervisor” all are assigned similar vectors.
=> A previously trained model can be used to compute word vectors for out-of-vocabulary words. This one is my favorite. Even if the vocabulary of your corpus is finite, you can get a vector for almost any word that exists in the world.
=> fastText also provides the option to generate vectors for paragraphs or sentences. Similar documents can be found by comparing the vectors of documents.
=> The option to predict likely labels for a piece of text has been included too.
=> Pre-trained word vectors for about 90 languages trained on Wikipedia are available in the official repo.
Cons:
=> As fastText is command line based, I struggled while incorporating this into my project, this might not be an issue to others though.
=> No in-built method to find similar words or paragraphs.
For those who wish to read more, here are the links to the official research papers:
1) https://arxiv.org/pdf/1607.04606.pdf
2) https://arxiv.org/pdf/1607.01759.pdf
And link to the official repo:
https://github.com/facebookresearch/fastText

Multiclass classification in SVM

I have been working on "Script identification from bilingual documents".
I want to classify the pages/blocks as either Eng(class 1), Hindi (class 2) or Mixed using libsvm in matlab. but the problem is that the training data i have consists of samples corresponding to Hindi and english pages/blocks only but no mixed pages.
The test data i want to give may consists of Mixed pages/blocks also, in that case i want it to be classified as "Mixed". I am planning to do it using confidence score or probability values. like if the prob value of class 1 is greater than a threshold (say 0.8) and prob value of class 2 is less than a threshold say(0.05) then it will be classified as class 1, and class 2 vice-versa. but if aforementioned two conditions dont satisfy then i want to classify it as "Mixed".
The third return value from the "libsvmpredict" is prob_values and i was planning to go ahead with this prob_values to decide whether the testdata is Hindi, English or Mixed. but at few places i learnt that "libsvmpredict" does not produce the actual prob_values.
Is there any way which can help me to classify the test data into 3 classes( Hindi, English, Mixed) using training data consisting of only 2 classes in SVM.
This is not the modus operandi for SVMs.
In no way SVMs can predict a given class without knowing it, without knowing how to separate such class from all other classes.
The function svmpredict() in LibSVM actually shows the probability estimates and the greater this value is, the more confident you can be regarding your prediction. But you cannot rely on such values if you have just two classes in order to predict a third class: indeed svmpredict() will return as many decision values as there are classes.
You can go on with your thresholding system (which, again, is not SVM-based) but it most likely fail or give bad performances. Think about that: you have to set up two thresholds and use them in a logic AND manner. The chance of correctly classified non-Mixed documents will indeed drastically decrease.
My suggestion is: instead of wasting time setting up thresholds, with a high chance of bad performances, join some of these texts together or create some new files with some Hindi and some English lines in order to add to your training data some proper Mixed documents and perform a standard 3-classes SVM system.
In order to create such files you can as well use Matlab, which has a pretty decent file I/O functions such as fread(), fwrite(), fprintf(), fscanf(), importdata() and so on...

Sentiment analysis Google Prediction API

I am reading about the Google Prediction API and can't figure out a part of the docs.
From the use cases I am stuck a bit on this part:
Each line can only have one label assigned, but you can apply multiple
labels to one example by repeating an example and applying different
labels to each one. For example: "excited", "OMG! Just had a fabulous
day!" "annoying", "OMG! Just had a fabulous day!" If you send a tweet
to this model, you might get a classification something like this:
"excited":0.6, "annoying":0.2.
Why would it put "excited":0.6, "annoying":0.2 while there are no more features on excited. Why is excited prefered?
It's not that the tag "excited" is preferred, but a probability that the message should in fact be classified as "excited" and not "annoyed."
Suppose I have 2 classifications for sentiment: "bullish" and "bearish." I then train a model in the Prediction API with even amounts of "bullish" and "bearish" training data. When I submit a message to Prediction API to get the sentiment, it reads the text and assigns a probability both a "bullish" and a "bearish" probability based on the words in the message. The sum of the probabilities will add up to 1.
So again, it's not that one label is preferred to another, but the probability of the message being "excited" is 3 times greater than it being "annoyed."
If you train the model with just those 2 examples, "excited" and "annoying" labels for sentence "OMG! Just had a fabulous day!", the only reasonable results when querying classification for a tweet like this "OMG! Just had a fabulous day!" should be "excited":0.5, "annoying":0.5.
So probably the case is not perfectly explained in Google documentation. I guess they are more focused trying to explain that it is possible to associate 2 different labels with exactly the same sentence.

Using Conditional Random Fields for Named Entity Recognition

What is Conditional Random Field?
How does exactly Conditional Random Field identify proper names as person, organization, or place in a structured or unstructured text?
For example: This product is ordered by StackOverFlow Inc.
What does Conditional Random Field do to identify StackOverFlow Inc. as an organization?
A CRF is a discriminative, batch, tagging model, in the same general family as a Maximum Entropy Markov model.
A full explanation is book-length.
A short explanation is as follows:
Humans annotate 200-500K words of text, marking the entities.
Humans select a set of features that they hope indicate entities. Things like capitalization, or whether the word was seen in the training set with a tag.
A training procedure counts all the occurrences of the features.
The meat of the CRF algorithm search the space of all possible models that fit the counts to find a pretty good one.
At runtime, a decoder (probably a Viterbi decoder) looks at a sentence and decides what tag to assign to each word.
The hard parts of this are feature selection and the search algorithm in step 4.
Well to understand that you got to study a lot of things.
For start
Understand the basic of markov and bayesian networks.
Online course available in coursera by daphne coller
https://class.coursera.org/pgm/lecture/index
CRF is a special type of markov network where we have observation and hidden states.
The objective is to find the best State Assignment to the unobserved variables also known as MAP problem.
Be Prepared for a lot of probability and Optimization. :-)