Comparison between fasttext and LDA - facebook

Hi Last week Facebook announced Fasttext which is a way to categorize words into bucket. Latent Dirichlet Allocation is also another way to do topic modeling. My question is did anyone do any comparison regarding pro and con within these 2.
I haven't tried Fasttext but here are few pro and con for LDA based on my experience
Pro
Iterative model, having support for Apache spark
Takes in a corpus of document and does topic modeling.
Not only finds out what the document is talking about but also finds out related documents
Apache spark community is continuously contributing to this. Earlier they made it work on mllib now on ml libraries
Con
Stopwords need to be defined well. They have to be related to the context of the document. For ex: "document" is a word which is having high frequency of appearance and may top the chart of recommended topics but it may or maynot be relevant, so we need to update the stopword for that
Sometime classification might be irrelevant. In the below example it is hard to infer what this bucket is talking about
Topic:
Term:discipline
Term:disciplines
Term:notestable
Term:winning
Term:pathways
Term:chapterclosingtable
Term:metaprograms
Term:breakthroughs
Term:distinctions
Term:rescue
If anyone has done research in Fasttext can you please update with your learning?

fastText offers more than topic modelling, it is a tool for generation of word embeddings and text classification using a shallow neural network.
The authors state its performance is comparable with much more complex “deep learning” algorithms, but the training time is significantly lower.
Pros:
=> It is extremely easy to train your own fastText model,
$ ./fasttext skipgram -input data.txt -output model
Just provide your input and output file, the architecture to be used and that's all, but if you wish to customize your model a bit, fastText provides the option to change the hyper-parameters as well.
=> While generating word vectors, fastText takes into account sub-parts of words called character n-grams so that similar words have similar vectors even if they happen to occur in different contexts. For example, “supervised”, “supervise” and “supervisor” all are assigned similar vectors.
=> A previously trained model can be used to compute word vectors for out-of-vocabulary words. This one is my favorite. Even if the vocabulary of your corpus is finite, you can get a vector for almost any word that exists in the world.
=> fastText also provides the option to generate vectors for paragraphs or sentences. Similar documents can be found by comparing the vectors of documents.
=> The option to predict likely labels for a piece of text has been included too.
=> Pre-trained word vectors for about 90 languages trained on Wikipedia are available in the official repo.
Cons:
=> As fastText is command line based, I struggled while incorporating this into my project, this might not be an issue to others though.
=> No in-built method to find similar words or paragraphs.
For those who wish to read more, here are the links to the official research papers:
1) https://arxiv.org/pdf/1607.04606.pdf
2) https://arxiv.org/pdf/1607.01759.pdf
And link to the official repo:
https://github.com/facebookresearch/fastText

Related

Can LDA model be useful for sentences (not documents) clustering / classification?

Recently, I’m working on sentence classification problem, these sentences are nothing but one or two line of reviews about product and customers post there feedback on various features that product has to offer. After pre-processing (removal of stop words and stemming) I’m using feature extraction libraries (like word2vec, tf-idf) and clustering algorithms (k-mean) to run over my sentences to have unsupervised sentence classification - output is fairly acceptable. However I’m looking for more options on clustering algorithm, specifically wanted to try out LDA to further improve quality of output however I have come across this paper listing few facts on LDA for using on sentence classification.
My question is – Would be helpful to use LDA on sentence (not documents) classification? Also apart from K-mean what are other alternative with unsupervised learning that that can work well with sentence classification. Thank you in advance for all your suggestion.
Note: I’m practicing my exercise in Spark 1.6.1 environment with pyspark API.
After Trying out LDA by myself, below is output:
1 Topics came out similar: frequent words for each of the topics overlap a lot and topics share almost the same set of words.
One of my understanding was, my reviews belongs to specific domain. For example my product belong to credit card domain & all reviews revolving around this singl domain. Further, I tried to plot word distribution and found that most frequently use word is just around 2% of total population.
The overlapping is not necessarily a function of your input (documents or sentences) but could well be the result of your hyperparameter choices. For example, you could choose lower alpha to have less overlap over topics.
From
https://stats.stackexchange.com/questions/37405/natural-interpretation-for-lda-hyperparameters
In practice, a high alpha-value will lead to documents being more similar in terms of what topics they contain. A high beta-value will similarly lead to topics being more similar in terms of what words they contain.
"""
Distinct from our proposed “one
topic per sentence” assumption, all these methods
allow each sentence to include multiple topics, and
use various means to incorporate sentence structure.
The most straightforward method is to treat each
sentence as a document and apply the LDA model
on the collection of sentences rather than documents.
Despite its simplicity, this method, called local-LDA
(Brody and Elhadad 2010), has been demonstrated to
be effective in discovering meaningful topics while
summarizing consumer reviews. (p.1376)
"""
see: https://pubsonline.informs.org/doi/pdf/10.1287/mnsc.2014.1930
Yes. LDA can also work on sentences (but won't always work).
It tends to work better on longer documents though. But your sentences are longer than tweets, that's good.

How does word embedding/ word vectors work/created?

How does word2vec create vectors for words? I trained two word2vec models using two different files (from commoncrawl website) but I am getting same word vectors for a given word from both models.
Actually, I have created multiple word2vec models using different text files from the commoncrawl website. Now I want to check which model is better among all. How can select the best model out of all these models and why I am getting same word vectors for different models?
Sorry, If the question is not clear.
If you are getting identical word-vectors from models that you've prepared from different text corpuses, something is likely wrong in your process. You may not be performing any training at all, perhaps because of a problem in how the text iterable is provided to the Word2Vec class. (In that case, word-vectors would remain at their initial, randomly-initialized values.)
You should enable logging, and review the logs carefully to see that sensible counts of words, examples, progress, and incremental-progress are displayed during the process. You should also check that results for some superficial, ad-hoc checks look sensible after training. For example, does model.most_similar('hot') return other words/concepts somewhat like 'hot'?
Once you're sure models are being trained on varied corpuses – in which case their word-vectors should be very different from each other – deciding which model is 'best' depends on your specific goals with word-vectors.
You should devise a repeatable, quantitative way to evaluate a model against your intended end-uses. This might start crudely with a few of your own manual reviews of results, like looking over most_similar() results for important words for better/worse results – but should become more extensive. rigorous, and automated as your project progresses.
An example of such an automated scoring is the accuracy() method on gensim's word-vectors object. See:
https://github.com/RaRe-Technologies/gensim/blob/6d6f5dcfa3af4bc61c47dfdf5cdbd8e1364d0c3a/gensim/models/keyedvectors.py#L652
If supplied with a specifically-formatted file of word-analogies, it will check how well the word-vectors solve those analogies. For example, the questions-words.txt of Google's original word2vec code release includes the analogies they used to report vector quality. Note, though, that the word-vectors that are best for some purposes, like understanding text topics or sentiment, might not also be the best at solving this style of analogy, and vice-versa. If training your own word-vectors, it's best to choose your training corpus/parameters based on your own goal-specific criteria for what 'good' vectors will be.

What is the effect of adding new word vector embeddings onto an existing embedding space for Neural networks

In Word2Vector, the word embeddings are learned using co-occurrence and updating the vector's dimensions such that words that occur in each other's context come closer together.
My questions are the following:
1) If you already have a pre-trained set of embeddings, let's say a 100 dimensional space with 40k words, can you add 10 additional words onto this embedding space without changing the existing word embeddings. So you would only be updating the dimensions of the new words using the existing word embeddings. I'm thinking of this problem with respect to the "word 2 vector" algorithm, but if people have insights on how GLoVe embeddings work in this case, I am still very interested.
2) Part 2 of the question is; Can you then use the NEW word embeddings in a NN that was trained with the previous embedding set and expect reasonable results. For example, if I had trained a NN for sentiment analysis, and the word "nervous" was previously not in the vocabulary, then would "nervous" be correctly classified as "negative".
This is a question about how sensitive (or robust) NN are with respect to the embeddings. I'd appreciate any thoughts/insight/guidance.
The initial training used info about known words to plot them in a useful N-dimensional space.
It is of course theoretically possible to then use new information, about new words, to also give them coordinates in the same space. You would want lots of varied examples of the new words being used together with the old words.
Whether you want to freeze the positions of old words, or let them also drift into new positions based on the new examples, could be an important choice to make. If you've already trained a pre-existing classifier (like a sentiment classifier) using the older words, and didn't want to re-train that classifier, you'd probably want to lock the old words in place, and force the new words into compatible positioning (even if the newer combined text examples would otherwise change the relative positions of older words).
Since after an effective train-up of the new words, they should generally be near similar-meaning older words, it would be reasonable to expect classifiers that worked on the old words to still do something useful on the new words. But how well that'd work would depend on lots of things, including how well the original word-set covered all the generalizable 'neighborhoods' of meaning. (If the new words bring in shades of meaning of which there were no examples in the old words, that area of the coordinate-space may be impoverished, and the classifier may have never had a good set of distinguishing examples, so performance could lag.)

How to preprocess text for embedding?

In the traditional "one-hot" representation of words as vectors you have a vector of the same dimension as the cardinality of your vocabulary. To reduce dimensionality usually stopwords are removed, as well as applying stemming, lemmatizing, etc. to normalize the features you want to perform some NLP task on.
I'm having trouble understanding whether/how to preprocess text to be embedded (e.g. word2vec). My goal is to use these word embeddings as features for a NN to classify texts into topic A, not topic A, and then perform event extraction on them on documents of topic A (using a second NN).
My first instinct is to preprocess removing stopwords, lemmatizing stemming, etc. But as I learn about NN a bit more I realize that applied to natural language, the CBOW and skip-gram models would in fact require the whole set of words to be present --to be able to predict a word from context one would need to know the actual context, not a reduced form of the context after normalizing... right?). The actual sequence of POS tags seems to be key for a human-feeling prediction of words.
I've found some guidance online but I'm still curious to know what the community here thinks:
Are there any recent commonly accepted best practices regarding punctuation, stemming, lemmatizing, stopwords, numbers, lowercase etc?
If so, what are they? Is it better in general to process as little as possible, or more on the heavier side to normalize the text? Is there a trade-off?
My thoughts:
It is better to remove punctuation (but e.g. in Spanish don't remove the accents because the do convey contextual information), change written numbers to numeric, do not lowercase everything (useful for entity extraction), no stemming, no lemmatizing.
Does this sound right?
I've been working on this problem myself for some time. I totally agree with the other answers, that it really depends on your problem and you must match your input to the output that you expect.
I found that for certain tasks like sentiment analysis it's OK to remove lot's of nuances by preprocessing, but e.g. for text generation, it is quite essential to keep everything.
I'm currently working on generating Latin text and therefore I need to keep quite a lot of structure in the data.
I found a very interesting paper doing some analysis on that topic, but it covers only a small area. However, it might give you some more hints:
On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis
by Jose Camacho-Collados and Mohammad Taher Pilehvar
https://arxiv.org/pdf/1707.01780.pdf
Here is a quote from their conclusion:
"Our evaluation highlights the importance of being consistent in the preprocessing strategy employed across training and evaluation data. In general a simple tokenized corpus works equally or better than more complex preprocessing techniques such as lemmatization or multiword grouping, except for a dataset corresponding to a specialized domain, like health, in which sole tokenization performs poorly. Addi- tionally, word embeddings trained on multiword- grouped corpora perform surprisingly well when applied to simple tokenized datasets."
So many questions. The answer to all of them is probably "depends". It needs to be considered the classes you are trying to predict and the kind of documents you have. It's not the same to try to predict authorship (then you definitely need to keep all kinds of punctuation and case so stylometry will work) than sentiment analysis (where you can get rid of almost everything but have to pay special attention to things like negations).
I would say apply the same preprocessing to both ends. The surface forms are your link so you can't normalise in different ways. I do agree with the point Joseph Valls makes, but my impression is that most embeddings are trained in a generic rather than a specific manner. What I mean is that the Google News embeddings perform quite well on various different tasks and I don't think they had some fancy preprocessing. Getting enough data tends to be more important. All that being said -- it still depends :-)

Grouping similar words (bad , worse )

I know there are ways to find synonyms either by using NLTK/pywordnet or Pattern package in python but it isn't solving my problem.
If there are words like
bad,worst,poor
bag,baggage
lost,lose,misplace
I am not able to capture them. Can anyone suggest me a possible way?
There have been numerous research in this area in past 20 years. Yes computers don't understand language but we can train them to find similarity or difference in two words with the help of some manual effort.
Approaches may be:
Based on manually curated datasets that contain how words in a language are related to each other.
Based on statistical or probabilistic measures of words appearing in a corpus.
Method 1:
Try Wordnet. It is a human-curated network of words which preserves the relationship between words according to human understanding. In short, it is a graph with nodes as something called 'synsets' and edges as relations between them. So any two words which are very close to each other are close in meaning. Words that fall within the same synset might mean exactly the same. Bag and Baggage are close - which you can find either by iteratively exploring node-to-node in a breadth first style - like starting with 'baggage', exploring its neighbors in an attempt to find 'baggage'. You'll have to limit this search upto a small number of iterations for any practical application. Another style is starting a random walk from a node and trying to reach the other node within a number of tries and distance. It you reach baggage from bag say, 500 times out of 1000 within 10 moves, you can be pretty sure that they are very similar to each other. Random walk is more helpful in much larger and complex graphs.
There are many other similar resources online.
Method 2:
Word2Vec. Hard to explain it here but it works by creating a vector of a user's suggested number of dimensions based on its context in the text. There has been an idea for two decades that words in similar context mean the same. e.g. I'm gonna check out my bags and I'm gonna check out my baggage both might appear in text. You can read the paper for explanation (link in the end).
So you can train a Word2Vec model over a large amount of corpus. In the end, you will be able to get 'vector' for each word. You do not need to understand the significance of this vector. You can this vector representation to find similarity or difference between words, or generate synonyms of any word. The idea is that words which are similar to each other have vectors close to each other.
Word2vec came up two years ago and immediately became the 'thing-to-use' in most of NLP applications. The quality of this approach depends on amount and quality of your data. Generally Wikipedia dump is considered good training data for training as it contains articles about almost everything that makes sense. You can easily find ready-to-use models trained on Wikipedia online.
A tiny example from Radim's website:
>>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
[('queen', 0.50882536)]
>>> model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'
>>> model.similarity('woman', 'man')
0.73723527
First example tells you the closest word (topn=1) to words woman and king but meanwhile also most away from the word man. The answer is queen.. Second example is odd one out. Third one tells you how similar the two words are, in your corpus.
Easy to use tool for Word2vec :
https://radimrehurek.com/gensim/models/word2vec.html
http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf (Warning : Lots of Maths Ahead)