Grouping similar words (bad , worse ) - text-processing

I know there are ways to find synonyms either by using NLTK/pywordnet or Pattern package in python but it isn't solving my problem.
If there are words like
bad,worst,poor
bag,baggage
lost,lose,misplace
I am not able to capture them. Can anyone suggest me a possible way?

There have been numerous research in this area in past 20 years. Yes computers don't understand language but we can train them to find similarity or difference in two words with the help of some manual effort.
Approaches may be:
Based on manually curated datasets that contain how words in a language are related to each other.
Based on statistical or probabilistic measures of words appearing in a corpus.
Method 1:
Try Wordnet. It is a human-curated network of words which preserves the relationship between words according to human understanding. In short, it is a graph with nodes as something called 'synsets' and edges as relations between them. So any two words which are very close to each other are close in meaning. Words that fall within the same synset might mean exactly the same. Bag and Baggage are close - which you can find either by iteratively exploring node-to-node in a breadth first style - like starting with 'baggage', exploring its neighbors in an attempt to find 'baggage'. You'll have to limit this search upto a small number of iterations for any practical application. Another style is starting a random walk from a node and trying to reach the other node within a number of tries and distance. It you reach baggage from bag say, 500 times out of 1000 within 10 moves, you can be pretty sure that they are very similar to each other. Random walk is more helpful in much larger and complex graphs.
There are many other similar resources online.
Method 2:
Word2Vec. Hard to explain it here but it works by creating a vector of a user's suggested number of dimensions based on its context in the text. There has been an idea for two decades that words in similar context mean the same. e.g. I'm gonna check out my bags and I'm gonna check out my baggage both might appear in text. You can read the paper for explanation (link in the end).
So you can train a Word2Vec model over a large amount of corpus. In the end, you will be able to get 'vector' for each word. You do not need to understand the significance of this vector. You can this vector representation to find similarity or difference between words, or generate synonyms of any word. The idea is that words which are similar to each other have vectors close to each other.
Word2vec came up two years ago and immediately became the 'thing-to-use' in most of NLP applications. The quality of this approach depends on amount and quality of your data. Generally Wikipedia dump is considered good training data for training as it contains articles about almost everything that makes sense. You can easily find ready-to-use models trained on Wikipedia online.
A tiny example from Radim's website:
>>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
[('queen', 0.50882536)]
>>> model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'
>>> model.similarity('woman', 'man')
0.73723527
First example tells you the closest word (topn=1) to words woman and king but meanwhile also most away from the word man. The answer is queen.. Second example is odd one out. Third one tells you how similar the two words are, in your corpus.
Easy to use tool for Word2vec :
https://radimrehurek.com/gensim/models/word2vec.html
http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf (Warning : Lots of Maths Ahead)

Related

How to predict word using trained CBOW

I have a question about CBOW prediction. Suppose my job is to use 3 surrounding words w(t-3), w(t-2), w(t-1)as input to predict one target word w(t). Once the model is trained and I want to predict a missing word after a sentence. Does this model only work for a sentence with four words which the first three are known and the last is unknown? If I have a sentence in 10 words. The first nine words are known, can I use 9 words as input to predict the last missing word in that sentence?
Word2vec CBOW mode typically uses symmetric windows around a target word. But it simply averages the (current in-training) word-vectors for all words in the window to find the 'inputs' for the prediction neural-network. Thus, it is tolerant of asymmetric windows – if there are fewer words are available on either side, fewer words on that side are used (and perhaps even zero on that side, for words at the front/end of a text).
Additionally, during each training example, it doesn't always use the maximum-window specified, but some random-sized window up-to the specified size. So for window=5, it will sometimes use just 1 on either side, and other times 2, 3, 4, or 5. This is done to effectively overweight closer words.
Finally and most importantly for your question, word2vec doesn't really do a full-prediction during training of "what exact word does the model say should be heat this target location?" In either the 'hierarchical softmax' or 'negative-sampling' variants, such an exact prediction can be expensive, requiring calculations of neural-network output-node activation levels proportionate to the size of the full corpus vocabulary.
Instead, it does the much-smaller number-of-calculations required to see how strongly the neural-network is predicting the actual target word observed in the training data, perhaps in contrast to a few other words. In hierarchical-softmax, this involves calculating output nodes for a short encoding of the one target word – ignoring all other output nodes encoding other words. In negative-sampling, this involves calculating the one distinct output node for the target word, plus a few output nodes for other randomly-chosen words (the 'negative' examples).
In neither case does training know if this target word is being predicted in preference over all other words – because it's not taking the time to evaluate all others words. It just looks at the current strength-of-outputs for a real example's target word, and nudges them (via back-propagation) to be slightly stronger.
The end result of this process is the word-vectors that are usefully-arranged for other purposes, where similar words are close to each other, and even certain relative directions and magnitudes also seem to match human judgements of words' relationships.
But the final word-vectors, and model-state, might still be just mediocre at predicting missing words from texts – because it was only ever nudged to be better on individual examples. You could theoretically compare a model's predictions for every possible target word, and thus force-create a sort of ranked-list of predicted-words – but that's more expensive than anything needed for training, and prediction of words like that isn't the usual downstream application of sets of word-vectors. So indeed most word2vec libraries don't even include any interface methods for doing full target-word prediction. (For example, the original word2vec.c from Google doesn't.)
A few versions ago, the Python gensim library added an experimental method for prediction, [predict_output_word()][1]. It only works for negative-sampling mode, and it doesn't quite handle window-word-weighting the same way as is done in training. You could give it a try, but don't be surprised if the results aren't impressive. As noted above, making actual predictions of words isn't the usual real goal of word2vec-training. (Other more stateful text-analysis, even just large co-occurrence tables, might do better at that. But they might not force word-vectors into interesting constellations like word2vec.)

Can LDA model be useful for sentences (not documents) clustering / classification?

Recently, I’m working on sentence classification problem, these sentences are nothing but one or two line of reviews about product and customers post there feedback on various features that product has to offer. After pre-processing (removal of stop words and stemming) I’m using feature extraction libraries (like word2vec, tf-idf) and clustering algorithms (k-mean) to run over my sentences to have unsupervised sentence classification - output is fairly acceptable. However I’m looking for more options on clustering algorithm, specifically wanted to try out LDA to further improve quality of output however I have come across this paper listing few facts on LDA for using on sentence classification.
My question is – Would be helpful to use LDA on sentence (not documents) classification? Also apart from K-mean what are other alternative with unsupervised learning that that can work well with sentence classification. Thank you in advance for all your suggestion.
Note: I’m practicing my exercise in Spark 1.6.1 environment with pyspark API.
After Trying out LDA by myself, below is output:
1 Topics came out similar: frequent words for each of the topics overlap a lot and topics share almost the same set of words.
One of my understanding was, my reviews belongs to specific domain. For example my product belong to credit card domain & all reviews revolving around this singl domain. Further, I tried to plot word distribution and found that most frequently use word is just around 2% of total population.
The overlapping is not necessarily a function of your input (documents or sentences) but could well be the result of your hyperparameter choices. For example, you could choose lower alpha to have less overlap over topics.
From
https://stats.stackexchange.com/questions/37405/natural-interpretation-for-lda-hyperparameters
In practice, a high alpha-value will lead to documents being more similar in terms of what topics they contain. A high beta-value will similarly lead to topics being more similar in terms of what words they contain.
"""
Distinct from our proposed “one
topic per sentence” assumption, all these methods
allow each sentence to include multiple topics, and
use various means to incorporate sentence structure.
The most straightforward method is to treat each
sentence as a document and apply the LDA model
on the collection of sentences rather than documents.
Despite its simplicity, this method, called local-LDA
(Brody and Elhadad 2010), has been demonstrated to
be effective in discovering meaningful topics while
summarizing consumer reviews. (p.1376)
"""
see: https://pubsonline.informs.org/doi/pdf/10.1287/mnsc.2014.1930
Yes. LDA can also work on sentences (but won't always work).
It tends to work better on longer documents though. But your sentences are longer than tweets, that's good.

What is the effect of adding new word vector embeddings onto an existing embedding space for Neural networks

In Word2Vector, the word embeddings are learned using co-occurrence and updating the vector's dimensions such that words that occur in each other's context come closer together.
My questions are the following:
1) If you already have a pre-trained set of embeddings, let's say a 100 dimensional space with 40k words, can you add 10 additional words onto this embedding space without changing the existing word embeddings. So you would only be updating the dimensions of the new words using the existing word embeddings. I'm thinking of this problem with respect to the "word 2 vector" algorithm, but if people have insights on how GLoVe embeddings work in this case, I am still very interested.
2) Part 2 of the question is; Can you then use the NEW word embeddings in a NN that was trained with the previous embedding set and expect reasonable results. For example, if I had trained a NN for sentiment analysis, and the word "nervous" was previously not in the vocabulary, then would "nervous" be correctly classified as "negative".
This is a question about how sensitive (or robust) NN are with respect to the embeddings. I'd appreciate any thoughts/insight/guidance.
The initial training used info about known words to plot them in a useful N-dimensional space.
It is of course theoretically possible to then use new information, about new words, to also give them coordinates in the same space. You would want lots of varied examples of the new words being used together with the old words.
Whether you want to freeze the positions of old words, or let them also drift into new positions based on the new examples, could be an important choice to make. If you've already trained a pre-existing classifier (like a sentiment classifier) using the older words, and didn't want to re-train that classifier, you'd probably want to lock the old words in place, and force the new words into compatible positioning (even if the newer combined text examples would otherwise change the relative positions of older words).
Since after an effective train-up of the new words, they should generally be near similar-meaning older words, it would be reasonable to expect classifiers that worked on the old words to still do something useful on the new words. But how well that'd work would depend on lots of things, including how well the original word-set covered all the generalizable 'neighborhoods' of meaning. (If the new words bring in shades of meaning of which there were no examples in the old words, that area of the coordinate-space may be impoverished, and the classifier may have never had a good set of distinguishing examples, so performance could lag.)

How to preprocess text for embedding?

In the traditional "one-hot" representation of words as vectors you have a vector of the same dimension as the cardinality of your vocabulary. To reduce dimensionality usually stopwords are removed, as well as applying stemming, lemmatizing, etc. to normalize the features you want to perform some NLP task on.
I'm having trouble understanding whether/how to preprocess text to be embedded (e.g. word2vec). My goal is to use these word embeddings as features for a NN to classify texts into topic A, not topic A, and then perform event extraction on them on documents of topic A (using a second NN).
My first instinct is to preprocess removing stopwords, lemmatizing stemming, etc. But as I learn about NN a bit more I realize that applied to natural language, the CBOW and skip-gram models would in fact require the whole set of words to be present --to be able to predict a word from context one would need to know the actual context, not a reduced form of the context after normalizing... right?). The actual sequence of POS tags seems to be key for a human-feeling prediction of words.
I've found some guidance online but I'm still curious to know what the community here thinks:
Are there any recent commonly accepted best practices regarding punctuation, stemming, lemmatizing, stopwords, numbers, lowercase etc?
If so, what are they? Is it better in general to process as little as possible, or more on the heavier side to normalize the text? Is there a trade-off?
My thoughts:
It is better to remove punctuation (but e.g. in Spanish don't remove the accents because the do convey contextual information), change written numbers to numeric, do not lowercase everything (useful for entity extraction), no stemming, no lemmatizing.
Does this sound right?
I've been working on this problem myself for some time. I totally agree with the other answers, that it really depends on your problem and you must match your input to the output that you expect.
I found that for certain tasks like sentiment analysis it's OK to remove lot's of nuances by preprocessing, but e.g. for text generation, it is quite essential to keep everything.
I'm currently working on generating Latin text and therefore I need to keep quite a lot of structure in the data.
I found a very interesting paper doing some analysis on that topic, but it covers only a small area. However, it might give you some more hints:
On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis
by Jose Camacho-Collados and Mohammad Taher Pilehvar
https://arxiv.org/pdf/1707.01780.pdf
Here is a quote from their conclusion:
"Our evaluation highlights the importance of being consistent in the preprocessing strategy employed across training and evaluation data. In general a simple tokenized corpus works equally or better than more complex preprocessing techniques such as lemmatization or multiword grouping, except for a dataset corresponding to a specialized domain, like health, in which sole tokenization performs poorly. Addi- tionally, word embeddings trained on multiword- grouped corpora perform surprisingly well when applied to simple tokenized datasets."
So many questions. The answer to all of them is probably "depends". It needs to be considered the classes you are trying to predict and the kind of documents you have. It's not the same to try to predict authorship (then you definitely need to keep all kinds of punctuation and case so stylometry will work) than sentiment analysis (where you can get rid of almost everything but have to pay special attention to things like negations).
I would say apply the same preprocessing to both ends. The surface forms are your link so you can't normalise in different ways. I do agree with the point Joseph Valls makes, but my impression is that most embeddings are trained in a generic rather than a specific manner. What I mean is that the Google News embeddings perform quite well on various different tasks and I don't think they had some fancy preprocessing. Getting enough data tends to be more important. All that being said -- it still depends :-)

Doubts about clustering methods for tweets

I'm fairly new to clustering and related topics so please forgive my questions.
I'm trying to get introduced into this area by doing some tests, and as a first experiment I'd like to create clusters on tweets based on content similarity. The basic idea for the experiment would be storing tweets on a database and periodically calculate the clustering (ie. using a cron job). Please note that the database would obtain new tweets from time to time.
Being ignorant in this field, my idea (probably naive) would be to do something like this:
1. For each new tweet in the db, extract N-grams (N=3 for example) into a set
2. Perform Jaccard similarity and compare with each of the existing clusters. If result > threshold then it would be assigned to that cluster
3. Once finished I'd get M clusters containing similar tweets
Now I see some problems with this basic approach. Let's put aside computational cost, how would the comparison between a tweet and a cluster be done? Assuming I have a tweet Tn and a cluster C1 containing T1, T4, T10 which one should I compare it to? Given that we're talking about similarity, it could well happen that sim(Tn,T1) > threshold but sim(Tn,T4) < threshold. My gut feeling tells me that something like an average should be used for the cluster, in order to avoid this problem.
Also, it could happen that sim(Tn, C1) and sim(Tn, C2) are both > threshold but similarity with C1 would be higher. In that case Tn should go to C1. This could be done brute force as well to assign the tweet to the cluster with maximum similarity.
And last of all, it's the computational issue. I've been reading a bit about minhash and it seems to be the answer to this problem, although I need to do some more research on it.
Anyway, my main question would be: could someone with experience in the area recommend me which approach should I aim to? I read some mentions about LSA and other methods, but trying to cope with everything is getting a bit overwhelming, so I'd appreciate some guiding.
From what I'm reading a tool for this would be hierarchical clustering, as it would allow regrouping of clusters whenever new data enters. Is this correct?
Please note that I'm not looking for any complicated case. My use case idea would be being able to cluster similar tweets into groups without any previous information. For example, tweets from Foursquare ("I'm checking in ..." which are similar to each other would be one case, or "My klout score is ..."). Also note that I'd like this to be language independent, so I'm not interested in having to deal with specific language issues.
It looks like to me that you are trying to address two different problems in one, i.e. "syntactic" and "semantic" clustering. They are quite different problems, expecially if you are in the realm of short-text analysis (and Twitter is the king of short-text analysis, of course).
"Syntactic" clustering means aggregating tweets that come, most likely, from the same source. Your example of Foursquare fits perfectly, but it is also common for retweets, people sharing online newspaper articles or blog posts, and many other cases. For this type of problem, using a N-gram model is almost mandatory, as you said (my experience suggests that N=2 is good for tweets, since you can find significant tweets that have as low as 3-4 features). Normalization is also an important factor here, removing RT tag, mentions, hashtags might help.
"Semantic" clustering means aggregating tweets that share the same topic. This is a much more difficult problem, and it won't likely work if you try to aggregate random sample of tweets, due to the fact that they, usually, carry too little information. These techniques might work, though, if you restrict your domain to a specific subset of tweets (i.e. the one matching a keyword, or an hashtag). LSA could be useful here, while it is useless for syntactic clusters.
Based on your observation, I think what you want is syntactic clustering. Your biggest issue, though, is the fact that you need online clustering, and not static clustering. The classical clustering algorithms that would work well in the static case (like hierarchical clustering, or union find) aren't really suited for online clustering , unless you redo the clustering from scratch every time a new tweet gets added to your database. "Averaging" the clusters to add new elements isn't a great solution according to my experience, because you need to retain all the information of every cluster member to update the "average" every time new data gets in. Also, algorithms like hierarchical clustering and union find work well because they can join pre-existant clusters if a link of similarity is found between them, and they don't simply assign a new element to the "closest" cluster, which is what you suggested to do in your post.
Algorithms like MinHash (or SimHash) are indeed more suited to online clustering, because they support the idea of "querying" for similar documents. MinHash is essentially a way to obtain pairs of documents that exceed a certain threshold of similarity (in particular, MinHash can be considered an estimator of Jaccard similarity) without having to rely on a quadratic algorithm like pairwise comparison (it is, in fact, O(nlog(n)) in time). It is, though, quadratic in space, therefore a memory-only implementation of MinHash is useful for small collections only (say 10000 tweets). In your case, though, it can be useful to save "sketches" (i.e., the set of hashes you obtain by min-hashing a tweet) of your tweets in a database to form an "index", and query the new ones against that index. You can then form a similarity graph, by adding edges between vertices (tweets) that matched the similarity query. The connected components of your graph will be your clusters.
This sounds a lot like canopy pre-clustering to me.
Essentially, each cluster is represented by the first object that started the cluster.
Objects within the outer radius join the cluster. Objects that are not within the inner radius of at least one cluster start a new cluster. This way, you get an overlapping (non-disjoint!) quantization of your dataset. Since this can drastically reduce the data size, it can be used to speed up various algorithms.
However don't expect useful results from clustering tweets. Tweet data is just to much noise. Most tweets have just a few words, too little to define a good similarity. On the other hand, you have the various retweets that are near duplicates - but trivial to detect.
So what would be a good cluster of tweets? Can this n-gram similarity actually capture this?