Can LDA model be useful for sentences (not documents) clustering / classification? - pyspark

Recently, I’m working on sentence classification problem, these sentences are nothing but one or two line of reviews about product and customers post there feedback on various features that product has to offer. After pre-processing (removal of stop words and stemming) I’m using feature extraction libraries (like word2vec, tf-idf) and clustering algorithms (k-mean) to run over my sentences to have unsupervised sentence classification - output is fairly acceptable. However I’m looking for more options on clustering algorithm, specifically wanted to try out LDA to further improve quality of output however I have come across this paper listing few facts on LDA for using on sentence classification.
My question is – Would be helpful to use LDA on sentence (not documents) classification? Also apart from K-mean what are other alternative with unsupervised learning that that can work well with sentence classification. Thank you in advance for all your suggestion.
Note: I’m practicing my exercise in Spark 1.6.1 environment with pyspark API.
After Trying out LDA by myself, below is output:
1 Topics came out similar: frequent words for each of the topics overlap a lot and topics share almost the same set of words.
One of my understanding was, my reviews belongs to specific domain. For example my product belong to credit card domain & all reviews revolving around this singl domain. Further, I tried to plot word distribution and found that most frequently use word is just around 2% of total population.

The overlapping is not necessarily a function of your input (documents or sentences) but could well be the result of your hyperparameter choices. For example, you could choose lower alpha to have less overlap over topics.
From
https://stats.stackexchange.com/questions/37405/natural-interpretation-for-lda-hyperparameters
In practice, a high alpha-value will lead to documents being more similar in terms of what topics they contain. A high beta-value will similarly lead to topics being more similar in terms of what words they contain.

"""
Distinct from our proposed “one
topic per sentence” assumption, all these methods
allow each sentence to include multiple topics, and
use various means to incorporate sentence structure.
The most straightforward method is to treat each
sentence as a document and apply the LDA model
on the collection of sentences rather than documents.
Despite its simplicity, this method, called local-LDA
(Brody and Elhadad 2010), has been demonstrated to
be effective in discovering meaningful topics while
summarizing consumer reviews. (p.1376)
"""
see: https://pubsonline.informs.org/doi/pdf/10.1287/mnsc.2014.1930

Yes. LDA can also work on sentences (but won't always work).
It tends to work better on longer documents though. But your sentences are longer than tweets, that's good.

Related

How to preprocess text for embedding?

In the traditional "one-hot" representation of words as vectors you have a vector of the same dimension as the cardinality of your vocabulary. To reduce dimensionality usually stopwords are removed, as well as applying stemming, lemmatizing, etc. to normalize the features you want to perform some NLP task on.
I'm having trouble understanding whether/how to preprocess text to be embedded (e.g. word2vec). My goal is to use these word embeddings as features for a NN to classify texts into topic A, not topic A, and then perform event extraction on them on documents of topic A (using a second NN).
My first instinct is to preprocess removing stopwords, lemmatizing stemming, etc. But as I learn about NN a bit more I realize that applied to natural language, the CBOW and skip-gram models would in fact require the whole set of words to be present --to be able to predict a word from context one would need to know the actual context, not a reduced form of the context after normalizing... right?). The actual sequence of POS tags seems to be key for a human-feeling prediction of words.
I've found some guidance online but I'm still curious to know what the community here thinks:
Are there any recent commonly accepted best practices regarding punctuation, stemming, lemmatizing, stopwords, numbers, lowercase etc?
If so, what are they? Is it better in general to process as little as possible, or more on the heavier side to normalize the text? Is there a trade-off?
My thoughts:
It is better to remove punctuation (but e.g. in Spanish don't remove the accents because the do convey contextual information), change written numbers to numeric, do not lowercase everything (useful for entity extraction), no stemming, no lemmatizing.
Does this sound right?
I've been working on this problem myself for some time. I totally agree with the other answers, that it really depends on your problem and you must match your input to the output that you expect.
I found that for certain tasks like sentiment analysis it's OK to remove lot's of nuances by preprocessing, but e.g. for text generation, it is quite essential to keep everything.
I'm currently working on generating Latin text and therefore I need to keep quite a lot of structure in the data.
I found a very interesting paper doing some analysis on that topic, but it covers only a small area. However, it might give you some more hints:
On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis
by Jose Camacho-Collados and Mohammad Taher Pilehvar
https://arxiv.org/pdf/1707.01780.pdf
Here is a quote from their conclusion:
"Our evaluation highlights the importance of being consistent in the preprocessing strategy employed across training and evaluation data. In general a simple tokenized corpus works equally or better than more complex preprocessing techniques such as lemmatization or multiword grouping, except for a dataset corresponding to a specialized domain, like health, in which sole tokenization performs poorly. Addi- tionally, word embeddings trained on multiword- grouped corpora perform surprisingly well when applied to simple tokenized datasets."
So many questions. The answer to all of them is probably "depends". It needs to be considered the classes you are trying to predict and the kind of documents you have. It's not the same to try to predict authorship (then you definitely need to keep all kinds of punctuation and case so stylometry will work) than sentiment analysis (where you can get rid of almost everything but have to pay special attention to things like negations).
I would say apply the same preprocessing to both ends. The surface forms are your link so you can't normalise in different ways. I do agree with the point Joseph Valls makes, but my impression is that most embeddings are trained in a generic rather than a specific manner. What I mean is that the Google News embeddings perform quite well on various different tasks and I don't think they had some fancy preprocessing. Getting enough data tends to be more important. All that being said -- it still depends :-)

Comparison between fasttext and LDA

Hi Last week Facebook announced Fasttext which is a way to categorize words into bucket. Latent Dirichlet Allocation is also another way to do topic modeling. My question is did anyone do any comparison regarding pro and con within these 2.
I haven't tried Fasttext but here are few pro and con for LDA based on my experience
Pro
Iterative model, having support for Apache spark
Takes in a corpus of document and does topic modeling.
Not only finds out what the document is talking about but also finds out related documents
Apache spark community is continuously contributing to this. Earlier they made it work on mllib now on ml libraries
Con
Stopwords need to be defined well. They have to be related to the context of the document. For ex: "document" is a word which is having high frequency of appearance and may top the chart of recommended topics but it may or maynot be relevant, so we need to update the stopword for that
Sometime classification might be irrelevant. In the below example it is hard to infer what this bucket is talking about
Topic:
Term:discipline
Term:disciplines
Term:notestable
Term:winning
Term:pathways
Term:chapterclosingtable
Term:metaprograms
Term:breakthroughs
Term:distinctions
Term:rescue
If anyone has done research in Fasttext can you please update with your learning?
fastText offers more than topic modelling, it is a tool for generation of word embeddings and text classification using a shallow neural network.
The authors state its performance is comparable with much more complex “deep learning” algorithms, but the training time is significantly lower.
Pros:
=> It is extremely easy to train your own fastText model,
$ ./fasttext skipgram -input data.txt -output model
Just provide your input and output file, the architecture to be used and that's all, but if you wish to customize your model a bit, fastText provides the option to change the hyper-parameters as well.
=> While generating word vectors, fastText takes into account sub-parts of words called character n-grams so that similar words have similar vectors even if they happen to occur in different contexts. For example, “supervised”, “supervise” and “supervisor” all are assigned similar vectors.
=> A previously trained model can be used to compute word vectors for out-of-vocabulary words. This one is my favorite. Even if the vocabulary of your corpus is finite, you can get a vector for almost any word that exists in the world.
=> fastText also provides the option to generate vectors for paragraphs or sentences. Similar documents can be found by comparing the vectors of documents.
=> The option to predict likely labels for a piece of text has been included too.
=> Pre-trained word vectors for about 90 languages trained on Wikipedia are available in the official repo.
Cons:
=> As fastText is command line based, I struggled while incorporating this into my project, this might not be an issue to others though.
=> No in-built method to find similar words or paragraphs.
For those who wish to read more, here are the links to the official research papers:
1) https://arxiv.org/pdf/1607.04606.pdf
2) https://arxiv.org/pdf/1607.01759.pdf
And link to the official repo:
https://github.com/facebookresearch/fastText

Grouping similar words (bad , worse )

I know there are ways to find synonyms either by using NLTK/pywordnet or Pattern package in python but it isn't solving my problem.
If there are words like
bad,worst,poor
bag,baggage
lost,lose,misplace
I am not able to capture them. Can anyone suggest me a possible way?
There have been numerous research in this area in past 20 years. Yes computers don't understand language but we can train them to find similarity or difference in two words with the help of some manual effort.
Approaches may be:
Based on manually curated datasets that contain how words in a language are related to each other.
Based on statistical or probabilistic measures of words appearing in a corpus.
Method 1:
Try Wordnet. It is a human-curated network of words which preserves the relationship between words according to human understanding. In short, it is a graph with nodes as something called 'synsets' and edges as relations between them. So any two words which are very close to each other are close in meaning. Words that fall within the same synset might mean exactly the same. Bag and Baggage are close - which you can find either by iteratively exploring node-to-node in a breadth first style - like starting with 'baggage', exploring its neighbors in an attempt to find 'baggage'. You'll have to limit this search upto a small number of iterations for any practical application. Another style is starting a random walk from a node and trying to reach the other node within a number of tries and distance. It you reach baggage from bag say, 500 times out of 1000 within 10 moves, you can be pretty sure that they are very similar to each other. Random walk is more helpful in much larger and complex graphs.
There are many other similar resources online.
Method 2:
Word2Vec. Hard to explain it here but it works by creating a vector of a user's suggested number of dimensions based on its context in the text. There has been an idea for two decades that words in similar context mean the same. e.g. I'm gonna check out my bags and I'm gonna check out my baggage both might appear in text. You can read the paper for explanation (link in the end).
So you can train a Word2Vec model over a large amount of corpus. In the end, you will be able to get 'vector' for each word. You do not need to understand the significance of this vector. You can this vector representation to find similarity or difference between words, or generate synonyms of any word. The idea is that words which are similar to each other have vectors close to each other.
Word2vec came up two years ago and immediately became the 'thing-to-use' in most of NLP applications. The quality of this approach depends on amount and quality of your data. Generally Wikipedia dump is considered good training data for training as it contains articles about almost everything that makes sense. You can easily find ready-to-use models trained on Wikipedia online.
A tiny example from Radim's website:
>>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
[('queen', 0.50882536)]
>>> model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'
>>> model.similarity('woman', 'man')
0.73723527
First example tells you the closest word (topn=1) to words woman and king but meanwhile also most away from the word man. The answer is queen.. Second example is odd one out. Third one tells you how similar the two words are, in your corpus.
Easy to use tool for Word2vec :
https://radimrehurek.com/gensim/models/word2vec.html
http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf (Warning : Lots of Maths Ahead)

Doubts about clustering methods for tweets

I'm fairly new to clustering and related topics so please forgive my questions.
I'm trying to get introduced into this area by doing some tests, and as a first experiment I'd like to create clusters on tweets based on content similarity. The basic idea for the experiment would be storing tweets on a database and periodically calculate the clustering (ie. using a cron job). Please note that the database would obtain new tweets from time to time.
Being ignorant in this field, my idea (probably naive) would be to do something like this:
1. For each new tweet in the db, extract N-grams (N=3 for example) into a set
2. Perform Jaccard similarity and compare with each of the existing clusters. If result > threshold then it would be assigned to that cluster
3. Once finished I'd get M clusters containing similar tweets
Now I see some problems with this basic approach. Let's put aside computational cost, how would the comparison between a tweet and a cluster be done? Assuming I have a tweet Tn and a cluster C1 containing T1, T4, T10 which one should I compare it to? Given that we're talking about similarity, it could well happen that sim(Tn,T1) > threshold but sim(Tn,T4) < threshold. My gut feeling tells me that something like an average should be used for the cluster, in order to avoid this problem.
Also, it could happen that sim(Tn, C1) and sim(Tn, C2) are both > threshold but similarity with C1 would be higher. In that case Tn should go to C1. This could be done brute force as well to assign the tweet to the cluster with maximum similarity.
And last of all, it's the computational issue. I've been reading a bit about minhash and it seems to be the answer to this problem, although I need to do some more research on it.
Anyway, my main question would be: could someone with experience in the area recommend me which approach should I aim to? I read some mentions about LSA and other methods, but trying to cope with everything is getting a bit overwhelming, so I'd appreciate some guiding.
From what I'm reading a tool for this would be hierarchical clustering, as it would allow regrouping of clusters whenever new data enters. Is this correct?
Please note that I'm not looking for any complicated case. My use case idea would be being able to cluster similar tweets into groups without any previous information. For example, tweets from Foursquare ("I'm checking in ..." which are similar to each other would be one case, or "My klout score is ..."). Also note that I'd like this to be language independent, so I'm not interested in having to deal with specific language issues.
It looks like to me that you are trying to address two different problems in one, i.e. "syntactic" and "semantic" clustering. They are quite different problems, expecially if you are in the realm of short-text analysis (and Twitter is the king of short-text analysis, of course).
"Syntactic" clustering means aggregating tweets that come, most likely, from the same source. Your example of Foursquare fits perfectly, but it is also common for retweets, people sharing online newspaper articles or blog posts, and many other cases. For this type of problem, using a N-gram model is almost mandatory, as you said (my experience suggests that N=2 is good for tweets, since you can find significant tweets that have as low as 3-4 features). Normalization is also an important factor here, removing RT tag, mentions, hashtags might help.
"Semantic" clustering means aggregating tweets that share the same topic. This is a much more difficult problem, and it won't likely work if you try to aggregate random sample of tweets, due to the fact that they, usually, carry too little information. These techniques might work, though, if you restrict your domain to a specific subset of tweets (i.e. the one matching a keyword, or an hashtag). LSA could be useful here, while it is useless for syntactic clusters.
Based on your observation, I think what you want is syntactic clustering. Your biggest issue, though, is the fact that you need online clustering, and not static clustering. The classical clustering algorithms that would work well in the static case (like hierarchical clustering, or union find) aren't really suited for online clustering , unless you redo the clustering from scratch every time a new tweet gets added to your database. "Averaging" the clusters to add new elements isn't a great solution according to my experience, because you need to retain all the information of every cluster member to update the "average" every time new data gets in. Also, algorithms like hierarchical clustering and union find work well because they can join pre-existant clusters if a link of similarity is found between them, and they don't simply assign a new element to the "closest" cluster, which is what you suggested to do in your post.
Algorithms like MinHash (or SimHash) are indeed more suited to online clustering, because they support the idea of "querying" for similar documents. MinHash is essentially a way to obtain pairs of documents that exceed a certain threshold of similarity (in particular, MinHash can be considered an estimator of Jaccard similarity) without having to rely on a quadratic algorithm like pairwise comparison (it is, in fact, O(nlog(n)) in time). It is, though, quadratic in space, therefore a memory-only implementation of MinHash is useful for small collections only (say 10000 tweets). In your case, though, it can be useful to save "sketches" (i.e., the set of hashes you obtain by min-hashing a tweet) of your tweets in a database to form an "index", and query the new ones against that index. You can then form a similarity graph, by adding edges between vertices (tweets) that matched the similarity query. The connected components of your graph will be your clusters.
This sounds a lot like canopy pre-clustering to me.
Essentially, each cluster is represented by the first object that started the cluster.
Objects within the outer radius join the cluster. Objects that are not within the inner radius of at least one cluster start a new cluster. This way, you get an overlapping (non-disjoint!) quantization of your dataset. Since this can drastically reduce the data size, it can be used to speed up various algorithms.
However don't expect useful results from clustering tweets. Tweet data is just to much noise. Most tweets have just a few words, too little to define a good similarity. On the other hand, you have the various retweets that are near duplicates - but trivial to detect.
So what would be a good cluster of tweets? Can this n-gram similarity actually capture this?

Clustering or classification?

I am stuck between a decision to apply classification or clustering on the data set I got. The more I think about it, the more I get confused. Heres what I am confronted with.
I have got news documents (around 3000 and continuously increasing) containing news about companies, investment, stocks, economy, quartly income etc. My goal is to have the news sorted in such a way that I know which news correspond to which company. e.g for the news item "Apple launches new iphone", I need to associate the company Apple with it. A particular news item/document only contains 'title' and 'description' so I have to analyze the text in order to find out which company the news referes to. It could be multiple companies too.
To solve this, I turned to Mahout.
I started with clustering. I was hoping to get 'Apple', 'Google', 'Intel' etc as top terms in my clusters and from there I would know the news in a cluster corresponds to its cluster label, but things were a bit different. I got 'investment', 'stocks', 'correspondence', 'green energy', 'terminal', 'shares', 'street', 'olympics' and lots of other terms as the top ones (which makes sense as clustering algos' look for common terms). Although there were some 'Apple' clusters but the news items associated with it were very few.I thought may be clustering is not for this kind of problem as many of the company news goes into more general clusters(investment, profit) instead of the specific company cluster(Apple).
I started reading about classification which requires training data, The name was convincing too as I actually want to 'classify' my news items into 'company names'. As I read on, I got an impression that the name classification is a bit deceiving and the technique is used more for prediction purposes as compared to classification. The other confusions that I got was how can I prepare training data for news documents? lets assume I have a list of companies that I am interested in. I write a program to produce training data for the classifier. the program will see if the news title or description contains the company name 'Apple' then its a news story about apple. Is this how I can prepare training data?(off course I read that training data is actually a set of predictors and target variables). If so, then why should I use mahout classification in the first place? I should ditch mahout and instead use this little program that I wrote for training data(which actually does the classification)
You can see how confused I am about how to address this issue. Another thing that concerns me is that if its possible to make a system this intelligent, that if the news says 'iphone sales at a record high' without using the word 'Apple', the system can classify it as a news related to apple?
Thank you in advance for pointing me in the right direction.
Copying my reply from the mailing list:
Classifiers are supervised learning algorithms, so you need to provide
a bunch of examples of positive and negative classes. In your example,
it would be fine to label a bunch of articles as "about Apple" or not,
then use feature vectors derived from TF-IDF as input, with these
labels, to train a classifier that can tell when an article is "about
Apple".
I don't think it will quite work to automatically generate the
training set by labeling according to the simple rule, that it is
about Apple if 'Apple' is in the title. Well, if you do that, then
there is no point in training a classifier. You can make a trivial
classifier that achieves 100% accuracy on your test set by just
checking if 'Apple' is in the title! Yes, you are right, this gains
you nothing.
Clearly you want to learn something subtler from the classifier, so
that an article titled "Apple juice shown to reduce risk of dementia"
isn't classified as about the company. You'd really need to feed it
hand-classified documents.
That's the bad news, but, sure you can certainly train N classifiers
for N topics this way.
Classifiers put items into a class or not. They are not the same as
regression techniques which predict a continuous value for an input.
They're related but distinct.
Clustering has the advantage of being unsupervised. You don't need
labels. However the resulting clusters are not guaranteed to match up
to your notion of article topics. You may see a cluster that has a lot
of Apple articles, some about the iPod, but also some about Samsung
and laptops in general. I don't think this is the best tool for your
problem.
First of all, you don't need Mahout. 3000 documents is close to nothing. Revisit Mahout when you hit a million. I've been processing 100.000 images on a single computer, so you really can skip the overhead of Mahout for now.
What you are trying to do sounds like classification to me. Because you have predefined classes.
A clustering algorithm is unsupervised. It will (unless you overfit the parameters) likely break Apple into "iPad/iPhone" and "Macbook". Or on the other hand, it may merge Apple and Google, as they are closely related (much more than, say, Apple and Ford).
Yes, you need training data, that reflects the structure that you want to measure. There is other structure (e.g. iPhones being not the same as Macbooks, and Google, Facebook and Apple being more similar companies than Kellogs, Ford and Apple). If you want a company level of structure, you need training data at this level of detail.