Incremental clustering algorithm for grouping news articles? - cluster-analysis

I'm doing a little research on how to cluster articles into 'news stories' ala Google News.
Looking at previous questions here on the subject, I often see it recommended to simply pull out a vector of words from an article, weight some of the words more if they're in certain parts of the article (e.g. the headline), and then to use something like a k-means algorithm to cluster the articles.
But this leads to a couple of questions:
With k-means, how do you know in advance how much k should be? In a dynamic news environment you may have a very variable number of stories, and you won't know in advance how many stories a collection of articles represents.
With hierarchal clustering algorithms, how do you decide which clusters to use as your stories? You'll have clusters at the bottom of the tree that are just single articles, which you obviously won't want to use, and a cluster at the root of the tree which has all of the articles, which again you won't want...but how do you know which clusters in between should be used to represent stories?
Finally, with either k-means or hierarchal algorithms, most literature I have read seems to assume you have a preset collection of documents you want to cluster, and it clusters them all at once. But what of a situation where you have new articles coming in every so often. What happens? Do you have to cluster all the articles from scratch, now that there's an additional one? This is why I'm wondering if there are approaches that let you 'add' articles as you go without re-clustering from scratch. I can't imagine that's very efficient.

I worked on a start-up that built exactly this: an incremental clustering engine for news articles. We based our algorithm on this paper: Web Document Clustering Using Document Index Graph (http://ieeexplore.ieee.org/xpl/articleDetails.jsp?reload=true&arnumber=4289851). Worked well for us for 10K articles / day.
It has two main advantages:
1) It's incremental, which addresses the problem you have with having to deal with a stream of incoming articles (rather than clustering all at once)
2) It uses phrase-based modeling, as opposed to just "bag of words", which results in much higher accuracy.
A Google search pops up http://www.similetrix.com, they might have what you're looking for.

I would do a search for adaptive K-means clustering algorithms. There is a good section of research devoted to the problems you describe. Here is one such paper (pdf)

Related

Can LDA model be useful for sentences (not documents) clustering / classification?

Recently, I’m working on sentence classification problem, these sentences are nothing but one or two line of reviews about product and customers post there feedback on various features that product has to offer. After pre-processing (removal of stop words and stemming) I’m using feature extraction libraries (like word2vec, tf-idf) and clustering algorithms (k-mean) to run over my sentences to have unsupervised sentence classification - output is fairly acceptable. However I’m looking for more options on clustering algorithm, specifically wanted to try out LDA to further improve quality of output however I have come across this paper listing few facts on LDA for using on sentence classification.
My question is – Would be helpful to use LDA on sentence (not documents) classification? Also apart from K-mean what are other alternative with unsupervised learning that that can work well with sentence classification. Thank you in advance for all your suggestion.
Note: I’m practicing my exercise in Spark 1.6.1 environment with pyspark API.
After Trying out LDA by myself, below is output:
1 Topics came out similar: frequent words for each of the topics overlap a lot and topics share almost the same set of words.
One of my understanding was, my reviews belongs to specific domain. For example my product belong to credit card domain & all reviews revolving around this singl domain. Further, I tried to plot word distribution and found that most frequently use word is just around 2% of total population.
The overlapping is not necessarily a function of your input (documents or sentences) but could well be the result of your hyperparameter choices. For example, you could choose lower alpha to have less overlap over topics.
From
https://stats.stackexchange.com/questions/37405/natural-interpretation-for-lda-hyperparameters
In practice, a high alpha-value will lead to documents being more similar in terms of what topics they contain. A high beta-value will similarly lead to topics being more similar in terms of what words they contain.
"""
Distinct from our proposed “one
topic per sentence” assumption, all these methods
allow each sentence to include multiple topics, and
use various means to incorporate sentence structure.
The most straightforward method is to treat each
sentence as a document and apply the LDA model
on the collection of sentences rather than documents.
Despite its simplicity, this method, called local-LDA
(Brody and Elhadad 2010), has been demonstrated to
be effective in discovering meaningful topics while
summarizing consumer reviews. (p.1376)
"""
see: https://pubsonline.informs.org/doi/pdf/10.1287/mnsc.2014.1930
Yes. LDA can also work on sentences (but won't always work).
It tends to work better on longer documents though. But your sentences are longer than tweets, that's good.

Clustering techniques for Binary Data

I want to use clustering techniques for binary data analysis. I have collected the data through survey in which i asked the users to select exactly 20 features out of list of 94 product features. The columns in my data represents the 94 product features and the rows represents the participants. I am trying to cluster the similar users in different user groups based on the product features they selected. Each user cluster should also tell me the product features associated with each cluster. I am using some open source clustering tools like NCSS and JMP. I was trying to use fuzzy clustering technique for achieveing my goal but unfortunately these tools do not deal with binary data. Can you please suggest me which technique would really be appropriate for my tasks , also which online tool i can use for using the cluster analysis on my data? As beacuse of the time limitation, I am not looking to code myself and i am only looking for some open source tools that have all the functionality available in them which i can use as it is.
Clustering for binary data is not really well defined.
Rather than looking for some tool/function that may or may not work by trial and error, you should first try to answer a 'simple" question:
What is a good cluster, mathematically?
Vague terms not allowed. The next questions to answer then are: I) when is clustering A better than clustering B (I.e. how does the computer compute quality), and ii) how can this be found efficiently.
You won't get far if you don't understand what you are doing just by calling random functions...
Also, is clustering actually what you are looking for? Most of the time with binary data e.g. frequent itemset mining is the better choice.

Which clustering algorithm is suitable for clustering geographical locations?

I am developing an application which works similar to Tinder. I am guessing Tinder groups closest results by running clustering algorithm.
In my application have to similarly group data based on geographical location. I might have to run clustering based on many inputs, so it has to be efficient. Please suggest suitable algorithm for it.
There is no reason to group or cluster for a Tinder-like use case:
it's too expensive
it's too static
it does not add value (you can't just present a cluster to the user)
What you want to use is similarity search. Find other users that are a) nearby, b) recently online, c) have some interests in common, d) have not been recently shown.
For somebody who is looking for similar solution, There is a good answer about fast similarity search algoritms on quora https://www.quora.com/What-are-some-fast-similarity-search-algorithms-and-data-structures-for-high-dimensional-vectors/answer/Raghavendran-Balu?srid=hYuT
I found R-tree to be most suited for my application. There is a good github project for R-tree https://github.com/davidmoten/rtree

Doubts about clustering methods for tweets

I'm fairly new to clustering and related topics so please forgive my questions.
I'm trying to get introduced into this area by doing some tests, and as a first experiment I'd like to create clusters on tweets based on content similarity. The basic idea for the experiment would be storing tweets on a database and periodically calculate the clustering (ie. using a cron job). Please note that the database would obtain new tweets from time to time.
Being ignorant in this field, my idea (probably naive) would be to do something like this:
1. For each new tweet in the db, extract N-grams (N=3 for example) into a set
2. Perform Jaccard similarity and compare with each of the existing clusters. If result > threshold then it would be assigned to that cluster
3. Once finished I'd get M clusters containing similar tweets
Now I see some problems with this basic approach. Let's put aside computational cost, how would the comparison between a tweet and a cluster be done? Assuming I have a tweet Tn and a cluster C1 containing T1, T4, T10 which one should I compare it to? Given that we're talking about similarity, it could well happen that sim(Tn,T1) > threshold but sim(Tn,T4) < threshold. My gut feeling tells me that something like an average should be used for the cluster, in order to avoid this problem.
Also, it could happen that sim(Tn, C1) and sim(Tn, C2) are both > threshold but similarity with C1 would be higher. In that case Tn should go to C1. This could be done brute force as well to assign the tweet to the cluster with maximum similarity.
And last of all, it's the computational issue. I've been reading a bit about minhash and it seems to be the answer to this problem, although I need to do some more research on it.
Anyway, my main question would be: could someone with experience in the area recommend me which approach should I aim to? I read some mentions about LSA and other methods, but trying to cope with everything is getting a bit overwhelming, so I'd appreciate some guiding.
From what I'm reading a tool for this would be hierarchical clustering, as it would allow regrouping of clusters whenever new data enters. Is this correct?
Please note that I'm not looking for any complicated case. My use case idea would be being able to cluster similar tweets into groups without any previous information. For example, tweets from Foursquare ("I'm checking in ..." which are similar to each other would be one case, or "My klout score is ..."). Also note that I'd like this to be language independent, so I'm not interested in having to deal with specific language issues.
It looks like to me that you are trying to address two different problems in one, i.e. "syntactic" and "semantic" clustering. They are quite different problems, expecially if you are in the realm of short-text analysis (and Twitter is the king of short-text analysis, of course).
"Syntactic" clustering means aggregating tweets that come, most likely, from the same source. Your example of Foursquare fits perfectly, but it is also common for retweets, people sharing online newspaper articles or blog posts, and many other cases. For this type of problem, using a N-gram model is almost mandatory, as you said (my experience suggests that N=2 is good for tweets, since you can find significant tweets that have as low as 3-4 features). Normalization is also an important factor here, removing RT tag, mentions, hashtags might help.
"Semantic" clustering means aggregating tweets that share the same topic. This is a much more difficult problem, and it won't likely work if you try to aggregate random sample of tweets, due to the fact that they, usually, carry too little information. These techniques might work, though, if you restrict your domain to a specific subset of tweets (i.e. the one matching a keyword, or an hashtag). LSA could be useful here, while it is useless for syntactic clusters.
Based on your observation, I think what you want is syntactic clustering. Your biggest issue, though, is the fact that you need online clustering, and not static clustering. The classical clustering algorithms that would work well in the static case (like hierarchical clustering, or union find) aren't really suited for online clustering , unless you redo the clustering from scratch every time a new tweet gets added to your database. "Averaging" the clusters to add new elements isn't a great solution according to my experience, because you need to retain all the information of every cluster member to update the "average" every time new data gets in. Also, algorithms like hierarchical clustering and union find work well because they can join pre-existant clusters if a link of similarity is found between them, and they don't simply assign a new element to the "closest" cluster, which is what you suggested to do in your post.
Algorithms like MinHash (or SimHash) are indeed more suited to online clustering, because they support the idea of "querying" for similar documents. MinHash is essentially a way to obtain pairs of documents that exceed a certain threshold of similarity (in particular, MinHash can be considered an estimator of Jaccard similarity) without having to rely on a quadratic algorithm like pairwise comparison (it is, in fact, O(nlog(n)) in time). It is, though, quadratic in space, therefore a memory-only implementation of MinHash is useful for small collections only (say 10000 tweets). In your case, though, it can be useful to save "sketches" (i.e., the set of hashes you obtain by min-hashing a tweet) of your tweets in a database to form an "index", and query the new ones against that index. You can then form a similarity graph, by adding edges between vertices (tweets) that matched the similarity query. The connected components of your graph will be your clusters.
This sounds a lot like canopy pre-clustering to me.
Essentially, each cluster is represented by the first object that started the cluster.
Objects within the outer radius join the cluster. Objects that are not within the inner radius of at least one cluster start a new cluster. This way, you get an overlapping (non-disjoint!) quantization of your dataset. Since this can drastically reduce the data size, it can be used to speed up various algorithms.
However don't expect useful results from clustering tweets. Tweet data is just to much noise. Most tweets have just a few words, too little to define a good similarity. On the other hand, you have the various retweets that are near duplicates - but trivial to detect.
So what would be a good cluster of tweets? Can this n-gram similarity actually capture this?

Clustering or classification?

I am stuck between a decision to apply classification or clustering on the data set I got. The more I think about it, the more I get confused. Heres what I am confronted with.
I have got news documents (around 3000 and continuously increasing) containing news about companies, investment, stocks, economy, quartly income etc. My goal is to have the news sorted in such a way that I know which news correspond to which company. e.g for the news item "Apple launches new iphone", I need to associate the company Apple with it. A particular news item/document only contains 'title' and 'description' so I have to analyze the text in order to find out which company the news referes to. It could be multiple companies too.
To solve this, I turned to Mahout.
I started with clustering. I was hoping to get 'Apple', 'Google', 'Intel' etc as top terms in my clusters and from there I would know the news in a cluster corresponds to its cluster label, but things were a bit different. I got 'investment', 'stocks', 'correspondence', 'green energy', 'terminal', 'shares', 'street', 'olympics' and lots of other terms as the top ones (which makes sense as clustering algos' look for common terms). Although there were some 'Apple' clusters but the news items associated with it were very few.I thought may be clustering is not for this kind of problem as many of the company news goes into more general clusters(investment, profit) instead of the specific company cluster(Apple).
I started reading about classification which requires training data, The name was convincing too as I actually want to 'classify' my news items into 'company names'. As I read on, I got an impression that the name classification is a bit deceiving and the technique is used more for prediction purposes as compared to classification. The other confusions that I got was how can I prepare training data for news documents? lets assume I have a list of companies that I am interested in. I write a program to produce training data for the classifier. the program will see if the news title or description contains the company name 'Apple' then its a news story about apple. Is this how I can prepare training data?(off course I read that training data is actually a set of predictors and target variables). If so, then why should I use mahout classification in the first place? I should ditch mahout and instead use this little program that I wrote for training data(which actually does the classification)
You can see how confused I am about how to address this issue. Another thing that concerns me is that if its possible to make a system this intelligent, that if the news says 'iphone sales at a record high' without using the word 'Apple', the system can classify it as a news related to apple?
Thank you in advance for pointing me in the right direction.
Copying my reply from the mailing list:
Classifiers are supervised learning algorithms, so you need to provide
a bunch of examples of positive and negative classes. In your example,
it would be fine to label a bunch of articles as "about Apple" or not,
then use feature vectors derived from TF-IDF as input, with these
labels, to train a classifier that can tell when an article is "about
Apple".
I don't think it will quite work to automatically generate the
training set by labeling according to the simple rule, that it is
about Apple if 'Apple' is in the title. Well, if you do that, then
there is no point in training a classifier. You can make a trivial
classifier that achieves 100% accuracy on your test set by just
checking if 'Apple' is in the title! Yes, you are right, this gains
you nothing.
Clearly you want to learn something subtler from the classifier, so
that an article titled "Apple juice shown to reduce risk of dementia"
isn't classified as about the company. You'd really need to feed it
hand-classified documents.
That's the bad news, but, sure you can certainly train N classifiers
for N topics this way.
Classifiers put items into a class or not. They are not the same as
regression techniques which predict a continuous value for an input.
They're related but distinct.
Clustering has the advantage of being unsupervised. You don't need
labels. However the resulting clusters are not guaranteed to match up
to your notion of article topics. You may see a cluster that has a lot
of Apple articles, some about the iPod, but also some about Samsung
and laptops in general. I don't think this is the best tool for your
problem.
First of all, you don't need Mahout. 3000 documents is close to nothing. Revisit Mahout when you hit a million. I've been processing 100.000 images on a single computer, so you really can skip the overhead of Mahout for now.
What you are trying to do sounds like classification to me. Because you have predefined classes.
A clustering algorithm is unsupervised. It will (unless you overfit the parameters) likely break Apple into "iPad/iPhone" and "Macbook". Or on the other hand, it may merge Apple and Google, as they are closely related (much more than, say, Apple and Ford).
Yes, you need training data, that reflects the structure that you want to measure. There is other structure (e.g. iPhones being not the same as Macbooks, and Google, Facebook and Apple being more similar companies than Kellogs, Ford and Apple). If you want a company level of structure, you need training data at this level of detail.