Text clustering using MATLAB - matlab

I've got a text file which looks like this:
leave messages
enterrement de vie de garçon
sacré coeur
paris skyline
singer montmartre girl audience joined man singing playing guitar front tourists
paris skyline
paris skyline
Each row of this text file corresponds to a document, which I want to cluster using either tf-idf with cosine similarity, or agglomerative clustering. I'm using MATLAB. I've removed the stop words, and punctuation marks.
My issue is that there are 300k of these rows (documents). So scaling is one issue. Another issue is that I'm having trouble understanding how to convert each row of text into a vector of values? Can anybody explain please, with an example?
Thanks.
I tried using k-means clustering (nltk library python) and ran out of memory. Also with k-means I don't have a clue how many clusters I'm supposed to get (so I was just guessing wildly).
Another thing: I have ground truth available for this text (like, I have 0,1,2 labels in another file for this data). And I also have test data (another text file). I'm confused as to how to use this information to help cluster the test data.
Please help. Thanks.

Related

What is the effect of adding new word vector embeddings onto an existing embedding space for Neural networks

In Word2Vector, the word embeddings are learned using co-occurrence and updating the vector's dimensions such that words that occur in each other's context come closer together.
My questions are the following:
1) If you already have a pre-trained set of embeddings, let's say a 100 dimensional space with 40k words, can you add 10 additional words onto this embedding space without changing the existing word embeddings. So you would only be updating the dimensions of the new words using the existing word embeddings. I'm thinking of this problem with respect to the "word 2 vector" algorithm, but if people have insights on how GLoVe embeddings work in this case, I am still very interested.
2) Part 2 of the question is; Can you then use the NEW word embeddings in a NN that was trained with the previous embedding set and expect reasonable results. For example, if I had trained a NN for sentiment analysis, and the word "nervous" was previously not in the vocabulary, then would "nervous" be correctly classified as "negative".
This is a question about how sensitive (or robust) NN are with respect to the embeddings. I'd appreciate any thoughts/insight/guidance.
The initial training used info about known words to plot them in a useful N-dimensional space.
It is of course theoretically possible to then use new information, about new words, to also give them coordinates in the same space. You would want lots of varied examples of the new words being used together with the old words.
Whether you want to freeze the positions of old words, or let them also drift into new positions based on the new examples, could be an important choice to make. If you've already trained a pre-existing classifier (like a sentiment classifier) using the older words, and didn't want to re-train that classifier, you'd probably want to lock the old words in place, and force the new words into compatible positioning (even if the newer combined text examples would otherwise change the relative positions of older words).
Since after an effective train-up of the new words, they should generally be near similar-meaning older words, it would be reasonable to expect classifiers that worked on the old words to still do something useful on the new words. But how well that'd work would depend on lots of things, including how well the original word-set covered all the generalizable 'neighborhoods' of meaning. (If the new words bring in shades of meaning of which there were no examples in the old words, that area of the coordinate-space may be impoverished, and the classifier may have never had a good set of distinguishing examples, so performance could lag.)

Tag Clustering in Lastfm database

I have a last.fm dataset composed of songs and their tags given by the users. I want to apply a clusterization on the dataset in order to find clusters of songs based on tags.
The dataset has 200k songs and 119k different tags. I was previously thinking on making a matrix NxM, where N is the number of songs and M is the number of attributes, and each position is 0 or 1 indicating the presence or not presence of a tag in the song. However, the huge dimension of the matrix has stopped me for doing so. I have some ideas on applying a SVD for reducing dimensionality before applying the clustering, but I don't know exactly if it is the best approach.
Therefore, does anybody know some work in the literature which attempts to perform such kind of clustering? Or any other idea in my problem?
Thank you very much in advance
Clustering probably is the wrong tool for your problem.
Are you sure you want to split your data into (usually) non-overlapping chunks? What if there is some overlap needed? Say, there are songs that are both "hip hop" and "driving beats" but these tags are not synonys?
Frequent itemset mining (Market basket analysis)
is much more applicable, isn't it?
Consider every song to be a "market basket", every tag to be an "item" in these transactions. The FIM will identify frequent tag combinations, and derive patterns from that.

Grouping similar words (bad , worse )

I know there are ways to find synonyms either by using NLTK/pywordnet or Pattern package in python but it isn't solving my problem.
If there are words like
bad,worst,poor
bag,baggage
lost,lose,misplace
I am not able to capture them. Can anyone suggest me a possible way?
There have been numerous research in this area in past 20 years. Yes computers don't understand language but we can train them to find similarity or difference in two words with the help of some manual effort.
Approaches may be:
Based on manually curated datasets that contain how words in a language are related to each other.
Based on statistical or probabilistic measures of words appearing in a corpus.
Method 1:
Try Wordnet. It is a human-curated network of words which preserves the relationship between words according to human understanding. In short, it is a graph with nodes as something called 'synsets' and edges as relations between them. So any two words which are very close to each other are close in meaning. Words that fall within the same synset might mean exactly the same. Bag and Baggage are close - which you can find either by iteratively exploring node-to-node in a breadth first style - like starting with 'baggage', exploring its neighbors in an attempt to find 'baggage'. You'll have to limit this search upto a small number of iterations for any practical application. Another style is starting a random walk from a node and trying to reach the other node within a number of tries and distance. It you reach baggage from bag say, 500 times out of 1000 within 10 moves, you can be pretty sure that they are very similar to each other. Random walk is more helpful in much larger and complex graphs.
There are many other similar resources online.
Method 2:
Word2Vec. Hard to explain it here but it works by creating a vector of a user's suggested number of dimensions based on its context in the text. There has been an idea for two decades that words in similar context mean the same. e.g. I'm gonna check out my bags and I'm gonna check out my baggage both might appear in text. You can read the paper for explanation (link in the end).
So you can train a Word2Vec model over a large amount of corpus. In the end, you will be able to get 'vector' for each word. You do not need to understand the significance of this vector. You can this vector representation to find similarity or difference between words, or generate synonyms of any word. The idea is that words which are similar to each other have vectors close to each other.
Word2vec came up two years ago and immediately became the 'thing-to-use' in most of NLP applications. The quality of this approach depends on amount and quality of your data. Generally Wikipedia dump is considered good training data for training as it contains articles about almost everything that makes sense. You can easily find ready-to-use models trained on Wikipedia online.
A tiny example from Radim's website:
>>> model.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)
[('queen', 0.50882536)]
>>> model.doesnt_match("breakfast cereal dinner lunch".split())
'cereal'
>>> model.similarity('woman', 'man')
0.73723527
First example tells you the closest word (topn=1) to words woman and king but meanwhile also most away from the word man. The answer is queen.. Second example is odd one out. Third one tells you how similar the two words are, in your corpus.
Easy to use tool for Word2vec :
https://radimrehurek.com/gensim/models/word2vec.html
http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf (Warning : Lots of Maths Ahead)

Doubts about clustering methods for tweets

I'm fairly new to clustering and related topics so please forgive my questions.
I'm trying to get introduced into this area by doing some tests, and as a first experiment I'd like to create clusters on tweets based on content similarity. The basic idea for the experiment would be storing tweets on a database and periodically calculate the clustering (ie. using a cron job). Please note that the database would obtain new tweets from time to time.
Being ignorant in this field, my idea (probably naive) would be to do something like this:
1. For each new tweet in the db, extract N-grams (N=3 for example) into a set
2. Perform Jaccard similarity and compare with each of the existing clusters. If result > threshold then it would be assigned to that cluster
3. Once finished I'd get M clusters containing similar tweets
Now I see some problems with this basic approach. Let's put aside computational cost, how would the comparison between a tweet and a cluster be done? Assuming I have a tweet Tn and a cluster C1 containing T1, T4, T10 which one should I compare it to? Given that we're talking about similarity, it could well happen that sim(Tn,T1) > threshold but sim(Tn,T4) < threshold. My gut feeling tells me that something like an average should be used for the cluster, in order to avoid this problem.
Also, it could happen that sim(Tn, C1) and sim(Tn, C2) are both > threshold but similarity with C1 would be higher. In that case Tn should go to C1. This could be done brute force as well to assign the tweet to the cluster with maximum similarity.
And last of all, it's the computational issue. I've been reading a bit about minhash and it seems to be the answer to this problem, although I need to do some more research on it.
Anyway, my main question would be: could someone with experience in the area recommend me which approach should I aim to? I read some mentions about LSA and other methods, but trying to cope with everything is getting a bit overwhelming, so I'd appreciate some guiding.
From what I'm reading a tool for this would be hierarchical clustering, as it would allow regrouping of clusters whenever new data enters. Is this correct?
Please note that I'm not looking for any complicated case. My use case idea would be being able to cluster similar tweets into groups without any previous information. For example, tweets from Foursquare ("I'm checking in ..." which are similar to each other would be one case, or "My klout score is ..."). Also note that I'd like this to be language independent, so I'm not interested in having to deal with specific language issues.
It looks like to me that you are trying to address two different problems in one, i.e. "syntactic" and "semantic" clustering. They are quite different problems, expecially if you are in the realm of short-text analysis (and Twitter is the king of short-text analysis, of course).
"Syntactic" clustering means aggregating tweets that come, most likely, from the same source. Your example of Foursquare fits perfectly, but it is also common for retweets, people sharing online newspaper articles or blog posts, and many other cases. For this type of problem, using a N-gram model is almost mandatory, as you said (my experience suggests that N=2 is good for tweets, since you can find significant tweets that have as low as 3-4 features). Normalization is also an important factor here, removing RT tag, mentions, hashtags might help.
"Semantic" clustering means aggregating tweets that share the same topic. This is a much more difficult problem, and it won't likely work if you try to aggregate random sample of tweets, due to the fact that they, usually, carry too little information. These techniques might work, though, if you restrict your domain to a specific subset of tweets (i.e. the one matching a keyword, or an hashtag). LSA could be useful here, while it is useless for syntactic clusters.
Based on your observation, I think what you want is syntactic clustering. Your biggest issue, though, is the fact that you need online clustering, and not static clustering. The classical clustering algorithms that would work well in the static case (like hierarchical clustering, or union find) aren't really suited for online clustering , unless you redo the clustering from scratch every time a new tweet gets added to your database. "Averaging" the clusters to add new elements isn't a great solution according to my experience, because you need to retain all the information of every cluster member to update the "average" every time new data gets in. Also, algorithms like hierarchical clustering and union find work well because they can join pre-existant clusters if a link of similarity is found between them, and they don't simply assign a new element to the "closest" cluster, which is what you suggested to do in your post.
Algorithms like MinHash (or SimHash) are indeed more suited to online clustering, because they support the idea of "querying" for similar documents. MinHash is essentially a way to obtain pairs of documents that exceed a certain threshold of similarity (in particular, MinHash can be considered an estimator of Jaccard similarity) without having to rely on a quadratic algorithm like pairwise comparison (it is, in fact, O(nlog(n)) in time). It is, though, quadratic in space, therefore a memory-only implementation of MinHash is useful for small collections only (say 10000 tweets). In your case, though, it can be useful to save "sketches" (i.e., the set of hashes you obtain by min-hashing a tweet) of your tweets in a database to form an "index", and query the new ones against that index. You can then form a similarity graph, by adding edges between vertices (tweets) that matched the similarity query. The connected components of your graph will be your clusters.
This sounds a lot like canopy pre-clustering to me.
Essentially, each cluster is represented by the first object that started the cluster.
Objects within the outer radius join the cluster. Objects that are not within the inner radius of at least one cluster start a new cluster. This way, you get an overlapping (non-disjoint!) quantization of your dataset. Since this can drastically reduce the data size, it can be used to speed up various algorithms.
However don't expect useful results from clustering tweets. Tweet data is just to much noise. Most tweets have just a few words, too little to define a good similarity. On the other hand, you have the various retweets that are near duplicates - but trivial to detect.
So what would be a good cluster of tweets? Can this n-gram similarity actually capture this?

Fuzzy c-means tcp dump clustering in matlab

Hi I have some data thats represented like this:
0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.
Its from the kdd cup 1999 which was based on the darpa set.
the text file I have has rows and rows of data like this, in matlab there is the generic clustering tool you can use by typing findcluster but it only accepts .dat files.
Im also not very sure if it will accept the format like this. Im also not sure why there is so many trailing zeros in the dump files.
Can anyone help how I can utilise the text document and run it thru a fcm clustering method in matlab? Code help is really needed.
FINDCLUSTER is simply a GUI interface for two clustering algorithms: FCM and SUBCLUST
You first need to read the data from file, look into the TEXTSCAN function for that.
Then you need to deal with non-numeric attributes; either remove them or convert them somehow. As far as I can tell, the two algorithms mentioned only support numeric data.
Visit the original website of the KDD cup dataset to find out the description of each attribute.