Document Clustering Basics - cluster-analysis

Document Clustering Basics - cluster-analysis

So, I've been mulling over these concepts for some time, and my understanding is very basic. Information retrieval seems to be a topic seldom covered in the wild...
My questions stem from the process of clustering documents. Let's say I start off with a collection of documents containing only interesting words. What is the first step here? Parse the words from each document and create a giant 'bag-of-words' type model? Do I then proceed to create vectors of word counts for each document? How do I compare these documents using something like the K-means clustering?

Try Tf-idf for starters.
If you read Python, look at
"Clustering text documents using MiniBatchKmeans"
in scikit-learn:
"an example showing how the scikit-learn can be used to cluster
documents by topics using a bag-of-words approach".
Then feature_extraction/text.py in the source has very nice classes.

Related

Mongodb Performance of getting a document in the case of non existing document

We are storing lots of data in mongodb let's say 30M docs. And these documents does not get modified very often. There are high number of read queries(~15k qps). And many of these queries(by _id field) will result in empty search result because of the nature of our use case.
I want to understand if mongodb does some sort of optimisation for detecting if a doc is not available in the db,index or not. Are there any plugin to enable this? Other option that I see is to use application level bloom filter but that would be another piece to maintain. AFAIK HBASE has support for bloom filter to see if a document is present or not.

Finding a non-existent document is the worst case of finding a document. Same as in real life, if what you're looking for doesn't exist you'll need more time to check all the places than if the thing existed at some point.
All of the find optimizations apply equally to finding documents that end up not existing (indexes, shard keys, etc.).

Extracting important sub-sections and the sub set of documents associated with them from a set of documents

I have a set of documents all of which come under the category "crime".
Now, I want to categorize them into a number of (could be overlapping) clusters of documents where each of the clusters are formed under a sub-category like murder or kidnapping, etc.
I want to accomplish this using some way of identifying the importance of individual words occurring in each document. I have already tried using TF-IDF but it is not giving me satisfactory results.

Another alternative is to assign weights to frequently occurring words. Then you can group the words using a k-prototypes or the k-mode approach.

You'll need supervision.
Words such as "suspect", "gun", are likely significant, but do not produce desirable categories. An unsupervised approach cannot know what a "kind of"crime is.

Reading the similar data from more than two collections in mongoDB

I am novice user to MongoDB. In our application the data size for each table quite bit large, So I decided to split the same into different collections even though it is same of kind. The only difference is the "id" between each document(documents in one collection is under one category) in the all the collections. So we decided to insert the data into number collections and each collections will be having certain number of documents. currently I have 10 collections of same of kind of document data.
My requirement is
1) to get the data from all the collections in a single query to display in application home page.
2) I do need to get the data by using sorting and filtering before fetching.
I have gone through some of the posts in the stackoverflow saying that use Mongo-3.2 $lookup aggregation for this requirement. but I am suspecting If I use $lookup for 10 collections, there might be performance Issue and too complex query.
since I have divided the my same kind of data into number of collections(Each collection will have the documents which comes under one category, Like that I have the 10 categories, so I need to use 10 collections).
Could any body please suggest me whether my approach is correct?

If you have a lot data, how could you display all of them in a webpage?
My understanding is that you will only display a portion of the dataset by querying the database. Since you didn't mention how many records you have, it's not easy to make a recommendation.
Based on the vague description, sharding is the solution, you should check out the official doc.
However, before you do sharding, and since you mentioned are a novice user, you probably want to check your databases' indexing, data models, and benchmark your performance first.
Hope this helps.

You should store all 10 types of data in 1 collection, not 10. Don't make things more difficult than they need to be.

Storing word frequency data

I am trying to store word frequency data using Mongo. Each word needs to be associated to a user so I can calculate how often an individual uses each word. Currently my words collection looks like this:
{'Hello':3, 'user_id':1}
Which obviously only works on a 'One To One' basis and is no good.
I am trying to work out how best to make this a 'One To Many' relationshop between the user and the words. Would I store the user relationship in my words collection like so:
{'word':"Hello", 'users':[{'id':1, 'count':4},{'id':2, 'count':10}]}
Or would I attach the word counts to the user collection instead?
{'id':1, 'username':'SomeUser', 'words':[{'Hello':4}]}
The obvious disadvantage to the second approach is that the same words will be used across different users, so having a single words collection would help to keeping the data size down.
Can anyone advise me as to what I should do here? Is there a method I have perhaps overlooked in the documentation?

The obvious disadvantage to the second approach is that the same words
will be used across different users, so having a single words
collection would help to keeping the data size down.
Nope, that's the nature of using document db. Data size is really not a matter in non sql solutions, important thing is how easy and how fast you can access your data.
Your first approach is a typical textbook relational model. There is no advantage of using this in mongo (Though you can model this in relational way in mongo). Instead the second approach gives you
Fatser reads/writes since every word is stored inside user. You dont need to perform multiple queries for this

Can I search across collections in MongoDB?

I am inserting my data into MongoDB and had 240 such files. Instead of inserting everything into one big collection, I was thinking of inserting the files as a collection by themselves. Is this a good idea if I do a lot of queries on a commonly indexed column?
If so, how can I initiate a query to query all the collections in my database?

Using an application server such as Solr can help you achieve what you want, also with the addition of fuzzy matching, synonyms, phonetic matching, misspellings, etc.
Solor is built on top of Lucene. It's docs are here:
http://lucene.apache.org/solr/
The learning curve is a little bit steep, but you can get pretty good searchability using much of its defaults, leaving you to build a schema and index your data to get started.

I think the answer you're looking for is really here on your other question: Is there any multicore exploiting NoSQL system?
There is no way to query across all collections in Mongo. It wouldn't make a lot of sense to do so. MongoDB's strength is focused on tactically denormalizing data into collections. Providing operations to query across all collections run exactly counter to the concept of tactical denormalization.
In theory, you could just run 240 queries. But more practically you'll probably end up "partitioning" your data so that you only need to query some of the collections. At this point you end up back at the link I provided above, which suggests that sharding your data is probably the answer here.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Document Clustering Basics - cluster-analysis

Related

Mongodb Performance of getting a document in the case of non existing document

Extracting important sub-sections and the sub set of documents associated with them from a set of documents

Reading the similar data from more than two collections in mongoDB

Storing word frequency data

Can I search across collections in MongoDB?

Categories

Resources