How to determine what a cluster is about? - cluster-analysis

I have tweets retrieved using the Twitter API and need to group the tweets into 2 categories. To do the grouping I used doc2vec to represent the tweets into numerical form and then performed DBSCAN algorithm clustering. However, how do I know what category a cluster belongs to? My output is just tweets assigned to different clusters.
For example, I need to know which tweet indicates the needs of the people and which tweets indicate that people have help to offer.
How can I make out which cluster has what type of tweets?
Thank you!

Probably neither cluster is a either of these two things.
Clustering is unsupervised. You don't get to control what it finds. It could be tweets that contain the f... word vs. tweets that don't.
If you want something specific such as "needs" and "offers", then you absolutely need to train a supervised algorithm from labeled data.

Related

Extracting important sub-sections and the sub set of documents associated with them from a set of documents

I have a set of documents all of which come under the category "crime".
Now, I want to categorize them into a number of (could be overlapping) clusters of documents where each of the clusters are formed under a sub-category like murder or kidnapping, etc.
I want to accomplish this using some way of identifying the importance of individual words occurring in each document. I have already tried using TF-IDF but it is not giving me satisfactory results.
Another alternative is to assign weights to frequently occurring words. Then you can group the words using a k-prototypes or the k-mode approach.
You'll need supervision.
Words such as "suspect", "gun", are likely significant, but do not produce desirable categories. An unsupervised approach cannot know what a "kind of"crime is.

Ibm Bluemix Retrieve and Rank

How do I Re-Rank the results generated by IBM Retrieve and rank service to get the optimal answer as i am unable to find any tutorial related to re-rank?
I'm not sure I completely understand your question. Do you mean the initial ranking by Retrieve and Rank of answers retrieved from Solr, or refining the ranking of already-ranked results? These specific links might be of help:
Preparing training data. This covers how to train rankers.
Reranking results. This covers how to refine the results produced by a ranker.

Mongo Architecture Efficiency

I am currently working on designing a local content bases sharing system that depends on mongoDB. I need to make a critical architecture decision that will undoubtably have a huge impact on query performance, scaling and overall long term maintainability.
Our system has a library of topics, each topic is available in specific cities/metropolitan areas. When a person creates a piece of content it needs to be stored as part of the topic in a specific city. There are three approaches I am currently considering to address these requirements (And open to other ideas as well).
Option 1 (Single Collection per Topic/City):
Example: a collection name would be TopicID123CityID456 and each entry would obviously be a document within that collection.
Option 2 (Single Topic Collection)
Example: A collection name would be Topic123 and each entry would create a document that contains an indexed cityID.
Option 3 (Single City Collection)
Example: A collection name would be City456 and each entry would create a document that contains an indexed topicID
When querying the DB I always want to build a feed in date order based on the member's selected topic(s) and city. Since members can group multiple topics together to build a custom feed, option 3 seems to be the best, however I am concerned with long term performance of this approach. It seems option 1 would be the most performant but also forces multiple queries when needing to select more than one topic.
Another thing that I need to consider is some topics will be far more active and grow much larger than other topics which will also vary by location.
Since I still consider myself a beginner with MongoDB, I want to make sure the general DB structure is the most ideal before coding all of the logic around writing and retrieving the data. And I don't know how well Mongo Performs with hundreds of thousands if not millions of documents in a collection thus my uncertainty in approach.
From experience which is the most optimal way of tackling the storage and recall of this data? Any insight would be greatly appreciated.
UPDATE: June 22, 2016
It is important to note that we are starting in a one DB server environment to start. #profesor79 provided a great scaling solution once we need to move to a multi-server (Sharded) environment.
from your 3 proposal I will pickup number 4 :-)
Having a one collection sharded over multiple servers.
As there could be one collection TopicCity, `we could have a one for all topics and one foll all cities.
Then collection topicCities will have all documents sharded.
Sharding on key {topic:1, city:1} will allow to balance load thru shard servers and enytime you will need to add more power you will be able to add shard to cluster.
Any comments welcome!

Document Clustering Basics

So, I've been mulling over these concepts for some time, and my understanding is very basic. Information retrieval seems to be a topic seldom covered in the wild...
My questions stem from the process of clustering documents. Let's say I start off with a collection of documents containing only interesting words. What is the first step here? Parse the words from each document and create a giant 'bag-of-words' type model? Do I then proceed to create vectors of word counts for each document? How do I compare these documents using something like the K-means clustering?
Try Tf-idf for starters.
If you read Python, look at
"Clustering text documents using MiniBatchKmeans"
in scikit-learn:
"an example showing how the scikit-learn can be used to cluster
documents by topics using a bag-of-words approach".
Then feature_extraction/text.py in the source has very nice classes.

Storing a graph in mongodb

I have an undirected graph where each node contains an array. Data can be added/deleted from the array. What's the best way to store this in Mongodb and be able to do this query effectively: given node A, select all the data contained in the adjacent nodes of A.
In relational DB, you can create a table representing the edges and another table for storing the data in each node this so.
table 1
NodeA, NodeB
NodeA, NodeC
table 2
NodeA, item1
NodeA, item2
NodeB, item3
And then you join the tables when you query for the data in adjacent nodes. But join is not possible in MongoDB, so what's the best way to setup this database and efficiently query for data in adjacent nodes (favoring performance slightly over space).
Specialized Distributed Graph Databases
I know this is sounds a little far afield from the OPs question about Mongo, but these days there are more specialized graph databases that excel at this kind of work and may be much easier for you to use, especially on large graphs.
There is a comparison of 7 such offerings here: https://docs.google.com/spreadsheet/ccc?key=0AlHPKx74VyC5dERyMHlLQ2lMY3dFQS1JRExYQUNhdVE#gid=0
Of the three most significant open source offerings (Titan, OrientDB, and Neo4J), all of them support the Tinkerpop Blueprints interface. So for a graph that looks like this...
... a query for "all the people that Juno greatly admires who she has known since the year 2011" would look like this:
Iterable<Vertex> results = juno.query().labels("knows").has("since",2011).has("stars",5).vertices()
This, of course, is just the tip of the iceberg. Pretty powerful stuff!
If you have to stay with Mongo
Think of Tinkerpop Blueprints as the "JDBC of storing graph structures" in various databases. The Tinkerpop Blueprints API has a specific MongoDB implementation that would work for you I'm sure. Then using Tinkerpop Gremlin, you have all sorts of advanced traversal and search methods at your disposal.
I'm picking up mongo, looking into this sort of schema as well (undirected graphs, querying for information from neighbors) I think the way that I favor so far looks something like this:
Each node contains an array of neighbor keys, like so.
{
nodeIndex: 4
myData: "data"
neighbors: [8,15,16,23,42]
}
To find data from neighbors, use the $in "operator":
db.nodes.find({nodeIndex:{$in: [8,15,16,23,42]}});
You can use field selection to limit results to the relevant data.
db.nodes.find({nodeIndex:{$in: [8,15,16,23,42]}}, {myData:1});
See http://www.mongodb.org/display/DOCS/Trees+in+MongoDB for inspiration.
MongoDB will introduce native graph capabilities in version 3.4 and it could be used to store graph stuctures and do analytics on them although performance might not be that good compared to native graph databases like Neo4j depending on the cases but it is too early to judge.
Check those links for more information:
$graphLookup (aggregation)
MongoDB 3.4 Accelerates Digital Transformation for the Modern Enterprise
MongoDB can simulate a graph using a flexible tree hierarchy. You may want to consider neo4j for strict graphing needs.