Efficient way for mongodb.find() to search through 1 million document? - mongodb

I have a blog post server that will contains million of articles, and i need to be able to get all Articles written by User A.
What would be the best schema design.
1) Separate both User and Articles documents, and to get user A Articles search in all the million record for the User's id
articles.find({Writer_id: User_A.id})
2) Put a article id reference inside User schema. ex:
userSchema = {
name: "name",
age: "age",
articles: [ {type:mongoose.Article_id}, {type:mongoose.Article_id} ]
}
And search for User A and make a join to get Articles back.

It's far better to keep the Writer_id approach and create an index on that property. If you store an array of references, then you'll need to perform an $in operation on your find() calls. This will result in your query "jumping" from one matching Article_id to another. If instead you have a Writer_id and an index built for that property, all of user's articles will exist in the same sequential "block" in the index, requiring no jumping around whatsoever. The result is a far more read-efficient find() operation.
Additionally, the articles array approach would require frequent updates to the user document, whereas the Writer_id approach only requires inserts. Inserts are incredibly efficient, whereas frequent updates are relatively inefficient. Finally, an array of Article_ids can potentially (if unlikely) result in hitting the 16 MB document size limit. The Writer_id approach runs into no such limitation.
The difference should be relatively negligible for a smaller project, but if you're looking for scalability, then you're better off with the Writer_id approach.

Related

Mongodb Index or not to index

quick question on whether to index or not. There are frequent queries to a collection that looks for a specific 'user_id' within an array of a doc. See below -
_id:"bQddff44SF9SC99xRu",
participants:
[
{
type:"client",
user_id:"mi7x5Yphuiiyevf5",
screen_name:"Bob",
active:false
},
{
type:"agent",
user_id:"rgcy6hXT6hJSr8czX",
screen_name:"Harry",
active:false
}
]
}
Would it be a good idea to add an index to 'participants.user_id'? The array is added to frequently and occasionally items are removed.
Update
I've added the index after testing locally with the same set of data and this certainly seems to have decreased the high CPU usage on the mongo process. As there are only a small number of updates to these documents I think it was the right move. I'm looking at more possible indexes and optimisation now.
Why do you want to index? Do you have significant latency problems when querying? Or are you trying to optimise in advance?
Ultimately there are lots of variables here which make it hard to answer. Including but not limited to:
how often is the query made
how many documents in the collection
how many users are in each document
how often you add/remove users from the document after the document is inserted.
do you need to optimise inserts/updates to the collection
It may be that indexing isn't the answer, but rather how you have structured you data.

many-to-many relationships for social app: Mongodb or graph databases like Neo4j

I have tried to understand embedding in Mongodb but could not find good enough documentation. Linking is not advised as writes are not atomic across documents and also there are two lookups. Does someone know how to solve this or would you suggest me to go to graph dbs like neo4j.
I am trying to build an application which would need many-to-many relationships. To explain, I will take the example of a library. It can suggest books to user based on books his friends are reading and neighbors (like minded) users are reading.
There are Users and Books. Users borrow books and have friends who are other users
Given a user, I need all books he is reading and number of mutual
friends for the book
Given a book, I need all the people who are reading it. May be given
a user A, this would return the intersection of people reading book
and friends of user A. This is mutual friendship
Users = [
{ name: 'xyz', 'id':'000000', friend_ids:['949583','958694']}
{ name: 'abc', 'id':'000001', friend_ids:['949582','111111']}
]
Books = [
{'book':'da vinci code', 'author': 'dan brown', 'readers'=['949583', '000000']}
{'book':'iCon', 'author': 'Young', 'readers'=['000000', '000001']}
]
As seen above, generally I need two documents if I take mongo DB as I might two way lookup. Duplicating (embedding) on document into another could lead to lot of duplicity (these schemas could store much more information than shown).
Am I modeling my data correctly? Can this be effectively done in mongodb or should I look at graph dbs.
A disclaimer: I work for Neo4j
From your outline, requirements and type of data it seems that your app is rather in a sweetspot for graph databases.
I'd suggest you just do a quick spike with a graph database and see how it is going.
There will be no duplication
you have transactions for atomic operations
following links is the natural operation
local queries (e.g. from a user or a book) are cheap and fast
you can use graph algorithms like shortest path to find interesting information about your data
recommendations and similar operations are natural to graph databases
Some Questions:
Why did you choose MongoDB in the first place?
What implementation language do you use?
Your basic schema proposal above would work fine for MongoDB, with a few suggestions:
Use integers for identifiers, rather than strings. Integers will often be stored more compactly by MongoDB (they will always be 8 bytes, whereas strings' stored size will depend on the length of the string). You can use findAndModify to emulate unique sequence generators (like auto_increment in some relational databases) -- see Mongoengine's SequenceField for an example of how this is done. You could also use ObjectIds which are always 12 bytes, but are virtually guaranteed to be unique without having to store any coordination information in the database.
You should use the _id field instead of id, as this field is always present in MongoDB and has a default unique index created on it. This means your _ids are always unique, and lookups by _id is very fast.
You are right that using this sort of schema will require multiple find()s, and will incur network round-trip overhead each time. However, for each of the queries you have suggested above, you need no more than 2 lookups, combined with some straightforward application code:
"Given a user, I need all books he is reading and number of mutual friends for the book"
a. Look up the user in question, thenb. query the books collection using db.books.find({_id: {$in: [list, of, books, for, the, user]}}), thenc. For each book, compute a set union for that book's readers plus the user's friends
"Given a book, I need all the people who are reading it."a. Look up the book in question, thenb. Look up all the users who are reading that book, again using $in like db.users.find({_id: {$in: [list, of, users, reading, book]}})
"May be given a user A, this would return the intersection of people reading book and friends of user A."a. Look up the user in question, thenb. Look up the book in question, thenc. Compute the set union of the user's friends and the book's readers
I should note that $in can be slow if you have very long lists, as it is effectively equivalent to doing N number of lookups for a list of N items. The server does this for you, however, so it only requires one network round-trip rather than N.
As an alternative to using $in for some of these queries, you can create an index on the array fields, and query the collection for documents with a specific value in the array. For instance, for query #1 above, you could do:
// create an index on the array field "readers"
db.books.ensureIndex({readers: 1})
// now find all books for user whose id is 1234
db.books.find({readers: 1234})
This is called a multi-key index and can perform better than $in in some cases. Your exact experience will vary depending on the number of documents and the size of the lists.

MongoDB - One Collection Using Indexes

Ok so the more and more I develop in Mongodb i start to wonder about the need for multiple collections vs having one large collection with indexes (since columns and fields can be different for each document unlike tabular data). If i am trying to develop in the most efficient way possible (meaning less code and reusable code) then can I use one collection for all documents and just index on a field. By having all documents in one collection with indexes then i can reuse all my form processing code and other code since it will all be inserting into the same collection.
For Example:
Lets say i am developing a contact manager and I have two types of contacts "individuals" and "businesses". My original thought was to create a collection called individuals and a second collection called businesses. But that was because im used to developing in sql where yes this would be appropriate since columns would be different for each table. The more i started to think about the flexibility of document dbs the more I started to think, "do I really need two collections for this?" If i just add a field to each document called "contact type" and index on that, do i really need two collections? Since the fields/columns in each document do not have to be the same for all (like in sql) then each document can have their own fields as long as i have a "document type" field and an index on that field.
So then i took that concept and started to think, if i only need one collection for "individuals" and "businesses" then do i even need a separate collection for "Users" or "Contact History" or any other data. In theory couldn't i build the entire solution in once collection and just have a field in each document that specifield the "type" and index on it such as "Users", "Individual Contact", "Business Contacts", "Contact History", etc, and if it is a document related to another document i can index on the "parent key/foreign" Id field...
This would allow me to code the front end dynamically since the form processing code would all be the same (inserting into the same collection). This would save a lot of coding but i want to make sure by using indexes and secondary indexes that the db would still run fast and not cause future problems as the collection grew. As you can imagine, if everything was in one collection there might be hundreds of thousands even millions of documents in this collection as the user base grows but it would have indexes and secondary indexes to optimize performance.
My question is: Is this a common method mongodb developers use? Why or why not? What are the downfalls, if any? If this is a commonly used method, please also give any positives to using this method. thank you.
This is a really big point in Mongo and the answer is a little bit more of an art than science. Having one collection full of gigantic documents is definitely an anti-pattern because it works against many of Mongo's features.
For instance, when retrieving documents, you can only retrieve a whole document out of a collection (not entirely true, but mostly). So if you have huge documents, you're retrieving huge documents each time. Also, having huge documents makes sharding less flexible since only the top level documents are indexed (and hence, sharded) in each collection. You can index values deep into a document, but the index value is associated with the top level document.
At the same time, going purely relational is also an anti-pattern because you've lost a lot of the referential integrity by going to Mongo in the first place. Also, all joins are done in application memory, so each one requires a full round-trip (slow).
So the answer is to do something in between. I'm thinking you'll probably want a collection for individuals and a different collection for businesses in this case. I say this because it seem like businesses have enough meta-data associated that it could bulk up a lot. (Also, I individual-business relationship seems like a many-to-many). However, an individual might have a Name object (with first and last properties). That would be a bad idea to make Name into a separate collection.
Some info from 10gen about schema design: http://www.mongodb.org/display/DOCS/Schema+Design
EDIT
Also, Mongo has limited support for transactions - in the form of atomic aggregates. When you insert an object into mongo, the entire object is either inserted or not inserted. So you're application domain requires consistency between certain objects, you probably want to keep them in the same document/collection.
For example, consider an application that requires that a User always has a Name object (containing FirstName, LastName, and MiddleInitial). If a User was somehow inserted with no corresponding Name, the data would be considered to be corrupted. In an RDBMS you would wrap a transaction around the operations to insert User and Name. In Mongo, we make sure Name is in the same document (aggregate) as User to achieve the same effect.
Your example is a little less clear, since I don't understand the business cases. One thing that does come to mind is that Mongo has excellent support for inheritance. It might make sense to put all users, individuals, and potentially businesses into the same collection (depending on how the application is modeled). If one individual has many contacts, you probably want individuals to have an array of IDs. If your application requires that you get a quick preview of contacts, you might consider duplicating part of an individual and storing an array of contact objects.
If you're used to RDBMS thinking, you probably think all your data always has to be consistent. The truth is, that's probably not entirely true. This concept of applying atomic aggregates to the domain has been preached heavily by the DDD community recently. When you look at your domain in depth, like your business users do, the consistency boundaries should become distinct.
MongoDB, and NoSQL in general, is about de-normalising data and about reducing joins. It goes against normal SQL thinking.
In your case, I don't see any reason why you would want to have separate collections because it introduces unnecessary complexity and performance overhead. Consider, for example, if you wanted to have a screen that displayed all contacts, in alphabetical order. If you have one single collection for contacts, then its really easy, but if you have two collections it becomes a more complicated proposition.
Where I would have multiple collections is if your application had multiple users storing contacts. I would then have one collection for each user. This makes it so easy to extract out that users contacts.

Querying directly on results from MongoDB mapreduce versus updating original collection

I have a mapreduce job that runs on a collection of posts and calculates a popularity for each post. The mapreduce outputs a collection with the post_id and popularity for each post. The application needs to be able to get posts sorted by popularity. There are millions of posts, and these popularities are updated every 10 minutes. Two methods I can think of:
Method 1
Keep an index on the posts table popularity field
Run mapreduce on the posts table (this will replace any previous mapreduce results)
Loop through each row in the mapreduce results collection and individually update the popularity of its corresponding post in the posts table
Query directly on the posts table to get posts sorted by popularity
Method 2
Run mapreduce on the posts table (this will replace the previous mapreduce results)
Add an index to the popularity field in the resulting mapreduce collection
When the application needs posts, first query the mapreduce results collection to get the sorted post_ids, then query the posts collection to get the actual post data
Questions
Method 1 would need to maintain an index on the popularity in the posts table. It'll also need to update millions (the post table has millions of rows) of popularities individually every 10 or so minutes. It'll only update those posts that have changed popularity, but it's still a lot of updates on a collection with a couple of indexes. There will be a significant # of reads on this collection as well. Is this scalable?
For method 2, is it possible to mapreduce the posts collection to create a new popularities collection, immediately create an index on it, and query it?
Are there any concurrency issues for question #2, assuming the application will be querying that popularities collection as it's being updated by the map reduce and re-indexed.
If the mapreduce replaces the popularities collection do I need to manually create a new index every time or will mongo know to keep an index on the popularity field. Basically, how do indexes work with mapreduce result collections.
Is there some tweak or other method I could use for this??
Thanks for any help!
The generic advice concerning Map Reduce is to have your application perform a little extra computation on each insert, and avoid doing a processor-intensive map reduce job whenever possible.
Is it possible to add a "popularity" field to each "post" document and have your application increment it each time each post is viewed, clicked on, voted for, or however you measure popularity? You could then index the popularity field, and searches for posts by popularity would be lightning-fast.
If simply incrementing a "popularity" field is not an option, and a MapReduce operation must be performed, try to prevent it from paging through all of the documents in the collection. You will find that this becomes prohibitively slow as your collection grows. It sounds as though your collection is already pretty large.
It is possible to perform an incremental map reduce, where the results of the latest map reduce are integrated with the results of the previous one, instead of merely being overwritten. You can also provide a query to the mapReduce function, so not all documents will be read. Perhaps add a query that matches only posts that have been viewed, voted for, or added since the last map reduce.
The documentation on incremental mapReduce operations is here:
http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-IncrementalMapreduce
Integrating the new results with the old ones is explained in the "Output options" section.
I realize that my advice has been pretty general so far, so I will attempt to address your questions now:
1) As discussed above, if your MapReduce operation has to read every single document, this will not scale well.
2) The MapReduce operation only outputs a collection. Creating an index and querying that collection will have to be done programmatically.
3) If there is one process that is querying a collection at the same time that another is updating it, then it is possible for the query to return a document before it has been updated. The short answer is, "yes"
4) If the collection is dropped then indexes will have to be rebuilt. If the documents in the collection are deleted, but the collection itself is not dropped then the index(es) will persist. In the case of a MapReduce run with the {out:{replace:"output"}} option, the index(ex) will persist, and won't have to be recreated.
5) As stated above, if possible it would be preferable to add another field to your "posts" collection, and update that, instead of performing so many MapReduce operations.
Hopefully I have been able to provide you with some additional factors to consider when building your application. Ultimately, it is important to remember that each application is unique, and so for the ultimate proof of which way is "best", you will have to experiment with all of the different options and decide for yourself which way is most efficient. Good Luck!

mongo db design of following and feeds, where should I embed?

I have a basic question about where I should embed a collection of followers/following in a mongo db. It makes sense to have an embedded collection of following in a user object, but does it also make sense to also embed the converse followers collection as well? That would mean I would have to update and embed in the profile record of both the:
following embedded list in the follower
And the followers embedded list of the followee
I can't ensure atomicity on that unless I also somehow keep a transaction or update status somewhere. Is it worth it embedding in both entities or should I just update #1, embed following in the follower's profile and, put an index on it so that I can query for the converse- followers across all profiles? Is the performance hit on that too much?
Is this a candidate for a collection that should not be embedded? Should I just have a collection of edges where I store following in its own collection with followerid and followedbyId ?
Now if I also have to update a feed to both users when they are followed or following, how should I organize that?
As for the use case, the user will see the people they are following when viewing their feeds, which happens quite often, and also see the followers of a profile when they view the profile detail of anyone, which also happens often but not quite as much as the 1st case. In both cases, the total numbers of following and followers shows up on every profile page.
In general, it's a bad idea to embed following/followed-by relationships into user documents, for several reasons:
(1) there is a maximum document size limit of 16MB, and it's plausible that a popular user of a well-subscribed site might end up with hundreds of thousands of followers, which will approach the maximum document size,
(2) followership relationships change frequently, and so the case where a user gains a lot of followers translates into repeated document growth if you're embedding followers. Frequent document growth will significantly hinder MongoDB performance, and so should be avoided (occasional document growth, especially is documents tend to reach a stable final size, is less of a performance penalty).
So, yes, it is best to split out following/followed-by relationship into a separate collection of records each having two fields, e.g., { _id : , oid : }, with indexes on _id (for the "who am I following?" query) and oid (for the "who's following me?" query). Any individual state change is modeled by a single document addition or removal, though if you're also displaying things like follower counts, you should probably keep separate counters that you update after any edge insertion/deletion.
(Of course, this supposes your business requirements allow you some flexibility on the consistency details: in general, if your display code tells a user he's got 304 followers and then proceeds to enumerate them, only the most fussy user will check that the followers enumerated tally up to 304. If business requirements necessitate absolute consistency, you'll either need a database that isolates transactions for you, or else you'll have to do the counting yourself as part of displaying all user identities.)
You can embed them all but create a new document when you reach a certain limit. For example you can limit a document to an array of 500 elements then create a new one. Also, if it is about feed, when viewed, you dont have to keep the viewed publications you can replace by new ones so you don't have to create new document for additional publication storage.
To maintain your performance, I'd advice you to make a collection that can use graphlookup aggregation, where you store your following. Being followed can reach millions of followers, so you have to store what pwople follow instead of who follows them.
I hope it helps you.