MongoDB Shard considering DBRefs - mongodb

I have a case where in first collection I use DBRef to another collection.
First collection is Books, the second is Users (who read those books). The user can have avatars and various other informations, which is reasonable to keep in separate collection.
But now I need to shard the books collection. If I shard it amongst 2 nodes, how the Users collection will be sharded? I would like to keep users that are related to particular books in same node. Is that possible? Thanks!

At the moment this is not possible out side of tag aware sharding ( http://docs.mongodb.org/manual/core/tag-aware-sharding/ ). Kristina (when she was still with 10gen) wrote a good article on how to distribute your data, can easily be used to group multiple collections: http://www.kchodorow.com/blog/2012/07/25/controlling-collection-distribution/
However, you might find that very difficult to maintain as such I wouldn't advise it unless you are solely a DBA since you will literally be spending most of your time keeping it together with an ever expanding network like that.
What I would do instead is shard the books on user_id and then shard the user collection on hashed _id that way you only need to query two shards at most

Related

Sharding with mongodb. Optimal way to write my query

Let me try to explain my problem first, and then the solution I'm implementing. I have a collection of "events", which can be shared with specific users. I also have a collection of "users". Any user could share an event with any number of other users. When an event is shared with a user, it is seen in the home page of my website by that user (let's say that it is sorted by creation date to make it simple).
I want to use sharding to balance both my writes and my reads, and to be able to scale horizontally if needed. Before I thought of sharding, I had an events collection, which had an array of userIds within. Those userIds are the ones that can see the event. My query then was every event where the logged in user was contained within that array, sorted by creation date, limiting to my page size.
To implement sharding in this scenario, the obvious choice would be to somehow have the userId as shard key, as every event returned by my query has the userId within that embedded array. However, my userId is contained within an array, so that wouldn't work. I then though to have a new collection, with the following fields:
userId: ObjectId (hashed shard key, to avoid monotony)
eventId: ObjectId
creationDate: Date
This way, I can run my query by userId, and have it go only to the corresponding shard. My problem of course with this solution, is that I now have eventIds instead of events, which is a somehow big document so I wouldn't want to have it redundantly as an embedded document within that collection (remember many users can be shared the same event).
To solve this, I think the correct solution would be to have the eventId be the shard key of the events collection (again, hashed to avoid monotony). I can then query the events collection by just those ids.
This raises two questions:
Is this the correct way to think about this particular problem. Is it a good solution?
As I now have several eventIds, let's just say five, and each one of them can be located in a different shard, which good be more performant: to have a single query looking for the five ids, or have five different queries looking for a single id each?
Yes, this is correct way and solution is fine. Users sharded with userId and events sharded with eventId.
Latter one. five different queries searching single id, because then query goes to one shard. If you have single query what looks five ids same time ( $in:[]), it probably scatter to multiple shards.

mongodb/mongoose - when to use subdocument and when to use new collection

I would like to know if there is a rule of thumb about when to use a new document and when to use a sub document. In sql database I used to break all realtions to seperate tables by the rule of normalization and connect them with keys , but I can't find a good approch about what to do in mongodb ( I don't know how other no-sql databases are handled).
Any help will be appreicated.
Kind regards.
Though no fixed rules, there are some general guidelines which are intuitive enough to follow while modeling data in noSql.
Nearly all cases of 1-1 can be handled with sub-documents. For example: A user has an address. All likelihood is that address would be unique for each user (in context of your system, say a social website). So, keeping address in another collection would be a waste of space and queries. Address sub-document is the best choice.
Another example: Hundreds of employees share a same building/address. In this case keeping 1-1 is a poor use of space and will cost you a lot of updates whenever a slight change happens in any of the addresses because it's being replicated across multiple employee documents as sub-document. Therefore, an address should have many employees i.e. 1 to many relationship
You must have noticed that in noSql there are multiple ways to represent 1 to many relationship.
Keep an array of references. Choose this if you're sure the size of the array won't get too big and preferably the document containing the array is not expected to be updated a lot.
Keep an array of sub-documents. A handy option if the sub-documents don't qualify for a separate collection and you don't run the risk of hitting 16Mb document size limit. (thanks greyfairer for reminding!)
Sql style foreign key. Use this if 1 and 2 are not good enough or you prefer this style over them
Modeling documents for retrieval, Document design considerations and Modeling Relationships from Couchbase (another noSql database) are really good reads and equally applicable to mongodb.

MongoDB sharding by collection

I have an application which creates a collection in MongoDB for every user where a collection is expected to have at most 100,000 documents (a few "big" users are like this while many "small" users only have less than 10,000 documents). Now the number of users grows and I want to shard my database. Is it possible to say "put this collection (thus this user) on this shard and that collection on that shard, but do not shard documents inside a collection further", and is it possible to do this automatically?
Edit: I'm already aware of MongoDB's standard sharding design now, but my application was scaled up from a small application for single person's use, where a nedb datastore is created for the user. When the multi-user support was added, it was an obvious choice to create a nedb datastore for every user so many parts of my application could stay unchanged. When I migrated it to MongoDB, since one nedb datastore is the equivalent of a MongoDB collection, I was using one collection per user. Given the current situation, I wonder the quickest way (~= with the smallest change to my application and overall configurations) to solve the current performance issue.
Sharding is done on a collection and how the sharded collection is broken up is based on the shard key (where one or more object elements from your collection make up the key).
It might be better to rethink your document design. You could have all users in one collection and then use the user id as the shard key. That would shard each user as a whole and do it automatically.
See Mongodb's Sharding documentation for more information on sharding.

MongoDB - Using email id as identifier across collections

I have user collection which holds email_id and _id as unique. I want to store user data across various collections. I would like to use email_id as identifier in those collections. Because it is easy to query in the shell against those collections with email_id instead of complex ObjectId.
Is this right way? will it give any performance problem while creating indexes with big emailIds?
Also, don't consider this option, If you have plan to enable email_id change
option in future.
While relational databases encourage you to normalize your data and spread it over many tables, this approach is usually not the best for MongoDB. MongoDB doesn't support JOINs over multiple collections or even multiple documents from the same collection. So you should try to design your database documents in a way that each query can be statisfied by searching for a single document. That means it is usually a good idea to store all information about a user in one document.
An exception for this is when certain points of data of the user grows indefinitely (like the posts made by a user in a forum). First, MongoDB documents have a size limit and second, when the size of a document increases, the database needs to reallocate its hard drive space frequently. This slows down writes and leads to fragmentation in the database. In that case it's better to put each entity in a different collection.
The size of the fields covered by an index don't matter when you search for equality. When you have an unique index on email_id, it should be just as fast as searching by _id.

Sharing a document with users

I have to choose a database for implementing a sharing system.
My system will have users and documents. I have to share a document with a few users.
Example:
There are 2 users, and there is one document.
So if I have to share that one document with both the users, I could do these possible solutions:
The current method I'm using is with MySQL (I don't want to use this):
Relational Databases (MySQL)
Users Table = user1, user2
Docs Table = doc1
Docs-User Relation Table = doc1, user1
doc1, user2
And I would like to use something like this:
NoSQL Document Stores (MongoDB)
Users Documents:
{
_id: user1,
docs_i_have_access_to: {doc1}
}
{
_id: user2,
docs_i_have_access_to: {doc1}
}
Document's Document:
{
_id: doc1
members_of_this_doc: {user1, user2}
}
And I don't yet know how I would implement in a key-value store like Redis.
So I just wanted to know, would the MongoDB way I have given above, the best solution?
And is there any other way I could implement this? Maybe with another database solution?
Should I try to implement it with Redis or not?
Which database and which method should I choose and will be the best to share the data and why?
Note: I want something highly scalable and persistent. :D
Thanks. :D
Actually, you need to represent a many-to-many relationship. One user can have several documents. One document can be shared among several users.
See my previous answer to this question: how to have relations many to many in redis
With Redis, representing relationship with the set datatype is a pretty common pattern. You can expect to get better performance than with MongoDB for this kind of data model. And as a bonus, you can easily and efficiently find which users have a given list of documents in common, or which documents are shared by a given set of users.
Considering only this simple example (you just need to keep who owns what) SQL seems to be the most appropriate, as it will give additional options for free, such as reporting who has how many docs, the most popular documents, most active user etc with almost zero cost + the data will be more consistent (no duplication, possibly foreign keys). This is valid unless you have millions of documents of course.
If I chose between document-oriented and relational DB, I'd make a decision based mostly on the structure of the document itself. Whether they're all uniform or may have different fields for different types, do you nested sub-documents or arrays with the ability to search by their contents.