how mongodb scales when you have relationships between collections? - mongodb

I have a mongodb which links documents (the data cannot be embedded)
Does the mongos cluster (http://docs.mongodb.org/manual/core/sharding-introduction/) support sharding when the documents are linked?
How this impacts the performance?
Thanks!

Considering there is nothing special about referenced documents, it is just a logical relationship inferred by the application layer and not MongoDB itself, sharding is supported. This applies for "manual" references, as well as DBRefs. You can even shard on a DBRef property, although I'm not sure as to why you'd want to considering a DBRef should have inherently low cardinality.
There is an impact in performance for both manual and DBRefs, in that multiple queries must be performed to "join" the data. From the docs:
To resolve DBRefs, your application must perform additional queries to
return the referenced documents. Many drivers have helper methods that
form the query for the DBRef automatically. The drivers do not
automatically resolve DBRefs into documents.

There is no such thing as "document links" in MongoDB. Just fields in documents of collection A which happen to have the same values as fields of documents in collection B. DBRef's are just a conversion on the application layer and get no special treatment whatsoever by the database.
What matters for sharding efficiency is how you define the shard key for the referenced collection. When the field you search by is part of the shard key of the collection, mongos can accelerate it by redirecting the query to the correct shard.
You likely want all documents of collection A which belong to the same document of collection B to reside on the same shard. That means you should have the shard key of A include the field of A which is an unique identifier of B (objectID, name or whatever).

Related

Same shards across different MongoDB collections

I have a collection A containing one type of documents, and a second collection B containing another kind of documents.
There are multiple documents in collection B that have the same value for the field "b" which references field "a" in the collection A.
If we shard the two collections A and B on "a" and "b" respectively, can we be assured that documents in collection A having "a=foobar" will be co-located with documents in collection B having "b=foobar"?
If we shard the two collections A and B on "a" and "b" respectively, can we be assured that documents in collection A having "a= " will be co-located with documents in collection B having "b=foobar"?
Shard key indexes are defined per collection, and (as at MongoDB 4.0) collections are balanced independently. Even if two collections have identical shard keys, there is definitely no guarantee that the chunk ranges or shard assignments will align.
If you plan to use server-side queries to combine data from these collections using $lookup or $graphLookup, note that additional collections you are looking up from cannot currently be sharded. For this use case you would only shard one of the collections. For sharded lookup support there are some relevant improvements to watch/upvote in the MongoDB issue tracker: SERVER-29159 (sharded $lookup) and SERVER-27533 (sharded $graphLookup).
There are a few possible approaches to co-locating data, but all have caveats:
Denormalize: duplicate the most commonly used fields from A into B. This can speed up data retrieval by avoiding the need for joins, but adds some overhead for updates and data storage.
Embed the related data so you have a single sharded collection. This will not be ideal if your collections have very different growth or access patterns, or a large one-to-many relationship.
Manage the data distribution manually: disable balancing for these collections, manually split (or pre-split) chunks so the chunk ranges are identical, and use zone sharding for shard affinity.
For more information on relationship patterns, the Six Rules of Thumb for MongoDB Schema Design blog series is a helpful read. It doesn't cover sharding but the general data model considerations still apply.

MongoDB sharding by collection

I have an application which creates a collection in MongoDB for every user where a collection is expected to have at most 100,000 documents (a few "big" users are like this while many "small" users only have less than 10,000 documents). Now the number of users grows and I want to shard my database. Is it possible to say "put this collection (thus this user) on this shard and that collection on that shard, but do not shard documents inside a collection further", and is it possible to do this automatically?
Edit: I'm already aware of MongoDB's standard sharding design now, but my application was scaled up from a small application for single person's use, where a nedb datastore is created for the user. When the multi-user support was added, it was an obvious choice to create a nedb datastore for every user so many parts of my application could stay unchanged. When I migrated it to MongoDB, since one nedb datastore is the equivalent of a MongoDB collection, I was using one collection per user. Given the current situation, I wonder the quickest way (~= with the smallest change to my application and overall configurations) to solve the current performance issue.
Sharding is done on a collection and how the sharded collection is broken up is based on the shard key (where one or more object elements from your collection make up the key).
It might be better to rethink your document design. You could have all users in one collection and then use the user id as the shard key. That would shard each user as a whole and do it automatically.
See Mongodb's Sharding documentation for more information on sharding.

MongoDB - Same _id in different collections

I have two collections called Users and ElectedUsers. ElectedUsers is a subset of Users.
The main reason to have two collection is there are some unique different services for each collection. So I have to maintain two collections for that.
But when saving documents to ElectedUsers first it fetch the document from Users collections and do some business logic and save it to ElectedUsers with same _id. For the particular document _id field in both collections can be same.
I want to know is it violating best practices ? or is it impact to sharding or any other operation badly ?
If you are using _id as the shard key, then having duplicate _id values can be problematic, otherwise if you are not using _id as shard key and maintaining some other global unique value for sharding, then there shouldn't be any issue
refer this link
http://docs.mongodb.org/manual/faq/sharding/

Mongodb choose shard key

I have a mongodb collection which I want to shard. This collection holds messages from users and a document from the collection has the following properties
{
_id : ObjectId,
conversationId: ObjectId,
created: DateTime
}
All queries will be done using the converstionId property and sorter by created.
Sharding by _id obviously won't work because I need to query by conversationId (plus _id is of type ObjectId which won't scale very well to many inserts)
Sharding by conversationId would be a logical choice in terms of query isolation but I'm afraid that it won't scale very well many inserts (even if I use a hashed shard key on conversationId or if I change the type of the property from ObjectId to some other type which isn't incremental like GUID) because some conversation might be much more active than others (i.e.: have many more message added to them)
From what I see in the mongo documentation The shard key is either an indexed field or an indexed compound field that exists in every document in the collection.
Does this mean that I can create a shard key on a compound index ?
Bottom line is that:
creating a hashed shard key from the _id property would offer good distribution of the data
creating a shard key on conversationId would offer good query isolation
So a combination of these two things would be great, if it could be done.
Any ideas?
Thanks
For your case, neither of fields look good choice for sharding. For instance, if you shard on conversationId, it will result in hot spotting, i.e. most of your inserts will happen to the last shard as conversationId would monotonically increase over time. Same problem with other two fields as well.
Also, conversationId will not offer high degree of isolation as conversationId would monotonically increase over time. (Since newer conversations will get updated much more frequently than very old ones)
In your case, a "hashed shard key"(version 2.4 onwards) over conversationId would be the smart choice as one would imagine that there can be tons of conversations going on in parallel.
Refer following link for details on creating hashed shard key: [ http://docs.mongodb.org/manual/tutorial/shard-collection-with-a-hashed-shard-key/ ]

Why does MongoDB have collections

MongoDB being document-oriented, the structure of collections seems to be a special case of documents. By that I mean one can define a document to contain other documents. So a collection is just a document containing other documents.
So why do we need collections after all?
Logically yes, you could design a database system like that, but practically speaking no.
A collection has indexes on the documents in it.
A collection requires the documents in it to have unique ids.
A document is limited in size.
Object ids (_id top-level document attribute) must be unique within a collection. Multiple collections may have the same _id, just like in RDBMs where the key constraint is per-table, yet multiple tables may contain the same value for a key.
collections is a container for documents. so when you say a document that contain other documents that s kinda wrong because, already, a document can have inner documents.
Collection is the unit where you put together the documents. Be aware that due to schema free design, you can put anything in a collection but it s not a good design. so collection is kinda logical container for documents. same as tables in relational world.