MongoDB - Same _id in different collections - mongodb

I have two collections called Users and ElectedUsers. ElectedUsers is a subset of Users.
The main reason to have two collection is there are some unique different services for each collection. So I have to maintain two collections for that.
But when saving documents to ElectedUsers first it fetch the document from Users collections and do some business logic and save it to ElectedUsers with same _id. For the particular document _id field in both collections can be same.
I want to know is it violating best practices ? or is it impact to sharding or any other operation badly ?

If you are using _id as the shard key, then having duplicate _id values can be problematic, otherwise if you are not using _id as shard key and maintaining some other global unique value for sharding, then there shouldn't be any issue
refer this link
http://docs.mongodb.org/manual/faq/sharding/

Related

In meteor/mongo collections, is the _id field unique in its collection or in the entire database?

I want to check that if I use _id fields that refer to documents from different collections, I will never have a duplicate _id, i.e. used in 2 different collections inside the same database.
Using meteor (so both in minimongo and mongodb), is the _id field unique in its collection or in the entire database?
The _id values you have in your database are generated by Meteor using Random.id(). These are unique across all collections.
Please note that the uniqueness of _id values in MonogoDB is ensured on the collection level, meaning that there is always a unique index on the _id field for every collection. There is no MongoDB mechanism in place that would ensure _id uniqueness across collections.
In any case, it is quite a safe assumption that Meteor's random IDs will never collide.

how mongodb scales when you have relationships between collections?

I have a mongodb which links documents (the data cannot be embedded)
Does the mongos cluster (http://docs.mongodb.org/manual/core/sharding-introduction/) support sharding when the documents are linked?
How this impacts the performance?
Thanks!
Considering there is nothing special about referenced documents, it is just a logical relationship inferred by the application layer and not MongoDB itself, sharding is supported. This applies for "manual" references, as well as DBRefs. You can even shard on a DBRef property, although I'm not sure as to why you'd want to considering a DBRef should have inherently low cardinality.
There is an impact in performance for both manual and DBRefs, in that multiple queries must be performed to "join" the data. From the docs:
To resolve DBRefs, your application must perform additional queries to
return the referenced documents. Many drivers have helper methods that
form the query for the DBRef automatically. The drivers do not
automatically resolve DBRefs into documents.
There is no such thing as "document links" in MongoDB. Just fields in documents of collection A which happen to have the same values as fields of documents in collection B. DBRef's are just a conversion on the application layer and get no special treatment whatsoever by the database.
What matters for sharding efficiency is how you define the shard key for the referenced collection. When the field you search by is part of the shard key of the collection, mongos can accelerate it by redirecting the query to the correct shard.
You likely want all documents of collection A which belong to the same document of collection B to reside on the same shard. That means you should have the shard key of A include the field of A which is an unique identifier of B (objectID, name or whatever).

Mongodb choose shard key

I have a mongodb collection which I want to shard. This collection holds messages from users and a document from the collection has the following properties
{
_id : ObjectId,
conversationId: ObjectId,
created: DateTime
}
All queries will be done using the converstionId property and sorter by created.
Sharding by _id obviously won't work because I need to query by conversationId (plus _id is of type ObjectId which won't scale very well to many inserts)
Sharding by conversationId would be a logical choice in terms of query isolation but I'm afraid that it won't scale very well many inserts (even if I use a hashed shard key on conversationId or if I change the type of the property from ObjectId to some other type which isn't incremental like GUID) because some conversation might be much more active than others (i.e.: have many more message added to them)
From what I see in the mongo documentation The shard key is either an indexed field or an indexed compound field that exists in every document in the collection.
Does this mean that I can create a shard key on a compound index ?
Bottom line is that:
creating a hashed shard key from the _id property would offer good distribution of the data
creating a shard key on conversationId would offer good query isolation
So a combination of these two things would be great, if it could be done.
Any ideas?
Thanks
For your case, neither of fields look good choice for sharding. For instance, if you shard on conversationId, it will result in hot spotting, i.e. most of your inserts will happen to the last shard as conversationId would monotonically increase over time. Same problem with other two fields as well.
Also, conversationId will not offer high degree of isolation as conversationId would monotonically increase over time. (Since newer conversations will get updated much more frequently than very old ones)
In your case, a "hashed shard key"(version 2.4 onwards) over conversationId would be the smart choice as one would imagine that there can be tons of conversations going on in parallel.
Refer following link for details on creating hashed shard key: [ http://docs.mongodb.org/manual/tutorial/shard-collection-with-a-hashed-shard-key/ ]

MongoDB - Using email id as identifier across collections

I have user collection which holds email_id and _id as unique. I want to store user data across various collections. I would like to use email_id as identifier in those collections. Because it is easy to query in the shell against those collections with email_id instead of complex ObjectId.
Is this right way? will it give any performance problem while creating indexes with big emailIds?
Also, don't consider this option, If you have plan to enable email_id change
option in future.
While relational databases encourage you to normalize your data and spread it over many tables, this approach is usually not the best for MongoDB. MongoDB doesn't support JOINs over multiple collections or even multiple documents from the same collection. So you should try to design your database documents in a way that each query can be statisfied by searching for a single document. That means it is usually a good idea to store all information about a user in one document.
An exception for this is when certain points of data of the user grows indefinitely (like the posts made by a user in a forum). First, MongoDB documents have a size limit and second, when the size of a document increases, the database needs to reallocate its hard drive space frequently. This slows down writes and leads to fragmentation in the database. In that case it's better to put each entity in a different collection.
The size of the fields covered by an index don't matter when you search for equality. When you have an unique index on email_id, it should be just as fast as searching by _id.

Why does MongoDB have collections

MongoDB being document-oriented, the structure of collections seems to be a special case of documents. By that I mean one can define a document to contain other documents. So a collection is just a document containing other documents.
So why do we need collections after all?
Logically yes, you could design a database system like that, but practically speaking no.
A collection has indexes on the documents in it.
A collection requires the documents in it to have unique ids.
A document is limited in size.
Object ids (_id top-level document attribute) must be unique within a collection. Multiple collections may have the same _id, just like in RDBMs where the key constraint is per-table, yet multiple tables may contain the same value for a key.
collections is a container for documents. so when you say a document that contain other documents that s kinda wrong because, already, a document can have inner documents.
Collection is the unit where you put together the documents. Be aware that due to schema free design, you can put anything in a collection but it s not a good design. so collection is kinda logical container for documents. same as tables in relational world.