MongoDB Extended Reference Pattern - Managing Data Duplication - mongodb

I'm new to MongoDB and trying to wrap my head around managing duplicate data. The Extended Reference Pattern (link) is a good example. When you have two related collections (e.g., Customers and Orders), it can make sense for performance reasons to duplicate some information that would otherwise just live in the referenced collection. So for instance, the Order collection might duplicate the customer's name to avoid unnecessary joins with some queries.
I totally get that. And I totally get that you should be careful about what data you duplicate ("it works best if [duplicated fields] don't frequently change"), as updating those records can be expensive. What I don't understand is how you're supposed to keep track of where all that data is housed. Suppose you do need to update a customer's name. If that information duplicated in multiple orders within the Order Collection, plus maybe one or two other collections, tracking down where all the customer name lives (and the mechanics of changing it) sounds like a logistical nightmare!
Is there some sort of Mongo voodoo magic that can help with these sorts of updates, or is that just a necessarily messy process?

you have to manage all that changes on your app, so you have to take care when select one pattern or another, they are not silver bullets.
and remember not all the data need to be updated, depends of the situation, the data and the context of your app.

Related

NoSQL how to lookup id in a collection

NoSQL noob here. I'm building an app using Firestore NoSQL. I'm looping through items where every item has a owner id (creator user id).
I want to display owner's name on the listing page. In traditional SQL, i have foreign key so I can just make reference to say, Item.Owner.FirstName
What's the best practice in NoSQL? Should I be saving owner name as a field at the time of saving the item? or do a lookup of each owner id to get user object whilst i'm looping through items?
Second option sounds expensive so i'm assuming 1st way is the way to go. Unless there's a better, more accepted way?
Both will work. You either reference the data in the other document in whatever way you see fit, or you duplicate information into the document that you intend to query to build the display. You just have to decide what which problem you want to deal with:
If you duplicate data among documents (known as "denormalization"), then you'll have to put effort into keeping them all up to date with each other, if that's what you require. So, writing one document might actually turn into writing multiple documents.
If you normalize your data with no duplication, then each of your queries will require more queries to get the related data from other documents. This could result in a drop in performance and an increase in cost for apps with heavy read loads.
Since we don't know the performance requirements and usage behavior of your app, there is no way to give specific advice. You will have to think carefully about which problem you want to have, perhaps based on complexity, performance, and overall cost.

Are there any advantages to using a custom _id for documents in MongoDB?

Let's say I have a collection called Articles. If I were to insert a new document into that collection without providing a value for the _id field, MongoDB will generate one for me that is specific to the machine and the time of the operation (e.g. sdf4sd89fds78hj).
However, I do have the ability to pass a value for MongoDB to use as the value of the _id key (e.g. 1).
My question is, are there any advantages to using my own custom _ids, or is it best to just let Mongo do its thing? In what scenarios would I need to assign a custom _id?
Update
For anyone else that may find this. The general idea (as I understand it) is that there's nothing wrong with assigning your own _ids, but it forces you to maintain unique values within your application layer, which is a PITA, and requires an extra query before every insert to make sure you don't accidentally duplicate a value.
Sammaye provides an excellent answer here:
Is it bad to change _id type in MongoDB to integer?
Advantages with generating your own _ids:
You can make them more human-friendly, by assigning incrementing numbers: 1, 2, 3, ...
Or you can make them more human-friendly, using random strings: t3oSKd9q
(That doesn't take up too much space on screen, could be picked out from a list, and could potentially be copied manually if needed. However you do need to make it long enough to prevent collisions.)
If you use randomly generated strings they will have an approximately even sharding distribution, unlike the standard mongo ObjectIds, which tends to group records created around the same time onto the same shard. (Whether that is helpful or not really depends on your sharding strategy.)
Or you may like to generate your own custom _ids that will group related objects onto one shard, e.g. by owner, or geographical region, or a combination. (Again, whether that is desirable or not depends on how you intend to query the data, and/or how rapidly you are producing and storing it. You can also do this by specifying a shard key, rather than the _id itself. See the discussion below.)
Advantages to using ObjectIds:
ObjectIds are very good at avoiding collisions. If you generate your own _ids randomly or concurrently, then you need to manage the collision risk yourself.
ObjectIds contain their creation time within them. That can be a cheap and easy way to retain the creation date of a document, and to sort documents chronologically. (On the other hand, if you don't want to expose/leak the creation date of a document, then you must not expose its ObjectId!)
The nanoid module can help you to generate short random ids. They also provide a calculator which can help you choose a good id length, depending on how many documents/ids you are generating each hour.
Alternatively, I wrote mongoose-generate-unique-key for generating very short random ids (provided you are using the mongoose library).
Sharding strategies
Note: Sharding is only needed if you have a huge number of documents (or very heavy documents) that cannot be managed by one server. It takes quite a bit of effort to set up, so I would not recommend worrying about it until you are sure you actually need it.
I won't claim to be an expert on how best to shard data, but here are some situations we might consider:
An astronomical observatory or particle accelerator handles gigabytes of data per second. When an interesting event is detected, they may want to store a huge amount of data in only a few seconds. In this case, they probably want an even distribution of documents across the shards, so that each shard will be working equally hard to store the data, and no one shard will be overwhelmed.
You have a huge amount of data and you sometimes need to process all of it at once. In this case (but depending on the algorithm) an even distribution might again be desirable, so that all shards can work equally hard on processing their chunk of the data, before combining the results at the end. (Although in this scenario, we may be able to rely on MongoDB's balancer, rather than our shard key, for the even distribution. The balancer runs in the background after data has been stored. After collecting a lot of data, you may need to leave it to redistribute the chunks overnight.)
You have a social media app with a large amount of data, but this time many different users are making many light queries related mainly to their own data, or their specific friends or topics. In this case, it doesn't make sense to involve every shard whenever a user makes a little query. It might make sense to shard by userId (or by topic or by geographical region) so that all documents belonging to one user will be stored on one shard, and when that user makes a query, only one shard needs to do work. This should leave the other shards free to process queries for other users, so many users can be served at once.
Sharding documents by creation time (which the default ObjectIds will give you) might be desirable if you have lots of light queries looking at data for similar time periods. For example many different users querying different historical charts.
But it might not be so desirable if most of your users are querying only the most recent documents (a common situation on social media platforms) because that would mean one or two shards would be getting most of the work. Distributing by topic or perhaps by region might provide a flatter overall distribution, whilst also allowing related documents to clump together on a single shard.
You may like to read the official docs on this subject:
https://docs.mongodb.com/manual/sharding/#shard-key-strategy
https://docs.mongodb.com/manual/core/sharding-choose-a-shard-key/
I can think of one good reason to generate your own ID up front. That is for idempotency. For example so that it is possible to tell if something worked or not after a crash. This method works well when using re-try logic.
Let me explain. The reason people might consider re-try logic:
Inter-app communication can sometimes fail for different reasons, (especially in a microservice architecture). The app would be more resilient and self-healing by codifying the app to re-try and not give up right away. This rides over odd blips that might occur without the consumer ever being affected.
For example when dealing with mongo, a request is sent to the DB to store some object, the DB saves it, but just as it is trying to respond to the client to say everything worked fine, there is a network blip for whatever reason and the “OK” is never received. The app assumes it didn't work and so the app may end up re-trying the same data and storing it twice, or worse it just blows up.
Creating the ID up front is an easy, low overhead way to help deal with re-try logic. Of course one could think of other schemes too.
Although this sort of resiliency may be overkill in some types of projects, it really just depends.
I have used custom ids a couple of times and it was quite useful.
In particular I had a collection where I would store stats by date, so the _id was actually a date in a specific format. I did that mostly because I would always query by date. Keep in mind that using this approach can simplify your indexes as no extra index is needed, the basic cursor is sufficient.
Sometimes the ID is something more meaningful than a randomly generated one. For example, a user collection may use the email address as the _id instead. In my project I generate IDs that are much shorter than the ones Mongodb uses so that the ID shown in the URL is much shorter.
I'll use an example , i created a property management tool and it had multiple collections. For simplicity some fields would be duplicated for example the payment. And when i needed to update these record it had to happen simultaneously across all collections it appeared in so i would assign them a custom payment id so when the delete/query action is performed it changes all instances of it database wide

MongoDB - One Collection Using Indexes

Ok so the more and more I develop in Mongodb i start to wonder about the need for multiple collections vs having one large collection with indexes (since columns and fields can be different for each document unlike tabular data). If i am trying to develop in the most efficient way possible (meaning less code and reusable code) then can I use one collection for all documents and just index on a field. By having all documents in one collection with indexes then i can reuse all my form processing code and other code since it will all be inserting into the same collection.
For Example:
Lets say i am developing a contact manager and I have two types of contacts "individuals" and "businesses". My original thought was to create a collection called individuals and a second collection called businesses. But that was because im used to developing in sql where yes this would be appropriate since columns would be different for each table. The more i started to think about the flexibility of document dbs the more I started to think, "do I really need two collections for this?" If i just add a field to each document called "contact type" and index on that, do i really need two collections? Since the fields/columns in each document do not have to be the same for all (like in sql) then each document can have their own fields as long as i have a "document type" field and an index on that field.
So then i took that concept and started to think, if i only need one collection for "individuals" and "businesses" then do i even need a separate collection for "Users" or "Contact History" or any other data. In theory couldn't i build the entire solution in once collection and just have a field in each document that specifield the "type" and index on it such as "Users", "Individual Contact", "Business Contacts", "Contact History", etc, and if it is a document related to another document i can index on the "parent key/foreign" Id field...
This would allow me to code the front end dynamically since the form processing code would all be the same (inserting into the same collection). This would save a lot of coding but i want to make sure by using indexes and secondary indexes that the db would still run fast and not cause future problems as the collection grew. As you can imagine, if everything was in one collection there might be hundreds of thousands even millions of documents in this collection as the user base grows but it would have indexes and secondary indexes to optimize performance.
My question is: Is this a common method mongodb developers use? Why or why not? What are the downfalls, if any? If this is a commonly used method, please also give any positives to using this method. thank you.
This is a really big point in Mongo and the answer is a little bit more of an art than science. Having one collection full of gigantic documents is definitely an anti-pattern because it works against many of Mongo's features.
For instance, when retrieving documents, you can only retrieve a whole document out of a collection (not entirely true, but mostly). So if you have huge documents, you're retrieving huge documents each time. Also, having huge documents makes sharding less flexible since only the top level documents are indexed (and hence, sharded) in each collection. You can index values deep into a document, but the index value is associated with the top level document.
At the same time, going purely relational is also an anti-pattern because you've lost a lot of the referential integrity by going to Mongo in the first place. Also, all joins are done in application memory, so each one requires a full round-trip (slow).
So the answer is to do something in between. I'm thinking you'll probably want a collection for individuals and a different collection for businesses in this case. I say this because it seem like businesses have enough meta-data associated that it could bulk up a lot. (Also, I individual-business relationship seems like a many-to-many). However, an individual might have a Name object (with first and last properties). That would be a bad idea to make Name into a separate collection.
Some info from 10gen about schema design: http://www.mongodb.org/display/DOCS/Schema+Design
EDIT
Also, Mongo has limited support for transactions - in the form of atomic aggregates. When you insert an object into mongo, the entire object is either inserted or not inserted. So you're application domain requires consistency between certain objects, you probably want to keep them in the same document/collection.
For example, consider an application that requires that a User always has a Name object (containing FirstName, LastName, and MiddleInitial). If a User was somehow inserted with no corresponding Name, the data would be considered to be corrupted. In an RDBMS you would wrap a transaction around the operations to insert User and Name. In Mongo, we make sure Name is in the same document (aggregate) as User to achieve the same effect.
Your example is a little less clear, since I don't understand the business cases. One thing that does come to mind is that Mongo has excellent support for inheritance. It might make sense to put all users, individuals, and potentially businesses into the same collection (depending on how the application is modeled). If one individual has many contacts, you probably want individuals to have an array of IDs. If your application requires that you get a quick preview of contacts, you might consider duplicating part of an individual and storing an array of contact objects.
If you're used to RDBMS thinking, you probably think all your data always has to be consistent. The truth is, that's probably not entirely true. This concept of applying atomic aggregates to the domain has been preached heavily by the DDD community recently. When you look at your domain in depth, like your business users do, the consistency boundaries should become distinct.
MongoDB, and NoSQL in general, is about de-normalising data and about reducing joins. It goes against normal SQL thinking.
In your case, I don't see any reason why you would want to have separate collections because it introduces unnecessary complexity and performance overhead. Consider, for example, if you wanted to have a screen that displayed all contacts, in alphabetical order. If you have one single collection for contacts, then its really easy, but if you have two collections it becomes a more complicated proposition.
Where I would have multiple collections is if your application had multiple users storing contacts. I would then have one collection for each user. This makes it so easy to extract out that users contacts.

Many to many update in MongoDB without transactions

I have two collections with a many-to-many relationship. I want to store an array of linked ObjectIds in both documents so that I can take Document A and retrieve all linked Document B's quickly, and vice versa.
Creating this link is a two step process
Add Document A's ObjectId to Document B
Add Document B's ObjectId to Document A
After watching a MongoDB video I found this to be the recommended way of storing a many-to-many relationship between two collections
I need to be sure that both updates are made. What is the recommended way of robustly dealing with this crucial two step process without a transaction?
I could condense this relationship into a single link collection, the advantage being a single update with no chance of Document B missing the link to Document A. The disadvantage being that I'm not really using MongoDB as intended. But, because there is only a single update, it seems more robust to have a link collection that defines the many-to-many relationship.
Should I use safe mode and manually check the data went in afterwards and try again on failure? Or should I represent the many-to-many relationship in just one of the collections and rely on an index to make sure I can still quickly get the linked documents?
Any recommendations? Thanks
#Gareth, you have multiple legitimate ways to do this. So they key concern is how you plan to query for the data, (i.e.: what queries need to be fast)
Here are a couple of methods.
Method #1: the "links" collection
You could build a collection that simply contains mappings between the collections.
Pros:
Supports atomic updates so that data is not lost
Cons:
Extra query when trying to move between collections
Method #2: store copies of smaller mappings in larger collection
For example: you have millions of Products, but only a hundred Categories. Then you would store the Categories as an array inside each Product.
Pros:
Smallest footprint
Only need one update
Cons:
Extra query if you go the "wrong way"
Method #3: store copies of all mappings in both collections
(what you're suggesting)
Pros:
Single query access to move between either collection
Cons:
Potentially large indexes
Needs transactions (?)
Let's talk about "needs transactions". There are several ways to do transactions and it really depends on what type of safety you require.
Should I use safe mode and manually check the data went in afterwards and try again on failure?
You can definitely do this. You'll have to ask yourself, what's the worst that happens if only one of the saves fails?
Method #4: queue the change
I don't know if you've ever worked with queues, but if you have some leeway you can build a simple queue and have different jobs that update their respective collections.
This is a much more advanced solution. I would tend to go with #2 or #3.
Why don't you create a dedicated collection holding the relations between A and B as dedicated rows/documents as one would do it in a RDBMS. You can modify the relation table with one operation which is of course atomic.
Should I use safe mode and manually check the data went in afterwards and try again on failure?
Yes this an approach, but there is an another - you can implement an optimistic transaction. It has some overhead and limitations but it guarantees data consistency. I wrote an example and some explanation on a GitHub page.

Relations in Document-oriented database?

I'm interested in document-oriented databases, and I'd like to play with MongoDB. So I started a fairly simple project (an issue tracker), but am having hard times thinking in a non-relational way.
My problems:
I have two objects that relate to each other (e.g. issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}} - here I have a user related to the issue). Should I create another document 'user' and reference it in 'issue' document by its id (like in relational databases), or should I leave all the user's data in the subdocument?
If I have objects (subdocuments) in a document, can I update them all in a single query?
I'm totally new to document-oriented databases, and right now I'm trying to develop sort of a CMS using node.js and mongodb so I'm facing the same problems as you.
By trial and error I found this rule of thumb: I make a collection for every entity that may be a "subject" for my queries, while embedding the rest inside other objects.
For example, comments in a blog entry can be embedded, because usually they're bound to the entry itself and I can't think about a useful query made globally on all comments. On the other side, tags attached to a post might deserve their own collection, because even if they're bound to the post, you might want to reason globally about all the tags (for example making a list of trending topics).
In my mind this is actually pretty simple. Embedded documents can only be accessed via their master document. If you can envision a need to query an object outside the context of the master document, then don't embed it. Use a ref.
For your example
issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}}
I would make issue and reporter each their own document, and reference the reporter in the issue. You could also reference a list of issues in reporter. This way you won't duplicate reporters in issues, you can query them each separately, you can query reporter by issue, and you can query issues by reporter. If you embed reporter in issue, you can only query the one way, reporter by issue.
If you embed documents, you can update them all in a single query, but you have to repeat the update in each master document. This is another good reason to use reference documents.
The beauty of mongodb and other "NoSQL" product is that there isn't any schema to design. I use MongoDB and I love it, not having to write SQL queries and awful JOIN queries! So to answer your two questions.
1 - If you create multiple documents, you'll need make two calls to the DB. Not saying it's a bad thing but if you can throw everything into one document, why not? I recall when I used to use MySQL, I would create a "blog" table and a "comments" table. Now, I append the comments to the record in the same collection (aka table) and keep building on it.
2 - Yes ...
The schema design in Document-oriented DBs can seems difficult at first, but building my startup with Symfony2 and MongoDB I've found that the 80% of the time is just like with a relational DB.
At first, think it like a normal db:
To start, just create your schema as you would with a relational Db:
Each Entity should have his own Collection, especially if you'll need to paginate the documents in it.
(in Mongo you can somewhat paginate nested document arrays, but the capabilities are limited)
Then just remove overly complicated normalization:
do I need a separate category table? (simply write the category in a column/property as a string or embedded doc)
Can I store comments count directly as an Int in the Author collection? (then update the count with an event, for example in Doctrine ODM)
Embedded documents:
Use embedded documents only for:
clearness (nested documents like: addressInfo, billingInfo in the User collection)
to store tags/categories ( eg: [ name: "Sport", parent: "Hobby", page: "/sport"
] )
to store simple multiple values (for eg. in User collection: list of specialties, list of personal websites)
Don't use them when:
the parent Document will grow too large
when you need to paginate them
when you feel the entity is important enough to deserve his own collection
Duplicate values across collection and precompute counts:
Duplicate some columns/attributes values from a Collection to another if you need to do a query with each values in the where conditions. (remember there aren't joins)
eg: In the Ticket collection put also the author name (not only the ID)
Also if you need a counter (number of tickets opened by user, by category, ecc), precompute them.
Embed references:
When you have a One-to-Many or Many-to-Many reference, use an embedded array with the list of the referenced document ids (see MongoDB DB Ref).
You'll need to use an Event again to remove an id if the referenced document get deleted.
(There is an extension for Doctrine ODM if you use it: Reference Integrity)
This kind of references are directly managed by Doctrine ODM: Reference Many
Its easy to fix errors:
If you find late that you have made a mistake in the schema design, its quite simply to fix it with few lines of Javascript to run directly in the Mongo console.
(stored procedures made easy: no need of complex migration scripts)
Waring: don't use Doctrine ODM Migrations, you'll regret that later.
Redid this answer since the original answer took the relation the wrong way round due to reading incorrectly.
issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}}
As to whether embedding some important information about the user (creator) of the ticket is a wise decision or not depends upon the system specifics.
Are you giving these users the ability to login and report issues they find? If so then it is likely you might want to factor that relation off to a user collection.
On the other hand, if that is not the case then you could easily get away with this schema. The one problem I see here is if you wish to contact the reporter and their job role has changed, that's somewhat awkward; however, that is a real world dilemma, not one for the database.
Since the subdocument represents a single one-to-one relation to a reporter you also should not suffer fragmentation problems mentioned in my original answer.
There is one glaring problem with this schema and that is duplication of changing repeating data (Normalised Form stuff).
Let's take an example. Imagine you hit the real world dilemma I spoke about earlier and a user called Nigel wants his role to reflect his new job position from now on. This means you have to update all rows where Nigel is the reporter and change his role to that new position. This can be a lengthy and resource consuming query for MongoDB.
To contradict myself again, if you were to only have maybe 100 tickets (aka something manageable) per user then the update operation would likely not be too bad and would, in fact, by manageable for the database quite easily; plus due to the lack of movement (hopefully) of the documents this would be a completely in place update.
So whether this should be embedded or not depends heavily upn your querying and documents etc, however, I would say this schema isn't a good idea; specifically due to the duplication of changing data across many root documents. Technically, yes, you could get away with it but I would not try.
I would instead split the two out.
If I have objects (subdocuments) in a document, can I update them all in a single query?
Just like the relation style in my original answer, yes and easily.
For example, let's update the role of Nigel to MD (as hinted earlier) and change the ticket status to completed:
db.tickets.update({'reporter.username':'Nigel'},{$set:{'reporter.role':'MD', status: 'completed'}})
So a single document schema does make CRUD easier in this case.
One thing to note, stemming from your English, you cannot use the positional operator to update all subdocuments under a root document. Instead it will update only the first found.
Again hopefully that makes sense and I haven't left anything out. HTH
Original Answer
here I have a user related to the issue). Should I create another document 'user' and reference it in 'issue' document by its id (like in relational databases), or should I leave all the user's data in the subdocument?
This is a considerable question and requires some background knowledge before continuing.
First thing to consider is the size of a issue:
issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}}
Is not very big, and since you no longer need the reporter information (that would be on the root document) it could be smaller, however, issues are never that simple. If you take a look at the MongoDB JIRA for example: https://jira.mongodb.org/browse/SERVER-9548 (as a random page that proves my point) the contents of a "ticket" can actually be quite considerable.
The only way you would gain a true benefit from embedding the tickets would be if you could store ALL user information in a single 16 MB block of contigious sotrage which is the maximum size of a BSON document (as imposed by the mongod currently).
I don't think you would be able to store all tickets under a single user.
Even if you was to shrink the ticket to, maybe, a code, title and a description you could still suffer from the "swiss cheese" problem caused by regular updates and changes to documents in MongoDB, as ever this: http://www.10gen.com/presentations/storage-engine-internals is a good reference for what I mean.
You would typically witness this problem as users add multiple tickets to their root user document. The tickets themselves will change as well but maybe not in a drastic or frequent manner.
You can, of course, remedy this problem a bit by using power of 2 sizes allocation: http://docs.mongodb.org/manual/reference/command/collMod/#usePowerOf2Sizes which will do exactly what it says on the tin.
Ok, hypothetically, if you were to only have code and title then yes, you could store the tickets as subdocuments in the root user without too many problems, however, this is something that comes down to specifics that the bounty assignee has not mentioned.
If I have objects (subdocuments) in a document, can I update them all in a single query?
Yes, quite easily. This is one thing that becomes easier with embedding. You could use a query like:
db.users.update({user_id:uid,'tickets.code':'asdf-1'}, {$set:{'tickets.$.title':'Oh NOES'}})
However, to note, you can only update ONE subdocument at a time using the positional operator. As such this means you cannot, in a single atomic operation, update all ticket dates on a single user to 5 days in the future.
As for adding a new ticket, that is quite simple:
db.users.update({user_id:uid},{$push:{tickets:{code:asdf-1,title:"Whoop"}}})
So yes, you can quite simply, depending on your queries, update the entire users data in a single call.
That was quite a long answer so hopefully I haven't missed anything out, hope it helps.
I like MongoDB, but I have to say that I will use it a lot more soberly in my next project.
Specifically, I have not had as much luck with the Embedded Document facility as people promise.
Embedded Document seems to be useful for Composition (see UML Composition), but not for aggregation. Leaf nodes are great, anything in the middle of your object graph should not be an embedded document. It will make searching and validating your data more of a struggle than you'd want.
One thing that is absolutely better in MongoDB is your many-to-X relationships. You can do a many-to-many with only two tables, and it's possible to represent a many-to-one relationship on either table. That is, you can either put 1 key in N rows, or N keys in 1 row, or both. Notably, queries to accomplish set operations (intersection, union, disjoint set, etc) are actually comprehensible by your coworkers. I have never been satisfied with these queries in SQL. I often have to settle for "two other people will understand this".
If you've ever had your data get really big, you know that inserts and updates can be constrained by how much the indexes cost. You need fewer indexes in MongoDB; an index on A-B-C can be used to query for A, A & B, or A & B & C (but not B, C, B & C or A & C). Plus the ability to invert a relationship lets you move some indexes to secondary tables. My data hasn't gotten big enough to try, but I'm hoping that will help.