MongoDB: Declarative or Navigational? - mongodb

I am still new on the whole area of MongoDB systems.
I was wondering whether anyone of you knows if MongoDB is declarative or navigational when it comes to accessing objects within a document?
What I mean is:
-> Declarative: a pattern is given and the system works out the result. In other words, it works in the same way as SPJ queries
-> Navigational: it always starts from the beginning of a document and continues from there

SPJ (Select-Project-Join) is more than just accession of documents/rows, it is the entire process of forming a result set for queries.
I am unsure how the two: "Declarative" and "Navigational" are compatible. You talk about "Declarative" being the formation of a result set but then "Navigational" being related to accessing a document, i.e. reading the beginning of it.
I will answer what I believe your question to be about, the access patterns of documents.
I believe MongoDB leaves reading up to the OS itself (might use some C++ library to do its work for it) as such it starts from the beginning of a pointer (i.e. "Navigational") however, as to how it actually reads document does not really matter.
Here is why, MongoDB does not split a document up into many pieces to be stored on the hard disk, instead it stores a document as a single "block" of space. So I am going to turn around and say you probably need to look at the code to be sure exactly how it reads, though I don't see the point and here is a presentation that will help you understand more about the internals of MongoDB: http://www.mongodb.com/presentations/storage-engine-internals

Related

Performance Implications of Accessing Single MongoDB Document vs Different MongoDB Documents in The Same Collection

Say I have a MongoDB Document that contains within itself a list.
This list gets altered a lot and there's no real reason why it couldn't have its own collection and each of the items became a document.
Would there be any performance implications of the former? I've got an inkling that document read/writes are going to be blocked while any given connection tries to read it, but the same wouldn't be true for accessing different documents in the same collection.
I find that these questions are effectively impossible to 'answer' here on Stack Overflow. Not only is there not really a 'right' answer, but it is impossible to get enough context from the question to frame a response that appropriately factors in the items that are most important for you to consider in your specific situation. Nonetheless, here are some thoughts that come to mind that may help point you in the right direction.
Performance is obviously an important consideration here, so it's good to have it in mind as you think through the design. Even within the single realm of performance there are various aspects. For example, would it be acceptable for the source document and the associated secondary documents in another collection to be out of sync? If not, and you had to pursue a route such as using transactions to keep them aligned, then that may be a much bigger performance hit overall and not worth pursuing.
As broad as performance is, it is also just a single consideration here. What about usability? Are you able to succinctly express the type of modifications that you would be doing to the array using MongoDB's query language? What about retrieving the data, would you always pull the information back as a single logical document? If so, then that would imply needing to use $lookup very frequently. Even doing so via a view may be cumbersome and could be both a usability as well as performance consideration. Indeed, an overreliance on $lookup can be considered an antipattern.
What does it mean when you say that the list gets "altered" a lot? Are you inserting new information, or updating existing entries? There has been a 16MB size limit for individual documents for a long time in MongoDB, so they generally recommend avoiding unbounded arrays. Indeed processing them can be costly in various ways depending on some specific factors.
Also, where does your inkling about concurrency behavior come from? There is a FAQ on concurrency here which helps outline some of the expected behavior for various operations and their locking. Often (with any system) it can be most appropriate to build out an environment that appropriately represents your end state and stress test it directly. That often gives a good general sense for how the approach would work in your situation without having to become an expert in the particulars of how the database (or tool in general) works.
You can see that even in this short response, the "recommendation" fluctuates back and forth. Ultimately this question is about a trade-off which we are not in a good position answer for you. Hopefully this response helps give you some things to think about while doing so.

Best way to make a medical history DB in mongodb

i'm making a sistem that stores all medical , and healthy data from a person in a database , i've chosen mongodb to do the work but i'm new in mongodb modeling and i dont have an idea of whats the best way to do this.
Do i use a document for each pacient and insert subdocuments like this:
$evolution=array(); //subdocument
$record=array(); //subdocument
$prescriptions=array(); //subdocument
$exams=array(); //subdocument
$surgeries=array(); //subdocument
or do i create a new document for each one of these data?.
i know the limitation of document size that is 16 megabytes, but i don't know if the informations will reach the limmit.
The exact layout of your documents is highly dependent on the types of queries you need to make. Unfortunately without a detailed understanding of your use case it would be impossible to provide good advice about what is the best layout.
Depending on your use case it may be valid to have a document/patient with sub documents as you indicate. In some cases though it may be better to have a separate collection for each of the fields indicated. It all depends on how big those documents will be, what types of queries you will need to perform etc.
Some general advice:
Try to avoid queries that use multiple collections.
If your queries are getting difficult, you may have the wrong layout. Re-evaluate your layout any time you are in this situation.
Documents that constantly grow can create problems because Mongo constantly has to move them around in order to make room for the growth. If they will be growing quickly then reevaluate to see if there is a better layout.
While you can technically store different document layouts in the same collection in Mongo it is not generally considered a good practice. All documents in your collection should ideally follow some sort of schema even if that schema is not rigidly defined.
Field names matter. They take up space in Mongo so short field names are better if you expect to have a lot of data.
The best advice I can offer would be to start with what you think might work and see how it goes. If it gets awkward or difficult to get the information you need then reevaluate.

MongoDB space usage inefficiencies when using $push

Let's say that I have two collections, A and B. Among other things, one of them (collection A) has an array whose cells contain subdocuments with a handful of keys.
I also have a script that will go through a queue (external to MongoDB), insert its items on collection B, and push any relevant info from these items into subdocuments in an array in collection A, using $push. As the script runs, the size of the documents in collection A grows significantly.
The problem seems to be that, whenever a document does not fit its allocated size, MongoDB will move it internally, but it won't release the space it occupied previously---new MongoDB documents won't use that space, not unless I run a compact or repairDatabase command.
In my case, the script seems to scorch through my disk space quickly. It inserts a couple of items into collection B, then tries to inserts into a document in collection A, and (I'm guessing) relocates said document without reusing its old spot. Perhaps this does not happen every time, with padding, but when these documents are about 10MB in size, that means that every time it does happen it burns through a significant chunk of the DB, even though the actual data size remains small. The process eats up my (fairly small, admittedly) DB in minutes.
Requiring a compact or repairDatabase command every time this happens is clumsy: there is space on disk, and I would like MongoDB to use it without requesting it explicitly. The alternative of having a separate collection for the subdocuments in the array would fix this issue, and is probably a better design anyway, but one that will require me to make joins that I wanted to avoid, this being one of the advantages of NoSQL.
So, first, does MongoDB actually use space the way I described above? Second, am I approaching this the wrong way? Perhaps there is a parameter I can set to get MongoDB to reuse this space automatically; if there is, is it advisable to use it? And third, are there other, more fitting, design approaches I'm missing?
Most of the questions you have asked you should have already known (Google searching would have brought up 100's of links including critical blog posts on the matter) having tried to use MongoDB in such a case however, this presentation should answer like 90% of your questions: http://www.mongodb.com/presentations/storage-engine-internals
As for solving the problem through settings etc, not really possible here, power of 2 sizes won't help for an array which grows like this. So to answer:
Perhaps there is a parameter I can set to get MongoDB to reuse this space automatically; if there is, is it advisable to use it?
I would say no.
And third, are there other, more fitting, design approaches I'm missing?
For something like this I would recommend using a separate collection to store each of the array elements as a new row independent of the parent document.
Sammaye's recommendation was correct, but I needed to do some more digging to understand the cause of this issue. Here's what I found.
So, first, does MongoDB actually use space the way I described above?
Yes, but that's not as intended. See bug SERVER-8078, and its (non-obvious) duplicate, SERVER-2958. Frequent $push operations cause MongoDB to shuffle documents around, and their old spots are not (yet!) reused without a compact or repairDatabase command.
Second, am I approaching this the wrong way? Perhaps there is a parameter I can set to get MongoDB to reuse this space automatically; if there is, is it advisable to use it?
For some usages of $push, the usePowerOf2Size option initially consumes more memory, but stabilizes better (see the discussion on SERVER-8078). It may not work well with arrays that consistently tend to grow, which are a bad idea anyway because document sizes are capped.
And third, are there other, more fitting, design approaches I'm missing?
If an array is going to have hundreds or thousands of items, or if its length is arbitrary but likely large, it's better to move its cells to a different collection, despite the need for additional database calls.

MongoDB. Use cursor as value for $in in next query

Is there a way to use the cursor returned by the previous query as a value for $in in the next query? For example, something like this:
var users = db.user.find({state:1})
var offers = db.offer.find({user:{$in:users}})
I think this can reduce the traffic between mongodb and client in case the client doesn't need user information at all, just offers. Am i wrong?
Basically you want to do a join between two collections which Mongo doesn't support. You can reduce the amount of data being transferred from the server by limiting the fields returned from the first query to only the unique user information (i.e. the _id) that you need to get data from the offers collection.
If you really just want to make one query then you should store more information in the offers collection. For example, if you're trying to find offers for active users then you would store the active state of the user in the offers collection.
To work from your comment:
Yes, that's why I used tag 'join' in a question. The idea is that I
can make a first query more сomplex using a bunch of fields and
regexes without storing user data in other collections except
references. In these cases I always have to perform two consecutive
queries, but transfering of the results of the first query is not
necessary neither for me nor for the mongodb itself. I just want to
understand could it be done now, will it be possible to do so in the
future or it cannot be implemented for some technical reasons
As far as I understand it there is no immediate hurry to make this possible. Also the way it is coded atm will make this quite a big change to the way cursors work and are defined. A change big enough to possibly cause implementation breaks for other people. It is really a case of whether to set safe for inserts and updates for all future drivers. It is recognised that safe should be default but this will break implementation for other people who expect it the other way around.
It is rather inefficient if you don't require the results of the first query at all however since most networks are prepped with high traffic in mind and the traffic is cheap there hasn't been a demand to make it able to do chained queries server side in the cursor.
However subselects (which this basically is, it is selecting a set of rows based upon a sub selection of previous rows) have been on mongodb-user a couple of times and there might even be a JIRA for it somewhere, if not might be useful to make one.
As for doing it right now: there is no way.

Relations in Document-oriented database?

I'm interested in document-oriented databases, and I'd like to play with MongoDB. So I started a fairly simple project (an issue tracker), but am having hard times thinking in a non-relational way.
My problems:
I have two objects that relate to each other (e.g. issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}} - here I have a user related to the issue). Should I create another document 'user' and reference it in 'issue' document by its id (like in relational databases), or should I leave all the user's data in the subdocument?
If I have objects (subdocuments) in a document, can I update them all in a single query?
I'm totally new to document-oriented databases, and right now I'm trying to develop sort of a CMS using node.js and mongodb so I'm facing the same problems as you.
By trial and error I found this rule of thumb: I make a collection for every entity that may be a "subject" for my queries, while embedding the rest inside other objects.
For example, comments in a blog entry can be embedded, because usually they're bound to the entry itself and I can't think about a useful query made globally on all comments. On the other side, tags attached to a post might deserve their own collection, because even if they're bound to the post, you might want to reason globally about all the tags (for example making a list of trending topics).
In my mind this is actually pretty simple. Embedded documents can only be accessed via their master document. If you can envision a need to query an object outside the context of the master document, then don't embed it. Use a ref.
For your example
issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}}
I would make issue and reporter each their own document, and reference the reporter in the issue. You could also reference a list of issues in reporter. This way you won't duplicate reporters in issues, you can query them each separately, you can query reporter by issue, and you can query issues by reporter. If you embed reporter in issue, you can only query the one way, reporter by issue.
If you embed documents, you can update them all in a single query, but you have to repeat the update in each master document. This is another good reason to use reference documents.
The beauty of mongodb and other "NoSQL" product is that there isn't any schema to design. I use MongoDB and I love it, not having to write SQL queries and awful JOIN queries! So to answer your two questions.
1 - If you create multiple documents, you'll need make two calls to the DB. Not saying it's a bad thing but if you can throw everything into one document, why not? I recall when I used to use MySQL, I would create a "blog" table and a "comments" table. Now, I append the comments to the record in the same collection (aka table) and keep building on it.
2 - Yes ...
The schema design in Document-oriented DBs can seems difficult at first, but building my startup with Symfony2 and MongoDB I've found that the 80% of the time is just like with a relational DB.
At first, think it like a normal db:
To start, just create your schema as you would with a relational Db:
Each Entity should have his own Collection, especially if you'll need to paginate the documents in it.
(in Mongo you can somewhat paginate nested document arrays, but the capabilities are limited)
Then just remove overly complicated normalization:
do I need a separate category table? (simply write the category in a column/property as a string or embedded doc)
Can I store comments count directly as an Int in the Author collection? (then update the count with an event, for example in Doctrine ODM)
Embedded documents:
Use embedded documents only for:
clearness (nested documents like: addressInfo, billingInfo in the User collection)
to store tags/categories ( eg: [ name: "Sport", parent: "Hobby", page: "/sport"
] )
to store simple multiple values (for eg. in User collection: list of specialties, list of personal websites)
Don't use them when:
the parent Document will grow too large
when you need to paginate them
when you feel the entity is important enough to deserve his own collection
Duplicate values across collection and precompute counts:
Duplicate some columns/attributes values from a Collection to another if you need to do a query with each values in the where conditions. (remember there aren't joins)
eg: In the Ticket collection put also the author name (not only the ID)
Also if you need a counter (number of tickets opened by user, by category, ecc), precompute them.
Embed references:
When you have a One-to-Many or Many-to-Many reference, use an embedded array with the list of the referenced document ids (see MongoDB DB Ref).
You'll need to use an Event again to remove an id if the referenced document get deleted.
(There is an extension for Doctrine ODM if you use it: Reference Integrity)
This kind of references are directly managed by Doctrine ODM: Reference Many
Its easy to fix errors:
If you find late that you have made a mistake in the schema design, its quite simply to fix it with few lines of Javascript to run directly in the Mongo console.
(stored procedures made easy: no need of complex migration scripts)
Waring: don't use Doctrine ODM Migrations, you'll regret that later.
Redid this answer since the original answer took the relation the wrong way round due to reading incorrectly.
issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}}
As to whether embedding some important information about the user (creator) of the ticket is a wise decision or not depends upon the system specifics.
Are you giving these users the ability to login and report issues they find? If so then it is likely you might want to factor that relation off to a user collection.
On the other hand, if that is not the case then you could easily get away with this schema. The one problem I see here is if you wish to contact the reporter and their job role has changed, that's somewhat awkward; however, that is a real world dilemma, not one for the database.
Since the subdocument represents a single one-to-one relation to a reporter you also should not suffer fragmentation problems mentioned in my original answer.
There is one glaring problem with this schema and that is duplication of changing repeating data (Normalised Form stuff).
Let's take an example. Imagine you hit the real world dilemma I spoke about earlier and a user called Nigel wants his role to reflect his new job position from now on. This means you have to update all rows where Nigel is the reporter and change his role to that new position. This can be a lengthy and resource consuming query for MongoDB.
To contradict myself again, if you were to only have maybe 100 tickets (aka something manageable) per user then the update operation would likely not be too bad and would, in fact, by manageable for the database quite easily; plus due to the lack of movement (hopefully) of the documents this would be a completely in place update.
So whether this should be embedded or not depends heavily upn your querying and documents etc, however, I would say this schema isn't a good idea; specifically due to the duplication of changing data across many root documents. Technically, yes, you could get away with it but I would not try.
I would instead split the two out.
If I have objects (subdocuments) in a document, can I update them all in a single query?
Just like the relation style in my original answer, yes and easily.
For example, let's update the role of Nigel to MD (as hinted earlier) and change the ticket status to completed:
db.tickets.update({'reporter.username':'Nigel'},{$set:{'reporter.role':'MD', status: 'completed'}})
So a single document schema does make CRUD easier in this case.
One thing to note, stemming from your English, you cannot use the positional operator to update all subdocuments under a root document. Instead it will update only the first found.
Again hopefully that makes sense and I haven't left anything out. HTH
Original Answer
here I have a user related to the issue). Should I create another document 'user' and reference it in 'issue' document by its id (like in relational databases), or should I leave all the user's data in the subdocument?
This is a considerable question and requires some background knowledge before continuing.
First thing to consider is the size of a issue:
issue = {code:"asdf-11", title:"asdf", reporter:{username:"qwer", role:"manager"}}
Is not very big, and since you no longer need the reporter information (that would be on the root document) it could be smaller, however, issues are never that simple. If you take a look at the MongoDB JIRA for example: https://jira.mongodb.org/browse/SERVER-9548 (as a random page that proves my point) the contents of a "ticket" can actually be quite considerable.
The only way you would gain a true benefit from embedding the tickets would be if you could store ALL user information in a single 16 MB block of contigious sotrage which is the maximum size of a BSON document (as imposed by the mongod currently).
I don't think you would be able to store all tickets under a single user.
Even if you was to shrink the ticket to, maybe, a code, title and a description you could still suffer from the "swiss cheese" problem caused by regular updates and changes to documents in MongoDB, as ever this: http://www.10gen.com/presentations/storage-engine-internals is a good reference for what I mean.
You would typically witness this problem as users add multiple tickets to their root user document. The tickets themselves will change as well but maybe not in a drastic or frequent manner.
You can, of course, remedy this problem a bit by using power of 2 sizes allocation: http://docs.mongodb.org/manual/reference/command/collMod/#usePowerOf2Sizes which will do exactly what it says on the tin.
Ok, hypothetically, if you were to only have code and title then yes, you could store the tickets as subdocuments in the root user without too many problems, however, this is something that comes down to specifics that the bounty assignee has not mentioned.
If I have objects (subdocuments) in a document, can I update them all in a single query?
Yes, quite easily. This is one thing that becomes easier with embedding. You could use a query like:
db.users.update({user_id:uid,'tickets.code':'asdf-1'}, {$set:{'tickets.$.title':'Oh NOES'}})
However, to note, you can only update ONE subdocument at a time using the positional operator. As such this means you cannot, in a single atomic operation, update all ticket dates on a single user to 5 days in the future.
As for adding a new ticket, that is quite simple:
db.users.update({user_id:uid},{$push:{tickets:{code:asdf-1,title:"Whoop"}}})
So yes, you can quite simply, depending on your queries, update the entire users data in a single call.
That was quite a long answer so hopefully I haven't missed anything out, hope it helps.
I like MongoDB, but I have to say that I will use it a lot more soberly in my next project.
Specifically, I have not had as much luck with the Embedded Document facility as people promise.
Embedded Document seems to be useful for Composition (see UML Composition), but not for aggregation. Leaf nodes are great, anything in the middle of your object graph should not be an embedded document. It will make searching and validating your data more of a struggle than you'd want.
One thing that is absolutely better in MongoDB is your many-to-X relationships. You can do a many-to-many with only two tables, and it's possible to represent a many-to-one relationship on either table. That is, you can either put 1 key in N rows, or N keys in 1 row, or both. Notably, queries to accomplish set operations (intersection, union, disjoint set, etc) are actually comprehensible by your coworkers. I have never been satisfied with these queries in SQL. I often have to settle for "two other people will understand this".
If you've ever had your data get really big, you know that inserts and updates can be constrained by how much the indexes cost. You need fewer indexes in MongoDB; an index on A-B-C can be used to query for A, A & B, or A & B & C (but not B, C, B & C or A & C). Plus the ability to invert a relationship lets you move some indexes to secondary tables. My data hasn't gotten big enough to try, but I'm hoping that will help.