I'm working on a scheme that will heavily use array fields in mongo documents,
are there any known problems with the approach of holding rather large arrays of other documents etc?
performance issues?
being rather new to mongo and coming from SQL background the approach seems "out of place" for me since it's bit different than the approach of grouping all records in a table by a set of "primary keys" instead of holding the "primary keys" once and holding the rest of the data in arrays.
the "primary keys" approach is my other option to use in mongo as well
what is best?
I'm not aware of special performance issues. What you need to keep in mind is that you have only about 16mb at most for one document. So if you have several hundred or thousand subdocs, with some reasonal length you may come into trouble with the document limit. Depending on the fact how often you need those subdocs and your primary docs, you may consider to split it. Else your primary doc will have much overload (your subdocs), blocking other "business objects" to be kept in RAM. So this may be a point.
Additionally without that I worked with Arrays much actual, it may be wise to use only primary keys, if those subdocs may be displayed/queried over all those subdocs, without their parents needed. I think it highly depends upon the fact if you need those subdocs separately and often.
What is best, is a question of your use case, nothing in general :)
Related
Say I have a MongoDB Document that contains within itself a list.
This list gets altered a lot and there's no real reason why it couldn't have its own collection and each of the items became a document.
Would there be any performance implications of the former? I've got an inkling that document read/writes are going to be blocked while any given connection tries to read it, but the same wouldn't be true for accessing different documents in the same collection.
I find that these questions are effectively impossible to 'answer' here on Stack Overflow. Not only is there not really a 'right' answer, but it is impossible to get enough context from the question to frame a response that appropriately factors in the items that are most important for you to consider in your specific situation. Nonetheless, here are some thoughts that come to mind that may help point you in the right direction.
Performance is obviously an important consideration here, so it's good to have it in mind as you think through the design. Even within the single realm of performance there are various aspects. For example, would it be acceptable for the source document and the associated secondary documents in another collection to be out of sync? If not, and you had to pursue a route such as using transactions to keep them aligned, then that may be a much bigger performance hit overall and not worth pursuing.
As broad as performance is, it is also just a single consideration here. What about usability? Are you able to succinctly express the type of modifications that you would be doing to the array using MongoDB's query language? What about retrieving the data, would you always pull the information back as a single logical document? If so, then that would imply needing to use $lookup very frequently. Even doing so via a view may be cumbersome and could be both a usability as well as performance consideration. Indeed, an overreliance on $lookup can be considered an antipattern.
What does it mean when you say that the list gets "altered" a lot? Are you inserting new information, or updating existing entries? There has been a 16MB size limit for individual documents for a long time in MongoDB, so they generally recommend avoiding unbounded arrays. Indeed processing them can be costly in various ways depending on some specific factors.
Also, where does your inkling about concurrency behavior come from? There is a FAQ on concurrency here which helps outline some of the expected behavior for various operations and their locking. Often (with any system) it can be most appropriate to build out an environment that appropriately represents your end state and stress test it directly. That often gives a good general sense for how the approach would work in your situation without having to become an expert in the particulars of how the database (or tool in general) works.
You can see that even in this short response, the "recommendation" fluctuates back and forth. Ultimately this question is about a trade-off which we are not in a good position answer for you. Hopefully this response helps give you some things to think about while doing so.
i'm making a sistem that stores all medical , and healthy data from a person in a database , i've chosen mongodb to do the work but i'm new in mongodb modeling and i dont have an idea of whats the best way to do this.
Do i use a document for each pacient and insert subdocuments like this:
$evolution=array(); //subdocument
$record=array(); //subdocument
$prescriptions=array(); //subdocument
$exams=array(); //subdocument
$surgeries=array(); //subdocument
or do i create a new document for each one of these data?.
i know the limitation of document size that is 16 megabytes, but i don't know if the informations will reach the limmit.
The exact layout of your documents is highly dependent on the types of queries you need to make. Unfortunately without a detailed understanding of your use case it would be impossible to provide good advice about what is the best layout.
Depending on your use case it may be valid to have a document/patient with sub documents as you indicate. In some cases though it may be better to have a separate collection for each of the fields indicated. It all depends on how big those documents will be, what types of queries you will need to perform etc.
Some general advice:
Try to avoid queries that use multiple collections.
If your queries are getting difficult, you may have the wrong layout. Re-evaluate your layout any time you are in this situation.
Documents that constantly grow can create problems because Mongo constantly has to move them around in order to make room for the growth. If they will be growing quickly then reevaluate to see if there is a better layout.
While you can technically store different document layouts in the same collection in Mongo it is not generally considered a good practice. All documents in your collection should ideally follow some sort of schema even if that schema is not rigidly defined.
Field names matter. They take up space in Mongo so short field names are better if you expect to have a lot of data.
The best advice I can offer would be to start with what you think might work and see how it goes. If it gets awkward or difficult to get the information you need then reevaluate.
In mongodb, i want to use mongos and do mongodb sharding over 2 machines, is it common to have a single collection and adding an documents to my collections such as :
{type:'user',name:'xxx',id:1,.........}
{type:'userentery',userid:1.........}
{type:'usersettings',userid:1.......}
{type:'userevent',userid:1.......}
{type:'SomthingNotRelated',....}
is my understanding correct to how you should use mongodb?
and is the the way to do horizontal scaling and avoid vertical scaling by
avoiding adding more collections?
what are the disadvantages of my approach?
if a user had a very big array wouldn't it be better putting it in a
seperate document rather than the user document itself?
"shredding" no such word for MongoDB. It is "sharding", since you cannot get the name right I would strongly suggest you read the documentation right here: http://docs.mongodb.org/manual/core/sharding/
is my understanding correct to how you should use mongodb?
From what I understand yes.
and is the the way to do horizontal scaling and avoid vertical scaling by avoiding adding more collections?
More or less. Sometimes an aggregate collection of another, containing unique entries or summed entries is also helpful for scaling.
what are the disadvantages of my approach?
You haven't really described a specific approach to anything as such I cannot answer that.
if a user had a very big array wouldn't it be better putting it in a seperate document rather than the user document itself?
Depends on the operations of that array. If the array were to be consistently and continiously updated so that it would dramatically shift in size regularly then yes, you would be better splitting it off.
Such subdocuments are normally actually separate entities in themselves when thought of logically.
Sharding is the ability of Mongo to split a single collection (any collection) in shards (pieces of the collection) into different small databases (to make it simple). For you it's completely transparent, you use a collection "colX" sharded and you can split it into several machines if you want. The only recommendation is you have to be smart enough and read the documentation to use a proper shard key that helps you to split in the best balanced way possible your collection.
You can use your collection and in case this type is statistically relevant to represent a good balance in the collection (it means if you have 10 million records and 10 types it's normal you have around 1 million each) you can use it and shard by type.
Your approach is correct, you just need to use the correct shard key.
One more comment added to my note. A wrong shard key won't accelerate your process too much, if you query by type and your shard key is type it's faster to get the proper shard to return your information. In the other case, if you need, let's say to query by date and it's not in your shard key, Mongo will need to send your query to every shard and merge the result at the end. Sharding help you a lot in some cases and not too much in some other cases, of course you duplicate processor and it's always better but you won't see a big difference always if you didn't choose your shard key properly.
I've been getting in to mongo, but coming from RDBMS background facing the probably obvious questions with regards to denormalisation and general data modelling.
If I have a document type with an array of sub docs, each sub doc has a status code.
In The relational world I would add a foreign key to the record, StatusId, simple.
In mongodb, would you denormalise the key pieces of data from the "status" e.g. Code and desc and hold objectid referencing another collection of proper status. I guess the next question is one of design, if the status doc is modified I'd then need to modified the denormalised data?
Another question on the same theme is how would you model a transaction table, say I have events and people, the events could be quite granular, say time sheets which over time may lead to many records. Based on what I've seen, this would seem like a good candidate for a child / sub array of docs, of course that could be indexed for speed.
Therefore is it possible to query / find just the sub array or part of it? And given the 16mb limit for doc size, and I just limited the transaction history of the person? Or should the transaction history be a separate collection with a onjid referencing the person?
Thanks for any input
Sam
Or should the transaction history be a separate collection with a onjid referencing the person?
Probably, I think this S/O question may help you understand why.
if the status doc is modified I'd then need to modified the denormalised data?
Yes this is standard trade-off in MongoDB. You will encounter this question a lot. You may need to leverage a Queue structure to ensure that data remains consistent across multiple collections.
Therefore is it possible to query / find just the sub array or part of it?
This is a tough one specific to MongoDB. With the basic query syntax, you have only limited support for dealing with arrays of objects. The new "Aggregration Framework" is actually much better here, but it's not available in a stable build.
All your "how to model this or that" can't really be answered, because good schema design depends on so many factors (access patters, hardware characteristics, is cluster used, etc).
if the status doc is modified I'd then need to modified the denormalised data?
Usually yes, that's the drawback of denormalisation. But sometimes you don't have to (some social network site stores user name with a photo tag and doesn't update it when user changes his name).
to query / find just the sub array or part of it?
It is not currently possible to fetch only a part of array (unless using map/reduce, of course).
And given the 4mb limit
Where did you get this from? It's 16mb at the moment.
While it's true that schema design does take into account many factors, the need to denormalize data usually comes up somewhere. I tend to take advantage of denormalization in my apps that use MongoDB because I feel it lends itself well storing denormalized data:
no additional column maintenance
support for hashes and arrays as field types (perfect for storing denormalized fields)
speedy, non-blocking writes make syncing data less expensive
document size growth only marginally affects performance up to limits (for the most part)
There are a few gems that help you manage denormalized data, including setting it up and keeping it in sync. If you're using Mongoid, you try mongoid_alize. DISCLAIMER: I am the author and maintainer of mongoid_alize.
I'm evaluating MongoDB, coming from Membased/memcached because I want more flexibility.
Of course Membase is excellent in doing fast (multi)-key lookups.
I like the additional options that MongoDB gives me, but is it also fast in doing multi-key lookups? I've seen the $or and $in operator and I'm sure I can model it with that. I just want to know if it's performant (in the same league) as Membase.
use-case, e.g., Lucene/Solr returns 20 product-ids. Lookup these product-ids in Couchdb to return docs/ appropriate fields.
Thanks,
Geert-Jan
For your use case, I'd say it is, from my experience: I hacked some analytics into a database of mine that made a lot of $in queries with thousands of ids and it worked fine (it was a hack). To my surprise, it worked rather well, in the lower millisecond area.
Of course, it's hard to compare this, and -as usual- theory is a bad companion when it comes to performance. I guess the best way to figure it out is to migrate some test data and send some queries to the system.
Use MongoDB's excellent built-in profiler, use $explain, keep the one index per query rule in mind, take a look at the logs, keep an eye on mongostat, and do some benchmarks. This shouldn't take too long and give you a definite and affirmative answer. If your queries turn out slow, people here and on the news group probably have some ideas how to improve the exact query, or the indexation.
One index per query. It's sometimes thought that queries on multiple
keys can use multiple indexes; this is not the case with MongoDB. If
you have a query that selects on multiple keys, and you want that
query to use an index efficiently, then a compound-key index is
necessary.
http://www.mongodb.org/display/DOCS/Indexing+Advice+and+FAQ#IndexingAdviceandFAQ-Oneindexperquery.
There's more information on that page as well with regard to Indexes.
The bottom line is Mongo will be great if your indexes are in memory and you are indexing on the columns you want to query using composite keys. If you have poor indexing then your performance will suffer as a result. This is pretty much in line with most systems.