MongoDB Schema: Nested, Flattened, or Independent Collections? - mongodb

We are writing an application in which we have multiple 'Projects'. Each 'Project' has multiple 'Boards'. Each 'Board' has its own set of 'Comments'. What is the recommended way to structure this in MongoDB?
= Option I (nested collection)
-Project
|
|----- Board
|
|----- Comments
= Option II (flattened collection)
-Project
|
|----- Board
|
|----- Comment
|-----Board_ID
= Option III (independent collections)
-Project
- Boards
|-----Project_ID
- Comments
|-----Board_ID
There are 10,000 projects. Each project has 5 Boards, so total boards is 50,000. Each Board has 20 comments, so total comments are '1,000,000. Only one project, and one board can be open in the application at one time.
So, if we pick Option I, then to get the associated 'Comments' for a particular project/board combination, we will have to query/parse through only 20 comments. However, if I pick Option III, then, to get the associated 'Comments' for a given project/board combination, we will have to query/parse through 1,000,000 comments. So, in theory, Option I sounds faster and more efficient. However, Option I uses a nested collection: Is there any dis-advantages on a nested collection? Are there any reasons for not using nested collections in MongoDB, like Option I?
MongoDB experts: What Option (I, II, or III), is the recommended practice for such cases?

Probably the most important question is: What do you read and write together?
Only one project, and one board can be open in the application at one time.
So basically 1 project with its 20 comments are mainly read and written together? Then I'd store them in one document (embedded comments) and have a projects collection pointing to the boards collection.
Background:
Even if you read a single attribute from a document, you're always fetching the whole document from disk and load it into RAM. Even if you limit the query to the single attribute, you'll load it — you just won't send it over the network.
Since there are no (multi-document) transactions, put things into a single document, which you want to write atomically.
Avoid growing documents, since one document needs to be stored as a single block on disk and moves are expensive. Either preallocate values (if possible) or use separate documents for stuff you want to add later on.

Related

Mongo Architecture Efficiency

I am currently working on designing a local content bases sharing system that depends on mongoDB. I need to make a critical architecture decision that will undoubtably have a huge impact on query performance, scaling and overall long term maintainability.
Our system has a library of topics, each topic is available in specific cities/metropolitan areas. When a person creates a piece of content it needs to be stored as part of the topic in a specific city. There are three approaches I am currently considering to address these requirements (And open to other ideas as well).
Option 1 (Single Collection per Topic/City):
Example: a collection name would be TopicID123CityID456 and each entry would obviously be a document within that collection.
Option 2 (Single Topic Collection)
Example: A collection name would be Topic123 and each entry would create a document that contains an indexed cityID.
Option 3 (Single City Collection)
Example: A collection name would be City456 and each entry would create a document that contains an indexed topicID
When querying the DB I always want to build a feed in date order based on the member's selected topic(s) and city. Since members can group multiple topics together to build a custom feed, option 3 seems to be the best, however I am concerned with long term performance of this approach. It seems option 1 would be the most performant but also forces multiple queries when needing to select more than one topic.
Another thing that I need to consider is some topics will be far more active and grow much larger than other topics which will also vary by location.
Since I still consider myself a beginner with MongoDB, I want to make sure the general DB structure is the most ideal before coding all of the logic around writing and retrieving the data. And I don't know how well Mongo Performs with hundreds of thousands if not millions of documents in a collection thus my uncertainty in approach.
From experience which is the most optimal way of tackling the storage and recall of this data? Any insight would be greatly appreciated.
UPDATE: June 22, 2016
It is important to note that we are starting in a one DB server environment to start. #profesor79 provided a great scaling solution once we need to move to a multi-server (Sharded) environment.
from your 3 proposal I will pickup number 4 :-)
Having a one collection sharded over multiple servers.
As there could be one collection TopicCity, `we could have a one for all topics and one foll all cities.
Then collection topicCities will have all documents sharded.
Sharding on key {topic:1, city:1} will allow to balance load thru shard servers and enytime you will need to add more power you will be able to add shard to cluster.
Any comments welcome!

All vs All comparisons on MongoDB

We are planning to use MongoDB for a general purpose system and it seems well suited to the particular data and use cases we have.
However we have one use case where we will need to compare every document (of which there could be 10s of millions) with every other document. The 'distance measure' could be pre computed offline by another system but we are concerned about the online performance of MongoDB when we want to query - eg when we want to see the top 10 closest documents in the entire collection to a list of specific documents ...
Is this likely to be slow? Also can this be done across documents (eg query for the top10 closest documents in one collection to a document in another collection)...
Thanks in advance,
FK

Best way to group meteor collections

The Meteor project I'm working on right now has multiple collections: Projects, Sensors, and Readings. Looking at it relationally, Users have many Projects, Projects have many Sensors, and Sensors have_many Readings.
What's the best way to organize this within Meteor? Right now I have created three discrete collections, and each item in the Sensors and Readings collections has a "parent" field with a mongo _id to relate it to the parent.
I'm not sure if this is the best way to go about it, the alternative being having one all-encompassing collection "Projects" that contains all the information about the sensors and readings that belong to it. When adding a new Sensor to a project, all you'd need to do is append a newly instantiated Sensor object to the Project.Sensors array.
What do you think?
It all depends on your use case. Will your app be more write or read heavy?
The way you've set it up right now is called a normalized fashion. You don't have redundant data and data is related to other data in other collections. (Like a table join in SQL). When you update data, you only have to update the small record it is part of. The downside is when you need to read a lot of data it takes a while longer because of all the different db queries.
The alternative you're suggesting is called denormalization, where you add redundant data to records or group collections into larger collections. This has the advantage that when reading data you don't need that many db look-ups as all the data you need is mostly in one record. On the other hand, writing to the db becomes more complex since you don't need to update one small record but one or more large records.
Either way, both methods are used, it just depends on the use case and how you want your app to perform (more read or more write performance).

NoSQL schema for folder structure

I have documents that represent a folder structure. A folder can contain other folders (nested), theoretically unlimited levels deep but more realistically 3 or 4 levels for our application. I need to be able to retrieve a single item (a node) and perhaps embedding will make this task a bit difficult?
Any suggestions?
The docs give a great summary of the more popular/common ways to store hierarchical data in mongodb.
Embedding documents - have significant drawbacks
Hard to search
Hard to get back partial results
Can get unwieldy if you need a huge tree. Further there is a limit on the size of documents in MongoDB – 16MB in v1.8 (limit may rise in future versions).
As you need to be able to retrieve single items - that is not likely to be the best option for your use case.
Array of ancestors or materialized path are likely to be much more suitable for what you've described - you could choose to use the full filepath for _id since that is unique and the path you would want to find data by more commonly.

Many to many update in MongoDB without transactions

I have two collections with a many-to-many relationship. I want to store an array of linked ObjectIds in both documents so that I can take Document A and retrieve all linked Document B's quickly, and vice versa.
Creating this link is a two step process
Add Document A's ObjectId to Document B
Add Document B's ObjectId to Document A
After watching a MongoDB video I found this to be the recommended way of storing a many-to-many relationship between two collections
I need to be sure that both updates are made. What is the recommended way of robustly dealing with this crucial two step process without a transaction?
I could condense this relationship into a single link collection, the advantage being a single update with no chance of Document B missing the link to Document A. The disadvantage being that I'm not really using MongoDB as intended. But, because there is only a single update, it seems more robust to have a link collection that defines the many-to-many relationship.
Should I use safe mode and manually check the data went in afterwards and try again on failure? Or should I represent the many-to-many relationship in just one of the collections and rely on an index to make sure I can still quickly get the linked documents?
Any recommendations? Thanks
#Gareth, you have multiple legitimate ways to do this. So they key concern is how you plan to query for the data, (i.e.: what queries need to be fast)
Here are a couple of methods.
Method #1: the "links" collection
You could build a collection that simply contains mappings between the collections.
Pros:
Supports atomic updates so that data is not lost
Cons:
Extra query when trying to move between collections
Method #2: store copies of smaller mappings in larger collection
For example: you have millions of Products, but only a hundred Categories. Then you would store the Categories as an array inside each Product.
Pros:
Smallest footprint
Only need one update
Cons:
Extra query if you go the "wrong way"
Method #3: store copies of all mappings in both collections
(what you're suggesting)
Pros:
Single query access to move between either collection
Cons:
Potentially large indexes
Needs transactions (?)
Let's talk about "needs transactions". There are several ways to do transactions and it really depends on what type of safety you require.
Should I use safe mode and manually check the data went in afterwards and try again on failure?
You can definitely do this. You'll have to ask yourself, what's the worst that happens if only one of the saves fails?
Method #4: queue the change
I don't know if you've ever worked with queues, but if you have some leeway you can build a simple queue and have different jobs that update their respective collections.
This is a much more advanced solution. I would tend to go with #2 or #3.
Why don't you create a dedicated collection holding the relations between A and B as dedicated rows/documents as one would do it in a RDBMS. You can modify the relation table with one operation which is of course atomic.
Should I use safe mode and manually check the data went in afterwards and try again on failure?
Yes this an approach, but there is an another - you can implement an optimistic transaction. It has some overhead and limitations but it guarantees data consistency. I wrote an example and some explanation on a GitHub page.