I'm begining to use MongoDB, and I have a collection that has documents that may have subdocuments, that may have subdocuments... a tree.
{
a:[
{
a:[
{
},
{
},
]
},
{
}
],
...
},
{
a:[
{
a:[
{
},
{
},
]
},
{
}
],
...
}
Should I do it like that explicitly, should I store each subdocument in another collection and store IDs on an array, or should I use a traditional relational SQL DB for this?
Thank you :)
MongoDB is a document-oriented database. Data that is usually broken up into multiple tables (i.e. "normalized") in a relational database is often stored together in document databases, sometimes in a single collection.
MongoDB is "schema-less," meaning it has no fixed schema, so the structure of its documents is very flexible. You can have sub-documents or even arrays in your documents, as you showed in your tree structure. This makes designing a NoSQL schema very challenging, since you have many options to choose from. Really, the best schema depends on your application and the queries you will perform against the database. In SQL databases, the data is normalized based on the data itself, without regard to queries.
Keep in mind that MongoDB doesn't support joins, so anything you "split up" into multiple collections will need to be joined manually. Go ahead and take advantage of MongoDB's flexible schema and try different options. Documents with many levels of nesting will make queries more complicated, so there's a trade-off as usual.
Related
I am new to Mongodb and I heard that Mongodb is good for massive amount of read and write operations.
Embedded document is one of the features that make it happen. But I am not sure if it is also a cause of performance issue.
Book document example:
{
"_id": 1,
"Authors": [
{
"Email": "email",
"Name": "name"
}
],
"Title": "title",
...
}
If there are thousands of books by one author, and his email needs to be updated, I need to write some query which can
search through all book documents, pick out those thousands ones with this author
update author's email field across these book documents
These operations do not seem efficient. But this type of update is ubiquitous, I believe the developers have considered this. So, where did I get it wrong?
Your current embedded schema design has its merits, one of them being data locality. Since MongoDB stores data contiguously on disk, putting all the data you need in one document ensures that the spinning disks will take less time to seek to a particular location on the disk.
If your application frequently accesses books information along with the Authors data then you'll almost certainly want to go the embedded route. The other advantage with embedded documents is the atomicity and isolation in writing data.
To illustrate this, say you want all books by one author have his email field updated, this can be done with one single (atomic) operation, which is not a performance issue with MongoDB:
db.books.updateMany(
{ "Authors.name": "foo" },
{
"$set": { "Authors.$.email": "new#email.com" }
}
);
or with earlier MongoDB versions:
db.books.update(
{ "Authors.name": "foo" },
{
"$set": { "Authors.$.email": "new#email.com" }
},
{ "multi": true }
)
In the above, you use the positional $ operator which facilitates updates to arrays that contain embedded documents by identifying an element in an array to update without explicitly specifying the position of the element in the array. Use it with the dot notation on the $ operator.
For more details on data modelling in MongoDB, please read the docs Data Modeling Introduction, especically Model One-to-Many Relationships with Embedded Documents.
The other design option which you can consider is referencing documents where you follow a normalized schema. For example:
// db.books schema
{
"_id": 3
"authors": [1, 2, 3] // <-- array of references to the author collection
"title": "foo"
}
// db.authors schema
/*
1
*/
{
"_id": 1,
"name": "foo",
"surname": "bar",
"address": "xxx",
"email": "foo#mail.com"
}
/*
2
*/
{
"_id": 2,
"name": "abc",
"surname": "def",
"address": "xyz",
"email": "abc#mail.com"
}
/*
3
*/
{
"_id": 3,
"name": "alice",
"surname": "bob",
"address": "xyz",
"email": "alice#mail.com"
}
The above normalized schema using document reference approach also has an advantage when you have one-to-many relationships with very unpredictable arity. If you have hundreds or thousands of author documents per give book entity, embedding has so many setbacks in as far as spacial constraints are concerned because the larger the document, the more RAM it uses and MongoDB documents have a hard size limit of 16MB.
For querying a normalized schema, you can consider using the aggregation framework's $lookup operator which performs a left outer join to the authors collection in the same database to filter in documents from the books collection for processing.
Thus said, I believe your current schema is a better approach than creating a separate collection of authors since separate collections require more work i.e. finding an book + its authors is two queries and requires extra work whereas the above schema embedded documents are easy and fast (single seek). There are no big differences for inserts and updates. So, separate collections are good if you need to select individual documents, need more control over querying, or have huge documents. Embedded documents are also good when you want the entire document, the document with a $slice of the embedded authors, or with no authors at all.
The general rule of thumb is that if your application's query pattern is well-known and data tends to be accessed only in one way, an embedded approach works well. If your application queries data in many ways or you unable to anticipate the data query patterns, a more normalized document referencing model will be appropriate for such case.
Ref:
MongoDB Applied Design Patterns: Practical Use Cases with the Leading NoSQL Database By Rick Copeland
I think you basically have the wrong schema design. MongoDB allows you to structure your data heirarchically, but that isnt an excuse for structuring it inefficiently. If its likely you'll need to update thousands of documents across entire collections on a regular basis then its worth asking if you have the right schema design.
There are lots of articles about covering schema design, and the comparison with relational DB structures. For example:
http://blog.mongodb.org/post/87200945828/6-rules-of-thumb-for-mongodb-schema-design-part-1
I have two document formats which I can't decide is the mongo way of doing things. Are the two examples equivalent? The idea is to search by userId and have userId be indexed. It seems to me the performance will be equal for either schemas.
multiple bookmarks as separate documents in a collection:
{
userId: 123,
bookmarkName: "google",
bookmarkUrl: "www.google.com"
},
{
userId: 123,
bookmarkName: "yahoo",
bookmarkUrl: "www.yahoo.com"
},
{
userId: 456,
bookmarkName: "google",
bookmarkUrl: "www.google.com"
}
multiple bookmarks within one document per user.
{
userId: 123,
bookmarks:[
{
bookmarkName: "google",
bookmarkUrl: "www.google.com"
},
{
bookmarkName: "yahoo",
bookmarkUrl: "www.yahoo.com"
}
]
},
{
userId: 456,
bookmarks:[
{
bookmarkName: "google",
bookmarkUrl: "www.google.com"
}
]
}
The problem with the second option is that it causes growing documents. Growing documents are bad for write performance, because the database will have to constantly move them around the database files.
To improve write performance, MongoDB always writes each document as a consecutive sequence to the database files with little padding between each document. When a document is changed and the change results in the document growing beyond the current padding, the document needs to be deleted and moved to the end of the current file. This is a quite slow operation.
Also, MongoDB has a hardcoded limit of 16MB per document (mostly to discourage growing documents). In your illustrated use-case this might not be a problem, but I assume that this is just a simplified example and your actual data will have a lot more fields per bookmark entry. When you store a lot of meta-data with each entry, that 16MB limit could become a problem.
So I would recommend you to pick the first option.
I would go with the option 2 - multiple bookmarks within one document per user because this schema would take advantage of MongoDB’s rich documents also known as “denormalized” models.
Embedded data models allow applications to store related pieces of information in the same database record. As a result, applications may need to issue fewer queries and updates to complete common operations. Link
There are two tools that allow applications to represent these
relationships: references and embedded documents.
When designing data models, always consider the application usage of
the data (i.e. queries, updates, and processing of the data) as well
as the inherent structure of the data itself.
The Second type of structure represents an Embedded type.
Generally Embedded type structure should be chosen when our application needs:
a) better performance for read operations.
b) the ability to request and retrieve
related data in a single database operation.
c) Data Consistency, to update related data in a single atomic write operation.
In MongoDB, operations are atomic at the document level. No single
write operation can change more than one document. Operations that
modify more than a single document in a collection still operate on
one document at a time. Ensure that your application stores all fields
with atomic dependency requirements in the same document. If the
application can tolerate non-atomic updates for two pieces of data,
you can store these data in separate documents. A data model that
embeds related data in a single document facilitates these kinds of
atomic operations.
d) to issue fewer queries and updates to complete common operations.
When not to choose:
Embedding related data in documents may lead to situations where
documents grow after creation. Document growth can impact write
performance and lead to data fragmentation. (limit of 16MB per
document)
Now let's compare the structures from a developer's perspective:
Say I want to see all the bookmarks of a particular user:
The first type would require an aggregation to be applied on all the documents.
minimum set of functions that would be required to get the aggregated results, $match,$group(with $push operator):
db.collection.aggregate([{$match:{"userId":123}},{$group:{"_id":"$userId","bookmarkNames":{$push:"$bookmarkName"},"bookMarkUrls:{$push:"$bookmarkUrl"}"}}])
or a find() which returns multiple documents to be iterated.
Wheras the Embedded type would allow us to fetch it using a $match in the find query.
db.collection.find({"userId":123});
This just indicates the added overhead from the developer's point of view. We would view the first type as an unwinded form of the embedded document.
The first type, multiple bookmarks as separate documents in a collection,
is normally used in case of logging. Where the log entries are huge and will have a TTL, time to live. The collections in that case, would be capped collections. Where documents would be automatically deleted after a particular period of time.
Bottomline, if your documents size would not grow beyond 16 MB at any particular time opt for the Embedded type. it would save developing effort as well.
See Also: MongoDB relationships: embed or reference?
The problem
As usual the problem is to display a directed acyclic graph in a database. The choices for a Database I had were a relational database like mysql or mongodb. I chose mongoDb because DAGs in relational databases are a mess but if there is a trick I just didn't find please tell me.
The goal is to map a DAG in one or multiple MongoDB Documents. Because we have multiple children and parents SubDocuments where no possibility. I came across multiple design patterns but am not sure which one is the best to go with.
Tree-structure with Ancestors Array
The Ancestors Array is suggested by the mongoDB docs. And is quite easy to understand. As I understand it my document would look like this:
{
"_id" : "root",
"ancestors" : [ null ],
"left": 1
}
{
"_id" : "child1",
"ancestors" : [ "root" ],
"left": 2
}
{
"_id" : "child2",
"ancestors" : [ "root", "child1" ],
"left": 1
}
This allows me to find all children of an element like this:
db.Tree.find({ancestors: 'root'}).sort({left: -1})
and all parents like this:
db.Tree.findOne({_id: 'child1'}).ancestors
DBRefs instead of Strings
My second approach would be to replace the string-keys with DBRefs. But except for longer database records I don't see many advantages over the ancestors array.
String-based array with children and parents
The last idea is to store not only the children of each document but it's parents as well. This would give me all the features I want. The downside is the massive overhead of information I would create by storing all relations two times. Further on I am worried by the amount of administration there is. E.g. if a document gets deleted I have to check all others for a reference in multiple fields.
My Questions
Is MongoDb the right choice over a relational database for this purpose?
Are there any up-/downsides of any of my pattern I missed?
Which pattern would you suggest and why? Do you maybe have experience with one of them?
Why don't you use a graph database? Check ArangoDB, you can use documents like MongoDB and also graphs. MongoDB is a great database, but not for storing graph oriented documents. ArangoDB does.
https://www.arangodb.com/
I have a 'reports' collection in mongodb. The collection holds a the ObjectId of the user that the report belongs to. I am currently doing an aggregate so that I can group the reports by user.
db.reports.aggregate(
{
$group: {
_id: "$user",
stuff: {
$push: {
things: {
properties: "$properties"
}
}
}
}
},
{
$project: {
_id: 0,
user: "$_id",
stuff: "$stuff"
}
}
)
This give me an array of users id and the report 'stuff'. Instead of just the userId I am wondering if I can form the aggregate such that instead of just the userId I could hit the users collection and return more information about the user.
Is that possible? I am using mongoose as an ORM, but looking at the mongoose docs, aggregate looks to be a straight pass through to mongodb's aggregate function. Don't think I can take advantage of mongoose's populate as aggregate is not dealing with a schema, but I could be wrong.
Lastly reports are computer generated,each user could have millions. This prevents me from storing the reports in an array with the users collection.
No, you can't do that as there are no joins in MongoDB (not in normal queries, MapReduce, or the Aggregation Framework).
Only one collection can be accessed at a time from the aggregation functions.
Mongoose won't directly help, or necessarily make the query for additional user information any more efficient than doing an $in on a large batch of Users at a time (an array of userId). ($in docs here)
There really aren't work-arounds for this as the lack of joins is currently an intentional design of MongoDB (ie., it's how it works). Two paths you might consider:
You may find that another database platform would be better suited to the types of queries that you're trying to run.
You might try using $in as suggested above after the aggregation results are returned to your client code (it's one of the ways Mongoose handles fetching related documents). Just grab the userIds, and in batches request the associated User documents. While it's a bit more network traffic, depending on how you want to display the results, you may consider it an optimization to only fetch extra User information as it's shown to a user (if that's possible), by incrementally loading the extra data.
You could combine the two collections into one and do aggregation thusly. You don't need to have same structured top-level documents in an collection. It wouldn't be too difficult to mark each document of type A or B and adapt your queries involving A type documents to include if the document is type A.
You might not like how it looks but would set you up for the "join".
I am using mongodb as my backend. I have data for movies, music, books and more which I am storing in one single collection. The compulsory fields for every bson entry are "_id", "name", "category". Rest of the fields depend upon the category to which the entry belongs.
For example, I have a movie record stored like.
{
"_id": <some_id>,
"name": <movie_name>,
"category": "movie",
"director": <director_name>,
"actors": <list_of_actors>,
"genre": <list_of_genre>
}
For music, I have,
{
"_id": <some_id>,
"name": <movie_name>,
"category": "music"
"record_label": <label_name>
"length": <length>
"lyrics": <lyrics>
}
Now I have 12 different categories for which only _id, name and category are common fields. Rest the fields are all different for different categories. Is my decision to store all data in one single collection fine or should I make different collections per category.
A single collection is best if you're searching across categories. Having the single collection might slow performance on inserts, but if you don't have a high write need, that shouldn't matter.
MongoDB allows you to store any field structure in a document even if every document is different, so that isn't a concern. By having those 3 consistent fields then you can use those as part of the index and to handle your queries. This is a good example of where a schemaless database helps because you can store everything in a single collection.
There is no performance hit for using a single collection in this way. Indeed, there is actually a benefit because you can shard the collection as a scaling strategy later. Sharding is done on a collection level so you could shard based on the _id field to have them evenly distributed, or use your category field to have certain categories per shard, or even a combination.
One thing to be aware of is future query requirements. If you do need to index the other fields then you can use sparse indexes which mean that documents without the indexed fields won't be in the index, so won't take any space in the index; a handy optimisation.
You should also be aware of growing the documents if you made updates. This does have a major performance impact.