The problem
As usual the problem is to display a directed acyclic graph in a database. The choices for a Database I had were a relational database like mysql or mongodb. I chose mongoDb because DAGs in relational databases are a mess but if there is a trick I just didn't find please tell me.
The goal is to map a DAG in one or multiple MongoDB Documents. Because we have multiple children and parents SubDocuments where no possibility. I came across multiple design patterns but am not sure which one is the best to go with.
Tree-structure with Ancestors Array
The Ancestors Array is suggested by the mongoDB docs. And is quite easy to understand. As I understand it my document would look like this:
{
"_id" : "root",
"ancestors" : [ null ],
"left": 1
}
{
"_id" : "child1",
"ancestors" : [ "root" ],
"left": 2
}
{
"_id" : "child2",
"ancestors" : [ "root", "child1" ],
"left": 1
}
This allows me to find all children of an element like this:
db.Tree.find({ancestors: 'root'}).sort({left: -1})
and all parents like this:
db.Tree.findOne({_id: 'child1'}).ancestors
DBRefs instead of Strings
My second approach would be to replace the string-keys with DBRefs. But except for longer database records I don't see many advantages over the ancestors array.
String-based array with children and parents
The last idea is to store not only the children of each document but it's parents as well. This would give me all the features I want. The downside is the massive overhead of information I would create by storing all relations two times. Further on I am worried by the amount of administration there is. E.g. if a document gets deleted I have to check all others for a reference in multiple fields.
My Questions
Is MongoDb the right choice over a relational database for this purpose?
Are there any up-/downsides of any of my pattern I missed?
Which pattern would you suggest and why? Do you maybe have experience with one of them?
Why don't you use a graph database? Check ArangoDB, you can use documents like MongoDB and also graphs. MongoDB is a great database, but not for storing graph oriented documents. ArangoDB does.
https://www.arangodb.com/
Related
I'm figuring out if it would be easy to move my already existing app towards a Google Cloud Datastore. Currently it exists in MongoDB so it would be from NoSQL to NoSQL, but it does not seem that it would be easy. An example.
My app uses a hierarchy of objects which is why I like MongoDB because of its elegance:
{
"_id" : "31851460-7052-4c89-ab51-8941eb5bdc7g",
"_status" : "PRIV",
"_active" : true,
"emails" : [
{ "id" : 1, "email" : "anexample#gmail.com", "_hash" : "1514e2c9e71318e5", "_objecttype" : "EmailObj" },
{ "id" : 1, "email" : "asecondexample#gmail.com", "_hash" : "78687668871318e7", "_objecttype" : "EmailObj" }
],
"numbers": [
...
],
"socialnetworks": [
...
],
"settings": [
...
]
}
While moving towards Google Cloud Datastore, I can save emails, numbers, socialnetworks, settings, etc as a string, but that defeats the whole purpose of using JSON as I will lose the opportunity to query this.
I have a number of tables where I have this issue.
Only solution I see is to move all of these to different entities (tables). But in that case the amount of queries will increase, and therefore also the cost.
Other solution might be to keep only arrays of ids and do key-value like resolves on Google Cloud Datastore which are free (except traffic maybe).
What is the best approach here?
The transition from Mongo to Google's Datastore would not be trivial. A big part of the reason for this is that Datastore (although technically still NoSQL) is a columnar database whereas Mongo is a traditional NoSQL. Just as SQL databases require a different mindset from NoSQL databases, columnar databases require a different mindset again.
The transition from NoSQL to Datastore would require a comprehensive restructuring of your data.
You don't have to save everything in a string. Datastore has what is called Embbebed Entity, which looks very much like in Mongo. Just a full object into another.
Check some library like Objectify which makes it very easy to interact with datastore.
We version most of our collections in Mongodb. The selected versioning mechanism is as follows:
{ "docId" : 174, "v" : 1, "attr1": 165 } /*version 1 */
{ "docId" : 174, "v" : 2, "attr1": 165, "attr2": "A-1" }
{ "docId" : 174, "v" : 3, "attr1": 184, "attr2" : "A-1" }
So, when we perform our queries we always need to use the aggregation framework in this way to ensure get latest versions of our objects:
db.docs.aggregate( [
{"$sort":{"docId":-1,"v":-1}},
{"$group":{"_id":"$docId","doc":{"$first":"$$ROOT"}}}
{"$match":{<query>}}
] );
The problem with this approach is once you have done your grouping, you have a set of data in memory which has nothing to do with your collection and thus, your indexes cannot be used.
As a result, the more documents your collection has, the slower the query gets.
Is there any way to speed this up?
If not, I will consider to move to one of the approaches defined in this good post: http://www.askasya.com/post/trackversions/
Just in order to complete this question, we went with option 3: one collection to keep latest versions and one collection to keep historical ones. It is introduced here: http://www.askasya.com/post/trackversions/ and some further description (with some nice code snippets) can be found in http://www.askasya.com/post/revisitversions/.
It has been running in production now for 6 months. So far so good. Former approach meant we were always using the aggregate framework which moves away from indexes as soon as you modify the original schema (using $group, $project...) as it doesn't match anymore the original collection. This was making our performance terrible as the data was growing.
With the new approach though the problem is gone. 90% of our queries goes against latest data and this means we target a collection with a simple ObjectId as identifier and we do not require aggregate framework anymore, just regular finds.
Our queries against historical data always include id and version so by indexing these (we include both as _id so we get it out of the box), reads towards those collections are equally fast. This is a point though not to overlook. Read patterns in your application are crucial when designing how your collections/schemas should look like in MongoDB so you must ensure you know them when taking such decisions.
I am new to Mongodb and I heard that Mongodb is good for massive amount of read and write operations.
Embedded document is one of the features that make it happen. But I am not sure if it is also a cause of performance issue.
Book document example:
{
"_id": 1,
"Authors": [
{
"Email": "email",
"Name": "name"
}
],
"Title": "title",
...
}
If there are thousands of books by one author, and his email needs to be updated, I need to write some query which can
search through all book documents, pick out those thousands ones with this author
update author's email field across these book documents
These operations do not seem efficient. But this type of update is ubiquitous, I believe the developers have considered this. So, where did I get it wrong?
Your current embedded schema design has its merits, one of them being data locality. Since MongoDB stores data contiguously on disk, putting all the data you need in one document ensures that the spinning disks will take less time to seek to a particular location on the disk.
If your application frequently accesses books information along with the Authors data then you'll almost certainly want to go the embedded route. The other advantage with embedded documents is the atomicity and isolation in writing data.
To illustrate this, say you want all books by one author have his email field updated, this can be done with one single (atomic) operation, which is not a performance issue with MongoDB:
db.books.updateMany(
{ "Authors.name": "foo" },
{
"$set": { "Authors.$.email": "new#email.com" }
}
);
or with earlier MongoDB versions:
db.books.update(
{ "Authors.name": "foo" },
{
"$set": { "Authors.$.email": "new#email.com" }
},
{ "multi": true }
)
In the above, you use the positional $ operator which facilitates updates to arrays that contain embedded documents by identifying an element in an array to update without explicitly specifying the position of the element in the array. Use it with the dot notation on the $ operator.
For more details on data modelling in MongoDB, please read the docs Data Modeling Introduction, especically Model One-to-Many Relationships with Embedded Documents.
The other design option which you can consider is referencing documents where you follow a normalized schema. For example:
// db.books schema
{
"_id": 3
"authors": [1, 2, 3] // <-- array of references to the author collection
"title": "foo"
}
// db.authors schema
/*
1
*/
{
"_id": 1,
"name": "foo",
"surname": "bar",
"address": "xxx",
"email": "foo#mail.com"
}
/*
2
*/
{
"_id": 2,
"name": "abc",
"surname": "def",
"address": "xyz",
"email": "abc#mail.com"
}
/*
3
*/
{
"_id": 3,
"name": "alice",
"surname": "bob",
"address": "xyz",
"email": "alice#mail.com"
}
The above normalized schema using document reference approach also has an advantage when you have one-to-many relationships with very unpredictable arity. If you have hundreds or thousands of author documents per give book entity, embedding has so many setbacks in as far as spacial constraints are concerned because the larger the document, the more RAM it uses and MongoDB documents have a hard size limit of 16MB.
For querying a normalized schema, you can consider using the aggregation framework's $lookup operator which performs a left outer join to the authors collection in the same database to filter in documents from the books collection for processing.
Thus said, I believe your current schema is a better approach than creating a separate collection of authors since separate collections require more work i.e. finding an book + its authors is two queries and requires extra work whereas the above schema embedded documents are easy and fast (single seek). There are no big differences for inserts and updates. So, separate collections are good if you need to select individual documents, need more control over querying, or have huge documents. Embedded documents are also good when you want the entire document, the document with a $slice of the embedded authors, or with no authors at all.
The general rule of thumb is that if your application's query pattern is well-known and data tends to be accessed only in one way, an embedded approach works well. If your application queries data in many ways or you unable to anticipate the data query patterns, a more normalized document referencing model will be appropriate for such case.
Ref:
MongoDB Applied Design Patterns: Practical Use Cases with the Leading NoSQL Database By Rick Copeland
I think you basically have the wrong schema design. MongoDB allows you to structure your data heirarchically, but that isnt an excuse for structuring it inefficiently. If its likely you'll need to update thousands of documents across entire collections on a regular basis then its worth asking if you have the right schema design.
There are lots of articles about covering schema design, and the comparison with relational DB structures. For example:
http://blog.mongodb.org/post/87200945828/6-rules-of-thumb-for-mongodb-schema-design-part-1
I'm coming from a Postgres background, and I am currently contemplating whether I should use a noSQL database such as mongoDB in my next project. For this I have a few questions
Is it possible to perform queries in noSQL that can fetch all the documents that have some common subdocument/attribute, example "select all users where country = italy"
Also, how is redundancy handled in noSQL? say I have a document that represents a given car model that multiple people can own. Would I then have to insert the same exact data in all these People documents, describing the given car model?
Thanks
Sure you can do queries with where clause in MongoDB (and other NoSQL engine), if I take your example, you will store the users into a "collection" named "users", and query it more or less the same way.
db.users.find( { "country" : "Italy" } );
MongoDB has a very rich and powerful query and aggregation engine ( http://docs.mongodb.org/manual/tutorial/query-documents/ ) , I am inviting your to follow the tutorials ( http://docs.mongodb.org/manual/tutorial/getting-started/ )or free online training ( http://university.mongodb.com ), to learn more about it.
To insert the document it is also really easy:
db.users.insert( {"first_name" : "John", "last_name" : "Doe", "country" : "USA" } );
that's it!
You talk about redundancy, like in your SQL world it depends a lot of the design. In MongoDB you will organize your document and link between them (linked or embedded documents) based on your business needs. It is hard to give an answer about document design in this context so I will invite you to read some interesting articles:
MongoDB Documentation : http://docs.mongodb.org/manual/data-modeling/
MongoDB Blog, 6 Rules of Thumb for MongoDB Schema Design (part 1,2,3)
http://blog.mongodb.org/post/87200945828/6-rules-of-thumb-for-mongodb-schema-design-part-1
http://blog.mongodb.org/post/87892923503/6-rules-of-thumb-for-mongodb-schema-design-part-2
http://blog.mongodb.org/post/88473035333/6-rules-of-thumb-for-mongodb-schema-design-part-3
Answer to the question about users and cars, will be "It depends of your application".
If your application is mostly read, and you need most of the data about cars & users, duplication (denormalization), will be a good approach to make the development easy. (and yes you will need more work when you have to update the information...). The blog post and documentation should help you to find your way.
Yes, you can do queries in MongoDB on documents that have common subdocument/attribute.
MongoDB encourages embedding(de-normalization) of data as disk space is cheap and embedding docs can result in better query performance.Yes, embedding documents would mean inserting the same data in all People documents in a Car model. But if you want to avoid duplication/de-normalization , then you can go for 'referencing' which is a normalized model which stores data something like this: People collection will contain people docs and Car collection will contain car docs, similar to what we have in rdbms. But the primary/foreign key relationship is not imposed by MongoDB. You will end up doing joins in code and hence query performance will get degraded.
Say I have a collection of users like this:-
{
"_id" : "1234",
"Name" : "John",
"OS" : "5.1",
"Groups" : [{
"_id" : "A",
"Name" : "Group A"
}, {
"_id" : "C",
"Name" : "Group C"
}]
}
And I have a collection of events like this:-
{
"_id" : "15342",
"Event" : "VIEW",
"UserId" : "1234"
}
I'm able to use mapreduce to work out the count of events per user as I can just emit the "UserId" and count off of that, however what I want to do now is count events by group.
If I had a "Groups" array in my event document then this would be easy, however I don't and this is only an example, the actual application of this is much more complicated and I don't want to replicate all that data into the event document.
I've see an example at http://tebros.com/2011/07/using-mongodb-mapreduce-to-join-2-collections/ but I can't see how that applies in this situation as it is aggregating values from two places... all I really want to do is perform a lookup.
In SQL I would simply JOIN my flattened UserGroup table to the event table and just GROUP BY UserGroup.GroupName
I'd be happy with multiple passes of mapreduce... first pass to count by UserId into something like { "_id" : "1234", "count" : 9 } but I get stuck on next pass... how to include the group id
Some potential approaches I've considered:-
Include group info in the event document (not feasible)
Work out how to "join" the user collection or look-up the users groups from within the map function so I can emit the group id's as well (don't know how to do this)
Work out how to "join" the event and user collections into a third collection I can run mapreduce over
What is possible and what are the benefits/issues with each approach?
Your third approach is the way to go:
Work out how to "join" the event and user collections into a third collection I can run mapreduce over
To do this you'll need to create a new collection J with the "joined" data you need for map-reduce. There are several strategies you can use for this:
Update your application to insert/update J in the normal course of business. This is best in the case where you need to run MR very frequently and with up-to-date data. It can add substantially to code complexity. From an implementation standpoint, you can do this either directly (by writing to J) or indirectly (by writing changes to a log collection L and then applying the "new" changes to J). If you choose the log collection approach you'll need a strategy for determining what's changed. There are two common ones: high-watermark (based on _id or a timestamp) and using the log collection as a queue with the findAndModify command.
Create/update J in batch mode. This is the way to go in the case of high-performance systems where the multiple updates from the above strategy would affect performance. This is also the way to go if you do not need to run the MR very frequently and/or you do not have to guarantee up-to-the-second data accuracy.
If you go with (2) you will have to iterate over documents in the collections you need to join--as you've figured out, Mongo map-reduce won't help you here. There are many possible ways to do this:
If you don't have many documents and if they are small, you can iterate outside of the DB with a direct connection to the DB.
If you cannot do (1) you can iterate inside the DB using db.eval(). If the number of documents in not small, make sure to use nolock: true as db.eval is blocking by default. This is typically the strategy I choose as I tend to deal with very large document sets and I cannot afford to move them over the network.
If you cannot do (1) and do not want to do (2) you can clone the collections to another node with a temporary DB. Mongo has a convenient cloneCollection command for this. Note that this does not work if the DB requires authentication (don't ask why; it's a strange 10gen design choice). In that case you can use mongodump and mongorestore. Once you have the data local to a new DB you can party on it as you see fit. Once you complete the MR you can update the result collection in your production DB. I use this strategy for one-off map-reduce operations with heavy pre-processing so as to not load the production replica sets.
Good luck!