How to model mongodb for custom user data - mongodb

I'm developing a cms using MongoDb and am trying to get some modelling advice. It's multi-tenant and each tenant can create their own schema and choose what custom fields they want searchable/indexed. The only thing I'm waffling on is how to model my collections. It seems to me like it would be ideal for each tenant to have their own collection due to indexing, but I am not very experienced with MongoDb and would love to hear if that's even a valid statement or not.
I'm thinking about separating each tenant's schema definitions from their data - perhaps a customSchema and customData collection for each tenant. Maybe something like customSchema_5543e1191a85d8946f0ee6fc and customData_5543e1191a85d8946f0ee6fc? The major question here being how many collections are feasible in MongoDb. I'm not clear if there's a cap with the new WiredTiger or not. If not, would such a large number of collections have any downsides?
Or, is it better to have just two collections with all tenant's data in them, along with all of their individual indexes? What are the pros and cons of this approach?
Any thoughts or suggestions are welcome, particularly if anyone has had experience doing something like this before.
Update:
My use case is a cms where tenants can specify their own data, like in Sharepoint or Expression Engine, or most other content apis, like contentful or CloudCMS. A user can say, "I want to store Products, and each product has a Name, Description, Quantity, and a price". Another user could say, "I want to store bands, and each band has a Name, a HomeCity, and a whatever." The users would then want to retrieve and display that data on their pages however they like. It's a basic cms scenario where tenants can create their own schema, then create, edit, and retrieve entries of those schemas. Tenants would need to be able to denote which fields they can search on, so this highly customizable indexing per tenant is the primary area of focus and concern in the modelling strategy.
I'm waffling between two big collections to store schemas and data, shared by all tenants, and a pair of those collections for every tenant. I just don't know the pros and cons of each of those solutions in MongoDb. I'm also open to any ideas I haven't thought of yet :)

Related

Designing mongodb data model: embedding vs. referencing

I'm writing an application that gathers statistics of users across multiple social networks accounts. I have a collection of users and I would like to store the statistics information of each user.
Now, I have two options:
Create a collection that stores users statistics documents, and add a reference object to each of the user documents that links it to the corresponding document in the statistics collection.
Embed a statistics document in each of the users document.
Besides for query performance (which I'm less concerned about):
what are the pros and cons of each of these approaches?
What should I take into account if I choose to use references rather than embedding the information inside the user document?
The shape of the data is determined by the application itself.
There’s a good chance that when you are working with the users data, you probably need statistics details.
The decision about what to put in the document is pretty much determined by how the data is used by the application.
The data that is used together as users documents is a good candidate to be pre-joined or embedded.
One of the limitations of this approach is the size of the document. It should be a maximum of 16 MB.
Another approach is to split data between multiple collections.
One of the limitations of this approach is that there is no constraint in MongoDB, so there are no foreign key constraints as well.
The database does not guarantee consistency of the data. Is it up to you as a programmer to take care that your data has no orphans.
Data from multiple collections could be joined by applying the lookup operator. But, a collection is a separate file on disk, so seeking on multiple collections means seeking from multiple files, and that is, as you are probably guessing, slow.
Generally speaking, embedded data is the preferable approach.

MongoDB beginner - to normalize or not to normalize?

I'm going to try and make this as straight-forward as I can.
Coming from MySQL and thinking in terms of tables, let's use the following example:
Let's say that we have a real-estate website and we're displaying a list of houses
normally, I'd use the following tables:
houses - the real estate asset at hand
owners - the owner of the house (one-to-many relationship with houses)
agencies - the real-estate broker agency (many-to-many relationship with houses)
images - many-to-one relationship with houses
reviews - many-to-one relationship with houses
I understand that MongoDB gives you the flexibility to design your web-app in different collections with unique IDs much like a relational database (normalized), and to enjoy quick selections, you can nest within a collection, related objects and data (un-normalized).
Back to our real-estate houses list, the query used to populate it is quite expensive in a normal relational DB, for each house you need to query its images, reviews, owner & agencies, each entity resides in a different table with its fields, you'd probably use joins and have multiple queries joined into one - Expensive!
Enter MongoDB - where you don't need joins, and you can store all the related data of a house in a house item on the houses collection, selection was never faster, it's a db heaven!
But what happens when you need to add/update/delete related reviews/agencies/owner/images?
This is a mystery to me, and if I need to guess, each related collection exist on its own collection on top of its data within the houses table, and once one of these pieces of related data is being added/updated/deleted you'll have to update it on its own collection as well as on the houses collection. Upon this update - do I need to query the other collections as well to make sure I'm updating the house record with all the updated related data?
I'm just guessing here and would really appreciate your feedback.
Thanks,
Ajar
Try this approach:
Work out which entity (or entities) are the hero(s)
With 'hero', I mean the entity(s) that the database is centered around. Let's take your example. The hero of the real-estate example is the house*.
Work out the ownerships
Go through the other entities, such as the owner, agency, images and reviews and ask yourself whether it makes sense to place their information together with the house. Would you have a cascading delete on any of the foreign keys in your relational database? If so, then that implies ownership.
Work out whether it actually matters that data is de-normalised
You will have agency (and probably owner) details spread across multiple houses. Does that matter?
Your house collection will probably look like this:
house: {
owner,
agency,
images[], // recommend references to GridFS here
reviews[] // you probably won't get too many of these for a single house
}
*Actually, it's probably the ad of the house (since houses are typically advertised on a real-estate website and that's probably what you're really interested in) so just consider that
Sarah Mei wrote an informative article about the kinds of issues that can arise with data integrity in nosql dbs. The choice between duplicate data or using id's, code based joins and the challenges with keeping data integrity. Her take is that any nosql db with code based joins will lose data integrity at some point. Imho the articles comments are as valuable as the article itself in understanding these issues and possible resolutions.
Link: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/comment-page-1/
I would just like to give a normalization refresher from the MongoDB's perspective -
What are the goals of normalization?
Frees the database from modification anomalies - For MongoDB, it looks like embedding data would mostly cause this. And in fact, we should try to avoid embedding data in documents in MongoDB which possibly create these anomalies. Occasionally, we might need to duplicate data in the documents for performance reasons. However that's not the default approach. The default is to avoid it.
Should minimize re-design when extending - MongoDB is flexible enough because it allows addition of keys without re-designing all the documents
Avoid bias toward any particular access pattern - this is something, we're not going to worry about when describing schema in MongoDB. And one of the ideas behind the MongoDB is to tune up your database to the applications that we're trying to write and the problem we're trying to solve.

MongoDB Schema Design ordering service

I have the following objects Company, User and Order (contains orderlines). User's place orders with 1 or more orderlines and these relate to a Company. The time period for which orders can be placed for this Company is only a week.
What I'm not sure on is where to place the orders array, should it be a collection of it's own containing a link to the User and a link to the Company or should it sit under the Company or finally should the orders be sat under the User.
Numbers wise I need to plan for 50k+ in orders.
Queries wise, I'll probably be looking at Orders by Company mainly but I would need to find an Order by Company based for a specific user.
1) For folks coming from the SQL world (such as myself) one of the hardest learn about MongoDB is the new style of schema design. In the SQL world, everything goes into third normal form. Folks come to think that there is a single right way to design their schema, because there typically is one.
In the MongoDB world, there is no one best schema design. More accurately, in MongoDB schema design depends on how the application is going to access the data.
2) Here are the key questions that you need to have answered in order to design a good schema for MongoDB:
How much data do you have?
What are your most common operations? Will you be mostly inserting new data, updating existing data, or doing queries?
What are your most common queries?
How many I/O operations do you expect per second?
What you're talking about here is modeling Many-to-One relationships:
Company -> User
User -> Order
Order -> Order Lines
Company -> Order
Using SQL you would create a pair of master/detail tables with a primary key/foreign key relationship. In MongoDB, you have a number of choices: you can embed the data, you can create a linked relationship, you can duplicate and denormalize the data, or you can use a hybrid approach.
The correct approach would depend on a lot of details about the use case of your application, many of which you haven't provided.
3) This is my best guess - and it's only a guess - as to a good schema for you.
a) Have separate collections for Users, Companies, and Orders
If you're looking at 50k+ orders, there are too many to embed in a single document. Having them as a separate collection will allow you to reference them from both the Company and the User documents.
b) Have an array of references to the Order documents in both the Company and the User documents. This makes the query "Find all Orders for this Company" a single-document query
c) If your query pattern supports it, you might also have a duplicate link from Orders back to the owning Company and/or User.
d) Assuming that the order lines are unique to the individual Order, you would embed the Order Lines in an array within the Order documents.
e) If your order lines refer back to individual Products, you might want to have a separate Product collection, and include a reference to the Product document in the order line sub-document
4) Here are some good general references on MongoDB schema design.
MongoDB presentations:
http://www.10gen.com/presentations/mongosf2011/schemabasics
http://www.10gen.com/presentations/mongosv-2011/schema-design-by-example
http://www.10gen.com/presentations/mongosf2011/schemascale
Here are a couple of books about MongoDB schema design that I think you would find useful:
http://www.manning.com/banker/ (MongoDB in Action)
http://shop.oreilly.com/product/0636920018391.do
Here are some sample schema designs:
http://docs.mongodb.org/manual/use-cases/
Note that the "MongoDB in Action" book includes a sample schema for an e-commerce application, which is very similar to what you're trying to build -- I recommend you check it out.

Sharing a document with users

I have to choose a database for implementing a sharing system.
My system will have users and documents. I have to share a document with a few users.
Example:
There are 2 users, and there is one document.
So if I have to share that one document with both the users, I could do these possible solutions:
The current method I'm using is with MySQL (I don't want to use this):
Relational Databases (MySQL)
Users Table = user1, user2
Docs Table = doc1
Docs-User Relation Table = doc1, user1
doc1, user2
And I would like to use something like this:
NoSQL Document Stores (MongoDB)
Users Documents:
{
_id: user1,
docs_i_have_access_to: {doc1}
}
{
_id: user2,
docs_i_have_access_to: {doc1}
}
Document's Document:
{
_id: doc1
members_of_this_doc: {user1, user2}
}
And I don't yet know how I would implement in a key-value store like Redis.
So I just wanted to know, would the MongoDB way I have given above, the best solution?
And is there any other way I could implement this? Maybe with another database solution?
Should I try to implement it with Redis or not?
Which database and which method should I choose and will be the best to share the data and why?
Note: I want something highly scalable and persistent. :D
Thanks. :D
Actually, you need to represent a many-to-many relationship. One user can have several documents. One document can be shared among several users.
See my previous answer to this question: how to have relations many to many in redis
With Redis, representing relationship with the set datatype is a pretty common pattern. You can expect to get better performance than with MongoDB for this kind of data model. And as a bonus, you can easily and efficiently find which users have a given list of documents in common, or which documents are shared by a given set of users.
Considering only this simple example (you just need to keep who owns what) SQL seems to be the most appropriate, as it will give additional options for free, such as reporting who has how many docs, the most popular documents, most active user etc with almost zero cost + the data will be more consistent (no duplication, possibly foreign keys). This is valid unless you have millions of documents of course.
If I chose between document-oriented and relational DB, I'd make a decision based mostly on the structure of the document itself. Whether they're all uniform or may have different fields for different types, do you nested sub-documents or arrays with the ability to search by their contents.

Using mongodb for form-building web application, schema considerations

I'm creating a web application with a dynamic survey creation & submission component. I'm using MongoDB to store the schema of the form and the form submissions.
I can imagine organizing this in several different ways:
Having all form submissions and form schemas as documents in a single collection.
Have separate collections for all form schemas and all form submissions
Have a separate collection for all form schemas, and create a new collection for all submissions of a form for each schema.
I'm still researching this and I come from the world of RDBMS, I'm a noob to NoSQL databases. Anyone have any advice?
EDIT 1
Forgot to address embedding the responses as a property within the form schema document.
Having all form submissions and form schemas as documents in a single collection.
You will want to avoid this one (#1). The simple reason here is that the form submission has a different role than the form schema. Mixing these in the same collection will make it more difficult to query.
Have separate collections for all form schemas and all form submissions
To clarify, it sounds like you're suggesting two collections: schema andsubmission`.
This is a logical way to proceed. You will have one small schema collection and one large submission collection.
The key limitation will be the queries you make against that submission collection. Are you planning to query "across types"? Or are major queries centered about "submission type"?
If you end up including "submission type" on every query, then it makes sense to...
Have a separate collection for all form schemas, and create a new collection for all submissions of a form for each schema.
The reason for this is simply the indexes. If you have one collection, you will need an index on "type". So by making separate collections, you can save an index. However, if you ever end up needing the sharding features, this can make for lots of collections to manage.
Of course, you can work around this "extra index", by being creative with the _id. MongoDB has an auto-generated ObjectId that it will use by default, kind of like an auto-increment ID. However, you can override this and create a smarter _id, something like submissionid_userid.
My preference is honestly for the last option. But really #2 & #3 are both good options, really just an issue of trade-offs in terms of code complexity and management complexity.
I'd go for two collections: form and submissions.
This is the approach scales horizontally well as you only have 2 collections to worry about.
I agree with #Gates VP about providing custom _id rather than the default objectId as you are spared the need for an extra index.
On the submissions collection if you set the _id format to formID_userID to get all the submissions all you'd need to do is:
db.submissions.find({'_id': '^formID'})
The bonus here is the anchored regex will use the _id_ index - so its efficient.
For general reference and others stumbling upon this: there are some good presentations about schema design - that are worth checking out:
http://www.10gen.com/presentations/mongodb-tokyo-2012/basic-application-and-schema-design
http://www.10gen.com/presentations/mongosv-2011/schema-design-principles-and-practice