I'm creating a platform where customers (users) are from different organisations. So I would like to keep their data totally separated according to organisations they belong. How would you suggest to store such data in mongo db? On which level?
Are you keeping the data separate for security reasons (i.e. compliance or regulation) or simply for administration/ease-of-use?
If it's the former, I'd go with separate databases at the very least, if not separate MongoDB instances. Separate instances enables you to perform segregation at an IP level through something like iptables so that you can tie down different instances to different IP ranges, representing the different organisations presuming they will be accessing the data.
If it's the latter, I'd still go with separate databases because it gives you the ability to have different users on a database level and from version 2.2, concurrency will be on a database level (so there's no sharing of the write lock, for example, that you'd have if you split it out on collection level).
As a FYI, here's some additional information on schema design in MongoDB -
Schema Design
Schema Design Presentation by Kyle Banker
Schema Design Blogs from Customers
MongoSF2012: mongodb-schema-design-insights-and-tradeoffs
There was actually a schema introduction webinar held last week that you can now listen to.
You can create a document for each organization and put the user's details into sub-documents inside the root document.
If the overall users' profiles are so big that don't fit into MongoDB document size (16 mg), then you can use different approach by creating a document for every user and add a field referring to the organization.
Related
I'm writing an application that gathers statistics of users across multiple social networks accounts. I have a collection of users and I would like to store the statistics information of each user.
Now, I have two options:
Create a collection that stores users statistics documents, and add a reference object to each of the user documents that links it to the corresponding document in the statistics collection.
Embed a statistics document in each of the users document.
Besides for query performance (which I'm less concerned about):
what are the pros and cons of each of these approaches?
What should I take into account if I choose to use references rather than embedding the information inside the user document?
The shape of the data is determined by the application itself.
There’s a good chance that when you are working with the users data, you probably need statistics details.
The decision about what to put in the document is pretty much determined by how the data is used by the application.
The data that is used together as users documents is a good candidate to be pre-joined or embedded.
One of the limitations of this approach is the size of the document. It should be a maximum of 16 MB.
Another approach is to split data between multiple collections.
One of the limitations of this approach is that there is no constraint in MongoDB, so there are no foreign key constraints as well.
The database does not guarantee consistency of the data. Is it up to you as a programmer to take care that your data has no orphans.
Data from multiple collections could be joined by applying the lookup operator. But, a collection is a separate file on disk, so seeking on multiple collections means seeking from multiple files, and that is, as you are probably guessing, slow.
Generally speaking, embedded data is the preferable approach.
I am currently working on designing a local content bases sharing system that depends on mongoDB. I need to make a critical architecture decision that will undoubtably have a huge impact on query performance, scaling and overall long term maintainability.
Our system has a library of topics, each topic is available in specific cities/metropolitan areas. When a person creates a piece of content it needs to be stored as part of the topic in a specific city. There are three approaches I am currently considering to address these requirements (And open to other ideas as well).
Option 1 (Single Collection per Topic/City):
Example: a collection name would be TopicID123CityID456 and each entry would obviously be a document within that collection.
Option 2 (Single Topic Collection)
Example: A collection name would be Topic123 and each entry would create a document that contains an indexed cityID.
Option 3 (Single City Collection)
Example: A collection name would be City456 and each entry would create a document that contains an indexed topicID
When querying the DB I always want to build a feed in date order based on the member's selected topic(s) and city. Since members can group multiple topics together to build a custom feed, option 3 seems to be the best, however I am concerned with long term performance of this approach. It seems option 1 would be the most performant but also forces multiple queries when needing to select more than one topic.
Another thing that I need to consider is some topics will be far more active and grow much larger than other topics which will also vary by location.
Since I still consider myself a beginner with MongoDB, I want to make sure the general DB structure is the most ideal before coding all of the logic around writing and retrieving the data. And I don't know how well Mongo Performs with hundreds of thousands if not millions of documents in a collection thus my uncertainty in approach.
From experience which is the most optimal way of tackling the storage and recall of this data? Any insight would be greatly appreciated.
UPDATE: June 22, 2016
It is important to note that we are starting in a one DB server environment to start. #profesor79 provided a great scaling solution once we need to move to a multi-server (Sharded) environment.
from your 3 proposal I will pickup number 4 :-)
Having a one collection sharded over multiple servers.
As there could be one collection TopicCity, `we could have a one for all topics and one foll all cities.
Then collection topicCities will have all documents sharded.
Sharding on key {topic:1, city:1} will allow to balance load thru shard servers and enytime you will need to add more power you will be able to add shard to cluster.
Any comments welcome!
I'm developing a cms using MongoDb and am trying to get some modelling advice. It's multi-tenant and each tenant can create their own schema and choose what custom fields they want searchable/indexed. The only thing I'm waffling on is how to model my collections. It seems to me like it would be ideal for each tenant to have their own collection due to indexing, but I am not very experienced with MongoDb and would love to hear if that's even a valid statement or not.
I'm thinking about separating each tenant's schema definitions from their data - perhaps a customSchema and customData collection for each tenant. Maybe something like customSchema_5543e1191a85d8946f0ee6fc and customData_5543e1191a85d8946f0ee6fc? The major question here being how many collections are feasible in MongoDb. I'm not clear if there's a cap with the new WiredTiger or not. If not, would such a large number of collections have any downsides?
Or, is it better to have just two collections with all tenant's data in them, along with all of their individual indexes? What are the pros and cons of this approach?
Any thoughts or suggestions are welcome, particularly if anyone has had experience doing something like this before.
Update:
My use case is a cms where tenants can specify their own data, like in Sharepoint or Expression Engine, or most other content apis, like contentful or CloudCMS. A user can say, "I want to store Products, and each product has a Name, Description, Quantity, and a price". Another user could say, "I want to store bands, and each band has a Name, a HomeCity, and a whatever." The users would then want to retrieve and display that data on their pages however they like. It's a basic cms scenario where tenants can create their own schema, then create, edit, and retrieve entries of those schemas. Tenants would need to be able to denote which fields they can search on, so this highly customizable indexing per tenant is the primary area of focus and concern in the modelling strategy.
I'm waffling between two big collections to store schemas and data, shared by all tenants, and a pair of those collections for every tenant. I just don't know the pros and cons of each of those solutions in MongoDb. I'm also open to any ideas I haven't thought of yet :)
As an example, imagine a trivial "helpdesk" type app where there are support tickets, and the app supports multiple companies logging in and managing their tickets.
Given that companies won't interact with each others "Tickets"....
Is it better to have one collection of "Tickets" and query or is it better to create collections of Tickets per Company?
There are a couple of things to consider here.
The first thing is pre-allocation of space. You will find a couple of threads on the mongodb-user group whereby the OP is confused about why their database is taking so much space when their data is taking so little space. This is because when you reach a certain point of pre-alloc within a collection it will create files 2GB in size by default, even if you are only using 100meg of that space.
Now imagine this pre-alloc pattern for 1000 companies; this quickly creates inefficient use of disk space and, in most of the threads, performance and cost problems.
The second thing to consider here is the nssize, which is 2GB maximum. This may seem crazy but what if you do have more than 3 million members (assume a company is a "registered user")? You will quickly use up the maximum namespace file size that MongoDB can give.
Also you will gain no benefit from the lock (on DB level) without splitting them out into separate databases, this of course creates an operational overhead in maintaining the database connections for each company.
MongoDB is typically designed to scale through a cluster rather than scale vertically and scaling vertically is normally considered a bad idea for large websites.
I don't have much time using mongodb, but I'll give some arguments so we can discuss it. I think you should create just one Tickets collection, for the following reasons:
Creating a Collection for each company seems like redundancy.
You will have to create and configurate a collection every time you add a new company to your system in order to create tickets, when in the other hand you will only have to create the company.
I don't know how where you planning to create the link between your company document and it's corresponding ticket collection, but I think is more straightforward to create the link using the id of the company document with an idcompany attribute in the Tickets collection.
I think one of the reasons that might make you consider to create a ticket collection per company, is due to the large amount of data could decrease the speed of your queries (all the companies inserting to the same tickets collection). But the way you could counter this is creating a sharded cluster, using a compound shard key with idcompany and some usefull attribute from the Tickets document, this way is very likely that all the documents of a given company remains in the same shard, so the common queries will perform relatively quick.
My $0.02:
By separating out each company into their own collections, or better, databases... it makes customer migration and individualized backups, restores, imports and exports much easier at the expense of making your code a tad crappier.
Isolating customer data may reduce your data storage requirements, as you won't need to embed the customer ID into every single document. Of course, with separate databases, most drivers will treat that as a separate network connection.
As with everything, there are tradeoffs.
Having a MongoDB database named maindatabase which has 3 document collections named users, tags and categories, I would like to know if it is possible having them splitted on three different servers separately (on different cloud service providers).
I mean not as a replica, but just one collection for server (one db with just categories collection on a server, one with users on another server and one for tags on the third server) may be routed by a mongos Router selectively.
Anyone know if it is possible?
Aside from #matulef's answer regarding manual manipulation of databases through movePrimary, maybe this calls for a simpler solution of just maintaining 3 database connections: one per server, each in a different cloud provider's data center as you originally specified. You wouldn't have the simplicity of a single mongos connection point, but with your three connections, you could then directly manipulate users, tags, and categories on each of their respective connections.
Unfortunately you can't currently split up the collections in a single database this way. However it is possible to do this if you put each collection in a different database. In a sharded system, each database has a "primary shard" associated with it, where all the unsharded collections on that database live. If you separate your 3 collections into 3 different databases, you can individually move them to different shards using the "movePrimary" command:
http://www.mongodb.org/display/DOCS/movePrimary+Command
There is, however, some overhead associated with making more databases, so it's not clear whether this is the best solution for your needs.