I am running several NodeJS instances on separate compute engines all accessing the same MongoDB. In each instance I am running a background housekeeping process which scans the entire customer collection in the database. I am using cursors to access documents, fetching the next customer one by one.
This yields a number of competing housekeeping processes, all wanting to access the same documents (customers) in the same order.
What I am looking for instead, is for my housekeeping processes to cooperate rather than compete.
So if I had two instances I could construct two opposite direction cursors. But if I have 3 or more instances or if I want to be tolerant to any number of instances going up and down without duplicating or phantoming customers, I need to find a different approach.
I was thinking, does MongoDB provide cursors addressable by name from multiple NodeJS instances such that all instances fetch the next document from the same cursor, never obtaining the same document?
If not, can anyone suggest a good pattern to apply to this problem?
Related
Do many mongodb collections have a big impact on mongodb performance, memory and capacity? I am designing an api with mvc pattern, and a collection is being created for each model. I question the way I am doing now.
MongoDB with the WirdeTiger engine supports an unlimited number of collections. So you are not going to run into any hard technical limitations.
When you wonder if something should be in one collection or in multiple collections, these are some of the considerations you need to keep in mind:
More collections = more maintenance work. Sharding is configured on the collection level. So having a large number of collections will make shard configuration a lot more work. You also need to set up indexes for each collection separately, but this is quite easy to automatize, because createIndex on an index which already exists does nothing.
The MongoDB API is designed in a way that every database query operates on one collection at a time. That means when you need to search for a document in n different collections, you need to perform n queries. When you need to aggregate data stored in multiple collections, you run into even more problems. So any data which is queried together should be stored together in the same collection.
Creating one collection for each class in your model is usually a god rule of thumb, but it is not a golden hammer solution. There are situations where you want to embed object in their parent-object documents instead of putting them into a separate collection. There are also cases where you want to put all objects with the same base-class in the same collection to benefit from MongoDB's ability to handle heterogeneous collections. But that goes beyond the scope of this question.
Why don't you use this and test your application ?
https://docs.mongodb.com/manual/tutorial/evaluate-operation-performance/
By the way your question is not completely clear... is more like a "discussion" rather than question. And you're asking others to evaluate your work instead of searching the web the rigth approach.
I am currently working on designing a local content bases sharing system that depends on mongoDB. I need to make a critical architecture decision that will undoubtably have a huge impact on query performance, scaling and overall long term maintainability.
Our system has a library of topics, each topic is available in specific cities/metropolitan areas. When a person creates a piece of content it needs to be stored as part of the topic in a specific city. There are three approaches I am currently considering to address these requirements (And open to other ideas as well).
Option 1 (Single Collection per Topic/City):
Example: a collection name would be TopicID123CityID456 and each entry would obviously be a document within that collection.
Option 2 (Single Topic Collection)
Example: A collection name would be Topic123 and each entry would create a document that contains an indexed cityID.
Option 3 (Single City Collection)
Example: A collection name would be City456 and each entry would create a document that contains an indexed topicID
When querying the DB I always want to build a feed in date order based on the member's selected topic(s) and city. Since members can group multiple topics together to build a custom feed, option 3 seems to be the best, however I am concerned with long term performance of this approach. It seems option 1 would be the most performant but also forces multiple queries when needing to select more than one topic.
Another thing that I need to consider is some topics will be far more active and grow much larger than other topics which will also vary by location.
Since I still consider myself a beginner with MongoDB, I want to make sure the general DB structure is the most ideal before coding all of the logic around writing and retrieving the data. And I don't know how well Mongo Performs with hundreds of thousands if not millions of documents in a collection thus my uncertainty in approach.
From experience which is the most optimal way of tackling the storage and recall of this data? Any insight would be greatly appreciated.
UPDATE: June 22, 2016
It is important to note that we are starting in a one DB server environment to start. #profesor79 provided a great scaling solution once we need to move to a multi-server (Sharded) environment.
from your 3 proposal I will pickup number 4 :-)
Having a one collection sharded over multiple servers.
As there could be one collection TopicCity, `we could have a one for all topics and one foll all cities.
Then collection topicCities will have all documents sharded.
Sharding on key {topic:1, city:1} will allow to balance load thru shard servers and enytime you will need to add more power you will be able to add shard to cluster.
Any comments welcome!
I'm new to mongodb and I'm facing a dilemma regarding my DB Schema design:
Should I create one single collection or put my data into several collections (we could call these categories I suppose).
Now I know many such questions have been asked, but I believe my case is different for 2 reasons:
If I go for many collections, I'll have to create about 120 and that's it. This won't grow in the future.
I know I'll never need to query or insert into multiple collections. I will always have to query only one, since a document in collection X is not related to any document stored in the other collections. Documents may hold references to other parts of the DB though (like userId etc).
So my question is: could the 120 collections improve query performance? Is this a useful optimization in my case?
Or should I just go for single collection + sharding?
Each collection is expected hold millions of documents. If use only one, it will store billions of docs.
Thanks in advance!
------- Edit:
Thanks for the great answers.
In fact the 120 collections is only a self made limit, it's not really optimal:
The data in the collections is related to web publishers. There could be millions of these (any web site can join).
I guess the ideal situation would be if I could create a collection for each publisher (to hold their data only). But obviously, this is not possible due to mongo limitations.
So I came up with the idea of a fixed number of collections to at least distribute the data somehow. Like: collection "A_XX" would hold XX Platform related data for publishers whose names start with "A".. etc. We'll only support a few of these platforms, so 120 collections should be more than enough.
On another website someone suggested using many databases instead of many collections. But this means overhead and then I would have to use / manage many different connections.
What do you think about this? Is there a better solution?
Sorry for not being specific enough in my original question.
Thanks in advance
Single Sharded Collection
The edited version of the question makes the actual requirement clearer: you have a collection that can potentially grow very large and you want an approach to partition the data. The artificial collection limit is your own planned partitioning scheme.
In that case, I think you would be best off using a single collection and taking advantage of MongoDB's auto-sharding feature to distribute the data and workload to multiple servers as required. Multiple collections is still a valid approach, but unnecessarily complicates your application code & deployment versus leveraging core MongoDB features. Assuming you choose a good shard key, your data will be automatically balanced across your shards.
You can do not have to shard immediately; you can defer the decision until you see your workload actually requiring more write scale (but knowing the option is there when you need it). You have other options before deciding to shard as well, such as upgrading your servers (disks and memory in particular) to better support your workload. Conversely, you don't want to wait until your system is crushed by workload before sharding so you definitely need to monitor the growth. I would suggest using the free MongoDB Monitoring Service (MMS) provided by 10gen.
On another website someone suggested using many databases instead of many collections. But this means overhead and then I would have to use / manage many different connections.
Multiple databases will add significantly more administrative overhead, and would likely be overkill and possibly detrimental for your use case. Storage is allocated at the database level, so 120 databases would be consuming much more space than a single database with 120 collections.
Fixed number of collections (original answer)
If you can plan for a fixed number of collections (120 as per your original question description), I think it makes more sense to take this approach rather than using a monolithic collection.
NOTE: the design considerations below still apply, but since the question was updated to clarify that multiple collections are an attempted partitioning scheme, sharding a single collection would be a much more straightforward approach.
The motivations for using separate collections would be:
Your documents for a single large collection will likely have to include some indication of the collection subtype, which may need to be added to multiple indexes and could significantly increase index sizes. With separate collections the subtype is already implicit in the collection namespace.
Sharding is enabled at the collection level. A single large collection only gives you an "all or nothing" approach, whereas individual collections allow you to control which subset(s) of data need to be sharded and choose more appropriate shard keys.
You can use the compact to command to defragment individual collections. Note: compact is a blocking operation, so the normal recommendation for a HA production environment would be to deploy a replica set and use rolling maintenance (i.e. compact the secondaries first, then step down and compact the primary).
MongoDB 2.4 (and 2.2) currently have database-level write lock granularity. In practice this has not proven a problem for the vast majority of use cases, however multiple collections would allow you to more easily move high activity collections into separate databases if needed.
Further to the previous point .. if you have your data in separate collections, these will be able to take advantage of future improvements in collection-level locking (see SERVER-1240 in the MongoDB Jira issue tracker).
The main problem here is that you will gain very little performance in the current MongoDB versions if you separate out collections into the same database. To get any sort of extra performance over a single collection setup you would need to move the collections out into separate databases, then you will have operational overhead for judging what database you should query etc.
So yes, you could go for 120 collections easily however, you won't really gain anything currently due to: https://jira.mongodb.org/browse/SERVER-1240 not being implemented (anytime soon).
Housing billions of documents in a single collection isn't too bad. I presume that even if you was to house this in separate collections it probably would not be on a single server either, just like sharding a single collection, so any speed reduction due to multi server setup will also not matter in this case.
In my personal opinion, using a single collection is easier on everything.
As an example, imagine a trivial "helpdesk" type app where there are support tickets, and the app supports multiple companies logging in and managing their tickets.
Given that companies won't interact with each others "Tickets"....
Is it better to have one collection of "Tickets" and query or is it better to create collections of Tickets per Company?
There are a couple of things to consider here.
The first thing is pre-allocation of space. You will find a couple of threads on the mongodb-user group whereby the OP is confused about why their database is taking so much space when their data is taking so little space. This is because when you reach a certain point of pre-alloc within a collection it will create files 2GB in size by default, even if you are only using 100meg of that space.
Now imagine this pre-alloc pattern for 1000 companies; this quickly creates inefficient use of disk space and, in most of the threads, performance and cost problems.
The second thing to consider here is the nssize, which is 2GB maximum. This may seem crazy but what if you do have more than 3 million members (assume a company is a "registered user")? You will quickly use up the maximum namespace file size that MongoDB can give.
Also you will gain no benefit from the lock (on DB level) without splitting them out into separate databases, this of course creates an operational overhead in maintaining the database connections for each company.
MongoDB is typically designed to scale through a cluster rather than scale vertically and scaling vertically is normally considered a bad idea for large websites.
I don't have much time using mongodb, but I'll give some arguments so we can discuss it. I think you should create just one Tickets collection, for the following reasons:
Creating a Collection for each company seems like redundancy.
You will have to create and configurate a collection every time you add a new company to your system in order to create tickets, when in the other hand you will only have to create the company.
I don't know how where you planning to create the link between your company document and it's corresponding ticket collection, but I think is more straightforward to create the link using the id of the company document with an idcompany attribute in the Tickets collection.
I think one of the reasons that might make you consider to create a ticket collection per company, is due to the large amount of data could decrease the speed of your queries (all the companies inserting to the same tickets collection). But the way you could counter this is creating a sharded cluster, using a compound shard key with idcompany and some usefull attribute from the Tickets document, this way is very likely that all the documents of a given company remains in the same shard, so the common queries will perform relatively quick.
My $0.02:
By separating out each company into their own collections, or better, databases... it makes customer migration and individualized backups, restores, imports and exports much easier at the expense of making your code a tad crappier.
Isolating customer data may reduce your data storage requirements, as you won't need to embed the customer ID into every single document. Of course, with separate databases, most drivers will treat that as a separate network connection.
As with everything, there are tradeoffs.
Having a MongoDB database named maindatabase which has 3 document collections named users, tags and categories, I would like to know if it is possible having them splitted on three different servers separately (on different cloud service providers).
I mean not as a replica, but just one collection for server (one db with just categories collection on a server, one with users on another server and one for tags on the third server) may be routed by a mongos Router selectively.
Anyone know if it is possible?
Aside from #matulef's answer regarding manual manipulation of databases through movePrimary, maybe this calls for a simpler solution of just maintaining 3 database connections: one per server, each in a different cloud provider's data center as you originally specified. You wouldn't have the simplicity of a single mongos connection point, but with your three connections, you could then directly manipulate users, tags, and categories on each of their respective connections.
Unfortunately you can't currently split up the collections in a single database this way. However it is possible to do this if you put each collection in a different database. In a sharded system, each database has a "primary shard" associated with it, where all the unsharded collections on that database live. If you separate your 3 collections into 3 different databases, you can individually move them to different shards using the "movePrimary" command:
http://www.mongodb.org/display/DOCS/movePrimary+Command
There is, however, some overhead associated with making more databases, so it's not clear whether this is the best solution for your needs.