setting up mongodb for sharding/scalability?

setting up mongodb for sharding/scalability? - mongodb

Any recommended readings for setting up mongodb for sharding/scalability?
I'm looking for best practices. i don't know a lot about sharding or scaling db solutions. are there examples out there with practical real world examples?
i apologize if i'm using the wrong terms.
Is my understanding correct in that mongodb acts like a "single database" but knows how to distribute data across disparate instances of mongodb (maybe located in different locations, etc)
Are each of those instances called shards? is that data replicated across all instances?

MongoDB provides two types of scaling.
Read scaling: is provided by Replica Sets.
Write scaling is provided by Sharding.
Those links are a reasonable place to start.
There are also numerous slides and videos from the multiple Mongo conferences that have run recently. Here are some recent ones with use cases.
are each of those instances called shards? is that data replicated across all instances?
Think of a shard as a "slice" of your data. Each shard is generally composed of a replica set. So each shard has multiple computers managing replication of data.
is my understanding correct in that mongodb acts like a "single database" but knows how to distribute data across disparate instances of mongodb...
Sharding allows MongoDB to automatically distribute writes. But there's a little more to it, so I think it's best you work through some of the presentations.

MongoDB has a great documentation. Issues like Sharding and Replica sets are documented in depth:
http://www.mongodb.org/display/DOCS/Sharding+Introduction
http://www.mongodb.org/display/DOCS/Replica+Sets
Apart from that there are lot of presentations
http://www.10gen.com/presentations
and videos
http://www.10gen.com/presentations
dealing with your questions.
Please research first and come up with some more specific questions.

Related

Is there any performance impact to have multiple Mongo databases?

We are currently working on an application using Mongo and we try to evaluate benefits and constraints on each differents architecture choices related to spreading data on multiple databases/collections or using a single shared one.
Is there any performance penalties between one single database with a lot of collections or many databases with less collections per database ?
From what I understand it does not seem to have any impact because sharding is done per collection basis but I would like some confirmations.
Regards

By performance, I guess you mean read/write speed. Using multiple databases with fewer collections would definitely increase your read/write speed since each database can handle more read/write operations on the collections associated with them. 
However, spreading data across databases this way I believe can bring about extra complexity to your project, depending on how your codebase is structured, it might introduce complexity to your application logic, things like backup and other admin database operations won't be straight forward, cross collection ad-hoc queries for collection that lives in different databases would be next to impossible.
If the goal of the architecture design is to ensure high read/write speed, you can still go with using a single DB that can be auto-scaled at the deployment level. I don't know much about it but I think Replication is a MongoDB feature that can help you achieve such auto-scaling and if you are in for database-as-a-service, you should check out MongoDB Atlas, auto-scaling is out of the box.

MongoDB - Can I shard all new db's (Created by application) automatically?

My team will deploy a new version of our app (Capture social media posts, hashtags etc.) they create a different DB for each user and we may have thousands of collections on each DB. I read all mongoDB shard documentation and I saw that I can only shard an collection or one DB at time, I'm missing something ?
We will start this new version fresh, without any databases and we will grow from 0 again (For now, we have 23k users) but we will escalate this number really quickly (100.000+ at the end of the year)
My question is: I really need a Shard cluster ? (My test setup have 3 shards with 3 microshards, 3 config servers and 2 mongos) for now, in production, i have a large server doing all the hard work but i dont want to scale to top, the horizontal scale is the best choice, i think.
Can I shard all my databases automatically or I really need to do that one by one doing the shard key procedure and so. ?
Thanks in advance

You are reading correctly. What you intend to do is so far away from what any sensible person would do that MongoDB doesn't offer any tools to support this. If you really want to go with this WTF solution, your application will be responsible to set up sharding for each collection it creates. This forces you to give administration permission to the application (despite what any security guides recommend).
"Will you really need a sharded cluster" - that depends on how much data you will have and how often you query it with what kind of query. But it is unlikely to work anyway, because your sharded cluster will have to manage (100,000 databases* 1.000 collections) = a hundred million collections. MongoDB is not designed for scaling in that direction. The cluster will likely be so busy with bookkeeping that you won't really see any notable performance gain.
It is also questionable if clustering would even theoretically make sense. Clustering is usually only useful when you have very large collections. But in your scenario where your data is so heavily fragmented into a million collections, each individual collection is unlikely to be very large.
If you really want to go this route, it might in fact be a better solution to separate the databases physically by assigning each user to a database server.
Or you could just build a database architecture like a normal team would with one database for all users and one collection per type of document. You would then speed up lookups by creating a compound index on user and whatever criteria you used to tell which database a document belonged to. This index might also be a good shard key.

120 mongodb collections vs single collection - which one is more efficient?

I'm new to mongodb and I'm facing a dilemma regarding my DB Schema design:
Should I create one single collection or put my data into several collections (we could call these categories I suppose).
Now I know many such questions have been asked, but I believe my case is different for 2 reasons:
If I go for many collections, I'll have to create about 120 and that's it. This won't grow in the future.
I know I'll never need to query or insert into multiple collections. I will always have to query only one, since a document in collection X is not related to any document stored in the other collections. Documents may hold references to other parts of the DB though (like userId etc).
So my question is: could the 120 collections improve query performance? Is this a useful optimization in my case?
Or should I just go for single collection + sharding?
Each collection is expected hold millions of documents. If use only one, it will store billions of docs.
Thanks in advance!
------- Edit:
Thanks for the great answers.
In fact the 120 collections is only a self made limit, it's not really optimal:
The data in the collections is related to web publishers. There could be millions of these (any web site can join).
I guess the ideal situation would be if I could create a collection for each publisher (to hold their data only). But obviously, this is not possible due to mongo limitations.
So I came up with the idea of a fixed number of collections to at least distribute the data somehow. Like: collection "A_XX" would hold XX Platform related data for publishers whose names start with "A".. etc. We'll only support a few of these platforms, so 120 collections should be more than enough.
On another website someone suggested using many databases instead of many collections. But this means overhead and then I would have to use / manage many different connections.
What do you think about this? Is there a better solution?
Sorry for not being specific enough in my original question.
Thanks in advance

Single Sharded Collection
The edited version of the question makes the actual requirement clearer: you have a collection that can potentially grow very large and you want an approach to partition the data. The artificial collection limit is your own planned partitioning scheme.
In that case, I think you would be best off using a single collection and taking advantage of MongoDB's auto-sharding feature to distribute the data and workload to multiple servers as required. Multiple collections is still a valid approach, but unnecessarily complicates your application code & deployment versus leveraging core MongoDB features. Assuming you choose a good shard key, your data will be automatically balanced across your shards.
You can do not have to shard immediately; you can defer the decision until you see your workload actually requiring more write scale (but knowing the option is there when you need it). You have other options before deciding to shard as well, such as upgrading your servers (disks and memory in particular) to better support your workload. Conversely, you don't want to wait until your system is crushed by workload before sharding so you definitely need to monitor the growth. I would suggest using the free MongoDB Monitoring Service (MMS) provided by 10gen.
On another website someone suggested using many databases instead of many collections. But this means overhead and then I would have to use / manage many different connections.
Multiple databases will add significantly more administrative overhead, and would likely be overkill and possibly detrimental for your use case. Storage is allocated at the database level, so 120 databases would be consuming much more space than a single database with 120 collections.
Fixed number of collections (original answer)
If you can plan for a fixed number of collections (120 as per your original question description), I think it makes more sense to take this approach rather than using a monolithic collection.
NOTE: the design considerations below still apply, but since the question was updated to clarify that multiple collections are an attempted partitioning scheme, sharding a single collection would be a much more straightforward approach.
The motivations for using separate collections would be:
Your documents for a single large collection will likely have to include some indication of the collection subtype, which may need to be added to multiple indexes and could significantly increase index sizes. With separate collections the subtype is already implicit in the collection namespace.
Sharding is enabled at the collection level. A single large collection only gives you an "all or nothing" approach, whereas individual collections allow you to control which subset(s) of data need to be sharded and choose more appropriate shard keys.
You can use the compact to command to defragment individual collections. Note: compact is a blocking operation, so the normal recommendation for a HA production environment would be to deploy a replica set and use rolling maintenance (i.e. compact the secondaries first, then step down and compact the primary).
MongoDB 2.4 (and 2.2) currently have database-level write lock granularity. In practice this has not proven a problem for the vast majority of use cases, however multiple collections would allow you to more easily move high activity collections into separate databases if needed.
Further to the previous point .. if you have your data in separate collections, these will be able to take advantage of future improvements in collection-level locking (see SERVER-1240 in the MongoDB Jira issue tracker).

The main problem here is that you will gain very little performance in the current MongoDB versions if you separate out collections into the same database. To get any sort of extra performance over a single collection setup you would need to move the collections out into separate databases, then you will have operational overhead for judging what database you should query etc.
So yes, you could go for 120 collections easily however, you won't really gain anything currently due to: https://jira.mongodb.org/browse/SERVER-1240 not being implemented (anytime soon).
Housing billions of documents in a single collection isn't too bad. I presume that even if you was to house this in separate collections it probably would not be on a single server either, just like sharding a single collection, so any speed reduction due to multi server setup will also not matter in this case.
In my personal opinion, using a single collection is easier on everything.

Configure a Mongo replica set to only replicate certain collections

I have a ~3GB mongo database with several dozen collections. Three of these collections handle ~300 queries per second, while the rest sustain a much lower volume. I expect the traffic to continue to grow quickly.
I'd like to set up a replica set to handle the high-traffic collections. It isn't necessary for this new instance to replicate the rest of the database. Is this possible?

Seems like not possible at the moment by built-in features of mongodb and only way to do is to come up with your own manual replication algorithm or use some other tools written by third parties.
https://github.com/wordnik/wordnik-oss project might help you to achieve this according to the following post.
https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/Ap9V4ArGuFo
Describes workaround to filter documents in replication.
Replicate only documents where {'public':true} in MongoDB
Or just replicate the data yourself manually which might worth trying.
Good luck.

No that isn't possible now. What you could do is move those collections into another unreplicated database. But this will cause headaches once these collections see higher traffic too, so you would need to move them into your "replication"-db.
But in general Replication isn't the way to go if you need to scale, it's more considered for DR/failover. Replicaset Secondaries can only (optionally) answer read queries but no write queries, this is something you should keep in mind. So if you have high write load this may not cure your problem.
Once you allow your application to read from secondaries you need to live with eventual consistency, meaning that your application isn't guaranteed to see always the latest data. This is caused due to the asynchronous replication to the secondaries.
Indeed you can cure this problem if you configure your writeconcern, so that the write needs to succeeded on all replicas, before it's considered written and your driver returns. But this may slow down your write operations significant.
So for scaling query execution capabilities I would go with Sharding. This is possible on a per collection level, all unsharded collections will remain on a "default-shard".

Not possible but then if the data size is so small and these collections aren't updated, then the only overhead of having them replicated is the small storage size on the secondary. That is a relatively small price to pay, especially since the collections won't grow in size, compared with writing your own replication logic.

Instead of that archive the data, and have only the latest data set on the production server and the rest of the data can archive on the new server.

Scaling MongoDB on EC2 or should I just switch to DynamoDB?

I currently run my website on a single server with MongoDB. On my server I have two components (1) a crawler that runs hourly and appends data to my MongoDB instance (2) a web-site that reads from the crawler index and also writes to a user personalization DB. I am moving to Amazon EC2 for auto-scaling, so that web-server can auto-scale, so I can increase the number of servers as the web-traffic increases. I don't need auto-scaling for my crawler. This poses a challenge for how I use MongoDB. I'm wondering what my best option is to optimize on
Minimal changes to my code (the code is in perl)
Ability to seamlessly add/remove web-servers without worry about losing data in the DB
Low cost
In the short-term, the DB will certainly be able to fit in memory across all machies since it will be under 2 GB. The user personalization DB can't be rebuilt so its more important to have this, while the index can easily be re-built. The current MongoDB crawl index has about 100k entries that are keyed on ~15 different columns. This is built for speed, as I am working on an online dating site (that is searchable in many ways).
I can think of a few options
Use SimpleDB for the user personalization store, and MongoDB for the index. Have the index replicate across all machines, however, I don't know too much about MongoDB replication.
Move everything to SimpleDB
Move everything to DynamoDB
I don't know too much about SimpleDB and/or DynamoDB. Based on articles it seems like DynamoDB would bew a natural choice, but I'm not sure about good perl support, whether I can have all columns, index, etc. Anyone have experience or have any advice?

You could host Mongo on a single server on EC2 which each of the boxes in the web farm connect to. You can then easily spin up another web instance that uses the same DB box.
We currently have three Mongo servers as we run a replica set and when we get to the point where we need to scale horizontally with Mongo we'll spin up some new instances and shard the larger collections.

I currently run my website on a single server with MongoDB.
First off, this is a big red flag. When running on production, it is always recommended to run a replica set with at least three full nodes.
Replication provides automatic redundancy and fail-over.
Ability to seamlessly add/remove web-servers without worry about losing data in the DB
MongoDB supports a concept called sharding. Sharding provides a way to scale horizontally by automatically partioning data. The partitioning is done via a shard key.
If you plan to use sharding, please read that link very carefully and recognize the limitations. For MongoDB sharding you have to select the correct key that will allow queries to be evenly distributed across the shards.
The current MongoDB crawl index has about 100k entries that are keyed on ~15 different columns.
This is going to be a problem with sharding. Sharding can only scale queries that use the shard key. A query on the shard key can be routed directly to a single machine. A query on a secondary index goes to all machines.
You have 15 different indexes, so basically all of these queries will go to all shards. That will not "auto-scale" very well at all.

Beware that at the moment EC2 does not have 64 bit small instances, making replication potentially expensive. Because MongoDB memory maps files, a 32 bit OS is not advised.

I've had very bad experiences with SimpleDB and think it's fundamentally flawed, so I would avoid it.
Three is a good white paper on how to set up MongoDB on Amazon EC2: http://d36cz9buwru1tt.cloudfront.net/AWS_NoSQL_MongoDB.pdf
I suspect setting up MongoDB on EC2 is the fastest solution versus rewriting-for/migrating-to DynamoDB.
Best of luck!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse