Is it good to add sharding on single server? - mongodb

is it good to do sharding on single machine/server, if size of mongodb documents is above 10GB, will it perform well?

The key rule of sharding is don't shard until you absolutely have to. If you're not having problems with performance now don't need to shard. Choosing sharding keys can be a difficult process to make your data gets balanced correctly between shards. Sharding the database can add severe overhead to your deployment that can take a lot to manage since you will need additional mongod process, config servers, and replica sets in order for it to be stable for production.
I'm assuming you mean your collections are 10GB. Depending on the size of your machine a 10GB collection is not a lot for mongo to handle. IF you're having performance issue with queries my first step would be to go through your mongo log and see if there are any queries you can add indexes for.

Related

What Issues could arise, when reading/writing to a single shard in a sharded MongoDB?

In the offical MongoDB Documentation it is stated that:
"Clients should never connect to a single shard in order to perform read or write operations." Source
I did try writing some data to a single shard node and it worked, but it is not a big dataset (certainly not one that would require sharding).
Could this lead to other issues, which I am not yet aware of?
Or are clients discouraged from doing this, simply because of performance reasons?
In a sharded cluster, the individual shards contain the data but some of the operation state is stored on mongos nodes. When you communicate with individual shards you bypass mongoses and as such any state management that they would perform.
Depending on the operation, this could be innocuous or could result in unexpected behavior/wrong results.
There isn't a comprehensive list of issues that would arise because the scenario of direct shard access isn't tested (because it's not supported).
Problem is, by default you don't know which data portion is stored on which shard. The balancer is running in background and may change the data distribution at any time.
If you insert some data manually into one shard, then a properly connected client may not find this data.
If you read some data from a single shard then you don't know whether this data is complete or not.

Is MongoDB always write to primary Shard and then rebalance?

use vsm;
sh.enableSharding('vsm');
sh.shardCollection('vsm.pricelist', {maker_id:1});
Ok, we enabled sharding for Database (vsm) and collection in this database (pricelist).
We trying to write about 80 million documents to 'pricelist' collection.
And we have about 2000 distributed uniformly different maker_ids.
We have three shards. And Shard002 is PRIMARY for 'vsm' database.
We write to 'pricelist' collection from four application nodes with started mongos on each.
And during write data to 'pricelist' collection we see CPU Usage 100% ONLY on Shard002 !
We see rebalancing process. And data migrate to Shard000 and Shard003. But Shard002 has hight CPU Usage and Load Average!
Shards deployed on c4.xlarge EBS Optimized instances. dbdata stored on io1 with 2000 IOPS EBS Volumes.
It is looks like MongoDB write data only to one Shard :( What we do wrong?
The problem
What you describe is usually the indication that you have chosen a poor shard key with makerid, most likely monotonically increasing.
What usually happens is that one shard is assigned the key range from x to infinity (shard002 in your case). Now all new documents get written to that shard, until that shards holds more chunks in excess of the current migration threshold. Now the balancer kicks in and moves some chunks. Problem is that new documents still get written to said shard.
The solution
An easy solution for that problem is to use hashed keys for sharding
Now here comes the serious problem: you can not change the shard key.
So what you have to do is to make a backup of the sharded collection, drop it, reshard the collection using the hashed makerId and restore the backup into the new collection.
Is MongoDB always write to primary Shard and then rebalance?
Yes, if you are relying on auto balancer. And loading huge amounts of data into an empty collection
In your situation, you are relying on the autobalancer to do all the sharding / balancing stuff. I assume what you require is, as your data gets loaded it goes to each shard during load hence having less CPU usage etc.
This how sharding / autobalancing will take place on a high level.
Create chunks of data using split http://docs.mongodb.org/manual/reference/command/split/
Move the chunks to other shards http://docs.mongodb.org/manual/reference/command/moveChunk/#dbcmd.moveChunk
Now, when autobalancer is ON these two steps occur when your data is already loaded or loading.
Solution
Create empty collection. Execute the shard command on it. The collection where your data is going to get loaded.
Turn off the auto balancer http://docs.mongodb.org/manual/tutorial/manage-sharded-cluster-balancer/#disable-the-balancer
Manually create empty chunks using split. http://docs.mongodb.org/manual/tutorial/create-chunks-in-sharded-cluster/
Move those empty chunks to different shards. http://docs.mongodb.org/manual/tutorial/migrate-chunks-in-sharded-cluster/
Start the load, This time all your data should go directly to their respective shards.
Turn ON the balancer. (Once the load is complete) http://docs.mongodb.org/manual/tutorial/manage-sharded-cluster-balancer/#enable-the-balancer
You will have to test this approach using a small data set first. But I guess I have got enough information for you to get started.

mongodb - Reclaim disk space regularly with no downtime

We have a replica set of 1 primary, 1 secondary, and 1 arbiter. We delete collections often, so I am looking for a fast way to reclaim disk space used by deleted collections with no downtime, current database size is close to 3TB.
I've been researching various ways of doing this, 2 common approaches are:
repairDatabase(): which needs free space equal the size of used space to be able to run, it will take the primary offline, then start initial Sync on the secondary,which is very lengthy process, during which only one node is available for read only from secondary during repairDatabase, and read/write during initial Sync.
run initial Sync on a new node, then claim as primary and retire the old one. Repeat the process for secondary. With this option, both primary and secondary are available, but very lengthy process and take almost 1 week to run initial Sync twice.
is there a better solution to reclaim disk space on a regular basis and relatively faster than the above solutions.
Note that every single collection is subject to deletion.
Thanks
there's no easy way to achieve this, unless you design your DB structure to keep different collections in different databases, which in turn will mean storing them in different paths in your HDD as long as you have the directoryPerDB set to true in your mongo.conf. This is a workaround and depending on your app it might be unpractical.
While it's true that dropping a collection won't free the hdd space, it's also true that the used space it's not lost. It will be eventually reused for new collections.
That being said, unless you are really short on space, don't reclain that space. The CPU and I/O cost of doing that regularly is far more expensive than the storage capacity cost in every provider I know of.
I'd take a look at using MongoDB's sharding functionality to address some of your issues. To quote from the documentation:
Sharding is a method for storing data across multiple machines.
MongoDB uses sharding to support deployments with very large data sets
and high throughput operations.
While sharding is frequently used to balance thru put for a large collection across more servers, to avoid hot spots and spread the overall load, it's also useful for managing storage for large collections. In your specific case I'd investigate the use of shard tags to pin a collection to a specific shard.
Again, to quote the documentation, shard tags are useful to
isolate a specific subset of data on a specific set of shards.
For example, let's say you split your production environment into a couple of shards, shard1 and shard2. You could, using shard tags and the sharding tools, pin the collections that you frequently delete onto shard2. In this use case shard1 contains all your normal collections. When you then choose to reclaim disk storage via your second option, you'd perform this only on the shard that has the deleted collections - that way you avoid have to recreate more static data. It should run faster that way (how much faster is a function of how much data is in the deleted collections shard at any given time).
It also has the secondary benefit that as each shard (actually replica set within each shard) requires smaller servers as they only contain a subset of the overall data.
The specifics of the best way to do this will be driven by your exact use case - number and size of collections, insert, update, query and deletion frequency, etc. I described a simple 2 shard case but you can do this with many more shards. You can also have some shards running on higher performance hardware for collections that have more transaction volume.
I can't really do sharding justice within the limited space here other than to point you in the right direction to investigate it. MongoDB has a lot of good information within their documentation and their 2 online DBA courses (Which are free) get into this in some detail.
Some useful links:
http://docs.mongodb.org/manual/core/sharding-introduction/
http://docs.mongodb.org/manual/core/tag-aware-sharding/

Scaling MongoDB on EC2 or should I just switch to DynamoDB?

I currently run my website on a single server with MongoDB. On my server I have two components (1) a crawler that runs hourly and appends data to my MongoDB instance (2) a web-site that reads from the crawler index and also writes to a user personalization DB. I am moving to Amazon EC2 for auto-scaling, so that web-server can auto-scale, so I can increase the number of servers as the web-traffic increases. I don't need auto-scaling for my crawler. This poses a challenge for how I use MongoDB. I'm wondering what my best option is to optimize on
Minimal changes to my code (the code is in perl)
Ability to seamlessly add/remove web-servers without worry about losing data in the DB
Low cost
In the short-term, the DB will certainly be able to fit in memory across all machies since it will be under 2 GB. The user personalization DB can't be rebuilt so its more important to have this, while the index can easily be re-built. The current MongoDB crawl index has about 100k entries that are keyed on ~15 different columns. This is built for speed, as I am working on an online dating site (that is searchable in many ways).
I can think of a few options
Use SimpleDB for the user personalization store, and MongoDB for the index. Have the index replicate across all machines, however, I don't know too much about MongoDB replication.
Move everything to SimpleDB
Move everything to DynamoDB
I don't know too much about SimpleDB and/or DynamoDB. Based on articles it seems like DynamoDB would bew a natural choice, but I'm not sure about good perl support, whether I can have all columns, index, etc. Anyone have experience or have any advice?
You could host Mongo on a single server on EC2 which each of the boxes in the web farm connect to. You can then easily spin up another web instance that uses the same DB box.
We currently have three Mongo servers as we run a replica set and when we get to the point where we need to scale horizontally with Mongo we'll spin up some new instances and shard the larger collections.
I currently run my website on a single server with MongoDB.
First off, this is a big red flag. When running on production, it is always recommended to run a replica set with at least three full nodes.
Replication provides automatic redundancy and fail-over.
Ability to seamlessly add/remove web-servers without worry about losing data in the DB
MongoDB supports a concept called sharding. Sharding provides a way to scale horizontally by automatically partioning data. The partitioning is done via a shard key.
If you plan to use sharding, please read that link very carefully and recognize the limitations. For MongoDB sharding you have to select the correct key that will allow queries to be evenly distributed across the shards.
The current MongoDB crawl index has about 100k entries that are keyed on ~15 different columns.
This is going to be a problem with sharding. Sharding can only scale queries that use the shard key. A query on the shard key can be routed directly to a single machine. A query on a secondary index goes to all machines.
You have 15 different indexes, so basically all of these queries will go to all shards. That will not "auto-scale" very well at all.
Beware that at the moment EC2 does not have 64 bit small instances, making replication potentially expensive. Because MongoDB memory maps files, a 32 bit OS is not advised.
I've had very bad experiences with SimpleDB and think it's fundamentally flawed, so I would avoid it.
Three is a good white paper on how to set up MongoDB on Amazon EC2: http://d36cz9buwru1tt.cloudfront.net/AWS_NoSQL_MongoDB.pdf
I suspect setting up MongoDB on EC2 is the fastest solution versus rewriting-for/migrating-to DynamoDB.
Best of luck!

MongoDB Index in Memory with Sharding

The word on the street is that MongoDB gets slow if you can't keep the indexes you're using in memory. How does this work with sharding? Does a sharded only keep its own BTree in memory, or does every shard need to keep the index for the entire collection in memory?
Does a sharded only keep its own BTree in memory...?
Yes, each shard manages its own indexes.
The word on the street is that MongoDB gets slow if you can't keep the indexes you're using in memory.
You can actually expect worse when using sharding and secondary indexes. The key problem is that the router process (mongos) knows nothing about data in secondary indexes.
If you do a query using the shard key, it will be routed directly to the correct server(s). In most cases, this levels out the workload. So 100 queries can be spread across 100 servers and each server only answers 1 query.
However, if you do a query using the secondary key, that query has to go to every server. So 100 queries to the router will result 10,000 queries across 100 servers or 100 queries per server. As you add more servers, these "non-shardkey" queries become less and less efficient. The workload does not become more balanced.
Some details are available in the MongoDB docs here.
Just its own portion of the index (it doesn't know about the other shards' data). Scaling wouldn't work very well, otherwise. See this documentation for some more information about sharding:
http://www.mongodb.org/display/DOCS/Sharding+Introduction
http://www.mongodb.org/display/DOCS/Choosing+a+Shard+Key