MongoDB 'Manually Sharding' for multi-tenancy - mongodb

We are using Mongo to host a multi-tenant application. Each tenant is going to have their own database. To get around resource utilization issues the approach that we are taking is to shard by database (as opposed to by collection - if that is the correct term to use).
This means for every x tenants we will create a new 3-node replica set. So we may have for example 1000 tenants on 1 shard and another 1000 tenants on another shard.
My question is regarding the placement of the databases for new signups. The approach we were going to take was to flag a shard as being the 'active' shard and creating all new tenants on that shard. When it reaches capacity, create a new shard, flag that as the active shard and continue on.
Can you choose which shard you create a new database on in Mongo directly? If left to Mongo, from what I understand, it will do it in round robin fashion when there is more then one shard which may leave our shards imbalanced.
Is this the right approach or is there an alternative better approach?

You can use shard tags to force some collections to reside only on specific shards. So you could, for example, tag each shard with its serial number, and tag the collections/databases you want to have on that shard with that tag, until it runs full at which point you create a new shard, increase the counter and use that for new data.
Another option then is to not enable sharding on the individual databases at all, and use the movePrimary command to force a specific shard to act as the primary shard for a specific database. Since the database won't be sharded, all its data will remain on its designated primary shard, which is exactly what you want.
That being said, it seems to me like this approach conflicts with the very concept of sharding, which is meant to evenly distribute data across multiple machines automatically.

Related

MongoDB sharding with repeated documents

I am new to mongodb and wish to create a distributed database environment using docker-compose with mongodb. I've created multiple docker with shards to simulate multiple sites. However, I have a problem to replicate the same set of documents into multiple shards.
For example I have a collection with a key that has value "A" and "B". I want to distribute this collection into 2 shards where
Shard 1 = A & B
Shard 2 = B only
However, when I run the balancer it distributes all A's into shard 1 and B's into shard 2. Is there any way I can do the sharding with repeated data or am I using the wrong approach for my problem?
You might be approaching sharding (horizontal scaling) incorrectly. What makes sharding in Mongo work is that the sharding key is chosen such that it results in (vertical) shards which have a roughly even distribution of data, or a similar number of Mongo documents. A requirement of sharding which makes it work well is that queries would typically be directed to only a single shard. If you have queries which need to return some field having the different values of A and B, then it implies that this field should not be the sharding key. Queries can go across shards, but certain cross-shard operations, such as joins, can be very costly. In your particular case, perhaps some other field could be used as sharding key.
Redundancy in MongoDB is provided by replica sets, not sharded clusters.
Each shard can be backed by a replica set with your desired number of nodes to provide the required redundancy level.
It is not possible to have the same document be (authoritatively) located in multiple shards.

mongodb - one collection per shard

My system is built on multi-tenancy, and I'm intending to apply database sharding and replica set on it. This is new to me, so I have some questions below:
Is it possible to partition collection disjoint to one shard only? That means instead of splitting some documents in 1 shard and some others in another shard, I want to put 1 collection completely in 1 shard, and another collection completely in another shard. Because my multi-tenant system is built on schema-per-tenant, so 1 collection represents 1 tenant. Putting each of them completely in 1 shard would make aggregate query more reliable with in that tenant's scope.
If MongoDB is unable to support the answer of question 1, how can I aggregate the queried data among shards correctly if a collection's documents are scattered?
I want to know the full extent of support provided by DBMS instead of delegating the logic into backend. Thank you very much

Is MongoDB always write to primary Shard and then rebalance?

use vsm;
sh.enableSharding('vsm');
sh.shardCollection('vsm.pricelist', {maker_id:1});
Ok, we enabled sharding for Database (vsm) and collection in this database (pricelist).
We trying to write about 80 million documents to 'pricelist' collection.
And we have about 2000 distributed uniformly different maker_ids.
We have three shards. And Shard002 is PRIMARY for 'vsm' database.
We write to 'pricelist' collection from four application nodes with started mongos on each.
And during write data to 'pricelist' collection we see CPU Usage 100% ONLY on Shard002 !
We see rebalancing process. And data migrate to Shard000 and Shard003. But Shard002 has hight CPU Usage and Load Average!
Shards deployed on c4.xlarge EBS Optimized instances. dbdata stored on io1 with 2000 IOPS EBS Volumes.
It is looks like MongoDB write data only to one Shard :( What we do wrong?
The problem
What you describe is usually the indication that you have chosen a poor shard key with makerid, most likely monotonically increasing.
What usually happens is that one shard is assigned the key range from x to infinity (shard002 in your case). Now all new documents get written to that shard, until that shards holds more chunks in excess of the current migration threshold. Now the balancer kicks in and moves some chunks. Problem is that new documents still get written to said shard.
The solution
An easy solution for that problem is to use hashed keys for sharding
Now here comes the serious problem: you can not change the shard key.
So what you have to do is to make a backup of the sharded collection, drop it, reshard the collection using the hashed makerId and restore the backup into the new collection.
Is MongoDB always write to primary Shard and then rebalance?
Yes, if you are relying on auto balancer. And loading huge amounts of data into an empty collection
In your situation, you are relying on the autobalancer to do all the sharding / balancing stuff. I assume what you require is, as your data gets loaded it goes to each shard during load hence having less CPU usage etc.
This how sharding / autobalancing will take place on a high level.
Create chunks of data using split http://docs.mongodb.org/manual/reference/command/split/
Move the chunks to other shards http://docs.mongodb.org/manual/reference/command/moveChunk/#dbcmd.moveChunk
Now, when autobalancer is ON these two steps occur when your data is already loaded or loading.
Solution
Create empty collection. Execute the shard command on it. The collection where your data is going to get loaded.
Turn off the auto balancer http://docs.mongodb.org/manual/tutorial/manage-sharded-cluster-balancer/#disable-the-balancer
Manually create empty chunks using split. http://docs.mongodb.org/manual/tutorial/create-chunks-in-sharded-cluster/
Move those empty chunks to different shards. http://docs.mongodb.org/manual/tutorial/migrate-chunks-in-sharded-cluster/
Start the load, This time all your data should go directly to their respective shards.
Turn ON the balancer. (Once the load is complete) http://docs.mongodb.org/manual/tutorial/manage-sharded-cluster-balancer/#enable-the-balancer
You will have to test this approach using a small data set first. But I guess I have got enough information for you to get started.

Mongodb - sharded and unsharded collections

I'm a bit confused as to how this works.
When sharding MySQL, we had some tables, usually small ones with reference data, whole in each shard. This was to enable joins.
If we have small collections in MongoDB, that we don't shard in a sharded setup, what happens to them? Do they get sent to each shard, or just stay in the first shard?
This strikes me as a possible potential bottleneck, if all processes in a heavily sharded system with many application servers were hitting on one server.
In MongoDB with the autosharding feature, a sharded collection will be distributed somehow evenly along all the shards you have.
With those collections which you not likely to shard (which are not sharded) you can specify a primary shard which will they reside on. This primary shard is a given one for a specific database, so it is on per database level. Can be moved and can be different for different databases.
There is the notion of shard tagging which with you can influence for sharded collections where to be placed. Basicly you can constraint a collection or a part of a collection to be stored on a specific set of shards. (Reference)

Can we move the document dynamically across shards in mongo db?

I am building a tracking platform which has the following use cases.
Need to track 50,000 vehicles
Each vehicle relays its location every 60 secs.
Get API which returns all the vehicles in the X km range.
So, i need to scale writes and also achieve query isolation.
I can create a shard cluster with geographical region as shard key(geohash). This will help me to balance the writes and also achieve query isolation. But what happens when a vehicle moves across regions does mangodb automatically move the document to the new shard in this case?
You cannot change the shard key fields for a record once written. Using the region as the shard key would prevent you from moving across regions unless you delete the record in the original region and the insert using the new one.
On choosing a shard key, look for one which matches your most common query pattern. Querying on the shard key will allow you to retrieve a record directly from a shard. Queries which don't use the shard key will have to perform a scatter gather query against all shards.
If are on or can use Mongodb 2.4, and you don't need to perform range based queries, you may want to consider using a hashed shard key which will allow for even distribution, even if your shard key is an monotonically increasing. See this page for advice on choosing a shard key.