Mongodb Sharding and Indexing - mongodb

I have been struggling to deploy a large database.
I have deployed 3 shard clusters and started indexing my data.
However it's been 16 days and I'm only half way through.
Question is, should I import all data to a non sharded cluster and then activate sharding once the raw data is in the database and then attach more clusters and start indexing? Will this auto balance my data?
Or I should wait another 16 days for the current method I am using...
*Edit:
Here is more explanation of the setup and data that is being imported...
So we have 160 million documents that are like this
"_id" : ObjectId("5146ae7de4b0d58a864bcfda"),
"subject" : "<concept/resource/propert/122322xyz>",
"predicate" : "<concept/property/os/123ABCDXZYZ>",
"object" : "<http://host/uri_to_object_abcdy>"
Indexes: subject, predicate, object, subject > predicate, object > predicate
Shard keys: subject, predicate, object
Setup:
3 clusters on AWS (each with 3 Replica sets) with each node having 8 GiB RAM
(Config servers are within each cluster and Mongos is in a separate server)
The data gets imported by a Java program into a the Mongos.
What would be the ideal way to import this data, index and shard. (without waiting a month for the process to be completed)

If you are doing a massive bulk insert, it is often faster to perform the insert without an index and then index the collection. This has to do with the way Mongo manages index updates on the fly.
Also, MongoDB is particularly sensitive to memory when it indexes. Check the size of your indexes in your db.stats() and hook up your DBs to the Mongo Monitoring Service.
In my experience, whenever MongoDB takes a lot more time than expected, it is due to one of two things:
It running out of physical memory or getting itself into a poor I/O pattern. MMS can help diagnose both. Check out the page faults graph in particular.
Operating on unindexed collections, which does not apply in your case.

Related

Is it good to add sharding on single server?

is it good to do sharding on single machine/server, if size of mongodb documents is above 10GB, will it perform well?
The key rule of sharding is don't shard until you absolutely have to. If you're not having problems with performance now don't need to shard. Choosing sharding keys can be a difficult process to make your data gets balanced correctly between shards. Sharding the database can add severe overhead to your deployment that can take a lot to manage since you will need additional mongod process, config servers, and replica sets in order for it to be stable for production.
I'm assuming you mean your collections are 10GB. Depending on the size of your machine a 10GB collection is not a lot for mongo to handle. IF you're having performance issue with queries my first step would be to go through your mongo log and see if there are any queries you can add indexes for.

Is MongoDB always write to primary Shard and then rebalance?

use vsm;
sh.enableSharding('vsm');
sh.shardCollection('vsm.pricelist', {maker_id:1});
Ok, we enabled sharding for Database (vsm) and collection in this database (pricelist).
We trying to write about 80 million documents to 'pricelist' collection.
And we have about 2000 distributed uniformly different maker_ids.
We have three shards. And Shard002 is PRIMARY for 'vsm' database.
We write to 'pricelist' collection from four application nodes with started mongos on each.
And during write data to 'pricelist' collection we see CPU Usage 100% ONLY on Shard002 !
We see rebalancing process. And data migrate to Shard000 and Shard003. But Shard002 has hight CPU Usage and Load Average!
Shards deployed on c4.xlarge EBS Optimized instances. dbdata stored on io1 with 2000 IOPS EBS Volumes.
It is looks like MongoDB write data only to one Shard :( What we do wrong?
The problem
What you describe is usually the indication that you have chosen a poor shard key with makerid, most likely monotonically increasing.
What usually happens is that one shard is assigned the key range from x to infinity (shard002 in your case). Now all new documents get written to that shard, until that shards holds more chunks in excess of the current migration threshold. Now the balancer kicks in and moves some chunks. Problem is that new documents still get written to said shard.
The solution
An easy solution for that problem is to use hashed keys for sharding
Now here comes the serious problem: you can not change the shard key.
So what you have to do is to make a backup of the sharded collection, drop it, reshard the collection using the hashed makerId and restore the backup into the new collection.
Is MongoDB always write to primary Shard and then rebalance?
Yes, if you are relying on auto balancer. And loading huge amounts of data into an empty collection
In your situation, you are relying on the autobalancer to do all the sharding / balancing stuff. I assume what you require is, as your data gets loaded it goes to each shard during load hence having less CPU usage etc.
This how sharding / autobalancing will take place on a high level.
Create chunks of data using split http://docs.mongodb.org/manual/reference/command/split/
Move the chunks to other shards http://docs.mongodb.org/manual/reference/command/moveChunk/#dbcmd.moveChunk
Now, when autobalancer is ON these two steps occur when your data is already loaded or loading.
Solution
Create empty collection. Execute the shard command on it. The collection where your data is going to get loaded.
Turn off the auto balancer http://docs.mongodb.org/manual/tutorial/manage-sharded-cluster-balancer/#disable-the-balancer
Manually create empty chunks using split. http://docs.mongodb.org/manual/tutorial/create-chunks-in-sharded-cluster/
Move those empty chunks to different shards. http://docs.mongodb.org/manual/tutorial/migrate-chunks-in-sharded-cluster/
Start the load, This time all your data should go directly to their respective shards.
Turn ON the balancer. (Once the load is complete) http://docs.mongodb.org/manual/tutorial/manage-sharded-cluster-balancer/#enable-the-balancer
You will have to test this approach using a small data set first. But I guess I have got enough information for you to get started.

mongodb- huge insert time on sharding

I am having a problem with insertion time of 300,000,000 documents into collection.
I was checking the performance of insertion time with a single node for the same count of documents. The time taken was approximately 23 minutes.
I create 2 shards - and trying to insert the same count of documents. The insert time is more than 25hours.
The two shards have a configuration of 8 GB RAM, 8 Cores machines.
The config and router are on the same machine which is of 4 GB RAM, 4 cores machine.
I am using C# driver - in my app- creating BSON documents for insertion.
The collection structure is:
Logs{
"_id"
"LID"
"Ver"
"Y"
"M"
"D"
"H"
"Min"
"Sec"
"MSec"
"FID"
}
The shard key is _id field. The chunkSize of sharding is set to 1.
What are the things i should check on where the performance is creating a problem?
Can anyone suggest me a solution or the things i should look into to find the factors which are increasing the insertion time
Thanks in advance.
I think that the problem is due to the chunk migration. Basically while you are inserting time the data is also moved from one shard to another. And then it might move back to the same shard. Also the thing can be that the indexes are eating some of your time (this is common thing that in databases creating index and then inserting data is slower then inserting data and creating index).
So if I were you, I would do the following:
create 1 node mongo and insert all data into it. Without db.coll.insert() but by using mongodump and mongorestore.
then create indexes on whatever fields are needed.
then shard your collection
Also you might try to disable balancer for the time of insertion.

What is the performance of a query that doesn't contains the shard key in a sharded MongoDB environment?

The title is saying everything. Assume that you have a sharded MongoDB environment and the user provide a query, which doesn't contain the shard key. What is the actual performance of the query? What happens in the background?
The performance depends on any number of factors however, the default action of MongoDB in this case is to do a global scatter and gather operation whereby it will send the query to all shards and then merge duplicates to give you an end result.
Returning to the performance, it normally depends upon the indexes on each shard and the isolated optimisation of their data sets and how much range of a dataset they hold.
However processing is parallel in sharding which means they all get the query and the "master" mongod will just merge as they come in, so the performance shouldn't be: go to shard 1, get it, then shard 2; instead it should be: go to all shards, each shard return its results and the master merges and returns.
Here is a good presentation (with nice pictures) on exactly how queries with sharding work in certain situations: http://www.slideshare.net/mongodb/how-queries-work-with-sharding
If the query is maked on the sharded collections the query is maked on all shard, if the query is maked on non shared collections, mongoDB take all data on the same shard.
I add the link for shard FAQ on MongoDB
http://docs.mongodb.org/manual/faq/sharding/

MongoDB Index in Memory with Sharding

The word on the street is that MongoDB gets slow if you can't keep the indexes you're using in memory. How does this work with sharding? Does a sharded only keep its own BTree in memory, or does every shard need to keep the index for the entire collection in memory?
Does a sharded only keep its own BTree in memory...?
Yes, each shard manages its own indexes.
The word on the street is that MongoDB gets slow if you can't keep the indexes you're using in memory.
You can actually expect worse when using sharding and secondary indexes. The key problem is that the router process (mongos) knows nothing about data in secondary indexes.
If you do a query using the shard key, it will be routed directly to the correct server(s). In most cases, this levels out the workload. So 100 queries can be spread across 100 servers and each server only answers 1 query.
However, if you do a query using the secondary key, that query has to go to every server. So 100 queries to the router will result 10,000 queries across 100 servers or 100 queries per server. As you add more servers, these "non-shardkey" queries become less and less efficient. The workload does not become more balanced.
Some details are available in the MongoDB docs here.
Just its own portion of the index (it doesn't know about the other shards' data). Scaling wouldn't work very well, otherwise. See this documentation for some more information about sharding:
http://www.mongodb.org/display/DOCS/Sharding+Introduction
http://www.mongodb.org/display/DOCS/Choosing+a+Shard+Key