Is mongodb database available while it is being sharded? - mongodb

I am sharding a large-scale database in mongodb. Since the amount of documents in the collection is over 100,000,000, so, it will take a really long time to shard, right? I am now doing the sharding operation for this database, and the command line is just in the waiting state. How can I check if the shard operation is processing normally?

Do the following steps (order matter):
Set up shard MongoDB server
Then, do the sharding operation for the database you got (collection
with documents over 100,000,000 or so..)

Well,in fact ,it is because I should shard the database before I populated all the data to it.Never should I populate all the data to a database whose shard configuration is ready but haven't be shard-enabled . In that case,I run sh.enableSharding("database name") and sh.shardCollection(),the terminal will fall into a forever wait. But if I run sh.enableSharding("database name") and sh.shardCollection() on a empty database ,and then populate the data to it , data can be sharded into different shard while I am inserting data.
Well ,another piece of experience.

Related

MongoDb shard locks up after deleting large amounts of data from mongo-router

I have a configuration that contains three shards and a mongo-router. One collection in the database contains a large amount of data. If I try to remove some of the data by executing into the mongo shell and running "db.collectionName.deleteMany({}), the collection will start to remove the data. However, after a while (thousands of documents deleted), one of the shards will become unresponsive (lock up). After that, I can get into the mongo shell but all queries are unresponsive.
Any thoughts on what's happening/how to fix it?

Mongo dynamic collection creation and locking

I am working on an app where I am looking into creating MongoDB collections on the fly as they are needed.
The app consumes data from a data source and maps the data to a collection. If the collection does not exist, the app:
creates the collection
kicks off appropriate indexes in the background
shards the collection
and then inserts the data into the collection.
While this is going on, other processes will be reading and writing from the database.
Looking at this MongodDB locking FAQ, it appears that the reads and writes in other collections of the database should not be affected by the dynamic collection creation snd setup i.e. they won't end up waiting on a lock we acquired to create the collection.
Question: Is the above assumption correct?
Thank you in advance!
No, when you insert into a collection which does not exist, then a new collection is created automatically. Of course, this new collection does not have any index (apart from _id) and is not sharded.
So, you must ensure the collection is created before the application starts any inserts.
However, it is no problem to create indexes and enable sharding on a collection which contains already some data.
Note, when you enable sharding on an empty collection or a collection with low amount of data, then all data is written initially to the primary shard. You may use sh.splitAt() to pre-split the upcoming data.

Query against local MongoDB shard data only

I have a sharded collection, with a shard key "user id". I would like to perform a query where, instead of passing the shard key, I simply restrict the query to only the data on the local mongos shard.
Is this possible / advisable?
Furthermore, can it be used with findAndModify? This would allow me to perform atomic updates on local documents, without specifying a shard key in the query.
Edit
As stated in some answers and comments below, my understanding of mongos vs. mongod was a little skewed. I now appreciate that mongos doesn't hold the local data.
Does mongos have any "local" data?
No. Each mongos daemon routes queries to your shards and does not store any data itself, so there is no such concept as "local" documents stored by a mongos. The mongos interface provides a logical view of the entire sharded cluster and does not have affinity to a specific shard.
Based on the type of query/command you send to mongos, the query will be:
Directed: sent to a specific shard if the query uses the shard key
Targeted: sent to applicable shards if the query includes multiple shard key values (or uses a prefix subset of a compound shard key)
Scatter/gather: sent to all shards, if the query is not using the shard key
Should I read from shards directly?
No. It's technically possible to read data from the shards directly but definitely not recommendable as you can get an inconsistent view of data. For example, if there is a migration in progress the data will temporarily exist on both the donor shard and the target shard. Similarly, copies of documents may be orphaned as the result of failed migrations.
A query through mongos correctly directs queries to the appropriate shard(s) and filters results based on the sharded cluster metadata.
Can I use findAndModify() on a sharded collection without a query based on a shard key?
No. For a sharded collection, findAndModify() requires a query based on the shard key. The shard key provides a guarantee that the requested document only exists on one shard.
Can I update sharded collections without going through mongos?
No. All updates to a sharded collection must go through mongos.
Please keep in mind, that doing so is unadvised as traffic to a shared cluster should go through a mongos service.
That being said, It's possible to query the shard itself if you're performing the query locally on the shard instance.
I've never tried to do that programatically, but It may worth a shot.
You can either login directly to the machine running the shard, and open a mongo shell there (if you've never created a local user/password on it, I believe you can connect without credentials, otherwise, the mongod process on that specific shard must have it's own user/pass (as those which were created via the mongos are not recognised in the mongod shards.
As each shard knows its own data files only, and for example you'll run a count() operation on one of your collection you'll see that the result is only a portion of the actual collection size.
Your question is a little vague since you mix your English:
I simply restrict the query to only the data on the local mongos shard.
The shard will infact be a mongod process, not a mongos process, however your English can make sense if you have a mongos per shard in which case it makes sense that you want to direct to a mongos on that shard that can query its local mongod data.
If you are considering on circumventing the mongos then #Stennies comment answers your question however, if your English means something else then I do not believe the mongos has a command switch to allow you to direct queries without a shard key currently.

What is the performance of a query that doesn't contains the shard key in a sharded MongoDB environment?

The title is saying everything. Assume that you have a sharded MongoDB environment and the user provide a query, which doesn't contain the shard key. What is the actual performance of the query? What happens in the background?
The performance depends on any number of factors however, the default action of MongoDB in this case is to do a global scatter and gather operation whereby it will send the query to all shards and then merge duplicates to give you an end result.
Returning to the performance, it normally depends upon the indexes on each shard and the isolated optimisation of their data sets and how much range of a dataset they hold.
However processing is parallel in sharding which means they all get the query and the "master" mongod will just merge as they come in, so the performance shouldn't be: go to shard 1, get it, then shard 2; instead it should be: go to all shards, each shard return its results and the master merges and returns.
Here is a good presentation (with nice pictures) on exactly how queries with sharding work in certain situations: http://www.slideshare.net/mongodb/how-queries-work-with-sharding
If the query is maked on the sharded collections the query is maked on all shard, if the query is maked on non shared collections, mongoDB take all data on the same shard.
I add the link for shard FAQ on MongoDB
http://docs.mongodb.org/manual/faq/sharding/

Mongodb Sharding and Indexing

I have been struggling to deploy a large database.
I have deployed 3 shard clusters and started indexing my data.
However it's been 16 days and I'm only half way through.
Question is, should I import all data to a non sharded cluster and then activate sharding once the raw data is in the database and then attach more clusters and start indexing? Will this auto balance my data?
Or I should wait another 16 days for the current method I am using...
*Edit:
Here is more explanation of the setup and data that is being imported...
So we have 160 million documents that are like this
"_id" : ObjectId("5146ae7de4b0d58a864bcfda"),
"subject" : "<concept/resource/propert/122322xyz>",
"predicate" : "<concept/property/os/123ABCDXZYZ>",
"object" : "<http://host/uri_to_object_abcdy>"
Indexes: subject, predicate, object, subject > predicate, object > predicate
Shard keys: subject, predicate, object
Setup:
3 clusters on AWS (each with 3 Replica sets) with each node having 8 GiB RAM
(Config servers are within each cluster and Mongos is in a separate server)
The data gets imported by a Java program into a the Mongos.
What would be the ideal way to import this data, index and shard. (without waiting a month for the process to be completed)
If you are doing a massive bulk insert, it is often faster to perform the insert without an index and then index the collection. This has to do with the way Mongo manages index updates on the fly.
Also, MongoDB is particularly sensitive to memory when it indexes. Check the size of your indexes in your db.stats() and hook up your DBs to the Mongo Monitoring Service.
In my experience, whenever MongoDB takes a lot more time than expected, it is due to one of two things:
It running out of physical memory or getting itself into a poor I/O pattern. MMS can help diagnose both. Check out the page faults graph in particular.
Operating on unindexed collections, which does not apply in your case.