InsertBatch in Sharding - mongodb

What is actually happening behind the scene with a big InsertBatch if
one is writing to a sharded cluster? Does MongoDb actually support
bulk insert or the InserBatch is actually inserting one at a time at
the server level? How does this work with sharding then? Does this
mean that a mongos will look at every item in the batch to figure out
what is the shard key of each item and then will route it to the right
server? This will break bulk insert if it exist and does not seem to
be efficient. What is the mechanics of InsertBatch for a sharding
solution? I am using version 2.0 and willing to upgrade if that makes any difference

Bulk inserts are an actual MongoDB feature and are (somewhat) more performant than seperate per-document inserts due to less roundtrips.
In a sharded environment if mongos receives a bulk insert it will figure out which part of the bulk has to be sent to which shard. There are no differences between 2.0 and 2.1 and it is the most efficient way to bulk insert data into a sharded database.
If you're curious to how exactly mongos works have a look at it's source code here :
https://github.com/mongodb/mongo/tree/master/src/mongo/s

Related

What Issues could arise, when reading/writing to a single shard in a sharded MongoDB?

In the offical MongoDB Documentation it is stated that:
"Clients should never connect to a single shard in order to perform read or write operations." Source
I did try writing some data to a single shard node and it worked, but it is not a big dataset (certainly not one that would require sharding).
Could this lead to other issues, which I am not yet aware of?
Or are clients discouraged from doing this, simply because of performance reasons?
In a sharded cluster, the individual shards contain the data but some of the operation state is stored on mongos nodes. When you communicate with individual shards you bypass mongoses and as such any state management that they would perform.
Depending on the operation, this could be innocuous or could result in unexpected behavior/wrong results.
There isn't a comprehensive list of issues that would arise because the scenario of direct shard access isn't tested (because it's not supported).
Problem is, by default you don't know which data portion is stored on which shard. The balancer is running in background and may change the data distribution at any time.
If you insert some data manually into one shard, then a properly connected client may not find this data.
If you read some data from a single shard then you don't know whether this data is complete or not.

Is it good to add sharding on single server?

is it good to do sharding on single machine/server, if size of mongodb documents is above 10GB, will it perform well?
The key rule of sharding is don't shard until you absolutely have to. If you're not having problems with performance now don't need to shard. Choosing sharding keys can be a difficult process to make your data gets balanced correctly between shards. Sharding the database can add severe overhead to your deployment that can take a lot to manage since you will need additional mongod process, config servers, and replica sets in order for it to be stable for production.
I'm assuming you mean your collections are 10GB. Depending on the size of your machine a 10GB collection is not a lot for mongo to handle. IF you're having performance issue with queries my first step would be to go through your mongo log and see if there are any queries you can add indexes for.

MongoDB Sharded cluster: Inserts only hitting one shard

We are using a cluster with 6 shards.
The collection uses a non-hashed key.
The documents are rather big and our chunk-size is set to 512MB.
Two huge bulk inserts hit our cluster but everything is inserted on a single shard.
This leads to 120% effective lock, while the other shards are chilling at 5% lock.
I think that the bulk inserts only append the last chunk since the inserts are ordered. Due to heavy load there is no redistribution of chunks until the insert ends.
After the bulk insert redistribution works nicely.
MongoDB version is 2.6.5.
How can I configure the config servers to automatically distribute bulk inserts?
I will edit the post if more information is required.
Thank You all!!!
As answered below:
pre-splitting is the best solution for us. This allows us to evenly distribute the whole set before insertion since we know the key space! Thank you!
Sounds like your shard key is monotonic? The documentation has a large section about bulk insert in sharded environments.
Essentially,
either pre-split the collection
or insert to different mongos (not for the initial insert)
and/or make sure that your shard key doesn't increase monotonically (for non-hashed collections, that's usually a good idea).

does mongodb have the properties such as trigger and procedure in a relational database?

as the title suggests, include out the map-reduce framework
if i want to trigger an event to run a consistency check or security operations before a record is inserted, how can i do that with MongoDB?
MongoDB does not support triggers, but people have created solutions around them, mostly using the oplog, though this will only help you if you are running with replica sets, as the oplog is a capped collection that keeps track of data changes for the purposes of replication.
For a nodejs solution see: https://www.npmjs.org/package/mongo-watch or see an earlier SO thread: How to listen for changes to a MongoDB collection?
If you are concerned with consistency, read about write concern in mongoDB. http://docs.mongodb.org/manual/core/write-concern/ You can be as relaxed or as strict as you want by setting insert write concern levels, from fire and hope to getting an acknowledgement from all members of the replica set.
So, if you want to run a consistency check before inserting data, you probably will have to move that logic to the client application and set your write concern level to a level that will ensure consistency.
MongoDb does not have triggers or stored procedures. While there are solutions that some have used to try to emulate the behavior, as it is not a built-in feature, you'll need to decide whether the solutions are effective for you. Searching for "triggers and mongodb" should find dozens. All depend on the oplog and replicas.
But, given the nature of MongoDb and a typical 3 tier architecture, I would expect that at the point of data insertion, which could be on a web server for example, you would run, on the web server, the necessary consistency and security checks. You wouldn't allow a client such as a mobile application to directly set data into the database collection without some checks.
Many drivers for MongoDb and extended libraries have validation and consistency checks built in already, so there is less to do. Using unique indexes for some fields can also provide a level of consistency that you cannot do from the driver alone. Look at calls like findAndModify which make atomic updates.

creating a different database for each collection in MongoDB 2.2

MongoDB 2.2 has a write lock per database as opposed to a global write lock on the server in previous versions. So would it be ok if i store each collection in a separate database to effectively have a write lock per collection.(This will make it look like MyISAM's table level locking). Is this approach faulty?
There's a key limitation to the locking and that is the local database. That database includes a the oplog collection which is used for replication.
If you're running in production, you should be running with Replica Sets. If you're running with Replica Sets, you need to be aware of the write lock effect on that database.
Breaking out your 10 collections into 10 DBs is useless if they all block waiting for the oplog.
Before taking a large step to re-write, please ensure that the oplog will not cause issues.
Also, be aware that MongoDB implements DB-level security. If you're using any security features, you are now creating more DBs to secure.
Yes that will work, 10gen actually offers this as an option in their talks on locking.
I probably isolate every collection, though. Most databases seem to have 2-5 high activity collections. For the sake of simplicity it's probably better to keep the low activity collections grouped in one DB and put high activity collections in their own databases.