Sharding GridFS on MongoDB - mongodb

I'm documenting about the GridFS and the possibility to shard it among different machines.
Reading the documentation here, the suggested shard key is chunks.files_id. This key will be linked to the _id of the files collection, thus this _id is incremental. Every new file i save in the Grid will have a new incremental _id.
In the O'Reilly "Scaling MongoDB" book the use of an incremental shard key is discouraged to avoid HotSpots (the last shard will receive all the write and read).
what is your suggestion for sharding the GridFS collection?
have anybody experienced the HotSpot problem?
thank you.

You should shard on files_id to keep file chunks together, but you are correct that that will create a hotspot. If you can, use something other than ObjectId for _ids in the fs.files collection (probably MD5s would be better than ObjectIds).
We'll be adding hashing for sharding, which will solve this, but not until at least 2.0.

You can shard gridfs data because gridfs it just two collecttions: chunks and files. And gridfs sharding it's very useful and great thing. About gridfs shard key it's always bad choose random or incremental shard key, because data not evenly distribute across shards. In case of incremental shard key all writes going to the last shard and it growth and once difference between become 10 or more chunks, balancer move data to another shards. Moving data to another shard always difficult task that should be avoided as it possible.
So when you choose shard key you should care about even distribution of data.
Also if you get luck mb author of 'Scaling MongoDB' kristina(great specialist in shard keys) will answer to your question.
Documentation says that in common cases you should choose default index fileId:1,n:1 as shard key:
There are different ways that GridFS
can be sharded, depending on the need.
One common way to shard, based on
pre-existing indexes, is:
"files" collection is not sharded. All
file records will live in 1 shard. It
is highly recommended to make that
shard very resilient (at least 3 node
replica set) "chunks" collection gets
sharded using the existing index
"files_id: 1, n: 1". Some files at the
end of ranges may have their chunks
split across shards, but most files
will be fully contained within the
same shard.

Currently MongoDB as of version 1.8.1 supports only sharding on "file_id" field, because of using md5 to verify the upload, but it doesn't
work across shards yet. So you cannot split single file across shards.
Answer on google group7

Related

How shard a collection with data inside sharded cluster MongoDB

I want to shard a collection with data. When I try with sh.shardCollection("myDb.myCollection", {id:"hashed"}) then this collection shard but it's not spread to the whole shards. only spread to the primary shard. for example,
Empty collection after shard,
sh.status() result
Then data add it will spread to whole shards
Collection with data after shard,
sh.status() result
When data add only goes to the primary shard.
My question is how correctly shard a collection with data in MongoDB. Have any other alternative way?
I agree with #Wernfried Domscheit in the comments about the fact that the cluster will take care of distributing the data once the collection is sharded. As mentioned, that is done based on writing to the collection and happens over time. Your test may have too little data or too few writes to trigger the changes.
To your specific question about the initial distribution of chunks, this is covered in the documentation. Applying a hashed shard key on an empty collection in your first example is covered here:
The sharding operation creates empty chunks to cover the entire range of the shard key values and performs an initial chunk distribution. By default, the operation creates 2 chunks per shard and migrates across the cluster. You can use numInitialChunks option to specify a different number of initial chunks. This initial creation and distribution of chunks allows for faster setup of sharding.
And behavior on the collection with data is covered just above it here:
The sharding operation creates the initial chunk(s) to cover the entire range of the shard key values. The number of chunks created depends on the configured chunk size.
Both of these described behaviors match what you have demonstrated in your question.

How to efficiently Shard a MongoDb collections that already has millions of documents?

I have a collection named order_error. Which has over 60 million documents. Today I was trying to shard it. I have 3 replica sets. Initially, no issues were there. The balancer was distributing the chunks among the clusters. But eventually, it has started to consume all Ram space and after all swap space too. Now everything is unresponsive. We can't follow this procedure in production. We need a better solution for that. How can I do the sharding in a better way?
If someone could help me with that please let me know
When you insert documents into an empty collection, then initially all date will be written to the primary shard, so it will not solve your issue.
But you can use sh.splitAt on empty collection to pre-split the it.
Note, even if the collection is empty it will take some time till chunks are distributed over all shards! When you split a chunk, then it still remains on the current shard. Check with db.collection.getShardDistribution() whether chunks are evenly distributed.

MongoDB Sharded cluster: Inserts only hitting one shard

We are using a cluster with 6 shards.
The collection uses a non-hashed key.
The documents are rather big and our chunk-size is set to 512MB.
Two huge bulk inserts hit our cluster but everything is inserted on a single shard.
This leads to 120% effective lock, while the other shards are chilling at 5% lock.
I think that the bulk inserts only append the last chunk since the inserts are ordered. Due to heavy load there is no redistribution of chunks until the insert ends.
After the bulk insert redistribution works nicely.
MongoDB version is 2.6.5.
How can I configure the config servers to automatically distribute bulk inserts?
I will edit the post if more information is required.
Thank You all!!!
As answered below:
pre-splitting is the best solution for us. This allows us to evenly distribute the whole set before insertion since we know the key space! Thank you!
Sounds like your shard key is monotonic? The documentation has a large section about bulk insert in sharded environments.
Essentially,
either pre-split the collection
or insert to different mongos (not for the initial insert)
and/or make sure that your shard key doesn't increase monotonically (for non-hashed collections, that's usually a good idea).

Sharding by ObjectID, is it the right way?

I'm just like many others is thinking about correct approach to shard my collections in Mongo. Main question is - how does auto-sharding work?
The official doc says - "MongoDB scales horizontally via an auto-sharding (partitioning) architecture" and "To partition a collection, we specify a shard key pattern." with note "It is important to choose the right shard key for a collection" :).
http://www.mongodb.org/display/DOCS/Sharding+Introduction#ShardingIntroduction-ShardKeys
http://www.mongodb.org/display/DOCS/Choosing+a+Shard+Key
Now the question is - "is this right key"(sharding by ObjectID)?
db.runCommand({ shardcollection : "test", key : { _id : 1 }})
What happens internally in Mongo for ? How Mongo will split data to chunks in this case? Assuming i initially have 10mln of records with 2 shard servers - what happens on Mongo side when I'd like to add 2 more shard server when collection reaches 20mln records? I could not find that level of details anywhere on Mongo-related sources.
Taking into account random nature of autogenerated _id and it's structure,
... http://www.mongodb.org/display/DOCS/Object+IDs ...
i would shard by the least significant byte (rtl order) with chunks split by value of 2-3 bytes - this would provide easy way to shard by 2^N of shard servers - 2, 4, 8, .., 256 shard servers with more-or-less even load on each shard and with minimal required configuration. As far as i understand Mongo supports only sharding/chunking by explicitly defined ranges and that my idea will not work. Is is true?
It's generally not a good idea to use the default object id as the shard key since it has an embedded timestamp and monotonically increases in time. This may work fine if you do a lot of updates such that it touches old and new documents in an evenly distributed fashion. However, this is really bad news if your application is heavy on inserts since majority of your writes will go to a single shard. This is because the writes will go to the shard that owns the [nearCurrentTimestamp -> infinity] chunk.
Each mongos monitors write traffic to shards and use a very simple heuristic to determine if a chunk has become too big and needs to be split (threshold size is configurable via chunkSize).
When you add a new shard to the cluster, the balancer (http://www.mongodb.org/display/DOCS/Sharding+Administration#ShardingAdministration-Balancing) will see a chunk imbalance and will start migrating chunks to the new shards.
Mongo supports range based sharding, however, that does not mean that the ranges are fixed since chunks can be split into smaller ranges and moved around the cluster over time.
A new exciting feature in version 2.4 is that Hashed index is supported, and can be used as Shard Keys. So the answer to your main question "Sharding by ObjectID, is it the right way?" may be yes now!
More references are in the official docs:
Hashed Shard Keys
http://docs.mongodb.org/manual/core/sharded-cluster-internals/#hashed-shard-keys
Hashed Index
http://docs.mongodb.org/manual/core/indexes/#hashed-index

Using GridFS - Should it be on a separate DB?

I am making a site that has a lot of audio storage, terabytes, and I was wanting to use GridFS for sharding and to be able to easily expand the database across multiple machines.
My question is that would it be better to put the files in a separate mongo database? There will be a good amount of documents in the mongodb, I just was not sure what happens when you start sharding with the GridFS portion.
Thanks!
Even if you keep the GridFS storage in the same database as your other collections, you can still choose which collections to shard (or not) when you need to move to sharding. That said, if you have it in a separate database, you will be able to more easily move it to a separate cluster if you so choose -- so you could, for instance, have a 3 shard cluster for your "main" collections and a 5 shard cluster for GridFS (or any other configuration you choose).
As far as sharding GridFS collections, please see the MongoDB docs on choosing a shard key for GridFS. Commonly, people shard the chunks collection (which is where the file data itself is stored) on files_id so that all chunks for the same file reside on the same shard. Again, please see the documentation page for more detail.