Mongo sharding not removing data of sharded collection in source shard - mongodb

I have MongoDB 3.2.6 installed on 5 machines which all form sharded cluster consisting of 2 shards (each is replica set with primary-secondary-arbiter configuration).
I also have a database with very large collection (~50M records, 200GB) and it was imported through mongos which put it to primary shard along with other collections.
I generated hashed id on that collection which will be my shard key.
After thay I sharded collection with:
> use admin
> db.runCommand( { enablesharding : "my-database" } )
> use my-database
> sh.shardCollection("my-database.my-collection", { "_id": "hashed" } )
Comand returned:
{ "collectionsharded" : "my-database.my-collection", "ok" : 1 }
And it actually started to shard. Status of shard looks like this:
> db.my-collection.getShardingDistribution()
Totals
data : 88.33GiB docs : 45898841 chunks : 2825
Shard my-replica-1 contains 99.89% data, 99.88% docs in cluster, avg obj size on shard : 2KiB
Shard my-replica-2 contains 0.1% data, 0.11% docs in cluster, avg obj size on shard : 2KiB()
This all looks ok but problem is that when I count my-collection through mongos I see number is increasing.
When I log in to primary replica set (my-replica-1) I see that number of records in my-collection is not decreasing although number in my-replica-2 is increasing (which is expected) so I guess mongodb is not removing chunks from source shard while migrating to second shard.
Does anyone know is this normal and if not why is it happening?
EDIT: Actually now it started to decrease on my-replica-1 although it still grows when counting on mongos (sometimes it goes little down and then up). Maybe this is normal behaviour when migrating large collection, I don't know
Ivan

according to documentation here you are observing a valid situation.
When document is moved from a to b it is counted twice as long as a receive confirmation that relocation was successfule.
On a sharded cluster, db.collection.count() can result in an
inaccurate count if orphaned documents exist or if a chunk migration
is in progress.
To avoid these situations, on a sharded cluster, use the $group stage
of the db.collection.aggregate() method to $sum the documents. For
example, the following operation counts the documents in a collection:
db.collection.aggregate(
[
{ $group: { _id: null, count: { $sum: 1 } } }
]
)

Related

Insert data to Mongob Shard Cluster?

At first below is my shard cluster I create by Ops Manager:
I have 2 Mongos and 2 Shard (each shard configure replicates set). I not configure any shard key, I mean not sharded collections esxit in my cluster.
When I use mongos to insert a database for testing purposes, the database store only one Shard.
So I want when I insert a database, data can split and store balance on both shards. And I can query from mongos to get accurate data.
Anyone have the same issue?
Databases and collections are not sharded automatically: a sharded deployment can contain both unsharded and sharded data. Unsharded collections will be created on the primary shard for a given database.
If you want to shard a collection you need to take a few steps in the mongo shell connected to a mongos process for your sharded deployment:
Run sh.enableSharding(<database>) for a database (this is a one-off action per database)
Choose a shard key using sh.shardCollection()
See Shard a Collection in the MongoDB manual for specific steps.
It is important to choose a good shard key for your data distribution and use case. Poor choices of shard key may result in unequal data distribution or limit your sharding performance. The MongoDB documentation has more information on the considerations and options for choosing a shard key.
If you are not sure a collection if a collection sharded or want to see a summary of the current data distribution, you can use db.collection.getShardDistribution() in the mongo shell.
You need to implement Zone Range so according the range the data will be stored for each shred.
The code bellows helps you to create zones :
For the zone01 :
sh.addShardTag("rs1", "zone01")
sh.addTagRange("myDB.col01", { num: 1 }, { num: 10 }, "zone01")
For the zone02 :
sh.addShardTag("rs2", "zone02")
sh.addTagRange("myDB.col01", { num: 11 }, { num: 20 }, "zone02")
This will help you Manage Shard Zones

Query based on shard key hitting multiple shards

While browsing mongodb sharding tutorials I came across the following assertion :
"If you use shard key in the query, its going to hit a small number of shards, often only ONE"
On the other hand from some of my earlier elementary knowledge of sharding, I was under the impression that mongos routing service can uniquely point out the target shard if the query is fired on Shard Key. My question is - under what circumstances, a shard key based query stands a chance of hitting multiple shards?
A query using the shard key will target the subset of shards to retrieve data for your query, but depending on the query and data distribution this could be as few as one or as many as all shards.
Borrowing a helpful image from the MongoDB documentation on shard keys:
MongoDB uses the shard key to automatically partition data into logical ranges of shard key values called chunks. Each chunk represents approximately 64MB of data by default, and is associated with a single shard that currently owns that range of shard key values. Chunk counts are balanced across available shards, and there is no expectation of adjacent chunks being on the same shard.
If you query for a shard key value (or range of values) that falls within a single chunk, the mongos can definitely target a single shard.
Assuming chunk ranges as in the image above:
// Targeted query to the shard with Chunk 3
db.collection.find( { x: 50 } )
// Targeted query to the shard with Chunk 4
db.collection.find( {x: { $gte: 200} } )
If your query spans multiple chunk ranges, the mongos can target the subset of shards that contain relevant documents:
// Targeted query to the shard(s) with Chunks 3 and 4
db.collection.find( {x: { $gte: 50} } )
The two chunks in this example will either be on the same shard or two different shards. You can review the explain results for a query to find out more information about which shards were accessed.
It's also possible to construct a query that would require data from all shards (for example, based on a large range of shard key values):
// Query includes data from all chunk ranges
db.collection.find( {x: { $gte: -100} } )
Note: the above information describes range-based sharding. MongoDB also supports hash-based shard keys which will (intentionally) distribute adjacent shard key values to different chunk ranges after hashing. Range queries on hashed shard keys are expected to include multiple shards. See: Hashed vs Ranged Sharding.

collection count increasing after sharding mongodb

I am not able to understand why collection count increasing after sharding mongodb.
I have a collection of 20M records , when I sharded collection count keep increasing , plz help me out
clusture configration
3 shards
3 config sever
6 query routers
If I get it right, you mean that db.shardedCollection.count()returns more documents than you expect. This is a known bug (SERVER-3645).
TL;DR
The problem is that the way sharding works, it can happen that after a chunk migration so called orphaned documents exist. These are documents which exist as a duplicate on a shard not responsible for the key range that the document falls into. For almost all practical purposes, this is not a problem, since the mongos takes care of "sorting them out" (which is a bit simplified, but sufficient in this context).
However, when calling a db.collection.count() on a sharded collection, this query gets routed to all shards, since it does not contain the shard key.
Disclaimer from here on, it is my theory, deduced from the observed behavior
Since the orphaned documents still technically exist on a shard, they seem to get counted and the result of the count as a whole is reported back to the mongos, which simply sums up all the results. I assume .count() to take a shortcut on the individual shard, possibly simply counting the entries of the _id index for performance reasons.
Workaround
As written in the ticket, using an aggregation mitigates the problem:
db.collection.aggregate({$group:{_id:"uniqueDocs",count:{$sum:1}}})
However, this aggregation is not ideal, and should show better performance when changed as below if you have a shard key other than _id
db.books.aggregate([
{ $project:{ _id:0, yourShardKey: 1 }},
{ $group:{ _id:"uniqueDocs", count:{ $sum:1 }}}
])
or
db.books.aggregate([
{ $project:{ _id:1 }},
{ $group:{ _id:"uniqueDocs", count:{ $sum:1 }}}
])
if you use _id as your shard key.

Why count of Collection's docs in MongoDb sharding is decreasing

I have a mongo sharding cluster, with 3 shards, and all of operations to this db is find or update(with upsert=true option). That means the count of collection will keep increasing, but when the count for collection (db.mycollection.find().count()) increases to 80000000 or larger, I found that sometimes it's incresing, but sometimes it's decreasing, Why? I promise that there is no delete action to this db.
I am using db.myCollection.getShardDistribution() to show the distribution, and the shard2 is only 29%, which is less than average.
Here are the trend of count:
mongos> db.myCollection.find().count()
84374837
mongos> db.myCollection.find().count()
84375036
mongos> db.myCollection.find().count()
84409281
mongos> db.myCollection.find().count()
84408921
mongos> db.myCollection.find().count()
84407190
mongos> db.myCollection.find().count()
84407173
mongos> db.myCollection.find().count()
84407013
mongos> db.myCollection.find().count()
84406911
I'd bet this is sharding in action. This is how it works:
All documents are broken into virtual chunks
Chunks can be moved between shards
When balancer moves a chunk, it
1) Copies all documents from this chunk to their new shard
2) Transfers ownership of the chunk to the new shard
3) Deletes documents from the old shard.
Again, this is just a guess, based on information provided. But since you swear there are no deletes in your app, then it must be this.

MongoDB/GridFS out of disk space

I'm using MongoDB 2.0.6, and I ran out of disk space and can't seem to do anything to clear some up.
I have 3 shards (not replicated, since I can regenerate the photos).
I have one collection called photos, with sharding enabled. I store everything in gridfs so db.photos.files and db.photos.chunks.
All the files ended up on my primary shard and are not distributed eventhough sharding is enabled on my collection.
When I try:
> db.runCommand({ shardcollection : "photos.fs.chunks", key : { files_id : 1 }})
I get an error about needing to create the index.
When I try to create the index I get "Can't take a write lock while out of disk space".
When I try to delete things:
>db.fs.files.remove({$and: [{_id: {$gt: "1667"} }, { _id: {$lt: "5000"} }]});
Can't take a write lock while out of disk space
Any ideas?