Why count of Collection's docs in MongoDb sharding is decreasing - mongodb

I have a mongo sharding cluster, with 3 shards, and all of operations to this db is find or update(with upsert=true option). That means the count of collection will keep increasing, but when the count for collection (db.mycollection.find().count()) increases to 80000000 or larger, I found that sometimes it's incresing, but sometimes it's decreasing, Why? I promise that there is no delete action to this db.
I am using db.myCollection.getShardDistribution() to show the distribution, and the shard2 is only 29%, which is less than average.
Here are the trend of count:
mongos> db.myCollection.find().count()
84374837
mongos> db.myCollection.find().count()
84375036
mongos> db.myCollection.find().count()
84409281
mongos> db.myCollection.find().count()
84408921
mongos> db.myCollection.find().count()
84407190
mongos> db.myCollection.find().count()
84407173
mongos> db.myCollection.find().count()
84407013
mongos> db.myCollection.find().count()
84406911

I'd bet this is sharding in action. This is how it works:
All documents are broken into virtual chunks
Chunks can be moved between shards
When balancer moves a chunk, it
1) Copies all documents from this chunk to their new shard
2) Transfers ownership of the chunk to the new shard
3) Deletes documents from the old shard.
Again, this is just a guess, based on information provided. But since you swear there are no deletes in your app, then it must be this.

Related

MongoDB sh.status() and db.getShardDistribution() results not consistent about shards

When I run sh.status() on my MongoDB server, it shows that the collections are sharded, that there are 3 shards and identifies the primary shard in each database.
When I run db.getCollection('ReportRow').getShardDistribution(), it returns
Collection reporting.ReportRow is not sharded, even though sh.status shows that it is.
Any ideas on why MongoDb would have this discrepancy?
The shard instances exist but the data is not being sharded.
Check to insure the shard key exists on the collection as an index:
- db.collection.getIndexes()

Query based on shard key hitting multiple shards

While browsing mongodb sharding tutorials I came across the following assertion :
"If you use shard key in the query, its going to hit a small number of shards, often only ONE"
On the other hand from some of my earlier elementary knowledge of sharding, I was under the impression that mongos routing service can uniquely point out the target shard if the query is fired on Shard Key. My question is - under what circumstances, a shard key based query stands a chance of hitting multiple shards?
A query using the shard key will target the subset of shards to retrieve data for your query, but depending on the query and data distribution this could be as few as one or as many as all shards.
Borrowing a helpful image from the MongoDB documentation on shard keys:
MongoDB uses the shard key to automatically partition data into logical ranges of shard key values called chunks. Each chunk represents approximately 64MB of data by default, and is associated with a single shard that currently owns that range of shard key values. Chunk counts are balanced across available shards, and there is no expectation of adjacent chunks being on the same shard.
If you query for a shard key value (or range of values) that falls within a single chunk, the mongos can definitely target a single shard.
Assuming chunk ranges as in the image above:
// Targeted query to the shard with Chunk 3
db.collection.find( { x: 50 } )
// Targeted query to the shard with Chunk 4
db.collection.find( {x: { $gte: 200} } )
If your query spans multiple chunk ranges, the mongos can target the subset of shards that contain relevant documents:
// Targeted query to the shard(s) with Chunks 3 and 4
db.collection.find( {x: { $gte: 50} } )
The two chunks in this example will either be on the same shard or two different shards. You can review the explain results for a query to find out more information about which shards were accessed.
It's also possible to construct a query that would require data from all shards (for example, based on a large range of shard key values):
// Query includes data from all chunk ranges
db.collection.find( {x: { $gte: -100} } )
Note: the above information describes range-based sharding. MongoDB also supports hash-based shard keys which will (intentionally) distribute adjacent shard key values to different chunk ranges after hashing. Range queries on hashed shard keys are expected to include multiple shards. See: Hashed vs Ranged Sharding.

Mongo sharding not removing data of sharded collection in source shard

I have MongoDB 3.2.6 installed on 5 machines which all form sharded cluster consisting of 2 shards (each is replica set with primary-secondary-arbiter configuration).
I also have a database with very large collection (~50M records, 200GB) and it was imported through mongos which put it to primary shard along with other collections.
I generated hashed id on that collection which will be my shard key.
After thay I sharded collection with:
> use admin
> db.runCommand( { enablesharding : "my-database" } )
> use my-database
> sh.shardCollection("my-database.my-collection", { "_id": "hashed" } )
Comand returned:
{ "collectionsharded" : "my-database.my-collection", "ok" : 1 }
And it actually started to shard. Status of shard looks like this:
> db.my-collection.getShardingDistribution()
Totals
data : 88.33GiB docs : 45898841 chunks : 2825
Shard my-replica-1 contains 99.89% data, 99.88% docs in cluster, avg obj size on shard : 2KiB
Shard my-replica-2 contains 0.1% data, 0.11% docs in cluster, avg obj size on shard : 2KiB()
This all looks ok but problem is that when I count my-collection through mongos I see number is increasing.
When I log in to primary replica set (my-replica-1) I see that number of records in my-collection is not decreasing although number in my-replica-2 is increasing (which is expected) so I guess mongodb is not removing chunks from source shard while migrating to second shard.
Does anyone know is this normal and if not why is it happening?
EDIT: Actually now it started to decrease on my-replica-1 although it still grows when counting on mongos (sometimes it goes little down and then up). Maybe this is normal behaviour when migrating large collection, I don't know
Ivan
according to documentation here you are observing a valid situation.
When document is moved from a to b it is counted twice as long as a receive confirmation that relocation was successfule.
On a sharded cluster, db.collection.count() can result in an
inaccurate count if orphaned documents exist or if a chunk migration
is in progress.
To avoid these situations, on a sharded cluster, use the $group stage
of the db.collection.aggregate() method to $sum the documents. For
example, the following operation counts the documents in a collection:
db.collection.aggregate(
[
{ $group: { _id: null, count: { $sum: 1 } } }
]
)

MongoDB - Loading data into sharded DB with balancer on

MongoDB - Anyone has encountered this problem?:
Loading .5M docs (using node js script reading csv file) into sharded database(3 shards) in mongoDB (v3.0.3) with balancer on. Multiple databases sharing the shards. Noticed following behavior:
missing data (range of 1 to 5 docs) which has happened intermitently. Shard key is hashed and all docs have this key and value pair.
while data is loading, expected the count to continously increase but noticed it increaes then decreases then increases again
mongos> db.logs.count()
471566
mongos> db.logs.count()
468772
mongos> db.logs.count()
465814
mongos> db.logs.count()
554979
Turning the balancer off did not have the same problem. Any explanation to this? Thanks.

MongoDB/GridFS out of disk space

I'm using MongoDB 2.0.6, and I ran out of disk space and can't seem to do anything to clear some up.
I have 3 shards (not replicated, since I can regenerate the photos).
I have one collection called photos, with sharding enabled. I store everything in gridfs so db.photos.files and db.photos.chunks.
All the files ended up on my primary shard and are not distributed eventhough sharding is enabled on my collection.
When I try:
> db.runCommand({ shardcollection : "photos.fs.chunks", key : { files_id : 1 }})
I get an error about needing to create the index.
When I try to create the index I get "Can't take a write lock while out of disk space".
When I try to delete things:
>db.fs.files.remove({$and: [{_id: {$gt: "1667"} }, { _id: {$lt: "5000"} }]});
Can't take a write lock while out of disk space
Any ideas?