collection count increasing after sharding mongodb - mongodb

I am not able to understand why collection count increasing after sharding mongodb.
I have a collection of 20M records , when I sharded collection count keep increasing , plz help me out
clusture configration
3 shards
3 config sever
6 query routers

If I get it right, you mean that db.shardedCollection.count()returns more documents than you expect. This is a known bug (SERVER-3645).
TL;DR
The problem is that the way sharding works, it can happen that after a chunk migration so called orphaned documents exist. These are documents which exist as a duplicate on a shard not responsible for the key range that the document falls into. For almost all practical purposes, this is not a problem, since the mongos takes care of "sorting them out" (which is a bit simplified, but sufficient in this context).
However, when calling a db.collection.count() on a sharded collection, this query gets routed to all shards, since it does not contain the shard key.
Disclaimer from here on, it is my theory, deduced from the observed behavior
Since the orphaned documents still technically exist on a shard, they seem to get counted and the result of the count as a whole is reported back to the mongos, which simply sums up all the results. I assume .count() to take a shortcut on the individual shard, possibly simply counting the entries of the _id index for performance reasons.
Workaround
As written in the ticket, using an aggregation mitigates the problem:
db.collection.aggregate({$group:{_id:"uniqueDocs",count:{$sum:1}}})
However, this aggregation is not ideal, and should show better performance when changed as below if you have a shard key other than _id
db.books.aggregate([
{ $project:{ _id:0, yourShardKey: 1 }},
{ $group:{ _id:"uniqueDocs", count:{ $sum:1 }}}
])
or
db.books.aggregate([
{ $project:{ _id:1 }},
{ $group:{ _id:"uniqueDocs", count:{ $sum:1 }}}
])
if you use _id as your shard key.

Related

Upsert performance with sharding on MongoDB

I am analyzing some performance issues that we have with our MongoDB cluster, and it led me to a question I'm not able to find an answer for at the moment.
Let's consider the collection MyCollection that is sharded on the index {myField:1} and contains the two indexes {_id:1} and {myField:1}
If I execute the following request:
db.MyCollection.update({_id: X}, {$set: {otherField: Y}, $setOnInsert: {myField: Z}}, {upsert:true});
Will it lead to performance issues as the query is made of only the id that is not part of the sharding key?
Would it be significantly better with the following query:
db.MyCollection.update({_id: X, myField: Z}, {$set: {otherField: Y}}, {upsert:true});
Or would it be the same?
My reasoning is that for the first query, as it doesn't have the sharding key in the query, it will ask all shards to find _id:X whereas with the second, it'll go directly to the appropriate shard.
However, I still have some doubts about the second one. Because even if a sharding key is immutable, won't it check on all other shards too to ensure that the id I provided is not present with a different sharding key?
Note: we're on version 4.0
If the query does not include the shard key, the mongos must send the
query to all shards as a scatter/gather operation. Otherwise mongos will know what shard(s) to look.
UPDATE:
To determine if the query is a Targeted Operations or Broadcast Operations, use the .explain() method, look for the numbers of shards involved in the operation. "queryPlanner" > "winningPlan" > "shards"

How to improve terrible MongoDB query performance when aggregating with arrays

I have a data schema consisting of many updates (hundreds of thousands+ per entity) that are assigned to entities. I'm representing this with a single top-level document for each of the entities and an array of updates under each of them. The schema for those top-level documents looks like this:
{
"entity_id": "uuid",
"updates": [
{ "timestamp": Date(...), "value": 10 },
{ "timestamp": Date(...), "value": 11 }
]
}
I'm trying to create a query that returns the number of entities that have received an update within the past n hours. All updates in the updates array are guaranteed to be sorted by virtue of the manner in which they're updated by my application. I've created the following aggregation to do this:
db.getCollection('updates').aggregate([
{"$project": {last_update: {"$arrayElemAt": ["$updates", -1]}}},
{"$replaceRoot": {newRoot: "$last_update"}},
{"$match": {timestamp: {"$gte": new Date(...)}}},
{"$count": "count"}
])
For some reason that I don't understand, the query I just pasted takes an absurd amount of time to complete. It exhausts the 15-second timeout on the client I use, as a matter of fact.
From a time complexity point of view, this query looks incredibly cheap (which is part of the way I designed this schema that way I did). It looks to be linear with respect to the total number of top-level documents in the collection which are then filtered down, of which there are less than 10,000.
The confusing part is that it doesn't seem to be the $project step which is expensive. If I run that one alone, the query completes in under 2 seconds. However, just adding the $match step makes it time out and shows large amounts of CPU and IO usage on the server the database is running on. My best guess is that it's doing some operations on the full update array for some reason, which makes no sense since the first step explicitly limits it to only the last element.
Is there any way I can improve the performance of this aggregation? Does having all of the updates in a single array like this somehow cause Mongo to not be able to create optimal queries even if the array access patterns are efficient themselves?
Would it be better to do what I was doing previously and store each update as a top-level document tagged with the id of its parent entity? This is what I was doing previously, but performance was quite bad and I figured I'd try this schema instead in an effort to improve it. So far, the experience has been the opposite of what I was expecting/hoping for.
Use indexing, it will enhance the performance of your query.
https://docs.mongodb.com/manual/indexes/
For that use the mongo compass to check which index is used most then one by one index them to improve the performance of it.
After that fetch on the fields which you require in the end, with projection in aggregation.
I hope this might solve your issue. But i would suggest that go for indexing first. Its a huge PLUS in case of large data fetching.
You need to support your query with an index and simplify it as much as possible.
You're querying against the timestamp field of the first element of the updates field, so add an index for that:
db.updates.createIndex({'updates.0.timestamp': 1})
You're just looking for a count, so get that directly:
db.updates.count({'updates.0.timestamp': {$gte: new Date(...)}})

MongoDB $group operation - memory usage optimisation

I need to perform a $group operation over my entire collection. This group stage is reaching the limit of 100MB RAM usage.
The $group stage has a limit of 100 megabytes of RAM. By default, if the stage exceeds this limit, $group will produce an error. However, to allow for the handling of large datasets, set the allowDiskUse option to true to enable $group operations to write to temporary files.
I'm not limited by the RAM but I couldn't find how to increase this memory usage limit. Does anyone know how to config this restriction?
Setting up allowDiskUse to true will solve the solution but I assume the whole operation will be much slower and I'd like to find a better solution.
{
$group: {
_id: {
producer: "$producer",
dataset:"$dataset",
featureOfInterest:"$_id.featureOfInterest",
observedProperty:"$_id.observedProperty"
},
documentId: {$push:"$documentId"}
}
}
This $group operation is performed over entire complexe objects (producer and dataset). I understand that this operation is expensive since "It requires to scan the entire result set before yielding, and MongoDB will have to at least store a pointer or an index of each element in the groups." I'd rather $group on uniqueId fields for both of these object.
How could I $group object using unique ID and $project the whole object afterwards?
I'd like to obtain the same result as the group operation above using the group operation below at the beginning of my aggregation pipeline :
{
$group: {
_id: {
producer: "$producer.producerId",
dataset:"$dataset.datasetId",
featureOfInterest:"$_id.featureOfInterest",
observedProperty:"$_id.observedProperty"
},
documentId: {$push:"$documentId"}
}
}
allowDiskUse
There is no option in MongoDB to increase memory usage more than 100mb in aggregations so in heavy pipeline so you have to set the flag to true.
However
You may be interested reading about MongoDB In-Memory Storage Engine
Example starting mongodb with in-memory storage engine in the command line
mongod --storageEngine inMemory --dbpath <path> --inMemorySizeGB <newSize>
More information in Mongodb docs
https://docs.mongodb.com/manual/core/inmemory/
Regarding the second question - I didn't get it. Please post example documents.

Mongo sharding not removing data of sharded collection in source shard

I have MongoDB 3.2.6 installed on 5 machines which all form sharded cluster consisting of 2 shards (each is replica set with primary-secondary-arbiter configuration).
I also have a database with very large collection (~50M records, 200GB) and it was imported through mongos which put it to primary shard along with other collections.
I generated hashed id on that collection which will be my shard key.
After thay I sharded collection with:
> use admin
> db.runCommand( { enablesharding : "my-database" } )
> use my-database
> sh.shardCollection("my-database.my-collection", { "_id": "hashed" } )
Comand returned:
{ "collectionsharded" : "my-database.my-collection", "ok" : 1 }
And it actually started to shard. Status of shard looks like this:
> db.my-collection.getShardingDistribution()
Totals
data : 88.33GiB docs : 45898841 chunks : 2825
Shard my-replica-1 contains 99.89% data, 99.88% docs in cluster, avg obj size on shard : 2KiB
Shard my-replica-2 contains 0.1% data, 0.11% docs in cluster, avg obj size on shard : 2KiB()
This all looks ok but problem is that when I count my-collection through mongos I see number is increasing.
When I log in to primary replica set (my-replica-1) I see that number of records in my-collection is not decreasing although number in my-replica-2 is increasing (which is expected) so I guess mongodb is not removing chunks from source shard while migrating to second shard.
Does anyone know is this normal and if not why is it happening?
EDIT: Actually now it started to decrease on my-replica-1 although it still grows when counting on mongos (sometimes it goes little down and then up). Maybe this is normal behaviour when migrating large collection, I don't know
Ivan
according to documentation here you are observing a valid situation.
When document is moved from a to b it is counted twice as long as a receive confirmation that relocation was successfule.
On a sharded cluster, db.collection.count() can result in an
inaccurate count if orphaned documents exist or if a chunk migration
is in progress.
To avoid these situations, on a sharded cluster, use the $group stage
of the db.collection.aggregate() method to $sum the documents. For
example, the following operation counts the documents in a collection:
db.collection.aggregate(
[
{ $group: { _id: null, count: { $sum: 1 } } }
]
)

Incorrect Count returned by MongoDB (WiredTiger)

This sounds odd, and I hope I am doing something wrong, but my MongoDB collection is returning the Count off by one in my collection.
I have a collection with (I am sure) 359671 documents. However the count() command returns 359670 documents.
I am executing the count() command using the mongo shell:
rs0:PRIMARY> db.COLLECTION.count()
359670
This is incorrect.
It is not finding each and every document in my collection.
If I provide the following query to count, I get the correct result:
rs0:PRIMARY> db.COLLECTION.count({_id: {$exists: true}})
359671
I believe this is a bug in WiredTiger. As far as I am aware each document has the same definition, an _id field of an integer ranging from 0 to 359670, and a BinData field. I did not have this problem with the older storage engine (or Mongo 2, either could have caused the issue).
Is this something I have done wrong? I do not want to use the {_id: {$exists: true}} query as that takes 100x longer to complete.
According to this issue, this behaviour can occur if mongodb experiences a hard crash and is not shut down gracefully. If not issuing any query, mongodb probably just falls back to the collected statistics.
According to the article, calling db.COLLECTION.validate(true) should reset the counters.
As now stated in the doc, db.collection.count() without using a query parameter, returns results based on the collection’s metadata:
This may result in an approximate count. In particular:
On a sharded cluster, the resulting count will not correctly filter out orphaned documents.
After an unclean shutdown, the count may be incorrect.
When using a query parameter, as you did in the second query ({_id: {$exists: true}}), then it forces count to not use the collection's metadata, but to scan the collection instead.
Starting Mongo 4.0.3, count() is considered deprecated and the following alternatives are recommended instead:
Exact count of douments:
db.collection.countDocuments({})
which under the hood actually performs the following "expensive", but accurate aggregation (expensive since the whole collection is scanned to count records):
db.collection.aggregate([{ $group: { _id: null, n: { $sum: 1 } } }])
Approximate count of documents:
db.collection.estimatedDocumentCount()
which performs exactly what db.collection.count() does/did (it's actually a wrapper around count), which uses the collection’s metadata.
This is thus almost instantaneous, but may lead to an approximate result in the particular cases mentioned above.