Best way to count documents in mongoDB - mongodb

we have a collection with big amount of documents, lets say around 100k. We now want to count the number of documents which has the key x set.
If I try it with Collection.countDocuments({ x: { $exists: true }}) I get the result, but it creates instantly a warning in the console: Query Targeting: Scanned Objects / Returned has gone above 1000.
So, is there a better way to count the documents? There is a Index on the field, is it possible to get the length of the index?
Thanks

Theres no real way of viewing the index trees in Mongo, what other people have linked you just returns the size of the tree, I'm not sure how useful that information is in this context.
Now to your question is this the best way to count?.
The answer is Yes ... -ish.
countDocuments is a wrapper function, it just simulates the following pipeline:
db.collection.aggregate([
{ $match: <query> },
{ $group: { _id: null, n: { $sum: 1 } } } )
])
This pipeline is the most efficient way to go, but the difference between running this aggregation and using the wrapper function is about 100-200 milliseconds, depending on your machine spec.
Meaning if you're looking for "way" better performance you're not going to find it.
With that said this warning is stupid, it just means you have more than 1000 documents with that field. The true purpose of it is to alert you in the case you're trying to query 1-20 documents without a proper index.

You can use the indexSizes field returned by the stats() method.
The stats() method "Returns statistics about the collection".
See example here :
https://docs.mongodb.com/manual/reference/method/db.collection.stats/#basic-stats-lookup
{
...,
"indexSizes" : {
"_id_" : 237568,
"cuisine_1" : 143360,
"borough_1_cuisine_1" : 151552,
"borough_1_address.zipcode_1" : 151552
},
...
}

indexSize key return size as in space used in storing not count
Check With Explain if index getting used or not . (Update in question Also)
can use hint option to check the performance after specifying index
Or precalculate count by $inc operator might good option if possible in you use case
try cursor.count if its faster countDocument should been faster but no harm in checking
https://docs.mongodb.com/manual/reference/method/cursor.count/

Related

MongoDB - how do I update a value in nested array/object?

I have a document in my Mongo collection which has a field with the following structure:
"_id" : "F7WNvjwnFZZ7HoKSF",
"process" : [
{
"process_id" : "wTGqVk5By32mpXadZ",
"stages" : [
{
"stage_id" : "D6Huk89DGFsd29ds7",
"completed" : "N"
},
{
"stage_id" : "Msd390vekn09nvL23",
"completed" : "N"
}
]
}
]
I need to update the value of completed where the stage_id is equal to 'D6Huk89DGFsd29ds7' - the update query will not know which object in the stages array this value of stage_id will be in.
How do I do this?
Since you have nested arrays in your object, this is bit tricky and I'm not sure if this problem can be solved with help of just one update query.
However, if you happen to know index of your matching object in first array, in your case process[0] you can write your update query like.
db.collection.update(
{"process.stages.stage_id":"D6Huk89DGFsd29ds7"},
{$set:{"process.0.stages.$.completed":"Y"}}
);
Query above will work perfect with your test case. Again, there is still possibility of having multiple objects at root level and there is no guarantee that matching object will always be at 0 index.
Solution I proposed above will fail if you have multiple children of process and if matching index of object is not zero.
However, you can achieve your goal with help of client side programming. That is find matching document, modify on client side and replace whole document with new content.
Since this approach is very in efficient, I'll suggest that you should consider altering your document structure to avoid nesting. Create another collection and move content of process array there.
In the end, I removed the outer process block, so that the process_id and stages were in the root of the document - made the process of updating easier using:
MyColl.update(
{
_id: 'F7WNvjwnFZZ7HoKSF',
"stages.stage_id": 'D6Huk89DGFsd29ds7'
},
{
$set: {"stages.$.completed": 'Y'}
}
);

Best way to create a mongo expression that never matches

What I am looking for is somehow the equivalent of doing in SQL:
WHERE 1 = 0
I'm looking for such a thing because I'm building a typesafe DSL to perform queries on my domain, supporting conjunctions and disjunctions. Sometimes it may be easier to add a query that never match anything, instead of dealing with it in the code.
For exemple, in my usecase:
StampleFilters().underCategoryIds(sharedCategoryIds.toList)
In this case, it does not work as expected because sharedCategoryIds is empty, so it results in a query being $(), which does not filter anything.
For an empty list, I would rather build a query that never returns anything.
Is there an easy way to do such a thing, without any impact on performances?
I could probably add some query like { somefield: unexistingvalue } but I wonder if there is nothing better.
Edit
I expect the expression to be composable. I mean it should work in queries like $or(exp1,exp2,exp3) where exp1 is for exemple the expression that never match.
If you have any proposition, it would be nice to explain why one is better than others and how it affect the query engine performances (or not)
I think the best way to achieve what you want is to add {_id : -1}
db.coll.find({a : 1}) will be transformed into db.coll.find({a : 1, _id : -1}). This is simpler then all shx2 solutions (except of the last one with noScan which is nice).
Moreover _id field is already a primary index, so it will quickly realize that there is no such _id field in the collection.
P.S. if someone would be so smart to name their _id as -1, then you can do {_id : NaN}.
If there will be _id = NaN then you most probably need to redevelop your app.
I came up with a few ways to achieve that:
"P&!P": { $and: [ {X:0}, {X:{$ne:0}} ] }
Can't be "$in" an empty list: { X: {$in: []} }
Nothing can be this long { X: {$size: 9999999999999999} }
"noScan": db.coll.find({})._addSpecial("$maxScan", 0)
EDIT:
one more, using $where: { $where: function() {return 0} }

sort by string length in Mongodb/pymongo

I was wondering if anyone knows how to sort a mongodb find() result by string length.
I have tried something like db.foo.find().sort({item.lenght:-1}) but obviously doesn't work. Can somebody help me and also suggest me a way to do the same thing but in pymongo?
There are lot of things ( and basic API ) I would personally love to see in the aggregation framework such as:
Math functions
log (as in logarithm)
ceil
floor
Array
sum
String
length
Just to name a few.
And that is without resorting to obscure usages of the $mod operator or other means in such cases as "ceil" and "floor". But I digress.
Your "string length" falls into this category. Raise a JIRA issue about it. But for now you you can use mapReduce and the existing JavaScript functionality:
db.collection.mapReduce(
function() {
emit( this.item.length, this.item );
},
function(key,values) {
return values;
},
{ "out": { "inline": 1 } }
)
So while that does actually have the "mapReduce" funky style of returning a re-shaped document and with of course everything matching the same length in an array, what it does do is take advantage of the nature of "mapReduce" ( not just restricted to MongoDB ) and allows the emitted "key" value to be sorted in the response.
There is now a solution for this in MongoDB v3.4+ using the aggregation framework using $strLenBytes. Given the following document:
{_id: 0, name: "Bob"}
We can use
db.mycollection.aggregate([{
$project: {
byteLength: {$strLenBytes: "$name"}
}
}])
Which will return 3 for the number of bytes.
No, actually is not possible.
I was dealing with a similar problem, what I did was to store the string length of every object as a property of the object itself. This bypassed the problem.
If you think that shall be implemented (I do) I recomend you to upvote the issue in JIRA, which, for some reason have not so many votes:
https://jira.mongodb.org/browse/SERVER-5319

Maintaining total count of a set in another collection

I got simple scenario of two entities: post; bumps (ie upvote).
Example of a post:
{_id: 'happy_days', 'title': 'Happy days', text: '...', bumps: 2}
Example of a bump:
{_id: {user: 'jimmy', post: 'happy_days'}}
{_id: {user: 'hans', post: 'happy_days'}}
Question: how do I maintain correct bumps count in post under all circumstances (and failures)?
The method I have come up with so far is:
To bump, upsert and check for existence. Only if inserted, increase bumps count.
To unbump, delete and check for existence. Only if deleted, decrease bumps count.
Above fails if the app crashes between the two ops and the only way to correct the bumps stats is to query all documents in bump collection and recalculate everything offline (ie there is no way to know which post have incorrect bumps count).
I suggest that you stick with what you already have. The worst that can happen if there is a failover/connection issue between your two operations is that you bump count is wrong. So what? This is not the end of the world, and nobody is going to care too much if a bump count is either 812 or 813. You can always recreate the count anyway by checking how many bumps you have for each post by running an aggregation query if something went wrong. Embrace eventual consistency!
As an alternative to updating the data in multiple places (which, for read performance, will probably be the best but as you noticed will complicate updates) it may be worth considering storing uid's of the bumps in an array (here called bump_uids) directly on the post, and just count the bumps when needed using aggregate framework;
> db.test.aggregate( [ { $match: { _id:'happy_days' } },
{ $project: { bump_uids: 1 } },
{ $unwind: '$bump_uids' },
{ $group: {_id:'$_id', bumps: { $sum:1 } } } ] )
>>> { "result" : [ { "_id" : "happy_days", "bumps" : 3 } ], "ok" : 1 }
Since MongoDB does not yet support triggers ( https://jira.mongodb.org/browse/SERVER-124 ) you have to do this the gritty way with application logic.
As a brief example:
db.follower.insert({fromId:u,toId:c});
db.user.update({_id:u},{$inc:{totalFollowing:1}});
db.user.update({_id:c},{$inc:{totalFollowers:1}});
Yes, it is not atomic etc etc however it is the way to do it. In reality many update counters like this, whether in MongoDB or not.

Is it possible to physically reorganize a mongoDB collection to avoid using a sort()?

I have a collection that stores information about articles. The collection is for archival purposes so it is read only. Only two fields are being used at the moment: "title" and "page_length". Because I am always interested in getting longer articles first, I have the following index in place: { title: 1, page_length: -1}.
I have found that sorts are still slow because the collection is very large and won't fit into memory.
Assuming that almost every query I use on this collection will require a sort({page_length:-1}), is there any way to simply have the records stored on disk in order of page_length descending? In other words, is there a simple way to make the first record in the collection the largest page_length value, the second record the second largest, and so on?
That way I could just grab the first n records using limit(n) without having to run a sort. Any ideas?
Updating with more information:
I'm using this for a search autocomplete feature so speed is critical. The query I've been using looks like this:
db.articles.find({"title": /^SomeKeyword/}).sort({page_length:-1})
I'm happy to create multiple indexes since inserts are not a concern, I just want to maximize read speed.
EDIT: For reference, I actually was able to reorganize the records in the collection by using a find().forEach() into a new collection. I then searched the collection and grabbed the first N results without the need for any sort, which worked very well. Note that this ONLY works because my dataset does not ever change.
Your index { title: 1, page_length: -1 } is not used for the a query that looks like this:
db.collection.find( {} ).sort( { page_length: -1 } );
MongoDB can only use compound indexes from left to right, so in order for the index to be used, you need to have the "title" as a find or sort argument:
db.collection.find( { title: 'foo' } ).sort( { page_length: -1 } );
db.collection.find().sort( { title: 1, page_length: -1 } );
Explain will tell you:
db.so.find( {} ).sort( { page_length: -1 } ).explain();
{
"cursor" : "BasicCursor",
…
If you change your index to:
db.so.ensureIndex({ page_length: -1, title: 1 } );
Then the index will be used for sorting, but you can't use the index for just doing a lookup by title and you will need an additional index for that. If you're really only interested in those two fields and making sure you use a covered index helps. You will have to have the compound index with { page_length: -1, title: 1 } and you can make sure it is used by using a projection:
db.collection.find( {}, { page_length: 1, title: 1, _id: 0 } ).sort( { page_length: -1 } );
But you can not decide or influence how MongoDB stores things on disk.
I can think of a solution that uses two queries.
First, you can do a covered query to get the list of documents you care about. Second you can use the list of documents retrieved and the $in operator to get the final result.
The covered query will operate within memory (or at least sequentially on disk), so it should be fast, and the $in can utilize the _id index and should be tolerably efficient with a reasonable number of documents.