How to do efficent query to replace $exists in MongoDB - mongodb

I have a MongoDB collection with various data in it. (about millions)
These data have a data struct like {k: {a:1,b:2,c:{},...}} and I don't know extactly what in it.
Now I wanna do a counting on this collection to return me the total elements in the collection that k is not empty by using {k:{$exists:true}} but that's turns out very slow ...
Then I add an index on k and trying to query by : {k:{$gt:{}} but that's not return the correct results.
So, how to do this counting on the collection now?
Note that I don't know the data structure of k.

If you are using a version before version 2, $exists is not able to use an index.
See this answer: https://stackoverflow.com/a/7503114/131809
So, try upgrading your version of MongoDB
From the docs:
Before v2.0, $exists is not able to use an index. Indexes on other
fields are still used.
$exists is not very efficient even with an
index, and esp. with {$exists:true} since it will effectively have to
scan all indexed values.
The second part of that is perhaps the important bit.
It sounds like sparse index may be the key here...

db.collection.count({k:{$ne:null}})
By the way use sparse index on k.
db.collection.ensureIndex({k:1}, {sparse: true});

Try using $ne : null
So, as per your code example:
{k:{$ne : null}}

Related

MongoDB Indexing a field which may not exist

I have a collection which has an optional field xy_id. About 10% of the documents (out of 500k) does not have this xy_id field.
I have quite a lot of queries to this collection like find({xy_id: <id>}).
I tried indexing it normally (.createIndex({xy_id: 1}, {"background": true})) and it does improve the query speed.
Is this the correct way to index the field in this case? or should I be using a sparse index or another way?
Yes, this is the correct way. The default behaviour of MongoDB is serving well in this case. You can see in the docs that index creation supports an unique flag, which is false by default. All your documents missing the index key will be indexed under a single index entry. Queries can use this index in all cases because all the documents are indexed.
On the other hand, if you use sparse index the documents missing the index key will not be indexed at all. Some operations such as count, sort and other queries will not be able to use the sparse index unless explicitly hinted to do so. If explicitly hinted, you should be okay with incorrect results - the entries not in the index will be omitted in the result. You can read about it here.

MongoDB: Indexes, Sorting

After having read the official documentations on indexes, sort, intersection, i'm a little bit confuse on how everything work together.
I've trouble making my query use the indexes i've created. I work on a mongodb 3.0.3, on a collection having ~4millions of document.
To simplify, let's say my document is composed of 6 fields:
{
a:<text>,
b:<boolean>,
c:<text>,
d:<boolean>,
e:<date>,
f:<date>
}
The query I want to achieve is the following :
db.mycoll.find({ a:"OK", b:true, c:"ProviderA", d:true, e:{ $gte:ISODate("2016-10-28T12:00:01Z"),$lt:ISODate("2016-10-28T12:00:02") } }).sort({f:1});
So intuitively I've created two indexes
db.mycoll.createIndex({a: 1, b: 1, c: 1, d:1, e:1 }, {background: true,name: "test1"})
db.mycoll.createIndex({f:1}, {background: true,name: "test2"})
But the explain() give me that the first index is not used at all.
I known there is some kind of limitation when there is ranges in play in the filter (in the e field), but I can't find my way around it.
Also instead of having a single index on f, I try a compound index on {e:1,f:1} but it didn't change anything.
So What I have misunderstood?
Thanks for your support.
Update: also I find some time the following predicate for mongodb 2.6 :
A good rule of thumb for queries with sort is to order the indexed fields in this order:
First, the field(s) on which you will query for exact values.
Second, the field(s) on which you will sort.
Finally, field(s) on which you will query for a range of values (e.g., $gt, $lt, $in)
An example of using this rule of thumb is in the section on “Sorting the results of a complex query on a range of values” below, including a link to further reading.
Does this also apply for 3.X version?
Update 2: following above predicate, I created the following index
db.mycoll.createIndex({a: 1, b: 1, c: 1, d:1 , f:1, e:1}, {background: true,name: "test1"})
And for the same query :
db.mycoll.find({ a:"OK", b:true, c:"ProviderA", d:true, e:{ $gte:ISODate("2016-10-28T12:00:01Z"),$lt:ISODate("2016-10-28T12:00:02") } }).sort({f:1});
the index is indeed used. However too much keys seems to be scan, I may need to find a better order the fields in the query/index.
Mongo acts sometimes a bit strange when it comes to the index selection.
Mongo automagically decides what index to use. The smaller an index is the more likely it is used (especially indexes with only one field) - this is my experience. May be this happens because it is more often already loaded in RAM? To find out what index to use when Mongo performs test queries when it is idle. However the result is sometimes unexpected.
Therefore if you know what index to use you can force a query to use a specific index using the $hint option. You should try that.
Your two indexes used in the query and the sort does not overlap so MongoDB can not use them for index intersection:
Index intersection does not apply when the sort() operation requires an index completely separate from the query predicate.

mongodb: why indexOnly=false when collection is empty

Let's say I have an empty db without any collections. Then I run db.qqq.ensureIndex({a:1}).
In the output of db.qqq.find().explain() I see BasicCursor and "indexOnly" : false. That seems OK.
db.qqq.find({a:"somevalue"}).explain() outputs BtreeCursor a_1, but it also tells "indexOnly" : false. Why does this happen?
Why the given index isn't enough for mongodb to fulfill my query?
UPD: OK, so I need to use projection, since there is no all fields in my index. But what I don't understand -- if Mongo can see from index that there is no any documents matching query, then why should it scan the actual documents?
You need to add projection to that query, index only means it gets ALL data from the index. MongoDB cannot use an index only cursor if you want to get the full document back. So i.e.:
db.qqq.find({a:"somevalue"},{a:1,_id:0}).explain()
Should work.
MongoDB doesn't know that there are no documents until it searches for them, so it will have to at least check in the index if it can. A "BasicCursor" with "n=0" is not really a bad thing of course as no actual documents are read (or index elements, as there are none).
Also, if you want to use a covered index, you need to use a projection so that only fields are returned that are actually part of the index. You do that with:
db.qqq.find({a:"somevalue"},{a:1,_id:0}).explain()

Time Complexity of $addToset vs $push when element does not exist in the Array

Given: Connection is Safe=True so Update's return will contain update information.
Say I have a documents that look like:
[{'a': [1]}, {'a': [2]}, {'a': [1,2]}]
And I issue:
coll.update({}, {'$addToSet': {'a':1}}, multi=True)
The result would be:
{u'connectionId': 28,
u'err': None,
u'n': 3,
u'ok': 1.0,
u'updatedExisting': True
}
Even when come documents already have that value. To avoid this I could issue a command.
coll.update({'a': {'$ne': 1}}, {'$push': {'a':1}}, multi=True)
What's the Time Complexity Comparison for $addToSet vs. $push with a $ne check ?
Looks like $addToSet is doing the same thing as your command: $push with a $ne check. Both would be O(N)
https://github.com/mongodb/mongo/blob/master/src/mongo/db/ops/update_internal.cpp
if speed is really important then why not use a hash:
instead of:
{'$addToSet': {'a':1}}
{'$addToSet': {'a':10}}
use:
{$set: {'a.1': 1}
{$set: {'a.10': 1}
Edit
Ok since I read your question wrong all along it turns out that actually you are looking at two different queries and judging the time complexity between them.
The first query being:
coll.update({}, {'$addToSet': {'a':1}}, multi=True)
And the second being:
coll.update({'a': {'$ne': 1}}, {'$push': {'a':1}}, multi=True)
First problem springs to mind here, no indexes. $addToSet, being an update modifier, I do not believe it uses an index as such you are doing a full table scan to accomplish what you need.
In reality you are looking for all documents that do not have 1 in a already and looking to $push the value 1 to that a array.
So 2 points to the second query even before we get into time complexity here because the first query:
Does not use indexes
Would be a full table scan
Would then do a full array scan (with no index) to $addToSet
So I have pretty much made my mind up here that the second query is what your looking for before any of the Big O notation stuff.
There is a problem to using big O notation to explain the time complexity of each query here:
I am unsure of what perspective you want, whether it is per document or for the whole collection.
I am unsure of indexes as such. Using indexes will actually create a Log algorithm on a however not using indexes does not.
However the first query would look something like: O(n) per document since:
The $addToSet would need to iterate over each element
The $addToSet would then need to do an O(1) op to insert the set if it does not exist. I should note I am unsure whether the O(1) is cancelled out or not (light reading suggests my version), I have cancelled it out here.
Per collection, without the index it would be: O(2n2) since the complexity of iterating a will expodentially increase with every new document.
The second query, without indexes, would look something like: O(2n2) (O(n) per document) I believe since $ne would have the same problems as $addToSet without indexes. However with indexes I believe this would actually be O(log n log n) (O(log n) per document) since it would first find all documents with a in then all documents without 1 in their set based upon the b-tree.
So based upon time complexity and the notes at the beginning I would say query 2 is better.
If I am honest I am not used to explaining in "Big O" Notation so this is experimental.
Hope it helps,
Adding my observation in difference between addToSet and push from bulk update of 100k documents.
when you are doing bulk update. addToSet will be executed separately.
for example,
bulkInsert.find({x:y}).upsert().update({"$set":{..},"$push":{ "a":"b" } , "$setOnInsert": {} })
will first insert and set the document. And then it executes addToSet query.
I saw clear difference of 10k between
db.collection_name.count() #gives around 40k
db.collection_name.count({"a":{$in:["b"]}}) # it gives only around 30k
But when replaced $addToSet with $push. both count query returned same value.
note: when you're not concerned about duplicate entry in array. you can go with $push.

How to set array length after updating it via $addToSet in mongodb?

Document structure looks like this,
{
blacklists:[] // elements should be unique
blacklistsLength:0 // length of blacklists
}
Adding sets of value to blacklists is easy.
db.posts.update({_id:...}, {$addtoSet:{blacklists:{$each:['peter', 'bob', 'steven']}}});
But How can I update blacklistLength at the same time to reflect the changes?
This is not possible. Either you have
Update the length seperately using a subsequent findAndModify
command or
You can do it per name and rewrite the query using a negation in
your criteria and $push rather than $addToSet (not necessarily
needed but a lot faster with large blacklists since addToSet is
always o(n) regardless of indexes) :
db.posts.update({_id:..., blacklists:{$ne:'peter'}}, {$push:{blacklists:{'peter'}},$inc:{blacklistsLength: 1}});
The latter being perfectly safe since the list and the length are adjusted atomically but obviously has slightly degraded performance. Since it also has the benefit of better overall performance due to the $push versus $addToSet performance issue on large arrays (and blacklists tend to become huge and remember that the $push version of the update uses an index on blacklist in the update criteria while $addToSet will NOT use an index during it's set scan) it is generally the best solution.
Would the following not work?
db.posts.update({_id:...}, {
$addtoSet:{blacklists:{$each:['peter', 'bob', 'steven']}},
$set: {blacklistsLength: ['peter', 'bob', 'steven'].length}
});
I had a similar problem, please see the discussion here: google groups mongo
As you can notice, following to this discussion, a bug was open:
Mongo Jira
As you upsert items into the database, simply query the item to see if it's in your embedded array. That way, you're avoiding pushing duplicate items, and only incrementing the counter as you add new items.
q = {'blacklists': {'$nin': ['blacklist_to_insert'] }}
u = {
'$push' : {'blacklists': { 'blacklist_to_insert' } },
'$inc' : {'total_blacklists': 1 }
}
o = { 'upsert' : true }
db.posts.update(q,u,o)