mongo index field presence - mongodb

How can you create an index in mongodb to only index presence of a field (whether it's null/undefined or it actually has a value). I understand null is considered a value, however for this case I would like to consider null the same as undefined so the index will only care whether the property is null/undefined or not.
Example:
I have a data structure with a very big internal property:
document = {
_id: 1
info: { /* a fairly large object here */ }
}
index = { info: 1 }
If I were to make a query as follows:
collection.find({ info: null })
This would work fine in terms of search performance. Issue is that the object is large, so the index ends up very large (hundreds of MB with only 500k documents)
I have looked into conditional indexes (partial and sparse), but they only work if you want to index subset of documents. I want to index all documents but only on the presence of the field, not its entire value
Other practical option is adding a boolean flag hasInfo: Boolean and just index that, but will have to back-populate data and keep redundant properties on the collection

Related

Compound index where one field can be null MongoDB

How can I create compound index in mongo where one of the fields maybe not present or be null?
For example in below documents if I create a compound index name+age. How can I still achieve this with age being not present or null in some documents?
{
name: "Anurag",
age: "21",
},
{
name: "Nitin",
},
You can create partial Index as follow:
db.contacts.createIndex(
{ name: 1 },
{ partialFilterExpression: { age: { $exists: true } } }
)
Explained:
As per the documentation partial indexes only index the documents in a collection that meet a specified filter expression. By indexing a subset of the documents in a collection, partial indexes have lower storage requirements and reduced performance costs for index creation and maintenance. In this particular case imagine your collection have 100k documents , but only 5 documents have the "age" field existing , in this case the partial index will include only those 5 fields in the index optimizing the index storage space and providing better performance.
For the query optimizer to choose this partial index, the query predicate must include a condition on the name field as well as a non-null match on the age field.
Following example queries will be able to use the index:
db.contacts.find({name:"John"})
db.contacts.find({name:"John",age:{$gt:20}})
db.contacts.find({name:"John",age:30})
Following example query is a "covered query" based on this index:
db.contacts.find({name:"John",age:30},{_id:0,name:1,age:1})
( this query will be highly efficient since it return the data directly from the index )
Following example queries will not be able to use the index:
db.contacts.find({name:"John",age:{$exists:false}})
db.contacts.find({name:"John",age:null})
db.contacts.find({age:20})
Please, note you need to perform some analysis on if you need to search on the age field together with the name , since name field has a very good selectivity and this index will not be used in case you search only by age , maybe a good option is to create additional sparse/partial index only on the age field so you could fetch a list with contacts by certain age if this a possible search use case.

In MongoDB how to decide for a collection which fields to be indexed for a costly query

I have a collection with 1000+ records and I need to run the query below. I have come across the issue that this query takes more than a minute even if the departmentIds array has length something like 15-20. I think if I use an index the query time will be reduced.
From what I observe the 99% of the time spent on the query is due to the $in part.
How do I decide which fields to index. Should I index only department.department_id since that's what taking most time or should I create a compound index using userId,something and department.department_id (bascially all the fields I'm using in the query here)
Here is what my query looks like
let departmentIds = [.......................... can be large]
let query = {
userId: someid,
something: something,
'department.department_id': {
$in: departmentIds
}
};
//db query
let result = db
.collection(TABLE_NAME)
.find(query)
.project({
anotherfield: 1,
department: 1
})
.toArray();
You need to check all search cases and create indexes for those that are often used and most critical for your application. For the particular case above this seems to be the index options:
userId:1
userId:1,something:1
userId:1,something:1,department.department_id:1
I bet on option 1 since userId sounds like a unique key with high selectivity very suitable for index , afcourse best is to do some testing and identify the fastest options , there is good explain option that can help alot with the testing:
db.collection.find().explain("executionStats")

Mongo .find() returning duplicate documents (with same _id) (!)

Mongo appears to be returning duplicate documents for the same query, i.e. it returns more documents than there are unique _ids in the returned documents:
lobby-brain> count_iterated = 0; ids = {}
{}
lobby-brain> db.the_collection.find({
'a_boolean_key': true
}).forEach((el) => {
count_iterated += 1;
ids[el._id] = (ids[el._id]||0) + 1;
})
lobby-brain> count_iterated
278
lobby-brain> Object.keys(ids).length
251
That is, the number of unique _id returned is 251 -- but there were 278 documents returned by the cursor.
Investigating further:
lobby-brain> ids
{
'60cb8cb92c909a974a96a430': 1,
'61114dea1a13c86146729f21': 1,
'6111513a1a13c861467d3dcf': 1,
...
'61114c491a13c861466d39cf': 2,
'61114bcc1a13c861466b9f8e': 2,
...
}
lobby-brain> db.the_collection.find({
_id: ObjectId("61114c491a13c861466d39cf")
}).forEach((el) => print("foo"));
foo
>
That is, there aren't actually duplicate documents with the same _id -- it's just an issue with the .find().
I tried restarting the database, and rebuilding an index involving 'a_boolean_key', with the same results.
I've never seen this before and this seems impossible... what is causing this and how can I fix it?
Version info:
Using MongoDB: 5.0.5
Using Mongosh: 1.0.4
It is a stand-alone database, no replica set or sharding or anything like that.
Further Info
One thing to note is, there is a compound index with a_boolean_key as the first index, and a datetime field as the second. The boolean key is rarely updated on the database (~once/day), but the datetime field is frequently updated.
Maybe these updates are causing the duplicate return values?
Update Feb 15, 2022: I added a Mongo JIRA task here.
Try checking if you store indexes for a_boolean_key field.
When performing a count, MongoDB can return the count using only the
index
So, maybe you don't have indexes for all documents, so count method result is not equal to your manual count.
According to Louis Williams over at Mongo JIRA, this is not a bug but expected behavior.
Learn something new every day!

Mongo - How can a narrower query be slower than a generic one?

This is executed immediately:
db.mycollection.find({ strField: 'AAA'}).count()
And this takes a lot to finish:
db.mycollection.find({ strField: 'AAA', dateTimeField: { $exists: true }}).count()
This is how I created my index:
db.mycollection.createIndex({strField: 1, dateTimeField: 1}, { sparse: true })
But it doesn't work even using hint(indexName)
Why this happens and how to fix it?
The { $exists: true } query predicate is problematic, especially if there are documents in the collection for which that field does not exist.
When MongoDB creates an index entry for a document, it collects all of the field values according to the index spec, and concatenates them.
If a field is not present in the document, the index stores null in that field's position.
If the field is explicitly set to null, it also stores null in that field's position.
This means that these 2 documents will have identical entries in the index:
{ strField: 'AAA', dateTimeField: null}
{ strField: 'AAA'}
Note that even with the index being sparse, both documents will be indexed since at least one of the indexes fields exists in each document.
When testing {dateTimeFied:{$exists:true}}, the first document will match, while the second will not.
When processing a count query using an index, if the query can be satisfied by scanning a single range of the index, the query executor can use a count_scan stage, and get the correct result without loading a single document from disk.
Because the executor cannot definitively tell from the index whether or not the field exists, it cannot use a count_scan, and must instead use an ordinary ixscan followed by a fetch stage, and load all of the matching documents from disk in order to arrive at the correct count.
In the case of the first query, the executor would have been able to use a count_scan, while the second would have had to examine all of the documents. You should be able to see this by running explain with the executionStats option on each query.
One way to avoid this pitfall is to take advantage of the fact that MongoDB query operators are type-sensitive. This means that this query will match any document where dateTimeField is greater than epoch 0, and a timestamp:
db.mycollection.find({ strField: 'AAA', dateTimeField: { $gte: new ISODate("1970-01-01T00:00:00Z") }}).count()
This will allow the query executor to count all of the documents that have the matching string and contain a date, but will exclude documents that contain a dateTimeField with a numeric or string value.

MongoDB MongoEngine index declaration

I have Document
class Store(Document):
store_id = IntField(required=True)
items = ListField(ReferenceField(Item, required=True))
meta = {
'indexes': [
{
'fields': ['campaign_id'],
'unique': True
},
{
'fields': ['items']
}
]
}
And want to set up indexes in items and store_id, does my configuration right?
Your second index declaration looks like it should do what you want. But to make sure that the index is really effective, you should use explain. Connect to your database with the mongo shell and perform a find-query which should use that index followed by .explain(). Example:
db.yourCollection.find({items:"someItem"}).explain();
The output will be a document with lots of fields. The documentation explains what exactly each field means. Pay special attention to these fields:
millis Time in milliseconds the query required
indexOnly (self-explaining)
n number of returned documents
nscannedObjects the number of objects which had to be examined without using an index. For an index-only query this should be equal to n. When it is higher, it means that some documents could not be excluded by an index and had to be scanned manually.