Considering a MongoDB Document as following
{
"_id": ObjectId("59f05fd6aa6e2bdef4275ebb"),
"property": "example value",
"updatedAt": ISODate("2018-07-14T01:00:00+01:00"),
"createdAt": ISODate("2018-07-15T01:00:00+01:00"),
"deletedAt": ISODate("2018-07-16T01:00:00+01:00")
}
My questions are about MongoDB best practices and especially minimising response time and indexing costs.
Since creation timestamp is encoded in _id and queryable with comparison operators ($gt(e), $lt(e) ...) (see this post) : what's the most performant between directly querying "_id" property and consequently keeping _id the only collection's index and adding ISODate field createdAt and also indexing this field ?
Finally, how to choose between indexing a single property (deletedAt) or combining them (_id, deletedAt) knowing that the most frequent query will be something like that
db.myCollection.find({
deletedAt: { $exists: false },
_id: { $gt: ObjectId('somepastobjectid') }
})
Note that _id is indexed by default and I just need to check existence of deletedAt (the value will be used very occasionally)
Thank you for your answers
Related
I have a lot of documents looking like this:
[{
"title": "Luxe [daagse] [verzorging] # Egypte! Incl. vluchten, transfers & 4* ho",
"price": 433,
"automatic": false,
"destination": "5d26fc92f72acc7a0b19f2c4",
"date": "2020-01-19T00:00:00.000+00:00",
"days": 8,
"arrival_airport": "5d1f5b407ec7385fa2963623",
"departure_airport": "5d1f5adb7ec7385fa2963307",
"board_type": "5d08e1dfff6c4f13f6db1e6c"
},
{
"title": "Luxe [daagse] [verzorging] # Egypte! Incl. vluchten, transfers & 4* ho",
"automatic": true,
"destination": "5d26fc92f72acc7a0b19f2c4",
"prices": [{
"price": 433,
"date_from": "2020-01-19T00:00:00.000+00:00",
"date_to": "2020-01-28T00:00:00.000+00:00",
"day_count": 8,
"arrival_airport": "5d1f5b407ec7385fa2963623",
"departure_airport": "5d1f5adb7ec7385fa2963307",
"board_type": "5d08e1dfff6c4f13f6db1e6c"
},
{
"price": 899,
"date_from": "2020-04-19T00:00:00.000+00:00",
"date_to": "2020-04-28T00:00:00.000+00:00",
"day_count": 19,
"arrival_airport": "5d1f5b407ec7385fa2963623",
"departure_airport": "5d1f5adb7ec7385fa2963307",
"board_type": "5d08e1dfff6c4f13f6db1e6c"
}
]
}
]
As you can see, automatic deals have multiple prices (can be a lot, between 1000 and 4000) and does not have the original fields available.
Now I need to search in the original document as well in the subdocuments to look for a match.
This is the aggregation I use to search through the documents:
[{
"$match": {
"destination": {
"$in": ["5d26fc9af72acc7a0b19f313"]
}
}
}, {
"$match": {
"$or": [{
"prices": {
"$elemMatch": {
"price": {
"$lte": 1500,
"$gte": 400
},
"date_to": {
"$lte": "2020-04-30T22:00:00.000Z"
},
"date_from": {
"$gte": "2020-03-31T22:00:00.000Z"
},
"board_type": {
"$in": ["5d08e1bfff6c4f13f6db1e68"]
}
}
}
}, {
"price": {
"$lte": 1500,
"$gte": 400
},
"date": {
"$lte": "2020-04-30T22:00:00.000Z",
"$gte": "2020-03-31T22:00:00.000Z"
},
"board_type": {
"$in": ["5d08e1bfff6c4f13f6db1e68"]
}
}]
}
}, {
"$limit": 20
}]
I would like to speed things up, because it can be quite slow. I was wondering, what is the best index strategy for this aggregate, what fields do I use? Is this the best way of doing it or is there a better way?
From Mongo's $or docs:
When evaluating the clauses in the $or expression, MongoDB either performs a collection scan or, if all the clauses are supported by indexes, MongoDB performs index scans. That is, for MongoDB to use indexes to evaluate an $or expression, all the clauses in the $or expression must be supported by indexes. Otherwise, MongoDB will perform a collection scan.
So with that in mind in order to avoid a collection scan in this pipeline you have to create a compound index containing both price and prices fields.
Remember that order matters in compound indexes so the order of the field should vary depending on your possible usage of it.
It seems to me that the index you want to create looks something like:
{destination: 1, date: 1, board_type: 1, price: 1, prices: 1}
A compound index including the match filter fields is required to make the aggregation run fast. In aggregation queries, having the $match stage early in the pipeline (preferably, first stage) utilizes indexes, if any are defined on the filter fields. In the posted query it is so, and defining the indexes is all needed for a fast query. But, index on what fields?
The index is going to be compound index; i.e., index on multiple fields of the query criteria. The index prefix starts with the destination field. The remaining index fields are to be determined. What are the remaining fields?
Most of these fields are in the prices array's sub-document fields - price, date_from, date_to and board_type. There is also the date field from the main document. Which of these fields need to be used in the compound index?
Defining indexes on array elements (or fields of sub-documents in an array) creates lots of index keys. This means lots of storage and for using the index the memory (or RAM). This is an important consideration. Indexes on array elements are called as multikey indexes. For an index to be properly utilized, the collection's documents and the index being used by the query (together called as working set) must fit into the RAM.
Another aspect you need to consider is the query selectivity. How many documents gets selected using a filter which uses an index field, is a factor. It is imperative that the filter field with must select a small set of the input documents to be effective. See Create Queries that Ensure Selectivity.
It is difficult to determine what other fields need to be considered (sure some of the fields of the prices) based on the above two factors. So, the index is going to be something like this:
{ destination: 1, fld1: 1, fld2: 1, ... }
The fld1, fld2, ..., are going to be the prices array sub-document fields and / or the date field. I think only one set of date fields can be used with the index. An example index can be one of these:
{ destination: 1, date: 1, "prices.price": 1, "prices.board_type": 1}
{ destination: 1, "prices.price": 1, "prices.date_from": 1, "prices.date_to": 1, "prices.board_type": 1}
Note the index keys order and the necessity of the price, date_from, date_to and board_type is to be determined based upon the two main factors - requirement of the working set and the query selectivity - this is important.
NOTES: On a small sample data set with similar structure showed usage of the compound index with the primary destination field and two fields from the prices (one with equality condition and one with range condition). The query plan using the explain showed an IXSCAN (index scan) on the compound index, and using an index will sure improve the query performance.
For example, I want to update half of data whose _id is an odd number. such as:
db.col.updateMany({"$where": "this._id is an odd number"})
Instead of integer, _id is mongo's ObejectId which be regard as hexadecimal "string". It is not supported to code as:
db.col.updateMany(
{"$where": "this._id % 2 = 1"},
{"$set": {"a": 1}}
)
so, what is the correct format?
And what if molding according to _id?
This operation can also be done using two database calls.
Get List of _id from collection.
Push only ODD _id into an array.
Update the collection.
Updating the collection:
db.collection.update(
{ _id: { $in: ['id1', 'id2', 'id3'] } }, // Array with odd _id
{ $set: { urkey : urValue } }
)
I am trying to store key value data in MongoDb.
Key could be any string and I don't know about it anything more before storing, value could be any type (int, string, array). And I would like to have an index on such key & value.
I was looking on a (Multikey Index) over an array of my key-vals but looks like it can't cover queries over array fields.
Is it possible to have an index on a custom key & value in mongoDb and make queries with such operations as $exists and $eq and $gte, $lte, $and, $or, $in without COLLSCAN but through an IXSCAN stage?
Or maybe I need another Db for that?
I may have misunderstood your question but I think that this is precisely where MongoDB's strengths are - dealing with different shapes of documents and data types.
So let's say you have to following two documents:
db.test.insertMany([
{
key: "test",
value: [ "some array", 1 ]
},
{
key: 12.7,
values: "foo"
}
])
and you create a compound index like this:
db.test.createIndex({
"key": 1,
"value": 1
})
then the following query will use that index:
db.test.find({ "key": "test", "value": 1 })
and also more complicated queries will do the same:
db.test.find({ "key": { $exists: true }, "value": { gt: 0 } })
You can verify this by adding a .explain() to the end of the above queries.
UPDATE based on your comment:
You don't need the aggregation framework for that. You can simply do something like this:
db.test.distinct("user_id", { "key": { $exists: true } })
This query is going to use the above index. Moreover it can be made even faster by changing the index definition to include the "user_id" field like this:
db.test.createIndex({
"key" : 1.0,
"value" : 1.0,
"user_id" : 1
})
This, again, can be verified by running the following query:
db.test.explain().distinct("user_id", { "key": { $exists: true } })
If your key can be any arbitrary value, then this is impossible. Your best bet is to create an index on some other known field to limit the initial results so that the inevitable collection scan's impact is reduced to a minimum.
Context: I have a MongoDB populated with large number of emails. I'd like to do a search for all emails that include a given email address within any of the following fields: To, From, CC and BCC. The result needs to be sorted by the field Date. We're currently trying the following query:
db.collection.find({ $text : {$search: "\"email#domain.com\""}}).sort({Date:1})
I've tried doing a compound index including the date but it does not work.
With this index...
db.collection.createIndex({Date: 1, From:"text", To:"text", CC:"text", BCC:"text"})
it gives error 17007 as Date should have an equality match as it's a prefix. This is not an option as we'd like all emails regardless of the date.
Also with this other index...
db.collection.createIndex({From:"text", To:"text", CC:"text", BCC:"text", Date:1})
Then it gives error 17144 as it goes over the internal limit for the sort.
We've read the following:
Stackoverflow ref
Stackoverflow ref
mongoDB doc on compound index
In these references and others I'm getting the idea that this is not possible but I don't think what we're trying to do is atypical or so much out of the box.
Are we doing something wrong? Is there a way to do this query with compound index or any other MongoDB feature?
thanks!
Regardless of other compound index keys, you need to include the $meta for the "textScore" in order to get the correct sorting:
db.collection.find(
{ "$text": { "$search": "\"email#domain.com\""}},
{ "score": { "$meta": "textScore" } }
).sort({
"score": { "$meta": "textScore" }, "Date": 1
})
So naturally you want that "score" to sort first, and then by "Date" in order for things to be correctly ranked by relevance of the search.
The order of index does not matter, but of course you can ony have "one" text index. So make sure you drop all others before creating:
db.collection.createIndex({
"From": "text",
"To": "text",
"CC":"text",
"BCC": "text",
"Date":1
})
Look for indexes that are current with:
db.collection.getIndicies()
Or just drop everything and start fresh:
db.collection.dropIndexes()
For the data you appear to be searching on though, I would have thought a regular compound index on each field should suit you better. Looking for "email" addresses should be an "exact match", and if you expect multiple items for each field then they should be arrays of strings, like so:
{
"TO": ["bill#example.com"],
"FROM": ["ted#example.com"],
"CC": ["marty#example.com","sarah#example.com"],
"BCC": [],
"Date": ISODate("2015-07-27T13:42:05.535Z")
}
Then you need seperate indexes on each field, possibly in compound with "Date" like so:
db.email.createIndex({ "TO": 1, "Date": 1 })
db.email.createIndex({ "FROM": 1, "Date": 1 })
db.email.createIndex({ "CC": 1, "Date": 1 })
db.email.createIndex({ "BCC": 1, "Date": 1 })
And query with an $or condition:
db.email.find({
"$or": [
{ "TO": "sarah#example.com" },
{ "FROM": "sarah#example.com" },
{ "CC": "sarah#example.com" },
{ "BCC": "sarah#example.com" }
],
"Date": { "$lt": new Date() }
})
If you look at the .explain(true) (verbose) output from that, you should see that the winning plan is an "index intersection" of all the specified indexes. This works out to be very efficient as every field ( and index selected ) has an exact match value, and a range match on the indexed date.
That's going to be a lot better for you than the "fuzzy matching" of text searches. Even regular expressions should work better here in general ( for e-mail addresses ) and especially if they are "anchored" ^ to the start of the string.
Text indexes are meant for "word like tokens" to match, but this should not be your data. The $or does not look at nice, but it should do a much better job.
Have read this doc, it states that index can optimize update operation. Then, I am adding an index to my collection to optimize update operation I am using.
Records in the collection have object as _id, and a timestamp:
{_id: {userId: "sample"}, firstTimestamp: 123, otherField: "abc"}
What I want to do is operate update using query below:
db.userFirstTimestamp.update(
{_id: {userId: "sample"}, firstTimestamp: {$gt: 100}},
{_id: {userId: "sample"}, firstTimestamp: 100, otherField2: "efg"})
I want to store 'first document' based on 'firstTimestamp', field of old and new document can be different, hence it cannot be $set query, it should rewrite document instead. For sample below "otherField" should not be exist, it should be "otherField2" instead.
Based on my understanding on MongoDB doc and this article, I created index as per below
db.sample.createIndex({_id:1, timestamp:1})
Then I try to benchmark the query on an isolated experimental node using MongoDB 3.0.4 with spec below:
MongoDB 3.0.4
Machine is empty, no other operation, only mongo
RAM ~30GB
Disk is RAID 0 stripped
Collection has 60 million record
Average object size 1001 bytes
Index size 5.34 gig
When I check the log, many update query take more than 100ms, and when I do mongotop, top of the query is write query which takes ~1000ms. It is a bit slow since it takes that long to do one query.
When I do mongostat, throughput is only 400-500 query per second.
Then I try to do query explain using find query (since update does not support explain)
When I am not using projection, it is using default index {_id:1}.
When I am using projection for _id and timestamp only, it is using {_id:1, timestamp:1} index.
My question is:
Does index I have created help this update query?
If it is not helping, then how the index should be?
Any other way to optimize this update query?
Somewhat. But not optimally.
Should be this really, so index on the "element" of the object in the _id key:
db.sample.createIndex({ "_id.userId": 1, "timestamp": 1 })
Use the $set operator and stop overwiting your documents:
db.sample.update(
{
"_id.userId": "sample",
"firstTimestamp": { "$gt": 100 }
},
{
"$set": { "otherfield": "cfg" }
}
)
But really your data "should" look like this:
{
"_id": "sample",
"firstTimestamp": 200,
"otherfield2": "sam"
}
And update like:
db.sample.update(
{
"_id.userId": "sample",
"firstTimestamp": { "$gt": 100 }
},
{
"$set": {
"fistTimetamp": 100,
"otherfield2": "efg"
}
}
)
Or if you insist that fields other than "_id" and "firstTimestamp" are going to change a lot, then rather do this:
{
"_id": "sample",
"firstTimestamp": 200,
"data": {
"otherfield2": "sam"
}
}
When if you just want to replace data then do:
db.sample.update(
{
"_id.userId": "sample",
"firstTimestamp": { "$gt": 100 }
},
{
"$set": {
"fistTimetamp": 100,
"data": {
"overwritingField": "efg"
}
}
}
)
Since "data" can be replaced as an entire object if you wish, or just update a single key:
db.sample.update(
{
"_id.userId": "sample",
"firstTimestamp": { "$gt": 100 }
},
{
"$set": {
"fistTimetamp": 100,
"data.newfield": "efg"
}
}
)
In all cases, try to use the operators rather than replacing the whole object as it typically works out as more traffic and more load to the server.
But overall, what makes sense here is that the "userId" part "should" be the portion of the index that narrows down the results the most. So it definately goes before the timestamp, of which there should be a lot more possible values.
Compound primary keys are fine, but make sure you actually use them. A singular value would not make any sense and could just be assigned to _id. If you can just query on one field of they key as you are here, then you probably don't need a compound object as the primary key.
Your _id in the update suggests that you are getting exact matches for the _id therefore it is not a compound field with other keys. With this being the case, it should just a value in the _id itself.
Also a "range" is okay, but again consider that you are trying to match a single document ( well you don't mention "multi" anywhere ), so again questin why is it needed and either then go for an exact match or at "least" an upper limit.
The $set will "only" update the fields that you specifiy. I think you made a mistake in typing your question though, as the syntax for the "update" portion would not be valid. But use update operators anyway, as they send less traffic by sending a single field, or just the fields you intend to update.