MongoDB match on document and subdocuments, what to use as indexes? - mongodb

I have a lot of documents looking like this:
[{
"title": "Luxe [daagse] [verzorging] # Egypte! Incl. vluchten, transfers & 4* ho",
"price": 433,
"automatic": false,
"destination": "5d26fc92f72acc7a0b19f2c4",
"date": "2020-01-19T00:00:00.000+00:00",
"days": 8,
"arrival_airport": "5d1f5b407ec7385fa2963623",
"departure_airport": "5d1f5adb7ec7385fa2963307",
"board_type": "5d08e1dfff6c4f13f6db1e6c"
},
{
"title": "Luxe [daagse] [verzorging] # Egypte! Incl. vluchten, transfers & 4* ho",
"automatic": true,
"destination": "5d26fc92f72acc7a0b19f2c4",
"prices": [{
"price": 433,
"date_from": "2020-01-19T00:00:00.000+00:00",
"date_to": "2020-01-28T00:00:00.000+00:00",
"day_count": 8,
"arrival_airport": "5d1f5b407ec7385fa2963623",
"departure_airport": "5d1f5adb7ec7385fa2963307",
"board_type": "5d08e1dfff6c4f13f6db1e6c"
},
{
"price": 899,
"date_from": "2020-04-19T00:00:00.000+00:00",
"date_to": "2020-04-28T00:00:00.000+00:00",
"day_count": 19,
"arrival_airport": "5d1f5b407ec7385fa2963623",
"departure_airport": "5d1f5adb7ec7385fa2963307",
"board_type": "5d08e1dfff6c4f13f6db1e6c"
}
]
}
]
As you can see, automatic deals have multiple prices (can be a lot, between 1000 and 4000) and does not have the original fields available.
Now I need to search in the original document as well in the subdocuments to look for a match.
This is the aggregation I use to search through the documents:
[{
"$match": {
"destination": {
"$in": ["5d26fc9af72acc7a0b19f313"]
}
}
}, {
"$match": {
"$or": [{
"prices": {
"$elemMatch": {
"price": {
"$lte": 1500,
"$gte": 400
},
"date_to": {
"$lte": "2020-04-30T22:00:00.000Z"
},
"date_from": {
"$gte": "2020-03-31T22:00:00.000Z"
},
"board_type": {
"$in": ["5d08e1bfff6c4f13f6db1e68"]
}
}
}
}, {
"price": {
"$lte": 1500,
"$gte": 400
},
"date": {
"$lte": "2020-04-30T22:00:00.000Z",
"$gte": "2020-03-31T22:00:00.000Z"
},
"board_type": {
"$in": ["5d08e1bfff6c4f13f6db1e68"]
}
}]
}
}, {
"$limit": 20
}]
I would like to speed things up, because it can be quite slow. I was wondering, what is the best index strategy for this aggregate, what fields do I use? Is this the best way of doing it or is there a better way?

From Mongo's $or docs:
When evaluating the clauses in the $or expression, MongoDB either performs a collection scan or, if all the clauses are supported by indexes, MongoDB performs index scans. That is, for MongoDB to use indexes to evaluate an $or expression, all the clauses in the $or expression must be supported by indexes. Otherwise, MongoDB will perform a collection scan.
So with that in mind in order to avoid a collection scan in this pipeline you have to create a compound index containing both price and prices fields.
Remember that order matters in compound indexes so the order of the field should vary depending on your possible usage of it.
It seems to me that the index you want to create looks something like:
{destination: 1, date: 1, board_type: 1, price: 1, prices: 1}

A compound index including the match filter fields is required to make the aggregation run fast. In aggregation queries, having the $match stage early in the pipeline (preferably, first stage) utilizes indexes, if any are defined on the filter fields. In the posted query it is so, and defining the indexes is all needed for a fast query. But, index on what fields?
The index is going to be compound index; i.e., index on multiple fields of the query criteria. The index prefix starts with the destination field. The remaining index fields are to be determined. What are the remaining fields?
Most of these fields are in the prices array's sub-document fields - price, date_from, date_to and board_type. There is also the date field from the main document. Which of these fields need to be used in the compound index?
Defining indexes on array elements (or fields of sub-documents in an array) creates lots of index keys. This means lots of storage and for using the index the memory (or RAM). This is an important consideration. Indexes on array elements are called as multikey indexes. For an index to be properly utilized, the collection's documents and the index being used by the query (together called as working set) must fit into the RAM.
Another aspect you need to consider is the query selectivity. How many documents gets selected using a filter which uses an index field, is a factor. It is imperative that the filter field with must select a small set of the input documents to be effective. See Create Queries that Ensure Selectivity.
It is difficult to determine what other fields need to be considered (sure some of the fields of the prices) based on the above two factors. So, the index is going to be something like this:
{ destination: 1, fld1: 1, fld2: 1, ... }
The fld1, fld2, ..., are going to be the prices array sub-document fields and / or the date field. I think only one set of date fields can be used with the index. An example index can be one of these:
{ destination: 1, date: 1, "prices.price": 1, "prices.board_type": 1}
{ destination: 1, "prices.price": 1, "prices.date_from": 1, "prices.date_to": 1, "prices.board_type": 1}
Note the index keys order and the necessity of the price, date_from, date_to and board_type is to be determined based upon the two main factors - requirement of the working set and the query selectivity - this is important.
NOTES: On a small sample data set with similar structure showed usage of the compound index with the primary destination field and two fields from the prices (one with equality condition and one with range condition). The query plan using the explain showed an IXSCAN (index scan) on the compound index, and using an index will sure improve the query performance.

Related

MongoDB: returning documents in order until a condition match

In a MongoDB collection, I have documents with a "position" field for ordering and an optional "date" field, e.g.
[
{
"_id": "doc1",
"position": 1
},
{
"_id": "doc2",
"position": 2,
"date": "2021-05-20T08:00:00.000Z"
},
{
"_id": "doc3",
"position": 3
},
{
"_id": "doc4",
"position": 4,
"date": "2021-05-20T08:00:00.000Z"
}
]
I would like the query this collection to get the documents "before" a specified date, in position order. The algorithm would be:
find the first element whose date is "after" the specified date
return all the documents whose position is less than the position of the element found, sorted by "position"
I have implemented this algorithm naïvely with 2 independent queries. However, I suspect it can be done with a single call to the database, but I have no idea how to proceed. Maybe with an aggregation pipeline?
Can someone give me a clue how this can be done?
EDIT: Here are the current queries I use (roughly):
limit_element = db.getCollection('collection').find({
"date": { "$gte": ISODate("2021-05-20T08:00:00.000Z") }
}).sort({
"position": 1
}).limit(1)
position = limit_element['position']
elements = db.getCollection('collection').find({
"position": { "$lt": position }
}).sort({
"position": 1
})
You can use an aggregation pipeline with two match clauses. Essentially its the same thing as you do now but within one DB access so a bit faster. With aggregation you can acess results from the previus stage to use in the next stage. If that is worth it you have to decide. I think your naive approach is sensible. In any case this a conditional problem so you will have to first find one and then do the other. Difference is just where you do the steps.

Mongo queries to search all the collections of a database (Mongo/PyMongo)

I have been stuck on how to query db which the common data structure of every document looks as:
{
"_id": {
"$oid": "5e0983863bcf0dab51f2872b"
},
"word": "never", // get the `word` value for each of below queries
"wordset_id": "a42b50e85e",
"meanings": [{
"id": "1f1bca9d9f",
"def": "not ever",
"speech_part": "adverb",
"synonyms": ["ne'er"]
}, {
"id": "d35f973ed0",
"def": "not at all",
"speech_part": "adverb"
}]
}
1) query to get all the wordfor speech_part: "adverb" (eg: never,....) //
2)query to get all the word for: word length of 6 and speech_part: "adverb"
I have learnt from SO that ,to search whole collections first i have to retrieve all collections in the database , but how to write a query is where i stuck
db.collection.find({"meanings.speech_part":"adverb"},{"_id":0, "word":1})
To get array of all word of a specific speech_part above is the query.
First part of the query is filter predicate like in your scenario matching speach_part.if your matching column were not inside another object or a object inside a array, you could just write {column_name: "something"}.
as speech_part is inside an object which is inside an array, you have to write {"parentClumn.key":"something"}, in your case {"meanings.speech_part":"adverb"}.
where second part of the query is projection where you define which columns you want in your result. so to get only word column values you do {word:1}, to have more column you do {word:1, etc:1}. While mongodb project _id by default, so to remove _id from result you have to explicitly set {_id:0}
db.collection.find({
"meanings.speech_part":"adverb",
"$expr": { "$gt": [ { "$strLenCP": "$word" }, 6 ] }
},{"_id":0, "word":1})
To get array of all word of a specific speech_part with length greater than 6. This one is a bit complex query. You can look up $expr documentation. In $expr you can run function on your column and match the result. In your case strLenCP is calculating the length of your word column value and then checking, is it greater then 6 by $gt comparison operator
You may try below query to get the matching rows. You will have to try the same with pymongo.
db.getCollection('test-collection').find(
{
'meanings.speech_part': 'adverb'
},
{
_id: 0,
word: 1
}
);
Read about the projections in mongodb here:
https://docs.mongodb.com/manual/tutorial/project-fields-from-query-results

MongoDB: How to do a text search and sort by a date

Context: I have a MongoDB populated with large number of emails. I'd like to do a search for all emails that include a given email address within any of the following fields: To, From, CC and BCC. The result needs to be sorted by the field Date. We're currently trying the following query:
db.collection.find({ $text : {$search: "\"email#domain.com\""}}).sort({Date:1})
I've tried doing a compound index including the date but it does not work.
With this index...
db.collection.createIndex({Date: 1, From:"text", To:"text", CC:"text", BCC:"text"})
it gives error 17007 as Date should have an equality match as it's a prefix. This is not an option as we'd like all emails regardless of the date.
Also with this other index...
db.collection.createIndex({From:"text", To:"text", CC:"text", BCC:"text", Date:1})
Then it gives error 17144 as it goes over the internal limit for the sort.
We've read the following:
Stackoverflow ref
Stackoverflow ref
mongoDB doc on compound index
In these references and others I'm getting the idea that this is not possible but I don't think what we're trying to do is atypical or so much out of the box.
Are we doing something wrong? Is there a way to do this query with compound index or any other MongoDB feature?
thanks!
Regardless of other compound index keys, you need to include the $meta for the "textScore" in order to get the correct sorting:
db.collection.find(
{ "$text": { "$search": "\"email#domain.com\""}},
{ "score": { "$meta": "textScore" } }
).sort({
"score": { "$meta": "textScore" }, "Date": 1
})
So naturally you want that "score" to sort first, and then by "Date" in order for things to be correctly ranked by relevance of the search.
The order of index does not matter, but of course you can ony have "one" text index. So make sure you drop all others before creating:
db.collection.createIndex({
"From": "text",
"To": "text",
"CC":"text",
"BCC": "text",
"Date":1
})
Look for indexes that are current with:
db.collection.getIndicies()
Or just drop everything and start fresh:
db.collection.dropIndexes()
For the data you appear to be searching on though, I would have thought a regular compound index on each field should suit you better. Looking for "email" addresses should be an "exact match", and if you expect multiple items for each field then they should be arrays of strings, like so:
{
"TO": ["bill#example.com"],
"FROM": ["ted#example.com"],
"CC": ["marty#example.com","sarah#example.com"],
"BCC": [],
"Date": ISODate("2015-07-27T13:42:05.535Z")
}
Then you need seperate indexes on each field, possibly in compound with "Date" like so:
db.email.createIndex({ "TO": 1, "Date": 1 })
db.email.createIndex({ "FROM": 1, "Date": 1 })
db.email.createIndex({ "CC": 1, "Date": 1 })
db.email.createIndex({ "BCC": 1, "Date": 1 })
And query with an $or condition:
db.email.find({
"$or": [
{ "TO": "sarah#example.com" },
{ "FROM": "sarah#example.com" },
{ "CC": "sarah#example.com" },
{ "BCC": "sarah#example.com" }
],
"Date": { "$lt": new Date() }
})
If you look at the .explain(true) (verbose) output from that, you should see that the winning plan is an "index intersection" of all the specified indexes. This works out to be very efficient as every field ( and index selected ) has an exact match value, and a range match on the indexed date.
That's going to be a lot better for you than the "fuzzy matching" of text searches. Even regular expressions should work better here in general ( for e-mail addresses ) and especially if they are "anchored" ^ to the start of the string.
Text indexes are meant for "word like tokens" to match, but this should not be your data. The $or does not look at nice, but it should do a much better job.

MongoDB Compound Index to Optimize Update with Key and Range Condition

Have read this doc, it states that index can optimize update operation. Then, I am adding an index to my collection to optimize update operation I am using.
Records in the collection have object as _id, and a timestamp:
{_id: {userId: "sample"}, firstTimestamp: 123, otherField: "abc"}
What I want to do is operate update using query below:
db.userFirstTimestamp.update(
{_id: {userId: "sample"}, firstTimestamp: {$gt: 100}},
{_id: {userId: "sample"}, firstTimestamp: 100, otherField2: "efg"})
I want to store 'first document' based on 'firstTimestamp', field of old and new document can be different, hence it cannot be $set query, it should rewrite document instead. For sample below "otherField" should not be exist, it should be "otherField2" instead.
Based on my understanding on MongoDB doc and this article, I created index as per below
db.sample.createIndex({_id:1, timestamp:1})
Then I try to benchmark the query on an isolated experimental node using MongoDB 3.0.4 with spec below:
MongoDB 3.0.4
Machine is empty, no other operation, only mongo
RAM ~30GB
Disk is RAID 0 stripped
Collection has 60 million record
Average object size 1001 bytes
Index size 5.34 gig
When I check the log, many update query take more than 100ms, and when I do mongotop, top of the query is write query which takes ~1000ms. It is a bit slow since it takes that long to do one query.
When I do mongostat, throughput is only 400-500 query per second.
Then I try to do query explain using find query (since update does not support explain)
When I am not using projection, it is using default index {_id:1}.
When I am using projection for _id and timestamp only, it is using {_id:1, timestamp:1} index.
My question is:
Does index I have created help this update query?
If it is not helping, then how the index should be?
Any other way to optimize this update query?
Somewhat. But not optimally.
Should be this really, so index on the "element" of the object in the _id key:
db.sample.createIndex({ "_id.userId": 1, "timestamp": 1 })
Use the $set operator and stop overwiting your documents:
db.sample.update(
{
"_id.userId": "sample",
"firstTimestamp": { "$gt": 100 }
},
{
"$set": { "otherfield": "cfg" }
}
)
But really your data "should" look like this:
{
"_id": "sample",
"firstTimestamp": 200,
"otherfield2": "sam"
}
And update like:
db.sample.update(
{
"_id.userId": "sample",
"firstTimestamp": { "$gt": 100 }
},
{
"$set": {
"fistTimetamp": 100,
"otherfield2": "efg"
}
}
)
Or if you insist that fields other than "_id" and "firstTimestamp" are going to change a lot, then rather do this:
{
"_id": "sample",
"firstTimestamp": 200,
"data": {
"otherfield2": "sam"
}
}
When if you just want to replace data then do:
db.sample.update(
{
"_id.userId": "sample",
"firstTimestamp": { "$gt": 100 }
},
{
"$set": {
"fistTimetamp": 100,
"data": {
"overwritingField": "efg"
}
}
}
)
Since "data" can be replaced as an entire object if you wish, or just update a single key:
db.sample.update(
{
"_id.userId": "sample",
"firstTimestamp": { "$gt": 100 }
},
{
"$set": {
"fistTimetamp": 100,
"data.newfield": "efg"
}
}
)
In all cases, try to use the operators rather than replacing the whole object as it typically works out as more traffic and more load to the server.
But overall, what makes sense here is that the "userId" part "should" be the portion of the index that narrows down the results the most. So it definately goes before the timestamp, of which there should be a lot more possible values.
Compound primary keys are fine, but make sure you actually use them. A singular value would not make any sense and could just be assigned to _id. If you can just query on one field of they key as you are here, then you probably don't need a compound object as the primary key.
Your _id in the update suggests that you are getting exact matches for the _id therefore it is not a compound field with other keys. With this being the case, it should just a value in the _id itself.
Also a "range" is okay, but again consider that you are trying to match a single document ( well you don't mention "multi" anywhere ), so again questin why is it needed and either then go for an exact match or at "least" an upper limit.
The $set will "only" update the fields that you specifiy. I think you made a mistake in typing your question though, as the syntax for the "update" portion would not be valid. But use update operators anyway, as they send less traffic by sending a single field, or just the fields you intend to update.

mongodb index. how to index a single object on a document, nested in an array

I have the following document:
{
'date': date,
'_id': ObjectId,
'Log': [
{
'lat': float,
'lng': float,
'date': float,
'speed': float,
'heading': float,
'fix': float
}
]
}
for 1 document, the Log array can be some hundred entries.
I need to query the first and last date element of Log on each document. I know how to query it, but I need to do it fast, so I would like to build an index for that. I don't want to index Log.date since it is too big... how can I index them?
In fact it's hard to advise without knowing how you work with the documents. One of the solutions could be to use a sparse index. You just need to add a new field to every first and last array element, let's call it shouldIndex. Then just create a sparse index which includes shouldIndex and date fields. Here's a short example:
Assume we have this document
{"Log":
[{'lat': 1, 'lng': 2, 'date': new Date(), shouldIndex : true},
{'lat': 3, 'lng': 4, 'date': new Date()},
{'lat': 5, 'lng': 6, 'date': new Date()},
{'lat': 7, 'lng': 8, 'date': new Date(), shouldIndex : true}]}
Please note the first element and the last one contain shouldIndex field.
db.testSparseIndex.ensureIndex( { "Log.shouldIndex": 1, "Log.date":1 }, { spar
se: true } )
This index should contain entries only for your first and last elements.
Alternatively you may store first and last elements date field in a seperate array.
For more info on sparse indexes please refer to this article.
Hope it helps!
So there was an answer about indexing that is fundamentally correct. As of writing though it seems a little unclear whether you are talking about indexing at all. It almost seems like what you want to do is get the first and last date from the elements in your array.
With that in mind there are a few approaches:
1. The elements in your array have been naturally inserted in increasing date values
So if the way all writes that are made to this field is done, only with use of the $push operator over a period of time, and you never update these items, at least in so much as changing a date, then your items are already in order.
What this means is you just get the first and last element from the array
db.collection.find({ _id: id },{ Log: {$slice: 1 }}); // gets the first element
db.collection.find({ _id: id },{ Log: {$slice: -1 }}); // gets the last element
Now of course that is two queries but it's a relatively simple operation and not costly.
2. For some reason your elements are not naturally ordered by date
If this is the case, or indeed if you just can't live with the two query form, then you can get the first and last values in aggregation, but using $min and $max modifiers
db.collection.aggregate([
// You might want to match first. Just doing one _id here. (commented)
//{"$match": { "_id": id }},
//Unwind the array
{"$unwind": "$Log" },
//
{"$group": {
"_id": "$_id",
"firstDate": {"$min": "$Log.Date" },
"lastDate": {"$max": "$Log.Date" }
}}
])
So finally, if your use case here is getting the details of the documents that have the first and last date, we can do that as well, mirroring the initial two query form, somewhat. Using $first and $last :
db.collection.aggregate([
// You might want to match first. Just doing one _id here. (commented)
//{"$match": { "_id": id }},
//Unwind the array
{"$unwind": "$Log" },
// Sort the results on the date
{"$sort": { "_id._id": 1, "Log.date": 1 }},
// Group using $first and $last
{"$group": {
"_id": "$_id",
"firstLog": {"$first": "$Log" },
"lastLog": {"$last": "$Log" }
}}
])
Your mileage may vary, but those approaches may obviate the need to index if this indeed would the the only usage for that index.