MongoDB index array issue

MongoDB index array issue - mongodb

I have question about mongoDB (version:"3.4.10") index arrays as I see they don't work correctly. Maybe I am doing something wrong.
I have schedule document that has some props that aren't important for this question. But each schedule has it's plan (occurrence) period in array.
{ .... "plans": [ { "startDateTime": "2018-01-04T00:00:00Z",
"endDateTime": "2018-01-04T23:59:59Z" }, { "startDateTime":
"2018-01-11T00:00:00Z", "endDateTime": "2018-01-11T23:59:59Z" } ... ]
},
Now I need to search schedule documents by that array item and find all schedules that fit in that period.
I have created index plans.startDateTime and plans.endDateTime.
When I perform following query using compas option explain plan I get bad really bad results.
{"Plans.StartDateTime": {$lt: new Date ('2018-01-10')}, "Plans.EndDateTime": {$gte: new Date ('2018-01-15')} }
Results are (this is in test environment where number of documents are really low, in production ratio would be even higher)
Documents Returned:2823
Index Keys Examined:65708
Documents Examined:11554
When I go little bit deeper in analysis I got following (meaning that mongo is ignoring plan end date in index search):
"indexBounds": { "Plans.StartDateTime": [ "(true, new
Date(1515542400000))" ], "Plans.EndDateTime": [ "[MinKey,
MaxKey]" ] },
Can somebody please tell me how to create better indexes for following search because this one isn't working?

In order to find all scheduleDocuments having at least one plan overlapping with a given time interval (e.g. 2018-01-10 and 2018-01-14) you have to use $elemMatch MongoDB operator.
db.scheduleDocuments.find({
plans: {
$elemMatch: {
startDateTime: { $lte: ISODate("2018-01-14Z") },
endDateTime: { $gt: ISODate("2018-01-10Z") }
}
}
});
The rule used to test for overlapping intervals can be found here.
This search performs a collection scan, unless you create an index on the array.
db.scheduleDocuments.createIndex({
"plans.startDateTime": 1,
"plans.endDateTime": 1
});
Thanks to the index, unmatching documents in the collection are not scanned at all. An IXSCAN is performed and only matching documents are accessed to be fetched and returned.

Related

MongoDB match on document and subdocuments, what to use as indexes?

I have a lot of documents looking like this:
[{
"title": "Luxe [daagse] [verzorging] # Egypte! Incl. vluchten, transfers & 4* ho",
"price": 433,
"automatic": false,
"destination": "5d26fc92f72acc7a0b19f2c4",
"date": "2020-01-19T00:00:00.000+00:00",
"days": 8,
"arrival_airport": "5d1f5b407ec7385fa2963623",
"departure_airport": "5d1f5adb7ec7385fa2963307",
"board_type": "5d08e1dfff6c4f13f6db1e6c"
},
{
"title": "Luxe [daagse] [verzorging] # Egypte! Incl. vluchten, transfers & 4* ho",
"automatic": true,
"destination": "5d26fc92f72acc7a0b19f2c4",
"prices": [{
"price": 433,
"date_from": "2020-01-19T00:00:00.000+00:00",
"date_to": "2020-01-28T00:00:00.000+00:00",
"day_count": 8,
"arrival_airport": "5d1f5b407ec7385fa2963623",
"departure_airport": "5d1f5adb7ec7385fa2963307",
"board_type": "5d08e1dfff6c4f13f6db1e6c"
},
{
"price": 899,
"date_from": "2020-04-19T00:00:00.000+00:00",
"date_to": "2020-04-28T00:00:00.000+00:00",
"day_count": 19,
"arrival_airport": "5d1f5b407ec7385fa2963623",
"departure_airport": "5d1f5adb7ec7385fa2963307",
"board_type": "5d08e1dfff6c4f13f6db1e6c"
}
]
}
]
As you can see, automatic deals have multiple prices (can be a lot, between 1000 and 4000) and does not have the original fields available.
Now I need to search in the original document as well in the subdocuments to look for a match.
This is the aggregation I use to search through the documents:
[{
"$match": {
"destination": {
"$in": ["5d26fc9af72acc7a0b19f313"]
}
}
}, {
"$match": {
"$or": [{
"prices": {
"$elemMatch": {
"price": {
"$lte": 1500,
"$gte": 400
},
"date_to": {
"$lte": "2020-04-30T22:00:00.000Z"
},
"date_from": {
"$gte": "2020-03-31T22:00:00.000Z"
},
"board_type": {
"$in": ["5d08e1bfff6c4f13f6db1e68"]
}
}
}
}, {
"price": {
"$lte": 1500,
"$gte": 400
},
"date": {
"$lte": "2020-04-30T22:00:00.000Z",
"$gte": "2020-03-31T22:00:00.000Z"
},
"board_type": {
"$in": ["5d08e1bfff6c4f13f6db1e68"]
}
}]
}
}, {
"$limit": 20
}]
I would like to speed things up, because it can be quite slow. I was wondering, what is the best index strategy for this aggregate, what fields do I use? Is this the best way of doing it or is there a better way?

From Mongo's $or docs:
When evaluating the clauses in the $or expression, MongoDB either performs a collection scan or, if all the clauses are supported by indexes, MongoDB performs index scans. That is, for MongoDB to use indexes to evaluate an $or expression, all the clauses in the $or expression must be supported by indexes. Otherwise, MongoDB will perform a collection scan.
So with that in mind in order to avoid a collection scan in this pipeline you have to create a compound index containing both price and prices fields.
Remember that order matters in compound indexes so the order of the field should vary depending on your possible usage of it.
It seems to me that the index you want to create looks something like:
{destination: 1, date: 1, board_type: 1, price: 1, prices: 1}

A compound index including the match filter fields is required to make the aggregation run fast. In aggregation queries, having the $match stage early in the pipeline (preferably, first stage) utilizes indexes, if any are defined on the filter fields. In the posted query it is so, and defining the indexes is all needed for a fast query. But, index on what fields?
The index is going to be compound index; i.e., index on multiple fields of the query criteria. The index prefix starts with the destination field. The remaining index fields are to be determined. What are the remaining fields?
Most of these fields are in the prices array's sub-document fields - price, date_from, date_to and board_type. There is also the date field from the main document. Which of these fields need to be used in the compound index?
Defining indexes on array elements (or fields of sub-documents in an array) creates lots of index keys. This means lots of storage and for using the index the memory (or RAM). This is an important consideration. Indexes on array elements are called as multikey indexes. For an index to be properly utilized, the collection's documents and the index being used by the query (together called as working set) must fit into the RAM.
Another aspect you need to consider is the query selectivity. How many documents gets selected using a filter which uses an index field, is a factor. It is imperative that the filter field with must select a small set of the input documents to be effective. See Create Queries that Ensure Selectivity.
It is difficult to determine what other fields need to be considered (sure some of the fields of the prices) based on the above two factors. So, the index is going to be something like this:
{ destination: 1, fld1: 1, fld2: 1, ... }
The fld1, fld2, ..., are going to be the prices array sub-document fields and / or the date field. I think only one set of date fields can be used with the index. An example index can be one of these:
{ destination: 1, date: 1, "prices.price": 1, "prices.board_type": 1}
{ destination: 1, "prices.price": 1, "prices.date_from": 1, "prices.date_to": 1, "prices.board_type": 1}
Note the index keys order and the necessity of the price, date_from, date_to and board_type is to be determined based upon the two main factors - requirement of the working set and the query selectivity - this is important.
NOTES: On a small sample data set with similar structure showed usage of the compound index with the primary destination field and two fields from the prices (one with equality condition and one with range condition). The query plan using the explain showed an IXSCAN (index scan) on the compound index, and using an index will sure improve the query performance.

MongoDb aggregate with limit and without limit

There is a collection in mongo
In the collection of 40 million records
db.getCollection('feedposts').aggregate([
{
"$match": {
"$or": [
{
"isOfficial": true
},
{
"creator": ObjectId("537f267c984539401ff448d2"),
type: { $nin: ['challenge_answer', 'challenge_win']}
}
],
}
},
{
$sort: {timeline: -1}
}
])
This request never ends
But if you add a limit before sorting, and the limit is higher than the total number of records in advance, for example, 1,000,000,000,000,000 - the request will be processed instantly
db.getCollection('feedposts').aggregate([
{
"$match": {
"$or": [
{
"isOfficial": true
},
{
"creator": ObjectId("537f267c984539401ff448d2"),
type: { $nin: ['challenge_answer', 'challenge_win']}
}
],
}
},
{
$limit: 10000000000000000
},
{
$sort: {timeline: -1}
}
])
Please tell me why this is happening?
What problems can I expect in the future if I leave it this way?

TLDR: Mongo is using the wrong index for the query
Why is this happening?
Well basically every query you do Mongo simulates a quick "competition" between the relevant indexes in order to choose which one to use, the first index to retrieve 1001 documents "wins".
Now usually this situation of picking the wrong index occurs with ascending or descending fields and a matching index making this index with the fetching competition under certain conditions, Meaning this is very risky as you can have stable code that can suddenly become a huge bottleneck.
What can we do?
You have a few options:
Use the hint option and make Mongo use the compound index you have ready for this pipeline.
Drop the rogue index to ensure this will never happen again elsewhere (which is my recommended option).
Keep doing what you're doing. basically by adding this random $limit stage you're throwing Mongo's competition off and ensuring the right index will be picked.

Why sort document by id is slower with $match than not in mongodb?

So, I tried to query
db.collection('collection_name').aggregate([
{
$match: { owner_id: '5be9b2f03ef77262c2bd49e6' }
},
{
$sort: { _id: -1 }
}])
the query above takes up 20s
but If I tried to query
db.collection('collection_name').aggregate([{$sort : {_id : -1}}])
it's only take 0.7s
Why does it the one without $match is actually faster than without match ?
update :
when I try this query
db.getCollection('callbackvirtualaccounts').aggregate([
{
$match: { owner_id: '5860457640b4fe652bd9c3eb' }
},
{
$sort: { created: -1 }
}
])
it's only takes 0.781s
Why sort by _id is slower than by created field ?
note : I'm using mongodb v3.0.0

db.collection('collection_name').aggregate([
{
$match: { owner_id: '5be9b2f03ef77262c2bd49e6' }
},
{
$sort: { _id: -1 }
}])
This collection probably won't be having and index on owner_id; Try using below mentioned index creation query and rerun your previous code.
db.collection('collection_name').createIndexes({ owner_id:1}) //Simple Index
or
db.collection('collection_name').createIndexes({ owner_id:1,_id:-1}) //Compound Index
**Note:: If you don't know how to compound index yet, you can create simple indexes individually on all keys which are used either in match or sort and that should be making query efficient as well.

The query speed depends upon a lot of factors. The size of collection, size of the document, indexes defined on the collection (and used in the queries and properly), the hardware components (like CPU, RAM, network) and other processes running at the time the query is running.
You have to tell what indexes are defined on the collection being discussed for further analysis. The command will retrieve them: db.collection.getIndexes()
Note the unique index on the _id field is created by default, and cannot be modified or deleted.
(i)
But If I tried to query: db.collection.aggregate( [ { $sort : { _id : -1 } } ] ) it's
only take 0.7s.
The query is faster because there is an index on the _id field and it is used in sort process. Aggregation queries use indexes with sort stage and when this sort happens early in the pipeline. You can verify if the index is used or not by generating a query plan (use explain with executionStats mode). There will be an index scan (IXSCAN) in the generated query plan.
(ii)
db.collection.aggregate([
{
$match: { owner_id: '5be9b2f03ef77262c2bd49e6' }
},
{
$sort: { _id: -1 }
}
])
The query above takes up 20s.
When I try this query it's only takes 0.781s.
db.collection.aggregate([
{
$match: { owner_id: '5860457640b4fe652bd9c3eb' }
},
{
$sort: { created: -1 }
}
])
Why sort by _id is slower than by created field ?
Cannot come to any conclusions with the available information. In general, the $match and $sort stages present early in the aggregation query can use any indexes created on the fields used in the operations.
Generating a query plan will reveal what the issues are.
Please run the explain with executionStats mode and post the query plan details for all queries in question. There is documentation for Mongodb v3.0.0 version on generation query plans using explain: db.collection.explain()

MongoDB Compound Index to Optimize Update with Key and Range Condition

Have read this doc, it states that index can optimize update operation. Then, I am adding an index to my collection to optimize update operation I am using.
Records in the collection have object as _id, and a timestamp:
{_id: {userId: "sample"}, firstTimestamp: 123, otherField: "abc"}
What I want to do is operate update using query below:
db.userFirstTimestamp.update(
{_id: {userId: "sample"}, firstTimestamp: {$gt: 100}},
{_id: {userId: "sample"}, firstTimestamp: 100, otherField2: "efg"})
I want to store 'first document' based on 'firstTimestamp', field of old and new document can be different, hence it cannot be $set query, it should rewrite document instead. For sample below "otherField" should not be exist, it should be "otherField2" instead.
Based on my understanding on MongoDB doc and this article, I created index as per below
db.sample.createIndex({_id:1, timestamp:1})
Then I try to benchmark the query on an isolated experimental node using MongoDB 3.0.4 with spec below:
MongoDB 3.0.4
Machine is empty, no other operation, only mongo
RAM ~30GB
Disk is RAID 0 stripped
Collection has 60 million record
Average object size 1001 bytes
Index size 5.34 gig
When I check the log, many update query take more than 100ms, and when I do mongotop, top of the query is write query which takes ~1000ms. It is a bit slow since it takes that long to do one query.
When I do mongostat, throughput is only 400-500 query per second.
Then I try to do query explain using find query (since update does not support explain)
When I am not using projection, it is using default index {_id:1}.
When I am using projection for _id and timestamp only, it is using {_id:1, timestamp:1} index.
My question is:
Does index I have created help this update query?
If it is not helping, then how the index should be?
Any other way to optimize this update query?

Somewhat. But not optimally.
Should be this really, so index on the "element" of the object in the _id key:
db.sample.createIndex({ "_id.userId": 1, "timestamp": 1 })
Use the $set operator and stop overwiting your documents:
db.sample.update(
{
"_id.userId": "sample",
"firstTimestamp": { "$gt": 100 }
},
{
"$set": { "otherfield": "cfg" }
}
)
But really your data "should" look like this:
{
"_id": "sample",
"firstTimestamp": 200,
"otherfield2": "sam"
}
And update like:
db.sample.update(
{
"_id.userId": "sample",
"firstTimestamp": { "$gt": 100 }
},
{
"$set": {
"fistTimetamp": 100,
"otherfield2": "efg"
}
}
)
Or if you insist that fields other than "_id" and "firstTimestamp" are going to change a lot, then rather do this:
{
"_id": "sample",
"firstTimestamp": 200,
"data": {
"otherfield2": "sam"
}
}
When if you just want to replace data then do:
db.sample.update(
{
"_id.userId": "sample",
"firstTimestamp": { "$gt": 100 }
},
{
"$set": {
"fistTimetamp": 100,
"data": {
"overwritingField": "efg"
}
}
}
)
Since "data" can be replaced as an entire object if you wish, or just update a single key:
db.sample.update(
{
"_id.userId": "sample",
"firstTimestamp": { "$gt": 100 }
},
{
"$set": {
"fistTimetamp": 100,
"data.newfield": "efg"
}
}
)
In all cases, try to use the operators rather than replacing the whole object as it typically works out as more traffic and more load to the server.
But overall, what makes sense here is that the "userId" part "should" be the portion of the index that narrows down the results the most. So it definately goes before the timestamp, of which there should be a lot more possible values.
Compound primary keys are fine, but make sure you actually use them. A singular value would not make any sense and could just be assigned to _id. If you can just query on one field of they key as you are here, then you probably don't need a compound object as the primary key.
Your _id in the update suggests that you are getting exact matches for the _id therefore it is not a compound field with other keys. With this being the case, it should just a value in the _id itself.
Also a "range" is okay, but again consider that you are trying to match a single document ( well you don't mention "multi" anywhere ), so again questin why is it needed and either then go for an exact match or at "least" an upper limit.
The $set will "only" update the fields that you specifiy. I think you made a mistake in typing your question though, as the syntax for the "update" portion would not be valid. But use update operators anyway, as they send less traffic by sending a single field, or just the fields you intend to update.

Moving MongoDB to Multikeys but indexOnly is returning false

What I'm trying to do sounds logical to me however I'm not sure.
I am trying to improve part of a MongoDB collection by using Multikeys.
For example: I have multiple documents with the following format:
Document:
{
"_id": ObjectId("528a4177dbcfd00000000013"),
"name": "Shopping",
"tags": [
"retail",
"shop",
"shopping",
"store",
"grocery"
]
}
Query:
Up until now, I have been using the following query to match the tags field.
var tags = Array("store", "shopping", "etc");
db.collection.findOne({ 'tags': { $in: tags } }, { name: true });
This has been working well, however I think Multikeys should be used in this instance to improve speed & performance. Please, correct me if I am wrong!
Indexing:
I issued the following command in an attempt to index the tags.
db.collection.ensureIndex( { tags: 1 }, { safe: true }, function(err, doc) {} );
ensureIndex was successful.
Result:
However when using RockMongo's explain feature on the above query, the result is:
{
"indexOnly": false,
"indexBounds": {
"tags": [
[
"etc",
"etc"
],
[
"shopping",
"shopping"
],
[
"store",
"store"
]
]
}
}
Questions:
Why is indexing not working, is there something else I have to do?
Is Multikey indexing in this case beneficial? (I'm assuming yes.)
Is there another form of indexing that would be more beneficial?
Edit:
I've just noticed that in the RockMongo explain data there is a field:
"isMultiKey": true,
could it be that Multikeys are being used and I've completely misunderstood that it IS being indexed?

As you say in your edit, and coming from the part of explain you did not post is that isMulyiKey: true along with other information on the cursor are showing that the index is being used. The indexBounds are another indicator.
What is being described by indexOnly is the fact that your query contains another field, name, which is not part of the index. When the query optimizer sees that all elements of the query can be met by using the fields from within the index this is referred to as a covered query and the indexOnly property here is set to true.
So in an Ideal situation your query and results are using the information from the index only and MongoDB does not also have to look up the entry from the index in the collection in order to return more data.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse