optimize indexes in MongoDB - mongodb

I have a Order collection with records looking like this:
{
"_id": ObjectId,
"status": String Enum,
"products": [{
"sku": String UUID,
...
}, ...],
...
},
My goal is to find find what products user buy together. Given an sku, i would like to browse the past order and find, for orders that contains more than 1 product AND of course the product with the looked up sku, what other products were bought along.
So I created a aggregation pipeline that works :
[
// exclude cancelled orders
{
'$match': {
'status': {
'$nin': [
'CANCELLED', 'CHECK_OUT'
]
}
}
},
// add a fields with product size and just the products sku
{
'$addFields': {
'size': {
'$size': '$products'
},
'skus': '$products.sku'
}
},
// limit to orders with 2 products or more including the looked up SKU
{
'$match': {
'size': {
'$gte': 2
},
'skus': {
'$elemMatch': {
'$eq': '3516215049767'
}
}
}
},
// group by skus
{
'$unwind': {
'path': '$skus'
}
}, {
'$group': {
'_id': '$skus',
'count': {
'$sum': 1
}
}
},
// sort by count, exclude the looked up sku, limit to 4 results
{
$sort': {
'count': -1
}
}, {
'$match': {
'_id': {
'$ne': '3516215049767'
}
}
}, {
'$limit': 4
}
]
Althought this works, this collection contains more than 10K docs and I have an alert on my MongoDB instance telling me than the ratio Scanned Objects / Returned has gone above 1000.
So my question is, how can my query be improve? and what indexes can I add to improve this?
db.Orders.stats();
{
size: 14329835,
count: 10571,
avgObjSize: 1355,
storageSize: 4952064,
freeStorageSize: 307200,
capped: false
nindexes: 2,
indexBuilds: [],
totalIndexSize: 466944,
totalSize: 5419008,
indexSizes: { _id_: 299008, status_1__created_at_1: 167936 },
scaleFactor: 1,
ok: 1,
operationTime: Timestamp({ t: 1635415716, i: 1 })
}

Let's start with rewriting the query a little bit to make it more efficient.
Currently you're matching all the orders with a certain status and after that you're starting with data manipulations, this means every single stage is doing work on a larger than needed data set.
What we can do is move all the queries into the first stage, this is made possible using Mongo's dot notation, like so:
{
'$match': {
'status': {
'$nin': [
'CANCELLED', 'CHECK_OUT',
],
},
'products.sku': '3516215049767', // mongo allows you to do this using the dot notation.
'products.1': { $exists: true }, // this requires the array to have at least two elements.
},
},
Now this achieves two things:
We start the pipeline only with relevant results, no need to calculate the $size of the array anymore to many unrelevant documents. This already will boost your performance greatly.
Now we can create a compound index that will support this specific query, before we couldn't do that as index usage is limited to the first step and that only included the status field. ( just as an anecdote is that Mongo actually does optimize pipelines, but in this specific case no optimization was possible to to the usage of $addFields )
The index that I recommend building is:
{ status: 1, "products.sku": 1 }
This will allow the best match to start off your pipeline.

Related

Fetch immediate next and previous documents based on conditions in MongoDB

Background
I have the following collection:
article {
title: String,
slug: String,
published_at: Date,
...
}
MongoDB version: 4.4.10
The problem
Given an article, I want to fetch the immediate next and previous articles depending on the published_at field of that article.
Let's say I have an article with published_at as 100. And there are a lot of articles with published_at less than 100 and a lot having published_at more than 100. I want the pipeline/query to fetch only the articles with published_at values of 99 or 101 or the nearest possible.
Attempts
Here's my aggregation pipeline:
const article = await db.article.findOne({ ... });
const nextAndPrev = db.article.aggregate([
{
$match: {
$or: [
{
published_at: { $lt: article.published_at },
published_at: { $gt: article.published_at },
},
],
},
},
{
$project: { slug: 1, title: 1 },
},
{
$limit: 2,
},
]);
It gives the wrong result (two articles after the provided article), which is expected as I know it's incorrect.
Possible solutions
I can do this easily using two separate findOne queries like the following:
const next = await db.article.findOne({ published_at: { $gt: article.published_at } });
const prev = await db.article.findOne({ published_at: { $lt: article.published_at } });
But I was curious to know of any available methods to do it in a single trip to the database.
If I sort all the articles, offset it to the timestamp, and pull out the previous and next entries, that might work. I don't know the syntax.
Starting from MongoDB v5.0,
you can use $setWindowFields to fetch immediate prev/next documents according to certain sorting/ranking.
You can get the _id of current and next document through manipulating the documents: [<prev offset>, <next offset>] field. Similarly, for OP's scenario, it would be [-1, 1] to get the prev, current and next documents at once. Perform $lookup to fetch back the documents through the _id stored in the nearIds array.
{
"$setWindowFields": {
"partitionBy": null,
"sortBy": {
"published_at": 1
},
"output": {
nearIds: {
$addToSet: "$_id",
window: {
documents: [
-1,
1
]
}
}
}
}
}
Here is the Mongo playground for your reference.

How to construct a mongo query that aggregates on two fields?

TL;DR; How do I query mongo in a way that aggregates a collection two different ways?
I'm learning to query MongoDB. Suppose that I have a collection that contains the results of users submitting a questionnaire and then reviewers signing off that they have reviewed it.
A record in my collection looks like this:
{
_id: 0,
Questions:
[
{
label: "fname",
response: "Sir"
},
{
label: "lname",
response: "Robin"
},
{
label: "What is your name?",
response: "Sir Robin of Camelot"
},
{
label: "What is your quest?",
response: "To seek the holy grail"
},
{
label: "What is the capital of Asyria?",
response: "I don't know that."
}
],
Signatures:
[
"Lancelot",
"Arthur"
]
}
I am creating a report that will display a summary of each record.
For each record, I need to display the following:
The first name
The last name
The number of signatures.
I am able to write an aggregate query that gets the first and last name.
I am also able to write an aggregate query that gets the number of signatures.
However, I am stuck when I try to write an aggregate query that gets both.
// In order to query the first and last name, I use this query:
[
{ $unwind: "$Questions" },
{ $match: { "Questions.Label": { $in: ["fname", "lname"] } } },
{ $project: { "Questions": 1 } },
{ $group: { _id: "$_id", Questions: { $push: "$Questions" } } }
]
// In order to query the number of signatures, I use this query:
[
{ $project: { "SignatureCount": { $size: "$Signatures" } } }
]
Everything works with these two queries, but I want to write a single query that returns the data all together.
So, for example, if the example record above were the only record in my collection, I would want the query to return this:
{
_id: 0,
Questions:
[
{
label: "fname",
response: "Sir"
},
{
label: "lname",
response: "Robin"
}
],
SignatureCount: 2
}
You can rewrite the query to get both in single project stage in 3.4 version.
Use $filter to filter Questions.
[
{"$match":{"Questions.Label":{"$in":["fname", "lname"]}},
{"$project":{
"Questions":{
"$filter":{
"input":"$Questions",
"as":"q",
"cond":{"$in":["$$q.Label",["fname","lname"]]}
}
},
"SignatureCount":{"$size":"$Signatures"}
}}
]

Aggregation is very slow

I have a collection with a structure similar to this.
{
"_id" : ObjectId("59d7cd63dc2c91e740afcdb"),
"dateJoined": ISODate("2014-12-28T16:37:17.984Z"),
"activatedMonth": 5,
"enrollments" : [
{ "month":-10, "enrolled":'00'},
{ "month":-9, "enrolled":'00'},
{ "month":-8, "enrolled":'01'},
//other months
{ "month":8, "enrolled":'11'},
{ "month":9, "enrolled":'11'},
{ "month":10, "enrolled":'00'}
]
}
month in enrollments sub document is a relative month from dateJoined.
activatedMonth is a month of activation relative to dateJoined. So, this will be different for each document.
I am using Mongodb aggregation framework to process queries like "Find all documents that are enrolled from 10 months before dateJoined activating to 25 months after dateJoined activating".
"enrolled" values 01, 10, 11 are considered enrolled and 00 is considered not enrolled. For a document to be considered to to be enrolled, it should be enrolled for every month in the range.
I am applying all the filters that I can apply in the match stage but this can be empty in most cases. In projection phase I am trying to find out the all the document with at least one not-enrolled month. if the size is zero, then the document is enrolled.
Below is the query that I am using. It takes 3 to 4 seconds to finish. It is more or less same time with or with out the group phase. My data is relatively smaller in size ( 0.9GB) and total number of documents are 41K and sub document count is approx. 13 million.
I need to reduce the processing time. I tried creating an index on enrollments.month and enrollment.enrolled and is of no use and I think it is because of the fact that project stage cant use indexes. Am I right?
Are there are any other things that I can do to the query or the collection structure to improve performance?
let startMonth = -10;
let endMonth = 25;
mongoose.connection.db.collection("collection").aggregate([
{
$match: filters
},
{
$project: {
_id: 0,
enrollments: {
$size: {
$filter: {
input: "$enrollment",
as: "enrollment",
cond: {
$and: [
{
$gte: [
'$$enrollment.month',
{
$add: [
startMonth,
"$activatedMonth"
]
}
]
},
{
$lte: [
'$$enrollment.month',
{
$add: [
startMonth,
"$activatedMonth"
]
}
]
},
{
$eq: [
'$$enrollment.enroll',
'00'
]
}
]
}
}
}
}
}
},
{
$match: {
enrollments: {
$eq: 0
}
}
},
{
$group: {
_id: null,
enrolled: {
$sum: 1
}
}
}
]).toArray(function(err,
result){
//some calculations
}
});
Also, I definitely need the group stage as I will group the counts based on different field. I have omitted this for simplicity.
Edit:
I have missed a key details in the initial post. Updated the question with the actual use case why I need projection with a calculation.
Edit 2:
I converted this to just a count query to see how it performs (based on comments to this question by Niel Lunn.
My query:
mongoose.connection.db.collection("collection")
.find({
"enrollment": {
"$not": {
"$elemMatch": { "month": { "$gte": startMonth, "$lte": endMonth }, "enrolled": "00" }
}
}
})
.count(function(e,count){
console.log(count);
});
This query is taking 1.6 seconds. I tried with following indexes separately:
1. { 'enrollment.month':1 }
2. { 'enrollment.month':1 }, { 'enrollment.enrolled':1 } -- two seperate indexes
3. { 'enrollment.month':1, 'enrollment.enrolled':1} - just one index on both fields.
Winning query plan is not using keys in any of these cases, it does a COLLSCAN always. What am I missing here?

Mongoose Aggregate pagination and total number [duplicate]

I am interested in optimizing a "pagination" solution I'm working on with MongoDB. My problem is straight forward. I usually limit the number of documents returned using the limit() functionality. This forces me to issue a redundant query without the limit() function in order for me to also capture the total number of documents in the query so I can pass to that to the client letting them know they'll have to issue an additional request(s) to retrieve the rest of the documents.
Is there a way to condense this into 1 query? Get the total number of documents but at the same time only retrieve a subset using limit()? Is there a different way to think about this problem than I am approaching it?
Mongodb 3.4 has introduced $facet aggregation
which processes multiple aggregation pipelines within a single stage
on the same set of input documents.
Using $facet and $group you can find documents with $limit and can get total count.
You can use below aggregation in mongodb 3.4
db.collection.aggregate([
{ "$facet": {
"totalData": [
{ "$match": { }},
{ "$skip": 10 },
{ "$limit": 10 }
],
"totalCount": [
{ "$group": {
"_id": null,
"count": { "$sum": 1 }
}}
]
}}
])
Even you can use $count aggregation which has been introduced in mongodb 3.6.
You can use below aggregation in mongodb 3.6
db.collection.aggregate([
{ "$facet": {
"totalData": [
{ "$match": { }},
{ "$skip": 10 },
{ "$limit": 10 }
],
"totalCount": [
{ "$count": "count" }
]
}}
])
No, there is no other way. Two queries - one for count - one with limit. Or you have to use a different database. Apache Solr for instance works like you want. Every query there is limited and returns totalCount.
MongoDB allows you to use cursor.count() even when you pass limit() or skip().
Lets say you have a db.collection with 10 items.
You can do:
async function getQuery() {
let query = await db.collection.find({}).skip(5).limit(5); // returns last 5 items in db
let countTotal = await query.count() // returns 10-- will not take `skip` or `limit` into consideration
let countWithConstraints = await query.count(true) // returns 5 -- will take into consideration `skip` and `limit`
return { query, countTotal }
}
Here's how to do this with MongoDB 3.4+ (with Mongoose) using $facets. This examples returns a $count based on the documents after they have been matched.
const facetedPipeline = [{
"$match": { "dateCreated": { $gte: new Date('2021-01-01') } },
"$project": { 'exclude.some.field': 0 },
},
{
"$facet": {
"data": [
{ "$skip": 10 },
{ "$limit": 10 }
],
"pagination": [
{ "$count": "total" }
]
}
}
];
const results = await Model.aggregate(facetedPipeline);
This pattern is useful for getting pagination information to return from a REST API.
Reference: MongoDB $facet
Times have changed, and I believe you can achieve what the OP is asking by using aggregation with $sort, $group and $project. For my system, I needed to also grab some user info from my users collection. Hopefully this can answer any questions around that as well. Below is an aggregation pipe. The last three objects (sort, group and project) are what handle getting the total count, then providing pagination capabilities.
db.posts.aggregate([
{ $match: { public: true },
{ $lookup: {
from: 'users',
localField: 'userId',
foreignField: 'userId',
as: 'userInfo'
} },
{ $project: {
postId: 1,
title: 1,
description: 1
updated: 1,
userInfo: {
$let: {
vars: {
firstUser: {
$arrayElemAt: ['$userInfo', 0]
}
},
in: {
username: '$$firstUser.username'
}
}
}
} },
{ $sort: { updated: -1 } },
{ $group: {
_id: null,
postCount: { $sum: 1 },
posts: {
$push: '$$ROOT'
}
} },
{ $project: {
_id: 0,
postCount: 1,
posts: {
$slice: [
'$posts',
currentPage ? (currentPage - 1) * RESULTS_PER_PAGE : 0,
RESULTS_PER_PAGE
]
}
} }
])
there is a way in Mongodb 3.4: $facet
you can do
db.collection.aggregate([
{
$facet: {
data: [{ $match: {} }],
total: { $count: 'total' }
}
}
])
then you will be able to run two aggregate at the same time
By default, the count() method ignores the effects of the
cursor.skip() and cursor.limit() (MongoDB docs)
As the count method excludes the effects of limit and skip, you can use cursor.count() to get the total count
const cursor = await database.collection(collectionName).find(query).skip(offset).limit(limit)
return {
data: await cursor.toArray(),
count: await cursor.count() // this will give count of all the documents before .skip() and limit()
};
It all depends on the pagination experience you need as to whether or not you need to do two queries.
Do you need to list every single page or even a range of pages? Does anyone even go to page 1051 - conceptually what does that actually mean?
Theres been lots of UX on patterns of pagination - Avoid the pains of pagination covers various types of pagination and their scenarios and many don't need a count query to know if theres a next page. For example if you display 10 items on a page and you limit to 13 - you'll know if theres another page..
MongoDB has introduced a new method for getting only the count of the documents matching a given query and it goes as follows:
const result = await db.collection('foo').count({name: 'bar'});
console.log('result:', result) // prints the matching doc count
Recipe for usage in pagination:
const query = {name: 'bar'};
const skip = (pageNo - 1) * pageSize; // assuming pageNo starts from 1
const limit = pageSize;
const [listResult, countResult] = await Promise.all([
db.collection('foo')
.find(query)
.skip(skip)
.limit(limit),
db.collection('foo').count(query)
])
return {
totalCount: countResult,
list: listResult
}
For more details on db.collection.count visit this page
It is possible to get the total result size without the effect of limit() using count() as answered here:
Limiting results in MongoDB but still getting the full count?
According to the documentation you can even control whether limit/pagination is taken into account when calling count():
https://docs.mongodb.com/manual/reference/method/cursor.count/#cursor.count
Edit: in contrast to what is written elsewhere - the docs clearly state that "The operation does not perform the query but instead counts the results that would be returned by the query". Which - from my understanding - means that only one query is executed.
Example:
> db.createCollection("test")
{ "ok" : 1 }
> db.test.insert([{name: "first"}, {name: "second"}, {name: "third"},
{name: "forth"}, {name: "fifth"}])
BulkWriteResult({
"writeErrors" : [ ],
"writeConcernErrors" : [ ],
"nInserted" : 5,
"nUpserted" : 0,
"nMatched" : 0,
"nModified" : 0,
"nRemoved" : 0,
"upserted" : [ ]
})
> db.test.find()
{ "_id" : ObjectId("58ff00918f5e60ff211521c5"), "name" : "first" }
{ "_id" : ObjectId("58ff00918f5e60ff211521c6"), "name" : "second" }
{ "_id" : ObjectId("58ff00918f5e60ff211521c7"), "name" : "third" }
{ "_id" : ObjectId("58ff00918f5e60ff211521c8"), "name" : "forth" }
{ "_id" : ObjectId("58ff00918f5e60ff211521c9"), "name" : "fifth" }
> db.test.count()
5
> var result = db.test.find().limit(3)
> result
{ "_id" : ObjectId("58ff00918f5e60ff211521c5"), "name" : "first" }
{ "_id" : ObjectId("58ff00918f5e60ff211521c6"), "name" : "second" }
{ "_id" : ObjectId("58ff00918f5e60ff211521c7"), "name" : "third" }
> result.count()
5 (total result size of the query without limit)
> result.count(1)
3 (result size with limit(3) taken into account)
Try as bellow:
cursor.count(false, function(err, total){ console.log("total", total) })
core.db.users.find(query, {}, {skip:0, limit:1}, function(err, cursor){
if(err)
return callback(err);
cursor.toArray(function(err, items){
if(err)
return callback(err);
cursor.count(false, function(err, total){
if(err)
return callback(err);
console.log("cursor", total)
callback(null, {items: items, total:total})
})
})
})
Thought of providing a caution while using the aggregate for the pagenation. Its better to use two queries for this if the API is used frequently to fetch data by the users. This is atleast 50 times faster than getting the data using aggregate on a production server when more users are accessing the system online. The aggregate and $facet are more suited for Dashboard , reports and cron jobs that are called less frequently.
We can do it using 2 query.
const limit = parseInt(req.query.limit || 50, 10);
let page = parseInt(req.query.page || 0, 10);
if (page > 0) { page = page - 1}
let doc = await req.db.collection('bookings').find().sort( { _id: -1 }).skip(page).limit(limit).toArray();
let count = await req.db.collection('bookings').find().count();
res.json({data: [...doc], count: count});
I took the two queries approach, and the following code has been taken straight out of a project I'm working on, using MongoDB Atlas and a full-text search index:
return new Promise( async (resolve, reject) => {
try {
const search = {
$search: {
index: 'assets',
compound: {
should: [{
text: {
query: args.phraseToSearch,
path: [
'title', 'note'
]
}
}]
}
}
}
const project = {
$project: {
_id: 0,
id: '$_id',
userId: 1,
title: 1,
note: 1,
score: {
$meta: 'searchScore'
}
}
}
const match = {
$match: {
userId: args.userId
}
}
const skip = {
$skip: args.skip
}
const limit = {
$limit: args.first
}
const group = {
$group: {
_id: null,
count: { $sum: 1 }
}
}
const searchAllAssets = await Models.Assets.schema.aggregate([
search, project, match, skip, limit
])
const [ totalNumberOfAssets ] = await Models.Assets.schema.aggregate([
search, project, match, group
])
return await resolve({
searchAllAssets: searchAllAssets,
totalNumberOfAssets: totalNumberOfAssets.count
})
} catch (exception) {
return reject(new Error(exception))
}
})
I had the same problem and came across this question. The correct solution to this problem is posted here.
You can do this in one query. First you run a count and within that run the limit() function.
In Node.js and Express.js, you will have to use it like this to be able to use the "count" function along with the toArray's "result".
var curFind = db.collection('tasks').find({query});
Then you can run two functions after it like this (one nested in the other)
curFind.count(function (e, count) {
// Use count here
curFind.skip(0).limit(10).toArray(function(err, result) {
// Use result here and count here
});
});

Mongodb aggregate on subdocument in array

I am implementing a small application using mongodb as a backend. In this application I have a data structure where the documents will contain a field that contains an array of subdocuments.
I use the following use case as a basis:
http://docs.mongodb.org/manual/use-cases/inventory-management/
As you can see from the example, each document have a field called carted, which is an array of subdocuments.
{
_id: 42,
last_modified: ISODate("2012-03-09T20:55:36Z"),
status: 'active',
items: [
{ sku: '00e8da9b', qty: 1, item_details: {...} },
{ sku: '0ab42f88', qty: 4, item_details: {...} }
]
}
This fits me perfect, except for one problem:
I want to count each unique item (with "sku" as the unique identifier key) in the entire collection where each document adds the count by 1 (multiple instances of the same "sku" in the same document will still just count 1). E.g. I would like this result:
{ sku: '00e8da9b', doc_count: 1 },
{ sku: '0ab42f88', doc_count: 9 }
After reading up on MongoDB, I am quite confused about how to do this (fast) when you have a complex schema as described above. If I have understood the otherwise excellent documentation correct, such operation may perhaps be achieved using either the aggregation framework or the map/reduce framework, but this is where I need some input:
Which framework would be better suited to achieve the result I am looking for, given the complexity of the structure?
What kind of indexes would be preferred in order to gain the best possible performance out of the chosen framework?
MapReduce is slow, but it can handle very large data sets. The Aggregation framework on the other hand is a little quicker, but will struggle with large data volumes.
The trouble with your structure shown is that you need to "$unwind" the arrays to crack open the data. This means creating a new document for every array item and with the aggregation framework it needs to do this in memory. So if you have 1000 documents with 100 array elements it will need to build a stream of 100,000 documents in order to groupBy and count them.
You might want to consider seeing if there's a schema layout that will server your queries better, but if you want to do it with the Aggregation framework here's how you could do it (with some sample data so the whole script will drop into the shell);
db.so.remove();
db.so.ensureIndex({ "items.sku": 1}, {unique:false});
db.so.insert([
{
_id: 42,
last_modified: ISODate("2012-03-09T20:55:36Z"),
status: 'active',
items: [
{ sku: '00e8da9b', qty: 1, item_details: {} },
{ sku: '0ab42f88', qty: 4, item_details: {} },
{ sku: '0ab42f88', qty: 4, item_details: {} },
{ sku: '0ab42f88', qty: 4, item_details: {} },
]
},
{
_id: 43,
last_modified: ISODate("2012-03-09T20:55:36Z"),
status: 'active',
items: [
{ sku: '00e8da9b', qty: 1, item_details: {} },
{ sku: '0ab42f88', qty: 4, item_details: {} },
]
},
]);
db.so.runCommand("aggregate", {
pipeline: [
{ // optional filter to exclude inactive elements - can be removed
// you'll want an index on this if you use it too
$match: { status: "active" }
},
// unwind creates a doc for every array element
{ $unwind: "$items" },
{
$group: {
// group by unique SKU, but you only wanted to count a SKU once per doc id
_id: { _id: "$_id", sku: "$items.sku" },
}
},
{
$group: {
// group by unique SKU, and count them
_id: { sku:"$_id.sku" },
doc_count: { $sum: 1 },
}
}
]
//,explain:true
})
Note that I've $group'd twice, because you said that an SKU can only count once per document, so we need to first sort out the unique doc/sku pairs and then count them up.
If you want the output a little different (in other words, EXACTLY like in your sample) we can $project them.
With the latest mongo build (it may be true for other builds too), I've found that slightly different version of cirrus's answer performs faster and consumes less memory. I don't know the details why, seems like with this version mongo somehow have more possibility to optimize the pipeline.
db.so.runCommand("aggregate", {
pipeline: [
{ $unwind: "$items" },
{
$group: {
// create array of unique sku's (or set) per id
_id: { id: "$_id"},
sku: {$addToSet: "$items.sku"}
}
},
// unroll all sets
{ $unwind: "$sku" },
{
$group: {
// then count unique values per each Id
_id: { id: "$_id.id", sku:"$sku" },
count: { $sum: 1 },
}
}
]
})
to match exactly the same format as asked in question, grouping by "_id" should be skipped