I have a collection named "questions" with around 15 fields. There is an indexed field called "user". Another field is "response_api" which is a subdocument of around 60KB. And there are around 40000 documents in this collection.
When I run aggreate query with only $match stage on user field, it is very slow and takes around 11 seconds to complete.
The query is:
db.questions.aggregate([{$match: {user: ObjectId("5c9a19abc89b2d09740ccd1d")} }])
But when I run this same query with $project stage, it returns pretty fast in less than 10 millis. The query is:
db.questions.aggregate([{$match: {user: ObjectId("5c9a19abc89b2d09740ccd1d")} }, {$project: {_id: 1, user: 1, subject: 1}}])
Note: This particular user has 5000 documents in questions collection.
When I copied this collection without "response_api" field and created index on user field, then both these queries on the copied collection were pretty fast and took less than 10 millis.
Can somebody explain What's going on here? Is it because of that large field?
According to this documentation provided by MongoDB, you should keep a check on the size of indexes. Basically, if index size is more than what your RAM can accomadate, then MongoDB will be reading them from the disk and thus your queries will be a lot slower.
Related
I have a table that is called "sessions"
In the table I've two fields that are a compound index:
app (ObjectId)
udid (string)
I would like to count the amount of distinct udids in a specific app.
The first way that I went was:
db.getCollection('sessions').distinct('udid', {"app" : ObjectId("...")}).length
My problem is that the result is bigger than 16mb and I get the following errors "distinct too big, 16mb cap"
Then I tried to go with the next solution:
db.getCollection('sessions').aggregate([
{$match: {app: ObjectId('...')}},
{$group:{_id: '$udid'}},
{$count: 'id'}
])
The problem is that it isn't using the compound index it is first finding all the $match and only after it, it is doing the group therefore it isn't using for the group the index.
This makes that this query is very slow on a 3 million document schema it takes 50 sec.
It would be great if someone could tell me another way to do a distinct count.
MongoDB is running find() operations very slow when I sort or search on non-indexed fields. When there is not sorting or searching on these fields or this operations are made on indexed fields the query runs very fast. My test database has only 650k documents (which is not very much), but the final database will have 50 million documents growing 30k/day.
This is a query which runs fast (less than 1s), only filtering on indexed fields:
.find({
tipo_doc: 'nfe', //indexed field
id_empresa: 7 //indexed field
},
{
sort: {'data_requisicao': -1}, //indexed field
limit: 3000
});
And this is a query which runs very slow (30s or more):
.find({
tipo_doc: 'nfe', //indexed field
id_empresa: 7, //indexed field
'transportador.xNome': 'A commom name' //not indexed field
},
{
sort: {'emitente.enderEmit.UF': -1}, //not indexed field
limit: 3000
});
Here's a JSON document example: https://pastebin.com/1EeaL9VL
I've made a lot of researchs but nothing seems to work here.. I am doing something wrong to get this poor performance?
EDIT:
As I said in my comment, this data will be shown on a HTML table, and every column will be searchable and sortable, so I assume I can't create an index on every column..
Imagine you have a schema like:
[{
name: "Bob",
naps: [{
time: 2019-05-01T15:35:00,
location: "sofa"
}, ...]
}, ...
]
So lots of people, each with a few dozen naps. You want to find out 'what days do people take the most naps?', so you index naps.time, and then query with:
aggregate([
{$unwind: naps},
{$group: { _id: {$day: "$naps.time"}, napsOnDay: {"$sum": 1 } }
])
But when doing explain(), mongo tells me no index was used in this query, when clearly the index on the time Date field could have been. Why is this? How can I get mongo to use the index for the more optimal query?
Indexes stores pointers to actual documents, and can only be used when working with a material document (i.e. the document that is actually stored on disk).
$match or $sort does not mutate the actual documents, and thus indexes can be used in these stages.
In contrast, $unwind, $group, or any other stages that changes the actual document representations basically loses the connection between the index and the material documents.
Additionally, when those stages are processed without $match, you're basically saying that you want to process the whole collection. There is no point in using the index if you want to process the whole collection.
Let's say you have a database of People.
[
#1 {qualities: ['brown', 'nice', 'happy' 'cool' 'cheery']}
#2 {qualities: ['blue', 'okay', 'happy' 'decent' 'cheery']}
#3 {qualities: ['green', 'alright', 'happy' 'cool' 'cheery']}
]
Here's the People schema and model:
var peopleSchema = mongoose.Schema({
qualities: [],
});
var People = mongoose.model('People', peopleSchema);
If you want to get data according to maximum match document then we use the aggregate query:
People.aggregate([
{$unwind:"$qualities"},
{$match:{'qualities': {$in: ["brown", "happy", "cool"]}}},
{$group:{_id:"$_id",count:{$sum:1}}},
{$sort:{count:-1}}
]).exec(function(err, persons) {
console.log(persons)
});
it will return 1, 3,2 because for first one matched with 3
items, third one matched with 2 items and second one matched with 1 item.
Question
This aggregate works fast for my database of 10,000 people - in fact, it completed this in 273.199ms. However, how will it fare for a MongoDB of 10 million entries? If these rates are proportional [100k:2.7s, 1m:27s, 10m:4m30s], it could take 4 minutes and 30 seconds. Perhaps the rate is not proportional, I do not know. But is there any optimization or suggestions for querying such a large database if my time hypothesis happens to be true?
Okay. So if you have asked,
I will ask you to look into how the aggregate query works.
The aggregate query works on the basis of pipeline stages :
Now, what's a pipeline stage :- From your example the
{$unwind:"$qualities"},
{$match:{'qualities': {$in: ["brown", "happy", "cool"]}}},
{$group:{_id:"$_id",count:{$sum:1}}},
{$sort:{count:-1}}
Here, all the unwind, match, group and sort are pipeline stages.
The $unwind works by creating new documents for the embedded document (in your case qualities) for better nested searching.
But if you keep $unwind as the first stage, it creates a performance overhead by unwinding unnecessary documents.
A better approach would be to keep $match as the first stage in aggregation pipeline.
Now, how fast is the aggregation query :
The aggregation query's speed depends upon the amount of data stored into the embedded doc. If you store a million entries into the embedded doc qualities, it will create a performance overhead while unwinding those millions entries.
So, it all comes to how you create your database schema. Also, for faster querying you can look into the multi key indexing and sharding approaches for mongodb.
I have a mongodb database, which has following fields:
{"word":"ipad", "date":20140113, "docid": 324, "score": 98}
which is a reverse index for a log of docs(about 120 millions).
there are two kinds of queries in my system:
one of which is :
db.index.find({"word":"ipad", "date":20140113}).sort({"score":-1})
this query fetch the word "ipad" in date 20140113, and sort the all docs by score.
another query is:
db.index.find({"word":"ipad", "date":20140113, "docid":324})
to speed up these two kinds of query, what index should I build?
Should I build two indexes like this?:
db.index.ensureIndex({"word":1, "date":1, "docid":1}, {"unique":true})
db.index.ensureIndex({"word":1, "date":1, "score":1}
but I think build the two index use two much hard disk space.
So do you have some good ideas?
You are sorting by score descending (.sort({"score":-1})), which means that your index should also be descending on the score-field so it can support the sorting:
db.index.ensureIndex({"word":1, "date":1, "score":-1});
The other index looks good to speed up that query, but you still might want to confirm that by running the query in the mongo shell followed with .explain().
Indexes are always a tradeoff of space and write-performance for read-performance. When you can't afford the space, you can't have the index and have to deal with it. But usually the write-performance is the larger concern, because drive space is usually cheap.
But maybe you could save one of the three indexes you have. "Wait, three indexes?" Yes, keep in mind that every collection must have an unique index on the _id field which is created implicitely when the collection is initialized.
But the _id field doesn't have to be an auto-generated ObjectId. It can be anything you want. When you have another index with an uniqueness-constraint and you have no use for the _id field, you can move that unique-constraint to the _id field to save an index. Your documents would then look like this:
{ _id: {
"word":"ipad",
"date":20140113,
"docid": 324
},
"score": 98
}