Performance degradation in mongodb aggregation-framework - mongodb

I am using mongodb 3.6.8.
I have a collection (called states) with an ObjectId field (sensor_id), a date, as well as a few other fields.
I created a compound index on the collection:
db.states.createIndex({ "sensor_id" : 1, "date" : 1 });
I am using the aggregation framework with a match stage, for example:
{
"$match" : {
"sensor_id" : { "$oid" : "5b8fd62c4f0cd13c05296df7"},
"date" : {
"$gte" : { "$date" : "2018-10-06T04:19:00.000Z"},
"$lt" : { "$date" : "2018-10-06T10:19:09.000Z"}
}
}
}
My problem: as the states collection gets bigger, the pipeline aggregation gets slower and slower, even when the documents that are added fall outside the dates in the match filter. Using this index, I really expected performance not to vary very much as the collection gets bigger.
Other info:
The states collection does not have very many documents (about 200,000), of which about 20,000 are matched by the above filter.
The indexes in the collection (and other collections) are just a few megabytes and easily fit in memory.
The aggregation pipeline does not insert any documents in the database.
Can anyone suggest what I should investigate to explain the pretty drastic fall in performance as the collection grows (with new documents outside the date range in $match)?
Thank you.

Solved. Another stage in the pipeline was treating every more data, as time evolved. Nothing wrong with the indexes.

Related

MongoDB aggregate query performance improvement

I recently started shifting data from Microsoft SQL Server to MongoDB to obtain scalability. All good in term of migration.
The document has 2 important fields: customer, timestamphash (year month day).
We imported only 75 Million data in Azure Linux where we install MongoDB.
After adding compound index on both fields, we are having the following problem:
On 3 Milion data (after filterin) it takes 24 seconds to finish an aggregate group by count by customerId. The SQL Server gives the result in less then 1 second on the same data.
Do you think Casandra will be a better solution? We need query performance on big number of data.
I tried disk write, giving the VM more RAM. Nothing works.
Query:
aaggregate([
{ "$match" : { "Customer" : 2 } },
{ "$match" : { "TimestampHash" : { "$gte" : 20160710 } } },
{ "$match" : { "TimestampHash" : { "$lte" : 20190909 } } },
{ "$group" : { "_id" : { "Device" : "$Device" }, "__agg0" : { "$sum" : 1 } } },
{ "$project" : { "Device" : "$_id.Device", "Count" : "$__agg0", "_id" : 0 } },
{ "$skip" : 0 },
{ "$limit" : 10 }])
Update:
I used 'allowDiskUse: true' and the problem was solved. Reduced to 4 seconds for 3M data filtered.
I have encounter a similar problem before, during this question, and to be honest, I guess Cassandra is better in your certain case, but the question was about Mongo aggregation query optimization, right?
As for now, one of my collections have more then 3M+ docs, and shouldn't take 24s for aggregation queries if you build indexes correctly.
First of all, check out the index usage via Mongo Compass. Does Mongo actually using it? If your app spam queries to DB and your index have 0 usage (as in example below) than, as you already guessed it, something wrong with your index.
The Second thing is, using explain method (this doc will help you out), to check out more info about your query.
And for the third: index field sorting matters. For example if you have $match stage with 3 fields and you request docs by fields:
{ $match: {a_field:a, b_field:b, c_field:c} }
then you should build compound index on a,b,c fields in the exact same order.
There is always some kind of DB architecture problem. I highly recommend you not to stockpile all data inside one collection. Using {timestamps:true} on insertion (it created two fields, like createdAt: and updatedAt:
{
timestamps: true
}
in your schema, store old-time/outdated data in different collection and use $lookup aggregation method for them, when you really needs to operate with them them.
Hope you'll find something useful in my answer.

Mongodb query using db.collection.find() method 100 times faster than using db.collection.aggregate()?

My collection testData has some 4 milion documents with the identical structure:
{"_id" : ObjectId("5932c56571f5a268cea12226"),
"x" : 1.0,
"text" : "w592cQzC5aAfZboMujL3knCUlIWgHqZNuUcH0yJNS9U4",
"country" : "Albania",
"location" : {
"longitude" : 118.8775183,
"latitude" : 75.4316019
}}
The collection is indexed on (country, location.longitude) pair.
The following two queries, which I would consider identical and which produce identical output, differ in execution time by a factor of 100:
db.testData.aggregate(
[
{ $match : {country : "Brazil"} },
{ $sort : { "location.longitude" : 1 } },
{ $project : {"_id" : 0, "country" : 1, "location.longitude" : 1} }
]);
(this one produces output within about 6 seconds for the repeated query and about 120 seconds for the first-time query)
db.testData.find(
{ country : "Brazil" },
{"_id" : 0, "country" : 1, "location.longitude" : 1}
).sort(
{"location.longitude" : 1}
);
(this one produces output within 15 milliseconds for the repeated query and about 1 second for the first-time query).
What am I missing here? Thanx for any feedback.
MongoDB find operation is used to fetch documents from a collection according to filters .
MongoDB aggregation groups values from a collection and performs computation on group of values through execution of stages in pipeline and return computed result.
MongoDB find operation performs speedily as compared to aggregation operation as aggregate operation encapsulates multiple stages into pipeline which performs computation on data stored into collection with each stage's output serving as input to another stage and return processed result.
Mongo DB find operation returns a cursor to fetched documents that match filters and cursor is iterated to access document.
According to above mentioned description we need to fetch only those documents where value of country key is Brazil and sort documents according to values of longitude key in ascending order which can be accomplished easily using MongoDB find operation.

query to retrieve multiple objects in an array in mongodb

Suppose I have a array of objects as below.
"array" : [
{
"id" : 1
},
{
"id" : 2
},
{
"id" : 2
},
{
"id" : 4
}
]
If I want to retrieve multiple objects ({id : 2}) from this array, the aggregation query goes like this.
db.coll.aggregate([{ $match : {"_id" : ObjectId("5492690f72ae469b0e37b61c")}}, { $unwind : "$array"}, { $match : { "array.id" : 2}}, { $group : { _id : "$_id", array : { $push : { id : "$array.id"}}}} ])
The output of above aggregation is
{
"_id" : ObjectId("5492690f72ae469b0e37b61c"),
"array" : [
{
"id" : 2
},
{
"id" : 2
}
]
}
Now the question is:
1) Is retrieving of multiple objects from an array possible using find() in MongoDB?
2) With respect to performance, is aggregation is the correct way to do? (Because we need to use four pipeline operators) ?
3) Can we use Java manipulation (looping the array and only keep {id : 2} objects) to do this after
find({"_id" : ObjectId("5492690f72ae469b0e37b61c")}) query? Because find will once retrieve the document and keeps it in RAM. But if we use aggregation four operations need to be performed in RAM to get the output.
Why I asked the 3) question is: Suppose if thousands of clients accessing at the same time, then RAM memory will be overloaded. If it is done using Java, less task on RAM.
4) For how long the workingSet will be in RAM??
Is my understanding correct???
Please correct me if I am wrong.
Please suggest me to have right insight on this..
No. You project the first matching one with $, you project all of them, or you project none of them.
No-ish. If you have to work with this array, aggregation is what will allow you to extract multiple matching elements, but the correct solution, conceptually and for performance, is to design your document structure so this problem does not arise, or arises only for rare queries whose performance is not particularly important.
Yes.
We have no information that would allow us to give a reasonable answer to this question. This is also out of scope relative to the rest of the question and should be a separate question.

group in aggregate framework stopped working properly

I hate this kind of questions but maybe you can point me to obvious. I'm using Mongo 2.2.2.
I have a collection (in replica set) with 6M documents which has string field called username on which I have index. The index was non-unique but recently I made it unique. Suddenly following query gives me false alarms that I have duplicates.
db.users.aggregate(
{ $group : {_id : "$username", total : { $sum : 1 } } },
{ $match : { total : { $gte : 2 } } },
{ $sort : {total : -1} } );
which returns
{
"result" : [
{
"_id" : "davidbeges",
"total" : 2
},
{
"_id" : "jesusantonio",
"total" : 2
},
{
"_id" : "elesitasweet",
"total" : 2
},
{
"_id" : "theschoolofbmx",
"total" : 2
},
{
"_id" : "longflight",
"total" : 2
},
{
"_id" : "thenotoriouscma",
"total" : 2
}
],
"ok" : 1
}
I tested this query on sample collection with few documents and it works as expected.
One of 10gen responded in their JIRA.
Are there any updates on this collection? If so, I'd try adding {$sort: {username:1}} to the front of the pipeline. That will ensure that you only see each username once if it is unique.
If there are updates going on, it is possible that aggregation would see a document twice if it moves due to growth. Another possibility is that a document was deleted after being seen by the aggregation and a new one was inserted with the same username.
So sorting by username before grouping helped.
I think the answer may lie in the fact that your $group is not using an index, it's just doing a scan over the entire collection. These operators can use and index currently in the aggregation framework:
$match $sort $limit $skip
And they will work if placed before:
$project $unwind $group
However, $group by itself will not use an index. When you do your find() test I am betting you are using the index, possibly as a covered index (you can verify by looking at an explain() for that query), rather than scanning the collection. Basically my theory is that your index has no dupes, but your collection does.
Edit: This likely happens because a document is updated/moved during the aggregation operation and hence is seen twice, not because of dupes in the collection as originally thought.
If you add an operator earlier in the pipeline that can use the index but not alter the results fed into $group, then you can avoid the issue.

Unreasonably slow MongoDB query, even though the query is simple and aligned to indexes

I'm running a MongoDB server (that's literally all it has running). The server has 64gb of RAM and 16 cores, plus 2TB of hard drive space to work with.
The Document Structure
The database has a collection domains with around 20 million documents. There is a decent amount of data in each document, but for our purposes, The document is structured like so:
{
_id: "abcxyz.com",
LastUpdated: <date>,
...
}
The _id field is the domain name referenced by the document. There is an ascending index on LastUpdated. LastUpdated is updated on hundreds of thousands of records per day. Basically every time new data becomes available for a document, the document is updated and the LastUpdated field updated to the current date/time.
The Query
I have a mechanism that extracts the data from the database so it can be indexed in a Lucene index. The LastUpdated field is the key driver for flagging changes made to a document. In order to search for documents that have been changed and page through those documents, I do the following:
{
LastUpdated: { $gte: ISODate(<firstdate>), $lt: ISODate(<lastdate>) },
_id: { $gt: <last_id_from_previous_page> }
}
sort: { $_id:1 }
When no documents are returned, the start and end dates move forward and the _id "anchor" field is reset. This setup is tolerant to documents from previous pages that have had their LastUpdated value changed, i.e. the paging won't become incorrectly offset by the number of documents in previous pages that are now technically no longer in those pages.
The Problem
I want to ideally select about 25000 documents at a time, but for some reason the query itself (even when only selecting <500 documents) is extremely slow.
The query I ran was:
db.domains.find({
"LastUpdated" : {
"$gte" : ISODate("2011-11-22T15:01:54.851Z"),
"$lt" : ISODate("2011-11-22T17:39:48.013Z")
},
"_id" : { "$gt" : "1300broadband.com" }
}).sort({ _id:1 }).limit(50).explain()
It is so slow in fact that the explain (at the time of writing this) has been running for over 10 minutes and has not yet completed. I will update this question if it ever finishes, but the point of course is that the query is EXTREMELY slow.
What can I do? I don't have the faintest clue what the problem might be with the query.
EDIT
The explain finished after 55 minutes. Here it is:
{
"cursor" : "BtreeCursor Lastupdated_-1__id_1",
"nscanned" : 13112,
"nscannedObjects" : 13100,
"n" : 50,
"scanAndOrder" : true,
"millis" : 3347845,
"nYields" : 5454,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"LastUpdated" : [
[
ISODate("2011-11-22T17:39:48.013Z"),
ISODate("2011-11-22T15:01:54.851Z")
]
],
"_id" : [
[
"1300broadband.com",
{
}
]
]
}
}
Bumped into a very similar problem, and the Indexing Advice and FAQ on Mongodb.org says, quote:
The range query must also be the last column in an index
So if you have the keys a,b and c and run db.ensureIndex({a:1, b:1, c:1}), these are the "guidelines" in order use the index as much as possible:
Good:
find(a=1,b>2)
find(a>1 and a<10)
find(a>1 and a<10).sort(a)
Bad:
find(a>1, b=2)
Only use a range query OR sort on one column.
Good:
find(a=1,b=2).sort(c)
find(a=1,b>2)
find(a=1,b>2 and b<4)
find(a=1,b>2).sort(b)
Bad:
find(a>1,b>2)
find(a=1,b>2).sort(c)
Hope it helps!
/J
Ok I solved it. The culprit was "scanAndOrder": true, which suggested that the index wasn't being used as intended. The correct composite index has the the primary sort field first and then the fields being queried on.
{ "_id":1, "LastUpdated":1 }
Have you tried adding _id to your composite index. As you're using it as part of the query won't it still have to do a full table scan?