MongoDB slow aggregate time - mongodb

I'm facing an issue where the aggregate function is performing very slowly where it takes about 30 seconds to gather all my data. Assume 1 of the record in this structure:
{
"_id":{
"$oid":"5909a5cefece40f172895a6b"
},
"Record":1,
"Link":"https://www.google.com",
"Location":["loc1", "loc2", "loc3"],
"Organization":["org1", "org2", "org3"],
"Date":2017,
"PeoplePPL":["ppl1", "ppl2", "ppl3"]
}
And the aggregate query as follows:
db.testdata_4.aggregate([{
"$unwind": "$PeoplePPL"
},{
"$unwind": "$Location"
},{
"$match": {
Date: {
$gte: lowerBoundYear,
$lte: upperBoundYear
}
}
},{
"$group": {
"_id": {
"People": "$PeoplePPL",
"Date": "$Date"
},
Links: {
$addToSet: "$Link"
},
Locations: {
$addToSet: "$Location"
}
}
},{
"$group": {
"_id": "$_id.People",
Record: {
$push: {
"Country": "$Locations",
"Year": "$_id.Date",
"Links": "$Links"
}
}
}
}]).toArray()
There are a total of 154 records in the "testdata_4" collection, and upon aggregation, there will be 5571 records returned with the query time of 28 seconds. I have performed the ensureIndex() on "Locations" and "Date". Is this supposed to be normal as the number of records returned increases?
If it isn't normal, may I know if there's a workaround to decrease my query time to 5 seconds at max instead of having it at 28 seconds or more?

It's very likely that the index on Date isn't being used.
The $match and $sort operators can take advantage of indexes when they are being used at the beginning of the pipeline. In this case, the filters are applied after several $unwind stages, which mean it likely isn't be used.
Suggestions:
Move the $match stage to the beginning of the pipeline
The "Location", "Date" and "Link" fields aren't arrays, so it isn't immediately clear why there are $unwind aggregation stages on these fields. You may want to remove these.

Related

Efficiently find the most recent filtered document in MongoDB collection using datetime field

I have a large collection of documents with datetime fields in them, and I need to retrieve the most recent document for any given queried list.
Sample data:
[
{"_id": "42.abc",
"ts_utc": "2019-05-27T23:43:16.963Z"},
{"_id": "42.def",
"ts_utc": "2019-05-27T23:43:17.055Z"},
{"_id": "69.abc",
"ts_utc": "2019-05-27T23:43:17.147Z"},
{"_id": "69.def",
"ts_utc": "2019-05-27T23:44:02.427Z"}
]
Essentially, I need to get the most recent record for the "42" group as well as the most recent record for the "69" group. Using the sample data above, the desired result for the "42" group would be document "42.def".
My current solution is to query each group one at a time (looping with PyMongo), sort by the ts_utc field, and limit it to one, but this is really slow.
// Requires official MongoShell 3.6+
db = db.getSiblingDB("someDB");
db.getCollection("collectionName").find(
{
"_id" : /^42\..*/
}
).sort(
{
"ts_utc" : -1.0
}
).limit(1);
Is there a faster way to get the results I'm after?
Assuming all your documents have the format displayed above, you can split the id into two parts (using the dot character) and use aggregation to find the max element per each first array (numeric) element.
That way you can do it in a one shot, instead of iterating per each group.
db.foo.aggregate([
{ $project: { id_parts : { $split: ["$_id", "."] }, ts_utc : 1 }},
{ $group: {"_id" : { $arrayElemAt: [ "$id_parts", 0 ] }, max : {$max: "$ts_utc"}}}
])
As #danh mentioned in the comment, the best way you can do is probably adding an auxiliary field to indicate the grouping. You may further index the auxiliary field to boost the performance.
Here is an ad-hoc way to derive the field and get the latest result per grouping:
db.collection.aggregate([
{
"$addFields": {
"group": {
"$arrayElemAt": [
{
"$split": [
"$_id",
"."
]
},
0
]
}
}
},
{
$sort: {
ts_utc: -1
}
},
{
"$group": {
"_id": "$group",
"doc": {
"$first": "$$ROOT"
}
}
},
{
"$replaceRoot": {
"newRoot": "$doc"
}
}
])
Here is the Mongo playground for your reference.

SELECT COUNT with HAVING clause

This is my input :
{"_id": "phd/Klink2006","type": "Phd", "title": "IQForCE - Intelligent Query (Re-)Formulation with Concept-based Expansion", "year": 2006, "publisher": "Verlag Dr. Hut, M?nchen", "authors": ["Stefan Klink"], "isbn": ["3-89963-303-2"]}
I want to count books that have less than 3 authors. How can I reach this ?
$group by null, check condition if size of authors is less than 3 then count 1 otherwise 0
db.collection.aggregate([
{
$group: {
_id: null,
count: {
$sum: {
$cond: [
{ $lt: [{ $size: "$authors" }, 3] },
1,
0
]
}
}
}
}
])
Playground
You can use the $where operator (this will return all the documents).
db.collection.find({
"$where": "this.authors.length < 3"
});
Important consideration:
$where evaluates JavaScript and cannot take advantage of indexes.
Therefore, query performance improves when you express your query
using the standard MongoDB operators (e.g., $gt, $in).
In general, you
should use $where only when you cannot express your query using
another operator. If you must use $where, try to include at least one
other standard query operator to filter the result set. Using $where
alone requires a collection scan.
The best options in term of performance is to create a new key authorsLength
db.collection.aggregate([
{
"$match": {
"authorsLength": {
"$lt": 3
}
}
},
{
"$group": {
"_id": null,
"count": {
"$sum": 1
}
}
}
])

MongoDB: Sorting, subtracting dates and counting

I'm a begginer in MongoDB and I'm practising the aggregation method. I'm my example, I would like to get the wine which has been produced in the last 5 years (5 years back from the newest wine), then I would like to count how many wines have been produced in that period of time (the database gives us the year of the wines, in an integer)
I believe that first, I have to sort the wines by year, then I should get the year of the newest wine, and sustract five years, using that period of time to count the wines. But I don't know how to write all of that using the aggregation code.
Thanks!
You need to use various aggregation pipeline stages to transform your data.
MongoDB’s aggregation framework is modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the documents into an aggregated result.
As you have mentioned,
First you have to get the year of the newest wine.
I have used $group to group the data, and $max is used to get newestWineYear and entire documents($$ROOT) is pushed to data by using $push
Stage1
{
$group: {
_id: null,
"newestWineYear": {
$max: "$year"
},
data: {
$push: "$$ROOT"
}
}
}
The output of the first stage contains the entire documents in an array which we had named asdata and the newestWineYear
So, inorder to flatten the data array $unwind is used.
Stage2
{
$unwind: "$data"
}
Get the count of the wines that has been produced in the last 5 years.
I have used $group to get the count, and count is obtained using $sum.
Stage3
{
$group: {
_id: null,
count: {
"$sum": {
$cond: [
{
"$gte": [
"$data.year",
{
"$subtract": [
"$newestWineYear",
5
]
}
]
},
1,
0
]
}
}
}
}
A condition is added to $sum to count only the wines that has been produced in the last 5 years.
Condition is available inside $cond.
It says:
If "data.year" >= [ "$newestWineYear" - 5 ], then add 1 to count, else add 0
data.year is used because we had pushed year of the wines to the data array in our first stage of aggregation pipeline.
Final aggregation query can be found here: Playground
Alternative method can be found here without a $cond inside $group but a $match stage is introduced to get only the wines that has been produced in the last 5 years.
As you guessed, you need to use mongo's aggregation framework.
Start with a simple pipeline with 3 simple steps:
group with $ group to get last year's data (returns an array)
remove array and get single documents ($ unwind operator)
filter documents with the $ match operator
Finally, you can replace the root of your documents to have better formed results.
Imagine having data like this:
[
{
"wine": "Red xxx",
"year": 2018
},
{
"wine": "Red yyy",
"year": 2017
},
{
"wine": "Red zzz",
"year": 2017
},
{
"wine": "White 1",
"year": 2016
},
{
"wine": "White 2",
"year": 2013
},
{
"wine": "White 3",
"year": 2017
},
{
"wine": "White 4",
"year": 2009
}
]
Here's the pipeline and here you can see the results in playground.
db.collection.aggregate({
$group: {
_id: null,
"lastYear": {
$max: "$year"
},
data: {
$push: "$$ROOT"
}
}
},
{
$unwind: "$data"
},
{
$match: {
$expr: {
$gte: [
"$data.year",
{
"$subtract": [
"$lastYear",
5
]
}
]
}
}
},
{
"$replaceWith": "$data"
})

Poor lookup aggregation performance

I have two collections
Posts:
{
"_Id": "1",
"_PostTypeId": "1",
"_AcceptedAnswerId": "192",
"_CreationDate": "2012-02-08T20:02:48.790",
"_Score": "10",
...
"_OwnerUserId": "6",
...
},
...
and users:
{
"_Id": "1",
"_Reputation": "101",
"_CreationDate": "2012-02-08T19:45:13.447",
"_DisplayName": "Geoff Dalgas",
...
"_AccountId": "2"
},
...
and I want to find users who write between 5 and 15 posts.
This is how my query looks like:
db.posts.aggregate([
{
$lookup: {
from: "users",
localField: "_OwnerUserId",
foreignField: "_AccountId",
as: "X"
}
},
{
$group: {
_id: "$X._AccountId",
posts: { $sum: 1 }
}
},
{
$match : {posts: {$gte: 5, $lte: 15}}
},
{
$sort: {posts: -1 }
},
{
$project : {posts: 1}
}
])
and it works terrible slow. For 6k users and 10k posts it tooks over 40 seconds to get response while in relational database I get response in a split second.
Where's the problem? I'm just getting started with mongodb and it's quite possible that I messed up this query.
from https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/
foreignField Specifies the field from the documents in the from
collection. $lookup performs an equality match on the foreignField to
the localField from the input documents. If a document in the from
collection does not contain the foreignField, the $lookup treats the
value as null for matching purposes.
This will be performed the same as any other query.
If you don't have an index on the field _AccountId, it will do a full tablescan query for each one of the 10,000 posts. The bulk of the time will be spent in that tablescan.
db.users.ensureIndex("_AccountId", 1)
speeds up the process so it's doing 10,000 index hits instead of 10,000 table scans.
In addition to bauman.space's suggestion to put an index on the _accountId field (which is critical), you should also do your $match stage as early as possible in the aggregation pipeline (i.e. as the first stage). Even though it won't use any indexes (unless you index the posts field), it will filter the result set before doing the $lookup (join) stage.
The reason why your query is terribly slow is that for every post, it is doing a non-indexed lookup (sequential read) for every user. That's around 60m reads!
Check out the Pipeline Optimization section of the MongoDB Aggregation Docs.
First use $match then $lookup. $match filter the rows need to be examined to $lookup. It's efficient.
as long as you're going to group by user _AccountId, you should do the $group first by _OwnerUserId then lookup only after filtering accounts having 10<postsCount<15 this will reduce lookups:
db.posts.aggregate([{
$group: {
_id: "$_OwnerUserId",
postsCount: {
$sum: 1
},
posts: {
$push: "$$ROOT"
} //if you need to keep original posts data
}
},
{
$match: {
postsCount: {
$gte: 5,
$lte: 15
}
}
},
{
$lookup: {
from: "users",
localField: "_id",
foreignField: "_AccountId",
as: "X"
}
},
{
$unwind: "$X"
},
{
$sort: {
postsCount: -1
}
},
{
$project: {
postsCount: 1,
X: 1
}
}
])

Mongo $subtract date doesn't work in aggregation $match block

I am creating a mongo aggregation query which use a $subtract operator in my $match block. As explained in these codes below.
This query doesn't work:
db.coll.aggregate(
[
{
$match: {
timestamp: {
$gte: {
$subtract: [new Date(), 24 * 60 * 60 * 1000]
}
}
}
},
{
$group: {
_id: {
timestamp: "$timestamp"
},
total: {
$sum: 1
}
}
},
{
$project: {
_id: 0,
timestamp: "$_id.timestamp",
total: "$total",
}
},
{
$sort: {
timestamp: -1
}
}
]
)
However, this second query work:
db.coll.aggregate(
[
{
$match: {
timestamp: {
$gte: new Date(new Date() - 24 * 60 * 60 * 1000)
}
}
},
{
$group: {
_id: {
timestamp: "$timestamp"
},
total: {
$sum: 1
}
}
},
{
$project: {
_id: 0,
timestamp: "$_id.timestamp",
total: "$total",
}
},
{
$sort: {
timestamp: -1
}
}
]
)
I need to use $subtract on my $match block so I can't use the last query.
As of mongodb 3.6 you can use $subtract in the $match stage via the $expr. Here's the docs: https://docs.mongodb.com/manual/reference/operator/query/expr/
I was able to get a query like what you're describing via this $expr and a new system variable in mongodb 4.2 called $$NOW. Here is my query, which gives me orders that have been created within the last 4 hours:
[
{ $match:
{ $expr:
{ $gt: [
"$_created_at",
{ $subtract: [ "$$NOW", 4 * 60 * 60 * 1000] } ]
}
}
}
]
Well you cannot do that and you are not meant to do so either. Another valid thing is that you say to "need" to do this but in reality you really do not.
Pretty much all of the general aggregation operators outside of the pipeline operators are really only valid within a $project or a $group pipeline stage. Mostly within $project but certainly not in others.
A $match pipeline is really the same as a general "query" operation, so the only things valid in there are the query operators.
As for the case for your "need", any "value" that is submitted within an aggregation pipeline and particularly within a $match needs to be evaluated outside of the actual pipeline before the BSON representation is sent to the server.
The only exception is the notation that defines variables in the document, particularly "fieldnames" such a "$fieldname" and then only really in $project or $group. So that means something that "refers" to an existing value of a document, and that is something that cannot be done within any type of "query" document expression.
If you need to work with the value of another field in the document then you work it out with $project first, as in:
db.collection.aggregate([
{ "$project": {
"fieldMath": { "$subtract": [ "$fieldOne", "$fieldTwo" ] }
}},
{ "$match": { "fieldMath": { "$gt": 2 } }}
])
For any other purpose you really want to evaluate the value "outside" the pipeline.
The above answers the question you asked, but this answers the question you didn't ask.
Your pipeline doesn't make any sense since grouping on the "timestamp" alone would be unlikely to group anything since the values are of millisecond accuracy and there is likely not to be more than just a few at best for very active systems.
It appears like you are looking for the math to group by "day", which you can do like this:
db.collection.aggregate([
{ "$group": {
"_id": {
"$subtract": [
{ "$subtract": [ "$timestamp", new Date(0) ] },
{ "$mod": [
{ "$subtract": [ "$timestamp", new Date(0) ] },
1000 * 60 * 60 * 24
]}
]
},
"total": { "$sum": "$total" }
}}
])
That "rounds" your timestamp value to a single day and has a much better chance of "aggregating" something than you would otherwise have.
Or you can use the "date aggregation operators" to do much the same thing with a composite key.
So if you want to "query" then it evaluates externally. If you want to work on a value "within the document" then you must do so in either a $project or $group pipeline stage.
The $subtract operator is a projection-operator. It is only available during a $project step. So your options are:
(not recommended) Add a $project-step before your $match-step to convert the timestamp field of all documents for the following match-step. I would not recommend you to do this because this operation needs to be performed on every single document on your database and prevents the database from using an index on the timestamp field, so it could cost you a lot of performance.
(recommended) Generate the Date you want to match against in the shell / in your application. Generate a new Date() object, store it in a variable, subtract 24 hours from it and perform your 2nd query using that variable.