Aggregation is very slow - mongodb

I have a collection with a structure similar to this.
{
"_id" : ObjectId("59d7cd63dc2c91e740afcdb"),
"dateJoined": ISODate("2014-12-28T16:37:17.984Z"),
"activatedMonth": 5,
"enrollments" : [
{ "month":-10, "enrolled":'00'},
{ "month":-9, "enrolled":'00'},
{ "month":-8, "enrolled":'01'},
//other months
{ "month":8, "enrolled":'11'},
{ "month":9, "enrolled":'11'},
{ "month":10, "enrolled":'00'}
]
}
month in enrollments sub document is a relative month from dateJoined.
activatedMonth is a month of activation relative to dateJoined. So, this will be different for each document.
I am using Mongodb aggregation framework to process queries like "Find all documents that are enrolled from 10 months before dateJoined activating to 25 months after dateJoined activating".
"enrolled" values 01, 10, 11 are considered enrolled and 00 is considered not enrolled. For a document to be considered to to be enrolled, it should be enrolled for every month in the range.
I am applying all the filters that I can apply in the match stage but this can be empty in most cases. In projection phase I am trying to find out the all the document with at least one not-enrolled month. if the size is zero, then the document is enrolled.
Below is the query that I am using. It takes 3 to 4 seconds to finish. It is more or less same time with or with out the group phase. My data is relatively smaller in size ( 0.9GB) and total number of documents are 41K and sub document count is approx. 13 million.
I need to reduce the processing time. I tried creating an index on enrollments.month and enrollment.enrolled and is of no use and I think it is because of the fact that project stage cant use indexes. Am I right?
Are there are any other things that I can do to the query or the collection structure to improve performance?
let startMonth = -10;
let endMonth = 25;
mongoose.connection.db.collection("collection").aggregate([
{
$match: filters
},
{
$project: {
_id: 0,
enrollments: {
$size: {
$filter: {
input: "$enrollment",
as: "enrollment",
cond: {
$and: [
{
$gte: [
'$$enrollment.month',
{
$add: [
startMonth,
"$activatedMonth"
]
}
]
},
{
$lte: [
'$$enrollment.month',
{
$add: [
startMonth,
"$activatedMonth"
]
}
]
},
{
$eq: [
'$$enrollment.enroll',
'00'
]
}
]
}
}
}
}
}
},
{
$match: {
enrollments: {
$eq: 0
}
}
},
{
$group: {
_id: null,
enrolled: {
$sum: 1
}
}
}
]).toArray(function(err,
result){
//some calculations
}
});
Also, I definitely need the group stage as I will group the counts based on different field. I have omitted this for simplicity.
Edit:
I have missed a key details in the initial post. Updated the question with the actual use case why I need projection with a calculation.
Edit 2:
I converted this to just a count query to see how it performs (based on comments to this question by Niel Lunn.
My query:
mongoose.connection.db.collection("collection")
.find({
"enrollment": {
"$not": {
"$elemMatch": { "month": { "$gte": startMonth, "$lte": endMonth }, "enrolled": "00" }
}
}
})
.count(function(e,count){
console.log(count);
});
This query is taking 1.6 seconds. I tried with following indexes separately:
1. { 'enrollment.month':1 }
2. { 'enrollment.month':1 }, { 'enrollment.enrolled':1 } -- two seperate indexes
3. { 'enrollment.month':1, 'enrollment.enrolled':1} - just one index on both fields.
Winning query plan is not using keys in any of these cases, it does a COLLSCAN always. What am I missing here?

Related

Is there a way to write a nested query in mongoDB?

I an new to mongoDB and I am trying to achieve below SQL query equivalent in mongoDB
SELECT ROUND((SELECT COUNT() FROM INFODOCS WHERE ML_PRIORITY = HIGH AND PROCESSOR_ID = userid)
/ (SELECT COUNT() FROM INFODOCS WHERE PROCESSOR_ID = userid) * 100)
AS EFFORTS FROM DUMMY;
EFFORTS = Total High Priority Infodocs / Total Infodocs for a given Processor
I tried to write an aggregation pipeline using $match, $group, $count but the issue is once I get an output for one subquery i did not find anyway how can i compute another subquery and finally use the outputs of both subquery to determine the final result.
The mongo-y way would not to execute two different queries to get the 2 different counts, but to do a sum it dynamically with one query.
You can achieve this in many different ways, here is an example how to use $cond while $grouping to do a conditional sum.
db.collection.aggregate([
{
$match: {
PROCESSOR_ID: "1"
},
},
{
$group: {
_id: null,
totalCount: {
$sum: 1
},
priorityHighCount: {
$sum: {
$cond: [
{
$eq: [
"$ML_PRIORITY",
"HIGH"
]
},
1,
0
]
}
}
}
},
{
$project: {
EFFORTS: {
$round: {
"$multiply": [
{
$divide: [
"$priorityHighCount",
"$totalCount"
]
},
100
]
}
}
}
}
])

MongoDB: Aggregation ($sort) on a union of collections very slow

I have a few collections where I need to perform a union on, then query. However, I realise this is very slow for some reason. The explain is not that helpful as it only tells if the 1st $match stage is indexed. I am using a pipeline like:
[
{
"$match": {
"$and": [
{ ... }
]
}
},
// repeat this chunk for each collection
{
"$unionWith": {
"coll": "anotherCollection",
"pipeline": [
{
"$match": {
"$and": [
{ ... }
]
}
},
]
}
},
// Then an overall limit/handle pagination for all the unioned results
// UPDATE: Realised the sort is the culprit
{ "$sort": { "createdAt": -1 } },
{ "$skip": 0},
{ "$limit": 50 }
]
Is there a better way to do such a query? Does mongo do the unions in parallel maybe? Is there a "DB View" I can use to obtain a union of all the collections?
UPDATE: Just realised the runtime increase once I add the sort. I suspect it cannot use indexes because its on a union?
Yes, there is a way. But it's not that trivial, you need to change how you do pagination. It requires more engineering, as you got to keep track of the page not only by number, but also by last elements found
If you paginate by filtering by a unique identifier (usually _id) with a cursor you can do early filtering.
!!! Important !!!
You will need to keep track of the last item found instead of skipping a number of elements. If you don't do so, you will lose track of the pagination, and maybe never return some data, or return some twice, which is way worse than being slow
[
{
"$match": {
"$and": [
{ ... }
],
"_id":{"$gt": lastKnownIdOfCollectionA} // this will filter out everything you already saw, so no skip needed
}
},
{ "$sort": { "createdAt": -1 } }, // this sorting is indexed!
{ "$limit": 50 } // maybe you will take 0 but max 50, you don't care about the rest
// repeat this chunk for each collection
{
"$unionWith": {
"coll": "anotherCollection",
"pipeline": [
{
"$match": {
"$and": [
{ ... }
],
"_id":{"$gt": lastKnownIdOfCollectionB} // this will filter out everything you already saw, so no skip needed
}
},
{ "$sort": { "createdAt": -1 } }, // this sorting is indexed!
{ "$limit": 50 } // maybe you will take 0 but max 50, you don't care about the rest
]
}
},
// At this point you have MAX 100 elements, an index is not needed for sorting :)
{ "$sort": { "createdAt": -1 } },
{ "$skip": 0},
{ "$limit": 50 }
]
In this example, I do the early filter by _id which also contains the createdAt timestamp. If the filtering is not about the creation date you might have to define which identifier will suit the most. Remember the identifier must be a unique identifier, but you can use more than one value combined (eg. createdAt + randomizedId)

Is it possible to list in mongodb the list of elements whose value is less than 10% of another field?

I basically have a database where I record motorcycles and their mileage.
{
"motorcycle":"A",
"current_km":4600,
"review_km":5000
},
{
"motorcycle":"B",
"current_km":4000,
"review_km":5000
},
{
"motorcycle":"C",
"current_km":4900,
"review_km":5000
},
{
"motorcycle":"D",
"current_km":3000,
"review_km":5000
}
I have a field called current_km that determines your current mileage and I have another field called review_km, which consists of specifying the mileage in which your review should be done, as long as your current mileage (current_km) is greater than 10% of Mileage review (review_km).
So I would like to list the elements where:
current_km is greater than:
(review_km - ( review_km * 0.10))
for example:
current_km = 4600;
review_km = 5000;
result = 5000 - (5000 * 0.10);
4600 (current_km)> = 4500 (result) // in this case it is showed
In my database it would show the results of motorcycles A and C
how can I do it? I don't know if it is possible to do it in mongodb directly.
Need to use aggregation with $subtract and $multiply,
$addFields add new fields, we are generating result field, equation (review_km - ( review_km * 0.10)) using $subtract and $multiply
$match equation in $expr if current_km >= result if its correct then returns document
db.collection.aggregate([
{
$addFields: {
result: {
$subtract: [
"$review_km",
{
$multiply: [
"$review_km",
0.10
]
}
]
}
}
},
{
$match: {
$expr: {
$gte: [
"$current_km",
"$result"
]
}
}
}
])
Working Playground: https://mongoplayground.net/p/s2qenvuzLKF
Shorter version
If you don't want result field in response then combined condition in $match and $addFields is no longer needed
db.collection.aggregate([
{
$match: {
$expr: {
$gte: [
"$current_km",
{
$subtract: [
"$review_km",
{
$multiply: [
"$review_km",
0.10
]
}
]
}
]
}
}
}
])
Working Playground: https://mongoplayground.net/p/fii__3tTika

each document contains 3 fields with dates, is it possible to return the documents according to the date closest to today contained in those fields?

I have a database with a structure like this:
{
"bidding":"0ABF3",
"dates":{
"expiration_date_registration":ISODate("2020-08-24T23:51:25.000Z"),
"expiration_date_tender":ISODate("2020-08-23T23:51:25.000Z"),
"expiration_date_complaints":ISODate("2020-08-22T23:51:25.000Z")
}
},
{
"bidding":"0ABF4",
"dates":{
"expiration_date_registration":ISODate("2020-08-19T23:51:25.000Z"),
"expiration_date_tender":ISODate("2020-07-25T23:51:25.000Z"), // this is the closest expiration date with respect to today ("this question was asked on July 24)
"expiration_date_complaints":ISODate("2020-08-13T23:51:25.000Z")
}
}
I have 3 fields each containing a date. expiration_date_registration,expiration_date_tender,expiration_date_complaints
I would like that when a request is made to my database, it is returned in order of expiration date according to the dates contained in these 3 fields.
In this case the output should show the second document first (in this example it is the field with the date closest to today (this question was asked on July 24), "expiration_date_tender":ISODate("2020-07-29T25:51:25.000Z")) and so on, an order using these 3 fields to determine the order in which my documents will be displayed.
it's possible?
It's possible, but not very efficient. Here's an aggregate pipe that calculates the distance from now for each of your dates, then gets and sorts by the min distance.
https://mongoplayground.net/p/HwittyBRjzZ
https://mongoplayground.net/p/EGp20ftjh-P
db.collection.aggregate([
{
$addFields: {
datesArr: [
"$dates.expiration_date_registration",
"$dates.expiration_date_tender",
"$dates.expiration_date_complaints",
]
}
},
{
$addFields: {
distances: {
$map: {
input: "$datesArr",
in: {
$abs: {
$subtract: [
"$$NOW",
"$$this"
]
}
}
}
}
}
},
{
$addFields: {
minDist: {
$min: "$distances"
}
}
},
{
$addFields: {
closestDate: {
$let: {
vars: {
closestDateIndex: {
$indexOfArray: [
"$distances",
"$minDist"
]
}
},
in: {
$arrayElemAt: [
"$datesArr",
"$$closestDateIndex"
]
}
}
}
}
},
{
$sort: {
minDist: 1
}
}
])
You can use $subtract to find difference between today date ($currentDate) and "expiration_date_tender" and then user $sort to get the desired document on top. You can fetch the top document with collection.findOne() function

Aggregation pipeline slow with large collection

I have a single collection with over 200 million documents containing dimensions (things I want to filter on or group by) and metrics (things I want to sum or get averages from). I'm currently running against some performance issues and I'm hoping to gain some advice on how I could optimize/scale MongoDB or suggestions on alternative solutions. I'm running the latest stable MongoDB version using WiredTiger. The documents basically look like the following:
{
"dimensions": {
"account_id": ObjectId("590889944befcf34204dbef2"),
"url": "https://test.com",
"date": ISODate("2018-03-04T23:00:00.000+0000")
},
"metrics": {
"cost": 155,
"likes": 200
}
}
I have three indexes on this collection, as there are various aggregations being ran on this collection:
account_id
date
account_id and date
The following aggregation query fetches 3 months of data, summing cost and likes and grouping by week/year:
db.large_collection.aggregate(
[
{
$match: { "dimensions.date": { $gte: new Date(1512082800000), $lte: new Date(1522447200000) } }
},
{
$match: { "dimensions.account_id": { $in: [ "590889944befcf34204dbefc", "590889944befcf34204dbf1f", "590889944befcf34204dbf21" ] }}
},
{
$group: {
cost: { $sum: "$metrics.cost" },
likes: { $sum: "$metrics.likes" },
_id: {
year: { $year: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } },
week: { $isoWeek: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } }
}
}
},
{
$project: {
cost: 1,
likes: 1
}
}
],
{
cursor: {
batchSize: 50
},
allowDiskUse: true
}
);
This query takes about 25-30 seconds to complete and I'm looking to reduce this to at least 5-10 seconds. It's currently a single MongoDB node, no shards or anything. The explain query can be found here: https://pastebin.com/raw/fNnPrZh0 and executionStats here: https://pastebin.com/raw/WA7BNpgA As you can see, MongoDB is using indexes but there are still 1.3 million documents that need to be read. I currently suspect I'm facing some I/O bottlenecks.
Does anyone have an idea how I could improve this aggregation pipeline? Would sharding help at all? Is MonogDB the right tool here?
The following could improve performances if and only if precomputing dimensions within each record is an option.
If this type of query represents an important portion of the queries on this collection, then including additional fields to make these queries faster could be a viable alternative.
This hasn't been benchmarked.
One of the costly parts of this query probably comes from working with dates.
First during the $group stage while computing for each matching record the year and the iso week associated to a specific time zone.
Then, to a lesser extent, during the initial filtering, when keeping dates from the 3 last months.
The idea would be to store in each record the year and the isoweek, for the given example this would be { "year" : 2018, "week" : 10 }. This way the _id key in the $group stage wouldn't need any computation (which would otherwise represent 1M3 complex date operations).
In a similar fashion, we could also store in each record the associated month, which would be { "month" : "201803" } for the given example. This way the first match could be on months [2, 3, 4, 5] before applying a more precise and costlier filtering on the exact timestamps. This would spare the initial costlier Date filtering on 200M records to a simple Int filtering.
Let's create a new collection with these new pre-computed fields (in a real scenario, these fields would be included during the initial insert of the records):
db.large_collection.aggregate([
{ $addFields: {
"prec.year": { $year: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } },
"prec.week": { $isoWeek: { date: "$dimensions.date", timezone: "Europe/Amsterdam" } },
"prec.month": { $dateToString: { format: "%Y%m", date: "$dimensions.date", timezone: "Europe/Amsterdam" } }
}},
{ "$out": "large_collection_precomputed" }
])
which will store these documents:
{
"dimensions" : { "account_id" : ObjectId("590889944befcf34204dbef2"), "url" : "https://test.com", "date" : ISODate("2018-03-04T23:00:00Z") },
"metrics" : { "cost" : 155, "likes" : 200 },
"prec" : { "year" : 2018, "week" : 10, "month" : "201803" }
}
And let's query:
db.large_collection_precomputed.aggregate([
// Initial gross filtering of dates (months) (on 200M documents):
{ $match: { "prec.month": { $gte: "201802", $lte: "201805" } } },
{ $match: {
"dimensions.account_id": { $in: [
ObjectId("590889944befcf34204dbf1f"), ObjectId("590889944befcf34204dbef2")
]}
}},
// Exact filtering of dates (costlier, but only on ~1M5 documents).
{ $match: { "dimensions.date": { $gte: new Date(1512082800000), $lte: new Date(1522447200000) } } },
{ $group: {
// The _id is now extremly fast to retrieve:
_id: { year: "$prec.year", "week": "$prec.week" },
cost: { $sum: "$metrics.cost" },
likes: { $sum: "$metrics.likes" }
}},
...
])
In this case we would use indexes on account_id and month.
Note: Here, months are stored as String ("201803") since I'm not sure how to cast them to Int within an aggregation query. But best would be to store them as Int when records are inserted
As a side effect, this obviously will make the storage disk/ram of the collection heavier.