Execution time of a query - MongoDB - mongodb

I have two collections: coach and team.
Coach collection contains information about coaches like name, surname, age and an array coached_Team that contains the _id of the team that a coach coached.
The team collection contains data about teams like _id, common name, official name, country, championship....
If I want to find, for example, the official name of all teams coached by Allegri, I have to do two queries, the first on coach collection:
>var x = db.coach.find({surname:"Allegri"},{_id:0, "coached_Team.team_id":1})
>var AllegriTeams
>while(x.hasNext()) AllegriTeams=x.next()
{
"coached_Team" : [
{
"team_id" : "Juv.26
},
{
"team_id" : "Mil.74
},
{
"team_id" : "Cag.00
}
]
}
>AllegriTeams=AllegriTeams.coached_Team
[
{
"team_id" : "Juv.26"
},
{
"team_id" : "Mil.74"
},
{
"team_id" : "Cag.00"
}
]
And then I have to execute three queries on team collection:
> db.team.find({ _id:AllegriTeams[0].team_id}, {official_name:1,_id:0})
{official_name : "Juventus Football Club S.p.A."}
> db.team.find({ _id:AllegriTeams[1].team_id}, {official_name:1,_id:0})
{official_name : "Associazione Calcio Milan S.p.A"}
> db.team.find({ _id:AllegriTeams[2].team_id}, {official_name:1,_id:0})
{official_name:"Cagliari Calcio S.p.A"}
Now consider I have about 100k documents on collection team and collection coach. The first query, on coach collection, needs about 71 ms plus the time of while cycle. The three queries on team collection, using cursor.explain("executionStats") needs 0 ms. I don't understand why this query takes 0.
I need executionTimeMillis of these three queries to have the execution time of the query "find official names of all teams coached by Allegri". I want to add the execution time of the query on coach collection(71ms) with the execution time of these three. If the time of these three queries is 0 what can I say about the execution time of the query mainly?

I think the more important observation here is that 71ms is a long time for a simple fetch of one item. Looks like your "surname" field needs an index. The other "three" queries are simple lookups of a primary key, which is why they are relatively fast.
db.coach.createIndex({ "surname": 1 })
If that surname is actually "unique" then add that too:
db.coach.createIndex({ "surname": 1 },{ "unique": true })
You can also simplify your "three" queries as as one by simply mapping the array, and applying the $in operator:
var teamIds = [];
db.coach.find(
{ "surname": "Allegri" },
{ "_id":0, "coached_Team.team_id":1 }
).forEach(function(coach) {
teamIds = coach.coached_Team.map(function(team) {
return team.team_id }).concat(teamIds);
});
});
db.team.find(
{ "_id": { "$in": teamIds" }},
{ "official_name": 1, "_id": 0 }
).forEach(function(team) {
printjson(team);
});
And then certainly the overall execution time is way down, as well as removing the overhead of multiple operations down to just the two queries requried.
Also remembering here that despite what is in the execution plan stats, the more queries to make to and from the server then the longer the overal real time execution will be for making each request and retriving the data. So it is best to keep things as minimal as possible.
Therefore even more logical would be that where to "need" this information regularly, storing the "coach name" on the "team itself" ( and indexing that data ) leads to the fastest possible response and only a single query operation.
It's easy to get caught up in observing execution stats. But really, think of what is "best" and "fastest" as a pattern for the sort of queries you want to do.

Related

Next.js/MongoDB - Query Optimization

I am building a website using Next.js and MongoDB. On one of my website page, I have implemented filters to help search for products. To retrieve and update the filters (update item count each time a filter is changing), I have an api endpoint which query my MongoDB Collection. This specific collection contains ~200.000 items. Each item have several fields such as brand, model, place etc...
I have 9 fields which I use to filter and thus must fetch through my api each time there's a change. Therefore I have 9 queries running through my api, on for each field/filter and the query on MongoDB looks like :
var models = await db_collection
.aggregate([
{
$match: {
$and: [filter],
},
},
{
$group: { _id: '$model', count: { $sum: 1 } },
},
{ $sort: { _id: 1 } },
])
.toArray();
The problem is that, as 9 queries are running, the update of the page (mainly due to the queries) takes ~4secs which is too long. I would like to reach <1sec. I would like to now if there is a good practice I am missing such as doing one query instead of one for each filter or maybe a database optimization on my database.
Thank you,
I have tried using a $project argument before $groupon aggregate pipeline for the query to reduce the number of field returned, using distinct and then sorting instead of aggregate but none of these solutions seem to improve efficiency.
EDIT :
As suggested by R2D2, I am posting the structure of a document on MongoDB in my collection :
{
_id : ObjectId('example_id')
source : string
date : date
brand : string
family : string
model : string
size : string
color : string
condition : string
contact : string
SKU : string
}
Depending on the pages, I query unique values of each field of interest (source, date, brand, family, model, size, color, condition, contact) and their count depending on filters (e.g. Number for each unique values of model for selected brands, I also query documents based on specific values of these fields.
As mentioned, you indexes are important and if you are querying by those field I recomand to create compound indexes, see here for indexes optimisation : https://learnmongodbthehardway.com/schema/indexes/
As far as the aggregation pipeline goes, nothing is out of the ordinary, but this specific aggregation just return the number of items per model matching the criteria, not the matching document. If it is all the data you need you might find it usefull to create a new collection when you perform pre-caculation for common search daily (how many items have the color black, ...) this way, when the page loads, you don't have to look in you 200k+ items, but just in your pre-calculated statistical collection. Schedule a cron task or use a lambda function to invoke a route on your api that will calculate all your stats once a day and upsert them in a new collection.
Also I believe the "and" is useless useless since you can use the implicit $and. You can look for an object like :
{
color : {$in : ['BLACK', 'BLUE']},
size : 3
}
rather than :
[{color : 'BLACK'}, {color : 'BLUE'}, {size : 3}]
Reserve the explicit $and for when you really need it.

$addField with subQuery in MongoDB aggregation

I have 2 collections: Chargers and Reservation. I would like for the reservations in the next 7 days to be appended as a field when querying for chargers (i.e. time range from now until the same day next week).
The model for chargers looks like this:
{
"name":"charger 2",
"address":"test, test, USA",
"current_type": 1,
"charge_level" : 2,
"plug_type" : 1,
}
and the reservation model looks like this:
{
"charger_id": ObjectId("test ID"),
"start_time": "31-01-2021",
"end_time": "1-02-2021"
}
I found the aggregation $addField that adds a field within an aggregation pipeline, but I was wondering if I could use the $addfield on a "subquery". Essentiallly, for each charger that match, get all the reservations within a given time range where the chargerID is the one from the match, and add the array of reservations as a field. The resulting model would look like this:
{
"name":"charger 2",
"address":"test, test, USA",
"current_type": 1,
"charge_level" : 2,
"plug_type" : 1,
"reservations" : [
...
]
}
The current way of getting the data is with 2 queries at the application result, but that becomes quite taxing with the network latencies.
Query 1 - chargers:
{"_id":<charger_id>}
once I get the result, I query again.
Query 2 - Reservations:
{
"charger_id": <charger_id>,
"start_time" : {
"$lte" : <date in a week>
}
}
This is not too bad with a get by ID, because I already have the ID before querying for the charger in the first place, but with a getAll, or any other query that doesn't query by ID, it can get really taxing.
To bring in data from another collection you need to use $lookup. $addFields can be used to add fields based on data you already have in the result set.

Return only matching subdocs with Mongo aggregation

I have a schema with subdocs.
// Schema
var company = {
_id: ObjectId,
publish: Boolean,
divisions: {
employees: [ObjectId]
}
};
I need to find all the subdocs (divisions) that match my query. It appears that I have to use 2 matches - one to filter out initial docs and a second one to filter out the matching subdocs from the resulting $unwind operation. Is there a more efficient way?
// Query
this.aggregate({
$match: {
'publish': 1,
'divisions.employees': new ObjectId(userid)
}
}, {
$unwind: '$divisions'
}, {
$match: {
'divisions.employees': new ObjectId(userid)
}
}
I found this ticket but I am unsure this does what I need.
Doing both matches is the right thing here. You could eliminate the first match stage and just unwind, but having an initial $match allows you to narrow down the pipeline to exactly those documents that will produce at least one output document (i.e. those documents for which publish : true and some employees ObjectId matches the given ObjectId). You will be able to use indexes, like an index on { publish : 1, divisions.employees : 1 }, to perform the first match stage quickly.
However, you should ask yourself why you are using the aggregation pipeline here and if it is appropriate. Will you commonly be querying for a given employee that's part of a company with publish : 1? Is this one of the main queries for your use case? If it's infrequent or not critical then the aggregation is a good way to do it. Otherwise, you should reconsider the schema. You could make this query easy with a schema like
{
"_id" : ObjectId,
"publish" : Boolean,
"company" : (unique identifier, possibly a String or ObjectId)
}
that models employees as documents and denormalizes company information into the employee document. Then your query is as easy as
db.employees.find({ "_id" : ObjectId(userid), "publish" : true })
which will be much quicker than the aggregation (not that the aggregation will be slow; this is just relatively quicker). I'm not telling you to do it this way - use your own knowledge of your use case to make the right call.

1 document with updates vs Many smaller and inserting

I need to develop a data set for users which stores their favourite items - maybe 5% of users will have favourites, and for those perhaps 5-10 favourites on average, with a max of 50. Almost every user will have a "get favourites" call happen, regardless of if they have them, but will probably add infrequently
My assumption is: There will probably be 100x more "get favourites" than "add/post favourite".
Would it be better to have this structure in mongo, which may slow inserts (since it needs to update 1 document per user), but could be faster to retrieve all.
{
_id : 123456, (the user id)
favourites : [
{ item_id : 43563, created_date : ... },
{ item_id : 31232, created_date : ... },
{ item_id : 23472, created_date : ... }
]
}
Or 1 document per favourite
{
_id: ...,
user_id : 123456,
item_id : 43563,
created_date:...
}
{
_id: ...,
user_id : 123456,
item_id : 31232,
created_date:...
}
{
_id: ...,
user_id : 123456,
item_id : 23472,
created_date:...
}
The second structure is probably more flexible for future requirements change, but I assume the first structure would localise all the data in one area on a disk and may be much quicker for reads.
Then again, I'm not sure if changing the size of a collection document (by many updates) may have a detrimental affect? (i.e. low level would it have to move the document around on disk, or would it fragment the data anyway, since it may not preallocate enough space for it on storage on first insert)
The question is: Is one method recommended or significantly more highly performant than the other.
One way to design a Mongo collection is to think of the way in which the data is most likely to be used and design it for that purpose. In your case your user will query favourites much more frequently that add them. Therefore the collection should be design to optimise this query.
With this in mind the first option is the most optimal of the two. However you might want to consider a slight modification to that structure.
As you have said the getFavourites method will be called for all users but will only return a list of favourites for 5% of users. This call will have to retrieve the favourites array and determine if it has content. While this does not cost too much you could pre-calculate this call by adding an additional field that is true only if the user has favourites. Therefore it will only be necessary to query this field and then only query for favourites if the value returned is true.
I imagine a structure as follows:
{
_id : 123456, (the user id),
hasFavourites: 1,
favourites : [
{ item_id : 43563, created_date : ... },
{ item_id : 31232, created_date : ... },
{ item_id : 23472, created_date : ... }
]
}
This document has favourites so the field hasFavourites is 1, if it didn't it would be 0.

how do I do 'not-in' operation in mongodb?

I have two collections - shoppers (everyone in shop on a given day) and beach-goers (everyone on beach on a given day). There are entries for each day, and person can be on a beach, or shopping or doing both, or doing neither on any day. I want to now do query - all shoppers in last 7 days who did not go to beach.
I am new to Mongo, so it might be that my schema design is not appropriate for nosql DBs. I saw similar questions around join and in most cases it was suggested to denormalize. So one solution, I could think of is to create collection - activity, index on date, embed actions of user. So something like
{
user_id
date
actions {
[action_type, ..]
}
}
Insertion now becomes costly, as now I will have to query before insert.
A few of suggestions.
Figure out all the queries you'll be running, and all the types of data you will need to store. For example, do you expect to add activities in the future or will beach and shop be all?
Consider how many writes vs. reads you will have and which has to be faster.
Determine how your documents will grow over time to make sure your schema is scalable in the long term.
Here is one possible approach, if you will only have these two activities ever. One record per user per day.
{ user: "user1",
date: "2012-12-01",
shopped: 0,
beached: 1
}
Now your query becomes even simpler, whether you have two or ten activities.
When new activity comes in you always have to update the correct record based on it.
If you were thinking you could just append a record to your collection indicating user, date, activity then your inserts are much faster but your queries now have to do a LOT of work querying for both users, dates and activities.
With proposed schema, here is the insert/update statement:
db.coll.update({"user":"username", "date": "somedate"}, {"shopped":{$inc:1}}, true)
What that's saying is: "for username on somedate increment their shopped attribute by 1 and create it if it doesn't exist aka "upsert" (that's the last 'true' argument).
Here is the query for all users on a particular day who did activity1 more than once but didn't do any of activity2.
db.coll.find({"date":"somedate","shopped":0,"danced":{$gt:1}})
Be wary of picking a schema where a single document can have continuous and unbounded growth.
For example, storing everything in a users collection where the array of dates and activities keeps growing will run into this problem. See the highlighted section here for explanation of this - and keep in mind that large documents will keep getting into your working data set and if they are huge and have a lot of useless (old) data in them, that will hurt the performance of your application, as will fragmentation of data on disk.
Remember, you don't have to put all the data into a single collection. It may be best to have a users collection with a fixed set of attributes of that user where you track how many friends they have or other semi-stable information about them and also have a user_activity collection where you add records for each day per user what activities they did. The amount or normalizing or denormalizing of your data is very tightly coupled to the types of queries you will be running on it, which is why figure out what those are is the first suggestion I made.
Insertion now becomes costly, as now I will have to query before insert.
Keep in mind that even with RDBMS, insertion can be (relatively) costly when there are indices in place on the table (ie, usually). I don't think using embedded documents in Mongo is much different in this respect.
For the query, as Asya Kamsky suggest you can use the $nin operator to find everyone who didn't go to the beach. Eg:
db.people.find({
actions: { $nin: ["beach"] }
});
Using embedded documents probably isn't the best approach in this case though. I think the best would be to have a "flat" activities collection with documents like this:
{
user_id
date
action
}
Then you could run a query like this:
var start = new Date(2012, 6, 3);
var end = new Date(2012, 5, 27);
db.activities.find({
date: {$gte: start, $lt: end },
action: { $in: ["beach", "shopping" ] }
});
The last step would be on your client driver, to find user ids where records exist for "shopping", but not for "beach" activities.
One possible structure is to use an embedded array of documents (a users collection):
{
user_id: 1234,
actions: [
{ action_type: "beach", date: "6/1/2012" },
{ action_type: "shopping", date: "6/2/2012" }
]
},
{ another user }
Then you can do a query like this, using $elemMatch to find users matching certain criteria (in this case, people who went shopping in the last three days:
var start = new Date(2012, 6, 1);
db.people.find( {
actions : {
$elemMatch : {
action_type : { $in: ["shopping"] },
date : { $gt : start }
}
}
});
Expanding on this, you can use the $and operator to find all people went shopping, but did not go to the beach in the past three days:
var start = new Date(2012, 6, 1);
db.people.find( {
$and: [
actions : {
$elemMatch : {
action_type : { $in: ["shopping"] },
date : { $gt : start }
}
},
actions : {
$not: {
$elemMatch : {
action_type : { $in: ["beach"] },
date : { $gt : start }
}
}
}
]
});