Lets say I have a million entries in the db with 10 fields/("columns") in the db. It seems to me that the more columns I search by, the faster the query goes - for example:
db.items.find( {
$and: [
{ field1: x },
{ field2: y },
{ field3: z}
]
} )
is faster than:
db.items.find( {
$and: [
{ field1: x },
{ field2: y }
]
} )
While I would love to say "Great, this makes total sense to me" - it doesn't. I just know it's happening in my particular case, and wondering if this is actually always true. If so, ideally, I would like to know why.
Furthermore, when creating multi-field indices, does it help to have them in any sort of order. For example, let's say I add a compound index:
db.collection.ensureIndex( { field1: 1, field2: 1, field3: 1 } )
Do these have any sort of order? If yes, would the order matter? Let's say 90% of items will match the field1 criteria, but 1% of items will match the field 3 criteria. Would ordering them make some sort of difference?
It may be just the case that more restrictive query returns less documents, since 90% of items will match the field1 criteria and only 1% of items will match the field 3 criteria. Check what explain says for both queries.
Mongo has quite good profiler. Give it a try. Play with different indexes and different queries. Not on production db of course.
Order of fields in the index matters. If you have an index { field1: 1, field2: 1, field3: 1 }
and a query db.items.find( { field2: x, field3: y }), the index wount be used at all,
and for query db.items.find( { field1: x, field3: y }) it can be used only partially for field1.
From the other hand, order of conditions in the query does not matter:
db.items.find( { field1: x, field2: y }) is as good as
db.items.find( { field2: y, field1: x }) and will use the index in both cases.
Choosing indexing strategy you should examine both data and typical queries. It may be the case, that index intersection works better for you, and instead of a single compound index you get better total performance with simple indexes like { field1: 1}, { field2: 1}, { field3: 1}, rather than multiple compound indexes for different kind of queries.
It is also important to check index size to fit it in memory. In most cases anyway.
It's complicated... MongoDB keeps recently accessed documents in RAM, and the query plan is calculated the first time a query is executed, so the second time you run a query may be much faster than the first time.
But, putting that aside, the order of a compound index does matter. In a compound index, you can use the index in the order it was created, a bit like opening a door, walking through and finding that you have more doors to open.
So, having two overlapping indexes setup, e.g.:
{ city: 1, building: 1, room: 1 }
AND
{ city: 1, building: 1 }
Would be a waste, because you can still search for all the rooms in a particular building using the first two levels (fields) of the "{ city: 1, building: 1, room: 1 }" index.
Your intuition does make sense. If you had to find a particular room in a building, going straight to the right city, straight to the right building and then knowing the approximate place in the building will make it faster to find the room than if you didn't know the approximate place (assuming that there are lots of rooms). Take a look at levels in a B-Tree, e.g. the search visualisation here: http://visualgo.net/bst.html
It's not universally the case though, not all data is neatly distributed in a sort order - for example, English names or words tend to clump together under common letters - there's not many words that start with the letter X.
The (free, online) MongoDB University developer courses cover indexes quite well, but the best way to find out about the performance of a query is to look at the results of the explain() method against your query to see if an index was used, or whether a collection was scanned (COLLSCAN).
db.items.find( {
$and: [
{ field1: x },
{ field2: y }
]
})
.explain()
Related
I'm using MongoDB 4.0 on mongoDB Atlas cluster (3 replicas - 1 shard).
Assuming i have a collection that contains multiple documents.
Each of this documents holding an array out of subdocuments that represent cities in a certain year with additional information. An example document would look like that (i removed unessesary information to simplify example):
{_id:123,
cities:[
{name:"vienna",
year:1985
},
{name:"berlin",
year:2001
}
{name:"vienna",
year:1985
}
]}
I have a compound index on and year. What is the fastest way to count the occurrences of name and year combinations?
I already tried the following aggregation:
[{$unwind: {
path: '$cities'
}}, {$group: {
_id: {
name: 'cities.name',
year: '$cities.year'
},
count: {
$sum: 1
}
}}, {$project: {
count: 1,
name: '$_id.name',
year: '$_id.year',
_id: 0
}}]
Another approach i tried was a map-reduce in the following form - the map reduce performed a bit better ~30% less time needed.
map function:
function m() {
for (var i in this.cities) {
emit({
name: this.cities[i].name,
year: this.cities[i].year
},
1);
}
}
reduce function (also tried to replace sum with length, but surprisingly sum is faster):
function r(id, counts) {
return Array.sum(counts);
}
function call in mongoshell:
db.test.mapReduce(m,r,{out:"mr_test"})
Now i was asking myself - Is it possible to access the index? As far as i know it is a B+ tree that holds the pointers to the relevant documents on disk, therefore from a technical point of view I think is would be possible to iterate through all leaves of the index tree and just counting the pointers? Does anybody if this is possible?
Does anybody knows another way to solve this approach in a high performant way? (It is not possible to change the design, because of other dependencies of the software, we are running this on a very big dataset). Has anybody maybe experience in solve such task via shards?
The index will not be very helpful in this situation.
MongoDB indexes were designed for identifying documents that match a given critera.
If you create an index on {cities.name:1, cities.year:1}
This document:
{_id:123,
cities:[
{name:"vienna",
year:1985
},
{name:"berlin",
year:2001
}
{name:"vienna",
year:1985
}
]}
Will have 2 entries in the b-tree that refer to this document:
vienna|1985
berlin|2001
Even if it were possible to count the incidence of a specific key in the index, this does not necessarily correspond.
MongoDB does not provide a method to examine the raw entries in an index, and it explicitly refuses to use an index on a field containing an array for counting.
The MongoDB count command and helper functions all count documents, not elements inside of them. As you noticed, you can unwind the array and count the items in an aggregation pipeline, but at that point you've already loaded all of the documents into memory, so it's too late to make use of an index.
My Mongodb collection has this document structure:
{
_id: 1,
my_dict: {
my_key: [
{id: x, other_fields: other_values},
...
]
},
...
},
I need to update the array subdocuments very often, so an Index on the id field seems like a good idea. Still, I have many documents (millions) but my arrays inside them are small (max ~20 elements). Would it still improve performance a lot to index it, compared to the cost of indexing?
PS: I'm not using the id as a key (dict instead of an array), as I also often need to get the number of elements in "the array" ($size only works on arrays). I cannot use count as I am using Mongodb 3.2.
Followup question: If it would make a very big difference, I could instead use a dict like so:
{id: {others_fields: other_values}}
and store the size myself in a field. What I dislike about this is that I would need another field and update it myself (possible errors maybe, as I would need to use $inc each time I add/delete an item) instead of relying on "real" values. I would also have to manage the possibility that a key could be called _my_size, which would conflict with my logic. It would look then like this:
{
_id: 1,
my_dict: {
my_key: {
id: {other_fields: other_values},
_my_size: 1
},
},
},
Still not sure which is best for performance. I will need to update the subdocument (with id field) a lot, as well as computing the $size a lot (maybe 1/10 of the calls to update).
Which Schema/Strategy would give me a better performance? Or even more important, would it actually make a big difference? (possibly thousands of calls per second)
Update example:
update(
{_id: 1, my_dict.my_key.id: update_data_id},
{$set: {my_dict.my_key: update_data}}
)
Getting the size example:
aggregate(
{$match: {_id: 1}},
{$project: {_id: 0, nb_of_sub_documents: {$size: $my_dict.my_key}}}
Given Users with an index on age:
{ name: 'Bob',
age: 21 }
{ name: 'Cathy,
age: 21 }
{ name: 'Joe',
age: 33 }
To get the output:
[
{ _id: 21,
names: ['Bob, 'Cathy'] },
{ _id: 33,
names: ['Joe'] }
]
Is it possible to sort, group and limit by age?
db.users.aggregate(
[
{
$sort: {
age: 1
}
},
{
$group : {
_id : $age,
names:{ $push: '$name' }
},
{
$limit: 10
}
]
I did some research, but it's not clear if it is possible to sort first and then group. In my testing, the group loses the sort, but I don't see why.
If the group preserves the sort, then the sort and limit can greatly reduce the required processing. It only needs to do enough work to "fill" the limit of 10 groups.
So,
Does group preserve the sort order? Or do you have to group and then sort?
Is it possible to sort, group and limit doing only enough processing to return the limit? Or does it need to process the entire collection and then limit?
To answer your first question: $group does not preserve the order. There are a open requests for changes which also highlight the backgrounds a little but it doesn't look like the product will be changed to preserve the input documents' order:
https://jira.mongodb.org/browse/SERVER-24799
https://jira.mongodb.org/browse/SERVER-4507
https://jira.mongodb.org/browse/SERVER-21022
Two things can be said in general: You generally want to group first and then do the sorting. The reason being that sorting less elements (which the grouping generally produces) is going to be faster than sorting all input documents.
Secondly, MongoDB is going to make sure to sort as efficiently and little as possible. The documentation states:
When a $sort immediately precedes a $limit in the pipeline, the $sort
operation only maintains the top n results as it progresses, where n
is the specified limit, and MongoDB only needs to store n items in
memory. This optimization still applies when allowDiskUse is true and
the n items exceed the aggregation memory limit.
So this code gets the job done in your case:
collection.aggregate({
$group: {
_id: '$age',
names: { $push: '$name' }
}
}, {
$sort: {
'_id': 1
}
}, {
$limit: 10
})
EDIT following your comments:
I agree to what you say. And taking your logic a little further, I would go as far as saying: If $group was smart enough to use an index then it shouldn't even require a $sort stage at the start. Unfortunately, it's not (not yet probably). As things stand today, $group will never use an index and it won't take shortcuts based on the following stages ($limit in this case). Also see this link where someone ran some basic tests.
The aggregation framework is still pretty young so I guess, there is a lot of work being done to make the aggregation pipeline smarter and faster.
There are answers here on StackOverflow (e.g. here) where people suggest to use an upfront $sort stage in order to "force" MongoDB to use an index somehow. This however, slowed down my tests (1 million records of your sample shape using different random distributions) significantly.
When it comes to performance of an aggregation pipeline, $match stages at the start are what really helps the most. If you can limit the total amount of records that need to go through the pipeline from the beginning then that's your best bet - obviously... ;)
i want to ask some info related findAndModify in MongoDB.
As i know the query is "isolated by document".
This mean that if i run 2 findAndModify like this:
{a:1},{set:{status:"processing", engine:1}}
{a:1},{set:{status:"processing", engine:2}}
and this query potentially can effect 2.000 documents then because there are 2-query (2engine) then maybe that some document will have "engine:1" and someother "engine:2".
I don't think findAndModify will isolate the "first query".
In order to isolate the first query i need to use $isolated.
Is everything write what i have write?
UPDATE - scenario
The idea is to write an proximity engine.
The collection User has 1000-2000-3000 users, or millions.
1 - Order by Nearest from point "lng,lat"
2 - in NodeJS i make some computation that i CAN'T made in MongoDB
3 - Now i will group the Users in "UserGroup" and i write an Bulk Update
When i have 2000-3000 Users, then this process (from 1 to 3) take time.
So i want to have Multiple Thread in parallel.
Parallel thread mean parallel query.
This can be a problem since Query3 can take some users of Query1.
If this happen, then at point (2) i don't have the most nearest Users but the most nearest "for this query" because maybe another query have take the rest of Users. This can create maybe that some users in New York is grouped with users of Los Angeles.
UPDATE 2 - scenario
I have an collection like this:
{location:[lng,lat], name:"1",gender:"m", status:'undone'}
{location:[lng,lat], name:"2",gender:"m", status:'undone'}
{location:[lng,lat], name:"3",gender:"f", status:'undone'}
{location:[lng,lat], name:"4",gender:"f", status:'done'}
What i should be able to do, is create 'Group' of users by grouping by the most nearest. Each Group have 1male+1female. In the example above, i'm expecting to have only 1 group (user1+user3) since there are Male+Female and are so near each other (user-2 is also Male, but is far away from User-3 and also user-4 is also Female but have status 'done' so is already processed).
Now the Group are created (only 1 group) so the 2users are marked as 'done' and the other User-2 is marked as 'undone' for future operation.
I want to be able to manage 1000-2000-3000 users very fast.
UPDATE 3 : from community
Okay now. Can I please try to summarise your case. Given your data, you want to "pair" male and female entries together based on their proximity to each other. Presumably you don't want to do every possible match but just set up a list of general "recommendations", and let's say 10 for each user by the nearest location. Now I'd have to be stupid to not see the full direction of where this is going, but does this sum up the basic initial problem statement. Process each user, find their "pairs", mark them as "done" once paired and exclude them from other pairings by combination where complete?
This is a non-trivial problem and can not be solved easily.
First of all, an iterative approach (which admittedly was my first one) may lead to wrong results.
Given we have the following documents
{
_id: "A",
gender: "m",
location: { longitude: 0, latitude: 1 }
}
{
_id: "B",
gender: "f",
location: { longitude: 0, latitude: 3 }
}
{
_id: "C",
gender: "m",
location: { longitude: 0, latitude: 4 }
}
{
_id: "D",
gender: "f",
location: { longitude: 0, latitude: 9 }
}
With an iterative approach, we now would start with "A" and calculate the closest female, which, of course would be "B" with a distance of 2. However, in fact, the closest distance between a male and a female would be 1 (distance from "B" to "C"). But even when we found this, that would leave the other match, "A" and "D", at a distance of 8, where, with our previous solution, "A" would have had a distance of only 2 to "B".
So we need to decide what way to go
Naively iterate over the documents
Find the lowest sum of distances between matching individuals (which itself isn't trivial to solve), so that all participants together have the shortest travel.
Matching only participants within an acceptable distance
Do some sort of divide and conquer and match participants within a certain radius of a common landmark (say cities, for example)
Solution 1: Naively iterate over the documents
var users = db.collection.find(yourQueryToFindThe1000users);
// We can safely use an unordered op here,
// which has greater performance.
// Since we use the "done" array do keep track of
// the processed members, there is no drawback.
var pairs = db.pairs.initializeUnorderedBulkOp();
var done = new Array();
users.forEach(
function(currentUser){
if( done.indexOf(currentUser._id) == -1 ) { return; }
var genderToLookFor = ( currentUser.gender === "m" ) ? "f" : "m";
// using the $near operator,
// the returned documents automatically are sorted from nearest
// to farest, and since findAndModify returns only one document
// we get the closest matching partner.
var nearPartner = db.collection.findAndModify(
query: {
status: "undone",
gender: genderToLookFor,
$near: {
$geometry: {
type: "Point" ,
coordinates: currentUser.location
}
}
},
update: { $set: { "status":"done" } },
fields: { _id: 1}
);
// Obviously, the current use already is processed.
// However, we store it for simplifying the process of
// setting the processed users to done.
done.push(currentUser._id, nearPartner._id);
// We have a pair, so we store it in a bulk operation
pairs.insert({
_id:{
a: currentUser._id,
b: nearPartner._id
}
});
}
)
// Write the found pairs
pairs.execute();
// Mark all that are unmarked by now as done
db.collection.update(
{
_id: { $in: done },
status: "undone"
},
{
$set: { status: "done" }
},
{ multi: true }
)
Solution 2: Find the smallest sum of distances between matches
This would be the ideal solution, but it is extremely complex to solve. We need to all members of one gender, calculate all distances to all members of the other gender and iterate over all possible sets of matches. In our example it is quite simple, since there are only 4 combinations for any given gender. Thinking of it twice, this might be at least a variant of the traveling salesman problem (MTSP?). If I am right with that, the number of combinations should be
for all n>2, where n is the number of possible pairs.
and hence
for n=10
and an astonishing
for n=25
That's 7.755 quadrillion (long scale) or 7.755 septillion (short scale).
While there are approaches to solving this kind of problem, the world record is somewhere in the range of 25,000 nodes using massive amounts of hardware and quite tricky algorithms. I think for all practical purposes, this "solution" can be ruled out.
Solution 3
In order to prevent the problem that people might be matched with unacceptable distances between them and depending on your use case, you might want to match people depending on their distance to a common landmark (where they are going to meet, for example the next bigger city).
For our example assume we have cities at [0,2] and [0,7]. The distance (5) between the cities hence has to be our acceptable range for matches. So we do a query for each city
db.collection.find({
$near: {
$geometry: {
type: "Point" ,
coordinates: [ 2 , 0 ]
},
$maxDistance: 5
}, status: "done"
})
and iterate over the results naively. Since "A" and "B" would be the first in the result set, they would be matched and done. Bad luck for "C" here, as no girl is left for him. But when we do the same query for the second city he gets his second chance. Ok, his travel gets a bit longer, but hey, he got a date with "D"!
To find the respective distances, take a fixed set of cities (towns, metropolitan areas, whatever your scale is), order them by location and set each cities radius to the bigger of the two distances to their immediate neighbors. This way, you get overlapping areas. So even when a match can not be found in one place, it may be found on others.
Iirc, Google Maps allows it to grab the cities of a nation based on their size. An easier way would be to let people choose their respective city.
Notes
The code shown is not production ready and needs to be refined.
Instead of using "m" and "f" for denoting a gender, I suggest using 1 and 0: Can still be easily mapped, but needs less space to save.
Same goes for status.
I think the last solution is the best, optimizing distances some wayish and keeping the chances high for a match.
I want to sort a collection of documents on the backend with Meteor in a publication. I want to make sure I index the collection properly and I couldnt find any concrete ways to do this for the fields I want.
Basically I want to find all the documents in collection "S" that have field "a" equal to a specific string value. Then after I have all those documents, I want to sort them by three fields: b, c (descending), and d.
Would this be the right way to index it:
S._ensureIndex({ a: 1 });
S._ensureIndex({ b: 1, c: -1, d: 1 });
S.find({ a: "specificString"}, { sort: { b: 1, c: -1, d: 1 } });
As a bonus question, field "d" is the time added in milliseconds and it would be highly unlikely that there is a duplicate document with all 3 fields being the same. Should I also include the unique option as well in the index?
S._ensureIndex({ b: 1, c: -1, d: 1 }, { unique: true });
Will adding the unique option help any with sorting and indexing performance?
I was also wondering what the general performance was for an indexed sort by mongo? On their page it says indexed sorts dont need to be done in memory because it is just doing a read. Does this mean the sort is instant?
Thanks for all your help :)
I was able to find my own answer here:
http://docs.mongodb.org/manual/tutorial/sort-results-with-indexes/#sort-and-non-prefix-subset-of-an-index
Basically the correct way to index for my example here is:
S._ensureIndex({ a: 1, b: 1, c: -1, d: 1 });
S.find({ a: "specificString"}, { sort: { b: 1, c: -1, d: 1 } });
-- As for my other subquestions.
Because you are using the index here, the performance is instant. However do keep mind that having a 4 field index can be very expensive in other ways.
Without looking this up for sure....my hunch is that including the "unique: true" would not make a difference in performance since it is already instant.