General question about Mongo query performance and the order of query arguments.
We have a collection for storing "files" meta data which includes the file-name, and status of the file (integer code value). There will only be a small number of files in the collection with the same name (maybe a few dozen at most), however there can be thousands of files with the same status.
If there is a Mongo query structured something like this:
db.getCollection('files').find( {
'$and': [
{ 'name': 'someFileName.csv' },
{ 'status': { '$in': [ 12, 6 ] } }
]
})
...would it perform any differently then the same query formatted like this:
db.getCollection('files').find( {
'$and': [
{ 'status': { '$in': [ 12, 6 ] } },
{ 'name': 'someFileName.csv' }
]
})
Which is to say: does the order of the $and clause arguments matter? Would scenario #1 perform better than scenario #2 since theoretically the file-name search would eliminate all but a few records? Or does Mongo operate in that manner under the covers?
No, the order of the fields in a query doesn't matter.
Also, due to query fields being implicitly "anded", these would also be equivalent to:
db.getCollection('files').find( {
'status': { '$in': [ 12, 6 ] },
'name': 'someFileName.csv'
})
and
db.getCollection('files').find( {
'name': 'someFileName.csv',
'status': { '$in': [ 12, 6 ] }
})
They're all treated the same by the query analyzer when determining the optimal way to execute the query.
Related
I have a collection that is made up of companies. Each company has a "number_of_employees" as well as a subdocument of "offices" which includes "state_code" and "country_code". For example:
{
'_id': ObjectId('52cdef7c4bab8bd675297da5'),
'name': 'Technorati',
'number_of_employees': 35,
'offices': [
{'description': '',
'address1': '360 Post St. Ste. 1100',
'address2': '',
'zip_code': '94108',
'city': 'San Francisco',
'state_code': 'CA',
'country_code': 'USA',
'latitude': 37.779558,
'longitude': -122.393041}
]
}
I'm trying to get the number of employees per state across all companies. My latest attempt looks like:
db.research.aggregate([
{ "$match": {"offices.country_code": "USA" } },
{ "$unwind": "$offices" },
{ "$project": { "_id": 1, "number_of_employees": 1, "offices.state_code": 1 } }
])
But now I'm stuck on how to do the $group. Because the num_of_employees is at the company level and not the office level I want to split them evenly across the offices. For example, if Technorati has 5 offices in 5 different states then each state would be allocated 7 employees.
In SQL I could do this easily enough using a windowed function to get average employees across offices by company and then summing those while grouping by state. I can't seem to find any clear examples of similar functionality in MongoDB though.
Note, this is for a school assignment, so the use of third-party libraries isn't feasible. Also, I'm hoping that this can all be done in a simple snippet of code, possibly even one call. I could certainly create new intermediate collections or do this in Python and process data there, but that's probably outside of the scope of the homework.
Anything to point me in the right direction would be greatly appreciated!
You are actually on the right track. You just need to derive an extra field numOfEmpPerOffice by using $divide and $sum it when $group by state.
db.collection.aggregate([
{
"$match": {
"offices.country_code": "USA"
}
},
{
"$addFields": {
"numOfEmpPerOffice": {
"$divide": [
"$number_of_employees",
{
"$size": "$offices"
}
]
}
}
},
{
"$unwind": "$offices"
},
{
$group: {
_id: "$offices.state_code",
totalEmp: {
$sum: "$numOfEmpPerOffice"
}
}
}
])
Here is the Mongo playground for your reference.
I am trying to implement a search feature to MongoDB and this is the aggregate pipeline I am using:
[
{
'$search': {
'text': {
'query': 'albus',
'path': [
'first_name', 'email', 'last_name'
]
}
}
}, {
'$project': {
'_id': 1,
'first_name': 1,
'last_name': 1
}
}, {
'$limit': 5
}
]
The command returns documents that contain only exactly albus or Albus, but return nothing for queries like alb, albu, etc. In the demo video I watched here: https://www.youtube.com/watch?time_continue=8&v=kZ77X67GUfk, the instructor was able to search based on substring.
The search index I am currently using is the default dynamic one.
How would I need to change my command?
You need to use the autocomplete feature, so your query will look like this:
{
$search: {
"autocomplete": {
'query': 'albus',
'path': [
'first_name', 'email', 'last_name'
]
}
}
}
Mind you both first_name, email and last_name need to be mapped as autocomplete type so a name like albus will be indexed as a, al, alb, albu, albus. Obviously this will vastly increase your index size.
Another thing to consider is tweaking the maxGrams and tokenization parameters. this will allow very long names to still work as expected and if you want to allow substring match like lbu matching albus.
I have a document which is structured like this:
{
'item_id': '12345'
'total_score': 100,
'user_scores': {
'ABC': 40,
'DEF': 60
}
}
I'm using PyMongo, but documentation of MongoDB seems easily translatable across different distributions. With PyMongo, I could update user scores with:
collection.update_one(
{ 'item_id': '12345' },
{ '$set': { 'user_scores.GHI': 20 } },
upsert=True
)
Which results in this:
{
'item_id': '12345'
'total_score': 100,
'user_scores': {
'ABC': 40,
'DEF': 60,
'GHI': 20
}
}
The issue is of course that the total_score is now incorrect. I want that total score to update, so that in a future query, I can quickly ascertain the score of each result, and even sort by score.
One solution could be to find an existing document using find_one({'item_id: '12345'}), (create if it doesn't exist), then update with new scores, and update total score. The problem there is that I want to run thousands of these at the same time, and it's far more efficient to call bulk_write on a series of requests.
So, a better solution would be to do two sequential update requests:
request1 = UpdateOne(
{ 'item_id' : '12345' },
{ '$set': { 'user_scores.GHI': 20 } },
upsert = True
)
request2 = UpdateOne(
{ 'item_id' : '12345' },
{ '$set': { 'total_score': { '$sum': { '$values': 'user_scores' } } } },
upsert = True
)
The first request updates the user scores, same as before. The second request, there are two concepts going on. The syntax for this isn't correct, but here's what I'm trying to do:
I need to get the values from the user_scores dictionary. { '$values': 'user_scores' } is how I've tried to convey this.
That gives me an array of values. I know these are all numeric, so I now need to sum those, conveyed with { '$sum': { '$values': 'user_scores' } }.
I can run these batch updates consecutively, so there's no risk of summing the wrong thing. The danger with having a total_score field will always be that it isn't updated and thus doesn't contain the correct number. I'd imagine this is a common case with document-based models?
If you're using Mongo version 4.2+ they introduced a new feature: pipelined updates, Meaning now you can do what you want in one go:
db.collection.updateOne({ 'item_id' : '12345' },
[
{ '$set': { 'user_scores.GHI': 20 } },
{ '$set': { 'total_score': { '$sum': [ "$user_scores.GHI", "$user_scores.ABC", "$user_scores.GHI"] } } },,
]);
Unfortunately this is not possible for lesser Mongo versions hence if that is the case you'll have to keep using your solution which is splitting this into 2 actions.
EDIT:
For dynamic update we can use $map and $objectToArray like so:
db.collection.updateOne(
{'item_id': '12345'},
[
{'$set': {'user_scores.GHI': 20}},
{
'$set':
{
'total_score': {
'$sum': {
'$map': {
'input': {'$objectToArray': '$user_scores'},
'as': 'score',
'in': '$$score.v'
}
}
}
}
}
]);
This is my model:
order:{
_id: 88565,
activity:
[
{_id: 57235, content: "foo"},
{_id: 57236, content: "bar"}
]
}
This is my query:
db.order.find({
"$and": [
{
"activity.content "bar"
},
{
"activity._id": 57235
}
]
});
This query will select the order with id 88565 even if the conditions are satisfied together by 2 different embedded activities.
I would expect that this query returned nothing.
I know that I can use elemMatch to filter embedded documents with more precision but this behaviour seems very confusing.
Is there a way to obtain a proper filtering where an AND clause has a single embedded document scope?
I am trying to put a limit on the number of distinct results returned from a mongoid query.
Place.where({:tags.in => ["food"]}).distinct(:name).limit(2)
But this throws up the following error:
NoMethodError: undefined method 'limit' for ["p1", "p2", "p3", "p4"]:Array
How do I put a limit in the mongoid query instead of fetching the entire result set and then selecting the limited number of items from the array.
Thanks
distinct in MongoDB is a command and commands don't return cursors. Only cursors support limit() whereas command results don't, just like sort() isn't supported here, and I think sort is important enough to run first, otherwise you never know which "first two" distinct items you get
There is a way around this, but using the aggregation framework. In plain MongoDB query speak (as you use on the MongoDB shell), you'd use:
db.places.aggregate( [
{ $match: { 'tags' : { $in: [ 'food' ] } } },
{ $group: { '_id': '$name' } },
{ $sort: { 'name': 1 } },
{ $limit: 2 },
] );
In Mongoid I suspect you can change the first line to:
Place.collection.aggregate( [