I'm trying to learn NoSQL aggregation queries and here is dataset (name - shakespeare_plays) structure:
"_id" : "Romeo and Juliet",
"acts" : [
{
"title" : "ACT I",
"scenes" : [
{
"title" : "SCENE I. Verona. A public place.",
"action" : [
{
"character" : "SAMPSON",
"says" : [
"Gregory, o' my word, we'll not carry coals."
]
},
{
"character" : "GREGORY",
"says" : [
"No, for then we should be colliers."
]
},
// ...
{
"character" : "GREGORY",
"says" : [
"To move is to stir; and to be valiant is to stand:",
"therefore, if thou art moved, thou runn'st away."
]
},
{
"character" : "SAMPSON",
"says" : [
"A dog of that house shall move me to stand: I will",
"take the wall of any man or maid of Montague's."
]
},
{
"character" : "GREGORY",
"says" : [
"That shows thee a weak slave; for the weakest goes",
"to the wall."
]
},
// ...
},
// ...
]
},
// ...
]
}
What tasks am I trying to do:
What characters are found in more than one play
How many replicas does Juliet have
Number of characters in Othello
Any tips how to do it via aggregate?
You're on the right track. Sharing some queries to achieve your goal.
From where you're right now, you can get a list of all characters by adding $group stage
db.getCollection('shakespeare_plays').aggregate([{
$unwind: "$acts"
}, {
$unwind: "$acts.scenes"
}, {
$unwind: "$acts.scenes.action"
}, {
$group: {
_id: "$acts.scenes.action.character"
}
}])
Going further, you want to see who has appeared how many times, you can use $sum operator inside $group
db.getCollection('shakespeare_plays').aggregate([{
$unwind: "$acts"
}, {
$unwind: "$acts.scenes"
}, {
$unwind: "$acts.scenes.action"
}, {
$group: {
_id: "$acts.scenes.action.character",
count: {$sum: 1}
}
}])
//Results : [{ "_id" : "GREGORY", "count" : 4 }]
You can export the results to an array and perform any logic you want to perform on the results which will give you all the answers you needed
var myResults = db.getCollection('shakespeare_plays').aggregate([pipelineQuery]).toArray();
//Here you can perform any logic on the variable myResults in your programming language
Read more about $group and $sum
{
"_id" : ObjectId("5ae84dd87f5b72618ba7a669"),
"main_sub" : "MATHS",
"reporting" : [
{
"teacher" : "ABC"
}
],
"subs" : [
{
"sub" : "GEOMETRIC",
"teacher" : "XYZ",
}
]
}
{
"_id" : ObjectId("5ae84dd87f5b72618ba7a669"),
"main_sub" : "SOCIAL SCIENCE",
"reporting" : [
{
"teacher" : "XYZ"
}
],
"subs" : [
{
"sub" : "CIVIL",
"teacher" : "ABC",
}
]
}
I have simplified the structure of the documents that i have.
The basic structure is that I have a parent subject with an array of reporting teachers and an array of sub-subjects(each having a teacher)
I now want to extract all the subject(parent/sub-subjects) along with the condition if they are sub-subjects or not which are taught by a particular teacher.
For eg:
for teacher ABC i want the following structure:
[{'subject':'MATHS', 'is_parent':'True'}, {'subject':'CIVIL', 'is_parent':'FALSE'}]
-- What is the most efficient query possible ..? I have tried $project with $cond and $switch but in both the cases I have had to repeat the conditional statement for 'subject' and 'is_parent'
-- Is it advised to do the computation in a query or should I get the data dump and then modify the structure in the server code? AS in, I could $unwind and get a mapping of the parent subjects with each sub-subject and then do a for loop.
I have tried
db.collection.aggregate(
{$unwind:'$reporting'},
{$project:{
'result':{$cond:[
{$eq:['ABC', '$reporting.teacher']},
"$main_sub",
"$subs.sub"]}
}}
)
then I realised that even if i transform the else part into another query for the sub-subjects I will have to write the exact same thing for the property of is_parent
You have 2 arrays, so you need to unwind both - the reporting and the subs.
After that stage each document will have at most 1 parent teacher-subj and at most 1 sub teacher-subj pairs.
You need to unwind them again to have a single teacher-subj per document, and it's where you define whether it is parent or not.
Then you can group by teacher. No need for $conds, $filters, or $facets. E.g.:
db.collection.aggregate([
{ $unwind: "$reporting" },
{ $unwind: "$subs" },
{ $project: {
teachers: [
{ teacher: "$reporting.teacher", sub: "$main_sub", is_parent: true },
{ teacher: "$subs.teacher", sub: "$subs.sub", is_parent: false }
]
} },
{ $unwind: "$teachers" },
{ $group: {
_id: "$teachers.teacher",
subs: { $push: {
subject: "$teachers.sub",
is_parent: "$teachers.is_parent"
} }
} }
])
I'm working with a mongodb query. Each document in the collection looks like this:
{
"_id": "12345",
"name": "Trinity Force",
"price": 3702,
"comp": [
"Zeal",
"Phage",
"Sheen",
]
}
I was working on a query that returns the 5 cheapest items (lowest price), with prices equal to 0 excluded (those trinkets though). I wrote this (sorry for poor formatting)
db.league.aggregate( { $project : { _id : 1, name: 1, price: 1, comp: 0 } },
{ $match : {price : { $gt : 0 } } },
{ $sort: { price : 1 } }).limit(5)
I ran into two problems, though; the limit function doesn't seem to work with this aggregation, and neither does the $project. The output I'm looking for should exclude the item components (hence comp: 0) and limit it to 5 outputs. Could I get some assistance, please?
db.league.aggregate(
{ $project : { _id : "$_id", name: "$name", price: "$price"} },
{ $match : { "price" : { $gt : 0 } } },
{ $sort: { "price" : 1 } },
{ $limit : 5 })
This is aggregation query to return the 5 cheapest items
imo, this is not aggregating but sorting results.
db.league.find({ price: { $gt :0} }, {comp: 0}).sort({price: 1}).limit(5)
nevertheless, i would test both for performance
I have the following documents:
[{
"_id":1,
"name":"john",
"position":1
},
{"_id":2,
"name":"bob",
"position":2
},
{"_id":3,
"name":"tom",
"position":3
}]
In the UI a user can change position of items(eg moving Bob to first position, john gets position 2, tom - position 3).
Is there any way to update all positions in all documents at once?
You can not update two documents at once with a MongoDB query. You will always have to do that in two queries. You can of course set a value of a field to the same value, or increment with the same number, but you can not do two distinct updates in MongoDB with the same query.
You can use db.collection.bulkWrite() to perform multiple operations in bulk. It has been available since 3.2.
It is possible to perform operations out of order to increase performance.
From mongodb 4.2 you can do using pipeline in update using $set operator
there are many ways possible now due to many operators in aggregation pipeline though I am providing one of them
exports.updateDisplayOrder = async keyValPairArr => {
try {
let data = await ContestModel.collection.update(
{ _id: { $in: keyValPairArr.map(o => o.id) } },
[{
$set: {
displayOrder: {
$let: {
vars: { obj: { $arrayElemAt: [{ $filter: { input: keyValPairArr, as: "kvpa", cond: { $eq: ["$$kvpa.id", "$_id"] } } }, 0] } },
in:"$$obj.displayOrder"
}
}
}
}],
{ runValidators: true, multi: true }
)
return data;
} catch (error) {
throw error;
}
}
example key val pair is: [{"id":"5e7643d436963c21f14582ee","displayOrder":9}, {"id":"5e7643e736963c21f14582ef","displayOrder":4}]
Since MongoDB 4.2 update can accept aggregation pipeline as second argument, allowing modification of multiple documents based on their data.
See https://docs.mongodb.com/manual/reference/method/db.collection.update/#modify-a-field-using-the-values-of-the-other-fields-in-the-document
Excerpt from documentation:
Modify a Field Using the Values of the Other Fields in the Document
Create a members collection with the following documents:
db.members.insertMany([
{ "_id" : 1, "member" : "abc123", "status" : "A", "points" : 2, "misc1" : "note to self: confirm status", "misc2" : "Need to activate", "lastUpdate" : ISODate("2019-01-01T00:00:00Z") },
{ "_id" : 2, "member" : "xyz123", "status" : "A", "points" : 60, "misc1" : "reminder: ping me at 100pts", "misc2" : "Some random comment", "lastUpdate" : ISODate("2019-01-01T00:00:00Z") }
])
Assume that instead of separate misc1 and misc2 fields, you want to gather these into a new comments field. The following update operation uses an aggregation pipeline to:
add the new comments field and set the lastUpdate field.
remove the misc1 and misc2 fields for all documents in the collection.
db.members.update(
{ },
[
{ $set: { status: "Modified", comments: [ "$misc1", "$misc2" ], lastUpdate: "$$NOW" } },
{ $unset: [ "misc1", "misc2" ] }
],
{ multi: true }
)
Suppose after updating your position your array will looks like
const objectToUpdate = [{
"_id":1,
"name":"john",
"position":2
},
{
"_id":2,
"name":"bob",
"position":1
},
{
"_id":3,
"name":"tom",
"position":3
}].map( eachObj => {
return {
updateOne: {
filter: { _id: eachObj._id },
update: { name: eachObj.name, position: eachObj.position }
}
}
})
YourModelName.bulkWrite(objectToUpdate,
{ ordered: false }
).then((result) => {
console.log(result);
}).catch(err=>{
console.log(err.result.result.writeErrors[0].err.op.q);
})
It will update all position with different value.
Note : I have used here ordered : false for better performance.
Is it possible to group-by field name? Or do I need a different structure so I can group-by value?
I know we can use group by on values and we can unwind arrays, but is it possible to get total apples, pears and oranges owned by John amongst the three houses here without specifying "apples", "pears" and "oranges" explicitly as part of the query? (so NOT like this);
// total all the fruit John has at each house
db.houses.aggregate([
{
$group: {
_id: null,
"apples": { $sum: "$people.John.items.apples" },
"pears": { $sum: "$people.John.items.pears" },
"oranges": { $sum: "$people.John.items.oranges" },
}
},
])
In other words, can I group-by the first field-name under "items" and get the aggregate sum of apples:104, pears:202 and oranges:306, but also bananas, melons and anything else that might be there? Or do I need to restructure the data into an array of key/value pairs like categories?
db.createCollection("houses");
db.houses.remove();
db.houses.insert(
[
{
House: "birmingham",
categories : [
{
k : "location",
v : { d : "central" }
}
],
people: {
John: {
items: {
apples: 2,
pears: 1,
oranges: 3,
}
},
Dave: {
items: {
apples: 30,
pears: 20,
oranges: 10,
},
},
},
},
{
House: "London", categories: [{ k: "location", v: { d: "central" } }, { k: "type", v: { d: "rented" } }],
people: {
John: { items: { apples: 2, pears: 1, oranges: 3, } },
Dave: { items: { apples: 30, pears: 20, oranges: 10, }, },
},
},
{
House: "Cambridge", categories: [{ k: "type", v: { d: "rented" } }],
people: {
John: { items: { apples: 100, pears: 200, oranges: 300, } },
Dave: { items: { apples: 0.3, pears: 0.2, oranges: 0.1, }, },
},
},
]
);
Secondly, and more importantly, could I then also group by "house.categories.k" ? In other words, is it possible to find out how many "apples" "John" has in "rented" vs "owned" or "friends" houses (so group by "categories.k.type")?
Finally - if this is even possible, is it sensible? At first I thought it was quite useful to create dictionaries of nested objects using actual field names of the object, as it seemed a logical use of a document database, and it seemed to make the MR queries easier to write vs arrays, but now I'm starting to wonder if this is all a bad idea and having variable field names makes it very tricky/inefficient to write aggregation queries.
OK, so I think I have this partially solved. At least for the shape of data in the initial question.
// How many of each type of fruit does John have at each location
db.houses.aggregate([
{
$unwind: "$categories"
},
{
$match: { "categories.k": "location" }
},
{
$group: {
_id: "$categories.v.d",
"numberOf": { $sum: 1 },
"Total Apples": { $sum: "$people.John.items.apples" },
"Total Pears": { $sum: "$people.John.items.pears" },
}
},
])
which yields;
{
"result" : [
{
"_id" : "central",
"numberOf" : 2,
"Total Apples" : 4,
"Total Pears" : 2
}
],
"ok" : 1
}
Note that there's only "central", but if I had other "location"s in my DB I'd get a range of totals for each location. I wouldn't need the $unwind step if I had named properties instead of an array for "categories", but this is where I find the structure is at odds with itself. There are several keywords likely under "categories". The sample data shows "type" and "location" but there could be around 10 of these categorizations all with different values. So if I used named fields;
"categories": {
location: "london",
type: "owned",
}
...the problem I then have is indexing. I can't afford to simply index "location" since those are user-defined categories, and if 10,000 users choose 10,000 different ways of categorizing their houses I'd need 10,000 indexes, one for each field. But by making it an array I only need one on the array field itself. The downside is the $unwind step. I ran into this before with MapReduce. The last thing you want to be doing is a ForEach loop in JavaScript to cycle an array if you can help it. What you really want is to filter out the fields by name because it's much quicker.
Now this is all well and good where I already know what fruit I'm looking for, but if I don't, it's much harder. I can't (as far as I can see) $unwind or otherwise ForEach "people.John.items" here. If I could, I'd be overjoyed. So since the names of fruit are again user-defined, it looks like I need to convert them to an array as well, like this;
{
"people" : {
"John" : {
"items" : [
{ k:"apples", v:100 },
{ k:"pears", v:200 },
{ k:"oranges", v:300 },
]
},
}
}
So that now allows me get the fruit (where I don't know which fruit to look for) totalled, again by location;
db.houses.aggregate([
{
$unwind: "$categories"
},
{
$match: { "categories.k": "location" }
},
{
$unwind: "$people.John.items"
},
{
$group: { // compound key - thanks to Jenna
_id: { fruit:"$people.John.items.k", location:"$categories.v.v" },
"numberOf": { $sum: 1 },
"Total Fruit": { $sum: "$people.John.items.v" },
}
},
])
So now I'm doing TWO $unwinds. If you're thinking that looks grotesquely ineffecient, you'd be right. If I have just 10,000 house records, with 10 categories each, and 10 types of fruit, this query takes half a minute to run.
OK, so I can see that moving the $match before the $unwind improves things significantly but then it's the wrong output. I don't want an entry for every category, I want to filter out just the "location" categories.
I would have made this comment, but it's easier to format in a response text box.
{ _id: 1,
house: "New York",
people: {
John: {
items: {apples: 1, oranges:2}
}
Dave: {
items: {apples: 2, oranges: 1}
}
}
}
{ _id: 2,
house: "London",
people: {
John: {
items: {apples: 3, oranges:2}
}
Dave: {
items: {apples: 1, oranges:3}
}
}
}
Just to make sure I understand your question, is this what you're trying to accomplish?
{location: "New York", johnFruit:3}
{location: "London", johnFruit: 5}
Since categories is not nested under house, you can't group by "house.categories.k", but you can use a compound key for the _id of $group to get this result:
{ $group: _id: {house: "$House", category: "$categories.k"}
Although "k" doesn't contain the information that you're presumably trying to group by. And as for "categories.k.type", type is the value of k, so you can't use this syntax. You would have to group by "categories.v.d".
It may be possible with your current schema to accomplish this aggregation using $unwind, $project, possibly $match, and finally $group, but the command won't be pretty. If possible, I would highly recommend restructuring your data to make this aggregation much simpler. If you would like some help with schema, please let us know.
I'm not sure if this is a possible solution, but what if you begin the aggregation process by determining the number of different locations using distinct(), and run separate aggregation commands for each location? distinct() may not be efficient, but every subsequent aggregation will be able to use $match, and therefore, the index on categories. You could use the same logic to count the fruit for "categories.type".
{
"_id" : 1,
"house" : "New York",
"people" : {
"John" : [{"k" : "apples","v" : 1},{"k" : "oranges","v" : 2}],
"Dave" : [{"k" : "apples","v" : 2},{"k" : "oranges","v" : 1}]
},
"categories" : [{"location" : "central"},{"type" : "rented"}]
}
{
"_id" : 2,
"house" : "London",
"people" : {
"John" : [{"k" : "apples","v" : 3},{"k" : "oranges","v" : 2}],
"Dave" : [{"k" : "apples","v" : 3},{"k" : "oranges","v" : 1}]
},
"categories" : [{"location" : "suburb"},{"type" : "rented"}]
}
{
"_id" : 3,
"house" : "London",
"people" : {
"John" : [{"k" : "apples","v" : 0},{"k" : "oranges","v" : 1}],
"Dave" : [{"k" : "apples","v" : 2},{"k" : "oranges","v" : 4}]
},
"categories" : [{"location" : "central"},{"type" : "rented"}]
}
Run distinct() and iterate through the results by running aggregate() commands for each unique value of "categories.location":
db.agg.distinct("categories.location")
[ "central", "suburb" ]
db.agg.aggregate(
{$match: {categories: {location:"central"}}}, //the index entry is on the entire
{$unwind: "$people.John"}, //document {location:"central"}, so
{$group:{ //use this syntax to use the index
_id:"$people.John.k",
"numberOf": { $sum: 1 },
"Total Fruit": { $sum: "$people.John.v"}
}
}
)
{
"result" : [
{
"_id" : "oranges",
"numberOf" : 2,
"Total Fruit" : 3
},
{
"_id" : "apples",
"numberOf" : 2,
"Total Fruit" : 1
}
],
"ok" : 1
}