Search on multiple collections in MongoDB - mongodb

I know the theory of MongoDB and the fact that is doesn't support joins, and that I should use embeded documents or denormalize as much as possible, but here goes:
I have multiple documents, such as:
Users, which embed Suburbs, but also has: first name, last name
Suburbs, which embed States
Child, which embeds School, belongs to a User, but also has: first name, last name
Example:
Users:
{ _id: 1, first_name: 'Bill', last_name: 'Gates', suburb: 1 }
{ _id: 2, first_name: 'Steve', last_name: 'Jobs', suburb: 3 }
Suburb:
{ _id: 1, name: 'Suburb A', state: 1 }
{ _id: 2, name: 'Suburb B', state: 1 }
{ _id: 3, name: 'Suburb C', state: 3 }
State:
{ _id: 1, name: 'LA' }
{ _id: 3, name: 'NY' }
Child:
{ _id: 1, _user_id: 1, first_name: 'Little Billy', last_name: 'Gates' }
{ _id: 2, _user_id: 2, first_name: 'Little Stevie', last_name: 'Jobs' }
The search I need to implement is on:
first name, last name of Users and Child
State from Users
I know that I have to do multiple queries to get it done, but how can that be achieved? With mapReduce or aggregate?
Can you point out a solution please?
I've tried to use mapReduce but that didn't get me to have documents from Users which contained a state_id, so that's why I brought it up here.

This answer is outdated. Since version 3.2, MongoDB has limited support for left outer joins with the $lookup aggregation operator
MongoDB does not do queries which span multiple collections - period. When you need to join data from multiple collections, you have to do it on the application level by doing multiple queries.
Query collection A
Get the secondary keys from the result and put them into an array
Query collection B passing that array as the value of the $in-operator
Join the results of both queries programmatically on the application layer
Having to do this should be rather the exception than the norm. When you frequently need to emulate JOINs like that, it either means that you are still thinking too relational when you design your database schema or that your data is simply not suited for the document-based storage concept of MongoDB.

So now join is possible in mongodb and you can achieve this using $lookup and $facet aggregation here and which is probably the best way to find in multiple collections
db.collection.aggregate([
{ "$limit": 1 },
{ "$facet": {
"c1": [
{ "$lookup": {
"from": Users.collection.name,
"pipeline": [
{ "$match": { "first_name": "your_search_data" } }
],
"as": "collection1"
}}
],
"c2": [
{ "$lookup": {
"from": State.collection.name,
"pipeline": [
{ "$match": { "name": "your_search_data" } }
],
"as": "collection2"
}}
],
"c3": [
{ "$lookup": {
"from": State.collection.name,
"pipeline": [
{ "$match": { "name": "your_search_data" } }
],
"as": "collection3"
}}
]
}},
{ "$project": {
"data": {
"$concatArrays": [ "$c1", "$c2", "$c3" ]
}
}},
{ "$unwind": "$data" },
{ "$replaceRoot": { "newRoot": "$data" } }
])

You'll find MongoDB easier to understand if you take a denormalized approach to schema design. That is, you want to structure your documents the way the requesting client application understands them. Essentially, you are modeling your documents as domain objects with which the applicaiton deals. Joins become less important when you model your data this way. Consider how I've denormalized your data into a single collection:
{
_id: 1,
first_name: 'Bill',
last_name: 'Gates',
suburb: 'Suburb A',
state: 'LA',
child : [ 3 ]
}
{
_id: 2,
first_name: 'Steve',
last_name: 'Jobs',
suburb: 'Suburb C',
state 'NY',
child: [ 4 ]
}
{
_id: 3,
first_name: 'Little Billy',
last_name: 'Gates',
suburb: 'Suburb A',
state: 'LA',
parent : [ 1 ]
}
{
_id: 4,
first_name: 'Little Stevie',
last_name: 'Jobs'
suburb: 'Suburb C',
state 'NY',
parent: [ 2 ]
}
The first advantage is that this schema is far easier to query. Plus, updates to address fields are now consistent with the individual Person entity since the fields are embedded in a single document. Notice also the bidirectional relationship between parent and children? This makes this collection more than just a collection of individual people. The parent-child relationships mean this collection is also a social graph. Here are some resoures which may be helpful to you when thinking about schema design in MongoDB.

Here's a JavaScript function that will return an array of all records matching specified criteria, searching across all collections in the current database:
function searchAll(query,fields,sort) {
var all = db.getCollectionNames();
var results = [];
for (var i in all) {
var coll = all[i];
if (coll == "system.indexes") continue;
db[coll].find(query,fields).sort(sort).forEach(
function (rec) {results.push(rec);} );
}
return results;
}
From the Mongo shell, you can copy/paste the function in, then call it like so:
> var recs = searchAll( {filename: {$regex:'.pdf$'} }, {moddate:1,filename:1,_id:0}, {filename:1} )
> recs

Based on #brian-moquin and others, I made a set of functions to search entire collections with entire keys(fields) by simple keyword.
It's in my gist; https://gist.github.com/fkiller/005dc8a07eaa3321110b3e5753dda71b
For more detail, I first made a function to gather all keys.
function keys(collectionName) {
mr = db.runCommand({
'mapreduce': collectionName,
'map': function () {
for (var key in this) { emit(key, null); }
},
'reduce': function (key, stuff) { return null; },
'out': 'my_collection' + '_keys'
});
return db[mr.result].distinct('_id');
}
Then one more to generate $or query from keys array.
function createOR(fieldNames, keyword) {
var query = [];
fieldNames.forEach(function (item) {
var temp = {};
temp[item] = { $regex: '.*' + keyword + '.*' };
query.push(temp);
});
if (query.length == 0) return false;
return { $or: query };
}
Below is a function to search a single collection.
function findany(collection, keyword) {
var query = createOR(keys(collection.getName()));
if (query) {
return collection.findOne(query, keyword);
} else {
return false;
}
}
And, finally a search function for every collections.
function searchAll(keyword) {
var all = db.getCollectionNames();
var results = [];
all.forEach(function (collectionName) {
print(collectionName);
if (db[collectionName]) results.push(findany(db[collectionName], keyword));
});
return results;
}
You can simply load all functions in Mongo console, and execute searchAll('any keyword')

You can achieve this using $mergeObjects by MongoDB Driver
Example
Create a collection orders with the following documents:
db.orders.insert([
{ "_id" : 1, "item" : "abc", "price" : 12, "ordered" : 2 },
{ "_id" : 2, "item" : "jkl", "price" : 20, "ordered" : 1 }
])
Create another collection items with the following documents:
db.items.insert([
{ "_id" : 1, "item" : "abc", description: "product 1", "instock" : 120 },
{ "_id" : 2, "item" : "def", description: "product 2", "instock" : 80 },
{ "_id" : 3, "item" : "jkl", description: "product 3", "instock" : 60 }
])
The following operation first uses the $lookup stage to join the two collections by the item fields and then uses $mergeObjects in the $replaceRoot to merge the joined documents from items and orders:
db.orders.aggregate([
{
$lookup: {
from: "items",
localField: "item", // field in the orders collection
foreignField: "item", // field in the items collection
as: "fromItems"
}
},
{
$replaceRoot: { newRoot: { $mergeObjects: [ { $arrayElemAt: [ "$fromItems", 0 ] }, "$$ROOT" ] } }
},
{ $project: { fromItems: 0 } }
])
The operation returns the following documents:
{ "_id" : 1, "item" : "abc", "description" : "product 1", "instock" : 120, "price" : 12, "ordered" : 2 }
{ "_id" : 2, "item" : "jkl", "description" : "product 3", "instock" : 60, "price" : 20, "ordered" : 1 }
This Technique merge Object and return the result

Minime solution worked except that it required a fix:
var query = createOR(keys(collection.getName()));
need to add keyword as 2nd parameter to createOR call here.

Related

How to update a document with a reference to its previous state?

Is it possible to reference the root document during an update operation such that a document like this:
{"name":"foo","value":1}
can be updated with new values and have the full (previous) document pushed into a new field (creating an update history):
{"name":"bar","value":2,"previous":[{"name:"foo","value":1}]}
And so on..
{"name":"baz","value":3,"previous":[{"name:"foo","value":1},{"name:"bar","value":2}]}
I figure I'll have to use the new aggregate set operator in Mongo 4.2, but how can I achieve this?
Ideally I don't want to have to reference each field explicitly. I'd prefer to push the root document (minus the _id and previous fields) without knowing what the other fields are.
In addition to the new $set operator, what makes your use case really easier with Mongo 4.2 is the fact that db.collection.update() now accepts an aggregation pipeline, finally allowing the update of a field based on its current value:
// { name: "foo", value: 1 }
db.collection.update(
{},
[{ $set: {
previous: {
$ifNull: [
{ $concatArrays: [ "$previous", [{ name: "$name", value: "$value" }] ] },
[ { name: "$name", value: "$value" } ]
]
},
name: "bar",
value: 2
}}],
{ multi: true }
)
// { name: "bar", value: 2, previous: [{ name: "foo", value: 1 }] }
// and if applied again:
// { name: "baz", value: 3, previous: [{ name: "foo", value: 1 }, { name: "bar", value: 2 } ] }
The first part {} is the match query, filtering which documents to update (in our case probably all documents).
The second part [{ $set: { previous: { $ifNull: [ ... } ] is the update aggregation pipeline (note the squared brackets signifying the use of an aggregation pipeline):
$set is a new aggregation operator and an alias of $addFields. It's used to add/replace a new field (in our case "previous") with values from the current document.
Using an $ifNull check, we can determine whether "previous" already exists in the document or not (this is not the case for the first update).
If "previous" doesn't exist (is null), then we have to create it and set it with an array of one element: the current document: [ { name: "$name", value: "$value" } ].
If "previous" already exist, then we concatenate ($concatArrays) the existing array with the current document.
Don't forget { multi: true }, otherwise only the first matching document will be updated.
As you mentioned "root" in your question and if your schema is not the same for all documents (if you can't tell which fields should be used and pushed in the "previous" array), then you can use the $$ROOT variable which represents the current document and filter out the "previous" array. In this case, replace both { name: "$name", value: "$value" } from the previous query with:
{ $arrayToObject: { $filter: {
input: { $objectToArray: "$$ROOT" },
as: "root",
cond: { $ne: [ "$$root.k", "previous" ] }
}}}
Imho, you are making your life indefinitely more complex for no reason with such complicated data models.
Think of what you really want to achieve. You want to correlate different values in one or more interconnected series which are written to the collection consecutively.
Storing this in one document comes with some strings attached. While it seems to be reasonable in the beginning, let me name a few:
How do you get the most current document if you do not know it's value for name?
How do you deal with very large series, which make the document hit the 16MB limit?
What is the benefit of the added complexity?
Simplify first
So, let's assume you have only one series for a moment. It gets as simple as
[{
"_id":"foo",
"ts": ISODate("2019-07-03T17:40:00.000Z"),
"value":1
},{
"_id":"bar",
"ts": ISODate("2019-07-03T17:45:00.000"),
"value":2
},{
"_id":"baz",
"ts": ISODate("2019-07-03T17:50:00.000"),
"value":3
}]
Assuming the name is unique, we can use it as _id, potentially saving an index.
You can actually get the semantic equivalent by simply doing a
> db.seriesa.find().sort({ts:-1})
{ "_id" : "baz", "ts" : ISODate("2019-07-03T17:50:00Z"), "value" : 3 }
{ "_id" : "bar", "ts" : ISODate("2019-07-03T17:45:00Z"), "value" : 2 }
{ "_id" : "foo", "ts" : ISODate("2019-07-03T17:40:00Z"), "value" : 1 }
Say you only want to have the two latest values, you can use limit():
> db.seriesa.find().sort({ts:-1}).limit(2)
{ "_id" : "baz", "ts" : ISODate("2019-07-03T17:50:00Z"), "value" : 3 }
{ "_id" : "bar", "ts" : ISODate("2019-07-03T17:45:00Z"), "value" : 2 }
Should you really need to have the older values in a queue-ish array
db.seriesa.aggregate([{
$group: {
_id: "queue",
name: {
$last: "$_id"
},
value: {
$last: "$value"
},
previous: {
$push: {
name: "$_id",
value: "$value"
}
}
}
}, {
$project: {
name: 1,
value: 1,
previous: {
$slice: ["$previous", {
$subtract: [{
$size: "$previous"
}, 1]
}]
}
}
}])
Nail it
Now, let us say you have more than one series of data. Basically, there are two ways of dealing with it: put different series in different collections or put all the series in one collection and make a distinction by a field, which for obvious reasons should be indexed.
So, when to use what? It boils down wether you want to do aggregations over all series (maybe later down the road) or not. If you do, you should put all series into one collection. Of course, we have to slightly modify our data model:
[{
"name":"foo",
"series": "a"
"ts": ISODate("2019-07-03T17:40:00.000Z"),
"value":1
},{
"name":"bar",
"series": "a"
"ts": ISODate("2019-07-03T17:45:00.000"),
"value":2
},{
"name":"baz",
"series": "a"
"ts": ISODate("2019-07-03T17:50:00.000"),
"value":3
},{
"name":"foo",
"series": "b"
"ts": ISODate("2019-07-03T17:40:00.000Z"),
"value":1
},{
"name":"bar",
"series": "b"
"ts": ISODate("2019-07-03T17:45:00.000"),
"value":2
},{
"name":"baz",
"series": "b"
"ts": ISODate("2019-07-03T17:50:00.000"),
"value":3
}]
Note that for demonstration purposes, I fell back for the default ObjectId value for _id.
Next, we create an index over series and ts, as we are going to need it for our query:
> db.series.ensureIndex({series:1,ts:-1})
And now our simple query looks like this
> db.series.find({"series":"b"},{_id:0}).sort({ts:-1})
{ "name" : "baz", "series" : "b", "ts" : ISODate("2019-07-03T17:50:00Z"), "value" : 3 }
{ "name" : "bar", "series" : "b", "ts" : ISODate("2019-07-03T17:45:00Z"), "value" : 2 }
{ "name" : "foo", "series" : "b", "ts" : ISODate("2019-07-03T17:40:00Z"), "value" : 1 }
In order to generate the queue-ish like document, we need to add a match state
> db.series.aggregate([{
$match: {
"series": "b"
}
},
// other stages omitted for brevity
])
Note that the index we created earlier will be utilized here.
Or, we can generate a document like this for every series by simply using series as the _id in the $group stage and replace _id with name where appropriate
db.series.aggregate([{
$group: {
_id: "$series",
name: {
$last: "$name"
},
value: {
$last: "$value"
},
previous: {
$push: {
name: "$name",
value: "$value"
}
}
}
}, {
$project: {
name: 1,
value: 1,
previous: {
$slice: ["$previous", {
$subtract: [{
$size: "$previous"
}, 1]
}]
}
}
}])
Conclusion
Stop Being Clever when it comes to data models in MongoDB. Most of the problems with data models I saw in the wild and the vast majority I see on SO come from the fact that someone tried to be Smart (by premature optimization) ™.
Unless we are talking of ginormous series (which can not be, since you settled for a 16MB limit in your approach), the data model and queries above are highly efficient without adding unneeded complexity.
addMultipleData: (req, res, next) => {
let name = req.body.name ? req.body.name : res.json({ message: "Please enter Name" });
let value = req.body.value ? req.body.value : res.json({ message: "Please Enter Value" });
if (!req.body.name || !req.body.value) { return; }
//Step 1
models.dynamic.findOne({}, function (findError, findResponse) {
if (findResponse == null) {
let insertedValue = {
name: name,
value: value
}
//Step 2
models.dynamic.create(insertedValue, function (error, response) {
res.json({
message: "succesfully inserted"
})
})
}
else {
let pushedValue = {
name: findResponse.name,
value: findResponse.value
}
let updateWith = {
$set: { name: name, value: value },
$push: { previous: pushedValue }
}
let options = { upsert: true }
//Step 3
models.dynamic.updateOne({}, updateWith, options, function (error, updatedResponse) {
if (updatedResponse.nModified == 1) {
res.json({
message: "succesfully inserted"
})
}
})
}
})
}
//This is the schema
var multipleAddSchema = mongoose.Schema({
"name":String,
"value":Number,
"previous":[]
})

Remove duplicates based on a key and referenced Objects in Mongodb?

I have MongoDB models of Actor and Movies. The Mongoose schema of both the models is as following :
var ActorsSchema = new Schema({
id : {
type : Number
},
known_for:[{
type: Schema.Types.ObjectId,
ref: 'Movie'
}]
})
var MovieSchema = new Schema({
genres: [{
type: Schema.Types.ObjectId,
ref: 'Genre'
}],
id: {
type: Number
}
});
known_for attribute in the actor model contains the reference to a list of movies in which that actor has starred.
I want to delete duplicate Actor records which would be determined using the id fieled (not the _id). But what I also want to do is delete the movies referenced in the deleted actor's record in the known_for field to also be deleted and I want to do that from the Mongo interface as the number of records in these documents is very large and performing this function programmatically would be time inefficient.
I have looked in to a related question but it does not apply to models who reference other models as there fields.
Consider using the aggregation framework to identify the duplicate documents, get a list of the duplicate _ids for the actors collection alongside the arrays of movie ids and issue remove and update commands with the ids array as the query.
For testing purposes, suppose you have the following data in your collections (with minimum test cases, for demonstration purposes of course):
db.movies.insert([
{
"_id" : ObjectId("5543e79e42063d2be5d2ea84"),
"id" : 1,
"genres" : []
},
{
"_id" : ObjectId("5543e79e42063d2be5d2ea85"),
"id" : 2,
"genres" : []
},
{
"_id" : ObjectId("5543e79e42063d2be5d2ea86"),
"id" : 3,
"genres" : []
}
]);
db.actors.insert([
{ id: 1, known_for: [ObjectId("5543e79e42063d2be5d2ea84")] },
{ id: 1, known_for: [ObjectId("5543e79e42063d2be5d2ea84")] },
{ id: 2, known_for: [ObjectId("5543e79e42063d2be5d2ea84"), ObjectId("5543e79e42063d2be5d2ea85")] },
{ id: 3, known_for: [ObjectId("5543e79e42063d2be5d2ea85"), ObjectId("5543e79e42063d2be5d2ea86")] }
]);
Now for the magical part. The aggregation pipeline groups the actors documents by id, calculates the grouped count, creates two array fields which hold the actor _id duplicates and the movies object ids. The pipeline outputs the results to a collection dupes that will be used later on to remove the duplicates:
db.actors.aggregate([
{
"$group": {
"_id": "$id",
"duplicates": { "$addToSet": "$_id" },
"movies": { "$addToSet": "$known_for"},
"count": { "$sum": 1 }
}
},
{
"$match": {
"count": { "$gt": 1 }
}
},
{
"$out": "dupes"
}
])
Querying the dupes collection will give the result:
/* 1 */
{
"_id" : 1.0000000000000000,
"duplicates" : [
ObjectId("5543fc8e42063d2be5d2eaa2"),
ObjectId("5543fc8e42063d2be5d2eaa1")
],
"movies" : [
[
ObjectId("5543e79e42063d2be5d2ea84")
]
],
"count" : 2
}
Now for the fun part. Use the dupes collection to then remove the dupes from the actors collection. As you have noticed from the dupes collection, the movies field is an array of arrays so you will need to flatten it and use the flattened array to then remove the movies and pull the orphaned movie references from the actors collection:
db.dupes.find({}).find({}).forEach( function (doc) {
var movie_dupes = [];
db.actors.remove({ "_id": { "$in": doc.duplicates } });
doc.movies.forEach( function (arr){
arr.forEach(function (id){
movie_dupes.push(id)
});
});
db.movies.remove({ "_id": { "$in": movie_dupes } });
db.actors.update({ "known_for": { "$in": movie_dupes } }, { "$pull": { "known_for": { "$in": movie_dupes } } }, { "multi": true });
});
Logs to console:
Removed 2 record(s) in 38ms
Removed 1 record(s) in 2ms
Updated 1 existing record(s) in 1ms
Now to verify whether our duplicates have been obliterated:
db.actors.find()
/* 1 */
{
"_id" : ObjectId("5543fc8e42063d2be5d2eaa3"),
"id" : 2,
"known_for" : [
ObjectId("5543e79e42063d2be5d2ea85")
]
}
/* 2 */
{
"_id" : ObjectId("5543fc8e42063d2be5d2eaa4"),
"id" : 3,
"known_for" : [
ObjectId("5543e79e42063d2be5d2ea85"),
ObjectId("5543e79e42063d2be5d2ea86")
]
}
Actor with id 1 (which was a duplicate) was indeed removed.
db.movies.find()
/* 1 */
{
"_id" : ObjectId("5543e79e42063d2be5d2ea85"),
"id" : 2,
"genres" : []
}
/* 2 */
{
"_id" : ObjectId("5543e79e42063d2be5d2ea86"),
"id" : 3,
"genres" : []
}
Movie with ObjectId("5543e79e42063d2be5d2ea84") was removed.

way to update multiple documents with different values

I have the following documents:
[{
"_id":1,
"name":"john",
"position":1
},
{"_id":2,
"name":"bob",
"position":2
},
{"_id":3,
"name":"tom",
"position":3
}]
In the UI a user can change position of items(eg moving Bob to first position, john gets position 2, tom - position 3).
Is there any way to update all positions in all documents at once?
You can not update two documents at once with a MongoDB query. You will always have to do that in two queries. You can of course set a value of a field to the same value, or increment with the same number, but you can not do two distinct updates in MongoDB with the same query.
You can use db.collection.bulkWrite() to perform multiple operations in bulk. It has been available since 3.2.
It is possible to perform operations out of order to increase performance.
From mongodb 4.2 you can do using pipeline in update using $set operator
there are many ways possible now due to many operators in aggregation pipeline though I am providing one of them
exports.updateDisplayOrder = async keyValPairArr => {
try {
let data = await ContestModel.collection.update(
{ _id: { $in: keyValPairArr.map(o => o.id) } },
[{
$set: {
displayOrder: {
$let: {
vars: { obj: { $arrayElemAt: [{ $filter: { input: keyValPairArr, as: "kvpa", cond: { $eq: ["$$kvpa.id", "$_id"] } } }, 0] } },
in:"$$obj.displayOrder"
}
}
}
}],
{ runValidators: true, multi: true }
)
return data;
} catch (error) {
throw error;
}
}
example key val pair is: [{"id":"5e7643d436963c21f14582ee","displayOrder":9}, {"id":"5e7643e736963c21f14582ef","displayOrder":4}]
Since MongoDB 4.2 update can accept aggregation pipeline as second argument, allowing modification of multiple documents based on their data.
See https://docs.mongodb.com/manual/reference/method/db.collection.update/#modify-a-field-using-the-values-of-the-other-fields-in-the-document
Excerpt from documentation:
Modify a Field Using the Values of the Other Fields in the Document
Create a members collection with the following documents:
db.members.insertMany([
{ "_id" : 1, "member" : "abc123", "status" : "A", "points" : 2, "misc1" : "note to self: confirm status", "misc2" : "Need to activate", "lastUpdate" : ISODate("2019-01-01T00:00:00Z") },
{ "_id" : 2, "member" : "xyz123", "status" : "A", "points" : 60, "misc1" : "reminder: ping me at 100pts", "misc2" : "Some random comment", "lastUpdate" : ISODate("2019-01-01T00:00:00Z") }
])
Assume that instead of separate misc1 and misc2 fields, you want to gather these into a new comments field. The following update operation uses an aggregation pipeline to:
add the new comments field and set the lastUpdate field.
remove the misc1 and misc2 fields for all documents in the collection.
db.members.update(
{ },
[
{ $set: { status: "Modified", comments: [ "$misc1", "$misc2" ], lastUpdate: "$$NOW" } },
{ $unset: [ "misc1", "misc2" ] }
],
{ multi: true }
)
Suppose after updating your position your array will looks like
const objectToUpdate = [{
"_id":1,
"name":"john",
"position":2
},
{
"_id":2,
"name":"bob",
"position":1
},
{
"_id":3,
"name":"tom",
"position":3
}].map( eachObj => {
return {
updateOne: {
filter: { _id: eachObj._id },
update: { name: eachObj.name, position: eachObj.position }
}
}
})
YourModelName.bulkWrite(objectToUpdate,
{ ordered: false }
).then((result) => {
console.log(result);
}).catch(err=>{
console.log(err.result.result.writeErrors[0].err.op.q);
})
It will update all position with different value.
Note : I have used here ordered : false for better performance.

MongoDB: select matched elements of subcollection

I'm using mongoose.js to do queries to mongodb, but I think my problem is not specific to mongoose.js.
Say I have only one record in the collection:
var album = new Album({
tracks: [{
title: 'track0',
language: 'en',
},{
title: 'track1',
language: 'en',
},{
title: 'track2',
language: 'es',
}]
})
I want to select all tracks with language field equal to 'en', so I tried two variants:
Album.find({'tracks.language':'en'}, {'tracks.$': 1}, function(err, albums){
and tied to to the same thing with $elemMatch projection:
Album.find({}, {tracks: {$elemMatch: {'language': 'en'}}}, function(err, albums){
in either case I've got the same result:
{tracks:[{title: 'track0', language: 'en'}]}
selected album.tracks contain only ONE track element with title 'track0' (but there should be both 'track0', 'track1'):
{tracks:[{title: 'track0', language: 'en'}, {title: 'track1', language: 'en'}]}
What am I doing wrong?
Like #JohnnyHK already said, you'll have to use the aggregation framework to accomplish that because both $ and $elemMatch only return the first match.
Here's how:
db.Album.aggregate(
// This is optional. It might make your query faster if you have
// many albums that don't have any English tracks. Take a larger
// collection and measure the difference. YMMV.
{ $match: {tracks: {$elemMatch: {'language': 'en'}} } },
// This will create an 'intermediate' document for each track
{ $unwind : "$tracks" },
// Now filter out the documents that don't contain an English track
// Note: at this point, documents' 'tracks' element is not an array
{ $match: { "tracks.language" : "en" } },
// Re-group so the output documents have the same structure, ie.
// make tracks a subdocument / array again
{ $group : { _id : "$_id", tracks : { $addToSet : "$tracks" } }}
);
You might want to try that aggregate query with only the first expression and then add expressions line by line to see how the output is changed. It's particularly important to understand how $unwind creates intermediate documents that are later re-merged using $group and $addToSet.
Results:
> db.Album.aggregate(
{ $match: {tracks: {$elemMatch: {'language': 'en'}} } },
{ $unwind : "$tracks" },
{ $match: { "tracks.language" : "en" } },
{ $group : { _id : "$_id", tracks : { $addToSet : "$tracks" } }} );
{
"result" : [
{
"_id" : ObjectId("514217b1c99766f4d210c20b"),
"tracks" : [
{
"title" : "track1",
"language" : "en"
},
{
"title" : "track0",
"language" : "en"
}
]
}
],
"ok" : 1
}

MongoDB 2.2 Aggregation Framework group by field name

Is it possible to group-by field name? Or do I need a different structure so I can group-by value?
I know we can use group by on values and we can unwind arrays, but is it possible to get total apples, pears and oranges owned by John amongst the three houses here without specifying "apples", "pears" and "oranges" explicitly as part of the query? (so NOT like this);
// total all the fruit John has at each house
db.houses.aggregate([
{
$group: {
_id: null,
"apples": { $sum: "$people.John.items.apples" },
"pears": { $sum: "$people.John.items.pears" },
"oranges": { $sum: "$people.John.items.oranges" },
}
},
])
In other words, can I group-by the first field-name under "items" and get the aggregate sum of apples:104, pears:202 and oranges:306, but also bananas, melons and anything else that might be there? Or do I need to restructure the data into an array of key/value pairs like categories?
db.createCollection("houses");
db.houses.remove();
db.houses.insert(
[
{
House: "birmingham",
categories : [
{
k : "location",
v : { d : "central" }
}
],
people: {
John: {
items: {
apples: 2,
pears: 1,
oranges: 3,
}
},
Dave: {
items: {
apples: 30,
pears: 20,
oranges: 10,
},
},
},
},
{
House: "London", categories: [{ k: "location", v: { d: "central" } }, { k: "type", v: { d: "rented" } }],
people: {
John: { items: { apples: 2, pears: 1, oranges: 3, } },
Dave: { items: { apples: 30, pears: 20, oranges: 10, }, },
},
},
{
House: "Cambridge", categories: [{ k: "type", v: { d: "rented" } }],
people: {
John: { items: { apples: 100, pears: 200, oranges: 300, } },
Dave: { items: { apples: 0.3, pears: 0.2, oranges: 0.1, }, },
},
},
]
);
Secondly, and more importantly, could I then also group by "house.categories.k" ? In other words, is it possible to find out how many "apples" "John" has in "rented" vs "owned" or "friends" houses (so group by "categories.k.type")?
Finally - if this is even possible, is it sensible? At first I thought it was quite useful to create dictionaries of nested objects using actual field names of the object, as it seemed a logical use of a document database, and it seemed to make the MR queries easier to write vs arrays, but now I'm starting to wonder if this is all a bad idea and having variable field names makes it very tricky/inefficient to write aggregation queries.
OK, so I think I have this partially solved. At least for the shape of data in the initial question.
// How many of each type of fruit does John have at each location
db.houses.aggregate([
{
$unwind: "$categories"
},
{
$match: { "categories.k": "location" }
},
{
$group: {
_id: "$categories.v.d",
"numberOf": { $sum: 1 },
"Total Apples": { $sum: "$people.John.items.apples" },
"Total Pears": { $sum: "$people.John.items.pears" },
}
},
])
which yields;
{
"result" : [
{
"_id" : "central",
"numberOf" : 2,
"Total Apples" : 4,
"Total Pears" : 2
}
],
"ok" : 1
}
Note that there's only "central", but if I had other "location"s in my DB I'd get a range of totals for each location. I wouldn't need the $unwind step if I had named properties instead of an array for "categories", but this is where I find the structure is at odds with itself. There are several keywords likely under "categories". The sample data shows "type" and "location" but there could be around 10 of these categorizations all with different values. So if I used named fields;
"categories": {
location: "london",
type: "owned",
}
...the problem I then have is indexing. I can't afford to simply index "location" since those are user-defined categories, and if 10,000 users choose 10,000 different ways of categorizing their houses I'd need 10,000 indexes, one for each field. But by making it an array I only need one on the array field itself. The downside is the $unwind step. I ran into this before with MapReduce. The last thing you want to be doing is a ForEach loop in JavaScript to cycle an array if you can help it. What you really want is to filter out the fields by name because it's much quicker.
Now this is all well and good where I already know what fruit I'm looking for, but if I don't, it's much harder. I can't (as far as I can see) $unwind or otherwise ForEach "people.John.items" here. If I could, I'd be overjoyed. So since the names of fruit are again user-defined, it looks like I need to convert them to an array as well, like this;
{
"people" : {
"John" : {
"items" : [
{ k:"apples", v:100 },
{ k:"pears", v:200 },
{ k:"oranges", v:300 },
]
},
}
}
So that now allows me get the fruit (where I don't know which fruit to look for) totalled, again by location;
db.houses.aggregate([
{
$unwind: "$categories"
},
{
$match: { "categories.k": "location" }
},
{
$unwind: "$people.John.items"
},
{
$group: { // compound key - thanks to Jenna
_id: { fruit:"$people.John.items.k", location:"$categories.v.v" },
"numberOf": { $sum: 1 },
"Total Fruit": { $sum: "$people.John.items.v" },
}
},
])
So now I'm doing TWO $unwinds. If you're thinking that looks grotesquely ineffecient, you'd be right. If I have just 10,000 house records, with 10 categories each, and 10 types of fruit, this query takes half a minute to run.
OK, so I can see that moving the $match before the $unwind improves things significantly but then it's the wrong output. I don't want an entry for every category, I want to filter out just the "location" categories.
I would have made this comment, but it's easier to format in a response text box.
{ _id: 1,
house: "New York",
people: {
John: {
items: {apples: 1, oranges:2}
}
Dave: {
items: {apples: 2, oranges: 1}
}
}
}
{ _id: 2,
house: "London",
people: {
John: {
items: {apples: 3, oranges:2}
}
Dave: {
items: {apples: 1, oranges:3}
}
}
}
Just to make sure I understand your question, is this what you're trying to accomplish?
{location: "New York", johnFruit:3}
{location: "London", johnFruit: 5}
Since categories is not nested under house, you can't group by "house.categories.k", but you can use a compound key for the _id of $group to get this result:
{ $group: _id: {house: "$House", category: "$categories.k"}
Although "k" doesn't contain the information that you're presumably trying to group by. And as for "categories.k.type", type is the value of k, so you can't use this syntax. You would have to group by "categories.v.d".
It may be possible with your current schema to accomplish this aggregation using $unwind, $project, possibly $match, and finally $group, but the command won't be pretty. If possible, I would highly recommend restructuring your data to make this aggregation much simpler. If you would like some help with schema, please let us know.
I'm not sure if this is a possible solution, but what if you begin the aggregation process by determining the number of different locations using distinct(), and run separate aggregation commands for each location? distinct() may not be efficient, but every subsequent aggregation will be able to use $match, and therefore, the index on categories. You could use the same logic to count the fruit for "categories.type".
{
"_id" : 1,
"house" : "New York",
"people" : {
"John" : [{"k" : "apples","v" : 1},{"k" : "oranges","v" : 2}],
"Dave" : [{"k" : "apples","v" : 2},{"k" : "oranges","v" : 1}]
},
"categories" : [{"location" : "central"},{"type" : "rented"}]
}
{
"_id" : 2,
"house" : "London",
"people" : {
"John" : [{"k" : "apples","v" : 3},{"k" : "oranges","v" : 2}],
"Dave" : [{"k" : "apples","v" : 3},{"k" : "oranges","v" : 1}]
},
"categories" : [{"location" : "suburb"},{"type" : "rented"}]
}
{
"_id" : 3,
"house" : "London",
"people" : {
"John" : [{"k" : "apples","v" : 0},{"k" : "oranges","v" : 1}],
"Dave" : [{"k" : "apples","v" : 2},{"k" : "oranges","v" : 4}]
},
"categories" : [{"location" : "central"},{"type" : "rented"}]
}
Run distinct() and iterate through the results by running aggregate() commands for each unique value of "categories.location":
db.agg.distinct("categories.location")
[ "central", "suburb" ]
db.agg.aggregate(
{$match: {categories: {location:"central"}}}, //the index entry is on the entire
{$unwind: "$people.John"}, //document {location:"central"}, so
{$group:{ //use this syntax to use the index
_id:"$people.John.k",
"numberOf": { $sum: 1 },
"Total Fruit": { $sum: "$people.John.v"}
}
}
)
{
"result" : [
{
"_id" : "oranges",
"numberOf" : 2,
"Total Fruit" : 3
},
{
"_id" : "apples",
"numberOf" : 2,
"Total Fruit" : 1
}
],
"ok" : 1
}