Use Case:
I've got a mongodb collection with a couple million documents. Documents in this
collection must be updated sometimes. Therefore I've setup a monitorFrequency field which would define the that a specific document must be updated every 6, 12, 24 or 720 hours. Additionally I setup a field called lastRefreshAt which is a timestamp of the last actual update.
The problem:
How can I select all documents from my collection profiles which need to be refreshed again (because monitorFrequency is older than lastRefreshAt).
Should I run that on a single query which would only return those documents which need to be refreshed again or should I rather iterate on all documents with a cursor and check in my node application if the document needs to be refreshed or not?
I would know how to do approach #2, but I am not sure what approach to chose and how the query for #1 would look like.
There are a couple of approaches depending on available architecture and choices. Some are good choices and some are bad, but we might as well explain them all.
Use $where with multi-update
As a first option to examine, you could use $where to calculate the difference for selection and feed directly to .update() or .updateMany() for that matter:
db.profiles.update(
{
"$where": function() {
return (Date.now() - this.lastRefreshAt.valueOf())
> ( this.monitorFrequency * 1000 * 60 * 60 );
}
},
{ "$currentDate": { "lastRefreshAt": true } },
{ "multi": true }
)
Which pretty simply works out the milliseconds difference between the current "lastRefreshAt" value and the current Date value and compares that to the stored "monitorFrequency" converted into milliseconds itself.
The $currentDate is appplied because it is a "multi" update and applied to all matched documents, so this ensures the "server timestamp" at the actual time of document update is applied to the document.
It's not fantastic as it does require a full collection scan in order to select the documents via calculation and thus cannot use an index. Plus it's JavaScript evaluation, which not being native code does add some overhead.
Loop the matched selection
So JavaScript is not that great a selection option in general when other options apply. Instead try using the aggregation framework for the calculation and loop the cursor result:
var ops = [];
db.profiles.aggregate([
{ "$redact": {
"$cond": {
"if": {
"$gt": [
{ "$subtract": [new Date(), "$lastRefreshAt"] },
{ "$multiply": ["$monitorFrequency", 1000 * 60 * 60] }
]
},
"then": "$$KEEP",
"else": "$$PRUNE"
}
}}
]).forEach(doc => {
ops.push({
"updateOne": {
"filter": { "_id": doc._id },
"update": { "$currentDate": { "lastRefreshAt": true } }
}
});
if ( ops.length > 1000 ) {
db.profiles.bulkWrite(ops);
ops = [];
}
})
if ( ops.length > 0 ) {
db.profiles.bulkWrite(ops);
ops = [];
}
So again that's a collection scan due to the calculation but it is done with native operators, so that part at least should be a bit faster. Also from a technical standpoint it's a little different because the new Date() is actually established at the time of request and not per document iterated as it would be using $where. Lacking an operator to produce the "current date" internally, there is no way for the aggregation framework to do this per iteration.
And of course, instead of just applying our "update" expression as it matches documents, we are looping the result cursor and applying a function. So whilst there are "some" gains, there is also additional overhead. Mileage may vary as to performance and practicality.
Parallel Updates
Personally I would do neither of the above and simply run a query selecting each marked "monitorFrequency" and looking for the dates between the boundaries that exceed the allowed difference.
As a simple example using NodeJS to implement Promise.all() for parallel calls:
const MongoClient = require('mongodb').MongoClient;
const onHour = 1000 * 60 * 60;
(async function() {
let db;
try {
db = await MongoClient.connect('mongodb://localhost/test');
let collection = db.collection('profiles');
let intervals = [6, 12, 24, 720];
let snapDate = new Date();
await Promise.all(
intervals.map( (monitorFrequency,i) =>
collection.updateMany(
{
monitorFrequency,
"lastRefreshAt": Object.assign(
{ "$lt": new Date(snapDate.valueOf() - intervals[i] * oneHour) },
(i < intervals.length) ?
{ "$gt": new Date(snapDate.valueOf() - intervals[i+1] * oneHour) }
: {}
)
},
{ "$currentDate": { "lastRefreshAt": true } },
)
)
);
} catch(e) {
console.error(e);
} finally {
db.close();
}
})();
This would allow you to index on the two fields and allow optimal selection, and since the "date ranges" are paired to their calculated difference from "monitorFrequency" then those documents that "require refresh" are the only ones that get selected for update.
Gievn the finite number of possible intervals this is what I would suspect to be the most optimal solution. But the construction along with the fact that the actual "update" portion remains consistent for each selection leads to one other option.
Use $or for each selection.
Much the same logic as above, but instead applied to build an $or condition for the "query" portion of a "single" update. It is an "array of criteria" afterall, which is essentially the same as an "array of queries" which is what we are doing above. So just turn it around a little:
let intervals = [6, 12, 24, 720];
let snapDate = new Date();
db.profiles.updateMany(
{
"$or": intervals.map( (monitorFrequency,i) =>
({
monitorFrequency,
"lastRefreshAt": Object.assign(
{ "$lt": new Date(snapDate.valueOf() - intervals[i] * oneHour) },
(i < intervals.length) ?
{ "$gt": new Date(snapDate.valueOf() - intervals[i+1] * oneHour) }
: {}
)
})
)
},
{ "$currentDate": { "lastRefreshAt": true } }
)
This then becomes one simple statement and of course can actually use indexes where available. Generally this is what you should be doing, though as I have suggested my intuition tells me that 4 threads of execution constrained only by the slowest one gets the job done slightly faster. Again, mileage may vary on that but logic dictates that this is so.
So the basic lesson here is "whilst you may think" that the logical approach is to calculate the values and compare within the database itself, it's actually the worst possible thing you can do for query performance.
The simple approach taken are to work out the criteria that should select the documents you want "before" you issue the query statement to the server. This means you are looking at "concrete values" rather than "calculation results" in comparison. And "concrete values" can actually be indexed, which is generally what you want for database queries.
Related
I am currently using doing some basic mapReduce using MongoDB.
I currently have data that looks like this:
db.football_team.insert({name: "Tane Shane", weight: 93, gender: "m"});
db.football_team.insert({name: "Lily Jones", weight: 45, gender: "f"});
...
I want to create a mapReduce function to group data by gender and show
Total number of each gender, Male & Female
Average weight of each gender
I can create a map / reduce function to carry out each function seperately, just cant get my head around how to show output for both. I am guessing since the grouping is based on Gender, Map function should stay the same and just alter something ont he reduce section...
Work so far
var map1 = function()
{var key = this.gender;
emit(key, {count:1});}
var reduce1 = function(key, values)
{var sum=0;
values.forEach(function(value){sum+=value["count"];});
return{count: sum};};
db.football_team.mapReduce(map1, reduce1, {out: "gender_stats"});
Output
db.football_team.find()
{"_id" : "f", "value" : {"count": 12} }
{"_id" : "m", "value" : {"count": 18} }
Thanks
The key rule to "map/reduce" in any implementation is basically that the same shape of data needs to be emitted by the mapper as is also returned by the reducer. The key reason for this is part of how "map/reduce" conceptually works by quite possibly calling the reducer multiple times. Which basically means you can call your reducer function on output that was already emitted from a previous pass through the reducer along with other data from the mapper.
MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key.
That said, your best approach to "average" is therefore to total the data along with a count, and then simply divide the two. This actually adds another step to a "map/reduce" operation as a finalize function.
db.football_team.mapReduce(
// mapper
function() {
emit(this.gender, { count: 1, weight: this.weight });
},
// reducer
function(key,values) {
var output = { count: 0, weight: 0 };
values.forEach(value => {
output.count += value.count;
output.weight += value.weight;
});
return output;
},
// options and finalize
{
"out": "gender_stats", // or { "inline": 1 } if you don't need another collection
"finalize": function(key,value) {
value.avg_weight = value.weight / value.count; // take an average
delete value.weight; // optionally remove the unwanted key
return value;
}
}
)
All fine because both mapper and reducer are emitting data with the same shape and also expecting input in that shape within the reducer itself. The finalize method of course is just invoked after all "reducing" is finally done and just processes each result.
As noted though, the aggregate() method actually does this far more effectively and in native coded methods which do not incur the overhead ( and potential security risks ) of server side JavaScript interpretation and execution:
db.football_team.aggregate([
{ "$group": {
"_id": "$gender",
"count": { "$sum": 1 },
"avg_weight": { "$avg": "$weight" }
}}
])
And that's basically it. Moreover you can actually continue and do other things after a $group pipeline stage ( or any stage for that matter ) in ways that you cannot do with a MongoDB mapReduce implementation. Notably something like applying a $sort to the results:
db.football_team.aggregate([
{ "$group": {
"_id": "$gender",
"count": { "$sum": 1 },
"avg_weight": { "$avg": "$weight" }
}},
{ "$sort": { "avg_weight": -1 } }
])
The only sorting allowed by mapReduce is solely that the key used with emit is always sorted in ascending order. But you cannot sort the aggregated result in output in any other way, without of course performing queries when output to another collection, or by working "in memory" with returned results from the server.
As a "side note" ( though an important one ), you probably should also consider in "learning" that the reality is the "server-side JavaScript" functionality of MongoDB is really a work-around more than being a feature. When MongoDB was first introduced, it applied a JavaScript engine for server execution mostly to make up for features which had not yet been implemented.
Thus to make up for the lack of the complete implementation of many query operators and aggregation functions which would come later, adding a JavaScript engine was a "quick fix" to allow certain things to be done with minimal implementation.
The result over the years is those JavaScript engine features are gradually being removed. The group() function of the API is removed. The eval() function of the API is deprecated and scheduled for removal at the next major version. The writing is basically "on the wall" for the limited future of these JavaScript on the server features, as the clear pattern is where the native features provide support for something, then the need to continue support for the JavaScript engine basically goes away.
The core wisdom here being that focusing on learning these JavaScript on the server features, is probably not really worth the time invested unless you have a pressing use case that currently cannot be solved by any other means.
I am wanting to use a facet to create a simple query that i can use to get paged data, however i have noticed that if i do this i get really poor performance when compared to running just two seperate queries.
As a quick test i created a collection with 50000 random documents and ran the following test.
var x = new Date();
var a = {
count : db.getCollection("test").find({}).count(),
data: db.getCollection("test").find({}).skip(0).limit(10)
};
var y = new Date();
print('result ' + a);
print(y - x);
var x = new Date();
var a = db.getCollection("test").aggregate(
[
{
"$match" : {
}
},
{
"$facet" : {
"data": [
{
"$skip": 0
},
{
"$limit": 10
}
],
"pageInfo": [
{
"$group": {
"_id": null,
"count": {
"$sum": 1
}
}
}
]
}
}
]
)
var y = new Date();
print('result ' + a);
print(y - x);
The result of this is that two seperate queries one for find the other for count takes around 2 milliseconds vs the aggregation single query taking upwards of 500 milliseconds.
Why is it that the aggregation is so slow?
Update
Even just a count without a facet within an aggregation is slow
var x = new Date();
var a = db.getCollection("test").find({}).count();
var y = new Date();
print('result ' + a);
print(y - x);
var x = new Date();
var a = db.getCollection("test").aggregate(
[
{ "$count" : "count" }
]
)
var y = new Date();
print('result ' + a);
print(y - x);
In the above with my test data set, the aggregation count takes 200ms vs the Count method taking 2ms.
This issue extends into the NodeJs Mongodb Driver where the .Count() method has been deprecated and replaced with a countDocuments() method, under the hood the new countDocuments() method is using an aggregation and not the count method on a find just like my example above it has significantly worse performance to the point at which i will continue using the deprecated method over the newer countDocuments() method.
Of course it is slow. The count() method just returns the cursor size after a query is applied (which does not necessarily require all documents to be read, depending on your query and indices). Furthermore, with an empty query, the query optimizer knows that all documents ought to be returned and basically only has to return length(_id_1).
Aggregations, by definition, do not work that way. Unless there is a match stage actually ruling out a document, each and every document is read from “disk” (MongoDB’s own cache and FS caches aside for the moment) for further processing.
I am running into the same issue, and I just hope that anyone might have a better answer then what was previously posted.
I have a "user" collection with 12 million users in it, using MongoDB 5.0.
My query looks like this:
db.users.aggregate([
{ '$sort': { updated_at: -1 } },
{ '$facet': {
results: [
{ $skip: 0 },
{ $limit: 20 }
],
total: [
{ $count: 'count' }
]
}
}
])
The query takes around 1 minute, so that is not acceptable.
I have an index on "updated_at", that is not the issue.
Also, I have this issue even if I run it directly on MongoShell in Compass. So it is not related to any NodeJs Mongo Driver as was previously suspected.
Can I somehow tell Mongo to use the estimated count here?
Or is there any other way to improve the query?
Is it possible to use the MongoDB aggregation framework to generate a time series output where any source documents that are deemed to fall within each bucket are added to that bucket?
Say my collection looks something like this:
/*light_1 on from 10AM to 1PM*/
{
"_id" : "light_1",
"on" : ISODate("2015-01-01T10:00:00Z"),
"off" : ISODate("2015-01-01T13:00:00Z"),
},
/*light_2 on from 11AM to 7PM*/
{
"_id" : "light_2",
"on" : ISODate("2015-01-01T11:00:00Z"),
"off" : ISODate("2015-01-01T19:00:00Z")
}
..and I am using a 6 hour bucket interval to generate a report for 2015-01-01. I wish my result to look something like:
{
"start" : ISODate("2015-01-01T00:00:00Z"),
"end" : ISODate("2015-01-01T06:00:00Z"),
"lights" : []
},
{
"start" : ISODate("2015-01-01T06:00:00Z"),
"end" : ISODate("2015-01-01T12:00:00Z"),
"lights_on" : ["light_1", "light_2"]
},
{
"start" : ISODate("2015-01-01T12:00:00Z"),
"end" : ISODate("2015-01-01T18:00:00Z"),
"lights_on" : ["light_1", "light_2"]
},
{
"start" : ISODate("2015-01-01T18:00:00Z"),
"end" : ISODate("2015-01-02T00:00:00Z"),
"lights_on" : ["light_2"]
}
a light is considered to be 'on' during a range if its 'on' value < the bucket 'end' AND its 'off' value >= the bucket 'start'
I know I can use $group and the aggregation date operators to group by a either start or end time, but in that case, it's a one-to-one mapping. Here, a single source document may make it into several time buckets if it spans several buckets.
The report range and interval span are not known until run-time.
Introduction
Your goal here demands a bit of thinking about the considerations for when to record the events as you have them structured into the given time period aggregations. The obvious point being that one single document as you have them represented can actually represent events to be reported in "multiple" time periods in the end aggregated result.
This turns out on analysis to be a problem that is outside of the scope of the aggregation framework due to the time periods that to are looking for. Certain events need to be "generated" outside of what can be just grouped on, which you should be able to see.
In order to do this "generataion", you need mapReduce. This has the "flow control" via JavaScript as it's processing language to be able to essentially determine whether the time betweeen on/off crosses more than one period and therefore emit the data that it occurred in more than one of those periods.
As a side note, the "light" is probably not best suited to the _id since it can possibly be turned on/off many times in a given day. So the "instance" of on/off is likely better. However I am just following your example here, so to tranpose this then just replace the reference to _id within the mapper code with whatever actual field represents the light's identifier.
But onto the code:
// start date and next date for query ( should be external to main code )
var oneHour = ( 1000 * 60 * 60 ),
sixHours = ( oneHour * 6 ),
oneDay = ( oneHour * 24 ),
today = new Date("2015-01-01"), // your input
tomorrow = new Date( today.valueOf() + oneDay ),
yesterday = new Date( today.valueOf() - sixHours ),
nextday = new Date( tomorrow.valueOf() + sixHours);
// main logic
db.collection.mapReduce(
// mapper to emit data
function() {
// Constants and round date to hour
var oneHour = ( 1000 * 60 * 60 )
sixHours = ( oneHour * 6 )
startPeriod = new Date( this.on.valueOf()
- ( this.on.valueOf() % oneHour )),
endPeriod = new Date( this.off.valueOf()
- ( this.off.valueOf() % oneHour ));
// Hour to 6 hour period and convert to UTC timestamp
startPeriod = startPeriod.setUTCHours(
Math.floor( startPeriod.getUTCHours() / 6) * 6 );
endPeriod = endPeriod.setUTCHours(
Math.floor( endPeriod.getUTCHours() / 6) * 6 );
// Init empty reults for each period only on first document processed
if ( counter == 0 ) {
for ( var x = startDay.valueOf(); x < endDay.valueOf(); x+= sixHours ) {
emit(
{ start: new Date(x), end: new Date(x + sixHours) },
{ lights_on: [] }
);
}
}
// Emit for every period until turned off only within the day
for ( var x = startPeriod; x <= endPeriod; x+= sixHours ) {
if ( ( x >= startDay ) && ( x < endDay ) ) {
emit(
{ start: new Date(x), end: new Date(x + sixHours) },
{ lights_on: [this._id] }
);
}
}
counter++;
},
// reducer to keep all lights in one array per period
function(key,values) {
var result = { lights_on: [] };
values.forEach(function(value) {
value.lights_on.forEach(function(light){
if ( result.lights_on.indexOf(light) == -1 )
result.lights_on.push(light);
});
});
result.lights_on.sort();
return result;
},
// options and query
{
"out": { "inline": 1 },
"query": {
"on": { "$gte": yesterday, "$lt": tomorrow },
"$or": [
{ "off": { "$gte:" today, "$lt": nextday } },
{ "off": null },
{ "off": { "$exists": false } }
]
},
"scope": {
"startDay": today,
"endDay": tomorrow,
"counter": 0
}
}
)
Map and Reduce
In essence, the "mapper" function looks at the current record, rounds each on/off time to hours and then works out the start hour of which six hour period the event occurred in.
With those new date values a loop is initiated to take the starting "on" time and emit an event for the current "light" being turned on during that period, within a single element array as explained later. Each loop increments the start period by six hours until the end "light off" time is reached.
These appear in the reducer function, which requires the same expected input that it will return, so hence the array of lights turned on in the period within the value object. It processes the emitted data under the same key as a list of these value objects.
First iterate the list of values to reduce, then looking at the inner array of lights, which could have come from a previous reduce pass, and processing each of those into a singular result array of unique lights. Simply done by looking for the current light value within the result array and pushing to that array where it does not exist.
Note the "previous pass", as if you are not familiar with how mapReduce works, then you should understand that the reducer function itself emits a result that might not have been achived by processing "all" of the possible values for the "key" in a single pass. It can and often does only process a "sub-set" of the emitted data for a key, and therefore will take a "reduced" result as input in just the same way as the data is emitted from the mapper.
That point of design is why both the mapper and reducer need to output the data with the same structure, as the reducer itself can also get it's input from data that has been previously reduced. This is how mapReduce deals with large data sets emitting a large number of the same key values. It processes typically in "chunks" and not all at once.
The end reduction comes down to the list of lights turned on during the period with each period start and end as the emitted key. Like this:
{
"_id": {
"start": ISODate("2015-01-01T06:00:00Z"),
"end": ISODate("2015-01-01T12:00:00Z")
},
{
"result": {
"lights_on": [ "light_1", "light_2" ]
}
}
},
That "_id", "result" structure is just a property of how all mapReduce output comes out, but the desired values are all there.
Query
Now there is also a note on the query selection here which needs to take into account that a light could already be "on" via its collection entry at a date before the start of the current day. The same is true that it can be turned "off" after the current date being reported on as well, and may in fact either have a null value or no "off" key in the document depending on how your data is being stored and what day is actually being observed.
That logic creates some required calculation from the start of the day to be reported on and consider the six hour period both before and after that date with query conditions as listed:
{
"on": { "$gte": yesterday, "$lt": tomorrow },
"$or": [
{ "off": { "$gte:" today, "$lt": nextday } },
{ "off": null },
{ "off": { "$exists": false } }
]
}
The basic selectors there use the range operators of $gte and $lt to find the values that are greater than or equal to and less than respectively on the fields that they are testing the values of in order to find the data within a suitable range.
Within the $or condition, the various possibilities for the "off" value are considered. Either being that it falls within the range criteria, or either has a null value or possibly no key present in the document at all via the $exists operator. It depends on how you actually represent "off" where a light has not yet been turned off as to the requirements of those conditions within $or, but these would be the reasonable assumptions.
Like all MongoDB queries, all conditions are implicity an "AND" conditon unless stated otherwise.
That is still somewhat flawed depending on how long a light is possibly expected to be turned on for. But the variables are all intentionally listed externally for adjustment to your needs, with consideration to the expected duration to fetch either before or after the date to be reported on.
Creating Empty Time Series
The other note here is that the data itself is likely not to have any events that show a light turned on within a given time period. For that reason, there is a simple method embedded in the mapper function that looks to see if we are on the first iteration of results.
On that first time only, a set of the possible period keys is emitted that includes an empty array for the lights turned on in each period. This allows the reporting to also show those periods where no light was on at all as this is inserted into the data sent to the reducer and output.
You may vary on this approach, as it is still dependant on there being some data that meets the query criteria in order to output anything. So to cater for a truly "blank day" where no data is recorded or meets the criteria, then it might be better to create an external hash table of the keys all showing an empty result for the lights. Then to just "merge" the result of the mapReduce operation into those pre-existing keys to produce the report.
Summary
There are a number of calculations here on the dates, and being unaware of the actual end language implementation I am just declaring anything that works externally to the actual mapReduce operation seperately. So anything that looks like duplication here is done to that intent, making that part of the logic language independant. Most programming languages support the capabilities to manipulate the dates as per the methods used.
The inputs that are then all language specific are passed in as the options block shown as the last argument to the mapReduce method here. Notably there is the query with it's paramterized values that are all calculated from the date to be reported on. Then there is the "scope", which is a way to pass in values that can be used by the functions within the mapReduce operation.
With those things considered, the JavaScript code of the mapper and reducer remains unaltered, as that is what is expected by the method as input. Any variables to the process are fed by both the scope and query results in order to get the outcome without changing that code.
It is mainly therefore that because the duration of a "light being on" can span over different periods to be reported on, that this becomes something the aggregation framework is not designed to do. It cannot perform the "looping" and "emission of data" that is required to get to the result, and therefore why we use mapReduce for this task instead.
That said, great question. I don't know if you considered the concepts of how to acheive the results here already, but at least now there is a guide for someone approaching a similar problem.
I originally misunderstood your question. Assuming I understand what you need now, this looks more like a job for map-reduce. I am not sure how you are determining the range or the interval span, so I will make these constants, you can modify that section of code properly. You could do something like this:
var mapReduceObj = {};
mapReduceObj.map = function() {
var start = new Date("2015-01-01T00:00:00Z").getTime(),
end = new Date("2015-01-02T00:00:00Z").getTime(),
interval = 21600000; //6 hours in milliseconds
var time = start;
while(time < end) {
var endtime = time + interval;
if(this.on < endtime && this.off >= time) {
emit({start : new Date(time), end : new Date(endtime)}, [this._id]);
break;
}
time = endtime;
}
};
mapReduceObj.reduce = function(times, light_ids) {
var lightsArr = {lights : []};
for(var i = 0; i < light_ids.length; i++) {
lightsArr.lights.push(light_ids[i]);
}
return lightsArr;
};
The result will have the following form:
results : {
_id : {
start : ISODate("2015-01-01T06:00:00Z"),
end : ISODate("2015-01-01T12:00:00Z")
},
value : {
lights : [
"light_6",
"light_7"
]
},
...
}
~~~Original Answer~~~
This should give you the exact format that you want.
db.lights.aggregate([
{ "$match": {
"$and": [
{ on : { $lt : ISODate("2015-01-01T06:00:00Z") } },
{ off : { $gte: ISODate("2015-01-01T12:00:00Z") } }
]
}},
{ "$group": {
_id : null,
"lights_on" : {$push : "$_id"}
}},
{ "$project": {
_id : false,
start : { $add : ISODate("2015-01-01T06:00:00Z") },
end : { $add : ISODate("2015-01-01T12:00:00Z") },
lights_on: true
}}
]);
First, the $match condition finds all documents that meet your time constraints. Then $group pushes the _id field (in this case, light_n where n is an integer) into the lights_on field. Either $addToSet or $push could be used since the _id field is unique, but if you were using a field that could have duplicates you would need to decide if duplicates in the array were acceptable. Finally, use $project to get the exact format you want.
One way is to to use the $cond operator of $project and compare each "start" and "end" with the "on" and "off" fields in the original collection. Loop over each bucket using your MongoDB client and do something like this:
db.lights.aggregate([
{ "$project": {
"present": { "$cond": [
{ "$and": [
{ "$lte": [ "$on", ISODate("2015-01-01T06:00:00Z") ] },
{ "$gte": [ "$off", ISODate("2015-01-01T12:00:00Z") ] }
]},
1,
0
]}
}}
]);
The result should look something like this:
{ "_id" : "light_1", "present" : 0 }
{ "_id" : "light_2", "present" : 0 }
{ "_id" : "light_3", "present" : 1 }
For all documents with {"present":1}, add the "_id" of lights collection to the "lights_on" field with your client. Hope this helps.
I have a collection1 of documents with tags in MongoDB. The tags are an embedded array of strings:
{
name: 'someObj',
tags: ['tag1', 'tag2', ...]
}
I want to know the count of each tag in the collection. Therefore I have another collection2 with tag counts:
{
{
tag: 'tag1',
score: 2
}
{
tag: 'tag2',
score: 10
}
}
Now I have to keep both in sync. It is rather trivial when inserting to or removing from collection1. However when I update collection1 I do the following:
1.) get the old document
var oldObj = collection1.find({ _id: id });
2.) calculate the difference between old and new tag arrays
var removedTags = $(oldObj.tags).not(obj.tags).get();
var insertedTags = $(obj.tags).not(oldObj.tags).get();
3.) update the old document
collection1.update(
{ _id: id },
{ $set: obj }
);
4.) update the scores of inserted & removed tags
// increment score of each inserted tag
insertedTags.forEach(function(val, idx) {
// $inc will set score = 1 on insert
collection2.update(
{ tag: val },
{ $inc: { score: 1 } },
{ upsert: true }
)
});
// decrement score of each removed tag
removedTags.forEach(function(val, idx) {
// $inc will set score = -1 on insert
collection2.update(
{ tag: val },
{ $inc: { score: -1 } },
{ upsert: true }
)
});
My questions:
A) Is this approach of keeping book of scores separately efficient? Or is there a more efficient one-time query to get the scores from collection1?
B) Even if keeping book separately is the better choice: can that be done in less steps, e.g. letting mongoDB calculate what tags are new / removed?
The solution, as nickmilion correctly states, would be an aggregation. Though I would do it with a nack: we'll save its results in a collection. What will do is to trade real time results for an extreme speed boost.
How I would do it
More often than not, the need for real time results is overestimated. Hence, I'd go with precalculated stats for the tags and renew it every 5 minutes or so. That should be well enough, since most of such calls are requested async by the client and hence some delay in case the calculation has to be made on a specific request is negligible.
db.tags.aggregate(
{$unwind:"$tags"},
{$group: { _id:"$tags", score:{"$sum":1} } },
{$out:"tagStats"}
)
db.tagStats.update(
{'lastRun':{$exists:true}},
{'lastRun':new Date()},
{upsert:true}
)
db.tagStats.ensureIndex({lastRun:1}, {sparse:true})
Ok, here is the deal. First, we unwind the tags array, group it by the individual tags and increment the score for each occurrence of the respective tag. Next, we upsert lastRun in the tagStats collection, which we can do since MongoDB is schemaless. Next, we create a sparse index, which only holds values for documents in which the indexed field exists. In case the index already exists, ensureIndex is an extremely cheap query; however, since we are going to use that query in our code, we don't need to create the index manually. With this procedure, the following query
db.tagStats.find(
{lastRun:{ $lte: new Date( ISODate().getTime() - 300000 ) } },
{_id:0, lastRun:1}
)
becomes a covered query: A query which is answered from the index, which tends to reside in RAM, making this query lightning fast (slightly less than 0.5 msecs median in my tests). So what does this query do? It will return a record when the last run of the aggregation was run more than 5 minutes ( 5*60*1000 = 300000 msecs) ago. Of course, you can adjust this to your needs.
Now, we can wrap it up:
var hasToRun = db.tagStats.find(
{lastRun:{ $lte: new Date( ISODate().getTime() - 300000 ) } },
{_id:0, lastRun:1}
);
if(hasToRun){
db.tags.aggregate(
{$unwind:"$tags"},
{$group: {_id:"$tags", score:{"$sum":1} } },
{$out:"tagStats"}
)
db.tagStats.update(
{'lastRun':{$exists:true}},
{'lastRun':new Date()},
{upsert:true}
);
db.tagStats.ensureIndex({lastRun:1},{sparse:true});
}
// For all stats
var tagsStats = db.tagStats.find({score:{$exists:true}});
// score for a specific tag
var scoreForTag = db.tagStats.find({score:{$exists:true},_id:"tag1"});
Alternative approach
If real time results really matter and you need the stats for all the tags, simply use the aggregation without saving it to another collection:
db.tags.aggregate(
{$unwind:"$tags"},
{$group: { _id:"$tags", score:{"$sum":1} } },
)
If you only need the results for one specific tag at a time, a real time approach could be to use a special index, create a covered query and simply count the results:
db.tags.ensureIndex({tags:1})
var numberOfOccurences = db.tags.find({tags:"tag1"},{_id:0,tags:1}).count();
answering your questions:
B): you don't have to calculate the dif yourself use $addToSet
A): you can get the counts via aggregation framework with a combination of $unwind and $count
I am totally new to MongoDB... I am missing a "newbie" tag, so the experts would not have to see this question.
I am trying to update all documents in a collection using an expression. The query I was expecting to solve this was:
db.QUESTIONS.update({}, { $set: { i_pp : i_up * 100 - i_down * 20 } }, false, true);
That, however, results in the following error message:
ReferenceError: i_up is not defined (shell):1
At the same time, the database did not have any problem with eating this one:
db.QUESTIONS.update({}, { $set: { i_pp : 0 } }, false, true);
Do I have to do this one document at a time or something? That just seems excessively complicated.
Update
Thank you Sergio Tulentsev for telling me that it does not work. Now, I am really struggling with how to do this. I offer 500 Profit Points to the helpful soul, who can write this in a way that MongoDB understands. If you register on our forum I can add the Profit Points to your account there.
I just came across this while searching for the MongoDB equivalent of SQL like this:
update t
set c1 = c2
where ...
Sergio is correct that you can't reference another property as a value in a straight update. However, db.c.find(...) returns a cursor and that cursor has a forEach method:
Queries to MongoDB return a cursor, which can be iterated to retrieve
results. The exact way to query will vary with language driver.
Details below focus on queries from the MongoDB shell (i.e. the
mongo process).
The shell find() method returns a cursor object which we can then iterate to retrieve specific documents from the result. We use
hasNext() and next() methods for this purpose.
for( var c = db.parts.find(); c.hasNext(); ) {
print( c.next());
}
Additionally in the shell, forEach() may be used with a cursor:
db.users.find().forEach( function(u) { print("user: " + u.name); } );
So you can say things like this:
db.QUESTIONS.find({}, {_id: true, i_up: true, i_down: true}).forEach(function(q) {
db.QUESTIONS.update(
{ _id: q._id },
{ $set: { i_pp: q.i_up * 100 - q.i_down * 20 } }
);
});
to update them one at a time without leaving MongoDB.
If you're using a driver to connect to MongoDB then there should be some way to send a string of JavaScript into MongoDB; for example, with the Ruby driver you'd use eval:
connection.eval(%q{
db.QUESTIONS.find({}, {_id: true, i_up: true, i_down: true}).forEach(function(q) {
db.QUESTIONS.update(
{ _id: q._id },
{ $set: { i_pp: q.i_up * 100 - q.i_down * 20 } }
);
});
})
Other languages should be similar.
//the only differnce is to make it look like and aggregation pipeline
db.table.updateMany({}, [{
$set: {
col3:{"$sum":["$col1","$col2"]}
},
}]
)
You can't use expressions in updates. Or, rather, you can't use expressions that depend on fields of the document. Simple self-containing math expressions are fine (e.g. 2 * 2).
If you want to set a new field for all documents that is a function of other fields, you have to loop over them and update manually. Multi-update won't help here.
Rha7 gave a good idea, but the code above is not work without defining a temporary variable.
This sample code produces an approximate calculation of the age (leap years behinds the scene) based on 'birthday' field and inserts the value into suitable field for all documents not containing such:
db.employers.find({age: {$exists: false}}).forEach(function(doc){
var new_age = parseInt((ISODate() - doc.birthday)/(3600*1000*24*365));
db.employers.update({_id: doc._id}, {$set: {age: new_age}});
});
Example to remove "00" from the beginning of a caller id:
db.call_detail_records_201312.find(
{ destination: /^001/ },
{ "destination": true }
).forEach(function(row){
db.call_detail_records_201312.update(
{ _id: row["_id"] },
{ $set: {
destination: row["destination"].replace(/^001/, '1')
}
}
)
});