Is it possible to use the MongoDB aggregation framework to generate a time series output where any source documents that are deemed to fall within each bucket are added to that bucket?
Say my collection looks something like this:
/*light_1 on from 10AM to 1PM*/
{
"_id" : "light_1",
"on" : ISODate("2015-01-01T10:00:00Z"),
"off" : ISODate("2015-01-01T13:00:00Z"),
},
/*light_2 on from 11AM to 7PM*/
{
"_id" : "light_2",
"on" : ISODate("2015-01-01T11:00:00Z"),
"off" : ISODate("2015-01-01T19:00:00Z")
}
..and I am using a 6 hour bucket interval to generate a report for 2015-01-01. I wish my result to look something like:
{
"start" : ISODate("2015-01-01T00:00:00Z"),
"end" : ISODate("2015-01-01T06:00:00Z"),
"lights" : []
},
{
"start" : ISODate("2015-01-01T06:00:00Z"),
"end" : ISODate("2015-01-01T12:00:00Z"),
"lights_on" : ["light_1", "light_2"]
},
{
"start" : ISODate("2015-01-01T12:00:00Z"),
"end" : ISODate("2015-01-01T18:00:00Z"),
"lights_on" : ["light_1", "light_2"]
},
{
"start" : ISODate("2015-01-01T18:00:00Z"),
"end" : ISODate("2015-01-02T00:00:00Z"),
"lights_on" : ["light_2"]
}
a light is considered to be 'on' during a range if its 'on' value < the bucket 'end' AND its 'off' value >= the bucket 'start'
I know I can use $group and the aggregation date operators to group by a either start or end time, but in that case, it's a one-to-one mapping. Here, a single source document may make it into several time buckets if it spans several buckets.
The report range and interval span are not known until run-time.
Introduction
Your goal here demands a bit of thinking about the considerations for when to record the events as you have them structured into the given time period aggregations. The obvious point being that one single document as you have them represented can actually represent events to be reported in "multiple" time periods in the end aggregated result.
This turns out on analysis to be a problem that is outside of the scope of the aggregation framework due to the time periods that to are looking for. Certain events need to be "generated" outside of what can be just grouped on, which you should be able to see.
In order to do this "generataion", you need mapReduce. This has the "flow control" via JavaScript as it's processing language to be able to essentially determine whether the time betweeen on/off crosses more than one period and therefore emit the data that it occurred in more than one of those periods.
As a side note, the "light" is probably not best suited to the _id since it can possibly be turned on/off many times in a given day. So the "instance" of on/off is likely better. However I am just following your example here, so to tranpose this then just replace the reference to _id within the mapper code with whatever actual field represents the light's identifier.
But onto the code:
// start date and next date for query ( should be external to main code )
var oneHour = ( 1000 * 60 * 60 ),
sixHours = ( oneHour * 6 ),
oneDay = ( oneHour * 24 ),
today = new Date("2015-01-01"), // your input
tomorrow = new Date( today.valueOf() + oneDay ),
yesterday = new Date( today.valueOf() - sixHours ),
nextday = new Date( tomorrow.valueOf() + sixHours);
// main logic
db.collection.mapReduce(
// mapper to emit data
function() {
// Constants and round date to hour
var oneHour = ( 1000 * 60 * 60 )
sixHours = ( oneHour * 6 )
startPeriod = new Date( this.on.valueOf()
- ( this.on.valueOf() % oneHour )),
endPeriod = new Date( this.off.valueOf()
- ( this.off.valueOf() % oneHour ));
// Hour to 6 hour period and convert to UTC timestamp
startPeriod = startPeriod.setUTCHours(
Math.floor( startPeriod.getUTCHours() / 6) * 6 );
endPeriod = endPeriod.setUTCHours(
Math.floor( endPeriod.getUTCHours() / 6) * 6 );
// Init empty reults for each period only on first document processed
if ( counter == 0 ) {
for ( var x = startDay.valueOf(); x < endDay.valueOf(); x+= sixHours ) {
emit(
{ start: new Date(x), end: new Date(x + sixHours) },
{ lights_on: [] }
);
}
}
// Emit for every period until turned off only within the day
for ( var x = startPeriod; x <= endPeriod; x+= sixHours ) {
if ( ( x >= startDay ) && ( x < endDay ) ) {
emit(
{ start: new Date(x), end: new Date(x + sixHours) },
{ lights_on: [this._id] }
);
}
}
counter++;
},
// reducer to keep all lights in one array per period
function(key,values) {
var result = { lights_on: [] };
values.forEach(function(value) {
value.lights_on.forEach(function(light){
if ( result.lights_on.indexOf(light) == -1 )
result.lights_on.push(light);
});
});
result.lights_on.sort();
return result;
},
// options and query
{
"out": { "inline": 1 },
"query": {
"on": { "$gte": yesterday, "$lt": tomorrow },
"$or": [
{ "off": { "$gte:" today, "$lt": nextday } },
{ "off": null },
{ "off": { "$exists": false } }
]
},
"scope": {
"startDay": today,
"endDay": tomorrow,
"counter": 0
}
}
)
Map and Reduce
In essence, the "mapper" function looks at the current record, rounds each on/off time to hours and then works out the start hour of which six hour period the event occurred in.
With those new date values a loop is initiated to take the starting "on" time and emit an event for the current "light" being turned on during that period, within a single element array as explained later. Each loop increments the start period by six hours until the end "light off" time is reached.
These appear in the reducer function, which requires the same expected input that it will return, so hence the array of lights turned on in the period within the value object. It processes the emitted data under the same key as a list of these value objects.
First iterate the list of values to reduce, then looking at the inner array of lights, which could have come from a previous reduce pass, and processing each of those into a singular result array of unique lights. Simply done by looking for the current light value within the result array and pushing to that array where it does not exist.
Note the "previous pass", as if you are not familiar with how mapReduce works, then you should understand that the reducer function itself emits a result that might not have been achived by processing "all" of the possible values for the "key" in a single pass. It can and often does only process a "sub-set" of the emitted data for a key, and therefore will take a "reduced" result as input in just the same way as the data is emitted from the mapper.
That point of design is why both the mapper and reducer need to output the data with the same structure, as the reducer itself can also get it's input from data that has been previously reduced. This is how mapReduce deals with large data sets emitting a large number of the same key values. It processes typically in "chunks" and not all at once.
The end reduction comes down to the list of lights turned on during the period with each period start and end as the emitted key. Like this:
{
"_id": {
"start": ISODate("2015-01-01T06:00:00Z"),
"end": ISODate("2015-01-01T12:00:00Z")
},
{
"result": {
"lights_on": [ "light_1", "light_2" ]
}
}
},
That "_id", "result" structure is just a property of how all mapReduce output comes out, but the desired values are all there.
Query
Now there is also a note on the query selection here which needs to take into account that a light could already be "on" via its collection entry at a date before the start of the current day. The same is true that it can be turned "off" after the current date being reported on as well, and may in fact either have a null value or no "off" key in the document depending on how your data is being stored and what day is actually being observed.
That logic creates some required calculation from the start of the day to be reported on and consider the six hour period both before and after that date with query conditions as listed:
{
"on": { "$gte": yesterday, "$lt": tomorrow },
"$or": [
{ "off": { "$gte:" today, "$lt": nextday } },
{ "off": null },
{ "off": { "$exists": false } }
]
}
The basic selectors there use the range operators of $gte and $lt to find the values that are greater than or equal to and less than respectively on the fields that they are testing the values of in order to find the data within a suitable range.
Within the $or condition, the various possibilities for the "off" value are considered. Either being that it falls within the range criteria, or either has a null value or possibly no key present in the document at all via the $exists operator. It depends on how you actually represent "off" where a light has not yet been turned off as to the requirements of those conditions within $or, but these would be the reasonable assumptions.
Like all MongoDB queries, all conditions are implicity an "AND" conditon unless stated otherwise.
That is still somewhat flawed depending on how long a light is possibly expected to be turned on for. But the variables are all intentionally listed externally for adjustment to your needs, with consideration to the expected duration to fetch either before or after the date to be reported on.
Creating Empty Time Series
The other note here is that the data itself is likely not to have any events that show a light turned on within a given time period. For that reason, there is a simple method embedded in the mapper function that looks to see if we are on the first iteration of results.
On that first time only, a set of the possible period keys is emitted that includes an empty array for the lights turned on in each period. This allows the reporting to also show those periods where no light was on at all as this is inserted into the data sent to the reducer and output.
You may vary on this approach, as it is still dependant on there being some data that meets the query criteria in order to output anything. So to cater for a truly "blank day" where no data is recorded or meets the criteria, then it might be better to create an external hash table of the keys all showing an empty result for the lights. Then to just "merge" the result of the mapReduce operation into those pre-existing keys to produce the report.
Summary
There are a number of calculations here on the dates, and being unaware of the actual end language implementation I am just declaring anything that works externally to the actual mapReduce operation seperately. So anything that looks like duplication here is done to that intent, making that part of the logic language independant. Most programming languages support the capabilities to manipulate the dates as per the methods used.
The inputs that are then all language specific are passed in as the options block shown as the last argument to the mapReduce method here. Notably there is the query with it's paramterized values that are all calculated from the date to be reported on. Then there is the "scope", which is a way to pass in values that can be used by the functions within the mapReduce operation.
With those things considered, the JavaScript code of the mapper and reducer remains unaltered, as that is what is expected by the method as input. Any variables to the process are fed by both the scope and query results in order to get the outcome without changing that code.
It is mainly therefore that because the duration of a "light being on" can span over different periods to be reported on, that this becomes something the aggregation framework is not designed to do. It cannot perform the "looping" and "emission of data" that is required to get to the result, and therefore why we use mapReduce for this task instead.
That said, great question. I don't know if you considered the concepts of how to acheive the results here already, but at least now there is a guide for someone approaching a similar problem.
I originally misunderstood your question. Assuming I understand what you need now, this looks more like a job for map-reduce. I am not sure how you are determining the range or the interval span, so I will make these constants, you can modify that section of code properly. You could do something like this:
var mapReduceObj = {};
mapReduceObj.map = function() {
var start = new Date("2015-01-01T00:00:00Z").getTime(),
end = new Date("2015-01-02T00:00:00Z").getTime(),
interval = 21600000; //6 hours in milliseconds
var time = start;
while(time < end) {
var endtime = time + interval;
if(this.on < endtime && this.off >= time) {
emit({start : new Date(time), end : new Date(endtime)}, [this._id]);
break;
}
time = endtime;
}
};
mapReduceObj.reduce = function(times, light_ids) {
var lightsArr = {lights : []};
for(var i = 0; i < light_ids.length; i++) {
lightsArr.lights.push(light_ids[i]);
}
return lightsArr;
};
The result will have the following form:
results : {
_id : {
start : ISODate("2015-01-01T06:00:00Z"),
end : ISODate("2015-01-01T12:00:00Z")
},
value : {
lights : [
"light_6",
"light_7"
]
},
...
}
~~~Original Answer~~~
This should give you the exact format that you want.
db.lights.aggregate([
{ "$match": {
"$and": [
{ on : { $lt : ISODate("2015-01-01T06:00:00Z") } },
{ off : { $gte: ISODate("2015-01-01T12:00:00Z") } }
]
}},
{ "$group": {
_id : null,
"lights_on" : {$push : "$_id"}
}},
{ "$project": {
_id : false,
start : { $add : ISODate("2015-01-01T06:00:00Z") },
end : { $add : ISODate("2015-01-01T12:00:00Z") },
lights_on: true
}}
]);
First, the $match condition finds all documents that meet your time constraints. Then $group pushes the _id field (in this case, light_n where n is an integer) into the lights_on field. Either $addToSet or $push could be used since the _id field is unique, but if you were using a field that could have duplicates you would need to decide if duplicates in the array were acceptable. Finally, use $project to get the exact format you want.
One way is to to use the $cond operator of $project and compare each "start" and "end" with the "on" and "off" fields in the original collection. Loop over each bucket using your MongoDB client and do something like this:
db.lights.aggregate([
{ "$project": {
"present": { "$cond": [
{ "$and": [
{ "$lte": [ "$on", ISODate("2015-01-01T06:00:00Z") ] },
{ "$gte": [ "$off", ISODate("2015-01-01T12:00:00Z") ] }
]},
1,
0
]}
}}
]);
The result should look something like this:
{ "_id" : "light_1", "present" : 0 }
{ "_id" : "light_2", "present" : 0 }
{ "_id" : "light_3", "present" : 1 }
For all documents with {"present":1}, add the "_id" of lights collection to the "lights_on" field with your client. Hope this helps.
Related
I am currently using doing some basic mapReduce using MongoDB.
I currently have data that looks like this:
db.football_team.insert({name: "Tane Shane", weight: 93, gender: "m"});
db.football_team.insert({name: "Lily Jones", weight: 45, gender: "f"});
...
I want to create a mapReduce function to group data by gender and show
Total number of each gender, Male & Female
Average weight of each gender
I can create a map / reduce function to carry out each function seperately, just cant get my head around how to show output for both. I am guessing since the grouping is based on Gender, Map function should stay the same and just alter something ont he reduce section...
Work so far
var map1 = function()
{var key = this.gender;
emit(key, {count:1});}
var reduce1 = function(key, values)
{var sum=0;
values.forEach(function(value){sum+=value["count"];});
return{count: sum};};
db.football_team.mapReduce(map1, reduce1, {out: "gender_stats"});
Output
db.football_team.find()
{"_id" : "f", "value" : {"count": 12} }
{"_id" : "m", "value" : {"count": 18} }
Thanks
The key rule to "map/reduce" in any implementation is basically that the same shape of data needs to be emitted by the mapper as is also returned by the reducer. The key reason for this is part of how "map/reduce" conceptually works by quite possibly calling the reducer multiple times. Which basically means you can call your reducer function on output that was already emitted from a previous pass through the reducer along with other data from the mapper.
MongoDB can invoke the reduce function more than once for the same key. In this case, the previous output from the reduce function for that key will become one of the input values to the next reduce function invocation for that key.
That said, your best approach to "average" is therefore to total the data along with a count, and then simply divide the two. This actually adds another step to a "map/reduce" operation as a finalize function.
db.football_team.mapReduce(
// mapper
function() {
emit(this.gender, { count: 1, weight: this.weight });
},
// reducer
function(key,values) {
var output = { count: 0, weight: 0 };
values.forEach(value => {
output.count += value.count;
output.weight += value.weight;
});
return output;
},
// options and finalize
{
"out": "gender_stats", // or { "inline": 1 } if you don't need another collection
"finalize": function(key,value) {
value.avg_weight = value.weight / value.count; // take an average
delete value.weight; // optionally remove the unwanted key
return value;
}
}
)
All fine because both mapper and reducer are emitting data with the same shape and also expecting input in that shape within the reducer itself. The finalize method of course is just invoked after all "reducing" is finally done and just processes each result.
As noted though, the aggregate() method actually does this far more effectively and in native coded methods which do not incur the overhead ( and potential security risks ) of server side JavaScript interpretation and execution:
db.football_team.aggregate([
{ "$group": {
"_id": "$gender",
"count": { "$sum": 1 },
"avg_weight": { "$avg": "$weight" }
}}
])
And that's basically it. Moreover you can actually continue and do other things after a $group pipeline stage ( or any stage for that matter ) in ways that you cannot do with a MongoDB mapReduce implementation. Notably something like applying a $sort to the results:
db.football_team.aggregate([
{ "$group": {
"_id": "$gender",
"count": { "$sum": 1 },
"avg_weight": { "$avg": "$weight" }
}},
{ "$sort": { "avg_weight": -1 } }
])
The only sorting allowed by mapReduce is solely that the key used with emit is always sorted in ascending order. But you cannot sort the aggregated result in output in any other way, without of course performing queries when output to another collection, or by working "in memory" with returned results from the server.
As a "side note" ( though an important one ), you probably should also consider in "learning" that the reality is the "server-side JavaScript" functionality of MongoDB is really a work-around more than being a feature. When MongoDB was first introduced, it applied a JavaScript engine for server execution mostly to make up for features which had not yet been implemented.
Thus to make up for the lack of the complete implementation of many query operators and aggregation functions which would come later, adding a JavaScript engine was a "quick fix" to allow certain things to be done with minimal implementation.
The result over the years is those JavaScript engine features are gradually being removed. The group() function of the API is removed. The eval() function of the API is deprecated and scheduled for removal at the next major version. The writing is basically "on the wall" for the limited future of these JavaScript on the server features, as the clear pattern is where the native features provide support for something, then the need to continue support for the JavaScript engine basically goes away.
The core wisdom here being that focusing on learning these JavaScript on the server features, is probably not really worth the time invested unless you have a pressing use case that currently cannot be solved by any other means.
Is it possible to run a sort on a Mongo collection before running the filtering query? I have older code in which I was using a method of getting a random result from the database by having a field which was a random float between 0 and 1, then querying with findOne to get the first document with a value greater than a random float generated at that time. The sample set was small, so didn't notice a problem at the time, but recently noticed that with one query, I was almost always getting the same value. The "first" document had a random > .9, so nearly every query matched it first.
I realized, for this solution to work, I need to sort by random, then find the first value greater than my random. As I understand it, this isn't as necessary a solution as in the past, as $sample exists as of 3.2, but I figure learning how I could do this would be good? Plus, my understanding is that $sample can return the same document multiple times (where N > 1 obviously, so not directly applicable to my question).
So for example, the following data:
> db.links.find()
{ "_id" : ObjectId("553c072bc87652a80e00002a"), "random" : 0.9162904409691691 }
{ "_id" : ObjectId("553c3332c87652c80700002a"), "random" : 0.00427396921440959 }
{ "_id" : ObjectId("553c3c5cc87652a80e00002b"), "random" : 0.2409569111187011 }
{ "_id" : ObjectId("553c3c66c876521c10000029"), "random" : 0.35101076657883823 }
{ "_id" : ObjectId("553c3c6ec87652200700002e"), "random" : 0.3234482416883111 }
{ "_id" : ObjectId("553c68d5c87652a80e00002c"), "random" : 0.5221220930106938 }
Any attempt to run db.mycollection.findOne({'random': {'$gte': x}}) where x is any value up to .91 always return the first object (_id 553c072). Anything greater returns nothing. If I could sort by the random value in ascending order then filter, it would keep searching until it found the correct value.
I would strongly recommend you to drop your custom solution and simply switch to using the MongoDB built-in $sample stage which will return a random result from your collection.
EDIT based on your comment:
Here's how you can do what you originally asked for:
db.links.find({ "random": { $gte: /* put your value here */ } })
.sort({ "random": 1 /* sort by "random" field in ascending order */ })
.limit(1)
You can, but don't need to use the aggregation framework, too:
db.links.aggregate({
$match: {
"random": {
$gte: /* put your value here */ // filter the collection
}
}
}, {
$sort: {
"random": 1 // sort by "random" field in ascending order
}
}, {
$limit: 1 // return only the first element
})
Use Case:
I've got a mongodb collection with a couple million documents. Documents in this
collection must be updated sometimes. Therefore I've setup a monitorFrequency field which would define the that a specific document must be updated every 6, 12, 24 or 720 hours. Additionally I setup a field called lastRefreshAt which is a timestamp of the last actual update.
The problem:
How can I select all documents from my collection profiles which need to be refreshed again (because monitorFrequency is older than lastRefreshAt).
Should I run that on a single query which would only return those documents which need to be refreshed again or should I rather iterate on all documents with a cursor and check in my node application if the document needs to be refreshed or not?
I would know how to do approach #2, but I am not sure what approach to chose and how the query for #1 would look like.
There are a couple of approaches depending on available architecture and choices. Some are good choices and some are bad, but we might as well explain them all.
Use $where with multi-update
As a first option to examine, you could use $where to calculate the difference for selection and feed directly to .update() or .updateMany() for that matter:
db.profiles.update(
{
"$where": function() {
return (Date.now() - this.lastRefreshAt.valueOf())
> ( this.monitorFrequency * 1000 * 60 * 60 );
}
},
{ "$currentDate": { "lastRefreshAt": true } },
{ "multi": true }
)
Which pretty simply works out the milliseconds difference between the current "lastRefreshAt" value and the current Date value and compares that to the stored "monitorFrequency" converted into milliseconds itself.
The $currentDate is appplied because it is a "multi" update and applied to all matched documents, so this ensures the "server timestamp" at the actual time of document update is applied to the document.
It's not fantastic as it does require a full collection scan in order to select the documents via calculation and thus cannot use an index. Plus it's JavaScript evaluation, which not being native code does add some overhead.
Loop the matched selection
So JavaScript is not that great a selection option in general when other options apply. Instead try using the aggregation framework for the calculation and loop the cursor result:
var ops = [];
db.profiles.aggregate([
{ "$redact": {
"$cond": {
"if": {
"$gt": [
{ "$subtract": [new Date(), "$lastRefreshAt"] },
{ "$multiply": ["$monitorFrequency", 1000 * 60 * 60] }
]
},
"then": "$$KEEP",
"else": "$$PRUNE"
}
}}
]).forEach(doc => {
ops.push({
"updateOne": {
"filter": { "_id": doc._id },
"update": { "$currentDate": { "lastRefreshAt": true } }
}
});
if ( ops.length > 1000 ) {
db.profiles.bulkWrite(ops);
ops = [];
}
})
if ( ops.length > 0 ) {
db.profiles.bulkWrite(ops);
ops = [];
}
So again that's a collection scan due to the calculation but it is done with native operators, so that part at least should be a bit faster. Also from a technical standpoint it's a little different because the new Date() is actually established at the time of request and not per document iterated as it would be using $where. Lacking an operator to produce the "current date" internally, there is no way for the aggregation framework to do this per iteration.
And of course, instead of just applying our "update" expression as it matches documents, we are looping the result cursor and applying a function. So whilst there are "some" gains, there is also additional overhead. Mileage may vary as to performance and practicality.
Parallel Updates
Personally I would do neither of the above and simply run a query selecting each marked "monitorFrequency" and looking for the dates between the boundaries that exceed the allowed difference.
As a simple example using NodeJS to implement Promise.all() for parallel calls:
const MongoClient = require('mongodb').MongoClient;
const onHour = 1000 * 60 * 60;
(async function() {
let db;
try {
db = await MongoClient.connect('mongodb://localhost/test');
let collection = db.collection('profiles');
let intervals = [6, 12, 24, 720];
let snapDate = new Date();
await Promise.all(
intervals.map( (monitorFrequency,i) =>
collection.updateMany(
{
monitorFrequency,
"lastRefreshAt": Object.assign(
{ "$lt": new Date(snapDate.valueOf() - intervals[i] * oneHour) },
(i < intervals.length) ?
{ "$gt": new Date(snapDate.valueOf() - intervals[i+1] * oneHour) }
: {}
)
},
{ "$currentDate": { "lastRefreshAt": true } },
)
)
);
} catch(e) {
console.error(e);
} finally {
db.close();
}
})();
This would allow you to index on the two fields and allow optimal selection, and since the "date ranges" are paired to their calculated difference from "monitorFrequency" then those documents that "require refresh" are the only ones that get selected for update.
Gievn the finite number of possible intervals this is what I would suspect to be the most optimal solution. But the construction along with the fact that the actual "update" portion remains consistent for each selection leads to one other option.
Use $or for each selection.
Much the same logic as above, but instead applied to build an $or condition for the "query" portion of a "single" update. It is an "array of criteria" afterall, which is essentially the same as an "array of queries" which is what we are doing above. So just turn it around a little:
let intervals = [6, 12, 24, 720];
let snapDate = new Date();
db.profiles.updateMany(
{
"$or": intervals.map( (monitorFrequency,i) =>
({
monitorFrequency,
"lastRefreshAt": Object.assign(
{ "$lt": new Date(snapDate.valueOf() - intervals[i] * oneHour) },
(i < intervals.length) ?
{ "$gt": new Date(snapDate.valueOf() - intervals[i+1] * oneHour) }
: {}
)
})
)
},
{ "$currentDate": { "lastRefreshAt": true } }
)
This then becomes one simple statement and of course can actually use indexes where available. Generally this is what you should be doing, though as I have suggested my intuition tells me that 4 threads of execution constrained only by the slowest one gets the job done slightly faster. Again, mileage may vary on that but logic dictates that this is so.
So the basic lesson here is "whilst you may think" that the logical approach is to calculate the values and compare within the database itself, it's actually the worst possible thing you can do for query performance.
The simple approach taken are to work out the criteria that should select the documents you want "before" you issue the query statement to the server. This means you are looking at "concrete values" rather than "calculation results" in comparison. And "concrete values" can actually be indexed, which is generally what you want for database queries.
Below is my sample Json message and it has Timestamp format YYYY-MM-DDThh:mmTZD (eg 2015-08-18T22:43:01-04:00)
Also I have a TTL index setup for 30 days but my data is not getting removed. I know that Mongodb uses ISODate("2015-09-03T14:21:30.177-04:00") kind format but is that absolutely necessary? What modification I can do in my index to get the TTL working.
We have millions of documents under multiple collections and we run of space every now and then.
JSON:
{
"_id" : ObjectId("55d3ed35817f4809e14e2"),
"AuditEnvelope" : {
"TrackingInformation" : {
"CorelationId" : "2703-4ce2-af68-47832462",
"Timestamp" : "2015-08-18T22:43:01-04:00",
"LogData" : {
"msgDetailJson" : "[Somedata here]"
}
}
}
}
Index
"1" : {
"v" : 1,
"key" : {
"AuditEnvelope.TrackingInformation.Timestamp" : 1
},
"name" : "TTL",
"ns" : "MyDB.MyColl",
"expireAfterSeconds" : 2592000
},
MongoDB version : 3.0.1
In order for the TTL clean-up process to work with a defined TTL index, the specified field must contain a Date BSON type, as is covered in the documentation for TTL indexes.
If the indexed field in a document is not a date or an array that holds a date value(s), the document will not expire.
You will need to convert such strings as BSON dates. This is also a wise thing to do as the internal storage of a BSON Date is a numeric timestamp value, and this takes up a lot less storage than a string does.
Tranformation requires an update to "cast" to a date object. As a "one off" operation this probably best done through the MongoDB shell and with the use of Bulk Operations to minimize the network overhead when writing back the data.
var bulk = db.MyColl.initializeOrderedBulkOp(),
count = 0;
db.MyColl.find({
"AuditEnvelope.TrackingInformation.Timestamp": { "$type": 2 }
}).forEach(function(doc) {
bulk.find({ "_id": doc._id }).updateOne({
"$set": {
"AuditEnvelope.TrackingInformation.Timestamp":
new Date(doc.AuditEnvelope.TrackingInformation.Timestamp)
}
});
count++;
if ( count % 1000 == 0 ) {
bulk.execute();
bulk = db.MyColl.initializeOrderedBulkOp();
}
});
if ( count % 1000 != 0 )
bulk.execute();
Also not the BSON $type operation there is designed to match "strings", so even if you began a conversion or changed some code to start producing BSON date objects in the field then the query only picks up the "string" values for conversion.
Ideally you should drop the indexes already on the "Timestamp" fields and then re-create them after the update. This removes the overhead of writing to the index with the updated information. You can also set a foreground index build on the new index creation and this will also save some space in what the index itself consumes.
I have a mongo collection with the fields
visit_id, user_id, date, action 1, action 2
example:
1 u100 2012-01-01 phone-call -
2 u100 2012-01-02 - computer-check
Can I get in mongodb the user that has made both a phone-call and a computer-check no matter the time ? ( basically it's an AND on different rows )
I guess it is not possible without map/reduce work.
I see it can be done in following way:
1.First you need run map/reduce that produce to you results like this:
{
_id : "u100",
value: {
actions: [
"phone-call",
"computer-check",
"etc..."
]
}
}
2.Then you can query above m/r result via elemMatch
You won't be able to do this with a single query-- if this is something you're doing frequently in your application I wouldn't recommend map/reduce-- I'd recommend doing a query in mongodb using the $or operator, and then processing it on the client to get a unique set of user_id's.
For example:
db.users.find({$or:[{"action 1":"phone-call"}, {"action 2":"computer-check"}]})
In the future, you should save your data in a different format like the one suggested above by Andrew.
There is the MongoDB group method that can be used for your query, comparable to an SQL group by operator.
I haven't tested this, but your query could look something similar to:
var results = db.coll.group({
key: { user_id: true },
cond: { $or: [ { action1: "phone-call" }, { action2: "computer-check" } ] },
initial: { actionFlags: 0 },
reduce: function(obj, prev) {
if(obj.action1 == "phone-call") { prev.actionFlags |= 1; }
if(obj.action2 == "computer-check") { prev.actionFlags |= 2; }
},
finalize: function(doc) {
if(doc.actionFlags == 3) { return doc; }
return null;
}
});
Again, I haven't tested this, it's based on my reading of the documentation. You're grouping by the user_id (the key declaration). The rows you want to let through have either action1 == "phone-call" or action2 == "computer-check" (the cond declaration). The initial state when you start checking a particular user_id is 0 (initial). For each row you check if action1 == "phone-call" and set its flag, and check action2 == "computer-check" and set it's flag (the reduce function). Once you've marked the row types, you check to make sure both flags are set. If so, keep the object, otherwise eliminate it (the finalize function).
That last part is the only part I'm unsure of, since the documentation doesn't explicitly state that you can knock out records in the finalize function. It will probably take me more time to get some test data set up to see than it would for you to see if the example above works.