Mongodb mapreduce sorting (optimization) or alternative - mongodb

I have a few documents that look like this:
{
'page_id': 123131,
'timestamp': ISODate('2014-06-10T12:13:59'),
'processed': false
}
The documents have other fields, but these are the only one relevant for this purpose. On this collection is also an index for these documents:
{
'page_id': 1
'timestamp': -1
}
I run a mapreduce that returns distinct (page_id, day) results, with day being the date-portion of the timestamp (in the above, it would be 2014-06-10).
This is done with the following mapreduce:
function() {
emit({
site_id: this.page_id,
day: Date.UTC(this.timestamp.getUTCFullYear(),
this.timestamp.getUTCMonth(),
this.timestamp.getUTCDate())
}, {
count: 1
});
}
The reduce-function basically just returns { count: 1 } as I am not really interested in the number, just unique tuples.
I wish to make this more efficient. I tried adding sort: { 'page_id' }, but it triggers an error - googling shows that I can apparently only sort by the key, but since this is not a "raw" key how does that work?
Also, is there an alternative to this mapreduce that is faster? I know mongodb has the distinct, but from what I can gather it only works on one field. Might the group aggregate function be relevant?

The aggregation framework would seem more appropriate since it runs in native code where mapReduce runs under a JavaScript interpreter instance. MapReduce has it's uses, but generally the aggregation framework should be best suited to common tasks that do not require specific processing where only the JavaScript methods allow the needed control:
db.collection.aggregate([
{ "$group": {
"_id": {
"page": "$page_id",
"day": {
"year": { "$year": "$timestamp" },
"month": { "$month": "$timestamp" },
"day": { "$dayOfMonth": "$timestamp" },
}
},
"count": { "$sum": 1 }
}}
])
This largely makes use of the date aggregation operators. See other aggregation framework operators for more details.
Of course if you wanted to reverse sort those unique dates (which is the opposite of what mapReduce will do) or other fields then just add a $sort to the end of the pipeline for what you want:
db.collection.aggregate([
{ "$group": {
"_id": {
"page": "$page_id",
"day": {
"year": { "$year": "$timestamp" },
"month": { "$month": "$timestamp" },
"day": { "$dayOfMonth": "$timestamp" },
}
},
"count": { "$sum": 1 }
}},
{ "$sort": {
"day.year": -1, "day.month": -1, "day.day": -1
}}
])

you might want to look at the aggregation framework.
query like this:
collection.aggregate([
{$group:
{
_id: {
year: { $year: [ "$timestamp" ] },
month: { $month: [ "$timestamp" ] },
day: { $dayOfMonth: [ "$timestamp" ] },
pageId: "$page_id"
}
}
])
will give you all unique combinations of the fields you're looking for.

Related

Aggregate with a Composite Key

I am trying to aggregate some data and group it by Time Intervals as well as maintaining a sub-category, if you will. I want to be able to chart this data out so that I will have multiple different Lines corresponding to each Office that was called. The X axis will be the Time Intervals and the Y axis would be the Average Ring Time.
My data looks like this:
Calls: [{
created: ISODate(xyxyx),
officeCalled: 'ABC Office',
answeredAt: ISODate(xyxyx)
},
{
created: ISODate(xyxyx),
officeCalled: 'Office 2',
answeredAt: ISODate(xyxyx)
},
{
created: ISODate(xyxyx),
officeCalled: 'Office 3',
answeredAt: ISODate(xyxyx)
}];
My goal is to get my calls grouped by Time Intervals (30 Minutes/1 Hour/1 Day) AND by the Office Called. So when my aggregate completes, I'm looking for data like this:
[{"_id":TimeInterval1,"calls":[{"office":"ABC Office","ringTime":30720},
{"office":"Office2","ringTime":3070}]},
{"_id":TimeInterval2,"calls":[{"office":"Office1","ringTime":1125},
{"office":"ABC Office","ringTime":15856}]}]
I have been poking around for the past few hours and I was able to aggregate my data, but I haven't figured out how to group it properly so that I have each time interval along with the office data. Here is my latest code:
Call.aggregate([
{$match: {
$and: [
{created: {$exists: 1}},
{answeredAt: {$exists: 1}}]}},
{$project: { created: 1,
officeCalled: 1,
answeredAt: 1,
timeToAns: {$subtract: ["$answeredAt", "$created"]}}},
{$group: {_id: {"day": {"$dayOfYear": "$created"},
"hour": {
"$subtract": [
{"$hour" : "$created"},
{"$mod": [ {"$hour": "$created"}, 2]}
]
},
"officeCalled": "$officeCalled"
},
avgRingTime: {$avg: '$timeToAns'},
total: {$sum: 1}}},
{"$group": {
"_id": "$_id.day",
"calls": {
"$push": {
"office": "$_id.officeCalled",
"ringTime": "$avgRingTime"
},
}
}},
{$sort: {_id: 1}}
]).exec(function(err, results) {
//My results look like this
[{"_id":118,"calls":[{"office":"ABC Office","ringTime":30720},
{"office":"Office 2","ringTime":31384.5},
{"office":"Office 3","ringTime":7686.066666666667},...];
});
This just doesn't quite get it...I get my data but it's broken down by Day only. Not my 2 hour time interval that I was shooting for. Let me know if I'm doing this all wrong, please --- I am VERY NEW to aggregation so your help is very much appreciated.
Thank you!!
All you really need to do is include the both parts of the _id value your want in the final group. No idea why you thought to only reference a single field.
Also "loose the $project" as it is just wasted cycles and processing, when you can just use directly in $group on the first try:
Call.aggregate(
[
{ "$match": {
"created": { "$exists": 1 },
"answeredAt": { "$exists": 1 }
}},
{ "$group": {
"_id": {
"day": {"$dayOfYear": "$created"},
"hour": {
"$subtract": [
{"$hour" : "$created"},
{"$mod": [ {"$hour": "$created"}, 2]}
]
},
"officeCalled": "$officeCalled"
},
"avgRingTime": {
"$avg": { "$subtract": [ "$answeredAt", "$created" ] }
},
"total": { "$sum": 1 }
}},
{ "$group": {
"_id": {
"day": "$_id.day",
"hour": "$_id.hour"
},
"calls": {
"$push": {
"office": "$_id.officeCalled",
"ringTime": "$avgRingTime"
},
},
"total": { "$sum": "$total" }
}},
{ "$sort": { "_id": 1 } }
]
).exec(function(err, results) {
});
Also note the complete omission of $and. This is not needed as all MongoDB query arguments are already "AND" conditions anyway, unless specifically stated otherwise. Just stick to what is simple. It's meant to be simple.

How can I improve performance on a MongoDB aggregation query?

I am using the following query to get the count of records per day where the air temperature is bellow 7.2 degree. The documentation recommends to use the aggregation framework since it is faster than the map reduce
db.maxial.aggregate([{
$project: {
time:1,
temp:1,
frio: {
$cond: [
{ $lte: [ "$temp", 7.2 ] },
0.25,
0
]
}
}
}, {
$match: {
time: {
$gte: new Date('11/01/2011'),
$lt: new Date('11/03/2011')
}
}
}, {
$group: {
_id: {
ord_date: {
day: { $dayOfMonth: "$time" },
month: { $month: "$time" },
year: { $year: "$time" }
}
},
horasFrio: { $sum: '$frio' }
}
}, {
$sort: {
'_id.ord_date': 1
}
}])
I get an average execution time of 2 secs. Am I doing something wrong? I am already using indexes on time and temp field.
You might have indexes defined but you are not using them. In order for an aggregation pipeline to "use" an index the $match stage must be implemented first. Also if you omit the $project entirely and just include this in $group you are doing it in the most efficient way.
db.maxial.aggregate( [
{ "$match": {
"time": {
"$gte": new Date('2011-11-01'),
"$lt": new Date('2011-11-03')
}
}},
{ "$group": {
"_id": {
"day": { "$dayOfMonth": "$time" },
"month": { "$month": "$time" },
"year": { "$year": "$time" }
},
"horasFrio": {
"$sum": {
"$cond": [{ "$lte": [ "$temp", 7.2 ] }, 0.25, 0 ]
}
}
}},
{ "$sort": { "_id": 1} }
])
Project does not provide the benefits people think it does in terms of "reducing fields" in a direct way.
And beware JavaScript "Date" object constructors. Unless you issue in the right way you will get a locally converted date rather then the UTC time reference you should be issuing. That and other misconceptions are cleared up in the re-written listing.
To improve the performance of an aggregate query you would have to use the various pipeline stages and in the right order.
You can use the $match at first and later follow by $limit and $skip if needed. These all will shorten the number of records to be traversed for grouping and hence improves the performance.

Correct query for group by user, per month

I have MongoDB collection that stores documents in this format:
"name" : "Username",
"timeOfError" : ISODate("...")
I'm using this collection to keep track of who got an error and when it occurred.
What I want to do now is create a query that retrieves errors per user, per month or something similar. Something like this:
{
"result": [
{
"_id": "$name",
"errorsPerMonth": [
{
"month": "0",
"errorsThisMonth": 10
},
{
"month": "1",
"errorsThisMonth": 20
}
]
}
]
}
I have tried several different queries, but none have given the desired result. The closest result came from this query:
db.collection.aggregate(
[
{
$group:
{
_id: { $month: "$timeOfError"},
name: { $push: "$name" },
totalErrorsThisMonth: { $sum: 1 }
}
}
]
);
The problem here is that the $push just adds the username for each error. So I get an array with duplicate names.
You need to compound the _id value in $group:
db.collection.aggregate([
{ "$group": {
"_id": {
"name": "$name",
"month": { "$month": "$timeOfError" }
},
"totalErrors": { "$sum": 1 }
}}
])
The _id is essentially the "grouping key", so whatever elements you want to group by need to be a part of that.
If you want a different order then you can change the grouping key precedence:
db.collection.aggregate([
{ "$group": {
"_id": {
"month": { "$month": "$timeOfError" },
"name": "$name"
},
"totalErrors": { "$sum": 1 }
}}
])
Or if you even wanted to or had other conditions in your pipeline with different fields, just add a $sort pipeline stage at the end:
db.collection.aggregate([
{ "$group": {
"_id": {
"month": { "$month": "$timeOfError" },
"name": "$name"
},
"totalErrors": { "$sum": 1 }
}},
{ "$sort": { "_id.name": 1, "_id.month": 1 } }
])
Where you can essentially $sort on whatever you want.

Server Side Looping

I’ve solved this problem but looking for a better way to do it on the mongodb server rather that client.
I have one collection of Orders with a placement datetime (iso date) and a product.
{ _id:1, datetime:“T1”, product:”Apple”}
{ _id:2, datetime:“T2”, product:”Orange”}
{ _id:3, datetime:“T3”, product:”Pear”}
{ _id:4, datetime:“T4”, product:”Pear”}
{ _id:5, datetime:“T5”, product:”Apple”}
Goal: For a given time (or set of times) show the last order for EACH product in the set of my products before that time. Products are finite and known.
eg. query for time T6 will return:
{ _id:2, datetime:“T2”, product:”Orange”}
{ _id:4, datetime:“T4”, product:”Pear”}
{ _id:5, datetime:“T5”, product:”Apple”}
T4 will return:
{ _id:1, datetime:“T1”, product:”Apple”}
{ _id:2, datetime:“T2”, product:”Orange”}
{ _id:4, datetime:“T4”, product:”Pear”}
i’ve implemented this by creating a composite index on orders [datetime:descending, product:ascending]
Then on the java client:
findLastOrdersForTimes(times) {
for (time: times) {
for (product: products) {
db.orders.findOne(product:product, datetime: { $lt: time}}
}
}
}
Now that is pretty fast since it hits the index and only fetching the data i need. However I need to query for many time points (100000+) which will be a lot of calls over the network. Also my orders table will be very large. So how can I do this on the server in one hit, i.e return a collection of time->array products? If it was oracle, id create a stored proc with a cursor that loops back in time and collects the results for every time point and breaks when it gets to the last product after the last time point. I’ve looked at the aggregation framework and mapreduce but can’t see how to achieve this kind of loop. Any pointers?
If you truly want the last order for each product, then the aggregation framework comes in:
db.times.aggregate([
{ "$match": {
"product": { "$in": products },
}},
{ "$group": {
"_id": "$product",
"datetime": { "$max": "$datetime" }
}}
])
Example with an array of products:
var products = ['Apple', 'Orange', 'Pear'];
{ "_id" : "Pear", "datetime" : "T4" }
{ "_id" : "Orange", "datetime" : "T2" }
{ "_id" : "Apple", "datetime" : "T5" }
Or if the _id from the original document is important to you, use the $sort with $last instead:
db.times.aggregate([
{ "$match": {
"product": { "$in": products },
}},
{ "$sort": { "datetime": 1 } },
{ "$group": {
"_id": "$product",
"id": { "$last": "$_id" },
"datetime": { "$last": "$datetime" }
}}
])
And that is what you most likely really want to do in either of those last cases. But the index you really want there is on "product":
db.times.ensureIndex({ "product": 1 })
So even if you need to iterate that with an additional $match condition for $lt a certain timepoint, then that is better or otherwise you can modify the "grouping" to include the "datetime" as well as keeping a set in the $match.
It seems better at any rate, so perhaps this helps at least to modify your thinking.
If I'm reading out your notes correctly you seem to simply be looking for turning this on it's head and finding the last product for each point in time. So the statement is not much different:
db.times.aggregate([
{ "$match": {
"datetime": { "$in": ["T4","T5"] },
}},
{ "$sort": { "product": 1, "datetime": 1 } },
{ "$group": {
"_id": "$datetime",
"id": { "$last": "$_id" },
"product": { "$last": "$product" }
}}
])
That is in theory it is like that based on how you present the question. I have the feeling though that you are abstracting this though and "datetime" is possibly actual timestamps as date object types.
So you might not be aware of the date aggregation operators you can apply, for example to get the boundary of each hour:
db.times.aggregate([
{ "$group": {
"_id": {
"year": { "$year": "$datetime" },
"dayOfYear": { "$dayOfYear": "$datetime" },
"hour": { "$hour": "$datetime" }
},
"id": { "$last": "$_id" },
"datetime": { "$last": "$datetime" },
"product": { "$last": "$product" }
}}
])
Or even using date math instead of the operators if a epoch based timestamp
db.times.aggregate([
{ "$group": {
"_id": {
"$subtract": [
{ "$subtract": [ "$datetime", new Date("1970-01-01") ] },
{ "$mod": [
{ "$subtract": [ "$datetime", new Date("1970-01-01") ] },
1000*60*60
]}
]
},
"id": { "$last": "$_id" },
"datetime": { "$last": "$datetime" },
"product": { "$last": "$product" }
}}
])
Of course you can add a range query for dates in the $match with $gt and $lt operators to keep the data within the range you are particularly looking at.
Your overall solution is probably a combination of ideas, but as I said, your question seem to be about matching the last entries on certain time boundaries, so the last examples possibly in combination with filtering certain products is what you need rather than looping .findOne() requests.

Group distinct count() on a single field in a single request

Is there a way to get different counts on a single field from a single document ?
Here is a schema for a document User
UserSchema: {
name: { type: String, required: true },
created_at: { type: Date, default: now }
}
I would like to get every Users created the 01/05/2013 and the 06/08/2013, maybe i'll need to count more different dates.
Can i get these datas on a sigle count() or should i get all the Users with a find() then count it using javascript ?
You can use the collection.count() form which accepts a query, along with the use of $or and ranges:
db.collection.count({ "$or": [
{ "created_at":
{"$gte": new Date("2014-05-01"), "$lt": new Date("2014-05-02") }
},
{ "created_at":
{"$gte": new Date("2013-08-06"), "$lt": new Date("2013-08-07") }
}
]})
Or you can pass that query to .find() and use the cursor count from there if that suits your taste.
But then, I read your title again, and distinct count would be different, and best to use aggregate to get the distinct days:
db.collection.aggregate([
// Match the dates you want to filter
{ "$match": {
{ "$or": [
{ "created_at": {
"$gte": new Date("2014-05-01"),
"$lt": new Date("2014-05-02")
}},
{ "created_at": {
"$gte": new Date("2013-08-06"),
"$lt": new Date("2013-08-07")
}}
]}
}},
// Group on the *whole* day and sum the count
{ "$group": {
"_id": {
"year": { "$year": "$created_at" },
"month": { "$month": "$created_at" },
"day": { "$dayOfMonth": "$created_at" }
},
"count": { "$sum": 1 }
}}
])
And that would give you a distinct count of the documents for each selected day you had added in your $or clause.
No need for looping in code.