Slow date aggregation query in mongo - mongodb

I have a mongo collection with 7 million documents, besides a couple of other fields each document has a 'createdAt' Date object. I have an index 'createdAt:1' on the field and it's hosted at a dedicated mongo service.
When I try to group by day the query gets real slow. Here is my aggregation query:
{
"$match": {
"createdAt": {
$gte:new Date(1472189560111)
}
}
},
{
"$project": {
"date":
{
"$dateToString": {
"format": "%Y-%m-%d",
"date": "$createdAt"
}
},
"count": 1
}
},
{
"$group": {
"_id": "$date",
"count": {
"$sum": 1
}
}
},
{
"$sort": {
"_id": 1
}
},
{
"$project": {
"date": "$_id",
"count": 1,
"_id": 0
}
}
What's a good strategy to improve the performance? Is there a problem in my aggregation pipeline? Do I need a field that contains the day date object with a fixed time like 00:00 and group on that? It seems such a basic operation that I believe there has to be a mongodb native way of doing that.

Related

Using Sum with Last mongodb

UseCase: I have the following data:
{"accountNumber":"1-1", "details":["version":{ "number": "1","accountGroup":"1", "editable":"false" , "amount":100 }]}
{"accountNumber":"1-2", "details":[version":{ "number": "2", "accountGroup":"1", "editable":"false" , "amount":200}]}
{"accountNumber":"2-1", "details":[version":{ "number": "1", "accountGroup":"2", "editable":"false", "amount":200 }]}
Where: my document is account. Each record has a accountGroup (1, 2). A group can have multiple versions. AccountNumber is being initialized by the combination of AccountGroup & version
I want to get the latest version of the account (accountNumber 1-2 & 2-1) along with the sum of their amount.
Expected output:
{accountNumber:2-1}, {accountNumber: 1-2}, total: 400 (sum of amount of the latest versions of the account group)
I am using the following query:
db.getCollection('account').aggregate([
{ "$sort": { "accountNumber": 1 } },
{ "$unwind": "$details"},
{ "$group": {
"_id": "$details.version.accountGroup",
"Latestversion": { "$last": "$$ROOT" },
"total": {
$sum: "$details.version.amount"
}
}
}])
It gets the sum of the all the versions which belongs to a group.
Current output:
{"accountNumber": "1-2", total: 300}, {"accountNumber":"2-1", total: 200}
I am new to Mongodb, any help is appreciated. Looking forward for a response.
You will need two $group stages.
First $group to find the latest document for each account group and second $group to sum amount from latest document.
Something like
aggregate([
{ "$sort": { "accountNumber": 1 } },
{ "$unwind": "$details"},
{ "$group": {
"_id": "$details.version.accountGroup",
"latest": { "$last": "$$ROOT" }
}
},
{ "$group": {
"_id": null,
"accountNumbers": { $push:"$latest.accountNumber" },
"total": { $sum: "$latest.details.version.amount" }
}
}
])
You can update your structure to below and remove $unwind.
{"accountNumber":"1-1", detail:{"number": "1","accountGroup":"1", "editable":"false" , "amount":100 }}

Correct query for group by user, per month

I have MongoDB collection that stores documents in this format:
"name" : "Username",
"timeOfError" : ISODate("...")
I'm using this collection to keep track of who got an error and when it occurred.
What I want to do now is create a query that retrieves errors per user, per month or something similar. Something like this:
{
"result": [
{
"_id": "$name",
"errorsPerMonth": [
{
"month": "0",
"errorsThisMonth": 10
},
{
"month": "1",
"errorsThisMonth": 20
}
]
}
]
}
I have tried several different queries, but none have given the desired result. The closest result came from this query:
db.collection.aggregate(
[
{
$group:
{
_id: { $month: "$timeOfError"},
name: { $push: "$name" },
totalErrorsThisMonth: { $sum: 1 }
}
}
]
);
The problem here is that the $push just adds the username for each error. So I get an array with duplicate names.
You need to compound the _id value in $group:
db.collection.aggregate([
{ "$group": {
"_id": {
"name": "$name",
"month": { "$month": "$timeOfError" }
},
"totalErrors": { "$sum": 1 }
}}
])
The _id is essentially the "grouping key", so whatever elements you want to group by need to be a part of that.
If you want a different order then you can change the grouping key precedence:
db.collection.aggregate([
{ "$group": {
"_id": {
"month": { "$month": "$timeOfError" },
"name": "$name"
},
"totalErrors": { "$sum": 1 }
}}
])
Or if you even wanted to or had other conditions in your pipeline with different fields, just add a $sort pipeline stage at the end:
db.collection.aggregate([
{ "$group": {
"_id": {
"month": { "$month": "$timeOfError" },
"name": "$name"
},
"totalErrors": { "$sum": 1 }
}},
{ "$sort": { "_id.name": 1, "_id.month": 1 } }
])
Where you can essentially $sort on whatever you want.

Mongodb mapreduce sorting (optimization) or alternative

I have a few documents that look like this:
{
'page_id': 123131,
'timestamp': ISODate('2014-06-10T12:13:59'),
'processed': false
}
The documents have other fields, but these are the only one relevant for this purpose. On this collection is also an index for these documents:
{
'page_id': 1
'timestamp': -1
}
I run a mapreduce that returns distinct (page_id, day) results, with day being the date-portion of the timestamp (in the above, it would be 2014-06-10).
This is done with the following mapreduce:
function() {
emit({
site_id: this.page_id,
day: Date.UTC(this.timestamp.getUTCFullYear(),
this.timestamp.getUTCMonth(),
this.timestamp.getUTCDate())
}, {
count: 1
});
}
The reduce-function basically just returns { count: 1 } as I am not really interested in the number, just unique tuples.
I wish to make this more efficient. I tried adding sort: { 'page_id' }, but it triggers an error - googling shows that I can apparently only sort by the key, but since this is not a "raw" key how does that work?
Also, is there an alternative to this mapreduce that is faster? I know mongodb has the distinct, but from what I can gather it only works on one field. Might the group aggregate function be relevant?
The aggregation framework would seem more appropriate since it runs in native code where mapReduce runs under a JavaScript interpreter instance. MapReduce has it's uses, but generally the aggregation framework should be best suited to common tasks that do not require specific processing where only the JavaScript methods allow the needed control:
db.collection.aggregate([
{ "$group": {
"_id": {
"page": "$page_id",
"day": {
"year": { "$year": "$timestamp" },
"month": { "$month": "$timestamp" },
"day": { "$dayOfMonth": "$timestamp" },
}
},
"count": { "$sum": 1 }
}}
])
This largely makes use of the date aggregation operators. See other aggregation framework operators for more details.
Of course if you wanted to reverse sort those unique dates (which is the opposite of what mapReduce will do) or other fields then just add a $sort to the end of the pipeline for what you want:
db.collection.aggregate([
{ "$group": {
"_id": {
"page": "$page_id",
"day": {
"year": { "$year": "$timestamp" },
"month": { "$month": "$timestamp" },
"day": { "$dayOfMonth": "$timestamp" },
}
},
"count": { "$sum": 1 }
}},
{ "$sort": {
"day.year": -1, "day.month": -1, "day.day": -1
}}
])
you might want to look at the aggregation framework.
query like this:
collection.aggregate([
{$group:
{
_id: {
year: { $year: [ "$timestamp" ] },
month: { $month: [ "$timestamp" ] },
day: { $dayOfMonth: [ "$timestamp" ] },
pageId: "$page_id"
}
}
])
will give you all unique combinations of the fields you're looking for.

Server Side Looping

I’ve solved this problem but looking for a better way to do it on the mongodb server rather that client.
I have one collection of Orders with a placement datetime (iso date) and a product.
{ _id:1, datetime:“T1”, product:”Apple”}
{ _id:2, datetime:“T2”, product:”Orange”}
{ _id:3, datetime:“T3”, product:”Pear”}
{ _id:4, datetime:“T4”, product:”Pear”}
{ _id:5, datetime:“T5”, product:”Apple”}
Goal: For a given time (or set of times) show the last order for EACH product in the set of my products before that time. Products are finite and known.
eg. query for time T6 will return:
{ _id:2, datetime:“T2”, product:”Orange”}
{ _id:4, datetime:“T4”, product:”Pear”}
{ _id:5, datetime:“T5”, product:”Apple”}
T4 will return:
{ _id:1, datetime:“T1”, product:”Apple”}
{ _id:2, datetime:“T2”, product:”Orange”}
{ _id:4, datetime:“T4”, product:”Pear”}
i’ve implemented this by creating a composite index on orders [datetime:descending, product:ascending]
Then on the java client:
findLastOrdersForTimes(times) {
for (time: times) {
for (product: products) {
db.orders.findOne(product:product, datetime: { $lt: time}}
}
}
}
Now that is pretty fast since it hits the index and only fetching the data i need. However I need to query for many time points (100000+) which will be a lot of calls over the network. Also my orders table will be very large. So how can I do this on the server in one hit, i.e return a collection of time->array products? If it was oracle, id create a stored proc with a cursor that loops back in time and collects the results for every time point and breaks when it gets to the last product after the last time point. I’ve looked at the aggregation framework and mapreduce but can’t see how to achieve this kind of loop. Any pointers?
If you truly want the last order for each product, then the aggregation framework comes in:
db.times.aggregate([
{ "$match": {
"product": { "$in": products },
}},
{ "$group": {
"_id": "$product",
"datetime": { "$max": "$datetime" }
}}
])
Example with an array of products:
var products = ['Apple', 'Orange', 'Pear'];
{ "_id" : "Pear", "datetime" : "T4" }
{ "_id" : "Orange", "datetime" : "T2" }
{ "_id" : "Apple", "datetime" : "T5" }
Or if the _id from the original document is important to you, use the $sort with $last instead:
db.times.aggregate([
{ "$match": {
"product": { "$in": products },
}},
{ "$sort": { "datetime": 1 } },
{ "$group": {
"_id": "$product",
"id": { "$last": "$_id" },
"datetime": { "$last": "$datetime" }
}}
])
And that is what you most likely really want to do in either of those last cases. But the index you really want there is on "product":
db.times.ensureIndex({ "product": 1 })
So even if you need to iterate that with an additional $match condition for $lt a certain timepoint, then that is better or otherwise you can modify the "grouping" to include the "datetime" as well as keeping a set in the $match.
It seems better at any rate, so perhaps this helps at least to modify your thinking.
If I'm reading out your notes correctly you seem to simply be looking for turning this on it's head and finding the last product for each point in time. So the statement is not much different:
db.times.aggregate([
{ "$match": {
"datetime": { "$in": ["T4","T5"] },
}},
{ "$sort": { "product": 1, "datetime": 1 } },
{ "$group": {
"_id": "$datetime",
"id": { "$last": "$_id" },
"product": { "$last": "$product" }
}}
])
That is in theory it is like that based on how you present the question. I have the feeling though that you are abstracting this though and "datetime" is possibly actual timestamps as date object types.
So you might not be aware of the date aggregation operators you can apply, for example to get the boundary of each hour:
db.times.aggregate([
{ "$group": {
"_id": {
"year": { "$year": "$datetime" },
"dayOfYear": { "$dayOfYear": "$datetime" },
"hour": { "$hour": "$datetime" }
},
"id": { "$last": "$_id" },
"datetime": { "$last": "$datetime" },
"product": { "$last": "$product" }
}}
])
Or even using date math instead of the operators if a epoch based timestamp
db.times.aggregate([
{ "$group": {
"_id": {
"$subtract": [
{ "$subtract": [ "$datetime", new Date("1970-01-01") ] },
{ "$mod": [
{ "$subtract": [ "$datetime", new Date("1970-01-01") ] },
1000*60*60
]}
]
},
"id": { "$last": "$_id" },
"datetime": { "$last": "$datetime" },
"product": { "$last": "$product" }
}}
])
Of course you can add a range query for dates in the $match with $gt and $lt operators to keep the data within the range you are particularly looking at.
Your overall solution is probably a combination of ideas, but as I said, your question seem to be about matching the last entries on certain time boundaries, so the last examples possibly in combination with filtering certain products is what you need rather than looping .findOne() requests.

Group distinct count() on a single field in a single request

Is there a way to get different counts on a single field from a single document ?
Here is a schema for a document User
UserSchema: {
name: { type: String, required: true },
created_at: { type: Date, default: now }
}
I would like to get every Users created the 01/05/2013 and the 06/08/2013, maybe i'll need to count more different dates.
Can i get these datas on a sigle count() or should i get all the Users with a find() then count it using javascript ?
You can use the collection.count() form which accepts a query, along with the use of $or and ranges:
db.collection.count({ "$or": [
{ "created_at":
{"$gte": new Date("2014-05-01"), "$lt": new Date("2014-05-02") }
},
{ "created_at":
{"$gte": new Date("2013-08-06"), "$lt": new Date("2013-08-07") }
}
]})
Or you can pass that query to .find() and use the cursor count from there if that suits your taste.
But then, I read your title again, and distinct count would be different, and best to use aggregate to get the distinct days:
db.collection.aggregate([
// Match the dates you want to filter
{ "$match": {
{ "$or": [
{ "created_at": {
"$gte": new Date("2014-05-01"),
"$lt": new Date("2014-05-02")
}},
{ "created_at": {
"$gte": new Date("2013-08-06"),
"$lt": new Date("2013-08-07")
}}
]}
}},
// Group on the *whole* day and sum the count
{ "$group": {
"_id": {
"year": { "$year": "$created_at" },
"month": { "$month": "$created_at" },
"day": { "$dayOfMonth": "$created_at" }
},
"count": { "$sum": 1 }
}}
])
And that would give you a distinct count of the documents for each selected day you had added in your $or clause.
No need for looping in code.