aggregating metrics data in mongodb - mongodb

I'm trying to pull report data out of a realtime metrics system inspired by the NYC MUG/SimpleReach schema, and maybe my mind is still stuck in SQL mode.
The data is stored in a document like so...
{
"_id": ObjectId("5209683b915288435894cb8b"),
"account_id": 922,
"project_id": 22492,
"stats": {
"2009": {
"04": {
"17": {
"10": {
"sum": {
"impressions": 11
}
},
"11": {
"sum": {
"impressions": 603
}
},
},
},
},
}}
and I've been trying different variations of the aggregation pipeline with no success.
db.metrics.aggregate({
$match: {
'project_id':22492
}}, {
$group: {
_id: "$project_id",
'impressions': {
//This works, but doesn't sum up the data...
$sum: '$stats.2009.04.17.10.sum.impressions'
/* none of these work.
$sum: ['$stats.2009.04.17.10.sum.impressions',
'$stats.2009.04.17.11.sum.impressions']
$sum: {'$stats.2009.04.17.10.sum.impressions',
'$stats.2009.04.17.11.sum.impressions'}
$sum: '$stats.2009.04.17.10.sum.impressions',
'$stats.2009.04.17.11.sum.impressions'
*/
}
}
any help would be appreciated.
(ps. does anyone have any ideas on how to do date range searches using this document schema? )

$group is designed to be applied to many documents, but here we only have one matched document.
Instead, $project could be used to sum up specific fields, like this:
db.metrics.aggregate(
{ $match: {
'project_id':22492
}
},
{ $project: {
'impressions': {
$add: [
'$stats.2009.04.17.10.sum.impressions',
'$stats.2009.04.17.11.sum.impressions'
]
}
}
})
I don't think there is an elegant way to do date range searches with this schema, because MongoDB operations/predictions are designed to be applied on values, rather than keys in a document. If I understand correctly, the most interesting point in the slides you mentioned is to cache/pre-aggregate metrics when updating. That's a good idea, but could be implemented with another schema. For example, using date and time with indexes, which are supported by MongoDB, might be a good choice for range searches. Even aggregation framework supports data operations, giving more flexibility.

Related

Getting aggregate count for nested fields in mongo

I have a mongo collection mimicking this java class. A student can be taught a number of subjects across campus.
class Students {
String studentName;
Map<String,List<String>> subjectsByCampus;
}
So a structure will look like this
{
_id: ObjectId("someId"),
studentName:'student1',
subjectByCampusName:{
campus1:['subject1','subject2'],
campus2: ['subject3']
},
_class: 'fqnOfTheEntity'
}
I want to find the count of subjects offered by each campus or be able to query the count of subjects offered by a specific campus. Is there a way to get it through query?
Mentioned in the comments, but the schema here does not appear to be particularly well-suited toward gathering the data requested in the question. For an individual student, this is doable via $size and processing the object (as an array) via $map. For example, if we want your desired out put of campus1:2, campus2:1 for the sample document provided, a pipeline to produce that in a countsByCampus field might look as follows:
[
{
"$addFields": {
"countsByCampus": {
"$arrayToObject": {
"$map": {
"input": {
"$objectToArray": "$subjectByCampusName"
},
"in": {
"$mergeObjects": [
"$$this",
{
v: {
$size: "$$this.v"
}
}
]
}
}
}
}
}
}
]
Playground demonstration here with an output document of:
{
"_class": "fqnOfTheEntity",
"_id": "someId",
"countsByCampus": {
"campus1": 2,
"campus2": 1
},
"studentName": "student1",
"subjectByCampusName": {
"campus1": [
"subject1",
"subject2"
],
"campus2": [
"subject3"
]
}
}
Doing that across the entire collection would involve $grouping the results together. This can be done but would be an extremely resource-intensive and likely slow operation.

Merging / aggregating multiple documents in MongoDB

I have a little problem with my aggregations. I have a large collection of flat documents with the following schema:
{
_id:ObjectId("5dc027d38da295b969eca568"),
emp_no:10001,
salary:60117,
from_date:"1986-06-26",
to_date:"1987-06-26"
}
It's all about annual employee salaries. The data is exported from relational database so there are multiple documents with the same value of "emp_no" but the rest of their attributes vary. I need to aggregate them by values of attribute "emp_no" so as a result I will have something like this:
//one document
{
_id:ObjectId("5dc027d38da295b969eca568"),
emp_no:10001,
salaries: [
{
salary:60117,
from_date:"1986-06-26",
to_date:"1987-06-26"
},
{
salary:62102,
from_date:"1987-06-26",
to_date:"1988-06-25"
},
...
]
}
//another document
{
_id:ObjectId("5dc027d38da295b969eca579"),
emp_no:10002,
salaries: [
{
salary:65828,
from_date:"1996-08-03",
to_date:"1997-08-03"
},
...
]
}
//and so on
Last but not least there are almost 2.9m of documents so aggregating by "emp_no" manually would be a bit of a problem. Is there a way I can aggregate them using just mongo queries? How do I do this kind of thing? Thank you in advance for any help
The group stage of aggregation pipeline can be used to get this type of aggregates. Specify the attribute you want to group by as the value of _id field in the group stage.
How does the below query work for you?
db.collection.aggregate([
{
"$group": {
"_id": "$emp_no",
"salaries": {
"$push": {
"salary": "$salary",
"from_data": "$from_date",
"to_data": "$to_date"
}
},
"emp_no": {
"$first": "$emp_no"
},
"first_document_id": {
"$first": "$_id"
}
}
},
{
"$project": {
"_id": "$first_document_id",
"salaries": 1,
"emp_no": 1
}
}
])

MongoDB: get all entries with most recent revision

I'm trying to query a prebuilt MongoDB that keeps revision information around for every entry. A sample of what the data looks like is this:
{
"_id": ObjectId("5cd48bad15447900012fae00"),
"sku": "abc123",
"_revision": 5
"_created": ISODate("2019-05-10T15:00:00.000Z"),
"provider": "MySupplier5"
}
In this example there are entries for the same record with:
different _id
same sku
same provider
_revision set to 1, 2, 3 and 4 (one per document)
varied _created dates
What I want to select is:
All documents with Provider set of "MySupplier5," but only the most recent revision.
I tried reading the documentation on Aggregate but wasn't able to get what I needed. I felt that the approach I am using to work around it is exceptionally inefficient and was hoping someone could point me in the right direction.
Right now, I am doing this:
Get all SKUs:
db.products.distinct('sku', { provider: "MySupplier5" }, { _id: 0, sku: 1 })
Loop over all SKUs and get the most recent revision:
db.products.find({ sku: 'abc123' }).sort({ revision: -1 }).limit(1)
That works, but it feels increidbly inefficient.
I was able to get those two queries running using Mongoid just fine and I'm sure I could translate whatever is given into the Ruby equivalent as well. I just can't figure out how to say "only give me the most recent revision" in Mongo.
Thanks!
For anyone looking for the answer that I wound up with based on the link provided I am listing it here (apologies for the late, late answer).
It took me all of the following links to be able to learn & figure out exactly what I needed:
StackOverflow 55101531
MongoDB Aggregation
MongoDB Aggregation API Doc
Some Samples to drive it home
[
{
"$group": {
"_id": "$sku",
"maxRevision": { "$max": "$revision" },
"originalDocuments": { "$push": "$$ROOT" }
}
},
{
"$project": {
"originalDocument": {
"$filter": {
"input": "$originalDocuments",
"as": "doc",
"cond": {
"$and": [
{ "$eq": ["abc123", "$$doc.sku"] },
{ "$eq": ["$maxRevision", "$$doc.revision"] }
]
}
}
}
}
},
{
"$unwind": "$originalDocument"
}
]

MongoDB: adding fields based on partial match query - expression vs query

So I have one collection that I'd like to query/aggegate. The query is made up of several parts that are OR'ed together. For every part of the query, I have a specific set of fields that need to be shown.
So my hope was to do this with an aggregate, that will $match the queries OR'ed together all at once, and then use $project with $cond to see what fields are needed. The problem here is that $cond uses expressions, while the $match uses queries. Which is a problem since some query features are not available as an expression. So a simple conversion is not an option.
So I need another solution..
- I could just make an aggregate per separate query, because there I know what fields to match, and them merger the results together. But this will not work if I use pagination in the queries (limit/skip etc).
- find some other way to tag every document so I can (afterwards) remove any fields not needed. It might not be super efficient, but would work. No clue yet how to do that
- figure out a way to make queries that are only made of expressions. For my purpose that might be good enough, and it would mean a rewrite of the query parser. It could work, but is not ideal.
So This is the next incarnation right here. It will deduplicate and merge records and finally transform it back again to something resembling a normal query result:
db.getCollection('somecollection').aggregate(
[
{
"$facet": {
"f1": [
{
"$match": {
<some query 1>
},
{
"$project: {<some fixed field projection>}
}
],
"f2": [
{
"$match": {
<some query 1>
}
},
{
"$project: {<some fixed field projection>}
}
]
}
},
{
$project: {
"rt": { $concatArrays: [ "$f1", "$f2"] }
}
},
{ $unwind: { path: "$rt"} },
{ $replaceRoot: {newRoot:"$rt"}},
{ $group: {_id: "$_id", items: {$push: {item:"$$ROOT"} } }},
{
$project: {
"rt": { $mergeObjects: "$items" }
}
},
{ $replaceRoot: {newRoot:"$rt.item"}},
]
);
There might still be some optimisation to be, so any comments are welcome
I found an extra option using $facet. This way, I can make a facet for every group opf fields/subqueries. This seems to work fine, except that the result is a single document with a bunch of arrays. not yet sure how to convert that back to multiple documents.
okay, so now I have it figured out. I'm not sure yet about all of the intricacies of this solution, but it seems to work in general. Here an example:
db.getCollection('somecollection').aggregate(
[
{
"$facet": {
"f1": [
{
"$match": {
<some query 1>
},
{
"$project: {<some fixed field projection>
}
],
"f2": [
{
"$match": {
<some query 1>
}
},
{
"$project: {<some fixed field projection>
}
]
}
},
{
$project: {
"rt": { $concatArrays: [ "$f1", "$f2"] }
}
},
{ $unwind: { path: "$rt"} },
{ $replaceRoot: {newRoot:"$rt"}}
]
);

Sub-query in MongoDB

I have two collections in MongoDB, one with users and one with actions. Users look roughly like:
{_id: ObjectId("xxxxx"), country: "UK",...}
and actions like
{_id: ObjectId("yyyyy"), createdAt: ISODate(), user: ObjectId("xxxxx"),...}
I am trying to count events and distinct users split by country. The first half of which is working fine, however when I try to add in a sub-query to pull the country I only get nulls out for country
db.events.aggregate({
$match: {
createdAt: { $gte: ISODate("2013-01-01T00:00:00Z") },
user: { $exists: true }
}
},
{
$group: {
_id: {
year: { $year: "$createdAt" },
user_obj: "$user"
},
count: { $sum: 1 }
}
},
{
$group: {
_id: {
year: "$_id.year",
country: db.users.findOne({
_id: { $eq: "$_id.user_obj" },
country: { $exists: true }
}).country
},
total: { $sum: "$count" },
distinct: { $sum: 1 }
}
})
No Joins in here, just us bears
So MongoDB "does not do joins". You might have tried something like this in the shell for example:
db.events.find().forEach(function(event) {
event.user = db.user.findOne({ "_id": eventUser });
printjson(event)
})
But this does not do what you seem to think it does. It actually does exactly what it looks like and, runs a query on the "user" collection for every item that is returned from the "events" collection, both "to and from" the "client" and is not run on the server.
For the same reasons your 'embedded' statement within an aggregation pipeline does not work like that. Unlike the above the "whole pipeline" logic is sent to the server before execution. So if you did something like this to 'select "UK" users:
db.events.aggregate([
{ "$match": {
"user": {
"$in": db.users.distinct("_id",{ "country": "UK" })
}
}}
])
Then that .distinct() query is actually evaluated on the "client" and not the server and therefore not having availability to any document values in the aggregation pipeline. So the .distinct() runs first, returns it's array as an argument and then the whole pipeline is sent to the server. That is the order of execution.
Correcting
You need at least some level of de-normalization for the sort of query you want to run to work. So you generally have two choices:
Embed your whole user object data within the event data.
At least embed "some" of the user object data within the event data. In this case "country" becasue you are going to use it.
So then if you follow the "second" case there and at least "extend" your existing data a little to include the "country" like this:
{
"_id": ObjectId("yyyyy"),
"createdAt": ISODate(),
"user": {
"_id": ObjectId("xxxxx"),
"country": "UK"
}
}
Then the "aggregation" process becomes simple:
db.events.aggregate([
{ "$match": {
"createdAt": { "$gte": ISODate("2013-01-01T00:00:00Z") },
"user": { "$exists": true }
}},
{ "$group": {
"_id": {
"year": { "$year": "$createdAt" },
"user_id": "$user._id"
"country": "$user.country"
},
"count": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.country",
"total": { "$sum": "$count" },
"distinct": { "$sum": 1 }
}}
])
We're not normal
Fixing your data to include the information it needs on a single collection where we "do not do joins" is a relatively simple process. Just really a variant on the original query sample above:
var bulk = db.events.intitializeUnorderedBulkOp(),
count = 0;
db.users.find().forEach(function(user) {
// update multiple events for user
bulk.find({ "user": user._id }).update({
"$set": { "user": { "_id": user._id, "country": user.country } }
});
count++;
// Send batch every 1000
if ( count % 1000 == 0 ) {
bulk.execute();
bulk = db.events.intitializeUnorderedBulkOp();
}
});
// Clear any queued
if ( count % 1000 != 0 )
bulk.execute();
So that's what it's all about. Individual queries to a MongoDB server get "one collection" and "one collection only" to work with. Even the fantastic "Bulk Operations" as shown above can still only be "batched" on a single collection.
If you want to do things like "aggregate on related properties", then you "must" contain those properties in the collection you are aggregating data for. It is perfectly okay to live with having data sitting in separate collections, as for instance "users" would generally have more information attached to them than just and "_id" and a "country".
But the point here is if you need "country" for analysis of "event" data by "user", then include it in the data as well. The most efficient server join is a "pre-join", which is the theory in practice here in general.