Sub-query in MongoDB - mongodb

I have two collections in MongoDB, one with users and one with actions. Users look roughly like:
{_id: ObjectId("xxxxx"), country: "UK",...}
and actions like
{_id: ObjectId("yyyyy"), createdAt: ISODate(), user: ObjectId("xxxxx"),...}
I am trying to count events and distinct users split by country. The first half of which is working fine, however when I try to add in a sub-query to pull the country I only get nulls out for country
db.events.aggregate({
$match: {
createdAt: { $gte: ISODate("2013-01-01T00:00:00Z") },
user: { $exists: true }
}
},
{
$group: {
_id: {
year: { $year: "$createdAt" },
user_obj: "$user"
},
count: { $sum: 1 }
}
},
{
$group: {
_id: {
year: "$_id.year",
country: db.users.findOne({
_id: { $eq: "$_id.user_obj" },
country: { $exists: true }
}).country
},
total: { $sum: "$count" },
distinct: { $sum: 1 }
}
})

No Joins in here, just us bears
So MongoDB "does not do joins". You might have tried something like this in the shell for example:
db.events.find().forEach(function(event) {
event.user = db.user.findOne({ "_id": eventUser });
printjson(event)
})
But this does not do what you seem to think it does. It actually does exactly what it looks like and, runs a query on the "user" collection for every item that is returned from the "events" collection, both "to and from" the "client" and is not run on the server.
For the same reasons your 'embedded' statement within an aggregation pipeline does not work like that. Unlike the above the "whole pipeline" logic is sent to the server before execution. So if you did something like this to 'select "UK" users:
db.events.aggregate([
{ "$match": {
"user": {
"$in": db.users.distinct("_id",{ "country": "UK" })
}
}}
])
Then that .distinct() query is actually evaluated on the "client" and not the server and therefore not having availability to any document values in the aggregation pipeline. So the .distinct() runs first, returns it's array as an argument and then the whole pipeline is sent to the server. That is the order of execution.
Correcting
You need at least some level of de-normalization for the sort of query you want to run to work. So you generally have two choices:
Embed your whole user object data within the event data.
At least embed "some" of the user object data within the event data. In this case "country" becasue you are going to use it.
So then if you follow the "second" case there and at least "extend" your existing data a little to include the "country" like this:
{
"_id": ObjectId("yyyyy"),
"createdAt": ISODate(),
"user": {
"_id": ObjectId("xxxxx"),
"country": "UK"
}
}
Then the "aggregation" process becomes simple:
db.events.aggregate([
{ "$match": {
"createdAt": { "$gte": ISODate("2013-01-01T00:00:00Z") },
"user": { "$exists": true }
}},
{ "$group": {
"_id": {
"year": { "$year": "$createdAt" },
"user_id": "$user._id"
"country": "$user.country"
},
"count": { "$sum": 1 }
}},
{ "$group": {
"_id": "$_id.country",
"total": { "$sum": "$count" },
"distinct": { "$sum": 1 }
}}
])
We're not normal
Fixing your data to include the information it needs on a single collection where we "do not do joins" is a relatively simple process. Just really a variant on the original query sample above:
var bulk = db.events.intitializeUnorderedBulkOp(),
count = 0;
db.users.find().forEach(function(user) {
// update multiple events for user
bulk.find({ "user": user._id }).update({
"$set": { "user": { "_id": user._id, "country": user.country } }
});
count++;
// Send batch every 1000
if ( count % 1000 == 0 ) {
bulk.execute();
bulk = db.events.intitializeUnorderedBulkOp();
}
});
// Clear any queued
if ( count % 1000 != 0 )
bulk.execute();
So that's what it's all about. Individual queries to a MongoDB server get "one collection" and "one collection only" to work with. Even the fantastic "Bulk Operations" as shown above can still only be "batched" on a single collection.
If you want to do things like "aggregate on related properties", then you "must" contain those properties in the collection you are aggregating data for. It is perfectly okay to live with having data sitting in separate collections, as for instance "users" would generally have more information attached to them than just and "_id" and a "country".
But the point here is if you need "country" for analysis of "event" data by "user", then include it in the data as well. The most efficient server join is a "pre-join", which is the theory in practice here in general.

Related

Count no.of instances of string in a field across documents grouped on another field in MongoDB?

I've got a specific use case and I'm trying to find a way to do it in one aggregation pipeline and preferably without the need to hardcode any data values. I want to group documents based on one property and see a count of values for a particular field within the document.
Example data:
{
"flightNum": "DL1002",
"status": "On time",
"date": 20191001
},
{
"flightNum": "DL1002",
"status": "Delayed",
"date": 20191002
},
{
"flightNum": "DL1002",
"status": "On time",
"date": 20191003
},
{
"flightNum": "DL1002",
"status": "Cancelled",
"date": 20191004
},
{
"flightNum": "DL952",
"status": "On time",
"date": 20191003
},
{
"flightNum": "DL952",
"status": "On time",
"date": 20191004
}
I want an aggregation pipeline that can tell me, for each flight (group by flightNum) how many flights were "On time", "Delayed", or "Cancelled".
Desired response:
{
"flightNum": "DL1002",
"numOnTime": 2,
"numCancelled": 1,
"numDelayed": 1
},
{
"flightNum": "DL952",
"numOnTime": 2
}
It doesn't really matter the naming of the fields, so much as they are there in one document. I found that I could do this with the $cond operator, but that would require me to hard code the expected values of "status" field. For this arbitrary example, there aren't many values, but if another status is added, I would like to not have to update the query. Since there are so many nifty tricks in Mongo, I feel there is likely a way to achieve this.
You can try below query :
db.collection.aggregate([
/** Group all docs based on flightNum & status & count no.of occurances */
{
$group: {
_id: {
flightNum: "$flightNum",
status: "$status"
},
count: {
$sum: 1
}
}
},
/** Group on flightNum & push an objects with key as status & value as count */
{
$group: {
_id: "$_id.flightNum",
status: {
$push: {
k: "$_id.status",
v: "$count"
}
}
}
},
/** Recreate status field as an object of key:value pairs from an array of objects */
{
$addFields: {
status: {
$arrayToObject: "$status"
}
}
},
/** add flightNum inside status object */
{
$addFields: {
"status.flightNum": "$_id"
}
},
/** Replace status field as new root for each doc in coll */
{
$replaceRoot: {
newRoot: "$status"
}
}
])
Test : MongoDB-Playground

mongodb aggregation - nested group

I'm trying to perform nested group, I have an array of documents that has two keys (invoiceIndex, proceduresIndex) I need the documents to be arranged like so
invoices (parent) -> procedures (children)
invoices: [ // Array of invoices
{
.....
"procedures": [{}, ...] // Array of procedures
}
]
Here is a sample document
{
"charges": 226.09000000000003,
"currentBalance": 226.09000000000003,
"insPortion": "",
"currentInsPortion": "",
"claim": "notSent",
"status": "unpaid",
"procedures": {
"providerId": "9vfpjSraHzQFNTtN7",
"procedure": "21111",
"description": "One surface",
"category": "basicRestoration",
"surface": [
"m"
],
"providerName": "B Dentist",
"proceduresIndex": "0"
},
"patientId": "mE5vKveFArqFHhKmE",
"patientName": "Silvia Waterman",
"invoiceIndex": "0",
"proceduresIndex": "0"
}
Here is what I have tried
https://mongoplayground.net/p/AEBGmA32n8P
Can you try the following;
db.collection.aggregate([
{
$group: {
_id: "$invoiceIndex",
procedures: {
$push: "$procedures"
},
invoice: {
$first: "$$ROOT"
}
}
},
{
$addFields: {
"invoice.procedures": "$procedures"
}
},
{
"$replaceRoot": {
"newRoot": "$invoice"
}
}
])
I retain the invoice fields with invoice: { $first: "$$ROOT" }, also keep procedures's $push logic as a separate field. Then with $addFields I move that array of procedures into the new invoice object. Then replace root to that.
You shouldn't use the procedureIndex as a part of _id in $group, for you won't be able to get a set of procedures, per invoiceIndex then. With my $group logic it works pretty well as you see.
Link to mongoplayground

MongoDB Sum Array With Objects

Say I have an aggregation that returns the following:
[
{driverId: 21312asd12, cars: 2, totalMiles: 30000, family: 4},
{driverId: 55512a23a2, cars: 3, totalMiles: 55000, family: 2},
...
]
How would I go about running a summation of each data set on a groupId basis to return the following? Do I use an $unwind? Do another grouping?
For example I would like to return:
{
totalDrivers: 2,
totalCars: 5,
totalMiles: 85000,
totalFamily: 6
}
You seem to just be referring to the documents in the output as an "array", therefore just add another $group to the end of your pipeline:
{ "$group": {
"_id": null,
"totalDrivers": { "$sum": 1 },
"totalCars": { "$sum": "$cars" },
"totalMiles": { "$sum": "$totalMiles" },
"totalFamily": { "$sum": "$family" }
}}
Where null is essentially just a blank grouping key that is not a field present in the document to group on. The result should be a single document (albeit in an array, depending on the API method call used or server version).
Or if you actually mean that each document has a field with an array like this, then $unwind and process the group either per document or with a null as above:
{ "$unwind": "$someArray" },
{ "$group": {
"_id": "$_id",
"totalDrivers": { "$sum": 1 },
"totalCars": { "$sum": "$someArray.cars" },
"totalMiles": { "$sum": "$someArray.totalMiles" },
"totalFamily": { "$sum": "$someArray.family" }
}}
At any rate, you should really post the code you are using when asking questions like this. It is very likely that your pipeline may not be as efficient to get to your end goal as you think, and if you posted that it both gives a clear picture of what you are doing as well as leaves it open for suggested improvement.

How to aggregate too large collection, the number of documents over than 10 billion

I got the following errors. when I tried to aggregated it by user_id or distinct on user_id
failed: exception: aggregation result exceeds maximum document size
failed: exception: distinct too big, 16mb cap
I wonder know how to finish my tasks under very large collection ?
data format
{
user_id: "Jack",
SYMPTOM_1: "flu",
SYMPTOM_2: "cough",
SYMPTOM_3: "cancer",
datetime: "20140101",
}
aggregation query
This query is tried to group users and append all the symptoms of medical records to each user
db.medical_records.aggregate([
{
"$sort": { "datetime": 1 }
},
{
"$group": {
"_id": "$user_id",
"symptom1":{
"$push": {"symptom": "$SYMPTOM_1" ,"date": "$datetime"}
},
"symptom2":{
"$push": {"symptom": "$SYMPTOM_2" ,"date": "$datetime"}
},
"symptom3":{
"$push": {"symptom": "$SYMPTOM_3" ,"date": "$datetime"}
},
"first_date": { "$first": "$datetime" },
"user_id": { "$first": "$user_id" },
"count": { "$sum": 1 }
}
},
{
"$project": {
"user_id": "$user_id",
"date": "$datetime",
"symptom1": "$symptom1",
"symptom2": "$symptom2",
"symptom3": "$symptom3",
"count": "$count",
"_id": 1
}
}
],allowDiskUse=true)
Expected output
{u'user_id': u'de96dsdase303c6c6439891c57901183c0e4c',
u'symptom1': [{u'symptom': u'1479 ', u'date': u'20040910'}],
u'symptom2': [{u'symptom': u' ', u'date': u'20040910'}],
u'symptom3': [{u'symptom': u' ', u'date': u'20040910'}],
u'count': 1,
u'first_date': u'20040910'}
It appears like you are trying to use the allowDiskUse option which would likely solve your problem, but unfortunately you seem to have a syntactical error.
When you pass options to an operation, you need to pass these as an object surrounded by { and }.
What you are doing here is assign true to a new global variable allowDiskUse and pass the result of that assign operation to aggregate, which is just the value true.
Try replacing ],allowDiskUse=true) with ], { allowDiskUse:true } )
This allows you to circumvent the 16MB limit per aggregation stage. But keep in mind that it will still be a very slow operation.

Insert if not exists, else remove MongoDB

So I have a query in MongoDB (2.6.4) where I am trying to implement a simple upvote/downvote mechanism. When a user clicks upvote, I need to do the following:
If already upvoted by user, then remove upvote.
Else if not upvoted by user, then add upvote AND remove downvote if exists.
So far, my query formed (is incorrect) is:
db.collection.aggregate([
{
$project: {
"_id" : ObjectId("53e4d45c198d7811248cefca"),
"upvote": {
"$cond":
[
{"$in": ["$upvote",1] },
{"$pull": {"upvote" : 1}},
{"$addToSet": {"upvote" : 1}, "$pull": {"downvote": 1}}
]
}
}
}
])
where '1' is the user id who is trying to upvote.
Both upvote and downvote are arrays that contain userIds of those who have upvoted and downvoted, respectively.
For output of query, I just want a bool value: true if $cond evaluated to true, else false.
That's not a good way to implement up-votes and downvotes. Aside from the aggregation framework not being a mechanism for updating documents in any way, you seem to have gravitated towards thinking it may be a solution due to the logic you want to implement. But aggregate does not update.
What you want on your, well lets call it a "question" schema is a structure like this:
{
"_id": ObjectId("53f51a844ffa9b02cf01c074"),
"upvoted": [],
"downvoted": [],
"upvoteCount": 0,
"downvoteCount": 0
}
That is something that can work well with atomic updates and actually give you some stateful information about the object at the same time.
For the "upvoted" and "downvoted" arrays, we are going to consider that the "users" voting have a similar unique ObjectId value. So what we are going to do is $push or $pull from either array and also "increment/decrement" the counter values along with each of those operations.
Here's how this works for an upvote:
db.questions.update(
{
"_id": ObjectId("53f51a844ffa9b02cf01c074"),
"upvoted": { "$ne": ObjectId("53f51c0a4ffa9b02cf01c075") }
"downvoted": ObjectId("53f51c0a4ffa9b02cf01c075")
},
{
"$push": { "upvoted": ObjectId("53f51c0a4ffa9b02cf01c075") },
"$inc": { "upvoteCount": 1, "downvoteCount": -1 },
"$pull": { "downvoted": ObjectId("53f51c0a4ffa9b02cf01c075") },
}
)
db.questions.update(
{
"_id": ObjectId("53f51a844ffa9b02cf01c074"),
"upvoted": { "$ne": ObjectId("53f51c0a4ffa9b02cf01c075") }
},
{
"$push": { "upvoted": ObjectId("53f51c0a4ffa9b02cf01c075") },
"$inc": { "upvoteCount": 1 },
}
)
Actually that's two operations, which you could do with the Bulk operations API as well (probably the best way really) but it has a point to it. The first statement will only match a document where the current user has a "downvote" recorded in the array. As it, we already "pushed" that user id value to the "downvotes" array. If it is not there then no update is made. But you both push and pull from respective arrays and also "increment/decrement" the counter fields at the same time.
With the second statement which will only match something where the first did not, you make a fair assessment that now you don't need to touch "downvotes" and just handle the upvote fields. In both cases the safe thing to do is make sure that the main condition is the current user id value is not present in the "upvoted" array.
For downvotes the fields are just reversed:
db.questions.update(
{
"_id": ObjectId("53f51a844ffa9b02cf01c074"),
"downvoted": { "$ne": ObjectId("53f51c0a4ffa9b02cf01c075") }
"upvoted": ObjectId("53f51c0a4ffa9b02cf01c075")
},
{
"$pull": { "upvoted": ObjectId("53f51c0a4ffa9b02cf01c075") },
"$inc": { "upvoteCount": -1, "downvoteCount": 1 },
"$push": { "downvoted": ObjectId("53f51c0a4ffa9b02cf01c075") },
}
)
db.questions.update(
{
"_id": ObjectId("53f51a844ffa9b02cf01c074"),
"downvoted": { "$ne": ObjectId("53f51c0a4ffa9b02cf01c075") }
},
{
"$push": { "downvoted": ObjectId("53f51c0a4ffa9b02cf01c075") },
"$inc": { "downvoteCount": 1 },
}
)
Naturally you can see the logical progression to simply cancelling any "upvote/downvote" for the user in question. Also you can be smart about it if you want and expose the information in your client to not only show if the current user have already "upvoted/downvoted" but also control click actions and eliminate unnecessary requests.