Complex Logic quires on simple mongo document - mongodb

I am battling with forming complex logic queries on very basic data types in mongo. Essentially I can have millions of user attributes so my basic mongo document is:
{
name: "Gender"
value: "Male"
userId : "ABC123"
}
{
name: "M-Spike"
value: 0.123
userId : "ABC123"
}
What I would like to do is search for things like findAll userId where {name : "Gender, value: "Male"} AND { name : "m-spike", value : { $gt : 0.1} }
I have tried using the aggregation framework but the complexity of the queries is limited, basically I was ORing all the criteria and counting the results by sampleId (which replicated a rudimentary AND)

I can see a way to do it for N being the number of attributes you want to query about (N being 2 in your example). Try something like this:
db.collection.aggregate(
[ { $match: {$or: [
{"name":"M-Spike","value":{$gt:.1}},
{"name":"Gender","value":"Male"}
]
}
},
{ $group: { _id:"$userId",total:{$sum:1}}
},
{ $project: { _id:1,
matchedAttr : { $eq: ["$total",2] }
}
}
]
)
You will get back:
{
"result" : [
{
"_id" : "XYZ123",
"matchedAttr" : false
},
{
"_id" : "ABC123",
"matchedAttr" : true
}
],
"ok" : 1
}
Now, if you had 2 conditions you matched via "$or" then you get back true for _id that matched both of them. So for five conditions, your $match: $or array will have five conditional pairs, and the last $project transformation will be $eq: ["$total",5]
Built in to this solution is the assumption that you cannot have duplicate entries (i.e. _id cannot have "M-Spike":.5 and also "M-Spike":.2. If you can, then this won't work.

Related

Double aggregation with distinct count in MongoDB

We have a collection which stores log documents.
Is it possible to have multiple aggregations on different attributes?
A document looks like this in it's purest form:
{
_id : int,
agent : string,
username: string,
date : string,
type : int,
subType: int
}
With the following query I can easily count all documents and group them by subtype for a specific type during a specific time period:
db.logs.aggregate([
{
$match: {
$and : [
{"date" : { $gte : new ISODate("2020-11-27T00:00:00.000Z")}}
,{"date" : { $lte : new ISODate("2020-11-27T23:59:59.000Z")}}
,{"type" : 906}
]
}
},
{
$group: {
"_id" : '$subType',
count: { "$sum": 1 }
}
}
])
My output so far is perfect:
{
_id: 4,
count: 5
}
However, what I want to do is to add another counter, which will also add the distinct count as a third attribute.
Let's say I want to append the resultset above with a third attribute as a distinct count of each username, so my resultset would contain the subType as _id, a count for the total amount of documents and a second counter that represents the amount of usernames that has entries. In my case, the number of people that somehow have created documents.
A "pseudo resultset" would look like:
{
_id: 4,
countOfDocumentsOfSubstype4: 5
distinctCountOfUsernamesInDocumentsWithSubtype4: ?
}
Does this makes any sense?
Please help me improve the question as well, since it's difficult to google it when you're not a MongoDB expert.
You can first group at the finest level, then perform a second grouping to achieve what you need:
db.logs.aggregate([
{
$match: {
$and : [
{"date" : { $gte : new ISODate("2020-11-27T00:00:00.000Z")}}
,{"date" : { $lte : new ISODate("2020-11-27T23:59:59.000Z")}}
,{"type" : 906}
]
}
},
{
$group: {
"_id" : {
subType : "$subType",
username : "$username"
},
count: { "$sum": 1 }
}
},
{
$group: {
"_id" : "$_id.subType",
"countOfDocumentsOfSubstype4" : {$sum : "$count"},
"distinctCountOfUsernamesInDocumentsWithSubtype4" : {$sum : 1}
}
}
])
Here is the test cases I used:
And here is the aggregate result:

MongoDB - Update a field in an object of an array with multiple filter conditions

If I have the following document in my database :
{
"_id" : MainId,
"subdoc" : [
{
"_id" : SubdocId,
"userid" : "someid",
"toupdate": false,
},
{
"_id" : SubdocId2,
"userid" : "someid2",
"toupdate": false,
}
],
"extra" : [
"extraid1",
"extraid2"
]
}
How can I update the subdocument SubdocId2 where the id (SubdocId2) must match and either SubdocId2's userid is "someid2" OR value "extraid1" exists in "extra"?
The farthest I got is:
db.coll.update({
"subdoc._id":"SubdocId2", {
$or: ["extra":{$in:["extraid1"]}, "subdoc.userid":"someid2"]
}
}, {"subdoc.$.toupdate":true})
Maybe I forgot to quote up something, but I get an error (SyntaxError: invalid property id)
Please try this :
db.coll.update(
{
$and: [{ "subdoc._id": SubdocId2 }, {
$or: [{ "extra": { $in: ["extraid1"] } },
{ "subdoc.userid": "someid2" }]
}] // subdoc._id && (extra || subdoc.userid)
}, { $set: { "subdoc.$.toupdate": true } })
In your query there are couple of syntax issues & also if you don't use $set in your update part - it replace the entire document with "subdoc.$.toupdate": true. Of course if you're using mongoose that's different scenario as mongoose will internally does add $set but when executing in shell or any client you need to specify $set. Also if SubdocId2 is ObjectId() you need to convert string to ObjectId() in code before querying database.

Project values of different columns into one field

{
"_id" : ObjectId("5ae84dd87f5b72618ba7a669"),
"main_sub" : "MATHS",
"reporting" : [
{
"teacher" : "ABC"
}
],
"subs" : [
{
"sub" : "GEOMETRIC",
"teacher" : "XYZ",
}
]
}
{
"_id" : ObjectId("5ae84dd87f5b72618ba7a669"),
"main_sub" : "SOCIAL SCIENCE",
"reporting" : [
{
"teacher" : "XYZ"
}
],
"subs" : [
{
"sub" : "CIVIL",
"teacher" : "ABC",
}
]
}
I have simplified the structure of the documents that i have.
The basic structure is that I have a parent subject with an array of reporting teachers and an array of sub-subjects(each having a teacher)
I now want to extract all the subject(parent/sub-subjects) along with the condition if they are sub-subjects or not which are taught by a particular teacher.
For eg:
for teacher ABC i want the following structure:
[{'subject':'MATHS', 'is_parent':'True'}, {'subject':'CIVIL', 'is_parent':'FALSE'}]
-- What is the most efficient query possible ..? I have tried $project with $cond and $switch but in both the cases I have had to repeat the conditional statement for 'subject' and 'is_parent'
-- Is it advised to do the computation in a query or should I get the data dump and then modify the structure in the server code? AS in, I could $unwind and get a mapping of the parent subjects with each sub-subject and then do a for loop.
I have tried
db.collection.aggregate(
{$unwind:'$reporting'},
{$project:{
'result':{$cond:[
{$eq:['ABC', '$reporting.teacher']},
"$main_sub",
"$subs.sub"]}
}}
)
then I realised that even if i transform the else part into another query for the sub-subjects I will have to write the exact same thing for the property of is_parent
You have 2 arrays, so you need to unwind both - the reporting and the subs.
After that stage each document will have at most 1 parent teacher-subj and at most 1 sub teacher-subj pairs.
You need to unwind them again to have a single teacher-subj per document, and it's where you define whether it is parent or not.
Then you can group by teacher. No need for $conds, $filters, or $facets. E.g.:
db.collection.aggregate([
{ $unwind: "$reporting" },
{ $unwind: "$subs" },
{ $project: {
teachers: [
{ teacher: "$reporting.teacher", sub: "$main_sub", is_parent: true },
{ teacher: "$subs.teacher", sub: "$subs.sub", is_parent: false }
]
} },
{ $unwind: "$teachers" },
{ $group: {
_id: "$teachers.teacher",
subs: { $push: {
subject: "$teachers.sub",
is_parent: "$teachers.is_parent"
} }
} }
])

Compare Properties from Array to Single Property in Document

I have a mongoDB orders collection, the documents of which look as follows:
[{
"_id" : ObjectId("59537df80ab10c0001ba8767"),
"shipments" : {
"products" : [
{
"orderDetails" : {
"id" : ObjectId("59537df80ab10c0001ba8767")
}
},
{
"orderDetails" : {
"id" : ObjectId("59537df80ab10c0001ba8767")
}
}
]
},
}
{
"_id" : ObjectId("5953831367ae0c0001bc87e1"),
"shipments" : {
"products" : [
{
"orderDetails" : {
"id" : ObjectId("5953831367ae0c0001bc87e1")
}
}
]
},
}]
Now, from this collection, I want to filter out the elements in which, any of the values at shipments.products.orderDetails.id path is same as value at _id path.
I tried:
db.orders.aggregate([{
"$addFields": {
"same": {
"$eq": ["$shipments.products.orderDetails.id", "$_id"]
}
}
}])
to add a field same as a flag to decide whether the values are equal, but the value of same comes as false for all documents.
EDIT
What I want to do is compare the _id field the the documents with all shipments.products.orderDetails.id values in the array.
If even 1 of the shipments.products.orderDetails.ids match the value of the _id field, I want that document to be present in the final result.
PS I am using MongoDB 3.4, and have to use the aggregation pipeline.
Your current attempt fails because the notation returns an "array" in comparison with a "single value".
So instead either use $in where available, which can compare to see if one value is "in" an array:
db.orders.aggregate([
{ "$addFields": {
"same": {
"$in": [ "$_id", "$shipments.products.orderDetails.id" ]
}
}}
])
Or notate both as arrays using $setIsSubset
db.orders.aggregate([
{ "$addFields": {
"same": {
"$setIsSubset": [ "$shipments.products.orderDetails.id", ["$_id"] ]
}
}}
])
Where in that case it's doing a comparison to see if the "sets" have an "intersection" that makes _id the "subset" of the array of values.
Either case will return true when "any" of the id properties within the array entries at the specified path are a match for the _id property of the document.

Compare a date of two elements

My problem is difficult to explain :
In my website I save every action of my visitors (view, click, buy etc).
I have a simple collection named "flow" where my data is registered
{
"_id" : ObjectId("534d4a9a37e4fbfc0bf20483"),
"profile" : ObjectId("534bebc32939ffd316a34641"),
"activities" : [
{
"id" : ObjectId("534bebc42939ffd316a3af62"),
"date" : ISODate("2013-12-13T22:39:45.808Z"),
"verb" : "like",
"product" : "5"
},
{
"id" : ObjectId("534bebc52939ffd316a3f480"),
"date" : ISODate("2013-12-20T19:19:10.098Z"),
"verb" : "view",
"product" : "6"
},
{
"id" : ObjectId("534bebc32939ffd316a3690f"),
"date" : ISODate("2014-01-01T07:11:44.902Z"),
"verb" : "buy",
"product" : "5"
},
{
"id" : ObjectId("534bebc42939ffd316a3741b"),
"date" : ISODate("2014-01-11T08:49:02.684Z"),
"verb" : "favorite",
"product" : "26"
}
]
}
I would like to aggregate these data to retrieve the number of people who made an action (for example "view") and then another later in time (for example "buy"). To to that I need to compare "date" inside my "activities" array...
I tried to use aggregation framework to do that but I do not see how too make this request
This is my beginning :
db.flows.aggregate([
{ $project: { profile: 1, activities: 1, _id: 0 } },
{ $match: { $and: [{'activities.verb': 'view'}, {'activities.verb': 'buy'}] }}, //First verb + second verb
{ $unwind: '$activities' },
{ $match: { 'activities.verb': {$in:['view', 'buy']} } }, //First verb + second verb,
{
$group: {
_id: '$profile',
view: { $push: { $cond: [ { $eq: [ "$activities.verb", "view" ] } , "$activities.date", null ] } },
buy: { $push: { $cond: [ { $eq: [ "$activities.verb", "buy" ] } , "$activities.date", null ] } }
}
}
])
Maybe the format of my collection "flow" is not the best to do what I want...If you have any better idea dont hesitate
Thank you for your help !
Here is the aggregation that will give you the total number of buyers who viewed first and then bought (though not necessarily the same product that they viewed).
db.flow.aggregate(
{$match: {"activities.verb":{$all:["view","buy"]}}},
{$unwind :"$activities"},
{$match: {"activities.verb":{$in:["view","buy"]}}},
{$group: {
_id:"$_id",
firstViewed:{$min:{$cond:{
if:{$eq:["$activities.verb","view"]},
then : "$activities.date",
else : new Date(9999,0,1)
}}},
lastBought: {$max:{$cond:{
if:{$eq:["$activities.verb","buy"]},
then:"$activities.date",
else:new Date(1900,0,1)}
}}}
},
{$project: {viewedThenBought:{$cond:{
if:{$gt:["$lastBought","$firstViewed"]},
then:1,
else:0
}}}},
{$group:{_id:null,totalViewedThenBought:{$sum:"$viewedThenBought"}}}
)
Here you first pass through the pipeline only the documents that have all the "verbs" you are interested in. When you group the first time, you want to use the earliest "view" and the last "buy" and the next project compares them to see if they viewed before they bought.
The last step gives you the count of all the people who satisfied your criteria.
Be careful to leave out all $project phases that don't actually compute any new fields (like you very first $project). The aggregation framework is smart enough to never pass through any fields that it sees are not used in any later stages, so there is never a need to $project just to "eliminate" fields as that will happen automatically.
For your query:
I would like to aggregate these data to retrieve the number of people who made an action
Try this:
db.flows.aggregate([
// De-normalize the array into individual documents
{"$unwind" : "$activities"},
// Match for the verbs you are interested in
{"$match" : {"activities.verb":{$in:["buy", "view"]}}},
// Group by verb to get the count
{"$group" : {_id:"$activities.verb", count:{$sum:1}}}
])
The above query would produce an output like:
{
"result" : [
{
"_id" : "buy",
"count" : 1
},
{
"_id" : "view",
"count" : 1
}
],
"ok" : 1
}
Note: The $and operator in your query ({ $match: { $and: [{'activities.verb': 'view'}, {'activities.verb': 'buy'}] }}) is not required as that's the default if you specify multiple conditions. Only if you need a logical OR, $or operator is required.
If you want to use the date in the aggregation query to do queries like how many "views by day", etc.. the Date Aggregation Operators will come in handy.
I see where you are going with this and I think you are basically on the right track. So more or less un-altered (but for formatting preference) and the few tweeks at the end:
db.flows.aggregate([
// Try to $match "first" always to make sure you can get an index
{ "$match": {
"$and": [
{"activities.verb": "view"},
{"activities.verb": "buy"}
]
}},
// Don't worry, the optimizer "sees" this and will sort of "blend" with
// with the first stage.
{ "$project": {
"profile": 1,
"activities": 1,
"_id": 0
}},
{ "$unwind": "$activities" },
{ "$match": {
"activities.verb": { "$in":["view", "buy"] }
}},
{ "$group": {
"_id": "$profile",
"view": { "$min": { "$cond": [
{ "$eq": [ "$activities.verb", "view" ] },
"$activities.date",
null
]}},
"buy": { "$max": { "$cond": [
{ "$eq": [ "$activities.verb", "buy" ] },
"$activities.date",
null
]}}
}},
{ "$project": {
"viewFirst": { "$lt": [ "$view", "$buy" ] }
}}
])
So essentially the $min and $max operators should be self explanatory in the context in that you should be looking for the "first" view to correspond with the "last" purchase. As for me, and would make sense, you would actually be matching these by product (but hint: "Grouping") but I'll leave that part up to you.
The other advantage here is that the false values will always be negated if there is an actual date to match the "verb". Otherwise this goes through as false and this turns out to be okay.
That is because the next thing you do is $project to "compare" the values and ask the question "Did the 'view' happen before the 'buy'?" which is a logical evaluation of the "less than" $lt operator.
As for the schema itself. If you are storing a lot of these "events" then you are probably better off flattening things out into separate documents and finding some way to mark each with the same "session" identifier if that is separate to "profile".
Getting away from large arrays ( which this seems to lead to ) if likely going to help performance, and with care, makes little different to the aggregation process.