Aggregation in mongodb for nested documents - mongodb

I have a document in the following format:
"summary":{
"HUL":{
"hr_0":{
"ts":None,
"Insights":{
"sentiments":{
"pos":37,
"neg":3,
"neu":27
},
"topics":[
"Basketball",
"Football"
],
"geo":{
"locations":{
"Delhi":34,
"Kolkata":56,
"Pune":79,
"Bangalore":92,
"Mumbai":54
},
"mst_act":{
"loc":Bangalore,
"lat_long":None
}
}
}
},
"hr_1":{....},
"hr_2":{....},
.
.
"hr_23":{....}
I want to run an aggregation in pymongo that sums up the pos, neg and neu sentiments for all hours of the day "hr_0" to "hr_23".
I am having trouble in constructing the pipeline command in order to do this as the fields I am interested in are in nested dictionaries. Would really appreciate your suggestions.
Thanks!

It's going to be pretty difficult to come up with an aggregation pipeline that will give you the desired aggregates because your document schema has some dynamic keys which you can't use as an identifiey expression in the group operator pipeline.
However, a workaround using the current schema would be to iterate over the find cursor and extract the values you want to add up within the loop. Something like the following:
pos_total = 0
neg_total = 0
neu_total = 0
cursor = db.collection.find()
for doc in cursor:
for i in range(0, 24):
pos_total += doc["summary"]["HUL"]["hr_"+str(i)]["Insights"]["sentiments"]["pos"]
neg_total += doc["summary"]["HUL"]["hr_"+str(i)]["Insights"]["sentiments"]["neg"]
neu_total += ddoc["summary"]["HUL"]["hr_"+str(i)]["Insights"]["sentiments"]["neu"]
print(pos_total)
print(neg_total)
print(neu_total)
If you could do with changing the schema, then the following schema would be ideal for using the aggregation framework:
{
"summary": {
"HUL": [
{
"_id": "hr_0",
"ts": None,
"Insights":{
"sentiments":{
"pos":37,
"neg":3,
"neu":27
},
"topics":[
"Basketball",
"Football"
],
"geo":{
"locations":{
"Delhi":34,
"Kolkata":56,
"Pune":79,
"Bangalore":92,
"Mumbai":54
},
"mst_act":{
"loc":Bangalore,
"lat_long":None
}
}
}
},
{
"_id": "hr_2",
"ts": None,
"Insights":{
"sentiments":{
"pos":37,
"neg":3,
"neu":27
},
...
}
},
...
{
"_id": "hr_23",
"ts": None,
"Insights":{
"sentiments":{
"pos":37,
"neg":3,
"neu":27
},
...
}
}
]
}
}
The aggregation pipeline that would give you the required totals is:
var pipeline = [
{
"$unwind": "$summary.HUL"
},
{
"$group": {
"_id": "$summary.HUL._id",
"pos_total": { "$sum": "$summary.HUL.Insights.sentiments.pos" },
"neg_total": { "$sum": "$summary.HUL.Insights.sentiments.neg" },
"neu_total": { "$sum": "$summary.HUL.Insights.sentiments.neu" },
}
}
]
result = db.collection.aggregate(pipeline)

Related

get document with same 3 fields in a collection

i have a collection with more then 1000 documents and there are some documents with same value in some fields, i need to get those
the collection is:
[{_id,fields1,fields2,fields3,etc...}]
what query can i use to get all the elements that have the same 3 fields for example:
[
{_id:1,fields1:'a',fields2:1,fields3:'z'},
{_id:2,fields1:'a',fields2:1,fields3:'z'},
{_id:3,fields1:'f',fields2:2,fields3:'g'},
{_id:4,fields1:'f',fields2:2,fields3:'g'},
{_id:5,fields1:'j',fields2:3,fields3:'g'},
]
i need to get
[
{_id:2,fields1:'a',fields2:1,fields3:'z'},
{_id:4,fields1:'f',fields2:2,fields3:'g'},
]
in this way i can easly get a list of "duplicate" that i can delete if needed, it's not really important get id 2 and 4 or 1 and 3
but 5 would never be included as it's not 'duplicated'
EDIT:
sorry but i forgot to mention that there are some document with null value i need to exclude those
This is the perfect use case of window field. You can use $setWindowFields to compute $rank in the grouping/partition you want. Then, get those rank not equal to 1 to get the duplicates.
db.collection.aggregate([
{
$match: {
fields1: {
$ne: null
},
fields2: {
$ne: null
},
fields3: {
$ne: null
}
}
},
{
"$setWindowFields": {
"partitionBy": {
fields1: "$fields1",
fields2: "$fields2",
fields3: "$fields3"
},
"sortBy": {
"_id": 1
},
"output": {
"duplicateRank": {
"$rank": {}
}
}
}
},
{
$match: {
duplicateRank: {
$ne: 1
}
}
},
{
$unset: "duplicateRank"
}
])
Mongo Playground
I think you can try this aggregation query:
First group by the feilds you want to know if there are multiple values.
It creates an array with the _ids that are repeated.
Then get only where there is more than one ($match).
And last project to get the desired output. I've used the first _id found.
db.collection.aggregate([
{
"$group": {
"_id": {
"fields1": "$fields1",
"fields2": "$fields2",
"fields3": "$fields3"
},
"duplicatesIds": {
"$push": "$_id"
}
}
},
{
"$match": {
"$expr": {
"$gt": [
{
"$size": "$duplicatesIds"
},
1
]
}
}
},
{
"$project": {
"_id": {
"$arrayElemAt": [
"$duplicatesIds",
0
]
},
"fields1": "$_id.fields1",
"fields2": "$_id.fields3",
"fields3": "$_id.fields2"
}
}
])
Example here

Mongo projection for only field and value

I have large documents and I want to project them with only field and value.
It goes like this;
{"p":{"s":{"status":"b"},"m":{"pd":{"tt":{"bi":"2","psi":"4","ircsi":true}},"mi":"TT","et":"2020-09-07T14:34:00+03:00"}}}
{"p":{"s":{"status":"b"},"m":{"pd":{"tt":{"bi":"20","psi":"1","ircsi":true}},"mi":"TT","et":"2020-12-29T08:28:06+03:00"}}}
.........
Is there any way for its to look like this;
{"status":"b","bi":"2","psi":"4","ircsi":true,"mi":"TT","et":"2020-09-07T14:34:00+03:00"}}}
{"status":"b","bi":"20","psi":"1","ircsi":true,"mi":"TT","et":"2020-09-07T14:34:00+03:00"}}}
.........
And my query is like this;
db.s.aggregate([{
"$match": {
"p.m.etm": {
"$gt": 1585688401000,
"$lt": 1610565947499
},
"p.m.mi":"TT"
}
},
{
"$project": {
"p.m.pd.tt.bi": 1,
"p.m.pd.tt.psi": 1,
"p.m.pd.tt.ircsi": 1,
"p.m.mi":1,
"p.s.status":1,
"p.m.et":1,
"_id": 0
}
}],
{
"allowDiskUse": true
});
What should I change to reach what I need ?
You can use $project to create new fields. (I just remove $match for easiness)
db.collection.aggregate([
{
"$project": {
"bi": "$p.m.pd.tt.bi",
"psi": "$p.m.pd.tt.psi",
"ircsi": "$p.m.pd.tt.ircsi",
"mi": "$p.m.mi",
"status": "$p.s.status",
"et": "$p.m.et",
"_id": 0
}
}
])
Working Mongo playground

Get sum of Nested Array in Aggregate

Ok, I have an issue I cannot seem to solve.
I have a document like this:
{
"playerId": "43345jhiuy3498jh4358yu345j",
"leaderboardId": "5b165ca15399c020e3f17a75",
"data": {
"type": "EclecticData",
"holeScores": [
{
"type": "RoundHoleData",
"xtraStrokes": 0,
"strokes": 3,
},
{
"type": "RoundHoleData",
"xtraStrokes": 1,
"strokes": 5,
},
{
"type": "RoundHoleData",
"xtraStrokes": 0,
"strokes": 4
}
]
}
}
Now, what I am trying to accomplish is using aggregate sum the strokes and then order it afterwards. I am trying this:
var sortedBoard = db.collection.aggregate(
{$match: {"leaderboardId": boardId}},
{$group: {
_id: "$playerId",
played: { $sum: 1 },
strokes: {$sum: '$data.holeScores.strokes'}
}
},
{$project:{
type: "$SortBoard",
avgPoints: '$played',
sumPoints: "$strokes",
played : '$played'
}}
);
The issue here is that I do net get the strokes sum correct, since this is inside another array.
Hope someone can help me with this and thanks in advance :-)
You need to say $sum twice:
var sortedBoard = db.collection.aggregate([
{ "$match": { "leaderboardId": boardId}},
{ "$group": {
"_id": "$playerId",
"SortBoard": { "$first": "$SortBoard" },
"played": { "$sum": 1 },
"strokes": { "$sum": { "$sum": "$data.holeScores.strokes"} }
}},
{ "$project": {
"type": "$SortBoard",
"avgPoints": "$playeyed",
"sumPoints": "$strokes",
"played": "$played"
}}
])
The reason is because you are using it both as a way to "sum array values" and also as an "accumulator" for $group.
The other thing you appear to be missing is that $group only outputs the fields you tell it to, therefore if you want to access other fields in other stages or output, you need to keep them with something like $first or another accumulator. We also appear to be missing a pipeline stage in the question anyway, but it's worth noting just to be sure.
Also note you really should wrap aggregation pipelines as an official array [], because the legacy usage is deprecated and can cause problems in some language implementations.
Returns the correct details of course:
{
"_id" : "43345jhiuy3498jh4358yu345j",
"avgPoints" : 1,
"sumPoints" : 12,
"played" : 1
}

$facet \ $bucket date manipulation

I am using $facet aggregation for e-commerce style platform.
for this case I have an products collections, and the product schema contain many fields.
I am using same aggregation process to get all the required facets.
the question is, if I am able to use the same $facet \ $bucket aggrigation to group documents by manipulation of specific field -
for example - in product schema I have releaseDate field which is Date (type) field.
currently the aggregation query is looking like this:
let facetsBodyExample = {
'facetName': [
{ $unwind: `$${fieldData.key}` },
{ $sortByCount: `$${fieldData.key}` }
],
'facetName2': [
{ $unwind: `$${fieldData.key}` },
{ $sortByCount: `$${fieldData.key}` }
],
...,
...
};
let results = await Product.aggregate({
$facet: facetsBodyExample
});
the documents looks like
{
_id : 'as5d16as1d65sa65d165as1d',
name : 'product name',
releaseDate: '2015-07-01T00:00:00.000Z',
field1:13,
field2:[1,2,3],
field3:'abx',
...
}
I want to create custom facets (groups) by quarter + year in format like 'QQ/YYY', without defining any boundaries.
now, I am getting groups of exact match of date, and I want to group them into quarter + year groups, if possible in the same $facet aggregation.
Query result of the date field I want to customize:
CURRENT:
{
relaseDate: [
{ "_id": "2017-01-01T00:00:00.000Z", "count": 26 },
{ "_id": "2013-04-01T00:00:00.000Z", "count": 25 },
{ "_id": "2013-07-01T00:00:00.000Z", "count": 23 },
...
]
}
DESIRED:
{
relaseDate: [
{ "_id": "Q1/2014", "count": 100 },
{ "_id": "Q2/2014", "count": 200 },
{ "_id": "Q3/2016", "count": 300 },
...
]
}
Thanks !!

mongodb count two condition one time?

I need to count two query like below:
1. data[k]['success'] = common.find({'url': {"$regex": k}, 'result.titile':{'$ne':''}}).count()
2. data[k]['fail'] = common.find({'url': {"$regex": k}, 'result.titile':''}).count()
I think it would be more efficient if mongodb can work like below:
result = common.find({'url': {"$regex": k})
count1 = result.find({'result.titile':{'$ne':''}})
count2 = result.count() - count1
//result do not have find or count method, just for example
Two count are basing same search condition{'url': {"$regex": k}, splited by {'result.titile':{'$ne':''}} or not.
Is there some build-in way to do these without writing custom js?
The async method would be the preferred one if at all your client supports it.
You could also aggregate as below:
$match the docs which have the urls.
$group by the _id as null, and take the $sum of all documents. We need those documents, to get the sum of those which do not have a title, so just accumulate them using the $push operator.
$unwind the documents.
$match those which do not have a title.
$group, and get the $sum.
$project the desired result.
sample code:
db.t.aggregate([
{$match:{"url":{"$regex":k}}},
{$group:{"_id":null,
"count_of_url_matching_docs":{$sum:1},
"docs":{$push:"$$ROOT"}}},
{$unwind:"$docs"},
{$match:{"docs.result.titile":{$ne:""}}},
{$group:{"_id":null,
"count_of_url_matching_docs":{$first:"$count_of_url_matching_docs"},
"count_of_docs_with_titles":{$sum:1}}},
{$project:{"_id":0,
"count_of_docs_with_titles":"$count_of_docs_with_titles",
"count_difference":{$subtract:[
"$count_of_url_matching_docs",
"$count_of_docs_with_titles"]}}}
])
Test data:
db.t.insert([
{"url":"s","result":{"titile":1}},
{"url":"s","result":{"titile":""}},
{"url":"s","result":{"titile":""}},
{"url":"s","result":{"titile":2}}
])
Test Result:
{ "count_of_docs_with_titles" : 2, "count_difference" : 2 }
Use .aggregate() with a conditional key for grouping via $cond:
common.aggregate([
{ "$match": { "url": { "$regex": k } } },
{ "$group": {
"_id": {
"$cond": {
"if": { "$ne": [ "$result.title", "" ] },
"then": "success",
"else": "fail"
}
},
"count": { "$sum": 1 }
}}
])
However it is actually more efficient to run both queries in parallel if your environment supports it, such as with nodejs
async.parallel(
[
function(callback) {
common.count({
"url": { "$regex": k },
"result.title": { "$ne": "" }
}, function(err,count) {
callback(err,{ "success": count });
});
},
function(callback) {
common.count({
"url": { "$regex": k },
"result.title": ""
}, function(err,count) {
callback(err,{ "fail": count });
});
}
],
function(err,results) {
if (err) throw err;
console.log(results);
}
)
Which makes sense really since each item is not being tested and each result can actually run on the server at the same time.