How to select only not null values when aggregating with first or last in mongodb? - mongodb

My data represents a dictionary that receives a bunch of updates and potentially new fields (metadata being added to a post). So something like:
> db.collection.find()
{ _id: ..., 'A': 'apple', 'B': 'banana' },
{ _id: ..., 'A': 'artichoke' },
{ _id: ..., 'B': 'blueberry' },
{ _id: ..., 'C': 'cranberry' }
The challenge - I want to find the first (or last) value for each key ignoring blank values (i.e. I want some kind of conditional group by that works at a field not document level). (Equivalent to the starting or ending version of the metadata after updates).
The problem is that:
db.collection.aggregate([
{ $group: {
_id: null,
A: { $last: '$A' },
B: { $last: '$B' },
C: { $last: '$C' }
}}
])
fills in the blanks with nulls (rather than skipping them in the result), so I get:
{ '_id': ..., 'A': null, 'B': null, 'C': 'cranberry' }
when I want:
{ '_id': ..., 'A': 'artichoke', 'B': 'blueberry', 'C': cranberry' }

I don't think this is what you really want, but it does solve the problem you are asking. The aggregation framework cannot really do this, as you are asking for "last results" of different columns from different documents. There is really only one way to do this and it is pretty insane:
db.collection.aggregate([
{ "$group": {
"_id": null,
"A": { "$push": "$A" },
"B": { "$push": "$B" },
"C": { "$push": "$C" }
}},
{ "$unwind": "$A" },
{ "$group": {
"_id": null,
"A": { "$last": "$A" },
"B": { "$last": "$B" },
"C": { "$last": "$C" }
}},
{ "$unwind": "$B" },
{ "$group": {
"_id": null,
"A": { "$last": "$A" },
"B": { "$last": "$B" },
"C": { "$last": "$C" }
}},
{ "$unwind": "$C" },
{ "$group": {
"_id": null,
"A": { "$last": "$A" },
"B": { "$last": "$B" },
"C": { "$last": "$C" }
}},
])
Essentially you compact down the documents pushing all of the found elements into arrays. Then each array is unwound and the $last element is taken from there. You need to do this for each field in order to get the last element of each array, which was the last match for that field.
Not real good and certain to explode the BSON 16MB limit on any meaningful collection.
So what you are really after is looking for a "last seen" value for each field. You could brute force this by iterating the collection and keeping values that are not null. You can even do this on the server like this with mapReduce:
db.collection.mapReduce(
function () {
if (start == 0)
emit( 1, "A" );
start++;
current = this;
Object.keys(store).forEach(function(key) {
if ( current.hasOwnProperty(key) )
store[key] = current[key];
});
},
function(){},
{
"scope": { "start": 0, "store": { "A": null, "B": null, "C": null } },
"finalize": function(){ return store },
"out": { "inline": 1 }
}
)
That will work as well, but iterating the whole collection is nearly as bad as mashing everything together with aggregate.
What you really want in this case is three queries, ideally in parallel to just get the discreet value last seen for each property:
> db.collection.find({ "A": { "$exists": true } }).sort({ "$natural": -1 }).limit(1)
{ "_id" : ObjectId("54b319cd6997a054ce4d71e7"), "A" : "artichoke" }
> db.collection.find({ "B": { "$exists": true } }).sort({ "$natural": -1 }).limit(1)
{ "_id" : ObjectId("54b319cd6997a054ce4d71e8"), "B" : "blueberry" }
> db.collection.find({ "C": { "$exists": true } }).sort({ "$natural": -1 }).limit(1)
{ "_id" : ObjectId("54b319cd6997a054ce4d71e9"), "C" : "cranberry" }
Acutally even better is to create a sparse index on each property and query via $gt and a blank string. This makes sure an index is used and as a sparse index it will only contain documents where the property is present. You'll need to .hint() this, but you still want $natural ordering for the sort:
db.collection.ensureIndex({ "A": -1 },{ "sparse": 1 })
db.collection.ensureIndex({ "B": -1 },{ "sparse": 1 })
db.collection.ensureIndex({ "C": -1 },{ "sparse": 1 })
> db.collection.find({ "A": { "$gt": "" } }).hint({ "A": -1 }).sort({ "$natural": -1 }).limit(1)
{ "_id" : ObjectId("54b319cd6997a054ce4d71e7"), "A" : "artichoke" }
> db.collection.find({ "B": { "$gt": "" } }).hint({ "B": -1 }).sort({ "$natural": -1 }).limit(1)
{ "_id" : ObjectId("54b319cd6997a054ce4d71e8"), "B" : "blueberry" }
> db.collection.find({ "C": { "$gt": "" } }).hint({ "C": -1 }).sort({ "$natural": -1 }).limit(1)
{ "_id" : ObjectId("54b319cd6997a054ce4d71e9"), "C" : "cranberry" }
That's the best way to solve what you are saying here. But as I said, this is how you think you need to solve it. Your real problem likely has another way to approach both storing and querying.

Starting Mongo 3.6, for those using $first or $last as a way to get one value from grouped records (not necessarily the actual first or last), $group's $mergeObjects can be used as a way to find a non-null value from grouped items:
// { "A" : "apple", "B" : "banana" }
// { "A" : "artichoke" }
// { "B" : "blueberry" }
// { "C" : "cranberry" }
db.collection.aggregate([
{ $group: {
_id: null,
A: { $mergeObjects: { a: "$A" } },
B: { $mergeObjects: { b: "$B" } },
C: { $mergeObjects: { c: "$C" } }
}}
])
// { _id: null, A: { a: "artichoke" }, B: { b: "blueberry" }, C: { c: "cranberry" } }
$mergeObjects accumulates an object based on each grouped record. And the thing to note is that $mergeObjects will merge in priority values that aren't null. But that requires to modify the accumulated field to an object, thus the "awkward" { a: "$A" }.
If the output format isn't exactly what you expect, one can always use an additional $project stage.

So I've just thought about how to answer this, but would be interested to hear people's opinions on how right/wrong this is. Based on the reply from #NeilLunn I guess I'll hit the BSON limit, making his version better for pulling the data, but it's important to my app that I can run this query in one go. (Perhaps my real problem is the data design).
The problem we have is that in the "group by" we pull in a version of A, B, C for every document. So my solution is to tell the aggregation what fields it should pull in by changing (slightly) the original data structure to tell the engine which keys are in each document:
> db.collection.find()
{ _id: ..., 'A': 'apple', 'B': 'banana', 'Keys': ['A', 'B']},
{ _id: ..., 'A': 'artichoke', 'Keys': ['A']},
{ _id: ..., 'B': 'blueberry', 'Keys': ['B']},
{ _id: ..., 'C': 'cranberry', 'Keys': ['C']}
Now we can can $unwind on 'Keys' and then group with 'Keys' as '_id'. Thus:
db.collection.aggregate([
{'$unwind': 'Keys'},
{'$group':
{'_id': 'Keys',
'A': {'$last': '$A'},
'B': {'$last': '$B'},
'C': {'$last': '$C'}
}
}
])
I get back a series of documents with _id equal to the key:
{_id: 'A', 'A': 'artichoke', 'B': null, 'C': null},
{_id: 'B', 'A': null, 'B': 'blueberry', 'C': null},
{_id: 'C', 'A': null, 'B': null, 'C': 'cranberry'}
You can then pull the results you want, knowing that the value for key X is only valid for the result where _id is X.
(Of course the next question is how to reduce this series of documents to one, taking the appropriate field each time)

Related

Mongodb group by values and count the number of occurence

I am trying to count how many times does a particular value occur in a collection.
{
_id:1,
field1: value,
field2: A,
}
{
_id:2,
field1: value,
field2: A,
}
{
_id:3,
field1: value,
field2: C,
}
{
_id:4,
field1: value,
field2: B,
}
what I want is to count how many times A occurs, B occurs and C occurs and return the count.
The output I want
{
A: 2,
B: 1,
C: 1,
}
You can use $facet in an aggregate pipeline like this:
$facet create "three ways" where in each one filter the values by desired key (A, B or C).
Then in a $project stage you can get the $size of the matched values.
db.collection.aggregate([
{
"$facet": {
"first": [
{
"$match": {
"field2": "A"
}
}
],
"second": [
{
"$match": {
"field2": "B"
}
}
],
"third": [
{
"$match": {
"field2": "C"
}
}
]
}
},
{
"$project": {
"A": {
"$size": "$first"
},
"B": {
"$size": "$second"
},
"C": {
"$size": "$third"
}
}
}
])
Example here
This is typical use case for $group stage in Aggregation Pipeline. You can do it like this:
$group - to group all the documents by field2
$sum - to count the number of documents for each value of field2
db.collection.aggregate([
{
"$group": {
"_id": "$field2",
"count": {
"$sum": 1
}
}
}
])
Working example
Leverage the $arrayToObject operator and a final $replaceWith pipeline to get the desired result. You would need to run the following aggregate pipeline:
db.collection.aggregate([
{ $group: {
_id: { $toUpper: '$field2' },
count: { $sum: 1 }
} },
{ $group: {
_id: null,
counts: {
$push: { k: '$_id', v: '$count' }
}
} },
{ $replaceWith: { $arrayToObject: '$counts' } }
])
Mongo Playground

Convert array to new field, using keys as the values of this array and values as frequency of these items (aggregation framework)

I have this problem, but I can't solve it.
I have to transform the array s to a new field called shares.
This new field have inside new keys and new values.
Suppose I have these documents:
{
'name': 'igor',
's': ['a', 'a', 'a', 'b', 'b']
},
{
'name': 'jones',
's': ['c', 'b']
}
Expected output:
{
'name': 'igor',
'shares': {
'a': 3
'b': 2
}
},
{
'name': 'jones',
'shares': {
'c': 1
'b': 1
}
}
You can try below aggregation query :
db.collection.aggregate([
/** unwind `s` array */
{
$unwind: "$s"
},
/** group on unique pairs of `_id + s` & retain name field, count sum of matching docs */
{
$group: { _id: { k: "$s", _id: "$_id" }, name: { $first: "$name" }, v: { $sum: 1 } }
},
/** group on unique pairs of just `_id` & retain name field, push docs into shares array `[{k :..., v:...}]` */
{
$group: { _id: "$_id._id", name: { $first: "$name" }, shares: { $push: { k: "$_id.k", v: "$v" } } }
},
/** Re-create shares field from array to object */
{
$addFields: { shares: { $arrayToObject: "$shares" } }
}
])
Test : mongoplayground
It's a bad practice to add heterogeneous elements (in your case: 'a': 3, 'b': 2) to an array, I converted shares's type to something like:
{
key: "$_id.shares",
count: "$count"
}
You need to do the following in order:
Unwind the array s.
Group by composite _ids name and s.
Again group by _id _id.name and push objects of type key and count to the shares array.
You can try the below query:
db.collection.aggregate([
{
$unwind: "$s"
},
{
$group: {
_id: {
name: "$name",
shares: "$s"
},
count: {
$sum: 1
}
}
},
{
$group: {
_id: "$_id.name",
shares: {
$push: {
key: "$_id.shares",
count: "$count"
}
}
}
}
])
Output
[
{
"_id": "jones",
"shares": [
{
"count": 1,
"key": "c"
},
{
"count": 1,
"key": "b"
}
]
},
{
"_id": "igor",
"shares": [
{
"count": 3,
"key": "a"
},
{
"count": 2,
"key": "b"
}
]
}
]
MongoPlayGroundLink

Project object existence boolean in MongoDB

I have a document structure that looks like this (two example docs below).
{
"A": "value"
},
{
"A": "value",
"B": {
"a": "value",
"b": "value"
}
}
I want to aggregate such that the value of field A is projected while a true/false value is returned depending on whether the object B exists. The result of the query would be:
{
"A": "value",
"B": false
},
{
"A": "value",
"B": true
}
Even a shorter solution:
db.collection.aggregate({
$project: {
A: 1,
B: { $cond: ["$B", true, false] }
}
})
or
db.collection.aggregate({
$project: {
A: 1,
B: { $ifNull: [{ $toBool: "$B" }, false] }
}
})
However, following documents will yield different result than the other answers. Check your application if such documents apply.
{
'A': 'value5',
'B': false
},
{
'A': 'value5',
'B': []
}
You can use below aggregation
db.collection.aggregate([
{ "$addFields": {
"B": {
"$cond": [
{ "$eq": ["$B", undefined] },
false,
true
]
}
}}
])
You may use $type operator:
If the argument is a field that is missing in the input document, $type returns the string "missing".
db.collection.aggregate([
{
$project: {
A: 1,
B: {
$ne: [
{
$type: "$B"
},
"missing"
]
}
}
}
])
MongoPlayground

mongodb aggregation query for field value length's sum

Say, I have following documents:
{name: 'A', fav_fruits: ['apple', 'mango', 'orange'], 'type':'test'}
{name: 'B', fav_fruits: ['apple', 'orange'], 'type':'test'}
{name: 'C', fav_fruits: ['cherry'], 'type':'test'}
I am trying to query to find the total count of fav_fruits field on overall documents returned by :
cursor = db.collection.find({'type': 'test'})
I am expecting output like:
cursor.count() = 3 // Getting
Without much idea of aggregate, can mongodb aggregation framework help me achieve this in any way:
1. sum up the lengths of all 'fav_fruits' field: 6
and/or
2. unique 'fav_fruit' field values = ['apple', 'mango', 'orange', 'cherry']
You need to $project your document after the $match stage and use the $size operator which return the number of items in each array. Then in the $group stage you use the $sum accumulator operator to return the total count.
db.collection.aggregate([
{ "$match": { "type": "test" } },
{ "$project": { "count": { "$size": "$fav_fruits" } } },
{ "$group": { "_id": null, "total": { "$sum": "$count" } } }
])
Which returns:
{ "_id" : null, "total" : 6 }
To get unique fav_fruits simply use .distinct()
> db.collection.distinct("fav_fruits", { "type": "test" } )
[ "apple", "mango", "orange", "cherry" ]
Do this to get just the number of fruits in the fav_fruits array:
db.fruits.aggregate([
{ $match: { type: 'test' } },
{ $unwind: "$fav_fruits" },
{ $group: { _id: "$type", count: { $sum: 1 } } }
]);
This will return the total number of fruits.
But if you want to get the array of unique fav_fruits along with the total number of elements in the fav_fruits field of each document, do this:
db.fruits.aggregate([
{ $match: { type: 'test' } },
{ $unwind: "$fav_fruits" },
{ $group: { _id: "$type", count: { $sum: 1 }, fav_fruits: { $addToSet: "$fav_fruits" } } }
])
You can try this. It may helpful to you.
db.collection.aggregate([{ $match : { type: "test" } }, {$group : { _id : null, count:{$sum:1} } }])

Empty array prevents document to appear in query

I have documents that have a few fields and in particular the have a field called attrs that is an array. I am using the aggregation pipeline.
In my query I am interested in the attrs (attributes) field if there are any elements in it. Otherwise I still want to get the result. In this case I am after the field type of the document.
The problem is that if a document does not contain any element in the attrs field it will be filtered away and I won't get its _id.type field, which is what I really want from this query.
{
aggregate: "entities",
pipeline: [
{
$match: {
_id.servicePath: {
$in: [
/^/.*/,
null
]
}
}
},
{
$project: {
_id: 1,
"attrs.name": 1,
"attrs.type": 1
}
},
{
$unwind: "$attrs"
},
{
$group: {
_id: "$_id.type",
attrs: {
$addToSet: "$attrs"
}
}
},
{
$sort: {
_id: 1
}
}
]
}
So the question is: how can I get a result containing all documents types regardless of their having attrs, but including the attributes in case they have them?
I hope it makes sense.
You can use the $cond operator in a $project stage to replace the empty attr array with one that contains a placeholder like null that can be used as a marker to indicate that this doc doesn't contain any attr elements.
So you'd insert an additional $project stage like this right before the $unwind:
{
$project: {
attrs: {$cond: {
if: {$eq: ['$attrs', [] ]},
then: [null],
else: '$attrs'
}}
}
},
The only caveat is that you'll end up with a null value in the final attrs array for those groups that contain at least one doc without any attrs elements, so you need to ignore those client-side.
Example
The example uses an altered $match stage because the one in your example isn't valid.
Input Docs
[
{_id: {type: 1, id: 2}, attrs: []},
{_id: {type: 2, id: 1}, attrs: []},
{_id: {type: 2, id: 2}, attrs: [{name: 'john', type: 22}, {name: 'bob', type: 44}]}
]
Output
{
"result" : [
{
"_id" : 1,
"attrs" : [
null
]
},
{
"_id" : 2,
"attrs" : [
{
"name" : "bob",
"type" : 44
},
{
"name" : "john",
"type" : 22
},
null
]
}
],
"ok" : 1
}
Aggregate Command
db.test.aggregate([
{
$match: {
'_id.servicePath': {
$in: [
null
]
}
}
},
{
$project: {
_id: 1,
"attrs.name": 1,
"attrs.type": 1
}
},
{
$project: {
attrs: {$cond: {
if: {$eq: ['$attrs', [] ]},
then: [null],
else: '$attrs'
}}
}
},
{
$unwind: "$attrs"
},
{
$group: {
_id: "$_id.type",
attrs: {
$addToSet: "$attrs"
}
}
},
{
$sort: {
_id: 1
}
}
])
use some if statements and loops.
first, your query should select all documents, first and foremost.
loop through all of them
then, if number of attributes is greater than 0, loop through the attributes. loop them into whatever array or output you find useful.
use if statements to sanitize your results if you like.
You should use '$or' operator , and two seperate queries : one to select the documents with attr value equal to required value, and other query to match documents where attr is null, or attr key does not exist ( using $exists operator )