Turn _ids into Keys of New Object - mongodb

I have a huge bunch of documents as such:
{
_id: '1abc',
colors: [
{ value: 'red', count: 2 },
{ value: 'blue', count: 3}
]
},
{
_id: '2abc',
colors: [
{ value: 'red', count: 7 },
{ value: 'blue', count: 34},
{ value: 'yellow', count: 12}
]
}
Is it possible to make use of aggregate() to get the following?
{
_id: 'null',
colors: {
"1abc": [
{ value: 'red', count: 2 },
{ value: 'blue', count: 3}
],
"2abc": [
{ value: 'red', count: 7 },
{ value: 'blue', count: 34},
{ value: 'yellow', count: 12}
]
}
}
Basically, is it possible to turn all of the original documents' _ids into keys of a new object in the singular new aggregated document?
So far, when trying to use$group, I had not been able to use a variable value, e.g. $_id, on the left hand side of an assignment. Am I missing something or is it simply impossible?
I can do this easily using Javascript but it is unbearably slow. Hence why I am looking to see if it is possible using mongo native aggregate(), which will probably be faster.
If impossible... I would appreciate any kind suggestions that could point towards a sufficient alternative (change structure, etc.?). Thank you!

Like a said in comments, whilst there are things you can do with the aggregation framework or even mapReduce to make the "server" reshape the response, it's kind of silly to do so.
Lets consider the cases:
Aggregate
db.collection.aggregate([
{ "$match": { "_id": { "$in": ["1abc","2abc"] } } },
{ "$group": {
"_id": null,
"result": { "$push": "$$ROOT" }
}},
{ "$project": {
"colors": {
"1abc": {
"$arrayElemAt": [
{ "$map": {
"input": {
"$filter": {
"input": "$result",
"as": "r",
"cond": { "$eq": [ "$$r._id", "1abc" ] },
}
},
"as": "r",
"in": "$$r.colors"
}},
0
]
},
"2abc": {
"$arrayElemAt": [
{ "$map": {
"input": {
"$filter": {
"input": "$result",
"as": "r",
"cond": { "$eq": [ "$$r._id", "2abc" ] },
}
},
"as": "r",
"in": "$$r.colors"
}},
0
]
}
}
}}
])
So the aggregation framework purely does not dynamically generate "keys" of a document. If you want to process this way, then you need to know all of the "values" that you are going to use to make the keys in the result.
After putting everything into one document with $group, you can then work with the result array to extact data for your "keys". The basic operators here are:
$filter to get the matched element of the array for the "value" that you want.
$map to return just the specific property from the filtered array
$arrayElemAt to just grab the single elment that was filtered out of the resulting mapped array
So it really isn't practical in a lot of cases, and the coding of the statement is fairly involved.
MapReduce
db.collection.mapReduce(
function() {
var obj = { "colors": {} };
obj.colors[this._id] = this.colors;
emit(null,obj);
},
function(key,values) {
var obj = { "colors": {} };
values.forEach(function(value) {
Object.keys(value.colors).forEach(function(key) {
obj.colors[key] = value.colors[key];
});
})
return obj;
},
{ "out": { "inline": 1 } }
)
Since it is actually written in a "language" then you have the ability to loop structures and "build things" in a more dynamic way.
However, close inspection should tell you that the "reducer" function here is not doing anything more than being the processor of "all the results" which have been "stuffed into it" but each emitted document.
That means that "iterating the values" fed to the reducer is really no different to "iterating the cursor", and that leads to the next conclusion.
Cursor Iteration
var result = { "colors": {} };
db.collection.find().forEach(function(doc) {
result.colors[doc._id] = doc.colors;
})
printjson(result)
The simplicity of this should really speak volumes. It is afterall doing exactly what you are trying to "shoehorn" into a server operation and nothing more, and just simply "rolls up it sleeves" and gets on with the task at hand.
The key point here is none of the process requires any "aggregation" in a real sense, that cannot be equally achieved by simply iterating the cursor and building up the response document.
This is really why you always need to look at what you are doing and choose the right method. "Server side" aggregation has a primary task of "reducing" a result so you would not need to iterate a cursor. But nothing here "reduces" anything. It's just all of the data, transformed into a different format.
Therefore the simple approach for this type of "transform" is to just iterate the cursor and build up your transformed version of "all the results" anyway.

Related

Getting aggregate count for nested fields in mongo

I have a mongo collection mimicking this java class. A student can be taught a number of subjects across campus.
class Students {
String studentName;
Map<String,List<String>> subjectsByCampus;
}
So a structure will look like this
{
_id: ObjectId("someId"),
studentName:'student1',
subjectByCampusName:{
campus1:['subject1','subject2'],
campus2: ['subject3']
},
_class: 'fqnOfTheEntity'
}
I want to find the count of subjects offered by each campus or be able to query the count of subjects offered by a specific campus. Is there a way to get it through query?
Mentioned in the comments, but the schema here does not appear to be particularly well-suited toward gathering the data requested in the question. For an individual student, this is doable via $size and processing the object (as an array) via $map. For example, if we want your desired out put of campus1:2, campus2:1 for the sample document provided, a pipeline to produce that in a countsByCampus field might look as follows:
[
{
"$addFields": {
"countsByCampus": {
"$arrayToObject": {
"$map": {
"input": {
"$objectToArray": "$subjectByCampusName"
},
"in": {
"$mergeObjects": [
"$$this",
{
v: {
$size: "$$this.v"
}
}
]
}
}
}
}
}
}
]
Playground demonstration here with an output document of:
{
"_class": "fqnOfTheEntity",
"_id": "someId",
"countsByCampus": {
"campus1": 2,
"campus2": 1
},
"studentName": "student1",
"subjectByCampusName": {
"campus1": [
"subject1",
"subject2"
],
"campus2": [
"subject3"
]
}
}
Doing that across the entire collection would involve $grouping the results together. This can be done but would be an extremely resource-intensive and likely slow operation.

How to use $set and dot notation to update embedded array elements using corresponding old element?

I have following documents in a MongoDb:
from pymongo import MongoClient
client = MongoClient(host='my_host', port=27017)
database = client.forecast
collection = database.regions
collection.delete_many({})
regions = [
{
'id': 'DE',
'sites': [
{
'name': 'paper_factory',
'energy_consumption': 1000
},
{
'name': 'chair_factory',
'energy_consumption': 2000
},
]
},
{
'id': 'FR',
'sites': [
{
'name': 'pizza_factory',
'energy_consumption': 3000
},
{
'name': 'foo_factory',
'energy_consumption': 4000
},
]
}
]
collection.insert_many(regions)
Now I would like to copy the property sites.energy_consumption to a new field sites.new_field for each site:
set_stage = {
"$set": {
"sites.new_field": "$sites.energy_consumption"
}
}
pipeline = [set_stage]
collection.aggregate(pipeline)
However, instead of copying the individual value per site, all site values are collected and added as an array. Intead of 'new_field': [1000, 2000] I would like to get 'new_field': 1000 for the first site:
{
"_id": ObjectId("61600c11732a5d6b103ba6be"),
"id": "DE",
"sites": [
{
"name": "paper_factory",
"energy_consumption": 1000,
"new_field": [
1000,
2000
]
},
{
"name": "chair_factory",
"energy_consumption": 2000,
"new_field": [
1000,
2000
]
}
]
},
{
"_id": ObjectId("61600c11732a5d6b103ba6bf"),
"id": "FR",
"sites": [
{
"name": "pizza_factory",
"energy_consumption": 3000,
"new_field": [
3000,
4000
]
},
{
"name": "foo_factory",
"energy_consumption": 4000,
"new_field": [
3000,
4000
]
}
]
}
=> What expression can I use to only use the corresponding entry of the array?
Is there some sort of current-index operator:
$sites[<current_index>].energy_consumption
or an alternative dot operator (would remind me on difference between * multiplication and .* element wise matrix multiplication)?
$sites:energy_consumption
Or is this a bug?
Edit
I also tried to use the "$" positional operator, e.g. with
sites.$.new_field
or
$sites.$.energy_consumption
but then I get the error
FieldPath field names may not start with '$'
Related:
https://docs.mongodb.com/manual/reference/operator/aggregation/set/#std-label-set-add-field-to-embedded
In MongoDB how do you use $set to update a nested value/embedded document?
If the field is member of an array by selecting it you are selecting all of them.
{ar :[{"a" : 1}, {"a" : 2}]}
"$ar.a" = [1 ,2]
Also you cant mix update operators with aggregation, you cant use things like
$sites.$.energy_consumption, if you are doing aggregation you have to use aggregate operators, with only exception the $match stage where you can use query operators.
Query
alternative slightly different solution from yours using $setField
i guess it will be faster, but probably little difference
no need to use javascript it will be slower
this is >= MongoDB 5 solution, $setField is new operator
Test code here
aggregate(
[{"$set":
{"sites":
{"$map":
{"input":"$sites",
"in":
{"$setField":
{"field":"new_field",
"input":"$$this",
"value":"$$this.energy_consumption"}}}}}}]
)
use $addFields
db.collection.update({},
[
{
"$addFields": {
"sites": {
$map: {
input: "$sites",
as: "s",
in: {
name: "$$s.name",
energy_consumption: "$$s.energy_consumption",
new_field: {
$map: {
input: "$sites",
as: "value",
in: "$$value.energy_consumption"
}
}
}
}
}
}
}
])
mongoplayground
I found following ugly workarounds that set the complete sites instead of only specifying a new field with dot notation:
a) based on javascript function
set_stage = {
"$set": {
"sites": {
"$function": {
"body": "function(sites) {return sites.map(site => {site.new_field = site.energy_consumption_in_mwh; return site})}",
"args": ["$sites"],
"lang": "js"
}
}
}
}
b) based on map and mergeObjects
set_stage = {
"$set": {
"sites": {
"$map": {
"input": "$sites",
"in": {
"$mergeObjects": ["$$this", {
"new_field": "$$this.energy_consumption_in_mwh"
}]
}
}
}
}
}
If there is some kind of $$this context for the dot operator expression, allowing a more elegant solution, please let me know.

MongoDB Property String Starts With Query And Set Group Id

Data structure - one document in a big collection:
{
OPERATINGSYSTEM: "Android 6.0"
}
Issue: The operatingsystem can equal e.g. "Android 5.0", "Android 6.0", "Windows Phone", "Windows Phone 8.1"
There is no property which only contains the kind of operating system e.g. only Android.
I need to get the count of windows phones, and android phones.
My temporary solution:
db.getCollection('RB').find(
{OPERATINGSYSTEM: {$regex: "^Android"}}
).count();
I'm doing that query replacing "^Android" by windows phone and so on which takes much time and needs to be done in parallel.
Using the aggregation framework I though about this:
db.RB.aggregate(
{$group: {_id: {OPERATINGSYSTEM:"$OPERATINGSYSTEM"}}},)
But using this I get an entry for each operatingsystem version Android 5.0, Android 6.0 etc...
The solution I'm searching for should return data in this format:
{
"Android": 50,
"Windows Phone": 100
}
How can this be done in a single query?
Provided your strings at least consistently have the numeric version as the last thing in the string, then you could use $split with the aggregation framework to make an array from the "space delimited" content, then remove the last element from the array before reconstructing:
Given data like :
{ "name" : "Android 6.0" }
{ "name" : "Android 7.0" }
{ "name" : "Windows Phone 10" }
You can try:
db.getCollection('phones').aggregate([
{ "$group": {
"_id": {
"$let": {
"vars": { "split": { "$split": [ "$name", " " ] } },
"in": {
"$reduce": {
"input": { "$slice": [ "$$split", 0, { "$subtract": [ { "$size": "$$split" }, 1 ] } ] },
"initialValue": "",
"in": {
"$cond": {
"if": { "$eq": [ "$$value", "" ] },
"then": "$$this",
"else": { "$concat": [ "$$value", " ", "$$this" ] }
}
}
}
}
}
},
"count": { "$sum": 1 }
}},
{ "$replaceRoot": {
"newRoot": {
"$arrayToObject": [[{ "k": "$_id", "v": "$count" }]]
}
}}
])
That's all possible if your MongoDB is at least MongoDB 3.4 to support both $split and $reduce. The $replaceRoot is really about naming the keys, and not really required.
Alternately you can use mapReduce:
db.getCollection('phones').mapReduce(
function() {
var re = /\d+/g;
emit(this.name.substr(0,this.name.search(re)-1),1);
},
function(key,values) { return Array.sum(values) },
{ "out": { "inline": 1 } }
)
Where it's easier to break down the string by the index where a numeric value occurs. In either case, you are not required to "hardcode" anything, and the values of the keys are completely dependent on the strings in context.
Keep in mind though that unless there is an extremely large number of possible values, then running parallel .count() operations "should" be the fastest to process since returning cursor counts is a lot faster than actually counting the aggregated entries.
You can use map reduce, and apply your logic in the map function.
var map = function(){
var name = this.op.includes("android") ? "Android" : ""; // could be a regexp
if(name === ""){
name = this.op.includes("windows") ? "Windows" : "";
}
emit(name, 1);
}
var reduce = function(key, values){
return Array.sum(values)
}
db.operating.mapReduce(map, reduce, {out: "total"})
https://docs.mongodb.com/manual/tutorial/map-reduce-examples/

Finding documents based on the minimum value in an array

my document structure is something like :
{
_id: ...,
key1: ....
key2: ....
....
min_value: //should be the minimum of all the values in options
options: [
{
source: 'a',
value: 12,
},
{
source: 'b',
value: 10,
},
...
]
},
{
_id: ...,
key1: ....
key2: ....
....
min_value: //should be the minimum of all the values in options
options: [
{
source: 'a',
value: 24,
},
{
source: 'b',
value: 36,
},
...
]
}
the value of various sources in options will keep getting updated on a frequent basis(evey few mins or hours),
assume the size of options array doesnt change, i.e. no extra elements are added to the list
my queries are of the following type:
-find all documents where the min_value of all the options falls between some limit.
I could first do an unwind on options(and then take min) and then run comparison queries, but I am new to mongo and not sure how performance
is affected by unwind operation. The number of documents of this type would be about a few million.
Or does anyone has any suggestions around changing the document structure which could help me simplify this query? ( apart from creating separate documents per source - it would involves lot of data duplication )
Thanks!
Using $unwind is indeed quite expensive, most notably so with larger arrays, but there is a cost in all cases of usage. There are a couple of way to approach not needing $unwind here without real structural changes.
Pure Aggregation
In the basic case, as of MongoDB 3.2.x release series the $min operator can work directly on an array of values in a "projection" sense in addition to it's standard grouping accumulator role. This means that with the help of the related $map operator for processing elements of an array, you can then get the minimal value without using $unwind:
db.collection.aggregate([
// Still makes sense to use an index to select only possible documents
{ "$match": {
"options": {
"$elemMatch": {
"value": { "$gte": minValue, "$lt": maxValue }
}
}
}},
// Provides a logical filter to remove non-matching documents
{ "$redact": {
"$cond": {
"if": {
"$let": {
"vars": {
"min_value": {
"$min": {
"$map": {
"input": "$options",
"as": "option",
"in": "$$option.value"
}
}
}
},
"in": { "$and": [
{ "$gte": [ "$$min_value", minValue ] },
{ "$lt": [ "$$min_value", maxValue ] }
]}
}
},
"then": "$$KEEP",
"else": "$$PRUNE"
}
}},
// Optionally return the min_value as a field
{ "$project": {
"min_value": {
"$min": {
"$map": {
"input": "$options",
"as": "option",
"in": "$$option.value"
}
}
}
}}
])
The basic case is to get the "minimum" value from the array ( done inside of $let since we want to use the result "twice" in logical conditions. Helps us not repeat ourselves ) is to first extract the "value" data from the "options" array. This is done using $map.
The output of $map is an array with just those values, so this is supplied as the argument to $min, which then returns the minimum value for that array.
Using $redact is sort of like a $match pipeline stage with the difference that rather than needing a field to be "present" in the document being examined, you instead just form a logical condition with calculations.
In this case the condition is $and where "both" the logical forms of $gte and $lt return true against the calculated value ( from $let as "$$min_value" ).
The $redact stage then has the special arguments to apply to $$KEEP the document when the condition is true or $$PRUNE the document from results when it is false.
It's all very much like doing $project and then $match to actually project the value into the document before filtering in another stage, but all done in one stage. Of course you might actually want to $project the resulting field in what you return, but it generally cuts the workload if you remove non-matched documents "first" using $redact instead.
Updating Documents
Of course I think the best option is to actually keep the "min_value" field in the document rather than work it out at run-time. So this is a very simple thing to do when adding to or altering array items during update.
For this there is the $min "update" operator. Use it when appending with $push:
db.collection.update({
{ "_id": id },
{
"$push": { "options": { "source": "a", "value": 9 } },
"$min": { "min_value": 9 }
}
})
Or when updating a value of an element:
db.collection.update({
{ "_id": id, "options.source": "a" },
{
"$set": { "options.$.value": 9 },
"$min": { "min_value": 9 }
}
})
If the current "min_value" in the document is greater than the argument in $min or the key does not yet exist then the value given will be written. If it is greater than, the existing value stays in place since it is already the smaller value.
You can even set all your existing data with a simple "bulk" operations update:
var ops = [];
db.collection.find({ "min_value": { "$exists": false } }).forEach(function(doc) {
// Queue operations
ops.push({
"updateOne": {
"filter": { "_id": doc._id },
"update": {
"$min": {
"min_value": Math.min.apply(
null,
doc.options.map(function(option) {
return option.value
})
)
}
}
}
});
// Write once in 1000 documents
if ( ops.length == 1000 ) {
db.collection.bulkWrite(ops);
ops = [];
}
});
// Clear any remaining operations
if ( ops.length > 0 )
db.collection.bulkWrite(ops);
Then with a field in place, it is just a simple range selection:
db.collection.find({
"min_value": {
"$gte": minValue, "$lt": maxValue
}
})
So it really should be in your best interests to keep a field ( or fields if you regularly need different conditions ) in the document since that provides the most efficient query.
Of course, the new functions of aggregation $min along with $map also make this viable to use without a field, if you prefer more dynamic conditions.

How to concatenate all values and find specific substring in Mongodb?

I have json document like this:
{
"A": [
{
"C": "abc",
"D": "de"
},
{
"C": "fg",
"D": "hi"
}
]
}
I would check whether "A" contains string ef or not.
first Concatenate all values abcdefghi then search for ef
In XML, XPATH it would be something like:
//A[contains(., 'ef')]
Is there any similar query in Mongodb?
All options are pretty horrible for this type of search, but there are a few approaches you can take. Please note though that the end case here is likely the best solution, but I present the options in order to illustrate the problem.
If your keys in the array "A" are consistently defined and always contained an array, you would be searching like this:
db.collection.aggregate([
// Filter the documents containing your parts
{ "$match": {
"$and": [
{ "$or": [
{ "A.C": /e/ },
{ "A.D": /e/ }
]},
{"$or": [
{ "A.C": /f/ },
{ "A.D": /f/ }
]}
]
}},
// Keep the original form and a copy of the array
{ "$project": {
"_id": {
"_id": "$_id",
"A": "$A"
},
"A": 1
}},
// Unwind the array
{ "$unwind": "$A" },
// Join the two fields and push to a single array
{ "$group": {
"_id": "$_id",
"joined": { "$push": {
"$concat": [ "$A.C", "$A.D" ]
}}
}},
// Copy the array
{ "$project": {
"C": "$joined",
"D": "$joined"
}},
// Unwind both arrays
{ "$unwind": "$C" },
{ "$unwind": "$D" },
// Join the copies and test if they are the same
{ "$project": {
"joined": { "$concat": [ "$C", "$D" ] },
"same": { "$eq": [ "$C", "$D" ] },
}},
// Discard the "same" elements and search for the required string
{ "$match": {
"same": false,
"joined": { "$regex": "ef" }
}},
// Project the origial form of the matching documents
{ "$project": {
"_id": "$_id._id",
"A": "$_id.A"
}}
])
So apart from the horrible $regex matching there are a few hoops to go through in order to get the fields "joined" in order to again search for the string in sequence. Also note the reverse joining that is possible here that could possibly produce a false positive. Currently there would be no simple way to avoid that reverse join or otherwise filter it, so there is that to consider.
Another approach is to basically run everything through arbitrary JavaScript. The mapReduce method can be your vehicle for this. Here you can be a bit looser with the types of data that can be contained in "A" and try to tie in some more conditional matching to attempt to reduce the set of documents you are working on:
db.collection.mapReduce(
function () {
var joined = "";
if ( Object.prototype.toString.call( this.A ) === '[object Array]' ) {
this.A.forEach(function(doc) {
for ( var k in doc ) {
joined += doc[k];
}
});
} else {
joined = this.A; // presuming this is just a string
}
var id = this._id;
delete this["_id"];
if ( joined.match(/ef/) )
emit( id, this );
},
function(){}, // will not reduce
{
"query": {
"$or": [
{ "A": /ef/ },
{ "$and": [
{ "$or": [
{ "A.C": /e/ },
{ "A.D": /e/ }
]},
{"$or": [
{ "A.C": /f/ },
{ "A.D": /f/ }
]}
] }
]
},
"out": { "inline": 1 }
}
);
So you can use that with whatever arbitrary logic to search the contained objects. This one just differentiates between "arrays" and presumes otherwise a string, allowing the additional part of the query to just search for the matching "string" element first, and which is a "short circuit" evaluation.
But really at the end of the day, the best approach is to simply have the data present in your document, and you would have to maintain this yourself as you update the document contents:
{
"A": [
{
"C": "abc",
"D": "de"
},
{
"C": "fg",
"D": "hi"
}
],
"search": "abcdefghi"
}
So that is still going to invoke a horrible usage of $regex type queries but at least this avoids ( or rather shifts to writing the document ) the overhead of "joining" the elements in order to effect the search for your desired string.
Where this eventually leads is that a "full blown" text search solution, and that means an external one at this time as opposed to the text search facilities in MongoDB, is probably going to be your best performance option.
Either using the "pre-stored" approach in creating your "joined" field or otherwise where supported ( Solr is one solution that can do this ) have a "computed field" in this text index that is created when indexing document content.
At any rate, those are the approaches and the general point of the problem. This is not XPath searching, not is their some "XPath like" view of an entire collection in this sense, so you are best suited to structuring your data towards the methods that are going to give you the best performance.
With all of that said, your sample here is a fairly contrived example, and if you had an actual use case for something "like" this, then that actual case may make a very interesting question indeed. Actual cases generally have different solutions than the contrived ones. But now you have something to consider.