How to get total record fields from mongodb by using $group? - mongodb

I wants to get records with all fields using $group in mongodb.
i.e: SELECT * FROM users GROUP BY state, equivalent query in mongodb.
Can any one help me.

AFAIK there is no way to return all object in group query. You can use $addToSet operator to add fields into the array to return. The example code is shown in the below. You can add all the fields to the array using addToSet operator. It will return array as a response and you should get data from those array(s).
db.users.aggregate({$group : {_id : "$state", id : {$addToSet : "$_id"}, field1 : {$addToSet : "$field1"}}});

You can do it with MapReduce instead:
db.runCommand({
mapreduce: 'tests',
map: function() {
return emit(this.state, { docs: [this] });
},
reduce: function(key, vals) {
var res = vals.pop();
vals.forEach(function(val) {
[].push.apply(res.docs, val.docs);
});
return res;
},
finalize: function(key, reducedValue) {
return reducedValue.docs;
},
out: { inline: 1 }
})
I'm using finalize function in my example because MapReduce not supports arrays in reducedValue.
But, regardless of the method, you should try to avoid such queries in productions. They are fine for rare requests, like analytics, db migrations or daily scripting, but not for frequent ones.

Related

Find aggregation on arbitrary number of different keys

I have a collection in mongoDB which looks like,
collection:
doc1
{ field1 : {
field1_1 : 'val1',
field1_2 : 'val2',
field1_3 : 'val3',
...
field1_N : 'valN' } }
doc2
{ field1 : {
field1_1 : 'val1',
field1_2 : 'val2',
field1_3 : 'val3',
...
field1_N : 'valN' } }
I want to find aggregation(sum, avg, min, max) on val1, val2, val3 ... valN. Is there any way to use mongo's aggregation feature? The keys are always different, and the aggregation should happen for all the values of field1
Edited:
The final output should look like,
doc1
{ field1 : {
sum: sumOf(val1, val2... valN),
avg: avgOf(val1, val2... valN)
... } }
doc2
{ field1 : {
sum: sumOf(val1, val2... valN),
avg: avgOf(val1, val2... valN)
... } }
can try this one using Map-Reduce instead of aggregate for your requirement. map-reduce operations provide some flexibility that is not presently available in the aggregation pipeline.
var mapFunction =
function() {
for (key in this.field1) {
emit(this._id, parseInt(this.field1[key]));
}
};
var reduceFunction =
function(key, values) {
return {sum:Array.sum(values), avg:Array.avg(values)};
};
db.getCollection('collectionName').mapReduce(mapFunction, reduceFunction, {out: {inline:1}});
As far I know if your document structure would be like :
field1 : [
{field1 : 'val1'},
{field1 : 'val2'},
{field1 : 'val3'},
...
{field1 : 'valN'}]
then you could solved easily by using aggregate. so for your structure mapReducemay better.
I'm not a mongodb ninja by any means, but I've a solution for your question. It's not the most efficient one, but if you are royally stuck, you can use it.
So the code below, creates a projection, of the avg data of the fields that you want to aggregate (avg) on, the down side to this is that you have to specify each field, rather than maybe a way of iterating through all fields and aggregating on that. You can definitely do iterations using JS, I've briefly come across it in the past, but it might take a small bit more tweaking than the code I've provided below.
Also, I've assumed that, field1.field1_1 : "int". Is an integer and not a string value.
db.random.aggregate(
[
{
$project:
{
_id: "$field1",
avgAmount: { $avg: ["$field1.field1_1", "$field1.field1_2", "$field1.field1.3"]}
}
}]
)
Best of luck anyway #1love.
You can try $sum for sum and $multiply for multiplication.. you can find all the aggregate functions available for MongoDb at the following link:
https://docs.mongodb.com/manual/reference/operator/aggregation/sum/
hope this helps..

mongodb delete nested object without knowledge of object nodes

For the below document, I am trying to delete the node which contains id = 123
{
'_id': "1234567890",
"image" : {
"unknown-node-1" : {
"id" : 123
},
"unknown-node-2" : {
"id" : 124
}
}
}
Result should be as below.
{
'_id': "1234567890",
"image" : {
"unknown-node-2" : {
"id" : 124
}
}
}
The below query achieves the result. But i have to know the unknown-node-1 in advance. How can I achieve the results without pre-knowledge of node, but only
info that I have is image.*.id = 123
(* means unknown node)
Is it possible in mongo? or should I do these find on my app code.
db.test.update({'_id': "1234567890"}, {$unset: {'image.unknown-node-1': ""}})
Faiz,
There is no operator to help match and project a single key value pair without knowing the key. You'll have to write post processing code to scan each one of the documents to find the node with the id and then perform your removal.
If you have the liberty of changing your schema, you'll have more flexibilty. With a document design like this:
{
'_id': "1234567890",
"image" : [
{"id" : 123, "name":"unknown-node-1"},
{"id" : 124, "name":"unknown-node-2"},
{"id" : 125, "name":"unknown-node-3"}
]
}
You could remove documents from the array like this:
db.collectionName.update(
{'_id': "1234567890"},
{ $pull: { image: { id: 123} } }
)
This would result in:
{
'_id': "1234567890",
"image" : [
{"id" : 124, "name":"unknown-node-2"},
{"id" : 125, "name":"unknown-node-3"}
]
}
With your current schema, you will need a mechanism to get a list of the dynamic keys that you need to assemble the query before doing the update and one way of doing this would be with MapReduce. Take for instance the following map-reduce operation which will populate a separate collection with all the keys as the _id values:
mr = db.runCommand({
"mapreduce": "test",
"map" : function() {
for (var key in this.image) { emit(key, null); }
},
"reduce" : function(key, stuff) { return null; },
"out": "test_keys"
})
To get a list of all the dynamic keys, run distinct on the resulting collection:
> db[mr.result].distinct("_id")
[ "unknown-node-1", "unknown-node-2" ]
Now given the list above, you can assemble your query by creating an object that will have its properties set within a loop. Normally if you knew the keys beforehand, your query will have this structure:
var query = {
"image.unknown-node-1.id": 123
},
update = {
"$unset": {
"image.unknown-node-1": ""
}
};
db.test.update(query, update);
But since the nodes are dynamic, you will have to iterate the list returned from the mapReduce operation and for each element, create the query and update parameters as above to update the collection. The list could be huge so for maximum efficiency and if your MongoDB server is 2.6 or newer, it would be better to take advantage of using a write commands Bulk API that allow for the execution of bulk update operations which are simply abstractions on top of the server to make it easy to build bulk operations and thus get perfomance gains with your update over large collections. These bulk operations come mainly in two flavours:
Ordered bulk operations. These operations execute all the operation in order and error out on the first write error.
Unordered bulk operations. These operations execute all the operations in parallel and aggregates up all the errors. Unordered bulk operations do not guarantee order of execution.
Note, for older servers than 2.6 the API will downconvert the operations. However it's not possible to downconvert 100% so there might be some edge cases where it cannot correctly report the right numbers.
In your case, you could implement the Bulk API update operation like this:
mr = db.runCommand({
"mapreduce": "test",
"map" : function() {
for (var key in this.image) { emit(key, null); }
},
"reduce" : function(key, stuff) { return null; },
"out": "test_keys"
})
// Get the dynamic keys
var dynamic_keys = db[mr.result].distinct("_id");
// Get the collection and bulk api artefacts
var bulk = db.test.initializeUnorderedBulkOp(), // Initialize the Unordered Batch
counter = 0;
// Execute the each command, triggers for each key
dynamic_keys.forEach(function(key) {
// Create the query and update documents
var query = {},
update = {
"$unset": {}
};
query["image."+ key +".id"] = 123;
update["$unset"]["image." + key] = ";"
bulk.find(query).update(update);
counter++;
if (counter % 100 == 0 ) {
bulk.execute() {
// re-initialise batch operation
bulk = db.test.initializeUnorderedBulkOp();
}
});
if (counter % 100 != 0) { bulk.execute(); }

Rename a sub-document field within an Array

Considering the document below how can I rename 'techId1' to 'techId'. I've tried different ways and can't get it to work.
{
"_id" : ObjectId("55840f49e0b"),
"__v" : 0,
"accessCard" : "123456789",
"checkouts" : [
{
"user" : ObjectId("5571e7619f"),
"_id" : ObjectId("55840f49e0bf"),
"date" : ISODate("2015-06-19T12:45:52.339Z"),
"techId1" : ObjectId("553d9cbcaf")
},
{
"user" : ObjectId("5571e7619f15"),
"_id" : ObjectId("55880e8ee0bf"),
"date" : ISODate("2015-06-22T13:01:51.672Z"),
"techId1" : ObjectId("55b7db39989")
}
],
"created" : ISODate("2015-06-19T12:47:05.422Z"),
"date" : ISODate("2015-06-19T12:45:52.339Z"),
"location" : ObjectId("55743c8ddbda"),
"model" : "model1",
"order" : ObjectId("55840f49e0bf"),
"rid" : "987654321",
"serialNumber" : "AHSJSHSKSK",
"user" : ObjectId("5571e7619f1"),
"techId" : ObjectId("55b7db399")
}
In mongo console I tried which gives me ok but nothing is actually updated.
collection.update({"checkouts._id":ObjectId("55840f49e0b")},{ $rename: { "techId1": "techId" } });
I also tried this which gives me an error. "cannot use the part (checkouts of checkouts.techId1) to traverse the element"
collection.update({"checkouts._id":ObjectId("55856609e0b")},{ $rename: { "checkouts.techId1": "checkouts.techId" } })
In mongoose I have tried the following.
collection.findByIdAndUpdate(id, { $rename: { "checkouts.techId1": "checkouts.techId" } }, function (err, data) {});
and
collection.update({'checkouts._id': n1._id}, { $rename: { "checkouts.$.techId1": "checkouts.$.techId" } }, function (err, data) {});
Thanks in advance.
You were close at the end, but there are a few things missing. You cannot $rename when using the positional operator, instead you need to $set the new name and $unset the old one. But there is another restriction here as they will both belong to "checkouts" as a parent path in that you cannot do both at the same time.
The other core line in your question is "traverse the element" and that is the one thing you cannot do in updating "all" of the array elements at once. Well, not safely and without possibly overwriting new data coming in anyway.
What you need to do is "iterate" each document and similarly iterate each array member in order to "safely" update. You cannot really iterate just the document and "save" the whole array back with alterations. Certainly not in the case where anything else is actively using the data.
I personally would run this sort of operation in the MongoDB shell if you can, as it is a "one off" ( hopefully ) thing and this saves the overhead of writing other API code. Also we're using the Bulk Operations API here to make this as efficient as possible. With mongoose it takes a bit more digging to implement, but still can be done. But here is the shell listing:
var bulk = db.collection.initializeOrderedBulkOp(),
count = 0;
db.collection.find({ "checkouts.techId1": { "$exists": true } }).forEach(function(doc) {
doc.checkouts.forEach(function(checkout) {
if ( checkout.hasOwnProperty("techId1") ) {
bulk.find({ "_id": doc._id, "checkouts._id": checkout._id }).updateOne({
"$set": { "checkouts.$.techId": checkout.techId1 }
});
bulk.find({ "_id": doc._id, "checkouts._id": checkout._id }).updateOne({
"$unset": { "checkouts.$.techId1": 1 }
});
count += 2;
if ( count % 500 == 0 ) {
bulk.execute();
bulk = db.collection.initializeOrderedBulkOp();
}
}
});
});
if ( count % 500 !== 0 )
bulk.execute();
Since the $set and $unset operations are happening in pairs, we are keeping the total batch size to 1000 operations per execution just to keep memory usage on the client down.
The loop simply looks for documents where the field to be renamed "exists" and then iterates each array element of each document and commits the two changes. As Bulk Operations, these are not sent to the server until the .execute() is called, where also a single response is returned for each call. This saves a lot of traffic.
If you insist on coding with mongoose. Be aware that a .collection acessor is required to get to the Bulk API methods from the core driver, like this:
var bulk = Model.collection.inititializeOrderedBulkOp();
And the only thing that sends to the server is the .execute() method, so this is your only execution callback:
bulk.exectute(function(err,response) {
// code body and async iterator callback here
});
And use async flow control instead of .forEach() such as async.each.
Also, if you do that, then be aware that as a raw driver method not governed by mongoose, you do not get the same database connection awareness as you do with mongoose methods. Unless you know for sure the database connection is already established, it is safter to put this code within an event callback for the server connection:
mongoose.connection.on("connect",function(err) {
// body of code
});
But otherwise those are the only real ( apart from call syntax ) alterations you really need.
This worked for me, I created this query to perform this procedure and I share it, (although I know it is not the most optimized way):
First, make an aggregate that (1) $match the documents that have the checkouts array field with techId1 as one of the keys of each sub-document. (2) $unwind the checkouts field (that deconstructs the array field from the input documents to output a document for each element), (3) adds the techId field (with $addFields), (4) $unset the old techId1 field, (5) $group the documents by _id to have again the checkout sub-documents grouped by its _id, and (6) write the result of these aggregation in a temporal collection (with $out).
const collection = 'yourCollection'
db[collection].aggregate([
{
$match: {
'checkouts.techId1': { '$exists': true }
}
},
{
$unwind: {
path: '$checkouts'
}
},
{
$addFields: {
'checkouts.techId': '$checkouts.techId1'
}
},
{
$project: {
'checkouts.techId1': 0
}
},
{
$group: {
'_id': '$_id',
'checkouts': { $push: { 'techId': '$checkouts.techId' } }
}
},
{
$out: 'temporal'
}
])
Then, you can make another aggregate from this temporal collection to $merge the documents with the modified checkouts field to your original collection.
db.temporal.aggregate([
{
$merge: {
into: collection,
on: "_id",
whenMatched:"merge",
whenNotMatched: "insert"
}
}
])

MongoDB, MapReduce and sorting

I might be a bit in over my head on this as I'm still learning the ins and outs of MongoDB, but here goes.
Right now I'm working on a tool to search/filter through a dataset, sort it by an arbitrary datapoint (eg. popularity) and then group it by an id. The only way I see I can do this is through Mongo's MapReduce functionality.
I can't use .group() because I'm working with more than 10,000 keys and I also need to be able to sort the dataset.
My MapReduce code is working just fine, except for one thing: sorting. Sorting just doesn't want to work at all.
db.runCommand({
'mapreduce': 'products',
'map': function() {
emit({
product_id: this.product_id,
popularity: this.popularity
}, 1);
},
'reduce': function(key, values) {
var sum = 0;
values.forEach(function(v) {
sum += v;
});
return sum;
},
'query': {category_id: 20},
'out': {inline: 1},
'sort': {popularity: -1}
});
I already have a descending index on the popularity datapoint, so it's definitely not working because of a lack of that:
{
"v" : 1,
"key" : { "popularity" : -1 },
"ns" : "app.products",
"name" : "popularity_-1"
}
I just cannot figure out why it doesn't want to sort.
Instead of inlining the result set, I can't output it to another collection and then run a .find().sort({popularity: -1}) on that because of the way this feature is going to work.
First of all, Mongo map/reduce are not designed to be used in as a query tool (as it is in CouchDB), it is design for you to run background tasks. I use it at work to analyze traffic data.
What you are doing wrong however is that you're applying the sort() to your input, but it is useless because when the map() stage is done the intermediate documents are sorted by each keys. Because your key is a document, it is being sort by product_id, popularity.
This is how I generated my dataset
function generate_dummy_data() {
for (i=2; i < 1000000; i++) {
db.foobar.save({
_id: i,
category_id: parseInt(Math.random() * 30),
popularity: parseInt(Math.random() * 50)
})
}
}
And this my map/reduce task:
var data = db.runCommand({
'mapreduce': 'foobar',
'map': function() {
emit({
sorting: this.popularity * -1,
product_id: this._id,
popularity: this.popularity,
}, 1);
},
'reduce': function(key, values) {
var sum = 0;
values.forEach(function(v) {
sum += v;
});
return sum;
},
'query': {category_id: 20},
'out': {inline: 1},
});
And this is the end result (very long to paste it here):
http://cesarodas.com/results.txt
This works because now we're sorting by sorting, product_id, popularity. You can play with the sorting how ever you like just remember that the final sorting is by key regardless of you how your input is sorted.
Anyway as I said before you should avoid doing queries with Map/Reduce it was designed for background processing. If I were you I would design my data in such a way I could access it with simple queries, there is always a trade-off in this case complex insert/updates to have simple queries (that's how I see MongoDB).
As noted in discussion on the original question:
Map/Reduce with inline output currently cannot use an explicit sort key (see SERVER-3973). Possible workarounds include relying on the emitted key order (see #crodas's answer); outputting to a collection and querying that collection with sort order; or sorting the results in your application using something like usort().
OP's preference is for inline results rather than creating/deleting temporary collections.
The Aggregation Framework in MongoDB 2.2 (currently a production release candidate) would provide a suitable solution.
Here's an example of a similar query to the original Map/Reduce, but instead using the Aggregation Framework:
db.products.aggregate(
{ $match: { category_id: 20 }},
{ $group : {
_id : "$product_id",
'popularity' : { $sum : "$popularity" },
}},
{ $sort: { 'popularity': -1 }}
)
.. and sample output:
{
"result" : [
{
"_id" : 50,
"popularity" : 139
},
{
"_id" : 150,
"popularity" : 99
},
{
"_id" : 123,
"popularity" : 55
}
],
"ok" : 1
}

MongoDB select count(distinct x) on an indexed column - count unique results for large data sets

I have gone through several articles and examples, and have yet to find an efficient way to do this SQL query in MongoDB (where there are millions of rows documents)
First attempt
(e.g. from this almost duplicate question - Mongo equivalent of SQL's SELECT DISTINCT?)
db.myCollection.distinct("myIndexedNonUniqueField").length
Obviously I got this error as my dataset is huge
Thu Aug 02 12:55:24 uncaught exception: distinct failed: {
"errmsg" : "exception: distinct too big, 16mb cap",
"code" : 10044,
"ok" : 0
}
Second attempt
I decided to try and do a group
db.myCollection.group({key: {myIndexedNonUniqueField: 1},
initial: {count: 0},
reduce: function (obj, prev) { prev.count++;} } );
But I got this error message instead:
exception: group() can't handle more than 20000 unique keys
Third attempt
I haven't tried yet but there are several suggestions that involve mapReduce
e.g.
this one how to do distinct and group in mongodb? (not accepted, answer author / OP didn't test it)
this one MongoDB group by Functionalities (seems similar to Second Attempt)
this one http://blog.emmettshear.com/post/2010/02/12/Counting-Uniques-With-MongoDB
this one https://groups.google.com/forum/?fromgroups#!topic/mongodb-user/trDn3jJjqtE
this one http://cookbook.mongodb.org/patterns/unique_items_map_reduce/
Also
It seems there is a pull request on GitHub fixing the .distinct method to mention it should only return a count, but it's still open: https://github.com/mongodb/mongo/pull/34
But at this point I thought it's worth to ask here, what is the latest on the subject? Should I move to SQL or another NoSQL DB for distinct counts? or is there an efficient way?
Update:
This comment on the MongoDB official docs is not encouraging, is this accurate?
http://www.mongodb.org/display/DOCS/Aggregation#comment-430445808
Update2:
Seems the new Aggregation Framework answers the above comment... (MongoDB 2.1/2.2 and above, development preview available, not for production)
http://docs.mongodb.org/manual/applications/aggregation/
1) The easiest way to do this is via the aggregation framework. This takes two "$group" commands: the first one groups by distinct values, the second one counts all of the distinct values
pipeline = [
{ $group: { _id: "$myIndexedNonUniqueField"} },
{ $group: { _id: 1, count: { $sum: 1 } } }
];
//
// Run the aggregation command
//
R = db.runCommand(
{
"aggregate": "myCollection" ,
"pipeline": pipeline
}
);
printjson(R);
2) If you want to do this with Map/Reduce you can. This is also a two-phase process: in the first phase we build a new collection with a list of every distinct value for the key. In the second we do a count() on the new collection.
var SOURCE = db.myCollection;
var DEST = db.distinct
DEST.drop();
map = function() {
emit( this.myIndexedNonUniqueField , {count: 1});
}
reduce = function(key, values) {
var count = 0;
values.forEach(function(v) {
count += v['count']; // count each distinct value for lagniappe
});
return {count: count};
};
//
// run map/reduce
//
res = SOURCE.mapReduce( map, reduce,
{ out: 'distinct',
verbose: true
}
);
print( "distinct count= " + res.counts.output );
print( "distinct count=", DEST.count() );
Note that you cannot return the result of the map/reduce inline, because that will potentially overrun the 16MB document size limit. You can save the calculation in a collection and then count() the size of the collection, or you can get the number of results from the return value of mapReduce().
db.myCollection.aggregate(
{$group : {_id : "$myIndexedNonUniqueField"} },
{$group: {_id:1, count: {$sum : 1 }}});
straight to result:
db.myCollection.aggregate(
{$group : {_id : "$myIndexedNonUniqueField"} },
{$group: {_id:1, count: {$sum : 1 }}})
.result[0].count;
Following solution worked for me
db.test.distinct('user');
[ "alex", "England", "France", "Australia" ]
db.countries.distinct('country').length
4