MongoDB nested grouping - mongodb

I have the following MongoDB data model:
{
"_id" : ObjectId("53725814740fd6d2ee0ca2bb"),
"date" : "2014-01-01",
"establishmentId" : 1,
"products" : [
{
"productId" : 1,
"price" : 7.03,
"someOtherInfo" : 325,
"somethingElse" : 6878
},
{
"productId" : 2,
"price" : 4.6,
"someOtherInfo" : 243,
"somethingElse" : 1757
},
{
"productId" : 3,
"price" : 2.14,
"someOtherInfo" : 610,
"somethingElse" : 5435
},
{
"productId" : 4,
"price" : 1.45,
"someOtherInfo" : 627,
"somethingElse" : 5762
},
{
"productId" : 5,
"price" : 3.9,
"someOtherInfo" : 989,
"somethingElse" : 3752
}
}
What is the fastest way to get the average price across all establishments? Is there a better data model to achieve this?

An aggregation operation should handle this well. I'd suggest looking into the $unwind operation.
Something along these lines should work (just as an example):
db.collection.aggregate(
{$match: {<query parameters>}},
{$unwind: "$products"},
{
$group: {
_id: "<blank or field(s) to group by before averaging>",
$avg: "$price"
}
}
);
An aggregation built in this style should produce a JSON object that has the data you want.

Due to the gross syntax errors in anything else provided the more direct answer is:
db.collection.aggregate([
{ "$unwind": "$products" },
{ "$group": {
"_id": null,
"avgprice": { "$avg": "$products.price" }
}}
])
The usage of the aggregation framework here is to first $unwind the array, which is a way to "de-normalize" the content in the array into separate documents.
Then in the $group stage you pass in a value of null to the _id which means "group everything" and pass your $products.price ( note the dot notation ) in to the $avg operator to return the total average value across all of the sub-document entries in all of your documents in the collection.
See the full operator reference for more information.

The best solution I found was:
db.collection.aggregate([
{$match:{date:{$gte:"2014-01-01",$lte:"2014-01-31"},establishmentId:{$in:[1,2,3,4,5,6]}}
{ "$unwind": "$products" },
{ "$group": {
"_id": {date:"$date",product:"$products.productId"},
"avgprice": { "$avg": "$products.price" }
}}
])
And something I found out also was that it is much better to first use match and then unwind so there are fewer items to unwind. This results in a faster overall process.

Related

how to find duplicate records in mongo db query to use

I have below collection, need to find duplicate records in mongo, how can we find that as below is one sample of collection we have around more then 10000 records of collections.
/* 1 */
{
"_id" : 1814099,
"eventId" : "LAS012",
"eventName" : "CustomerTab",
"timeStamp" : ISODate("2018-12-31T20:09:09.820Z"),
"eventMethod" : "click",
"resourceName" : "CustomerTab",
"targetType" : "",
"resourseUrl" : "",
"operationName" : "",
"functionStatus" : "",
"results" : "",
"pageId" : "CustomerPage",
"ban" : "290824901",
"jobId" : "87377713",
"wrid" : "87377713",
"jobType" : "IBJ7FXXS",
"Uid" : "sc343x",
"techRegion" : "W",
"mgmtReportingFunction" : "N",
"recordPublishIndicator" : "Y",
"__v" : 0
}
We can first find the unique ids using
const data = await db.collection.aggregate([
{
$group: {
_id: "$eventId",
id: {
"$first": "$_id"
}
}
},
{
$group: {
_id: null,
uniqueIds: {
$push: "$id"
}
}
}
]);
And then we can make another query, which will find all the duplicate documents
db.collection.find({_id: {$nin: data.uniqueIds}})
This will find all the documents that are redundant.
Another way
To find the event ids which are duplicated
db.collection.aggregate(
{"$group" : { "_id": "$eventId", "count": { "$sum": 1 } } },
{"$match": {"_id" :{ "$ne" : null } , "count" : {"$gt": 1} } }
)
To get duplicates from db, you need to get only the groups that have a count of more than one, we can use the $match operator to filter our results. Within the $match pipeline operator, we'll tell it to look at the count field and tell it to look for counts greater than one using the $gt operator representing "greater than" and the number 1. This looks like the following:
db.collection.aggregate([
{$group: {
_id: {eventId: "$eventId"},
uniqueIds: {$addToSet: "$_id"},
count: {$sum: 1}
}
},
{$match: {
count: {"$gt": 1}
}
}
]);
I assume that eventId is a unique id.

How to sort sub-documents in the array field?

I'm using the MongoDB shell to fetch some results, ordered. Here's a sampler,
{
"_id" : "32022",
"topics" : [
{
"weight" : 281.58551703724993,
"words" : "some words"
},
{
"weight" : 286.6695125796183,
"words" : "some more words"
},
{
"weight" : 289.8354232846977,
"words" : "wowz even more wordz"
},
{
"weight" : 305.70093587160807,
"words" : "WORDZ"
}]
}
what I want to get is, same structure, but ordered by "topics" : []
{
"_id" : "32022",
"topics" : [
{
"weight" : 305.70093587160807,
"words" : "WORDZ"
},
{
"weight" : 289.8354232846977,
"words" : "wowz even more wordz"
},
{
"weight" : 286.6695125796183,
"words" : "some more words"
},
{
"weight" : 281.58551703724993,
"words" : "some words"
},
]
}
I managed to get some ordered results, but no luck in grouping them by id field. is there a way to do this?
MongoDB doesn't provide a way to do this out of the box but there is a workaround which is to update your documents and use the $sort update operator to sort your array.
db.collection.update_many({}, {"$push": {"topics": {"$each": [], "$sort": {"weight": -1}}}})
You can still use the .aggregate() method like this:
db.collection.aggregate([
{"$unwind": "$topics"},
{"$sort": {"_id": 1, "topics.weight": -1}},
{"$group": {"_id": "$_id", "topics": {"$push": "$topics"}}}
])
But this is less efficient if all you want is sort your array, and you definitely shouldn't do that.
You could always do this client side using the .sort or sorted function.
If you don't want to update but only get documents, you can use the following query
db.test.aggregate(
[
{$unwind : "$topics"},
{$sort : {"topics.weight":-1}},
{"$group": {"_id": "$_id", "topics": {"$push": "$topics"}}}
]
)
It works for me:
db.getCollection('mycollection').aggregate(
{$project:{topics:1}},
{$unwind:"$topics"},
{$sort :{"topics.words":1}})

Does mongo provide functionality for deconstructing document arrays for large datasets?

Similar to map/reduce but in reverse. Does mongo have a way of reformatting data. I have a collection in the following format.
{
{"token-id" : "LKJ8_lkjsd"
"data": [
{"views":100, "Date": "2015-01-01"},
{"views":200, "Date": "2015-01-02"},
{"views":300, "Date": "2015-01-03"},
{"views":300, "Date": "2015-01-03"}
]
}
}
I would like to process the entire collection into a new format. where every time series data point is its document mapped to the ID hopefully using some inherent mongo functionality similar to map reduce. If there isn't; I'd appreciate a strategy in which we can do this.
{
{ "token-id" : "LKJ8_lkjsd", "views": 100, "Date" : "2015-01-01"},
{ "token-id" : "LKJ8_lkjsd", "views": 200, "Date" : "2015-01-01"},
{ "token-id" : "LKJ8_lkjsd", "views": 300, "Date" : "2015-01-01"}
}
The aggregate command can return results as a cursor or store the
results in a collection, which are not subject to the size limit. The
db.collection.aggregate() returns a cursor and can return result sets
of any size.
var result = db.test.aggregate( [ { $unwind : "$data" }, {$project: {_id:0, "token-id":1, "data":1}}])
for(result.hasNext()){
db.collection.insert(result.next());
}
You need the $unwind from the aggregation pipeline, see mongodb documentation
In your case the code would be
db.yourcollection.aggregate( [ { $unwind : "$data" } ] )
unwind does not insert documents to the new collection by itself
You can use
> db.test.aggregate( [ { $unwind : "$data" }, {$project: {_id:0, "token-id":1, "data":1}}, {$out: "another"} ] )
> db.another.find()
In the first line you need to suppress _id, because after the $unwind you get 4 documents with the same _id (and thus they cannot be inserted)
Without the explicit _id, new values will be generated automatically
Here is the output that I got for your example
{ "_id" : ObjectId("560599b1699289a5b754fab9"), "token-id" : "LKJ8_lkjsd", "data" : { "views" : 100, "Date" : "2015-01-01" } }
{ "_id" : ObjectId("560599b1699289a5b754faba"), "token-id" : "LKJ8_lkjsd", "data" : { "views" : 200, "Date" : "2015-01-02" } }
{ "_id" : ObjectId("560599b1699289a5b754fabb"), "token-id" : "LKJ8_lkjsd", "data" : { "views" : 300, "Date" : "2015-01-03" } }
{ "_id" : ObjectId("560599b1699289a5b754fabc"), "token-id" : "LKJ8_lkjsd", "data" : { "views" : 300, "Date" : "2015-01-03" } }
As per your question with large data set then $unwind creates slow performance the query for this case you should use $map in aggregation to process the array of data like below :
db.collection.aggregate({
"$project": {
"result": {
"$map": {
"input": "$data",
"as": "el",
"in": {
"token-id": "$token-id",
"views": "$$el.views",
"Date": "$$el.Date"
}
}
}
}
}).pretty()

paging subdocument in mongodb subdocument

I want to paging my data in Mongodb. I use slice operator but can not paging my data. I wish to bring my row but can not paging in this row.
I want to return only 2 rows of data source.
How can resolve it
My Query :
db.getCollection('forms').find({
"_id": ObjectId("557e8c93a6df1a22041e0879"),
"Questions._id": ObjectId("557e8c9fa6df1a22041e087b")
}, {
"Questions.$.DataSource": {
"$slice": [0, 2]
},
"_id": 0,
"Questions.DataSourceItemCount": 1
})
My collection data :
/* 1 */
{
"_id" : ObjectId("557e8c93a6df1a22041e0879"),
"QuestionCount" : 2.0000000000000000,
"Questions" : [
{
"_id" : ObjectId("557e8c9ba6df1a22041e087a"),
"DataSource" : [],
"DataSourceItemCount" : NumberLong(0)
},
{
"_id" : ObjectId("557e8c9fa6df1a22041e087b"),
"DataSource" : [
{
"_id" : ObjectId("557e9428a6df1a198011fa55"),
"CreationDate" : ISODate("2015-06-15T09:00:24.485Z"),
"IsActive" : true,
"Text" : "sdf",
"Value" : "sdf"
},
{
"_id" : ObjectId("557e98e9a6df1a1a88da8b1d"),
"CreationDate" : ISODate("2015-06-15T09:20:41.027Z"),
"IsActive" : true,
"Text" : "das",
"Value" : "asdf"
},
{
"_id" : ObjectId("557e98eea6df1a1a88da8b1e"),
"CreationDate" : ISODate("2015-06-15T09:20:46.889Z"),
"IsActive" : true,
"Text" : "asdf",
"Value" : "asdf"
},
{
"_id" : ObjectId("557e98f2a6df1a1a88da8b1f"),
"CreationDate" : ISODate("2015-06-15T09:20:50.401Z"),
"IsActive" : true,
"Text" : "asd",
"Value" : "asd"
},
{
"_id" : ObjectId("557e98f5a6df1a1a88da8b20"),
"CreationDate" : ISODate("2015-06-15T09:20:53.639Z"),
"IsActive" : true,
"Text" : "asd",
"Value" : "asd"
}
],
"DataSourceItemCount" : NumberLong(5)
}
],
"Name" : "er"
}
Though this is possible to do with some real wrangling you would be best off changing the document structure to "flatten" the array entries into a single array. The main reason for this is "updates" which are not atomically supported by MongoDB with respect to updating the "inner" array due to the current limitations of the positional $ operator.
At any rate, it's not easy to deal with for the reasons that will become apparent.
For the present structure you approach it like this:
db.collection.aggregate([
// Match the required document and `_id` is unique
{ "$match": {
"_id": ObjectId("557e8c93a6df1a22041e0879")
}},
// Unwind the outer array
{ "$unwind": "$Questions" },
// Match the inner entry
{ "$match": {
"Questions._id": ObjectId("557e8c9fa6df1a22041e087b"),
}},
// Unwind the inner array
{ "$unwind": "$Questions.DataSource" }
// Find the first element
{ "$group": {
"_id": {
"_id": "$_id",
"questionId": "$Questions._id"
},
"firstSource": { "$first": "$Questions.DataSource" },
"sources": { "$push": "$Questions.DataSource" }
}},
// Unwind the sources again
{ "$unwind": "$sources" },
// Compare the elements to keep
{ "$project": {
"firstSource": 1,
"sources": 1,
"seen": { "$eq": [ "$firstSource._id", "$sources._id" ] }
}},
// Filter out anything "seen"
{ "$match": { "seen": true } },
// Group back the elements you want
{ "$group": {
"_id": "$_id",
"firstSource": "$firstSource",
"secondSource": { "$first": "$sources" }
}}
])
So that is going to give you the "first two elements" of that inner array. It's the basic process for implementing $slice in the aggregation framework, which is required since you cannot use standard projection with a "nested array" in the way you are trying.
Since $slice is not supported otherwise with the aggregation framework, you can see that doing "paging" would be a pretty horrible and "iterative" operation in order to "pluck" the array elements.
I could at this point suggest "flattening" to a single array, but the same "slicing" problem applies because even if you made "QuestionId" a property of the "inner" data, it has the same projection an selection problems for which you need the same aggregation approach.
Then there is this "seemingly" not great structure for your data ( for some query operations ) but it all depends on your usage patterns. This structure suits this type of operation:
{
"_id" : ObjectId("557e8c93a6df1a22041e0879"),
"QuestionCount" : 2.0000000000000000,
"Questions" : {
"557e8c9ba6df1a22041e087a": {
"DataSource" : [],
"DataSourceItemCount" : NumberLong(0)
},
"557e8c9fa6df1a22041e087b": {
"DataSource" : [
{
"_id" : ObjectId("557e9428a6df1a198011fa55"),
"CreationDate" : ISODate("2015-06-15T09:00:24.485Z"),
"IsActive" : true,
"Text" : "sdf",
"Value" : "sdf"
},
{
"_id" : ObjectId("557e98e9a6df1a1a88da8b1d"),
"CreationDate" : ISODate("2015-06-15T09:20:41.027Z"),
"IsActive" : true,
"Text" : "das",
"Value" : "asdf"
}
],
"DataSourceItemCount" : NumberLong(5)
}
}
}
Where this works:
db.collection.find(
{
"_id": ObjectId("557e8c93a6df1a22041e0879"),
"Questions.557e8c9fa6df1a22041e087b": { "$exists": true }
},
{
"_id": 0,
"Questions.557e8c9fa6df1a22041e087b.DataSource": { "$slice": [0, 2] },
"Questions.557e8c9fa6df1a22041e087b.DataSourceItemCount": 1
}
)
Nested arrays are not great for many operations, particularly update operations since it is not possible to get the "inner" array index for update operations. The positional $ operator will only get the "first" or "outer" array index and cannot "also" match the inner array index.
Updates with a structure like you have involve "reading" the document as a whole and then manipulating in code and writing back. There is no "guarantee" that the document has not changed in the collection between those operations and it can lead to inconsistencies unless handled properly.
On the other hand, the revised structure as shown, works well for the type of query given, but may be "bad" if you need to dynamically search or "aggregate" across what you have represented as the "outer" "Questions".
Data structure with MongoDB is very subjective to "how you use it". So it is best to consider all of your usage patterns before "nailing down" a final data structure design for your application.
So you can either take note of the problems and solutions as noted, or simply live with retrieving the "outer" element via the standard "positional" match and then just "slice" in your client code.
It's all a matter of "what suits your application best".

Group and count using aggregation framework

I'm trying to group and count the following structure:
[{
"_id" : ObjectId("5479c4793815a1f417f537a0"),
"status" : "canceled",
"date" : ISODate("2014-11-29T00:00:00.000Z"),
"offset" : 30,
"devices" : [
{
"name" : "Mouse",
"cost" : 150,
},
{
"name" : "Keyboard",
"cost" : 200,
}
],
},
{
"_id" : ObjectId("5479c4793815a1f417d557a0"),
"status" : "done",
"date" : ISODate("2014-10-20T00:00:00.000Z"),
"offset" : 30,
"devices" : [
{
"name" : "LCD",
"cost" : 150,
},
{
"name" : "Keyboard",
"cost" : 200,
}
],
}
,
{
"_id" : ObjectId("5479c4793815a1f417f117a0"),
"status" : "done",
"date" : ISODate("2014-12-29T00:00:00.000Z"),
"offset" : 30,
"devices" : [
{
"name" : "Headphones",
"cost" : 150,
},
{
"name" : "LCD",
"cost" : 200,
}
],
}]
I need group and count something like that:
"result" : [
{
"_id" : {
"status" : "canceled"
},
"count" : 1
},
{
"_id" : {
"status" : "done"
},
"count" : 2
},
totaldevicecost: 730,
],
"ok" : 1
}
My problem in calculating cost sum in subarray "devices". How to do that?
It seems like you got a start on this but you got lost on some of the other concepts. There are some basic truths when working with arrays in documents, but's let's start where you left off:
db.sample.aggregate([
{ "$group": {
"_id": "$status",
"count": { "$sum": 1 }
}}
])
So that is just going to use the $group pipeline to gather up your documents on the different values of the "status" field and then also produce another field for "count" which of course "counts" the occurrences of the grouping key by passing a value of 1 to the $sum operator for each document found. This puts you at a point much like you describe:
{ "_id" : "done", "count" : 2 }
{ "_id" : "canceled", "count" : 1 }
That's the first stage of this and easy enough to understand, but now you need to know how to get values out of an array. You might then be tempted once you understand the "dot notation" concept properly to do something like this:
db.sample.aggregate([
{ "$group": {
"_id": "$status",
"count": { "$sum": 1 },
"total": { "$sum": "$devices.cost" }
}}
])
But what you will find is that the "total" will in fact be 0 for each of those results:
{ "_id" : "done", "count" : 2, "total" : 0 }
{ "_id" : "canceled", "count" : 1, "total" : 0 }
Why? Well MongoDB aggregation operations like this do not actually traverse array elements when grouping. In order to do that, the aggregation framework has a concept called $unwind. The name is relatively self-explanatory. An embedded array in MongoDB is much like having a "one-to-many" association between linked data sources. So what $unwind does is exactly that sort of "join" result, where the resulting "documents" are based on the content of the array and duplicated information for each parent.
So in order to act on array elements you need to use $unwind first. This should logically lead you to code like this:
db.sample.aggregate([
{ "$unwind": "$devices" },
{ "$group": {
"_id": "$status",
"count": { "$sum": 1 },
"total": { "$sum": "$devices.cost" }
}}
])
And then the result:
{ "_id" : "done", "count" : 4, "total" : 700 }
{ "_id" : "canceled", "count" : 2, "total" : 350 }
But that isn't quite right is it? Remember what you just learned from $unwind and how it does a de-normalized join with the parent information? So now that is duplicated for every document since both had two array member. So while the "total" field is correct, the "count" is twice as much as it should be in each case.
A bit more care needs to be taken, so instead of doing this in a single $group stage, it is done in two:
db.sample.aggregate([
{ "$unwind": "$devices" },
{ "$group": {
"_id": "$_id",
"status": { "$first": "$status" },
"total": { "$sum": "$devices.cost" }
}},
{ "$group": {
"_id": "$status",
"count": { "$sum": 1 },
"total": { "$sum": "$total" }
}}
])
Which now gets the result with correct totals in it:
{ "_id" : "canceled", "count" : 1, "total" : 350 }
{ "_id" : "done", "count" : 2, "total" : 700 }
Now the numbers are right, but it is still not exactly what you are asking for. I would think you should stop there as the sort of result you are expecting is really not suited to just a single result from aggregation alone. You are looking for the total to be "inside" the result. It really doesn't belong there, but on small data it is okay:
db.sample.aggregate([
{ "$unwind": "$devices" },
{ "$group": {
"_id": "$_id",
"status": { "$first": "$status" },
"total": { "$sum": "$devices.cost" }
}},
{ "$group": {
"_id": "$status",
"count": { "$sum": 1 },
"total": { "$sum": "$total" }
}},
{ "$group": {
"_id": null,
"data": { "$push": { "count": "$count", "total": "$total" } },
"totalCost": { "$sum": "$total" }
}}
])
And a final result form:
{
"_id" : null,
"data" : [
{
"count" : 1,
"total" : 350
},
{
"count" : 2,
"total" : 700
}
],
"totalCost" : 1050
}
But, "Do Not Do That". MongoDB has a document limit on response of 16MB, which is a limitation of the BSON spec. On small results you can do this kind of convenience wrapping, but in the larger scheme of things you want the results in the earlier form and either a separate query or live with iterating the whole results in order to get the total from all documents.
You do appear to be using a MongoDB version less than 2.6, or copying output from a RoboMongo shell which does not support the latest version features. From MongoDB 2.6 though the results of aggregation can be a "cursor" rather than a single BSON array. So the overall response can be much larger than 16MB, but only when you are not compacting to a single document as results, shown for the last example.
This would be especially true in cases where you were "paging" the results, with 100's to 1000's of result lines but you just wanted a "total" to return in an API response when you are only returning a "page" of 25 results at a time.
Anyhow, that should give you a reasonable guide on how to get the type of results you are expecting from your common document form. Remember $unwind in order to process arrays, and generally $group multiple times in order to get totals at different grouping levels from your document and collection groupings.