Assign an incremental number to a field in a MongoDB aggregation - mongodb

I have a MongoDB aggregation and each document has a field (groupNumber) like this:
I need that each groupNumber has a different number for each document (could be incremental 1,2,3..)
Most of the solutions that I had found are using "find" like this one , but I think that cannot be used in an aggregation.
Thanks in advance.

You can do this with a lookup, but I don't think it will scale well. I would think this should be persisted in some way, perhaps $out to another collection, so that the numbers are stable and deleting or inserting a document won't change the numbers for any other group.
db.target.aggregate([
{"$lookup":{
"from":"target",
"as":"looked",
"let":{"srcId":"$_id"},
"pipeline":[
{"$match":{"$expr":{"$lte":["$_id","$$srcId"]}}},
{"$group":{"_id":"null", "cnt":{"$sum":1}}}
]
}},
{"$addFields":{"groupNumber":{"$arrayElemAt":["$looked.cnt",0]}}},
{"$project":{"looked":0}}
])

Related

How to improve the performance of this MongoDB query

I am trying to take an extract from a huge MongoDB collection.
In particular, the collection contains 2.65TB data (unzipped), i.e., 600GB data (zipped). Each document has a deep hierarchy and a couple of arrays and I want to extract some parts out of them. In this collection we have multiple documents for each customer id. Since I want to export the most active document for each customer, I need to group and take the records with the maximum timestamp field and perform some further processing on them. I need some help in forming the query for the export. I have tried to sort the documents per customer id, but this could not be achieved in an acceptable time when combined with a 'match' construct (this is needed since it is a huge collection and we try to create the export in parts). Currently the query looks like this:
db.getCollection('CEM').aggregate([
{'$match' : {'LiveFeed.customer.profile.id':'TCAYT2RY2PF93R93JVSUGU7D3'}},
{'$project':{'LiveFeed.customer.profile.id':1,'LiveFeed.customer.profile.products.air.flights':1, 'LiveFeed.context.timestamp':1}},
{'$sort':{'LiveFeed.customer.profile.id':1,"LiveFeed.context.timestamp":1}},
{'$group':{'_id':'$LiveFeed.customer.profile.id',
'products':{'$last':'$LiveFeed.customer.profile.products.air.flights'}}},
{'$unwind': '$products'},
{'$unwind': '$products.sources'},
{'$project':{'_id':0,
'ceid': '$_id',
'coupon_no':{'$ifNull':['$products.couponId.couponNumber', ""]},
'ticket_no':{'$ifNull':['$products.couponId.ticketId.number','']},
'pnr_id':'$products.sources.id',
'departure_date':'$products.segment.departure.at',
'departure_airport':'$products.segment.departure.code',
'arrival_airport':'$products.segment.arrival.code',
'created_date':'$products.createdAt'}}])
Any ideas/suggestions on to how to improve this query will be very helpful indeed - Thanks in advance!
It is difficult to answer this without knowing the indexes on your collection. However, you can save some time by eliminating stage 3. The $sort is undone by the $group in stage 4. See $group does not preserve order

How to project in MongoDB after sort?

In find operation fields can be excluded, but what if I want to do a find then a sort and just after then the projection. Do you know any trick, operation for it?
doc: fields {Object}, the fields to return in the query. Object of fields to include or exclude (not both), {‘a’:1}
You can run a usual find query with conditions, projections, and sort. I think you want to sort on a field that you don't want to project. But don't worry about that, you can sort on that field even after not projecting it.
If you explicitly select projection of sorting field as "0", then you won't be able to perform that find query.
//This query will work
db.collection.find(
{_id:'someId'},
{'someField':1})
.sort('someOtherField':1)
//This query won't work
db.collection.find(
{_id:'someId'},
{'someField':1,'someOtherField':0})
.sort('someOtherField':1)
However, if you still don't get required results, look into the MongoDB Aggregation Framework!
Here is the sample query for aggregation according to your requirement
db.collection.aggregate([
{$match: {_id:'someId'}},
{$sort: {someField:1}},
{$project: {_id:1,someOtherField:1}},
])

difference between aggregate ($match) and find, in MongoDB?

What is the difference between the $match operator used inside the aggregate function and the regular find in Mongodb?
Why doesn't the find function allow renaming the field names like the aggregate function?
e.g. In aggregate we can pass the following string:
{ "$project" : { "OrderNumber" : "$PurchaseOrder.OrderNumber" , "ShipDate" : "$PurchaseOrder.ShipDate"}}
Whereas, find does not allow this.
Why does not the aggregate output return as a DBCursor or a List? and also why can't we get a count of the documents that are returned?
Thank you.
Why does not the aggregate output return as a DBCursor or a List?
The aggregation framework was created to solve easy problems that otherwise would require map-reduce.
This framework is commonly used to compute data that requires the full db as input and few document as output.
What is the difference between the $match operator used inside the aggregate function and the regular find in Mongodb?
One of differences, like you stated, is the return type. Find operations output return as a DBCursor.
Other differences:
Aggregation result must be under 16MB. If you are using shards, the full data must be collected in a single point after the first $group or $sort.
$match only purpose is to improve aggregation's power, but it has some other uses, like improve the aggregation performance.
and also why can't we get a count of the documents that are returned?
You can. Just count the number of elements in the resulting array or add the following command to the end of the pipe:
{$group: {_id: null, count: {$sum: 1}}}
Why doesn't the find function allow renaming the field names like the aggregate function?
MongoDB is young and features are still coming. Maybe in a future version we'll be able to do that. Renaming fields is more critical in aggregation than in find.
EDIT (2014/02/26):
MongoDB 2.6 aggregation operations will return a cursor.
EDIT (2014/04/09):
MongoDB 2.6 was released with the predicted aggregation changes.
I investigated a few things about the aggregation and find call:
I did this with a descending sort in a table of 160k documents and limited my output to a few documents.
The Aggregation command is slower than the find command.
If you access to the data like ToList() the aggregation command is faster than the find.
if you watch at the total times (point 1 + 2) the commands seem to be equal
Maybe the aggregation automatically calls the ToList() and does not have to call it again. If you dont call ToList() afterwards the find() call will be much faster.
7 [ms] vs 50 [ms] (5 documents)

MongoDb Query - Count + where condition

I have a mongodb collection which has collection with a tag and date field.
I want to count the number of all tags which can be done by this query
db.collection.count( { tags: "abc" })
But I would like to get counts of all unique tags together. And I also want to put a where condition on the date, as in I want to query for a certain time period.
A very simple approach:
db.collection.distinct("tags", {dateField: {$gt: dateValue}}).forEach(function (tag) {
var count = db.collection.count({"tags" : tag})
})
You can use the mongoDD Aggregation framework for solving this problem (http://docs.mongodb.org/manual/applications/aggregation/) . In case you do not get everything done . you always a option to do it through map-reduce (http://docs.mongodb.org/manual/applications/map-reduce/) . I have used map-reduce to build my own search options for special requirement. So any point of time map-reduce will help you to do what you want to do which is not possible by simple query. I am not giving the query because I do not have much information how you want to get the data and how is your collection looks like but both the two option will help you to get it very easily.
You should use a combination of find and count
db.collection.find({condition}).count
db.collection.aggregate([
{"$group" : {_id:"$tags", count:{$sum:1}}}, {$sort:{"count":-1}}
])

Aggregate framework can't use indexes

I run this command:
db.ads_view.aggregate({$group: {_id : "$campaign", "action" : {$sum: 1} }});
ads_view : 500 000 documents.
this queries take 1.8s . this is its profile : https://gist.github.com/afecec63a994f8f7fd8a
indexed : db.ads_view.ensureIndex({campaign: 1});
But mongodb don't use index. Anyone know if can aggregate framework use indexes, how to index this query.
This is a late answer, but since $group in Mongo as of version 4.0 still won't make use of indexes, it may be helpful for others.
To speed up your aggregation significantly, performe a $sort before $group.
So your query would become:
db.ads_view.aggregate({$sort:{"campaign":1}},{$group: {_id : "$campaign", "action" : {$sum: 1} }});
This assumes an index on campaign, which should have been created according to your question. In Mongo 4.0, create the index with db.ads_view.createIndex({campaign:1}).
I tested this on a collection containing 5.5+ Mio. documents. Without $sort, the aggregation would not have finished even after several hours; with $sort preceeding $group, aggregation is taking a couple of seconds.
The $group operator is not one of the ones that will use an index currently. The list of operators that do (as of 2.2) are:
$match
$sort
$limit
$skip
From here:
http://docs.mongodb.org/manual/applications/aggregation/#pipeline-operators-and-indexes
Based on the number of yields going on in the gist, I would assume you either have a very active instance or that a lot of this data is not in memory when you are doing the group (it will yield on page fault usually too), hence the 1.8s
Note that even if $group could use an index, and your index covered everything being grouped, it would still involve a full scan of the index to do the group, and would likely not be terrible fast anyway.
$group doesn't use an index because it doesn't have to. When you $group your items you're essentially indexing all documents passing through the $group stage of the pipeline using your $group's _id. If you used an index that matched the $group's _id, you'd still have to pass through all the docs in the index so it's the same amount of work.