MongoDB: get latest full document for each id by date/time - mongodb

I need to get the latest documents that are in an array of ids based on data/time. I have the following query that does this, but it only returns the _id and acquiredTime fields. How can I get it to return the full document with all the fields?
db.trip.aggregate([
{ $match: { tripId: { $in: ["trip01", "trip02" ]}} },
{ $sort: { acquiredTime: -1} },
{ $group: { _id: "$tripId" , acquiredTime: { $first: "$acquiredTime" }}}
])
The collection looks something like:
[{
"tripId": "trip01",
"acquiredTime": 1000,
"name": "abc",
"value": "abc"
},{
"tripId": "trip02",
"acquiredTime": 1000,
"name": "xyz",
"value": "xyz"
},{
"tripId": "trip01",
"acquiredTime": 2000,
"name": "def",
"value": "abc"
},{
"tripId": "trip02",
"acquiredTime": 2000,
"name": "ghi",
"value": "xyz"
}]
At the moment I get:
[{
"tripId": "trip01",
"acquiredTime": 2000
},{
"tripId": "trip02",
"acquiredTime": 2000
}]
I need to get:
[{
"tripId": "trip01",
"acquiredTime": 2000,
"name": "def",
"value": "abc"
},{
"tripId": "trip02",
"acquiredTime": 2000,
"name": "ghi",
"value": "xyz"
}]

Your approach is the right approach, but the thing is that $group and $project just don't work that way and require you to name all of the fields you want in the result.
If you don't mind the structure looking a bit different, then you can always use $$ROOT in MongoDB versions 2.6 and greater:
db.trip.aggregate([
{ "$match": { "tripId": { "$in": ["trip01", "trip02" ]}} },
{ "$sort": { "acquiredTime": -1} },
{ "$group": { "_id": "$tripId" , "doc": { "$first": "$$ROOT" }}}
])
So the whole document is there, but just all contained as a sub-document to "doc" in the results.
For anything else or prettier you are going to have to specify every field that you want. It's just a data structure so you could always generate it from code anyway.
db.trip.aggregate([
{ "$match": { "tripId": { "$in": ["trip01", "trip02" ]}} },
{ "$sort": { "acquiredTime": -1} },
{ "$group": {
"_id": "$tripId" ,
"acquiredTime": { "$first": "$acquiredTime" },
"name": { "$first": "$name" },
"value": { "$first": "$value" }
}}
])

To my understading, the above solution suffers from performance and RAM problems when there is a large number of unique documents to be returned, as the output of $match is sorted in memory, no matter what indices you may have.
Reference: https://docs.mongodb.com/manual/tutorial/sort-results-with-indexes/
To maximise performance and minimise RAM usage:
Create a unique index [(tripId, 1), (acquiredTime, -1)]
Have the sort to operate exactly along the index
This of course will cost you an index, which will slow down inserts - there's no free meal :)
Additionally, the cosmetic problem of having the original document moved to a sub-document can be easily solved with $replaceRoot, without needing to explicltly list the document keys.
db.trip.aggregate([
{ "$match": { "tripId": { "$in": ["trip01", "trip02" ]}} },
{ "$sort": SON([("tripId", 1), ("acquiredTime", -1)],
{ "$group": { "_id": "$tripId" , "doc": { "$first": "$$ROOT" }}},
{ "$replaceRoot": { "newRoot": "$doc"}}
])
Finally, it's worth noting that if acquiredTime is just your server time, you can get rid of it, as the _id already embeds the creation timestamp. So the unique index would go on [(tripId, 1), (_id, -1)], and the query becomes:
db.trip.aggregate([
{ "$match": { "tripId": { "$in": ["trip01", "trip02" ]}} },
{ "$sort": SON([("tripId", 1), ("_id", -1)],
{ "$group": { "_id": "$tripId" , "doc": { "$first": "$$ROOT" }}},
{ "$replaceRoot": { "newRoot": "$doc"}}
])
This is also better as date objects in MongoDB have a resolution of 1 millisecond, which - depending on the frequency of your inserts - may result in extremely hard to reproduce race conditions, whereas auto-generated _id are guaranteed to be strictly incremental.

Related

How can I combine two collection in mongoDB and also fulfill the following conditions?

Note: Each collection contains 96.5k documents and each collection have these fields --
{
"name": "variable1",
"startTime": "variable2",
"endTime": "variable3",
"classes": "variable4",
"section": "variable"
}
I have 2 collections. I have to compare these 2 collection and have to find out whether some specific fields( here I want name, startTime, endTime) of the documents are same in both the collection.
My approach was to join these 2 collection and then use $lookup .. I also tried the following query but it didn't work.
Please help me.
col1.aggregate([
{
"$unionWith": {"col1": "col2"}
},
{
"$group":
{
"_id":
{
"Name": "$Name",
"startTime": "$startTime",
"endTime": "$endTime"
},
"count": {"$sum": 1},
"doc": {"$first": "$$ROOT"}
}
},
{
"$match": {"$expr": {"$gt": ["$count", 1]}}
},
{
"$replaceRoot": {"newRoot": "$doc"}
},
{
"$out": "newCollectionWithDuplicates"
}
])
You're approach is fine you just have a minor syntax error in your $unionWith, it's suppose to be like so:
{
"$unionWith": {
coll: "col2",
pipeline: []
}
}
Mongo Playground

Find documents that share one key but differ in another

I have a mongodb collection that is resembles
{"dept":"A" , "email":"bob#example.com", "userID": "1"}
{"dept":"A" , "email":"bob#example.com", "userID": "1"}
{"dept":"A" , "email":"bob#example.com", "userID": "2"} <<< "bad" record
{"dept":"A" , "email":"alice#example.com", "userID": "3"}
{"dept":"B" , "email":"bob#example.com", "userID": "4"}
{"dept":"B" , "email":"kevin#example.com", "userID": "5"}
The constraint is that an email must only have a single userID per department.
How would I query the table to find which emails have multiple userIDs within a department? Mongo 4.4+
You have to use two $group pipeline stages to filter and find records with multiple entries.
db.collection.aggregate([
{
"$group": {
"_id": {
"dept": "$dept",
"email": "$email",
"userID": "$userID",
},
"individualCount": {
"$sum": 1
}
},
},
{
"$group": {
"_id": "$_id.email",
"userIDs": {
"$addToSet": "$_id.userID"
},
"dept": {
"$addToSet": "$_id.dept"
},
"totalRecordsCount": {
"$sum": "$individualCount"
},
"totalDuplicCounts": {
"$sum": 1
},
},
},
{
"$match": {
"totalDuplicCounts": {
"$gt": 1
}
},
},
])
Mongo Playground Sample Execution

MongoDb Aggregate on both field and a nested array field in the same record

I have a collection. I am trying to get an aggregate sum/count of a field in the record. I also need an aggregate sum/count of a nested array field in the record.
I am using MongoDB 3.0.0 with Jongo.
Please find my record below:
db.events.insert([{
"eventId": "a21sda2s-711f-12e6-8bcf-p1ff819aer3o",
"orgName": "ORG1",
"eventName": "EVA2",
"eventCost": 5000,
"bids": [{
"vendorName": "v1",
"bidStatus": "ACCEPTED",
"bidAmount": 4400
},{
"vendorName": "v2",
"bidStatus": "PROCESSING",
"bidAmount": 4900
},{
"vendorName": "v3",
"bidStatus": "REJECTED",
"bidAmount": "3000"
}] }, {
"eventId": "4427f318-7699-11e5-8bcf-feff819cdc9f",
"orgName": "ORG1",
"eventName": "EVA3",
"eventCost": 1000,
"bids": [ {
"vendorName": "v1",
"bidStatus": "REJECTED",
"bidAmount": 800
}, {
"vendorName": "v2",
"bidStatus": "PROCESSING",
"bidAmount": 900
},{
"vendorName": "v3",
"bidStatus": "PROCESSING",
"bidAmount": 990
}] }])
I need $eventCount and $eventCost where I aggregate $eventCost field.
I get $acceptedCount and $acceptedAmount by aggregating $bids.bidAmount field (with a condition on $bids.bidStatus)
The result I need would be in form:
[
{
"_id" : "EVA2",
"eventCount" : 2,
"eventCost" : 10000,
"acceptedCount" : 2,
"acceptedAmount" : 7400 },
{
"_id" : "EVA3",
"eventCount" : 1,
"eventCost" : 1000 ,
"acceptedCount" : 0,
"acceptedAmount" : 0 },
}]
I am not able to get the result in a single query. Right now I make two Queries A and Query B(refer below) and merge them in my Java Code.
I use an $unwind operator in my Query B.
Is there a way I can the achieve the same result, in a single query. I feel all I need is a way to pass the bids[] array downstream for the next operation in the pipeline.
I tried $push operator, but I am not able to figure, a way to push the entire bid[] array downstream.
I don't want to change my record structure, but if there is something intrinsically wrong, I could give it a try. Thanks for all your help.
My Solution
Query A:
db.events.aggregate([
{$group: {
_id: "$eventName",
eventCount: {$sum: 1}, // Get count of all events
eventCost: {$sum: "$eventCost"} // Get sum of costs
} }
])
Query B:
db.events.aggregate([
{$unwind: "$bids" },
{$group: {
_id: "$eventName",
// Get Count of Bids that have been accepted
acceptedCount:{ $sum:{$cond: [{$eq: ["$bids.bidStatus","ACCEPTED"]} ,1,0] } } ,
// Get Sum of Amounts that have been accepted
acceptedAmount:{$sum:{$cond: [{$eq: ["$bids.bidStatus","ACCEPTED"]} ,"$bids.bidAmount",0]
} } } }
])
Join Query A and QueryB in Java Code.
What I need:
A single DB operation to accomplish the same
The problem with unwinding arrays is it's going to mess up your count's for the grouped events if you try to unwind these before you do that initial grouping, as the number of items in each document array will affect the count and sum with the deformalized documents.
Provided that is practical for your data size, there is however nothing wrong with using $push to simply create and "array" of "arrays", where of course you just process $unwind twice on each grouped document:
db.events.aggregate([
{ "$group": {
"_id": "$eventName",
"eventCount": { "$sum": 1 },
"eventCost": { "$sum": "$eventCost" },
"bids": { "$push": "$bids" }
}},
{ "$unwind": "$bids" },
{ "$unwind": "$bids" },
{ "$group": {
"_id": "$_id",
"eventCount": { "$first": "$eventCount" },
"eventCost": { "$first": "$eventCost" },
"acceptedCount":{
"$sum":{
"$cond": [
{ "$eq": [ "$bids.bidStatus","ACCEPTED" ] },
1,
0
]
}
},
"acceptedCost":{
"$sum":{
"$cond": [
{ "$eq": [ "$bids.bidStatus","ACCEPTED" ] },
"$bids.bidAmount",
0
]
}
}
}}
])
The likely better alternative to this is to sum up the "accepted" values from each document first, and then sum those values per "event" later:
db.events.aggregate([
{ "$unwind": "$bids" },
{ "$group": {
"_id": "$_id",
"eventName": { "$first": "$eventName" },
"eventCost": { "$first": "$eventCost" },
"acceptedCount":{
"$sum":{
"$cond": [
{ "$eq": [ "$bids.bidStatus","ACCEPTED" ] },
1,
0
]
}
},
"acceptedCost":{
"$sum":{
"$cond": [
{ "$eq": [ "$bids.bidStatus","ACCEPTED" ] },
"$bids.bidAmount",
0
]
}
}
}},
{ "$group": {
"_id": "$eventName",
"eventCount": { "$sum": 1 },
"eventCost": { "$sum": "$eventCost" },
"acceptedCount": { "$sum": "$acceptedCount" },
"acceptedCost": { "$sum": "$acceptedCost" }
}}
])
In that way each array is reduced to just the values you need to collect and this makes the latter $group a lot easier.
Those are a couple of approaches with the latter being the better option, but if you are actually able to process both queries in parallel and combine them in a smart way, then running two queries as you are currently doing would be my recommended approach for the best performance.

Server Side Looping

I’ve solved this problem but looking for a better way to do it on the mongodb server rather that client.
I have one collection of Orders with a placement datetime (iso date) and a product.
{ _id:1, datetime:“T1”, product:”Apple”}
{ _id:2, datetime:“T2”, product:”Orange”}
{ _id:3, datetime:“T3”, product:”Pear”}
{ _id:4, datetime:“T4”, product:”Pear”}
{ _id:5, datetime:“T5”, product:”Apple”}
Goal: For a given time (or set of times) show the last order for EACH product in the set of my products before that time. Products are finite and known.
eg. query for time T6 will return:
{ _id:2, datetime:“T2”, product:”Orange”}
{ _id:4, datetime:“T4”, product:”Pear”}
{ _id:5, datetime:“T5”, product:”Apple”}
T4 will return:
{ _id:1, datetime:“T1”, product:”Apple”}
{ _id:2, datetime:“T2”, product:”Orange”}
{ _id:4, datetime:“T4”, product:”Pear”}
i’ve implemented this by creating a composite index on orders [datetime:descending, product:ascending]
Then on the java client:
findLastOrdersForTimes(times) {
for (time: times) {
for (product: products) {
db.orders.findOne(product:product, datetime: { $lt: time}}
}
}
}
Now that is pretty fast since it hits the index and only fetching the data i need. However I need to query for many time points (100000+) which will be a lot of calls over the network. Also my orders table will be very large. So how can I do this on the server in one hit, i.e return a collection of time->array products? If it was oracle, id create a stored proc with a cursor that loops back in time and collects the results for every time point and breaks when it gets to the last product after the last time point. I’ve looked at the aggregation framework and mapreduce but can’t see how to achieve this kind of loop. Any pointers?
If you truly want the last order for each product, then the aggregation framework comes in:
db.times.aggregate([
{ "$match": {
"product": { "$in": products },
}},
{ "$group": {
"_id": "$product",
"datetime": { "$max": "$datetime" }
}}
])
Example with an array of products:
var products = ['Apple', 'Orange', 'Pear'];
{ "_id" : "Pear", "datetime" : "T4" }
{ "_id" : "Orange", "datetime" : "T2" }
{ "_id" : "Apple", "datetime" : "T5" }
Or if the _id from the original document is important to you, use the $sort with $last instead:
db.times.aggregate([
{ "$match": {
"product": { "$in": products },
}},
{ "$sort": { "datetime": 1 } },
{ "$group": {
"_id": "$product",
"id": { "$last": "$_id" },
"datetime": { "$last": "$datetime" }
}}
])
And that is what you most likely really want to do in either of those last cases. But the index you really want there is on "product":
db.times.ensureIndex({ "product": 1 })
So even if you need to iterate that with an additional $match condition for $lt a certain timepoint, then that is better or otherwise you can modify the "grouping" to include the "datetime" as well as keeping a set in the $match.
It seems better at any rate, so perhaps this helps at least to modify your thinking.
If I'm reading out your notes correctly you seem to simply be looking for turning this on it's head and finding the last product for each point in time. So the statement is not much different:
db.times.aggregate([
{ "$match": {
"datetime": { "$in": ["T4","T5"] },
}},
{ "$sort": { "product": 1, "datetime": 1 } },
{ "$group": {
"_id": "$datetime",
"id": { "$last": "$_id" },
"product": { "$last": "$product" }
}}
])
That is in theory it is like that based on how you present the question. I have the feeling though that you are abstracting this though and "datetime" is possibly actual timestamps as date object types.
So you might not be aware of the date aggregation operators you can apply, for example to get the boundary of each hour:
db.times.aggregate([
{ "$group": {
"_id": {
"year": { "$year": "$datetime" },
"dayOfYear": { "$dayOfYear": "$datetime" },
"hour": { "$hour": "$datetime" }
},
"id": { "$last": "$_id" },
"datetime": { "$last": "$datetime" },
"product": { "$last": "$product" }
}}
])
Or even using date math instead of the operators if a epoch based timestamp
db.times.aggregate([
{ "$group": {
"_id": {
"$subtract": [
{ "$subtract": [ "$datetime", new Date("1970-01-01") ] },
{ "$mod": [
{ "$subtract": [ "$datetime", new Date("1970-01-01") ] },
1000*60*60
]}
]
},
"id": { "$last": "$_id" },
"datetime": { "$last": "$datetime" },
"product": { "$last": "$product" }
}}
])
Of course you can add a range query for dates in the $match with $gt and $lt operators to keep the data within the range you are particularly looking at.
Your overall solution is probably a combination of ideas, but as I said, your question seem to be about matching the last entries on certain time boundaries, so the last examples possibly in combination with filtering certain products is what you need rather than looping .findOne() requests.

MongoDB return two object for every group

I want to get two objects $first and $last after grouping. Is it possible?
Something like this, but this is not working:
{ "$group": {
"_id": "type",
"values": [{
"time": { "$first": "$time" },
"value": { "$first": "$value" }
},
{
"time": { "$last": "$time" },
"value": { "$last": "$value" }
}]
}
}
In order to get the $first and $last values from an array with the aggregation framework, you need to use $unwind first to "de-normalize" the array as individual documents. There is also another trick to put those back in an array.
Assuming a document like this
{
"type": "abc",
"values": [
{ "time": ISODate("2014-06-12T22:35:42.260Z"), "value": "ZZZ" },
{ "time": ISODate("2014-06-12T22:36:45.921Z"), "value": "KKK" },
{ "time": ISODate("2014-06-12T22:37:18.237Z"), "value": "AAA" }
]
}
And assuming that your array is already sorted your would do:
If you do not care about the results being in an array just $unwind and $group:
db.junk.aggregate([
{ "$unwind": "$values" },
{ "$group": {
"_id": "$type",
"ftime": { "$first": "$values.time" },
"fvalue": { "$first": "$values.value" },
"ltime": { "$last": "$values.time" },
"lvalue": { "$last": "$values.value" },
}}
])
For those results in array then there is a trick to it:
db.collection.aggregate([
{ "$unwind": "$values" },
{ "$project": {
"type": 1,
"values": 1,
"indicator": { "$literal": ["first", "last"] }
}},
{ "$group": {
"_id": "$type",
"ftime": { "$first": "$values.time" },
"fvalue": { "$first": "$values.value" },
"ltime": { "$last": "$values.time" },
"lvalue": { "$last": "$values.value" },
"indicator": { "$first": "$indicator" }
}},
{ "$unwind": "$indicator" },
{ "$project": {
"values": {
"time": {
"$cond": [
{ "$eq": [ "$indicator", "first" ] },
"$ftime",
"$ltime"
]
},
"value": {
"$cond": [
{ "$eq": [ "$indicator", "first" ] },
"$fvalue",
"$lvalue"
]
}
}
}},
{ "$group": {
"_id": "$_id",
"values": { "$push": "$values" }
}}
])
If your array is not sorted place an additional $sort stage before the very first $group to make sure your items are in the order you want them to be evaluated by $first and $last. A logical order where is by the "time" field, so:
{ "$sort": { "type": 1, "values.time": 1 } }
The $literal declares an array to identify the values of "first" and "last" which are later "unwound" to create two copies of each grouped document. These are then evaluated using the $cond operator to re-assign to a single field for "values" which is finally push back into an array using $push.
Remember to allways try to $match first in the pipeline in order to reduce the number of documents you are working on to what you reasonable want. You pretty much never want to do this over whole collections, especially when you are using $unwind on arrays.
Just as a final note $literal is introduced/exposed in MongoDB 2.6 and greater versions. For prior versions you can interchange that with the undocumented $const.