MongoDB return latest full document for each id (Full document Object containing all fields like sub document arrays etc) [duplicate] - mongodb

I want to get the last document for each station with all other fields :
{
"_id" : ObjectId("535f5d074f075c37fff4cc74"),
"station" : "OR",
"t" : 86,
"dt" : ISODate("2014-04-29T08:02:57.165Z")
}
{
"_id" : ObjectId("535f5d114f075c37fff4cc75"),
"station" : "OR",
"t" : 82,
"dt" : ISODate("2014-04-29T08:02:57.165Z")
}
{
"_id" : ObjectId("535f5d364f075c37fff4cc76"),
"station" : "WA",
"t" : 79,
"dt" : ISODate("2014-04-29T08:02:57.165Z")
}
I need to have t and station for the latest dt per station.
With the aggregation framework :
db.temperature.aggregate([{$sort:{"dt":1}},{$group:{"_id":"$station", result:{$last:"$dt"}, t:{$last:"$t"}}}])
returns
{
"result" : [
{
"_id" : "WA",
"result" : ISODate("2014-04-29T08:02:57.165Z"),
"t" : 79
},
{
"_id" : "OR",
"result" : ISODate("2014-04-29T08:02:57.165Z"),
"t" : 82
}
],
"ok" : 1
}
Is this the most efficient way to do that ?
Thanks

To directly answer your question, yes it is the most efficient way. But I do think we need to clarify why this is so.
As was suggested in alternatives, the one thing people are looking at is "sorting" your results before passing to a $group stage and what they are looking at is the "timestamp" value, so you would want to make sure that everything is in "timestamp" order, so hence the form:
db.temperature.aggregate([
{ "$sort": { "station": 1, "dt": -1 } },
{ "$group": {
"_id": "$station",
"result": { "$first":"$dt"}, "t": {"$first":"$t"}
}}
])
And as stated you will of course want an index to reflect that in order to make the sort efficient:
However, and this is the real point. What seems have been overlooked by others ( if not so for yourself ) is that all of this data is likely being inserted already in time order, in that each reading is recorded as added.
So the beauty of this is the the _id field ( with a default ObjectId ) is already in "timestamp" order, as it does itself actually contain a time value and this makes the statement possible:
db.temperature.aggregate([
{ "$group": {
"_id": "$station",
"result": { "$last":"$dt"}, "t": {"$last":"$t"}
}}
])
And it is faster. Why? Well you don't need to select an index ( additional code to invoke) you also don't need to "load" the index in addition to the document.
We already know the documents are in order ( by _id ) so the $last boundaries are perfectly valid. You are scanning everything anyway, and you could also "range" query on the _id values as equally valid for between two dates.
The only real thing to say here, is that in "real world" usage, it might just be more practical for you to $match between ranges of dates when doing this sort of accumulation as opposed to getting the "first" and "last" _id values to define a "range" or something similar in your actual usage.
So where is the proof of this? Well it is fairly easy to reproduce, so I just did so by generating some sample data:
var stations = [
"AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "FL",
"GA", "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA",
"ME", "MD", "MA", "MI", "MN", "MS", "MO", "MT", "NE",
"NV", "NH", "NJ", "NM", "NY", "NC", "ND", "OH", "OK",
"OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VT",
"VA", "WA", "WV", "WI", "WY"
];
for ( i=0; i<200000; i++ ) {
var station = stations[Math.floor(Math.random()*stations.length)];
var t = Math.floor(Math.random() * ( 96 - 50 + 1 )) +50;
dt = new Date();
db.temperatures.insert({
station: station,
t: t,
dt: dt
});
}
On my hardware (8GB laptop with spinny disk, which is not stellar, but certainly adequate) running each form of the statement clearly shows a notable pause with the version using an index and a sort ( same keys on index as the sort statement). It is only a minor pause, but the difference is significant enough to notice.
Even looking at the explain output ( version 2.6 and up, or actually is there in 2.4.9 though not documented ) you can see the difference in that, though the $sort is optimized out due to the presence of an index, the time taken appears to be with index selection and then loading the indexed entries. Including all fields for a "covered" index query makes no difference.
Also for the record, purely indexing the date and only sorting on the date values gives the same result. Possibly slightly faster, but still slower than the natural index form without the sort.
So as long as you can happily "range" on the first and last _id values, then it is true that using the natural index on the insertion order is actually the most efficient way to do this. Your real world mileage may vary on whether this is practical for you or not and it might simply end up being more convenient to implement the index and sorting on the date.
But if you were happy with using _id ranges or greater than the "last" _id in your query, then perhaps one tweak in order to get the values along with your results so you can in fact store and use that information in successive queries:
db.temperature.aggregate([
// Get documents "greater than" the "highest" _id value found last time
{ "$match": {
"_id": { "$gt": ObjectId("536076603e70a99790b7845d") }
}},
// Do the grouping with addition of the returned field
{ "$group": {
"_id": "$station",
"result": { "$last":"$dt"},
"t": {"$last":"$t"},
"lastDoc": { "$last": "$_id" }
}}
])
And if you were actually "following on" the results like that then you can determine the maximum value of ObjectId from your results and use it in the next query.
Anyhow, have fun playing with that, but again Yes, in this case that query is the fastest way.

An index is all you really need:
db.temperature.ensureIndex({ 'station': 1, 'dt': 1 })
for s in db.temperature.distinct('station'):
db.temperature.find({ station: s }).sort({ dt : -1 }).limit(1)
of course using whatever syntax is actually valid for your language.
Edit: You are correct that a loop like this incurs a round-trip per station, and it's great for a few stations, and not so good for 1000. You do still want the compound index on station+dt, though, and to take advantage of a descending sort:
db.temperature.aggregate([
{ $sort: { station: 1, dt: -1 } },
{ $group: { _id: "$station", result: {$first:"$dt"}, t: {$first:"$t"} } }
])

As far as the aggregation query you've posted, I'd make certain that you have an index on dt:
db.temperature.ensureIndex({'dt': 1 })
This will make certain that the $sort at the beginning of the aggregation pipeline is as efficient as possible.
As to whether or not this is the most efficient way to get this data, vs. a query in a loop, will likely be a function of how many data points you have. In the beginning, with "thousands of stations" and perhaps hundreds of thousands of data points I'd think the aggregation approach will be faster.
However, as you add more and more data an issue is that the aggregation query will continue to touch all the documents. This will get increasingly expensive as you scale up to millions or more documents. One approach for that case would be to add a $limit right after the $sort to limit the total number of documents being considered. That's a bit hacky and inexact but it would help to limit the total number of documents that need to be accessed.

Related

What is the proper way to query MongoDB matching $in with large collection?

There is deal collection containing 50 million documents with this structure
{
"user": <string>,
"price": <number>
}
The collection has this index
{
"v" : 2,
"unique" : false,
"key" : {
"user" : 1
},
"name" : "deal_user",
"background" : true,
"sparse" : true
}
Trying to execute this aggregation
db.getCollection('deal').aggregate([
{
$match: {user: {$in: [
"u-4daf809bd",
"u-b967c0a84",
"u-ae03417c1",
"u-805ce887d",
"u-0f5160823",
/* more items */
]}}
},
{
$group: {
_id: "$user",
minPrice: {$min: "$price"}
}
}
])
When array size inside $match / $in is less than ~50 then query response time mostly is less than 2 seconds. For large arrays (size > 200) response time is around 120 seconds.
Also tried an approach with chunking the array into parts containing 10-50 elements and querying in parallel. There is a strange (reproducible) effect: most of the queries 80% respond fast (2-5 seconds), but there are some that hung ~100 seconds, so parallelisation did not bring a fruit.
I guess that the are some kind of "slow" and "normal" values, but can not explain what is going on, because they all belong to the same index and are expected to be fetched in approx same time. It slightly correlates with amount of duplicates in user field (i.e. each grouped value), but looks like:
"big size of duplicates" does not always entail "slow"
Please, help me to understand why this MongoDB query behaves this way.
What is the proper way to do this kind of queries?

MongoDB query for finding number of people with conflicting schedules [duplicate]

I have startTime and endTime for all records like this:
{
startTime : 21345678
endTime : 31345678
}
I am trying to find number of all the conflicts. For example if there are two records and they overlap the number of conflict is 1. If there are three records and two of them overlap the conflict is 1. If there are three records and all three overlap the conflicts is 3 i.e [(X1, X2), (X1, X3), (X2, X3)]
As an algorithm I am thinking of sorting the data by start time and for each sorted record checking the end time and finding the records with start time less than the end time. This will be O(n2) time. A better approach will be using interval tree and inserting each record into the tree and finding the counts when overlaps occur. This will be O(nlgn) time.
I have not used mongoDB much so what kind of query can I use to achieve something like this?
As you correctly mention, there are different approaches with varying complexity inherent to their execution. This basically covers how they are done and which one you implement actually depends on which your data and use case is best suited to.
Current Range Match
MongoDB 3.6 $lookup
The most simple approach can be employed using the new syntax of the $lookup operator with MongoDB 3.6 that allows a pipeline to be given as the expression to "self join" to the same collection. This can basically query the collection again for any items where the starttime "or" endtime of the current document falls between the same values of any other document, not including the original of course:
db.getCollection('collection').aggregate([
{ "$lookup": {
"from": "collection",
"let": {
"_id": "$_id",
"starttime": "$starttime",
"endtime": "$endtime"
},
"pipeline": [
{ "$match": {
"$expr": {
"$and": [
{ "$ne": [ "$$_id", "$_id" },
{ "$or": [
{ "$and": [
{ "$gte": [ "$$starttime", "$starttime" ] },
{ "$lte": [ "$$starttime", "$endtime" ] }
]},
{ "$and": [
{ "$gte": [ "$$endtime", "$starttime" ] },
{ "$lte": [ "$$endtime", "$endtime" ] }
]}
]},
]
},
"as": "overlaps"
}},
{ "$count": "count" },
]
}},
{ "$match": { "overlaps.0": { "$exists": true } } }
])
The single $lookup performs the "join" on the same collection allowing you to keep the "current document" values for the "_id", "starttime" and "endtime" values respectively via the "let" option of the pipeline stage. These will be available as "local variables" using the $$ prefix in subsequent "pipeline" of the expression.
Within this "sub-pipeline" you use the $match pipeline stage and the $expr query operator, which allows you to evaluate aggregation framework logical expressions as part of the query condition. This allows the comparison between values as it selects new documents matching the conditions.
The conditions simply look for the "processed documents" where the "_id" field is not equal to the "current document", $and where either the "starttime"
$or "endtime" values of the "current document" falls between the same properties of the "processed document". Noting here that these as well as the respective $gte and $lte operators are the "aggregation comparison operators" and not the "query operator" form, as the returned result evaluated by $expr must be boolean in context. This is what the aggregation comparison operators actually do, and it's also the only way to pass in values for comparison.
Since we only want the "count" of the matches, the $count pipeline stage is used to do this. The result of the overall $lookup will be a "single element" array where there was a count, or an "empty array" where there was no match to the conditions.
An alternate case would be to "omit" the $count stage and simply allow the matching documents to return. This allows easy identification, but as an "array embedded within the document" you do need to be mindful of the number of "overlaps" that will be returned as whole documents and that this does not cause a breach of the BSON limit of 16MB. In most cases this should be fine, but for cases where you expect a large number of overlaps for a given document this can be a real case. So it's really something more to be aware of.
The $lookup pipeline stage in this context will "always" return an array in result, even if empty. The name of the output property "merging" into the existing document will be "overlaps" as specified in the "as" property to the $lookup stage.
Following the $lookup, we can then do a simple $match with a regular query expression employing the $exists test for the 0 index value of output array. Where there actually is some content in the array and therefore "overlaps" the condition will be true and the document returned, showing either the count or the documents "overlapping" as per your selection.
Other versions - Queries to "join"
The alternate case where your MongoDB lacks this support is to "join" manually by issuing the same query conditions outlined above for each document examined:
db.getCollection('collection').find().map( d => {
var overlaps = db.getCollection('collection').find({
"_id": { "$ne": d._id },
"$or": [
{ "starttime": { "$gte": d.starttime, "$lte": d.endtime } },
{ "endtime": { "$gte": d.starttime, "$lte": d.endtime } }
]
}).toArray();
return ( overlaps.length !== 0 )
? Object.assign(
d,
{
"overlaps": {
"count": overlaps.length,
"documents": overlaps
}
}
)
: null;
}).filter(e => e != null);
This is essentially the same logic except we actually need to go "back to the database" in order to issue the query to match the overlapping documents. This time it's the "query operators" used to find where the current document values fall between those of the processed document.
Because the results are already returned from the server, there is no BSON limit restriction on adding content to the output. You might have memory restrictions, but that's another issue. Simply put we return the array rather than cursor via .toArray() so we have the matching documents and can simply access the array length to obtain a count. If you don't actually need the documents, then using .count() instead of .find() is far more efficient since there is not the document fetching overhead.
The output is then simply merged with the existing document, where the other important distinction is that since theses are "multiple queries" there is no way of providing the condition that they must "match" something. So this leaves us with considering there will be results where the count ( or array length ) is 0 and all we can do at this time is return a null value which we can later .filter() from the result array. Other methods of iterating the cursor employ the same basic principle of "discarding" results where we do not want them. But nothing stops the query being run on the server and this filtering is "post processing" in some form or the other.
Reducing Complexity
So the above approaches work with the structure as described, but of course the overall complexity requires that for each document you must essentially examine every other document in the collection in order to look for overlaps. Therefore whilst using $lookup allows for some "efficiency" in reduction of transport and response overhead, it still suffers the same problem that you are still essentially comparing each document to everything.
A better solution "where you can make it fit" is to instead store a "hard value"* representative of the interval on each document. For instance we could "presume" that there are solid "booking" periods of one hour within a day for a total of 24 booking periods. This "could" be represented something like:
{ "_id": "A", "booking": [ 10, 11, 12 ] }
{ "_id": "B", "booking": [ 12, 13, 14 ] }
{ "_id": "C", "booking": [ 7, 8 ] }
{ "_id": "D", "booking": [ 9, 10, 11 ] }
With data organized like that where there was a set indicator for the interval the complexity is greatly reduced since it's really just a matter of "grouping" on the interval value from the array within the "booking" property:
db.booking.aggregate([
{ "$unwind": "$booking" },
{ "$group": { "_id": "$booking", "docs": { "$push": "$_id" } } },
{ "$match": { "docs.1": { "$exists": true } } }
])
And the output:
{ "_id" : 10, "docs" : [ "A", "D" ] }
{ "_id" : 11, "docs" : [ "A", "D" ] }
{ "_id" : 12, "docs" : [ "A", "B" ] }
That correctly identifies that for the 10 and 11 intervals both "A" and "D" contain the overlap, whilst "B" and "A" overlap on 12. Other intervals and documents matching are excluded via the same $exists test except this time on the 1 index ( or second array element being present ) in order to see that there was "more than one" document in the grouping, hence indicating an overlap.
This simply employs the $unwind aggregation pipeline stage to "deconstruct/denormalize" the array content so we can access the inner values for grouping. This is exactly what happens in the $group stage where the "key" provided is the booking interval id and the $push operator is used to "collect" data about the current document which was found in that group. The $match is as explained earlier.
This can even be expanded for alternate presentation:
db.booking.aggregate([
{ "$unwind": "$booking" },
{ "$group": { "_id": "$booking", "docs": { "$push": "$_id" } } },
{ "$match": { "docs.1": { "$exists": true } } },
{ "$unwind": "$docs" },
{ "$group": {
"_id": "$docs",
"intervals": { "$push": "$_id" }
}}
])
With output:
{ "_id" : "B", "intervals" : [ 12 ] }
{ "_id" : "D", "intervals" : [ 10, 11 ] }
{ "_id" : "A", "intervals" : [ 10, 11, 12 ] }
It's a simplified demonstration, but where the data you have would allow it for the sort of analysis required then this is the far more efficient approach. So if you can keep the "granularity" to be fixed to "set" intervals which can be commonly recorded on each document, then the analysis and reporting can use the latter approach to quickly and efficiently identify such overlaps.
Essentially, this is how you would implement what you basically mentioned as a "better" approach anyway, and the first being a "slight" improvement over what you originally theorized. See which one actually suits your situation, but this should explain the implementation and the differences.

Counting entries of subdocument in MongoDB documents

I have a document structure like so
{
"_id" : "3:/content/somepath/test.txt",
"_revisions" : {
"r152f47f1daf-0-2" : "c",
"r152f48413c1-0-2" : "c",
"r152f4851bf7-0-1" : "c"
}
}
My task is to find all documents with the following conditions:
The "_id" needs to start with "5:"
The number of revisions need to be exclusively greater then 3
The first part is easy, I have solved it with
db.nodes.find( {'_id': /^5:/} )
But I am struggling with the second part, am supposed to use $gt.
Since I am new to MongoDB, I was first looking at $size, but _revisions is not an array, it is a subdocument, right?.
Was also looking at $unwind and then counting the results, but that does not make sense either, since my result need to be the documents that match the above two conditions.
Any pointers highly appreciated.
Using the $where operator.
db.nodes.find(function() {
return (/^5:/.test(this._id) && Object.keys(this._revisions).length > 3 );
})
The problem with this as mentioned in the documentation is that:
$where evaluates JavaScript and cannot take advantage of indexes. Therefore, query performance improves when you express your query using the standard MongoDB operators (e.g., $gt, $in).
You should definitely consider to change the _revisions field to an array of sub-documents like this:
{
"_id" : "3:/content/somepath/test.txt",
"_revisions" : [
{
"rev": "r152f47f1daf-0-2",
"value": "c"
},
{
"rev": "r152f48413c1-0-2",
"value": "c"
},
{
"rev": "r152f4851bf7-0-1",
"value": "c"
}
]
}
And use the $exists operator.
db.nodes.find({ "_id": /^5:/, "_revisions.3": { "$exists": true } } )

MongoDB: How to do a text search and sort by a date

Context: I have a MongoDB populated with large number of emails. I'd like to do a search for all emails that include a given email address within any of the following fields: To, From, CC and BCC. The result needs to be sorted by the field Date. We're currently trying the following query:
db.collection.find({ $text : {$search: "\"email#domain.com\""}}).sort({Date:1})
I've tried doing a compound index including the date but it does not work.
With this index...
db.collection.createIndex({Date: 1, From:"text", To:"text", CC:"text", BCC:"text"})
it gives error 17007 as Date should have an equality match as it's a prefix. This is not an option as we'd like all emails regardless of the date.
Also with this other index...
db.collection.createIndex({From:"text", To:"text", CC:"text", BCC:"text", Date:1})
Then it gives error 17144 as it goes over the internal limit for the sort.
We've read the following:
Stackoverflow ref
Stackoverflow ref
mongoDB doc on compound index
In these references and others I'm getting the idea that this is not possible but I don't think what we're trying to do is atypical or so much out of the box.
Are we doing something wrong? Is there a way to do this query with compound index or any other MongoDB feature?
thanks!
Regardless of other compound index keys, you need to include the $meta for the "textScore" in order to get the correct sorting:
db.collection.find(
{ "$text": { "$search": "\"email#domain.com\""}},
{ "score": { "$meta": "textScore" } }
).sort({
"score": { "$meta": "textScore" }, "Date": 1
})
So naturally you want that "score" to sort first, and then by "Date" in order for things to be correctly ranked by relevance of the search.
The order of index does not matter, but of course you can ony have "one" text index. So make sure you drop all others before creating:
db.collection.createIndex({
"From": "text",
"To": "text",
"CC":"text",
"BCC": "text",
"Date":1
})
Look for indexes that are current with:
db.collection.getIndicies()
Or just drop everything and start fresh:
db.collection.dropIndexes()
For the data you appear to be searching on though, I would have thought a regular compound index on each field should suit you better. Looking for "email" addresses should be an "exact match", and if you expect multiple items for each field then they should be arrays of strings, like so:
{
"TO": ["bill#example.com"],
"FROM": ["ted#example.com"],
"CC": ["marty#example.com","sarah#example.com"],
"BCC": [],
"Date": ISODate("2015-07-27T13:42:05.535Z")
}
Then you need seperate indexes on each field, possibly in compound with "Date" like so:
db.email.createIndex({ "TO": 1, "Date": 1 })
db.email.createIndex({ "FROM": 1, "Date": 1 })
db.email.createIndex({ "CC": 1, "Date": 1 })
db.email.createIndex({ "BCC": 1, "Date": 1 })
And query with an $or condition:
db.email.find({
"$or": [
{ "TO": "sarah#example.com" },
{ "FROM": "sarah#example.com" },
{ "CC": "sarah#example.com" },
{ "BCC": "sarah#example.com" }
],
"Date": { "$lt": new Date() }
})
If you look at the .explain(true) (verbose) output from that, you should see that the winning plan is an "index intersection" of all the specified indexes. This works out to be very efficient as every field ( and index selected ) has an exact match value, and a range match on the indexed date.
That's going to be a lot better for you than the "fuzzy matching" of text searches. Even regular expressions should work better here in general ( for e-mail addresses ) and especially if they are "anchored" ^ to the start of the string.
Text indexes are meant for "word like tokens" to match, but this should not be your data. The $or does not look at nice, but it should do a much better job.

Query multiple date ranges, return only specific key in MongoDB

In Mongo, I have a documents that look like the following:
dateRange: [{
"price": "200",
"dateStart": "2014-01-01",
"dateEnd": "2014-01-30"
},
{
"price": "220",
"dateStart": "2014-02-01",
"dateEnd": "2014-02-15"
}]
Nice and simple right? Just dates and prices. Now, the tricky party I'm is how would I go about creating a query to find the dateRange that fits with 2014-01-12, and then JUST return the price after it's found instead of the entire array of dateRanges?
These dateRanges can get quite large, and I'm trying to minimize the amount of data returned (if this is possible at all with Mongo). Note, the date format I can change up if required, I was just using the above for example purposes.
Any help is appreciated, thanks!
You want to use the $elemMatch operator, which is only valid in versions 2.2 upward. You will also need to make sure you use multikey indexes.
edit: To be clear you will also have to use the $elemMatch find operator as pointed out in comment below.
This being said, I agree with the gist of comment by mnemosyn. It would be better to have each element of the array represented as a single document.
quick example of $elemMatch to demonstrate the projection. Simply add $elemMatch to the find as well.
> db.test.save ( {
_id: 1,
zipcode: 63109,
students: [
{ name: "john", school: 102, age: 10 },
{ name: "jess", school: 102, age: 11 },
{ name: "jeff", school: 108, age: 15 }
]
} );
> db.test.find( { zipcode: 63109 }, { students: { $elemMatch: { school: 102 } } } ).pretty() );
{
"_id" : 1,
"students" : [
{
"name" : "john",
"school" : 102,
"age" : 10
}
]
}
Well, the problem with that schema is that it uses large embedded arrays - this can be quite inefficient, because a mongodb query will always find a document, not a subset of an embedded object. Even if you're using a projection, mongodb will have to read the entire object internally, so if the array becomes huge, say 100k entries, that will slow things down to a halt.
Why not simply separate these array elements into documents, e.g.
{
price : 200,
productId : ObjectId("foo"), // or whatever the price refers to
dateStart : "2014-01-01",
dateEnd : "2013-01-30"
}
This way, mongodb doesn't need to pull the entire object with all prices, but only the prices that match your date range. This will minimize the amount of data transferred. You can then also use the query projection to only return the price, i.e. db.collection.find({ criteria }, {"price" : 1, "_id" : 0}).
Of course, the number of objects will increase dramatically, but efficient indexing will solve that problem. The only inefficiency induced is the duplication of the productId, which is cheaper than dealing with huge embedded arrays.
P.S: I'd suggest using actual dates (ISODate) instead of strings, even if their format is sortable.