Avoid sending full table with MongoDB and Pymongo - mongodb

I'm trying to get the min and max value from some fields inside a collection. I'm not sure if this:
result = collection.find(date_filter, expected_projection).sort({'attribute': -1}).limit(1)
is equivalent to this:
result_a = collection.find(date_filter, expected_projection)
result_b = result_a.sort({'attribute': -1}).limit(1)
I don't want the server to query all the data in result_a from the database. Is the first line of code actually fetching every document in my collection and THEN sorting it, or just fetching the max element in the attribute field?

No, they aren't equivalent; and MongoDB will not return the entire collection to the client - whether or not the attribute field is indexed.
When you chain operators together in a MongoDB command (e.g. find().sort().limit()), it is not treated by the MongoDB server as a set of separate functions to be called sequentially; it is treated as a single query which should be optimised as a whole and executed as a whole on the MongoDB server.
See the documentation on Combining Cursor Methods for another example of how the chaining is not taken as a sequence of independent operations:
The following statements chain cursor methods limit() and sort():
db.bios.find().sort( { name: 1 } ).limit( 5 )
db.bios.find().limit( 5 ).sort( { name: 1 } )
The two statements are equivalent; i.e. the order in which you chain the limit() and the sort() methods is not significant. Both statements return the first five documents, as determined by the ascending sort order on ‘name’.

The first line of code tells MongoDB to return only the document with the lowest value for "attribute". If "attribute" is indexed, then MongoDB can directly access only that one document, and not even consider the rest of the collection.
Do this once:
collection.create_index([('attribute', 1)])
Having that index in place means you can find the highest-sorting or lowest-sorting document practically instantly.

Related

How do I find and update the lowest values for a field in a mongodb query?

I am frequently running a mongo db query to find the documents that have the lowest value for a field and then running another query to update them.
For example:
Ticket.find({
userEvent: eventData._id,
})
.sort( { purchasersMinimumPrice: 1 } )
.then(data => {
let SecureRefundRequests = data.slice(0, 6).map(e => {return (Ticket.findByIdAndUpdate({_id: e._id}, {
refundAttemptExpires :(Date.now() + (5*60*1000))})
)})
Promise.all(SecureRefundRequests)
})
To reduce the number of calls to the database I want to do all of this in one query.
I have been playing around with db.collection.update but can't figure out how to use the index of the returned sorted documents.
Is this possible?
You can't really do that in a single update command.
db.collection.update evaluates each document separately, without keeping any state between documents. This means you can find the minimum value in an array contained in a document, but you cannot compare a value between documents.
You might improve performance of your existing query by creating an index on
{userEvent:1, purchasersMinimumPrice:1, _id:1} and also apply a limit of 6 on the query, and project for just the _id. Together these will allow the mongod to find the 6 documents matching that id with the lowest values and return the _id you need to do the update without an in-memory sort, and without actually looking at the documents, i.e. the query is fully covered by the index.

How to improve performance when retrieving the last document from a collection

Here is a statement where Limit() precedes Sort()...
var result Candle
dao.c.Find(bson.M{"symbol": "USD"}).Limit(1).Sort("-time").One(&result);
... and here is a statement where Limit() comes after Sort():
var result Candle
dao.c.Find(bson.M{"symbol": "USD"}).Sort("-time").Limit(1).One(&result);
Is there any performance difference between the statements above?
We might answer regarding the mgo package and MongoDB itself.
In mgo
The Query.Limit() and Query.Sort() methods just operate on the Query object locally, and once you setup the query, you execute it e.g. with Query.One() or Query.All(). The order in which you called the methods to set it up is not stored and does not matter.
In MongoDB
Quoting from MongoDB doc: Combine Cursor Methods:
The following statements chain cursor methods limit() and sort():
db.bios.find().sort( { name: 1 } ).limit( 5 )
db.bios.find().limit( 5 ).sort( { name: 1 } )
The two statements are equivalent; i.e. the order in which you chain the limit() and the sort() methods is not significant. Both statements return the first five documents, as determined by the ascending sort order on ‘name’.
So there is no difference, they are equivalent and thus have the same performance.

MongoDB skip & limit when querying two collections

Let's say I have two collections, A and B, and a single document in A is related to N documents in B. For example, the schemas could look like this:
Collection A:
{id: (int),
propA1: (int),
propA2: (boolean)
}
Collection B:
{idA: (int), # id for document in Collection A
propB1: (int),
propB2: (...),
...
propBN: (...)
}
I want to return properties propB2-BN and propA2 from my API, and only return information where (for example) propA2 = true, propB6 = 42, and propB1 = propA1.
This is normally fairly simple - I query Collection B to find documents where propB6 = 42, collect the idA values from the result, query Collection A with those values, and filter the results with the Collection A documents from the query.
However, adding skip and limit parameters to this seems impossible to do while keeping the behavior users would expect. Naively applying skip and limit to the first query means that, since filtering occurs after the query, less than limit documents could be returned. Worse, in some cases no documents could be returned when there are actually still documents in the collection to be read. For example, if the limit was 10 and the first 10 Collection B documents returned pointed to a document in Collection A where propA2 = false, the function would return nothing. Then the user would assume there's nothing left to read, which may not be the case.
A slightly less naive solution is to simply check if the return count is < limit, and if so, repeat the queries until the return count = limit. The problem here is that skip/limit queries where the user would expect exclusive sets of documents returned could actually return the same documents.
I want to apply skip and limit at the mongo query level, not at the API level, because the results of querying collection B could be very large.
MapReduce and the aggregation framework appear to only work on a single collection, so they don't appear to be alternatives.
This seems like something that'd come up a lot in Mongo use - any ideas/hints would be appreciated.
Note that these posts ask similar sounding questions but don't actually address the issues raised here.
Sounds like you already have a solution (2).
You cannot optimize/skip/limit on first query, depending on search you can perhaps do it on second query.
You will need a loop around it either way, like you write.
I suppose, the .skip will always be costly for you, since you will need to get all the results and then throw them away, to simulate the skip, to give the user consistent behavior.
All the logic would have to go to your loop - unless you can match in a clever way to second query (depending on requirements).
Out of curiosity: Given the time passed, you should have a solution by now?!

Mongo find unique results

What's the easiest way to get all the documents from a collection that are unique based on a single field.
I know I can use db.collections.distrinct to get an array of all the distinct values of a field, but I want to get the first (or really any one) document for every distinct value of one field.
e.g. if the database contained:
{number:1, data:'Test 1'}
{number:1, data:'This is something else'}
{number:2, data:'I'm bad at examples'}
{number:3, data:'I guess there\'s room for one more'}
it would return (based on number being unique:
{number:1, data:'Test 1'}
{number:2, data:'I'm bad at examples'}
{number:3, data:'I guess there\'s room for one more'}
Edit: I should add that the server is running Mongo 2.0.8 so no aggregation and there's more results than group will support.
Update to 2.4 and use aggregation :)
When you really need to stick to the old version of MongoDB due to too much red tape involved, you could use MapReduce.
In MapReduce, the map function transforms each document of the collection into a new document and a distinctive key. The reduce function is used to merge documents with the same distincitve key into one.
Your map function would emit your documents as-is and with the number-field as unique key. It would look like this:
var mapFunction = function(document) {
emit(document.number, document);
}
Your reduce-function receives arrays of documents with the same key, and is supposed to somehow turn them into one document. In this case it would just discard all but the first document with the same key:
var reduceFunction = function(key, documents) {
return documents[0];
}
Unfortunately, MapReduce has some problems. It can't use indexes, so at least two javascript functions are executed for every single document in the collections (it can be limited by pre-excluding some documents with the query-argument to the mapReduce command). When you have a large collection, this can take a while. You also can't fully control how the docments created by MapReduce are formed. They always have two fields, _id with the key and value with the document you returned for the key.
MapReduce is also hard to debug an troubleshoot.
tl;dr: Update to 2.4

How does MongoDB evaluates multiple $or statements?

How will MongoDB evaluate this query:
db.testCol.find(
{
"$or" : [ {a:1, b:12}, {b:9, c:15}, {c:10, d:"foo"} ]
});
When scanning values in a document if first OR statement is TRUE will the other statements be also be evaluated?
Logically if the MongoDB is optimized other values in OR statement should not be evaluated, but I don't know how MongoDB is implemented.
UPDATE:
I updated my query because it was wrong and it didn't explain correctly what I was trying to accomplish. I need to find a set of documents that have different properties and if an exact combination of these properties is found the document must be returned.
The SQL equivalent of my query would be:
SELECT * FROM testCol
WHERE (a = 1 AND b = 12) OR (b = 9 AND c = 15) OR (c = 10 AND d = 'foo');
MongoDB will execute each clause of the $or operation as a seperate query and remove duplicates as a post processing pass. As such each clause can use a seperate index which is often very useful.
In other words, it will NOT look at 1 document, see which of the OR clauses apply and do an early-out if the first clause is a match. Rather it does a full dataset query per clause and de-dupe after the fact. This may seem less than efficient but in practice it's almost always faster since the first approach would only be able to hit at most one index for all clauses which is rarely efficient.
EDIT: Mongo only skips documents during the de-duplication process, not during the table scans.
Mongo won't check documents that are already part of the result set. So if your first {a:1, b:12} returns 100% of the documents, Mongo is done.
You want to put whatever will grab the most documents as your first evaluated statement because of this. If your first item only grabs 1% of documents, the subsequent item will need to scan the other 99%.
That being said, you are using $or to look for values in a single key. I think you want to use $in for this.
See here for more:
http://books.google.com/books?id=BQS33CxGid4C&lpg=PA48&ots=PqvQJPRUoe&dq=mongo%20tips%20and%20tricks%20%22OR-query%22&pg=PA48#v=onepage&q&f=false