MongoDB skip & limit when querying two collections - mongodb

Let's say I have two collections, A and B, and a single document in A is related to N documents in B. For example, the schemas could look like this:
Collection A:
{id: (int),
propA1: (int),
propA2: (boolean)
}
Collection B:
{idA: (int), # id for document in Collection A
propB1: (int),
propB2: (...),
...
propBN: (...)
}
I want to return properties propB2-BN and propA2 from my API, and only return information where (for example) propA2 = true, propB6 = 42, and propB1 = propA1.
This is normally fairly simple - I query Collection B to find documents where propB6 = 42, collect the idA values from the result, query Collection A with those values, and filter the results with the Collection A documents from the query.
However, adding skip and limit parameters to this seems impossible to do while keeping the behavior users would expect. Naively applying skip and limit to the first query means that, since filtering occurs after the query, less than limit documents could be returned. Worse, in some cases no documents could be returned when there are actually still documents in the collection to be read. For example, if the limit was 10 and the first 10 Collection B documents returned pointed to a document in Collection A where propA2 = false, the function would return nothing. Then the user would assume there's nothing left to read, which may not be the case.
A slightly less naive solution is to simply check if the return count is < limit, and if so, repeat the queries until the return count = limit. The problem here is that skip/limit queries where the user would expect exclusive sets of documents returned could actually return the same documents.
I want to apply skip and limit at the mongo query level, not at the API level, because the results of querying collection B could be very large.
MapReduce and the aggregation framework appear to only work on a single collection, so they don't appear to be alternatives.
This seems like something that'd come up a lot in Mongo use - any ideas/hints would be appreciated.
Note that these posts ask similar sounding questions but don't actually address the issues raised here.

Sounds like you already have a solution (2).
You cannot optimize/skip/limit on first query, depending on search you can perhaps do it on second query.
You will need a loop around it either way, like you write.
I suppose, the .skip will always be costly for you, since you will need to get all the results and then throw them away, to simulate the skip, to give the user consistent behavior.
All the logic would have to go to your loop - unless you can match in a clever way to second query (depending on requirements).
Out of curiosity: Given the time passed, you should have a solution by now?!

Related

Pymongo - find multiple different documents

my question is very similar to how-to-get-multiple-document-using-array-of-mongodb-id, however, I would like to find multiple documents without using the _id.
That is, consider that I have documents such as
the
document = { _id: _id, key_1: val_1, key_2: val_2, key_3: val_3}
I need to be able to .find() by multiple parameters, as for example,
query_1 = {key_1: foo, key_2: bar}
query_2 = {key_1: foofoo, key_2: barbar}
Right now, I am running a query for query_1, followed by a query for query_2.
As it turns out, this method is extremely inefficient.
I tried to add concurrency as to make it faster, but the speedup was not even 2x.
Is it possible to query multiple documents at once?,
I am looking for a method that returns the union of the matches for query_1 AND query_2.
If this is not possible, do you have any suggestions that might speed a query of this type?
Thank you for your help.

Why MongoDB find has same performance as count

I am running tests against my MongoDB and for some reason find has the same performance as count.
Stats:
orders collection size: ~20M,
orders with product_id 6: ~5K
product_id is indexed for improved performance.
Query: db.orders.find({product_id: 6}) vs db.orders.find({product_id: 6}).count()
result the orders for the product vs 5K after 0.08ms
Why count isn't dramatically faster? it can find the first and last elements position with the product_id index
As Mongo documentation for count states, calling count is same as calling find, but instead of returning the docs, it just counts them. In order to perform this count, it iterates over the cursor. It can't just read the index and determine the number of documents based on first and last value of some ID, especially since you can have index on some other field that's not ID (and Mongo IDs are not auto-incrementing). So basically find and count is the same operation, but instead of getting the documents, it just goes over them and sums their number and return it to you.
Also, if you want a faster result, you could use estimatedDocumentsCount (docs) which would go straight to collection's metadata. This results in loss of the ability to ask "What number of documents can I expect if I trigger this query?". If you need to find a count of docs for a query in a faster way, then you could use countDocuments (docs) which is a wrapper around an aggregate query. From my knowledge of Mongo, the provided query looks like a fastest way to count query results without calling count. I guess that this should be preferred way regarding performances for counting the docs from now on (since it's introduced in version 4.0.3).

MongoDB $in not only one result in case of repeated elements

I need to get the users whose ids are contained in an array. For this i'm using the $in operator, however being this inside an aggregate operation, i'd like to get back a specific user all the time it's id is present in the array, not just one. For example:
The ids array is A=[a,b,c,b] and U(x) is user with id x
with users.find({_id:{$in:A}}) i get these users as result: U(a),U(b),U(c)
instead i'd like to get back the result: U(a),U(b),U(c),U(b)
so get the user back every time it's id appears.
I understand that $in is working as expected but does anyone have an idea on how can i achieve this?
Thanks
This isn't possible using a MongoDB query.
MongoDB's query engine iterates over the documents in a collection (or over an index if there's a useful one) and returns to you any documents that match your query, in the order it finds them. Whether b appears once, twice, or a hundred times in your query makes no difference: the document with _id of b matches the query and is returned once, when MongoDB finds it.
You can do a post-processing step in your programming language to repeat documents as many times as you want.

Mongo query for number of items in a sub collection

This seems like it should be very simple but I can't get it to work. I want to select all documents A where there are one or more B elements in a sub collection.
Like if a Store document had a collection of Employees. I just want to find Stores with 1 or more Employees in it.
I tried something like:
{Store.Employees:{$size:{$ne:0}}}
or
{Store.Employees:{$size:{$gt:0}}}
Just can't get it to work.
This isn't supported. You basically only can get documents in which array size is equal to the value. Range searches you can't do.
What people normally do is that they cache array length in a separate field in the same document. Then they index that field and make very efficient queries.
Of course, this requires a little bit more work from you (not forgetting to keep that length field current).

How does MongoDB evaluates multiple $or statements?

How will MongoDB evaluate this query:
db.testCol.find(
{
"$or" : [ {a:1, b:12}, {b:9, c:15}, {c:10, d:"foo"} ]
});
When scanning values in a document if first OR statement is TRUE will the other statements be also be evaluated?
Logically if the MongoDB is optimized other values in OR statement should not be evaluated, but I don't know how MongoDB is implemented.
UPDATE:
I updated my query because it was wrong and it didn't explain correctly what I was trying to accomplish. I need to find a set of documents that have different properties and if an exact combination of these properties is found the document must be returned.
The SQL equivalent of my query would be:
SELECT * FROM testCol
WHERE (a = 1 AND b = 12) OR (b = 9 AND c = 15) OR (c = 10 AND d = 'foo');
MongoDB will execute each clause of the $or operation as a seperate query and remove duplicates as a post processing pass. As such each clause can use a seperate index which is often very useful.
In other words, it will NOT look at 1 document, see which of the OR clauses apply and do an early-out if the first clause is a match. Rather it does a full dataset query per clause and de-dupe after the fact. This may seem less than efficient but in practice it's almost always faster since the first approach would only be able to hit at most one index for all clauses which is rarely efficient.
EDIT: Mongo only skips documents during the de-duplication process, not during the table scans.
Mongo won't check documents that are already part of the result set. So if your first {a:1, b:12} returns 100% of the documents, Mongo is done.
You want to put whatever will grab the most documents as your first evaluated statement because of this. If your first item only grabs 1% of documents, the subsequent item will need to scan the other 99%.
That being said, you are using $or to look for values in a single key. I think you want to use $in for this.
See here for more:
http://books.google.com/books?id=BQS33CxGid4C&lpg=PA48&ots=PqvQJPRUoe&dq=mongo%20tips%20and%20tricks%20%22OR-query%22&pg=PA48#v=onepage&q&f=false