Perl mongodb $collecton->find::How many roundtrips to mongodb while fetching? - perl

If my collection has 10 records.
my $records = $collection->find;
while (my $record = $records->next){
do something;
}
Are there ten roundtrips to the mongodb server?
If so, is there any way to limit it to one roundtrip?
Thanks.

The answer is it's just one query per batch of records/documents returned in groups of 100 by default.
If your result set is 250 docs, the first access of the cursor to get doc 1 will load docs 1-100 in memory, when doc 101 is accessed this causes another 100 docs to be loaded from the server, and finally one more query for the last 50 docs.
See the mongodb docs about cursors and "getmore" command.

It's a single query, just like querying a RDBMS.
As per the documentation:
my $cursor = $collection->find({ i => { '$gt' => 42 } });
Executes the given $query and returns a MongoDB::Cursor with the results
my $cursor = $collection->query({ }, { limit => 10, skip => 10 });
Valid query attributes are:
limit - Limit the number of results.
skip -Skip a number of results.
sort_by - Order results.

No, i am absolutely sure that in above code only one roundtrip to the server. For example in c# the same code will load all data only once, when you start iteration.
while (my $record = $records->next){
^^^
here on first iteration driver load all 10 records
It seems to me logical have only one request to the server.
From documentation:
The shell find() method returns a
cursor object which we can then
iterate to retrieve specific documents
from the result

You can use the "mongosniff" tool to figure out the operations over the wire. Apart from that: you basically have no other option then iterating over the cursor....so why do you care?

Related

Why is PyMongo count_documents is slower than count?

In db['TF'] I have about 60 million records.
I need to get the quantity of the records.
If I run db['TF'].count(), it returns at once.
If I run db['TF'].count_documents({}), that is a such long time before I get the result.
However, the count method will be deprecated.
So, how can I get the quantity quickly when using count_documents? Is there some arguments I missed?
I have read the doc and code, but nothing found.
Thanks a lot!
This is not about PyMongo but Mongo itself.
count is a native Mongo function. It doesn't really count all the documents. Whenever you insert or delete a record in Mongo, it caches the total number of records in the collection. Then when you run count, Mongo will return that cached value.
count_documents uses a query object, which means that it has to loop through all the records in order to get the total count. Because you're not passing any parameters, it will have to run over all 60 million records. This is why it is slow.
based on #Stennie comment
You can use estimated_document_count() in PyMongo 3.7+ to return the fast count based on collection metadata. The original count() was deprecated because the behaviour differed (estimated vs actual count) based on whether query criteria was provided. The newer driver API is more intentional about the outcome
As already mentioned here, the behavior is not specific to PyMongo.
The reason is because the count_documents method in PyMongo performs an aggregation query and does not use any metadata. see collection.py#L1670-L1688
pipeline = [{'$match': filter}]
if 'skip' in kwargs:
pipeline.append({'$skip': kwargs.pop('skip')})
if 'limit' in kwargs:
pipeline.append({'$limit': kwargs.pop('limit')})
pipeline.append({'$group': {'_id': None, 'n': {'$sum': 1}}})
cmd = SON([('aggregate', self.__name),
('pipeline', pipeline),
('cursor', {})])
if "hint" in kwargs and not isinstance(kwargs["hint"], string_type):
kwargs["hint"] = helpers._index_document(kwargs["hint"])
collation = validate_collation_or_none(kwargs.pop('collation', None))
cmd.update(kwargs)
with self._socket_for_reads(session) as (sock_info, slave_ok):
result = self._aggregate_one_result(
sock_info, slave_ok, cmd, collation, session)
if not result:
return 0
return result['n']
This command has the same behavior as the collection.countDocuments method.
That being said, if you willing to trade accuracy for performance, you can use the estimated_document_count method which on the other hand, send a count command to the database with the same behavior as collection.estimatedDocumentCount See collection.py#L1609-L1614
if 'session' in kwargs:
raise ConfigurationError(
'estimated_document_count does not support sessions')
cmd = SON([('count', self.__name)])
cmd.update(kwargs)
return self._count(cmd)
Where self._count is a helper sending the command.

Mongodb: Skip collection values from between (not a normal pagination)

I have browsed through various examples but have failed to find what I am looking for.. What I want is to search for a specific document by _id and skip multiple times between a collection by using one query? Or some alternative which is fast enough to my case.
Following query would skip first one and return second in advance:
db.posts.find( { "_id" : 1 }, { comments: { $slice: [ 1, 1 ] } } )
That would be skip 0, return 1 and leaves the rest out from result..
But what If there would be like 10000 comments and I would want to use same pattern, but return that array values like this:
skip 0, return 1, skip 2, return 3, skip 4, return 5
So that would return collection which comments would be size of 5000, because half of them is skipped away. Is this possible? I applied large number like 10000 because I fear that using multiple queries to apply this would not be performance wise.. (example shown in here: multiple queries to accomplish something similar). Thnx!
I went through several resources and concluded that currently this is impossible to make with one query.. Instead, I agreed on that there are only two options to overcome this problem:
1.) Make a loop of some sort and run several slice queries while increasing the position of a slice. Similar to resource I linked:
var skip = NUMBER_OF_ITEMS * (PAGE_NUMBER - 1)
db.companies.find({}, {$slice:[skip, NUMBER_OF_ITEMS]})
However, depending on the type of a data, I would not want to run 5000 individual queries to get only half of the array contents, so I decided to use option 2.) Which seems for me relatively fast and performance wise.
2.) Make single query by _id to row you want and before returning results to client or some other part of your code, skip your unwanted array items away by using for loop and then return the results. I made this at java side since I talked to mongo via morphia. I also used query explain() to mongo and understood that returning single line with array which has 10000 items while specifying _id criteria is so fast, that speed wasn't really an issue, I bet that slice skip would only be slower.

how to obtain a record of adjacent records

Such as to obtain a post before an after a record time field created
Try to use the following statement to obtain articles
# Created is the time of the creation of the current article
# Before a
prev_post = db.Post.find ({'created': {'$ lt': created}}, sort = [('created', -1)], limit = 1)
# After a
next_post = db.Post.find ({'created': {'$ gt': created}}, sort = [('created', 1)], limit = 1)
The result turn to be discontinuous,sometimes skip several records.I don't know why,maybe I misunderstand the FIND?
Help please.
It may seem a strange behaviour indeed, but MongoDB does not guarantee you the order of stored records, unless you're querying an array (in which records are kept in the insertion order). I believe what MongoDB does - it reaches the first document that matches your query and returns it.
Bottomline: if the logic requires neighbour records, use arrays.
I think it's your created. the time type has a problem ,if the record time is same ,the system don't know which is the choosing one.you can use _id , have a try.

MongoDB ODM SELECT COUNT(*) equivalent

I wonder if there is an equivalent to the MySQL-Query:
SELECT COUNT(*) FROM users
in MongoDB ODM?
This might work:
$qb = $this->dm->createQueryBuilder('Documents\Functional\Users');
$qb->select('id');
$query = $qb->getQuery();
$results = $query->execute();
echo $query->count();
But aren't then all IDs returned and how does this affect performance if there are more complex documents in database. I don't want too send to much data around just to get a count.
A small contribution:
if you run the count this way:
$count = $this->dm->createQueryBuilder('Documents\Functional\Users')
->getQuery()->execute()->count();
Doctrine runs this query:
db.collection.find();
however, if the code is as follows:
$count = $this->dm->createQueryBuilder('Documents\Functional\Users')
->count()->getQuery()->execute();
Doctrine in this case run this query:
db.collection.count();
I do not know if there is improvement in performance, but I think most optimal
I hope that is helpful
$count = $this->dm->createQueryBuilder('Documents\Functional\Users')
->getQuery()->execute()->count();
The above will give you the number of documents inside a collection of Users. The query in question doesn't return all of the documents and then count them. It generates a cursor to the collection and from there it knows the count. Only once you start to iterate over the cursor does the driver start pulling data from the database.
A handy operator for performance is the eagerCursor(true) which will retrieve all the data in the query before hydration and close the cursor. Use this if you know the data you want to get and you'll be finished with it after the query.
Eager Cursor
If you have references that you know you will be iterating over. Use the prime(true) method on them.
Prime
If you want to return all the elements raw data, you can use hydrate(false) method in the query to disable the hydration system.
For Doctrine ODM 2 you can switch query type to count before call getQuery:
return $this->createQueryBuilder()
->field('storage')->equals($storage)
->field('priority')->in($priorities)
->count()
->getQuery()
->execute();

Slow pagination over tons of records in mongodb

I have over 300k records in one collection in Mongo.
When I run this very simple query:
db.myCollection.find().limit(5);
It takes only few miliseconds.
But when I use skip in the query:
db.myCollection.find().skip(200000).limit(5)
It won't return anything... it runs for minutes and returns nothing.
How to make it better?
One approach to this problem, if you have large quantities of documents and you are displaying them in sorted order (I'm not sure how useful skip is if you're not) would be to use the key you're sorting on to select the next page of results.
So if you start with
db.myCollection.find().limit(100).sort({created_date:true});
and then extract the created date of the last document returned by the cursor into a variable max_created_date_from_last_result, you can get the next page with the far more efficient (presuming you have an index on created_date) query
db.myCollection.find({created_date : { $gt : max_created_date_from_last_result } }).limit(100).sort({created_date:true});
From MongoDB documentation:
Paging Costs
Unfortunately skip can be (very) costly and requires the server to walk from the beginning of the collection, or index, to get to the offset/skip position before it can start returning the page of data (limit). As the page number increases skip will become slower and more cpu intensive, and possibly IO bound, with larger collections.
Range based paging provides better use of indexes but does not allow you to easily jump to a specific page.
You have to ask yourself a question: how often do you need 40000th page? Also see this article;
I found it performant to combine the two concepts together (both a skip+limit and a find+limit). The problem with skip+limit is poor performance when you have a lot of docs (especially larger docs). The problem with find+limit is you can't jump to an arbitrary page. I want to be able to paginate without doing it sequentially.
The steps I take are:
Create an index based on how you want to sort your docs, or just use the default _id index (which is what I used)
Know the starting value, page size and the page you want to jump to
Project + skip + limit the value you should start from
Find + limit the page's results
It looks roughly like this if I want to get page 5432 of 16 records (in javascript):
let page = 5432;
let page_size = 16;
let skip_size = page * page_size;
let retval = await db.collection(...).find().sort({ "_id": 1 }).project({ "_id": 1 }).skip(skip_size).limit(1).toArray();
let start_id = retval[0].id;
retval = await db.collection(...).find({ "_id": { "$gte": new mongo.ObjectID(start_id) } }).sort({ "_id": 1 }).project(...).limit(page_size).toArray();
This works because a skip on a projected index is very fast even if you are skipping millions of records (which is what I'm doing). if you run explain("executionStats"), it still has a large number for totalDocsExamined but because of the projection on an index, it's extremely fast (essentially, the data blobs are never examined). Then with the value for the start of the page in hand, you can fetch the next page very quickly.
i connected two answer.
the problem is when you using skip and limit, without sort, it just pagination by order of table in the same sequence as you write data to table so engine needs make first temporary index. is better using ready _id index :) You need use sort by _id. Than is very quickly with large tables like.
db.myCollection.find().skip(4000000).limit(1).sort({ "_id": 1 });
In PHP it will be
$manager = new \MongoDB\Driver\Manager("mongodb://localhost:27017", []);
$options = [
'sort' => array('_id' => 1),
'limit' => $limit,
'skip' => $skip,
];
$where = [];
$query = new \MongoDB\Driver\Query($where, $options );
$get = $manager->executeQuery("namedb.namecollection", $query);
I'm going to suggest a more radical approach. Combine skip/limit (as an edge case really) with sort range based buckets and base the pages not on a fixed number of documents, but a range of time (or whatever your sort is). So you have top-level pages that are each range of time and you have sub-pages within that range of time if you need to skip/limit, but I suspect the buckets can be made small enough to not need skip/limit at all. By using the sort index this avoids the cursor traversing the entire inventory to reach the final page.
My collection has around 1.3M documents (not that big), properly indexed, but still takes a big performance hit by the issue.
After reading other answers, the solution forward is clear; the paginated collection must be sorted by a counting integer similar to the auto-incremental value of SQL instead of the time-based value.
The problem is with skip; there is no other way around it; if you use skip, you are bound to hit with the issue when your collection grows.
Using a counting integer with an index allows you to jump using the index instead of skip. This won't work with time-based value because you can't calculate where to jump based on time, so skipping is the only option in the latter case.
On the other hand,
by assigning a counting number for each document, the write performance would take a hit; because all documents must be inserted sequentially. This is fine with my use case, but I know the solution is not for everyone.
The most upvoted answer doesn't seem applicable to my situation, but this one does. (I need to be able to seek forward by arbitrary page number, not just one at a time.)
Plus, it is also hard if you are dealing with delete, but still possible because MongoDB support $inc with a minus value for batch updating. Luckily I don't have to deal with the deletion in the app I am maintaining.
Just write this down as a note to my future self. It is probably too much hassle to fix this issue with the current application I am dealing with, but next time, I'll build a better one if I were to encounter a similar situation.
If you have mongos default id that is ObjectId, use it instead. This is probably the most viable option for most projects anyway.
As stated from the official mongo docs:
The skip() method requires the server to scan from the beginning of
the input results set before beginning to return results. As the
offset increases, skip() will become slower.
Range queries can use indexes to avoid scanning unwanted documents,
typically yielding better performance as the offset grows compared to
using skip() for pagination.
Descending order (example):
function printStudents(startValue, nPerPage) {
let endValue = null;
db.students.find( { _id: { $lt: startValue } } )
.sort( { _id: -1 } )
.limit( nPerPage )
.forEach( student => {
print( student.name );
endValue = student._id;
} );
return endValue;
}
Ascending order example here.
If you know the ID of the element from which you want to limit.
db.myCollection.find({_id: {$gt: id}}).limit(5)
This is a lil genious solution which works like charm
For faster pagination don't use the skip() function. Use limit() and find() where you query over the last id of the precedent page.
Here is an example where I'm querying over tons of documents using spring boot:
Long totalElements = mongockTemplate.count(new Query(),"product");
int page =0;
Long pageSize = 20L;
String lastId = "5f71a7fe1b961449094a30aa"; //this is the last id of the precedent page
for(int i=0; i<(totalElements/pageSize); i++) {
page +=1;
Aggregation aggregation = Aggregation.newAggregation(
Aggregation.match(Criteria.where("_id").gt(new ObjectId(lastId))),
Aggregation.sort(Sort.Direction.ASC,"_id"),
new CustomAggregationOperation(queryOffersByProduct),
Aggregation.limit((long)pageSize)
);
List<ProductGroupedOfferDTO> productGroupedOfferDTOS = mongockTemplate.aggregate(aggregation,"product",ProductGroupedOfferDTO.class).getMappedResults();
lastId = productGroupedOfferDTOS.get(productGroupedOfferDTOS.size()-1).getId();
}