In db['TF'] I have about 60 million records.
I need to get the quantity of the records.
If I run db['TF'].count(), it returns at once.
If I run db['TF'].count_documents({}), that is a such long time before I get the result.
However, the count method will be deprecated.
So, how can I get the quantity quickly when using count_documents? Is there some arguments I missed?
I have read the doc and code, but nothing found.
Thanks a lot!
This is not about PyMongo but Mongo itself.
count is a native Mongo function. It doesn't really count all the documents. Whenever you insert or delete a record in Mongo, it caches the total number of records in the collection. Then when you run count, Mongo will return that cached value.
count_documents uses a query object, which means that it has to loop through all the records in order to get the total count. Because you're not passing any parameters, it will have to run over all 60 million records. This is why it is slow.
based on #Stennie comment
You can use estimated_document_count() in PyMongo 3.7+ to return the fast count based on collection metadata. The original count() was deprecated because the behaviour differed (estimated vs actual count) based on whether query criteria was provided. The newer driver API is more intentional about the outcome
As already mentioned here, the behavior is not specific to PyMongo.
The reason is because the count_documents method in PyMongo performs an aggregation query and does not use any metadata. see collection.py#L1670-L1688
pipeline = [{'$match': filter}]
if 'skip' in kwargs:
pipeline.append({'$skip': kwargs.pop('skip')})
if 'limit' in kwargs:
pipeline.append({'$limit': kwargs.pop('limit')})
pipeline.append({'$group': {'_id': None, 'n': {'$sum': 1}}})
cmd = SON([('aggregate', self.__name),
('pipeline', pipeline),
('cursor', {})])
if "hint" in kwargs and not isinstance(kwargs["hint"], string_type):
kwargs["hint"] = helpers._index_document(kwargs["hint"])
collation = validate_collation_or_none(kwargs.pop('collation', None))
cmd.update(kwargs)
with self._socket_for_reads(session) as (sock_info, slave_ok):
result = self._aggregate_one_result(
sock_info, slave_ok, cmd, collation, session)
if not result:
return 0
return result['n']
This command has the same behavior as the collection.countDocuments method.
That being said, if you willing to trade accuracy for performance, you can use the estimated_document_count method which on the other hand, send a count command to the database with the same behavior as collection.estimatedDocumentCount See collection.py#L1609-L1614
if 'session' in kwargs:
raise ConfigurationError(
'estimated_document_count does not support sessions')
cmd = SON([('count', self.__name)])
cmd.update(kwargs)
return self._count(cmd)
Where self._count is a helper sending the command.
Related
I have multiple environments whith MongoDB on them that stores the same types of data (same types of collections, different documents according to the environment)
I run in MongoDB the following query:
db.incident.count({ $and: [{"tags.display_name": "Policy Violation"},{start_time: {$gte: ISODate("2020-07-11T09:30:04.887Z")}}]})
and I get a number as expected (for example: 279)
But on some of my environment, when I run this query:
db.incident.count({"start_time": {$gte: ISODate("2020-07-11T09:30:04.887Z")}})
I get a lower number (for example from the same environment from the example above : 274) which is an impossible result (as you can see the first query is a subquery of the second)
I read some documents and found:
Avoid using the db.collection.count() method without a query predicate since without the query predicate, the method returns results based on the collection’s metadata, which may result in an approximate count. In particular, on a sharded cluster, the resulting count will not correctly filter out orphaned documents.
After an unclean shutdown, the count may be incorrect
but I couldn't find anywhere the definition (or any example, etc') for 'query predicate'
Can someone please help? how can I get an exact result?
I am running tests against my MongoDB and for some reason find has the same performance as count.
Stats:
orders collection size: ~20M,
orders with product_id 6: ~5K
product_id is indexed for improved performance.
Query: db.orders.find({product_id: 6}) vs db.orders.find({product_id: 6}).count()
result the orders for the product vs 5K after 0.08ms
Why count isn't dramatically faster? it can find the first and last elements position with the product_id index
As Mongo documentation for count states, calling count is same as calling find, but instead of returning the docs, it just counts them. In order to perform this count, it iterates over the cursor. It can't just read the index and determine the number of documents based on first and last value of some ID, especially since you can have index on some other field that's not ID (and Mongo IDs are not auto-incrementing). So basically find and count is the same operation, but instead of getting the documents, it just goes over them and sums their number and return it to you.
Also, if you want a faster result, you could use estimatedDocumentsCount (docs) which would go straight to collection's metadata. This results in loss of the ability to ask "What number of documents can I expect if I trigger this query?". If you need to find a count of docs for a query in a faster way, then you could use countDocuments (docs) which is a wrapper around an aggregate query. From my knowledge of Mongo, the provided query looks like a fastest way to count query results without calling count. I guess that this should be preferred way regarding performances for counting the docs from now on (since it's introduced in version 4.0.3).
I am using Pymongo (v3.5.1) in a Python v3.6.3 Jupyter notebook.
Problem
Even-though I am limiting my results, the db.collection.find() is still retrieving all results before returning
My code:
for post in posts.find({'subreddit_1':"the_donald"}, limit=2):
print(post)
exit
Background
I have imported the Reddit comment data set (RC_2017-01) from files.pushshift.io and created an index on the subreddit field (subreddit_1).
My Indexes
I believe this is caused by the collection having no index on your query term, as exhibited by the line:
planSummary: COLLSCAN
which means that to answer your query, MongoDB is forced to look at each document in the collection one by one.
Creating an index to support your query should help. You can create an index in the mongo shell by executing:
db.posts.createIndex({'subreddit_1': 1})
This is assuming your collection is named posts.
Please note that creating that index would only help with the query you posted. It's likely that different index would be needed for different type of queries.
To read more about how indexing works in MongoDB, check out https://docs.mongodb.com/manual/indexes/
I think you need to change the query, because in find() method 2nd parameter is projection. Find() always return cursor and limit function always works on cursor.
So the syntax should like below:
for post in posts.find({'subreddit_1':"the_donald"})[<start_index>:<end_index>]
print(post)
exit
OR
for post in posts.find({'subreddit_1':"the_donald"}).limit(2)
print(post)
exit
Please read the doc for detail
I'm trying to get the min and max value from some fields inside a collection. I'm not sure if this:
result = collection.find(date_filter, expected_projection).sort({'attribute': -1}).limit(1)
is equivalent to this:
result_a = collection.find(date_filter, expected_projection)
result_b = result_a.sort({'attribute': -1}).limit(1)
I don't want the server to query all the data in result_a from the database. Is the first line of code actually fetching every document in my collection and THEN sorting it, or just fetching the max element in the attribute field?
No, they aren't equivalent; and MongoDB will not return the entire collection to the client - whether or not the attribute field is indexed.
When you chain operators together in a MongoDB command (e.g. find().sort().limit()), it is not treated by the MongoDB server as a set of separate functions to be called sequentially; it is treated as a single query which should be optimised as a whole and executed as a whole on the MongoDB server.
See the documentation on Combining Cursor Methods for another example of how the chaining is not taken as a sequence of independent operations:
The following statements chain cursor methods limit() and sort():
db.bios.find().sort( { name: 1 } ).limit( 5 )
db.bios.find().limit( 5 ).sort( { name: 1 } )
The two statements are equivalent; i.e. the order in which you chain the limit() and the sort() methods is not significant. Both statements return the first five documents, as determined by the ascending sort order on ‘name’.
The first line of code tells MongoDB to return only the document with the lowest value for "attribute". If "attribute" is indexed, then MongoDB can directly access only that one document, and not even consider the rest of the collection.
Do this once:
collection.create_index([('attribute', 1)])
Having that index in place means you can find the highest-sorting or lowest-sorting document practically instantly.
If my collection has 10 records.
my $records = $collection->find;
while (my $record = $records->next){
do something;
}
Are there ten roundtrips to the mongodb server?
If so, is there any way to limit it to one roundtrip?
Thanks.
The answer is it's just one query per batch of records/documents returned in groups of 100 by default.
If your result set is 250 docs, the first access of the cursor to get doc 1 will load docs 1-100 in memory, when doc 101 is accessed this causes another 100 docs to be loaded from the server, and finally one more query for the last 50 docs.
See the mongodb docs about cursors and "getmore" command.
It's a single query, just like querying a RDBMS.
As per the documentation:
my $cursor = $collection->find({ i => { '$gt' => 42 } });
Executes the given $query and returns a MongoDB::Cursor with the results
my $cursor = $collection->query({ }, { limit => 10, skip => 10 });
Valid query attributes are:
limit - Limit the number of results.
skip -Skip a number of results.
sort_by - Order results.
No, i am absolutely sure that in above code only one roundtrip to the server. For example in c# the same code will load all data only once, when you start iteration.
while (my $record = $records->next){
^^^
here on first iteration driver load all 10 records
It seems to me logical have only one request to the server.
From documentation:
The shell find() method returns a
cursor object which we can then
iterate to retrieve specific documents
from the result
You can use the "mongosniff" tool to figure out the operations over the wire. Apart from that: you basically have no other option then iterating over the cursor....so why do you care?