I have a Sphinx 2.2.7.
In this version, "max_matches" is deprecated. What could I use instead?
A result of my query has in average 20000 rows.
Its just the server wide 'cap' that is being deprecated.
It still exists as a query time parameter. It defaults to 1000, but can be overridden on a per query basis.
Related
In db['TF'] I have about 60 million records.
I need to get the quantity of the records.
If I run db['TF'].count(), it returns at once.
If I run db['TF'].count_documents({}), that is a such long time before I get the result.
However, the count method will be deprecated.
So, how can I get the quantity quickly when using count_documents? Is there some arguments I missed?
I have read the doc and code, but nothing found.
Thanks a lot!
This is not about PyMongo but Mongo itself.
count is a native Mongo function. It doesn't really count all the documents. Whenever you insert or delete a record in Mongo, it caches the total number of records in the collection. Then when you run count, Mongo will return that cached value.
count_documents uses a query object, which means that it has to loop through all the records in order to get the total count. Because you're not passing any parameters, it will have to run over all 60 million records. This is why it is slow.
based on #Stennie comment
You can use estimated_document_count() in PyMongo 3.7+ to return the fast count based on collection metadata. The original count() was deprecated because the behaviour differed (estimated vs actual count) based on whether query criteria was provided. The newer driver API is more intentional about the outcome
As already mentioned here, the behavior is not specific to PyMongo.
The reason is because the count_documents method in PyMongo performs an aggregation query and does not use any metadata. see collection.py#L1670-L1688
pipeline = [{'$match': filter}]
if 'skip' in kwargs:
pipeline.append({'$skip': kwargs.pop('skip')})
if 'limit' in kwargs:
pipeline.append({'$limit': kwargs.pop('limit')})
pipeline.append({'$group': {'_id': None, 'n': {'$sum': 1}}})
cmd = SON([('aggregate', self.__name),
('pipeline', pipeline),
('cursor', {})])
if "hint" in kwargs and not isinstance(kwargs["hint"], string_type):
kwargs["hint"] = helpers._index_document(kwargs["hint"])
collation = validate_collation_or_none(kwargs.pop('collation', None))
cmd.update(kwargs)
with self._socket_for_reads(session) as (sock_info, slave_ok):
result = self._aggregate_one_result(
sock_info, slave_ok, cmd, collation, session)
if not result:
return 0
return result['n']
This command has the same behavior as the collection.countDocuments method.
That being said, if you willing to trade accuracy for performance, you can use the estimated_document_count method which on the other hand, send a count command to the database with the same behavior as collection.estimatedDocumentCount See collection.py#L1609-L1614
if 'session' in kwargs:
raise ConfigurationError(
'estimated_document_count does not support sessions')
cmd = SON([('count', self.__name)])
cmd.update(kwargs)
return self._count(cmd)
Where self._count is a helper sending the command.
My collection name is trial and data size is 112mb
My query is,
db.trial.find()
and i have added limit up-to 10.
db.trial.find.limit(10).
but the limit is not working.the entire query is running.
Replace
db.trial.find.limit(10)
with
db.trial.find().limit(10)
Also you mention that the entire database is being queried? Run this
db.trial.find().limit(10).explain()
It will tell you how many documents it looked at before stopping the query (nscanned). You will see that nscanned will be 10.
The .limit() modifier on it's own will only "limit" the results of the query that is processed, so that works as designed to "limit" the results returned. In a raw form though with no query you should just have the n scanned as the limit you want:
db.trial.find().limit(10)
If your intent is to only operate on a set number of documents you can alter this with the $maxScan modifier:
db.trial.find({})._addSpecial( "$maxScan" , 11 )
Which causes the query engine to "give up" after the set number of documents have been scanned. But that should only really matter when there is something meaningful in the query.
If you are actually trying to do "paging" then you are better of using "range" queries with $gt and $lt and cousins to effectively change the range of selection that is done in your query.
Is count() in mongodb exact for many documents or is it an approximate number and if it's not is there any function that returns the exact number?
if your mongodb is version 4.0.3 or higher, use this for accurate count:
db.collection.countDocuments({})
According to the latest document count() function is inaccurate in case of unclean shutdown or etc, and will be deprecated.
Check this out:
https://docs.mongodb.com/manual/reference/method/db.collection.countDocuments/
It is an exact count. If it were not an exact count, the documentation would reflect that.
Counts the number of documents in a collection.
Reference - http://docs.mongodb.org/manual/reference/command/count/
I'm doing a query where all I want to know if there is at least one row in the collection that matches the query, so I pass limit=1 to find(). All I care about is whether the count() of the returned cursor is > 0. Would it be faster to use count(with_limit_and_skip=True) or just count()? Intuitively it seems to me like I should pass with_limit_and_skip=True, because if there are a whole bunch of matching records then the count could stop at my limit of 1.
Maybe this merits an explanation of how limits and skips work under the covers in mongodb/pymongo.
Thanks!
Your intuition is correct. That's the whole point of the with_limit_and_skip flag.
With with_limit_and_skip=False, count() has to count all the matching documents, even if you use limit=1, which is pretty much guaranteed to be slower.
From the docs:
Returns the number of documents in the results set for this query. Does not take limit() and skip() into account by default - set with_limit_and_skip to True if that is the desired behavior.
I'm querying for documents that are close to a location ($near and $maxDistance) and fall within a date range (an $or with a 3 sets of $gt/$lt conditions relating to dates/schedules).
I find that $cursor->count() always returns 100 even if there are 100 or more results regardless of limit().
It seems like $cursor->skip()->limit() work fine, allowing me to skip more than 100 results (when there are more than 100), but it bothers me that count() always returns 100 and there seems to be no way to determine the full count (other than paging until there are no more results).
I find references to map reduce not working correctly with geospatial, and the mongodb docs reference a default limit() of 100.
The above query finds the closest points to (50,50) and returns them sorted by distance (there is no need for an additional sort parameter). Use limit() to specify a maximum number of points to return (a default limit of 100 applies if unspecified):
Is this a known issue? I'm using the PHP driver.
Waiting for add $or $and support to geo-spital for a year:
Estimate: Medium ( < 1 week)
Fix Version/s: planned but not scheduled
https://jira.mongodb.org/browse/SERVER-3984
__
maybe they support this due 2014 ;)
__
http://pastebin.com/raw.php?i=FD3xe6Jt
http://www.zopyx.de/blog/goodbye-mongodb
http://blog.engineering.kiip.me/post/20988881092/a-year-with-mongodb
http://blog.schmichael.com/2011/11/05/failing-with-mongodb/