In MongoDB's pymongo, how do I do a count()? - mongodb

for post in db.datasets.find({"test_set":"abc"}).sort("abc",pymongo.DESCENDING).skip((page-1)*num).limit(num):
How do I get the count()?

Since pymongo version 3.7.0 and above count() is deprecated. Instead use Collection.count_documents. Running cursor.count or collection.count will result in following warning message:
DeprecationWarning: count is deprecated. Use Collection.count_documents instead.
To use count_documents the code can be adjusted as follows
import pymongo
db = pymongo.MongoClient()
col = db[DATABASE][COLLECTION]
find = {"test_set":"abc"}
sort = [("abc",pymongo.DESCENDING)]
skip = 10
limit = 10
doc_count = col.count_documents(find, skip=skip)
results = col.find(find).sort(sort).skip(skip).limit(limit)
for doc in result:
//Process Document
Note: count_documents method performs relatively slow as compared to count method. In order to optimize you can use collection.estimated_document_count. This method will return estimated number of docs(as the name suggested) based on collection metadata.

If you're using pymongo version 3.7.0 or higher, see this answer instead.
If you want results_count to ignore your limit():
results = db.datasets.find({"test_set":"abc"}).sort("abc",pymongo.DESCENDING).skip((page-1)*num).limit(num)
results_count = results.count()
for post in results:
If you want the results_count to be capped at your limit(), set applySkipLimit to True:
results = db.datasets.find({"test_set":"abc"}).sort("abc",pymongo.DESCENDING).skip((page-1)*num).limit(num)
results_count = results.count(True)
for post in results:

Not sure why you want the count if you are already passing limit 'num'. Anyway if you want to assert, here is what you should do.
results = db.datasets.find({"test_set":"abc"}).sort("abc",pymongo.DESCENDING).skip((page-1)*num).limit(num)
results_count = results.count(True)
That will match results_count with num

Cannot comment unfortuantely on #Sohaib Farooqi's answer... Quick note: although, cursor.count() has been deprecated it is significantly faster, than collection.count_documents() in all of my tests, when counting all documents in a collection (ie. filter={}). Running db.currentOp() reveals that collection.count_documents() uses an aggregation pipeline, while cursor.count() doesn't. This might be a cause.

This thread happens to be 11 years old. However, in 2022 the 'count()' function has been deprecated. Here is a way I came up with to count documents in MongoDB using Python. Here is a picture of the code snippet. Making a empty list is not needed I just wanted to be outlandish. Hope this helps :). Code snippet here.

The thing in my case relies in the count of matched elements for a given query, and surely not to repeat this query twice:
one to get the count, and
two to get the result set.
no way
I know the query result set is not quite big and fits in memory, therefore, I can convert it to a list, and get the list length.
This code illustrates the use case:
# pymongo 3.9.0
while not is_over:
it = items.find({"some": "/value/"}).skip(offset).size(limit)
# List will load the cursor content into memory
it = list(it)
if len(it) < size:
is_over = True
offset += size

If you want to use cursor and also want count, you can try this way
# Have 27 items in collection
db = MongoClient(_URI)[DB_NAME][COLLECTION_NAME]
cursor = db.find()
count = db.find().explain().get("executionStats", {}).get("nReturned")
# Output: 27
cursor = db.find().limit(5)
count = db.find().explain().get("executionStats", {}).get("nReturned")
# Output: 5
# Can also use cursor
for item in cursor:
...
You can read more about it from https://pymongo.readthedocs.io/en/stable/api/pymongo/cursor.html#pymongo.cursor.Cursor.explain

Related

Concatenate pymongo Cursor

How do you concatenate multiple pymongo Cursor? If not it is not possible, how do you take results from multiple Cursor and create a new one?
Example :
result1 = db[collection].find(query1)
result2 = db[collection].find(query2)
concat_result = result1 + result2 #something like that.
Update :
All answers here seems to take into account that the queries are in the same format. For example. query1 might get 2 documents between dates as query2 might sorts documents by categories and may be limited by a count of 5. $or is too homogeneous for what I need. After concatening those two queries, I need to sort them base on another key.
For further details, a class Printer needs to receive a pymongo.Cursor and only one and i'm stuck with this.
The easiest way is to use mongo $or operator like
db[collection].find({'$or': [query1, query2]})
Or if you have got to do this in python you
def concat_results(*results):
ids = set()
for result in results:
for v in result:
if v['_id'] not in ids:
ids.add(v['_id'])
yield v1
concat_result = list(concat_results(result1, result2))
yes the wise solution would be to use the $or as stated above.
if you wanted to do so in a pythonic way then you could:
a = [item for item in db[collection].find({filters},{select_fields})]
b = [item for item in db[collection].find({filters},{select_fields})]
c = []
for x,y in zip(a,b):
c += [x, y]

pymongo find().hint('index') does not use index [duplicate]

I'm trying to use the sort feature when querying my mongoDB, but it is failing. The same query works in the MongoDB console but not here. Code is as follows:
import pymongo
from pymongo import Connection
connection = Connection()
db = connection.myDB
print db.posts.count()
for post in db.posts.find({}, {'entities.user_mentions.screen_name':1}).sort({u'entities.user_mentions.screen_name':1}):
print post
The error I get is as follows:
Traceback (most recent call last):
File "find_ow.py", line 7, in <module>
for post in db.posts.find({}, {'entities.user_mentions.screen_name':1}).sort({'entities.user_mentions.screen_name':1},1):
File "/Library/Python/2.6/site-packages/pymongo-2.0.1-py2.6-macosx-10.6-universal.egg/pymongo/cursor.py", line 430, in sort
File "/Library/Python/2.6/site-packages/pymongo-2.0.1-py2.6-macosx-10.6-universal.egg/pymongo/helpers.py", line 67, in _index_document
TypeError: first item in each key pair must be a string
I found a link elsewhere that says I need to place a 'u' infront of the key if using pymongo, but that didn't work either. Anyone else get this to work or is this a bug.
.sort(), in pymongo, takes key and direction as parameters.
So if you want to sort by, let's say, id then you should .sort("_id", 1)
For multiple fields:
.sort([("field1", pymongo.ASCENDING), ("field2", pymongo.DESCENDING)])
You can try this:
db.Account.find().sort("UserName")
db.Account.find().sort("UserName",pymongo.ASCENDING)
db.Account.find().sort("UserName",pymongo.DESCENDING)
This also works:
db.Account.find().sort('UserName', -1)
db.Account.find().sort('UserName', 1)
I'm using this in my code, please comment if i'm doing something wrong here, thanks.
Why python uses list of tuples instead dict?
In python, you cannot guarantee that the dictionary will be interpreted in the order you declared.
So, in mongo shell you could do .sort({'field1':1,'field2':1}) and the interpreter would sort field1 at first level and field 2 at second level.
If this syntax was used in python, there is a chance of sorting by field2 at first level. With tuple, there is no such risk.
.sort([("field1",pymongo.ASCENDING), ("field2",pymongo.DESCENDING)])
Sort by _id descending:
collection.find(filter={"keyword": keyword}, sort=[( "_id", -1 )])
Sort by _id ascending:
collection.find(filter={"keyword": keyword}, sort=[( "_id", 1 )])
DESC & ASC :
import pymongo
client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["mydatabase"]
col = db["customers"]
doc = col.find().sort("name", -1) #
for x in doc:
print(x)
###################
import pymongo
client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["mydatabase"]
col = db["customers"]
doc = col.find().sort("name", 1) #
for x in doc:
print(x)
TLDR: Aggregation pipeline is faster as compared to conventional .find().sort().
Now moving to the real explanation. There are two ways to perform sorting operations in MongoDB:
Using .find() and .sort().
Or using the aggregation pipeline.
As suggested by many .find().sort() is the simplest way to perform the sorting.
.sort([("field1",pymongo.ASCENDING), ("field2",pymongo.DESCENDING)])
However, this is a slow process compared to the aggregation pipeline.
Coming to the aggregation pipeline method. The steps to implement simple aggregation pipeline intended for sorting are:
$match (optional step)
$sort
NOTE: In my experience, the aggregation pipeline works a bit faster than the .find().sort() method.
Here's an example of the aggregation pipeline.
db.collection_name.aggregate([{
"$match": {
# your query - optional step
}
},
{
"$sort": {
"field_1": pymongo.ASCENDING,
"field_2": pymongo.DESCENDING,
....
}
}])
Try this method yourself, compare the speed and let me know about this in the comments.
Edit: Do not forget to use allowDiskUse=True while sorting on multiple fields otherwise it will throw an error.
.sort([("field1",pymongo.ASCENDING), ("field2",pymongo.DESCENDING)])
Python uses key,direction. You can use the above way.
So in your case you can do this
for post in db.posts.find().sort('entities.user_mentions.screen_name',pymongo.ASCENDING):
print post
Say, you want to sort by 'created_on' field, then you can do like this,
.sort('{}'.format('created_on'), 1 if sort_type == 'asc' else -1)

Number of items in the aggregation with MongoDB 2.6

My query looks like that:
var x = db.collection.aggregate(...);
I want to know the number of items in the result set. The documentation says that this function returns a cursor. However it contains far less methods/fields than when using db.collection.find().
for (var k in x) print(k);
Produces
_firstBatch
_cursor
hasNext
next
objsLeftInBatch
help
toArray
forEach
map
itcount
shellPrint
pretty
No count() method! Why is this cursor different from the one returned by find()? itcount() returns some type of count, but the documentation says "for testing only".
Using a group stage in my aggregation ({$group:{_id:null,cnt:{$sum:1}}}), I can get the count, like that:
var cnt = x.hasNext() ? x.next().cnt : 0;
Is there a more straight forward way to get this count? As in db.collection.find(...).count()?
Barno's answer is correct to point out that itcount() is a perfectly good method for counting the number of results of the aggregation. I just wanted to make a few more points and clear up some other points of confusion:
No count() method! Why is this cursor different from the one returned by find()?
The trick with the count() method is that it counts the number of results of find() on the server side. itcount(), as you can see in the code, iterates over the cursor, retrieving the results from the server, and counts them. The "it" is for "iterate". There's currently (as of MongoDB 2.6), no way to just get the count of results from an aggregation pipeline without returning the cursor of results.
Using a group stage in my aggregation ({$group:{_id:null,cnt:{$sum:1}}}), I can get the count
Yes. This is a reasonable way to get the count of results and should be more performant than itcount() since it does the work on the server and does not need to send the results to the client. If the point of the aggregation within your application is just to produce the number of results, I would suggest using the $group stage to get the count. In the shell and for testing purposes, itcount() works fine.
Where have you read that itcount() is "for testing only"?
If in the mongo shell I do
var p = db.collection.aggregate(...);
printjson(p.help)
I receive
function () {
// This is the same as the "Cursor Methods" section of DBQuery.help().
print("\nCursor methods");
print("\t.toArray() - iterates through docs and returns an array of the results")
print("\t.forEach( func )")
print("\t.map( func )")
print("\t.hasNext()")
print("\t.next()")
print("\t.objsLeftInBatch() - returns count of docs left in current batch (when exhausted, a new getMore will be issued)")
print("\t.itcount() - iterates through documents and counts them")
print("\t.pretty() - pretty print each document, possibly over multiple lines")
}
If I do
printjson(p)
I find that
"itcount" : function (){
var num = 0;
while ( this.hasNext() ){
num++;
this.next();
}
return num;
}
This function
while ( this.hasNext() ){
num++;
this.next();
}
It is very similar var cnt = x.hasNext() ? x.next().cnt : 0; And this while is perfect for count...

MongoDB Aggregating Array Items

Given the Schema Displayed Below & MongoDB shell version: 2.0.4:
How would I go about counting the number of items in the impressions array ?
I am thinking I have to do a Map Reduce, which seems to be a little complex, I thought that the count Function would do it. Can someone Demonstrate this?
A simple example of counting is:
var count = 0;
db.engagement.find({company_name: "me"},{impressions:1}).forEach(
function (doc) {
count += doc.impressions.length;
}
)
print("Impressions: " + count);
If you have a large number of documents to process, you would be better maintaining the count as an explicit field. You could either update the count when pushing to the impressions array, or use an incremental MapReduce to re-count for updated documents as needed.

Random Sampling from Mongo

I have a mongo collection with documents. There is one field in every document which is 0 OR 1. I need to random sample 1000 records from the database and count the number of documents who have that field as 1. I need to do this sampling 1000 times. How do i do it ?
For people coming to the answer, you should now use the new $sample aggregation function, new in 3.2.
https://docs.mongodb.org/manual/reference/operator/aggregation/sample/
db.collection_of_things.aggregate(
[ { $sample: { size: 15 } } ]
)
Then add another step to count up the 0s and 1s using $group to get the count. Here is an example from the MongoDB docs.
For MongoDB 3.0 and before, I use an old trick from SQL days (which I think Wikipedia use for their random page feature). I store a random number between 0 and 1 in every object I need to randomize, let's call that field "r". You then add an index on "r".
db.coll.ensureIndex(r: 1);
Now to get random x objects, you use:
var startVal = Math.random();
db.coll.find({r: {$gt: startVal}}).sort({r: 1}).limit(x);
This gives you random objects in a single find query. Depending on your needs, this may be overkill, but if you are going to be doing lots of sampling over time, this is a very efficient way without putting load on your backend.
Here's an example in the mongo shell .. assuming a collection of collname, and a value of interest in thefield:
var total = db.collname.count();
var count = 0;
var numSamples = 1000;
for (i = 0; i < numSamples; i++) {
var random = Math.floor(Math.random()*total);
var doc = db.collname.find().skip(random).limit(1).next();
if (doc.thefield) {
count += (doc.thefield == 1);
}
}
I was gonna edit my comment on #Stennies answer with this but you could also use a seprate auto incrementing ID index here as an alternative if you were to skip over HUGE amounts of record (talking huge here).
I wrote another answer to another question a lot like this one where some one was trying to find nth record of the collection:
php mongodb find nth entry in collection
The second half of my answer basically describes one potential method by which you could approach this problem. You would still need to loop 1000 times to get the random row of course.
If you are using mongoengine, you can use a SequenceField to generate an incremental counter.
class User(db.DynamicDocument):
counter = db.SequenceField(collection_name="user.counters")
Then to fetch a random list of say 100, do the following
def get_random_users(number_requested):
users_to_fetch = random.sample(range(1, User.objects.count() + 1), min(number_requested, User.objects.count()))
return User.objects(counter__in=users_to_fetch)
where you would call
get_random_users(100)