I'm trying to integrate MongoDB driver in Erlang.
After some coding it appears to me that the only way to limit the number of retrieved rows can only occurs when dealing with the cursor after the find()action.
Here's my code so far :
Cursor = mongo:find(Connection, Collection, Selector),
Result = case Limit of
infinity ->
mc_cursor:rest(Cursor);
_ ->
mc_cursor:take(Cursor, Limit)
end,
mc_cursor:close(Cursor)
What I'm afraid of, is when the Collection will be huge, what will happen ?
Won't it be to huge to fetch and fit the memory ?
How the cursor is basically working ?
Or is there just a better way to limit the fetch ?
I think you could use the batch_size parameter.
The following code is from mongo.erl file
%% #doc Return projection of selected documents starting from Nth document in batches of batchsize.
%% 0 batchsize means default batch size.
%% Negative batch size means one batch only.
%% Empty projection means full projection.
-spec find(pid(), collection(), selector(), projector(), skip(), batchsize()) -> cursor(). % Action
find(Connection, Coll, Selector, Projector, Skip, BatchSize) ->
mc_action_man:read(Connection, #'query'{
collection = Coll,
selector = Selector,
projector = Projector,
skip = Skip,
batchsize = BatchSize
}).
===============
reponse to the comments:
In the mc_action_man.erl file, it still use cursor to save "current postion".
read(Connection, Request = #'query'{collection = Collection, batchsize = BatchSize} ) ->
{Cursor, Batch} = mc_connection_man:request(Connection, Request),
mc_cursor:create(Connection, Collection, Cursor, BatchSize, Batch).
In the mc_worker.erl, it is the the data actual send to the db, I think you could add write_log (ex: lager) code to monitor the actual request to find the problem.
handle_call(Request, From, State = #state{socket = Socket, ets = Ets, conn_state = CS}) % read requests
when is_record(Request, 'query'); is_record(Request, getmore) ->
UpdReq = case is_record(Request, 'query') of
true -> Request#'query'{slaveok = CS#conn_state.read_mode =:= slave_ok};
false -> Request
end,
{ok, Id} = mc_worker_logic:make_request(Socket, CS#conn_state.database, UpdReq),
inet:setopts(Socket, [{active, once}]),
RespFun = fun(Response) -> gen_server:reply(From, Response) end, % save function, which will be called on response
true = ets:insert_new(Ets, {Id, RespFun}),
{noreply, State};
Related
I was working on my unittests which for each cases the database should be reset.
Because that the indexes should be kept but the data should be cleared, what will be a faster way to do that reset?
I am using pymongo, not sure if the driver will heavily impacts the performance.
There are 2 ways that come up to my mind
Simply executes delete_many()
Drop the whole collection and recreate the indexes
I found that:
for collections containing < 25K entries, the first way is faster
for collections containing ~ 25K entries, two ways are similar
for collections containing < 25K entries, the second way is faster.
Below is my script to test and run.
import time
import pymongo
from bson import ObjectId
from extutils import exec_timing_result
client = pymongo.MongoClient("mongodb://192.168.50.33:27017/?readPreference=primary&ssl=false")
def test_drop_recreate(col):
col.drop()
col.create_index([("mike", pymongo.DESCENDING)])
def test_delete_many(col):
col.delete_many({})
def main():
db = client.get_database("_tests")
STEP = 1000
to_insert = []
for count in range(0, STEP * 101, STEP):
# Get the MongoDB collection
col = db.get_collection("a")
# Insert dummy data
to_insert.extend([{"_id": ObjectId(), "mike": "ABC"} for _ in range(STEP)])
if to_insert:
col.insert_many(to_insert)
# Record the execution time of dropping the databㄇse then recreate indexes of it
_start = time.time()
test_drop_recreate(col)
ms_drop = time.time() - _start
# Insert dummy data
if to_insert:
col.insert_many(to_insert)
# Record the execution time of simply executes `delete_many()`
_start = time.time()
test_delete_many(col)
ms_del = time.time() - _start
if ms_drop > ms_del:
print(f"{count},-{(ms_drop / ms_del) - 1:.2%}")
else:
print(f"{count},+{(ms_del / ms_drop) - 1:.2%}")
if __name__ == '__main__':
main()
After I ran this script a few times, I generated a graph to visualize the result using the output.
(Above 0) means deletion takes longer
(Below 0) means dropping and recreating takes longer
The value represents additional time consumed.
For example: +20 means deletion takes 20 times longer than drop & create.
I tried to "save" all the existing indexes and recreate them after the collection drop, but there are tons of edge cases and hidden parameters to the index object. Here is my shot:
def convert_key_to_list_index(son_index):
"""
Convert index SON object to a list that can be used to create the same index.
:param SON son_index: The output of "list_indexes()":
SON([
('v', 2),
('key', SON([('timestamp', 1.0)])),
('name', 'timestamp_1'),
('expireAfterSeconds', 3600.0)
])
:return list: List of tuples: (field, direction) to use for pymonog function create_index
"""
key_index_list = []
index_key_definitions = son_index["key"]
for field, direction in index_key_definitions.items():
item = (field, int(direction))
key_index_list.append(item)
return key_index_list
def drop_and_recreate_indexes(db, collection_name):
"""
Use list_indexes() and not index_information() to get the "key" as a SON object instead of regular dict,
Because the SON object is an ordered dict, which the order of the key of the index are important for the
recreation action.
"""
son_indexes = list(db[collection_name].list_indexes())
db[collection_name].drop()
# Re-create the collection indexes
for son_index in son_indexes:
# Skip the default index for the field "_id"
# Notice: For collection without any index except of the default "_id_" index,
# the collection will not be recreated (Recreate the index recreate the collection).
if son_index['name'] == '_id_':
continue
recreate_index(db, collection_name, son_index)
def recreate_index(db, collection_name, son_index):
"""
Get a SON object that represent an index from the db, and create the same index in the db.
:param pymongo.database.Database db:
:param str collection_name:
:param SON son_index: The object that returned from the call to pymongo.collection.Collection.list_indexes()
# Notice: Currently supported only these fields: ["name", "unique", "expireAfterSeconds"]
# For more fields/options see: https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html
# ?highlight=create_index#pymongo.collection.Collection.create_index
:return:
"""
list_index = convert_key_to_list_index(son_index)
# This dict is defined to send parameter if exists to the "create_index" function.
create_index_kwargs = {}
# "name" will always be part of the SON object that represent the index.
create_index_kwargs["name"] = son_index.get("name")
# "unique" may exist or not in the SON object
unique = son_index.get("unique")
if unique:
create_index_kwargs["unique"] = unique
# "expireAfterSeconds" may exist or not in the SON object
expire_after_seconds = son_index.get("expireAfterSeconds", None)
if expire_after_seconds:
create_index_kwargs["expireAfterSeconds"] = expire_after_seconds
db[collection_name].create_index(list_index, **create_index_kwargs)
I was playing with swift's Data in the following a small code:
var d = Data(count: 10)
d[5] = 3
let d2 = d[5..<8]
print("\(d2[0])")
To my surprise, this code throws exception on print() while the following code does not:
var d = Data(count: 10)
d[5] = 3
let d2 = d.subdata(in: 5..<8)
print("\(d2[0])")
I somehow understand why this happens, but I don't get why this is designed like this. When I use subdata() I get a whole copy of range, so indexing is valid from 0. But when I use range subscribe [], I get access to the requested range while indexing is the same as before. So in my first example d2[5] is 3.
But I wonder why it is designed like this? I don't want to make a copy of my data by using subdata() method. I just wanted to access a portion of my data with better indexing.
This is especially creates unexpected behaviors if you pass it to a function. For example, following code creates unexpected results and exceptions and you may not find out easily why:
func testit(idata: Data) {
if idata.count > 0 {
print("\(idata.count)")
print("\(idata[0])")
}
}
//...
var d = Data(count: 10)
d[5] = 3
let d2 = d[5..<8]
testit(idata: d2)
This code is really strange. Because if you debug your code, you see that print("\(idata.count)") prints 3 as size of idata which is correct, but accessing it with idata[0] creates exception.
Is there any reason for this design? I was expecting that I could access resulting Data from subscribe starting index 0 while it is not true. Can I do this without using subdata() which creates copy of data or using additional arguments to pass base of data slice?
d[5..<8] returns Data.Slice – which happens to be Data. Generally, slices share the indices with their base collection, as documented in Slice.
One possible reason for this design decision is that it guarantees that subscripting a slice is a O(1) operation (adding an offset for accessing the base collection is not necessarily O(1), e.g. not for strings.)
It is also convenient, as in this example to locate the text after the second occurrence of a character in a string:
let string = "abcdefgabcdefg"
// Find first occurrence of "d":
if let r1 = string.range(of: "d") {
// Find second occurrence of "d":
if let r2 = string[r1.upperBound...].range(of: "d") {
print(string[r2.upperBound...]) // efg
}
}
As a consequence, you must never assume that the indices of a collection are zero-based (unless documented, as for Array.startIndex). Use startIndex to get the first index, or first to get the first element.
In reactivemongo my query look like this:
val result =collName.find(BSONDocument("loc" -> BSONDocument("$near" ->
BSONArray(51,-114)))).cursor[BSONDocument].enumerate()
result.apply(Iteratee.foreach { doc => println(+BSONDocument.pretty(doc))})
I want to print only top 2 result, so i pass the maxdocs value in enumerate and then query is
val result =collName.find(BSONDocument("loc" -> BSONDocument("$near" ->
BSONArray(51,-114)))).cursor[BSONDocument].enumerate(2)
result.apply(Iteratee.foreach { doc => println(+BSONDocument.pretty(doc))})
But it's not workinng, it's print all document of query.
How to print only top 2 result ?
I basically stumbled over the same thing.
Turns out, that the ReactiveMongo driver transfers the result documents in batches, taking the maxDocs setting into account only when it wants to load the next batch of documents.
You can configure the batch size to be equal to the maxDocs limit or to a proper divisor thereof:
val result = collName.
find(BSONDocument("loc" -> BSONDocument("$near" -> BSONArray(51,-114)))).
options(QueryOpts(batchSizeN = 2)).
cursor[BSONDocument].enumerate(2)
Or, alternatively, let MongoDB choose the batch size and limit the documents you process using an Enumeratee:
val result = collName.
find(BSONDocument("loc" -> BSONDocument("$near" -> BSONArray(51,-114)))).
cursor[BSONDocument].
enumerate(2) &> Enumeratee.take(2)
My query looks like that:
var x = db.collection.aggregate(...);
I want to know the number of items in the result set. The documentation says that this function returns a cursor. However it contains far less methods/fields than when using db.collection.find().
for (var k in x) print(k);
Produces
_firstBatch
_cursor
hasNext
next
objsLeftInBatch
help
toArray
forEach
map
itcount
shellPrint
pretty
No count() method! Why is this cursor different from the one returned by find()? itcount() returns some type of count, but the documentation says "for testing only".
Using a group stage in my aggregation ({$group:{_id:null,cnt:{$sum:1}}}), I can get the count, like that:
var cnt = x.hasNext() ? x.next().cnt : 0;
Is there a more straight forward way to get this count? As in db.collection.find(...).count()?
Barno's answer is correct to point out that itcount() is a perfectly good method for counting the number of results of the aggregation. I just wanted to make a few more points and clear up some other points of confusion:
No count() method! Why is this cursor different from the one returned by find()?
The trick with the count() method is that it counts the number of results of find() on the server side. itcount(), as you can see in the code, iterates over the cursor, retrieving the results from the server, and counts them. The "it" is for "iterate". There's currently (as of MongoDB 2.6), no way to just get the count of results from an aggregation pipeline without returning the cursor of results.
Using a group stage in my aggregation ({$group:{_id:null,cnt:{$sum:1}}}), I can get the count
Yes. This is a reasonable way to get the count of results and should be more performant than itcount() since it does the work on the server and does not need to send the results to the client. If the point of the aggregation within your application is just to produce the number of results, I would suggest using the $group stage to get the count. In the shell and for testing purposes, itcount() works fine.
Where have you read that itcount() is "for testing only"?
If in the mongo shell I do
var p = db.collection.aggregate(...);
printjson(p.help)
I receive
function () {
// This is the same as the "Cursor Methods" section of DBQuery.help().
print("\nCursor methods");
print("\t.toArray() - iterates through docs and returns an array of the results")
print("\t.forEach( func )")
print("\t.map( func )")
print("\t.hasNext()")
print("\t.next()")
print("\t.objsLeftInBatch() - returns count of docs left in current batch (when exhausted, a new getMore will be issued)")
print("\t.itcount() - iterates through documents and counts them")
print("\t.pretty() - pretty print each document, possibly over multiple lines")
}
If I do
printjson(p)
I find that
"itcount" : function (){
var num = 0;
while ( this.hasNext() ){
num++;
this.next();
}
return num;
}
This function
while ( this.hasNext() ){
num++;
this.next();
}
It is very similar var cnt = x.hasNext() ? x.next().cnt : 0; And this while is perfect for count...
for post in db.datasets.find({"test_set":"abc"}).sort("abc",pymongo.DESCENDING).skip((page-1)*num).limit(num):
How do I get the count()?
Since pymongo version 3.7.0 and above count() is deprecated. Instead use Collection.count_documents. Running cursor.count or collection.count will result in following warning message:
DeprecationWarning: count is deprecated. Use Collection.count_documents instead.
To use count_documents the code can be adjusted as follows
import pymongo
db = pymongo.MongoClient()
col = db[DATABASE][COLLECTION]
find = {"test_set":"abc"}
sort = [("abc",pymongo.DESCENDING)]
skip = 10
limit = 10
doc_count = col.count_documents(find, skip=skip)
results = col.find(find).sort(sort).skip(skip).limit(limit)
for doc in result:
//Process Document
Note: count_documents method performs relatively slow as compared to count method. In order to optimize you can use collection.estimated_document_count. This method will return estimated number of docs(as the name suggested) based on collection metadata.
If you're using pymongo version 3.7.0 or higher, see this answer instead.
If you want results_count to ignore your limit():
results = db.datasets.find({"test_set":"abc"}).sort("abc",pymongo.DESCENDING).skip((page-1)*num).limit(num)
results_count = results.count()
for post in results:
If you want the results_count to be capped at your limit(), set applySkipLimit to True:
results = db.datasets.find({"test_set":"abc"}).sort("abc",pymongo.DESCENDING).skip((page-1)*num).limit(num)
results_count = results.count(True)
for post in results:
Not sure why you want the count if you are already passing limit 'num'. Anyway if you want to assert, here is what you should do.
results = db.datasets.find({"test_set":"abc"}).sort("abc",pymongo.DESCENDING).skip((page-1)*num).limit(num)
results_count = results.count(True)
That will match results_count with num
Cannot comment unfortuantely on #Sohaib Farooqi's answer... Quick note: although, cursor.count() has been deprecated it is significantly faster, than collection.count_documents() in all of my tests, when counting all documents in a collection (ie. filter={}). Running db.currentOp() reveals that collection.count_documents() uses an aggregation pipeline, while cursor.count() doesn't. This might be a cause.
This thread happens to be 11 years old. However, in 2022 the 'count()' function has been deprecated. Here is a way I came up with to count documents in MongoDB using Python. Here is a picture of the code snippet. Making a empty list is not needed I just wanted to be outlandish. Hope this helps :). Code snippet here.
The thing in my case relies in the count of matched elements for a given query, and surely not to repeat this query twice:
one to get the count, and
two to get the result set.
no way
I know the query result set is not quite big and fits in memory, therefore, I can convert it to a list, and get the list length.
This code illustrates the use case:
# pymongo 3.9.0
while not is_over:
it = items.find({"some": "/value/"}).skip(offset).size(limit)
# List will load the cursor content into memory
it = list(it)
if len(it) < size:
is_over = True
offset += size
If you want to use cursor and also want count, you can try this way
# Have 27 items in collection
db = MongoClient(_URI)[DB_NAME][COLLECTION_NAME]
cursor = db.find()
count = db.find().explain().get("executionStats", {}).get("nReturned")
# Output: 27
cursor = db.find().limit(5)
count = db.find().explain().get("executionStats", {}).get("nReturned")
# Output: 5
# Can also use cursor
for item in cursor:
...
You can read more about it from https://pymongo.readthedocs.io/en/stable/api/pymongo/cursor.html#pymongo.cursor.Cursor.explain