Motor Index not created on empty Collection - mongodb

I have the following code to setup my database:
self.con = motor.MotorClient(host, port)
self.Db = self.con.DB
self.Col = self.Db.Col
self.Col.create_index("c")
self.Col.create_index("h")
When I run index_information() I only see index information on _id field. However if I move the create_index() after some entries are inserted the index_information() shows the new indexes. Does this mean I have to wait until I have entries in the collection before I can create indexes? Is there another way to do this since I start with an empty collection?

You can create an index on an empty, or non-existent, MongoDB collection, and the index appears in index_information:
>>> from tornado import ioloop, gen
>>> import motor
>>>
>>> con = motor.MotorClient()
>>> db = con.test
>>> col = db.collection
>>>
>>>
>>> #gen.coroutine
... def coro():
... yield db.drop_collection("collection")
... yield col.create_index("c")
... yield col.create_index("h")
... print((yield col.index_information()))
...
>>> ioloop.IOLoop.current().run_sync(coro)
{u'c_1': {u'key': [(u'c', 1)], u'v': 1}, u'_id_': {u'key': [(u'_id', 1)], u'v': 1}, u'h_1': {u'key': [(u'h', 1)], u'v': 1}}
Since I don't see any "yield" statements in your example code, or any callbacks, I suspect you're not using Motor correctly. Motor is asynchronous; in order to wait for any Motor method that talks to the database server to complete you must either pass a callback to the method, or yield the Future the method returns.
For more information consult the tutorial:
http://motor.readthedocs.org/en/stable/tutorial.html#inserting-a-document
The discussion of calling asynchronous methods with Motor (and this applies to all Tornado libraries, not just Motor) begins at the "inserting a document" section.

You can very easily create the index on mongodb (even on empty collection) using
field_name and direction.
field_name: can be any field on which you want to create the index.
direction: can be any one from these values: 1, -1, 2dsphere, text or hashed
Refer MotorCollection Doc for details
In the below code I am trying to create index using motor library and python.
db.collection_name.create_index([("field_name", 1)] # To create ascending index
db.collection_name.create_index([("geoloc_field_name", "2dsphere")] # To create geo index
db.collection_name.create_index([("field_name", "text")] # To create text based index

Related

PyMongo gives error when using dot notation in field name for sort method

I am trying to get the maximum value of a field inside a collection. The field's value is an array and I actually need to get the maximum of the first index of the array. For example, the collection is similar to this:
[
{
...,
"<field>": [10, 20],
...
},
{
...,
"<field>": [13, 23],
...
},
{
...,
"<field>": [19, 31],
...
}
]
So from the above document, I would need to get the maximum of the first index of array. In this case, it would be 19.
To do this, I am first sorting the field by the first index of the field array and then getting the first document (using limit). I am able to do this using Node.js but cannot get it working with PyMongo.
It works using the Node.js MongoDB API like:
const max = (
await collection
.find()
.sort({ "<field>.0": -1 })
.limit(1)
.toArray()
)[0];
However, if I try to do a similar thing using PyMongo:
max = list(collection.find().sort("<field>.0", -1).limit(1))[0]
I get the error:
KeyError: '<field>.0'
I am using PyMongo version 3.12.0. How can I resolve this?
In PyMongo, the sort option is a list of tuples, where the tuples accept two arguments: key name and sort-order.
And you can pass multiple tuples to this list since MongoDB supports sort by multiple key conditions.
col.find({}).sort([('<key1>', <sort-order>), ('<key2>', <sort-order>)])
In your scenario, you should replace your find command as follows:
max = list(collection.find().sort([("<field>.0", -1)]).limit(1))[0]

A better way to delete the data in a MongoDB collection but keeping the index

I was working on my unittests which for each cases the database should be reset.
Because that the indexes should be kept but the data should be cleared, what will be a faster way to do that reset?
I am using pymongo, not sure if the driver will heavily impacts the performance.
There are 2 ways that come up to my mind
Simply executes delete_many()
Drop the whole collection and recreate the indexes
I found that:
for collections containing < 25K entries, the first way is faster
for collections containing ~ 25K entries, two ways are similar
for collections containing < 25K entries, the second way is faster.
Below is my script to test and run.
import time
import pymongo
from bson import ObjectId
from extutils import exec_timing_result
client = pymongo.MongoClient("mongodb://192.168.50.33:27017/?readPreference=primary&ssl=false")
def test_drop_recreate(col):
col.drop()
col.create_index([("mike", pymongo.DESCENDING)])
def test_delete_many(col):
col.delete_many({})
def main():
db = client.get_database("_tests")
STEP = 1000
to_insert = []
for count in range(0, STEP * 101, STEP):
# Get the MongoDB collection
col = db.get_collection("a")
# Insert dummy data
to_insert.extend([{"_id": ObjectId(), "mike": "ABC"} for _ in range(STEP)])
if to_insert:
col.insert_many(to_insert)
# Record the execution time of dropping the databㄇse then recreate indexes of it
_start = time.time()
test_drop_recreate(col)
ms_drop = time.time() - _start
# Insert dummy data
if to_insert:
col.insert_many(to_insert)
# Record the execution time of simply executes `delete_many()`
_start = time.time()
test_delete_many(col)
ms_del = time.time() - _start
if ms_drop > ms_del:
print(f"{count},-{(ms_drop / ms_del) - 1:.2%}")
else:
print(f"{count},+{(ms_del / ms_drop) - 1:.2%}")
if __name__ == '__main__':
main()
After I ran this script a few times, I generated a graph to visualize the result using the output.
(Above 0) means deletion takes longer
(Below 0) means dropping and recreating takes longer
The value represents additional time consumed.
For example: +20 means deletion takes 20 times longer than drop & create.
I tried to "save" all the existing indexes and recreate them after the collection drop, but there are tons of edge cases and hidden parameters to the index object. Here is my shot:
def convert_key_to_list_index(son_index):
"""
Convert index SON object to a list that can be used to create the same index.
:param SON son_index: The output of "list_indexes()":
SON([
('v', 2),
('key', SON([('timestamp', 1.0)])),
('name', 'timestamp_1'),
('expireAfterSeconds', 3600.0)
])
:return list: List of tuples: (field, direction) to use for pymonog function create_index
"""
key_index_list = []
index_key_definitions = son_index["key"]
for field, direction in index_key_definitions.items():
item = (field, int(direction))
key_index_list.append(item)
return key_index_list
def drop_and_recreate_indexes(db, collection_name):
"""
Use list_indexes() and not index_information() to get the "key" as a SON object instead of regular dict,
Because the SON object is an ordered dict, which the order of the key of the index are important for the
recreation action.
"""
son_indexes = list(db[collection_name].list_indexes())
db[collection_name].drop()
# Re-create the collection indexes
for son_index in son_indexes:
# Skip the default index for the field "_id"
# Notice: For collection without any index except of the default "_id_" index,
# the collection will not be recreated (Recreate the index recreate the collection).
if son_index['name'] == '_id_':
continue
recreate_index(db, collection_name, son_index)
def recreate_index(db, collection_name, son_index):
"""
Get a SON object that represent an index from the db, and create the same index in the db.
:param pymongo.database.Database db:
:param str collection_name:
:param SON son_index: The object that returned from the call to pymongo.collection.Collection.list_indexes()
# Notice: Currently supported only these fields: ["name", "unique", "expireAfterSeconds"]
# For more fields/options see: https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html
# ?highlight=create_index#pymongo.collection.Collection.create_index
:return:
"""
list_index = convert_key_to_list_index(son_index)
# This dict is defined to send parameter if exists to the "create_index" function.
create_index_kwargs = {}
# "name" will always be part of the SON object that represent the index.
create_index_kwargs["name"] = son_index.get("name")
# "unique" may exist or not in the SON object
unique = son_index.get("unique")
if unique:
create_index_kwargs["unique"] = unique
# "expireAfterSeconds" may exist or not in the SON object
expire_after_seconds = son_index.get("expireAfterSeconds", None)
if expire_after_seconds:
create_index_kwargs["expireAfterSeconds"] = expire_after_seconds
db[collection_name].create_index(list_index, **create_index_kwargs)

Why is one of (array) field/value pairs in MongoDB's find_and_modify() filter parameter being ignored?

I am using pymongo to call MongoDB 3.2 (running on Azure Cosmos DB). My app needs to, in an atomic step, search for an array element in a document and if found, modify 1-2 of its fields.
For the filter argument, I am specifying 3 conditions (see fd in code). The first call seems to work OK (it modifies the first element in the array which happens to match fd), but on the second call, it modifies the second array element that doesn't match the node_id value specified in the filter. My code:
fd = {'_id': 'job7569', 'active_runs.node_id': 'node0', 'active_runs.status': 'unstarted'}
ud = {'active_runs.$.run_name': 'run8100.1', 'active_runs.$.status': 'started'}
result = self.mongo.mongo_db["__jobs__"].find_and_modify(fd, update={"$set": ud}, fields={"active_runs.$": 1}, new=True)
active_runs after 2nd call:
[ {'node_id': 'node0', 'run_index': 0, 'run_name': 'run9999.43', 'status': 'started'},
{'node_id': 'node1', 'run_index': 1, 'run_name': 'run9999.44', 'status': 'started'},
{'node_id': 'node2', 'run_index': 2, 'run_name': None, 'status': 'unstarted'},
...]

pymongo find().hint('index') does not use index [duplicate]

I'm trying to use the sort feature when querying my mongoDB, but it is failing. The same query works in the MongoDB console but not here. Code is as follows:
import pymongo
from pymongo import Connection
connection = Connection()
db = connection.myDB
print db.posts.count()
for post in db.posts.find({}, {'entities.user_mentions.screen_name':1}).sort({u'entities.user_mentions.screen_name':1}):
print post
The error I get is as follows:
Traceback (most recent call last):
File "find_ow.py", line 7, in <module>
for post in db.posts.find({}, {'entities.user_mentions.screen_name':1}).sort({'entities.user_mentions.screen_name':1},1):
File "/Library/Python/2.6/site-packages/pymongo-2.0.1-py2.6-macosx-10.6-universal.egg/pymongo/cursor.py", line 430, in sort
File "/Library/Python/2.6/site-packages/pymongo-2.0.1-py2.6-macosx-10.6-universal.egg/pymongo/helpers.py", line 67, in _index_document
TypeError: first item in each key pair must be a string
I found a link elsewhere that says I need to place a 'u' infront of the key if using pymongo, but that didn't work either. Anyone else get this to work or is this a bug.
.sort(), in pymongo, takes key and direction as parameters.
So if you want to sort by, let's say, id then you should .sort("_id", 1)
For multiple fields:
.sort([("field1", pymongo.ASCENDING), ("field2", pymongo.DESCENDING)])
You can try this:
db.Account.find().sort("UserName")
db.Account.find().sort("UserName",pymongo.ASCENDING)
db.Account.find().sort("UserName",pymongo.DESCENDING)
This also works:
db.Account.find().sort('UserName', -1)
db.Account.find().sort('UserName', 1)
I'm using this in my code, please comment if i'm doing something wrong here, thanks.
Why python uses list of tuples instead dict?
In python, you cannot guarantee that the dictionary will be interpreted in the order you declared.
So, in mongo shell you could do .sort({'field1':1,'field2':1}) and the interpreter would sort field1 at first level and field 2 at second level.
If this syntax was used in python, there is a chance of sorting by field2 at first level. With tuple, there is no such risk.
.sort([("field1",pymongo.ASCENDING), ("field2",pymongo.DESCENDING)])
Sort by _id descending:
collection.find(filter={"keyword": keyword}, sort=[( "_id", -1 )])
Sort by _id ascending:
collection.find(filter={"keyword": keyword}, sort=[( "_id", 1 )])
DESC & ASC :
import pymongo
client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["mydatabase"]
col = db["customers"]
doc = col.find().sort("name", -1) #
for x in doc:
print(x)
###################
import pymongo
client = pymongo.MongoClient("mongodb://localhost:27017/")
db = client["mydatabase"]
col = db["customers"]
doc = col.find().sort("name", 1) #
for x in doc:
print(x)
TLDR: Aggregation pipeline is faster as compared to conventional .find().sort().
Now moving to the real explanation. There are two ways to perform sorting operations in MongoDB:
Using .find() and .sort().
Or using the aggregation pipeline.
As suggested by many .find().sort() is the simplest way to perform the sorting.
.sort([("field1",pymongo.ASCENDING), ("field2",pymongo.DESCENDING)])
However, this is a slow process compared to the aggregation pipeline.
Coming to the aggregation pipeline method. The steps to implement simple aggregation pipeline intended for sorting are:
$match (optional step)
$sort
NOTE: In my experience, the aggregation pipeline works a bit faster than the .find().sort() method.
Here's an example of the aggregation pipeline.
db.collection_name.aggregate([{
"$match": {
# your query - optional step
}
},
{
"$sort": {
"field_1": pymongo.ASCENDING,
"field_2": pymongo.DESCENDING,
....
}
}])
Try this method yourself, compare the speed and let me know about this in the comments.
Edit: Do not forget to use allowDiskUse=True while sorting on multiple fields otherwise it will throw an error.
.sort([("field1",pymongo.ASCENDING), ("field2",pymongo.DESCENDING)])
Python uses key,direction. You can use the above way.
So in your case you can do this
for post in db.posts.find().sort('entities.user_mentions.screen_name',pymongo.ASCENDING):
print post
Say, you want to sort by 'created_on' field, then you can do like this,
.sort('{}'.format('created_on'), 1 if sort_type == 'asc' else -1)

PyMongo updating array records with calculated fields via cursor

Basically the collection output of an elaborate aggregate pipeline for a very large dataset is similar to the following:
{
"_id" : {
"clienta" : NumberLong(460011766),
"clientb" : NumberLong(2886729962)
},
"states" : [
[
"fixed", "fixed.rotated","fixed.rotated.off"
]
],
"VBPP" : [
244,
182,
184,
11,
299,
],
"PPF" : 72.4,
}
The intuitive, albeit slow, way to update these fields to be calculations of their former selves (length and variance of an array) with PyMongo before converting to arrays is as follows:
records_list = []
cursor = db.clientAgg.find({}, {'_id' : 0,
'states' : 1,
'VBPP' : 1,
'PPF': 1})
for record in cursor:
records_list.append(record)
for dicts in records_list:
dicts['states'] = len(dicts['states'])
dicts['VBPP'] = np.var(dicts['VBPP'])
I have written various forms of this basic flow to optimize for speed, but bringing in 500k dictionaries in memory to modify them before converting them to arrays to go through a machine learning estimator is costly. I have tried various ways to update the records directly via a cursor with variants of the following with no success:
cursor = db.clientAgg.find().skip(0).limit(50000)
def iter():
for item in cursor:
yield item
l = []
for x in iter():
x['VBPP'] = np.var(x['VBPP'])
# Or
# db.clientAgg.update({'_id':x['_id']},{'$set':{'x.VBPS': somefunction as above }},upsert=False, multi=True)
I also unsuccessfully tried using Mongo's usual operators since the variance is as simple as subtracting the mean from each element of the array, squaring the result, then averaging the results.
If I could successfully modify the collection directly then I could utilize something very fast like Monary or IOPro to load data directly from Mongo and into a numpy array without the additional overhead.
Thank you for your time
MongoDB has no way to update a document with values calculated from the document's fields; currently you can only use update to set values to constants that you pass in from your application. So you can set document.x to 2, but you can't set document.x to document.y + document.z or any other calculated value.
See https://jira.mongodb.org/browse/SERVER-11345 and https://jira.mongodb.org/browse/SERVER-458 for possible future features.
In the immediate future, PyMongo will release a bulk API that allows you to send a batch of distinct update operations in a single network round-trip which will improve your performance.
Addendum:
I have two other ideas. First, run some Javascript server-side. E.g., to set all documents' b fields to 2 * a:
db.eval(function() {
var collection = db.test_collection;
collection.find().forEach(function(doc) {
var b = 2 * doc.a;
collection.update({_id: doc._id}, {$set: {b: b}});
});
});
The second idea is to use the aggregation framework's $out operator, new in MongoDB 2.5.2, to transform the collection into a second collection that includes the calculated field:
db.test_collection.aggregate({
$project: {
a: '$a',
b: {$multiply: [2, '$a']}
}
}, {
$out: 'test_collection2'
});
Note that $project must explicitly include all the fields you want; only _id is included by default.
For a million documents on my machine the former approach took 2.5 minutes, and the latter 9 seconds. So you could use the aggregation framework to copy your data from its source to its destination, with the calculated fields included. Then, if desired, drop the original collection and rename the target collection to the source's name.
My final thought on this, is that MongoDB 2.5.3 and later can stream large result sets from an aggregation pipeline using a cursor. There's no reason Monary can't use that capability, so you might file a feature request there. That would allow you to get documents from a collection in the form you want, via Monary, without having to actually store the calculated fields in MongoDB.