Pymongo w=1 with continue_on_error - mongodb

I have a collection of tweets. I want to insert a list of tweets into this collection. The new list may have some duplicate tweets as well and I want to ensure that duplicate tweets do not get written but all remaining does. To achieve this, I'm using following code.
mongoPayload = <list of tweets>
committedTweetIDs = db.tweets.insert(mongoPayload, w=1, continue_on_error=True)
print "%d documents committed" % len(committedTweetIDs)
The above code snippet should work. However, the behavior I'm getting is that second line generated DuplicateKeyError. I don't know what this is happening since, I mentioned continue_on_error.
What I want in the end is for Mongo to commit all the non-duplicate documents and return to me (as acknowledgement) tweetIDs of all the documents written to the journal.

Even with continue_on_error=True, PyMongo will raise a DuplicateKeyError if MongoDB tells it that you tried to insert a document with a duplicate _id. However, with continue_on_error=True, the server has attempted to insert all the documents in your list, instead of aborting the operation on the first error. The error_document attribute of the exception tells you the last duplicate _id in your list of documents.
Unfortunately you cannot determine how many documents succeeded and failed in total when you do a bulk insert. MongoDB 2.6 and PyMongo 2.7 will address this in the next release when we implement bulk write operations.

Related

query in mongodb atlas to verify the existence of multiple specific documents in a collection

I have a mongodb collection called employeeInformation, in which I have two documents:
{"name1":"tutorial1"}, {"name2":"tutorial2"}
When I do db.employeeInformation.find(), I get both these documents displayed. My question is - is there a query that I can run to confirm that the collection contains only those two specified documents? I tried db.employeeInformation.find({"name1":"tutorial1"}, {"name2":"tutorial2"}) but I only got the id corresponding to the first object with key "name1". I know it's easy to do here with 2 documents just by seeing the results of .find(), but I want to ensure that in a situation where I insert multiple (100's) of documents into the collection, I have a way of verifying that the collection contains all and only those 100 documents (note I will always have the objects themselves as text). Ideally this query should work in mongoatlas console/interface as well.
db.collection.count()
will give you number of inserts once you have inserted the document.
Thanks,
Neha

Pymongo : insert_many + unique index

I want to insert_many() documents in my collection. Some of them may have the same key/value pair (screen_name in my example) than existing documents inside the collection. I have a unique index set on this key, therefore I get an error.
my_collection.create_index("screen_name", unique = True)
my_collection.insert_one({"screen_name":"user1", "foobar":"lalala"})
# no problem
to_insert = [
{"screen_name":"user1", "foobar":"foo"},
{"screen_name":"user2", "foobar":"bar"}
]
my_collection.insert_many(to_insert)
# error :
# File "C:\Program Files\Python\Anaconda3\lib\site-packages\pymongo\bulk.py", line 331, in execute_command
# raise BulkWriteError(full_result)
#
# BulkWriteError: batch op errors occurred
I'd like to :
Not get an error
Not change the already existing documents (here {"screen_name":"user1", "foobar":"lalala"})
Insert all the non-already existing documents (here, {"screen_name":"user2", "foobar":"bar"})
Edit : As someone said in comment "this question is asking how to do a bulk insert and ignore unique-index errors, while still inserting the successful records. Thus it's not a duplicate with the question how do I do bulk insert". Please reopen it.
One solution could be to use the ordered parameter of insert_many and set it to False (default is True):
my_collection.insert_many(to_insert, ordered=False)
From the PyMongo documentation:
ordered (optional): If True (the default) documents will be inserted
on the server serially, in the order provided. If an error occurs all
remaining inserts are aborted. If False, documents will be inserted on
the server in arbitrary order, possibly in parallel, and all document
inserts will be attempted.
Although, you would still have to handle an exception when all the documents couldn't be inserted.
Depending on your use-case, you could decide to either pass, log a warning, or inspect the exception.

In MongoDB find out when last query to a collection was? (Removing stale collections)

I would like to find out how old/stale a collection is, I was wondering if there was a way to know when the last query was made to a collection, or even get a list of all collections last access date.
If your Mongodb collection document _id is of the following format "_id" : ObjectId("57bee0cbc9735bf0b80c23e0") then Mongodb stores the create document timestamp.
This can be retrieved by executing the following query
db.newcollection.findOne({"_id" : ObjectId("57bee0cbc9735bf0b80c23e0")})._id.getTimestamp();
the result would be an ISODate like this ISODate("2016-08-25T12:12:59Z")
find out how old/stale a collection
There is no predefined libraries available in mongodb to track the oldness of a collection. But it is doable by maintaining a log where we can keep an entry when we are accessing a collection.
References
ObjectID.getTimestamp()
Log messages
Rotate Log files
db.collection.stats()

Remove obsolete collection in mongodb

I want to delete all the collections from my db which are not used for long time. Is there any why i can check when the particular collection was last used?
It depends what you mean by 'last used'. If you mean the last time a document was inserted into the collection then you could do this by converting the ObjectId of the last inserted document into a date. The following query should return the date the last document was inserted:
db.<collection_name>.findOne({},{_id:1})._id.getTimestamp()
the findOne query will return documents in natural order, therefore if you input no query criteria ('{}') then it will return the most recently inserted document. You can then get the _id field and call the getTimestamp() function
I'm not sure if there is any way to reliably tell when a collection was last queried. If you're running your database with profiling enabled then there might be entries in the db.system.profile collection, or in the oplog.

insert or ignore multiple documents in mongoDB

I have a collection in which all of my documents have at least these 2 fields, say name and url (where url is unique so I set up a unique index on it). Now if I try to insert a document with a duplicate url, it will give an error and halt the program. I don't want this behavior, but I need something like mysql's insert or ignore, so that mongoDB should not insert the document with duplicate url and continue with the next documents.
Is there some parameter I can pass to the insert command to achieve this behavior? I generally do a batch of inserts using pymongo as:
collection.insert(document_array)
Here collection is a collection and document_array is an array of documents.
So is there some way I can implement the insert or ignore functionality for a multiple document insert?
Set the continue_on_error flag when calling insert(). Note PyMongo driver 2.1 and server version 1.9.1 are required:
continue_on_error (optional): If True, the database will not stop
processing a bulk insert if one fails (e.g. due to duplicate IDs).
This makes bulk insert behave similarly to a series of single inserts,
except lastError will be set if any insert fails, not just the last
one. If multiple errors occur, only the most recent will be reported
by error().
Use insert_many(), and set ordered=False.
This will ensure that all write operations are attempted, even if there are errors:
http://api.mongodb.org/python/current/api/pymongo/collection.html#pymongo.collection.Collection.insert_many
Try this:
try:
coll.insert(
doc_or_docs=doc_array,
continue_on_error=True)
except pymongo.errors.DuplicateKeyError:
pass
The insert operation will still throw an exception if an error occurs in the insert (such as trying to insert a duplicate value for a unique index), but it will not affect the other items in the array. You can then swallow the error as shown above.
Why not just put your call to .insert() inside a try: ... except: block and continue if the insert fails?
In addition, you could also use a regular update() call with the upsert flag. Details here: http://www.mongodb.org/display/DOCS/Updating#Updating-update%28%29
If you have your array of documents already in memory in your python script, why not insert them by iterating through them, and simply catch the ones that fail on insertion due to the unique index?
for doc in docs:
try:
collection.insert(doc)
except pymongo.errors.DuplicateKeyError:
print 'Duplicate url %s' % doc
Where collection is an instance of a collection created from your connection/database instances and docs is the array of dictionaries (documents) you would currently be passing to insert.
You could also decide what to do with the duplicate keys that violate your unique index within the except block.
It is highly recommended to use upsert
stat.update({'location': d['user']['location']}, \
{'$inc': {'count': 1}},upsert = True, safe = True)
Here stat is the collection if visitor location is already present in the collection, count is increased by one, else count is set to 1.
Here is the link for documentation http://www.mongodb.org/display/DOCS/Updating#Updating-UpsertswithModifiers
What I am doing :
Generate array of MongoDB ids I want to insert (hash of some values in my case)
Remove existing IDs (I am using a Redis queue bcoz performance, but you can query mongo)
Insert your cleaned data !
Redis is perfect for that, you can use Memcached or Mysql Memory, according your needs