insert or ignore multiple documents in mongoDB - mongodb

I have a collection in which all of my documents have at least these 2 fields, say name and url (where url is unique so I set up a unique index on it). Now if I try to insert a document with a duplicate url, it will give an error and halt the program. I don't want this behavior, but I need something like mysql's insert or ignore, so that mongoDB should not insert the document with duplicate url and continue with the next documents.
Is there some parameter I can pass to the insert command to achieve this behavior? I generally do a batch of inserts using pymongo as:
collection.insert(document_array)
Here collection is a collection and document_array is an array of documents.
So is there some way I can implement the insert or ignore functionality for a multiple document insert?

Set the continue_on_error flag when calling insert(). Note PyMongo driver 2.1 and server version 1.9.1 are required:
continue_on_error (optional): If True, the database will not stop
processing a bulk insert if one fails (e.g. due to duplicate IDs).
This makes bulk insert behave similarly to a series of single inserts,
except lastError will be set if any insert fails, not just the last
one. If multiple errors occur, only the most recent will be reported
by error().

Use insert_many(), and set ordered=False.
This will ensure that all write operations are attempted, even if there are errors:
http://api.mongodb.org/python/current/api/pymongo/collection.html#pymongo.collection.Collection.insert_many

Try this:
try:
coll.insert(
doc_or_docs=doc_array,
continue_on_error=True)
except pymongo.errors.DuplicateKeyError:
pass
The insert operation will still throw an exception if an error occurs in the insert (such as trying to insert a duplicate value for a unique index), but it will not affect the other items in the array. You can then swallow the error as shown above.

Why not just put your call to .insert() inside a try: ... except: block and continue if the insert fails?
In addition, you could also use a regular update() call with the upsert flag. Details here: http://www.mongodb.org/display/DOCS/Updating#Updating-update%28%29

If you have your array of documents already in memory in your python script, why not insert them by iterating through them, and simply catch the ones that fail on insertion due to the unique index?
for doc in docs:
try:
collection.insert(doc)
except pymongo.errors.DuplicateKeyError:
print 'Duplicate url %s' % doc
Where collection is an instance of a collection created from your connection/database instances and docs is the array of dictionaries (documents) you would currently be passing to insert.
You could also decide what to do with the duplicate keys that violate your unique index within the except block.

It is highly recommended to use upsert
stat.update({'location': d['user']['location']}, \
{'$inc': {'count': 1}},upsert = True, safe = True)
Here stat is the collection if visitor location is already present in the collection, count is increased by one, else count is set to 1.
Here is the link for documentation http://www.mongodb.org/display/DOCS/Updating#Updating-UpsertswithModifiers

What I am doing :
Generate array of MongoDB ids I want to insert (hash of some values in my case)
Remove existing IDs (I am using a Redis queue bcoz performance, but you can query mongo)
Insert your cleaned data !
Redis is perfect for that, you can use Memcached or Mysql Memory, according your needs

Related

mongodb multiple documents insert or update by unique key

I would like to get a list of items from an external resource periodically and save them into a collection.
There are several possible solutions but they are not optimal, for example:
Delete the entire collection and save the new list of items
Get all items from the collection using "find({})" and use it to filter out existing items and save those that do not exist.
But a better solution will be to set a unique key and just do kind of "update or insert".
Right now on saving items the unique key already exists I will get an error
is there a way to do it at all?
**upsert won't do the work since it's updating all items with the same value so it's actually good for a single document only
I have a feeling you can achieve what you want simply by using the "normal" insertMany with the ordered option set to false. The documentation states that
Note that one document was inserted: The first document of _id: 13
will insert successfully, but the second insert will fail. This will
also stop additional documents left in the queue from being inserted.
With ordered to false, the insert operation would continue with any
remaining documents.
So you will get "duplicate key" exceptions which, however, you can simply ignore in your case.

Pymongo : insert_many + unique index

I want to insert_many() documents in my collection. Some of them may have the same key/value pair (screen_name in my example) than existing documents inside the collection. I have a unique index set on this key, therefore I get an error.
my_collection.create_index("screen_name", unique = True)
my_collection.insert_one({"screen_name":"user1", "foobar":"lalala"})
# no problem
to_insert = [
{"screen_name":"user1", "foobar":"foo"},
{"screen_name":"user2", "foobar":"bar"}
]
my_collection.insert_many(to_insert)
# error :
# File "C:\Program Files\Python\Anaconda3\lib\site-packages\pymongo\bulk.py", line 331, in execute_command
# raise BulkWriteError(full_result)
#
# BulkWriteError: batch op errors occurred
I'd like to :
Not get an error
Not change the already existing documents (here {"screen_name":"user1", "foobar":"lalala"})
Insert all the non-already existing documents (here, {"screen_name":"user2", "foobar":"bar"})
Edit : As someone said in comment "this question is asking how to do a bulk insert and ignore unique-index errors, while still inserting the successful records. Thus it's not a duplicate with the question how do I do bulk insert". Please reopen it.
One solution could be to use the ordered parameter of insert_many and set it to False (default is True):
my_collection.insert_many(to_insert, ordered=False)
From the PyMongo documentation:
ordered (optional): If True (the default) documents will be inserted
on the server serially, in the order provided. If an error occurs all
remaining inserts are aborted. If False, documents will be inserted on
the server in arbitrary order, possibly in parallel, and all document
inserts will be attempted.
Although, you would still have to handle an exception when all the documents couldn't be inserted.
Depending on your use-case, you could decide to either pass, log a warning, or inspect the exception.

What is 'upsert' in the context of MongoDB?

In the context of MongoDB, what is upsert?
Is this an update and insert?
Just curious as I see the usage of this term in many articles and documentation on the MongoDB website.
From the documentation: An operation that will either update the first document matched by a query or insert a new document if none matches. The new document will have the fields implied by the operation.
See http://docs.mongodb.org/manual/reference/glossary/#term-upsert
To put it into SQL terms it is much like a ON DUPLICATE KEY ... UPDATE except that it isn't so verbose in how to query for it.
So essentially it is when you query for an update document, MongoDB doesn't find it and so inserts it.
The condition for the upsert accepts all the same stuff as a normal update except it also has the $setOnInsert ( http://docs.mongodb.org/manual/reference/operator/update/setOnInsert/ ) operator which allows you to define a set of fields that will only be taken into consideration on an insert.

Pymongo w=1 with continue_on_error

I have a collection of tweets. I want to insert a list of tweets into this collection. The new list may have some duplicate tweets as well and I want to ensure that duplicate tweets do not get written but all remaining does. To achieve this, I'm using following code.
mongoPayload = <list of tweets>
committedTweetIDs = db.tweets.insert(mongoPayload, w=1, continue_on_error=True)
print "%d documents committed" % len(committedTweetIDs)
The above code snippet should work. However, the behavior I'm getting is that second line generated DuplicateKeyError. I don't know what this is happening since, I mentioned continue_on_error.
What I want in the end is for Mongo to commit all the non-duplicate documents and return to me (as acknowledgement) tweetIDs of all the documents written to the journal.
Even with continue_on_error=True, PyMongo will raise a DuplicateKeyError if MongoDB tells it that you tried to insert a document with a duplicate _id. However, with continue_on_error=True, the server has attempted to insert all the documents in your list, instead of aborting the operation on the first error. The error_document attribute of the exception tells you the last duplicate _id in your list of documents.
Unfortunately you cannot determine how many documents succeeded and failed in total when you do a bulk insert. MongoDB 2.6 and PyMongo 2.7 will address this in the next release when we implement bulk write operations.

MongoDB: doing an atomic create and return operation

I need to create a document in mongodb and then immediately want it available in my application. The normal way to do this would be (in Python code):
doc_id = collection.insert({'name':'mike', 'email':'mike#gmail.com'})
doc = collection.find_one({'_id':doc_id})
There's two problems with this:
two requests to the server
not atomic
So, I tried using the find_and_modify operation to effectively do a "create and return" with the help of upserts like this:
doc = collection.find_and_modify(
# so that no doc can be found
query= { '__no_field__':'__no_value__'},
# If the <update> argument contains only field and value pairs,
# and no $set or $unset, the method REPLACES the existing document
# with the document in the <update> argument,
# except for the _id field
document= {'name':'mike', 'email':'mike#gmail.com'},
# since the document does not exist, this will create it
upsert= True,
#this will return the updated (in our case, newly created) document
new= True
)
This indeed works as expected. My question is: whether this is the right way to accomplish a "create and return" or is there any gotcha that I am missing?
What exactly are you missing from a plain old regular insert call?
If it is not knowing what the _id will be, you could just create the _id yourself first and insert the document. Then you know exactly how it will look like. None of the other fields will be different from what you sent to the database.
If you are worried about guarantees that the insert will have succeeded you can check the return code, and also set a write concern that provides enough assurances (such as that it has been flushed to disk or replicated to enough nodes).