Skip same id during MongoDB insertMany [duplicate] - mongodb

This question already has answers here:
How to continue insertion after duplicate key error using PyMongo
(2 answers)
Closed 6 years ago.
Is there a way of telling MongoDB to skip (instead of crashing) if we want to insert a document which has the same _id as an existing one?
Say, I have this list and I want to persist it:
to_mongo = [
{'_id': 'aaaaaa', 'content': 'hey'},
{'_id': 'bbbbbb', 'content': 'you'},
{'_id': 'aaaaaa', 'content': 'hey'}
]
mongo_collection.insert_many(to_mongo)
I'd like the last item to be simply ignored, instead of causing the whole request to crash.

Try using the ordered=False in the insert_many() method i.e.
to_mongo = [
{'_id': 'aaaaaa', 'content': 'hey'},
{'_id': 'bbbbbb', 'content': 'you'},
{'_id': 'aaaaaa', 'content': 'hey'}
]
mongo_collection.insert_many(to_mongo, ordered=False)
This makes sure that all write operations are attempted, even if there are errors. From the docs:
ordered (optional): If True (the default) documents will be inserted
on the server serially, in the order provided. If an error occurs all
remaining inserts are aborted. If False, documents will be inserted on
the server in arbitrary order, possibly in parallel, and all document
inserts will be attempted.

Related

Match up instances 2 lists in pymongo

I am using pipelines in pymongo to query a json file.
I have one list, "sixcities" containing the 6 'cities' with the 'highest count' of book shops i.e. the least book shops. (contains 6 pymongo instances)
{'_id': 'city1', 'count': 84}
{'_id': 'city2', 'count': 65}
{'_id': 'city3', 'count': 61}
{'_id': 'city4', 'count': 59}
{'_id': 'city5', 'count': 84}
{'_id': 'city6', 'count': 64}
I have a second list, "travelcities" with the counts of Travel Book shops in each of the 'cities' ( 20+) in the json file. (contains 20+pymongo instances)
{'_id': 'city1', 'count': 42}...etc
Please note:This list holds cities that do not feature in the first list.
I would like to use these lists to calculate the ratios of travel book shops in the 6 highest count cities.
The common key will be 'city' as this appears in documents of both lists
i.e. in list 2 : city1: 42 divided by in list 1: city1: 84 = 0.5 ratio
I am unsure of how to do this in pymongo as the information is in mongo documents within a list.
I thought some kind of nested loop would work:
dict={}
for i in sixcities: #loop through the first list
dict[i["_id"]]=i["count"]
for i in travelcities: #loop through second list
dict[i["_id"]]=i["count"]/(dict[i["_id"]]) #ratio
But I am getting the following result:
KeyError: 'city15'
This city does not appear in the first list as one of the 6 with the most bookshops, but it does appear in the second as containing a travel bookshop.
Any and all help is appreciated.
One of the problems in your code is that you are using same variable 'i' in both outer and inner loop
Consider this code which, for each city in first list search for it in the second list, then computes the ratio.
dict={}
for i in sixcities: #loop through the first list
dict[i["_id"]]=i["count"]
for j in travelcities: #loop through second list
if j["_id"] == i["_id"]:
dict[i["_id"]]=j["count"]/(dict[i["_id"]]) #ratio
Do note that if the city does not exist in the second list the answer remains the count of the city in the first list. Handle this corner case in the way you want.

MongoDB result as aggregated array of fields, not array of objects [duplicate]

This question already has answers here:
How do I get a list of just the ObjectId's using pymongo?
(5 answers)
Closed 5 years ago.
I have collection of documents (~1 billion of items) and I want to get it as array of field. And in the same time I do not want to postprocess result of Mongo query.
Example:
// Collection looks alike
[
{"_id": ObjectId("...", "id": "12313123", ....)},
{"_id": ObjectId("...", "id": "35675468456", ....)}
{"_id": ObjectId("...", "id": "23233463", ....)}
....
]
// Desired result
["12313123", "35675468456", "23233463"]
I.e I want to get only field id and make result flatten. But statement
db.collection.find({}, {"_id": 0, "id": 1}) returns list of objects.
Would single-purpose aggregation db.collection.distinct("id") work for you?

insert_many with upsert - PyMongo [duplicate]

This question already has answers here:
Fast or Bulk Upsert in pymongo
(6 answers)
Closed 3 years ago.
I have some data like this:
data = [{'_id': 1, 'val': 5},
{'_id': 2, 'val': 1}}]
current data in db:
>>> db.collection.find_one()
{'_id': 1, 'val': 3}
I always receive unique rows but am not sure if any of them already exists in DB (such as the case above). And I want to update them based on two types of requirements.
Requirement 1:
Do NOT update the rows if _id already exists. This is kinda easy in a way:
from pymongo.errors import BulkWriteError
try:
db.collection.insert_many(data, unordered=False)
except BulkWriteError:
pass
executing the above would insert 2nd row but won't update the first; but it also raises the exception.
1. Is there any better way of doing the above operation (for bulk inserts) ?
Requirement 2
This is similar to update_if_exists & insert if not exists combined. So the following data:
data2 = [{'_id': 1, 'val': 9},
{'_id': 3, 'val': 4}}]
should update the row with _id=1 and insert the 2nd row in DB.
The problem is I get thousands of rows at one time and am not sure if checking and updating one-by-one is efficient.
2. Is this requirement possible in MongoDB without iterating over each row and with as few operations as possible ?
You can generate a list of updates to pass to bulk write API that will send all the operations together but they will still be executed one by one on the server, but without causing an error.
from pymongo import UpdateOne
data2 = [{'_id': 1, 'val': 9}, {'_id': 3, 'val': 4}]
upserts=[ UpdateOne({'_id':x['_id']}, {'$setOnInsert':x}, upsert=True) for x in data2]
result = db.test.bulk_write(upserts)
You can see in the result that when _id is found the operation is a no-op, but when it's not found, it's an insert.

Get inserted ids after failed insert_many()

I'm currently trying to write a script that inserts documents into a MongoDb and returns where each element is stored. Very simple thanks to insert_many(), however my problem occurs if there is an error while I'm inserting.
I won't be able to get the ids that have just been inserted.
from pymongo import MongoClient
client = MongoClient(...)
db = client.test
r = db.test.insert_many([{'foo': 1}, {'foo': 2}, {'foo': 3}])
r.inserted_ids
#: [ObjectId('56b2a592dfcce9001a6efff8'),
#: ObjectId('56b2a592dfcce9001a6efff9'),
#: ObjectId('56b2a592dfcce9001a6efffa')]
list(db.test.find())
#: [{'_id': ObjectId('56b2a592dfcce9001a6efff8'), 'foo': 1},
#: {'_id': ObjectId('56b2a592dfcce9001a6efff9'), 'foo': 2},
#: {'_id': ObjectId('56b2a592dfcce9001a6efffa'), 'foo': 3}]
# This is dead stupid, but forcing an error by re-using the ObjectId we just generated
r2 = db.test.insert_many([{'foo': 4}, {'_id': r.inserted_ids[0], 'foo': 6}, {'foo': 7}])
#: ---------------------------------------------------------------------------
#: BulkWriteError Traceback (most recent call last)
#: <Cut in the interest of time>
Of course, r2 is not initialized, so I can't ask for inserted_ids, however, there will have been one record inserted into the database:
list(db.test.find())
#: [{'_id': ObjectId('56b2a592dfcce9001a6efff8'), 'foo': 1},
#: {'_id': ObjectId('56b2a592dfcce9001a6efff9'), 'foo': 2},
#: {'_id': ObjectId('56b2a592dfcce9001a6efffa'), 'foo': 3},
#: {'_id': ObjectId('56b2a61cdfcce9001a6efffd'), 'foo': 4}]
What I want, is to be able to reliably figure out what Id's were inserted in order. Something Like:
r2.inserted_ids
#: [ObjectId('56b2a61cdfcce9001a6efffd'),
#: None, # or maybe even some specific error for this point.
#: None]
Setting ordered=False still gives the error so r2 won't be initialized, (and it won't reliably return the ids in the order I gave anyway).
Is there any option here?
pymongo sets the _id field at client side, before sending it to the server. It modifies the documents you pass in place.
This means that all the documents you pass are left with the _id field set -- the successful ones and the failed ones.
So you just need to figure out which ones are successful. This can be done like #Austin explained.
Something like:
docs = [{'foo': 1}, {'foo': 2}, {'foo': 3}]
try:
r = db.test.insert_many(docs)
except pymongo.errors.OperationFailure as exc:
inserted_ids = [ doc['_id'] for doc in docs if not is_failed(doc, exc) ]
else:
inserted_ids = r.inserted_ids
is_failed(doc, exc) can be implemented by searching doc in the list of failed documents in the exception details, as explained by #Austin.
Catch the thrown exception. At least according to this site, the returned error details includes the bad record. That should enable you to determine the successful records.

Indexing two fields in mogo: timestamp and text

I do much find requests on collection like this:
{'$and': [{'time': {'$lt': 1375214400}},
{'time': {'$gte': 1375128000}},
{'$or': [{'uuid': 'test'},{'uuid': 'test2'}]}
]}
Which index i must create: compound or two single or both?
uuid - name of data collector.
time - timestamp
I want to retrieve data, collected by one or few collectors in specified time interval.
Your query would be better written without the $and and using $in instead of $or:
{
'time': {'$lt': 1375214400, '$gte': 1375128000},
'uuid': {'$in': ['test', 'test2']}
}
Then it's pretty clear you need a compound index that covers both time and uuid for best query performance. But it's important to always confirm your index is being used as you expect by using explain().