Parallel insert into MongoDB

Parallel insert into MongoDB - mongodb

What happens, if two clients, working with one MongoDB instance, perform and insert operation at same time without «forceServerObjectId: true». Is it possible to be generated equal ObjectIDs, is there may be a conflict?

There is an implied unique index on the _id field of every collection which makes it impossible for two objects with the same _id to exist in the same collection.
When two objects with the same _id value are stored with collection.save, one document will replace the other.
When they are stored with collection.insert, one of the inserts will fail with a duplicate key error.
But note that MongoDB ObjectIDs include a 24bit machine-ID. This makes it impossible for two clients to generate the same ID, unless they have the same machine-ID. And even then it's unlikely. That, of course, only applies when you let the MongoDB driver (or shell) auto-generate ObjectIDs. MongoDB allows to use any value of any type as a value for the _id field when you set it manually. When you do this (you shouldn't), it's your responsibility to ensure uniqueness.

Related

In MongoDB, how likely is it two documents in different collections in the same database will have the same Id?

According to the MongoDB documentation, the _id field (if not specified) is automatically assigned a 12 byte ObjectId.
It says a unique index is created on this field on the creation of a collection, but what I want to know is how likely is it that two documents in different collections but still in the same database instance will have the same ID, if that can even happen?
I want my application to be able to retrieve a document using just the _id field without knowing which collection it is in, but if I cannot guarantee uniqueness based on the way MongoDB generates one, I may need to look for a different way of generating Id's.

Short Answer for your question is : Yes that's possible.
below post on similar topic helps you in understanding better:
Possibility of duplicate Mongo ObjectId's being generated in two different collections?

You are not required to use a BSON ObjectId for the id field. You could use a hash of a timestamp and some random number or a field with extremely high cardinality (an US SSN for example) in order to make it close to impossible that two objects in the world will share the same id
The _id_index requires the idto be unique per collection. Much like in an RDBMS, where two objects in two tables may very likely have the same primary key when it's an auto incremented integer.
You can not retrieve a document solely by it's _id. Any driver I am aware of requires you to explicitly name the collection.
My 2 cents: The only thing you could do is to manually iterate over the existing collections and query for the _id you are looking for. Which is... ...inefficient, to put it polite. I'd rather semantically distinguish the documents in question by an additional field than by the collection they belong to. And remember, mongoDB uses dynamic schemas, so there is no reason to separate documents which semantically belong together but have a different set of fields. I'd guess there is something seriously, dramatically wrong with you schema. Please elaborate so that we can help you with that.

MongoDB: Object ID Uniqueness [duplicate]

Is it possible for the same exact Mongo ObjectId to be generated for a document in two different collections? I realize that it's definitely very unlikely, but is it possible?
Without getting too specific, the reason I ask is that with an application that I'm working on we show public profiles of elected officials who we hope to convert into full fledged users of our site. We have separate collections for users and the elected officials who aren't currently members of our site. There are various other documents containing various pieces of data about the elected officials that all map back to the person using their elected official ObjectId.
After creating the account we still highlight the data that's associated to the elected official but they now also are a part of the users collection with a corresponding users ObjectId to map their profile to interactions with our application.
We had begun converting our application from MySql to Mongo a few months ago and while we're in transition we store the legacy MySql id for both of these data types and we're also starting to now store the elected official Mongo ObjectId in the users document to map back to the elected official data.
I was pondering just specifying the new user ObjectId as the previous elected official ObjectId to make things simpler but wanted to make sure that it wasn't possible to have a collision with any existing user ObjectId.
Thanks for your insight.
Edit: Shortly after posting this question, I realized that my proposed solution wasn't a very good idea. It would be better to just keep the current schema that we have in place and just link to the elected official '_id' in the users document.

Short Answer
Just to add a direct response to your initial question: YES, if you use BSON Object ID generation, then for most drivers the IDs are almost certainly going to be unique across collections. See below for what "almost certainly" means.
Long Answer
The BSON Object ID's generated by Mongo DB drivers are highly likely to be unique across collections. This is mainly because of the last 3 bytes of the ID, which for most drivers is generated via a static incrementing counter. That counter is collection-independent; it's global. The Java driver, for example, uses a randomly initialized, static AtomicInteger.
So why, in the Mongo docs, do they say that the IDs are "highly likely" to be unique, instead of outright saying that they WILL be unique? Three possibilities can occur where you won't get a unique ID (please let me know if there are more):
Before this discussion, recall that the BSON Object ID consists of:
[4 bytes seconds since epoch, 3 bytes machine hash, 2 bytes process ID, 3 bytes counter]
Here are the three possibilities, so you judge for yourself how likely it is to get a dupe:
1) Counter overflow: there are 3 bytes in the counter. If you happen to insert over 16,777,216 (2^24) documents in a single second, on the same machine, in the same process, then you may overflow the incrementing counter bytes and end up with two Object IDs that share the same time, machine, process, and counter values.
2) Counter non-incrementing: some Mongo drivers use random numbers instead of incrementing numbers for the counter bytes. In these cases, there is a 1/16,777,216 chance of generating a non-unique ID, but only if those two IDs are generated in the same second (i.e. before the time section of the ID updates to the next second), on the same machine, in the same process.
3) Machine and process hash to the same values. The machine ID and process ID values may, in some highly unlikely scenario, map to the same values for two different machines. If this occurs, and at the same time the two counters on the two different machines, during the same second, generate the same value, then you'll end up with a duplicate ID.
These are the three scenarios to watch out for. Scenario 1 and 3 seem highly unlikely, and scenario 2 is totally avoidable if you're using the right driver. You'll have to check the source of the driver to know for sure.

ObjectIds are generated client-side in a manner similar to UUID but with some nicer properties for storage in a database such as roughly increasing order and encoding their creation time for free. The key thing for your use case is that they are designed to guarantee uniqueness to a high probability even if they are generated on different machines.
Now if you were referring to the _id field in general, we do not require uniqueness across collections so it is safe to reuse the old _id. As a concrete example, if you have two collections, colors and fruits, both could simultaneously have an object like {_id: 'orange'}.
In case you want to know more about how ObjectIds are created, here is the spec: http://www.mongodb.org/display/DOCS/Object+IDs#ObjectIDs-BSONObjectIDSpecification

In case anyone is having problems with duplicate Mongo ObjectIDs, you should know that despite the unlikelihood of dups happening in Mongo itself, it is possible to have duplicate _id's generated with PHP in Mongo.
The use-case where this has happened with regularity for me is when I'm looping through a dataset and attempting to inject the data into a collection.
The array that holds the injection data must be explicitly reset on each iteration - even if you aren't specifying the _id value. For some reason, the INSERT process adds the Mongo _id to the array as if it were a global variable (even if the array doesn't have global scope). This can affect you even if you are calling the insertion in a separate function call where you would normally expect the values of the array not to persist back to the calling function.
There are three solutions to this:
You can unset() the _id field from the array
You can reinitialize the entire array with array() each time you loop through your dataset
You can explicitly define the _id value yourself (taking care to define it in such a way that you don't generate dups yourself).
My guess is that this is a bug in the PHP interface, and not so much an issue with Mongo, but if you run into this problem, just unset the _id and you should be fine.

There's no guarantee whatsoever about ObjectId uniqueness across collections. Even if it's probabilistically very unlikely, it would be a very poor application design that relied on _id uniqueness across collections.
One can easily test this in the mongo shell:
MongoDB shell version: 1.6.5
connecting to: test
> db.foo.insert({_id: 'abc'})
> db.bar.insert({_id: 'abc'})
> db.foo.find({_id: 'abc'})
{ "_id" : "abc" }
> db.bar.find({_id: 'abc'})
{ "_id" : "abc" }
> db.foo.insert({_id: 'abc', data:'xyz'})
E11000 duplicate key error index: test.foo.$_id_ dup key: { : "abc" }
So, absolutely don't rely on _id's being unique across collections, and since you don't control the ObjectId generation function, don't rely on it.
It's possible to create something that's more like a uuid, and if you do that manually, you could have some better guarantee of uniqueness.
Remember that you can put objects of different "types" in the same collection, so why not just put your two "tables" in the same collection. They would share the same _id space, and thus, would be guaranteed unique. Switching from "prospective" to "registered" would be a simple flipping of a field...

Duplicate documents on _id (in mongo)

I have a sharded mongo collection, with over 1.5 mil documents. I use the _id column as a shard key, and the values in this column are integers (rather than ObjectIds).
I do a lot of write operations on this collection, using the Perl driver (insert, update, remove, save) and mongoimport.
My problem is that somehow, I have duplicate documents on the same _id. From what I've read, this shouldn't be possible.
I've removed the duplicates, but others still appear.
Do you have any ideas where could they come from, or what should I start looking at?
(Also, I've tried to replicate this on a smaller, test collection, but no duplicates are inserted, no matter what write operation I perform).

This actually isn't a problem with the Perl driver .. it is related to the characteristics of sharding. MongoDB is only able to enforce uniqueness among the documents located on a single shard at the time of creation, so the default index does not require uniqueness.
In the MongoDB: Configuring Sharding documentation there is specific mention that:
When you shard a collection, you must specify the shard key. If there is data in the collection, mongo will require an index to be created upfront (it speeds up the chunking process); otherwise, an index will be automatically created for you.
You can use the {unique: true} option to ensure that the underlying index enforces uniqueness so long as the unique index is a prefix of the shard key.
If the "unique: true" option is not used, the shard key does not have to be unique.

How have you implemented generating the integer Ids?
If you use a system like the one suggested on the MongoDB website, you should be fine. For reference:
function counter(name) {
var ret = db.counters.findAndModify({
query:{_id:name},
update:{$inc:{next:1}},
"new":true,
upsert:true});
return ret.next;
}
db.users.insert({_id:counter("users"), name:"Sarah C."}) // _id : 1
db.users.insert({_id:counter("users"), name:"Bob D."}) // _id : 2
If you are generating your Ids by reading a most recent record in the document store, then incrementing the number in the perl code, then inserting with the incremented number you could be running into timing issues.

heterogeneous bulk update in mongodb

I know that we can bulk update documents in mongodb with
db.collection.update( criteria, objNew, upsert, multi )
in one db call, but it's homogeneous, i.e. all those documents impacted are following one kind of criteria. But what I'd like to do is something like
db.collection.update([{criteria1, objNew1}, {criteria2, objNew2}, ...]
, to send multiple update request which would update maybe absolutely different documents or class of documents in single db call.
What I want to do in my app is to insert/update a bunch of objects with compound primary key, if the key is already existing, update it; insert it otherwise.
Can I do all these in one combine in mongodb?

That's two seperate questions. To the first one; there is no MongoDB native mechanism to bulk send criteria/update pairs although technically doing that in a loop yourself is bound to be about as efficient as any native bulk support.
Checking for the existence of a document based on an embedded document (what you refer to as compound key, but in the interest of correct terminology to avoid confusion it's better to use the mongo name in this case) and insert/update depending on that existence check can be done with upsert :
document A :
{
_id: ObjectId(...),
key: {
name: "Will",
age: 20
}
}
db.users.update({name:"Will", age:20}, {$set:{age: 21}}), true, false)
This upsert (update with insert if no document matches the criteria) will do one of two things depending on the existence of document A :
Exists : Performs update "$set:{age:21}" on the existing document
Doesn't exist : Create a new document with fields "name" and field
"age" with values "Will" and "20" respectively (basically the
criteria are copied into the new doc) and then the update is applied
($set:{age:21}). End result is a document with "name"="Will" and
"age"=21.
Hope that helps

we are seeing some benefits of $in clause.
our use case was to update the 'status' in a document for a large number number records.
In our first cut, we were doing a for loop and doing updates one by 1. But then we switched to using $in clause and that made a huge improvement.

There is no real benefit from doing updates the way you suggest.
The reason that there is a bulk insert API and that it is faster is that Mongo can write all the new documents sequentially to memory, and update indexes and other bookkeeping in one operation.
A similar thing happens with updates that affect more than one document: the update will traverse the index only once and update objects as they are found.
Sending multiple criteria with multiple criteria cannot benefit from any of these optimizations. Each criteria means a separate query, just as if you issued each update separately. The only possible benefit would be sending slightly fewer bytes over the connection. The database would still have to do each query separately and update each document separately.
All that would happen would be that Mongo would queue the updates internally and execute them sequentially (because only one update can happen at any one time), this is exactly the same as if all the updates were sent separately.
It's unlikely that the overhead in sending the queries separately would be significant, Mongo's global write lock will be the limiting factor anyway.

MongoDB: getting ObjectId of last inserted document with multiple, concurrent writers?

Consider the following scenario with MongoDB:
Three writers (A,B,C) insert a document into the same collection.
A inserts first, followed by B, followed by C.
How can we guarantee A retrieves the ObjectId of the document he inserted and not B's document or C's document? Do we need to serialize the writes (i.e., only permit B to write after A inserts and retrieves the ObjectId), or does MongoDB offer some native functionality for this scenario?
Thanks!
We're on Rails.

the normal pattern here is for the driver to allocate the ObjectId and then you know what it is for the insert even before the server gets it.

You can generate the _id value in your client applications (writers) before inserting the document. This way you don't need to rely on the server generating the ObjectId an retrieving the correct value. Most MongoDB language drivers will do this for you automatically if you leave the _id blank.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse