I have a distributed application that uses mongoDB as a backend. The application has two collections (C1 and C2) with a M:1 relationship, so that if I delete a document in C1, I need to search C1 for any other documents that point to the same doc in C2, and if there are no matches, then delete the related doc in C2.
This obviously has the problem of race conditions that could insert new documents into C1 while the search is going on that point to the soon-to-be-deleted document in C2, resulting in DB inconsistency. Deletes can be delayed such that they could be batched up and performed once a week, say, during low load, so I'm considering writing a distributed locking system for mongo to solve the RC problem.
Questions:
Is there a better approach than distributed locking?
Does software like this already exist for Mongo? I've seen single document examples of locks, but not database level distributed locks.
UPDATE
I left this out to avoid confusing the issue, but I need to include it now. There's actually another resource (R) (essentially a file on a remote disk) that needs to be deleted along with with C2 document and C2:R is M:1. R is completely outside the mongodb ecosystem, which is why my mind jumped to locking the application so I can safely delete all this stuff. Hence the reversing links idea mentioned below won't work for this case. Yes, the system is complicated, and no, it can't be changed.
UPDATE2
My attempts to abstract away implementation details to keep the question succinct keeps biting me. Another detail: R is manipulated via REST calls to another server.
1.
This type of problem is usually solved by embedding. So essentially C1 and C2 could be a single collection and C2 doc would embed itself into C1. Obviously this is not always possible or desirable and one of the downsides of this is data duplication. Another downside is that you would not be able to find all C2s without going through all C1s and given M:1 relationship it's not always good thing to do. So it depends if these cons are a real problem in your application.
2.
Another way to handle it would be to just remove links from C1 to C2 thus leaving C2 documents to exist with no links. This could have low cost in some cases.
3.
Use Two Phase Commit similar to as described here: http://docs.mongodb.org/manual/tutorial/perform-two-phase-commits/.
4.
Yet another option could be to reverse your links. C2 would have an array of links that point to C1s. Each time you delete C1 you $pull from that array a link to deleted C1. Immediately after you delete from C2 with a condition that array of links is empty and its _id is what you got back from update. If race condition happens when you insert a new document into C1 and trying to update C2 and you got back result that you didn't update anything then you can either fail your insert or try to insert a new C2. Here is an example:
// Insert first doc
db.many.insert({"name": "A"});
// Find it to get an ID to show.
db.many.find();
{ "_id" : ObjectId("52eaf9e05a07ef0270a9eccc"), "name" : "A" }
// lets add a tag to it right after
db.one.update({"tag": "favorite"}, {$addToSet: {"links": ObjectId("52eaf9e05a07ef0270a9eccc")}}, {upsert: true, multi: false});
// show that tag was created and a link was added
db.one.find();
{ "_id" : ObjectId("52eafaa77365653791085540"), "links" : [ ObjectId("52eaf9e05a07ef0270a9eccc") ], "tag" : "favorite" }
// Insert one more doc which will not be tagged just for illustration
db.many.insert({"name": "B"});
// Insert last document, show IDs of all docs and tag the last document inserted:
db.many.insert({"name": "C"});
db.many.find();
{ "_id" : ObjectId("52eaf9e05a07ef0270a9eccc"), "name" : "A" }
{ "_id" : ObjectId("52eafab95a07ef0270a9eccd"), "name" : "B" }
{ "_id" : ObjectId("52eafac85a07ef0270a9ecce"), "name" : "C" }
db.one.update({"tag": "favorite"}, {$addToSet: {"links": ObjectId("52eafac85a07ef0270a9ecce")}}, {upsert: true, multi: false});
// Now we have 2 documents tagged out of 3
db.one.find();
{ "_id" : ObjectId("52eafaa77365653791085540"), "links" : [ ObjectId("52eaf9e05a07ef0270a9eccc"), ObjectId("52eafac85a07ef0270a9ecce") ], "tag" : "favorite" }
// START DELETE PROCEDURE
// Let's delete first tagged document
db.many.remove({"_id" : ObjectId("52eaf9e05a07ef0270a9eccc")});
// remove the "dead" link
db.one.update({"tag": "favorite"}, {$pull: {"links": ObjectId("52eaf9e05a07ef0270a9eccc")}});
// just to show how it looks now (link removed)
db.one.find();
{ "_id" : ObjectId("52eafaa77365653791085540"), "links" : [ ObjectId("52eafac85a07ef0270a9ecce") ], "tag" : "favorite" }
// try to delete a document that has no links - it's not the case here yet, so the doc is not deleted.
db.one.remove({"tag" : "favorite", "links": {$size: 0}});
db.one.find();
{ "_id" : ObjectId("52eafaa77365653791085540"), "links" : [ ObjectId("52eafac85a07ef0270a9ecce") ], "tag" : "favorite" }
// DELETE OF THE FIRST DOC IS COMPLETE, if any docs got added with
// links then the tag will just have more links
// DELETE LAST DOC AND DELETE UNREFERENCED LINK
db.many.remove({"_id" : ObjectId("52eafac85a07ef0270a9ecce")});
db.one.update({"tag": "favorite"}, {$pull: {"links": ObjectId("52eafac85a07ef0270a9ecce")}});
// no links are left
db.one.find();
{ "_id" : ObjectId("52eafaa77365653791085540"), "links" : [ ], "tag" : "favorite" }
db.one.remove({"tag" : "favorite", "links": {$size: 0}});
// LAST DOC WAS DELETED AND A REFERENCING DOC WAS DELETED AS WELL
// final look at remaining data
db.one.find();
// empty
db.many.find();
{ "_id" : ObjectId("52eafab95a07ef0270a9eccd"), "name" : "B" }
If upsert happens after you delete from one then it will just create a new doc and add a link. If it happens before then old one doc will stay and links will be updated properly.
UPDATE
Here is one way to deal with "delete file" requirements. It assumes you have POSIX compliant filesystem like ext3/ext4, many other FSs would have same properties too. For each C2 you create you should create a randomly named hard link which points to the R file. Store the path to that link in C2 doc for example. You'll end up with multiple hard links pointing to a single file. Whenever you delete a C2 you delete this hard link. Eventually when link count goes to 0 OS will delete the file. Thus there is no way you can delete the file unless you delete all hard links.
Another alternative to reversing C1<->C2 links and using FS hard links is to use multiphase commit which you can implement in any way you want.
Disclaimer: whatever mechanisms I described should work, but might contain some cases that I missed. I didn't try exactly this approach myself, but I used similar "transactional" file deletion scheme in the past successfully. So such solution I think will work but requires good testing and thinking it through will all possible scenarios.
UPDATE 2
Given all the constraints you will have to implement either multi stage commit or a some sort of locking/transaction mechanism. You can also order all your operations through a task queue which will naturally be free of race conditions (synchronous). All of these mechanisms will slow the system down a bit but you can pick granularity level of a C2 document id which is not so bad I suppose. Thus you'll still be able to run stuff in parallel with isolation on C2 id level.
One of the simple practical approaches is to use a message bus/queue.
If you are not using sharding, you can use TokuMX instead of MongoDB, which has support for multi-document, multi-statement transactions with atomic commit and rollback, among other nice things. These transactions work across collections and databases, so they seem like they would work for your application without many changes.
There is a full tutorial here.
Disclaimer: I am an engineer at Tokutek
Alek,
Have you considered moving the relationships to a different collection. You can have a collection that maps all the relationships from C1 to C2. Each document can also store a boolean indicating it is marked for collection. You can write a background task that will periodically scan this table and look for collections that have been deleted. The advantage of this model is that it is easy to detect when the collections are out of sync.
E.g.
{
C1_ID,
[C2_ID_1, C2_ID_2....],
true/false
}
Related
I have removed some documents in my last query by mistake, Is there any way to rollback my last query mongo collection.
Here it is my last query :
db.foo.remove({ "name" : "some_x_name"})
Is there any rollback/undo option? Can I get my data back?
There is no rollback option (rollback has a different meaning in a MongoDB context), and strictly speaking there is no supported way to get these documents back - the precautions you can/should take are covered in the comments. With that said however, if you are running a replica set, even a single node replica set, then you have an oplog. With an oplog that covers when the documents were inserted, you may be able to recover them.
The easiest way to illustrate this is with an example. I will use a simplified example with just 100 deleted documents that need to be restored. To go beyond this (huge number of documents, or perhaps you wish to only selectively restore etc.) you will either want to change the code to iterate over a cursor or write this using your language of choice outside the MongoDB shell. The basic logic remains the same.
First, let's create our example collection foo in the database dropTest. We will insert 100 documents without a name field and 100 documents with an identical name field so that they can be mistakenly removed later:
use dropTest;
for(i=0; i < 100; i++){db.foo.insert({_id : i})};
for(i=100; i < 200; i++){db.foo.insert({_id : i, name : "some_x_name"})};
Now, let's simulate the accidental removal of our 100 name documents:
> db.foo.remove({ "name" : "some_x_name"})
WriteResult({ "nRemoved" : 100 })
Because we are running in a replica set, we still have a record of these documents in the oplog (being inserted) and thankfully those inserts have not (yet) fallen off the end of the oplog (the oplog is a capped collection remember) . Let's see if we can find them:
use local;
db.oplog.rs.find({op : "i", ns : "dropTest.foo", "o.name" : "some_x_name"}).count();
100
The count looks correct, we seem to have our documents still. I know from experience that the only piece of the oplog entry we will need here is the o field, so let's add a projection to only return that (output snipped for brevity, but you get the idea):
db.oplog.rs.find({op : "i", ns : "dropTest.foo", "o.name" : "some_x_name"}, {"o" : 1});
{ "o" : { "_id" : 100, "name" : "some_x_name" } }
{ "o" : { "_id" : 101, "name" : "some_x_name" } }
{ "o" : { "_id" : 102, "name" : "some_x_name" } }
{ "o" : { "_id" : 103, "name" : "some_x_name" } }
{ "o" : { "_id" : 104, "name" : "some_x_name" } }
To re-insert those documents, we can just store them in an array, then iterate over the array and insert the relevant pieces. First, let's create our array:
var deletedDocs = db.oplog.rs.find({op : "i", ns : "dropTest.foo", "o.name" : "some_x_name"}, {"o" : 1}).toArray();
> deletedDocs.length
100
Next we remind ourselves that we only have 100 docs in the collection now, then loop over the 100 inserts, and finally revalidate our counts:
use dropTest;
db.foo.count();
100
// simple for loop to re-insert the relevant elements
for (var i = 0; i < deletedDocs.length; i++) {
db.foo.insert({_id : deletedDocs[i].o._id, name : deletedDocs[i].o.name});
}
// check total and name counts again
db.foo.count();
200
db.foo.count({name : "some_x_name"})
100
And there you have it, with some caveats:
This is not meant to be a true restoration strategy, look at backups (MMS, other), delayed secondaries for that, as mentioned in the comments
It's not going to be particularly quick to query the documents out of the oplog (any oplog query is a table scan) on a large busy system.
The documents may age out of the oplog at any time (you can, of course, make a copy of the oplog for later use to give you more time)
Depending on your workload you might have to de-dupe the results before re-inserting them
Larger sets of documents will be too large for an array as demonstrated, so you will need to iterate over a cursor instead
The format of the oplog is considered internal and may change at any time (without notice), so use at your own risk
While I understand this is a bit old but I wanted to share something that I researched in this area that may be useful to others with a similar problem.
The fact is that MongoDB does not Physically delete data immediately - it only marks it for deletion. This is however version specific and there is currently no documentation or standardization - which could enable a third party tool developer (or someone in desperate need) to build a tool or write a simple script reliably that works across versions. I opened a ticket for this - https://jira.mongodb.org/browse/DOCS-5151.
I did explore one option which is at a much lower level and may need fine tuning based on the version of MongoDB used. Understandably too low level for most people's linking, however it works and can be handy when all else fails.
My approach involves directly working with the binary in the file and using a Python script (or commands) to identify, read and unpack (BSON) the deleted data.
My approach is inspired by this GitHub project (I am NOT the developer of this project). Here on my blog I have tried to simplify the script and extract a specific deleted record from a Raw MongoDB file.
Currently a record is marked for deletion as "\xee" at the start of the record. This is what a deleted record looks like in the raw db file,
‘\xee\xee\xee\xee\x07_id\x00U\x19\xa6g\x9f\xdf\x19\xc1\xads\xdb\xa8\x02name\x00\x04\x00\x00\x00AAA\x00\x01marks\x00\x00\x00\x00\x00\x00#\x9f#\x00′
I replaced the first block with the size of the record which I identified earlier based on other records.
y=”3\x00\x00\x00″+x[20804:20800+51]
Finally using the BSON package (that comes with pymongo), I decoded the binary to a Readable object.
bson.decode_all(y)
[{u’_id': ObjectId(‘5519a6679fdf19c1ad73dba8′), u’name': u’AAA’, u’marks': 2000.0}]
This BSON is a python object now and can be dumped into a recover collection or simply logged somewhere.
Needless to say this or any other recovery technique should be ideally done in a staging area on a backup copy of the database file.
I am writing a method that updates a single document in a very large MongoCollection,
and I have an index that I want the MongoCollection.Update() call to use to drastically reduce lookup time, but I can't seem to find anything like MongoCursor.SetHint(string indexName).
Is using an index on an update operation possible? If so, how?
You can create index according to your query section of update command.
For example if you have this collection, named data:
> db.data.find()
{ "_id" : ObjectId("5334908bd7f87918dae92eaf"), "name" : "omid" }
{ "_id" : ObjectId("5334943fd7f87918dae92eb0"), "name" : "ali" }
{ "_id" : ObjectId("53349478d7f87918dae92eb1"), "name" : "reza" }
and if you do this update query:
> db.data.update(query={name:'ali'}, update={name: 'Ali'})
without any defined index, the number of scanned document is 2:
"nscanned" : 2,
But if you define an index, according to your query, here for name field:
db.data.ensureIndex({name:1})
Now if you update it again:
> db.data.update(query={name:'Ali'}, update={name: 'ALI'})
Mongodb use your index for doing update, and number of scanned document is 1:
"nscanned" : 1,
But if you want to hint for update, you can hint it for your query:
# Assume that the index and field of it exists.
> var cursor = db.data.find({name:'ALI'}).hint({family:1})
Then use it in your update query:
> db.data.update(query=cursor, update={name: 'ALI'})
If you already have indexed your collection, update will be using the CORRECT index right away. There is no point to provide hint (in fact you can't hint with update).
Hint is only for debugging and testing purposes. Mongo is in most cases smart enough to automatically decide which index (if you have many of them) should be used in a particular query and it reviews its strategy from time to time.
So short answer - do nothing. If you have an index and it is useful, it will be automatically used on find, update, delete, findOne.
If you want to see if it is used - take the part of the query which searches for something and run it through find with explain.
Example for hellboy. This is just an example and in real life it can be more complex.
So you have a collection with docs like this {a : int, b : timestamp}. You have 2 indexes: one is on a, another is on b. So right now you need to do a query like a > 5 and b is after 2014. For some reason it uses index a, which does not give you the faster time (may be because you have 1000 elements and most of them are bigger than 5 and only 10 are > 2004 ). SO you decided to hint it to use b index. Cool it works much faster now. But your collection changes and right now you are in 2020 year and most of your documents have b bigger than 2014. So right now your index b is not doing so much work. But mongo still uses it, because you told so.
I am looking for a good way to implement a sort key, that is completely user definable. E.g. The user is presented with a list and may sort the elements by dragging them around. This order should be persisted.
One commonly used way is to just create an ascending integer type sort field within each element:
{
"_id": "xxx1",
"sort": 2
},
{
"_id": "xxx2",
"sort": 3
},
{
"_id": "xxx3",
"sort": 1
}
While this will surely work, it might not be ideal: In case the user moves an element from the very bottom to the very top, all the indexes in-between need to be updated. We are not talking about embedded documents here, so this will cause a lot of individual documents to be updated. This might be optimised by creating initial sort values with gaps in-between (e.g. 100, 200, 300, 400). However, this will create the need for additional logic an re-sorting in case the space between two elements is exhausted.
Another approach comes to mind: Have the parent document contain a sorted array, which defines the order of the children.
{
"_id": "parent01",
"children": ["xxx3","xxx1","xxx2"]
}
This approach would certainly make it easier to change the order, but will have it's own caveats: The parent documents must always keep track of a valid list of its children. As adding children will update multiple documents, this still might not be ideal. And there needs to be complex validation of the input received from the client, as the length of this list and the elements contained, may never be changed by the client.
Is there a better way to implement such a use case?
Hard to say which option is better without knowing:
How often the sort order is usually updated
Which queries you gonna run against the documents and how often
How many documents can be sorted at a time
I'm sure you gonna do much more queries than updates so personally I would go with the first option. It's easy to implement and it's simple which means it's gonna be rebust. I understand your concerns about updating multiple documents but the updates will be done in place, I mean no documents shifting will occur as you don't actually change the documents size. Just create a simple test. Generate 1k of documents, then just update each of them in a loop like that
db.test.update({ '_id': arrIds[i] }, { $set: { 'sort' : i } })
You will see it will be a pretty instant operation.
I like the second option as well, from programming perspective it looks more elegant but when it comes to practice you don't usually care much if your update takes 10 milleseconds instead of 5 if you don't do it often and I'm sure you don't, most applications are query oriented.
EDIT:
When you update multiple documents, even if it's an instant operation, one may come up with an inconsistency issue when some documents are updated and some not. In my case it wasn't really an issue in fact. Let's consider an example, assume there's a list:
{ "_id" : 1, "sort" : 1 },{ "_id" : 2, "sort" : 4 },{ "_id" : 3, "sort" : 2 },{ "_id" : 4, "sort" : 3 }
so the ordered ids should look like that 1,3,4,2 according to sort fields. Let's say we have a failure when we want to move id=2 to the top. The failure occurs when we only updated two documents, so we will come up with the following state as we only managed to update ids 2 and 1:
{ "_id" : 1, "sort" : 2 },{ "_id" : 2, "sort" : 1 },{ "_id" : 3, "sort" : 2 },{ "_id" : 4, "sort" : 3 }
the data is in inconsistent state but still we can display the list to fix the problem, the ids order will be 2,1,3,4 if we just order it by sort field. why is it not a problem in my case? because when a failure occurs a user is redirected to an error page or provided with an error message, it is obvious for him that something got wrong and he should try again, so he just goes to the page and fix the order which is only partially valid for him.
Just to sum it up. Taking into account that it's a really rare case and other benefits of the approach I would go with it. Otherwise you will have to place everything in one document both the elements and the array with their indexes. This might be a much bigger issue, especially when it come to querying.
Hope it helps!
I have two MongoDB collections
promo collection:
{
"_id" : ObjectId("5115bedc195dcf55d8740f1e"),
"curr" : "USD",
"desc" : "durable bags.",
"endDt" : "2012-08-29T16:04:34-04:00",
origPrice" : 1050.99,
"qtTotal" : 50,
"qtClaimd" : 30,
}
claimed collection:
{
"_id" : ObjectId("5117c749195d62a666171968"),
"proId" : ObjectId("5115bedc195dcf55d8740f1e"),
"claimT" : ISODate("2013-02-10T16:14:01.921Z")
}
Whenever someone claimed a promo, a new document will be created inside "claimedPro" collection where proId is a (virtual) foreign key to first (promo) collection. Every claim should increment a counter "qtClaimd" in "promo" collection. What's the best way to increment a value in another collection in a transactional fashion? I understand MongoDB doesn't have isolation for multiple docs.
Also, reason why I went with "non-embedded" approach is as follow
promo gets created and published to users then claims will happen in 100s of thousands amounts. I didn't think it was logical to embed claims inside promo collection given the number of writes will happen in a single document ('coz mongo resizes promo collection when size grows due to thousands of claims). Having non embedded approach keeps promo collection unaffected but insert new document in "claims" collection. Later while generating report I'll have to display "promo" details along with "claims" details for that promo. With non-embedded approach I'll have to first query "promo" collection and then "claims" collection with "proId". *Also worth mentioning that there could be times where 100s of "claims" can happen simultaneously for the same "promo" *.
What's the best way to achieve trnsactional effect with these two collections? I am using Scala, Casbah and Salat all with Scala 2.10 version.
db.bar.update({ id: 1 }, { $inc: { 'some.counter': 1 } });
Just look at how to run this with SalatDAO, I'm not a play user so I wouldn't want to give you wrong advise about that. $inc is the Mongo way to increment.
Let's say I have a collection of documents such as:
{ "_id" : 0 , "owner":0 "name":"Doc1"},{ "_id" : 1 , "owner":1, "name":"Doc1"}, etc
And, on the other hand the owners are represented as a separate collection:
{ "_id" : 0 , "username":"John"}, { "_id" : 1 , "username":"Sam"}
How can I make sure that, when I insert a document it references the user in a correct way. In old-school RDBMS this could easily be done using a Foreign Key.
I know that I can check the correctness of insertion from my business code, BUT what if an attacker tampers with my request to the server and puts "owner" : 100, and Mongo doesn't throw any exception back.
I would like to know how this situation should be handled in a real-word application.
Thank you in advance!
MongoDB doesn't have foreign keys (as you have presumably noticed). Fundamentally the answer is therefore, "Don't let users tamper with the requests. Only let the application insert data that follows your referential integrity rules."
MongoDB is great in lots of ways... but if you find that you need foreign keys, then it's probably not the correct solution to your problem.
To answer your specific question - while MongoDB encourages handling foreign-key relationships on the client side, they also provide the idea of "Database References" - See this help page.
That said, I don't recommend using a DBRef. Either let your client code manage the associations or (better yet) link the documents together from the start. You may want to consider embedding the owner's "documents" inside the owner object itself. Assemble your documents to match your usage patterns and MongoDB will shine.
This is a one-to-one to relationship. It's better to embed one document in another, instead of maintaining separate collections. Check here on how to model them in mongodb and their advantages.
Although its not explicitly mentioned in the docs, embedding gives you the same effect as foreign key constraints. Just want to make this idea clear. When you have two collections like that:
C1:
{ "_id" : 0 , "owner":0 "name":"Doc1"},{ "_id" : 1 , "owner":1, "name":"Doc1"}, etc
C2:
{ "_id" : 0 , "username":"John"}, { "_id" : 1 , "username":"Sam"}
And if you were to declare foreign key constraint on C2._id to reference C1._id (assuming MongoDB allows it), it would mean that you cannot insert a document into C2 where C2._id is non-existent in C1. Compare this with an embedded document:
{
"_id" : 0 ,
"owner" : 0,
"name" : "Doc1",
"owner_details" : {
"username" : "John"
}
}
Now the owner_details field represents the data from the C2 collection, and the remaining fields represent the data from C1. You can't add an owner_details field to a non-existent document. You're essentially achieving the same effect.
This questions was originally answered in 2011, so I decided to post an update here.
Starting from version MongoDB 4.0 (released in June 2018), it started supporting multi-document ACID transactions.
Relations now can be modeled in two approaches:
Embedded
Referenced (NEW!)
You can model referenced relationship like so:
{
"_id":ObjectId("52ffc33cd85242f436000001"),
"contact": "987654321",
"dob": "01-01-1991",
"name": "Tom Benzamin",
"address_ids": [
ObjectId("52ffc4a5d85242602e000000")
]
}
Where the sample document structure of address document:
{
"_id":ObjectId("52ffc4a5d85242602e000000"),
"building": "22 A, Indiana Apt",
"pincode": 123456,
"city": "Los Angeles",
"state": "California"
}
If someone really wants to enforce the Foreign keys in the Project/WebApp. Then you should with a MixSQL approach i.e. SQL + NoSQL
I would prefer that the Bulky data which doesn't have that much references then it can be stored in NoSQL database Store. Like : Hotels or Places type of data.
But if there is some serious things like OAuth modules Tables, TokenStore and UserDetails and UserRole (Mapping Table) etc.... then you can go with SQL.
I would also reccommend that if username's are unique, then use them as the _id. You will save on an index. In the document being stored, set the value of 'owner' in the application as the value of 'username' when the document is created and never let any other piece of code update it.
If there are requirements to change the owner, then provide appropirate API's with business rules implemented.
There woudln't be any need of foreign keys.