How to enforce foreign keys in NoSql databases (MongoDB)?

How to enforce foreign keys in NoSql databases (MongoDB)? - mongodb

Let's say I have a collection of documents such as:
{ "_id" : 0 , "owner":0 "name":"Doc1"},{ "_id" : 1 , "owner":1, "name":"Doc1"}, etc
And, on the other hand the owners are represented as a separate collection:
{ "_id" : 0 , "username":"John"}, { "_id" : 1 , "username":"Sam"}
How can I make sure that, when I insert a document it references the user in a correct way. In old-school RDBMS this could easily be done using a Foreign Key.
I know that I can check the correctness of insertion from my business code, BUT what if an attacker tampers with my request to the server and puts "owner" : 100, and Mongo doesn't throw any exception back.
I would like to know how this situation should be handled in a real-word application.
Thank you in advance!

MongoDB doesn't have foreign keys (as you have presumably noticed). Fundamentally the answer is therefore, "Don't let users tamper with the requests. Only let the application insert data that follows your referential integrity rules."
MongoDB is great in lots of ways... but if you find that you need foreign keys, then it's probably not the correct solution to your problem.

To answer your specific question - while MongoDB encourages handling foreign-key relationships on the client side, they also provide the idea of "Database References" - See this help page.
That said, I don't recommend using a DBRef. Either let your client code manage the associations or (better yet) link the documents together from the start. You may want to consider embedding the owner's "documents" inside the owner object itself. Assemble your documents to match your usage patterns and MongoDB will shine.

This is a one-to-one to relationship. It's better to embed one document in another, instead of maintaining separate collections. Check here on how to model them in mongodb and their advantages.
Although its not explicitly mentioned in the docs, embedding gives you the same effect as foreign key constraints. Just want to make this idea clear. When you have two collections like that:
C1:
{ "_id" : 0 , "owner":0 "name":"Doc1"},{ "_id" : 1 , "owner":1, "name":"Doc1"}, etc
C2:
{ "_id" : 0 , "username":"John"}, { "_id" : 1 , "username":"Sam"}
And if you were to declare foreign key constraint on C2._id to reference C1._id (assuming MongoDB allows it), it would mean that you cannot insert a document into C2 where C2._id is non-existent in C1. Compare this with an embedded document:
{
"_id" : 0 ,
"owner" : 0,
"name" : "Doc1",
"owner_details" : {
"username" : "John"
}
}
Now the owner_details field represents the data from the C2 collection, and the remaining fields represent the data from C1. You can't add an owner_details field to a non-existent document. You're essentially achieving the same effect.

This questions was originally answered in 2011, so I decided to post an update here.
Starting from version MongoDB 4.0 (released in June 2018), it started supporting multi-document ACID transactions.
Relations now can be modeled in two approaches:
Embedded
Referenced (NEW!)
You can model referenced relationship like so:
{
"_id":ObjectId("52ffc33cd85242f436000001"),
"contact": "987654321",
"dob": "01-01-1991",
"name": "Tom Benzamin",
"address_ids": [
ObjectId("52ffc4a5d85242602e000000")
]
}
Where the sample document structure of address document:
{
"_id":ObjectId("52ffc4a5d85242602e000000"),
"building": "22 A, Indiana Apt",
"pincode": 123456,
"city": "Los Angeles",
"state": "California"
}

If someone really wants to enforce the Foreign keys in the Project/WebApp. Then you should with a MixSQL approach i.e. SQL + NoSQL
I would prefer that the Bulky data which doesn't have that much references then it can be stored in NoSQL database Store. Like : Hotels or Places type of data.
But if there is some serious things like OAuth modules Tables, TokenStore and UserDetails and UserRole (Mapping Table) etc.... then you can go with SQL.

I would also reccommend that if username's are unique, then use them as the _id. You will save on an index. In the document being stored, set the value of 'owner' in the application as the value of 'username' when the document is created and never let any other piece of code update it.
If there are requirements to change the owner, then provide appropirate API's with business rules implemented.
There woudln't be any need of foreign keys.

Related

MongoDB Embedding alongside referencing

There is a lot of content of what kind of relationships should use in a database schema. However, I have not seen anything about mixing both techniques. 
The idea is to embed only the necessaries attributes and with them a reference. This way the application have the necessary data for rendering and the reference for the updating methods.
The problem I see here is that the logic for handle any CRUD operations becomes more tricky because its mandatory to update multiples collections however I have all the information in one single read.
Basic schema for a page that only wants the students names of a classroom:
CLASSROOM COLLECTION
{"_id": ObjectID(),
"students": [{"studentId" : ObjectID(),
"name" : "John Doe",
},
...
]
}
STUDENTS COLLECION
{"_id": ObjectID(),
"name" : "John Doe",
"address" : "...",
"age" : "...",
"gender": "..."
}
I use the students' collection in a different page and there I do not want any information about the classroom. That is the reason not to embed the students.
I started to learning mongo a few days ago and I don't know if this kind of schema bring some problems.

You can embed some fields and store other fields in a different collection as you are suggesting.
The issues with such an arrangement in my opinion would be:
What is the authority for a field? For example, what if a field like name is both embedded and stored in the separate collection, and the values differ?
Both updating and querying become awkward as you need to do it differently depending on which field is being worked with. If you make a mistake and go in the wrong place, you create/compound the first issue.

Denormalization Data in MongoDb Doctrine Symfony 2

I'm Following this Doc
http://docs.doctrine-project.org/projects/doctrine-mongodb-odm/en/latest/tutorials/getting-started.html
And
http://symfony.com/doc/current/bundles/DoctrineMongoDBBundle/index.html
When I Save My Document, I have two Collection
like this:
{
"_id" : ObjectId("5458e370d16fb63f250041a7"),
"name" : "A Foo Bar",
"price" : 19.99,
"posts" : [
{
"$ref" : "Embedd",
"$id" : ObjectId("5458e370d16fb63f250041a8"),
"$db" : "test_database"
}
]
}
I'd like have
{
"_id" : ObjectId("5458e370d16fb63f250041a7"),
"name" : "A Foo Bar",
"price" : 19.99,
"posts" : [
{
"mycomment" :"dsdsds"
" date" : date
}
]
}
I want denormalization my data. How Can i Do it?
Can I use Methods like $push,$addToSet etc of mongoDb?
Thanks

Doctrine ODM supports both references and embedded documents.
In your first example, you're using references. The main document (let's assume it's called Product) references many Post documents. Those Post documents live in their own collection (for some reason this is named Embedd -- I would suggest renaming that if you keep this schema). By default, ODM uses the DBRef convention for references, so each reference is itself a small embedded document with $ref, $id, and $db fields.
Denormalization can be achieved by using embedded documents (an #EmbedMany mapping in your case). If you were embedding a Post document, the Post class should be mapped as an #EmbeddedDocument. This tells ODM that it's not a first-class document (belonging to its own collection), so it won't have to worry about tracking it by _id and the like (in fact, embedded documents won't even need identifiers unless you want to map one).
My rule of thumb for deciding to embed or references has generally been asking myself, "Will I need this document outside of the context of the parent document?" If a Post will not have an identity outside of the Product record, I'm comfortable embedding it; however, if I find later that my application also wants to show users a list of all of their Posts, or that I need to query by Posts (e.g. a feed of all recent Posts, irrespective of Product), then I may want to reference documents in a Posts collection (or simply duplicate embedded Posts as needed).
Alternatively, you may decide that Posts should exist in both their own collection and be embedded on Product. In that case, you can create an AbstractPost class as a #MappedSuperclass and define common fields there. Then, extend this with both Post and EmbeddedPost sub-classes (mapped accordingly). You'll be responsible for creating some code to generate an EmbeddedPost from a Post document, which will be suitable for embedding in the Product.posts array. Furthermore, you'll need to handle data synchronization between the top-level and embedded Posts (e.g. if someone edits a Post comment, you may want all the corresponding embedded versions updated as well).
On the subject of references: ODM also supports a simple option for reference mappings, in which case it will just store the referenced document's _id instead of the larger DBRef object. In most cases, having DBRef store the collection and database name for each referenced document is quite redundant; however, DBRef is actually useful if you're using single-collection inheritance, as ODM uses the object to store extra discriminator information (i.e. the class of the referenced object).

Application level distributed read/write lock for mongoDB

I have a distributed application that uses mongoDB as a backend. The application has two collections (C1 and C2) with a M:1 relationship, so that if I delete a document in C1, I need to search C1 for any other documents that point to the same doc in C2, and if there are no matches, then delete the related doc in C2.
This obviously has the problem of race conditions that could insert new documents into C1 while the search is going on that point to the soon-to-be-deleted document in C2, resulting in DB inconsistency. Deletes can be delayed such that they could be batched up and performed once a week, say, during low load, so I'm considering writing a distributed locking system for mongo to solve the RC problem.
Questions:
Is there a better approach than distributed locking?
Does software like this already exist for Mongo? I've seen single document examples of locks, but not database level distributed locks.
UPDATE
I left this out to avoid confusing the issue, but I need to include it now. There's actually another resource (R) (essentially a file on a remote disk) that needs to be deleted along with with C2 document and C2:R is M:1. R is completely outside the mongodb ecosystem, which is why my mind jumped to locking the application so I can safely delete all this stuff. Hence the reversing links idea mentioned below won't work for this case. Yes, the system is complicated, and no, it can't be changed.
UPDATE2
My attempts to abstract away implementation details to keep the question succinct keeps biting me. Another detail: R is manipulated via REST calls to another server.

1.
This type of problem is usually solved by embedding. So essentially C1 and C2 could be a single collection and C2 doc would embed itself into C1. Obviously this is not always possible or desirable and one of the downsides of this is data duplication. Another downside is that you would not be able to find all C2s without going through all C1s and given M:1 relationship it's not always good thing to do. So it depends if these cons are a real problem in your application.
2.
Another way to handle it would be to just remove links from C1 to C2 thus leaving C2 documents to exist with no links. This could have low cost in some cases.
3.
Use Two Phase Commit similar to as described here: http://docs.mongodb.org/manual/tutorial/perform-two-phase-commits/.
4.
Yet another option could be to reverse your links. C2 would have an array of links that point to C1s. Each time you delete C1 you $pull from that array a link to deleted C1. Immediately after you delete from C2 with a condition that array of links is empty and its _id is what you got back from update. If race condition happens when you insert a new document into C1 and trying to update C2 and you got back result that you didn't update anything then you can either fail your insert or try to insert a new C2. Here is an example:
// Insert first doc
db.many.insert({"name": "A"});
// Find it to get an ID to show.
db.many.find();
{ "_id" : ObjectId("52eaf9e05a07ef0270a9eccc"), "name" : "A" }
// lets add a tag to it right after
db.one.update({"tag": "favorite"}, {$addToSet: {"links": ObjectId("52eaf9e05a07ef0270a9eccc")}}, {upsert: true, multi: false});
// show that tag was created and a link was added
db.one.find();
{ "_id" : ObjectId("52eafaa77365653791085540"), "links" : [ ObjectId("52eaf9e05a07ef0270a9eccc") ], "tag" : "favorite" }
// Insert one more doc which will not be tagged just for illustration
db.many.insert({"name": "B"});
// Insert last document, show IDs of all docs and tag the last document inserted:
db.many.insert({"name": "C"});
db.many.find();
{ "_id" : ObjectId("52eaf9e05a07ef0270a9eccc"), "name" : "A" }
{ "_id" : ObjectId("52eafab95a07ef0270a9eccd"), "name" : "B" }
{ "_id" : ObjectId("52eafac85a07ef0270a9ecce"), "name" : "C" }
db.one.update({"tag": "favorite"}, {$addToSet: {"links": ObjectId("52eafac85a07ef0270a9ecce")}}, {upsert: true, multi: false});
// Now we have 2 documents tagged out of 3
db.one.find();
{ "_id" : ObjectId("52eafaa77365653791085540"), "links" : [ ObjectId("52eaf9e05a07ef0270a9eccc"), ObjectId("52eafac85a07ef0270a9ecce") ], "tag" : "favorite" }
// START DELETE PROCEDURE
// Let's delete first tagged document
db.many.remove({"_id" : ObjectId("52eaf9e05a07ef0270a9eccc")});
// remove the "dead" link
db.one.update({"tag": "favorite"}, {$pull: {"links": ObjectId("52eaf9e05a07ef0270a9eccc")}});
// just to show how it looks now (link removed)
db.one.find();
{ "_id" : ObjectId("52eafaa77365653791085540"), "links" : [ ObjectId("52eafac85a07ef0270a9ecce") ], "tag" : "favorite" }
// try to delete a document that has no links - it's not the case here yet, so the doc is not deleted.
db.one.remove({"tag" : "favorite", "links": {$size: 0}});
db.one.find();
{ "_id" : ObjectId("52eafaa77365653791085540"), "links" : [ ObjectId("52eafac85a07ef0270a9ecce") ], "tag" : "favorite" }
// DELETE OF THE FIRST DOC IS COMPLETE, if any docs got added with
// links then the tag will just have more links
// DELETE LAST DOC AND DELETE UNREFERENCED LINK
db.many.remove({"_id" : ObjectId("52eafac85a07ef0270a9ecce")});
db.one.update({"tag": "favorite"}, {$pull: {"links": ObjectId("52eafac85a07ef0270a9ecce")}});
// no links are left
db.one.find();
{ "_id" : ObjectId("52eafaa77365653791085540"), "links" : [ ], "tag" : "favorite" }
db.one.remove({"tag" : "favorite", "links": {$size: 0}});
// LAST DOC WAS DELETED AND A REFERENCING DOC WAS DELETED AS WELL
// final look at remaining data
db.one.find();
// empty
db.many.find();
{ "_id" : ObjectId("52eafab95a07ef0270a9eccd"), "name" : "B" }
If upsert happens after you delete from one then it will just create a new doc and add a link. If it happens before then old one doc will stay and links will be updated properly.
UPDATE
Here is one way to deal with "delete file" requirements. It assumes you have POSIX compliant filesystem like ext3/ext4, many other FSs would have same properties too. For each C2 you create you should create a randomly named hard link which points to the R file. Store the path to that link in C2 doc for example. You'll end up with multiple hard links pointing to a single file. Whenever you delete a C2 you delete this hard link. Eventually when link count goes to 0 OS will delete the file. Thus there is no way you can delete the file unless you delete all hard links.
Another alternative to reversing C1<->C2 links and using FS hard links is to use multiphase commit which you can implement in any way you want.
Disclaimer: whatever mechanisms I described should work, but might contain some cases that I missed. I didn't try exactly this approach myself, but I used similar "transactional" file deletion scheme in the past successfully. So such solution I think will work but requires good testing and thinking it through will all possible scenarios.
UPDATE 2
Given all the constraints you will have to implement either multi stage commit or a some sort of locking/transaction mechanism. You can also order all your operations through a task queue which will naturally be free of race conditions (synchronous). All of these mechanisms will slow the system down a bit but you can pick granularity level of a C2 document id which is not so bad I suppose. Thus you'll still be able to run stuff in parallel with isolation on C2 id level.
One of the simple practical approaches is to use a message bus/queue.

If you are not using sharding, you can use TokuMX instead of MongoDB, which has support for multi-document, multi-statement transactions with atomic commit and rollback, among other nice things. These transactions work across collections and databases, so they seem like they would work for your application without many changes.
There is a full tutorial here.
Disclaimer: I am an engineer at Tokutek

Alek,
Have you considered moving the relationships to a different collection. You can have a collection that maps all the relationships from C1 to C2. Each document can also store a boolean indicating it is marked for collection. You can write a background task that will periodically scan this table and look for collections that have been deleted. The advantage of this model is that it is easy to detect when the collections are out of sync.
E.g.
{
C1_ID,
[C2_ID_1, C2_ID_2....],
true/false
}

MongoDB design for scalability

We want to design a scalable database. If we have N users with 1 Billion user responses, from the 2 options below which will be a good design? We would want to query based on userID as well as Reponse ID.
Having 2 Collections one for the user information and another to store the responses along with user ID. Each response is stored as a document so we will have 1 billion documents.
User Collection
{
"userid" : "userid1",
"password" : "xyz",
,
"City" : "New York",
},
{
"userid" : "userid2",
"password" : "abc",
,
"City" : "New York",
}
responses Collection
{
"userid": "userid1",
"responseID": "responseID1",
"response" : "xyz"
},
{
"userid": "userid1",
"responseID": "responseID2",
"response" : "abc"
},
{
"userid": "userid2",
"responseID": "responseID3",
"response" : "mno"
}
Having 1 Collection to store both the information as below. Each response is represented by a new key (responseIDX).
{
"userid" : "userid1",
"responseID1" : "xyz",
"responseID2" : "abc",
,
"responseN"; "mno",
"city" : "New York"
}

If you use your first options, I'd use a relational database (like MySQL) opposed to MongoDB. If you're heartfelt on MongoDB, use it to your advantage.
{
"userId": n,
"city": "foo"
"responses": {
"responseId1": "response message 1",
"responseId2": "response message 2"
}
}
As for which would render a better performance, run a few benchmark tests.

Between the two options you've listed - I would think using a separate collection would scale better - or possibly a combination of a separate collection and still using embedded documents.
Embedded documents can be a boon to your schema design - but do not work as well when you have an endlessly growing set of embedded documents (responses, in your case). This is because of document growth - as the document grows - and outgrows the allocated amount of space for it on disk, MongoDB must move that document to a new location to accommodate the new document size. That can be expensive and have severe performance penalties when it happens often or in high concurrency environments.
Also, querying on those embedded documents can become troublesome when you are looking to selectively return only a subset of responses, especially across users. As in - you can not return only the matching embedded documents. Using the positional operator, it is possible to get the first matching embedded document however.
So, I would recommend using a separate collection for the responses.
Though, as mentioned above, I would also suggest experimenting with other ways to group those responses in that collection. A document per day, per user, per ...whatever other dimensions you might have, etc.
Group them in ways that allow multiple embedded documents and compliments how you would query for them. If you can find the sweet spot between still using embedded documents in that collection and minimizing document growth, you'll have fewer overall documents and smaller index sizes. Obviously this requires benchmarking and testing, as the same caveats listed above can apply.
Lastly (and optionally), with that type of data set, consider using increment counters where you can on the front end to supply any type of aggregated reporting you might need down the road. Though the Aggregation Framework in MongoDB is great - having, say, the total response count for a user pre-aggregated is far more convenient then trying to get a count by running a aggregate query on the full dataset.

Is it ok to use MongoDB when we have no idea about the availabe keys?

We are scraping a huge products website.
So, we will get and persist so many products, and almost each product has a different set of features/details.
Naturally, we consider using a NoSQL database (MongoDB) for this job. We will make a collection "products", and a document for each product where each key/value pair map to detail_name/detail_description of the product.
Since products are quite different, we have almost no idea what are the product details/features. In other words, we have no knowledge of the available keys.
According to this link MongoDB case insensitive key search, It is a "gap" for MongoDB (that we do not have some idea of the available keys).
Is this true? If yes, what are the alternatives?

Your key problem isn't that much of an issue for MongoDB provided you can live with a slightly different schema and big indexes :
Normally you would do something like :
{
productId :..
details : {
detailName1 : detailValue1,
detailName2 : detailValue2;
}
}
But if you do this you can index the details field :
{
productId :..
details : [
{field : detailName1, value : detailValue1},
{field : detailName2, value : detailValue2}
]
}
Do note that this will result in a very big index. Not necessarily a problem but something to be aware of. The index will then be {details.field:1, details.value:1} (or just {details:1} if you're not adding additional fields per detail).

Once you've scraped all of the data you could examine it to determine if there is a field/set of fields in the documents that you could add an index to in order to improve performance.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse