I am trying to figure out how to best implement this for my system...and get my head out of the RDBMS space for now...
A part of my current DB has three tables: Show, ShowEntry, and Entry. Basically ShowEntry is a many-to-many joining table between Show and Entry. In my RDBMS thinking it's quite logical since any changes to Show details can be done in one place, and the same with Entry.
What's the best way to reflect this in a document-based storage? I'm sure there is no one way of doing this but I can't help but think if document-based storage is appropriate for this case at all.
FYI, I am currently considering implementing RavenDB. While discussions on general NoSQL design will be good a more RavenDB focused one will be fantastic!
Thanks,
D.
When modelling a many-to-many relationship in a document database, you usually store a collection of foreign keys in just one of the documents. The document you choose largely depends on the direction you intend to traverse the relationship. Traversing it one way is trivial, traversing it the other way requires an index.
Take the shopping basket example. It's more important to know exactly which items are in a particular basket than which baskets contain a particular item. Since we're usually following the relationship in the basket-to-item direction, it makes more sense to store item IDs in a basket than it does to store basket IDs in an item.
You can still traverse the relationship in the opposite direction (e.g. find baskets containing a particular item) by using an index, but the index will be updated in the background so it won't always be 100% accurate. (You can wait for the index to become accurate with WaitForNonStaleResults, but that delay will show in your UI.)
If you require immediate 100% accuracy in both directions, you can store foreign keys in both documents, but your application will have to update two documents whenever a relationship is created or destroyed.
This went a long way towards solving my question!
Answer to the question
Many-to-many relationships in NoSQL are implemented via an array of references on one of the entities.
You've got two options:
Show has an array of Entry items;
Entry has an array of Shows.
Location of the array is determined by the most common direction of querying. To resolve records in the other direction - index the array (in RavenDB it works like a charm).
Usually, having two arrays on both entities pointing to each other brings more grief than joy. You're losing the single source of truth in an eventually consistent environment... it has potential for breaking data integrity.
Check out this article - Entity Relationships in NoSQL (one-to-many, many-to-many). It covers entity relationships from various angles, taking into account performance, operational costs, time/costs of development and maintenance... and provides examples for RavenDB.
Related
I'm new to NoSQL and I'm trying to figure out the best way to model my database. I'll be using ArangoDB in the project but I think this question also stands if using MongoDB.
The database will store 12 categories of products. Each category is expected to hold hundreds or thousands of products. Products will also be added / removed constantly.
There will be a number of common fields across all products, but each category will also have unique fields / different restrictions to data.
Keep in mind that there are instances where I'd need to query all the categories at the same time, for example to search a product across all categories, and other instances where I'll only need to query one category.
Should I create one single collection "Product" and use a field to indicate the category, or create a seperate collection for each category?
I've read many questions related to this idea (1 collection vs many) but I haven't been able to reach a conclusion, other than "it dependes".
So my question is: In this specific use case which option would be most optimal, multiple collections vs single collection + sharding, in terms of performance and speed ?
Any help would be appreciated.
As you mentioned, you need to play with your data and use-case. You will have better picture.
Some decisions required as below.
Decide the number of documents you will have in near future. If you will have 1m documents in an year, then try with at least 3m data
Decide the number of indices required.
Decide the number of writes, reads per second.
Decide the size of documents per category.
Decide the query pattern.
Some inputs based on the requirements
If you have more writes with more indices, then single monolithic collection will be slower as multiple indices needs to be updated.
As you have different set of fields per category, you could try with multiple collections.
There is $unionWith to combine data from multiple collections. But do check the performance it purely depends on the above decisions. Note this open issue also.
If you decide to go with monolithic collection, defer the sharding. Implement this once you found that queries are slower.
If you have more writes on the same document, writes will be executed sequentially. It will slow down your read also.
Think of reclaiming the disk space when more data is cleared from the collections. Multiple collections do good here.
The point which forces me to suggest monolithic collections is that I'd need to query all the categories at the same time. You may need to add more categories, but combining all of them in single response would not be better in terms of performance.
As you don't really have a join use case like in RDBMS, you can go with single monolithic collection from model point of view. I doubt you could have a join key.
If any of my points are incorrect, please let me know.
To SQL or to NoSQL?
I think that before you implement this in NoSQL, you should ask yourself why you are doing that. I quite like NoSQL but some data is definitely a better fit to that model than others.
The data you are describing is a classic case for a relational SQL DB. That's fine if it's a hobby project and you want to try NoSQL, but if this is for a production environment or client, you are likely making the situation more difficult for them.
Relational or non-relational?
You mention common fields across all products. If you wish to update these fields and have those updates reflected in all products, then you have relational data.
Background
It may be worth reading Sarah Mei 2013 article about this. Skip to the section "How MongoDB Stores Data" and read from there. Warning: the article is called "Why You Should Never Use MongoDB" and is (perhaps intentionally) somewhat biased against Mongo, so it's important to read this through the correct lens. The message you should get from this article is that MongoDB is not a good fit for every data type.
Two strategies for handling relational data in Mongo:
every time you update one of these common fields, update every product's document with the new common field data. This is generally only ok if you have few updates or few documents, but not both.
use references and do joins.
In Mongo, joins typically happen code-side (multiple db calls)
In Arango (and in other graph dbs, as well as some key-value stores), the joins happen db-side (single db call)
Decisions
These are important factors to consider when deciding which DB to use and how to model your data
I've used MongoDB, ArangoDB and Neo4j.
Mongo definitely has the best tooling and it's easy to find help, but I don't believe it's good fit in this case
Arango is quite pleasant to work with, but doesn't yet have the adoption that it deserves
I wouldn't recommend Neo4j to anyone looking for a NoSQL solution, as its nodes and relations only support flat properties (no nesting, so not real documents)
It may also be worth considering MariaDB or Postgres
I started reading up on MongoDB (which got me very excited) as I understand one of their flaws is the self explanatory lack of relation. Especially when it comes to large or ever growing on both sides, many to many relationships.
And, as I read around the best way to avoid ever growing arrays inside some document is either try avoiding it by creating buckets of documents and then referencing the buckets (that does not guarantee total prevention of overgrowth). Or to create the both document referencing a third many to many document.
Since I could not found a final answer to this dilemma or at least one the wouldn't be a few years old, could someone explain if this is the dead end (in case the project uses a few big(ever growing) many to many relationships) and I should switch to RDBMS?
It depends on your usecase.
The main question is do you actually know why you want to use MongoDB in the first place? Hopefully, the reason is not because of the trend. RDBMS's are still relevant and have their own usecases. For some applications RDBMS is the way to go for some it isn't.
Now back to your original question about many-to-many relations. As you have already researched there are ways to model those relationships in MongoDB. So that doesn't disqualify MongoDB as a database on its own. For example, to you need transactionality or referential integrity checks when you insert or delete records for those many to many relationships? If the answer to that is yes, then MongoDB may not be the perfect fit for your case.
When i first started working on MongoDB this exact question crossed my mind and during searching for the answer i read something very interesting (hope i had the link to that for you, but unfortunately i dont).
think of a real world problem where you have a many to many relation that just keeps on growing ? there may be very exceptional cases of such kind.
lets say many students are registered for many courses. Now a course may be registered by 100 students but for sure a student wont register for 100 courses, so you can simply in the student collection keep a array field for registered course ID's..
let's deep dive and say there are a bunch of super brilliant students who actually registered for 100 courses in such scenario a array field may not be a viable solution. Then ? how about a collection that just have student_id and course_id. This even exists in the RDBMS world too.
so the workarounds available should be enough to find and design an optimized solution for probably the most complex of the scenarios.
I'm going to try and make this as straight-forward as I can.
Coming from MySQL and thinking in terms of tables, let's use the following example:
Let's say that we have a real-estate website and we're displaying a list of houses
normally, I'd use the following tables:
houses - the real estate asset at hand
owners - the owner of the house (one-to-many relationship with houses)
agencies - the real-estate broker agency (many-to-many relationship with houses)
images - many-to-one relationship with houses
reviews - many-to-one relationship with houses
I understand that MongoDB gives you the flexibility to design your web-app in different collections with unique IDs much like a relational database (normalized), and to enjoy quick selections, you can nest within a collection, related objects and data (un-normalized).
Back to our real-estate houses list, the query used to populate it is quite expensive in a normal relational DB, for each house you need to query its images, reviews, owner & agencies, each entity resides in a different table with its fields, you'd probably use joins and have multiple queries joined into one - Expensive!
Enter MongoDB - where you don't need joins, and you can store all the related data of a house in a house item on the houses collection, selection was never faster, it's a db heaven!
But what happens when you need to add/update/delete related reviews/agencies/owner/images?
This is a mystery to me, and if I need to guess, each related collection exist on its own collection on top of its data within the houses table, and once one of these pieces of related data is being added/updated/deleted you'll have to update it on its own collection as well as on the houses collection. Upon this update - do I need to query the other collections as well to make sure I'm updating the house record with all the updated related data?
I'm just guessing here and would really appreciate your feedback.
Thanks,
Ajar
Try this approach:
Work out which entity (or entities) are the hero(s)
With 'hero', I mean the entity(s) that the database is centered around. Let's take your example. The hero of the real-estate example is the house*.
Work out the ownerships
Go through the other entities, such as the owner, agency, images and reviews and ask yourself whether it makes sense to place their information together with the house. Would you have a cascading delete on any of the foreign keys in your relational database? If so, then that implies ownership.
Work out whether it actually matters that data is de-normalised
You will have agency (and probably owner) details spread across multiple houses. Does that matter?
Your house collection will probably look like this:
house: {
owner,
agency,
images[], // recommend references to GridFS here
reviews[] // you probably won't get too many of these for a single house
}
*Actually, it's probably the ad of the house (since houses are typically advertised on a real-estate website and that's probably what you're really interested in) so just consider that
Sarah Mei wrote an informative article about the kinds of issues that can arise with data integrity in nosql dbs. The choice between duplicate data or using id's, code based joins and the challenges with keeping data integrity. Her take is that any nosql db with code based joins will lose data integrity at some point. Imho the articles comments are as valuable as the article itself in understanding these issues and possible resolutions.
Link: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/comment-page-1/
I would just like to give a normalization refresher from the MongoDB's perspective -
What are the goals of normalization?
Frees the database from modification anomalies - For MongoDB, it looks like embedding data would mostly cause this. And in fact, we should try to avoid embedding data in documents in MongoDB which possibly create these anomalies. Occasionally, we might need to duplicate data in the documents for performance reasons. However that's not the default approach. The default is to avoid it.
Should minimize re-design when extending - MongoDB is flexible enough because it allows addition of keys without re-designing all the documents
Avoid bias toward any particular access pattern - this is something, we're not going to worry about when describing schema in MongoDB. And one of the ideas behind the MongoDB is to tune up your database to the applications that we're trying to write and the problem we're trying to solve.
I just start learning about nosql database, specially MongoDB (no specific reason for mongodb). I browse few tutorial sites, but still cant figure out, how it handle relationship between two documents/entity
Lets say for example:
1. One Employee works in one department
2. One Employee works in many department
I dont know the term 'relationship' make sense for mongodb or not.
Can somebody please give something about joins, relationship.
The short answer: with "nosql" you wouldn't do it that way.
What you'd do instead of a join or a relationship is add the departments the user is in to the user object.
You could also add the user to a field in the "department" object, if you needed to see users from that direction.
Denormalized data like this is typical in a "nosql" database.
See this very closely related question: How do I perform the SQL Join equivalent in MongoDB?
in general, you want to denormalize your data in your collections (=tables). Your collections should be optimized so that you don't need to do joins (joins are not possible in NoSQL).
In MongoDB you can either reference other collections (=tables), or you can embed them into each other -- whatever makes more sense in your domain. There are size limits to entries in a collection, so you can't just embed the encyclopedia britannica ;-)
It's probably best if you look for API documentation and examples for the programming language of your choice.
For Ruby, I'd recommend the Mondoid library: http://mongoid.org/docs/relations.html
Generally, if you decided to learn about NoSql databases you should follow the "NoSql way", i.e. learn the principles beyond the movement and the approach to design and not simply try to map RDBMS to your first NoSql project.
Simply put - you should learn how to embed and denormalize data (like Will above suggested), and not simply copy the id to simulate foreign keys.
If you do this the "foreign _id way", next step is to search for transactions to ensure that two "rows" are consistently inserted/updated. Few steps after Oracle/MySql is waiting. :)
There are some instances in which you want/need to keep the documents separate in which case you would take the _id from the one object and add it as a value in your other object.
For Example:
db.authors
{
_id:ObjectId(21EC2020-3AEA-1069-A2DD-08002B30309D)
name:'George R.R. Martin'
}
db.books
{
name:'A Dance with Dragons'
authorId:ObjectId(21EC2020-3AEA-1069-A2DD-08002B30309D)
}
There is no official relationship between books and authors its just a copy of the _id from authors into the authorId value in books.
Hope that helps.
Before I dive really deep into MongoDB for days, I thought I'd ask a pretty basic question as to whether I should dive into it at all or not. I have basically no experience with nosql.
I did read a little about some of the benefits of document databases, and I think for this new application, they will be really great. It is always a hassle to do favourites, comments, etc. for many types of objects (lots of m-to-m relationships) and subclasses - it's kind of a pain to deal with.
I also have a structure that will be a pain to define in SQL because it's extremely nested and translates to a document a lot better than 15 different tables.
But I am confused about a few things.
Is it desirable to keep your database normalized still? I really don't want to be updating multiple records. Is that still how people approach the design of the database in MongoDB?
What happens when a user favourites a book and this selection is still stored in a user document, but then the book is deleted? How does the relationship get detached without foreign keys? Am I manually responsible for deleting all of the links myself?
What happens if a user favourited a book that no longer exists and I query it (some kind of join)? Do I have to do any fault-tolerance here?
MongoDB doesn't support server side foreign key relationships, normalization is also discouraged. You should embed your child object within parent objects if possible, this will increase performance and make foreign keys totally unnecessary. That said it is not always possible, so there is a special construct called DBRef which allows to reference objects in a different collection. This may be then not so speedy because DB has to make additional queries to read objects but allows for kind of foreign key reference.
Still you will have to handle your references manually. Only while looking up your DBRef you will see if it exists, the DB will not go through all the documents to look for the references and remove them if the target of the reference doesn't exist any more. But I think removing all the references after deleting the book would require a single query per collection, no more, so not that difficult really.
If your schema is more complex then probably you should choose a relational database and not nosql.
There is also a book about designing MongoDB databases: Document Design for MongoDB
UPDATE The book above is not available anymore, yet because of popularity of MongoDB there are quite a lot of others. I won't link them all, since such links are likely to change, a simple search on Amazon shows multiple pages so it shouldn't be a problem to find some.
See the MongoDB manual page for 'Manual references' and DBRefs for further specifics and examples
Above, #TomaaszStanczak states
MongoDB doesn't support server side foreign key relationships,
normalization is also discouraged. You should embed your child object
within parent objects if possible, this will increase performance and
make foreign keys totally unnecessary. That said it is not always
possible ...
Normalization is not discouraged by Mongo. To be clear, we are talking about two fundamentally different types of relationships two data entities can have. In one, one child entity is owned exclusively by a parent object. In this type of relationship the Mongo way is to embed.
In the other class of relationship two entities exist independently - have independent lifetimes and relationships. Mongo wishes that this type of relationship did not exist, and is frustratingly silent on precisely how to deal with it. Embedding is just not a solution. Normalization is not discouraged, or encouraged. Mongo just gives you two mechanisms to deal with it; Manual refs (analoguous to a key with the foreign key constraint binding two tables), and DBRef (a different, slightly more structured way of doing the same). In this use case SQL databases win.
The answers of both Tomasz and Francis contain good advice: that "normalization" is not discouraged by Mongo, but that you should first consider optimizing your database document design before creating "document references". DBRefs were mentioned by Tomasz, however as he alluded, are not a "magic bullet" and require additional processing to be useful.
What is now possible, as of MongoDB version 3.2, is to produce results equivalent to an SQL JOIN by using the $lookup aggregation pipeline stage operator. In this manner you can have a "normalized" document structure, but still be able to produce consolidated results. In order for this to work you need to create a unique key in the target collection that is hopefully both meaningful and unique. You can enforce uniqueness by creating a unique index on this field.
$lookup usage is pretty straightforward. Have a look at the documentation here: https://docs.mongodb.com/manual/reference/operator/aggregation/lookup/#lookup-aggregation. Run the aggregate() method on the source collection (i.e. the "left" table). The from parameter is the target collection (i.e. the "right" table). The localField parameter would be the field in the source collection (i.e. the "foreign key"). The foreignField parameter would be the matching field in the target collection.
As far as orphaned documents, from your question I would presume you are thinking about a traditional RDBMS set of constraints, cascading deletes, etc. Again, as of MongoDB version 3.2, there is native support for document validation. Have a look at this StackOver article: How to apply constraints in MongoDB? Look at the second answer, from JohnnyHK
Packt Publishers have a bunch of good books on MongoDB. (Full Disclosure: I wrote a couple of them.)