Just a general question here - I'm doing some self paced learning on MongoDB and to get off on the right foot I'd like an opinion on how to organize collections for a sample budget application.
As with any home budget I have 'Categories' such as Home, Auto and I also have subcategories under those categories such as Mortgage and Car Payments.
Each bill will have a due date, minimum amount due, a forecast payment, forecast payment date, actual payment and actual payment date.
Each bill is due to 'someone', for example Home, Mortgage may be due to Bank of America, and Bank of America may have contact info (phone, mailing address).
Making the switch from a Table structure to Mongo is a bit confusing, so I appreciate any opinions on how to approach this.
The question is very general. In general :), the following principles apply to schema design in MongoDB:
The layout of your collections should be guided by sound modeling principles. With MongoDB, you can have a schema that more closely resembles the object structure of your data, as opposed to a relational "projection" of it.
The layout of your collections should be guided by your data access patterns (which may sometimes conflict with the previous statement). Design your schemas so you can get the info you need in as few queries as possible, without loading too much data you don't need into your application.
You often can, and should, "denormalize" to achieve the above two. It's not a bad thing with MongoDB at all. The downside of denormalizing is that updates become more expensive and you need to make sure to maintain consistency. But those downsides are often outweighed by more natural modeling and better read efficiency.
From your description above, it sounds as if you have a rather "relational" model already in mind. Try to get rid of that and approach the problem with a fresh mind. Think objects, not tables.
Related
I'm new to NoSQL and I'm trying to figure out the best way to model my database. I'll be using ArangoDB in the project but I think this question also stands if using MongoDB.
The database will store 12 categories of products. Each category is expected to hold hundreds or thousands of products. Products will also be added / removed constantly.
There will be a number of common fields across all products, but each category will also have unique fields / different restrictions to data.
Keep in mind that there are instances where I'd need to query all the categories at the same time, for example to search a product across all categories, and other instances where I'll only need to query one category.
Should I create one single collection "Product" and use a field to indicate the category, or create a seperate collection for each category?
I've read many questions related to this idea (1 collection vs many) but I haven't been able to reach a conclusion, other than "it dependes".
So my question is: In this specific use case which option would be most optimal, multiple collections vs single collection + sharding, in terms of performance and speed ?
Any help would be appreciated.
As you mentioned, you need to play with your data and use-case. You will have better picture.
Some decisions required as below.
Decide the number of documents you will have in near future. If you will have 1m documents in an year, then try with at least 3m data
Decide the number of indices required.
Decide the number of writes, reads per second.
Decide the size of documents per category.
Decide the query pattern.
Some inputs based on the requirements
If you have more writes with more indices, then single monolithic collection will be slower as multiple indices needs to be updated.
As you have different set of fields per category, you could try with multiple collections.
There is $unionWith to combine data from multiple collections. But do check the performance it purely depends on the above decisions. Note this open issue also.
If you decide to go with monolithic collection, defer the sharding. Implement this once you found that queries are slower.
If you have more writes on the same document, writes will be executed sequentially. It will slow down your read also.
Think of reclaiming the disk space when more data is cleared from the collections. Multiple collections do good here.
The point which forces me to suggest monolithic collections is that I'd need to query all the categories at the same time. You may need to add more categories, but combining all of them in single response would not be better in terms of performance.
As you don't really have a join use case like in RDBMS, you can go with single monolithic collection from model point of view. I doubt you could have a join key.
If any of my points are incorrect, please let me know.
To SQL or to NoSQL?
I think that before you implement this in NoSQL, you should ask yourself why you are doing that. I quite like NoSQL but some data is definitely a better fit to that model than others.
The data you are describing is a classic case for a relational SQL DB. That's fine if it's a hobby project and you want to try NoSQL, but if this is for a production environment or client, you are likely making the situation more difficult for them.
Relational or non-relational?
You mention common fields across all products. If you wish to update these fields and have those updates reflected in all products, then you have relational data.
Background
It may be worth reading Sarah Mei 2013 article about this. Skip to the section "How MongoDB Stores Data" and read from there. Warning: the article is called "Why You Should Never Use MongoDB" and is (perhaps intentionally) somewhat biased against Mongo, so it's important to read this through the correct lens. The message you should get from this article is that MongoDB is not a good fit for every data type.
Two strategies for handling relational data in Mongo:
every time you update one of these common fields, update every product's document with the new common field data. This is generally only ok if you have few updates or few documents, but not both.
use references and do joins.
In Mongo, joins typically happen code-side (multiple db calls)
In Arango (and in other graph dbs, as well as some key-value stores), the joins happen db-side (single db call)
Decisions
These are important factors to consider when deciding which DB to use and how to model your data
I've used MongoDB, ArangoDB and Neo4j.
Mongo definitely has the best tooling and it's easy to find help, but I don't believe it's good fit in this case
Arango is quite pleasant to work with, but doesn't yet have the adoption that it deserves
I wouldn't recommend Neo4j to anyone looking for a NoSQL solution, as its nodes and relations only support flat properties (no nesting, so not real documents)
It may also be worth considering MariaDB or Postgres
I'm learning about DDD, I don't clear about how to separate objects into aggregate.
An example:
I have 3 objects: company, shop, job.
And i have some relationships: one company has many shops and one shop has many jobs.
I thinks:
A shop can't exist without company. And a company have to has shops, it's real world. So that, i group company and shop into one aggregate.
Job is another aggregate.
Another thought
When getting a job, i always care about which shop this job belongs to.
So that, i group: shop and job into one aggregate.
Company is another aggregate.
Which way is right?
Thanks
The only possible answer is, of course, "It depends." That's not especially helpful, though.
Review the definition of an aggregate from Evan's book:
An AGGREGATE is a cluster of associated objects that we treat as a
unit for the purpose of data changes ... Invariants, which are
consistency rules that must be maintained whenever data changes, will
involve relationships between members of the AGGREGATE. Any rule that
spans AGGREGATES will not be expected to be up-to-date at all times
... But the invariants applied within an AGGREGATE will be enforced
with the completion of each transaction.
So the questions of "what objects make up my aggregate" and "what is my aggregate root?" depend on what business invariants need to be enforced across which business transactions?
You do not design aggregates likes you do tables in a relational database. You're not concerned about the multiplicity of the relationship between the entities in "real life". You're looking for what facts (properties, values) must be true at the end of an action that affects (mutates the data of) those entities.
Look at your requirements. Look at what kinds of behavior your system needs to support. What can you do with jobs? Create them? Start them? Complete them? Can you transfer a job from one shop to another? Can a job move between companies?
What facts need to stay consistent? e.g., are you enforcing a maximum number of jobs per shop? At the end of "adding a job", does the current # of jobs in a shop need to be consistent with the job's shop assignment?
Since you can only interact with an aggregate through its root, you need to think about the context of how you add new data. e.g., can you create a job with no initial shop assignment? Or can it only be created through a shop?
There's also a compromise between the size/scope of an aggregate and the possibility of data contention when updating an aggregate in a transaction.
With all of these things to worry about, you may wonder why even bother with aggregates? Well, they are great at a couple of things:
validation and execution of commands is FAST, because all the data you need is self-contained inside the aggregate
they lend themselves well to document-based persistence stores (like MongoDB with nested documents for the aggregate objects), which makes retrieval of the aggregate state simple and performant, and enforcement of the aggregate transaction boundary easy with document-level atomic updates
they are extremely easy to test, because they can be implemented as simple classes (POCO's/POJO's in C# or Java). Because they contain the majority of your business logic, it means your overall app's behavior is easy to unit test as well!
they are intentful; each aggregate has a purpose, and it's very clear from the data and functions they implement just what they do in your system. Combined with leveraging the ubiquitous language of the context you're coding in in the code itself, they are the most direct expressions of your business behavior in the codebase (much more so than a set of data tables alone)
because they are so use-case specific, aggregates often avoid leaky abstractions that crop up in more generic solutions
If you're interested in reading more, Vaughn Vernon has a nice summary in his Effective Aggregate Design posts, which served as the basis for his meaty book "Implementing Domain-Driven Design".
I'm going to try and make this as straight-forward as I can.
Coming from MySQL and thinking in terms of tables, let's use the following example:
Let's say that we have a real-estate website and we're displaying a list of houses
normally, I'd use the following tables:
houses - the real estate asset at hand
owners - the owner of the house (one-to-many relationship with houses)
agencies - the real-estate broker agency (many-to-many relationship with houses)
images - many-to-one relationship with houses
reviews - many-to-one relationship with houses
I understand that MongoDB gives you the flexibility to design your web-app in different collections with unique IDs much like a relational database (normalized), and to enjoy quick selections, you can nest within a collection, related objects and data (un-normalized).
Back to our real-estate houses list, the query used to populate it is quite expensive in a normal relational DB, for each house you need to query its images, reviews, owner & agencies, each entity resides in a different table with its fields, you'd probably use joins and have multiple queries joined into one - Expensive!
Enter MongoDB - where you don't need joins, and you can store all the related data of a house in a house item on the houses collection, selection was never faster, it's a db heaven!
But what happens when you need to add/update/delete related reviews/agencies/owner/images?
This is a mystery to me, and if I need to guess, each related collection exist on its own collection on top of its data within the houses table, and once one of these pieces of related data is being added/updated/deleted you'll have to update it on its own collection as well as on the houses collection. Upon this update - do I need to query the other collections as well to make sure I'm updating the house record with all the updated related data?
I'm just guessing here and would really appreciate your feedback.
Thanks,
Ajar
Try this approach:
Work out which entity (or entities) are the hero(s)
With 'hero', I mean the entity(s) that the database is centered around. Let's take your example. The hero of the real-estate example is the house*.
Work out the ownerships
Go through the other entities, such as the owner, agency, images and reviews and ask yourself whether it makes sense to place their information together with the house. Would you have a cascading delete on any of the foreign keys in your relational database? If so, then that implies ownership.
Work out whether it actually matters that data is de-normalised
You will have agency (and probably owner) details spread across multiple houses. Does that matter?
Your house collection will probably look like this:
house: {
owner,
agency,
images[], // recommend references to GridFS here
reviews[] // you probably won't get too many of these for a single house
}
*Actually, it's probably the ad of the house (since houses are typically advertised on a real-estate website and that's probably what you're really interested in) so just consider that
Sarah Mei wrote an informative article about the kinds of issues that can arise with data integrity in nosql dbs. The choice between duplicate data or using id's, code based joins and the challenges with keeping data integrity. Her take is that any nosql db with code based joins will lose data integrity at some point. Imho the articles comments are as valuable as the article itself in understanding these issues and possible resolutions.
Link: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/comment-page-1/
I would just like to give a normalization refresher from the MongoDB's perspective -
What are the goals of normalization?
Frees the database from modification anomalies - For MongoDB, it looks like embedding data would mostly cause this. And in fact, we should try to avoid embedding data in documents in MongoDB which possibly create these anomalies. Occasionally, we might need to duplicate data in the documents for performance reasons. However that's not the default approach. The default is to avoid it.
Should minimize re-design when extending - MongoDB is flexible enough because it allows addition of keys without re-designing all the documents
Avoid bias toward any particular access pattern - this is something, we're not going to worry about when describing schema in MongoDB. And one of the ideas behind the MongoDB is to tune up your database to the applications that we're trying to write and the problem we're trying to solve.
I'm considering using MongoDB for a web application but I have read that there are some situations where it not recommended. I am wondering would my project be one of those situations.
Here's a brief outline of the entities in my system -
There are Users with standard user detail attributes
Each User can have many Receipts, a Receipt can have only one User
A Receipt contains many products
Products have standard product detail attributes
Users can have many Friends, and each Friend is a User themselves
Reviews can be given by Users for Products
There will be a Reputation system like the one here on stackoverflow where users earn Points and Badges
As you can see there are lots of entities that have various relationships with each other. And data integrity is important. Is this type of schema suitable for MongoDB?
Your data seems to be very relational, not document-oriented.
You are also pointing out data integrity as an important requirement (I assume you mean referential integrity between different entities). When entities are not stored in the same document, it is hard to guarantee integrity between them in MongoDB. Also, MongoDB can't enforce any constraints (except for unique-ness of field values).
You also seem to have a very relational mind pattern overall which will make it hard for you to utilize MongoDB how it was meant to be used.
For these reasons I would recommend you to stick to a relational database.
The reason why you considered using MongoDB in the first place is likely because you heard that it is fast and that it scales well. Yes, it is fast and it does scale well, but only because it doesn't enforce referential integrity and because it doesn't do table joins. When you need these features and thus have to find ugly workarounds to simulate them, MongoDB won't be that fast anymore.
In my system Employees log Requests and Requests are about an Equipment.
In order to avoid loading the Employee and Equipment documents each time I need to display a Request I want to denormalize the employee name and the equipment inventory number, manufacturer name and model in the Request document.
I’m I on the wrong track here? Is this an antipattern?
I realize that if I do that then I’ll have to update all affected Request documents in the very rare case that an employee’s name changes or an equipment’s inventory number changes.
PS: Links to Document Database Modeling Guidelines would be appreciated as well.
Sly, this really depends on your required level of speed. If you just have one database without sharding and you're ok with very fast performance, then don't denormalize and use .Include for the sake of simplicity. However, they don't work on sharded sessions, so you might want to go for denormalization then and have lightning performance.
My very personal opinion is that you should always go with .Include unless you have a really good reason not to use it.
Here is a good, short post on denormalization.
In summary, denormalization is ok if you're never going to update, and the performance advantage outweighs the storage disadvantage (it usually does).
Sly,
In your case, do you really need to update the information if it changes?
If Sally May requested a new laptop in 2011, is it meaningful to change that request to say Sally May-Jones in 2012 ?
Usually when we want denormalization, we want it not for performance (although it is a factor, Include can deal with that nicely). We want that to get point-in-time view of the data.