We recently started to work in a big project and we decided to use MongoDB as a DDBB solution.
We wrote a lot of code, but the project has started to grow and we found out that we're trying to use joins instead of doing it the NoSQLway, which denotes a bad DDBB design.
What I'm trying to ask here is a good design for our project, which, at this point consists of the following:
More than 12.000 Products
More than 2.000 Sellers
Every seller should have its own private area that will allow to create a product catalog based on the +12.000 "products template list".
The seller should be able to set the price, stock and offers, which will then be reflected only in his public product listing. The template list of products will remain unchanged.
Currently we have two collections. One for the products (which holds the general product information, like name, description, photos, etc...) and one collection in which we store documents that contain the ID of the product from the first collection, an ID that is related to the seller and the stock, price and offers values.
We are using aggregate with $lookup to "emulate" SQL's left join to merge the two collections, but the process is not scaling as we'd like it to and we're hitting serious performance issues.
We're aware that using joins is not the way to go in NoSQL. What should we do? How should we refactor our DDBB design? Should we embed the prices, offers and stock for each seller in each document?
The decision of using "Embedded documents" or "Joins among two or more different collections" should depend on how you are going to retrieve the data.If every time,while fetching product, you are going to fetch sellers,then it makes sense to make it an embedded document instead of different collections.But if you will be planning to fetch these two entities separately, then only option you are left with is to use Join.
Related
I'm having some issues to correctly design the domain that I'm working on.
My straightforward use case is the following:
The user (~5000 users) can access to a list of ads (~5 millions)
He can choose to add/remove some of them as favorites.
He can decide to show/hide some of them.
I have a command which will mutate the aggregate state, to set Favorite to TRUE, let's say.
In terms of DDD, how should I design the aggregates?
How design the relationship between a user and his favorite's ads selection?
Considering the large numbers of ads, I cannot duplicate each ad inside a user aggregate root.
Can I design a Ads aggregateRoot containing a user "collection".
And finally, how to handle/perform the readmodels part?
Thanks in advance
Cheers
Two concepts may help you understand how to model this:
1. Aggregates are Transaction Boundaries.
An aggregate is a cluster of associated objects that are considered as a single unit. All parts of the aggregate are loaded and persisted together.
If you have an aggregate that encloses a 1000 entities, then you have to load all of them into memory. So it follows that you should preferably have small aggregates whenever possible.
2. Aggregates are Distinct Concepts.
An Aggregate represents a distinct concept in the domain. Behavior associated with more than one Aggregate (like Favoriting, in your case) is usually an aggregate by itself with its own set of attributes, domain objects, and behavior.
From your example, User is a clear aggregate.
An Ad has a distinct concept associated with it in the domain, so it is an aggregate too. There may be other entities that will be embedded within the Ad like valid_until, description, is_active, etc.
The concept of a favoriting an Ad links the User and the Ad aggregates. Your question seems to be centered around where this linkage should be preserved. Should it be in the User aggregate (a list of Ads), or should an Ad have a collection of User objects embedded within it?
While both are possibilities, IMHO, I think FavoriteAd is yet another aggregate, which holds references to both the User aggregate and the Ad aggregate. This way, you don't burden the concepts of User or the Ad with favoriting behavior.
Those aggregates will also not be required to load this additional data every time they are loaded into memory. For example, if you are loading an Ad object to edit its contents, you don't want the favorites collection to be loaded into memory by default.
These aggregate structures don't matter as far as read models are concerned. Aggregates only deal with the write side of the domain. You are free to rewire the data any way you want, in multiple forms, on the read side. You can have a subscriber just to listen to the Favorited event (raised after processing the Favorite command) and build a composite data structure containing data from both the User and the Ad aggregates.
I really like the answer given by Subhash Bhushan and I want to add another approach for you to consider.
If you look closely at your question you will see that you've made the assumption that an aggregate can 'see' everything that the user does when they are interacting with the UI. This doesn't need to be so.
Depending on the requirements of the domain you don't need to hold a list of any Ads in the aggregate to favourite them. Here's what I mean:
For this example, it doesn't matter where the the 'favourite' ad command sits. It could be on the user aggregate or a specific aggregate for handling the concept of Favouriting. The command just needs to hold the id of the User and the Ad they are favouriting.
You may need to handle what happens if a user or ad is deleted but that would just be a case of an event process manager listening to the appropriate events and issuing compensating commands.
This way you don't need to load up 5 million ads. That's a job for the read model and UI, not the domain.
Just a thought.
I need to model a many-to-many relation on Firestore. A summary of the requirements follow.
A company can hire many contractors for a project. A contractor can work for many companies on different projects at different times.
There should be no limit on the number of contractors or companies, i.e. collections or sub-collections should be used.
A contractor should be able to query by companies; and vice versa, a company should be able to query by contractors. For example, (1) a contractor might ask for a list of companies he/she worked for sorted by project & time, and (2) a company can ask for all contractors who worked for them over a month sorted by project & contractor, and possibly divided by week.
As far as the company is concerned, a contractor can change status, e.g. working, complete. A company changes the status of a contractor during the project lifetime. This status can be used in queries.
Obviously, contractors should not have access to other contractors' information.
A company is represented by only a single user on the mobile app. Similarly, a contractor is represented by only a single user on the mobile app.
The mobile app is built in React Native, which (to the best of my knowledge) is considered by Firestore as a web app.
I am thinking of using a sub-collection of documents for/under each company. Each document represents a project. All contractors' names, their statuses and start and end times are stored on this document.
At the same time, having a duplicate sub-collection of project documents for/under each contractor. Each of these duplicate documents represents a partial copy of the project's document (above). This duplicate document stores the company name and start and end time of the project.
a. Whenever a relationship is established, e.g. a contract is signed, both documents are created in a batch.
b. Status exists only on the 1st copy of the document.
c. In case of any rare changes to the almost static data, eg. name, phone, both documents are updated.
Does this design make sense?
Any concerns, suggestions, better ideas?
If you agree with the design, I would love to hear from you, maybe you can write in a comment something like sounds good.
AskFirebase
There are particular cases when you can use a sub-collection and when not to use sub-collections.
When to use sub-collections:
1) When you don't want to store a lot of fields in a document. Cloud Firestore has 20,000 field limit. (If the Company and Contractor information is very huge and can exceed more than 20,000 fields)
2) When updating the parent collection is a common operation. Firestore only lets you update the document at rate of 1 write/second. (If the Company and Contractor information is modified very often)
3) When you want to limit the access to particular fields of a document. (If you want to restrict the access to a Company's contractors or if the access to Contractor's companies should be restricted. In this case moving the restricted fields to another document in another collection is also a good idea!)
When not to use sub-collections:
1) When you want to query the collections and sub-collections together. Firestore queries are shallow. So sub-collections won't be queried when you query the parent collection so you have to query them separately. (If you have a case to show all the companies and their contractors in one window)
2) When you want to show the sub-collection when viewing the collection.(When showing a company, you might want to show its contractors. Here the number of reads will increase because instead of reading one document you are reading one document and its sub-collection all the time)
3) When you want to query collections and sub-collections together.(You can use the newly announced collections-group query whenever you want to query something that's common across the Companies and Contractors such as field of work or minimum rate)
4) If you're thinking about querying individual pieces of data, you should put them in a collection. (If the Contractor's particular attributes are usually queried by Companies or a Company's details are looked upon by multiple Contractors)
My Suggestion:
Company collection to store company information on which companies can be searched according to their qualities.
Contractors collection with the same approach since I'm assuming contractors will be queried a lot according to their attributes.
Projects sub-collection for info about the projects on which companies and contractors will collaborate. This can be a sub-collection under Company collection if only one company will be working on a project. Even if multiple contractors are going to be working on a project for a company you can store the contractor's Ids in an array in the Projects collection. This will help you avoid the Projects partial sub-collection inside each Company/Contractor collection.
But if you need to query on the project's qualities, it is better to expose them as a seperate parent collection. I leave that up to you.
Finally I would suggest a new collection Contracts which can be used to store the relationship between Company, Contractor and Project and all the information on which you can do the complex querying on. If the same company and contractor has two different projects on which they are working/collaborating, then it can be two documents in Contracts collection. This comes handy when you want to show some dashboards. Using this single collection you can show the separate statistics for a Company, Contractor and complex statistics involving both Company and Contractor.
Hope this helps.
My question is that I am having a problem where I need to update the data which is been denormalized due to being in NoSQL because a single update in one data needs to be updated in all other redundant data.
For eg: Consider an e-commerce database where there is one table "Products" which contains all the details about a product , let's say name,imageName, LogoImage
Now in this case the LogoImage of various "Products" table entry can be same, and now I need to update the LogoImage, so I need to update in all the fields which contains the given LogoImage. which seems like a very poor solution
So is there any better way to do that?
P.S.: If we seperate logo and Products into 2 different table , so when I need to get 1000 products at a time , I need to get the related logos by implementing a client level join type thing, which is also not a good solution.
You're suggesting using the database as your CDN and storing the binary image in it? That's not a great approach, in my opinion. You should be storing that image in an actual CDN like Amazon Cloudfront, or a simple one like Amazon S3, or your own webserver as a file. Whichever, the point is that you should be referring to it by URI. In Aerospike you would store the metadata about that image, not the image itself.
Next, you can have two sets - prod for products and prodimg for product images. The various products store a list of IDs referring to the product image set. The product image set has metadata about each image as a separate record { uri, name, title, width, length, ... } . If anything changes about this image, you just update the one record with the metadata for that image in prodimg. No need to change anything about the products.
And you don't really need JOIN functionality in this case. Your application can get the prod record first, and use the bin (images) that has all the IDs of the images for the product (each referring to a key of a record in prodimg). You can then issue either a few get operations (reads) or a single batch-read for all of them if there are many. The latencies for Aerospike are such that this will return faster and scale better than an equivalent JOIN in an RDBMS. A batch-read is a multi-node, multi-core, multi-threaded operation. A cluster of 3 multi-core nodes has plenty of parallel computing power.
Again, if you "need 1000 products at a time" use batch-read. In the Java client that's an AerospikeClient.get() with a list of Key objects. In the Python client that's an aerospike.Client.get_many. Every Aerospike client has batch-read functionality.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I’m starting a new web project and have to decide what database to use. I know, the question is very long but please bear with me on this.
I am very familiar with relational databases and have used frameworks like hibernate to get my data from the DB into Objects. But I have no experience with noSQL DBs. I am aware of the concepts of Document, Key-Value, etc. types.
While I do my research one question pops out every time and I don’t know how someone would handle this in noSQL DBs like MongoDB or any other Document-Typed noSQL DB where consistency takes top priority.
For example: let’s assume that we are creating a small shopping management system where customers can buy and sell stuff.
We have:
CUSTOMERs
ORDERs
PRODUCTs
A single CUSTOMER can have multiple ORDERs and an ORDER can have multiple PRODUCTs.
In a traditional RDBMS I would of course have 3 tables.
In the first version of our application, the front end for the customer should display his/her personal data, ORDERs and all the PRODUCTs he or she bought per order. Also which products are available for sale. So I guess in noSQL I would model the CUSTOMER class like this:
{
"id": 993784,
"firstname": "John",
"lastname": "Doe",
"orders": [
{
"id": 3234,
"quantity": 4,
"products": [
{
"id:" 378234,
"type": "TV",
"resolution": "1920x1080",
"screenSize":37,
"price": 999
}
]
}
],
"products": [
{
"id:" 7932,
"type": "car",
"sold": false,
"horsepower": 90
}
]
}
But later I want to extend my application to have 3 different UIs instead of only the first one:
The CUSTOMER Dashboard where a customer can view all his/her orders.
The PRODUCT Dashboard where a customer can add or remove products in his/her store.
THE SOLD Dashboard where a customer can view all sold PRODUCTs ready for shipping.
One very important thing to consider (the reason why I even bother asking this question): I want to be flexible with the classes like PRODUCT because products can have different properties. For Example: A TV has screen size and resolution while a car has horsepower and other properties. And if a user adds a new product, he or she should be able to dynamically add those properties depending on what he/she knows about it.
Now to some practical use cases of two fictional users Jane and John:
Let's say, Jane buys from John. Does that mean i have to create the PRODUCTs two times? One time as a child of Jane's ORDER and another time to stay in the "products" property of John?
Later Jane wants to view all products that are available from any user. Do i have to load every user to query the "products" property to generate a list of all products?
In version 2 of the application i want to enable John to view all outgoing orders (not orders he made but orders from other users who bought stuff from him) instead of viewing all sold products. How would this be done in noSQL? Would i now need to create an "outgoing" array of orders and duplicate them? (an outgoing order of Jane is an incoming order of John)
Some of you may say that noSQL is not right for this use case but isn’t that very common? Especially when we do not know what the future brings? If it does not fit for this use case, what use case would it fit into? Only baby applications (I guess not)? Wasn’t noSQL designed for more complex and flexible data?
Thank you very much for your advises and opinions!
EDIT 1:
Because this question was put on hold because of the unprecise question:
I made a very clear and simple example. So my question is not general about the use of noSQL but how to handle this specific example. How would a experienced noSQL user handle this use case? How to model this data? A recommendation to simply not use noSQL at all for this use case is also a valid answer to me.
I simply want to know how to use a noSQL database but still be able to manage entities and avoid redundancy.
For example: Are MongoDB's DBRefs/Manual refs a good way to achieve this? Performance issues because of multiple queries? What else to think about? I guess these questions can probably be answered quite well.
There probably isn't the one right answer to your question. But I'll make a start.
While it is technically possible in NoSQL to store some business entity together with all entities that are transitively linked with it (like Customer, Order, Product), it is't always clever to do so. The traditional reasons for separating entities, namely redundancies and therefore update and delete anomalies, don't just go away because a different platform is used.
So if you stored the product description with every customer who buys or sells this product, you will get update anomalies. If you have to change the screen size from 37 to 35, you'll have to find all customer records containing this product, which can be quite cumbersome.
Also, building up such a deep nested structure favors one direction of evaluating those structures over all other directions. If you put all orders and products into the customer document, this is very fine for getting a comprehensive view for a customer: whatever she bought throughout her lifetime. But if you want to query your database by orders (which orders need to be fulfilled tonight?) or products (who ordered product 1234?) you'll have to load tons of data that are of no interest to this query.
Similar questions are due to storing all orders with a customer. Old orders will sometimes still be of interest, so they may not be deleted. But do you want to load lots of orders everytime you load the customer?
This doesn't mean not to make use of the complex structuring made possible by a document store. As a rule of thumb, I would suggest: As long as the nested information belongs to the same business entity, put it into one document. If, e.g., the product description has some hierarchic structure, like nested sections consisting of text, pics, and videos, they may all go into one document. But entities with a totally different life cycle, like customers, orders, and suppliers, should be kept separate. Another indicator is references: A product will frequently be referenced as a whole, e.g. when it is ordered by a customer or ordered from a supplier. But the different parts of the product description may possibly never be referenced from the outside.
This rule of thumb wasn't completely precise, and it's not supposed to be. One person's business entity is another person's dumb attribute. Imagine the color of a car: For the car owner, it's just a piece of information describing a car. For the manufacturer, it's a business entity, having an availability, a price, one or more suppliers, a way of handling it, etc.
Your question also touches the aspect of dynamically adding attributes. This is often praised as one of the goodies of NoSQL, but it's no free lunch. Let's assume, as you mentioned, that the user may add attributes. That's technically possible, but how will these attributes be processed by the system? There won't be a specific view, nor specific business rules, for those attributes. So the best the system can do is offer some generic mechanism for displaying those attributes that were defined at runtime and never reflected in the program code.
This doesn't mean the feature is useless. Imagine your product description may be complex, as described above. You might build a generic mechanism to display (and edit) descriptions made up of sections, texts, images, etc., and afterwards the users may enter descriptions of unlimited width and depth. But in contrast, imagine your user will add a tiny delivery date attribute to the order. Unless the system knows specifically how to interpret this date, it will just be a dumb piece of information without any effect.
Now imagine not the user, but the developer adds new attributes. She has the opportunity to enhance the code at the same time, e.g. building some functionality around delivery dates. But this means that, although the database doesn't require it by its own, a new release of the software needs to be rolled out to make use of the new information.
The absence of a database scheme even makes the programmer's task more complicated. When a relational table has a certain column, you may be sure that each of its records has this column. If you want to make sure that it has a meaningful value, make it not null, and you may be sure that each record contains a value of the correct data type. Nothing like that is guaranteed by schemaless databases. So, when reading a record, defensive programming is needed to find out which parts are present, and whether they have the expected content. The same holds for database maintenance via administrative tools. Adding an attribute and initializing it with a default value is a 2-liner in SQL, or a couple of mouse clicks in pgadmin. For a schemaless database, you will write a short program on your own to achieve this.
This doesn't mean that I dislike NoSQL databases. But I think the "schemaless" characteristic is sometimes overestimated, and I wouldn't make it the main, or only, reason to employ such a database.
I'm more used to a relational database and am having a hard time thinking about how to design my database in mongoDB, and am even more unclear when taking into account some of the special considerations of database design for meteorjs, where I understand you often prefer separate collections over embedded documents/data in order to make better use of some of the benefits you get from collections.
Let's say I want to track students progress in high school. They need to complete certain required classes each school year in order to progress to the next year (freshman, sophomore, junior, senior), and they can also complete some electives. I need to track when the students complete each requirement or elective. And the requirements may change slightly from year to year, but I need to remember for example that Johnny completed all of the freshman requirements as they existed two years ago.
So I have:
Students
Requirements
Electives
Grades (frosh, etc.)
Years
Mostly, I'm trying to think about how to set up the requirements. In a relational DB, I'd have a table of requirements, with className, grade, and year, and a table of student_requirements, that tracks the students as they complete each requirement. But I'm thinking in MongoDB/meteorjs, I'd have a model for each grade/level that gets stored with a studentID and initially instantiates with false values for each requirement, like:
{
student: [studentID],
class: 'freshman'
year: 2014,
requirements: {
class1: false,
class2: false
}
}
and as the student completes a requirement, it updates like:
{
student: [studentID],
class: 'freshman'
year: 2014,
requirements: {
class1: false,
class2: [completionDateTime]
}
}
So in this way, each student will collect four Requirements documents, which are somewhat dictated by their initial instantiation values. And instead of the actual requirements for each grade/year living in the database, they would essentially live in the code itself.
Some of the actions I would like to be able to support are marking off requirements across a set of students at one time, and showing a grid of users/requirements to see who needs what.
Does this sound reasonable? Or is there a better way to approach this? I'm pretty early in this application and am hoping to avoid painting myself into a corner. Any help suggestion is appreciated. Thanks! :-)
Currently I'm thinking about my application data design too. I've read the examples in the MongoDB manual
look up MongoDB manual data model design - docs.mongodb.org/manual/core/data-model-design/
and here -> MongoDB manual one to one relationship - docs.mongodb.org/manual/tutorial/model-embedded-one-to-one-relationships-between-documents/
(sorry I can't post more than one link at the moment in an answer)
They say:
In general, use embedded data models when:
you have “contains” relationships between entities.
you have one-to-many relationships between entities. In these relationships the “many” or child documents always appear with or are viewed in the context of the “one” or parent documents.
The normalized approach uses a reference in a document, to another document. Just like in the Meteor.js book. They create a web app which shows posts, and each post has a set of comments. They use two collections, the posts and the comments. When adding a comment it's submitted together with the post_id.
So in your example you have a students collection. And each student has to fulfill requirements? And each student has his own requirements like a post has his own comments?
Then I would handle it like they did in the book. With two collections. I think that should be the normalized approach, not the embedded.
I'm a little confused myself, so maybe you can tell me, if my answer makes sense.
Maybe you can help me too? I'm trying to make a app that manages a flea market.
Users of the app create events.
The creator of the event invites users to be cashiers for that event.
Users create lists of stuff they want to sell. Max. number of lists/sellers per event. Max. number of position on a list (25/50).
Cashiers type in the positions of those lists at the event, to track what is sold.
Event creators make billings for the sold stuff of each list, to hand out the money afterwards.
I'm confused how to set up the data design. I need Events and Lists. Do I use the normalized approach, or the embedded one?
Edit:
After reading percona.com/blog/2013/08/01/schema-design-in-mongodb-vs-schema-design-in-mysql/ I found following advice:
If you read people information 99% of the time, having 2 separate collections can be a good solution: it avoids keeping in memory data is almost never used (passport information) and when you need to have all information for a given person, it may be acceptable to do the join in the application.
Same thing if you want to display the name of people on one screen and the passport information on another screen.
But if you want to display all information for a given person, storing everything in the same collection (with embedding or with a flat structure) is likely to be the best solution