MongoDB update paradigm - mongodb

Trying to get my hands dirty with MongoDB, coming from a relational database background.
I believe one of the main concepts of MongoDB is to keep as much data together as possible.
Imagine I create an application that have products and categories, and I store products this way:
Products collection
{
"_id": "30671", //main item ID
"department": "Shoes",
"category": "Shoes/Women/Pumps",
"brand": "Calvin Klein",
"thumbnail": "http://cdn.../pump.jpg",
"title": "Evening Platform Pumps",
"description": "Perfect for a casual night out or a formal event.",
"style": "Designer",
…
}
Categories collection
{
"_id": "12356",
"name": "Shoes"
…
}
If I update a category name, would I need to run a huge update command to update all products as well? Or in this scenario it would be better to actually store the category ids against the products instead of their names (like we would do on a relational database)?
Thanks

In general there is no "one-size-fits-all" approach in MongoDB like it is in relational database world. In a relational schema design, it is a no-brainer to put Category in a separate table. In fact, it is almost a requirement to do so during normalization process.
In MongoDB, schema design almost entirely depends on what you need and your use case more than any rule-of-thumb or any formulated requirements. Of course, there are pluses/minuses for each choice. For example:
If you find in your use case that Category doesn't change much (or at all), then you can safely put it inside the Products collection. However, should you need to rename a category, then yes you would need to update all the products belonging to that category.
If you find that your use case requires flexibility in changing category names, then you may want to put it into a separate collection and refer to it like in a relational design. However, returning a product may not be as performant since now you need two queries instead of one (or a lookup).
I noticed that you used "category": "Shoes/Women/Pumps" in your example. In MongoDB, you can put that into an array if your use case allows it, e.g. "category": ["Shoes", "Women", "Pumps"]. This may make the field easier to index (again, depending on your use case).
In short, there is no right or wrong in MongoDB schema design with one caveat: usually the wrong thing to do is trying to emulate a relational design in MongoDB, since it goes against the grain.
You also may find these links helpful:
6 Rules of Thumb for MongoDB Schema Design: Part 1, Part 2, Part 3
Data Models
Use Cases

Related

MongoDB denormalization vs one-off initial query

I have a blog post like application. Users and Projects are stored in two different collections, i.e., many-to-many. A user might have 2-10 projects. A project might have 5-100 users, normally around 10. Post is another collection with objects like:
{
"project_id": 123,
"user_id": 456,
"text": "some text",
"comments": [{Comment object}]
}
Currently I use a normalization approach. I fire an one-off query to retrieve all current user's project names with project usernames when the application first loading. Then store them within a global variable so that it's accessible everywhere in my application. The only one drawback of this approach is the initial loading time would be a bit longer depending on how many projects you have and how many users are in your projects. It doesn't feel like a big deal.
However, many articles have suggested use denormalization. I do know the pros and cons. Since project_name and user_name exist in too many fields and different collections. It could be a long running task to propagate changes if project name or user name has been modified. That said, this won't happen very frequently.
Just wanna know if my current approach makes sense. Or denormalization can also be a good practice in terms of MongoDB. Please discuss and advise. Thanks.

Practical usage of noSQL [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I’m starting a new web project and have to decide what database to use. I know, the question is very long but please bear with me on this.
I am very familiar with relational databases and have used frameworks like hibernate to get my data from the DB into Objects. But I have no experience with noSQL DBs. I am aware of the concepts of Document, Key-Value, etc. types.
While I do my research one question pops out every time and I don’t know how someone would handle this in noSQL DBs like MongoDB or any other Document-Typed noSQL DB where consistency takes top priority.
For example: let’s assume that we are creating a small shopping management system where customers can buy and sell stuff.
We have:
CUSTOMERs
ORDERs
PRODUCTs
A single CUSTOMER can have multiple ORDERs and an ORDER can have multiple PRODUCTs.
In a traditional RDBMS I would of course have 3 tables.
In the first version of our application, the front end for the customer should display his/her personal data, ORDERs and all the PRODUCTs he or she bought per order. Also which products are available for sale. So I guess in noSQL I would model the CUSTOMER class like this:
{
"id": 993784,
"firstname": "John",
"lastname": "Doe",
"orders": [
{
"id": 3234,
"quantity": 4,
"products": [
{
"id:" 378234,
"type": "TV",
"resolution": "1920x1080",
"screenSize":37,
"price": 999
}
]
}
],
"products": [
{
"id:" 7932,
"type": "car",
"sold": false,
"horsepower": 90
}
]
}
But later I want to extend my application to have 3 different UIs instead of only the first one:
The CUSTOMER Dashboard where a customer can view all his/her orders.
The PRODUCT Dashboard where a customer can add or remove products in his/her store.
THE SOLD Dashboard where a customer can view all sold PRODUCTs ready for shipping.
One very important thing to consider (the reason why I even bother asking this question): I want to be flexible with the classes like PRODUCT because products can have different properties. For Example: A TV has screen size and resolution while a car has horsepower and other properties. And if a user adds a new product, he or she should be able to dynamically add those properties depending on what he/she knows about it.
Now to some practical use cases of two fictional users Jane and John:
Let's say, Jane buys from John. Does that mean i have to create the PRODUCTs two times? One time as a child of Jane's ORDER and another time to stay in the "products" property of John?
Later Jane wants to view all products that are available from any user. Do i have to load every user to query the "products" property to generate a list of all products?
In version 2 of the application i want to enable John to view all outgoing orders (not orders he made but orders from other users who bought stuff from him) instead of viewing all sold products. How would this be done in noSQL? Would i now need to create an "outgoing" array of orders and duplicate them? (an outgoing order of Jane is an incoming order of John)
Some of you may say that noSQL is not right for this use case but isn’t that very common? Especially when we do not know what the future brings? If it does not fit for this use case, what use case would it fit into? Only baby applications (I guess not)? Wasn’t noSQL designed for more complex and flexible data?
Thank you very much for your advises and opinions!
EDIT 1:
Because this question was put on hold because of the unprecise question:
I made a very clear and simple example. So my question is not general about the use of noSQL but how to handle this specific example. How would a experienced noSQL user handle this use case? How to model this data? A recommendation to simply not use noSQL at all for this use case is also a valid answer to me.
I simply want to know how to use a noSQL database but still be able to manage entities and avoid redundancy.
For example: Are MongoDB's DBRefs/Manual refs a good way to achieve this? Performance issues because of multiple queries? What else to think about? I guess these questions can probably be answered quite well.
There probably isn't the one right answer to your question. But I'll make a start.
While it is technically possible in NoSQL to store some business entity together with all entities that are transitively linked with it (like Customer, Order, Product), it is't always clever to do so. The traditional reasons for separating entities, namely redundancies and therefore update and delete anomalies, don't just go away because a different platform is used.
So if you stored the product description with every customer who buys or sells this product, you will get update anomalies. If you have to change the screen size from 37 to 35, you'll have to find all customer records containing this product, which can be quite cumbersome.
Also, building up such a deep nested structure favors one direction of evaluating those structures over all other directions. If you put all orders and products into the customer document, this is very fine for getting a comprehensive view for a customer: whatever she bought throughout her lifetime. But if you want to query your database by orders (which orders need to be fulfilled tonight?) or products (who ordered product 1234?) you'll have to load tons of data that are of no interest to this query.
Similar questions are due to storing all orders with a customer. Old orders will sometimes still be of interest, so they may not be deleted. But do you want to load lots of orders everytime you load the customer?
This doesn't mean not to make use of the complex structuring made possible by a document store. As a rule of thumb, I would suggest: As long as the nested information belongs to the same business entity, put it into one document. If, e.g., the product description has some hierarchic structure, like nested sections consisting of text, pics, and videos, they may all go into one document. But entities with a totally different life cycle, like customers, orders, and suppliers, should be kept separate. Another indicator is references: A product will frequently be referenced as a whole, e.g. when it is ordered by a customer or ordered from a supplier. But the different parts of the product description may possibly never be referenced from the outside.
This rule of thumb wasn't completely precise, and it's not supposed to be. One person's business entity is another person's dumb attribute. Imagine the color of a car: For the car owner, it's just a piece of information describing a car. For the manufacturer, it's a business entity, having an availability, a price, one or more suppliers, a way of handling it, etc.
Your question also touches the aspect of dynamically adding attributes. This is often praised as one of the goodies of NoSQL, but it's no free lunch. Let's assume, as you mentioned, that the user may add attributes. That's technically possible, but how will these attributes be processed by the system? There won't be a specific view, nor specific business rules, for those attributes. So the best the system can do is offer some generic mechanism for displaying those attributes that were defined at runtime and never reflected in the program code.
This doesn't mean the feature is useless. Imagine your product description may be complex, as described above. You might build a generic mechanism to display (and edit) descriptions made up of sections, texts, images, etc., and afterwards the users may enter descriptions of unlimited width and depth. But in contrast, imagine your user will add a tiny delivery date attribute to the order. Unless the system knows specifically how to interpret this date, it will just be a dumb piece of information without any effect.
Now imagine not the user, but the developer adds new attributes. She has the opportunity to enhance the code at the same time, e.g. building some functionality around delivery dates. But this means that, although the database doesn't require it by its own, a new release of the software needs to be rolled out to make use of the new information.
The absence of a database scheme even makes the programmer's task more complicated. When a relational table has a certain column, you may be sure that each of its records has this column. If you want to make sure that it has a meaningful value, make it not null, and you may be sure that each record contains a value of the correct data type. Nothing like that is guaranteed by schemaless databases. So, when reading a record, defensive programming is needed to find out which parts are present, and whether they have the expected content. The same holds for database maintenance via administrative tools. Adding an attribute and initializing it with a default value is a 2-liner in SQL, or a couple of mouse clicks in pgadmin. For a schemaless database, you will write a short program on your own to achieve this.
This doesn't mean that I dislike NoSQL databases. But I think the "schemaless" characteristic is sometimes overestimated, and I wouldn't make it the main, or only, reason to employ such a database.

How to deal with many users update over Document driven nosql database

Problem
Starting with nosql document database I figured out lots of new possibilities, however, I see some pitfalls, and I would like to know how can I deal with them.
Suppose I have a product, and this product can be sold in many regions. There is one responsible for each region (with access to CMS). Each responsible modifies the products accordingly regional laws and rules.
Since Join feature isn't supported as we know it on relational databases, the document should be design in a way it contains all the needed information to build our selection statements and selection result to avoid round trips to the database.
So my first though was to design a document that follows more or less this structure:
{
type : "product",
id : "product_id",
title : "title",
allowedAge : 12,
regions : {
'TX' : {
title : "overriden title",
allowedAge : 13
},
'FL' : {
title : "still another title"
}
}
}
But I have the impression that this approach will generate conflicts while updating the document. Suppose we have a lot of users updating lots of document through a CMS. When same document is updated, the last update overwrites the updates done before, even the users are able to modify just fragments of this document (in this case the responsible should be able to modify just the regional data).
How to deal with this situation?
One possible solution I think of would be partial document updates. Positive: reducing the data overwriting from different operations, Negative: lose the optimistic lock feature since locking if done over a document not a fragment of such.
Is there another approach for the problem?
In this case you can use 3 solutions:
Leave current document structure and always check CAS value on update operations. If CAS doesn't match - call store function again. (But as you say if you have a lot of users it can be very slow).
Separate doc in several parts that could be updated independently, and then on app-side combine them together. This will result in increasing view calls (one for get main doc, another call to get i.e. regions, etc.). If you have many "parts" it will also reduce performance.
See this doc. It's about simulating joins in couchbase. There is also good example written by #Tug Grall.
If you're not bounded to using Couchbase (not clear from your question if it's general or specific to it) - look also into MongoDB. It supports partial updates on documents and also other atomic operations (like increments and array operations), so it might suite your use case better (checkout possible update operations on mongo - http://docs.mongodb.org/manual/core/update/ )

MongoDB schema design -- Choose two collection approach or embedded document

I am trying to design a simple application where in I have two entities Notebook and Note. So Notebook can contain multiple Notes.In RDBMS I could have two tables and have One to Many
relationship between them. I am not sure in MongoDB whether I should not take a two collection
approach or I should embed notes in Notebook collection. What would you suggest?
That seems like a perfectly reasonable situation to use a single collection called Notebook, and each Notebook document contains embedded Notes. You can easily index on embedded documents.
If a Notebook document has a 'notes' key, and value is a list of notes:
{
"notes": [
{"created_on": Date(1343592000000), text: "A note."}
]
}
# create index
db.notebook.ensureIndex({"notes.created_on" : -1})
My opinion is to try and embed as much as possible, and then choose to reference another collection via an id as a second option when the reference needs to be to a more general set of data that is shared and might change. For instance, a collection of category documents which many other collections reference. And the category can be updated over time. But in your case, a note should always belong to a note book
You should ask yourself what kind of queries you will need to run on it. The "by default" approach is to embed them, but there are cases (that will depend on how you plan on using them) where a more relational approach is applicable. So the simple answer is "probably, but you should probably think about it" :)

How do you track record relations in NoSQL?

I am trying to figure out the equivalent of foreign keys and indexes in NoSQL KVP or Document databases. Since there are no pivotal tables (to add keys marking a relation between two objects) I am really stumped as to how you would be able to retrieve data in a way that would be useful for normal web pages.
Say I have a user, and this user leaves many comments all over the site. The only way I can think of to keep track of that users comments is to
Embed them in the user object (which seems quite useless)
Create and maintain a user_id:comments value that contains a list of each comment's key [comment:34, comment:197, etc...] so that that I can fetch them as needed.
However, taking the second example you will soon hit a brick wall when you use it for tracking other things like a key called "active_comments" which might contain 30 million ids in it making it cost a TON to query each page just to know some recent active comments. It also would be very prone to race-conditions as many pages might try to update it at the same time.
How can I track relations like the following in a NoSQL database?
All of a user's comments
All active comments
All posts tagged with [keyword]
All students in a club - or all clubs a student is in
Or am I thinking about this incorrectly?
All the answers for how to store many-to-many associations in the "NoSQL way" reduce to the same thing: storing data redundantly.
In NoSQL, you don't design your database based on the relationships between data entities. You design your database based on the queries you will run against it. Use the same criteria you would use to denormalize a relational database: if it's more important for data to have cohesion (think of values in a comma-separated list instead of a normalized table), then do it that way.
But this inevitably optimizes for one type of query (e.g. comments by any user for a given article) at the expense of other types of queries (comments for any article by a given user). If your application has the need for both types of queries to be equally optimized, you should not denormalize. And likewise, you should not use a NoSQL solution if you need to use the data in a relational way.
There is a risk with denormalization and redundancy that redundant sets of data will get out of sync with one another. This is called an anomaly. When you use a normalized relational database, the RDBMS can prevent anomalies. In a denormalized database or in NoSQL, it becomes your responsibility to write application code to prevent anomalies.
One might think that it'd be great for a NoSQL database to do the hard work of preventing anomalies for you. There is a paradigm that can do this -- the relational paradigm.
The couchDB approach suggest to emit proper classes of stuff in map phase and summarize it in reduce.. So you could map all comments and emit 1 for the given user and later print out only ones. It would require however lots of disk storage to build persistent views of all trackable data in couchDB. btw they have also this wiki page about relationships: http://wiki.apache.org/couchdb/EntityRelationship.
Riak on the other hand has tool to build relations. It is link. You can input address of a linked (here comment) document to the 'root' document (here user document). It has one trick. If it is distributed it may be modified at one time in many locations. It will cause conflicts and as a result huge vector clock tree :/ ..not so bad, not so good.
Riak has also yet another 'mechanism'. It has 2-layer key name space, so called bucket and key. So, for student example, If we have club A, B and C and student StudentX, StudentY you could maintain following convention:
{ Key = {ClubA, StudentX}, Value = true },
{ Key = {ClubB, StudentX}, Value = true },
{ Key = {ClubA, StudentY}, Value = true }
and to read relation just list keys in given buckets. Whats wrong with that? It is damn slow. Listing buckets was never priority for riak. It is getting better and better tho. btw. you do not waste memory because this example {true} can be linked to single full profile of StudentX or Y (here conflicts are not possible).
As you see it NoSQL != NoSQL. You need to look at specific implementation and test it for yourself.
Mentioned before Column stores look like good fit for relations.. but it all depends on your A and C and P needs;) If you do not need A and you have less than Peta bytes just leave it, go ahead with MySql or Postgres.
good luck
user:userid:comments is a reasonable approach - think of it as the equivalent of a column index in SQL, with the added requirement that you cannot query on unindexed columns.
This is where you need to think about your requirements. A list with 30 million items is not unreasonable because it is slow, but because it is impractical to ever do anything with it. If your real requirement is to display some recent comments you are better off keeping a very short list that gets updated whenever a comment is added - remember that NoSQL has no normalization requirement. Race conditions are an issue with lists in a basic key value store but generally either your platform supports lists properly, you can do something with locks, or you don't actually care about failed updates.
Same as for user comments - create an index keyword:posts
More of the same - probably a list of clubs as a property of student and an index on that field to get all members of a club
You have
"user": {
"userid": "unique value",
"category": "student",
"metainfo": "yada yada yada",
"clubs": ["archery", "kendo"]
}
"comments": {
"commentid": "unique value",
"pageid": "unique value",
"post-time": "ISO Date",
"userid": "OP id -> THIS IS IMPORTANT"
}
"page": {
"pageid": "unique value",
"post-time": "ISO Date",
"op-id": "user id",
"tag": ["abc", "zxcv", "qwer"]
}
Well in a relational database the normal thing to do would be in a one-to-many relation is to normalize the data. That is the same thing you would do in a NoSQL database as well. Simply index the fields which you will be fetching the information with.
For example, the important indexes for you are
Comment.UserID
Comment.PageID
Comment.PostTime
Page.Tag[]
If you are using NosDB (A .NET based NoSQL Database with SQL support) your queries will be like
SELECT * FROM Comments WHERE userid = ‘That user’;
SELECT * FROM Comments WHERE pageid = ‘That user’;
SELECT * FROM Comments WHERE post-time > DateTime('2016, 1, 1');
SELECT * FROM Page WHERE tag = 'kendo'
Check all the supported query types from their SQL cheat sheet or documentation.
Although, it is best to use RDBMS in such cases instead of NoSQL, yet one possible solution is to maintain additional nodes or collections to manage mapping and indexes. It may have additional cost in form of extra collections/nodes and processing, but it will give an solution easy to maintain and avoid data redundancy.