How can denormalization be attribute of NoSQL DB - rdbms

While discussing NoSQL DBs against traditional RDBMS, many articles say that, in NoSQL-DB all related data is kept together so joins are avoided. Thus retrieving data is faster.
In short data is denormalized. There are downsides of denormalization as well. e.g. redundancy, extra space, need to update data at multiple places etc.
But irrespective of Pros-and-cons of denormalization; it is a DB design attribute. How can it be attributed to particular DB-type ?
If in a given case, it is ok to denormalize data then same can be achieved in RDBMs also.
So why is denormalization discussed as attribute of NoSQL db ?

You seem to be reading hype, instead of database design articles. You can denormalize any database. Yes, NoSQL is for cases where denormalized data is a good thing, for instance, in storing documents, where subdocuments are used instead of joins to another table. This works best when the subdocuments are not duplicated. Of course, if they are duplicated, then you have the usual problems of denormalized data.
Example: Person uses Car. In a relational database, you would have a Persons table and a Cars table and a junction table, perhaps "CarsUsedByPerson". In a NoSQL system, you might have a "car" document embedded within a "person" document.
Of course, if two people use the same car, then you have the same data in multiple places, and you'll need to update it in all such places, or it will be inconsistent.
NoSQL is for cases where you need the performance more than you need the consistency.

Seconding John Saunders that you can denormalize data in an RDBMS as well - denormalization is an attribute of most NoSQL databases ("most" meaning "excluding graph databases") because in many cases you MUST denormalize in order to get decent performance.
Continuing with his example, let's say that I've got a Person record, which has a foreign key to a Car record (one car per person in this example to simplify matters), which has a foreign key to a Manufacturer record. For a given person I want the record for that person, for their car, and for their car's manufacturer.
In an RDBMS I can normalize this data and retrieve it all in one query using a join, or I can denormalize this data - the denormalized read is going to be a bit cheaper than the normalized read because joins aren't free, but in this case the difference in read performance probably won't be significant.
My NoSQL database probably doesn't support joins so if I normalize this data then I'll have to do three separate lookups for it, e.g. using a key-value database I'd first retrieve the Person which contains a Car key, then I'd retrieve the Car which contains a Manufacturer key, then I'd retrieve the Manufacturer; if this data were denormalized then I'd only need one lookup, so the performance improvement would be significant. In the rare case that the NoSQL database does support joins then it is almost certainly location agnostic, so the Person, Car, and Manufacturer records might be on different servers or even in different data centers making for a very expensive join.
So an overly simplistic breakdown of your options are:
Traditional RDBMS, good with normalized data but difficult to scale out
NoSQL database, relatively easy to scale out but a bit crap with normalized data
Distributed OLAP database (e.g. Aster, Greenplum), relatively easy to scale out and good with normalized data but very expensive

Related

Single big collection for all products vs Separate collections for each Product category

I'm new to NoSQL and I'm trying to figure out the best way to model my database. I'll be using ArangoDB in the project but I think this question also stands if using MongoDB.
The database will store 12 categories of products. Each category is expected to hold hundreds or thousands of products. Products will also be added / removed constantly.
There will be a number of common fields across all products, but each category will also have unique fields / different restrictions to data.
Keep in mind that there are instances where I'd need to query all the categories at the same time, for example to search a product across all categories, and other instances where I'll only need to query one category.
Should I create one single collection "Product" and use a field to indicate the category, or create a seperate collection for each category?
I've read many questions related to this idea (1 collection vs many) but I haven't been able to reach a conclusion, other than "it dependes".
So my question is: In this specific use case which option would be most optimal, multiple collections vs single collection + sharding, in terms of performance and speed ?
Any help would be appreciated.
As you mentioned, you need to play with your data and use-case. You will have better picture.
Some decisions required as below.
Decide the number of documents you will have in near future. If you will have 1m documents in an year, then try with at least 3m data
Decide the number of indices required.
Decide the number of writes, reads per second.
Decide the size of documents per category.
Decide the query pattern.
Some inputs based on the requirements
If you have more writes with more indices, then single monolithic collection will be slower as multiple indices needs to be updated.
As you have different set of fields per category, you could try with multiple collections.
There is $unionWith to combine data from multiple collections. But do check the performance it purely depends on the above decisions. Note this open issue also.
If you decide to go with monolithic collection, defer the sharding. Implement this once you found that queries are slower.
If you have more writes on the same document, writes will be executed sequentially. It will slow down your read also.
Think of reclaiming the disk space when more data is cleared from the collections. Multiple collections do good here.
The point which forces me to suggest monolithic collections is that I'd need to query all the categories at the same time. You may need to add more categories, but combining all of them in single response would not be better in terms of performance.
As you don't really have a join use case like in RDBMS, you can go with single monolithic collection from model point of view. I doubt you could have a join key.
If any of my points are incorrect, please let me know.
To SQL or to NoSQL?
I think that before you implement this in NoSQL, you should ask yourself why you are doing that. I quite like NoSQL but some data is definitely a better fit to that model than others.
The data you are describing is a classic case for a relational SQL DB. That's fine if it's a hobby project and you want to try NoSQL, but if this is for a production environment or client, you are likely making the situation more difficult for them.
Relational or non-relational?
You mention common fields across all products. If you wish to update these fields and have those updates reflected in all products, then you have relational data.
Background
It may be worth reading Sarah Mei 2013 article about this. Skip to the section "How MongoDB Stores Data" and read from there. Warning: the article is called "Why You Should Never Use MongoDB" and is (perhaps intentionally) somewhat biased against Mongo, so it's important to read this through the correct lens. The message you should get from this article is that MongoDB is not a good fit for every data type.
Two strategies for handling relational data in Mongo:
every time you update one of these common fields, update every product's document with the new common field data. This is generally only ok if you have few updates or few documents, but not both.
use references and do joins.
In Mongo, joins typically happen code-side (multiple db calls)
In Arango (and in other graph dbs, as well as some key-value stores), the joins happen db-side (single db call)
Decisions
These are important factors to consider when deciding which DB to use and how to model your data
I've used MongoDB, ArangoDB and Neo4j.
Mongo definitely has the best tooling and it's easy to find help, but I don't believe it's good fit in this case
Arango is quite pleasant to work with, but doesn't yet have the adoption that it deserves
I wouldn't recommend Neo4j to anyone looking for a NoSQL solution, as its nodes and relations only support flat properties (no nesting, so not real documents)
It may also be worth considering MariaDB or Postgres

Magento: why using EAV instead of MongoDB?

I am learning magento, and I just read about the EAV. If I understood it correctly it is like a simulation of noSQL database over an SQL database.
So this way, for example, a product like a pencil can have different attributes in its description than a computer, or a ball.
But I read about this system in a book things like:
All this exibility and power is not free, and there is a price to pay; implementing the EAV model results in having our entity data distributed on a large number of tables, for example, just the Product Model is distributed on around 40 different tables.
[...]
Another major downside of EAV is the loss of performance when retrieving large collections of EAV objects and an increase on the database query complexity. Since the data is more fragmented (stored in more tables), selecting a single record involves several joins.
I'm completely new, so maybe I'm loosing something, but my question is:
Why use this complicated way instead of using a document based database as mongoDB?

What does denormalizing mean?

While reading the article http://blog.mongodb.org/post/88473035333/6-rules-of-thumb-for-mongodb-schema-design-part-3 chapter: "Rules of Thumb: Your Guide Through the Rainbow" i came across the words: embedding and denormalizing.
One: favor embedding unless there is a compelling reason not to
Five: Consider the write/read ratio when denormalizing. A field that will mostly be read and only seldom updated is a good candidate for denormalization: if you denormalize a field that is updated frequently then the extra work of finding and updating all the instances is likely to overwhelm the savings that you get from denormalizing.
I know embedding is nesting documents, instead of writing seperate tables/collections.
But i have no clue what denormalizing means.
Denormalization is the opposite of normalization, a good practice to design relational databases like MySQL, where you split up one logical dataset into separated tables to reduce redundancy.
Because MongoDB does not support joins of tables you prefere to duplicate an attribute into two collections if the attribute is read often and less updated.
E.G.: You want to save data about a person and want to save the gender of the Person:
In a SQL database you would create a table person and a table gender, store a foreign key of the genders ID in the person table and perform a join to get the full information. This way "male" and "female" only exists once as a value.
In MongoDB you would save the gender in every Person so the value "male" and "female" can exist multiple times but the lookup is faster because the data is tightly coupled in one object of a collection.

Demoralization in NoSQL

What is exactly demoralization in Nosql databases?
I have read it means modelling different object types as different documents. My first guess was it means Aggregation without storing related data, i.e storing all rows of an entity in a single document with related data being referred by different documents for each row.
But I'm not sure if this is correct or not?
An example would be helpful.
Thanks in advance
I do mean demoralization and not denormalization. I came across this term in the following links:
1. Couchbase documentation
2. Blog on Nosql
In the context of NoSQL (and database in general), demoralization is synonymous to denormalization. You can find mixed usage of demoralization and denormalization in many documents, or mention of demoralization being the opposite of normalization (so again, the same as denormalization) :
What Is Meant By Denormalization In SQL?
Database Denormalization
What is demoralization?
Normalization & Demoralization
Designing databases - OLTP and OLAP
There is even that reference, which mention that some/many spell checkers suggest "demoralization" instead of "denormalization". This could explain why some people use demoralization : The effect of denormalization
NoSQL is a very, very wide field. It covers a lot of entirely different databases systems with entirely different concepts of how data should be structured.
The dogma of database normalization applies mostly to classic relational databases. The further a NoSQL database is away from the relational philosophy, the more do you have to question this dogma.
The philosophy of normalization assumes that database JOINs are cheap. So any data which can be split over multiple tables to remove redundancies should be split. But that doesn't apply to all NoSQL databases. Some of them don't support JOIN operations, so getting data stored in many different database entries can be a very expensive operation which either requires multiple consecutive queries to the database or expensive database-sided code execution. When you use one of those databases, you should store your data in a way that every performance-critical use-case can be fulfilled by looking up as few entries as possible, even when this means that you will have redundant data.
Those non-relational NoSQL databases which don't support JOINs frequently support arrays in database entries instead. These are usually the preferred way to model 1:n relations. So when 1 person has n telephone numbers, you wouldn't store the telephone numbers in a separate table/document/collection/whateveryoucallit, you would store them in an array in the person entry. There is usually no reason to handle telephone numbers as self-sustained entities when it wouldn't be for the inability of SQL to work properly with multiple values in a single field.
Denormalization in a NoSQL world would mean the same as in a RDBMS world. Duplication of data for read performance.

Am I missing something about Document Databases?

I've been looking at the rise of the NoSql movement and the accompanying rise in popularity of document databases like mongodb, ravendb, and others. While there are quite a few things about these that I like, I feel like I'm not understanding something important.
Let's say that you are implementing a store application, and you want to store in the database products, all of which have a single, unique category. In Relational Databases, this would be accomplished by having two tables, a product and a category table, and the product table would have a field (called perhaps "category_id") which would reference the row in the category table holding the correct category entry. This has several benefits, including non-repetition of data.
It also means that if you misspelled the category name, for example, you could update the category table and then it's fixed, since that's the only place that value exists.
In document databases, though, this is not how it works. You completely denormalize, meaning in the "products" document, you would actually have a value holding the actual category string, leading to lots of repetition of data, and errors are much more difficult to correct. Thinking about this more, doesn't it also mean that running queries like "give me all products with this category" can lead to result that do not have integrity.
Of course the way around this is to re-implement the whole "category_id" thing in the document database, but when I get to that point in my thinking, I realize I should just stay with relational databases instead of re-implementing them.
This leads me to believe I'm missing some key point about document databases that leads me down this incorrect path. So I wanted to put it to stack-overflow, what am I missing?
You completely denormalize, meaning in the "products" document, you would actually have a value holding the actual category string, leading to lots of repetition of data [...]
True, denormalizing means storing additional data. It also means less collections (tables in SQL), thus resulting in less relations between pieces of data. Each single document can contain the information that would otherwise come from multiple SQL tables.
Now, if your database is distributed across multiple servers, it's more efficient to query a single server instead of multiple servers. With the denormalized structure of document databases, it's much more likely that you only need to query a single server to get all the data you need. With a SQL database, chances are that your related data is spread across multiple servers, making queries very inefficient.
[...] and errors are much more difficult to correct.
Also true. Most NoSQL solutions don't guarantee things such as referential integrity, which are common to SQL databases. As a result, your application is responsible for maintaining relations between data. However, as the amount of relations in a document database is very small, it's not as hard as it may sound.
One of the advantages of a document database is that it is schema-less. You're completely free to define the contents of a document at all times; you're not tied to a predefined set of tables and columns as you are with a SQL database.
Real-world example
If you're building a CMS on top of a SQL database, you'll either have a separate table for each CMS content type, or a single table with generic columns in which you store all types of content. With separate tables, you'll have a lot of tables. Just think of all the join tables you'll need for things like tags and comments for each content type. With a single generic table, your application is responsible for correctly managing all of the data. Also, the raw data in your database is hard to update and quite meaningless outside of your CMS application.
With a document database, you can store each type of CMS content in a single collection, while maintaining a strongly defined structure within each document. You could also store all tags and comments within the document, making data retrieval very efficient. This efficiency and flexibility comes at a price: your application is more responsible for managing the integrity of the data. On the other hand, the price of scaling out with a document database is much less, compared to a SQL database.
Advice
As you can see, both SQL and NoSQL solutions have advantages and disadvantages. As David already pointed out, each type has its uses. I recommend you to analyze your requirements and create two data models, one for a SQL solution and one for a document database. Then choose the solution that fits best, keeping scalability in mind.
I'd say that the number one thing you're overlooking (at least based on the content of the post) is that document databases are not meant to replace relational databases. The example you give does, in fact, work really well in a relational database. It should probably stay there. Document databases are just another tool to accomplish tasks in another way, they're not suited for every task.
Document databases were made to address the problem that (looking at it the other way around), relational databases aren't the best way to solve every problem. Both designs have their use, neither is inherently better than the other.
Take a look at the Use Cases on the MongoDB website: http://www.mongodb.org/display/DOCS/Use+Cases
A document db gives a feeling of freedom when you start. You no longer have to write create table and alter table scripts. You simply embed details in the master 'records'.
But after a while you realize that you are locked in a different way. It becomes less easy to combine or aggregate the data in a way that you didn't think was needed when you stored the data. Data mining/business intelligence (searching for the unknown) becomes harder.
That means that it is also harder to check if your app has stored the data in the db in a correct way.
For instance you have two collection with each approximately 10000 'records'. Now you want to know which ids are present in 'table' A that are not present in 'table' B.
Trivial with SQL, a lot harder with MongoDB.
But I like MongoDB !!
OrientDB, for example, supports schema-less, schema-full or mixed mode. In some contexts you need constraints, validation, etc. but you would need the flexibility to add fields without touch the schema. This is a schema mixed mode.
Example:
{
'#rid': 10:3,
'#class': 'Customer',
'#ver': 3,
'name': 'Jay',
'surname': 'Miner',
'invented': [ 'Amiga' ]
}
In this example the fields "name" and "surname" are mandatories (by defining them in the schema), but the field "invented" has been created only for this document. All your app need to don't know about it but you can execute queries against it:
SELECT FROM Customer WHERE invented IS NOT NULL
It will return only the documents with the field "invented".