NOSQL not appropriate for data which is naturally joined? - mongodb

Just read a lecture primarily about MongoDB which states that NOSQL is not appropriate for data which is naturally joined? Can't really understand why this is could someone please explain?

MongoDB does not support JOINs. The reason is that MongoDB is build for clustering, which means that the data is distributed over multiple independent servers. When the data required for a JOIN is distributed over multiple machines on a network, it gets hard to implement it in a performant way. So the MongoDB developers decided to do without joins so they can prioritize scalability.
For that reason it is usually better to model 1:n relations by embedding the sub-documents in the parent document. This works well to model compositions where the child-documents are inseparable from the parent (one invoice consists of multiple positions) but not so much for aggregations where the child-documents can switch parents or even exist independent from the parent-document (one department has multiple employees). And it doesn't work at all for n:m relations (one usergroup has multiple members, and each member can be in multiple usergroups).
When you have such a situation where embedding isn't reasonably possible, you need to mimic joins on the application layer, which is slow because each query requires multiple network-roundtrips.
There are, however, NoSQL databases which work much better with handling relations between data. One subset of NoSQL are graph databases like Neo4j. These handle complex relations between entities even better than relational databases in many cases.

Related

Single big collection for all products vs Separate collections for each Product category

I'm new to NoSQL and I'm trying to figure out the best way to model my database. I'll be using ArangoDB in the project but I think this question also stands if using MongoDB.
The database will store 12 categories of products. Each category is expected to hold hundreds or thousands of products. Products will also be added / removed constantly.
There will be a number of common fields across all products, but each category will also have unique fields / different restrictions to data.
Keep in mind that there are instances where I'd need to query all the categories at the same time, for example to search a product across all categories, and other instances where I'll only need to query one category.
Should I create one single collection "Product" and use a field to indicate the category, or create a seperate collection for each category?
I've read many questions related to this idea (1 collection vs many) but I haven't been able to reach a conclusion, other than "it dependes".
So my question is: In this specific use case which option would be most optimal, multiple collections vs single collection + sharding, in terms of performance and speed ?
Any help would be appreciated.
As you mentioned, you need to play with your data and use-case. You will have better picture.
Some decisions required as below.
Decide the number of documents you will have in near future. If you will have 1m documents in an year, then try with at least 3m data
Decide the number of indices required.
Decide the number of writes, reads per second.
Decide the size of documents per category.
Decide the query pattern.
Some inputs based on the requirements
If you have more writes with more indices, then single monolithic collection will be slower as multiple indices needs to be updated.
As you have different set of fields per category, you could try with multiple collections.
There is $unionWith to combine data from multiple collections. But do check the performance it purely depends on the above decisions. Note this open issue also.
If you decide to go with monolithic collection, defer the sharding. Implement this once you found that queries are slower.
If you have more writes on the same document, writes will be executed sequentially. It will slow down your read also.
Think of reclaiming the disk space when more data is cleared from the collections. Multiple collections do good here.
The point which forces me to suggest monolithic collections is that I'd need to query all the categories at the same time. You may need to add more categories, but combining all of them in single response would not be better in terms of performance.
As you don't really have a join use case like in RDBMS, you can go with single monolithic collection from model point of view. I doubt you could have a join key.
If any of my points are incorrect, please let me know.
To SQL or to NoSQL?
I think that before you implement this in NoSQL, you should ask yourself why you are doing that. I quite like NoSQL but some data is definitely a better fit to that model than others.
The data you are describing is a classic case for a relational SQL DB. That's fine if it's a hobby project and you want to try NoSQL, but if this is for a production environment or client, you are likely making the situation more difficult for them.
Relational or non-relational?
You mention common fields across all products. If you wish to update these fields and have those updates reflected in all products, then you have relational data.
Background
It may be worth reading Sarah Mei 2013 article about this. Skip to the section "How MongoDB Stores Data" and read from there. Warning: the article is called "Why You Should Never Use MongoDB" and is (perhaps intentionally) somewhat biased against Mongo, so it's important to read this through the correct lens. The message you should get from this article is that MongoDB is not a good fit for every data type.
Two strategies for handling relational data in Mongo:
every time you update one of these common fields, update every product's document with the new common field data. This is generally only ok if you have few updates or few documents, but not both.
use references and do joins.
In Mongo, joins typically happen code-side (multiple db calls)
In Arango (and in other graph dbs, as well as some key-value stores), the joins happen db-side (single db call)
Decisions
These are important factors to consider when deciding which DB to use and how to model your data
I've used MongoDB, ArangoDB and Neo4j.
Mongo definitely has the best tooling and it's easy to find help, but I don't believe it's good fit in this case
Arango is quite pleasant to work with, but doesn't yet have the adoption that it deserves
I wouldn't recommend Neo4j to anyone looking for a NoSQL solution, as its nodes and relations only support flat properties (no nesting, so not real documents)
It may also be worth considering MariaDB or Postgres

What NoSQL database is good for complex many to many relationships?

My startup project require complex data structure, which must be ready for geolocation and fulltext search. The information in the database are added by two different types of users, who together build a complex relationship network.
For a better understanding of my questions, here is a simple diagram that shows this relationship:
My first choice was MongoDB and Elasticsearch, but I noticed the problem of multiplication of the same data. Further planning, we concluded that for certain parts of the application need ACID transactions possibilities.
What NoSQL database is good for complex many to many relationships?
What would be a good choice for us?
A graph database like Neo4j is perfect for that.
If you need features of a document database like MongoDB, but with Neo4j under the hood, try Structr: http://de.slideshare.net/AxelMorgner/neo4j-as-document-database http://structr.org
(Disclaimer: I'm the project inititator of Structr)

NoSQL vs. Relational Databases vs. Possible Hybrid

I'm hearing more about NoSQL, but have yet had someone give me a clear explanation of how it is to be used instead of relational databases.
I've read that it can't do left joins, so I was trying to figure out how you'd be able to use such a data storage. From reading: Preserve Joins by code in MongoDB it seems like a suggestion is to just make a large table, as if you already did the joins on it.
If the above statement is true, then I can see how it can be used. However I'm curious on how you'd handle repeat data. As the concept of normalizing, helps you remove the redundancy and ensure consistency in the data (e.g. Slight modifications like capitalization, white space, etc)...
Are we simply sacrificing the consistency of the data for scalable speed, or am I missing something?
Edit
I've been doing some more digging and found the answers the following questions useful for clarifying my understanding:
Why Google's BigTable referred as a NoSQL database?
How do you track record relations in NoSQL?
My understanding of consistency seems to be correct from those answers. And it looks like NoSQL is suppose to be used for specific problems types and that if you need relations that you should use a relational database.
But this raises more questions like:
It makes me wonder about real life examples of when to use NoSQL versus when not to?
By denormalizing the data, you should be able to solve all of the same problems that relational databases do... But there are rules on how to normalize data with relational databases. Are there rules that one can use to help them denormalize the data to use a NoSQL solution?
Any examples on when you might want to consider using both a NoSQL solution in parallel with a relational database?
MongoDB has the ability to have documents which include arrays of other documents. This solves many cases where you would have relations in reational databases.
When an invoice has multiple positions, you wouldn't put these positions into a separate collection. You would embed them as an array.
It makes me wonder about real life examples of when to use NoSQL versus when not to?
There are many different NoSQL databases, each one designed with different use-cases in mind. But you tagged this question as MongoDB, so I assume that you mean MongoDB in particular.
MongoDB has two main advantages over relational databases.
First, it scales well.
When the database is too slow or too big, you can easily add more servers by creating a cluster or replica-set of multiple shards. This doesn't work nearly as well with most relational databases.
Second, it allows heterogeneous data.
Imagine, for example, the product database of a computer hardware store. What properties do products have? All products have a price and a vendor. But CPUs have a clock rate, hard drives and RAM chips have a capacity (and these capacities aren't comparable), monitors have a resolution and so on. How would you design this in a relational database? You would either create a very long productID-property-value table or you would create a very wide and sparse product table with every property you can imagine, but most of them being NULL for most products. Both solutions aren't really elegant. But MongoDB can solve this much better because it allows each document in a collection to have a different set of properties.
What can't it do?
As a rather new technology, there isn't that much literature about it. The software ecosystem around it isn't that well either. The tools you can get for relational databases are often much more shiny.
There are also some use-cases MongoDB isn't well-suited for.
MongoDB doesn't do JOINs. When your data is very relational and denormalizing it would be counter-productive, it might be a poor choice for your product. But you might want to take a look at graph databases like Neo4j, which focus even more on relations than relational databases. Update 2016: MongoDB 3.2 now has rudimentary JOIN support with the $lookup aggregation stage, but it's still very limited in functionality compared to relational and graph databases.
MongoDB doesn't do transactions. At least not complex transactions. Certain actions which only affect a single document are guaranteed to be atomic, but as soon as you affect more than one document, you can't guarantee that no other query will happen in-between and find an inconsistent state.
MongoDB is bad for ad-hoc reporting. Its options for data-mining are severely limited. The rather new aggregation functions help and MapReduce can also solve some surprisingly complex problems when you learn to use it smart, but SQL has usually the better tools for things like that.
By denormalizing the data, you should be able to solve all of the same problems that relational databases do... But there are rules on how to normalize data with relational databases. Are there rules that one can use to help them denormalize the data to use a NoSQL solution?
Relational databases are around for about 40 years. Their theory is a well-researched topic in computer science. There are whole libraries of books written about the theory behind them. There is a by-the-book solution for every imaginable corner-case by now.
But NoSQL databases, on the other hand, are a rather new technology. We are still figuring out the best practices. The most frequent advise is: "Use your own head. Think about what queries are performed most often, and optimize your data schema for them."
Any examples on when you might want to consider using both a NoSQL solution in parallel with a relational database?
When possible I would advise against using two different database technologies in the same product:
Anyone who maintains and supports the product must be familiar with both technologies
Troubleshooting gets a lot harder
The sysadmins need to keep an additional database running and updated
You have an additional point of failure which can lead to downtime
I would only recommend to mix database technologies when fulfilling your requirements without it doesn't just become hard but physically impossible. Otherwise, make your pick and stay with it.

Demoralization in NoSQL

What is exactly demoralization in Nosql databases?
I have read it means modelling different object types as different documents. My first guess was it means Aggregation without storing related data, i.e storing all rows of an entity in a single document with related data being referred by different documents for each row.
But I'm not sure if this is correct or not?
An example would be helpful.
Thanks in advance
I do mean demoralization and not denormalization. I came across this term in the following links:
1. Couchbase documentation
2. Blog on Nosql
In the context of NoSQL (and database in general), demoralization is synonymous to denormalization. You can find mixed usage of demoralization and denormalization in many documents, or mention of demoralization being the opposite of normalization (so again, the same as denormalization) :
What Is Meant By Denormalization In SQL?
Database Denormalization
What is demoralization?
Normalization & Demoralization
Designing databases - OLTP and OLAP
There is even that reference, which mention that some/many spell checkers suggest "demoralization" instead of "denormalization". This could explain why some people use demoralization : The effect of denormalization
NoSQL is a very, very wide field. It covers a lot of entirely different databases systems with entirely different concepts of how data should be structured.
The dogma of database normalization applies mostly to classic relational databases. The further a NoSQL database is away from the relational philosophy, the more do you have to question this dogma.
The philosophy of normalization assumes that database JOINs are cheap. So any data which can be split over multiple tables to remove redundancies should be split. But that doesn't apply to all NoSQL databases. Some of them don't support JOIN operations, so getting data stored in many different database entries can be a very expensive operation which either requires multiple consecutive queries to the database or expensive database-sided code execution. When you use one of those databases, you should store your data in a way that every performance-critical use-case can be fulfilled by looking up as few entries as possible, even when this means that you will have redundant data.
Those non-relational NoSQL databases which don't support JOINs frequently support arrays in database entries instead. These are usually the preferred way to model 1:n relations. So when 1 person has n telephone numbers, you wouldn't store the telephone numbers in a separate table/document/collection/whateveryoucallit, you would store them in an array in the person entry. There is usually no reason to handle telephone numbers as self-sustained entities when it wouldn't be for the inability of SQL to work properly with multiple values in a single field.
Denormalization in a NoSQL world would mean the same as in a RDBMS world. Duplication of data for read performance.

Many-to-many relationships in CouchDB or MongoDB

I have an MSSQL database which I am considering porting to CouchDB or MongoDB. I have a many-to-many relationship within the SQL db which has hundreds of thousands rows in the xref table, corresponding to tens of thousands of rows in the tables on each side of the relationship. Will CouchDB and/or MongoDB be able to handle this data, and what would be the best way of formatting the relevant documents for performant querying? Thanks very much.
For CouchDB, I would highly recommend reading this article about Entity Relationships.
One thing I would note in CouchDB is to be careful of attempting to "normalize" a non-relational data model. The document-based storage offers you a great deal of flexibility, and it's seldom the best idea to abstract everything into as many "document types" as you can think of. Many times, it's best to leave much of your data within the same document unless you have clear cases where separate entities exist.
One common use-case of many-to-many relationships is implementing tagging. There are articles about different methods you can use to accomplish this in CouchDB. It may apply to your requirements, it may not, but it's probably worth a read.
Since the 'collection' model of MongoDB is similar to tables you can of course maintain
the m:n relationship inside a dedicated mapping collection (using the _id of the related documents of the referenced documents from other collections).
If you can: consider redesign your application using embedded documents.
http://www.mongodb.org/display/DOCS/Schema+Design
In general: try to turn off your memories to a RDBMS when working with MongoDB.
Blindly copying the database design from RDBMS to MongoDB is neither helpful nor adviceable nor will it work in general.