Does NoSQL Fit the Bill for Reporting Software? - mongodb

I am currently developing a software system that imports and normalizes historical data in various formats (XML, JSON, CSV). As of right now, we are using SQL server, and are looking to find the best replacement for this tool (Postgres or NoSQL). 90% of the time, the (archived/historical/static)data is accessed via a web client, and is used in a READ only format with users picking and choosing canned reports. Changes to the data only occur to update incorrect information .
The replacement DB must be able to store and report on 10s of millions of rows very quickly, and scale across multiple servers with ease (data replication, clustering, etc). There must also be data integrity, so if I update one KPI (lets say Cost per Hr), then all the reports that rely on the KPI will be updated accordingly.
Having no prior experience with NoSQL databases, I am wondering if it is even the right choice to use in a reporting software. We would like to allow for users to create their own custom reports, and that means being able to query any data as opposed to our canned reports, but I don't know if this would throw a wrench in the comparison between SQL vs NoSQL.

There are a few too many variables in the question, to comfortably answer it in entirety, but here's an attempt.
Your choice in SQL vs NoSQL should be based on data structure. Scalability is generally a second-tier concern, and is only slightly easy on some NoSQL platforms, but as always, isn't always free.
If you're looking for 10s of millions of rows 'very quickly' you are seriously testing the limits of what you can do with it. An RDBMS would allow you a plethora of options at the cost of speed, and a NoSQL although quite fast an inputting at that speed, would make you code most of the RDBMS smartness in your application. Chose your poison.
Updating a metric and 'automagically' updating reports is clearly a business-logic smartness, that shouldn't be tied down to platform selection.
PostgreSQL has in the near past, really picked up a lot of arsenal to deal with file formats (JSON et al) and is clearly worth a try (sans easy scalability).
Having said that, you should really investigate Postgres' otherwise forgotten asset, FDWs. You can clearly consider using a NoSQL setup to ingest large unstructured data, and thence utilize postgres' powerful semantics to use that and create a asynchronous yet structured backend for your application. If done well, that could mean the best of both worlds.

Related

Replace PostgreSQL with MongoDB?

My client has an existing PostgreSQL database with around 100 tables and most every table has one or more relationships to other tables. He's got around a thousand customers who use an app that hits that database.
Recently he hired a new frontend web developer, and that person is trying to tell him that we should throw out the PostgreSQL database and replace it with a MongoDB solution. That seems odd to me, but I don't have experience with MongoDB.
Is there any clear reasons why he should, or should not, make the change? Obviously I'm arguing against it and the other guy for it, but I would like to remove the "I like this one better" from the argument and really hear from the community on their experience with such things.
1) Performance
During last years, there were several benchmarks comparing Postgres and Mongo.
Here you can find the most recent performance benchmark (Yahoo): https://www.slideshare.net/profyclub_ru/postgres-vs-mongo-postgres-professional (start with slide #58, where some overview of the past becnhmarks is given).
Notice, that traditionally, MongoDB provided benchmarks, where they didn't turn on write ahead logging or even turned fsync off, so their benchmarks were unfair -- in such states the database system doesn't wait for filesystem, so TPS are high but probability to lose data is also very high.
2) Flexibility – JSON
Postgres has non-structured and semistructured data types since 2003 (hstore, XML, array data types). And now has very strong JSON support with indexing (jsonb data type), you can create partial indexes, functional indexes, index only part of JSON documents, index whole documents in different manners (you can tweek index to reduce it's size and speed).
More interestingly, with Postgres, you can combine relational approach and non-relational JSON data – see this talk again https://www.slideshare.net/profyclub_ru/postgres-vs-mongo-postgres-professional for details. This gives you a lot of flexibility and power (I wouldn't keep money-related or basic accounts-related data in JSON format).
3) Standards and costs of support
SQL experiences new born now -- NoSQL products started to add SQL dialects, there is a lot of people making big data analysis with SQL, you can even run machine learning algorithms inside RDBMS (see MADlib project http://madlib.incubator.apache.org).
When you need to work with data, SQL was, is and will be for long time the best language – there are such many things included to it, so all other languages are lagging too much. I recommend http://modern-sql.com/ to learn modern SQL features and https://use-the-index-luke.com (from the same author) to learn how reach the best performance using SQL.
When Mongo needed to create "BI connector", they also needed to speak SQL, so guess what they chose? https://www.linkedin.com/pulse/mongodb-32-now-powered-postgresql-john-de-goes
SQL will go nowhere, it's extended with SQL/JSON now and this means that for future, Postgres is an excellent choice.
4) Scalability
If you data size is up to several terabytes -- it's easy to live on "single master - multiple replicas" architectuyre either on your own installation or in clouds (Amazon RDS, Heroku, Google Cloud Platform, and since recently, Azure – all them support Postgres). There is an increasing number of solutions which help you to work with microservice architecture, have automatic failover, and/or shard your data. Here is only few of them, which are actively developed and supported, without specific order:
https://wiki.postgresql.org/wiki/PL/Proxy
https://github.com/zalando/spilo and https://github.com/zalando/patroni
https://github.com/dalibo/PAF
https://github.com/postgrespro/postgres_cluster
https://www.2ndquadrant.com/en/resources/bdr/
https://www.postgresql.org/docs/10/static/postgres-fdw.html
5) Extensibility
There are much more additional projects built to work with Postgres than with Mongo. You can work with literally any data type (including but not limited to time ranges, geospatial data, JSON, XML, arrays), have index support for it, ACID and manipulate with it using standard SQL. You can develop your own functions, data types, operators, index structures and much more!
If your data is relational (and it appears that it is), it makes no sense whatsoever to use a non-relational db (like mongodb). You can't underestimate the power and expressiveness of standard SQL queries.
On top of that, postgres has full ACID. And it can handle free-form JSON reasonably well, if that is that guy's primary motivation.

NoSQL & AdHoc Queries - Millions of Rows

I currently run a MySQL-powered website where users promote advertisements and gain revenue every time someone completes one. We log every time someone views an ad ("impression"), every time a user clicks an add ("click"), and every time someone completes an ad ("lead").
Since we get so much traffic, we have millions of records in each of these respective tables. We then have to query these tables to let users see how much they have earned, so we end up performing multiple queries on tables with millions and millions of rows multiple times in one request, hundreds of times concurrently.
We're looking to move away from MySQL and to a key-value store or something along those lines. We need something that will let us store all these millions of rows, query them in milliseconds, and MOST IMPORTANTLY, use adhoc queries where we can query any single column, so we could do things like:
FROM leads WHERE country = 'US' AND user_id = 501 (the NoSQL equivalent, obviously)
FROM clicks WHERE ad_id = 1952 AND user_id = 200 AND country = 'GB'
etc.
Does anyone have any good suggestions? I was considering MongoDB or CouchDB but I'm not sure if they can handle querying millions of records multiple times a second and the type of adhoc queries we need.
Thanks!
With those requirements, you are probably better off sticking with SQL and setting up replication/clustering if you are running into load issues. You can set up indexing on a document database so that those queries are possible, but you don't really gain anything over your current system.
NoSQL systems generally improve performance by leaving out some of the more complex features of relational systems. This means that they will only help if your scenario doesn't require those features. Running ad hoc queries on tabular data is exactly what SQL was designed for.
CouchDB's map/reduce is incremental which means it only processes a document once and stores the results.
Let's assume, for a moment, that CouchDB is the slowest database in the world. Your first query with millions of rows takes, maybe, 20 hours. That sounds terrible. However, your second query, your third query, your fourth query, and your hundredth query will take 50 milliseconds, perhaps 100 including HTTP and network latency.
You could say CouchDB fails the benchmarks but gets honors in the school of hard knocks.
I would not worry about performance, but rather if CouchDB can satisfy your ad-hoc query requirements. CouchDB wants to know what queries will occur, so it can do the hard work up-front before the query arrives. When the query does arrive, the answer is already prepared and out it goes!
All of your examples are possible with CouchDB. A so-called merge-join (lots of equality conditions) is no problem. However CouchDB cannot support multiple inequality queries simultaneously. You cannot ask CouchDB, in a single query, for users between age 18-40 who also clicked fewer than 10 times.
The nice thing about CouchDB's HTTP and Javascript interface is, it's easy to do a quick feasibility study. I suggest you try it out!
Most people would probably recommend MongoDB for a tracking/analytic system like this, for good reasons. You should read the „MongoDB for Real-Time Analytics” chapter from the „MongoDB Definitive Guide” book. Depending on the size of your data and scaling needs, you could get all the performance, schema-free storage and ad-hoc querying features. You will need to decide for yourself if issues with durability and unpredictability of the system are risky for you or not.
For a simpler tracking system, Redis would be a very good choice, offering rich functionality, blazing speed and real durability. To get a feel how such a system would be implemented in Redis, see this gist. The downside is, that you'd need to define all the „indices” by yourself, not gain them for „free”, as is the case with MongoDB. Nevertheless, there's no free lunch, and MongoDB indices are definitely not a free lunch.
I think you should have a look into how ElasticSearch would enable you:
Blazing speed
Schema-free storage
Sharding and distributed architecture
Powerful analytic primitives in the form of facets
Easy implementation of „sliding window”-type of data storage with index aliases
It is in heart a „fulltext search engine”, but don't get yourself confused by that. Read the „Data Visualization with ElasticSearch and Protovis“ article for real world use case of ElasticSearch as a data mining engine.
Have a look on these slides for real world use case for „sliding window” scenario.
There are many client libraries for ElasticSearch available, such as Tire for Ruby, so it's easy to get off the ground with a prototype quickly.
For the record (with all due respect to #jhs :), based on my experience, I cannot imagine an implementation where Couchdb is a feasible and useful option. It would be an awesome backup storage for your data, though.
If your working set can fit in the memory, and you index the right fields in the document, you'd be all set. Your ask is not something very typical and I am sure with proper hardware, right collection design (denormalize!) and indexing you should be good to go. Read up on Mongo querying, and use explain() to test the queries. Stay away from IN and NOT IN clauses that'd be my suggestion.
It really depends on your data sets. The number one rule to NoSQL design is to define your query scenarios first. Once you really understand how you want to query the data then you can look into the various NoSQL solutions out there. The default unit of distribution is key. Therefore you need to remember that you need to be able to split your data between your node machines effectively otherwise you will end up with a horizontally scalable system with all the work still being done on one node (albeit better queries depending on the case).
You also need to think back to CAP theorem, most NoSQL databases are eventually consistent (CP or AP) while traditional Relational DBMS are CA. This will impact the way you handle data and creation of certain things, for example key generation can be come trickery.
Also remember than in some systems such as HBase there is no indexing concept. All your indexes will need to be built by your application logic and any updates and deletes will need to be managed as such. With Mongo you can actually create indexes on fields and query them relatively quickly, there is also the possibility to integrate Solr with Mongo. You don’t just need to query by ID in Mongo like you do in HBase which is a column family (aka Google BigTable style database) where you essentially have nested key-value pairs.
So once again it comes to your data, what you want to store, how you plan to store it, and most importantly how you want to access it. The Lily project looks very promising. The work I am involved with we take a large amount of data from the web and we store it, analyse it, strip it down, parse it, analyse it, stream it, update it etc etc. We dont just use one system but many which are best suited to the job at hand. For this process we use different systems at different stages as it gives us fast access where we need it, provides the ability to stream and analyse data in real-time and importantly, keep track of everything as we go (as data loss in a prod system is a big deal) . I am using Hadoop, HBase, Hive, MongoDB, Solr, MySQL and even good old text files. Remember that to productionize a system using these technogies is a bit harder than installing MySQL on a server, some releases are not as stable and you really need to do your testing first. At the end of the day it really depends on the level of business resistance and the mission-critical nature of your system.
Another path that no one thus far has mentioned is NewSQL - i.e. Horizontally scalable RDBMSs... There are a few out there like MySQL cluster (i think) and VoltDB which may suit your cause.
Again it comes to understanding your data and the access patterns, NoSQL systems are also Non-Rel i.e. non-relational and are there for better suit to non-relational data sets. If your data is inherently relational and you need some SQL query features that really need to do things like Cartesian products (aka joins) then you may well be better of sticking with Oracle and investing some time in indexing, sharding and performance tuning.
My advice would be to actually play around with a few different systems. However for your use case I think a Column Family database may be the best solution, I think there are a few places which have implemented similar solutions to very similar problems (I think the NYTimes is using HBase to monitor user page clicks). Another great example is Facebook and like, they are using HBase for this. There is a really good article here which may help you along your way and further explain some points above. http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.html
Final point would be that NoSQL systems are not the be all and end all. Putting your data into a NoSQL database does not mean its going to perform any better than MySQL, Oracle or even text files... For example see this blog post: http://mysqldba.blogspot.com/2010/03/cassandra-is-my-nosql-solution-but.html
I'd have a look at;
MongoDB - Document - CP
CouchDB - Document - AP
Redis - In memory key-value (not column family) - CP
Cassandra - Column Family - Available & Partition Tolerant (AP)
HBase - Column Family - Consistent & Partition Tolerant (CP)
Hadoop/Hive - Also have a look at Hadoop streaming...
Hypertable - Another CF CP DB.
VoltDB - A really good looking product, a relation database that is distributed and might work for your case (may be an easier move). They also seem to provide enterprise support which may be more suited for a prod env (i.e. give business users a sense of security).
Any way thats my 2c. Playing around with the systems is really the only way your going to find out what really works for your case.

Example of a task that a NoSQL database can't handle (if any)

I would like to test the NoSQL world. This is just curiosity, not an absolute need (yet).
I have read a few things about the differences between SQL and NoSQL databases. I'm convinced about the potential advantages, but I'm a little worried about cases where NoSQL is not applicable. If I understand NoSQL databases essentially miss ACID properties.
Can someone give an example of some real world operation (for example an e-commerce site, or a scientific application, or...) that an ACID relational database can handle but where a NoSQL database could fail miserably, either systematically with some kind of race condition or because of a power outage, etc ?
The perfect example will be something where there can't be any workaround without modifying the database engine. Examples where a NoSQL database just performs poorly will eventually be another question, but here I would like to see when theoretically we just can't use such technology.
Maybe finding such an example is database specific. If this is the case, let's take MongoDB to represent the NoSQL world.
Edit:
to clarify this question I don't want a debate about which kind of database is better for certain cases. I want to know if this technology can be an absolute dead-end in some cases because no matter how hard we try some kind of features that a SQL database provide cannot be implemented on top of nosql stores.
Since there are many nosql stores available I can accept to pick an existing nosql store as a support but what interest me most is the minimum subset of features a store should provide to be able to implement higher level features (like can transactions be implemented with a store that don't provide X...).
This question is a bit like asking what kind of program cannot be written in an imperative/functional language. Any Turing-complete language and express every program that can be solved by a Turing Maching. The question is do you as a programmer really want to write a accounting system for a fortune 500 company in non-portable machine instructions.
In the end, NoSQL can do anything SQL based engines can, the difference is you as a programmer may be responsible for logic in something Like Redis that MySQL gives you for free. SQL databases take a very conservative view of data integrity. The NoSQL movement relaxes those standards to gain better scalability, and to make tasks that are common to Web Applications easier.
MongoDB (my current preference) makes replication and sharding (horizontal scaling) easy, inserts very fast and drops the requirement for a strict scheme. In exchange users of MongoDB must code around slower queries when an index is not present, implement transactional logic in the app (perhaps with three phase commits), and we take a hit on storage efficiency.
CouchDB has similar trade-offs but also sacrifices ad-hoc queries for the ability to work with data off-line then sync with a server.
Redis and other key value stores require the programmer to write much of the index and join logic that is built in to SQL databases. In exchange an application can leverage domain knowledge about its data to make indexes and joins more efficient then the general solution the SQL would require. Redis also require all data to fit in RAM but in exchange gives performance on par with Memcache.
In the end you really can do everything MySQL or Postgres do with nothing more then the OS file system commands (after all that is how the people that wrote these database engines did it). It all comes down to what you want the data store to do for you and what you are willing to give up in return.
Good question. First a clarification. While the field of relational stores is held together by a rather solid foundation of principles, with each vendor choosing to add value in features or pricing, the non-relational (nosql) field is far more heterogeneous.
There are document stores (MongoDB, CouchDB) which are great for content management and similar situations where you have a flat set of variable attributes that you want to build around a topic. Take site-customization. Using a document store to manage custom attributes that define the way a user wants to see his/her page is well suited to the platform. Despite their marketing hype, these stores don't tend to scale into terabytes that well. It can be done, but it's not ideal. MongoDB has a lot of features found in relational databases, such as dynamic indexes (up to 40 per collection/table). CouchDB is built to be absolutely recoverable in the event of failure.
There are key/value stores (Cassandra, HBase...) that are great for highly-distributed storage. Cassandra for low-latency, HBase for higher-latency. The trick with these is that you have to define your query needs before you start putting data in. They're not efficient for dynamic queries against any attribute. For instance, if you are building a customer event logging service, you'd want to set your key on the customer's unique attribute. From there, you could push various log structures into your store and retrieve all logs by customer key on demand. It would be far more expensive, however, to try to go through the logs looking for log events where the type was "failure" unless you decided to make that your secondary key. One other thing: The last time I looked at Cassandra, you couldn't run regexp inside the M/R query. Means that, if you wanted to look for patterns in a field, you'd have to pull all instances of that field and then run it through a regexp to find the tuples you wanted.
Graph databases are very different from the two above. Relations between items(objects, tuples, elements) are fluid. They don't scale into terabytes, but that's not what they are designed for. They are great for asking questions like "hey, how many of my users lik the color green? Of those, how many live in California?" With a relational database, you would have a static structure. With a graph database (I'm oversimplifying, of course), you have attributes and objects. You connect them as makes sense, without schema enforcement.
I wouldn't put anything critical into a non-relational store. Commerce, for instance, where you want guarantees that a transaction is complete before delivering the product. You want guaranteed integrity (or at least the best chance of guaranteed integrity). If a user loses his/her site-customization settings, no big deal. If you lose a commerce transation, big deal. There may be some who disagree.
I also wouldn't put complex structures into any of the above non-relational stores. They don't do joins well at-scale. And, that's okay because it's not the way they're supposed to work. Where you might put an identity for address_type into a customer_address table in a relational system, you would want to embed the address_type information in a customer tuple stored in a document or key/value. Data efficiency is not the domain of the document or key/value store. The point is distribution and pure speed. The sacrifice is footprint.
There are other subtypes of the family of stores labeled as "nosql" that I haven't covered here. There are a ton (122 at last count) different projects focused on non-relational solutions to data problems of various types. Riak is yet another one that I keep hearing about and can't wait to try out.
And here's the trick. The big-dollar relational vendors have been watching and chances are, they're all building or planning to build their own non-relational solutions to tie in with their products. Over the next couple years, if not sooner, we'll see the movement mature, large companies buy up the best of breed and relational vendors start offering integrated solutions, for those that haven't already.
It's an extremely exciting time to work in the field of data management. You should try a few of these out. You can download Couch or Mongo and have them up and running in minutes. HBase is a bit harder.
In any case, I hope I've informed without confusing, that I have enlightened without significant bias or error.
RDBMSes are good at joins, NoSQL engines usually aren't.
NoSQL engines is good at distributed scalability, RDBMSes usually aren't.
RDBMSes are good at data validation coinstraints, NoSQL engines usually aren't.
NoSQL engines are good at flexible and schema-less approaches, RDBMSes usually aren't.
Both approaches can solve either set of problems; the difference is in efficiency.
Probably answer to your question is that mongodb can handle any task (and sql too). But in some cases better to choose mongodb, in others sql database. About advantages and disadvantages you can read here.
Also as #Dmitry said mongodb open door for easy horizontal and vertical scaling with replication & sharding.
RDBMS enforce strong consistency while most no-sql are eventual consistent. So at a given point in time when data is read from a no-sql DB it might not represent the most up-to-date copy of that data.
A common example is a bank transaction, when a user withdraw money, node A is updated with this event, if at the same time node B is queried for this user's balance, it can return an outdated balance. This can't happen in RDBMS as the consistency attribute guarantees that data is updated before it can be read.
RDBMs are really good for quickly aggregating sums, averages, etc. from tables. e.g. SELECT SUM(x) FROM y WHERE z. It's something that is surprisingly hard to do in most NoSQL databases, if you want an answer at once. Some NoSQL stores provide map/reduce as a way of solving the same thing, but it is not real time in the same way it is in the SQL world.

Could document-based databases be used instead of RDBMS throughout the application in the future?

The more I read/use non-sql databases, the more I love it.
It's so for the OOP world and it's easy to use, like Rails for Frameworks.
I know the disadvantages. The major concern seems to be the no-transaction and no-concurrency part. Am I correct?
Are these the only features making it hard for developers to choose to use non-sql databases entirely, even for transactions?
If these features were fixed, would it be more OK to only use document-based databases for an application?
Cause now it seems like you still have to use a RDBMS for customer billing data while your content could be in document-based databases like MongoDB/CouchDB/Cassandra.
Could someone shed a light on this.
Yes of course you can build entire applications on non-relational data models. As a general rule though most people don't want to do that. The problem is that hierarchical/graph based data models (ie. any model that depends on navigational data structures) significantly increase the complexity and reduce the effectiveness of queries and data integrity in the database. The relational model was invented 40 years ago precisely to overcome those disadvantages inherent in navigation-based databases.
No.
They do not seem to be appropriate for fixed-schema, high-volume, mostly numerical data. Think data warehousing. Think ad-hoc analytical queries. They could take over all (or some of the) areas where RDBMS have not been a good fit in the first place (areas where people came up with XML databases, and object-oriented databases, and graph databases, and so on).
This is just like Excel not being able to replace Word (also admittedly, most Excel files I see these days are more presentation than spreadsheet). Different tools for different tasks.
In short, many NoSql solutions don't have cascading updates so if your application's data schema requires this, you will either update multiple documents (ie columns whatever) programatically, or stick with a sql based solution to handle this.
Concurrency is handled differently for different solutions.
I think this blog does a good job at explaining some of the trade-offs using a NoSQL solution
http://blog.mongodb.org/post/475279604/on-distributed-consistency-part-1
It depends also on the development of new, faster hardware.
You can distribute your database over multiple cheap computers (scaling out) if you use Cassandra and MongoDB. There will always be data sets that are too large for one computer because people collect and keep more data when it is possible to collect and keep more data.
However most data sets fit on one computer and can be stored in a SQL database. It is also possible to scale out a SQL database but foreign keys and complex transactions become slow when you distribute your data over multiple machines.
You have to make some tough choices when you distribute your data over multiple machines: http://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html, you don't have to worry about the CAP theorem if all your data is on one machine.

When should I use a NoSQL database instead of a relational database? Is it okay to use both on the same site?

What are the advantages of using NoSQL databases? I've read a lot about them lately, but I'm still unsure why I would want to implement one, and under what circumstances I would want to use one.
Relational databases enforces ACID. So, you will have schema based transaction oriented data stores. It's proven and suitable for 99% of the real world applications. You can practically do anything with relational databases.
But, there are limitations on speed and scaling when it comes to massive high availability data stores. For example, Google and Amazon have terabytes of data stored in big data centers. Querying and inserting is not performant in these scenarios because of the blocking/schema/transaction nature of the RDBMs. That's the reason they have implemented their own databases (actually, key-value stores) for massive performance gain and scalability.
NoSQL databases have been around for a long time - just the term is new. Some examples are graph, object, column, XML and document databases.
For your 2nd question: Is it okay to use both on the same site?
Why not? Both serves different purposes right?
NoSQL solutions are usually meant to solve a problem that relational databases are either not well suited for, too expensive to use (like Oracle) or require you to implement something that breaks the relational nature of your db anyway.
Advantages are usually specific to your usage, but unless you have some sort of problem modeling your data in a RDBMS I see no reason why you would choose NoSQL.
I myself use MongoDB and Riak for specific problems where a RDBMS is not a viable solution, for all other things I use MySQL (or SQLite for testing).
If you need a NoSQL db you usually know about it, possible reasons are:
client wants 99.999% availability on
a high traffic site.
your data makes
no sense in SQL, you find yourself
doing multiple JOIN queries for
accessing some piece of information.
you are breaking the relational
model, you have CLOBs that store
denormalized data and you generate
external indexes to search that data.
If you don't need a NoSQL solution keep in mind that these solutions weren't meant as replacements for an RDBMS but rather as alternatives where the former fails and more importantly that they are relatively new as such they still have a lot of bugs and missing features.
Oh, and regarding the second question it is perfectly fine to use any technology in conjunction with another, so just to be complete from my experience MongoDB and MySQL work fine together as long as they aren't on the same machine
Martin Fowler has an excellent video which gives a good explanation of NoSQL databases. The link goes straight to his reasons to use them, but the whole video contains good information.
You have large amounts of data - especially if you cannot fit it all on one physical server as NoSQL was designed to scale well.
Object-relational impedance mismatch - Your domain objects do not fit well in a relaitional database schema. NoSQL allows you to persist your data as documents (or graphs) which may map much more closely to your data model.
NoSQL is a database system where data is organized into the document (MongoDB), key-value pair (MemCache, Redis), and graph structure form(Neo4J).
Maybe there are possible questions and answer for "When to go for NoSQL":
Require flexible schema or deal with tree-like data?
Generally, in agile development we start designing systems without knowing all requirements upfront, whereas later on throughout the development database system may need to accommodate frequent design changes, showcasing MVP (Minimal Viable product).
Or you are dealing with a data schema that is dynamic in nature.
e.g. System logs, very precise example is AWS cloudtrail logs.
Data set is vast/big?
Yes NoSQL databases are the better candidate for applications where the database needs to manage millions or even billions of records without compromising performance and availability while may be trading for inconsistency(though modern databases are exception here where it allows tunable consistency over availability e.g. Casandra, Cloud provider databases CosmosDB, DynamoDB).
Trade-off between scaling over consistency
Unlike RDMS, NoSQL databases may make the dataset consistent across other nodes eventually which is the default behavior, but it's easy to scale in terms of performance and availability.
Example: This may be good for storing people who are online in the instant messaging app, API tokens in DB, and logging website traffic stats.
Performing Geolocation Operations:
MongoDB hash rich support for doing GeoQuerying & Geolocation operations. I really loved this feature of MongoDB. So does the PostresSQL but ease of implementation is something that depends on the use case
In nutshell, MongoDB is a great fit for applications where you can store dynamic structured data on a large scale.
Edits:
Updated the answer about the consistency of the database.
Some essential information is missing to answer the question: Which use cases must the database be able to cover? Do complex analyses have to be performed from existing data (OLAP) or does the application have to be able to process many transactions (OLTP)? What is the data structure? That is far from the end of question time.
In my view, it is wrong to make technology decisions on the basis of bold buzzwords without knowing exactly what is behind them. NoSQL is often praised for its scalability. But you also have to know that horizontal scaling (over several nodes) also has its price and is not free. Then you have to deal with issues like eventual consistency and define how to resolve data conflicts if they cannot be resolved at the database level. However, this applies to all distributed database systems.
The joy of the developers with the word "schema less" at NoSQL is at the beginning also very big. This buzzword is quickly disenchanted after technical analysis, because it correctly does not require a schema when writing, but comes into play when reading. That is why it should correctly be "schema on read". It may be tempting to be able to write data at one's own discretion. But how do I deal with the situation if there is existing data but the new version of the application expects a different schema?
The document model (as in MongoDB, for example) is not suitable for data models where there are many relationships between the data. Joins have to be done on application level, which is additional effort and why should I program things that the database should do.
If you make the argument that Google and Amazon have developed their own databases because conventional RDBMS can no longer handle the flood of data, you can only say: You are not Google and Amazon. These companies are the spearhead, some 0.01% of scenarios where traditional databases are no longer suitable, but for the rest of the world they are.
What's not insignificant: SQL has been around for over 40 years and millions of hours of development have gone into large systems such as Oracle or Microsoft SQL. This has to be achieved by some new databases. Sometimes it is also easier to find an SQL admin than someone for MongoDB. Which brings us to the question of maintenance and management. A subject that is not exactly sexy, but that is a part of the technology decision.
Handling A Large Number Of Read Write Operations
Look towards NoSQL databases when you need to scale fast. And when do you generally need to scale fast?
When there are a large number of read-write operations on your website & when dealing with a large amount of data, NoSQL databases fit best in these scenarios. Since they have the ability to add nodes on the fly, they can handle more concurrent traffic & big amount of data with minimal latency.
Flexibility With Data Modeling
The second cue is during the initial phases of development when you are not sure about the data model, the database design, things are expected to change at a rapid pace. NoSQL databases offer us more flexibility.
Eventual Consistency Over Strong Consistency
It’s preferable to pick NoSQL databases when it’s OK for us to give up on Strong consistency and when we do not require transactions.
A good example of this is a social networking website like Twitter. When a tweet of a celebrity blows up and everyone is liking and re-tweeting it from around the world. Does it matter if the count of likes goes up or down a bit for a short while?
The celebrity would definitely not care if instead of the actual 5 million 500 likes, the system shows the like count as 5 million 250 for a short while.
When a large application is deployed on hundreds of servers spread across the globe, the geographically distributed nodes take some time to reach a global consensus.
Until they reach a consensus, the value of the entity is inconsistent. The value of the entity eventually gets consistent after a short while. This is what Eventual Consistency is.
Though the inconsistency does not mean that there is any sort of data loss. It just means that the data takes a short while to travel across the globe via the internet cables under the ocean to reach a global consensus and become consistent.
We experience this behaviour all the time. Especially on YouTube. Often you would see a video with 10 views and 15 likes. How is this even possible?
It’s not. The actual views are already more than the likes. It’s just the count of views is inconsistent and takes a short while to get updated.
Running Data Analytics
NoSQL databases also fit best for data analytics use cases, where we have to deal with an influx of massive amounts of data.
I came across this question while looking for convincing grounds to deviate from RDBMS design.
There is a great post by Julian Brown which sheds lights on constraints of distributed systems. The concept is called Brewer's CAP Theorem which in summary goes:
The three requirements of distributed systems are : Consistency, Availability and Partition tolerance (CAP in short). But you can only have two of them at a time.
And this is how I summarised it for myself:
You better go for NoSQL if Consistency is what you are sacrificing.
I designed and implemented solutions with NoSQL databases and here is my checkpoint list to make the decision to go with SQL or document-oriented NoSQL.
DON'Ts
SQL is not obsolete and remains a better tool in some cases. It's hard to justify use of a document-oriented NoSQL when
Need OLAP/OLTP
It's a small project / simple DB structure
Need ad hoc queries
Can't avoid immediate consistency
Unclear requirements
Lack of experienced developers
DOs
If you don't have those conditions or can mitigate them, then here are 2 reasons where you may benefit from NoSQL:
Need to run at scale
Convenience of development (better integration with your tech stack, no need in ORM, etc.)
More info
In my blog posts I explain the reasons in more details:
7 reasons NOT to NoSQL
2 reasons to NoSQL
Note: the above is applicable to document-oriented NoSQL only. There are other types of NoSQL, which require other considerations.
Ran into this thread and wanted to add my experience.. Many SQL databases support json data in columns and support querying of this json. So what I have used is a hybrid using a relational database with columns containing json..