Are document databases good for storing large amounts of Stock Tick data? [closed] - mongodb

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I was thinking of using a database like mongodb or ravendb to store a lot of stock tick data and wanted to know if this would be viable compared to a standard relational such as Sql Server.
The data would not really be relational and would be a couple of huge tables. I was also thinking that I could sum/min/max rows of data by minute/hour/day/week/month etc for even faster calculations.
Example data:
500 symbols * 60 min * 60sec * 300 days... (per record we store: date, open, high,low,close, volume, openint - all decimal/float)
So what do you guys think?

Since when this question was asked in 2010, several database engines were released or have developed features that specifically handle time series such as stock tick data:
InfluxDB - see my other answer
Cassandra
With MongoDB or other document-oriented databases, if you target performance, the advices is to contort your schema to organize ticks in an object keyed by seconds (or an object of minutes, each minute being another object with 60 seconds). With a specialized time series database, you can query data simply with
SELECT open, close FROM market_data
WHERE symbol = 'AAPL' AND time > '2016-09-14' AND time < '2016-09-21'
I was also thinking that I could sum/min/max rows of data by minute/hour/day/week/month etc for even faster calculations.
With InfluxDB, this is very straightforward. Here's how to get the daily minimums and maximums:
SELECT MIN("close"), MAX("close") FROM "market_data" WHERE WHERE symbol = 'AAPL'
GROUP BY time(1d)
You can group by time intervals which can be in microseconds (u), seconds (s), minutes (m), hours (h), days (d) or weeks (w).
TL;DR
Time-series databases are better choices than document-oriented databases for storing and querying large amounts of stock tick data.

The answer here will depend on scope.
MongoDB is great way to get the data "in" and it's really fast at querying individual pieces. It's also nice as it is built to scale horizontally.
However, what you'll have to remember is that all of your significant "queries" are actually going to result from "batch job output".
As an example, Gilt Groupe has created a system called Hummingbird that they use for real-time analytics on their web site. Presentation here. They're basically dynamically rendering pages based on collected performance data in tight intervals (15 minutes).
In their case, they have a simple cycle: post data to mongo -> run map-reduce -> push data to webs for real-time optimization -> rinse / repeat.
This is honestly pretty close to what you probably want to do. However, there are some limitations here:
Map-reduce is new to many people. If you're familiar with SQL, you'll have to accept the learning curve of Map-reduce.
If you're pumping in lots of data, your map-reduces are going to be slower on those boxes. You'll probably want to look at slaving / replica pairs if response times are a big deal.
On the other hand, you'll run into different variants of these problems with SQL.
Of course there are some benefits here:
Horizontal scalability. If you have lots of boxes then you can shard them and get somewhat linear performance increases on Map/Reduce jobs (that's how they work). Building such a "cluster" with SQL databases is lot more costly and expensive.
Really fast speed and as with point #1, you get the ability to add RAM horizontally to keep up the speed.
As mentioned by others though, you're going to lose access to ETL and other common analysis tools. You'll definitely be on the hook to write a lot of your own analysis tools.

Here's my reservation with the idea - and I'm going to openly acknowledge that my working knowledge of document databases is weak. I’m assuming you want all of this data stored so that you can perform some aggregation or trend-based analysis on it.
If you use a document based db to act as your source, the loading and manipulation of each row of data (CRUD operations) is very simple. Very efficient, very straight forward, basically lovely.
What sucks is that there are very few, if any, options to extract this data and cram it into a structure more suitable for statistical analysis e.g. columnar database or cube. If you load it into a basic relational database, there are a host of tools, both commercial and open source such as pentaho that will accommodate the ETL and analysis very nicely.
Ultimately though, what you want to keep in mind is that every financial firm in the world has a stock analysis/ auto-trader application; they just caused a major U.S. stock market tumble and they are not toys. :)

A simple datastore such as a key-value or document database is also beneficial in cases where performing analytics reasonably exceeds a single system's capacity. (Or it will require an exceptionally large machine to handle the load.) In these cases, it makes sense to use a simple store since the analytics require batch processing anyway. I would personally look at finding a horizontally scaling processing method to coming up with the unit/time analytics required.
I would investigate using something built on Hadoop for parallel processing. Either use the framework natively in Java/C++ or some higher level abstraction: Pig, Wukong, binary executables through the streaming interface, etc. Amazon offers reasonably cheap processing time and storage if that route is of interest. (I have no personal experience but many do and depend on it for their businesses.)

Related

Is Hadoop a good solution for small data? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
We have a set of data in MongoDB that we are map reducing (twice) we're going to be using Mongo's map reduce for now, but I'm thinking about how to scale and improve performance in the future and am thinking about Hadoop.
Most of the stuff I'm reading about Hadoop talks about big data, terabytes of the stuff, when we're going to be dealing with megabytes, 10s maybe 100s of thousands of records. (There may be many of these running concurrently though so whilst a single task is small total could be large).
We really want to get insane performance out of small data rather than make it possible to do big data. i.e. get map reduce results that take 10s of seconds in MongoDB to take seconds or sub second in Hadoop.
Is this possible?
Is Hadoop a good fit for this?
If not what other technologies are there that will make this possible?
Details of exact problem this is needed for and my solution to date, can be found in this question: Linear funnel from a collection of events with MongoDB aggregation, is it possible?
Is this possible?
NO. No matter how small your data is there will always be some initial delay while running MR jobs, incurred because a lot of things happening under the hood, like checking input/output paths, split creation, map creation etc. And this is unavoidable.
Is Hadoop a good fit for this?
NO. You can't expect Hadoop to give you results in nano or a few milliseconds.
If not what other technologies are there that will make this possible?
If you need something really fast and which scales well better have a look at Storm.
Most of the stuff I'm reading about Hadoop talks about big data, terabytes of the stuff, when we're going to be dealing with megabytes, 10s maybe 100s of thousands of records.
One of the things that gives hadoop its speed is its clustering abilities with Map Reduce, such things, of course, only really apply to "Big Data" (whatever that means now-a-days).
In fact map reduce is normally slower than say, the aggregation framework on small data because of how long it takes to actually run an average map reduce.
Map reduce is really designed for something other than what your doing.
You could look into storing your data in a traditional database and use that databases aggregation framework, i.e. SQL or MongoDB.
Hadoop is not going to fulfill your requirements. The very first things is infrastructure requirement and its administration. The cost of running map-reduce will be more on hadoop than in Mongo or other similar technologies if your data is in MBs.
Furthermore, I'd like to suggest to expand your existing mongoDB infrastructure. The querying and document based flexibilities (like easy indexes and data retrieval ) cannot be achieved with Hadoop technologies easily.
Hadoop 'in general' is moving toward lower latency processing, through projects like Tez for instance. And there are hadoop-like alternatives, like Spark.
But for event processing, and usually that means Storm, the future may already be here, see Storm and Hadoop: Convergence of Big-Data and Low-Latency Processing (also see the slideshare from Hadoop Summit).
Hadoop is a vast ecosystem. There are huge differences in capabilities between the old (1.0), the new (1.3) and the bleeding edge (2.0 and beyond). Can some of these technologies replace Mongo's own M/R? I certainly think so. Can your problem be split out into many parallel tasks (this is actually not clear to me)? Then somewhere between Spark/YARN/Tez there is a solution that would go faster as you throw more hardware at it.
And of course, for a workset that first in one host RAM there will always be a SMP RDBMS that will run circles around clusters...

DynamoDB vs MongoDB NoSQL [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I'm trying to figure out what can I use for a future project, we plan to store about 500k records per month in the first year and maybe more for the next years this is a vertical application so there's no need to use a database for this, that's the reason why I decided to choose a NoSQL data storage.
The first option that came to my mind was mongo DB since is a very mature product with a lot of support from the community but on the other hand, we got a brand new product that offers a managed service at top performance, I'll develop this application but there's no maintenance plan (at least for now) so I think that will be a huge advantage since amazon provides an elastic way to scale.
My major concern is about the query structure, I haven't looked at the dynamo DB query capabilities yet but since is a k/v data storage I feel that this could be more limited than mongo DB.
If someone had the experience of moving a project from MongoDB to DynamoDB, any advice will be totally appreciated.
I know this is old, but it still comes up when you search for the comparison. We were using Mongo, have moved almost entirely to Dynamo, which is our first choice now. Not because it has more features, it doesn't. Mongo has a better query language, you can index within a structure, there's lots of little things. The superiority of Dynamo is in what the OP stated in his comment: it's easy. You don't have to take care of any servers. When you start to set up a Mongo sharded solution, it gets complicated. You can go to one of the hosting companies, but that's not cheap either. With Dynamo, if you need more throughput, you just click a button. You can write scripts to scale automatically. When it's time to upgrade Dynamo, it's done for you. That is all a lot of precious stress and time not spent. If you don't have dedicated ops people, Dynamo is excellent.
So we are now going on Dynamo by default. Mongo maybe, if the data structure is complicated enough to warrant it, but then we'd probably go back to a SQL database. Dynamo is obtuse, you really need to think about how you're going to build it, and likely you'll use Redis in Elasticcache to make it work for complex stuff. But it sure is nice to not have to take care of it. You code. That's it.
With 500k documents, there is no reason to scale whatsoever. A typical laptop with an SSD and 8GB of ram can easily do 10s of millions of records, so if you are trying to pick because of scaling your choice doesn't really matter. I would suggest you pick what you like the most, and perhaps where you can find the most online support with.
For quick overview comparisons, I really like this website, that has many comparison pages, eg AWS DynamoDB vs MongoDB; http://db-engines.com/en/system/Amazon+DynamoDB%3BMongoDB
Short answer: Start with SQL and add NoSQL only when/if needed. (unless you don't need anything beyond very simple queries)
My personal experience: I haven't used MongoDB for queries but as of April 2015 DynamoDB is still very crippled when it comes to anything beyond the most basic key/value queries. I love it for the basic stuff but if you want query language then look to a real SQL database solution.
In DynamoDB you can query on a hash or on a hash and range key, and you can have multiple secondary global indexes. I'm doing queries on a single table with 4 possible filter parameters and sorting the results, this is supported (barely) through the use of the global secondary indexes with filter expressions. The problem comes in when you try to get the total results matching the filter, you can't just search for the first 10 items matching the filter, but rather it checks 10 items and you may get 0 valid results forcing you to keep re-scanning from the continue key - pain in the neck and consumes too much of your table read quota for a simple scenario.
To be specific about the limit problem with filters in the query, this is from the docs (http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html#ScanQueryLimit):
In a response, DynamoDB returns all the matching results within
the scope of the Limit value. For example, if you issue a Query
or a Scan request with a Limit value of 6 and without a filter
expression, the operation returns the first six items in the
table that match the request parameters. If you also supply a
FilterExpression, the operation returns the items within the
first six items in the table that match the filter requirements.
My conclusion is that queries involving FilterExpressions are only usable on very rare occasions and are not scalable because each query can easily read most or all of your of your table which consumes far too many DynamoDB read units. Once you use too many read units you'll get throttled and see poor performance.
Expert opinion: In the AWS summit on Apr 9, 2015 Brett Hollman, Manager, Solutions Architecture, AWS in his talk on scalling to your first 10 million users advocates starting with a SQL database and then using NoSQL only when and if it makes sense. Because sooner or later you'll probably need a SQL server somewhere in your stack. His slides are here: http://www.slideshare.net/AmazonWebServices/deep-dive-scaling-up-to-your-first-10-million-users
See slide 28.
We chose a combination of Mongo/Dynamo for a healthcare product. Basically mongo allows better searching, but the hosted Dynamo is great because its HIPAA compliant without any extra work. So we host the mongo portion with no personal data on a standard setup and allow amazon to deal with the HIPAA portion in terms of infrastructure. We can query certain items from mongo which bring up documents with pointers (ID's) of the relatable Dynamo document.
The main reason we chose to do this using mongo instead of hosting the entire application on dynamo was for 2 reasons. First, we needed to preform location based searches which mongo is great at and at the time, Dynamo was not, but they do have an option now.
Secondly was that some documents were unstructured and we did not know ahead of time what the data would be, so for example lets say user a inputs a document in the "form" collection like this: {"username": "user1", "email": "me#me.com"}. And another user puts this in the same collection {"phone": "813-555-3333", "location": [28.1234,-83.2342]}. With mongo we can search any of these dynamic and unknown fields at any time, with Dynamo, you could do this but would have to make a index every time a new field was added that you wanted searchable. So if you have never had a phone field in your Dynamo document before and then all of the sudden, some one adds it, its completely unsearchable.
Now this brings up another point in which you have mentioned. Sometimes choosing the right solution for the job does not always mean choosing the best product for the job. For example you may have a client who needs and will use the system you created for 10+ years. Going with a SaaS/IaaS solution that is good enough to get the job done may be a better option as you can rely on amazon to have up-kept and maintained their systems over the long haul.
I have worked on both and kind of fan of both.
But you need to understand when to use what and for what purpose.
I don't think It's a great idea to move all your database to DynamoDB, reason being querying is difficult except on primary and secondary keys, Indexing is limited and scanning in DynamoDB is painful.
I would go for a hybrid sort of DB, where extensive query-able data should be there is MongoDB, with all it's feature you would never feel constrained to provide enhancements or modifications.
DynamoDB is lightning fast (faster than MongoDB) so DynamoDB is often used as an alternative to sessions in scalable applications. DynamoDB best practices also suggests that if there are plenty of data which are less being used, move it to other table.
So suppose you have a articles or feeds. People are more likely to look for last week stuff or this month's stuff. chances are really rare for people to visit two year old data. For these purposes DynamoDB prefers to have data stored by month or years in different tables.
DynamoDB is seemlessly scalable, something you will have to do manually in MongoDB. however you would lose on performance of DynamoDB, if you don't understand about throughput partition and how scaling works behind the scene.
DynamoDB should be used where speed is critical, MongoDB on the other hand has too many hands and features, something DynamoDB lacks.
for example, you can have a replica set of MongoDB in such a way that one of the replica holds data instance of 8(or whatever) hours old. Really useful, if you messed up something big time in your DB and want to get the data as it is before.
That's my opinion though.
Bear in mind, I've only experimented with MongoDB...
From what I've read, DynamoDB has come a long way in terms of features. It used to be a super-basic key-value store with extremely limited storage and querying capabilities. It has since grown, now supporting bigger document sizes + JSON support and global secondary indices. The gap between what DynamoDB and MongoDB offers in terms of features grows smaller with every month. The new features of DynamoDB are expanded on here.
Much of the MongoDB vs. DynamoDB comparisons are out of date due to the recent addition of DynamoDB features. However, this post offers some other convincing points to choose DynamoDB, namely that it's simple, low maintenance, and often low cost. Another discussion here of database choices was interesting to read, though slightly old.
My takeaway: if you're doing serious database queries or working in languages not supported by DynamoDB, use MongoDB. Otherwise, stick with DynamoDB.

NoSQL & AdHoc Queries - Millions of Rows

I currently run a MySQL-powered website where users promote advertisements and gain revenue every time someone completes one. We log every time someone views an ad ("impression"), every time a user clicks an add ("click"), and every time someone completes an ad ("lead").
Since we get so much traffic, we have millions of records in each of these respective tables. We then have to query these tables to let users see how much they have earned, so we end up performing multiple queries on tables with millions and millions of rows multiple times in one request, hundreds of times concurrently.
We're looking to move away from MySQL and to a key-value store or something along those lines. We need something that will let us store all these millions of rows, query them in milliseconds, and MOST IMPORTANTLY, use adhoc queries where we can query any single column, so we could do things like:
FROM leads WHERE country = 'US' AND user_id = 501 (the NoSQL equivalent, obviously)
FROM clicks WHERE ad_id = 1952 AND user_id = 200 AND country = 'GB'
etc.
Does anyone have any good suggestions? I was considering MongoDB or CouchDB but I'm not sure if they can handle querying millions of records multiple times a second and the type of adhoc queries we need.
Thanks!
With those requirements, you are probably better off sticking with SQL and setting up replication/clustering if you are running into load issues. You can set up indexing on a document database so that those queries are possible, but you don't really gain anything over your current system.
NoSQL systems generally improve performance by leaving out some of the more complex features of relational systems. This means that they will only help if your scenario doesn't require those features. Running ad hoc queries on tabular data is exactly what SQL was designed for.
CouchDB's map/reduce is incremental which means it only processes a document once and stores the results.
Let's assume, for a moment, that CouchDB is the slowest database in the world. Your first query with millions of rows takes, maybe, 20 hours. That sounds terrible. However, your second query, your third query, your fourth query, and your hundredth query will take 50 milliseconds, perhaps 100 including HTTP and network latency.
You could say CouchDB fails the benchmarks but gets honors in the school of hard knocks.
I would not worry about performance, but rather if CouchDB can satisfy your ad-hoc query requirements. CouchDB wants to know what queries will occur, so it can do the hard work up-front before the query arrives. When the query does arrive, the answer is already prepared and out it goes!
All of your examples are possible with CouchDB. A so-called merge-join (lots of equality conditions) is no problem. However CouchDB cannot support multiple inequality queries simultaneously. You cannot ask CouchDB, in a single query, for users between age 18-40 who also clicked fewer than 10 times.
The nice thing about CouchDB's HTTP and Javascript interface is, it's easy to do a quick feasibility study. I suggest you try it out!
Most people would probably recommend MongoDB for a tracking/analytic system like this, for good reasons. You should read the „MongoDB for Real-Time Analytics” chapter from the „MongoDB Definitive Guide” book. Depending on the size of your data and scaling needs, you could get all the performance, schema-free storage and ad-hoc querying features. You will need to decide for yourself if issues with durability and unpredictability of the system are risky for you or not.
For a simpler tracking system, Redis would be a very good choice, offering rich functionality, blazing speed and real durability. To get a feel how such a system would be implemented in Redis, see this gist. The downside is, that you'd need to define all the „indices” by yourself, not gain them for „free”, as is the case with MongoDB. Nevertheless, there's no free lunch, and MongoDB indices are definitely not a free lunch.
I think you should have a look into how ElasticSearch would enable you:
Blazing speed
Schema-free storage
Sharding and distributed architecture
Powerful analytic primitives in the form of facets
Easy implementation of „sliding window”-type of data storage with index aliases
It is in heart a „fulltext search engine”, but don't get yourself confused by that. Read the „Data Visualization with ElasticSearch and Protovis“ article for real world use case of ElasticSearch as a data mining engine.
Have a look on these slides for real world use case for „sliding window” scenario.
There are many client libraries for ElasticSearch available, such as Tire for Ruby, so it's easy to get off the ground with a prototype quickly.
For the record (with all due respect to #jhs :), based on my experience, I cannot imagine an implementation where Couchdb is a feasible and useful option. It would be an awesome backup storage for your data, though.
If your working set can fit in the memory, and you index the right fields in the document, you'd be all set. Your ask is not something very typical and I am sure with proper hardware, right collection design (denormalize!) and indexing you should be good to go. Read up on Mongo querying, and use explain() to test the queries. Stay away from IN and NOT IN clauses that'd be my suggestion.
It really depends on your data sets. The number one rule to NoSQL design is to define your query scenarios first. Once you really understand how you want to query the data then you can look into the various NoSQL solutions out there. The default unit of distribution is key. Therefore you need to remember that you need to be able to split your data between your node machines effectively otherwise you will end up with a horizontally scalable system with all the work still being done on one node (albeit better queries depending on the case).
You also need to think back to CAP theorem, most NoSQL databases are eventually consistent (CP or AP) while traditional Relational DBMS are CA. This will impact the way you handle data and creation of certain things, for example key generation can be come trickery.
Also remember than in some systems such as HBase there is no indexing concept. All your indexes will need to be built by your application logic and any updates and deletes will need to be managed as such. With Mongo you can actually create indexes on fields and query them relatively quickly, there is also the possibility to integrate Solr with Mongo. You don’t just need to query by ID in Mongo like you do in HBase which is a column family (aka Google BigTable style database) where you essentially have nested key-value pairs.
So once again it comes to your data, what you want to store, how you plan to store it, and most importantly how you want to access it. The Lily project looks very promising. The work I am involved with we take a large amount of data from the web and we store it, analyse it, strip it down, parse it, analyse it, stream it, update it etc etc. We dont just use one system but many which are best suited to the job at hand. For this process we use different systems at different stages as it gives us fast access where we need it, provides the ability to stream and analyse data in real-time and importantly, keep track of everything as we go (as data loss in a prod system is a big deal) . I am using Hadoop, HBase, Hive, MongoDB, Solr, MySQL and even good old text files. Remember that to productionize a system using these technogies is a bit harder than installing MySQL on a server, some releases are not as stable and you really need to do your testing first. At the end of the day it really depends on the level of business resistance and the mission-critical nature of your system.
Another path that no one thus far has mentioned is NewSQL - i.e. Horizontally scalable RDBMSs... There are a few out there like MySQL cluster (i think) and VoltDB which may suit your cause.
Again it comes to understanding your data and the access patterns, NoSQL systems are also Non-Rel i.e. non-relational and are there for better suit to non-relational data sets. If your data is inherently relational and you need some SQL query features that really need to do things like Cartesian products (aka joins) then you may well be better of sticking with Oracle and investing some time in indexing, sharding and performance tuning.
My advice would be to actually play around with a few different systems. However for your use case I think a Column Family database may be the best solution, I think there are a few places which have implemented similar solutions to very similar problems (I think the NYTimes is using HBase to monitor user page clicks). Another great example is Facebook and like, they are using HBase for this. There is a really good article here which may help you along your way and further explain some points above. http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.html
Final point would be that NoSQL systems are not the be all and end all. Putting your data into a NoSQL database does not mean its going to perform any better than MySQL, Oracle or even text files... For example see this blog post: http://mysqldba.blogspot.com/2010/03/cassandra-is-my-nosql-solution-but.html
I'd have a look at;
MongoDB - Document - CP
CouchDB - Document - AP
Redis - In memory key-value (not column family) - CP
Cassandra - Column Family - Available & Partition Tolerant (AP)
HBase - Column Family - Consistent & Partition Tolerant (CP)
Hadoop/Hive - Also have a look at Hadoop streaming...
Hypertable - Another CF CP DB.
VoltDB - A really good looking product, a relation database that is distributed and might work for your case (may be an easier move). They also seem to provide enterprise support which may be more suited for a prod env (i.e. give business users a sense of security).
Any way thats my 2c. Playing around with the systems is really the only way your going to find out what really works for your case.

Why exactly do we use NoSQL? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Having understood some of the advantages that NoSQL offers (scalability, availability, etc.), I am still not clear why a website would want to use a non-relational database.
Can I get some help on this, preferably with an example?
Better performance
NoSQL databases sometimes have better performance, although this depends on the situation and is disputed.
Adaptability
You can add and remove "columns" without downtime. In most SQL servers, this takes a long time and takes up a load of load.
Application design
It is desirable to separate the data storage from the logic. If you join and select things in SQL queries, you are mixing business logic with storage.
NoSQL databases are there to solve several things, mainly:
(buzz) BigData => think TB, PB, etc..
Working with Distributed Systems / datasets => say you have 42 products, so 13 of them will live in Chicago datacenter, 21 in NY's and another and 8 somewhere in Japan, but once you query against all 42 products, you would not need to know where they are located: NoSQL DB will. This also allows to engage a lot more brain power ( servers ) to solve hard computational problems [ does not seem it would fit your use case, but it is an interesting thing to note ]
Partitioning => having your DB be easily distributed, besides those cool 8 products in Japan, also allows for an easy data replication, so those 42 products will be replicated with a factor of 3, for example, which would mean you DB would have 3 copies for every product. Hence if something goes down, no problem => here is a replica available. This is where NoSQL databases actually shine vs. RDBMS. Granted you can shard, partition and cluster Oracle / MySQL / PostgreSQL / etc.. BUT it is a several magnitudes more complicated process and usually a maintenance headache for most people you'd employ.
BUT to your question:
why a website would want to use a non-relational database
When most of the people, I worked with / met / chatted with, choose NoSQL for their "website", it is unfortunately NOT for the reasons above, but simply because it is COOLER to do so. And in fact many projects FAIL / have extreme difficulties due to this reason.
If most of NoSQL gurus take their masks off, they will all agree that MOST of the problems ( or as people call them websites ) that developers solve day to day, can and rather be solved with a SQL solution, such as PostgreSQL, MySQL, etc.. with some cool Redis cache layer on top of it. And only a small subset of problems would REALLY benefit from NoSQL.
I personally love Riak, as I am a firm believer that a NoSQL, fault tolerant DB should have an extremely strong, flexible and naturally distributed foundation => such as Erlang OTP. Plus I am a fan of simplicity. But again, given the problem, I would choose whatever works best, and most of the time I will NEED that consistency ( especially if we are talking about money / financial world / mission critical / etc.. ).
The main reason not to use an SQL database is scalability. The transactional guarantees and the relational model make it almost impossible to scale a database usefully across more than a few machines, especially given the write-heavy workloads generated by modern web applications.
An app like Facebook can't be made to work on a straightforward SQL database, except by massive partitioning and sharding, which requires significant adjustments to the app logic as well. That's why Facebook developed Cassandra.
NoSQL basically means you make do without some SQL-typical features like immediate consistency or easy joins, in exchange for being able to use a database that scales much better.
Conversely, there is no point in using NoSQL if your website never has more than a dozen concurrent users (which is true for the vast majority of all sites).
We need understand what is your problem in the current application?
Transactions
Amount of data
Data structure
NoSQL solves the problems of scalability and availability against that of atomicity or consistency.
Basic drive us to CAP theorem. Eric Brewer also noted that Of the three properties of shared-data systems – Consistency, Availability and tolerance to network Partitions – only two can be achieved any given moment in time. (CAP theorem)
NOSQL Approach
Schemaless data representation:
Most of them offer schemaless data representation & allow storing semi-structured data.
Can continue to evolve over time— including adding new fields or even nesting the data, for example, in case of JSON representation.
Development time:
No complex SQL queries.
No JOIN statements.
Speed:
Very High speed delivery & Mostly in-built entity-level caching
Plan ahead for scalability:
Avoiding rework
There are many types of NoSQL databases. The web applications uses document based databases. The document db allows us to store JSON,XML,YAML and even Word documents and manipulate them. So, NoSQL is the obvious choice, especially MongoDB which is a Document Database which supports JSON format by default is the most preferred choice of developers and designers.

Key-Value Stores vs. RDBMs vs. "Cloud" DBs (SDB) [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I'm comfortable in the MySQL space having designed several apps over the past few years, and then continuously refining performance and scalability aspects. I also have some experience working with memcached to provide application side speed-ups on frequently queried result sets. And recently I implemented the Amazon SDB as my primary "database" for an ecommerce experiment.
To oversimplify, a quick justification I went through in my mind for using the SDB service was that using a schema-less database structure would allow me to focus on the logical problem of my project and rapidly accumulate content in my data-store. That is, don't worry about setting up and normalize all possible permutations of a product's attributes before hand; simply start loading in the products and the SDB will simply remember everything that is available.
Now that I have managed to get through the first few iterations of my project and I need to setup simple interfaces to the data, I am running to issues that I had taken for granted working with MySQL. Ex: grouping in select statements and limit syntax to query "items 50 to 100". The ease advantage I gained using schema free architecture of SDB, I lost to a performance hit of querying/looping a resultset with just over 1800 items.
Now I'm reading about projects like Tokyo Cabinet that are extending the concept of in-memory key-value stores to provide pseudo-relational functionality at ridiculously faster speeds (14x i read somewhere).
My question:
Are there some rudimentary guidelines or heuristics that I as an application designer/developer can go through to evaluate which DB tech is the most appropriate at each stage of my project.
Ex: At a prototyping stage where logical/technical unknowns of the application make data structure fluid: use SDB.
At a more mature stage where user deliverables are a priority, use traditional tools where you don't have to spend dev time writing sorting, grouping or pagination logic.
Practical experience with these tools would be very much appreciated.
Thanks SO!
Shaheeb R.
The problems you are finding are why RDBMS specialists view some of the alternative systems with a jaundiced eye. Yes, the alternative systems handle certain specific requirements extremely fast, but as soon as you want to do something else with the same data, the fleetest suddenly becomes the laggard. By contrast, an RDBMS typically manages the variations with greater aplomb; it may not be quite as fast as the fleetest for the specialized workload which the fleetest is micro-optimized to handle, but it seldom deteriorates as fast when called upon to deal with other queries.
The new solutions are not silver bullets.
Compared to traditional RDBMS, these systems make improvements in some aspect (scalability, availability or simplicity) by trading-off other aspects (reduced query capability, eventual consistency, horrible performance for certain operations).
Think of these not as replacements of the traditional database, but they are specialized tools for a known, specific need.
Take Amazon Simple DB for example, SDB is basically a huge spreadsheet, if that is what your data looks like, then it probably works well and the superb scalability and simplicity will save you a lot of time and money.
If your system requires very structured and complex queries but you insist with one of these cool new solution, you will soon find yourself in the middle of re-implementing a amateurish, ill-designed RDBMS, with all of its inherent problems.
In this respect, if you do not know whether these will suit your need, I think it is actually better to do your first few iterations in a traditional RDBMS because they give you the best flexibility and capability especially in a single server deployment and under modest load. (see CAP Theorem).
Once you have a better idea about what your data will look like and how will they be used, then you can match your need with an alternative solution.
If you want the simplicity of a cloud hosted solution, but needs a relational database, you can check out: Amazon Relational Database Service