I have been hearing alot about nosql key/value databases online now. Can you give an example of what one is used for. What kind of real world data is best for these kind of databases?
I think 'What the heck are you actually using NoSQL for' is an excellent read for real life usage cases for NoSQL databases. Let me quote some of them here:
Managing large streams of non-transactional data: Apache logs, application logs, MySQL logs, clickstreams, etc.
Syncing online and offline data. This is a niche CouchDB has targeted.
Fast response times under all loads.
Avoiding heavy joins for when the query load for complex joins become too large for a RDBMS.
Soft real-time systems where low latency is critical. Games are one example.
Applications where a wide variety of different write, read, query, and consistency patterns need to be supported. There are systems optimized for 50% reads 50% writes, 95% writes, or 95% reads.
Read-only applications needing extreme speed and resiliency, simple
queries, and can tolerate slightly stale data.
Applications requiring moderate performance, read/write access, simple queries, completely
authoritative data.
Read-only application which complex query requirements.
Load balance to accommodate data and usage concentrations and to help keep microprocessors busy.
Real-time inserts, updates, and queries.
Hierarchical data like threaded discussions and parts explosion.
Dynamic table creation.
Two tier applications where low latency data is made available through a fast NoSQL interface, but the data itself can be calculated and updated by high latency Hadoop apps or other low priority apps.
Sequential data reading. The right underlying data storage model needs to be selected. A B-tree may not be the best model for sequential reads.
Slicing off part of service that may need better performance/scalability onto it's own system. For example, user logins may need to be high performance and this feature could use a dedicated service to meet those goals.
Caching. A high performance caching tier for web sites and other applications. Example is a cache for the Data Aggregation System used by the Large Hadron Collider.
Voting.
Real-time page view counters.
User registration, profile, and session data.
Document, catalog management and content management systems. These are facilitated by the ability to store complex documents has a whole rather than organized as relational tables. Similar logic applies to inventory, shopping carts, and other structured data types.
Archiving. Storing a large continual stream of data that is still accessible on-line.
Document-oriented databases with a flexible schema that can handle schema changes over time.
Analytics. Use MapReduce, Hive, or Pig to perform analytical queries and scale-out systems that support high write loads.
Related
For the use-case of Shopping cart (and checkout process) for E-commerce web application, what is better to use a Relational DB (RDBMS) or NoSQL DB as MongoDB/Cassandra/others ?
For the catalog perspective, NoSQL makes ideal use-case with flexible schema, horizontal scaling of data/nodes.
What are the pros/cons of each approach for Shopping Cart use-case?
There are many differences between SQL and noSQL databases. Those differences are what gives each storage type its pros and cons on different situations.
Since both database types would work in the end, it all really depends on the context or on your implementation.
In this specific case (shopping cart), the pros and cons are probably all related to the consistency of your data and scalability.
noSQL databses are better (pros) suited for more "dynamic" applications (data analysis, IoT, multimedia, etc.). Such applications use data that usually doesn't have a rigid structure and comes in very large volumes. This means that there's no need to develop a complex database model and it's cheaper to store large amounts of data throughout separate "nodes". This also makes noSQL databases easier to expand and scale. The main problem (cons) is the lack of structure. This will make it harder for you to run analysis and to keep track of every detail of the database.
Meanwhile, SQL databases are useful (pros) when your data is well-structured and mostly consistent. As you know, SQL stores data in columns and rows, this gives SQL an advantage if you want to generate detailed statistics of your data and also if you want to keep an organized record of everything that happens in your app. The main downside (cons) is that the design of an SQL database takes more time and also it's probably more expensive (scalability and physical storage require more hardware) to maintain a SQL database.
Performancewise, I would argue that in this usecase there wouldn't be any major difference.
If you think about all of what i just wrote, I would say that in the context of a shopping cart, the SQL model is the way to go. A shopping cart won't require lots of upgrades and changes (scalability), its data is always structured (name of item, price, etc.) and you might want to keep track of every transaction a user makes in your ecommerce application (for accountability and safety reasons).
tl;dr use SQL because the data in a shoppingcart usecase is structured and consistent.
good luck!
The general pros/cons of something like Cassandra vs postgres/mysql look like:
Cassandra handles multi-DC HA much better.
Cassandra handles high write volume much better.
Cassandra allows you to reboot hosts without downtime because you'll have multiple replicas (and you wont have to worry about WAL replay or binlog replay or weird master-master replication problems, though some RDBMS addons make this easier for MySQL and Postgres than it used to be).
Cassandra allows you to scale better (linear scaling with number of instances up to ~1200 or so instances)
MySQL/Postgres allow you to build queries as your business requirements evolve by adding indices to existing tables; Cassandra expects you to know the queries in advance and do data modeling before you start writing data.
MySQL/Postgres tends to be easier to use, and you'll find a ton of libraries/UIs/etc to help you get started
MySQL/Postgres offer real transactions / MVCC - Casssandra has lightweight transactions limited to operations on a single key with much weaker isolation/atomicity guarantees.
Ultimately, though, unless you believe your shopping cart is going to handle thousands of concurrent users, it probably doesn't matter (as long as you use something with real data durability guarantees): use what you're most comfortable using. I'd use Cassandra because I know Cassandra very well, but if you're not great with Cassandra (or whatever), use what you know best.
I am developing a JAVA based web application. The primary aim is to have inventory for products being sold on multiple websites called channels. We will act as manager for all these channels.
What we need is:
Queues to manage inventory updates for each channel.
Inventory table which has a correct snapshot of allocation on each channel.
Keeping Session Ids and other fast access data in a cache.
Providing a facebook like dashboard(XMPP) to keep the seller updated asap.
The solutions i am looking at are postgres(our db till now in a synchronous replication mode), NoSQL solutions like Cassandra, Redis, CouchDB and MongoDB.
My constraints are:
Inventory updates cannot be lost.
Job Queues should be executed in order and preferably never lost.
Easy/Fast development and future maintenance.
I am open to any suggestions. thanks in advance.
Queues to manage inventory updates for each channel.
This is not necessarily a database issue. You might be better off looking at a messaging system(e.g. RabbitMQ)
Inventory table which has a correct snapshot of allocation on each channel.
Keeping Session Ids and other fast access data in a cache.
session data should probably be put in a separate database more suitable for the task(e.g. memcached, redis, etc)
There is no one-size-fits-all DB
Providing a facebook like dashboard(XMPP) to keep the seller updated asap.
My constraints are:
1. Inventory updates cannot be lost.
There are 3 ways to answer this question:
This feature must be provided by your application. The database can guarantee that a bad record is rejected and rolled back, but not guarantee that every query will get entered.
The app will have to be smart enough to recognize when an error happens and try again.
some DBs store records in memory and then flush memory to disk peridocally, this could lead to data loss in the case of a power failure. (e.g Mongo works this way by default unless you enable journaling. CouchDB always appends to the records(even a delete is a flag appended to the record so data loss is extremely difficult))
Some DBs are designed to be extremely reliable, even if an earthquake, hurricane or other natural disaster strikes, they remain durable. these include Cassandra, Hbase, Riak, Hadoop, etc
Which type of durability are your referring to?
Job Queues should be executed in order and preferably never lost.
Most noSQL solutions prefer to run in parallel. so you have two options here.
1. use a DB that locks the entire table for every query(slower)
2. build your app to be smarter or evented(client side sequential queuing)
Easy/Fast development and future maintenance.
generally, you will find that SQL is faster to develop at first, but changes can be harder to implement
noSQL may require a little more planning, but is easier to do ad hoc queries or schema changes.
The questions you probably need to ask yourself are more like:
"Will I need to have intense queries or deep analysis that a Map/Reduce is better suited to?"
"will I need to my change my schema frequently?
"is my data highly relational? in what way?"
"does the vendor behind my chosen DB have enough experience to help me when I need it?"
"will I need special feature such as GeoSpatial indexing, full text search, etc?"
"how close to realtime will I need my data? will it hurt if I don't see the latest records show up in my queries until 1sec later? what level of latency is acceptable?"
"what do I really need in terms of fail-over"
"how big is my data? will it fit in memory? will it fit on one computer? is each individual record large or small?
"how often will my data change? is this an archive?"
If you are going to have multiple customers(channels?) each with their own inventory schemas, a document based DB might have it's advantages. I remember one time I looked at an ecommerce system with inventory and it had almost 235 tables!
Then again, if you have certain relational data, a SQL solution can really have some advantages too.
I can certainly see how I could build a solution using mongo, couch, riak or orientdb with the given constraints. But as for which is the best? I would try talking directly DB vendors, and maybe watch the nosql tapes
Addressing your constraints:
Most NoSQL solutions give you a configurable tradeoff of consistency vs. performance. In MongoDB, for instance, you can decide how durable a write should be. If you want to, you can force the write to be fsync'ed on all your replica set servers. On the other extreme, you can choose to send the command and don't even wait for the server's response.
Executing job queues in order seems to be an application code issue. I'd say a timestamp in the db and an order by type of query should do for most applications. If you have multiple application servers and your queues need to be perfect, you'd have to use a truly distributed algorithm that provides ordering, but that is not a typical requirement, and it's very tricky indeed.
We've been using MongoDB for some time now, and I'm convinced this gives your app development speed a real boost. There's no big difference in maintenance, maintaining data is a pain either way. Not having a schema gives you added flexibility (lazy migrations), but it's more elaborate and requires some care.
In summary, I'd say you can do it both ways. The NoSQL is more code driven, and transactions and relational integrity are mostly managed by your code. If you're uncomfortable with that, go for a relational DB.
However, if you're data grows huge, you'll have to code some of this logic manually because you probably wouldn't want to do real-time joins on a 10B row database. Still, you can implement that with SQL as well.
A good way to find the boundary for different databases is to consider what you can cache. Data that can be cached and reconstructed at any time are a great way to start introducing a new layer, because there's no big risks there. Also, cached data usually doesn't keep any relations so you're not sacrificing any consistency here.
NoSQL is not correct for this application.
I mean, you can use it sure, but you will end up re-implementing a lot of what SQL offers for you. For example I see a lot of relations there. You also want ACID (although some NoSQL solutions do offer that).
There is no reason you can't use both - keep relational data in relational databases, and non-relational data in key/value stores.
Does it make sense to break up the data model of an application into different database systems? For example, the application stores all user data and relationships in a graph database (ideal for storing relationships), while storing other data in a document database, such as CouchDB or MongoDB? This would require the user graph database to reference unique ids in the document databases and vice versa.
Is this over complicating the data model and application? Or is this using the best uses of both types of database systems for scaling your application?
It definitely can make sense and depends fully on the requirements of your application. If you can use other database systems for things in which they are really good at.
Take for example full text search. Of course you can do more or less complex full text searches with a relational database like MySql. But there are systems like e.g. Lucene/Solr which are optimized for such things and can search fast in millions of documents. So you could use these systems for their special task (here: make a nifty full text search), then you return the identifiers and maybe load the relational structured data from the RDBMS.
Or CouchDB. I use couchDB in some projects as a caching systems. In combination with a relational database. Of course I need to care about consistency, but it it's definitely worth the effort. It pushed performance in the projects a lot and decreases for example load on the server from 2 to 0.2. :)
Something like this is for instance called cross-store persistence. As you mentioned you would store certain data in your relational database, social relationships in a graphdb, user-generated data (documents) in a document-db and user provided multimedia files (pictures, audio, video) in a blob-store like S3.
It is mainly about looking at the use-cases and making sure that from wherever you need it you might access the "primary" or index key of each store (back and forth). You can encapsulate the actual lookup in your domain or dao layer.
Some frameworks like the Spring Data projects provide some initial kind of cross-store persistence out of the box, mostly integrating JPA with a different NOSQL datastore. For instance Spring Data Graph allows it to store your entities in JPA and add social graphs or other highly interconnected data as a secondary concern and leverage a graphdb for the typical traversal and other graph operations (e.g. ranking, suggestions etc.)
Another term for this is polyglot persistence.
Here are two contrary positions on the question:
Pro:
"Contrary to that, I’m a big fan of polyglot persistence. This simply means using the right storage backend for each of your usecases. For example file storages, SQL, graph databases, data ware houses, in-memory databases, network caches, NoSQL. Today there are mostly two storages used, files and SQL databases. Both are not optimal for every usecase."
http://codemonkeyism.com/nosql-polyglott-persistence/
Con:
"I don’t think I need to say that I’m a proponent of polyglot persistence. And that I believe in Unix tools philosophy. But while adding more components to your system, you should realize that such a system complexity is “exploding” and so will operational costs grow too (nb: do you remember why Twitter started to into using Cassandra?) . Not to mention that the more components your system has the more attention and care must be invested figuring out critical aspects like overall system availability, latency, throughput, and consistency."
http://nosql.mypopescu.com/post/1529816758/why-redis-and-memcached-cassandra-lucene
We're designing an OLTP financial system. it should be able to support 10.000 transactions per second and have reporting features.
So we have come to the idea of using:
a NoSQL DB as our main storage
a MySQL DB (Percona server actually) making some ETLs from the NoSQL DB for store reporting data
We're considering MongoDB and Riak for the NoSQL job. we have read that Riak scales more smoothly than MongoDB. And we would like to listen your opinion.
Which NoSQL DB would you use for a
OLTP financial system?
How has been
your experience scaling MongoDB/Riak?
There is no conceivable circumstance where I would use a NOSQl database for anything to do with finance. You simply don't have the data integrity needed or the internal controls. Dow Jones uses SQL Server to do its transactions and if they can properly design a high performance, high transaction Relational datbase so can you. You will have to invest in some people who know what they are doing though.
One has to think about the problem differently. The notion of transaction consistency stems from the UD (update) in CRUD (Create, Read, Update, Delete). noSQL DBs are CRAP (Create, Replicate, Append, Process) oriented, working by accretion of time-stamped data. With the right domain model, there is no reason that auditability and the equivalent of referential integrity can't be achieved.
The global-storage based NoSQL databases - Cache from InterSystems and GT.M from FIS - are used extensively in financial services and have been for many years. Cache in particular is used for both the core database and for OLTP.
I can answer regarding my experience with scaling Riak.
Riak scales smoothly to the extreme. Scaling is as easy as adding nodes to the cluster, which is a very simple operation in itself. You can achieve near linear scalability by simply adding nodes. Our experience with Riak as far as scaling is concerned has been amazing.
The flip side is that it is lacking in many respects. Some examples:
You can't do something like count(*) or list keys on a production cluster. That would require a work around if you want to do ETL from Riak into MySQL - or how would you know what to (E)xtract?
(One possible work around would be to maintain a bucket with a known key sequence that map to values that contain the keys you inserted into your other buckets).
The free version of Riak comes with no management console that lets you know what's going on, and the one that's included in the Enterprise version isn't much of an improvement.
You'll need the Enterprise version of you're looking to replicate your data over WAN (e.g. for DR / high availability). That's alright if you don't mind paying, but keep in mind that Basho pricing is very high.
I work with the Starcounter (so I’m biased), but I think I can safely say that for a system processing financial transactions you have to worry about transaction consistency. Unfortunately, this is what the engines used for Facebook and Twitter had to give up allow their scale-out strategy to offer performance. This is not because engines such as MongoDb or Cassandra are poorly designed; rather, it follows naturally from the CAP theorem (http://en.wikipedia.org/wiki/CAP_theorem). Simply put, changes you make in your database will overwrite other changes if they occur close in time. Ok for status updates and new tweets, but disastrous if you deal with money or other quantities. The amounts will simply end up wrong when many reads and writes are being done in parallel. So for the throughput you need, a memory centric NoSQL database with ACID support is probably the way to go.
You can use some NoSQL databases (Cassandra, EventStore) as a storage for financial service if you implement your app using event sourcing and concepts from DDD. I recommend you to read this minibook http://www.oreilly.com/programming/free/reactive-microservices-architecture.html
OLTP can be achieved using NoSQL with a custom implementation,
there are two things,
1. How are you going to achieve ACID properties that an RDBMS gives.
2. Provide a custom blocking or non blocking concurrency and transaction handling mechanism.
To take you closer to solution,
Apache Phoenix,apache trafodion or Splice machine.
Trafodion has full ACID support over HBase, you should take a look.
Cassandra can be used for both OLTP and OLAP. Good replication and eventual data consistency gives you the choice in your hand. Need to design the system properly. And after all it's free of cost but not free of developer, give it a try
What are the advantages of using NoSQL databases? I've read a lot about them lately, but I'm still unsure why I would want to implement one, and under what circumstances I would want to use one.
Relational databases enforces ACID. So, you will have schema based transaction oriented data stores. It's proven and suitable for 99% of the real world applications. You can practically do anything with relational databases.
But, there are limitations on speed and scaling when it comes to massive high availability data stores. For example, Google and Amazon have terabytes of data stored in big data centers. Querying and inserting is not performant in these scenarios because of the blocking/schema/transaction nature of the RDBMs. That's the reason they have implemented their own databases (actually, key-value stores) for massive performance gain and scalability.
NoSQL databases have been around for a long time - just the term is new. Some examples are graph, object, column, XML and document databases.
For your 2nd question: Is it okay to use both on the same site?
Why not? Both serves different purposes right?
NoSQL solutions are usually meant to solve a problem that relational databases are either not well suited for, too expensive to use (like Oracle) or require you to implement something that breaks the relational nature of your db anyway.
Advantages are usually specific to your usage, but unless you have some sort of problem modeling your data in a RDBMS I see no reason why you would choose NoSQL.
I myself use MongoDB and Riak for specific problems where a RDBMS is not a viable solution, for all other things I use MySQL (or SQLite for testing).
If you need a NoSQL db you usually know about it, possible reasons are:
client wants 99.999% availability on
a high traffic site.
your data makes
no sense in SQL, you find yourself
doing multiple JOIN queries for
accessing some piece of information.
you are breaking the relational
model, you have CLOBs that store
denormalized data and you generate
external indexes to search that data.
If you don't need a NoSQL solution keep in mind that these solutions weren't meant as replacements for an RDBMS but rather as alternatives where the former fails and more importantly that they are relatively new as such they still have a lot of bugs and missing features.
Oh, and regarding the second question it is perfectly fine to use any technology in conjunction with another, so just to be complete from my experience MongoDB and MySQL work fine together as long as they aren't on the same machine
Martin Fowler has an excellent video which gives a good explanation of NoSQL databases. The link goes straight to his reasons to use them, but the whole video contains good information.
You have large amounts of data - especially if you cannot fit it all on one physical server as NoSQL was designed to scale well.
Object-relational impedance mismatch - Your domain objects do not fit well in a relaitional database schema. NoSQL allows you to persist your data as documents (or graphs) which may map much more closely to your data model.
NoSQL is a database system where data is organized into the document (MongoDB), key-value pair (MemCache, Redis), and graph structure form(Neo4J).
Maybe there are possible questions and answer for "When to go for NoSQL":
Require flexible schema or deal with tree-like data?
Generally, in agile development we start designing systems without knowing all requirements upfront, whereas later on throughout the development database system may need to accommodate frequent design changes, showcasing MVP (Minimal Viable product).
Or you are dealing with a data schema that is dynamic in nature.
e.g. System logs, very precise example is AWS cloudtrail logs.
Data set is vast/big?
Yes NoSQL databases are the better candidate for applications where the database needs to manage millions or even billions of records without compromising performance and availability while may be trading for inconsistency(though modern databases are exception here where it allows tunable consistency over availability e.g. Casandra, Cloud provider databases CosmosDB, DynamoDB).
Trade-off between scaling over consistency
Unlike RDMS, NoSQL databases may make the dataset consistent across other nodes eventually which is the default behavior, but it's easy to scale in terms of performance and availability.
Example: This may be good for storing people who are online in the instant messaging app, API tokens in DB, and logging website traffic stats.
Performing Geolocation Operations:
MongoDB hash rich support for doing GeoQuerying & Geolocation operations. I really loved this feature of MongoDB. So does the PostresSQL but ease of implementation is something that depends on the use case
In nutshell, MongoDB is a great fit for applications where you can store dynamic structured data on a large scale.
Edits:
Updated the answer about the consistency of the database.
Some essential information is missing to answer the question: Which use cases must the database be able to cover? Do complex analyses have to be performed from existing data (OLAP) or does the application have to be able to process many transactions (OLTP)? What is the data structure? That is far from the end of question time.
In my view, it is wrong to make technology decisions on the basis of bold buzzwords without knowing exactly what is behind them. NoSQL is often praised for its scalability. But you also have to know that horizontal scaling (over several nodes) also has its price and is not free. Then you have to deal with issues like eventual consistency and define how to resolve data conflicts if they cannot be resolved at the database level. However, this applies to all distributed database systems.
The joy of the developers with the word "schema less" at NoSQL is at the beginning also very big. This buzzword is quickly disenchanted after technical analysis, because it correctly does not require a schema when writing, but comes into play when reading. That is why it should correctly be "schema on read". It may be tempting to be able to write data at one's own discretion. But how do I deal with the situation if there is existing data but the new version of the application expects a different schema?
The document model (as in MongoDB, for example) is not suitable for data models where there are many relationships between the data. Joins have to be done on application level, which is additional effort and why should I program things that the database should do.
If you make the argument that Google and Amazon have developed their own databases because conventional RDBMS can no longer handle the flood of data, you can only say: You are not Google and Amazon. These companies are the spearhead, some 0.01% of scenarios where traditional databases are no longer suitable, but for the rest of the world they are.
What's not insignificant: SQL has been around for over 40 years and millions of hours of development have gone into large systems such as Oracle or Microsoft SQL. This has to be achieved by some new databases. Sometimes it is also easier to find an SQL admin than someone for MongoDB. Which brings us to the question of maintenance and management. A subject that is not exactly sexy, but that is a part of the technology decision.
Handling A Large Number Of Read Write Operations
Look towards NoSQL databases when you need to scale fast. And when do you generally need to scale fast?
When there are a large number of read-write operations on your website & when dealing with a large amount of data, NoSQL databases fit best in these scenarios. Since they have the ability to add nodes on the fly, they can handle more concurrent traffic & big amount of data with minimal latency.
Flexibility With Data Modeling
The second cue is during the initial phases of development when you are not sure about the data model, the database design, things are expected to change at a rapid pace. NoSQL databases offer us more flexibility.
Eventual Consistency Over Strong Consistency
It’s preferable to pick NoSQL databases when it’s OK for us to give up on Strong consistency and when we do not require transactions.
A good example of this is a social networking website like Twitter. When a tweet of a celebrity blows up and everyone is liking and re-tweeting it from around the world. Does it matter if the count of likes goes up or down a bit for a short while?
The celebrity would definitely not care if instead of the actual 5 million 500 likes, the system shows the like count as 5 million 250 for a short while.
When a large application is deployed on hundreds of servers spread across the globe, the geographically distributed nodes take some time to reach a global consensus.
Until they reach a consensus, the value of the entity is inconsistent. The value of the entity eventually gets consistent after a short while. This is what Eventual Consistency is.
Though the inconsistency does not mean that there is any sort of data loss. It just means that the data takes a short while to travel across the globe via the internet cables under the ocean to reach a global consensus and become consistent.
We experience this behaviour all the time. Especially on YouTube. Often you would see a video with 10 views and 15 likes. How is this even possible?
It’s not. The actual views are already more than the likes. It’s just the count of views is inconsistent and takes a short while to get updated.
Running Data Analytics
NoSQL databases also fit best for data analytics use cases, where we have to deal with an influx of massive amounts of data.
I came across this question while looking for convincing grounds to deviate from RDBMS design.
There is a great post by Julian Brown which sheds lights on constraints of distributed systems. The concept is called Brewer's CAP Theorem which in summary goes:
The three requirements of distributed systems are : Consistency, Availability and Partition tolerance (CAP in short). But you can only have two of them at a time.
And this is how I summarised it for myself:
You better go for NoSQL if Consistency is what you are sacrificing.
I designed and implemented solutions with NoSQL databases and here is my checkpoint list to make the decision to go with SQL or document-oriented NoSQL.
DON'Ts
SQL is not obsolete and remains a better tool in some cases. It's hard to justify use of a document-oriented NoSQL when
Need OLAP/OLTP
It's a small project / simple DB structure
Need ad hoc queries
Can't avoid immediate consistency
Unclear requirements
Lack of experienced developers
DOs
If you don't have those conditions or can mitigate them, then here are 2 reasons where you may benefit from NoSQL:
Need to run at scale
Convenience of development (better integration with your tech stack, no need in ORM, etc.)
More info
In my blog posts I explain the reasons in more details:
7 reasons NOT to NoSQL
2 reasons to NoSQL
Note: the above is applicable to document-oriented NoSQL only. There are other types of NoSQL, which require other considerations.
Ran into this thread and wanted to add my experience.. Many SQL databases support json data in columns and support querying of this json. So what I have used is a hybrid using a relational database with columns containing json..