Key-Value Stores vs. RDBMs vs. "Cloud" DBs (SDB) [closed] - rdbms

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I'm comfortable in the MySQL space having designed several apps over the past few years, and then continuously refining performance and scalability aspects. I also have some experience working with memcached to provide application side speed-ups on frequently queried result sets. And recently I implemented the Amazon SDB as my primary "database" for an ecommerce experiment.
To oversimplify, a quick justification I went through in my mind for using the SDB service was that using a schema-less database structure would allow me to focus on the logical problem of my project and rapidly accumulate content in my data-store. That is, don't worry about setting up and normalize all possible permutations of a product's attributes before hand; simply start loading in the products and the SDB will simply remember everything that is available.
Now that I have managed to get through the first few iterations of my project and I need to setup simple interfaces to the data, I am running to issues that I had taken for granted working with MySQL. Ex: grouping in select statements and limit syntax to query "items 50 to 100". The ease advantage I gained using schema free architecture of SDB, I lost to a performance hit of querying/looping a resultset with just over 1800 items.
Now I'm reading about projects like Tokyo Cabinet that are extending the concept of in-memory key-value stores to provide pseudo-relational functionality at ridiculously faster speeds (14x i read somewhere).
My question:
Are there some rudimentary guidelines or heuristics that I as an application designer/developer can go through to evaluate which DB tech is the most appropriate at each stage of my project.
Ex: At a prototyping stage where logical/technical unknowns of the application make data structure fluid: use SDB.
At a more mature stage where user deliverables are a priority, use traditional tools where you don't have to spend dev time writing sorting, grouping or pagination logic.
Practical experience with these tools would be very much appreciated.
Thanks SO!
Shaheeb R.

The problems you are finding are why RDBMS specialists view some of the alternative systems with a jaundiced eye. Yes, the alternative systems handle certain specific requirements extremely fast, but as soon as you want to do something else with the same data, the fleetest suddenly becomes the laggard. By contrast, an RDBMS typically manages the variations with greater aplomb; it may not be quite as fast as the fleetest for the specialized workload which the fleetest is micro-optimized to handle, but it seldom deteriorates as fast when called upon to deal with other queries.

The new solutions are not silver bullets.
Compared to traditional RDBMS, these systems make improvements in some aspect (scalability, availability or simplicity) by trading-off other aspects (reduced query capability, eventual consistency, horrible performance for certain operations).
Think of these not as replacements of the traditional database, but they are specialized tools for a known, specific need.
Take Amazon Simple DB for example, SDB is basically a huge spreadsheet, if that is what your data looks like, then it probably works well and the superb scalability and simplicity will save you a lot of time and money.
If your system requires very structured and complex queries but you insist with one of these cool new solution, you will soon find yourself in the middle of re-implementing a amateurish, ill-designed RDBMS, with all of its inherent problems.
In this respect, if you do not know whether these will suit your need, I think it is actually better to do your first few iterations in a traditional RDBMS because they give you the best flexibility and capability especially in a single server deployment and under modest load. (see CAP Theorem).
Once you have a better idea about what your data will look like and how will they be used, then you can match your need with an alternative solution.
If you want the simplicity of a cloud hosted solution, but needs a relational database, you can check out: Amazon Relational Database Service

Related

Product Catalog - Document Store or Column Family Store [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Wondering which technology would do better for a typical product catalog of a webshop. I'm writing my master thesis about nosql in the enterprise environment and focused on document stores for to long now I think.
Read a lot articles which recommend document stores because of it's flexibilty which is needed to model thousands of different products. But as far as I know now, Column-Family Stores like Cassandra offer the same flexibility.
What I like most of the idea of using cassandra is, what nosql-database.org says about it (marked the most interesting features):
massively scalable, partitioned row store, masterless architecture, linear scale performance, no single points of failure, read/write support across multiple data centers & cloud availability zones. API / Query Method: CQL and Thrift, replication: peer-to-peer, written in: Java, Concurrency: tunable consistency, Misc: built-in data compression, MapReduce support, primary/secondary indexes, security features.
In the end I focus on building a prototype of a highly available and scaleable Multishop System which makes use of polyglot persistence, saying K/V Stores for Sessions, Document Store or Column-Family Store for Product Catalog and maybe RDBMS for Inventory/Pricing like Sadalage and Fowler mentioned in their book "NoSQL Destilled".
If possible, provide scientific papers or other reliable sources for your answers.
Thanks!
Document Store's Achilles Heel
Stuart Halloway mentioned that a document store is the biggest schema lock solution that is way too inflexible, which I agree with. Couch/Mongo and others try to mitigate that by providing workarounds to create secondary indicies, ability and necessity to be aware of plain object ids, etc. And of course if you think about versioning (i.e. add a "time" variable to your system), document stores fail fast to provide a smooth support and time travel.
Column Store: Problem Relevance
Cassandra is a really compelling solution for building "scalable"/"distributed" systems with real examples such as Netflix, where 500 Cassandra nodes can be brought up in AWS for several minutes, and all the requests hit a Cassandra ring.
However, given the problem as it is stated in your question, Cassandra would be an unnecessary overkill. Not just because it is a bit more complex than "others", or because it is mentally harder to create a solid data model on top of column oriented stores, but also because a "product catalog" problem is not quite a rocket science. It can be, if you want to add machine learning later to predict/recognize/etc.., but a catalog itself is not, and simpler stores such as PostgreSQL for example would solve it easily.
Simple Desire to NoSQL
If you really want to use NoSQL for a product catalog, I would definitely consider 3 solutions to fit your prototype:
Riak as a "K/V for Sessions"
Datomic to solve "Product Catalog, Inventory and Pricing"
Depending on the size and nature of the problem and the final solution, I would consider Redis to cache those sessions, while having Datomic comfortably sit on top of Riak as its storage service.
Practice vs. Theory
Two classical NoSQL papers that made NoSQL sound real in practice for the first time are Dynamo and BigTable. I consider Datomic to be the next evolutionary step in the DB universe by introducing a hybrid data model with true indicies and relations without a schema lock, and immutability from which everything follows: safe time travel, caching, local db values, etc.
Practically, if it wasn't a master theses, depending on the real problem scale and definition, I would be choosing between Datomic and PostreSQL to solve catalog, inventory, pricing, etc.
A big advantage of Datomic here is time travel. In practice it is very important to be able to safely and easily do that in a "Shopping System".
A big advantage of PostgreSQL is its familiarity and SQL tools availability for analytics and reporting.
By now I think that Column-Family Stores are not well suited for product calaloges.
It's because products often contain some kind of collections like tags, tracklists for music records, different sizes for clothes and so on.
Cassandra supports collections by now BUT they are not searchable! This is a must have feature for tags for example.
In contrast MongoDb for example offers the $in operator to search in nested arrays...
I don't want to say it is not possible to model a product calalog in Cassandra but I think it is much more straight forward to do it in a document store.

Can NoSQL (e.g. MongoDB) replace Data Grid solutions e.g. Oracle Coherence [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm looking for an opinion on replacing existing Data Grid (i.e. Oracle Coherence) with some document store alternative e.g. NoSQL MongoDB. I was think about the most important pros and cons and came up with:
NoSQL
Pros:
No additional database
No ORM mapping necessary
Although the best query efficiency can be achieved when looking up by ID, other queries can be satisfied by map/reduce queries
Cons:
Quite difficult to achieve data consistency when updating multiple collections or even multiple rows in a same collection
Slower response time ? (i suspect that Coherence reponse time might be better)
A read operation can return old data
Data Grid
Pros
With a Data Grid it seems easier to keep data consistent e.g. the data grid becomes is a SOR (System of Record)
As Data Grid becomes SOR, all data should always be available in the grid
Remote Executors
Cons
Additional database means additional overhead & system/application requirements
With a huge amount of data and sharding in place any kind of queries can take a lot of time
Couchbase Server is a very good replacement for Oracle Coherence particularly for enterprise class applications. Orbitz is a great example where large number of nodes of Coherence were replaced by 70 nodes of Couchbase.
You can read more about the Coherence replacement here: http://gigaom.com/cloud/balancing-oracle-and-open-source-at-orbitz/
Slides from an Orbitz presentation about Couchbase are also available here: http://www.slideshare.net/Couchbase/t1-s6-oww-usescouchbase
Pros:
High availability of nodes using replication and failover (avoid cold cache scenarios)
Sub-milli second latencies (built-in object level cache based on memcached)
High read / write throughput (very low granularity of locking) ( http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/white_paper_c11-708169.pdf)
Strong consistency at a document / item level
TTL / Expiry per document / item
Cons:
Difficult to achieve consistency across multi-document updates. Can be achieved using sentinels. ( http://www.amainhobbies.com/FromTheCEO/2012/09/09/invalidating-couchbase-cache-entries-with-sentinels/ )
It can, but so can a pen and paper system.
The question is, will it be an acceptable replacement. That wholly depends on the situation. In some cases a NoSQL solution is faster, more scalable than a relational solution, but in some situations it is essential to have some kind of support for longer running transactions and relational constraints.
It depends.
You already gave the pros and cons in detail...
as iwein said it depends...
What are the queries that existing relational system forced?
we know that partitioning in nosql db's are easier than realtional db's...
So if you switch to mongo you can extend your systems performance more cheaper and quicker way...
if people are happy on your oracle system now. don't touch it :)
Yes - NoSQL can replace it. But a lot depends on what you are trying to do.
If you just need a simple document store with easy key based lookups - NoSQL is a no-brainer.
If you need an enterprise class solution with paid for support and features such as custom aggregation, entry processors etc etc. Maybe Coherence is what you want.
I've seen people build custom NoSQL solutions on top of Coherence - which is a really expensive thing to do.

Redis, CouchDB or Cassandra? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
What are the strengths and weaknesses of the various NoSQL databases available?
In particular, it seems like Redis is weak when it comes to distributing write load over multiple servers. Is that the case? Is it a big problem? How big does a service have to grow before that could be a significant problem?
The strengths and weaknesses of the NoSQL databases (and also SQL databases) is highly dependent on your use case. For very large projects, performance is king; but for brand new projects, or projects where time and money are limited, simplicity and time-to-market are probably the most important. For teaching yourself (broadening your perspective, becoming a better, more valuable programmer), perhaps the most important thing is simple, solid fundamental concepts.
What kind of project do you have in mind?
Some strengths and weaknesses, off the top of my head:
Redis
Very simple key-value "global variable server"
Very simple (some would say "non-existent") query system
Easily the fastest in this list
Transactions
Data set must fit in memory
Immature clustering, with unclear future (I'm sure it'll be great, but it's not yet decided.)
Cassandra
Arguably the most community momentum of the BigTable-like databases
Probably the easiest of this list to manage in big/growing clusters
Support for map/reduce, good for analytics, data warehousing
MUlti-datacenter replication
Tunable consistency/availability
No single point of failure
You must know what queries you will run early in the project, to prepare the data shape and indexes
CouchDB
Hands-down the best sync (replication) support, supporting master/slave, master/master, and more exotic architectures
HTTP protocol, browsers/apps can interact directly with the DB partially or entirely. (Sync is also done over HTTP)
After a brief learning curve, pretty sophisticated query system using Javascript and map/reduce
Clustered operation (no SPOF, tunable consistency/availability) is currently a significant fork (BigCouch). It will probably merge into Couch but there is no roadmap.
Similarly, clustering and multi-datacenter are theoretically possible (the "exotic" thing I mentioned) however you must write all that tooling yourself at this time.
Append only file format (both databases and indexes) consumes disk surprisingly quickly, and you must manually run compaction (vacuuming) which makes a full copy of all records in the database. The same is required for each index file. Again, you have to be your own toolsmith.
Take a look at http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis He does a good job summing up why you would use one over the other.

Why exactly do we use NoSQL? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Having understood some of the advantages that NoSQL offers (scalability, availability, etc.), I am still not clear why a website would want to use a non-relational database.
Can I get some help on this, preferably with an example?
Better performance
NoSQL databases sometimes have better performance, although this depends on the situation and is disputed.
Adaptability
You can add and remove "columns" without downtime. In most SQL servers, this takes a long time and takes up a load of load.
Application design
It is desirable to separate the data storage from the logic. If you join and select things in SQL queries, you are mixing business logic with storage.
NoSQL databases are there to solve several things, mainly:
(buzz) BigData => think TB, PB, etc..
Working with Distributed Systems / datasets => say you have 42 products, so 13 of them will live in Chicago datacenter, 21 in NY's and another and 8 somewhere in Japan, but once you query against all 42 products, you would not need to know where they are located: NoSQL DB will. This also allows to engage a lot more brain power ( servers ) to solve hard computational problems [ does not seem it would fit your use case, but it is an interesting thing to note ]
Partitioning => having your DB be easily distributed, besides those cool 8 products in Japan, also allows for an easy data replication, so those 42 products will be replicated with a factor of 3, for example, which would mean you DB would have 3 copies for every product. Hence if something goes down, no problem => here is a replica available. This is where NoSQL databases actually shine vs. RDBMS. Granted you can shard, partition and cluster Oracle / MySQL / PostgreSQL / etc.. BUT it is a several magnitudes more complicated process and usually a maintenance headache for most people you'd employ.
BUT to your question:
why a website would want to use a non-relational database
When most of the people, I worked with / met / chatted with, choose NoSQL for their "website", it is unfortunately NOT for the reasons above, but simply because it is COOLER to do so. And in fact many projects FAIL / have extreme difficulties due to this reason.
If most of NoSQL gurus take their masks off, they will all agree that MOST of the problems ( or as people call them websites ) that developers solve day to day, can and rather be solved with a SQL solution, such as PostgreSQL, MySQL, etc.. with some cool Redis cache layer on top of it. And only a small subset of problems would REALLY benefit from NoSQL.
I personally love Riak, as I am a firm believer that a NoSQL, fault tolerant DB should have an extremely strong, flexible and naturally distributed foundation => such as Erlang OTP. Plus I am a fan of simplicity. But again, given the problem, I would choose whatever works best, and most of the time I will NEED that consistency ( especially if we are talking about money / financial world / mission critical / etc.. ).
The main reason not to use an SQL database is scalability. The transactional guarantees and the relational model make it almost impossible to scale a database usefully across more than a few machines, especially given the write-heavy workloads generated by modern web applications.
An app like Facebook can't be made to work on a straightforward SQL database, except by massive partitioning and sharding, which requires significant adjustments to the app logic as well. That's why Facebook developed Cassandra.
NoSQL basically means you make do without some SQL-typical features like immediate consistency or easy joins, in exchange for being able to use a database that scales much better.
Conversely, there is no point in using NoSQL if your website never has more than a dozen concurrent users (which is true for the vast majority of all sites).
We need understand what is your problem in the current application?
Transactions
Amount of data
Data structure
NoSQL solves the problems of scalability and availability against that of atomicity or consistency.
Basic drive us to CAP theorem. Eric Brewer also noted that Of the three properties of shared-data systems – Consistency, Availability and tolerance to network Partitions – only two can be achieved any given moment in time. (CAP theorem)
NOSQL Approach
Schemaless data representation:
Most of them offer schemaless data representation & allow storing semi-structured data.
Can continue to evolve over time— including adding new fields or even nesting the data, for example, in case of JSON representation.
Development time:
No complex SQL queries.
No JOIN statements.
Speed:
Very High speed delivery & Mostly in-built entity-level caching
Plan ahead for scalability:
Avoiding rework
There are many types of NoSQL databases. The web applications uses document based databases. The document db allows us to store JSON,XML,YAML and even Word documents and manipulate them. So, NoSQL is the obvious choice, especially MongoDB which is a Document Database which supports JSON format by default is the most preferred choice of developers and designers.

MongoDB vs. Cassandra [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I am evaluating what might be the best migration option.
Currently, I am on a sharded MySQL (horizontal partition), with most of my data stored in JSON blobs. I do not have any complex SQL queries (already migrated away after since I partitioned my db).
Right now, it seems like both MongoDB and Cassandra would be likely options. My situation:
Lots of reads in every query, less regular writes
Not worried about "massive" scalability
More concerned about simple setup, maintenance and code
Minimize hardware/server cost
Lots of reads in every query, fewer regular writes
Both databases perform well on reads where the hot data set fits in memory. Both also emphasize join-less data models (and encourage denormalization instead), and both provide indexes on documents or rows, although MongoDB's indexes are currently more flexible.
Cassandra's storage engine provides constant-time writes no matter how big your data set grows. Writes are more problematic in MongoDB, partly because of the b-tree based storage engine, but more because of the multi-granularity locking it does.
For analytics, MongoDB provides a custom map/reduce implementation; Cassandra provides native Hadoop support, including for Hive (a SQL data warehouse built on Hadoop map/reduce) and Pig (a Hadoop-specific analysis language that many think is a better fit for map/reduce workloads than SQL). Cassandra also supports use of Spark.
Not worried about "massive" scalability
If you're looking at a single server, MongoDB is probably a better fit. For those more concerned about scaling, Cassandra's no-single-point-of-failure architecture will be easier to set up and more reliable. (MongoDB's global write lock tends to become more painful, too.) Cassandra also gives a lot more control over how your replication works, including support for multiple data centers.
More concerned about simple setup, maintenance and code
Both are trivial to set up, with reasonable out-of-the-box defaults for a single server. Cassandra is simpler to set up in a multi-server configuration since there are no special-role nodes to worry about.
If you're presently using JSON blobs, MongoDB is an insanely good match for your use case, given that it uses BSON to store the data. You'll be able to have richer and more queryable data than you would in your present database. This would be the most significant win for Mongo.
I've used MongoDB extensively (for the past 6 months), building a hierarchical data management system, and I can vouch for both the ease of setup (install it, run it, use it!) and the speed. As long as you think about indexes carefully, it can absolutely scream along, speed-wise.
I gather that Cassandra, due to its use with large-scale projects like Twitter, has better scaling functionality, although the MongoDB team is working on parity there. I should point out that I've not used Cassandra beyond the trial-run stage, so I can't speak for the detail.
The real swinger for me, when we were assessing NoSQL databases, was the querying - Cassandra is basically just a giant key/value store, and querying is a bit fiddly (at least compared to MongoDB), so for performance you'd have to duplicate quite a lot of data as a sort of manual index. MongoDB, on the other hand, uses a "query by example" model.
For example, say you've got a Collection (MongoDB parlance for the equivalent to a RDMS table) containing Users. MongoDB stores records as Documents, which are basically binary JSON objects. e.g:
{
FirstName: "John",
LastName: "Smith",
Email: "john#smith.com",
Groups: ["Admin", "User", "SuperUser"]
}
If you wanted to find all of the users called Smith who have Admin rights, you'd just create a new document (at the admin console using Javascript, or in production using the language of your choice):
{
LastName: "Smith",
Groups: "Admin"
}
...and then run the query. That's it. There are added operators for comparisons, RegEx filtering etc, but it's all pretty simple, and the Wiki-based documentation is pretty good.
Why choose between a traditional database and a NoSQL data store? Use both! The problem with NoSQL solutions (beyond the initial learning curve) is the lack of transactions -- you do all updates to MySQL and have MySQL populate a NoSQL data store for reads -- you then benefit from each technology's strengths. This does add more complexity, but you already have the MySQL side -- just add MongoDB, Cassandra, etc to the mix.
NoSQL datastores generally scale way better than a traditional DB for the same otherwise specs -- there is a reason why Facebook, Twitter, Google, and most start-ups are using NoSQL solutions. It's not just geeks getting high on new tech.
I'm probably going to be an odd man out, but I think you need to stay with MySQL. You haven't described a real problem you need to solve, and MySQL/InnoDB is an excellent storage back-end even for blob/json data.
There is a common trick among Web engineers to try to use more NoSQL as soon as realization comes that not all features of an RDBMS are used. This alone is not a good reason, since most often NoSQL databases have rather poor data engines (what MySQL calls a storage engine).
Now, if you're not of that kind, then please specify what is missing in MySQL and you're looking for in a different database (like, auto-sharding, automatic failover, multi-master replication, a weaker data consistency guarantee in cluster paying off in higher write throughput, etc).
I haven't used Cassandra, but I have used MongoDB and think it's awesome.
If you're after simple setup, this is it: You simply untar MongoDB and run the mongod daemon and that's it ... it's running.
Obviously that's only a starter, but to get you started it's easy.
I saw a presentation on mongodb yesterday. I can definitely say that setup was "simple", as simple as unpacking it and firing it up. Done.
I believe that both mongodb and cassandra will run on virtually any regular linux hardware so you should not find to much barrier in that area.
I think in this case, at the end of the day, it will come down to which do you personally feel more comfortable with and which has a toolset that you prefer. As far as the presentation on mongodb, the presenter indicated that the toolset for mongodb was pretty light and that there werent many (they said any really) tools similar to whats available for MySQL. This was of course their experience so YMMV. One thing that I did like about mongodb was that there seemed to be lots of language support for it (Python, and .NET being the two that I primarily use).
The list of sites using mongodb is pretty impressive, and I know that twitter just switched to using cassandra.