Redis, CouchDB or Cassandra? [closed] - nosql

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
What are the strengths and weaknesses of the various NoSQL databases available?
In particular, it seems like Redis is weak when it comes to distributing write load over multiple servers. Is that the case? Is it a big problem? How big does a service have to grow before that could be a significant problem?

The strengths and weaknesses of the NoSQL databases (and also SQL databases) is highly dependent on your use case. For very large projects, performance is king; but for brand new projects, or projects where time and money are limited, simplicity and time-to-market are probably the most important. For teaching yourself (broadening your perspective, becoming a better, more valuable programmer), perhaps the most important thing is simple, solid fundamental concepts.
What kind of project do you have in mind?
Some strengths and weaknesses, off the top of my head:
Redis
Very simple key-value "global variable server"
Very simple (some would say "non-existent") query system
Easily the fastest in this list
Transactions
Data set must fit in memory
Immature clustering, with unclear future (I'm sure it'll be great, but it's not yet decided.)
Cassandra
Arguably the most community momentum of the BigTable-like databases
Probably the easiest of this list to manage in big/growing clusters
Support for map/reduce, good for analytics, data warehousing
MUlti-datacenter replication
Tunable consistency/availability
No single point of failure
You must know what queries you will run early in the project, to prepare the data shape and indexes
CouchDB
Hands-down the best sync (replication) support, supporting master/slave, master/master, and more exotic architectures
HTTP protocol, browsers/apps can interact directly with the DB partially or entirely. (Sync is also done over HTTP)
After a brief learning curve, pretty sophisticated query system using Javascript and map/reduce
Clustered operation (no SPOF, tunable consistency/availability) is currently a significant fork (BigCouch). It will probably merge into Couch but there is no roadmap.
Similarly, clustering and multi-datacenter are theoretically possible (the "exotic" thing I mentioned) however you must write all that tooling yourself at this time.
Append only file format (both databases and indexes) consumes disk surprisingly quickly, and you must manually run compaction (vacuuming) which makes a full copy of all records in the database. The same is required for each index file. Again, you have to be your own toolsmith.

Take a look at http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis He does a good job summing up why you would use one over the other.

Related

Modern hierarchical database [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Closed 9 years ago.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Improve this question
I'm working on the architecture of a new system to replace an ancient mainframe app. The mainframe uses IBM IMS and is surprisingly fast with large amounts of data. We've tried 3 DBs so far - MongoDB, SQL Server and Oracle, but they performed poorly under load. We hired an Oracle consultant and a 128 cores server and Oracle still gives us 4x the response time of the old system (same with SQL Server).
Are there any modern hierarchical DBs, that can efficiently support billions of records?
Mainframes have been and remain very fast for certain use cases, so part one is not to assume that mainframe = bad. Having said that, they can be very expensive to maintain, and particularly with legacy apps the skills are starting to evaporate.
If you really wanted a hierarchical database, one valid option would be to modernise your application but retain IMS at the core. IMS is a great hierarchical database, and I don't think IBM are going to EOL IMS any time soon, so is there a real reason to go to a hierarchical database that isn't IMS? A quick visit to their website gave me the impression that they'd discount the product if they thought you were going to migrate to a competing product, so if money is the problem then perhaps the answer is to just ask IBM to discount the product you're already happy with. This white paper (ftp://public.dhe.ibm.com/software/data/ims/pdf/TCG2013015LI.pdf) suggests they're pushing that as an option, and no doubt the later versions of IMS have a bunch of features that might not be available in the version you're running (assuming you've not upgraded to the latest).
I'm surprised you can't get the performance you want out of Oracle though, the system I'm currently working on has a couple of tables at the billion mark and we definitely don't have 128 cores, but we get reasonable performance.
My first question is whether your Oracle consultant really knew their stuff. I've had mixed results, I guess like any skill set people can have variable skills. I often find that when you get performance problems it's because people have over-normalised or over-generalised the database schema - so you've moved from a highly optimised hierarchical structure in IMS that flies to a very abstracted structure in 3NF, and that dies. But sometimes if you put that same hierarchical structure in Oracle, and only allow the same sort of access patterns that were possible in IMS, you'd get all the performance you want.
By that, I mean if in IMS you had clients, clients had orders, and orders had order lines, then I think that means it's pretty hard to do any accesses without starting at the client. It also often means you have large batch processes that process all the clients every day to find out which have orders that you need to do something with.
So, some things here. Firstly, if, in Oracle, you were to build that structure - so I have a client id, the client id is the first element in the primary key of orders, and the client id then the order id are the first two elements in the primary key of order lines, and then I use client id as my clustering key and put client id into every index......probably all my client-based access paths will be really fast. You can also partition by client id, and if needed, run an Oracle RAC cluster with each of those partitions/client ranges effectively running as separate databases on a separate more commodity class machine (say, a dual socket machine = about 20 cores).
Secondly, if I used to have to process all my records once a night to find the orders that needed someone to work on them, then in the new relational world I don't need to do that any more, I just need to find the orders with a status of "pending" or whatever. So maybe Oracle isn't as fast for that batch oriented workload, but if I change my logic and do an indexed query for pending orders, then again I can get all the performance I want. Even more so, perhaps I make order_status into a partitioning key, so my "active" records are all in one partition, and all the older orders are in other partitions - and then I put that partition on an SSD-backed array.
Thirdly, take a look at your storage devices. Performance problems in databases are invariably IO problems - either you're doing too much IO (poorly optimised queries), or your IO subsystem can't keep up with the IO that you need to do. 128 cores is an awful lot of compute, and I've rarely seen a database that is compute bound. Maybe look at a big SSD array, some of them can give you enormous IO throughput. Certainly if you were running Oracle on a RAID 5 spinning disk array your performance is likely to suck.
The last random comment here - a lot of people are getting good results with SAP HANA - a fully in-memory database. That really flies, and is specifically designed for workloads that just won't run fast enough in other databases. I bet SAP would come demo it to you for free if you wanted it.

Product Catalog - Document Store or Column Family Store [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Wondering which technology would do better for a typical product catalog of a webshop. I'm writing my master thesis about nosql in the enterprise environment and focused on document stores for to long now I think.
Read a lot articles which recommend document stores because of it's flexibilty which is needed to model thousands of different products. But as far as I know now, Column-Family Stores like Cassandra offer the same flexibility.
What I like most of the idea of using cassandra is, what nosql-database.org says about it (marked the most interesting features):
massively scalable, partitioned row store, masterless architecture, linear scale performance, no single points of failure, read/write support across multiple data centers & cloud availability zones. API / Query Method: CQL and Thrift, replication: peer-to-peer, written in: Java, Concurrency: tunable consistency, Misc: built-in data compression, MapReduce support, primary/secondary indexes, security features.
In the end I focus on building a prototype of a highly available and scaleable Multishop System which makes use of polyglot persistence, saying K/V Stores for Sessions, Document Store or Column-Family Store for Product Catalog and maybe RDBMS for Inventory/Pricing like Sadalage and Fowler mentioned in their book "NoSQL Destilled".
If possible, provide scientific papers or other reliable sources for your answers.
Thanks!
Document Store's Achilles Heel
Stuart Halloway mentioned that a document store is the biggest schema lock solution that is way too inflexible, which I agree with. Couch/Mongo and others try to mitigate that by providing workarounds to create secondary indicies, ability and necessity to be aware of plain object ids, etc. And of course if you think about versioning (i.e. add a "time" variable to your system), document stores fail fast to provide a smooth support and time travel.
Column Store: Problem Relevance
Cassandra is a really compelling solution for building "scalable"/"distributed" systems with real examples such as Netflix, where 500 Cassandra nodes can be brought up in AWS for several minutes, and all the requests hit a Cassandra ring.
However, given the problem as it is stated in your question, Cassandra would be an unnecessary overkill. Not just because it is a bit more complex than "others", or because it is mentally harder to create a solid data model on top of column oriented stores, but also because a "product catalog" problem is not quite a rocket science. It can be, if you want to add machine learning later to predict/recognize/etc.., but a catalog itself is not, and simpler stores such as PostgreSQL for example would solve it easily.
Simple Desire to NoSQL
If you really want to use NoSQL for a product catalog, I would definitely consider 3 solutions to fit your prototype:
Riak as a "K/V for Sessions"
Datomic to solve "Product Catalog, Inventory and Pricing"
Depending on the size and nature of the problem and the final solution, I would consider Redis to cache those sessions, while having Datomic comfortably sit on top of Riak as its storage service.
Practice vs. Theory
Two classical NoSQL papers that made NoSQL sound real in practice for the first time are Dynamo and BigTable. I consider Datomic to be the next evolutionary step in the DB universe by introducing a hybrid data model with true indicies and relations without a schema lock, and immutability from which everything follows: safe time travel, caching, local db values, etc.
Practically, if it wasn't a master theses, depending on the real problem scale and definition, I would be choosing between Datomic and PostreSQL to solve catalog, inventory, pricing, etc.
A big advantage of Datomic here is time travel. In practice it is very important to be able to safely and easily do that in a "Shopping System".
A big advantage of PostgreSQL is its familiarity and SQL tools availability for analytics and reporting.
By now I think that Column-Family Stores are not well suited for product calaloges.
It's because products often contain some kind of collections like tags, tracklists for music records, different sizes for clothes and so on.
Cassandra supports collections by now BUT they are not searchable! This is a must have feature for tags for example.
In contrast MongoDb for example offers the $in operator to search in nested arrays...
I don't want to say it is not possible to model a product calalog in Cassandra but I think it is much more straight forward to do it in a document store.

Can NoSQL (e.g. MongoDB) replace Data Grid solutions e.g. Oracle Coherence [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I'm looking for an opinion on replacing existing Data Grid (i.e. Oracle Coherence) with some document store alternative e.g. NoSQL MongoDB. I was think about the most important pros and cons and came up with:
NoSQL
Pros:
No additional database
No ORM mapping necessary
Although the best query efficiency can be achieved when looking up by ID, other queries can be satisfied by map/reduce queries
Cons:
Quite difficult to achieve data consistency when updating multiple collections or even multiple rows in a same collection
Slower response time ? (i suspect that Coherence reponse time might be better)
A read operation can return old data
Data Grid
Pros
With a Data Grid it seems easier to keep data consistent e.g. the data grid becomes is a SOR (System of Record)
As Data Grid becomes SOR, all data should always be available in the grid
Remote Executors
Cons
Additional database means additional overhead & system/application requirements
With a huge amount of data and sharding in place any kind of queries can take a lot of time
Couchbase Server is a very good replacement for Oracle Coherence particularly for enterprise class applications. Orbitz is a great example where large number of nodes of Coherence were replaced by 70 nodes of Couchbase.
You can read more about the Coherence replacement here: http://gigaom.com/cloud/balancing-oracle-and-open-source-at-orbitz/
Slides from an Orbitz presentation about Couchbase are also available here: http://www.slideshare.net/Couchbase/t1-s6-oww-usescouchbase
Pros:
High availability of nodes using replication and failover (avoid cold cache scenarios)
Sub-milli second latencies (built-in object level cache based on memcached)
High read / write throughput (very low granularity of locking) ( http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9670/white_paper_c11-708169.pdf)
Strong consistency at a document / item level
TTL / Expiry per document / item
Cons:
Difficult to achieve consistency across multi-document updates. Can be achieved using sentinels. ( http://www.amainhobbies.com/FromTheCEO/2012/09/09/invalidating-couchbase-cache-entries-with-sentinels/ )
It can, but so can a pen and paper system.
The question is, will it be an acceptable replacement. That wholly depends on the situation. In some cases a NoSQL solution is faster, more scalable than a relational solution, but in some situations it is essential to have some kind of support for longer running transactions and relational constraints.
It depends.
You already gave the pros and cons in detail...
as iwein said it depends...
What are the queries that existing relational system forced?
we know that partitioning in nosql db's are easier than realtional db's...
So if you switch to mongo you can extend your systems performance more cheaper and quicker way...
if people are happy on your oracle system now. don't touch it :)
Yes - NoSQL can replace it. But a lot depends on what you are trying to do.
If you just need a simple document store with easy key based lookups - NoSQL is a no-brainer.
If you need an enterprise class solution with paid for support and features such as custom aggregation, entry processors etc etc. Maybe Coherence is what you want.
I've seen people build custom NoSQL solutions on top of Coherence - which is a really expensive thing to do.

Why exactly do we use NoSQL? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Having understood some of the advantages that NoSQL offers (scalability, availability, etc.), I am still not clear why a website would want to use a non-relational database.
Can I get some help on this, preferably with an example?
Better performance
NoSQL databases sometimes have better performance, although this depends on the situation and is disputed.
Adaptability
You can add and remove "columns" without downtime. In most SQL servers, this takes a long time and takes up a load of load.
Application design
It is desirable to separate the data storage from the logic. If you join and select things in SQL queries, you are mixing business logic with storage.
NoSQL databases are there to solve several things, mainly:
(buzz) BigData => think TB, PB, etc..
Working with Distributed Systems / datasets => say you have 42 products, so 13 of them will live in Chicago datacenter, 21 in NY's and another and 8 somewhere in Japan, but once you query against all 42 products, you would not need to know where they are located: NoSQL DB will. This also allows to engage a lot more brain power ( servers ) to solve hard computational problems [ does not seem it would fit your use case, but it is an interesting thing to note ]
Partitioning => having your DB be easily distributed, besides those cool 8 products in Japan, also allows for an easy data replication, so those 42 products will be replicated with a factor of 3, for example, which would mean you DB would have 3 copies for every product. Hence if something goes down, no problem => here is a replica available. This is where NoSQL databases actually shine vs. RDBMS. Granted you can shard, partition and cluster Oracle / MySQL / PostgreSQL / etc.. BUT it is a several magnitudes more complicated process and usually a maintenance headache for most people you'd employ.
BUT to your question:
why a website would want to use a non-relational database
When most of the people, I worked with / met / chatted with, choose NoSQL for their "website", it is unfortunately NOT for the reasons above, but simply because it is COOLER to do so. And in fact many projects FAIL / have extreme difficulties due to this reason.
If most of NoSQL gurus take their masks off, they will all agree that MOST of the problems ( or as people call them websites ) that developers solve day to day, can and rather be solved with a SQL solution, such as PostgreSQL, MySQL, etc.. with some cool Redis cache layer on top of it. And only a small subset of problems would REALLY benefit from NoSQL.
I personally love Riak, as I am a firm believer that a NoSQL, fault tolerant DB should have an extremely strong, flexible and naturally distributed foundation => such as Erlang OTP. Plus I am a fan of simplicity. But again, given the problem, I would choose whatever works best, and most of the time I will NEED that consistency ( especially if we are talking about money / financial world / mission critical / etc.. ).
The main reason not to use an SQL database is scalability. The transactional guarantees and the relational model make it almost impossible to scale a database usefully across more than a few machines, especially given the write-heavy workloads generated by modern web applications.
An app like Facebook can't be made to work on a straightforward SQL database, except by massive partitioning and sharding, which requires significant adjustments to the app logic as well. That's why Facebook developed Cassandra.
NoSQL basically means you make do without some SQL-typical features like immediate consistency or easy joins, in exchange for being able to use a database that scales much better.
Conversely, there is no point in using NoSQL if your website never has more than a dozen concurrent users (which is true for the vast majority of all sites).
We need understand what is your problem in the current application?
Transactions
Amount of data
Data structure
NoSQL solves the problems of scalability and availability against that of atomicity or consistency.
Basic drive us to CAP theorem. Eric Brewer also noted that Of the three properties of shared-data systems – Consistency, Availability and tolerance to network Partitions – only two can be achieved any given moment in time. (CAP theorem)
NOSQL Approach
Schemaless data representation:
Most of them offer schemaless data representation & allow storing semi-structured data.
Can continue to evolve over time— including adding new fields or even nesting the data, for example, in case of JSON representation.
Development time:
No complex SQL queries.
No JOIN statements.
Speed:
Very High speed delivery & Mostly in-built entity-level caching
Plan ahead for scalability:
Avoiding rework
There are many types of NoSQL databases. The web applications uses document based databases. The document db allows us to store JSON,XML,YAML and even Word documents and manipulate them. So, NoSQL is the obvious choice, especially MongoDB which is a Document Database which supports JSON format by default is the most preferred choice of developers and designers.

Key-Value Stores vs. RDBMs vs. "Cloud" DBs (SDB) [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I'm comfortable in the MySQL space having designed several apps over the past few years, and then continuously refining performance and scalability aspects. I also have some experience working with memcached to provide application side speed-ups on frequently queried result sets. And recently I implemented the Amazon SDB as my primary "database" for an ecommerce experiment.
To oversimplify, a quick justification I went through in my mind for using the SDB service was that using a schema-less database structure would allow me to focus on the logical problem of my project and rapidly accumulate content in my data-store. That is, don't worry about setting up and normalize all possible permutations of a product's attributes before hand; simply start loading in the products and the SDB will simply remember everything that is available.
Now that I have managed to get through the first few iterations of my project and I need to setup simple interfaces to the data, I am running to issues that I had taken for granted working with MySQL. Ex: grouping in select statements and limit syntax to query "items 50 to 100". The ease advantage I gained using schema free architecture of SDB, I lost to a performance hit of querying/looping a resultset with just over 1800 items.
Now I'm reading about projects like Tokyo Cabinet that are extending the concept of in-memory key-value stores to provide pseudo-relational functionality at ridiculously faster speeds (14x i read somewhere).
My question:
Are there some rudimentary guidelines or heuristics that I as an application designer/developer can go through to evaluate which DB tech is the most appropriate at each stage of my project.
Ex: At a prototyping stage where logical/technical unknowns of the application make data structure fluid: use SDB.
At a more mature stage where user deliverables are a priority, use traditional tools where you don't have to spend dev time writing sorting, grouping or pagination logic.
Practical experience with these tools would be very much appreciated.
Thanks SO!
Shaheeb R.
The problems you are finding are why RDBMS specialists view some of the alternative systems with a jaundiced eye. Yes, the alternative systems handle certain specific requirements extremely fast, but as soon as you want to do something else with the same data, the fleetest suddenly becomes the laggard. By contrast, an RDBMS typically manages the variations with greater aplomb; it may not be quite as fast as the fleetest for the specialized workload which the fleetest is micro-optimized to handle, but it seldom deteriorates as fast when called upon to deal with other queries.
The new solutions are not silver bullets.
Compared to traditional RDBMS, these systems make improvements in some aspect (scalability, availability or simplicity) by trading-off other aspects (reduced query capability, eventual consistency, horrible performance for certain operations).
Think of these not as replacements of the traditional database, but they are specialized tools for a known, specific need.
Take Amazon Simple DB for example, SDB is basically a huge spreadsheet, if that is what your data looks like, then it probably works well and the superb scalability and simplicity will save you a lot of time and money.
If your system requires very structured and complex queries but you insist with one of these cool new solution, you will soon find yourself in the middle of re-implementing a amateurish, ill-designed RDBMS, with all of its inherent problems.
In this respect, if you do not know whether these will suit your need, I think it is actually better to do your first few iterations in a traditional RDBMS because they give you the best flexibility and capability especially in a single server deployment and under modest load. (see CAP Theorem).
Once you have a better idea about what your data will look like and how will they be used, then you can match your need with an alternative solution.
If you want the simplicity of a cloud hosted solution, but needs a relational database, you can check out: Amazon Relational Database Service