no single point of failure with traditional RDBMS - nosql

I am working in a trading applications that depends on an Oracle DB.
The DB is crashed two times and the business owner wants some solution in which the application still works even the DB is crashed.
My team leader introduced Cassandra NOSQL as a solution as it has no single point of failure but this option will make us move from the traditional relational model into the NOSQL model which I consider as a drawback.
My question here, Is there a way to avoid a single point of DB failure with traditional relational DBMS like Mysql, postgreSQL,......etc ?

Sounds like you just need a cluster of Oracle database instances, rather than just a single instance, such as Oracle RAC.
If your solution for the Oracle server being offline is to use Cassandra, what happens if the Cassandra cluster goes down? And are you really in the situation where it makes sense to rewrite and re-architect your entire application to use a different type of data store, just to avoid downtime from Oracle? I would suspect this only makes sense for applications with huge usage and load numbers, where any downtime is going to cost serious money (and not just cause embarrassment to the business folks to their bosses).

Is there a way to avoid a single point of DB failure with traditional relational DBMS
No, that's not possible. Simply because when one node dies. It is gone.
Any fault-tolerant system will use several nodes that replicate each other. You can still use traditional RDBMS, but you will need to configure mirroring in order for the system to tolerate a node failure.

NoSQL isn't the only possible solution. You can set up replication with MySQL:
http://dev.mysql.com/doc/refman/5.0/en/replication-solutions.html
and
http://mysql-mmm.org/
and concerming failover discussions:
http://serverfault.com/questions/274094/automated-failover-strategy-for-master-slave-mysql-replication-why-would-this

Related

NoSQL in a single machine

As part of my university curriculum I ended up with a real project which consists in helping a company shifting from their relational data warehouse into a NoSQL data warehouse. The thing is that what they are looking for is better performance in large jobs but so far they have used a single machine and if they indeed migrate to NoSQL they still wish to keep using a single machine for cost reasons.
As far as I know the whole point of NoSQL is to run it in a large distributed system of several machines. So I don't see the point of this migration, specially since I am pretty sure (but not entirely) that if they do install NoSQL, they will probably end having even worst performance.
But still I am not comfortable telling them this since I am still new to this area (less than a month), so I wonder, is there are any situation where using NoSQL in a single machine for a datawarehouse would be justifiable performance wise? Or is it just a plain bad idea?
The answer to your question, like the answer to so many questions, is "it depends."
Ignoring the commentary on the question, I think there may be legitimacy to your client's question. Both relational and non-relational databases ultimately hold data in key-value tuples, with indexes and such to ensure quick and speedy access to the data. The difference is that SQL/relational databases contain an incredible amount of overhead to attempt the optimal way to retrieve results given an unknown set of queries, as well as ensure stable concurrency. This overhead is both computationally expensive and rarely results in the optimal solution. As a result, SQL databases often perform significantly slower for simple repetitive queries.
No-sql databases, on the other hand, are more of a "bare-bones" database, relying on programmers and intelligent design to achieve success. They are optimized to retrieve a value for a given key very quickly, often sub-millisecond. As a result, increased up front investment in the design results in superior and near-optimal performance. It will be necessary to determine the cost-benefit of doing this up-front design, but it is all but guaranteed that the no-sql approach will perform better regardless of the number of machines involved (in fact, SQL databases are very difficult or impossible to cluster together and is one of the main reasons why NoSql was developed).
Eventually we will see relational-like solutions implemented on a no-sql platform. In fact, Mongo, Elasticsearch, and Couchbase (probably others) already have SQL-like query functionality. But right now, you are faced with this dilemma.
For a single machine if the load is write heavy e.g. your logging a lot of events you could do for cassandra. Also a good alternative is hbase but its heavy and not suggested for single node. If they expose api in json you could look into document based dbs such as couchbase, mongo db. If you have an idea about the load then selecting a nosql data store is much easier
If you're in a position where you need to pick one, I think you should look first at MongoDB. If you've never tried it, I really recommend you visit their live demo with tutorial and give it a try. If you like, download and follow the installation guide on their site. It's free, runs well on a single machine, and is incredibly easy to use.
In addition to MongoDB, I've used Oracle, SQL Server, MySQL, SQLite, and HBase. I understand Cassandra should be in the list but I've not tried it. With MongoDB, I was fully deployed and executing reads and writes from an application in like two hours. I attribute most of that to their website's clear and concise instructional content. The biggest learning curve was figuring out how the queries work for things like updating a record or deleting a record without deleting the entire set of similar records.
Regarding NoSQL vs RDBMS, some points to consider:
Adding a new column to RDBMS table can lock the database in or degrade performance in another
MongoDB is schema-less so adding a new field, does not effect old documents and will be instant (think how flexible that really is - throw any dimension of data into this system without maintenance overhead)
You're less likely to require a DBA to solve your schema problems when an application changes
I think problems related to table size are irrelevant, so you won't run into a scaling problem - just a disk space problem on single machine

Which NoSQL DB is best fitted for OLTP financial systems?

We're designing an OLTP financial system. it should be able to support 10.000 transactions per second and have reporting features.
So we have come to the idea of using:
a NoSQL DB as our main storage
a MySQL DB (Percona server actually) making some ETLs from the NoSQL DB for store reporting data
We're considering MongoDB and Riak for the NoSQL job. we have read that Riak scales more smoothly than MongoDB. And we would like to listen your opinion.
Which NoSQL DB would you use for a
OLTP financial system?
How has been
your experience scaling MongoDB/Riak?
There is no conceivable circumstance where I would use a NOSQl database for anything to do with finance. You simply don't have the data integrity needed or the internal controls. Dow Jones uses SQL Server to do its transactions and if they can properly design a high performance, high transaction Relational datbase so can you. You will have to invest in some people who know what they are doing though.
One has to think about the problem differently. The notion of transaction consistency stems from the UD (update) in CRUD (Create, Read, Update, Delete). noSQL DBs are CRAP (Create, Replicate, Append, Process) oriented, working by accretion of time-stamped data. With the right domain model, there is no reason that auditability and the equivalent of referential integrity can't be achieved.
The global-storage based NoSQL databases - Cache from InterSystems and GT.M from FIS - are used extensively in financial services and have been for many years. Cache in particular is used for both the core database and for OLTP.
I can answer regarding my experience with scaling Riak.
Riak scales smoothly to the extreme. Scaling is as easy as adding nodes to the cluster, which is a very simple operation in itself. You can achieve near linear scalability by simply adding nodes. Our experience with Riak as far as scaling is concerned has been amazing.
The flip side is that it is lacking in many respects. Some examples:
You can't do something like count(*) or list keys on a production cluster. That would require a work around if you want to do ETL from Riak into MySQL - or how would you know what to (E)xtract?
(One possible work around would be to maintain a bucket with a known key sequence that map to values that contain the keys you inserted into your other buckets).
The free version of Riak comes with no management console that lets you know what's going on, and the one that's included in the Enterprise version isn't much of an improvement.
You'll need the Enterprise version of you're looking to replicate your data over WAN (e.g. for DR / high availability). That's alright if you don't mind paying, but keep in mind that Basho pricing is very high.
I work with the Starcounter (so I’m biased), but I think I can safely say that for a system processing financial transactions you have to worry about transaction consistency. Unfortunately, this is what the engines used for Facebook and Twitter had to give up allow their scale-out strategy to offer performance. This is not because engines such as MongoDb or Cassandra are poorly designed; rather, it follows naturally from the CAP theorem (http://en.wikipedia.org/wiki/CAP_theorem). Simply put, changes you make in your database will overwrite other changes if they occur close in time. Ok for status updates and new tweets, but disastrous if you deal with money or other quantities. The amounts will simply end up wrong when many reads and writes are being done in parallel. So for the throughput you need, a memory centric NoSQL database with ACID support is probably the way to go.
You can use some NoSQL databases (Cassandra, EventStore) as a storage for financial service if you implement your app using event sourcing and concepts from DDD. I recommend you to read this minibook http://www.oreilly.com/programming/free/reactive-microservices-architecture.html
OLTP can be achieved using NoSQL with a custom implementation,
there are two things,
1. How are you going to achieve ACID properties that an RDBMS gives.
2. Provide a custom blocking or non blocking concurrency and transaction handling mechanism.
To take you closer to solution,
Apache Phoenix,apache trafodion or Splice machine.
Trafodion has full ACID support over HBase, you should take a look.
Cassandra can be used for both OLTP and OLAP. Good replication and eventual data consistency gives you the choice in your hand. Need to design the system properly. And after all it's free of cost but not free of developer, give it a try

Does any RDBMS do auto scaling, sharding, re-balancing?

I think one of the advantages of no-sql such as MongoDB is that it can scale horizontally automatically: just add a cheap machine and the data can "spread over" to the new machine.
How about for RDBMS, does any RDBMS do that automatically too?
The answer here is "kind of". MySQL doesn't really have anything native of "for free". Big RDBMS technologies like MSSQL and Oracle do have very good support for scaling out. However, both technologies are expensive and there's no way to through a thousand servers at MS SQL and say "have at it".
Of course, even with millions of dollars in servers and tech, you still won't have ready access to joins. I mean, how can you reliably join data across 500 servers?
Honestly, I think your question is probably best answered by the very existence of technologies like MongoDB and CouchDB. These technologies exist b/c developers need a way to reliably "horizontalize". RDBMS, by their nature, are not good at horizontalizing. Again, how can you scale a join?
I have only worked with MySQL and MySQL supports partitioning. However, partitioning is restricted to a single database server, which means that horizontal-scaling (sharding the database to multiple machines) is not something that the database engine manages. This has to be managed at the application level.
MySQL Partitioning is said to work very well in write-heavy use-cases.
To give you further direction :
Scaling mysql writes through partitioning at Yahoo
Database sharding at Netlog

What is the best database/storage to store statistic data?

I'm having a system that collects real-time Apache log data from about 90-100 Web Servers. I had also defined some url patterns.
Now I want to build another system that updates the time of occurrence of each pattern based on those logs.
I had thought about using MySQL to store statistic data, update them by statement:
"Update table set count=count+1 where ....",
but i'm afraid that MySQL will be slow for data from such amount of servers. Moreover, I'm looking for some database/storage solutions that more scalable and simple. (As a RDBMS, MySQL supports too much things that I don't need in this situation) . Do you have any idea ?
Apache Cassandra is a high-performance column-family store and can scale extremely well. The learning curve is a bit steep, but will have no problem handling large amounts of data.
A more simple solution would be a key-value store, like Redis. It's easier to understand than Cassandra. Redis only seems to support master-slave replication as a way to scale, so the write performance of your master server could be a bottleneck. Riak has a decentralized architecture without any central nodes. It has no single point of failure nor any bottlenecks, so it's easier to scale out.
Key value storage seems to be an appropriate solution for my system. After taking a quick look on those storages, I'm concerning about race-condition issue, as there will be a lot of clients trying to do these steps on the same key:
count = storage.get(key)
storage.set(key,count+1)
I had worked with Tokyo Cabinet before, and they have 'addint' method which perfectly matched with my case, I wonder if other storages have similar feature? I didn't choose Tokyo Cabinet/Tyrant cause I had experienced some issues about its scalability and data stability (e.g. repair corrupted data, ...)

Are MongoDB and CouchDB perfect substitutes?

I haven't got my hands dirty yet with neither CouchDB nor MongoDB but I would like to do so soon... I also have read a bit about both systems and it looks to me like they cover the same cases... Or am I missing a key distinguishing feature?
I would like to use a document based storage instead of a traditional RDBMS in my next project. I also need the datastore to
handle large binary objects (images and videos)
automatically replicate itself to physically separate nodes
rendering the need of an additional RDBMS superfluous
Are both equally well suited for these requirements?
Thanks!
I've actually used both pretty extensively, both for very different projects.
I'd say they are equally well suited for the requirements you list, however there are quite a lot of differences between the two. IMO the biggest is their query-ability. CouchDB doesn't have 'queries' in the RDBMS sense (select * from ...) but instead uses 'views' which are more like stored procedures (essentially, static queries defined in the database (1)). MongoDB has much more 'usual' querying.
Essentially it comes down to your application requirements. If you give more information I might be able to shed some more light on what might matter in that situation.
(1): you can have temporarily, non-static queries in CouchDB but they aren't recommended for production use
Mongo uses more "traditional" queries. You turn on indexing on a per-key basis and use a SQLish query syntax.
CouchDB's views can do much deeper indexing and relationships but require you to do a little more work and understand the way the key sorting works for doing the queries.
There is a big difference in the replication systems as well. Mongo's replication looks a lot like most RDBMS solutions with masters and slaves and all that. CouchDB's replication is more peer to peer, no master/slave, every CouchDB is a node.
CouchDB's replication is made for keeping geographically apart sites in sync. It handles network- and other errors gracefully by restarting replication where it left off. Participating nodes can even be put offline deliberately.
Before using MongoDB, I'd recommend that you take a look at the following: http://groups.google.com/group/mongodb-user/browse_thread/thread/460dbd49a5b6b267. MongoDB has a small chance of corrupting data due to its lack of fsync's with each write.
http://nosql.mypopescu.com/post/298557551/couchdb-vs-mongodb
From a developer point of view the biggest difference is the mongo live queries vs couch view (which must be "compiled").
From an operational point of view, couch is working completely on http-rest. If you're able to configure http servers you know how to setup coach. With Mongo instead you have to learn how to set up config servers, replica set and mongos (kind of balancer).