When to use master-master replication VS master-slave replication - mongodb

I am using CouchDB as my noSQL database for a CRM solution.
CouchDB uses a master-master replication.
Compared to this mongodb uses a master-slave replication.
Being a little newcomer to NoSQL,
I would like to clearly understand what are the benefits of a master-master replication over master-slave replication.

In a master-master architecture, you can distribute the power to the places it's most needed. In a CRM, you may want a single point of authority (the head office), but authoritative content may be created by anyone (sales reps, VP's, tech support agents). Master-master let's you bring the canonical data source as close as possible to every content owner/creator in that scenario.
In master-slave architectures, everyone must be able to reach the canonical authoritative source or risk (at least) having their content overwritten or not be writable at all.
Apache CouchDB is particularly well suited to master-master replication and coupled with PouchDB can provide applications that work offline-first--cloud optional. These apps can then synchronize their changes when an Internet connection is again available.

In master-master replication, client can both read and write on any of the instances. It has its use case where we want to independently write on two databases. Imagine you have one database in Asia and one in North America. clients in asia will write to db in asia and clients in north America will write to database in North America. This will have low latency
If one of the instances goes down client can write to other instances. This provides high availability.
In master-master replication, there is a possibility of write conflict. if you make a change on one db1 and db2 about the same time and if we are modifying the same record, when we reconcile the data between two dbs, that will result in a write conflict. But in master-slave replication, there is only one source of truth because we write ONLY ON MASTER.

Related

In database terms, what is the difference between replication and decentralisation?

I am currently researching different databases to use for my next project. I was wanting to use a decentralized database. For example Apache Cassandra claims to be decentralized. MongoDB however says it uses replication. From what I can see, as far as these databases are concerned, replication and decentralization are basically the same thing. Is that correct or is there some difference/feature between decentralization and replication that I'm missing?
Short answer, no, replication and decentralization are two different things. As a simple example, let's say you have three instances (i1, i2 and i3) that replicate the same data. You also have a client that fetches data from only i1. If i1 goes down you will still have the data replicated to i2 and i3 as a backup. But since i1 is down the client has no way of getting the data. This an example of a centralized database with single point of failure.
A centralized database has a centralized location that the majority of requests goes through. It could, as in Mongo DB's case be instances that route queries to instances that can handle the query.
A decentralized database is obviously the opposite. In Cassandra any node in a cluster can handle any request. This node is called the coordinator for the request. The node then reads/writes data from/to the nodes that are responsible for that data before returning a result to the client.
Decentralization means that there should be no single point of failure in your application architecture. These systems will provide deployment scheme, where there's no leader (or master) elected during the service life-cycle. These are often deliver services in a peer-to-peer fashion.
Replication means, that simply your data is copied over to another server instance to ensure redundancy and failure tolerance. Client requests can still be served from copies, but your system should ensure some level of "consistency", when making copies.
Cassandra serves requests in a peer-to-peer fashion. Meaning that clients can initiate requests to any node participating in the cluster. It also provides replication and tunable consistency.
MongoDB offers master/slave deployment, so it's not considered as decentralized. You can deliver a multi-master, to ensure that requests can still be served if master node goes down. It also provides replication out-of-the box.
Links
Cassandra's tunable consistency
MongoDB's master-slave configuration
Introduction to Cassandra's architecture

Column-family database sharding and replication [NoSQL Distilled]

In section 4.5 Combining Sharding and Replication of the NoSQL Distilled book, the following assertion is made:
"Using peer-to-peer replication and sharding is a common strategy for column-family databases."
The statement leaves out other types of cluster-ready databases, namely key-value and document stores. Why is this the case? Are those databases well suited for sharding, but not in conjunction with peer-to-peer replication? Is master-slave replication a better approach in those cases?
Peer-to-peer replication has more to do with the consistency model. You're making a tradeoff between fault tolerance and consistency, where a peer-to-peer model chooses the former and a master-slave model the latter. It is possible to achieve consistency through means such as using quorum reads and writes, so you can often achieve both in practice--even though the database isn't technically consistent.
There are certainly examples of non-CF stores that use peer-to-peer replication, such as CouchBase, a document store, and Riak, a KV store. These databases perform very well and use auto-sharding in some form.

Does MongoDB require at least 2 server instances to prevent the loss of data?

I have decided to start developing a little web application in my spare time so I can learn about MongoDB. I was planning to get an Amazon AWS micro instance and start the development and the alpha stage there. However, I stumbled across a question here on Stack Overflow that concerned me:
But for durability, you need to use at least 2 mongodb server
instances as master/slave. Otherwise you can lose the last minute of
your data.
Is that true? Can't I just have my box with everything installed on it (Apache, PHP, MongoDB) and rely on the data being correctly stored? At least, there must be a config option in MongoDB to make it behave reliably even if installed on a single box - isn't there?
The information you have on master/slave setups is outdated. Running single-server MongoDB with journaling is a durable data store, so for use cases where you don't need replica sets or if you're in development stage, then journaling will work well.
However if you're in production, we recommend using replica sets. For the bare minimum set up, you would ideally run three (or more) instances of mongod, a 'primary' which receives reads and writes, a 'secondary' to which the writes from the primary are replicated, and an arbiter, a single instance of mongod that allows a vote to take place should the primary become unavailable. This 'automatic failover' means that, should your primary be unable to receive writes from your application at a given time, the secondary will become the primary and take over receiving data from your app.
You can read more about journaling here and replication here, and you should definitely familiarize yourself with the documentation in general in order to get a better sense of what MongoDB is all about.
Replication provides redundancy and increases data availability. With multiple copies of data on different database servers, replication protects a database from the loss of a single server. Replication also allows you to recover from hardware failure and service interruptions. With additional copies of the data, you can dedicate one to disaster recovery, reporting, or backup.
In some cases, you can use replication to increase read capacity. Clients have the ability to send read and write operations to different servers. You can also maintain copies in different data centers to increase the locality and availability of data for distributed applications.
Replication in MongoDB
A replica set is a group of mongod instances that host the same data set. One mongod, the primary, receives all write operations. All other instances, secondaries, apply operations from the primary so that they have the same data set.
The primary accepts all write operations from clients. Replica set can have only one primary. Because only one member can accept write operations, replica sets provide strict consistency. To support replication, the primary logs all changes to its data sets in its oplog. See primary for more information.

I am looking for a key-value datastore which has following (preferable) properties

I am trying to build a distributed task queue, and I am wondering if there is any data store, which has some or all of the following properties. I am looking to have a completely decentralized, multinode/multi-master self replicating datastore cluster to avoid any single point of failure.
Essential
Supports Python pickled object as Value.
Persistent.
More, the better, In decreasing order of importance (I do not expect any datastore to meet all the criteria. :-))
Distributed.
Synchronous Replication across multiple nodes supported.
Runs/Can run on multiple nodes, in multi-master configuration.
Datastore cluster exposed as a single server.
Round-robin access to/selection of a node for read/write action.
Decent python client.
Support for Atomicity in get/put and replication.
Automatic failover
Decent documentation and/or Active/helpful community
Significantly mature
Decent read/write performance
Any suggestions would be much appreciated.
Cassandra (open-sourced by facebook) has pretty much all of these properties. There are several Python clients, including pycassa.
Edited to add:
Cassandra is fully distributed, multi-node P2P, with tunable consistency levels (i.e. your replication can be synchronous or asynchronous or a mixture of both). Clients can connect to any server. Failover is automatic, and new servers can be added on-the-fly for load balancing. Cassandra is in production use by companies such as Facebook. There is an O'Reilly book. Write performance is extremely high, read performance is also high.

What is the major difference between Redis and Membase?

What are the major differences between Redis and Membase?
Scalability:
Membase offers a distributed key/value store (just like Memcache), so writes and reads will always be performed in predictably constant time regardless of how large your data set is. Redis on the other hand offers just master-slave replication, which speeds up read but does not speed up writes.
Data Redundancy
It's simple to setup a cluster with a set amount of replicated copy for each key-value pair, allow for servers to failover a inoperative node in a cluster without losing data. Redis' master-slave replication doesn't offer this same type of data redundancy, however.
Data Type:
Redis offers ability to handle lists in an atomic fashion out of the box, but one can implement similar functionality in the application logic layer with Membase.
Adoption:
Currently Redis is more widely adopted and a bit more mature than Membase. Membase does have a few high profile use case, such as Zynga and their slew of social games.
Membase has recently merged with Couchbase and they will have a version of Membase that will offer CouchDB's Map/Reduce and query/index ability in the next major release (scheduled around early 2011).
Membase is a massive key-value store with persistent and replication for failover. The data stored in membase is not subject to "modification" (besides increment). You get or set it.
Redis is more of a key-data store. Redis allows the manipulation of sets, lists, sorted-lists, hashes and some odd other data types. While redis has replication it is more of a master/slave type of replication.
I'm adding some points to Manto's answer:
Redis has built in transaction mechanism, while membase does not. Base on your work it may be critical
Master-master replication have some cons compare to master-slave: loosy consistent (lazy object, async...), more complex compare to master-slave (hence add some latency).
Current version of redis (2.x) does not support clustering. You'll need to shard the database manually (check http://antirez.com/post/redis-presharding.html), while membase support clustering out of the box and have a pretty nice monitoring gui.
(Benchmark may be ** but people just love dirty things) Redis seems to have slight performance edge at heavily concurrent case. (http://coder.cl/2011/06/concurrency-in-redis-and-memcache/)