Column-family database sharding and replication [NoSQL Distilled]

Column-family database sharding and replication [NoSQL Distilled] - nosql

In section 4.5 Combining Sharding and Replication of the NoSQL Distilled book, the following assertion is made:
"Using peer-to-peer replication and sharding is a common strategy for column-family databases."
The statement leaves out other types of cluster-ready databases, namely key-value and document stores. Why is this the case? Are those databases well suited for sharding, but not in conjunction with peer-to-peer replication? Is master-slave replication a better approach in those cases?

Peer-to-peer replication has more to do with the consistency model. You're making a tradeoff between fault tolerance and consistency, where a peer-to-peer model chooses the former and a master-slave model the latter. It is possible to achieve consistency through means such as using quorum reads and writes, so you can often achieve both in practice--even though the database isn't technically consistent.
There are certainly examples of non-CF stores that use peer-to-peer replication, such as CouchBase, a document store, and Riak, a KV store. These databases perform very well and use auto-sharding in some form.

Related

How to decide when to use replicate sets for mongodb in production

We are currently hosting the MongoDB using its official docker image in ec2, for our production environment, its 32gb memory server dedicated to just this service.
How can using replica sets help us in the improvement of the performance of our MongoDB, we are currently facing that the response for queries is getting slower day by day.
Are there any measures through which we can determine that investing in the replica set will provide worthy benefits as well and will not be premature optimization.

MongoDB replication is a high availability solution (see note at the end of the post for more details on Replication). Replication is not a performance improvement solution.
MongoDB query performance depends upon various factors: size of collection, size of document, database design, query definition and indexes. Inadequate hardware (memory, hard drive, cpu and network) can affect the query performance. The number of operations at a given time can also affect the performance.
For faster query performance the main consideration is using indexes. Indexes affect directly the query filter and sort operations. To find if your query is performing optimally and using the proper indexes generate a query plan using the explainwith "executionStats" mode; study the plan. Explain can be run on MongoDB find, update, delete and aggregation queries. All these queries can benefit from indexes. See Query Optimization.
Adding capabilities to the existing hardware is known as vertical scaling; and replication is not vertical scaling.
Replication:
This is configured as a replica-set - a primary node and multiple secondary nodes. The primary is the main point of contact for application - all writes happen on the primary, (and reads, by default). The data written to the primary is replicated to the secondaries. This way data redundancy is accomplished. When the primary goes down one of the secondaries takes over as primary and keep the system running via a failover process. Data durability, high availability, redundancy and failover are the man concepts with replication. In MongoDB a replica-set cluster can have up to fifty nodes.

It is recommended to use replica-set in production due to HA functionality.
As a result of source limits on one hand and the need of HA in production on the other hand, I would suggest you to create a minimal replica-set which will consist of Primary, Secondary and an Arbiter (an arbiter does not contain any data and is very low memory consumer).
Also, Writes typically effect your memory performance much more than reads. In order to achieve better write performance I would advice you to create more shards (the more masters you have, the more writes you can handle at the same time).
However, I'm not sure what case your mongo's performance to slow so fast. I think you should:
Check what is most effect your production's performance (complicated queries or hard writes).
Change your read preference to "nearest".
Consider to disable Read Concern "majority" (remember that by default there is a write "majority" concern. Members should be up to date).
Check for a better index.
And of curse create a replica-set!
Good Luck! :P

How to achieve strong consistency in MongoDB Replica Sets?

In MongoDB documentation, here, it has been mentioned that in a replica set even with majority readConcern we would achieve eventual consistency. I am wondering how is this possible when we have majority in both reads and writes which leads to a quorum (R+W>N) in our distributed system? I expect a strong consistent system in this setting. This is the technique which Cassandra uses as well in order to achieve strong consistency.
Can someone clarify this for me please?

MongoDb is not regarded very well in terms of strong consistency. If you have a typical sharded and replicated setup to increase consistency will need to trade off some of the performance of the db. As you know you can execute write operations only on the master of the replica set. By default you can only read from it as well. This is possibly the strongest consistency you can get from MongoDb AFAIK as the other nodes are used only for replication, failover and availability reasons. And you could read from the secondary nodes only for operations where having the latest data is not crucial and for long-running operations, such as aggregation for example.
If you set up sharding you could offload a big portion of the read/write operations to different primary nodes. I think that when it comes to MongoDb that is all you could do in order to increasing consistency and performance in particular for larger data sets.

When to use master-master replication VS master-slave replication

I am using CouchDB as my noSQL database for a CRM solution.
CouchDB uses a master-master replication.
Compared to this mongodb uses a master-slave replication.
Being a little newcomer to NoSQL,
I would like to clearly understand what are the benefits of a master-master replication over master-slave replication.

In a master-master architecture, you can distribute the power to the places it's most needed. In a CRM, you may want a single point of authority (the head office), but authoritative content may be created by anyone (sales reps, VP's, tech support agents). Master-master let's you bring the canonical data source as close as possible to every content owner/creator in that scenario.
In master-slave architectures, everyone must be able to reach the canonical authoritative source or risk (at least) having their content overwritten or not be writable at all.
Apache CouchDB is particularly well suited to master-master replication and coupled with PouchDB can provide applications that work offline-first--cloud optional. These apps can then synchronize their changes when an Internet connection is again available.

In master-master replication, client can both read and write on any of the instances. It has its use case where we want to independently write on two databases. Imagine you have one database in Asia and one in North America. clients in asia will write to db in asia and clients in north America will write to database in North America. This will have low latency
If one of the instances goes down client can write to other instances. This provides high availability.
In master-master replication, there is a possibility of write conflict. if you make a change on one db1 and db2 about the same time and if we are modifying the same record, when we reconcile the data between two dbs, that will result in a write conflict. But in master-slave replication, there is only one source of truth because we write ONLY ON MASTER.

I am looking for a key-value datastore which has following (preferable) properties

I am trying to build a distributed task queue, and I am wondering if there is any data store, which has some or all of the following properties. I am looking to have a completely decentralized, multinode/multi-master self replicating datastore cluster to avoid any single point of failure.
Essential
Supports Python pickled object as Value.
Persistent.
More, the better, In decreasing order of importance (I do not expect any datastore to meet all the criteria. :-))
Distributed.
Synchronous Replication across multiple nodes supported.
Runs/Can run on multiple nodes, in multi-master configuration.
Datastore cluster exposed as a single server.
Round-robin access to/selection of a node for read/write action.
Decent python client.
Support for Atomicity in get/put and replication.
Automatic failover
Decent documentation and/or Active/helpful community
Significantly mature
Decent read/write performance
Any suggestions would be much appreciated.

Cassandra (open-sourced by facebook) has pretty much all of these properties. There are several Python clients, including pycassa.
Edited to add:
Cassandra is fully distributed, multi-node P2P, with tunable consistency levels (i.e. your replication can be synchronous or asynchronous or a mixture of both). Clients can connect to any server. Failover is automatic, and new servers can be added on-the-fly for load balancing. Cassandra is in production use by companies such as Facebook. There is an O'Reilly book. Write performance is extremely high, read performance is also high.

What is the major difference between Redis and Membase?

What are the major differences between Redis and Membase?

Scalability:
Membase offers a distributed key/value store (just like Memcache), so writes and reads will always be performed in predictably constant time regardless of how large your data set is. Redis on the other hand offers just master-slave replication, which speeds up read but does not speed up writes.
Data Redundancy
It's simple to setup a cluster with a set amount of replicated copy for each key-value pair, allow for servers to failover a inoperative node in a cluster without losing data. Redis' master-slave replication doesn't offer this same type of data redundancy, however.
Data Type:
Redis offers ability to handle lists in an atomic fashion out of the box, but one can implement similar functionality in the application logic layer with Membase.
Adoption:
Currently Redis is more widely adopted and a bit more mature than Membase. Membase does have a few high profile use case, such as Zynga and their slew of social games.
Membase has recently merged with Couchbase and they will have a version of Membase that will offer CouchDB's Map/Reduce and query/index ability in the next major release (scheduled around early 2011).

Membase is a massive key-value store with persistent and replication for failover. The data stored in membase is not subject to "modification" (besides increment). You get or set it.
Redis is more of a key-data store. Redis allows the manipulation of sets, lists, sorted-lists, hashes and some odd other data types. While redis has replication it is more of a master/slave type of replication.

I'm adding some points to Manto's answer:
Redis has built in transaction mechanism, while membase does not. Base on your work it may be critical
Master-master replication have some cons compare to master-slave: loosy consistent (lazy object, async...), more complex compare to master-slave (hence add some latency).
Current version of redis (2.x) does not support clustering. You'll need to shard the database manually (check http://antirez.com/post/redis-presharding.html), while membase support clustering out of the box and have a pretty nice monitoring gui.
(Benchmark may be ** but people just love dirty things) Redis seems to have slight performance edge at heavily concurrent case. (http://coder.cl/2011/06/concurrency-in-redis-and-memcache/)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse