Do NoSQL datacenter aware features enable fast reads and writes when nodes are distributed across high-latency connections? - nosql

We have a data system in which writes and reads can be made in a couple of geographic locations which have high network latency between them (crossing a few continents, but not this slow). We can live with 'last write wins' conflict resolution, especially since edits can't be meaningfully merged.
I'd ideally like to use a distributed system that allows fast, local reads and writes, and copes with the replication and write propagation over the slow connection in the background. Do the datacenter-aware features in e.g. Voldemort or Cassandra deliver this?
It's either this, or we roll our own, probably based on collecting writes using something like
rsync and sorting out the conflict resolution ourselves.

You should be able to get the behavior you're looking for using Voldemort. (I can't speak to Cassandra, but imagine that it's similarly possible using it.)
The key settings in the configuration will be:
replication-factor — This is the total number of times the data is stored. Each put or delete operation must eventually hit this many nodes. A replication factor of n means it can be possible to tolerate up to n - 1 node failures without data loss.
required-reads — The least number of reads that can succeed without throwing an exception.
required-writes — The least number of writes that can succeed without the client getting back an exception.
So for your situation, the replication would be set to whatever number made sense for your redundancy requirements, while both required-reads and required-writes would be set to 1. Reads and writes would return quickly, with a concomitant risk of stale or lost data, and the data would only be replicated to the other nodes afterwards.

I have no experience with Voldemort, so I can only comment on Cassandra.
You can deploy Cassandra to multiple datacenters with an inter-DC latency higher than a few milliseconds (see http://spyced.blogspot.com/2010/04/cassandra-fact-vs-fiction.html).
To ensure fast local reads, you can configure the cluster to replicate your data to a certain number of nodes in each datacenter (see "Network Topology Strategy"). For example, you specify that there should always be two replica in each data center. So even when you lose a node in a data center, you will still be able to read your data locally.
Write requests can be sent to any node in a Cassandra cluster. So for fast writes, your clients would always speak to a local node. The node receiving the request (the "coordinator") will replicate the data to other nodes (in other datacenters) in the background. If nodes are down, the write request will still succeed and the coordinator will replicate the data to the failed nodes at a later time ("hinted handoff").
Conflict resolution is based on a client-supplied timestamp.
If you need more than eventual consistency, Cassandra offers several consistency options (including datacenter-aware options).

Related

How to ensure consistent reading in distributed system?

In a distributed system, if only half of the nodes are successfully written, the subsequent nodes that read the unwritten data will be inconsistent. How to avoid this situation?
client write --> Node1 v
--> Node2 v
client read --> Node3 x(The latest data was not read)
My plan:
Compare the data version with other nodes when reading data
If the current node version is found to be lower, it will be routed to other nodes to read data.
I am going to ignore tags [mongo and elastic] :)
What you are planning to do is called Dynamo style replication. That system is eventually consistent by design. (I read a while ago that it could get strongly consistent with some effort, but I don't remember if that paper was correct.)
Back to dynamo and quorum: with three nodes you want to have at least 2 nodes to save writes to assume the write has succeeded. Important point is that you need two nodes to report back to customer the success, but the data is still should to be sent to three nodes.
Let's assume that data is written to two nodes, third failed, but camed back online later. To read the data, you have to read it from any two nodes as well. You will sent read requests to all three, but only two is needed to report back to the customer. This will give you quorum: 2+2>3. This guarantees that there is an intersection in between writes and reads.
This will work ok when the network is good and nodes are healthy. But you will run into major challengers, lost updates and conflict resolution to name a few. But in either way, the system will not be strongly consistent based on design itself.
Let me describe another interesting issue to illustrate weak consistency:
node 1 gets the write
the rest of process fails; node 1 has new data, but node 2 and 3 don't
now, when you read, under the quorum condition, you may or may not see the value from node 1 - since you are picking any two nodes for a read, node 1 may not be in that set.
Long story short, dynamo is not good for strong consistency, and we get to the Raft part of the solution.
Raft will get you what you need. A consistent system. There is a catch to watch for. Most examples are focused on writing - raft maintains a log of messages and consensus is used to agree on the order (and content) of these messages.
But when you do a read, you can't just go to a node, or any two nodes, or three and read the value. You will have to do read via Raft as well, by attaching a read operation to raft's log. This is called linearizable read.
I'll stop here, as this is pretty complicated topic (but not an impossible one to learn).
Hope this gave you enough ideas to explore.
I saw both mongodb and elasticsearch is being tagged, I don't know which case you are thinking, but the two database is very different.
For mongo, replicas are not by default used to increase reading speed, see https://docs.mongodb.com/manual/core/read-preference, the default reading preferences will only look at primary and excludes all replicas. The writing of Mongo is also to the primary first and the replication will happen asynchronously possibly after the write to primary finishes, see https://docs.mongodb.com/manual/core/replica-set-members/. Because of that, if you do a force read to the secondary, you are not guaranteed to have the newest data.
For elasticsearch, elasticsearch naturally does not guarantee you always read the most recent data, see https://www.elastic.co/guide/en/elasticsearch/reference/current/near-real-time.html, so in either way even if there is only one node you may get data that are out of date.

How to read/write to secondary member of a MongoDB replica-set?

I am currently planning some server infrastructure. I have two servers in different locations. My apps (apis and stuff) are running on both of them. The client connects to the nearest (best connection). In case of failure of one server the other can process the requests. I want to use mongodb for my projects. The first idea is to use a replica set, therefore I can ensure the data is consistent. If one server fails the data is still accessible and the secondary switches to primary. When the app on the primary server wants to use the data, it is fine, but the other server must connect to to the primary server in order to handle data (that would solve the failover, but not the "best connection" problem). In Mongodb there is an option to read data from secondary servers, but then I have to ensure, that the inserts (only possible on primary) are consistent on every secondary. There is also an option for this "writeConcern". Is it possible to somehow specify “writeConcern on specific secondary”? Because If an add a second secondary without the apps on it, "writeConcern" on every secondary would not be necessary. And if I specify a specific value I don't really know on which secondary the data is available, right ?
Summary: I want to reduce the connections between the servers when the api is called.
Please share some thought or Ideas to fix my problem.
Writes can only be done on primaries.
To control which secondary the reads are directed to, you can use max staleness as well as tags.
that the inserts (only possible on primary) are consistent on every secondary.
I don't understand what you mean by this phrase.
If you have two geographically separated datacenters, A and B, it is physically impossible to write data in A and instantly see it in B. You must either wait for the write to propagate or wait for the read to fetch data from the remote node.
To pay the cost at write time, set your write concern to the number of nodes in the deployment (2, in your proposal). To pay the cost at read time, use primary reads.
Note that merely setting write concern equal to the number of nodes doesn't make all nodes have the same data at all times - it just makes your application only consider the write successful when all nodes have received it. The primary can still be ahead of a particular secondary in terms of operations committed.
And, as noted in comments, a two-node replica set will not accept writes unless both members are operational, which is why it is generally not a useful configuration to employ.
Summary: I want to reduce the connections between the servers when the api is called.
This has nothing to do with the rest of the question, and if you really mean this it's a premature optimization.
If what you want is faster network I/O I suggest looking into setting up better connectivity between your application and your database (for example, I imagine AWS would offer pretty good connectivity between their various regions).

How to reliably shard data across multiple servers

I am currently reading up on some distributed systems design patterns. One of the designs patterns when you have to deal with a lot of data (billions of entires or multiple peta bytes) would be to spread it out across multiple servers or storage units.
One of the solutions for this is to use a Consistent hash. This should result in an even spread across all servers in the hash.
The concept is rather simple: we can just add new servers and only the servers in the range would be affected, and if you loose servers the remaining servers in the consistent hash would take over. This is when all servers in the hash have the same data (in memory, disk or database).
My question is how do we handle adding and removing servers from a consistent hash where there are so much data that it can't be stored on a single host. How do they figure out what data to store and what not too?
Example:
Let say that we have 2 machines running, "0" and "1". They are starting to reach 60% of their maximum capacity, so we decide to add an additional machine "2". Now a large part the data on machine 0 has to be migrated to machine 2.
How would we automate so this will happen without downtime and while being reliable.
My own suggested approach would be that the service hosing consistent hash and the machines would have be aware of how to transfer data between each other. When a new machine is added, will the consistent hash service calculate the affected hash ranges. Then inform the affect machine
of the affected hash range and that they need to transfer affected data to machine 2. Once the affected machines are done transferring their data, they would ACK back to the consistent hash service. Once all affected services are done transferring data, the consistent hash service would start sending data to machine 2, and inform the affected machine that they can remove their transferred data now. If we have peta bytes on each server can this process take a long time. We there for need to keep track of what entires where changes during the transfer so we can ensure to sync them after, or we can submit the write/updates to both machine 0 and 2 during the transfer.
My approach would work, but i feel it is a little risky with all the backs and forth, so i would like to hear if there is a better way.
How would we automate so this will happen without downtime and while being reliable?
It depends on the technology used to store your data, but for example in Cassandra, there is no "central" entity that governs the process and it is done like almost everything else; by having nodes gossiping with each other. There is no downtime when a new node joins the cluster (performance might be slightly impacted though).
The process is as follow:
The new node joining the cluster is defined as an empty node without system tables or data.
When a new node joins the cluster using the auto bootstrap feature, it will perform the following operations
- Contact the seed nodes to learn about gossip state.
- Transition to Up and Joining state (to indicate it is joining the cluster; represented by UJ in the nodetool status).
- Contact the seed nodes to ensure schema agreement.
- Calculate the tokens that it will become responsible for.
- Stream replica data associated with the tokens it is responsible for from the former owners.
- Transition to Up and Normal state once streaming is complete (to indicate it is now part of the cluster; represented by UN in the nodetool status).
Taken from https://thelastpickle.com/blog/2017/05/23/auto-bootstrapping-part1.html
So when the joining node is in the Joining State, it is receiving data from other nodes but not ready for reads until the process is complete (Up status).
DataStax also has some material on this https://academy.datastax.com/units/2017-ring-dse-foundations-apache-cassandra?path=developer&resource=ds201-datastax-enterprise-6-foundations-of-apache-cassandra

Why MongoDB is Consistent not available and Cassandra is Available not consistent?

Mongo
From this resource I understand why mongo is not A(Highly Available) based on below statement
MongoDB supports a “single master” model. This means you have a master
node and a number of slave nodes. In case the master goes down, one of
the slaves is elected as master. This process happens automatically
but it takes time, usually 10-40 seconds. During this time of new
leader election, your replica set is down and cannot take writes
Is it for the same reason Mongo is said to be Consistent(as write did not happen so returning the latest data in system ) but not Available(not available for writes) ?
Till re-election happens and write operation is in pending, can slave return perform the read operation ? Also does user re-initiate the write operation again once master is selected ?
But i do not understand from another angle why Mongo is highly consistent
As said on Where does mongodb stand in the CAP theorem?,
Mongo is consistent when all reads go to the primary by default.
But that is not true. If under Master/slave model , all reads will go to primary what is the use of slaves then ? It further says If you optionally enable reading from the secondaries then MongoDB becomes eventually consistent where it's possible to read out-of-date results. It means mongo may not be be
consistent with master/slaves(provided i do not configure write to all nodes before return). It does not makes sense to me to say mongo is consistent if all
read and writes go to primary. In that case every other DB also(like cassandra) will be consistent . Is n't it ?
Cassandra
From this resource I understand why Cassandra is A(Highly Available ) based on below statement
Cassandra supports a “multiple master” model. The loss of a single
node does not affect the ability of the cluster to take writes – so
you can achieve 100% uptime for writes
But I do not understand why cassandra is not Consistent ? Is it because node not available for write(as coordinated node is not able to connect) is available for read which can return stale data ?
Go through: MongoDB, Cassandra, and RDBMS in CAP, for better understanding of the topic.
A brief definition of Consistency and availability.
Consistency simply means, when you write a piece of data in a system/distributed system, the same data you should get when you read it from any node of the system.
Availability means, the system should always be available for read/write operation.
Note: Most systems are not, only available or only consistent, they always offer a bit of both
With the above definition let's see where MongoDB and Cassandra fall in CAP.
MongoDB
As you said MongoDB is highly consistent when reads and write go to the same node(the default case). Further, you can choose in MongoDB to read from other secondary nodes instead of reading from only leader/primary.
Now, when you try to read data from secondary, your consistency will completely depend on, how you want to read data:
You could ask data which is up to maximum, say 5 seconds stale or,
You could just say, return data from majority of nodes for your select statement.
Same way when you write from your client into Mongo leader, you can say, a write is successful if the data is replicated to or stored on majority of servers.
Clearly, from above, we can say MongoDb can be highly consistent or eventually consistent based on how you read/write your data.
Now, what about availability? MongoDB is mostly always available, but, the only time when the leader is down, MongoDB can't accept writes, until it figures out the new leader. Hence, not highly available
So, MongoDB is categorized under CP.
What about Cassandra?
In Cassandra, there is no leader and any nodes can accept write, so the Cassandra cluster is always available for writes and reads even if some nodes go down.
What about consistency in Cassandra?
Same as MongoDB Cassandra can be eventually consistent or highly consistent based on how you read/write data.
You can give consistency levels in your read/write operations, For example:
read/write data from one node
read/write data from majority/quorum of nodes and more
Let's say you give a consistency level of one in your read/write operation. So, your write is successful as soon as data is written to one replica. Now, if your read request happens to go to the other replica where the data is not updated yet(could be due to high network latency or any other reason), you will end up reading the old data.
So, Cassandra is highly available but has configurable consistency levels and hence not always consistent.
In conclusion, in their default behavior, MongoDB falls under CP and Cassandra in AP.
Consistency in the CAP paradigm also includes "eventual consistency" which MongoDB supports. In a contrast to ACID systems, the read in CAP systems does not guarantee a safe return.
In simple words, this means that your Master could have an updated value, but if you do read from Slave, it does not necessarily return the updated value, and that it's okay to no have this updated value by design.
The concept of eventual consistency is explained in an excellent answer here.
By architecture, Cassandra is supposed to be consistent; it offers a special implementation of eventual consistency called the 'tunable consistency' which would meant that the client application may choose the method of handling this- it even offers multi data centre consistency support at low levels!
Most issues from row wise inconsistency in Cassandra comes from the fact that Cassandra uses client timestamps to determine which value is the most recent, and not the server side ones, which may be tad bit confusing to understand at first.
I hope this helps!
You have only to understand the "point-in-time": As you only write to mongodb master, even if slave is not updated, it is consistent, as it has all the data generated util the sync moment.
That is not true for cassandra. As cassandra uses a master-less model, there's no garantee that other nodes has all the data. At a certain time, a node can have certain recent data, and not having older data from nodes not yet synced. Cassandra will only be consistent if you stop write to all nodes and put them online. As soon the sync finished you have a consistent data.

Can Triggers be used in Cassandra for production for a multi datacenter environment?

I have a multi datacenter(DC1, DC2) environment having 3 nodes in each datacenter with RF=3 per datacenter.
Wanted to know if triggers can be used in production in a multi-datacenter environment. If so, how can this be achieved?
Case A: If I start inserting the data to DC1, it would have 3 replicas with in DC1 and is responsible of replicating the data to other data center DC2. Every time an insert into DC2 takes place, I would like to have an trigger event to occur and notify about the latest inserted value in the application. Is it possible?
Case B: If not point 2, is it good to insert the data simultaneously on to two datacenters DC1, DC2 (pointing to a single table) and avoid triggers concept?
Will it have any impact with the network traffic? Based on the latest timestamp, the table would have the last insert to the table which serves the purpose when queried from either of the regions.
Consistency level as LOCAL_QUORUM for Read
Consistency level as ONE for write
dse 4.8.2
With these Consistency levels, good consistency can be achieved lowering the latency for write operation across the datacenters.
Usecase:
We have an application (2 domains) for two different regions(DC1 &
DC2). Users of DC1 region uses domain 1 to access the application and
users of DC2 region uses domain 2 for the same. The data is ingested
to DC1 for the same region and when this replicates in its DC, the
coordinator of DC1 would replicate the data in other DC (DC2). The
moment Dc2 receives the data from DC1, we want to let the application
know about the latest information (Polling_ available using some
trigger event mechanism. Just wanted to know if this can be
implemented with cassandra triggers.
Can someone give the feedback on Case A and Case B? and which would be efficient in production.
Thanks
In either case stated above I am not sure why you want to use a trigger to notify your application that a value was inserted. In the scenario as I understand it your application already knows the newest value. Once the write has been successful you can notify your application with the newest value.
In both cases A and B you are working against some of the basic principals of how Cassandra functions. At an application level you should now need to worry about ensuring replication or eventual consistency of your data across multiple nodes and data centers. That is a large part of what Cassandra brings to the table.
In both Case A and B you are going to get multiple inserts of the same data for each write in each node it is replicated to in both data centers. As you write to DC1 it will also be written to DC2. If you then write to DC2 it will be written back to DC1. This will end with a large number of rows containing the same data and will increase disk requirements and compaction frequency. This will also increase network traffic as the two DC's talk back and forth to gain eventual consistency.
From what I can see here I also have to ask why you are doing an RF=3 on a 3 node cluster. This means that each node in each data center will have all the data essentially making each server a complete replica of the others. This seems like it may be overkill (depending on the data of course) as you are not going to get a lot of the scalability benefits that Cassandra offers.
Cassandra will handle the syncing of data between the data centers and across nodes so your application does not need to worry about this.
One other quick note - Currently your writes are using a CL=ONE. This means that you may end up with cross-DC latency on a write request. If you change this to LOCAL_ONE then you limit your CL query until one of the nodes in the local DC has written the value instead of possibly a node in the other DC. Cassandra will still handle the replication and syncing of the data.
Generally, multiple data center concept is used for workload separation(say different DCs for real-time query,analytic and search). Cassandra by itself takes care of replicating the data across multiple DCs.
So, coming to your question Case B doesn't seems a right option because:
Cassandra automatically replicates data across multiple DCs link
Case A is feasible.alerts/notifications using triggers
Hope, it will be helpful.