How to deploy zookeeper across multiple data centers and failover? - apache-zookeeper

I would like to know about the existing approaches that are available when running Zookeeper across data centers?
One approach that I found after doing some research is to have observers. That approach is to have only one ensemble in the main data center with leader and follower. And having observers in the backup data center. When main datacenter crash, we select other datacenter as the new main data center and convert observers to leader/follower manually.
I would like to about better approaches to achieve the same.
Thanks

First I would like to point the cons of your solution which hopefully my solution would solve:
a) in case of main data center failure the recovery process is manual (I quote you: "convert observers to leader/follower manually")
b) only the main data center accepts writes -> in case of failure all data (when observer don't write logs) or only last updates (when observer do write logs) are lost
Because the question is about data centerS I'll consider that we have enough (DCs) to reach our objective: solving a. and b. while having an usable multi data center distributed ZK.
So, when having an even number of data centers (DC) one could use an additional DC only for getting an odd number of ZK nodes in the ensemble. When having e.g. 2 DCs than a 3rd one could be added; each DC could contain 1 rwZK (read-write ZK node) or, for better tolerance against failures, each DC could contain 3 rwZK organized as hierarchical quorums (both cases could benefit of ZK observers). Inside a DC all ZK clients should point only to the DC's ZK-group so the traffic remained between DCs would be only for e.g. leader election, writes. With this kind of setup one solves both a. and b. but loses write/recovery-performance because the writes/elections must be agreed between data centers: at least 2 DCs must agree on writes/elections with 2 ZK nodes agreement per DC (see hierarchical quorums). The intra-DC agreement should be fast enough hence won't matter much for the overall write agreement process; bottom line, approximately only the delay between DCs would matter. The disadvantages of this approach are:
- additional cost for the 3rd data center: this could be mitigated by using the company office (a guy did that) as the 3rd data center
- lost sessions because of inter-DC network latency and/or throughput: with high enough timeouts one could reach a maximum possible write-throughput (depending on inter-DC average network speed) so this solution would be valid only when that maximum is acceptable. Still, when using 1 rw-ZK per DC I guess there'll be not much difference to your solution because the writes from backup DC to main DC must travel between DCs too; but for your solution won't be inter-DCs write agreements or leader elections related communication so it's faster.
Other consideration:
Regardless of the chosen solution the inter-DCs communication should be secured and for this ZK offers no solution so tunneling or other approach must be implemented.
UPDATE
Another solution would be to still use an additional 3rd DC (or company office) but where to keep only the rw-ZKs (1, 3 or other odd number) while the other 2 DCs to only have observer-ZKs. The clients should still connect only to the DC's ZK servers but we no longer need hierarchical quorums. The gain here is that the write agreements and leader elections would be only inside the DC with rw-ZKs (let's call it arbiter DC). The disadvantages are:
- the arbiter DC is a single point of failure
- the write requests will still have to travel from observer DCs to arbiter DC

Related

Can Triggers be used in Cassandra for production for a multi datacenter environment?

I have a multi datacenter(DC1, DC2) environment having 3 nodes in each datacenter with RF=3 per datacenter.
Wanted to know if triggers can be used in production in a multi-datacenter environment. If so, how can this be achieved?
Case A: If I start inserting the data to DC1, it would have 3 replicas with in DC1 and is responsible of replicating the data to other data center DC2. Every time an insert into DC2 takes place, I would like to have an trigger event to occur and notify about the latest inserted value in the application. Is it possible?
Case B: If not point 2, is it good to insert the data simultaneously on to two datacenters DC1, DC2 (pointing to a single table) and avoid triggers concept?
Will it have any impact with the network traffic? Based on the latest timestamp, the table would have the last insert to the table which serves the purpose when queried from either of the regions.
Consistency level as LOCAL_QUORUM for Read
Consistency level as ONE for write
dse 4.8.2
With these Consistency levels, good consistency can be achieved lowering the latency for write operation across the datacenters.
Usecase:
We have an application (2 domains) for two different regions(DC1 &
DC2). Users of DC1 region uses domain 1 to access the application and
users of DC2 region uses domain 2 for the same. The data is ingested
to DC1 for the same region and when this replicates in its DC, the
coordinator of DC1 would replicate the data in other DC (DC2). The
moment Dc2 receives the data from DC1, we want to let the application
know about the latest information (Polling_ available using some
trigger event mechanism. Just wanted to know if this can be
implemented with cassandra triggers.
Can someone give the feedback on Case A and Case B? and which would be efficient in production.
Thanks
In either case stated above I am not sure why you want to use a trigger to notify your application that a value was inserted. In the scenario as I understand it your application already knows the newest value. Once the write has been successful you can notify your application with the newest value.
In both cases A and B you are working against some of the basic principals of how Cassandra functions. At an application level you should now need to worry about ensuring replication or eventual consistency of your data across multiple nodes and data centers. That is a large part of what Cassandra brings to the table.
In both Case A and B you are going to get multiple inserts of the same data for each write in each node it is replicated to in both data centers. As you write to DC1 it will also be written to DC2. If you then write to DC2 it will be written back to DC1. This will end with a large number of rows containing the same data and will increase disk requirements and compaction frequency. This will also increase network traffic as the two DC's talk back and forth to gain eventual consistency.
From what I can see here I also have to ask why you are doing an RF=3 on a 3 node cluster. This means that each node in each data center will have all the data essentially making each server a complete replica of the others. This seems like it may be overkill (depending on the data of course) as you are not going to get a lot of the scalability benefits that Cassandra offers.
Cassandra will handle the syncing of data between the data centers and across nodes so your application does not need to worry about this.
One other quick note - Currently your writes are using a CL=ONE. This means that you may end up with cross-DC latency on a write request. If you change this to LOCAL_ONE then you limit your CL query until one of the nodes in the local DC has written the value instead of possibly a node in the other DC. Cassandra will still handle the replication and syncing of the data.
Generally, multiple data center concept is used for workload separation(say different DCs for real-time query,analytic and search). Cassandra by itself takes care of replicating the data across multiple DCs.
So, coming to your question Case B doesn't seems a right option because:
Cassandra automatically replicates data across multiple DCs link
Case A is feasible.alerts/notifications using triggers
Hope, it will be helpful.

MongoDB - cross data center primary election DRP / Geographically Distributed Replica Sets

Working with mongo distributed over 3 data center
for this example the data center names are A,B,C
when every thing is going well all user traffic is pointed to A
so the mongo primary is on A, the mongo setup is :
3 servers in A (with high priority)
1 servers in B (with low priority)
1 servers in C (priority 0 )
problem is supporting mongo-writes when 2 scenario happen:
no network between A-B-C (network tunnel is down)
data canter A is on fire :), lets say the data-center isnt working, in this point all user traffic is pointed to B and a primary election in B is expected.
scenario 1 isnt a problem, when no datacenter network tunnel the A still has a majority of replicas and high proirity so every thing is still working.
scenario 2 wont work, beacuse when A will stop working , all 3 replicas (on A) arent reachable, in this way no new primary will be reelacted in B or C beacuse the majority of replicas is down.
how can i setup my replica set so it supports the 2 scenarios?
This is not possible: You can't have an 'available' system in case of total network partitions and in case of failure of a DC with a majority election approach as used by MongoDB: Either the majority is in one DC, then it will survive partitions but not a DC going down, or the majority requires 2 DCs to be up which can survive one DC going down but not a full network failure.
Your options:
Accept the partition problem and change the setup to 2-2-1. Unreliable tunnels should be solvable, if the entire network of a DC goes down you're at scenario 2.
Accept the DC problem and stick to your configuration. The most likely problems are probably large-scale network issues and massive power outages, not fire.
Use a database that supports other types of fault-tolerance. That, however, is not a panacea since this entails other tradeoffs that must be well understood.
To keep the system up when DC A goes down also requires application servers in DC B or C, which is a tricky problem in its own regard. If you use a more partition tolerant database, for instance, you could easily have a 'split brains' problem where application servers in different DCs accept different, but conflicting writes. Such problems can only be solved at the application level.

Why does a mongodb replica set need an odd number of voting members?

If find the replica set requirement a bit confusing, and I'm probably missing something obvious (like under which condition there are elections).
I understand that in normal operations you need quorum, and a voting takes place and to get a majority you need and odd numbers of machines.
But since we use a replica set for failover, if the master dies, then we are left with an even number of voting members, which based on my limited experience lengthen the time to elect a primary.
Also according to the documentation, the addition of a voting member doesn't start an election, it would seem that starting (booting) you replica set with an even number of nodes would make more sense?
So if we start say with 4 machines in the replica set, and one machine dies, there is a re-election with 3 machines, fast quorum. We add a machine back to get back to our normal operation state, no re-election and we are back to our normal operation conditions.
Can someone shed a light on this?
TL;DR: With single master systems, even partitions make it impossible to determine which remainder still has a majority, taking both systems down.
Let N be a cluster of four machines:
One machine dies, the others resume operation. Good.
Two machines die, we're offline because we no longer get a majority. Bad.
Let M be a cluster of three machines:
One machine dies, the others resume operation. Good.
Two machines die, we're offline because we no longer get a majority. Bad.
=> Same result at 3/4 of the cost.
Now, let's add an assumption or two:
We're also going to operate some kind of server application that uses the database
The network can be partitioned
Let's say you have two datacenters, one with two database instances and the backend server machines. If the connection to the backup center (which has one MongoDB instance) fails, you're still online.
Now if you added a second MongoDB instance at the backup data center, a network partition would, despite seemingly higher redundancy, yield lower availability since we'd lose the majority in case of a network partition and can't continue to operate.
=> Less availability at higher cost. But that doesn't answer the question yet.
Let's say you're really worried about availability: You have two data centers, with backend servers in both datacenters, anycast IPs, the whole deal. Now the network between the two DCs is partitioned, but some clients connect to DC A while other reach DC B. How do you now determine which datacenter may accept writes? It's not possible - this is why the odd number is necessary.
You don't actually need Anycast IPs, BGP or any fancy stuff for the problem to become real, any writing application (like a worker, a stale request, anything) would require later merging different writes, which is a completely different concurrency scheme.

Apache Zookeeper: distribution of nodes across data centers

I am working on a brand new SolrCloud - ZooKeeper infrastructure.
Some background information:
all other services (mostly web site infrastructure) are distributed across two data centers, with active-active configurations.
at the network level, the servers are setup on extended LANs, with dark fibre across the data centers. So latency is at a minimum.
the SolrCloud - ZooKeeper infrastructure will be used by most of these applications.
I got a SolrCloud, and a ZooKeeper ensemble running. Implementation at this level is fine.
But I wonder how to distribute my ZooKeeper servers. I must have an odd number of servers, but I only have two data centers. If one fails, I have a 50-50 chance that I will lose majority.
What should I do? So far I have thought of:
requesting a third data center (not likely to happen, $$$!)
host two per data center and two on an external cloud provider (Amazon or ...?). Again $$$
set up an odd number at data center 1 and use an observer on site 2. What then happens if site 1 fails? Can SolrCloud work with only one observer?
If your requirement is to serve all search requests from a local data center (at which request was origin) then you don’t need to go for a cross data center ZooKeeper deployment.
Because a cross data center ZooKeeper deployment is only needed to survive a DC crash (it is most likely not going to happen, and that's why you pay $$$$), so in that case there isn't any need to spawn a ZooKeeper cluster in multiple data centers.
I got a third site to host the other ZooKeeper instance. This site is another office of my company, not a "full data center". So each site has one ZooKeeper instance.
What allowed me to have one cluster spread over three data centers was that they are close enough together to get a dark fiber between them. The latency is very low and does not impact ZooKeeper performance.
Then for Solr, I got full replicas on the two main data centers. The third office only hosts a ZooKeeper for quorum. Using full replicas, I have all the data in each data center. If my Solr needs to increase later, I will shard, but for now our index is small.
It has proven solid for four years now, with one failure. And it was at the third office, not in a data center.

Cassandra = Does ReadRepair prevent Scaling Reads?

Cassandra has to option to enable "ReadRepair". A Read is send to all Replicas and if one is stale, it will be fixed/updated. But due to the fact, that all replicas receive the Read, there will be the point, when the nodes reach IO-Saturation. As always ALL replica nodes receive the read, adding further nodes will not help, as they also receive all reads (and will be saturated at once)?
Or does cassandra offer some "tunabililty" to configure that ReadRepair does only go to not all of the nodes (or offer any other "replication" that will allow true read scaling)?
thanks!!
jens
Update:
A Concrete exmaple, as I still do not understand how it will work in practice.
9 Cassandra "Boxes/Severs"
3 Replicas (N=3) => Every "Row" is
written to 2 additinal Nodes = 3
Boxes hold the data in total)
ReadRepair Enabled
The Row in Question is (Lets say customer1) is highly trafficed
1.) The first Time I write the Row "Customer1" to Cassandra it will evantually be available on all 3 nodes.
2.) Now I query the system with 1000's of Requests of requests per second for Customer1 (and to make it more clear with any caching disabled).
3.) The Read will always be dispateched to all 3 nodes. (The first request (to the nearest node) will be a full request for data and the two additional requests will only be a "checksum request".)
4.) As we are queryingw with 1000's of requests, we reach the IO-limit of all Replicas! (The IO is the same on all 3 nodes!! (only the bandwith is much smaller on the checksum nodes).
5.) I add 3 further Boxes (so we have 12 Boxes in Total):
A) These Boxes does NOT have the Data yet (to help scale linearly). I first have to get the Customer1 Record to at least one of this new Boxes.
=>This means I have to Change the replication Factor to 4
(OR is there any other option to get the data to another box?)
And now we have the same problem. The Replication Factor is now 4. And all 4 Boxes will receive the Read(Repair)Requst for this highly trafficed customer1 row. This does not scale this way. Scaling would only work if we have Copy that will NOT receive the ReadRepair Request.
What is wrong in my understanding?? My Conculsion: With Standard ReadRepair the System will NOT scale linearly (for a single highly trafficed row), as adding further boxes will also lead to the fact that these boxes also receive the ReadRepair requests (for this trafficed row)...
Thanks very much!!!Jens
Adding further nodes will help (in most situations). There will only be N read repair "requests" during a read, where N is the ReplicationFactor (number of replicas, nb. not the # of nodes in the entire cluster). So the new node(s) will only be included in a read / read repair if the data you request is included in the nodes key range (or is holding a replica of the data).
There is also the read_repair_chance tunable per ColumnFamily, but that is a more advanced topic and doesn't change the fundamental equation that you should scale reads by adding more nodes, rather than de-tuning read repair.
You could read more about replication and consistency from bens slides