Ceph Disaster Recovery Solution - ceph

I m sitting up a new cluster , i would like to set up another cluster and enable sync between the two, the second cluster work as disater recovery.
i'm new to Ceph and i'm not sure how to do it?

Disaster recovery in Ceph depends on how you are using Ceph and what exactly your requirements for disaster recovery are. Since Jewel there are two options.
For RBD you can use rbd-mirroring (http://docs.ceph.com/docs/mimic/rbd/rbd-mirroring/). Mirroring can be enabled based on pools and will invoke the rbd-mirror daemon. With rbd-mirroring enabled writes will go into the rbd image journal first, after that the write will be acknowledged to the client and then written to image itself. The rbd-mirror daemon will replay the image journal to your remote location.
Additiontally for radaosgw multisite is available: http://docs.ceph.com/docs/mimic/radosgw/multisite/.
For CephFS there is no such kind of solution available yet.

Related

How can I fix ceph commands hanging after a reboot?

I'm pretty new to Ceph, so I've included all my steps I used to set up my cluster since I'm not sure what is or is not useful information to fix my problem.
I have 4 CentOS 8 VMs in VirtualBox set up to teach myself how to bring up Ceph. 1 is a client and 3 are Ceph monitors. Each ceph node has 6 8Gb drives. Once I learned how the networking worked, it was pretty easy.
I set each VM to have a NAT (for downloading packages) and an internal network that I called "ceph-public". This network would be accessed by each VM on the 10.19.10.0/24 subnet. I then copied the ssh keys from each VM to every other VM.
I followed this documentation to install cephadm, bootstrap my first monitor, and added the other two nodes as hosts. Then I added all available devices as OSDs, created my pools, then created my images, then copied my /etc/ceph folder from the bootstrapped node to my client node. On the client, I ran rbd map mypool/myimage to mount the image as a block device, then used mkfs to create a filesystem on it, and I was able to write data and see the IO from the bootstrapped node. All was well.
Then, as a test, I shutdown and restarted the bootstrapped node. When it came back up, I ran ceph status but it just hung with no output. Every single ceph and rbd command now hangs and I have no idea how to recover or properly reset or fix my cluster.
Has anyone ever had the ceph command hang on their cluster, and what did you do to solve it?
Let me share a similar experience. I also tried some time ago to perform some tests on Ceph (mimic i think) an my VMs on my VirtualBox acted very strange, nothing comparing with actual bare metal servers so please bare this in mind... the tests are not quite relevant.
As regarding your problem, try to see the following:
have at least 3 monitors (or an even number). It's possible that hang is because of monitor election.
make sure the networking part is OK (separated VLANs for ceph servers and clients)
DNS is resolving OK. (you have added the servername in hosts)
...just my 2 cents...

CockroachDB snapshot backups in Kubernetes

I am trying to take snapshot backups with Velero in Kubernetes of a 12 node test CockroachDB cluster with Velero such that, if the cluster failed, we could rebuild the cluster and restore the cockroachdb from these snapshots.
We're using Velero to do that and the snapshot and restore seems to work, but on recovery, we seem to have issues with CockroachDB losing ranges.
Has anyone gotten snapshot backups to work with CockroachDB with a high scale database? (Given the size of the dataset, doing dumps or restores from dumps is not viable.)
Performing backups of the underlying disks while CockroachDB nodes are running is unlikely to work as expected.
The main reason is that even if a persistent disk snapshot is atomic, there is no way to ensure that all disks are captured at the exact same time (time being defined by CockroachDB's consistency mechanism). The restore would contain data with replicas across nodes at different commit indices, resulting in data loss or loss of quorum (show in the Admin UI as "unavailable" ranges).
You have a few options (in order or convenience):
CockroachDB BACKUP which has all nodes write data to external storage (S3, GCS, etc...). Before version 20.2, this is only available with an enterprise license.
SQL dump which is impractical for large datasets
stop all nodes, snapshot all disks, startup all nodes again. warning: this is something we have used to quickly load testing datasets but have not used it in production environments.

DRP for postgres-xl

After installing and setting up a 2 node cluster of postgres-xl 9.2, where coordinator and GTM are running on node1 and the Datanode is set up on node2.
Now before I use it in production I have to deliver a DRP solution.
Does anyone have a DR plan for postgres-xl 9.2 architechture?
Best Regards,
Aviel B.
So from what you described you only have one of each node... What are you expecting to recover too??
Postgres-XL is a clustered solution. If you only have one of each node then you have no cluster and not only are you not getting any scaling advantage it is actually going to run slower than stand alone Postgres. Plus you have nothing to recover to. If you lose either node you have completely lost the database.
Also the docs recommend you put the coordinator and data nodes on the same server if you are going to combine nodes.
So for the simplest solution in Replication mode you would need something like
Server1 GTM
Server2 GTM Proxy
Server3 Coordinator 1 & DataNode 1
Server4 Coordinator 2 & DataNode 2
Postgres-XL has no fail over support so any failure will require manual intervention.
If you use the replication DISTRIBUTED BY option you would just remove the failing node from the cluster and restart everything.
If you used another DISTRIBUTED BY options then data is shared over multiple nodes which means if you lose any node you lose everything. So for this option you will need to have a slave instance of every data node and coordinator node you have. If one of the nodes fails then you would remove that node from the cluster and replace it with its slave backup node. Then restart it all.

MongoDB single server production setup

I am developing a server to a customer who has only one machine for his production deployment.
It's a CentOS 64bit with 8Gb of memory.
I am using Mongo and the question is, do I still need to deploy a replica set even though it's a single machine?
Will I get the advantages of a replica set or since it's a single machine it really does not matter and journaling is enough?
You definitely have to enable journaling (It will ensure consistent state even in cases of HW failure scenarios, you will not have to run costy repair command after a crash). You should enable RAID under the data directrory (Anyway this is in general recommended), while here will be crucial not to lose data due to a disk failure (You do not have copy on an other box or so). There is no option for HA within one box it is quite straightforward, however it is not harmful, and in some cases useful to configure 1 node (1 mongod) replicaset (Than you will have oplog). This will help for example when you likely to have MMS backup, or just for enable point in time backup feature of mongodump. Later if you will likely to scale out for HA this way you will only have to add the new nodes to your initially established replicaset.
Make no sense to run several replicas inside one box, while they will race on HW resources and will bring nothing as an advantage.

mongodb automatic failover / high availability on aws

I need the proper way of failover mechanism for mongodb on aws ec2. I know failover can be accomplished by replica sets, but what is the best way to fire a new mongo installed ubuntu-ec2 ami node and add it to replica set again automatically (with zero manual operation) and return the replica set to it's proper state ?
EBS has some problems, but if I use local instance storage, I will lost the dead nodes data, but does the replica got all the master data and so is replaca is enough to recover everthing (on mongo 1.8 with journaling), or do I have to use only EBS ?
How should I start mongo instances, If I should start with repair option, how can I sperate node's first run from failover restart ?
Regards,
The easiest way to bring up new nodes is to bring up a new node with a recent backup.
So now it's a question of how you do your backup and how you restore from the backup quickly.
The MongoDB site has a write up for backups (in general) and backups on EC2 specifically. There's also a write-up for adding a new set member.
You can do this with instance storage or EBS drives, but you'll need different strategies for each. There's really no single way to do this, so I would check out the docs I've linked to for a primer.
Highly recommend reading Sean Coates' article on mutli-node MongoDB Elections, failover and AWS - specifically, the subtlety on distributed arbiter nodes (e.g., make sure to give yourself a voting majority when an AZ goes down). A similar recommendation can be found in a comment on this (now-closed) MongoDB vs. Cassandra thread.