Having received no replies on the Couchbase forum after nearly 2 months, I'm bringing this question to a broader audience.
I'm configuring CB Server 2.2.0 XDCR between two different Openstack (Essex, eek) installations. I've done some reading on using a DNS FQDN trick in the couchbase-server file to add a -name ns_1#(hostname) value in the start() function. I've tried that with absolutely zero success. There's already a flag in the start() function that says -name 'babysitter_of_ns_1#127.0.0.1' so I don't know if I need to replace that line, comment it out, or keep it. I've tried all 3 of those; none of them seemed to have any positive effect.
The FQDNs are pointing to the Openstack floating_ip addresses (in amazon-speak, the "public" ones). Should they be pointed to the fixed_ip addresses (amazon: private/local) for the nodes? Between Openstack installations, I'm not convinced pointing to an unreachable (potentially duplicate) class-C private IP is of any use.
When I create a remote cluster reference using the floating_ip address to a node in the other cluster, of course it'll create the cluster reference just fine. But when I create a Replication using that reference, I always get one of two distinct errors: Save request failed because of timeout or Failed to grab remote bucket 'bucket' from any of known nodes.
What I think is happening is that the Openstack floating_ip isn't being recognized or translated to its fixed_ip address prior to surfing the cluster nodes for the bucket. I know the -name ns_1#(hostname) modification is supposed to fix this, but I wonder if anyone has had success configuring XDCR between Openstack installations that may be able to provide some tips or hacks.
I know this "works" in AWS. It's my belief that AWS uses some custom DNS enabling queries to return an instance's fixed_ip ("private" IP) when going between availability zones, possibly between regions. There may be other special sauce in AWS that makes this work.
This blog post on aws Couchbase XDCR replication should help! There are quite a few steps so I won't paste them all here.
http://blog.couchbase.com/cross-data-center-replication-step-step-guide-amazon-aws
Related
I am trying to set up a ZFS cluster on two nodes running Enterprise Storage OS (ESOS). This is based on Redhat, and running the newest ESOS production release (4.0.12).
I have been reading up on this for a bit, and think I finally understand that I have to use Corosync, DRBD and Pacemaker for this to be done correctly.
Though, I haven't done anything like this before, and still have some questions about the different modules.
The complete setup is like the following:
2 ESOS nodes running a ZFS active/passive cluster.
3 ESXi hosts connecting to this cluster using iSCSI. These are connected using fiber.
The 2 ESOS nodes got a dedicated 10G fiber link for synchronization.
First of, I am not able to find any answers to whether or not this configuration would ever be possible to archive, considering I am using ZFS.
If I understand what I have read correctly, you configure a shared iSCSI initiator address when this is set up. Then you use that on ESXi, where Corosync, DRBD & Pacemaker does the rest on the SAN side of things. Have I understood this correctly?
Corosync uses rings to communicate date between the two hosts (not so sure about this one, nor what it exactly means).
Do I need to use all three modules (Corosync, DRBD & Pacemaker), and in essence, what do they actually do.
In the different guides I have been reading, I have seen Asymmetric Logical Unit Access (ALUA) been mentioned a couple times. Is this possible to use to instruct iSCSI initiators which SAN node to use, and thereby not have to use a shared initiator?
Does anyone by any chance know of a website where someone has done something like this?
I will try this one tomorrow, and see if it helps me in the right direction: https://marcitland.blogspot.com/2013/04/building-using-highly-available-esos.html
Thanks.
We have an Apache Airflow deployed over a K8s cluster in AWS. Airflow is running on containers but the EC2 instances themselves are reserved instances.
We are experiencing an issue where we see that Ariflow is making many DNS queries related to it's DB. When at rest (i.e. no DAGs are running) it's about 10 per second. When running several DAGs it can go up to 50 per second. This results in Route53 blocking us since we are hitting the packet limit for DNS queries (1024 packets per second).
Our DB is a Postgres RDS, and when switching it to a MySQL the issue remains.
The way we understand it, the DNS query starts at K8s coredns service, which tries several permutations of the FQDN and sends the requests to Route53 if it can't resolve it on it's own.
Any ideas, thoughts, or hints to explain Airflow's behavior or how to reduce the number of queries is most welcome.
Best,
After some digging we found we had several issues happening at the same time.
The first being that Airflow's scheduler was running about 2 times per second. Each time it created DB queries which resulted in several DNS queries. Changing that scheduling alleviated some of the issue.
Another issue we had is described here. It looks like coredns is configured to try some alternatives of the given domain if it has less than x number of . in the FQDN. There are 2 suggested fixes in that article. We followed them through and the number of DNS queries dropped.
we have been having this issue too.
wasn't the easiest to find as we had one box with lots of apps on it making 1000s of DNS queries requesting DNS resolution of our SQL server name.
i really wonder why Airflow doesnt just use the DNS cache like every other application
I have an application deployed in kubernetes, it consists of cassandra, a go client, and a java client (and other things, but they are not relevant for this discussion).
We have used helm to do our deployment.
We are using a stateful set and a headless service for cassandra.
We have configured the clients to use the headless service dns as a contact point for cluster creation.
Everything works great.
Until all of the nodes go down, or some other nefarious combination of nodes going down, I am simulating it by deleting all pods using kubectl delete in succession on all of the cassandra nodes.
When I do this the clients throw NoHostAvailableException
in java its
"java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.200.23.151:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency LOCAL_QUORUM (1 required but only 0 alive)), /10.200.152.130:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency ONE (1 required but only 0 alive)))"
which eventually becomes
"java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)"
in go its
"gocql: no hosts available in the pool"
I can query cassandra using cqlsh, the node seems fine using nodetool status, all of the new ips are there
the image I am using doesnt have netstat so I have not yet confirmed its listening on the expected port.
Via executing bash on the two client pods I can see the dns makes sense using nslookup, but...
netstat does not show any established connections to cassandra (they are present before I take the nodes down)
If I restart my clients everything works fine.
I have googled a lot (I mean a lot), most of what I have found is related to never having a working connection, the most relevant things seem very old (like 2014, 2016).
So a node going down is very basic and I would expect everything to work, the cassandra cluster manages itself, it discovers new nodes as they come online, it balances the load, etc. etc.
If I take my all of my cassandra nodes down slowly, one at a time, everything works fine (I have not confirmed that the load is distributed appropriately and to the correct node, but at least it works)
So, is there a point where this behaviour is expected? ie I have taken everything down, nothing was up and running before the last from the first cluster was taken down.. is this behaviour expected?
To me it seems like it should be an easy issue to resolve, not sure whats missing / incorrect, I am surprised that both clients show the same symptoms, makes me think something is not happening with our statefulset and service
I think the problem might lie in the headless DNS service. If all of the nodes go down completely and there are no nodes at all available via the service until pods are replaced, it could cause the driver to hang.
I've noted that you've used Helm for your deployments but you may be interested in this document for connecting to Cassandra clusters in Kubernetes from the authors of the cass-operator.
I'm going to contact some of the authors and get them to respond here. Cheers!
FoundationDB cluster can be configured to use SSL/TLS but is it possible to connect to a cluster without knowing cluster's fdb.cluster file?
In other words, is fdb.cluster file equivalent to username/password security scheme in other database systems?
The fdb.cluster file is composed of 2 main parts (see Cluster file format):
The cluster id (with optional description)
A list of one or more coordinators IP:PORT pairs.
Any client must be capable of reaching at least one of the coordinators in the list to talk to the cluster, and it must have the correct cluster ID. If not, it will not be able to connect. There are no built-in service discovery.
This means that you have to provide yourself the initial cluster file to your application. Once it connects to a coordinator node, it will be able to obtain the complete list and update the cluster file by itself (if the topology changes).
One solution for deployment, is to have your application (or deployment script) download the latest fdb.cluster from an internal URL (or file share) if the file is missing, to jump start the setup.
Regarding authentication, unless you use TLS/SSL, the "id" part in the cluster file acts as a pseudo clear-text password. Even if you have the correct set of coordinators , the application must also have the proper cluster ID.
Though it should be considered as the equivalent of the database name in a typical SQL server. It can be easily found and is transmitted in clear if you don't use SSL. I guess it is there to prevent silly mistakes, more than anything else (ex: typing the IP:PORT of a different cluster).
You can't connect without the cluster file. This does provide some weak security but it's much better to use the mutual TLS support if you want to run a cluster in an untrusted network.
We are setting up a MongoDB replica set on Amazon EC2 in the us-west-1 region.
This region only has two availability zones though. My understanding is that MongoDB must have a majority to work correctly. If we create 2 servers in zone us-west-1b and one server in us-west-1c this will not provide high availability if the entire us-west-1b goes down right? How is this possible? What is the recommended configuration?
Having faced a similar challenge we looked at a number of possible solutions:
Put an Arbiter in another region:
Secure the connection either by using a point to point VPN between the regions a routing the traffic across this connection.
or
Give each server an E-IP and DNS name and use some combination of AWS security groups, IPTables and SSL to ensure connections are secure.
AWS actually have a whitepaper on this not sure how old it is though http://media.amazonwebservices.com/AWS_NoSQL_MongoDB.pdf
Alternatively you could allow the application to fall back to a read-only state until your servers come back on-line (not the nicest of options though)
Hope this helps