Cassandra Kubernetes Statefulset NoHostAvailableException - kubernetes

I have an application deployed in kubernetes, it consists of cassandra, a go client, and a java client (and other things, but they are not relevant for this discussion).
We have used helm to do our deployment.
We are using a stateful set and a headless service for cassandra.
We have configured the clients to use the headless service dns as a contact point for cluster creation.
Everything works great.
Until all of the nodes go down, or some other nefarious combination of nodes going down, I am simulating it by deleting all pods using kubectl delete in succession on all of the cassandra nodes.
When I do this the clients throw NoHostAvailableException
in java its
"java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.200.23.151:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency LOCAL_QUORUM (1 required but only 0 alive)), /10.200.152.130:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency ONE (1 required but only 0 alive)))"
which eventually becomes
"java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)"
in go its
"gocql: no hosts available in the pool"
I can query cassandra using cqlsh, the node seems fine using nodetool status, all of the new ips are there
the image I am using doesnt have netstat so I have not yet confirmed its listening on the expected port.
Via executing bash on the two client pods I can see the dns makes sense using nslookup, but...
netstat does not show any established connections to cassandra (they are present before I take the nodes down)
If I restart my clients everything works fine.
I have googled a lot (I mean a lot), most of what I have found is related to never having a working connection, the most relevant things seem very old (like 2014, 2016).
So a node going down is very basic and I would expect everything to work, the cassandra cluster manages itself, it discovers new nodes as they come online, it balances the load, etc. etc.
If I take my all of my cassandra nodes down slowly, one at a time, everything works fine (I have not confirmed that the load is distributed appropriately and to the correct node, but at least it works)
So, is there a point where this behaviour is expected? ie I have taken everything down, nothing was up and running before the last from the first cluster was taken down.. is this behaviour expected?
To me it seems like it should be an easy issue to resolve, not sure whats missing / incorrect, I am surprised that both clients show the same symptoms, makes me think something is not happening with our statefulset and service

I think the problem might lie in the headless DNS service. If all of the nodes go down completely and there are no nodes at all available via the service until pods are replaced, it could cause the driver to hang.
I've noted that you've used Helm for your deployments but you may be interested in this document for connecting to Cassandra clusters in Kubernetes from the authors of the cass-operator.
I'm going to contact some of the authors and get them to respond here. Cheers!

Related

Airflow too many DNS lookups for database

We have an Apache Airflow deployed over a K8s cluster in AWS. Airflow is running on containers but the EC2 instances themselves are reserved instances.
We are experiencing an issue where we see that Ariflow is making many DNS queries related to it's DB. When at rest (i.e. no DAGs are running) it's about 10 per second. When running several DAGs it can go up to 50 per second. This results in Route53 blocking us since we are hitting the packet limit for DNS queries (1024 packets per second).
Our DB is a Postgres RDS, and when switching it to a MySQL the issue remains.
The way we understand it, the DNS query starts at K8s coredns service, which tries several permutations of the FQDN and sends the requests to Route53 if it can't resolve it on it's own.
Any ideas, thoughts, or hints to explain Airflow's behavior or how to reduce the number of queries is most welcome.
Best,
After some digging we found we had several issues happening at the same time.
The first being that Airflow's scheduler was running about 2 times per second. Each time it created DB queries which resulted in several DNS queries. Changing that scheduling alleviated some of the issue.
Another issue we had is described here. It looks like coredns is configured to try some alternatives of the given domain if it has less than x number of . in the FQDN. There are 2 suggested fixes in that article. We followed them through and the number of DNS queries dropped.
we have been having this issue too.
wasn't the easiest to find as we had one box with lots of apps on it making 1000s of DNS queries requesting DNS resolution of our SQL server name.
i really wonder why Airflow doesnt just use the DNS cache like every other application

Two versions of fluentd fighting over port in my cluster

Somehow, I have 2 versions of fluentd running in my cluster:
They end up fighting over the same port, they just keep cranking away, trying to start up on that port, and it saturates all the CPU in the cluster.
unexpected error error_class=Errno::EADDRINUSE error="Address already in use - bind(2) for 0.0.0.0:24231
/opt/google-fluentd/embedded/lib/ruby/2.6.0/socket.rb:201:in 'bind'
I've tried deleting the daemon sets and deployments, they just keep coming back. Also tried ssh'ing into the machines and killing the process on that port. Nothing seems to work.
Obviously, I only want one version of fluentd to run (and I'm not even sure which one).
I seem to have fixed it. I went to GCP dashboard cluster edit page, Kubernetes Engine Monitoring dropdown was blank. It seems not even the dropdown could decide what to display here.
It seems the automated agent, or whatever, seriously messed up here, and had 2 versions of the logging and monitoring system running, fighting over a port, and crushing the CPU on every machine in the cluster. On top of that, I couldn't delete the daemon sets, pods, or deployments. It seems Google treats these as special somehow, maybe with some kind of automated agent, I don't know.
From the dropdown, I just selected System and workload logging and monitoring, saved, and it applied the changes.
Everything looking good so far, but this whole event has me worried, I didn't do anything. This just....happened.
This is a dev cluster, but if it was a production cluster...

Pgbouncer: how to run within a kubernetes cluster properly

The background: I currently run some kubernetes pods with a pgbouncer sidecar container. I’ve been running into annoying behavior with sidecars (that will be addressed in k8s 1.18) that have workarounds, but have brought up an earlier question around running pgbouncer inside k8s.
Many folks recommend the sidecar approach for pgbouncer, but I wonder why running one pgbouncer per say: machine in the k8s cluster wouldn’t be better? I admit I don’t have enough of a deep understanding of either pgbouncer or k8s networking to understand the implications of either approach.
EDIT:
Adding context, as it seems like my question wasn't clear enough.
I'm trying to decide between two approaches of running pgbouncer in a kubernetes cluster. The PostgreSQL server is not running in this cluster. The two approaches are:
Running pgbouncer as a sidecar container in all of my pods. I have a number of pods: some replicas on a webserver deployment, an async job deployment, and a couple cron jobs.
Running pgbouncer as a separate deployment. I'd plan on running 1 pgbouncer instance per node on the k8s cluster.
I worry that (1) will not scale well. If my PostgreSQL master has a max of 100 connections, and each pool has a max of 20 connections, I potentially risk saturating connections pretty early. Additionally, I risk saturating connections on master during pushes as new pgbouncer sidecars exist alongside the old image being removed.
I, however, almost never see (2) recommended. It seems like everyone recommends (1), but the drawbacks seem quite obvious to me. Is the networking penalty I'd incur by connecting to pgbouncer outside of my pod be large enough to notice? Is pgbouncer perhaps smart enough to deal with many other pgbouncer instances that could potentially saturate connections?
We run pgbouncer in production on Kubernetes. I expect the best way to do it is use-case dependent. We do not take the sidecar approach, but instead run pgbouncer as a separate "deployment", and it's accessed by the application via a "service". This is because for our use case, we have 1 postgres instance (i.e. one physical DB machine) and many copies of the same application accessing that same instance (but using different databases within that instance). Pgbouncer is used to manage the active connections resource. We are pooling connections independently for each application because the nature of our application is to have many concurrent connections and not too many transactions. We are currently running with 1 pod (no replicas) because that is acceptable for our use case if pgbouncer restarts quickly. Many applications all run their own pgbouncers and each application has multiple components that need to access the DB (so each pgbouncer is pooling connections of one instance of the application). It is done like this https://github.com/astronomer/airflow-chart/tree/master/templates/pgbouncer
The above does not include getting the credentials set up right for accessing the database. The above, linked template is expecting a secret to already exist. I expect you will need to adapt the template to your use case, but it should help you get the idea.
We have had some production concerns. Primarily we still need to do more investigation on how to replace or move pgbouncer without interrupting existing connections. We have found that the application's connection to pgbouncer is stateful (of course because it's pooling the transactions), so if pgbouncer container (pod) is swapped out behind the service for a new one, then existing connections are dropped from the application's perspective. This should be fine even running pgbouncer replicas if you have an application where you can ensure that rarely dropped connections retry and make use of Kubernetes sticky sessions on the "service". More investigation is still required by our organization to make it work perfectly.

ejabberd cluster: Multi-master or Master-slave

So far what I've come across is this -
Setting up ejabberd cluster in a master-slave configuration, there would be a single point of failure and people have experienced issues when even after fixing the master (if it goes down), the cluster doesn't become operable again. Also sometimes, ejabberd instances of every slave would have to be revisited again to get them working properly, or mnesia commands would have to be in-putted again to make master communicate with the slaves.
Setting up ejabberd cluster in a multi-master configuration then any of the nodes can be taken out of the cluster without bringing the whole cluster down. Basically, there is no single point of failure and, this is also the way in which the official documentation for ejabberd tells you to do via the join_cluster argument they expose in the ejabberdctl script. HOWEVER, in this case, all the data is replicated across both nodes which is a big performance overhead in my opinion.
So it boils down to this.
What is the best/recommended/popular mode in which an ejabberd cluster of 2 nodes should be set up mostly with respect to performance but keeping other critical factors (fault tolerance, load balancing) in mind as well.
There is only a single mode in ejabberd. Basically, it works like what you describe as multi-master. master-slave would basically be the same setup without any traffic sent to the second node by load balancing mechanism.
So case 2 is the way to go.

Questions Concerning Using Celery with Multiple Load-Balanced Django Application Servers

I'm interested in using Celery for an app I'm working on. It all seems pretty straight forward, but I'm a little confused about what I need to do if I have multiple load balanced application servers. All of the documentation assumes that the broker will be on the same server as the application. Currently, all of my application servers sit behind an Amazon ELB and tasks need to be able to come from any one of them.
This is what I assume I need to do:
Run a broker server on a separate instance
Configure each application instance to connect to that broker server
Each application instance will also be be a celery working (running
celeryd)?
My only beef with that is: What happens if my broker instance dies? Can I run 2 broker instances some how so I'm safe if one goes under?
Any tips or information on what to do in a setup like mine would be greatly appreciated. I'm sure I'm missing something or not understanding something.
For future reference, for those who do prefer to stick with RabbitMQ...
You can create a RabbitMQ cluster from 2 or more instances. Add those instances to your ELB and point your celeryd workers at the ELB. Just make sure you connect the right ports and you should be all set. Don't forget to allow your RabbitMQ machines to talk among themselves to run the cluster. This works very well for me in production.
One exception here: if you need to schedule tasks, you need a celerybeat process. For some reason, I wasn't able to connect the celerybeat to the ELB and had to connect it to one of the instances directly. I opened an issue about it and it is supposed to be resolved (didn't test it yet). Keep in mind that celerybeat by itself can only exist once, so that's already a single point of failure.
You are correct in all points.
How to make reliable broker: make clustered rabbitmq installation, as described here:
http://www.rabbitmq.com/clustering.html
Celery beat also doesn't have to be a single point of failure if you run it on every worker node with:
https://github.com/ybrs/single-beat