Airflow too many DNS lookups for database - kubernetes

We have an Apache Airflow deployed over a K8s cluster in AWS. Airflow is running on containers but the EC2 instances themselves are reserved instances.
We are experiencing an issue where we see that Ariflow is making many DNS queries related to it's DB. When at rest (i.e. no DAGs are running) it's about 10 per second. When running several DAGs it can go up to 50 per second. This results in Route53 blocking us since we are hitting the packet limit for DNS queries (1024 packets per second).
Our DB is a Postgres RDS, and when switching it to a MySQL the issue remains.
The way we understand it, the DNS query starts at K8s coredns service, which tries several permutations of the FQDN and sends the requests to Route53 if it can't resolve it on it's own.
Any ideas, thoughts, or hints to explain Airflow's behavior or how to reduce the number of queries is most welcome.
Best,

After some digging we found we had several issues happening at the same time.
The first being that Airflow's scheduler was running about 2 times per second. Each time it created DB queries which resulted in several DNS queries. Changing that scheduling alleviated some of the issue.
Another issue we had is described here. It looks like coredns is configured to try some alternatives of the given domain if it has less than x number of . in the FQDN. There are 2 suggested fixes in that article. We followed them through and the number of DNS queries dropped.

we have been having this issue too.
wasn't the easiest to find as we had one box with lots of apps on it making 1000s of DNS queries requesting DNS resolution of our SQL server name.
i really wonder why Airflow doesnt just use the DNS cache like every other application

Related

Cassandra Kubernetes Statefulset NoHostAvailableException

I have an application deployed in kubernetes, it consists of cassandra, a go client, and a java client (and other things, but they are not relevant for this discussion).
We have used helm to do our deployment.
We are using a stateful set and a headless service for cassandra.
We have configured the clients to use the headless service dns as a contact point for cluster creation.
Everything works great.
Until all of the nodes go down, or some other nefarious combination of nodes going down, I am simulating it by deleting all pods using kubectl delete in succession on all of the cassandra nodes.
When I do this the clients throw NoHostAvailableException
in java its
"java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: /10.200.23.151:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency LOCAL_QUORUM (1 required but only 0 alive)), /10.200.152.130:9042 (com.datastax.driver.core.exceptions.UnavailableException: Not enough replicas available for query at consistency ONE (1 required but only 0 alive)))"
which eventually becomes
"java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)"
in go its
"gocql: no hosts available in the pool"
I can query cassandra using cqlsh, the node seems fine using nodetool status, all of the new ips are there
the image I am using doesnt have netstat so I have not yet confirmed its listening on the expected port.
Via executing bash on the two client pods I can see the dns makes sense using nslookup, but...
netstat does not show any established connections to cassandra (they are present before I take the nodes down)
If I restart my clients everything works fine.
I have googled a lot (I mean a lot), most of what I have found is related to never having a working connection, the most relevant things seem very old (like 2014, 2016).
So a node going down is very basic and I would expect everything to work, the cassandra cluster manages itself, it discovers new nodes as they come online, it balances the load, etc. etc.
If I take my all of my cassandra nodes down slowly, one at a time, everything works fine (I have not confirmed that the load is distributed appropriately and to the correct node, but at least it works)
So, is there a point where this behaviour is expected? ie I have taken everything down, nothing was up and running before the last from the first cluster was taken down.. is this behaviour expected?
To me it seems like it should be an easy issue to resolve, not sure whats missing / incorrect, I am surprised that both clients show the same symptoms, makes me think something is not happening with our statefulset and service
I think the problem might lie in the headless DNS service. If all of the nodes go down completely and there are no nodes at all available via the service until pods are replaced, it could cause the driver to hang.
I've noted that you've used Helm for your deployments but you may be interested in this document for connecting to Cassandra clusters in Kubernetes from the authors of the cass-operator.
I'm going to contact some of the authors and get them to respond here. Cheers!

Pgbouncer: how to run within a kubernetes cluster properly

The background: I currently run some kubernetes pods with a pgbouncer sidecar container. I’ve been running into annoying behavior with sidecars (that will be addressed in k8s 1.18) that have workarounds, but have brought up an earlier question around running pgbouncer inside k8s.
Many folks recommend the sidecar approach for pgbouncer, but I wonder why running one pgbouncer per say: machine in the k8s cluster wouldn’t be better? I admit I don’t have enough of a deep understanding of either pgbouncer or k8s networking to understand the implications of either approach.
EDIT:
Adding context, as it seems like my question wasn't clear enough.
I'm trying to decide between two approaches of running pgbouncer in a kubernetes cluster. The PostgreSQL server is not running in this cluster. The two approaches are:
Running pgbouncer as a sidecar container in all of my pods. I have a number of pods: some replicas on a webserver deployment, an async job deployment, and a couple cron jobs.
Running pgbouncer as a separate deployment. I'd plan on running 1 pgbouncer instance per node on the k8s cluster.
I worry that (1) will not scale well. If my PostgreSQL master has a max of 100 connections, and each pool has a max of 20 connections, I potentially risk saturating connections pretty early. Additionally, I risk saturating connections on master during pushes as new pgbouncer sidecars exist alongside the old image being removed.
I, however, almost never see (2) recommended. It seems like everyone recommends (1), but the drawbacks seem quite obvious to me. Is the networking penalty I'd incur by connecting to pgbouncer outside of my pod be large enough to notice? Is pgbouncer perhaps smart enough to deal with many other pgbouncer instances that could potentially saturate connections?
We run pgbouncer in production on Kubernetes. I expect the best way to do it is use-case dependent. We do not take the sidecar approach, but instead run pgbouncer as a separate "deployment", and it's accessed by the application via a "service". This is because for our use case, we have 1 postgres instance (i.e. one physical DB machine) and many copies of the same application accessing that same instance (but using different databases within that instance). Pgbouncer is used to manage the active connections resource. We are pooling connections independently for each application because the nature of our application is to have many concurrent connections and not too many transactions. We are currently running with 1 pod (no replicas) because that is acceptable for our use case if pgbouncer restarts quickly. Many applications all run their own pgbouncers and each application has multiple components that need to access the DB (so each pgbouncer is pooling connections of one instance of the application). It is done like this https://github.com/astronomer/airflow-chart/tree/master/templates/pgbouncer
The above does not include getting the credentials set up right for accessing the database. The above, linked template is expecting a secret to already exist. I expect you will need to adapt the template to your use case, but it should help you get the idea.
We have had some production concerns. Primarily we still need to do more investigation on how to replace or move pgbouncer without interrupting existing connections. We have found that the application's connection to pgbouncer is stateful (of course because it's pooling the transactions), so if pgbouncer container (pod) is swapped out behind the service for a new one, then existing connections are dropped from the application's perspective. This should be fine even running pgbouncer replicas if you have an application where you can ensure that rarely dropped connections retry and make use of Kubernetes sticky sessions on the "service". More investigation is still required by our organization to make it work perfectly.

Autoscaling limited by RDS connection

I have some nightly jobs that are running on EC2 and the number of machines is scaled by the number of messages in SQS. My process requires reads from a Postgres RDS database. Now these are the issues I am facing.
Not able to scale beyond a certain number because of the unavailability of connections.
I tried creating a connection pool using pgbouncer, and tried with different settings as well, but it's missing a lot of data on the resultant set.
Make your postgresql RDS install multi AZ. Then you can make read replicas on demand and scale read performance with your load.
To answer the comments:
Some extra "plumbing" is required to make the connections to the read replica. Maybe route53 dynamically updated records as the scaling happens or something like haproxy
The reason I mention multi AZ is that this would help prevent downtime during an auto scaling event bringing up the read replica
It would be simpler (but more costly) to permanently bring up a read replica and use DNS round robin to share the load
See https://aws.amazon.com/blogs/aws/amazon-rds-announcing-read-replicas/ for information on read replicas

Couchbase XDCR on Openstack

Having received no replies on the Couchbase forum after nearly 2 months, I'm bringing this question to a broader audience.
I'm configuring CB Server 2.2.0 XDCR between two different Openstack (Essex, eek) installations. I've done some reading on using a DNS FQDN trick in the couchbase-server file to add a -name ns_1#(hostname) value in the start() function. I've tried that with absolutely zero success. There's already a flag in the start() function that says -name 'babysitter_of_ns_1#127.0.0.1' so I don't know if I need to replace that line, comment it out, or keep it. I've tried all 3 of those; none of them seemed to have any positive effect.
The FQDNs are pointing to the Openstack floating_ip addresses (in amazon-speak, the "public" ones). Should they be pointed to the fixed_ip addresses (amazon: private/local) for the nodes? Between Openstack installations, I'm not convinced pointing to an unreachable (potentially duplicate) class-C private IP is of any use.
When I create a remote cluster reference using the floating_ip address to a node in the other cluster, of course it'll create the cluster reference just fine. But when I create a Replication using that reference, I always get one of two distinct errors: Save request failed because of timeout or Failed to grab remote bucket 'bucket' from any of known nodes.
What I think is happening is that the Openstack floating_ip isn't being recognized or translated to its fixed_ip address prior to surfing the cluster nodes for the bucket. I know the -name ns_1#(hostname) modification is supposed to fix this, but I wonder if anyone has had success configuring XDCR between Openstack installations that may be able to provide some tips or hacks.
I know this "works" in AWS. It's my belief that AWS uses some custom DNS enabling queries to return an instance's fixed_ip ("private" IP) when going between availability zones, possibly between regions. There may be other special sauce in AWS that makes this work.
This blog post on aws Couchbase XDCR replication should help! There are quite a few steps so I won't paste them all here.
http://blog.couchbase.com/cross-data-center-replication-step-step-guide-amazon-aws

JMeter throughput drops when hitting Amazon ELB

I am hosting a web application on Amazon's AWS Servers. I am currently in the process of load testing the application with JMeter. My main problem seems to be that when I go through an Elastic Load Balancer (ELB) to hit the Amazon server's rather than hitting the servers directly - I seem to hit a cap in my throughput.
If I hit my web application directly - for each server I am able to achieve a throughput of 50 RPS per server.
If I hit my web application via Amazon's ELB - I am only able to achieve a max throughput of 50 RPS (total)
I was wondering if anyone else has experienced similar behavior when load testing using Jmeter via Amazon's ELB.
For more context my web application is a REST application which allows users to download content (~150 kb) via HTTP requests.
I am running Jmeter with the following flag "-Dsun.net.inetaddr.ttl=0" and running it with 10 threads. I have tried running these tests with multiple clients on different machines.
Thanks for any help in advance.
Load balancers may be tricky to test as they may have different mechanisms of orchestrating traffic depending on origin. The most commonly used approach to distinguish origin of the request and redirect it to the same host, which served previous request is a cookie. You can look into HTTP Cookie Manager to correctly manipulate your cookies and make sure than you have different ones for each testing thread or thread group (depending on your use case). Another flaky area is origin host IP. You may require to bind each testing thread to different IP address in order to hit different servers behind the load balancer. There can be also some issues with DNS in regards to Amazon LBs. useful guide on how to test Amazon ELBs
Most probable cause would be DNS caching by jmeter. ELB returns IPs of additional servers depending on how autoscaling is set but JMeter does not use these additional servers. This problem can be solved by ensuring that Jmeter does not cache DNS results...
The ELB is a name, not IP, and can suffer from DNS caching. Make sure you use "-Dsun.net.inetaddr.ttl=0" when starting JMeter
http://wiki.apache.org/jmeter/JMeterAndAmazon
A really late response, and slightly different than the original question, but I hope this can help others as it took me a while to get it all straight. My original problem was not reduced throughput as a result of the ELB, but the introduction of HTTP 503 errors. Actually, the ELB increased my throughput as compared to querying the web application directly, though even with 1 hour tests, the results were sporadic to say the least.
First, the ELB has 2-staged load balancing going on. The first load balance is across the ELB's themselves. That's done by associating multiple IP addresses to the hostname provided by AWS for the ELB you provision. The second is then, of course, across your application instances behind the ELB.
Without trying to offend the SO gods, this is a really helpful article.
https://blazemeter.com/blog/dns-cache-manager-right-way-test-load-balanced-apps
The most helpful information in there was to use the DNS Cache Manager module in JMeter. This will query multiple DNS servers, and wipe out your DNS cache.
I implemented that module and then setup Wireshark, filtering on the two IP addresses belonging to the ELB hostname and sure enough, it was querying both IP addresses, though clearly favored one over the other.
That didn't make a big difference, at least not over short tests.
The real difference (2-3 times more throughput) came when I tweaked the ELB health settings. I initially had a high error rate, however after reducing the unhealthy threshold and the interval between health checks, my error rates dropped dramatically.
Additionally, whereas all my other tests had been 60 - 90 minutes in duration, this one was 8 hours. I started out with decent throughput and it then quickly dropped (by about 2/3). After about 20 minutes or more, the throughput then started ticking back up and by the end of the test, it had sustained throughput of about 5 times what I was getting without the ELB (which was similar to what the throughput was when it dropped shortly after beginning this test).