How can I fix frequent, but intermittent TLS handshake timeouts in kubectl? - kubernetes

I'm encountering TLS handshake timeout when trying to perform a number of operations against a local Kubernetes cluster on macOS 10.14.6. The errors show up when doing any kubectl action, any helm action (including helm init and helm version), as well as during deployments.
I've tried rebooting Docker for Mac, as well as rebooting the physical host machine, and wiping and recreating the cluster (which is difficult given that deployments will spontaneously fail because of the TLS handshake issue). I've also made sure that my major/minor/patch versions for kubectl (1.14.3 client, 1.14.6 server) and helm (2.9.1) all match those being used by known-good local deployments in the office.
I've also reviewed the firewall rules on my machine, but haven't found anything that would obviously cause this kind of issue.
Additionally, I've browsed many of the threads discussing this on the issue trackers for k8s itself as well as Helm, plus the questions already on SO, but these overwhelmingly concern Azure's AKS, while I'm working on a local setup.
Finally, I've made sure that enough resources are allocated to actually run the target applications -- in this case 16GB of RAM (which I've tried unsuccessfully upgrading to 24GB) as well as 8 CPU cores.
This problem seems to show up at random, and while it's most often manifested as a TLS handshake timeout, it will also occasionally interrupt an established connection, with skaffold run commands sometimes crashing out with "transport closed." It also doesn't seem to be caused by any missing certs, since the commands eventually succeed -- but the success rate is very low, of the order of 1 in 10 calls.

Related

RabbitMQ randomly disconnecting application consumers in a Kubernetes/Istio environment

Issue:
My company has recently moved workers from Heroku to Kubernetes. We previously used a Heroku-managed add-on (CloudAMQP) for our RabbitMQ brokers. This worked perfectly and we never saw issues with dropped consumer connections.
Now that our workloads live in Kubernetes deployments on separate nodegroups, we are seeing daily dropped consumer connections, causing our messages to not be processed by our applications living in Kubernetes. Our new RabbitMQ brokers live in CloudAMQP but are not managed Heroku add-ons.
Errors on the consumer side just indicate a Unexpected disconnect. No additional details.
No errors on the Istio envoy proxy level that is evident.
We do not have a Istio Egress, so no destination rules set here.
No errors on the RabbitMQ server that is evident.
Remediation Attempts:
Read all StackOverflow/GitHub issues for the Unexpected errors we are seeing. Nothing we have found has remediated the issue.
Our first attempt to remediate was to change the heartbeat to 0 (disabling heartbeats) on our RabbitMQ server and consumer. This did not fix anything, connections still randomly dropping. CloudAMQP also suggests disabling this, because they rely heavily on TCP keepalive.
Created a message that just logs on the consumer every five minutes. To keep the connection active. This has been a bandaid fix for whatever the real issue is. This is not perfect, but we have seen a reduction of disconnects.
What we think the issue is:
We have researched why this might be happening and are honing in on network TCP keepalive settings either within Kubernetes or on our Istio envoy proxy's outbound connection settings.
Any ideas on how we can troubleshoot this further, or what we might be missing here to diagnose?
Thanks!

How can I fix ceph commands hanging after a reboot?

I'm pretty new to Ceph, so I've included all my steps I used to set up my cluster since I'm not sure what is or is not useful information to fix my problem.
I have 4 CentOS 8 VMs in VirtualBox set up to teach myself how to bring up Ceph. 1 is a client and 3 are Ceph monitors. Each ceph node has 6 8Gb drives. Once I learned how the networking worked, it was pretty easy.
I set each VM to have a NAT (for downloading packages) and an internal network that I called "ceph-public". This network would be accessed by each VM on the 10.19.10.0/24 subnet. I then copied the ssh keys from each VM to every other VM.
I followed this documentation to install cephadm, bootstrap my first monitor, and added the other two nodes as hosts. Then I added all available devices as OSDs, created my pools, then created my images, then copied my /etc/ceph folder from the bootstrapped node to my client node. On the client, I ran rbd map mypool/myimage to mount the image as a block device, then used mkfs to create a filesystem on it, and I was able to write data and see the IO from the bootstrapped node. All was well.
Then, as a test, I shutdown and restarted the bootstrapped node. When it came back up, I ran ceph status but it just hung with no output. Every single ceph and rbd command now hangs and I have no idea how to recover or properly reset or fix my cluster.
Has anyone ever had the ceph command hang on their cluster, and what did you do to solve it?
Let me share a similar experience. I also tried some time ago to perform some tests on Ceph (mimic i think) an my VMs on my VirtualBox acted very strange, nothing comparing with actual bare metal servers so please bare this in mind... the tests are not quite relevant.
As regarding your problem, try to see the following:
have at least 3 monitors (or an even number). It's possible that hang is because of monitor election.
make sure the networking part is OK (separated VLANs for ceph servers and clients)
DNS is resolving OK. (you have added the servername in hosts)
...just my 2 cents...

Two versions of fluentd fighting over port in my cluster

Somehow, I have 2 versions of fluentd running in my cluster:
They end up fighting over the same port, they just keep cranking away, trying to start up on that port, and it saturates all the CPU in the cluster.
unexpected error error_class=Errno::EADDRINUSE error="Address already in use - bind(2) for 0.0.0.0:24231
/opt/google-fluentd/embedded/lib/ruby/2.6.0/socket.rb:201:in 'bind'
I've tried deleting the daemon sets and deployments, they just keep coming back. Also tried ssh'ing into the machines and killing the process on that port. Nothing seems to work.
Obviously, I only want one version of fluentd to run (and I'm not even sure which one).
I seem to have fixed it. I went to GCP dashboard cluster edit page, Kubernetes Engine Monitoring dropdown was blank. It seems not even the dropdown could decide what to display here.
It seems the automated agent, or whatever, seriously messed up here, and had 2 versions of the logging and monitoring system running, fighting over a port, and crushing the CPU on every machine in the cluster. On top of that, I couldn't delete the daemon sets, pods, or deployments. It seems Google treats these as special somehow, maybe with some kind of automated agent, I don't know.
From the dropdown, I just selected System and workload logging and monitoring, saved, and it applied the changes.
Everything looking good so far, but this whole event has me worried, I didn't do anything. This just....happened.
This is a dev cluster, but if it was a production cluster...

Pgbouncer: how to run within a kubernetes cluster properly

The background: I currently run some kubernetes pods with a pgbouncer sidecar container. I’ve been running into annoying behavior with sidecars (that will be addressed in k8s 1.18) that have workarounds, but have brought up an earlier question around running pgbouncer inside k8s.
Many folks recommend the sidecar approach for pgbouncer, but I wonder why running one pgbouncer per say: machine in the k8s cluster wouldn’t be better? I admit I don’t have enough of a deep understanding of either pgbouncer or k8s networking to understand the implications of either approach.
EDIT:
Adding context, as it seems like my question wasn't clear enough.
I'm trying to decide between two approaches of running pgbouncer in a kubernetes cluster. The PostgreSQL server is not running in this cluster. The two approaches are:
Running pgbouncer as a sidecar container in all of my pods. I have a number of pods: some replicas on a webserver deployment, an async job deployment, and a couple cron jobs.
Running pgbouncer as a separate deployment. I'd plan on running 1 pgbouncer instance per node on the k8s cluster.
I worry that (1) will not scale well. If my PostgreSQL master has a max of 100 connections, and each pool has a max of 20 connections, I potentially risk saturating connections pretty early. Additionally, I risk saturating connections on master during pushes as new pgbouncer sidecars exist alongside the old image being removed.
I, however, almost never see (2) recommended. It seems like everyone recommends (1), but the drawbacks seem quite obvious to me. Is the networking penalty I'd incur by connecting to pgbouncer outside of my pod be large enough to notice? Is pgbouncer perhaps smart enough to deal with many other pgbouncer instances that could potentially saturate connections?
We run pgbouncer in production on Kubernetes. I expect the best way to do it is use-case dependent. We do not take the sidecar approach, but instead run pgbouncer as a separate "deployment", and it's accessed by the application via a "service". This is because for our use case, we have 1 postgres instance (i.e. one physical DB machine) and many copies of the same application accessing that same instance (but using different databases within that instance). Pgbouncer is used to manage the active connections resource. We are pooling connections independently for each application because the nature of our application is to have many concurrent connections and not too many transactions. We are currently running with 1 pod (no replicas) because that is acceptable for our use case if pgbouncer restarts quickly. Many applications all run their own pgbouncers and each application has multiple components that need to access the DB (so each pgbouncer is pooling connections of one instance of the application). It is done like this https://github.com/astronomer/airflow-chart/tree/master/templates/pgbouncer
The above does not include getting the credentials set up right for accessing the database. The above, linked template is expecting a secret to already exist. I expect you will need to adapt the template to your use case, but it should help you get the idea.
We have had some production concerns. Primarily we still need to do more investigation on how to replace or move pgbouncer without interrupting existing connections. We have found that the application's connection to pgbouncer is stateful (of course because it's pooling the transactions), so if pgbouncer container (pod) is swapped out behind the service for a new one, then existing connections are dropped from the application's perspective. This should be fine even running pgbouncer replicas if you have an application where you can ensure that rarely dropped connections retry and make use of Kubernetes sticky sessions on the "service". More investigation is still required by our organization to make it work perfectly.

Spark error: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

I have a virtual machine in which a spark-2.0.0-bin-hadoop2.7 in standalone mode is installed.
I ran ./sbin/start-all.sh to run the master and the slave.
When I do ./bin/spark-shell --master spark://192.168.43.27:7077 --driver-memory 600m --executor-memory 600m --executor-cores 1 in the machine itself the task's status is RUNNING and I am able to compute code in spark shell.
When I do exactly the same command from another machine in the network, the status is "RUNNING" again, but the spark-shell throws WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources. I guess the problem is not directly related to resources because the same command works in the virtual machine itself, but not when it comes from other machines.
I checked most of the topics related to this error and none of them solved my problem. I even disabled firewall with sudo ufw disable just to make sure but no success (based on this link) which suggests:
Disable Firewall on the client : This was the solution that worked for me. Since I was working on a prototype in-house code, I disabled the firewall on the client node. For some reason the worker nodes, were not able to talk back to the client for me. For production purposes, you would want to open-up certain number of ports required.
There are two known reasons for this:
Your application requires more resources (cores, memory) than allocated. Increasing worker cores and memory should solve it. Most other answers focus on this.
Where less known, the firewall is blocking the communication between master and workers. This could happen especially you are using cloud service. According to Spark Security, besides the standard 8080, 8081, 7077, 4040 ports, you also need to make sure the master and worker can communicate via the SPARK_WORKER_PORT, spark.driver.port and spark.blockManager.port; the latter three are used by submitting jobs and are randomly assigned by the program (if left unconfigured). You may try to open all ports to run a quick test.
Add an example of #Fountaine007's first bullet.
I ran into the same issue and it's because the allocated vcores is less than the application's expectation.
For my specific scenario, I increased the value of yarn.nodemanager.resource.cpu-vcores under $HADOOP_HOME/etc/hadoop/yarn-site.xml.
For memory related issue, you may also need to modify yarn.nodemanager.resource.memory-mb.