kubectl version does not show when the cluster was upgraded with the current version. Is there a way to get the cluster upgrade timestamp or better timestamps of all previous upgrades?
We can see a list of every running and completed operations in the cluster using the following command :
gcloud beta container operations list
Each operation is assigned an operation ID , operation type, start and end times, target cluster, and status.
To get more information about a specific operation, specify the operation ID in the following command:
gcloud beta container operations describe OPERATION_ID -z region
Related
I upgrade MongoDB Ops Manager from 4.2.4 to 4.4.6. Then it told me that MongoDB Agent, MongoDB Tools are out of date. After I press the upgrade button, the upgrade process is stucked.
The Agent log has the following error:
Interface conversion: interface {} is nil, not float64
Now the Ops Manager is stuck at the "deploying" stage. How to revert to the previous state or fix the problem?
$ kubectl exec -it pod - is useless as the pod are running under a security context which uses user "2000"
https://docs.mongodb.com/kubernetes-operator/master/reference/troubleshooting/
is almost useless as we don't have permission to change the pod state. So what's the right way to fix deployment issues when using Ops Manager?
I assume there are no stupid questions, so here is one that I could not find a direct answer to.
The situation
I currently have a Kubernetes-cluster running 1.15.x on AKS, deployed and managed through Terraform. AKS recently Azure announced that they would retire the 1.15 version of Kubernetes on AKS, and I need to upgrade the cluster to 1.16 or later. Now, as I understand the situation, upgrading the cluster directly in Azure would have no consequences for the content of the cluster, I.E nodes, pods, secrets and everything else currently on there, but I can not find any proper answer to what would happen if I upgrade the cluster through Terraform.
Potential problems
So what could go wrong? In my mind, the worst outcome would be that the entire cluster would be destroyed, and a new one would be created. No pods, no secrets, nothing. Since there is so little information out there, I am asking here, to see if there are anyone with more experience with Terraform and Kubernetes that could potentially help me out.
To summary:
Terraform versions
Terraform v0.12.17
+ provider.azuread v0.7.0
+ provider.azurerm v1.37.0
+ provider.random v2.2.1
What I'm doing
§ terraform init
//running terrafrom plan with new Kubernetes version declared for AKS
§ terraform plan
//Following changes are announced by Terraform:
An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
~ update in-place
Terraform will perform the following actions:
#module.mycluster.azurerm_kubernetes_cluster.default will be updated in-place...
...
~ kubernetes_version = "1.15.5" -> "1.16.13"
...
Plan: 0 to add, 1 to change, 0 to destroy.
What I want to happen
Terraform will tell Azure to upgrade the existing AKS-service, not destroy before creating a new one. I assume that this will happen, as Terraform announces that it will "update in-place", instead of adding new and/or destroying existing clusters.
I found this question today and thought I'd add my experience as well. I made the following changes:
Changed the kubernetes_version under azurerm_kubernetes_cluster from "1.16.15" -> "1.17.16"
Changed the orchestrator_version under default_node_pool from "1.16.15" -> "1.17.16"
Increased the node_count under default_node_pool from 1 -> 2
A terraform plan showed that it was going to update in-place. I then performed a terraform apply which completed successfully. kubectl get nodes showed that an additional node was created, but both nodes in the pool were still on the old version. After further inspection in Azure Portal it was found that only the k8s cluster version was upgraded and not the version of the node pool. I then executed terraform plan again and again it showed that the orchestrator_version under default_node_pool was going to be updated in-place. I then executed terraform apply which then proceeded to upgrade the version of the node pool. It did that whole thing where it creates an additional node in the pool (with the new version) and sets the status to NodeSchedulable while setting the existing node in the pool to NodeNotSchedulable. The NodeNotSchedulable node is then replaced by a new node with the new k8s version and eventually set to NodeSchedulable. It did this for both nodes. Afterwards all nodes were upgraded without any noticeable downtime.
I'd say this shows that the Terraform method is non-destructive, even if there have at times been oversights in the upgrade process (but still non-destructive in this example): https://github.com/terraform-providers/terraform-provider-azurerm/issues/5541
If you need higher confidence for this change then you could alternativly consider using the Azure-based upgrade method, refreshing the changes back into your state, and tweaking the code until a plan generation doesn't show anything intolerable. The two azurerm_kubernetes_cluster arguments dealing with version might be all you need to tweak.
I'm trying to search previous versions of my secrets inside Microk8s etcd instance but the number of revisions change everytime I refresh my screen and I don't know why.
When I try to access an older version I get the error below:
etcdctl --endpoints=127.0.0.1:2380 get --rev=9133 -w fields /registry/secrets/default/mysql-test-password
{"level":"warn","ts":"2020-09-14T13:40:08.594Z","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-8b3e59a8-efd2-4f77-a96f-5ec3c451b9b7/127.0.0.1:2380","attempt":0,"error":"rpc error: code = OutOfRange desc = etcdserver: mvcc: required revision has been compacted"}
Error: etcdserver: mvcc: required revision has been compacted
I also added the configuration on my etcd config file and restarted the service but didn't help:
--auto-compaction-mode=periodic
--auto-compaction-retention=72h
It seems that every time I refresh my screen the number of revisions increased a lot without doing anything.
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":15322,"raft_term":7}
1 second later
student#desktop:~$ etcdctl --endpoints=127.0.0.1:2380 get /registry/secrets/default/mysql-root-password -w json
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":16412,"raft_term":7}
Have someone faced something like that?
"etcdserver: mvcc: required revision has been compacted." is not an error, it's an expected message when a watch is attempted to be established at a resource version that has already been compacted.
Something is wrong with your etcd cluster, this is almost certainly not an apiserver problem. There's an alarm you might need to disable.
Also remember that you have to update etcd version to 3.0.11 or later -
https://github.com/kubernetes/kubernetes/issues/45506.
Take a look: mvcc-issues, reloader.
I changed my initialization script after creating a cluster with 2 worker nodes for spark. Then I changed the script a bit and tried to update the cluster with 2 more worker nodes. The script failed because I simply forgot to apt-get update before apt-get install, so dataproc reports error and the cluster's status changed to ERROR. When I try to reduce the size back to 2 nodes again, it doesn't work anymore with the following message
ERROR: (gcloud.dataproc.clusters.update) Cluster 'cluster-1' must be running before it can be updated, current cluster state is 'ERROR'.
The two worker nodes are still added, but they don't seem to be detected by a running spark application at first because no more executors are added. I manually reset the two instances on the Google Compute Engine page, and then 4 executors are added. So it seems everything is working fine again except that the cluster's status is still ERROR, and I cannot increase or decrease the number of worker nodes anymore.
How can I update the cluster status back to normal (RUNNING)?
In your case ERROR indicates that workflow to re-configure the cluster has failed, and Dataproc is not sure of its health. At this point Dataproc cannot guarantee that another reconfigure attempt will succeed so further updates are disallowed. You can however submit jobs.
Your best bet is to delete it and start over.
I am attempting to identify and fix the source of high latency when running kubectl get pods.
I am running 1.1.4 on AWS.
When running the command from the master host of afflicted master, I consistently get response times of 6s.
Other queries, such as get svc and get rc return on the order of 20ms.
Running get pods on a mirror cluster returns in 150ms.
I've crawled through master logs and system stats, but have not identified the issue.
We speeded up LIST operations in 1.2. You might be interested in learning the updates to Kubernetes performance and scalability in 1.2.
Chris - how big cluster do you have and how many pods do you have in it?
Obviously the time it take to return the response will be bigger if the result is bigger.
Also, what do you mean by "running on mirror cluster returns in 150ms"? What is "mirror cluster"?