Cannot upgrade to 21.2 because decomissioned node is unavailble - Cluster Version Error - upgrade

When I try to SET CLUSTER SETTING to 21.1, an error message says “ERROR: n2 required, but unavailable” (n2 is reference to a node, but n2 doesn’t exist, because I can see it as decommissioned 2 years ago).

Try decomissioning n2 again. The node's state is currently indeterminate (pre 21.1 versions didn’t know about this “final” decommissioned state). Decomissioning the node once more from a new binary node will let you mark it as decommissioned for good, and let the upgrade proceed. This unclear error message will get fixed when this issue closes.

Related

watch of *v1.Pod ended with: too old resource version

I updated my EKS from 1.16 to 1.17. All of sudden I started getting this error:
pkg/mod/k8s.io/client-go#v0.0.0-20180806134042-1f13a808da65/tools/cache/reflector.go:99: watch of *v1.Pod ended with: too old resource version
Checked on git and people were saying that's not an error but my question is how to stop getting these messages? I was not getting this message when I was having EKS 1.16?
Source.
This is a community wiki answer. Feel free to expand it.
In short, there is nothing to worry about when encountering these messages. They mean that there are newer version(s) of the watched resource after the time the client API last acquired a list within that watch window. In other words: a watch against the Kubernetes API is timing out, and it is being restarted, which is a intended behavior.
You can also see that being mentioned here:
this is perfectly expected, no worries. The messages are several hours
apart.
When nothing happens in your cluster, the watches established by the
Kubernetes client don't get a chance to get refreshed naturally, and
eventually time out. These messages simply indicate that these watches
are being re-created.
and here:
these are nothing to worry about. This is a known occurrence in
Kubernetes and is not an issue [0]. The API server ends watch requests
when they are very old. The operator uses a client-go informer, which
takes care of automatically re-listing the resource and then
restarting the watch from the latest resource version.
So answering your question:
my question is how to stop getting these messages
Simply, you don't because:
This is working as expected and is not going to be fixed.

Upgrade path from Zookeeper 3.4.6 to 3.6.1

I've been tasked with updating several zookeeper clusters. We are currently running 3.4.6, and I'm wondering if I can go directly to 3.6.1, or if I have to upgrade to a 3.5.x version first, then on to 3.6.1.
I've found https://cwiki.apache.org/confluence/display/ZOOKEEPER/Upgrade+FAQ, which talks mostly about upgrading to 3.5.5. https://zookeeper.apache.org/doc/r3.5.3-beta/zookeeperReconfig.html#ch_reconfig_upgrade talks about upgrading to 3.5.0.
Has anyone else out there done this? I'm aware of the snapshot.0 issue.
Thanks,
Todd
I have been tried to upgrade from 3.4.6 directly to 3.6.1, it works, just when upgrading old Leader node it shows won't find snapshot, all you need to do is to cleanup data storage(make a backup first), then restart the node.
so for the steps, you have A,B,C three nodes A with myid=11, B with myid=12, C with myid=13, then you have a A:11, B:12, C:13(Leader) Cluster.
upgrade A and B to 3.6.1 directly, watch there status and check if they are synced with C Leader.
stop C node, and B node should become Leader as it has the second biggest ID, after B node become Leader and A node still the follower, upgrade your C node to 3.6.1, and if there's any error, clean the data storage then restart.

GKE Node Upgrade "Out of Resources"

I had left my GKE cluster running 3 minor versions behind the latest and decided to finally upgrade. The master upgrade went well but then my node upgrades kept failing. I used the Cloud Shell console to manually start an upgrade and view the output, which said something along the lines of "Zone X is out of resources, try Y instead." Unfortunately,I can't just spin up a new node pool in a new zone and have my pipeline work because I am using GitLab's AutoDevOps pipeline and they make certain assumptions about node pool naming and such that I can't find any way to override. I also don't want to potentially lose the data stored in my persistent volumes if I end up needing to re-create everything in a new node-pool.
I just solved this issue but couldn't find any questions posed on this particular problem, so I wanted to post the answer here in case someone else comes looking for it.
My particular problem was that I had a non-autoscaling node pool with a single node. For my purposes, that's enough for the application stack to run smoothly and I don't want to incur unforeseen charges with additional nodes automatically being added to the pool. However, this meant that the upgrade had to apparently share resources with everything else running on that node to perform the upgrade, which it didn't have enough of. The solution was simple: add more nodes temporarily.
Because this is specifically GKE, I was able to use a beta feature called "surge upgrade", which allows you to set the maximum number of "surge" nodes to add when performing an upgrade. Once this was enabled, I started the upgrade process again and it temporarily added an extra node, performed the upgrade, and then scaled back down to a single node.
If you aren't on GKE, or don't wish to use a beta feature (or can't), then simply resize the node pool with the node(s) that needs upgrading. I would add a single node unless you are positive you need more.

Service Fabric - Cannot do Config upgrade to add or remove nodes

I've got an on-premise Service Fabric consisting of 18 nodes (9 are seed nodes) - secured via gMSA windows security. Cluster code version 6.4.622.9590
Unfortunately I have to rebuild 6 of these nodes (3 Seed nodes). They all live in one data center (cluster spans 3 DCs). As such, I wish to remove these 6 nodes, rebuild them and then re-add them.
As per MSDOCs, adding/removing of nodes is performed via config upgrades. Note: I've already used this process recently to add 12 nodes so understand the concept of SF config upgrades well.
Unfortunately, I'm unable to do ANY config upgrades on this cluster until I remove the nodes - this is due to ValidationExceptions reported by the Start-ServiceFabricClusterConfigurationUpgradepowershell command:
If I don't add the 6 nodes to the "NodesToBeRemoved" section, I get validation error that not all removed nodes are in this field
If I do add the 6 nodes, I get the following validation error:
Start-ServiceFabricClusterConfigurationUpgrade :
System.Runtime.InteropServices.COMException (-2147017627)
ValidationException: Model validation error. Removing a non-seed node and changing reliability level in the same
upgrade is not supported. Initiate an upgrade to remove node first and then change the reliability level.
At line:1 char:1
+ Start-ServiceFabricClusterConfigurationUpgrade -ClusterConfigPath "AL ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidOperation: (Microsoft.Servi...usterConnection:ClusterConnection) [Start-ServiceFa
...gurationUpgrade], FabricException
+ FullyQualifiedErrorId : StartClusterConfigurationUpgradeErrorId,Microsoft.ServiceFabric.Powershell.StartClusterC
onfigurationUpgrade
So, we're stuck! I've also already removed node states, thus leaving all 6 nodes in the "Invalid State". The Get-ServiceFabricClusterConfiguration does not return these 6 nodes, but they are still shown in SF Explorer and listed in the cluster manifest XML file.
As far as reliability level is concerned - I'm pretty sure one can no longer change this in SF; i.e. older versions of SF allowed you to configure bronze/silver/gold in config file, but in recent versions (+6.0??) - this is a calculated field and managed internally by SF. In any case - because the seed nodes will be decreased from 9 to 6, I suspect the internal calculated reliability level will drop (presumably from Gold to silver).
I've also come across a hack that someone has used to remove nodes in a cluster... but in my scenario, nodes are still listed in manifest file... Nonetheless, the words hack and production should never meet!
So, how do I get our production cluster out of this situation? Rebuilding the cluster is not an option (that's the whole reason for clusters...high availability!).
I discovered that the above errors are primarily a symptom of lack of clearly documented procedures as well as bad/misleading error messages when doing service fabric configuration upgrades.
I performed quite a bit of my own testing to make sure I can confidently add/remove several nodes from a cluster. I also removed enough nodes to drop the Seed nodes from 9 to 6.
So, to resolve the above issue, here's what I had to do to remove nodes:
Use the SF explorer to remove node state - this changed node state from Error to Invalid
Get latest json config via Get-ServiceFabricClusterConfiguration
Remove the node from Nodes section
Completely remove the NodesToBeRemoved json section (i.e. you'll get the inconsistent error if you have an empty list of nodes to be removed - so just remove the containing json block
Do a config update
Note: Initially I tried just doing 2-5 above - but it didn't work and the node remained in error state.
That said, from my experience, please also note the following when removing nodes (this info is not clear in MSDOC:
You can remove multiple Seed nodes at once (I wanted to do this to try and replicate above scenario)
You can add multiple nodes at once too - just be aware you may not see any activity/indication via SF config upgrade status tooling that
anything is happening... be prepared to wait at least +15 minutes
(depends on how many nodes you're adding...afterall, SF is copying
installation files to the nodes)
Sometimes, when removing one or more nodes, the node won't be successfully removed - but left in an Error status. If this is the
case, use the SF Explorer (or powershell) to remove node state. Status
will change to Invalid. At this point, do another config upgrade
ensuring that:
The removed node(s) are not in Nodes section
The removed node(s) are not in the NodesToBeRemoved list
As per above, if the value of NodesToBeRemoved is (or should be) empty, remove this whole JSON block otherwise you'll get a misleading/vague warning about NodesToBeRemoved parameter contains inconsistent information.
The latter part really is the confusing part that tripped me up last time. The thing to also remember is that, once you successfully remove nodes, the Get-ServiceFabricClusterConfiguration will STILL return the removed nodes in the NodesToBeRemoved parameter. This will likely confuse/trip you up with any subsequent attempts to do a config upgrade. As such, I recommend you do another final config upgrade with this section completely removed.
As a final note: If you re-add a node that has previously been removed, it may come back in a Deactivated status. Simply activate this node and all should be fine.

(How) do node pool autoupgrades in GKE actually work?

We have a fairly large kubernetes deployment on GKE, and we wanted to make our life a little easier by enabling auto-upgrades. The documentation on the topic tells you how to enable it, but not how it actually works.
We enabled the feature on a test cluster, but no nodes were ever upgraded (although the UI kept nagging us that "upgrades are available").
The docs say it would be updated to the "latest stable" version and that it occurs "at regular intervals at the discretion of the GKE team" - both of which is not terribly helpful.
The UI always says: "Next auto-upgrade: Not scheduled"
Has someone used this feature in production and can shed some light on what it'll actually do?
What I did:
I enabled the feature on the nodepools (not the cluster itself)
I set up a maintenance window
Cluster version was 1.11.7-gke.3
Nodepools had version 1.11.5-gke.X
The newest available version was 1.11.7-gke.6
What I expected:
The nodepool would be updated to either 1.11.7-gke.3 (the default cluster version) or 1.11.7-gke.6 (the most recent version)
The update would happen in the next maintenance window
The update would otherwise work like a "manual" update
What actually happened:
Nothing
The nodepools remained on 1.11.5-gke.X for more than a week
My question
Is the nodepool version supposed to update?
If so, at what time?
If so, to what version?
I'll finally answer this myself. The auto-upgrade does work, though it took several days to a week until the version was upgraded.
There is no indication of the planned upgrade date, or any feedback other than the version updating.
It will upgrade to the current master version of the cluster.
Addition: It still doesn't work reliably, and still no way to debug if it doesn't. One information I got was that the mechanism does not work if you initially provided a specific version for the node pool. As it is not possible to deduce the inner workings of the autoupdates, we had to resort to manually checking the status again.
I wanted to share two other possibilities as to why a node-pool may not be auto-upgrading or scheduled to upgrade.
One of our projects was having the similar issue where the master version had auto-upgraded to 1.14.10-gke.27 but our node-pool stayed stuck at 1.14.10-gke.24 for over a month.
Reaching a node quota
The node-pool upgrade might be failing due to a node quota (although I'm not sure the web console would say Next auto-upgrade: Not scheduled). From the node upgrades documentation, it suggests we can run the following to view any failed upgrade operations:
gcloud container operations list --filter="STATUS=DONE AND TYPE=UPGRADE_NODES AND targetLink:https://container.googleapis.com/v1/projects/[PROJECT_ID]/zones/[ZONE]/clusters/[CLUSTER_NAME]"
Automatic node upgrades are for minor+ versions only
After exhausting my troubleshooting steps, I reached out GCP Support and opened a case (Case 23113272 for anyone working at Google). They told me the following:
Automatic node upgrade:
The node version could not necessary upgrade automatically, let me explain, exists three upgrades in a node: Minor versions (1.X), Patch releases (1.X.Y) and Security updates and bug fixes (1.X.Y-gke.N), please take a look at this documentation [2] the automatic node upgrade works from a minor version and in your case the upgrade was a security update that can't upgrade automatically.
I responded back and they confirmed that automatic node upgrades will only happen for minor versions and above. I have requested that they submit a request to update their documentation because (at the time of this response) it is not outlined anywhere in their node auto-upgrade documentation.
This feature replaces the VMs (Kubernetes nodes) in your node pool running the "old" Kubernetes version with VMs running the "new" version.
The node pool "upgrade" operation is done in a rolling fashion: It's not like GKE deletes all your VMs and recreates them simultaneously (except when you have only 1 node in your cluster). By default, the nodes are replaced with newer nodes one-by-one (although this might change).
GKE internally uses mostly the features of managed instance groups to manage operations on node pools.
You can find documentation on how to schedule node upgrades by specifying certain "maintenance windows" so you are impacted minimally. (This article also gives a bit more insights on how upgrades happen.)
That said, you can disable auto-upgrades and upgrade your cluster manually (although this is not recommended). Some GKE users have thousands of nodes, therefore for them, upgrading VMs one-by-one are not feasible.
For that GKE offers an option that lets you choose "how many nodes are upgraded at a time":
gcloud container clusters upgrade \
--concurrent-node-count=CONCURRENT_NODE_COUNT
Documentation of this flag says:
The number of nodes to upgrade concurrently. Valid values are [1, 20]. It is a recommended best practice to set this value to no higher than 3% of your cluster size.'