Concourse resource cache flushing - concourse

Sometimes when a concourse pipeline is getting build, it tries to use the previous version of resource, not the latest one. I could confirm this because the resource hash don't match.
Please let me know a solution to flush the resource hash.

Concourse v7.4.0 (released August 2021) adds the command
fly clear-resource-cache -r <pipeline>/<resource>
which will do what you are looking for.
See:
the documentation for clear-resource-cache.
the release notes for v7.4.0

The only way to flush the resource cache is to restart all the workers, as this will clear your ephemeral disks.

Related

watch of *v1.Pod ended with: too old resource version

I updated my EKS from 1.16 to 1.17. All of sudden I started getting this error:
pkg/mod/k8s.io/client-go#v0.0.0-20180806134042-1f13a808da65/tools/cache/reflector.go:99: watch of *v1.Pod ended with: too old resource version
Checked on git and people were saying that's not an error but my question is how to stop getting these messages? I was not getting this message when I was having EKS 1.16?
Source.
This is a community wiki answer. Feel free to expand it.
In short, there is nothing to worry about when encountering these messages. They mean that there are newer version(s) of the watched resource after the time the client API last acquired a list within that watch window. In other words: a watch against the Kubernetes API is timing out, and it is being restarted, which is a intended behavior.
You can also see that being mentioned here:
this is perfectly expected, no worries. The messages are several hours
apart.
When nothing happens in your cluster, the watches established by the
Kubernetes client don't get a chance to get refreshed naturally, and
eventually time out. These messages simply indicate that these watches
are being re-created.
and here:
these are nothing to worry about. This is a known occurrence in
Kubernetes and is not an issue [0]. The API server ends watch requests
when they are very old. The operator uses a client-go informer, which
takes care of automatically re-listing the resource and then
restarting the watch from the latest resource version.
So answering your question:
my question is how to stop getting these messages
Simply, you don't because:
This is working as expected and is not going to be fixed.

How to detect GKE autoupgrading a node in Stackdriver logs

We have a GKE cluster with auto-upgrading nodes. We recently noticed a node become unschedulable and eventually deleted that we suspect was being upgraded automatically for us. Is there a way to confirm (or otherwise) in Stackdriver that this was indeed the cause what was happening?
You can use the following advanced logs queries with Cloud Logging (previously Stackdriver) to detect upgrades to node pools:
protoPayload.methodName="google.container.internal.ClusterManagerInternal.UpdateClusterInternal"
resource.type="gke_nodepool"
and master:
protoPayload.methodName="google.container.internal.ClusterManagerInternal.UpdateClusterInternal"
resource.type="gke_cluster"
Additionally, you can control when the update are applied with Maintenance Windows (like the user aurelius mentioned).
I think your question has been already answered in the comments. Just as addition automatic upgrades occur at regular intervals at the discretion of the GKE team. To get more control you can create a Maintenance Windows as explained here. This is basically a time frame that you choose in which automatic upgrades should occur.

Updating a kubernetes job: what happens?

I'm looking for a definitive answer for k8s' response to a job being updated - specifically, if I update the container spec (image / args).
If the containers are starting up, will it stop & restart them?
If the job's pod is all running, will it stop & restart?
If it's Completed, will it run it again with the new setup?
If it failed, will it run it again with the new setup?
I've not been able to find documentation on this point, but if there is some I'd be very happy to get some signposting.
The .spec.template field can not be updated in a Job, the field is immutable. The Job would need to be deleted and recreated which covers all of your questions.
The reasoning behind the changes aren't available in the github commit or pr, but these changes were soon after Jobs were originally added. Your stated questions are likely part of that reasoning as making it immutable removes ambiguity.

(How) do node pool autoupgrades in GKE actually work?

We have a fairly large kubernetes deployment on GKE, and we wanted to make our life a little easier by enabling auto-upgrades. The documentation on the topic tells you how to enable it, but not how it actually works.
We enabled the feature on a test cluster, but no nodes were ever upgraded (although the UI kept nagging us that "upgrades are available").
The docs say it would be updated to the "latest stable" version and that it occurs "at regular intervals at the discretion of the GKE team" - both of which is not terribly helpful.
The UI always says: "Next auto-upgrade: Not scheduled"
Has someone used this feature in production and can shed some light on what it'll actually do?
What I did:
I enabled the feature on the nodepools (not the cluster itself)
I set up a maintenance window
Cluster version was 1.11.7-gke.3
Nodepools had version 1.11.5-gke.X
The newest available version was 1.11.7-gke.6
What I expected:
The nodepool would be updated to either 1.11.7-gke.3 (the default cluster version) or 1.11.7-gke.6 (the most recent version)
The update would happen in the next maintenance window
The update would otherwise work like a "manual" update
What actually happened:
Nothing
The nodepools remained on 1.11.5-gke.X for more than a week
My question
Is the nodepool version supposed to update?
If so, at what time?
If so, to what version?
I'll finally answer this myself. The auto-upgrade does work, though it took several days to a week until the version was upgraded.
There is no indication of the planned upgrade date, or any feedback other than the version updating.
It will upgrade to the current master version of the cluster.
Addition: It still doesn't work reliably, and still no way to debug if it doesn't. One information I got was that the mechanism does not work if you initially provided a specific version for the node pool. As it is not possible to deduce the inner workings of the autoupdates, we had to resort to manually checking the status again.
I wanted to share two other possibilities as to why a node-pool may not be auto-upgrading or scheduled to upgrade.
One of our projects was having the similar issue where the master version had auto-upgraded to 1.14.10-gke.27 but our node-pool stayed stuck at 1.14.10-gke.24 for over a month.
Reaching a node quota
The node-pool upgrade might be failing due to a node quota (although I'm not sure the web console would say Next auto-upgrade: Not scheduled). From the node upgrades documentation, it suggests we can run the following to view any failed upgrade operations:
gcloud container operations list --filter="STATUS=DONE AND TYPE=UPGRADE_NODES AND targetLink:https://container.googleapis.com/v1/projects/[PROJECT_ID]/zones/[ZONE]/clusters/[CLUSTER_NAME]"
Automatic node upgrades are for minor+ versions only
After exhausting my troubleshooting steps, I reached out GCP Support and opened a case (Case 23113272 for anyone working at Google). They told me the following:
Automatic node upgrade:
The node version could not necessary upgrade automatically, let me explain, exists three upgrades in a node: Minor versions (1.X), Patch releases (1.X.Y) and Security updates and bug fixes (1.X.Y-gke.N), please take a look at this documentation [2] the automatic node upgrade works from a minor version and in your case the upgrade was a security update that can't upgrade automatically.
I responded back and they confirmed that automatic node upgrades will only happen for minor versions and above. I have requested that they submit a request to update their documentation because (at the time of this response) it is not outlined anywhere in their node auto-upgrade documentation.
This feature replaces the VMs (Kubernetes nodes) in your node pool running the "old" Kubernetes version with VMs running the "new" version.
The node pool "upgrade" operation is done in a rolling fashion: It's not like GKE deletes all your VMs and recreates them simultaneously (except when you have only 1 node in your cluster). By default, the nodes are replaced with newer nodes one-by-one (although this might change).
GKE internally uses mostly the features of managed instance groups to manage operations on node pools.
You can find documentation on how to schedule node upgrades by specifying certain "maintenance windows" so you are impacted minimally. (This article also gives a bit more insights on how upgrades happen.)
That said, you can disable auto-upgrades and upgrade your cluster manually (although this is not recommended). Some GKE users have thousands of nodes, therefore for them, upgrading VMs one-by-one are not feasible.
For that GKE offers an option that lets you choose "how many nodes are upgraded at a time":
gcloud container clusters upgrade \
--concurrent-node-count=CONCURRENT_NODE_COUNT
Documentation of this flag says:
The number of nodes to upgrade concurrently. Valid values are [1, 20]. It is a recommended best practice to set this value to no higher than 3% of your cluster size.'

Run out of storage on Service Fabric scale set

I've run out of storage on my Azure Service Fabric sclesets, so can no longer deploy any updates. I'm guessing this is because SF is keeping track of all the deployments and using up space.
Can anyone tell me if there is:
1) A way to tell service fabric to delete old deployments (say older than 10 days ago.)
2) A way to increase the storage available on the scalesets (Service Fabric is currently using the OS disk for deployments)
Regarding your first question,
There is no way to tell SF to auto-delete old packages based on days, you can either:
Do upgrades using the flag -UnregisterUnusedApplicationVersionsAfterUpgrade = $true when running the Deploy-FabricApplication.ps1 script
Update the Deploy-FabricApplication.ps1 script or create a scheduled script to check for unused packages older than a specific version, something like described in this SO
Regarding the second Question:
Yes you can change the disk size via ARM template update,
But the issue might also be the LOGs size, take a look in this question might help solve the problem without bigger disks.