What is the use for CRD status? - kubernetes

I'm currently writing a kubernetes operator in go using the operator-sdk.
This operator creates two StatefulSet and two Service, with some business logic around it.
I'm wondering what CRD status is about ? In my reconcile method I use the default client (i.e. r.List(ctx, &setList, opts...)) to fetch data from the cluster, shall I instead store data in the status to use it later ?
If so how reliable this status is ? I mean is it persisted ? If the control plane dies is it still available ?
What about disaster recovery, what if the persisted data disappear ? Doesn't that case invalidate the use of the CRD status ?

The status subresource of a CRD can be considered to have the same objective of non-custom resources. While the spec defines the desired state of that particular resource, basically I declare what I want, the status instead explains the current situation of the resource I declared on the cluster and should help in understanding what is different between the desired state and the actual state.
Like a StatefulSet spec could say I want 3 replicas and its status say that right now only 1 of those replicas is ready and the next one is still starting, a custom resource status may tell me what is the current situation of whatever I declared in the specs.
For example, using the Rook Operator, I could declare I want a CephCluster made in a certain way. Since a CephCluster is a pretty complex thing (made of several StatefulSets, Daemons and more), the status of the custom resource definition will tell me the current situation of the whole ceph cluster, if it's health is ok or if something requires my attention and so on.
From my understandings of the Kubernetes API, you shouldn't rely on status subresource to decide what your operator should do regarding a CRD as it is way better and less prone to errors to always check the current situation of the cluster (at operator start or when a resource is defined, updated or deleted)
Last, let me quote from Kubernetes API conventions as it exaplins the convention pretty well ( https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#spec-and-status )
By convention, the Kubernetes API makes a distinction between the
specification of the desired state of an object (a nested object field
called "spec") and the status of the object at the current time (a
nested object field called "status").
The specification is a complete
description of the desired state, including configuration settings
provided by the user, default values expanded by the system, and
properties initialized or otherwise changed after creation by other
ecosystem components (e.g., schedulers, auto-scalers), and is
persisted in stable storage with the API object. If the specification
is deleted, the object will be purged from the system.
The status
summarizes the current state of the object in the system, and is
usually persisted with the object by automated processes but may be
generated on the fly. At some cost and perhaps some temporary
degradation in behavior, the status could be reconstructed by
observation if it were lost.
When a new version of an object is POSTed or PUT, the "spec" is
updated and available immediately. Over time the system will work to
bring the "status" into line with the "spec". The system will drive
toward the most recent "spec" regardless of previous versions of that
stanza. In other words, if a value is changed from 2 to 5 in one PUT
and then back down to 3 in another PUT the system is not required to
'touch base' at 5 before changing the "status" to 3. In other words,
the system's behavior is level-based rather than edge-based. This
enables robust behavior in the presence of missed intermediate state
changes.

Related

Filtering items are stored in Kubernetes shared informer cache

I have a Kubernetes controller written using client-go informers package. It maintains a watch on all Pods in the cluster, there are about 15k of these objects, and their YAML representation takes around 600 MB to print (I assume their in-memory representation is not that different.)
As a result, this (otherwise really small) controller watching Pods ends up with a huge memory footprint (north of 1 GiB). Even methods that you'd think offer a way of filtering, such as the one named like NewFilteredSharedInformerFactory doesn't really give you a way to specify a predicate function that chooses which objects are stored in the in-memory cache.
Instead, that method in client-go offers a TweakListOptionsFunc. It helps you control ListOptions but my predicate unfortunately cannot be satisfied with a labelSelector or fieldSelector. I need to drop the objects when they arrive to the controller through a predicate function.
Note: the predicate I have is something like "Pods that have an ownerReference by a DaemonSet" (which is not possible with fieldSelectors –also another question of mine) and there's no labelSelector that can work in my scenario.
How would I go about configuring an informer on Pods that only have DaemonSet owner references to reduce the memory footprint of my controller?
Here's an idea, you can get a list of all the DaemonSets in your cluster. Read the spec.selector.matchLabels field to retrieve the label that the DaemonSet pods are bound to have. Use those labels as part of your TweakListOptionsFunc with a logic like:
Pods with label1 OR label2 OR label3 ...
I know it's a bit of toil, but seems to be a working approach. I believe there isn't a way to specify fields in client-go.
It appears that today if you use SharedInformers, there's no way to filters which objects to keep in the shared cache and which ones to discard.
I have found an interesting code snippet in kube-state-metrics project that opts into the lower-layer of abstraction of initiating Watch calls directly (which would normally be considered as an anti-pattern) and using watch.Filter, it decides whether to return an object from a Watch() call (to a cache/reflector or not).
That said, many controller authors might choose to not go down this path as it requires you to specify your own cache/reflector/indexer around the Watch() call. Furthermore, projects like controller-runtime don't even let you get access to this low-level machinery, as far as I know.
Another aspect of reducing controllers' memory footprint can be done through field/data erasure on structs (instead of discarding objects altogether). This is possible in newer versions of client-go through cache.TransformFunc, which can let you delete fields of irrelevant objects (though, these objects would still consume some memory). This one is more of a band-aid that can make your situation better.
In my case, I mostly needed to watch for DaemonSet Pods in certain namespaces, so I refactored the code from using 1 informer (watching all namespaces) to N namespace-scoped informers running concurrently.

How to prevent repeated reconcile calls

I'm developing a Kubernetes controller. The desired state for this controller is captured in CRD-A and then it creates a deployment and statefulset to achieve the actual state. Currently I'm using server side apply to create/update these deployment and statefulsets.
The controller establishes watch on both CRD-A as well as deployments, statefulset. This to ensure that if there is a change in the deployment/statefulset, the reconcile() is notified and takes action to fix it. Currently the reconcile() always calls server side apply to create/update and this leads another watch event (resource version changes on every server side apply) resulting in repeated/infinite calls to reconcile()
One approach I've been thinking about is to leverage 'generation' on deployment/statefulset i.e. the controller will maintain a in-memory map of (k8s object -> generation) and on reconcile() compare the value in this map to what is present in the indexed informer cache; do you see any concerns with this approach? And are there better alternatives to prevent repeated/infinite reconcile() calls?
One idea could be to not apply the objects if there are no changes to be applied.
This gives you the problem of how to detect if the objects are up to date already, but it's hard to tell if that's doable in your case.
Ideally, if the object you provide in the server side apply is not changed, the generation and the resourceVersion of the object should BOTH NOT be changed.
But sometimes that's not the case, see this github issue:https://github.com/kubernetes/kubernetes/issues/95460
Fortunately, the generation always stays the same, so yes, you can leverage this field to avoid the reconcile dead loop, by adding a GenerationChangedPredicate filter to your controller, which will skip reconciling if the generation does not change, and it's often used in conjuction with the LabelChangedPredicate, which filters events when the object's labels does not change.
Here's how you will set up your controller with these two predicates:
ctrl.NewControllerManagedBy(mgr).
For(&Object{}).
Owns(&appsv1.StatefulSet{}).
Owns(&appsv1.Deployment{}).
// together with Server Side Apply
// this predicates prevents meaningless reconcilations from being triggered
WithEventFilter(predicate.Or(predicate.GenerationChangedPredicate{}, predicate.LabelChangedPredicate{})).
Complete(r)

Cosmos DB Change Feeds in a Kubernetes Cluster with arbitrary number of pods

I have a collection in my Cosmos database that I would like to observe for changes. I have many documents (official and unofficial) explaining how to do this. There is one thing though that I cannot get to work in a reliable way: how do I receive the same changes to multiple instances when I don't have any common reference for instance names?
What do I mean by this? Well, I'm running my work loads in a Kubernetes cluster (AKS). I have a variable number of instances within the cluster that should observe my collection. In order for change feeds to work properly, I have to have a unique instance name for each instance. The only candidate I have is the pod name. It's usually on the form of <deployment-name>-<random string>. E.g. pod-5f597c9c56-lxw5b.
If I use the pod name as instance name, all instances do not receive the same changes (which is my requirement), only one instance will receive the change (see https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed-processor#dynamic-scaling). What I can do is to use the pod name as feed name instead, then all instances get the same changes. This is what I fear will bite me in the butt at some point; when peek into the lease container, I can see a set of documents per feed name. As pod names come and go (the random string part of the name), I fear the container will grow over time, generating a heap of garbage. I know Cosmos can handle huge work loads, but you know, I like to keep things tidy.
How can I keep this thing clean and tidy? I really don't want to invent (or reuse for that matter!) some protocol between my instances to vote for which instance gets which name out of a finite set of names.
One "simple" solution would be to build my own instance names, if AKS or Kubernetes held some "index" of some sort for my pods. I know stateful sets give me that, but I don't want to use stateful sets, as the pods themselves aren't really stateful (except for this particular aspect!).
There is a new Change Feed pull model (which is in preview at this time).
The differences are:
In your case, it looks like you don't need parallelization (you want all instances to receive everything). The important part would be to design a state storing model that can maintain the continuation tokens (or not, maybe you don't care to continue if a pod goes down and then restarts).
I would suggest that you proceed to use the pod name as unique ID. If you are concerned about sprawl of the data, you could monitor the container and devise a clean-up mechanism for the metadata.
In order to have at-least-once delivery, there is going to need to be metadata persisted somewhere to track items ACK-ed / position in a partition, etc. I suspect there could be a bit of work to get change feed processor to give you at-least-once delivery once you consider pod interruption/re-scheduling during data flow.
As another option Azure offers an implementation of checkpoint based message sharing from partitioned event hubs via EventProcessorClient. In EventProcessorClient, there is also a bit of metadata added to a storage account.

Updating Metadata Annotations

I am using kubebuilder to create a Kubernetes operator. When an object of my kind is initiated I have to parse the spec and update the objects based on a few calculations.
From what I can tell I can either update the status of the object, the metadata, or a managed field (I may be wrong?). It appears that the sigs.k8s.io/controller-runtime/pkg/client library is responsible for how to update these fields (I'm not completely sure). I am having trouble understanding the docs.
I have the following questions:
Are there a guide to best practices about where to store configuration on the object between status, metadata (labels or annotations), and managed fields?
How do I update/patch the annotations of an object similar to how I would use r.Status().Update(ctx, &thing); to update the status?
The Kubebuilder docs are a bit raw but nonetheless are a handy guide when building CRDs and controllers with Kubebuilder. It walks you through a fairly detailed example which is great to study and refer back to, to see how to do certain things.
The answer to your question generally is, "it depends." What values are you calculating, and why? Why do you need to store them on the object? Is the lifecycle of this data coupled to the lifecycle of this object, or might this computed data need to live on and be used by other controllers even when the object is deleted? In general, is anything going to interact with those values? What is it going to do with them?
If nothing else aside from the reconciliation controller for the CRD is going to interact with the data you're putting, consider putting it within the object's Status.
Doing r.Status().Update(ctx, &thing) will avoid triggering any side-effects as it will only persist changes you've made to the object's Status subresource, rather than its spec or metadata.
A common thing to do with custom resources is to set and remove finalizers, which live in the object's metadata.

Graphql Schema update rollback

We are moving some of our API's to graphql and would like to know to handle the rollback of the deployed package (Schema)and the best practice to the same.
To be more specific let's say we have a Schema S with 3 fields and then we added 4th field "A" . Now for some reason we cannot go forward with this package and field "A". So we have to perform roll back of the package so that now the Schema doesn't have field "A".
Now some consumer might ask for this field "A" and he might get an error. We could of course ask our clients to update but there is a time gap during which we might have failed request.
How do we handle this scenario,specifically an urgent rollback with in few hours or a day?
In general, you should avoid removing fields without warning to avoid the exact scenario you describe.
As your schema evolves, it's not uncommon to have some fields that are no longer needed. For example, rather than introducing a drastic change to a particular field (moving from a nullable to a non-nullable return type, adding required arguments, etc.), we may opt to add another field and encourage clients to transition to using that one instead. In such scenarios, we want to eventually remove the original field. The safest way to do so is to deprecate the field first. Using SDL, we can do so using a directive:
fieldA: String #deprecated(reason: "Use fieldB instead!")
After a certain amount of time, you can then remove the field entirely. How long you wait to remove the field depends on your team and the expectations you've communicated around handling deprecated fields. For example, you may find it helpful to set a deadline, by which point all clients are expected to have stopped using any deprecated fields. This works well as long as your client teams have the bandwidth to handle such technical debt.
A deprecated field's resolver can be changed to return a null value (if the field itself is nullable) or some minimal mock data. This prevents making unnecessary API or database calls, while still ensuring client requests don't result in an error.
In the context of your question, this means you should probably avoid rolling back to a previous release and instead follow the process outlined above for the fields you want to remove.
Alternatively, you could consider versioning. GraphQL generally shies away from the concept of versioning. As the official site explains:
Why do most APIs version? When there's limited control over the data that's returned from an API endpoint, any change can be considered a breaking change, and breaking changes require a new version. If adding new features to an API requires a new version, then a tradeoff emerges between releasing often and having many incremental versions versus the understandability and maintainability of the API.
In contrast, GraphQL only returns the data that's explicitly requested, so new capabilities can be added via new types and new fields on those types without creating a breaking change. This has led to a common practice of always avoiding breaking changes and serving a versionless API.
With that in mind, it's also feasible to still implement versioning with GraphQL by serving different schemas from different endpoints. While it's costly and usually unnecessary to go that route, it may be the right solution for you and your team, particularly if you expect to have to do similar rollbacks in the future.
You cannot do anything w.r.t GraphQL. Since you need to have the field present in the GraphQL type system. There may be libraries available, which will allow you to specify whether the field should be present in the query or not. But, there's no way of allowing non-existent field in the Query.
But what you can do is opt for a Blue-Green deployment strategy. In this strategy, you have both the versions running at the same time.
Let's say: Green is the version with Field A and Blue is the version without Field A. So when your clients are updated they start requesting the Blue version. And once all your clients are updated, shut-down the Green (with Field A).