How to prevent repeated reconcile calls - kubernetes

I'm developing a Kubernetes controller. The desired state for this controller is captured in CRD-A and then it creates a deployment and statefulset to achieve the actual state. Currently I'm using server side apply to create/update these deployment and statefulsets.
The controller establishes watch on both CRD-A as well as deployments, statefulset. This to ensure that if there is a change in the deployment/statefulset, the reconcile() is notified and takes action to fix it. Currently the reconcile() always calls server side apply to create/update and this leads another watch event (resource version changes on every server side apply) resulting in repeated/infinite calls to reconcile()
One approach I've been thinking about is to leverage 'generation' on deployment/statefulset i.e. the controller will maintain a in-memory map of (k8s object -> generation) and on reconcile() compare the value in this map to what is present in the indexed informer cache; do you see any concerns with this approach? And are there better alternatives to prevent repeated/infinite reconcile() calls?

One idea could be to not apply the objects if there are no changes to be applied.
This gives you the problem of how to detect if the objects are up to date already, but it's hard to tell if that's doable in your case.

Ideally, if the object you provide in the server side apply is not changed, the generation and the resourceVersion of the object should BOTH NOT be changed.
But sometimes that's not the case, see this github issue:https://github.com/kubernetes/kubernetes/issues/95460
Fortunately, the generation always stays the same, so yes, you can leverage this field to avoid the reconcile dead loop, by adding a GenerationChangedPredicate filter to your controller, which will skip reconciling if the generation does not change, and it's often used in conjuction with the LabelChangedPredicate, which filters events when the object's labels does not change.
Here's how you will set up your controller with these two predicates:
ctrl.NewControllerManagedBy(mgr).
For(&Object{}).
Owns(&appsv1.StatefulSet{}).
Owns(&appsv1.Deployment{}).
// together with Server Side Apply
// this predicates prevents meaningless reconcilations from being triggered
WithEventFilter(predicate.Or(predicate.GenerationChangedPredicate{}, predicate.LabelChangedPredicate{})).
Complete(r)

Related

Filtering items are stored in Kubernetes shared informer cache

I have a Kubernetes controller written using client-go informers package. It maintains a watch on all Pods in the cluster, there are about 15k of these objects, and their YAML representation takes around 600 MB to print (I assume their in-memory representation is not that different.)
As a result, this (otherwise really small) controller watching Pods ends up with a huge memory footprint (north of 1 GiB). Even methods that you'd think offer a way of filtering, such as the one named like NewFilteredSharedInformerFactory doesn't really give you a way to specify a predicate function that chooses which objects are stored in the in-memory cache.
Instead, that method in client-go offers a TweakListOptionsFunc. It helps you control ListOptions but my predicate unfortunately cannot be satisfied with a labelSelector or fieldSelector. I need to drop the objects when they arrive to the controller through a predicate function.
Note: the predicate I have is something like "Pods that have an ownerReference by a DaemonSet" (which is not possible with fieldSelectors –also another question of mine) and there's no labelSelector that can work in my scenario.
How would I go about configuring an informer on Pods that only have DaemonSet owner references to reduce the memory footprint of my controller?
Here's an idea, you can get a list of all the DaemonSets in your cluster. Read the spec.selector.matchLabels field to retrieve the label that the DaemonSet pods are bound to have. Use those labels as part of your TweakListOptionsFunc with a logic like:
Pods with label1 OR label2 OR label3 ...
I know it's a bit of toil, but seems to be a working approach. I believe there isn't a way to specify fields in client-go.
It appears that today if you use SharedInformers, there's no way to filters which objects to keep in the shared cache and which ones to discard.
I have found an interesting code snippet in kube-state-metrics project that opts into the lower-layer of abstraction of initiating Watch calls directly (which would normally be considered as an anti-pattern) and using watch.Filter, it decides whether to return an object from a Watch() call (to a cache/reflector or not).
That said, many controller authors might choose to not go down this path as it requires you to specify your own cache/reflector/indexer around the Watch() call. Furthermore, projects like controller-runtime don't even let you get access to this low-level machinery, as far as I know.
Another aspect of reducing controllers' memory footprint can be done through field/data erasure on structs (instead of discarding objects altogether). This is possible in newer versions of client-go through cache.TransformFunc, which can let you delete fields of irrelevant objects (though, these objects would still consume some memory). This one is more of a band-aid that can make your situation better.
In my case, I mostly needed to watch for DaemonSet Pods in certain namespaces, so I refactored the code from using 1 informer (watching all namespaces) to N namespace-scoped informers running concurrently.

What is the use for CRD status?

I'm currently writing a kubernetes operator in go using the operator-sdk.
This operator creates two StatefulSet and two Service, with some business logic around it.
I'm wondering what CRD status is about ? In my reconcile method I use the default client (i.e. r.List(ctx, &setList, opts...)) to fetch data from the cluster, shall I instead store data in the status to use it later ?
If so how reliable this status is ? I mean is it persisted ? If the control plane dies is it still available ?
What about disaster recovery, what if the persisted data disappear ? Doesn't that case invalidate the use of the CRD status ?
The status subresource of a CRD can be considered to have the same objective of non-custom resources. While the spec defines the desired state of that particular resource, basically I declare what I want, the status instead explains the current situation of the resource I declared on the cluster and should help in understanding what is different between the desired state and the actual state.
Like a StatefulSet spec could say I want 3 replicas and its status say that right now only 1 of those replicas is ready and the next one is still starting, a custom resource status may tell me what is the current situation of whatever I declared in the specs.
For example, using the Rook Operator, I could declare I want a CephCluster made in a certain way. Since a CephCluster is a pretty complex thing (made of several StatefulSets, Daemons and more), the status of the custom resource definition will tell me the current situation of the whole ceph cluster, if it's health is ok or if something requires my attention and so on.
From my understandings of the Kubernetes API, you shouldn't rely on status subresource to decide what your operator should do regarding a CRD as it is way better and less prone to errors to always check the current situation of the cluster (at operator start or when a resource is defined, updated or deleted)
Last, let me quote from Kubernetes API conventions as it exaplins the convention pretty well ( https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#spec-and-status )
By convention, the Kubernetes API makes a distinction between the
specification of the desired state of an object (a nested object field
called "spec") and the status of the object at the current time (a
nested object field called "status").
The specification is a complete
description of the desired state, including configuration settings
provided by the user, default values expanded by the system, and
properties initialized or otherwise changed after creation by other
ecosystem components (e.g., schedulers, auto-scalers), and is
persisted in stable storage with the API object. If the specification
is deleted, the object will be purged from the system.
The status
summarizes the current state of the object in the system, and is
usually persisted with the object by automated processes but may be
generated on the fly. At some cost and perhaps some temporary
degradation in behavior, the status could be reconstructed by
observation if it were lost.
When a new version of an object is POSTed or PUT, the "spec" is
updated and available immediately. Over time the system will work to
bring the "status" into line with the "spec". The system will drive
toward the most recent "spec" regardless of previous versions of that
stanza. In other words, if a value is changed from 2 to 5 in one PUT
and then back down to 3 in another PUT the system is not required to
'touch base' at 5 before changing the "status" to 3. In other words,
the system's behavior is level-based rather than edge-based. This
enables robust behavior in the presence of missed intermediate state
changes.

Cosmos DB Change Feeds in a Kubernetes Cluster with arbitrary number of pods

I have a collection in my Cosmos database that I would like to observe for changes. I have many documents (official and unofficial) explaining how to do this. There is one thing though that I cannot get to work in a reliable way: how do I receive the same changes to multiple instances when I don't have any common reference for instance names?
What do I mean by this? Well, I'm running my work loads in a Kubernetes cluster (AKS). I have a variable number of instances within the cluster that should observe my collection. In order for change feeds to work properly, I have to have a unique instance name for each instance. The only candidate I have is the pod name. It's usually on the form of <deployment-name>-<random string>. E.g. pod-5f597c9c56-lxw5b.
If I use the pod name as instance name, all instances do not receive the same changes (which is my requirement), only one instance will receive the change (see https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed-processor#dynamic-scaling). What I can do is to use the pod name as feed name instead, then all instances get the same changes. This is what I fear will bite me in the butt at some point; when peek into the lease container, I can see a set of documents per feed name. As pod names come and go (the random string part of the name), I fear the container will grow over time, generating a heap of garbage. I know Cosmos can handle huge work loads, but you know, I like to keep things tidy.
How can I keep this thing clean and tidy? I really don't want to invent (or reuse for that matter!) some protocol between my instances to vote for which instance gets which name out of a finite set of names.
One "simple" solution would be to build my own instance names, if AKS or Kubernetes held some "index" of some sort for my pods. I know stateful sets give me that, but I don't want to use stateful sets, as the pods themselves aren't really stateful (except for this particular aspect!).
There is a new Change Feed pull model (which is in preview at this time).
The differences are:
In your case, it looks like you don't need parallelization (you want all instances to receive everything). The important part would be to design a state storing model that can maintain the continuation tokens (or not, maybe you don't care to continue if a pod goes down and then restarts).
I would suggest that you proceed to use the pod name as unique ID. If you are concerned about sprawl of the data, you could monitor the container and devise a clean-up mechanism for the metadata.
In order to have at-least-once delivery, there is going to need to be metadata persisted somewhere to track items ACK-ed / position in a partition, etc. I suspect there could be a bit of work to get change feed processor to give you at-least-once delivery once you consider pod interruption/re-scheduling during data flow.
As another option Azure offers an implementation of checkpoint based message sharing from partitioned event hubs via EventProcessorClient. In EventProcessorClient, there is also a bit of metadata added to a storage account.

Updating Metadata Annotations

I am using kubebuilder to create a Kubernetes operator. When an object of my kind is initiated I have to parse the spec and update the objects based on a few calculations.
From what I can tell I can either update the status of the object, the metadata, or a managed field (I may be wrong?). It appears that the sigs.k8s.io/controller-runtime/pkg/client library is responsible for how to update these fields (I'm not completely sure). I am having trouble understanding the docs.
I have the following questions:
Are there a guide to best practices about where to store configuration on the object between status, metadata (labels or annotations), and managed fields?
How do I update/patch the annotations of an object similar to how I would use r.Status().Update(ctx, &thing); to update the status?
The Kubebuilder docs are a bit raw but nonetheless are a handy guide when building CRDs and controllers with Kubebuilder. It walks you through a fairly detailed example which is great to study and refer back to, to see how to do certain things.
The answer to your question generally is, "it depends." What values are you calculating, and why? Why do you need to store them on the object? Is the lifecycle of this data coupled to the lifecycle of this object, or might this computed data need to live on and be used by other controllers even when the object is deleted? In general, is anything going to interact with those values? What is it going to do with them?
If nothing else aside from the reconciliation controller for the CRD is going to interact with the data you're putting, consider putting it within the object's Status.
Doing r.Status().Update(ctx, &thing) will avoid triggering any side-effects as it will only persist changes you've made to the object's Status subresource, rather than its spec or metadata.
A common thing to do with custom resources is to set and remove finalizers, which live in the object's metadata.

How to prevent lost updates on the views in a distributed CQRS/ES system?

I have a CQRS/ES application where some of the views are populated by events from multiple aggregate roots.
I have a CashRegisterActivated event on the CashRegister aggregate root and a SaleCompleted event on the Sale aggregate root. Both events are used to populate the CashRegisterView. The CashRegisterActivated event creates the CashRegisterView or sets it active in case it already exists. The SaleCompleted event sets the last sale sequence number and updates the cash in the drawer.
When two of these events arrive within milliseconds, the first update is overwritten by the last one. So that's a lost update.
I already have a few possible solutions in mind, but they all have their drawbacks:
Marshal all event processing for a view or for one record of a view on the same thread. This works fine on a single node, but once you scale out, things start to get complex. You need to ensure all events for a view are delivered to the same node. And you need to migrate to another node when it goes down. This requires some smart load balancer which is aware of the events and the views.
Lock the record before updating to make sure no other threads or nodes modify it in the meantime. This will probably work fine, but it means giving up on a lock-free system. Threads will set there, waiting for a lock to be freed. Locking also means increased latency when I scale out the data store (if I'm not mistaken).
For the record: I'm using Java with Apache Camel, RabbitMQ to deliver the events and MariaDB for the view data store.
I have a CQRS/ES application where some of the views in the read model are populated by events from multiple aggregate roots.
That may be a mistake.
Driving a process off of an isolated event. But composing a view normally requires a history, rather than a single event.
A more likely implementation would be to use the arrival of the events to mark the current view stale, and to use a single writer to update the view from the history of events produced by the aggregate(s) concerned.
And that requires a smart messaging solution. I thought "Smart endpoints and dumb pipes" would be a good practice for CQRS/ES systems.
It is. The endpoints just need to be smart enough to understand when they need histories, or when events are sufficient.
A view, after all, is just a snapshot. You take inputs (X.history, Y.history), produce a snapshot, write the snapshot into your view store (possibly with meta data describing the positions in the histories that were used), and you are done.
The events are just used to indicate to the writer that a previous snapshot is stale. You don't use the event to extend the history, you use the event to tell the writer that a history has changed.
You don't lose updates with multiple events, because the event itself, with all of its state, is captured in the history. It's the history that is used to build the event-sourced view.
Konrad Garus wrote
... handling events coming from a single source is easier, but more importantly because a DB-backed event store trivially guarantees ordering and has no issues with lost or duplicate messages.
A solution could be to detect the when this situation happens, and do a retry.
To do this:
Add to each table the aggregate version number which is kept up to date
On each update statement add the following the the where clause "aggr_version=n-1" (where n is the version of the event being processed)
When the result of the update statement is that no records where modified, it probably means that the event was processed out of order and a retry strategy can be performed
The problem is that this adds complexity and is hard to test. The performance bottleneck is very likely in the database, so a single process with a failover solution will probably be the easiest solution.
Although I see you ask how to handle these things at scale - I've seen people recommend using a single threaded approach - until such times as it actually becomes a problem - and then address it.
I would have a process manager per view model, draw the events you need from the store and write them single threaded.
I combined the answers of VoiceOfUnreason and StefRave into something I think might work. Populating a view from multiple aggregate roots feels wrong indeed. We have out of order detection with a retry queue. So an event on an aggregate root will only be processed when the last completely processed event is version n-1.
So when I create new aggregate roots for the views that would be populated by multiple aggregate roots (say aggregate views), all updates for the view will be synchronised without row locking or thread synchronisation. We have conflict detection with a retry mechanism on the aggregate roots as well, that will take care of concurrency on the command side. So if I just construct these aggregate roots from the events I'm currently using to populate the aggregate views, I will have solved the lost update problem.
Thoughts on this solution?