Filtering items are stored in Kubernetes shared informer cache - kubernetes

I have a Kubernetes controller written using client-go informers package. It maintains a watch on all Pods in the cluster, there are about 15k of these objects, and their YAML representation takes around 600 MB to print (I assume their in-memory representation is not that different.)
As a result, this (otherwise really small) controller watching Pods ends up with a huge memory footprint (north of 1 GiB). Even methods that you'd think offer a way of filtering, such as the one named like NewFilteredSharedInformerFactory doesn't really give you a way to specify a predicate function that chooses which objects are stored in the in-memory cache.
Instead, that method in client-go offers a TweakListOptionsFunc. It helps you control ListOptions but my predicate unfortunately cannot be satisfied with a labelSelector or fieldSelector. I need to drop the objects when they arrive to the controller through a predicate function.
Note: the predicate I have is something like "Pods that have an ownerReference by a DaemonSet" (which is not possible with fieldSelectors –also another question of mine) and there's no labelSelector that can work in my scenario.
How would I go about configuring an informer on Pods that only have DaemonSet owner references to reduce the memory footprint of my controller?

Here's an idea, you can get a list of all the DaemonSets in your cluster. Read the spec.selector.matchLabels field to retrieve the label that the DaemonSet pods are bound to have. Use those labels as part of your TweakListOptionsFunc with a logic like:
Pods with label1 OR label2 OR label3 ...
I know it's a bit of toil, but seems to be a working approach. I believe there isn't a way to specify fields in client-go.

It appears that today if you use SharedInformers, there's no way to filters which objects to keep in the shared cache and which ones to discard.
I have found an interesting code snippet in kube-state-metrics project that opts into the lower-layer of abstraction of initiating Watch calls directly (which would normally be considered as an anti-pattern) and using watch.Filter, it decides whether to return an object from a Watch() call (to a cache/reflector or not).
That said, many controller authors might choose to not go down this path as it requires you to specify your own cache/reflector/indexer around the Watch() call. Furthermore, projects like controller-runtime don't even let you get access to this low-level machinery, as far as I know.
Another aspect of reducing controllers' memory footprint can be done through field/data erasure on structs (instead of discarding objects altogether). This is possible in newer versions of client-go through cache.TransformFunc, which can let you delete fields of irrelevant objects (though, these objects would still consume some memory). This one is more of a band-aid that can make your situation better.
In my case, I mostly needed to watch for DaemonSet Pods in certain namespaces, so I refactored the code from using 1 informer (watching all namespaces) to N namespace-scoped informers running concurrently.

Related

How to prevent repeated reconcile calls

I'm developing a Kubernetes controller. The desired state for this controller is captured in CRD-A and then it creates a deployment and statefulset to achieve the actual state. Currently I'm using server side apply to create/update these deployment and statefulsets.
The controller establishes watch on both CRD-A as well as deployments, statefulset. This to ensure that if there is a change in the deployment/statefulset, the reconcile() is notified and takes action to fix it. Currently the reconcile() always calls server side apply to create/update and this leads another watch event (resource version changes on every server side apply) resulting in repeated/infinite calls to reconcile()
One approach I've been thinking about is to leverage 'generation' on deployment/statefulset i.e. the controller will maintain a in-memory map of (k8s object -> generation) and on reconcile() compare the value in this map to what is present in the indexed informer cache; do you see any concerns with this approach? And are there better alternatives to prevent repeated/infinite reconcile() calls?
One idea could be to not apply the objects if there are no changes to be applied.
This gives you the problem of how to detect if the objects are up to date already, but it's hard to tell if that's doable in your case.
Ideally, if the object you provide in the server side apply is not changed, the generation and the resourceVersion of the object should BOTH NOT be changed.
But sometimes that's not the case, see this github issue:https://github.com/kubernetes/kubernetes/issues/95460
Fortunately, the generation always stays the same, so yes, you can leverage this field to avoid the reconcile dead loop, by adding a GenerationChangedPredicate filter to your controller, which will skip reconciling if the generation does not change, and it's often used in conjuction with the LabelChangedPredicate, which filters events when the object's labels does not change.
Here's how you will set up your controller with these two predicates:
ctrl.NewControllerManagedBy(mgr).
For(&Object{}).
Owns(&appsv1.StatefulSet{}).
Owns(&appsv1.Deployment{}).
// together with Server Side Apply
// this predicates prevents meaningless reconcilations from being triggered
WithEventFilter(predicate.Or(predicate.GenerationChangedPredicate{}, predicate.LabelChangedPredicate{})).
Complete(r)

What is the use for CRD status?

I'm currently writing a kubernetes operator in go using the operator-sdk.
This operator creates two StatefulSet and two Service, with some business logic around it.
I'm wondering what CRD status is about ? In my reconcile method I use the default client (i.e. r.List(ctx, &setList, opts...)) to fetch data from the cluster, shall I instead store data in the status to use it later ?
If so how reliable this status is ? I mean is it persisted ? If the control plane dies is it still available ?
What about disaster recovery, what if the persisted data disappear ? Doesn't that case invalidate the use of the CRD status ?
The status subresource of a CRD can be considered to have the same objective of non-custom resources. While the spec defines the desired state of that particular resource, basically I declare what I want, the status instead explains the current situation of the resource I declared on the cluster and should help in understanding what is different between the desired state and the actual state.
Like a StatefulSet spec could say I want 3 replicas and its status say that right now only 1 of those replicas is ready and the next one is still starting, a custom resource status may tell me what is the current situation of whatever I declared in the specs.
For example, using the Rook Operator, I could declare I want a CephCluster made in a certain way. Since a CephCluster is a pretty complex thing (made of several StatefulSets, Daemons and more), the status of the custom resource definition will tell me the current situation of the whole ceph cluster, if it's health is ok or if something requires my attention and so on.
From my understandings of the Kubernetes API, you shouldn't rely on status subresource to decide what your operator should do regarding a CRD as it is way better and less prone to errors to always check the current situation of the cluster (at operator start or when a resource is defined, updated or deleted)
Last, let me quote from Kubernetes API conventions as it exaplins the convention pretty well ( https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#spec-and-status )
By convention, the Kubernetes API makes a distinction between the
specification of the desired state of an object (a nested object field
called "spec") and the status of the object at the current time (a
nested object field called "status").
The specification is a complete
description of the desired state, including configuration settings
provided by the user, default values expanded by the system, and
properties initialized or otherwise changed after creation by other
ecosystem components (e.g., schedulers, auto-scalers), and is
persisted in stable storage with the API object. If the specification
is deleted, the object will be purged from the system.
The status
summarizes the current state of the object in the system, and is
usually persisted with the object by automated processes but may be
generated on the fly. At some cost and perhaps some temporary
degradation in behavior, the status could be reconstructed by
observation if it were lost.
When a new version of an object is POSTed or PUT, the "spec" is
updated and available immediately. Over time the system will work to
bring the "status" into line with the "spec". The system will drive
toward the most recent "spec" regardless of previous versions of that
stanza. In other words, if a value is changed from 2 to 5 in one PUT
and then back down to 3 in another PUT the system is not required to
'touch base' at 5 before changing the "status" to 3. In other words,
the system's behavior is level-based rather than edge-based. This
enables robust behavior in the presence of missed intermediate state
changes.

Cosmos DB Change Feeds in a Kubernetes Cluster with arbitrary number of pods

I have a collection in my Cosmos database that I would like to observe for changes. I have many documents (official and unofficial) explaining how to do this. There is one thing though that I cannot get to work in a reliable way: how do I receive the same changes to multiple instances when I don't have any common reference for instance names?
What do I mean by this? Well, I'm running my work loads in a Kubernetes cluster (AKS). I have a variable number of instances within the cluster that should observe my collection. In order for change feeds to work properly, I have to have a unique instance name for each instance. The only candidate I have is the pod name. It's usually on the form of <deployment-name>-<random string>. E.g. pod-5f597c9c56-lxw5b.
If I use the pod name as instance name, all instances do not receive the same changes (which is my requirement), only one instance will receive the change (see https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed-processor#dynamic-scaling). What I can do is to use the pod name as feed name instead, then all instances get the same changes. This is what I fear will bite me in the butt at some point; when peek into the lease container, I can see a set of documents per feed name. As pod names come and go (the random string part of the name), I fear the container will grow over time, generating a heap of garbage. I know Cosmos can handle huge work loads, but you know, I like to keep things tidy.
How can I keep this thing clean and tidy? I really don't want to invent (or reuse for that matter!) some protocol between my instances to vote for which instance gets which name out of a finite set of names.
One "simple" solution would be to build my own instance names, if AKS or Kubernetes held some "index" of some sort for my pods. I know stateful sets give me that, but I don't want to use stateful sets, as the pods themselves aren't really stateful (except for this particular aspect!).
There is a new Change Feed pull model (which is in preview at this time).
The differences are:
In your case, it looks like you don't need parallelization (you want all instances to receive everything). The important part would be to design a state storing model that can maintain the continuation tokens (or not, maybe you don't care to continue if a pod goes down and then restarts).
I would suggest that you proceed to use the pod name as unique ID. If you are concerned about sprawl of the data, you could monitor the container and devise a clean-up mechanism for the metadata.
In order to have at-least-once delivery, there is going to need to be metadata persisted somewhere to track items ACK-ed / position in a partition, etc. I suspect there could be a bit of work to get change feed processor to give you at-least-once delivery once you consider pod interruption/re-scheduling during data flow.
As another option Azure offers an implementation of checkpoint based message sharing from partitioned event hubs via EventProcessorClient. In EventProcessorClient, there is also a bit of metadata added to a storage account.

Distribute public key to all pods in all namespaces automatically

I have a public key that all my pods needs to have.
My initial thought was to create a ConfigMap or Secret to hold it but as far as I can tell neither of those can be used across namespaces. Apart from that, it's really boiler plate to paste the same volume into all my Deployments
So now I'm left with only, in my opinion, bad alternatives such as creating the same ConfigMap/Secret in all Namespaces and do the copy-paste thing in deployments.
Any other alternatives?
Extra information after questions.
The key doesn't need to be kept secret, it's a public key, but it needs to be distributed in a trusted way.
It won't rotate often but when it happens all images can't be re-built.
Almost all images/pods needs this key and there will be hundreds of images/pods.
You can use Kubernetes initializers to intercept object creation and mutate as you want. This can solve copy-paste in all your deployments and you can manage it from a central location.
https://medium.com/google-cloud/how-kubernetes-initializers-work-22f6586e1589
You will still need to create configmaps/secrets per namespace though.
While I don't really like the idea, one of the ways to solve it could be an init container that populates a volume with key(s) you need and then these volumes can be mounted in your containers as you see fit. That way it becomes independent of how you namespace stuff and relies only on how pods are defined/created.
That said, the Kubed mentioned by Ryan above sounds like more reasonable approach to a case like this one, and last but not least, something creates your namespaces after all, so having the creation of required elements of a namespace inside the same process sounds legit as well.

Do all cluster schedulers take array jobs, and if they do, do they set SGE_TASK_ID array id?

When using qsub to put array jobs on a cluster the global variable SGE_TASK_ID gets set to the array job ID. I use this in a shell script that I run on a cluster, where each array job needs to do something different based on the SGE_TASK_ID. Is this a common way for cluster schedulers to do this, or do they all have a different approach?
Most schedulers have a way to do this, although it can be slightly different in different setups. In TORQUE the variable is called $PBS_ARRAYID but it works the same.
Do all cluster schedulers take array jobs
No. Many do, but not all.
and if they do, do they set SGE_TASK_ID array id?
Only Grid Engine will set SGE_TASK_ID because this is simply what the variable is called in Grid Engine. Other cluster middlewares have a different name for it, with different semantics.
It's a bit unclear where you are aiming with your question, but if you want to write a program/system that runs on many different cluster middlewares / load balancers / schedulers, you should look into DRMAA. This will abstract variables like SGE_TASK_ID.