Cosmos DB Change Feeds in a Kubernetes Cluster with arbitrary number of pods - kubernetes

I have a collection in my Cosmos database that I would like to observe for changes. I have many documents (official and unofficial) explaining how to do this. There is one thing though that I cannot get to work in a reliable way: how do I receive the same changes to multiple instances when I don't have any common reference for instance names?
What do I mean by this? Well, I'm running my work loads in a Kubernetes cluster (AKS). I have a variable number of instances within the cluster that should observe my collection. In order for change feeds to work properly, I have to have a unique instance name for each instance. The only candidate I have is the pod name. It's usually on the form of <deployment-name>-<random string>. E.g. pod-5f597c9c56-lxw5b.
If I use the pod name as instance name, all instances do not receive the same changes (which is my requirement), only one instance will receive the change (see https://learn.microsoft.com/en-us/azure/cosmos-db/change-feed-processor#dynamic-scaling). What I can do is to use the pod name as feed name instead, then all instances get the same changes. This is what I fear will bite me in the butt at some point; when peek into the lease container, I can see a set of documents per feed name. As pod names come and go (the random string part of the name), I fear the container will grow over time, generating a heap of garbage. I know Cosmos can handle huge work loads, but you know, I like to keep things tidy.
How can I keep this thing clean and tidy? I really don't want to invent (or reuse for that matter!) some protocol between my instances to vote for which instance gets which name out of a finite set of names.
One "simple" solution would be to build my own instance names, if AKS or Kubernetes held some "index" of some sort for my pods. I know stateful sets give me that, but I don't want to use stateful sets, as the pods themselves aren't really stateful (except for this particular aspect!).

There is a new Change Feed pull model (which is in preview at this time).
The differences are:
In your case, it looks like you don't need parallelization (you want all instances to receive everything). The important part would be to design a state storing model that can maintain the continuation tokens (or not, maybe you don't care to continue if a pod goes down and then restarts).

I would suggest that you proceed to use the pod name as unique ID. If you are concerned about sprawl of the data, you could monitor the container and devise a clean-up mechanism for the metadata.
In order to have at-least-once delivery, there is going to need to be metadata persisted somewhere to track items ACK-ed / position in a partition, etc. I suspect there could be a bit of work to get change feed processor to give you at-least-once delivery once you consider pod interruption/re-scheduling during data flow.
As another option Azure offers an implementation of checkpoint based message sharing from partitioned event hubs via EventProcessorClient. In EventProcessorClient, there is also a bit of metadata added to a storage account.

Related

Filtering items are stored in Kubernetes shared informer cache

I have a Kubernetes controller written using client-go informers package. It maintains a watch on all Pods in the cluster, there are about 15k of these objects, and their YAML representation takes around 600 MB to print (I assume their in-memory representation is not that different.)
As a result, this (otherwise really small) controller watching Pods ends up with a huge memory footprint (north of 1 GiB). Even methods that you'd think offer a way of filtering, such as the one named like NewFilteredSharedInformerFactory doesn't really give you a way to specify a predicate function that chooses which objects are stored in the in-memory cache.
Instead, that method in client-go offers a TweakListOptionsFunc. It helps you control ListOptions but my predicate unfortunately cannot be satisfied with a labelSelector or fieldSelector. I need to drop the objects when they arrive to the controller through a predicate function.
Note: the predicate I have is something like "Pods that have an ownerReference by a DaemonSet" (which is not possible with fieldSelectors –also another question of mine) and there's no labelSelector that can work in my scenario.
How would I go about configuring an informer on Pods that only have DaemonSet owner references to reduce the memory footprint of my controller?
Here's an idea, you can get a list of all the DaemonSets in your cluster. Read the spec.selector.matchLabels field to retrieve the label that the DaemonSet pods are bound to have. Use those labels as part of your TweakListOptionsFunc with a logic like:
Pods with label1 OR label2 OR label3 ...
I know it's a bit of toil, but seems to be a working approach. I believe there isn't a way to specify fields in client-go.
It appears that today if you use SharedInformers, there's no way to filters which objects to keep in the shared cache and which ones to discard.
I have found an interesting code snippet in kube-state-metrics project that opts into the lower-layer of abstraction of initiating Watch calls directly (which would normally be considered as an anti-pattern) and using watch.Filter, it decides whether to return an object from a Watch() call (to a cache/reflector or not).
That said, many controller authors might choose to not go down this path as it requires you to specify your own cache/reflector/indexer around the Watch() call. Furthermore, projects like controller-runtime don't even let you get access to this low-level machinery, as far as I know.
Another aspect of reducing controllers' memory footprint can be done through field/data erasure on structs (instead of discarding objects altogether). This is possible in newer versions of client-go through cache.TransformFunc, which can let you delete fields of irrelevant objects (though, these objects would still consume some memory). This one is more of a band-aid that can make your situation better.
In my case, I mostly needed to watch for DaemonSet Pods in certain namespaces, so I refactored the code from using 1 informer (watching all namespaces) to N namespace-scoped informers running concurrently.

How to prevent repeated reconcile calls

I'm developing a Kubernetes controller. The desired state for this controller is captured in CRD-A and then it creates a deployment and statefulset to achieve the actual state. Currently I'm using server side apply to create/update these deployment and statefulsets.
The controller establishes watch on both CRD-A as well as deployments, statefulset. This to ensure that if there is a change in the deployment/statefulset, the reconcile() is notified and takes action to fix it. Currently the reconcile() always calls server side apply to create/update and this leads another watch event (resource version changes on every server side apply) resulting in repeated/infinite calls to reconcile()
One approach I've been thinking about is to leverage 'generation' on deployment/statefulset i.e. the controller will maintain a in-memory map of (k8s object -> generation) and on reconcile() compare the value in this map to what is present in the indexed informer cache; do you see any concerns with this approach? And are there better alternatives to prevent repeated/infinite reconcile() calls?
One idea could be to not apply the objects if there are no changes to be applied.
This gives you the problem of how to detect if the objects are up to date already, but it's hard to tell if that's doable in your case.
Ideally, if the object you provide in the server side apply is not changed, the generation and the resourceVersion of the object should BOTH NOT be changed.
But sometimes that's not the case, see this github issue:https://github.com/kubernetes/kubernetes/issues/95460
Fortunately, the generation always stays the same, so yes, you can leverage this field to avoid the reconcile dead loop, by adding a GenerationChangedPredicate filter to your controller, which will skip reconciling if the generation does not change, and it's often used in conjuction with the LabelChangedPredicate, which filters events when the object's labels does not change.
Here's how you will set up your controller with these two predicates:
ctrl.NewControllerManagedBy(mgr).
For(&Object{}).
Owns(&appsv1.StatefulSet{}).
Owns(&appsv1.Deployment{}).
// together with Server Side Apply
// this predicates prevents meaningless reconcilations from being triggered
WithEventFilter(predicate.Or(predicate.GenerationChangedPredicate{}, predicate.LabelChangedPredicate{})).
Complete(r)

What is the use for CRD status?

I'm currently writing a kubernetes operator in go using the operator-sdk.
This operator creates two StatefulSet and two Service, with some business logic around it.
I'm wondering what CRD status is about ? In my reconcile method I use the default client (i.e. r.List(ctx, &setList, opts...)) to fetch data from the cluster, shall I instead store data in the status to use it later ?
If so how reliable this status is ? I mean is it persisted ? If the control plane dies is it still available ?
What about disaster recovery, what if the persisted data disappear ? Doesn't that case invalidate the use of the CRD status ?
The status subresource of a CRD can be considered to have the same objective of non-custom resources. While the spec defines the desired state of that particular resource, basically I declare what I want, the status instead explains the current situation of the resource I declared on the cluster and should help in understanding what is different between the desired state and the actual state.
Like a StatefulSet spec could say I want 3 replicas and its status say that right now only 1 of those replicas is ready and the next one is still starting, a custom resource status may tell me what is the current situation of whatever I declared in the specs.
For example, using the Rook Operator, I could declare I want a CephCluster made in a certain way. Since a CephCluster is a pretty complex thing (made of several StatefulSets, Daemons and more), the status of the custom resource definition will tell me the current situation of the whole ceph cluster, if it's health is ok or if something requires my attention and so on.
From my understandings of the Kubernetes API, you shouldn't rely on status subresource to decide what your operator should do regarding a CRD as it is way better and less prone to errors to always check the current situation of the cluster (at operator start or when a resource is defined, updated or deleted)
Last, let me quote from Kubernetes API conventions as it exaplins the convention pretty well ( https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/api-conventions.md#spec-and-status )
By convention, the Kubernetes API makes a distinction between the
specification of the desired state of an object (a nested object field
called "spec") and the status of the object at the current time (a
nested object field called "status").
The specification is a complete
description of the desired state, including configuration settings
provided by the user, default values expanded by the system, and
properties initialized or otherwise changed after creation by other
ecosystem components (e.g., schedulers, auto-scalers), and is
persisted in stable storage with the API object. If the specification
is deleted, the object will be purged from the system.
The status
summarizes the current state of the object in the system, and is
usually persisted with the object by automated processes but may be
generated on the fly. At some cost and perhaps some temporary
degradation in behavior, the status could be reconstructed by
observation if it were lost.
When a new version of an object is POSTed or PUT, the "spec" is
updated and available immediately. Over time the system will work to
bring the "status" into line with the "spec". The system will drive
toward the most recent "spec" regardless of previous versions of that
stanza. In other words, if a value is changed from 2 to 5 in one PUT
and then back down to 3 in another PUT the system is not required to
'touch base' at 5 before changing the "status" to 3. In other words,
the system's behavior is level-based rather than edge-based. This
enables robust behavior in the presence of missed intermediate state
changes.

understanding Lagoms persistent read side

I read through the Lagom documentation, and already wrote a few small services that interact with each other. But because this is my first foray into CQRS i still have a few conceptual issues about the persistent read side that i don't really understand.
For instance, i have a user-service that keeps a list of users (as aggregates) and their profile data like email addresses, names, addresses, etc.
The questions i have now are
if i want to retrieve the users profile given a certain email-address, should i query the read side for the users id, and then query the event-store using this id for the profile data? or should the read side already keep all profile information?
If the read side has all information, what is the reason for the event-store? If its truly write-only, it's not really useful is it?
Should i design my system that i can use the event-store as much as possible or should i have a read side for everything? what are the scalability implications?
if the user-model changes (for instance, the profile now includes a description of the profile) and i use a read-side that contains all profile data, how do i update this read side in lagom to now also contain this description?
Following that question, should i keep different read-side tables for different fields of the profile instead of one table containing the whole profile
if a different service needs access to the data, should it always ask the user-service, or should it keep its own read side as needed? In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
As you can see, this whole concept hasn't really 'clicked' yet, and i am thankful for answers and/or some pointers.
if i want to retrieve the users profile given a certain email-address, should i query the read side for the users id, and then query the event-store using this id for the profile data? or should the read side already keep all profile information?
You should use a specially designed ReadModel for searching profiles using the email address. You should query the Event-store only to rehydrate the Aggregates, and you rehydrate the Aggregates only to send them commands, not queries. In CQRS an Aggregate may not be queried.
If the read side has all information, what is the reason for the event-store? If its truly write-only, it's not really useful is it?
The Event-store is the source of truth for the write side (Aggregates). It is used to rehydrate the Aggregates (they rebuild their internal & private state based on the previous emitted events) before the process commands and to persist the new events. So the Event-store is append-only but also used to read the event-stream (the events emitted by an Aggregate instance). The Event-store ensures that an Aggregate instance (that is, identified by a type and an ID) processes only a command at a time.
if the user-model changes (for instance, the profile now includes a description of the profile) and i use a read-side that contains all profile data, how do i update this read side in lagom to now also contain this description?
I don't use any other framework but my own but I guess that you rewrite (to use the new added field on the events) and rebuild the ReadModel.
Following that question, should i keep different read-side tables for different fields of the profile instead of one table containing the whole profile
You should have a separate ReadModel (with its own table(s)) for each use case. The ReadModel should be blazing fast, this means it should be as small as possible, only with the fields needed for that particular use case. This is very important, it is one of the main benefits of using CQRS.
if a different service needs access to the data, should it always ask the user-service, or should it keep its own read side as needed? In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
Here depends on you, the architect. It is preferred that each ReadModel owns its data, that is, it should subscribe to the right events, it should not depend on other ReadModels. But this leads to a lot of code duplication. In my experience I've seen a desire to have some canonical ReadModels that own some data but also can share it on demand. For this, in CQRS, there is also the term query. Just like commands and events, queries can travel in your system, but only from ReadModel to ReadModel.
Queries should not be sent during a client's request. They should be sent only in the background, as an asynchronous synchronization mechanism. This is an important aspect that influences the resilience and responsiveness of your system.
I've use also live queries, that are pushed from the authoritative ReadModels to the subscribed ReadModels in real time, when the answer changes.
In case of the latter, doesn't that violate the CQRS principle that the service that owns the data should be the only one reading and writing that data?
No, it does not. CQRS does not specify how the R (Read side) is updated, only that the R should not process commands and C should not be queried.

In Couchbase's Observe feature, what is the difference between "persistTo" and "replicateTo"?

In the Couchbase API, there are store and delete operations that allow you to specify how many nodes an operation must be successfully persisted to before returning. This is expressed through two method parameters:
My question is: what is the difference between the persistTo and replicateTo parameters. For example passing in PersistTo.MASTER + ReplicateTo.THREE appears to be exactly equivalent to passing in PersistTo.FOUR. Are there actually any behavioral differences between calling these observed APIs in those two different styles?
PersistTo.MASTER + ReplicateTo.THREE means at minimum the item must be on disk on the master node and at least in memory on three replica nodes. In this case the item might not be persisted on the replicas.
PersistTo.FOUR means that the item must be persisted on the master as well as three replicas.
A good way to think about things is that just because an item is replicated to another node doesn't mean that that item has been persisted to disk.