Kubernetes resource versioning

Kubernetes resource versioning - version-control

Using kubectl get with -o yaml on a resouce , I see that every resource is versioned:
kind: ConfigMap
metadata:
creationTimestamp: 2018-10-16T21:44:10Z
name: my-config
namespace: default
resourceVersion: "163"
I wonder what is the significance of these versioning and for what purpose these are used? ( use cases )

A more detailed explanation, that helped me to understand exactly how this works:
All the objects you’ve created throughout this book—Pods,
ReplicationControllers, Services, Secrets and so on—need to be
stored somewhere in a persistent manner so their manifests survive API
server restarts and failures. For this, Kubernetes uses etcd, which
is a fast, distributed, and consistent key-value store. The only
component that talks to etcd directly is the Kubernetes API server.
All other components read and write data to etcd indirectly through
the API server.
This brings a few benefits, among them a more robust optimistic
locking system as well as validation; and, by abstracting away the
actual storage mechanism from all the other components, it’s much
simpler to replace it in the future. It’s worth emphasizing that etcd
is the only place Kubernetes stores cluster state and metadata.
Optimistic concurrency control (sometimes referred to as optimistic
locking) is a method where instead of locking a piece of data and
preventing it from being read or updated while the lock is in place,
the piece of data includes a version number. Every time the data is
updated, the version number increases. When updating the data, the
version number is checked to see if it has increased between the time
the client read the data and the time it submits the update. If this
happens, the update is rejected and the client must re-read the new
data and try to update it again. The result is that when two clients
try to update the same data entry, only the first one succeeds.
The result is that when two clients try to update the same data entry,
only the first one succeeds
Marko Luksa, "Kubernetes in Action"
So, all the Kubernetes resources include a metadata.resourceVersion field, which clients need to pass back to the API server when updating an object. If the version doesn’t match the one stored in etcd, the API server rejects the update

The main purpose for the resourceVersion on individual resources is optimistic locking. You can fetch a resource, make a change, and submit it as an update, and the server will reject the update with a conflict error if another client has updated it in the meantime (their update would have bumped the resourceVersion, and the value you submit tells the server what version you think you are updating)

Related

Argo and Kubernetes "Request entity to large: limit is 3145728"

I've been trying to deploy a workflow in Argo with Kubernetes and I'm getting this error
Can someone help me to know the root of the issue?
I’ve tried several things but I’ve been unsuccessful.

The way Argo solves that problem is by using compression on the stored entity, but the real question is whether you have to have all 3MB worth of that data at once, or if it is merely more convenient for you and they could be decomposed into separate objects with relationships between each other. The kubernetes API is not a blob storage, and shouldn't be treated as one.
The "error": "Request entity too large: limit is 3145728" is probably
the default response from kubernetes handler for objects larger than
3MB, as you can see here at L305 of the source code:
expectedMsgFor1MB := etcdserver: request is too large
expectedMsgFor2MB := rpc error: code = ResourceExhausted desc = trying to send message larger than max
expectedMsgFor3MB := Request entity too large: limit is 3145728
expectedMsgForLargeAnnotation := metadata.annotations: Too long: must have at most 262144 bytes
The ETCD has indeed a 1.5MB limit for processing a file and you will
find on ETCD Documentation a suggestion to try the--max-request-bytes
flag but it would have no effect on a GKE cluster because you don't
have such permission on master node.
But even if you did, it would not be ideal because usually this error means that you are consuming the objects instead of referencing them which would degrade your performance.
I highly recommend that you consider instead these options:
- Determine whether your object includes references that aren't used
- Break up your resource
- Consider a volume mount instead
There's a request for a new API Resource: File (orBinaryData) that could apply to your case. It's very fresh but it's good to keep an eye on.
Partial source for this answer: https://stackoverflow.com/a/60492986/12153576

Caching in a microservice with multiple replicas in k8s

I've a Golang based micro-service which has an in-memory cache as follows:
Create object -> Put it in cache -> Persist
Update object -> Update the cache -> Persist
Get -> Get it from the cache
Delete -> Delete cache entry -> Remove from data store.
On a service re-start, the cache is populated from the data store.
The cache organizes the data in different ways that matches my access patterns.
Note that one client can create the object, and other clients can update it at a later point in time.
Everything works fine as long as I've one replica. But, this pattern will break when I increase the replica count in my deployment.
If I have to go to the DB for each GET, it defeats the purpose of the cache. The first thought is, to move the cache out. But, this seems like a fairly common problem when moving to multi-replica microservices. So, curious to understand alternatives.
Thanks for your time.

Mainly many things depends on how you structure your application.
One common solution is use Redis Cache or Distributed Cache. Here advantage is that your all services will go to same cache to manage object. This will give more consistent data.
Another approach that you can take and this will be some how more complex. Try to use sharding.
For Get Operation based on Id of object, you have to route request to specific instance. That instance will have that object in cache. If not then it read from db and put it in that instance cache. Eachtime for that object it will go that instance. This is applicable to Update and Delete operation.
For create operation.
If you want DB generate Id automatically for object then there is once chance object created in DB and then it return that Id and based on Id you have to route request and that way for first access after creation will be from DB but after that it will be in cache of that instance.
If you have provision that Id can be manually generated then during creation if you have to prefix Id with something that map to instance.
Note : In distributed system , there is no one solution. You always have to decide which approach works for you scenario.

CreateIfNotExists function calling Kubernetes API gets error : the object has been modified

Observation
I am initializing 3 MySQL pods at the same time, and I decided to create mysql service account and do role binding after it. Since I am using client-go to do this, I assumed that I don't need to worry about concurrency problems. But that is not the case!
I got an error when calling RbacV1().ClusterRoleBindings().Update, the error is: Operation cannot be fulfilled on clusterrolebindings.rbac.authorization.k8s.io "mysql-clusterrolebinding": the object has been modified; please apply your changes to the latest version and try again
The problem is gone when I initialize these pods sequentially,
My Questions
Why Kubernetes can not take care of object has been modified problem? For example, blocking reads of an object when another thread is updating it so that there would not be version change problems.
Why client-go doesn't handle concurrency problems either?
Should my program implement some locking mechanism? Or probably I shouldn't have tried to write some function like CreateIfNotExists in the first place?

Large payload for Custom Objects

While I can create custom objects just fine, I am wondering how one is supposed to handle large payloads (Gigabytes) for an object.
CRs are mostly used in order to interface with garbage collection/reference counting in Kubernetes.
Adding the payload via YAML does not work, though (out of memory for large payloads):
apiVersion: "data.foo.bar/v1"
kind: Dump
metadata:
name: my-data
ownerReferences:
- apiVersion: apps/v1
kind: Deploy
name: my-deploy
uid: d9607a69-f88f-11e7-a518-42010a800195
spec:
payload: dfewfawfjr345434hdg4rh4ut34gfgr_and_so_on_...
One could perhaps add the payload to a PV and just reference that path in the CR.
Then I have the problem, that it seems like I cannot clean up the payload file, should the CR get finalized (could not find any info about custom Finalizers).
Have no clear idea how to integrate such a concept into Kubernetes lifetimes.

In general the limit on size for any Kube API object is ~1M due to etcd restrictions, but putting more than 20-30k in an object is a bad idea and will be expensive to access (and garbage collection will be expensive as well).
I would recommend storing the data in a object storage bucket and using an RBAC proxy like https://github.com/brancz/kube-rbac-proxy to gate access the bucket contents (use a URL to the proxy as a reference from your object). That gives you all the benefits of tracking the data in the api, but keeps the object size small. If you want a more complex integration you could implement an aggregated API and reuse the core Kubernetes libraries to handle your API, storing the data in the object store.

We still went with using the CO. Alongside, we created a Kubernetes Controller, which handles the lifetime in the PV. For us this works fine, since the Controller can be the single writer to the PV, while the actual Services only need read access to the PV.
Combined with ownerReference, this makes for a good integration into the Kubernetes lifetime.

Kubernetes - Using a particular ConfigMap versioning

I have couple of questions regarding the configMap versioning.
Is it possible to use a specific version of a configMap in the deployment file?
I dont see any API's to get list of versions. How to get the list of versions?
Is it possible to compare configMap b/w versions?
How to control the number of versions?
Thanks

Is it possible to use a specific version of a configMap in the deployment file?
Not really.
The closest notion of a "version" is resourceVersion, but that is not for the user to directly act upon.
See API conventions: concurrency control and consistency:
Kubernetes leverages the concept of resource versions to achieve optimistic concurrency. All Kubernetes resources have a "resourceVersion" field as part of their metadata. This resourceVersion is a string that identifies the internal version of an object that can be used by clients to determine when objects have changed.
When a record is about to be updated, it's version is checked against a pre-saved value, and if it doesn't match, the update fails with a StatusConflict (HTTP status code 409).
The resourceVersion is changed by the server every time an object is modified. If resourceVersion is included with the PUT operation the system will verify that there have not been other successful mutations to the resource during a read/modify/write cycle, by verifying that the current value of resourceVersion matches the specified value.
The resourceVersion is currently backed by etcd's modifiedIndex.
However, it's important to note that the application should not rely on the implementation details of the versioning system maintained by Kubernetes. We may change the implementation of resourceVersion in the future, such as to change it to a timestamp or per-object counter.
The only way for a client to know the expected value of resourceVersion is to have received it from the server in response to a prior operation, typically a GET. This value MUST be treated as opaque by clients and passed unmodified back to the server.
Clients should not assume that the resource version has meaning across namespaces, different kinds of resources, or different servers.
Currently, the value of resourceVersion is set to match etcd's sequencer. You could think of it as a logical clock the API server can use to order requests.
However, we expect the implementation of resourceVersion to change in the future, such as in the case we shard the state by kind and/or namespace, or port to another storage system.
In the case of a conflict, the correct client action at this point is to GET the resource again, apply the changes afresh, and try submitting again.
This mechanism can be used to prevent races like the following:
Client #1 Client #2
GET Foo GET Foo
Set Foo.Bar = "one" Set Foo.Baz = "two"
PUT Foo PUT Foo
When these sequences occur in parallel, either the change to Foo.Bar or the change to Foo.Baz can be lost.
On the other hand, when specifying the resourceVersion, one of the PUTs will fail, since whichever write succeeds changes the resourceVersion for Foo.
resourceVersion may be used as a precondition for other operations (e.g., GET, DELETE) in the future, such as for read-after-write consistency in the presence of caching.
"Watch" operations specify resourceVersion using a query parameter. It is used to specify the point at which to begin watching the specified resources.
This may be used to ensure that no mutations are missed between a GET of a resource (or list of resources) and a subsequent Watch, even if the current version of the resource is more recent.
This is currently the main reason that list operations (GET on a collection) return resourceVersion.