Kubernetes Operators: Informers vs. reconcile loop

Kubernetes Operators: Informers vs. reconcile loop - kubernetes

I recently got started with building a Kubernetes operator. I'm using the Fabric8 Java Kubernetes Client but I think my question is more general and also applies to other programming languages and libraries.
When reading through blog posts, documentation or textbooks explaining the operator pattern, I found there seem to be two options to design an operator:
Using an infinite reconcile loop, in which all corresponding Kubernetes objects are retrieved from the API and then some action is performed.
Using informers, which are called whenever an observed Kubernetes resource changes.
However, I don't find any source discussion which option should be used in which case. Are there any best practices?

You should use both.
When using informers, it's possible that the handler gets the events out of order or even not at all. The former means the handler needs to define and reconcile state - this approach is referred to as level-based, as opposed to edge-based. The latter means reconciliation needs to be triggered on a regular interval to account for that possibility.
The way controller-runtime does things, reconciliation is triggered by cluster events (using informers behind the scenes) related to the resources watched by the controller and on a timer. Also, by design, the event is not passed to the reconciler so that it is forced to define and act on a state.

Related

What should I consider when using Cadence/Temporal to design a new project?

I am new to Cadence/Temporal and was wondering what the design review process is like. My team is ready to have a formal design review out but was wondering if there is a template available to capture Cadence/Temporal specific information?

This is something I try to call as "workflow-oriented-architecture". I would suggest to think more about the below aspects:
Different options/alternatives of “what part of the process” in the design that can be modeled as workflow. Based on that,
What will be the workflowID with which IDReusePolicy? It's usually recommended to use some business ID to guarantee the uniqueness so that there is only one workflow executing for a business entity
How is the Workflow started with what information as input parameters?
What Cadence/Temporal concepts you are planning to use, and how does a workflow interact with other system?
Regular/local/long-running activity is for making an action to external system
Durable timer (use workflow.Sleep or Workflow.Await) is to wait for certain time then wake up. Unlike using sleep in native language, durable timer is reliable that whatever host restart won't impact the firing
signal is to receive an event from external system
query is to let external system to get some workflow states
search attributes can do two things: a) letting application searching for workflows with some conditions using ListWorkflowExecutions API, and letting application to get the basic status by DescribeWorkflowExecution API
How do you handle failure, especially using Cadence/Temporal concepts: activityRetry, workflowRetry, reset

Using a vkFence with an std::condition_variable

A VkFence can be waited upon or queried about its state. Is it possible to have a callback invoked by the Vulkan implementation when the fence is ready instead?
This would allow it to be used with objects such as a std::condition_variable. When the fence would be ready, the condition_variable would get notified.
Such an approach would also allow integration with libraries like Boost.Fiber, which would completely remove the need for the thread to sleep, but rather it could do useful work while waiting upon the fence.
If this is not possible in base Vulkan, is there an extension that allows it?

Vulkan doesn't work that way. Vulkan devices and queues execute independently of the CPU. Indeed, with one or two exceptions, Vulkan implementations only ever use CPU resources within the scope of a particular function call and only on the thread on which this call was made. Even debug callbacks are made within the scope of the function that caused the error.
There is no mechanism for Vulkan implementations to use CPU resources without the explicit consent of the user of the API (again, minus one or two exceptions). So no callbacks that act outside of an API call.
Vulkan does have a way to extract a native synchronization object from a VkFence, but it is surprisingly not useful in Windows. While you can get a HANDLE, it cannot be used by the Win32 API for waiting on it. This is mainly for interop with other APIs (like converting it to a D3D12 sync object), not for waiting on it yourself. But the file descriptor extraction operation can get a fully functional sync object... if the implementation lets you.

Axon Server auto-scaling split/merge delay

I am implementing auto-scaling in an application using Axon Server, and running in k8s.
I have created ReST endpoints in the application itself, which look at the local configuration (for processors and thread counts) and then speak to the Axon Server ReST API in order to split/merge the processors appropriately. The intent being to use container lifecycle hooks to trigger them.
As a result, if a new instance (pod) of an application is launched, configured for 2 threads on ProcessorA, then my code will make 2 requests to the /v1/components/blah/processors/ProcessorA/segments/split?context=default endpoint on the server. This is in order to make full use of the 2 new threads.
Likewise, when the pod is shut down, it makes 2 similar requests to the merge endpoint on the server.
When scaling up I see the processor split twice, as expected. However, on shutdown I don't see the merge twice unless I put a long (5s) wait between requests. This isn't likely to be particularly stable, so I'm wondering if there's something else I need to be doing.
Perhaps I ought to request the merge, then loop waiting for it to occur, then request another. This seems like it's going to be excessively slow.
There was another question on SO somewhat related, Automatically scale Axon's tracking event processors, where Steven commented that there was no inbuilt auto-scaling in Axon Server at that point in time. I've not seen anything in more recent times either.

As it stands work is underway to improve the split/merge functionality. For one, the result of a split/merge will be returned, which has been resolved under issue #1001.
This should make it so you do not have to wait for the status' to have been updated, which is the likely cause why it (seems to) take long. This functionality will be part of Axon Framework / Server 4.4 by the way, which should be released relatively soon.
Subsequently, discussion are still underway to allow for auto scaling. One requirement deemed important is the capability of a TrackingEventProcessor to process several segments per thread (issue #1434). This will ensure that the TEP can take over several segments to transition the boundary when scaling, for example.
Eventually though, Axon Server should be able to do this for you. It's just not there yet.
So for now I think the most pragmatic solution is indeed to wait for the result to show up on the status'. As said, I trust 4.4 will improve upon this by returning the result of the split/merge operation once called. Lastly, the Axon team is aware this can be improved upon further, hence why discussion on the matter are underway.

Difference between watch endpoints?

I'm interested in watching a stream of Events from Kubernetes, to determine whether a deployment was successful, or if any of the Pods were unable to be scheduled.
I could call the endpoint /api/v1/watch/events, or I could call /api/v1/events?watch=true. Is there a difference between those two? I'm confused about the purpose of them.
Thanks.

We're making watch a query param and removing it from the path (legacy form). You should call /api/v1/events?watch=true. See more discussions here if you're interested.

Scheduling/delaying of jobs/tasks in Play framework 2.x app

In a typical web application, there are some things that I would prefer to run as delayed jobs/tasks. They tend to have some or all of the following properties:
Takes a long time (anywhere from multiple seconds to multiple minutes to multiple hours).
Occupy some resource heavily (CPU, network, disk, external API limits, etc.)
Result not immediately necessary. Can complete HTTP response without it. OK (and possibly even preferable) to delay until later.
Can be (and possibly preferable to) run on (a) different machine(s) than web server(s). The machine(s) are potentially dedicated job/task runners.
Should be run in response to other event(s), or started periodically.
What would be the preferred way(s) to set up, enqueue, schedule, and run delayed jobs/tasks in a Scala + Play Framework 2.x app?
For more details...
The pattern I have used in the past, and which I would like to replicate if applicable, is:
In handler of web request, or in cron-like call, enqueue job(s)
In job runner(s), repeatedly dequeue and run one job at a time
Possibly handle recording job results
This seems to be a relatively simple yet still relatively flexible pattern.
Examples I have encountered in the past include:
Updating derived data in DB
Analytics/tracking API calls for a web request
Delete expired sessions or other stale/outdated DB records
Periodic batch ETLs
In other languages/frameworks, I would typically use a job/task framework. Examples include:
Resque in a Ruby + Rails app
Celery in a Python + Django app
I have found the following existing materials, but unfortunately, I don't think they fit my use case directly.
Play 1.x asynchronous jobs API (+ various SO questions referencing it). Appears to have been removed in 2.x line. No reference to what replaced it.
Play 2.x Akka integration. Seems very general-purpose. I'd imagine it's possible to use Akka for the above, but I'd prefer not to write a jobs/tasks framework if one already exists. Also, no info on how to separate the job runner machine(s) from your web server(s).
This SO answer. Seems potentially promising for the "short to medium duration IO bound" case, e.g. analytics calls, but not necessarily for the "CPU bound" case (probably shouldn't tie up CPU on web server, prefer to ship off to different node), the "lots of network" case, or the "multiple hour" case (probably shouldn't leave that in the background on the web server, even if it isn't eating up too many resources).
This SO question, and related questions. Similar to above, it seems to me that this covers only the cases where it would be appropriate to run on the same web server.
Some further clarification on use-cases (as per commenters' request). There are two main use-cases that I have experienced with something like resque or celery that I am trying to replicate here:
Some event on the site (Most often, an incoming web request causes task to be enqueued.)
Task should run periodically. (Most often, this is implemented as: periodically, enqueue task to be run as above.)
In the case of resque or celery, the tasks enqueued by both use-cases enter queues the same way and are treated the same way by the runner/worker process. Barring other Scala or Play-specific considerations, that would be my initial guess for how to approach this.
Some further clarification on why I do not believe the Akka scheduler fits my use case out-of-the-box (as per commenters' request):
While it is no doubt possible to construct a fitting solution using some combination of the Akka scheduler (for periodic jobs), akka-remote and akka-cluster (for communicating between the job caller and the job runner), that approach requires a certain amount of glue code which is almost a delayed job framework in and of itself. If it exists, I would prefer to use an existing out-of-the-box solution rather than reinvent the wheel.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse