Monitoring a kubernetes job - kubernetes

I have kubernetes jobs that takes variable amount of time to complete. Between 4 to 8 minutes. Is there any way i can know when a job have completed, rather than waiting for 8 minutes assuming worst case. I have a test case that does the following:
1) Submits the kubernetes job.
2) Waits for its completion.
3) Checks whether the job has had the expected affect.
Problem is that in my java test that submits the deployment job in the kubernetes, I am waiting for 8 minutes even if the job has taken less than that to complete, as i dont have a way to monitor the status of the job from the java test.

$ kubectl wait --for=condition=complete --timeout=600s job/myjob

<kube master>/apis/batch/v1/namespaces/default/jobs
endpoint lists status of the jobs. I have parsed this json and retrieved the name of the latest running job that starts with "deploy...".
Then we can hit
<kube master>/apis/batch/v1/namespaces/default/jobs/<job name retrieved above>
And monitor the status field value which is as below when the job succeeds
"status": {
"conditions": [
{
"type": "Complete",
"status": "True",
"lastProbeTime": "2016-09-22T13:59:03Z",
"lastTransitionTime": "2016-09-22T13:59:03Z"
}
],
"startTime": "2016-09-22T13:56:42Z",
"completionTime": "2016-09-22T13:59:03Z",
"succeeded": 1
}
So we keep polling this endpoint till it completes. Hope this helps someone.

You can use NewSharedInformer method to watch the jobs' statuses. Not sure how to write it in Java, here's the golang example to get your job list periodically:
type ClientImpl struct {
clients *kubernetes.Clientset
}
type JobListFunc func() ([]batchv1.Job, error)
var (
jobsSelector = labels.SelectorFromSet(labels.Set(map[string]string{"job_label": "my_label"})).String()
)
func (c *ClientImpl) NewJobSharedInformer(resyncPeriod time.Duration) JobListFunc {
var once sync.Once
var jobListFunc JobListFunc
once.Do(
func() {
restClient := c.clients.BatchV1().RESTClient()
optionsModifer := func(options *metav1.ListOptions) {
options.LabelSelector = jobsSelector
}
watchList := cache.NewFilteredListWatchFromClient(restClient, "jobs", metav1.NamespaceAll, optionsModifer)
informer := cache.NewSharedInformer(watchList, &batchv1.Job{}, resyncPeriod)
go informer.Run(context.Background().Done())
jobListFunc = JobListFunc(func() (jobs []batchv1.Job, err error) {
for _, c := range informer.GetStore().List() {
jobs = append(jobs, *(c.(*batchv1.Job)))
}
return jobs, nil
})
})
return jobListFunc
}
Then in your monitor you can check the status by ranging the job list:
func syncJobStatus() {
jobs, err := jobListFunc()
if err != nil {
log.Errorf("Failed to list jobs: %v", err)
return
}
// TODO: other code
for _, job := range jobs {
name := job.Name
// check status...
}
}

I found that the JobStatus does not get updated while polling using job.getStatus()
Even if the status changes while checking from the command prompt using kubectl.
To get around this, I reload the job handler:
client.extensions().jobs()
.inNamespace(myJob.getMetadata().getNamespace())
.withName(myJob.getMetadata().getName())
.get();
My loop to check the job status looks like this:
KubernetesClient client = new DefaultKubernetesClient(config);
Job myJob = client.extensions().jobs()
.load(new FileInputStream("/path/x.yaml"))
.create();
boolean jobActive = true;
while(jobActive){
myJob = client.extensions().jobs()
.inNamespace(myJob.getMetadata().getNamespace())
.withName(myJob.getMetadata().getName())
.get();
JobStatus myJobStatus = myJob.getStatus();
System.out.println("==================");
System.out.println(myJobStatus.toString());
if(myJob.getStatus().getActive()==null){
jobActive = false;
}
else {
System.out.println(myJob.getStatus().getActive());
System.out.println("Sleeping for a minute before polling again!!");
Thread.sleep(60000);
}
}
System.out.println(myJob.getStatus().toString());
Hope this helps

You did not mention what is actually checking the job completion, but instead of waiting blindly and hope for the best you should keep polling the job status inside a loop until it becomes "Completed".

Since you said Java; you can use kubernetes java bindings from fabric8 to start the job and add a watcher:
KubernetesClient k = ...
k.extensions().jobs().load(yaml).watch (new Watcher <Job>() {
#Override
public void onClose (KubernetesClientException e) {}
#Override
public void eventReceived (Action a, Job j) {
if(j.getStatus().getSucceeded()>0)
System.out.println("At least one job attempt succeeded");
if(j.getStatus().getFailed()>0)
System.out.println("At least one job attempt failed");
}
});

I don't know what kind of tasks are you talking about but let's assume you are running some pods
you can do
watch 'kubectl get pods | grep <name of the pod>'
or
kubectl get pods -w
It will not be the full name of course as most of the time the pods get random names if you are running nginx replica or deployment your pods will end up with something like nginx-1696122428-ftjvy so you will want to do
watch 'kubectl get pods | grep nginx'
You can replace the pods with whatever job you are doing i.e (rc,svc,deployments....)

Related

Trigger When A New Pod Is Created on Kubernetes

I have a question that i have encountered on my project. In my project, briefly when a user clicks to a button, a pod is created, does some operations and finally it is deleted. I should measure the pods running time and should decrease the duration from the credit of the user. I want to manage it externally. Is it possible to understand and manage when a new pod has been created and destroyed from outside of the pods?
Thanks
I think there are multiple approach to this problem and I can suggest one.
If you are working with single container then postStart and preStop hooks can be useful. Here is the official doc of container hooks.
The function I written is like following:
func WatchPods() {
clientset, err := utils.NewClient(true)//create k8s client
if err != nil {
//handle error
}
watchlist := cache.NewListWatchFromClient(
clientset.CoreV1().RESTClient(),
string(v1.ResourcePods),
"default",
fields.Everything(),
)
_, controller := cache.NewInformer(
watchlist,
&v1.Pod{},
0,
cache.ResourceEventHandlerFuncs{
AddFunc: func(obj interface{}) {//Calls when a new pod is created
pod := obj.(*v1.Pod)
//do whatever you want
}
},
DeleteFunc: func(obj interface{}) {//Calls when a new pod is deleted
pod := obj.(*v1.Pod)
//do whatever you want
},
},
)
stop := make(chan struct{})
defer close(stop)
controller.Run(stop)
}

How to trigger a rollout restart on deployment resource from controller-runtime

I have been using kubebuilder for writing custom controller, and aware of Get(), Update(), Delete() methods that it provides. But Now I am looking for a method which mimic the behaviour of kubectl rollout restart deployment. If there is no such direct method then I am looking for correct way to mimic the same.
type CustomReconciler struct {
client.Client
Log logr.Logger
Scheme *runtime.Scheme
}
func (r *CustomReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
configMap := &v1.ConfigMap{}
err = r.Get(ctx, req.namespacedName, configMap)
if err != nil {
logger.Error(err, "Failed to GET configMap")
return ctrl.Result{}, err
}
Say in above code I read a deployment name from the configmap and rollout restart the same as follows:
val := configMap.Data["config.yml"]
config := Config{}
if err := yaml.Unmarshal([]byte(val), &config); err != nil {
logger.Error(err, "failed to unmarshal config data")
return ctrl.Result{}, err
}
// Need equivalent of following
// r.RolloutRestart(config.DeploymentName)
In all cases where you wish to replicate kubectl behavior, the answer is always to increase its verbosity and it'll show you exactly -- sometimes down to the wire payloads -- what it is doing.
For rollout restart, one will find that it just bumps an annotation on the Deployment/StatefulSet/whatever and that causes the outer object to be "different," and triggering a reconciliation run
You can squat on their annotation, or you can make up your own, or you can use a label change -- practically any "meaningless" change will do

Kubernetes Operator to read watch namespace from config map

I have an operator built via kube builder that reads from a "WATCH_NAMESPACE" env to understand which namespace to watch. This is how the current setup works.
namespaces := os.Getenv("WATCH_NAMESPACE")
if strings.Contains(namespaces, ",") {
setupLog.Info("Operator will listen to the specific namespaces: " + namespaces)
options.NewCache = cache.MultiNamespacedCacheBuilder(strings.Split(namespaces, ","))
} else {
log.Info("Operator will listen only one namespace or all namespace: " + namespaces)
options.Namespace = namespaces
}
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), options)
if err != nil {
log.Error(err, "unable to start manager")
os.Exit(1)
}
if err = (&dataplatformcontroller.DruidReconciler{
...
}
This works fine.
But the problem is that every time we need to add a new namespace to watch, we need to restart the operator. I believe the best option here is the watch a configmap and read from it every time.
But I am not sure how to proceed with this. Any suggestions or documentation or links can be helpful.

Workflow execution fails when a worker is restarted on the same workflow service client

We're in the process of writing a .NET Cadence client and ran into an issue while unit testing workflows. When we start a worker, execute a workflow, stop the worker, start it again, and then try and execute another workflow, the first workflow completes, but any workflow after the first hangs during the client.ExecuteWorkflow() call, eventually failing with a START_TO_CLOSE timeout. I replicated this behavior by munging the greetings cadence-samples workflow. See the loop in func main():
package main
import (
"context"
"time"
"go.uber.org/cadence/client"
"go.uber.org/cadence/worker"
"go.uber.org/zap"
"github.com/pborman/uuid"
"github.com/samarabbas/cadence-samples/cmd/samples/common"
)
// This needs to be done as part of a bootstrap step when the process starts.
// The workers are supposed to be long running.
func startWorkers(h *common.SampleHelper) worker.Worker {
// Configure worker options.
workerOptions := worker.Options{
MetricsScope: h.Scope,
Logger: h.Logger,
}
return h.StartWorkers(h.Config.DomainName, ApplicationName, workerOptions)
}
func startWorkflow(h *common.SampleHelper) client.WorkflowRun {
workflowOptions := client.StartWorkflowOptions{
ID: "greetings_" + uuid.New(),
TaskList: ApplicationName,
ExecutionStartToCloseTimeout: time.Minute,
DecisionTaskStartToCloseTimeout: time.Minute,
}
return h.StartWorkflow(workflowOptions, SampleGreetingsWorkflow)
}
func main() {
// setup the SampleHelper
var h common.SampleHelper
h.SetupServiceConfig()
// Loop:
// - start a worker
// - start a workflow
// - block and wait for workflow result
// - stop the worker
for i := 0; i < 3; i++ {
// start the worker
// execute the workflow
workflowWorker := startWorkers(&h)
workflowRun := startWorkflow(&h)
// create context
// get workflow result
var result string
ctx, cancel := context.WithCancel(context.Background())
err := workflowRun.Get(ctx, &result)
if err != nil {
panic(err)
}
// log the result
h.Logger.Info("Workflow Completed", zap.String("Result", result))
// stop the worker
// cancel the context
workflowWorker.Stop()
cancel()
}
}
This is not a blocking issue and will probably not come up in production.
Background:
We (Jeff Lill and I) noticed this issue during unit testing workflows in our .NET Cadence client. When we run our workflow tests individually they all pass, but when we run multiple at a time (sequentially, not in parallel), we see the behavior described above. This is because of the cleanup done in the .NET Cadence client dispose() method called after a test completes (pass or fail). One of the dispose behaviors is to stop workers created during a test. When the next test runs, new workers are created using the same workflow service client, and this is where the issue arises.

Service Fabric Reliable Queues FabricNotReadableException

I have a Stateful service with 1000 partitions and 1 replica.
This service in the RunAsync method have an infinte while cycle where I call a Reliable Queue to get messages.
If there are no messages I wait 5 seconds, then retry.
I used to do exactly that with Azure Storage Queue with success.
But with Service Fabric I'm getting thousands of FabricNotReadableExceptions, the Service become unstable and I'm not able to update it or delete it, I need to cancel the entire cluster.
I tried to update it and after 18 hours it was still stuck, so there is something terribly wrong in what I'm doing.
This is the method code:
public async Task<QueueObject> DeQueueAsync(string queueName)
{
var q = await StateManager.GetOrAddAsync<IReliableQueue<string>>(queueName);
using (var tx = StateManager.CreateTransaction())
{
try
{
var dequeued = await q.TryDequeueAsync(tx);
if (dequeued.HasValue)
{
await tx.CommitAsync();
var result = dequeued.Value;
return JSON.Deserialize<QueueObject>(result);
}
else
{
return null;
}
}
catch (Exception e)
{
ServiceEventSource.Current.ServiceMessage(this, $"!!ERROR!!: {e.Message} - Partition: {Partition.PartitionInfo.Id}");
return null;
}
}}
This is the RunAsync
protected override async Task RunAsync(CancellationToken cancellationToken)
{
while (true)
{
var message = await DeQueueAsync("MyQueue");
if (message != null)
{
//process, takes around 500ms
}
else
{
Thread.Sleep(5000);
}
}
}
I also changed Thread.Sleep(5000) with Task.Delay and was having thousands of "A task was canceled" errors.
What I'm missing here?
It's the cycle too fast and SF cannot update the other replicas in time?
Should I remove all the replicas leaving just one?
Should I use the new ConcurrentQueue instead?
I have the problem in production and in local with 50 or 1000 partitions, doesn't matter.
I'm stuck and confused.
Thanks
You need to honor the cancellationToken that is passed in to your RunAsync implementation. Service Fabric will cancel the token when it wants to stop your service for any reason - including upgrades - and it will wait indefinitely for RunAsync to return after cancelling the token. This could explain why you couldn't upgrade your application.
I would suggest checking cancellationToken.IsCancelled inside your loop, and breaking out if it has been cancelled.
FabricNotReadableException can happen for a variety of reasons - the answer to this question has a comprehensive explanation, but the takeaway is
You can consider FabricNotReadableException retriable. If you see it, just try the call again and eventually it will resolve into either NotPrimary or Granted.