Workflow execution fails when a worker is restarted on the same workflow service client - cadence-workflow

We're in the process of writing a .NET Cadence client and ran into an issue while unit testing workflows. When we start a worker, execute a workflow, stop the worker, start it again, and then try and execute another workflow, the first workflow completes, but any workflow after the first hangs during the client.ExecuteWorkflow() call, eventually failing with a START_TO_CLOSE timeout. I replicated this behavior by munging the greetings cadence-samples workflow. See the loop in func main():
package main
import (
"context"
"time"
"go.uber.org/cadence/client"
"go.uber.org/cadence/worker"
"go.uber.org/zap"
"github.com/pborman/uuid"
"github.com/samarabbas/cadence-samples/cmd/samples/common"
)
// This needs to be done as part of a bootstrap step when the process starts.
// The workers are supposed to be long running.
func startWorkers(h *common.SampleHelper) worker.Worker {
// Configure worker options.
workerOptions := worker.Options{
MetricsScope: h.Scope,
Logger: h.Logger,
}
return h.StartWorkers(h.Config.DomainName, ApplicationName, workerOptions)
}
func startWorkflow(h *common.SampleHelper) client.WorkflowRun {
workflowOptions := client.StartWorkflowOptions{
ID: "greetings_" + uuid.New(),
TaskList: ApplicationName,
ExecutionStartToCloseTimeout: time.Minute,
DecisionTaskStartToCloseTimeout: time.Minute,
}
return h.StartWorkflow(workflowOptions, SampleGreetingsWorkflow)
}
func main() {
// setup the SampleHelper
var h common.SampleHelper
h.SetupServiceConfig()
// Loop:
// - start a worker
// - start a workflow
// - block and wait for workflow result
// - stop the worker
for i := 0; i < 3; i++ {
// start the worker
// execute the workflow
workflowWorker := startWorkers(&h)
workflowRun := startWorkflow(&h)
// create context
// get workflow result
var result string
ctx, cancel := context.WithCancel(context.Background())
err := workflowRun.Get(ctx, &result)
if err != nil {
panic(err)
}
// log the result
h.Logger.Info("Workflow Completed", zap.String("Result", result))
// stop the worker
// cancel the context
workflowWorker.Stop()
cancel()
}
}
This is not a blocking issue and will probably not come up in production.
Background:
We (Jeff Lill and I) noticed this issue during unit testing workflows in our .NET Cadence client. When we run our workflow tests individually they all pass, but when we run multiple at a time (sequentially, not in parallel), we see the behavior described above. This is because of the cleanup done in the .NET Cadence client dispose() method called after a test completes (pass or fail). One of the dispose behaviors is to stop workers created during a test. When the next test runs, new workers are created using the same workflow service client, and this is where the issue arises.

Related

Postgres 13.6, my background process, causes high CPU with warning "worker took too long to start; canceled"

I am using PostgreSql & golang in my project
I have created a postgresql background worker using golang. This background worker listens to LISTEN/NOTIFY channel in postgres and writes the data it receives to a file.
The trigger(which supplies data to channel), background worker registration and the background worker's main function are all in one .so file. The core background process logic is in another helper .so file. The background process main loads the helper .so file using dlopen and executes the core logic from it.
Issue faced:
It works fine in windows, but in linux where I tried with postgres 13.6, there is a problems.
It runs as intended for a while, but after about 2-3hrs, postgresql postmaster's CPU utilization shoots up and seems to get stuck. The CPU shoot up is very sudden (less than 5 mins, till it starts shooting up it is normal).
I am not able to establish connection with postgres from psql client.
In the log the following message keeps repeating:
WARNING: worker took too long to start; canceled.
I tried commenting out various areas of my code, adding sleeps at different places of the core processing loop and even disabling the trigger but the issue occurs.
Softwares and libraries used:
go version go1.17.5 linux/amd64
Postgres 13.6, in Ubuntu
LibPq: https://github.com/lib/pq,
My core processing loop looks like this:
maxReconn := time.Minute
listener := pq.NewListener(<connectionstring>, minReconn, maxReconn, EventCallBackFn); // libpq used here.
defer listener.UnlistenAll();
if err = listener.Listen("mystream"); err != nil {
panic(err);
}
var itemsProcessedSinceLastSleep int = 0;
for {
select {
case signal := <-signalChan:
PgLog(PG_LOG, "Exiting loop due to termination signal : %d", signal);
return 1;
case pgstatus := <-pmStatusChan:
PgLog(PG_LOG, "Exiting loop as postmaster is not running : %d ", pgstatus);
return 1;
case data := <-listener.Notify:
itemsProcessedSinceLastSleep = itemsProcessedSinceLastSleep + 1;
if itemsProcessedSinceLastSleep >= 1000 {
time.Sleep(time.Millisecond * 10);
itemsProcessedSinceLastSleep = 0;
}
ProcessChangeReceivedAtStream(data); // This performs the data processing
case <-time.After(10 * time.Second):
time.Sleep(100 * time.Millisecond);
var cEpoch = time.Now().Unix();
if cEpoch - lastConnChkTime > 1800 {
lastConnChkTime = cEpoch;
if err := listener.Ping(); err!=nil {
PgLog(PG_LOG, "Seems to be a problem with connection")
}
}
default:
time.Sleep(time.Millisecond * 100);
}
}```

cadence go-client/ client to reach server for fetching workflow results in panic

First time user of Cadence:
Scenario
I have a cadence server running in my sandbox environment.
Intent is to fetch the workflow status
I am trying to use this cadence client
go.uber.org/cadence/client
on my local host to talk to my sandbox cadence server.
This is my simple code snippet:
var cadClient client.Client
func main() {
wfID := "01ERMTDZHBYCH4GECHB3J692PC" << I got this from cadence-ui
ctx := context.Background()
wf := cadClientlient.GetWorkflow(ctx, wfID,"") <<< Panic hits here
log.Println("Workflow RunID: ",wf.GetID())
}
I am sure getting it wrong because the client does not know how to reach the cadence server.
I referred this https://cadenceworkflow.io/docs/go-client/ to find the correct usage but could not find any reference (possible that I might have missed it).
Any help in how to resolve/implement this, will be of much help
I am not sure what panic you got. Based on the code snippet, it's likely that you haven't initialized the client.
To initialize it, follow the sample code here: https://github.com/uber-common/cadence-samples/blob/master/cmd/samples/common/sample_helper.go#L82
and
https://github.com/uber-common/cadence-samples/blob/aac75c7ca03ec0c184d0f668c8cd0ea13d3a7aa4/cmd/samples/common/factory.go#L113
ch, err := tchannel.NewChannelTransport(
tchannel.ServiceName(_cadenceClientName))
if err != nil {
b.Logger.Fatal("Failed to create transport channel", zap.Error(err))
}
b.Logger.Debug("Creating RPC dispatcher outbound",
zap.String("ServiceName", _cadenceFrontendService),
zap.String("HostPort", b.hostPort))
b.dispatcher = yarpc.NewDispatcher(yarpc.Config{
Name: _cadenceClientName,
Outbounds: yarpc.Outbounds{
_cadenceFrontendService: {Unary: ch.NewSingleOutbound(b.hostPort)},
},
})
if b.dispatcher != nil {
if err := b.dispatcher.Start(); err != nil {
b.Logger.Fatal("Failed to create outbound transport channel: %v", zap.Error(err))
client := workflowserviceclient.New(b.dispatcher.ClientConfig(_cadenceFrontendService))

How to empty Jenkins scheduled job list

I'm using Jenkins Rest API to build and schedule job.
The problem that i schedule one job for the Week-end but it execute it several times (Same job executed every minute).
For the rest of the week the job is executed only once, so if there any GUI options to empty the week-end job list ?
you can use the following groovy script to clean all ( or part of your queue ....)
this example delete all jobs that start with a specific branch name
import jenkins.model.*
def branchName = build.environment.get("GIT_BRANCH_NAME")
println "=========before clean the queue ... =="
def q = Jenkins.instance.queue
q.items.each {
println("${it.task.name}:")
}
q.items.findAll { it.task.name.startsWith(branchName) }.each { q.cancel(it.task) }
println "=========after clean the queue ... =="
q = Jenkins.instance.queue
q.items.each {
println("${it.task.name}:")
}

Golang channel in select not receiving

I am currently working on a small script where I use the channels, select and goroutine and I really don't understand why it doesn't run as I think.
I have 2 channels that all my goroutines listen to.
I pass the channels to each goroutine where there is a select which must choose between the 2 depending on where the data comes first.
The problem is that no goroutine falls into the second case. I can have received 100 jobs one after the other, I see everything in the log. It does well what is requested in the first case and after that it sent the work in the second channel (still if it does well ...) I do not have any more logs.
I just don't understand why...
If someone can enlighten me :)
package main
func main() {
wg := new(sync.WaitGroup)
in := make(chan *Job)
out := make(chan *Job)
results := make(chan *Job)
for i := 0; i < 50; i++ {
go work(wg, in, out, results)
}
wg.Wait()
// Finally we collect all the results of the work.
for elem := range results {
fmt.Println(elem)
}
}
func Work(wg *sync.WaitGroup, in chan *Job, out chan *Job, results chan *Job) {
wg.Add(1)
defer wg.Done()
for {
select {
case job := <-in:
ticker := time.Tick(10 * time.Second)
select {
case <-ticker:
// DO stuff
if condition is true {
out <- job
}
case <-time.After(5 * time.Minute):
fmt.Println("Timeout")
}
case job := <-out:
ticker := time.Tick(1 * time.Minute)
select {
case <-ticker:
// DO stuff
if condition is true {
results <- job
}
case <-quitOut:
fmt.Println("Job completed")
}
}
}
}
I create a number of workers who listen to 2 channels and send the final results to the 3rd.
It does something with the received job and if it validates a given condition, it passes this job to the next channel and if it validates a condition it passes the job into the result channel.
So, in my head I had a pipeline like this for 5 workers for example: 3 jobs in the channel IN, directly 3 workers takes them, if the 3 job validates the condition, they are sent in the channel OUT. Directly 2 workers takes them and the 3rd job is picked up by one of the first 3 workers ...
Now I hope you have a better understanding for my first code. But in my code, I never get to the second case.
I think your solution might be a bit over complicated. Here is a simplified version. Bare in mind that there are numerous implementations. A good article to read
https://medium.com/smsjunk/handling-1-million-requests-per-minute-with-golang-f70ac505fcaa
Or even better right from the Go handbook
https://gobyexample.com/worker-pools (which I think maybe is what you were aiming for)
Anyway, below serves as a different type of example.. There are a few ways to go about solving this problem.
package main
import (
"context"
"log"
"os"
"sync"
"time"
)
type worker struct {
wg *sync.WaitGroup
in chan job
quit context.Context
}
type job struct {
message int
}
func main() {
numberOfJobs := 50
ctx, cancel := context.WithTimeout(context.Background(), 3*time.Second)
defer cancel()
w := worker{
wg: &sync.WaitGroup{},
in: make(chan job),
quit: ctx,
}
for i := 0; i < numberOfJobs; i++ {
go func(i int) {
w.in <- job{message: i}
}(i)
}
counter := 0
for {
select {
case j := <-w.in:
counter++
log.Printf("Received job %+v\n", j)
// DO SOMETHING WITH THE RECEIVED JOB
// WORKING ON IT
x := j.message * j.message
log.Printf("job processed, result %d", x)
case <-w.quit.Done():
log.Printf("Recieved quit, timeout reached. Number of jobs queued: %d, Number of jobs complete: %d\n", numberOfJobs, counter)
os.Exit(0)
default:
// TODO
}
}
}
Your quitIn and quitOut channels are basically useless: You create them and try to receive from them. Which you cannot as nobody can write to these channels because nobody even knows about their existence. I cannot say more because I do not understand what the code is supposed to do.
Because your function is "Work" and you are calling "work".

Monitoring a kubernetes job

I have kubernetes jobs that takes variable amount of time to complete. Between 4 to 8 minutes. Is there any way i can know when a job have completed, rather than waiting for 8 minutes assuming worst case. I have a test case that does the following:
1) Submits the kubernetes job.
2) Waits for its completion.
3) Checks whether the job has had the expected affect.
Problem is that in my java test that submits the deployment job in the kubernetes, I am waiting for 8 minutes even if the job has taken less than that to complete, as i dont have a way to monitor the status of the job from the java test.
$ kubectl wait --for=condition=complete --timeout=600s job/myjob
<kube master>/apis/batch/v1/namespaces/default/jobs
endpoint lists status of the jobs. I have parsed this json and retrieved the name of the latest running job that starts with "deploy...".
Then we can hit
<kube master>/apis/batch/v1/namespaces/default/jobs/<job name retrieved above>
And monitor the status field value which is as below when the job succeeds
"status": {
"conditions": [
{
"type": "Complete",
"status": "True",
"lastProbeTime": "2016-09-22T13:59:03Z",
"lastTransitionTime": "2016-09-22T13:59:03Z"
}
],
"startTime": "2016-09-22T13:56:42Z",
"completionTime": "2016-09-22T13:59:03Z",
"succeeded": 1
}
So we keep polling this endpoint till it completes. Hope this helps someone.
You can use NewSharedInformer method to watch the jobs' statuses. Not sure how to write it in Java, here's the golang example to get your job list periodically:
type ClientImpl struct {
clients *kubernetes.Clientset
}
type JobListFunc func() ([]batchv1.Job, error)
var (
jobsSelector = labels.SelectorFromSet(labels.Set(map[string]string{"job_label": "my_label"})).String()
)
func (c *ClientImpl) NewJobSharedInformer(resyncPeriod time.Duration) JobListFunc {
var once sync.Once
var jobListFunc JobListFunc
once.Do(
func() {
restClient := c.clients.BatchV1().RESTClient()
optionsModifer := func(options *metav1.ListOptions) {
options.LabelSelector = jobsSelector
}
watchList := cache.NewFilteredListWatchFromClient(restClient, "jobs", metav1.NamespaceAll, optionsModifer)
informer := cache.NewSharedInformer(watchList, &batchv1.Job{}, resyncPeriod)
go informer.Run(context.Background().Done())
jobListFunc = JobListFunc(func() (jobs []batchv1.Job, err error) {
for _, c := range informer.GetStore().List() {
jobs = append(jobs, *(c.(*batchv1.Job)))
}
return jobs, nil
})
})
return jobListFunc
}
Then in your monitor you can check the status by ranging the job list:
func syncJobStatus() {
jobs, err := jobListFunc()
if err != nil {
log.Errorf("Failed to list jobs: %v", err)
return
}
// TODO: other code
for _, job := range jobs {
name := job.Name
// check status...
}
}
I found that the JobStatus does not get updated while polling using job.getStatus()
Even if the status changes while checking from the command prompt using kubectl.
To get around this, I reload the job handler:
client.extensions().jobs()
.inNamespace(myJob.getMetadata().getNamespace())
.withName(myJob.getMetadata().getName())
.get();
My loop to check the job status looks like this:
KubernetesClient client = new DefaultKubernetesClient(config);
Job myJob = client.extensions().jobs()
.load(new FileInputStream("/path/x.yaml"))
.create();
boolean jobActive = true;
while(jobActive){
myJob = client.extensions().jobs()
.inNamespace(myJob.getMetadata().getNamespace())
.withName(myJob.getMetadata().getName())
.get();
JobStatus myJobStatus = myJob.getStatus();
System.out.println("==================");
System.out.println(myJobStatus.toString());
if(myJob.getStatus().getActive()==null){
jobActive = false;
}
else {
System.out.println(myJob.getStatus().getActive());
System.out.println("Sleeping for a minute before polling again!!");
Thread.sleep(60000);
}
}
System.out.println(myJob.getStatus().toString());
Hope this helps
You did not mention what is actually checking the job completion, but instead of waiting blindly and hope for the best you should keep polling the job status inside a loop until it becomes "Completed".
Since you said Java; you can use kubernetes java bindings from fabric8 to start the job and add a watcher:
KubernetesClient k = ...
k.extensions().jobs().load(yaml).watch (new Watcher <Job>() {
#Override
public void onClose (KubernetesClientException e) {}
#Override
public void eventReceived (Action a, Job j) {
if(j.getStatus().getSucceeded()>0)
System.out.println("At least one job attempt succeeded");
if(j.getStatus().getFailed()>0)
System.out.println("At least one job attempt failed");
}
});
I don't know what kind of tasks are you talking about but let's assume you are running some pods
you can do
watch 'kubectl get pods | grep <name of the pod>'
or
kubectl get pods -w
It will not be the full name of course as most of the time the pods get random names if you are running nginx replica or deployment your pods will end up with something like nginx-1696122428-ftjvy so you will want to do
watch 'kubectl get pods | grep nginx'
You can replace the pods with whatever job you are doing i.e (rc,svc,deployments....)