I'm using cadence v0.11.
It failed to change ScheduleToStartTimeout in cadence v0.11.
How to config ScheduleToStartTimeout take effect? Is it a bug for cadence 0.11?
Here is the minimum bad case which is used to reproduce this bug:
https://github.com/ahuigo/cadence-v11-bad-case
Related
What could be the reason for the decision tasks to not get picked for execution in cadence cluster. They remains at pending state and finally times out. I dont see any error logs. How do I debug this ?
It’s very likely that there is no worker available and actively polling tasks for the tasklist.
Best way to confirm is to click on the tasklist naming in the webUI and see what are the workers behind the tasklist. Since it’s decision task, you should check the decision handler for the tasklist.
You can also use CLI to describe the tasklist to give the same information:
cadence tasklist desc —-tl <tasklist name>
In some extremely rare cases(I personally never seen but heard that happened in Uber with large scale cluster) that cadence server lost the task. In that case you can use CLI to either regenerate the task, or reset the workflow to unblock the workflow:
To regenerate task:
cadence adm wf refresh-tasks -w <wf id>
To reset:
cadence wf reset —-reset_type LastDecisionCompleted -w <wf id>
I was looking for a microservice orchestrator and came across Uber Cadence. I have gone through the documentation and also used it in the development setup.
I had a few questions for production scenarios:
Is it recommended to have a dedicated tasklist for the workflow and the different activities used by it? Or, should we use a single tasklist for all? Does this decision impact the scalability or performance?
When we add a new worker machine, is it a common practice to run all the workers for different activities/workflows in the same machine? Example:
Worker.Factory factory = new Worker.Factory("samples-domain");
Worker helloWorkflowWorker = factory.newWorker("HelloWorkflowTaskList");
helloWorkflowWorker.registerWorkflowImplementationTypes(HelloWorkflowImpl.class);
Worker helloActivityWorker = factory.newWorker("HelloActivityTaskList");
helloActivityWorker.registerActivitiesImplementations(new HelloActivityImpl());
Worker upperCaseActivityWorker = factory.newWorker("UpperCaseActivityTaskList");
upperCaseActivityWorker.registerActivitiesImplementations(new UpperCaseActivityImpl());
factory.start();
Or should we run each activity/workflow worker in a dedicated machine?
In a single worker machine, how many workers can we create for a given activity? For example, if we have activity HelloActivityImpl, should we create multiple workers for it in the same worker machine?
I have not found any documentation for production set up. For example, how to install and configure the Cadence Service in production? It will be great if someone can direct me to the right material for this.
In some of the video tutorials, it was mentioned that, for High Availability, we can setup Cadence Service across multiple data centers. How do I configure Cadence service for that?
Unless you need to have separate flow control and rate limiting for a set of activities there is no reason to use more than one task queue per worker process.
As I mentioned in 1 I would rewrite your code as:
Worker.Factory factory = new Worker.Factory("samples-domain");
Worker worker = factory.newWorker("HelloWorkflow");
worker.registerWorkflowImplementationTypes(HelloWorkflowImpl.class);
worker.registerActivitiesImplementations(new HelloActivityImpl(), new UpperCaseActivityImpl());
factory.start();
There is no reason to create more than one worker for the same activity.
Not sure about Cadence. Here is the Temporal documentation that shows how to deploy to Kubernetes.
This documentation is not yet available. We at Temporal are working on it.
You can also use Cadence helmchart https://hub.helm.sh/charts/banzaicloud-stable/cadence
I am actively working with Cadence team to have operation documentation for the community. It will be useful for those don't want to run on K8s, like myself. I will come back later as we make progress.
Current draft version: https://docs.google.com/document/d/1tQyLv2gEMDOjzFibKeuVYAA4fucjUFlxpojkOMAIwnA
will be published to cadence-docs soon.
Does anyone know why an ECS Fargate task would fail with this error?
Timeout waiting for network interface provisioning to complete. I am running an ECS Fargate task using step functions. The IAM role for step function have access to the task def.The state machine code also looks good. The same step function worked fine before but i ran into this error just now. Want to know why this would happen? is it occasional?
According to AWS support, intermittent failures of this nature are to be expected (with relatively low probability).
The recommendation was to set retryAttempts > 1 to handle these situations.
This can happen if there are problems within AWS. You can view the Network Interfaces page on the EC2 console and you may see errors loading, which is an indication of API problems within EC2. You can also check status.aws.amazon.com to look for errors. Note that AWS can be slow to acknowledge problems there, so you may experience the errors before they update the status page!
AWS support has a detailed post on resolving network interface provision errors for ECS on Fargate. Here's an excerpt from the same
If the Fargate service tries to attach an elastic network interface to the underlying infrastructure that the task is meant to run on, then you can receive the following error message: "Timeout waiting for network interface provisioning to complete."
Fargate faces intermittent API issues usually while spinning up in Step functions and AWS Batch jobs. And as recommended in another answer you can update the MaxAttempts for retry in the definition.
"Retry": [
{
"MaxAttempts": 3,
}
]
Additionally, reattempts can be automated with an exponential backoff and retry logic in AWS Step Functions.
I was hitting the same issue until I switched over to fargate platform 1.4.0
It looks like there were some changes made to the networking side of things.
https://aws.amazon.com/blogs/containers/aws-fargate-launches-platform-version-1-4/
The default version is currently still set at 1.3.0 so maybe give that a try and see if it fixes it for you.
We have a GKE cluster with auto-upgrading nodes. We recently noticed a node become unschedulable and eventually deleted that we suspect was being upgraded automatically for us. Is there a way to confirm (or otherwise) in Stackdriver that this was indeed the cause what was happening?
You can use the following advanced logs queries with Cloud Logging (previously Stackdriver) to detect upgrades to node pools:
protoPayload.methodName="google.container.internal.ClusterManagerInternal.UpdateClusterInternal"
resource.type="gke_nodepool"
and master:
protoPayload.methodName="google.container.internal.ClusterManagerInternal.UpdateClusterInternal"
resource.type="gke_cluster"
Additionally, you can control when the update are applied with Maintenance Windows (like the user aurelius mentioned).
I think your question has been already answered in the comments. Just as addition automatic upgrades occur at regular intervals at the discretion of the GKE team. To get more control you can create a Maintenance Windows as explained here. This is basically a time frame that you choose in which automatic upgrades should occur.
Workflow Processes throw WorkflowException in case of failure, there is a setting in Web Console Apache Sling Job Default Queue. In this max retries is set to 10 on failure.
Now on failure, workflow is retried 10 more times on failure. So if a workflow if having step for example Version Creation, 10 more versions are created of resource.
I could think of following solutions
Set the max retries count on failure to 0 in Apache Sling Job Default Queue. Is it fine to do this?
Replace OOTB Version Creation process with custom process and add check for retries probably by saving flag in workflow metadata.
Version Creation process is taken as example here, it could be any other process which is doing some other functionality, that would also be tried 10 more times on failure. Has anyone faced similar situation?
It is not advisable to make it zero. Some workflow needs to be retried, for example activation workflow, when there were network issues or publish boxes were down etc. Your settings would totally bypass this safety mechanism.
I would prefer your second method as an alternative. org.apache.sling.event.jobs.Jobs has getRetryCount().