What is Replay in Cadence/Temporal workflow? - cadence-workflow

What is Replay in Cadence/Temporal workflow? Is it the same as "retry"?
Why can't I simply use my own logger in workflow code due to replay?

“Retry” and "replay" are completely different.
Replay is for rebuilding the workflow thread states(process/thread stack).
Imagine this workflow code(in Java):
activityStub.doA()
LOG.info("first log")
activityStub.doB()
LOG.info("second log")
If LOG is not from Workflow.getLogger or not wrapped by Workflow.isReplay, the first log will be printed more than one times — just depend on how many times the code got replayed.
The timeline of causing duplicated logs:
After doA is completed, first log is printed.
And then doB is executed, let say doB will take 1 minute.
During the 1 minute, the worker crashes or got restarted.
And then the doB completed.
Then there will be a new workflow task to process the completion of doB.
The workflow task will then executed in a new worker host, which requires a replay to rebuild the Stack until the code of doB. During the replay, assuming LOG is your own logger without wrapping by workflow.isReplay(), the first log will be printed again.
And then doB will get completed and then the second log will be printed.
So at the end, you will see the logs:
first log
first log
second log

Related

IBM DataStage : Job activity does not continue in sequence

I have 16 job activities in a sequence, I already define the trigger with OK so they're all connected and auto run when the previous job has finished. I already run and recompiled each job activity on their own but when I recompile and re-run the sequence, somehow only the first job activity run and finished as OK but it does not trigger the next job. Here's the log
job_spi_februari..JobControl (#Coordinator): Summary of sequence run
19:18:01: Sequence started
19:18:01: jenis_kredit (JOB job_jenis_kredit) started
19:18:16: jenis_kredit (JOB job_jenis_kredit) finished, status=2 [Finished with warnings]
19:18:16: Sequence finished OK
I'm very confused why it's like this, it shows that it goes well without any problem or warning but it does not trigger the next job as it should be as if there's something wrong. What happens actually and how to fix this?
In case, you're curious about my job activity, they all look like this
If you connect all job activites with a OK trigger - the Sequence will end once a single activity does not finish with ok (like "Finished with Warnings") because nothing is left to execute.
If you want to go on I suggest to define a custom trigger which fires on RunOK and Runwarn.

Wait for all tasks in a celery group to finish or error out

I have a group of celery tasks that I want to run in parallel, and then wait for them all to finish. I am currently using:
group(task_list).apply_async().get()
(see my other question for more detail: Wait for all tasks in a celery group to finish or error out)
When all of my tasks run without exceptions, this works perfectly. However, if any of the tasks throw an exception, it immediately returns.
I can add a try/except around every task and have it return a custom error object, but then it shows up in the flower dashboard as 'succeeded'.
Is it possible to wait for all errors AND exceptions to finish?
You should use a chord, not a group. From the celery docs:
A chord is a task that only executes after all of the tasks in a
group have finished executing.
result = chord(task_list)(handle_results)
The task id returned by chord() is the id of the callback, so you can wait for it to complete and get the final return value
result.get()
So what happens if one of the tasks raises an exception?
The chord callback result will transition to the failure state, and the error is set to the ChordError exception
print(result.traceback)

Processing Groups of Results with Vertx - How to coordinate?

I have a job processing system where each job contains thousands of individual tasks that require different strategies to complete. The individual tasks make up the whole job. If all tasks have been completed, the job is marked as successfully completed and other steps are taken, if any of the tasks fail, the job must be marked as failed and other steps are taken, if the job times out the job must be marked as failed and other steps are taken.
Once all of the results for a job have been received, the next job can be fetched. The next job shouldn't be fetched while a job is currently being processed.
Here is the what the flow looks like:
The Job Polling Verticle publishes a job to the event bus, and the Job Processing Verticle publishes each task to the event bus. When the job strategy completes, it publishes the task result to the event bus.
The issue is that I don't know the right way to determine when all tasks have been completed in this model. All verticles are stateless, The Job Processing Verticle doesn't await any futures, and even if the Job Results Verticle was stateful, it doesn't know how many results it should expect.
The only way I can think to do this would be to have a global stateful object. But I don't think this is good design.
Additionally, I need to know when a Job has timed out. That is, it's run longer than it should and I need to consider it's failed, log it, and move on.
I could do this with the global state, but again I don't think that's the right solution.
Does this verticle pattern make sense for what I'm trying to do?
First, let me try to address your questions. Then I'll try to explain what problems this design has.
The issue is that I don't know the right way to determine when all tasks have been completed in this model. All verticles are stateless, The Job Processing Verticle doesn't await any futures, and even if the Job Results Verticle was stateful, it doesn't know how many results it should expect.
The solution could be reference counting verticle. Each worker should emit a start message on event bus with jobId when it starts, and end message with jobId when it completes. Even if you have fan-out (those are the cases that you don't know how many workers there are), counting verticle will know that. In your diagram, "Job Post Processing Verticle" is a good candidate for this. It can maintain a counter, and only when it reaches zero, it should start the next job. That also helps avoiding actually sharing some memory reference.
Additionally, I need to know when a Job has timed out. That is, it's run longer than it should and I need to consider it's failed, log it, and move on.
In the same verticle you can start a timer every time you get a new start message. If you get end message, cancel the timer. Otherwise, cancel current job and start again.
Now, this solution will work, but the design has two main flaws. One is the fact that you maintain all your flow in memory, it seems. If your application crashes, all progress is lost, and it's not clear how you record it. Maybe polling Jobs table in DB would actually be better, since your job execution is sequential anyway.
Second point is the fact that all those timeouts and reference counting is homemade implementation of structured concurrency. Maybe you should take a look at something like Kotlin coroutines for that, at it will handle many of your problems for you.

Azure Function and queue

I have a function:
public async static Task Run([QueueTrigger("efs-api-call-last-datetime", Connection = "StorageConnectionString")]DateTime queueItem,
[Queue("efs-api-call-last-datetime", Connection = "StorageConnectionString")]CloudQueue inputQueue,
TraceWriter log)
{
Then I have long process for processing message from queue. Problem is the message will be readded to queue after 30 seconds, while I process this message. I don't need to add this message and process it twice.
I would like to have code like:
try
{
// long operation
}
catch(Exception ex)
{
// something wrong. Readd this message in 1 minute
await inputQueue.AddMessageAsync(new CloudQueueMessage(
JsonConvert.SerializeObject(queueItem)),
timeToLive: null,
initialVisibilityDelay: TimeSpan.FromMinutes(1),
options: null,
operationContext: null
);
}
and prevent to readd it automatically. Any way to do it?
There are couple of things here.
1) When there are multiple queue messages waiting, the queue trigger retrieves a batch of messages and invokes function instances concurrently to process them. By default, the batch size is 16. But this is configurable in Host.json. You can set the batch size to 1 if you want to minimize the parallel execution. Microsoft document explains this.
2) As it is long running process so it seems your messages are not complete and the function might timeout and message are visible again. You should try to break down your function into smaller functions. Then you can use durable function which will chain the work you have to do.
Yes, you can dequeue same message twice.
Reasons:
1.Worker A dequeues Message B and invisibility timeout expires. Message B becomes visible again and Worker C dequeues Message B, invalidating Worker A's pop receipt. Worker A finishes work, goes to delete Message B and error is thrown. This is most common.
2.The lock on the original message that triggers the first Azure Function to execute is likely expiring. This will cause the Queue to assume that processing the message failed, and it will then use that message to trigger the Function to execute again.
3.In certain conditions (very frequent queue polling) you can get the same message twice on a GetMessage. This is a type of race condition that while rare does occur. Worker A and B are polling very quickly and hit the queue simultaneously and both get same message. This used to be much more common (SDK 1.0 time frame) under high polling scenarios, but it has become much more rare now in later storage updates (can't recall seeing this recently).
1 and 3 only happen when you have more than 1 worker.
Workaround:
Install azure-webjobs-sdk 1.0.11015.0 version (visible in the 'Settings' page of the Functions portal). For more details, you could refer to fixing queue visibility renewals

SpringBatch: getting the executionid of a completed instance by its JobParameters

My software is coreographing a number of spring batch jobs. The output of a job is partially an input for the next job . It may happen that the entire process (the entire jobs chain) is restarted, even if one or more jobs in the chain have been successfully completed. In this case, when I try tu run one of the jobs again with the same parameters, I get a JobInstanceAlreadyCompletedException as expected. I could skip and go on to the next job but I would need to access the context of the completed instance in order to get the output produced by its steps and pass them over to the next job.
According to the JobExplorer APIs, this is just possible if you have the executionId of the completed instance. I can't get it from the JobInstanceAlreadyCompletedException , and it looks there are no APIs for getting it from the already used parameters list. Do you know a way to get this executionId given the parameters? Or to get access, in whatever way, to the completed instance job context?
Why not put all this jobs into one main job and using JobSteps to integrate the jobs? This way, already completed subjobs will be treated as completed steps, which will not be started again. Moreover, all information is available in the job/step contexts, even if you restart?
Another way would be to save all needed parameters and information into a file and use this to start the next job instead of beeing dependent on the Jobexecution info. Your last step could simply be a tasklet, that writes an appropriate property file.