Timeout exception when size of the input to child workflow is huge - cadence-workflow

16:37:21.945 [Workflow Executor taskList="PullFulfillmentsTaskList", domain="test-domain": 3] WARN com.uber.cadence.internal.common.Retryer - Retrying after failure
org.apache.thrift.transport.TTransportException: Request timeout after 1993ms
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.throwOnRpcError(WorkflowServiceTChannel.java:546)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.doRemoteCall(WorkflowServiceTChannel.java:519)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.respondDecisionTaskCompleted(WorkflowServiceTChannel.java:962)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.lambda$RespondDecisionTaskCompleted$11(WorkflowServiceTChannel.java:951)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.measureRemoteCall(WorkflowServiceTChannel.java:569)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.RespondDecisionTaskCompleted(WorkflowServiceTChannel.java:949)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.lambda$sendReply$0(WorkflowWorker.java:301)
at com.uber.cadence.internal.common.Retryer.lambda$retry$0(Retryer.java:104)
at com.uber.cadence.internal.common.Retryer.retryWithResult(Retryer.java:122)
at com.uber.cadence.internal.common.Retryer.retry(Retryer.java:101)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.sendReply(WorkflowWorker.java:301)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:261)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:229)
at com.uber.cadence.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:71)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Our parent workflow code is basically like this (JSONObject is from org.json)
JSONObject[] array = restActivities.getArrayWithHugeJSONItems();
for(JSONObject hugeJSON: array) {
ChildWorkflow child = Workflow.newChildWorkflowStub(ChildWorkflow.class);
child.run(hugeJSON);
}
What we find out is that most of the time, the parent workflow worker fails to start the child workflow and throws the timeout exception above. It retries like crazy but never success and print the timeout exception over and over again. However sometimes we got very lucky and it works. And sometimes it fails even earlier at the activity worker, and it throws the same exception. We believe this is due to the size of the data is too big (about 5MB) and could not be sent within the timeout (judging from the log we guess it's set to 2s). If we call child.run with small fake data it 100% works.
The reason we use child workflow is we want to use Async.function to run them in parallel. So how can we solve this problem? Is there a thrift timeout config we should increase or somehow we can avoid passing huge data around?
Thank you in advance!
---Update after Maxim's answer---
Thank you. I read the example, but still have some questions for my use case. Let's say I got an array of 100 huge JSON objects in my RestActivitiesWorker, if I should not return the huge array to the workflow, I need to make 100 calls to the database to create 100 rows of records and put 100 ids in an array and pass that back to the workflow. Then the workflow create one child workflow per id. Each child workflow then calls another activity with the id to load the data from the DB. But that activity has to pass that huge JSON to the child workflow, is this OK? And for the RestActivitiesWorker making 100 inserts into the DB, what if it failed in the middle?
I guess it boils down to that our workflow is trying to work directly with huge JSON. We are trying to load huge JSON (5-30MB, not that huge) from an external system into our system. We break down the JSON a little bit, manipulate a few values, and use values from a few fields to do some different logic, and finally save it in our DB. How should we do this with Temporal?

Temporal/Cadence doesn't support passing large blobs as inputs and outputs as it uses a DB as underlying storage. So you want to change architecture of your application to avoid this.
The standard workarounds are:
Use external blob store to save large data and pass reference to it as parameters.
Cache data in a worker process or even on a host disk and route activities that operate on this data to that process or host. See fileprocessing sample for this approach.

Related

Fetch and maintain reference data at Job level in Spring Batch

I am configuring a new Job where I need to read the data from the database and in the processor, the data will be used to call a Rest endpoint with payload. In the payload along with dynamic data, I need to pass reference data which is constant for each record getting processed in the job. This reference data is stored in DB. I am thinking to implement the following approach.
In the beforeJob listener method make a DB call and populate the reference data object and use it for the whole job run.
In the processor make a DB call to get the reference data and cache the query so there will be no DB call to fetch the same data for each record.
Please suggest if these approaches are correct or if there is a better way to implement them in Spring batch.
For performance reasons, I would not recommend doing a DB call in the item processor, unless that is really a requirement.
The first approach seems reasonable to me, since the reference data is constant. You can populate/clear a cache with a JobExecutionListener and use the cache in your chunk-oriented step. Please refer to the following thread for more details and a complete sample: Spring Batch With Annotation and Caching.

HTTP GET for 'background' job creation and acquiring

I'm designing API for jobs scheduler. There is one scheduler with some set of resources and DB tables for them. Also there are multiple 'workers' that request 'jobs' from scheduler. Worker can't create job it must only request it. Job must be calculated on the server side. Also job is a dynamic entity and calculated using multiple DB tables and time. There is no 'job' table.
In general this system is very similar to task queue. But without queue. I need a method for worker to request next task. That task should be calculated and assigned for this agent.
Is it OK to use GET verb to retrieve and 'lock' job for the specific worker?
In terms of resources this query does not modify anything. Only internal DB state is updated. For client it looks like fetching records one by one. It doesn't know about internal modifications.
In pure REST style I probably should define a job table and CRUD api for it. Then I would need to create some auxilary service to POST jobs to that table. Then each agent would list jobs using GET and then lock it using PATCH. That approach requires multiple potential retries due to race-conditions. (Job can be already locked by another agent). Also it looks a little bit complicated if I need to assign job to specific agent based on server side logic. In that case I need to implement some check logic on client side to iterate through jobs based on different responces.
This approach looks complicated.
Is it OK to use GET verb to retrieve and 'lock' job for the specific worker?
Maybe? But probably not.
The important thing to understand about GET is that it is safe
The purpose of distinguishing between safe and unsafe methods is to
allow automated retrieval processes (spiders) and cache performance
optimization (pre-fetching) to work without fear of causing harm. In
addition, it allows a user agent to apply appropriate constraints on
the automated use of unsafe methods when processing potentially
untrusted content.
If aggressive cache performance optimization would make a mess in your system, then GET is not the http method you want triggering that behavior.
If you were designing your client interactions around resources, then you would probably have something like a list of jobs assigned to a worker. Reading the current representation of that resource doesn't require that a server change it, so GET is completely appropriate. And of course the server could update that resource for its own reasons at any time.
Requests to modify that resource should not be safe. For instance, if the client is going to signal that some job was completed, that should be done via an unsafe method (POST/PUT/PATCH/DELETE/...)
I don't have such resource. It's an ephymeric resource which is spread across the tables. There is no DB table for that and there is no ID column to update that job. That's another question why I don't have such table but it's current requirement and limitation.
Fair enough, though the main lesson still stands.
Another way of thinking about it is to think about failure. The network is unreliable. In a distributed environment, the client cannot distinguish a lost request from a lost response. All it knows is that it didn't receive an acknowledgement for the request.
When you use GET, you are implicitly telling the client that it is safe (there's that word again) to resend the request. Not only that, but you are also implicitly telling any intermediate components that it is safe to repeat the request.
If there are no adverse effects to handling multiple copies of the same request, the GET is fine. But if processing multiple copies of the same request is expensive, then you should probably be using POST instead.
It's not required that the GET handler be safe -- the standard only describes the semantics of the messages; it doesn't constraint the implementation at all. But any loss of property incurred is properly understood to be the responsibility of the server.

Service Fabric Actors - save state to database

I'm working on a sample Service Fabric project, where I have to maintain a shopping list. For this I have a ShoppingList actor, which is identifiable by a specific id. It stores the current list content in its state using StateManager. All works fine.
However, in parallel I'd like to maintain the shopping list content in a sql database. In particular:
store all add/remove item request for future analysis (ML)
on first actor initialization load list content from db (e.g. after cluster has been re-created)
What is the best approach to achieve that? Create a custom StateProvider (how? can't find examples)?
Or maybe have another service/actor for handling all db operations (possibly using queues and reminders)?
All examples seem to completely rely on default StateManager, with no data persistence to external storage, so I'm not sure what's the best practice.
The best way will be to have a separate entity responsible for storing data to DB. And actor will just send an event (not implying SF events) with some data about performed operation, and another entity will catch it and perform the rest of the work.
But of course you can implement this thing in actor itself, but it will bring two possible issues:
Actor will be not able to process other requests if there will be some issues with DB or connectivity between actor and DB or if there will be high loading of DB itself and it will process requests slowly. The actor would have to wait till transferring to DB successfully completes.
Possible overloading of DB with many single connections from many actors instead of one or several connection from another entity and batch insertion.
So, your final solution will depend on workload of your system. But definitely you will need a reliable queue to safely store data in DB if value of such data is too high to afford a loss.
Also, I think you could use default state manager to store logs and information about transactions before it will be transferred to DB and remove from service's state after transaction completes. There is no need to have permanent storage of such data in services.
And another things to take into consideration — reading from DB. Probably, if you have relationship database and will update with new records only one table + if there will be huge amount of actors that will query such data on activation, you will have performance degradation as this table will be locked for reading or writing if you will not configure it to behave differently. So, probably, you will need caching system to read data for actors activation — depends on your workload.
And about implementing your custom State Manager: take a look at this example. Basically, all you need to do is to implement IReliableStateManagerReplica interface and pass it to StatefullService constructor.

NSMutableURLRequest on succession of another NSMutableURLRequest's success

Basically, I want to implement SYNC functionality; where, if internet connection is not available, data gets stored on local sqlite database. Whenever, internet connection is available, SYNC gets into the action.
Now, Say for example; 5 records are stored locally, and then internet connection is available. I want the server to be updated. So, What I do currently is:
Post first record to the server.
Wait for the success of first request.
Post local NSNotification to routine, that the first record has been updated on server & now second request can go.
The routine fires the second post request on server and so on...
Question: Is this approach right and efficient enough to implement SYNC functionality; OR anything I should change into it ??
NOTE: Records to be SYNC will have no limit in numbers.
Well it depends on the requirements on the data that you save. If it is just for backup then you should be fine.
If the 5 records are somehow dependent on each other and you need to access this data from another device/application you should take care on the server side that either all 5 records are written or none. Otherwise you will have an inconsistent state if only 3 get written.
If other users are also reading / writing those data concurrently on the server then you need to implement some kind of lock on all records before writing and also decide how to handle conflicts when someone attempts to overwrite somebody else changes.

How to guard against repeated request?

we have a button in a web game for the users to collect reward. That should only be clicked once, and upon receiving the request, we'll mark it collected in DB.
we've already blocked the buttons in the client from repeated clicking. But that won't help if people resend the package multiple times to our server in short period of time.
what I want is a method to block this from server side.
we're using Playframework 2 (2.0.3-RC2) for server side and so far it's stateless, I'm tempted to use a Set to guard like this:
if processingSet has userId then BadRequest
else put userId in processingSet and handle request
after that remove userId from that Set
but then I'd have to face problem like Updating Scala collections thread-safely and still fail to block the user once we have more than one server behind load balancing.
one possibility I'm thinking about is to have a table in DB in place of the processingSet above, but that would incur 1+ DB operation per request, are there any better solution~?
thanks~
Additional DB operation is relatively 'cheap' solution in that case. You should use it if you'e planning to save the buttons state permanently.
If the button is disabled only for some period of time (for an example until the game is over) you can also consider using the cache API however keep in mind that's not dedicated for solutions which should be stored for long time (it should not be considered as DB alternative).
Given that you're using Mongo and so don't have transactions spanning separate collections, I think you can probably implement this guard using an atomic operation - namely "Update if current", which is effectively CompareAndSwap.
Assuming you've got a collection like "rewards" which has a "collected" attribute, you can update the collected flag to true only if it is currently false and if that operation doesn't fail you can proceed to apply the reward knowing that for any other requests the same operation will fail.