How to processing multiples large files at same time with multiples instances using Spring Batch Integration? - spring-batch

I created a Spring Batch Integration project for process multiples files and it is working like a charm.
While I'm writing this question I have four Pods running, but the behaviour isn't like I'm expecting, I expect 20 files being processing at the same time (five per Pod).
My pooler setup is using the following parameters:
poller-delay: 10000
max-message-per-poll: 5
I also using Redis for store the files and filter:
private CompositeFileListFilter<S3ObjectSummary> s3FileListFilter() {
return new CompositeFileListFilter<S3ObjectSummary>().addFilter(
new S3PersistentAcceptOnceFileListFilter(new RedisMetadataStore(redisConnectionFactory), "prefix-"))
.addFilter(new S3RegexPatternFileListFilter(".*\\.csv$"));
}
Seems like each pod is processing only one file and also another strange behaviour is like one of the pods register all the files in the Redis, so the others Pods only get new files.
How is the best practice and also how to solve that for processing multiples files at the same time?

See this option on the S3InboundFileSynchronizingMessageSource:
/**
* Set the maximum number of objects the source should fetch if it is necessary to
* fetch objects. Setting the
* maxFetchSize to 0 disables remote fetching, a negative value indicates no limit.
* #param maxFetchSize the max fetch size; a negative value means unlimited.
*/
#ManagedAttribute(description = "Maximum objects to fetch")
void setMaxFetchSize(int maxFetchSize);
And here is the doc: https://docs.spring.io/spring-integration/docs/current/reference/html/ftp.html#ftp-max-fetch

Related

Timeout exception when size of the input to child workflow is huge

16:37:21.945 [Workflow Executor taskList="PullFulfillmentsTaskList", domain="test-domain": 3] WARN com.uber.cadence.internal.common.Retryer - Retrying after failure
org.apache.thrift.transport.TTransportException: Request timeout after 1993ms
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.throwOnRpcError(WorkflowServiceTChannel.java:546)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.doRemoteCall(WorkflowServiceTChannel.java:519)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.respondDecisionTaskCompleted(WorkflowServiceTChannel.java:962)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.lambda$RespondDecisionTaskCompleted$11(WorkflowServiceTChannel.java:951)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.measureRemoteCall(WorkflowServiceTChannel.java:569)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.RespondDecisionTaskCompleted(WorkflowServiceTChannel.java:949)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.lambda$sendReply$0(WorkflowWorker.java:301)
at com.uber.cadence.internal.common.Retryer.lambda$retry$0(Retryer.java:104)
at com.uber.cadence.internal.common.Retryer.retryWithResult(Retryer.java:122)
at com.uber.cadence.internal.common.Retryer.retry(Retryer.java:101)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.sendReply(WorkflowWorker.java:301)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:261)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:229)
at com.uber.cadence.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:71)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Our parent workflow code is basically like this (JSONObject is from org.json)
JSONObject[] array = restActivities.getArrayWithHugeJSONItems();
for(JSONObject hugeJSON: array) {
ChildWorkflow child = Workflow.newChildWorkflowStub(ChildWorkflow.class);
child.run(hugeJSON);
}
What we find out is that most of the time, the parent workflow worker fails to start the child workflow and throws the timeout exception above. It retries like crazy but never success and print the timeout exception over and over again. However sometimes we got very lucky and it works. And sometimes it fails even earlier at the activity worker, and it throws the same exception. We believe this is due to the size of the data is too big (about 5MB) and could not be sent within the timeout (judging from the log we guess it's set to 2s). If we call child.run with small fake data it 100% works.
The reason we use child workflow is we want to use Async.function to run them in parallel. So how can we solve this problem? Is there a thrift timeout config we should increase or somehow we can avoid passing huge data around?
Thank you in advance!
---Update after Maxim's answer---
Thank you. I read the example, but still have some questions for my use case. Let's say I got an array of 100 huge JSON objects in my RestActivitiesWorker, if I should not return the huge array to the workflow, I need to make 100 calls to the database to create 100 rows of records and put 100 ids in an array and pass that back to the workflow. Then the workflow create one child workflow per id. Each child workflow then calls another activity with the id to load the data from the DB. But that activity has to pass that huge JSON to the child workflow, is this OK? And for the RestActivitiesWorker making 100 inserts into the DB, what if it failed in the middle?
I guess it boils down to that our workflow is trying to work directly with huge JSON. We are trying to load huge JSON (5-30MB, not that huge) from an external system into our system. We break down the JSON a little bit, manipulate a few values, and use values from a few fields to do some different logic, and finally save it in our DB. How should we do this with Temporal?
Temporal/Cadence doesn't support passing large blobs as inputs and outputs as it uses a DB as underlying storage. So you want to change architecture of your application to avoid this.
The standard workarounds are:
Use external blob store to save large data and pass reference to it as parameters.
Cache data in a worker process or even on a host disk and route activities that operate on this data to that process or host. See fileprocessing sample for this approach.

How to enable bigger payloads in Orion? Increase PAYLOAD_MAX_SIZE?

Right now it is not possible to send an entity to Orion with an PAYLOAD_MAX_SIZE >1MB.
/****************************************************************************
*
*
* PAYLOAD_MAX_SIZE -
*/
#define PAYLOAD_MAX_SIZE (1 * 1024 * 1024) // 1 MB Maximum size of the payload
SourceCode Orion Payload_Max_Size
We have to transfer an entity (including a map/image) through the context broker and the size is > 1MB.
Do you have forseen it as a parameter for the docker compose file? If not, it would be really helpful.
Thanks for you help.
Are you sure you want to store an image in the Broker? You should store it in an Object Storage service but not in Orion.
Orion is suited for context information which is basically about entities (e.g. a car) and their attributes (e.g. the speed and location asssociated to that car). It is not suited for large binaries (such a PNG file) directly, although the usual pattern is to use the URL as refernce to the binary in a external system where it is stored. Have a look to this post for more details.

Same data read by PCF instances for spring batch application

I am working on a spring batch application,which read data from data base using JdbcCursorItemReader,this application is working as expected when I run a single instance.
I deployed this application in PCF and used auto scale feature, but multiple instances are retrieving the same record from the data base.
How can I prevent the duplicate data reads from other instances?
This is normally handled by applying the processed indicator pattern. In this pattern, you have an additional field on each row that you mark with the status as each record is processed. You then use your query to filter only the records that match the status you care about. In this case, the status could be node specific so that the node only selects records that node tags.

Is querying MongoDB faster than Redis?

I have some data stored in a database (MongoDB) and in distributed cache redis.
While querying to the repository, I am using lazy loading approach which first finds the data in the cache if it's available, if not find it in the database and update the cache as well so that next time when the requirement comes it should be found in the cache.
Sample Model Used:
Person ( id, name, age, address (Reference))
Address (id, place)
PersonCacheModel extends Person with addressId.
I am not storing parent object with child object together in the cache that is why I've created personCacheModel with addressId and store this object in the cache and while getting the data personCacheModel converts to person and make a call to address repo to addressCache to fill the address details of the person object.
As far as I understand:
personRepository.findPersonByName(NAME + randomNumber);
Access Data from Cache = network time + cache access time + deserialize time
Access Data from database = network time + database query time + object mapping time
When I ran above approach for 1000 rows, accessing data from the database is faster than the accessing data from the cache. I believe cache access time must be smaller than the accessing MongoDB.
Please let me know if there's an issue with the approach or is this is the expected scenario.
to have a valid benchmark we need to consider hardware side and data processing side:
hardware - do we have same configuration, RAM, CPUs count, OS... etc
process - how data is transformed (on single thread, multi thread, per object, per request)
Performing a load test on your data set will give you an good overview of which process is faster in particular use case scenario.
It is hard to judge - what it should be as long as there mentioned above points will be know for us.
The other thing is to have more than one test scenario and have it stressed in let's say 10 sec time, minute , 5 an hour... so you can have digits that will tell you the truth.

paginated data with the help of mongo inbound adapter in spring integration

I am using mongo inbound adapter for retrieving data from mongo. Currently I am using below configuration.
<int-mongo:inbound-channel-adapter
id="mongoInboundAdapter" collection-name="updates_IPMS_PRICING"
mongo-template="mongoTemplatePublisher" channel="ipmsPricingUpdateChannelSplitter"
query="{'flagged' : false}" entity-class="com.snapdeal.coms.publisher.bean.PublisherVendorProductUpdate">
<poller max-messages-per-poll="2" fixed-rate="10000"></poller>
</int-mongo:inbound-channel-adapter>
I have around 20 records in my data base which qualifies the mentioned query but as I am giving max-messages-per-poll value 2 I was expecting that i will get maximum 2 records per poll.
but I am getting all the records which qualifies the mentioned query. Not sure what I am doing wrong.
Actually I'd suggest to raise a New Feature JIRA ticket for that query-expression to allow to specify org.springframework.data.mongodb.core.query.Query builder, which has skip() and limit() options and from there your issue can be fixed like:
<int-mongo:inbound-channel-adapter
query-expression="new BasicQuery('{\'flagged\' : false}').limit(2)"/>
The mongo adapter is designed to return a single message containing a collection of query results per poll. So max-messages-per-poll makes no difference here.
max-messages-per-poll is used to short-circuit the poller and, in your case, the second poll is done immediately rather than waiting 10 seconds again. After 2 polls, we wait again.
In order to implement paging, you will need to use a query-expression instead of query and maintain some state somewhere that can be included in the query on each poll.
For example, if the documents have some value that increments you can store off that value in a bean and use the value in the next poll to get the next one.