Talend load fails in midway resulting rollback - talend

I have a tfileinputdelimited component and a tmap and the result is pass to the tfileoutdelimited which creates a csv file.
now in middle of the job some times data load fails resulting rollback of the destination file.
It cause wastage of resources and time.
can anyone provide a way of approach so that once a job fails in between so that data that is passed will go to save state and the next time when job runs then it again start from the point of failure only.

Talend won't rollback a process when writing in tFileOutputDelimited. If you got empty output file, it means that your job died prematurely and no record were written in the output buffer.
If an error occurs while writing in the file, then the following code (generated by tFileOutputDelimited) close the outputBuffer and flush the data successfully inserted before the error :
...
} finally {
if (outtFileOutputDelimited_1 != null) {
outtFileOutputDelimited_1.flush();
outtFileOutputDelimited_1.close();
}
...
}
...
There's no real "resume" feature in Talend, but you can create your own die&resume process in the job as following :
tFileInput1 ==> tHashOutput
tFileInput2 = main => tMap ==> tFileOutput1
tHashInput =lookup=> tMap
tFileInput1 : reads the data generated by the last run of your job and is stored in memory with tHashOutput
tFileInput2 : reads your input file
tFileOutput1 : stores the output data
tHashInput : reads the data in memory and serves as lookup in the tMap
In your tMap, create an inner join between tFileInput1 and tHashInput. Then, for your output schema, select catch lookup inner join reject to process all the record that are not in tHashInput.
Not sure that it will save resource and time. The best way to manage errors is to identify them and do all the checks in the job to avoid them !
For more clarity, could you give an example of error that occurs when you run the job ?

Related

Mirth Database Reader channel automatically reruns CRON query when you delete the message history?

Is this a defect? "Remove all messages" causes this channel type to automatically reprocess?
Create a channel with:
Database Reader Source that
-- runs on a CRON (0 5 * * * ? for example)
-- does not use Javascript (uses the SQL text block)
-- does not aggregate results
-- does not cache results
File Writer Destination
-- append to file
-- write the SELECT columns out to the file
Then run the channel. After it runs and you write numerous rows to the output file,
go into the Dashboard and try to "REMOVEALL" messages. It cleared the messages, but goes right back into polling the DB and rerunning the query regrardless what the Source Cron was set to.
This creates duplicates in the output file if we clear the dashboard message history. Why?
Kindly edit the channel to be able to flag records that have already been read in the database reader source connector. You can achieve this by adding a state flag (i.e is_sent column) to the source table. Essentially, setting a default value of 0 then toggling to 1 once pulled by Mirth connect.

Clear Folder before write Sink Azure Data Factory "The specified path does not exist"

I have the a Azure Data Lake Storage Gen2 that has been aggregated to parquets:
Dataset source2 that reads from the output path /output/partitions
Dataset sink that writes to the same path as source 2 /output/partitions
When I select the clear folder in sink I get the
"Job failed due to reason: at Sink 'sink1': Operation failed:
\"The specified path does not exist.\", 404, HEAD,
It also says to run below to clear the cache:
'REFRESH TABLE tableName'
It writes all the other partitions but is there a way to read the same ADLS Gen2 folder and overwrite it
I reproduced this and got same error when I check the clear the folder option.
I have tried with other options and observed that the new parquets are created. So, to delete the existing parquets you can use the below approach.
The Idea is after the dataflow, delete the old files by their last modified date using delete activity.
To filter out the old files use utcNow() function. The last modified date of old files is less than utcNow().
First store the #utcNow() value in a variable before the dataflow.
This is my pipeline picture:
After the dataflow, use the Get Meta data activity to get all parquet(old+new) files list.
Give this list to ForEach and inside ForEach use another GetMeta data for lastModifieddate. For this Use another parquet dataset with parameter.
Now compare this Last modified date to our variable in if condition. If this results true use delete activity inside True activities of if.
If condition:
#greater(variables('timebeforedf'),activity('Get Metadata2').output.lastModified)
In Delete activity give the #item().name inside True activities.
My Result parquet files after Execution:

Timeout exception when size of the input to child workflow is huge

16:37:21.945 [Workflow Executor taskList="PullFulfillmentsTaskList", domain="test-domain": 3] WARN com.uber.cadence.internal.common.Retryer - Retrying after failure
org.apache.thrift.transport.TTransportException: Request timeout after 1993ms
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.throwOnRpcError(WorkflowServiceTChannel.java:546)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.doRemoteCall(WorkflowServiceTChannel.java:519)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.respondDecisionTaskCompleted(WorkflowServiceTChannel.java:962)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.lambda$RespondDecisionTaskCompleted$11(WorkflowServiceTChannel.java:951)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.measureRemoteCall(WorkflowServiceTChannel.java:569)
at com.uber.cadence.serviceclient.WorkflowServiceTChannel.RespondDecisionTaskCompleted(WorkflowServiceTChannel.java:949)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.lambda$sendReply$0(WorkflowWorker.java:301)
at com.uber.cadence.internal.common.Retryer.lambda$retry$0(Retryer.java:104)
at com.uber.cadence.internal.common.Retryer.retryWithResult(Retryer.java:122)
at com.uber.cadence.internal.common.Retryer.retry(Retryer.java:101)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.sendReply(WorkflowWorker.java:301)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:261)
at com.uber.cadence.internal.worker.WorkflowWorker$TaskHandlerImpl.handle(WorkflowWorker.java:229)
at com.uber.cadence.internal.worker.PollTaskExecutor.lambda$process$0(PollTaskExecutor.java:71)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:834)
Our parent workflow code is basically like this (JSONObject is from org.json)
JSONObject[] array = restActivities.getArrayWithHugeJSONItems();
for(JSONObject hugeJSON: array) {
ChildWorkflow child = Workflow.newChildWorkflowStub(ChildWorkflow.class);
child.run(hugeJSON);
}
What we find out is that most of the time, the parent workflow worker fails to start the child workflow and throws the timeout exception above. It retries like crazy but never success and print the timeout exception over and over again. However sometimes we got very lucky and it works. And sometimes it fails even earlier at the activity worker, and it throws the same exception. We believe this is due to the size of the data is too big (about 5MB) and could not be sent within the timeout (judging from the log we guess it's set to 2s). If we call child.run with small fake data it 100% works.
The reason we use child workflow is we want to use Async.function to run them in parallel. So how can we solve this problem? Is there a thrift timeout config we should increase or somehow we can avoid passing huge data around?
Thank you in advance!
---Update after Maxim's answer---
Thank you. I read the example, but still have some questions for my use case. Let's say I got an array of 100 huge JSON objects in my RestActivitiesWorker, if I should not return the huge array to the workflow, I need to make 100 calls to the database to create 100 rows of records and put 100 ids in an array and pass that back to the workflow. Then the workflow create one child workflow per id. Each child workflow then calls another activity with the id to load the data from the DB. But that activity has to pass that huge JSON to the child workflow, is this OK? And for the RestActivitiesWorker making 100 inserts into the DB, what if it failed in the middle?
I guess it boils down to that our workflow is trying to work directly with huge JSON. We are trying to load huge JSON (5-30MB, not that huge) from an external system into our system. We break down the JSON a little bit, manipulate a few values, and use values from a few fields to do some different logic, and finally save it in our DB. How should we do this with Temporal?
Temporal/Cadence doesn't support passing large blobs as inputs and outputs as it uses a DB as underlying storage. So you want to change architecture of your application to avoid this.
The standard workarounds are:
Use external blob store to save large data and pass reference to it as parameters.
Cache data in a worker process or even on a host disk and route activities that operate on this data to that process or host. See fileprocessing sample for this approach.

Is there any possibility if the stage variable conversion is failed then capture the data into reject file

We have a stage variable using DateFromDaysSince(Date Column) in datastage transformer. Due to some invalid dates , datastage job is getting failed . We have source with oracle.
When we check the dates in table we didnt find any issue but while transformation is happening job is getting failed
Error: Invalid Date [:000-01-01] used for date_from_days_since type conversion
Is there any possibility to capture those failure records into reject file and make the parallel job run successfull .. ?
Yes it is possible.
You can use the IsValidDate or IsValidTimestamp function to check that - check out the details here
These functions could be used in a Transformer condition to move rows not showing the expected type to move to a reject file (or peek).
When your data is retrieved from a database (as mentioned) the database ensures the datatype already - if the data is stored in the appropriate format. I suggest checking the retrieval method to avoid unnecessary checks or rejects. Different timestamp formats could be an issue.

How can I force to RepositoryItemReader to read the newly inserted record or unprocessed record only

I have a batch job which is reading record from the Azure SQL database. The use case is there will be continuous writing of record in the database and my spring batch job has to run in every 5 min and read the record which is newly inserted and so far not has been procced from the last job . But I am not sure if there is inbuilt method in RepositoryItemReader or I have to implement hack solution for it
#Bean
public RepositoryItemReader<Booking> bookingReader() {
RepositoryItemReader<Booking> bookingReader = new RepositoryItemReader<>();
bookingReader.setRepository(bookingRepository);
bookingReader.setMethodName("findAll");
bookingReader.setSaveState(true);
bookingReader.setPageSize(2);
Map<String, Sort.Direction> sort = new HashMap<String, Sort.Direction>();
bookingReader.setSort(sort);
return bookingReader;
}
You need to add a column to your database called "STATUS". When the data is inserted into your table, the status should be "NOT PROCESSED". When your ItemReader reads data change the status to "IN PROCESS" when your ItemProcessor and ItemWriter completes its task change the status to "PROCESSED". In this way you can make sure your ItemReader reads only "NOT PROCESSED" data.
Note: If you are running your batch job using multiple threads using Task Executor, please use synchronized method in your reader to read 'NOT PROCESSED" records and to change the status to "IN PROGRESS". In this way you can make sure that multiple threads will not fetch the same data.
If table altering is not an option then another approach would be to use Spring Batch meta-data tables as much as you can.
Before job completion you simply store timestamp or some sort of indicator into a job execution context that tells you where to begin on next job iteration.
This can be "out of the box" solution.