Spring-batch Sharing data between executions - spring-batch

I'm new with Spring-batch. I have a simple, proof-of-concept, job which does something dummy.
ItemReader: Get a number of records between dateA and max date (last record at the moment)
ItemProcessor: aggregates the rows per city
ItemWriter: writes the result to the database
My question is that when the job finishes, I want to keep the last date in order for the next job execution to start from that date as dateA. I know that I can use JobExecutionContext to share data but if on the second run, I tried to get dateA but the value is null
IS there a way to get the previous dateA value recorded from the previous job execution? and how?
Thanks in advance

The following may be an alternative.
Store the date in a bean holder using JobExecutionListener#afterJob. The following link answers passing a collection at a Step level.
What's the best way to pass a huge collection to a Spring Batch Step?

There are multiple ways of implementing this. The simplest way would be just write it in a file and read that file on next start. But this is somehow dirty. You can also write it in DB or just run an sql that will return the max date from data you already have in db. Basically I would implement a simple tasklet(similar Tasklet to delete a table in spring batch ) that runs the sql and adds max data on the chunkContext.

Related

Export database to multiple files in same job Spring Batch

I need to export some database of arround 180k objects to JSON files so I can retain data structure in certain way that suits me for later import to other database. However because of amount of data, I wanto to separate and group data based on some atribute value from database records itself. So all records that have attribute1=value1, I want to go to value1.json, value2.json and so on.
However I still haven't figured out how to do this kind of job. I am using RepositoryItemReader and JsonFileWriter.
I started by filtering data on that attribute and running separate exports, just to verify that works, however I need to do this so I can automate whole process and let it work.
Can this be done?
There are several ways to do that. Here are a couple of options:
Option 1: parallel steps
You start by creating a tasklet that calculates the distinct values of the attribute you want to group items by, and you put this information in the job execution context.
After that, you create a flow with a chunk-oriented step for each value. Each chunk-oriented step would process a distinct value and generate an output file. The item reader and writer would be step-scoped bean and dynamically configured with the information from the job execution context.
Option 2: partitioned step
Here, you would implement a Partitioner that creates a partition for each distinct value. Each worker step would then process a distinct value and generate an output file.
Both options should perform equally in your use-case. However, option 2 is easier to implement and configure in my opinion.

Is there a way to get estimated time of completion of a currently running Informatica workflow in Infra metadata tables

I am working with this metadata table REP_WFLOW_RUN currently from Infra DB, to get status about workflows. The column run_status_code shows whether this wf is running, succeeded, stopped, aborted etc..
But for my Business use case I also need to report to Business, the estimated time of completion of this particular work flow.
Example: If suppose the workflow generally started at 6:15, then along with this info that work flow has started I want to convey it is also estimated to complete at so and so time.
Could you please guide me if you have any details on how to get this info from Informatica database.
Many thanks in advance.
This is a very good question but no one can answer correctly :)
Now, you can get some logic like other scheduling tool does.
First calculate average time the workflow takes to complete for a successful run. And output should be a decimal value.
select avg(end_time - start_time )*24 avg_time_in_hr, workflow_name
From REP_WFLOW_RUN
Where run_status_code='succeeded'
Group by workflow_name
You can use above value as estimated time to completion for that workflow. Output should be a datetime.
Select sysdate + avg_time_in_hr/24 est_time_to_complete from dual
Now, this value is an estimated figure and not correct value. So on a bad day, if this takes hours, average value will be bad but we cant do much here.
I assumed, your infa metadata is on oracle.

AZURE DATA FACTORY - Can I set a variable from within a CopyData task or by using the output?

I have simple pipeline that has a Copy activity to populate a table. That task is based on a query and will only ever return 1 row.
The problem I am having is that I want to reuse the value from one of the columns (batch number) to set a variable so that at the end of the pipeline I can use a Stored Procedure to log that the batch was processed. I would rather avoid running the query a second time in a lookup task so can I make use of the data already being returned?
I have tried duplicating the column in the Copy activity and then mapping that to something like #BatchNo but that fails and have even tried to add a Set Variable task but can't figure out how to take a single column #{activity('Populate Aleprstw').output} does not error but not sure what that will actually do in this case.
Thanks and sorry if its a silly question.
Cheers
Mark
I always do it like this:
Generate a batch number (usually with a proc)
Use a lookup to grab it into a variable
Use the batch number in all activities (might be multiple copes, procs etc.)
Write the batch completion
From your description it seems you have the batch embedded in the data copy from the start which is not typical.
If you must do it this way, is there really an issue with running a lookup again?
Copy activity doesn't return data like that, so you won't be able to capture the results that way. With this design, running the query again in a Lookup is the best option.
Is the query in the Source running on the same Server as the Sink? If so, you could collapse the entire operation into a Stored Procedure that returns the data point you are trying to capture.

Keeping database table clean using Entity Framework

I am using Entity Framework and I have a table that I use to record some events that gets generated by a 3rd party. These events are valid for 3 days. So I like to clean out any events that is older than 3 days to keep my database table leaner. Is there anyway I can do this? Some approach that won't cause any performance issue during clean up.
To do the mentioned above there are some options :
1) Define stored procedure mapped to your EF . And you can use Quarts Trigger to execute this method using timely manner.
https://www.quartz-scheduler.net
2) SQL Server Agent job which runs every day at the least peak time and remove your rows.
https://learn.microsoft.com/en-us/sql/ssms/agent/schedule-a-job?view=sql-server-2017
If only the method required for cleaning purpose nothing else i recommend you go with option 2
first of all. Make sure the 3rd party write a timestamp on every record, in that way you will be able to track how old the record is.
then. Create a script that deletes all record that are older than 3 days.
DELETE FROM yourTable WHERE DATEDIFF(day,getdate(),thatColumn) < -3
now create a scheduled task in SQL management Studio:
In SQL Management Studio, navigate to the server, then expand the SQL Server Agent item, and finally the Jobs folder to view, edit, add scheduled jobs.
set script to run once every day or whatever please you :)

talend how can we estimate taggregateSortedRow recordcount parameter value

We are trying out talend and we wanted to aggregate some sorted data on few keys .
Simple enough but when we try to use taggregatesortedrow its asking for Exact number of rows to be specified.
I am not sure how any one can input this on the fly. Dosen't this value change for every run ? am i missing something. surely they cant expect us to know total recs before we run the job.
This has to do with the way in how the Talend component tAggregateSortedRow is programmed. To avoid it omitting data you need to provide the record count. There are a few users with the same question like you asked:
https://www.talendforge.org/forum/viewtopic.php?id=50094
https://www.talendforge.org/forum/viewtopic.php?id=54231
https://www.talendforge.org/forum/viewtopic.php?id=7641
which I found simply by using Google.
Anyway, if you need to do sorting and aggregating, consider using the components tAggregateRow and tSortedRow separately. It should work fine.