Spring batch : How to skip current step based on a precondition in spring batch - spring-batch

I have a Spring batch step reader where the query is complex and contains join of several tables.
Job will be run everyday looking for records that were added to table A, based on the last updated date.
In the scenario where no records were added, the query still takes a long time to return results. I would like to check if there were any records that were added to table A, and only then run the full query.
Example : select count(recordID) from table A where last_update_date >
If count > 0, then proceed with the step (reader, writer etc) joining the other tables.
If count = 0, then skip the reader, writer and set step status as COMPLETED and proceed with the next step of the job.
Is this possible in Spring batch ? If yes, how can this be done?

Use a StoredProcedureItemReader.
Or a JobExecutionDecider where perform fast query and move to processing step or to job termination.

Related

Spark Delta Table Updates

I am working in Microsoft Azure Databricks environment using sparksql and pyspark.
So I have a delta table on a lake where data is partitioned by say, file_date. Every partition contains files storing millions of records per day with no primary/unique key. All these records have a "status" column which can either contain values NULL (if everything looks good on that specific record) or Not null (say if a particular lookup mapping for a particular column is not found). Additionally, my process contains another folder called "mapping" which gets refreshed on a periodic basis, lets say nightly to make it simple, from where mappings are found.
On a daily basis, there is a good chance that about 100~200 rows get errored out (status column containing not null values). From these files, on a daily basis, (hence is the partition by file_date) , a downstream job pulls all the valid records and sends it for further processing ignoring those 100-200 errored records, waiting for the correct mapping file to be received. The downstream job, in addition to the valid status records, should also try and see if a mapping is found for the errored records and if present, take it down further as well (after of course, updating the data lake with the appropriate mapping and status).
What is the best way to go? The best way is to directly first update the delta table/lake with the correct mapping and update the status column to say "available_for_reprocessing" and my downstream job, pull the valid data for the day + pull the "available_for_reprocessing" data and after processing, update back with the status as "processed". But this seems to be super difficult using delta.
I was looking at "https://docs.databricks.com/delta/delta-update.html" and the update example there is just giving an example for a simple update with constants to update, not for updates from multiple tables.
The other but the most inefficient is, say pull ALL the data (both processed and errored) for the last say 30 days , get the mapping for the errored records and write the dataframe back into the delta lake using the replaceWhere option. This is super inefficient as we are reading everything (hunderds of millions of records) and writing everything back just to process say a 1000 records at the most. If you search for deltaTable = DeltaTable.forPath(spark, "/data/events/") at "https://docs.databricks.com/delta/delta-update.html", the example provided is for very simple updates. Without a unique key, it is impossible to update specific records as well. Can someone please help?
I use pyspark or can use sparksql but I am lost
If you want to update 1 column ('status') on the condition that all lookups are now correct for rows where they weren't correct before (where 'status' is currently incorrect), I think UPDATE command along with EXISTS can help you solve this. It isn't mentioned in the update documentation, but it works both for delete and update operations, effectively allowing you to update/delete records on joins.
For your scenario I believe the sql command would look something like this:
UPDATE your_db.table_name AS a
SET staus = 'correct'
WHERE EXISTS
(
SELECT *
FROM your_db.table_name AS b
JOIN lookup_table_1 AS t1 ON t1.lookup_column_a = b.lookup_column_a
JOIN lookup_table_2 AS t2 ON t2.lookup_column_b = b.lookup_column_b
-- ... add further lookups if needed
WHERE
b.staus = 'incorrect' AND
a.lookup_column_a = b.lookup_column_a AND
a.lookup_column_b = b.lookup_column_b
)
Merge did the trick...
MERGE INTO deptdelta AS maindept
USING updated_dept_location AS upddept
ON upddept.dno = maindept.dno
WHEN MATCHED THEN UPDATE SET maindept.dname = upddept.updated_name, maindept.location = upddept.updated_location

Pentaho Data Integration - Assure that one step will be run before another

I have a transformation in Pentaho Data Integration that stores data in several tables from a database.
But this database has constraints, meaning I can't put things in a table before the related data is put in another table.
Sometimes it works, sometimes it doesn't, depends on concurrency luck.
So I need to assure that Table Output 1 gets entirely run before Table Output 2 starts.
How can I do this?
You can use a step named "Block this step until steps finish".
You place it before the step that needs to wait. And inside the block you define which steps are to be waited for.
Below, suppose, Table Output 2 contains a foreign key to a field in table 1, but the rows you're going to reference in table 2 still don't exist in table 1. This means Table Output 2 needs to wait until Table Output finishes.
Place the "block" connected before Table Output 2:
Then enter the properties of the "block" step. Inside, add Table Output in the list (and any other steps you want to wait for):
For that, you can use job instead of transformation. because in transformation all the steps run parallelly. so use a job in that add the first transformation in which table output1 will be executed first and in second transformation table output2 will be performed

How do I perform batch upsert in Spring JDBC?

I have list of records, I want to perform following tasks using SpringJDBCTemplate
(1) Update existing records
(2) Insert new records.
Don't know how this happens using jdbcTemplate of spring.
Any insight?
You just use one of the various forms of batchUpdate for the update. Then you check the return value which will contain 1 if the row was present and 0 otherwise. For the later, you perform another batchUpdate with the insert statements.

How to execute an update after each item writing in Spring batch?

I am doing a database read and database write as spring task. It's running fine. The after job method also is getting executed fine. But my requirement is after each insert of an entry I need to update a flag in the source database. How can we achieve this?
Consider using a CompositeItemWriter - that has 2 delegate writers
Delegate writer 1 - performs the insert into the target database
Delegate writer 2 - update the status in the source database
If you really need to commit after each insert - you will need to set the commit-interval for the step to 1. Do remember that setting the commit interval 1 means very low performance - so unless there is a compelling reason do not set the commit interval to 1
if the inserted data contains some data to identify the insert happened (insert date, status flag, etc.) you could run a simple Taskletstep which executes an update statement like
update ....
set flag = flag.value
where insert.date = ....

Spring batch running steps in parallel

Step 1- Tasklet : I am reading a db table Employee that has EmpId, EmpName, EmpAddress, it return 8 rows (can change). I have set the List in jobExecutionContext.
Starting step 2:
i) for the returned employee list, I need to execute a query by giving empId for every record present in the list.
ii) Result of that query is written into a file.
iii) Copy that file to a different server
iv) Send email with stats of the file.
Step i) to iv) needs to be performed for every employee returned by first query in parallel. e.g. if there are 8 employees then step i) to iv) should be executed in parallel for all the 8 emp Id.
Can someone please guide me on this? What type of xml step configuration should be used?