I have following requirement.I am generating unique id from ItemProcessor and writing the same to database using JdbcItemWriter.
I wanted to pass this unique id as a query param in next JdbcItemReader,so that i can select all the records from database based on this unique id.
currently i am using max(uniqueid) from database.I have tried using {jobParameters['unqueid']} but it didn't worked.
Please let me know how to pass value from ItemProcessor to DataBaseItemReader.
I think using step execution context might work for you here. There is the option for setting some transient data on the step execution context and having that be available to other components in the same step.
There is a previous answer here that elaborates a bit more on this and a quick google search on "spring batch step execution context" also provides quite a few q/a on the subject.
Related
In my project I am reading the data for DB table using StoredProcedure reader and calling an API to process and then saving the output using writer. I need to maintain the processing status as Processed or Error for each record that I am reading. As of now I am using the writer to update the input table column STATUS to P (Processed) or E(Error) and add logs the in case of any error to LOGS column.
Can you please suggest if this is the efficient way to maintain the processing status of each record. Does Spring batch provides any default implementation for same?
Thanks
No, Spring Batch does not provide a "default implementation" for such a requirement.
That said, a flag on each item as you did is a reasonable way to address your requirement in my opinion.
I am configuring a new Job where I need to read the data from the database and in the processor, the data will be used to call a Rest endpoint with payload. In the payload along with dynamic data, I need to pass reference data which is constant for each record getting processed in the job. This reference data is stored in DB. I am thinking to implement the following approach.
In the beforeJob listener method make a DB call and populate the reference data object and use it for the whole job run.
In the processor make a DB call to get the reference data and cache the query so there will be no DB call to fetch the same data for each record.
Please suggest if these approaches are correct or if there is a better way to implement them in Spring batch.
For performance reasons, I would not recommend doing a DB call in the item processor, unless that is really a requirement.
The first approach seems reasonable to me, since the reference data is constant. You can populate/clear a cache with a JobExecutionListener and use the cache in your chunk-oriented step. Please refer to the following thread for more details and a complete sample: Spring Batch With Annotation and Caching.
I'm trying to setup Spring Batch to move DB records from Oracle to Cassandra daily.
I know I can manually define JPA repository queries based on additional entity table (like MyBatchProgress where I store previously completed Id + date or something like that), so that the next batch job knows which entity to start with for further operations.
My question is: does Spring Batch provide something like this inbuilt (also by utilising Spring Data JPA)?
Or is this something that I have to write manually in the job reader step where I just pick up the last Id stored in my custom "progress" table?
Thanks in advance!
You can store the last ID in the execution context, which is persisted in the meta-data tables. With that in place, you can make the code that launches the job look for the last job execution, take the ID from its context and pass it as a job parameter to the next job instance.
When implementing a system which creates tasks that need to be resolved by some workers, my idea would be to create a table which would have some task definition along with a status, e.g. for document review we'd have something like reviewId, documentId, reviewerId, reviewTime.
When documents are uploaded to the system we'd just store the documentId along with a generated reviewId and leave the reviewerId and reviewTime empty. When next reviewer comes along and starts the review we'd just set his id and current time to mark the job as "in progress" (I deliberately skip the case where the reviewer takes a long time, or dies during the review).
When implementing such a use case in e.g. PostgreSQL we could use the UPDATE review SET reviewerId = :reviewerId, reviewTime: reviewTime WHERE reviewId = (SELECT reviewId from review WHERE reviewId is null AND reviewTime is null FOR UPDATE SKIP LOCKED LIMIT 1) RETURNING reviewId, documentId, reviewerId, reviewTime (so basically update the first non-taken row, using SKIP LOCKED to skip any already in-processing rows).
But when moving from native solution to JDBC and beyond, I'm having troubles implementing this:
Spring Data JPA and Spring Data JDBC don't allow the #Modifying query to return anything else than void/boolean/int and force us to perform 2 queries in a single transaction - one for the first pending row, and second one with the update
one alternative would be to use a stored procedure but I really hate the idea of storing such logic so away from the code
other alternative would be to use a persistent queue and skip the database all along but this introduced additional infrastructure components that need to be maintained and learned. Any suggestions are welcome though.
Am I missing something? Is it possible to have it all or do we have to settle for multiple queries or stored procedures?
Why Spring Data doesn't support returning entity for modifying queries?
Because it seems like a rather special thing to do and Spring Data JDBC tries to focus on the essential stuff.
Is it possible to have it all or do we have to settle for multiple queries or stored procedures?
It is certainly possible to do this.
You can implement a custom method using an injected JdbcTemplate.
I've just started using the luigi library. I am regularly scraping a website and inserting any new records into a Postgres database. As I'm trying to rewrite parts of my scripts to use luigi, it's not clear to me how the "marker table" is supposed to be used.
Workflow:
Scrape data
Query DB to check if new data differs from old data.
If so, store the new data in the same table.
However, using luigi's postgres.CopyToTable, if the table already exists, no new data will be inserted. I guess I should be using the inserted column in the table_updates table to figure out what new data should be inserted, but it's unclear to me what that process looks like and I can't find any clear examples online.
You don't have to worry about marker table much: it's an internal table luigi uses to track which task has already been successfully executed. In order to do so, luigi uses the update_id property of your task. If you didn't declared one, then luigi will use the task_id as shown here. That task_id is a concatenation of the task family name and the first three parameters of your task.
The key here is to overwrite the update_id property of your task and return a custom string that you'll know will be unique for each run of your task. Usually you should use the significant parameters of your task, something like:
#property
def update_id(self):
return ":".join(self.param1, self.param2, self.param3)
By significant I mean parameters that change the output of your task. I imagine parameters like website url o id, and scraping date. Parameters like the hostname, port, username or password of your database will be the same for any of these tasks so they shouldn't be considered significant.
Notice that without having details about your tables and the data you're trying to save its pretty hard to say how you must build that update_id string, so please be careful.