Same data read by PCF instances for spring batch application - spring-batch

I am working on a spring batch application,which read data from data base using JdbcCursorItemReader,this application is working as expected when I run a single instance.
I deployed this application in PCF and used auto scale feature, but multiple instances are retrieving the same record from the data base.
How can I prevent the duplicate data reads from other instances?

This is normally handled by applying the processed indicator pattern. In this pattern, you have an additional field on each row that you mark with the status as each record is processed. You then use your query to filter only the records that match the status you care about. In this case, the status could be node specific so that the node only selects records that node tags.

Related

Best way of persisting the processing status of each individual item

In my project I am reading the data for DB table using StoredProcedure reader and calling an API to process and then saving the output using writer. I need to maintain the processing status as Processed or Error for each record that I am reading. As of now I am using the writer to update the input table column STATUS to P (Processed) or E(Error) and add logs the in case of any error to LOGS column.
Can you please suggest if this is the efficient way to maintain the processing status of each record. Does Spring batch provides any default implementation for same?
Thanks
No, Spring Batch does not provide a "default implementation" for such a requirement.
That said, a flag on each item as you did is a reasonable way to address your requirement in my opinion.

Spring Batch: reading from a database and being aware of the previous processed id?

I'm trying to setup Spring Batch to move DB records from Oracle to Cassandra daily.
I know I can manually define JPA repository queries based on additional entity table (like MyBatchProgress where I store previously completed Id + date or something like that), so that the next batch job knows which entity to start with for further operations.
My question is: does Spring Batch provide something like this inbuilt (also by utilising Spring Data JPA)?
Or is this something that I have to write manually in the job reader step where I just pick up the last Id stored in my custom "progress" table?
Thanks in advance!
You can store the last ID in the execution context, which is persisted in the meta-data tables. With that in place, you can make the code that launches the job look for the last job execution, take the ID from its context and pass it as a job parameter to the next job instance.

How to add a list of Steps to Job in spring batch

I'm extending existing Job. What I need to do is update a list of records from database with data gotten from external service. I don't know how to do it in a loop so I thought about creating a list of Steps each consisting of reader, processor and writer and simply adding them to next() method in a jobBuilder. Looking at documentation it's only possible to add one Step at a time, and I have several thousands rows in the database, thus several thousands Steps. How should I do this?
edit:
in short I need to:
read a list of ids from db,
for every id I need to call external service to get information relevant to this id,
process data from it
save updated row to db

hazelcast spring-data write-through

I am using Spring-Boot, Spring-Data/JPA with Hazelcast client/server topology. In parts of my test application, I am calculating time when performing CRUD operations on the client side (the server is the one interacting with a relational db). I configured the map(Store) to be write-behind by setting write-delay-seconds to 10.
Spring-Data's save() returns the persisted entity. In the client app, therefore, the application flow will be blocked until the (server) returns the persisted entity.
Would like to know is there is an alternative in which case the client does NOT have to wait for the entity to persist. Was under the impression that once new data is stored in the Map, persisting to the backed happens asynchronously -> the client app would NOT have to wait.
Map config in hazelast.xml:
<map name="com.foo.MyMap">
<map-store enabled="true" initial-mode="EAGER">
<class-name>com.foo.MyMapStore</class-name>
<write-delay-seconds>10</write-delay-seconds>
</map-store>
</map>
#NeilStevenson I don't find your response particularly helpful. I asked on an earlier post about where and how to generate the Map keys. You pointed me to the documentation which fails to shed any light on this topic. Same goes for the hazelcast (and other) examples.
The point of having the cache in the 1st place, is to avoid hitting the database. When we add data (via save()), we need to also generate an unique key for the Map. This key also becomes the Entity.Id in the database table. Since, again, its the hazelcast client that generates these Ids, there is no need to wait for the record to be persisted in the backend.
The only reason to wait for save() to return the persisted object would be to catch any exceptions NOT because of the ID.
That unfortunately is how it is meant to work, see https://docs.spring.io/spring-data/commons/docs/current/api/org/springframework/data/repository/CrudRepository.html#save-S-.
Potentially the external store mutates the saved entry in some way.
Although you know it won't do this, there isn't a variant on the save defined.
So the answer seems to be this is not currently available in the general purpose Spring repository definition. Why not raise a feature request for the Spring Data team ?

Service Fabric Actors - save state to database

I'm working on a sample Service Fabric project, where I have to maintain a shopping list. For this I have a ShoppingList actor, which is identifiable by a specific id. It stores the current list content in its state using StateManager. All works fine.
However, in parallel I'd like to maintain the shopping list content in a sql database. In particular:
store all add/remove item request for future analysis (ML)
on first actor initialization load list content from db (e.g. after cluster has been re-created)
What is the best approach to achieve that? Create a custom StateProvider (how? can't find examples)?
Or maybe have another service/actor for handling all db operations (possibly using queues and reminders)?
All examples seem to completely rely on default StateManager, with no data persistence to external storage, so I'm not sure what's the best practice.
The best way will be to have a separate entity responsible for storing data to DB. And actor will just send an event (not implying SF events) with some data about performed operation, and another entity will catch it and perform the rest of the work.
But of course you can implement this thing in actor itself, but it will bring two possible issues:
Actor will be not able to process other requests if there will be some issues with DB or connectivity between actor and DB or if there will be high loading of DB itself and it will process requests slowly. The actor would have to wait till transferring to DB successfully completes.
Possible overloading of DB with many single connections from many actors instead of one or several connection from another entity and batch insertion.
So, your final solution will depend on workload of your system. But definitely you will need a reliable queue to safely store data in DB if value of such data is too high to afford a loss.
Also, I think you could use default state manager to store logs and information about transactions before it will be transferred to DB and remove from service's state after transaction completes. There is no need to have permanent storage of such data in services.
And another things to take into consideration — reading from DB. Probably, if you have relationship database and will update with new records only one table + if there will be huge amount of actors that will query such data on activation, you will have performance degradation as this table will be locked for reading or writing if you will not configure it to behave differently. So, probably, you will need caching system to read data for actors activation — depends on your workload.
And about implementing your custom State Manager: take a look at this example. Basically, all you need to do is to implement IReliableStateManagerReplica interface and pass it to StatefullService constructor.