Multithread processing of reconds in spring batch Processors - spring-batch

Is it possible to pass a list of Records from From Spring batch Reader to Processor or not as below?
public MyProcessor implements ItemProcessor<List<argClass>, List<returnClass>>{
public List<returnClass> process(List<argClass>){
//Impementation
}
}
All the examples I could see is, there is a single record(not a list of objects) passed to the processor from Reader. Either it could be SingleItemPeekableReader or JDBCCursorItemReader.

Related

Nested Query in spring batch processing

I want to create an ETL process using Spring Batch, the steps will be read from a DB(s) and insert in one DB so basically I'm collecting similar information from different DB and inserting them in one DB, I have a large complex query that I need to run on those DBs and the result will be inserted in the so called one DB for later processing, my main concert is that I want to reference this query in the JpaPagingItemReader for example, it there a way I can for example add this query in my project as .sql file and then reference it in the reader?
Or any other solution I can follow?
Thank you
it there a way I can for example add this query in my project as .sql file and then reference it in the reader? Or any other solution I can follow?
You can put your query in a properties file and inject in your reader, something like:
#Configuration
#EnableBatchProcessing
#PropertySource("classpath:application.properties")
public class MyJob {
#Bean
public JpaPagingItemReader itemReader(#Value("${query}") String query) {
return new JpaPagingItemReaderBuilder<>()
.queryString(query)
// set other reader properties
.build();
}
// ...
}
In this example, you should have a property query=your sql query in application.properties. This is actually the regular Spring property injection mechanism, nothing Spring Batch specific here.

ItemReader for records returned by CrudRepository

I have a spring batch application wherein reader reads from an external db and processor transforms it to the POJO of my destination db , writer will write the transformed POJO to the destination db
I am using following CrudRepository
public interface MyCrudRepository extends CrudRepository<MyDbEntity, String> {
List<MyDbEntity> findByPIdBetween(String from, String to);
List<MyDbEntity> findByPIdGreaterThan(String from);
}
I wanted to know , how the ItemReader for above would look like?
Should I call myCrudRepository.findByPidBetween(String from, String to) in #PostConstruct of my ItemReader ?
Wouldnt that make the ItemReader static? As each job run would have different method parameter for findByPidBetween.
How should ItemReader be structured for above problem?
I wanted to know , how the ItemReader for above would look like?
RepositoryItemReader is what you need. You can use it with your repository and specify the method to use to read items. You can find an example here
each job run would have different method parameter for findByPidBetween
You can pass those as parameters to your job and use them in your reader.

Bulkindexing JPA Entities modified during Spring transaction to Elasticsearch index

I have an JPA Entity Class that is also an Elasticsearch Document. The enviroment is a Spring Boot Application using Spring Data Jpa and Spring Data Elasticsearch.
#Entity
#Document(indexname...etc)
#EntityListeners(MyJpaEntityListener.class)
public class MyEntity {
//ID, constructor and stuff following here
}
When an instance of this Entity gets created, updated or deleted it gets reindexed to Elasticsearch. This is currently achieved with an JPA EntityListener which reacts on PostPersist, PostUpdate and PostRemove events.
public class MyJpaEntityListener {
#PostPersist
#PostUpdate
public void postPersistOrUpdate(MyEntity entity) {
//Elasticsearch indexing code gets here
}
#PostRemove
public void postPersistOrUpdate(MyEntity entity) {
//Elasticsearch indexing code gets here
}
}
That´s all working fine at the moment when a single or a few entities get modified during a single transaction. Each modification triggers a separate index operation. But if a lot of entities get modified inside a transaction it is getting slow.
I would like to bulkindex all entities that got modified at the end (or after commit) of a transaction. I took a look at TransactionalEventListeners, AOP and TransactionSynchronizationManager but wasn´t able to come up with a good setup till now.
How can I collect all modified entities per transaction in an elegant way without doing it per hand in every service method myself?
And how can I trigger a bulkindex at the end of a transaction with the collected entities of this transaction.
Thanks for your time and help!
One different and in my opinion elegant approach, as you don't mix your services and entities with elasticsearch related code, is to use spring aspects with #AfterReturning in the service layer transactional methods.
The pointcut expression can be adjusted to catch all the service methods you want.
#Order(1) guaranties that this code will run after the transaction commit.
The code below is just a sample...you have to adapt it to work with your project.
#Aspect
#Component()
#Order(1)
public class StoreDataToElasticAspect {
#Autowired
private SampleElasticsearhRepository elasticsearchRepository;
#AfterReturning(pointcut = "execution(* com.example.DatabaseService.bulkInsert(..))")
public void synonymsInserted(JoinPoint joinPoint) {
Object[] args = joinPoint.getArgs();
//create elasticsearch documents from method params.
//can also inject database services if more information is needed for the documents.
List<String> ids = (List) args[0];
//create batch from ids
elasticsearchRepository.save(batch);
}
}
And here is an example with a logging aspect.

Dynamically create Job in Spring Batch

Is it possible to create Spring Batch Job dynamically as not a bean?
I have created a lot of readers, writers, processors and another tasklets and I would like to have a possibility to build Job at runtime from these parts.
I have some Job descriptions files in my xml-based format, saved in some directory. These Job descriptions can contain dynamic information about Job, for example, what reader and writer chose for this job.
When the program starts, these files are parsed, and corresponding Jobs must be created.
I think to implement it like this:
#Autowired
private JobBuilderFactory jobBuilderFactory;
#Autowired
private StepBuilderFactory stepBuilderFactory;
#Autowired
private ApplicationContext context;
public Job createJob(MyXmlJobConfig jobConfig) {
// My predefined steps in context
Step initStep = context.getBean("InitStep", Step.class);
Step step1 = context.getBean("MyFirstStep", Step.class);
Step step2 = context.getBean("MySecondStep", Step.class);
//......
// Mix these steps to build job
JobBuilder jobBuilder = jobBuilderFactory.get("myJob");
SimpleJobBuilder simpleJobBuilder = jobBuilder.start(initStep);
// Any logic of steps mixing and choosing
if(jobConfig.somePredicate())
simpleJobBuilder.next(step1);
else
simpleJobBuilder.next(step2);
//.........
//.......
return simpleJobBuilder.build();
}
Usage example:
JobLauncher jobLauncher = context.getBean(JobLauncher.class);
MyXmlJobConfig config = getConfigFromFile(); // Loading config from file
MyCustomJobBuilder myCustomJobBuilder = context.getBean(MyCustomJobBuilder.class);
Job createdJob = myCustomJobBuilder.createJob(config);
jobLauncher.run(createdJob, new JobParameters());
Is this approach of job building correct? Note that the createdJob is not a bean. Will not it break anything of Spring Batch behind the scenes?
Spring Batch uses the Spring DI container and related facilities quite extensively. Proxying beans that are job or step scoped is just one example. The whole parsing of an XML based definition results in BeanDefinitions. Can you build a Spring Batch job without making it a bean? Sure. Would I recommend it? No.
Do keep in mind that there are ways of dynamically creating child ApplicationContext instances that you can have a job in. Spring Batch Admin and Spring XD both took advantage of this feature to dynamically create instances of Spring Batch jobs. I'd recommend this approach over having the job not part of an ApplicationContext in the first place.

Performance problems using a Spring Batch FieldSetMapper to map into an object that will be written with a JpaItemWriter?

We are writing a set of Spring Batch jobs that read values from text files and use that information to update objects that are read and written from the database using JPA. These jobs are not run in a web container, but on an application server. My problem seems to be with how the EntityManager is configured.
The code reads files from various vendors that update an order's status. The text file specifies the customer by name and the order by date/time. If the customer doesn't exist, the line from the text file is skipped. If the order exists, we update it. If not, then we create it.
We currently use DeltaSpike to get instances of our DAO objects like this:
DependentProvider<CustomerDaoImpl> provider = BeanProvider.getDependent(CustomerDaoImpl.class);
ICustomerDao custDao = provider.get();
I cache the DAO objects in my mapper so I am only getting them once. But every call to BeanProvider.getDependent() creates a new EntityManager through "Spring Batch Magic." The EntityManager is specified thusly:
#Configuration
public class BaseBatchConfiguration {
#Bean
#Produces
public EntityManager entityManager() {
Map<String, String> properties = new HashMap<String, String>();
properties.put("hibernate.connection.url", System.getProperty("JDBC_URL"));
properties.put("hibernate.default_schema", System.getProperty("APP_SCHEMA"));
properties.put("hibernate.connection.username", System.getProperty("APP_DB_ID"));
properties.put("hibernate.connection.password", System.getProperty("APP_DB_PWD"));
EntityManagerFactory emf = Persistence.createEntityManagerFactory(System.getProperty("PU_NAME"), properties);
return emf.createEntityManager();
}
}
I tried caching the EntityManager, but a new instance of the BaseBatchConfiguration class is used every time. This means that each DAO gets created with it's own EntityManager, so no real object caching is taking place across DAOs (reading a customer with the CustomerDaoImpl isn't cached and used when OrderDaoImpl loads an order that references that same customer).
This is causing a lot of unwanted object loading as we process through the text file.
Is there some other way we should be declaring our EntityManager?