I need to denormalize data stored into a relational database.
There are too much tables linked between each others, and it's not possible to build a single query to get every data.
Here a simplified situation:
+-------------+ +-------------+
| | 1 N | |
| Unitat +-----------> Authors |
| | | |
+-------------+ +-------------+
Up to now I've build an step that is getting all Unitat rows:
#Bean
public ItemReader<Unitat> reader() {
String sql = "select * from unitat";
JdbcCursorItemReader<Unitat> jdbcCursorItemReader = new JdbcCursorItemReader<>();
jdbcCursorItemReader.setDataSource(this.dataSource);
jdbcCursorItemReader.setSql(sql);
jdbcCursorItemReader.setVerifyCursorPosition(false);
jdbcCursorItemReader.setRowMapper(new UnitatRowMapper());
return jdbcCursorItemReader;
}
The RowMapper is:
public class UnitatRowMapper implements RowMapper<Unitat> {
private static final String ID_COLUMN = "id";
//...
#Override
public Unitat mapRow(ResultSet resultSet, int numRow) throws SQLException {
Unitat unitat = new Unitat();
unitat.setId(resultSet.getString(ID_COLUMN));
//...
return unitat;
}
}
Here my processor. It's only engaged to populate fields into a UnitatDenormalized object:
#Component
public class UnitatMappingItemProcessor implements ItemProcessor<Unitat, UnitatDenormalized> {
#Override
public UnitatDenormalized process(Unitat unitat) throws Exception {
UnitatDenormalized denormalized = new UnitatDenormalized();
denormalized.setId(unitat.getId());
//denormalized.set...()
return denormalized;
}
}
Here my current step and job configuration:
#Bean
public Step step(
ItemReader<Unitat> mssqlItemReader,
UnitatMappingItemProcessor processor,
SolrItemWriter solrItemWriter
) {
return this.stepBuilderFactory
.get("unitat")
.<Unitat, UnitatDenormalized>chunk(100)
.reader(mssqlItemReader)
.processor(processor)
.writer(solrItemWriter)
.build();
}
#Bean
public Job job(Step step) {
Job job = this.jobBuilderFactory.get("job1")
.flow(step)
.end()
.build();
return job;
}
How could I get authors and merge them to previously obtained unitat?
When and how could I perform next code?
// Here would I need to populate `authors`:
denormalized.addAutors(...);
denormalized.addAutors(...);
I hope I've explained so well...
You can use the driving query pattern. The idea is to issue a request in your item processor to enrich items.
In your case, the query would fetch authors of the current unitat item and set them on the denormalized item before returning it.
Related
I have a spring batch job with a step consisting or reader(reading from elastic search), processor(flattening and giving me a list of items) and writer(writing in database the list of item) running as following
#Bean
public Step userAuditStep(
StepBuilderFactory stepBuilderFactory,
ItemReader<User> esUserReader,
ItemProcessor<Product, List<UserAuditFact>>
userAuditFactProcessor,
ItemWriter<List<UserAuditFact>> userAuditFactListItemWriter) {
return stepBuilderFactory
.get(stepName)
.<Product, List<UserAuditFact>>chunk(chunkSize)
.reader(esUserReader)
.processor(userAuditFactProcessor)
.writer(userAuditFactListItemWriter)
.listener(listener)
.build();
}
As you can see above user reader(just kept batch reader size 2) gives list of userAudit(which are usually in few thousands) and then writing using a listWriter which is something like below
#Bean
#StepScope
public ItemWriter<List<UserAuditFact>>
userAuditFactListItemWriter(
ItemWriter<UserAuditFact> userAuditFactItemWriter) {
return new UserAuditFactListUnwrapWriter(userAuditFactItemWriter);
}
#Bean
#StepScope
public ItemWriter<UserAuditFact> userAuditFactItemWriter(
#Qualifier("dbTemplate") NamedParameterJdbcTemplate jdbcTemplate) {
return new JdbcBatchItemWriterBuilder<UserAuditFact>()
.itemSqlParameterSourceProvider(
UserAuditFactQueryProvider
.getUserAuditFactInsertParams())
.assertUpdates(false)
.sql(UserAuditFactQueryProvider.UPSERT)
.namedParametersJdbcTemplate(jdbcTemplate)
.build();
}
So since I am having a list of Items, I am unwrapping them before writing into database like below
public class UserAuditFactListUnwrapWriter
implements ItemWriter<List<UserFact>> {
private final ItemWriter<UserAuditFact> delegate;
#Override
public void write(List<? extends List<UserAuditFact>> lists) throws Exception {
final List<UserAuditFact> consolidatedList = new ArrayList<>();
for (final List<UserAuditFact> list : lists) {
consolidatedList.addAll(list);
}
delegate.write(consolidatedList);
}
}
Now the write operation is taking a lot of time especially when there is lot of items in consolidatedList
One option I can do is do some chunking logic like following where I split the list to some chunk and then give to delegate like shown below
public class UserAuditFactListUnwrapWriter
implements ItemWriter<List<UserFact>> {
private final ItemWriter<UserAuditFact> delegate;
#Setter private int chunkSize = 0; // Maybe chunksize of 200
#Override
public void write(List<? extends List<UserAuditFact>> lists) throws Exception {
final List<UserAuditFact> consolidatedList = new ArrayList<>();
for (final List<UserAuditFact> list : lists) {
consolidatedList.addAll(list);
}
List<List<T>> partitions = ListUtils.partition(consolidatedList, chunkSize);
for (List<T> partition : partitions) {
delegate.write(partition);
}
}
}
However, this too does not give me needed performance. Imagine 60000 records (3000 partitions with each chunk of 200) takes long
I was wondering if i can make it better further (maybe someway to have the partitions to be written in parallel)
Some additional info.
Database is AWS postgres rds
Is there a way to query for multiple values of the same property with Spring DataREST JPA and querydsl? I am not sure what the format of the query URL should be and if I need extra customization in my bindings. I couldn't find anything in documentation. If I have a "student" table in my database with a "major" column with corresponding Student entity I would assume that querying for all students which have "math" and "science" majors would look like http://localhost:8080/students?major=math&major=science. However in this query only the first part is being taken and major=science is ignored
Below example customizes Querydsl web support to perform collection in operation. URI /students?major=sword&major=magic searches for students with major in ["sword", "magic"].
Entity and repository
public class Student {
private Long id;
private String name;
private String major;
}
public interface StudentRepos extends PagingAndSortingRepository<Student, Long>,
QuerydslPredicateExecutor<Student>,
QuerydslBinderCustomizer<QStudent> {
#Override
default void customize(QuerydslBindings bindings, QStudent root) {
bindings.bind(root.major)
.all((path, value) -> Optional.of(path.in(value)));
}
}
Test data
new Student("Arthur", "sword");
new Student("Merlin", "magic");
new Student("Lancelot", "lance");
Controller
#RestController
#RequestMapping("/students")
#RequiredArgsConstructor
public class StudentController {
private final StudentRepos studentRepos;
#GetMapping
ResponseEntity<List<Student>> getAll(Predicate predicate) {
Iterable<Student> students = studentRepos.findAll(predicate);
return ResponseEntity.ok(StreamSupport.stream(students.spliterator(), false)
.collect(Collectors.toList()));
}
}
Test case
#Test
#SneakyThrows
public void queryAll() {
mockMvc.perform(get("/students"))
.andExpect(status().isOk())
.andExpect(jsonPath("$").isArray())
.andExpect(jsonPath("$", hasSize(3)))
.andDo(print());
}
#Test
#SneakyThrows
void querySingleValue() {
mockMvc.perform(get("/students?major=sword"))
.andExpect(status().isOk())
.andExpect(jsonPath("$").isArray())
.andExpect(jsonPath("$", hasSize(1)))
.andExpect(jsonPath("$[0].name").value("Arthur"))
.andExpect(jsonPath("$[0].major").value("sword"))
.andDo(print());
}
#Test
#SneakyThrows
void queryMultiValue() {
mockMvc.perform(get("/students?major=sword&major=magic"))
.andExpect(status().isOk())
.andExpect(jsonPath("$").isArray())
.andExpect(jsonPath("$", hasSize(2)))
.andExpect(jsonPath("$[0].name").value("Arthur"))
.andExpect(jsonPath("$[0].major").value("sword"))
.andExpect(jsonPath("$[1].name").value("Merlin"))
.andExpect(jsonPath("$[1].major").value("magic"))
.andDo(print());
}
The full Spring Boot application is in Github
I'm using JpaPagingItemReaderBuilder to query a DB and the result is being insert in another DB.
Query is returning results with no issue but I'm getting an error with the return of the reader and in the processor you can check my coding and error below.
can someone please give me insight on this? and why I'm not able to process the result?
Here is my code:
#Bean
public Step sampleStep(){
return stepBuilderFactory.get("sampleStep")
.<FCR_HDR,FCR_HDR>chunk(5)
.reader(itemReader())
.processor(processor())
//.writer(i -> i.stream().forEach(j -> System.out.println(j)))
//.writer(i -> i.forEach(j -> System.out.println(j)))
.writer(jpaItemWriter())
.build();
}
#Bean
public Job sampleJob(){
return jobBuilderFactory.get("sampleJob")
.incrementer(new RunIdIncrementer())
.start(sampleStep())
.build();
}
#Bean
public FcrItemProcessor processor() {
return new FcrItemProcessor();
}
#Bean
#StepScope
public JpaPagingItemReader<FCR_HDR> itemReader(/*#Value("${query}") String query*/){
return new JpaPagingItemReaderBuilder<FCR_HDR>()
.name("db2Reader")
.entityManagerFactory(localContainerEntityManagerFactoryBean.getObject())
.queryString("select f.fcr_ref,f.num_subbills from FCR_HDR f where f.fcr_ref in ('R2G0130185','R2G0128330')")
//.queryString(qry)
.pageSize(3)
.build();
}
#Bean
#StepScope
public JpaItemWriter jpaItemWriter(){
JpaItemWriter writer = new JpaItemWriter();
writer.setEntityManagerFactory(emf);
return writer;
}
}
public class FcrItemProcessor implements ItemProcessor<FCR_HDR,FCR_HDR> {
private static final Logger log = LoggerFactory.getLogger(FcrItemProcessor.class);
#Nullable
#Override
public FCR_HDR process(FCR_HDR fcr_hdr) throws Exception {
final String fcrNo = fcr_hdr.getFcr_ref();
final String numsubbills = fcr_hdr.getNum_subbills();
final FCR_HDR transformFcr = new FCR_HDR();
transformFcr.setFcr_ref(fcrNo);
transformFcr.setNum_subbills(numsubbills);
log.info("Converting (" + fcr_hdr + ") into (" + transformFcr + ")");
return transformFcr;
}
}
Error:
java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to com.electronicfcr.efcr.model.FCR_HDR
Since you configure the following query in the JpaPagingItemReader :
.queryString("select f.fcr_ref,f.num_subbills from FCR_HDR f where f.fcr_ref in ('R2G0130185','R2G0128330')")
The query is in the format of JPQL which will be processed by the JPA and JPA will return a Object[] if you select certain mapped columns from the mapped entity.
Change it to :
.queryString("select f from FCR_HDR f where f.fcr_ref in ('R2G0130185','R2G0128330')")
such that it will return the mapped entity class (i.e FCR_HDR) and should solve your problem.
Based on my research, I know that Spring Batch provides API to handling many different kinds of data file formats.
But I need clarification on how do we supply multiple files of different format in one chunk / Tasklet.
For that, I know that there is MultiResourceItemReader can process multiple files but AFAIK all the files have to be of the same format and data structure.
So, the question is how can we supply multiple files of different data formats as input in a Tasklet ?
Asoub is right and there is no out-of-the-box Spring Batch reader that "reads it all!". However with just a handful of fairly simple and straight forward classes you can make a java config spring batch application that will go through different files with different file formats.
For one of my applications I had a similar type of use case and I wrote a bunch of fairly simple and straight forward implementations and extensions of the Spring Batch framework to create what I call a "generic" reader. So to answer your question: below you will find the code I used to go through different kind of file formats using spring batch. Obviously below you will find the stripped implementation, but it should get you going in the right direction.
One line is represented by a Record:
public class Record {
private Object[] columns;
public void setColumnByIndex(Object candidate, int index) {
columns[index] = candidate;
}
public Object getColumnByIndex(int index){
return columns[index];
}
public void setColumns(Object[] columns) {
this.columns = columns;
}
}
Each line contains multiple columns and the columns are separated by a delimiter. It does not matter if file1 contains 10 columns and/or if file2 only contains 3 columns.
The following reader simply maps each line to a record:
#Component
public class GenericReader {
#Autowired
private GenericLineMapper genericLineMapper;
#SuppressWarnings({ "unchecked", "rawtypes" })
public FlatFileItemReader reader(File file) {
FlatFileItemReader<Record> reader = new FlatFileItemReader();
reader.setResource(new FileSystemResource(file));
reader.setLineMapper((LineMapper) genericLineMapper.defaultLineMapper());
return reader;
}
}
The mapper takes a line and converts it to an array of objects:
#Component
public class GenericLineMapper {
#Autowired
private ApplicationConfiguration applicationConfiguration;
#SuppressWarnings({ "unchecked", "rawtypes" })
public DefaultLineMapper defaultLineMapper() {
DefaultLineMapper lineMapper = new DefaultLineMapper();
lineMapper.setLineTokenizer(tokenizer());
lineMapper.setFieldSetMapper(new CustomFieldSetMapper());
return lineMapper;
}
private DelimitedLineTokenizer tokenizer() {
DelimitedLineTokenizer tokenize = new DelimitedLineTokenizer();
tokenize.setDelimiter(Character.toString(applicationConfiguration.getDelimiter()));
tokenize.setQuoteCharacter(applicationConfiguration.getQuote());
return tokenize;
}
}
The "magic" of converting the columns to the record happens in the FieldSetMapper:
#Component
public class CustomFieldSetMapper implements FieldSetMapper<Record> {
#Override
public Record mapFieldSet(FieldSet fieldSet) throws BindException {
Record record = new Record();
Object[] row = new Object[fieldSet.getValues().length];
for (int i = 0; i < fieldSet.getValues().length; i++) {
row[i] = fieldSet.getValues()[i];
}
record.setColumns(row);
return record;
}
}
Using yaml configuration the user provides an input directory and a list of file names and ofcourse the appropriate delimiter and character to quote a column if the column contains the delimiter. Here is an exmple of such a yaml configuration:
#Component
#ConfigurationProperties
public class ApplicationConfiguration {
private String inputDir;
private List<String> fileNames;
private char delimiter;
private char quote;
// getters and setters ommitted
}
And then the application.yml:
input-dir: src/main/resources/
file-names: [yourfile1.csv, yourfile2.csv, yourfile3.csv]
delimiter: "|"
quote: "\""
And last but not least, putting it all together:
#Configuration
#EnableBatchProcessing
public class BatchConfiguration {
#Autowired
public JobBuilderFactory jobBuilderFactory;
#Autowired
public StepBuilderFactory stepBuilderFactory;
#Autowired
private GenericReader genericReader;
#Autowired
private NoOpWriter noOpWriter;
#Autowired
private ApplicationConfiguration applicationConfiguration;
#Bean
public Job yourJobName() {
List<Step> steps = new ArrayList<>();
applicationConfiguration.getFileNames().forEach(f -> steps.add(loadStep(new File(applicationConfiguration.getInputDir() + f))));
return jobBuilderFactory.get("yourjobName")
.start(createParallelFlow(steps))
.end()
.build();
}
#SuppressWarnings("unchecked")
public Step loadStep(File file) {
return stepBuilderFactory.get("step-" + file.getName())
.<Record, Record> chunk(10)
.reader(genericReader.reader(file))
.writer(noOpWriter)
.build();
}
private Flow createParallelFlow(List<Step> steps) {
SimpleAsyncTaskExecutor taskExecutor = new SimpleAsyncTaskExecutor();
// max multithreading = -1, no multithreading = 1, smart size = steps.size()
taskExecutor.setConcurrencyLimit(1);
List<Flow> flows = steps.stream()
.map(step -> new FlowBuilder<Flow>("flow_" + step.getName()).start(step).build())
.collect(Collectors.toList());
return new FlowBuilder<SimpleFlow>("parallelStepsFlow")
.split(taskExecutor)
.add(flows.toArray(new Flow[flows.size()]))
.build();
}
}
For demonstration purposes you can just put all the classes in one package. The NoOpWriter simply logs the 2nd column of my test files.
#Component
public class NoOpWriter implements ItemWriter<Record> {
#Override
public void write(List<? extends Record> items) throws Exception {
items.forEach(i -> System.out.println(i.getColumnByIndex(1)));
// NO - OP
}
}
Good luck :-)
I don't think there is an out-of-the-box Spring batch reader for multiple input format.
You'll have to build your own. Of course you can reuse already existing FileItemReader as delegates in your custom file reader, and for each file type/format, use the right one.
community,
i have a problem to count entities after a entitymanager#save in the same transaction. The problem is, that i get always an inexpected count. I would like to count the current rows of a entity after a entitymanager save, but the result is always the current row count + the saved entity (which are not commited yet). It looks like a classic dirty read issue.
For this reason i use Propagation.REQUIRES_NEW and Isolation.REPEATABLE_READ to count only the commited entities.
If I use a h2 database the result is the result which i expected, but if i use MYSQL there is another result.
Here some code snippets:
The service:
#Service
public class MyTestService {
#Autowired
private EntityRepository entityRepository;
#Transactional
public void doSomeLogic() {
entityRepository.save(new Entity());
final long count = count();
if (count != 0) {
throw new IllegalStateException("Entity count should be 0 but is " + count);
}
}
#Transactional(propagation = Propagation.REQUIRES_NEW, isolation = Isolation.REPEATABLE_READ)
public long count() {
return entityRepository.count();
}
}
The repo:
public interface EntityRepository extends PagingAndSortingRepository<Entity, Long>, JpaSpecificationExecutor<Entity> {
// #Override
// #Transactional(readOnly = true, propagation = Propagation.REQUIRES_NEW,
// isolation = Isolation.REPEATABLE_READ)
// long count();
}
The test:
#RunWith(SpringRunner.class)
#SpringBootTest
public class DemoApplicationTests {
#Autowired
private MyTestService myTestService;
#Test
public void testEntityCounterWithMultiThreads() {
myTestService.doSomeLogic();
}
}
If i uncomment the repository code, all works, but I don't understand the difference.
Anybody knows where is my thinking blunder?