My assumption
In my understanding "chunk oriented processing" in Spring Batch helps me to efficiently process multiple items in a single transaction. This includes efficient use of interfaces from external systems. As external communication includes overhead, it should be limited and chunk-oriented too. That's why we have the commit-level for the ItemWriter.
So what I don't get is, why does the ItemReader still have to read item-by-item? Why can't I read chunks also?
Problem description
In my step, the reader has to call a webservice. And the writer will send this information to another webservice. That's why I wan't to do as few calls as necessary.
The interface of the ItemWriter is chunk-oriented - as you know for sure:
public abstract void write(List<? extends T> paramList) throws Exception;
But the ItemReader is not:
public abstract T read() throws Exception;
As a workaround I implemented a ChunkBufferingItemReader, which reads a list of items, stores them and returns items one-by-one whenever its read() method is called.
But when it comes to exception handling and restarting of a job now, this approach is getting messy. I'm getting the feeling that I'm doing work here, which the framework should do for me.
Question
So am I missing something? Is there any existing functionality in Spring Batch I just overlooked?
In another post it was suggested to change the return type of the ItemReader to a List. But then my ItemProcessor would have to emit multiple outputs from a single input. Is this the right approach?
I'm graceful for any best practices. Thanks in advance :-)
This is a draft for an implementation of the read() interface method.
public T read() throws Exception {
while (this.items.isEmpty()) {
final List<T> newItems = readChunk();
if (newItems == null) {
return null;
}
this.items.addAll(newItems);
}
return this.items.pop();
}
Please note, that items is a buffer for the items read in chunks and not requested by the framework yet.
Spring Batch uses 'Chunk Oriented' processing style. (Not just chunk read or write, full process including read, process and write)
Chunk oriented processing refers to
Read an item using ItemReader (Single Item)
Process it using ItemProcessor, and aggregate the result (Result List is updated one by one).
Once the commit interval is reached, the entire aggregated result (Result List) is written out using ItemWriter and then the transaction is committed.
Here is the code representation from SpringBatch doc
List items = new Arraylist();
for(int i = 0; i < commitInterval; i++){
Object item = itemReader.read()
Object processedItem = itemProcessor.process(item);
items.add(processedItem);
}
itemWriter.write(items);
As you said, if you need your reader to return multiple Items, make it a List. And if your processor also returns a List. Finally, your Writer will get a List of List.
Here is the code representation of the new case
List<List<Object>> resultList = new Arraylist<List<Object>>();
for(int i = 0; i < commitInterval; i++){
List<Object> items = itemReader.read()
List<Object> processedItems = itemProcessor.process(items);
resultList.add(processedItems);
}
itemWriter.write(resultList);
Related
I have a job streaming using Apache Flink (flink version: 1.8.1) using scala. there are flow job requirements as follows:
Kafka -> Write to Hbase -> Send to kafka again with a different topic
During the writing process to Hbase, there was a need to retrieve data from another table. To ensure that the data is not empty (NULL), the job must check repeatedly (within a certain time) if the data is empty.
is this possible with Flink? If yes, can you help provide examples for conditions similar to my needs?
Edit :
I mean, with the problem that I described in the content, I thought about having to create some kind of job batch in the job streaming, but I couldn't find the right example for my case. So, is it possible to create a batch flink job in streaming flink job? If yes, can you help provide examples for conditions similar to my needs?
With more recent versions of Flink you can do lookup queries (with a configurable cache) against HBase from the SQL/Table APIs. Your use case sounds like it might be easily implemented in this fashion. See the docs for more info.
Just to clarify my comment I will post a sketch of what I was trying to suggest based on The Broadcast State Pattern. The link provides an example in Java, so I will follow it. In case you want in Scala it should not be too much different. You will likely have to implement the below code as it is explained on the link that I mentioned:
DataStream<String> output = colorPartitionedStream
.connect(ruleBroadcastStream)
.process(
// type arguments in our KeyedBroadcastProcessFunction represent:
// 1. the key of the keyed stream
// 2. the type of elements in the non-broadcast side
// 3. the type of elements in the broadcast side
// 4. the type of the result, here a string
new KeyedBroadcastProcessFunction<Color, Item, Rule, String>() {
// my matching logic
}
);
I was suggesting that you can collect the stream ruleBroadcastStream in fixed intervals from the database or whatever is your store. Instead of getting:
// broadcast the rules and create the broadcast state
BroadcastStream<Rule> ruleBroadcastStream = ruleStream
.broadcast(ruleStateDescriptor);
like the web page says. You will need to add a source where you can schedule it to run every X minutes.
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
BroadcastStream<Rule> ruleBroadcastStream = env
.addSource(new YourStreamSource())
.broadcast(ruleStateDescriptor);
public class YourStreamSource extends RichSourceFunction<YourType> {
private volatile boolean running = true;
#Override
public void run(SourceContext<YourType> ctx) throws Exception {
while (running) {
// TODO: yourData = FETCH DATA;
ctx.collect(yourData);
Thread.sleep("sleep for X minutes");
}
}
#Override
public void cancel() {
this.running = false;
}
}
I have slow moving data in a compacted kafka topic and also fast moving data in another topic.
1) fast moving data is real-time ingested unbounded events from Kafka.
2) slow moving data is meta data which is used to enrich the fast moving data. This is a compacted topic and the data is updated infrequently (days/months).
3) Each fast moving data payload should have a meta data payload with the same customerId which they can be aggregated with.
I would like to aggregate the fast/slow moving data against the customerId (common in the data on both topics). I was wondering how you would go about doing this? So far:
PTransform<PBegin, PCollection<KV<byte[], byte[]>>> kafka = KafkaIO.<byte[], byte[]>read()
.withBootstrapServers(“url:port")
.withTopics([“fast-moving-data”, “slow-moving-data"])
.withKeyDeserializer(ByteArrayDeserializer.class)
.withValueDeserializer(ByteArrayDeserializer.class)
.updateConsumerProperties((Map) props)
.withoutMetadata();
I have noticed that I can use .withTopics and specific the different topics I would like to use, but after this point I've not been able to find any examples to help in terms of aggregation. Any help would be appreciated.
The following pattern which is also discussed in this SO Q&A might be a good one to explore for your use case. One item that could be an issue is the size of your compacted slow moving stream. Hope its useful.
For this pattern we can use the GenerateSequence source transform to emit a value periodically for example once a day.
Pass this value into a global window via a data-driven trigger that activates on each element.
In a DoFn, use this process as a trigger to pull data from your bounded source
Create your SideInput for use in downstream transforms.
It's important to note that because this pattern uses a global-window SideInput triggering on processing time, matching to elements being processed in event time will be nondeterministic. For example if we have a main pipeline which is Windowed on Event time, the version of the SideInput View that those windows will see will depend on the latest trigger that has fired in processing time rather than the event time.
Also important to note that in general the SideInput should be something that fits into memory.
Java (SDK 2.9.0):
In the sample below the sideinput is updated at very short intervals, this is so that effects can be easily seen. The expectation is that the side input is updating slowly, for example every few hours or once a day.
In the example code below we make use of a Map that we create in a DoFn which becomes the View.asSingleton, this is the recommended approach for this pattern.
The sample below illustrates the pattern, please note the View.asSingleton is rebuilt on every counter update.
For your use case, you could replace the GenerateSequence transforms with PubSubIO transforms. Does that make sense?
public static void main(String[] args) {
// Create pipeline
PipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation()
.as(PipelineOptions.class);
// Using View.asSingleton, this pipeline uses a dummy external service as illustration.
// Run in debug mode to see the output
Pipeline p = Pipeline.create(options);
// Create slowly updating sideinput
PCollectionView<Map<String, String>> map = p
.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5L)))
.apply(Window.<Long>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterProcessingTime.pastFirstElementInPane()))
.discardingFiredPanes())
.apply(ParDo.of(new DoFn<Long, Map<String, String>>() {
#ProcessElement public void process(#Element Long input,
OutputReceiver<Map<String, String>> o) {
// Do any external reads needed here...
// We will make use of our dummy external service.
// Every time this triggers, the complete map will be replaced with that read from
// the service.
o.output(DummyExternalService.readDummyData());
}
})).apply(View.asSingleton());
// ---- Consume slowly updating sideinput
// GenerateSequence is only used here to generate dummy data for this illustration.
// You would use your real source for example PubSubIO, KafkaIO etc...
p.apply(GenerateSequence.from(0).withRate(1, Duration.standardSeconds(1L)))
.apply(Window.into(FixedWindows.of(Duration.standardSeconds(1))))
.apply(Sum.longsGlobally().withoutDefaults())
.apply(ParDo.of(new DoFn<Long, KV<Long, Long>>() {
#ProcessElement public void process(ProcessContext c) {
Map<String, String> keyMap = c.sideInput(map);
c.outputWithTimestamp(KV.of(1L, c.element()), Instant.now());
LOG.debug("Value is {} key A is {} and key B is {}"
, c.element(), keyMap.get("Key_A"),keyMap.get("Key_B"));
}
}).withSideInputs(map));
p.run();
}
public static class DummyExternalService {
public static Map<String, String> readDummyData() {
Map<String, String> map = new HashMap<>();
Instant now = Instant.now();
DateTimeFormatter dtf = DateTimeFormat.forPattern("HH:MM:SS");
map.put("Key_A", now.minus(Duration.standardSeconds(30)).toString(dtf));
map.put("Key_B", now.minus(Duration.standardSeconds(30)).toString());
return map;
}
}
I would suggest reading those topics separately, creating two different inputs to the pipeline. You can cross/join them later. And the way to cross them is to provide slow-moving stream as a side-input into the hotpath (transforms of the fast moving PCollection).
See here: https://beam.apache.org/documentation/programming-guide/#side-inputs
I'm aware that all spring steps need to have a reader, a writer, and optionally a processor. So even though my step only needs a writer, I am also fibbing a reader that does nothing but make spring happy.
This is based on the solution found here. Is it outdated, or am I missing something?
I have a spring batch job that has two chunked steps. My first step, deleteCount, is just deleting all rows from the table so that the second step has a clean slate. This means my first step doesn't need a reader, so I followed the above linked stackoverflow solution and created a NoOpItemReader, and added it to my stepbuilder object (code at the bottom).
My writer is mapped to a simple SQL statement that deletes all the rows from the table (code is at the bottom).
My table is not being cleared by the deleteCounts step. I suspect it's because I'm fibbing the reader.
I am expecting that deleteCounts will delete all rows from the table, yet it is not - and I suspect it's because of my "fibbed" reader but am not sure what I'm doing wrong.
My delete statement:
<delete id="delete">
DELETE FROM ${schemaname}.DERP
</delete>
My deleteCounts Step:
#Bean
#JobScope
public Step deleteCounts() {
StepBuilder sb = stepBuilderFactory.get("deleteCounts");
SimpleStepBuilder<ProcessedCountData, ProcessedCountData> ssb = sb.<ProcessedCountData, ProcessedCountData>chunk(10);
ssb.reader(noOpItemReader());
ssb.writer(writerFactory.myBatisBatchWriter(COUNT_DATA_DELETE));
ssb.startLimit(1);
ssb.allowStartIfComplete(true);
return ssb.build();
}
My NoOpItemReader, based on the previously linked solution on stackoverflow:
public NoOpItemReader<? extends ProcessedCountData> noOpItemReader() {
return new NoOpItemReader<>();
}
// for steps that do not need to read anything
public class NoOpItemReader<T> implements ItemReader<T> {
#Override
public T read() throws Exception {
return null;
}
}
I left out some mybatis plumbing, since I know that is working (step 2 is much more involved with the mybatis stuff, and step 2 is inserting rows just fine. deleting is so simple, it must be something with my step config...)
Your NoOpItemReader returns null. An ItemReader returning null indicates that the input has been exhausted. Since, in your case, that's all it returns, the framework assumes that there was no input in the first place.
Situation:
I read url of file on internet from db. In itemProcessor I download this file and I want to save each row to database. Then processing continue and I want to create some new class "summary" which I want to save to db too. How should configure my job in spring batch ?
For your use-case job can be defined using this step sequence (in this way this job is also restartable):
Download file from URL to HDD using a Tasklet: a Tasklet is the strategy to process a single step; in your case something similar to this post can help and store local filename to JobExecutionContext.
Process downloaded file:
2.1. With a FlatFileItemReader<S> (or your own ItemReader/ItemStream implementation) read downloaded file
2.2 With an ItemProcessor<S,T> process each row
2.3 Write each object to processed in 2.2 to database using a custom MyWriter<T> that do summary calculation and delegate to ItemWriter<T> for T's database persistence and to ItemWriter<Summary> to write Summary object.
<S> is the bean contains each file row and
<T> is the bean your write to db
MyWriter<T> can be used in this way:
class MyWriter extends ItemWriter<T> {
private ItemWriter<Summary> summaryWriter;
private ItemWriter<T> tWriter;
public void write(List<? super T> items) {
List<Summary> summaries = new ArrayList<>(items.size());
for(T item : items) {
final Summary summary = /* Here create summary object reading from
* database or creating new object */
/* Do summary or update summary */
summaries.add(summary);
}
/* The code above is trivial: you can group Summary object using a Map<SummaryKey,Summary> to reduce reading and use summaryWriter.write(summariesMap.values()) for example */
tWriter.write(items);
summaryWriter.write(summaries);
}
}
You need to save as stream both MyWriter.summaryWriter and MyWriter.tWriter for restartability.
You can use a CompositeItemWriter.
But perhaps your summary processing should be in another step which reads the rows you previously inserted
I need to execute seven distinctive processes sequently(One after the other). The data is stored in Mysql. I am thinking of the following options, Please correct me if I am wrong, or if there is a better solution.
Requirments:
Read the data from the Db, do the seven processes(datavalidation, calculation1, calculation2 ...etc.) finally, write the processed data into the DB.
Need to process the data in chunks.
My solution and issues:
Data read:
Read the data using JdbcCursorItemReader, because this is the best performing db reader - But, the SQL is very complex , so I may have to consider a custom ItemReader using JdbcTemplate? which gives me more flexibility in handling the data.
Process:
Define seven steps and chunks, share the data between the steps using databean. But, this won't be a good idea, because the data processes in chunks and after each chunk the step1 writer will create a new set of data in the databean. When this databean shared across the other steps, data integrity will be an issue.
Use StepExecutionContext to share the data between steps. But this may affect the performance as this involves Batch job repository.
Define only one step, with one ItemReader, and a chain of processes (the seven processes), and create one ItemWriter which writes the processed data into the DB. But, I won't be able to administrate or monitor each different processes, all will be in one step.
the org.springframework.batch.item.support.CompositeItemProcessor is an out of the box component from the Spring Batch Framework that would support your requirement akin to your second option. this would allow you do to the following;
- keep separation in your design/solution for reading from the database (itemreader)
- keep separation of each individual processors 'concerns' and configuration
- allow any individual processor to 'shutdown' the chunk by returning null, irrespective of previous processes
the CompositeItemProcessor iterates over a loop of delegates, so it's 'similar' to an action pattern. it's quite useful in the scenario you've described and still allows you to leverage the Chunk benefits (exception, retry, commit policy, etc.)
Suggestions:
1) Read the data using JdbcCursorItemReader.
All out-of-the-box Components are a good choice because they already implements the ItemStream interface that make your steps restartable. But like you mention, sometime, the request is just to complexe or, like me, you already have a service or DAO that you can reuse.
I would suggest you use the ItemReaderAdapter. It let you configure a delegate service to call to get your data.
<bean id="MyReader" class="xxx.adapters.MyItemReaderAdapter">
<property name="targetObject" ref="AnExistingDao" />
<property name="targetMethod" value="next" />
</bean>
Note that the targetMethod must respect the read contract of ItemReaders (return null when no more data)
If your job does not need to be restartable, you could simply use the class : org.springframework.batch.item.adapter.ItemReaderAdapter
But if you need your job to be restartable, you can create your own ItemReaderAdapter like this:
public class MyItemReaderAdapter<T> extends AbstractMethodInvokingDelegator<T> implements ItemReader<T>, ItemStream {
private long currentCount = 0;
private final String CONTEXT_COUNT_KEY = "count";
/**
* #return return value of the target method.
*/
public T read() throws Exception {
super.setArguments(new Long[]{currentCount++});
return invokeDelegateMethod();
}
#Override
public void open(ExecutionContext executionContext)
throws ItemStreamException {
currentCount = executionContext.getLong(CONTEXT_COUNT_KEY,0);
}
#Override
public void update(ExecutionContext executionContext) throws ItemStreamException {
executionContext.putLong(CONTEXT_COUNT_KEY, currentCount);
log.info("Update Stream current count : " + currentCount);
}
#Override
public void close() throws ItemStreamException {
// TODO Auto-generated method stub
}
}
Because the out-of-the-box itemReaderAdapter is not restartable, you just create your own that implements the ItemStream
2) Regarding the 7 steps vs 1 step.
I would go with 1 step with compositeProcessor on this one. the 7 steps option will only bring problems IMO.
1) 7 steps databean : so your writer commit in a databean until step 7.. then step 7 writer try to commit to the real database and boom error!!! all is lost and the batch must restart from step 1!!
2) 7 steps with context : could be better since you will have the state saved in the spring batch metadata.. BUT it is not a good practice to store big data in the metadata of springBatch!!
3) is the way to go IMO. ;-)