I am new to spring batch. I want to understand how the data is passed from Reader to Processor and from processor to Writer? So basically in Reader we will be having read() method which will return some kind of data, say String.. this return type will be used as an input parameter in process() method in Processor.
So what I want to understand is once read() method returns a String data, until it reaches process() method, how this transfer is handled? Does spring stores this data somewhere and then passes to the next phase? How it happens?
Any pointers to understand this or some good links to readup on the same are welcome.
Thanks in advance!
Spring Batch uses chunk-oriented processing as explained here: https://docs.spring.io/spring-batch/4.0.x/reference/html/step.html#chunkOrientedProcessing
The idea is to read/process items one at a time until a chunk (of a predefined size) is created. This design choice avoids loading the whole data source in memory which is very efficient when dealing with large amounts of data. The writing on the other hand operates on a chunk of items (and not a single item) in order to optimize writes (like jdbc batch inserts or elastic search bulk inserts).
If you want more details on what's happening behind the scene, I invite you to take a look at the code of the ChunkOrientedTasklet class which uses basically two collaborators:
A ChunkProvider which provides chunks of items (delegating item reading to an ItemReader)
A ChunkProcessor which processes chunks (delegating processing and writing respectively to an ItemProcessor/ItemWriter)
Here is a simplified version of the code:
Chunk inputs = chunkProvider.provide(contribution);
chunkProcessor.process(contribution, inputs);
The contribution object is used to add the chunk contribution to the step. So to answer your question:
"how this transfer is handled? Does spring stores this data somewhere and then passes to the next phase?"
Items are actually handed between these two collaborators through the inputs variable of type Chunk.
Hope this helps!
You seem to be thinking about this as WAY more complicated than it is. This is no different than any other object oriented method call.
When the read() method completes it returns a (hopefully defined generics typed) object(s) which then is passed as an argument to the process() method which then when it completes passes that object(s) the write() method. It all takes place effective instantly, in memory (which is why jobs are chucked) just the same way you pass any object to a method.
Object someData = read();
process(someData);
write(someData);
Related
I have a large quantity of sqlite databases, represented as Source[File, NotUsed]. For each db, I want to paginate through the results. Memory limits mean I cannot do this eagerly. Say that the result type is Foo, then I'm trying to figure out how to create a Flow[File, Foo, NotUsed] that internally uses a lazy, recursive call on the resource.
I see that the Source.unfold method allows me to do this, but it can only create a Source, which means I can't feed it the necessary input of File. I can't see how to convert a Source to a Flow (except via fromSinkAndSource, but that doesn't pipe the values through). I'm not sure if this path of inquiry will yield anything.
It was suggested to me that I should use the GraphDSL and Merge, but I'm stuck trying to understand how many input ports the Merge should have and how I would actually wire it together.
I think you're looking for the flatMapConcat operator:
Signature
def flatMapConcat[T, M](f: Out ⇒ Graph[SourceShape[T], M]): Repr[T]
Description
Transform each input element into a Source whose elements are then flattened into the output stream through concatenation. This means each source is fully consumed before consumption of the next source starts.
emits when the current consumed substream has an element available
backpressures when downstream backpressures
completes when upstream completes and all consumed substreams complete
This is the context:
There is an input event stream,
There are some methods to apply on
the stream, which applies different logic to evaluates each event,
saying it is a "good" or "bad" event.
An event can be a real "good" one only if it passes all the methods, otherwise it is a "bad" event.
There is an output event stream who has result of event and its eventID.
To solve this problem, I have two ideas:
We can apply each method sequentially to each event. But this is a kind of batch processing, and doesn't apply the advantages of stream processing, in the same time, it takes Time(M(ethod)1) + Time(M2) + Time(M3) + ....., which maybe not suitable to real-time processing.
We can pass the input stream to each method, and then we can run each method in parallel, each method saves the bad event into a permanent storage, then the Main method could query the permanent storage to get the result of each event. But this has some problems to solve:
how to execute methods in parallel in the programming language(e.g. Scala), how about the performance(network, CPUs, memory)
how to solve the synchronization problem? It's sure that those methods need sometime to calculate and save flag into the permanent storage, but the Main just need less time to query the flag, which a delay issue occurs.
etc.
This is not a kind of tech and design question, I would like to ask your guys' ideas, if you have some new ideas or ideas to solve the problem ? Looking forward to your opinions.
Parallel streams, each doing the full set of evaluations sequentially, is the more straightforward solution. But if that introduces too much latency, then you can fan out the evaluations to be done in parallel, and then bring the results back together again to make a decision.
To do the fan-out, look at the split operation on DataStream, or use side outputs. But before doing this n-way fan-out, make sure that each event has a unique ID. If necessary, add a field containing a random number to each event to use as the unique ID. Later we will use this unique ID as a key to gather back together all of the partial results for each event.
Once the event stream is split, each copy of the stream can use a MapFunction to compute one of evaluation methods.
Gathering all of these separate evaluations of a given event back together is a bit more complex. One reasonable approach here is to union all of the result streams together, and then key the unioned stream by the unique ID described above. This will bring together all of the individual results for each event. Then you can use a RichFlatMapFunction (using Flink's keyed, managed state) to gather the results for the separate evaluations in one place. Once the full set of evaluations for a given event has arrived at this stateful flatmap operator, it can compute and emit the final result.
We are working in a very complex solution using drools 6 (Fusion) and I would like your opinion about best way to read Objects created during the correlation results over time.
My first basic approach was to read Working Memory every certain time, looking for new objects and reporting them to external Service (REST).
AgendaEventListener does not seems to be the "best" approach beacuse I dont care about most of the objects being inserted in working memory, so maybe, best approach would be to inject particular "object" in some sort of service inside DRL. Is this a good approach?
You have quite a lot of options. In decreasing order of my preference:
AgendaEventListener is probably the solution requiring the smallest amount of LOC. It might be useful for other tasks as well; all you have on the negative side is one additional method call and a class test per inserted fact. Peanuts.
You can wrap the insert macro in a DRL function and collect inserted fact of class X in a global List. The problem you have here is that you'll have to pass the KieContext as a second parameter to the function call.
If the creation of a class X object is inevitably linked with its insertion into WM, you could add the registry of new objects into a static List inside class X, to be done in a factory method (or the constructor).
I'm putting your "basic approach" last because it requires much more cycles than the listener (#1) and tons of overhead for maintaining the set of X objects that have already been put to REST.
I'm interested in a scenario where a document is fetched from the database, some computations are run based on some external conditions, one of the fields of the document gets updated and then the document gets saved, all in a system that might have concurrent threads accessing the DB.
To make it easier to understand, here's a very simplistic example. Suppose I have the following document:
{
...
items_average: 1234,
last_10_items: [10,2187,2133, ...]
...
}
Suppose a new item (X) comes in, five things will need to be done:
read the document from the DB
remove the first (oldest) item in the last_10_items
add X to the end of the array
re-compute the average* and save it in items_average.
write the document to the DB
* NOTE: the average computation was chosen as a very simple example, but the question should take into account more complex operations based on data existing in the document and on new data (i.e. not something solvable with the $inc operator)
This certainly is something easy to implement in a single-threaded system, but in a concurrent system, if 2 threads would like to follow the above steps, inconsistencies might occur since both will update the last_10_items and items_average values without considering and/or overwriting the concurrent changes.
So, my question is how can such a scenario be handled? Is there a way to check or react-upon the fact that the underlying document was changed between steps 1 and 5? Is there such a thing as WATCH from redis or 'Concurrent Modification Error' from relational DBs?
Thanks
In database system,it uses a memory inspection and roll back scheme which is similar to transactional memory.
Briefly speaking, it simply monitors the share memory parts you specified and do something like compare and swap or load and link or test and set.
Therefore,if any memory content is changed during transaction,it will abort and try again until there is no conflict operation for that shared memory.
For example,GCC implements the following:
https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
type __sync_lock_test_and_set (type *ptr, type value, ...)
type __sync_val_compare_and_swap (type *ptr, type oldval type newval, ...)
For more info about transactional memory,
http://en.wikipedia.org/wiki/Software_transactional_memory
I have an XML which has multiple nodes and sub-nodes from which am consuming data as input for multiple functions from the main function.
I have a basic question on optimized code here.
Is it good to pass the XML object as an input to multiple function which consumes some data from the XML?
Is it good to pass the XML path to the function and instantiate XML object inside each function?
Is there a way to pass just the node which is required for a particular function? ( In-case i have 10 nodes and 10 functions - where each function requires just one particular node for consuming data)
Thanks
I would argue that it's better to pass only specific arguments to each function. The less broad your input, the simpler your input validation. Also, I'd strongly recommend to avoid repeatedly reading/parsing the same data. There's no benefit at all in doing that.