Spring Batch Integration - Multiple files as single Message - spring-batch

In the sample https://github.com/ghillert/spring-batch-integration-sample, the file inbound adapter is configured to poll for a directory and once the file is placed in that directory, a FileMessageToJobRequest is constructed and the Spring batch Job is getting launched.
So for each file, a new FileMessageToJobRequest is constructed and a new Spring batch Job instance is getting created.
We also want to configure a file inbound adapter to poll for the files but want to process all the files using a single batch job instance.
For example, If we place 1000 files in the directory and have max-messages-per-poll to 1000, We want to send the name of the 1000 files as one of the parameters to Spring batch Job instead of calling the Job 1000 times.
Is there a way to send list of files that the File Inbound adapter picked up during its one poll as a single message to the subsequent Spring components?
Thank You,
Regards
Suresh

Even if it is a single poll, the inbound-channel-adapter emits messages for each entry.
So, to collect them to the single message you need to use an <aggregator>.
Although with that you have to come up with ReleaseStrategy. Even if you can just use 1 as a correlationKey, there is some issue with releasing the group.
You should agree with that you don't always have 1000 of files there to group them to the single message. So, maybe a TimeoutCountSequenceSizeReleaseStrategy is a good compromise to emit the result after some timeout, even if you don't have enough number of files to complete the group by size.
HTH
UPDATE
You can consider to use group-timeout on the <aggregator> to allow to release groups even if there is no a new message during that period.
In addition the there is an expire-groups-upon-completion option to be sure that you "single" will be cleared and removed after each release. To allow to form a new group for the next poll.

Related

Aggregator pattern in Mirth

I am receiving a large number of correlated HL7 messages in Mirth. They contain an ID which is always the same for all correlated messages and they always come within a minute. Multiple batches can be received at the same time. It's hard to say when the batch ends, but when there are no more messages for a minute, it's safe to assume that the batch has finished.
How could I implement an aggregator pattern in Mirth that would keep reading and completing correlated messages and send the completed message after it didn't receive any new messages with the same ID within a defined time interval?
You may drop all incoming message to a folder and store the message ID in a Global Map. Once new messages start to arrive with the message ID different than the one stored in the map (meaning that the next sequence is started), trigger another channel either by sending the message ID it needs to look for or in some other way. After that replace the message ID in the Global Map with the message ID of a new sequence.
If that sounds too complicated, you may do the same, but the second channel will constantly scan the folder (File Reader) and grab only files having the same message ID and older than a minute from a current time (which is in my mind is too vague qualifier).
I've implemented this by saving all messages in a folder using an ID inside the message (that identifies the sequence) as the file name. The file gets updated with each new message. Several sequences live together in the same folder.
The next channel is using a simple file reader that only fetches the files that are a minute or more old.

Batching the send operation on outbound adapter in Spring Integration

I have an outbound channel adapter (in this case is SFTP but it would be the same for a JMS or WS) at the end of a Spring Integration flow. By using direct channels every time there is a messaging flowing, it will be sent out synchronously.
Now, I need to process messages all the way until they reach the outbound adapter, but wait for a predetermined interval before sending them out. In other words batching the send operation.
I know the Spring Batch project might offer a solution to this but I need to find a solution with Spring Integration compoonents (in the int-* namespaces)
What would be a typical pattern to achieve this?
The Aggregator pattern is for you.
In your particular case I'd call that like window, because you don't have any specific correlation to group messages, but just need to build a batch as you call it.
So, I think your Aggregator config may look like:
<int:aggregator input-channel="input" output-channel="output"
correlation-strategy-expression="1"
release-strategy-expression="size() == 10"
expire-groups-upon-completion="true"
send-partial-result-on-expiry="true"/>
correlation-strategy-expression="1" means group any incoming messages
release-strategy-expression="size() == 10" allows to form and release batches by 10 messages
expire-groups-upon-completion="true" says to aggregator to remove the releases group from it store. That allow to for a new group for the same correlationKey (1 in our case)
send-partial-result-on-expiry="true" specifies that normal release operation (send to the output-channel) must be done on expire function when we don't have enough messages to build a whole batch (size 10 in our case). For these options, please, follow with documentation mentioned above.

Spring batch/integration dynamic poller/trigger

We have job which polls for file and db every M-F between 1PM-5PM using cron expression. During this time if file arrives it downloads the file and invoke a job. This is working fine and we have used spring integration and batch.
Now we need some customization where we have multiple job where job1 one should poll like above once file is processed successfully, it should stop polling.
Second requirement is, in case if file does not come during polling period we want to send some notification to ops team so that they can take some actions.
Would that help ? Exit Spring Integration when no more messages
You would be able to implement custom behavior in that advice, based on polling result and the time of the day.
Garry is also mentionning that conditional pollers are coming in next versions :
http://docs.spring.io/spring-integration/docs/4.2.0.BUILD-SNAPSHOT/reference/html/messaging-channels-section.html#conditional-pollers

Retry failed writing operations without delaying other steps in Spring Batch application

I am maintaining a legacy application written using Spring Batch and need to tweak it to never lose data.
I have to read from various webservice (one for each step) and then write to a remote database. Things goes bad when connection with the DB drops because all itens read from webservice are discarded (can't read the same item twice), and the data is lost because can not be written.
I need to setup Spring Batch to keep already read data on one step to retry the writing operation next time the step runs. The same step can not read more data until the write operation is successfully concluded.
When not being able to write, the step should keep the read data and pass execution to the next step, after a while, when it's time to the failed step to run again, it should not read another item, retrying the failed writing operation instead.
The batch application should runs in an infinite loop and each step should gather data from one different source. Failed writing operations should be momentarily skipped (keeping the read data) to not delay others steps but should resume from the write operation next time they are called.
I am researching in various web sources aside from official docs, but Spring Batch hasn't the most intuitive docs I have come across.
Can this be achieved? If yes, how?
You can write the data you need to persist in case the job fails to the Batch Step's ExecutionContext. You can restart the job again with this data:
Step executions are represented by objects of the StepExecution class.
Each execution contains a reference to its corresponding step and
JobExecution, and transaction related data such as commit and rollback
count and start and end times. Additionally, each step execution will
contain an ExecutionContext, which contains any data a developer needs
persisted across batch runs, such as statistics or state information
needed to restart
More from: http://static.springsource.org/spring-batch/reference/html/domain.html#domainStepExecution
I do not know if this will be ok with you, but here are my thoughts on your configuration.
Since you have two remote sources that are open to failure, let us partition the overall system with two jobs (not two steps)
JOB A
Step 1: Tasklet
Check a shared folder for files. If files exist, do not proceed to the next step. Will be more understandable when writing about JOB B
Step 2: Webservice to files
Read from your web service and write results to flatfiles in the shared folder. Since you would be using flatfiles for the output, you will solve your "all items read from webservice are discarded and the data is lost because can not be written."
Use Quartz or equivalent for the scheduling of this job.
JOB B
Poll the shared folder for generated files and create a joblauncher with the file (file.getWhere as a jobparameter). Spring integration project may help in this polling.
Step 1:
Read from the file, write them to remote db and move/delete file if writing to db is successful.
No scheduling will be needed since job launching originates from polled in files.
Sample Execution
Time 0: No file in the shared folder
Time 1: Read from web service and write to shared folder
Time 2: Job B file polling occurs, tries to write to db.
If successfull, the system continues to execute.
If not, when Job A tries to execute on its scheduled time, it will skip reading from web service since files still exist in the shared folder. It will skip until Job B consumes the files.
I did not want to go into implementation specifics but Spring Batch can handle all of these situations. Hope that this helps.

Multiple Actors that writes to the same file + rotate

I have written a very simple webserver in Scala (based on Actors). The purpose
of it so to log events from our frontend server (such as if a user clicks a
button or a page is loaded). The file will need to be rotated every 64-100mb or so and
it will be send to s3 for later analysis with Hadoop. the amount of traffic will
be about 50-100 calls/s
Some questions that pops into my mind:
How do I make sure that all actors can write to one file in a thread safe way?
What is the best way to rotate the file after X amount of mb. Should I do this
in my code or from the filesystem (if I do it from the filesystem, how do I then verify
that the file isn't in the middle of a write or that the buffer is flushed)
One simple method would be to have a single file writer actor that serialized all writes to the disk. You could then have multiple request handler actors that fed it updates as they processed logging events from the frontend server. You'd get concurrency in request handling while still serializing writes to your log file. Having more than a single actor would open the possibility of concurrent writes, which would at best corrupt your log file. Basically, if you want something to be thread-safe in the actor model, it should be executed on a single actor. Unfortunately, your task is inherently serial at the point you write to disk. You could do something more involved like merge log files coming from multiple actors at rotation time but that seems like overkill. Unless you're generating that 64-100MB in a second or two, I'd be surprised if the extra threads doing I/O bought you anything.
Assuming a single writing actor, it's pretty trivial to calculate the amount that has been written since the last rotation and I don't think tracking in the actor's internal state versus polling the filesystem would make a difference one way or the other.
U can use Only One Actor to write every requests from different threads, since all of the requests go through this actor, there will be no concurrency problems.
As per file write rolling, if your write requests can be logged in line by line, then you can resort to log4j or logback's FileRollingAppender things. Otherwise, you can write your own which will be easy as long as remembering to lock the file before performing any delete or update operations.
The rolling usually means you rename the older files and current file to other names and then create a new file with current file name, at last, u can always write to the file with current file name.