Spring Integration JDBC inbound channel adapter - avoiding duplicate reads

Spring Integration JDBC inbound channel adapter - avoiding duplicate reads - spring-batch

I have a Spring Integration jdbc:inbound-channel-adapter which reads from a database. An important requirement is that the same rows are not read twice. One possible approach may be to use the update attribute to set a flag on the rows read using the same where clause as for the query attribute. The concern however would be that if an exception occurs further on in the workflow (transforming the result set using the row mapper, marshalling to XML, and then placing on an outbound queue for an external system), those rows would not be re-read when the application came back up. Is there a better strategy to use in this case with Spring Integration?
Another question would be that, given the above requirement, would Spring Batch offer a more robust solution, and if so, how would this be implemented?
Thanks

Looks like you should use the short TX and channel shift technique:
<int-jdbc:inbound-channel-adapter channel="executorChannel"/>
<int:channel id="executorChannel">
<int:dispatcher task-executor="executor"/>
</int:channel>
Having that your message will be shifted to the different Thread outside of JDBC TX. And the last one will be commited always. So, any downstrem issues won't affect you row in DB - they will be marked as processed and won't be read one more time.

Related

Spring batch entire Job in transaction boundary

I have a use-case for which I could use spring batch job which I could design in following ways.
1) First Way:
Step1 (Chunk oriented step): Read from the file —> filter, validate and transform the read row into DTO (data transfer object), if there are any errors, store errors in DTO itself —> Check if any of the DTOs has errors , if not write to Database. If yes, write to an error file.
However, problem with this way is - I need this entire JOB in transaction boundary. So if there is a failure in any of the chunks then I don’t want to write to DB and want to rollback all successful writes till that point in DB. Above way forces me to write rollback logic for all successful writes if there is a failure in any of the chunks.
2) Second way
Step 1 (Chunk oriented step): Read items from the file —> filter, validate and transform the read row in DTO (data transfer object). This does store the errors in the DTO object itself.
Step 2 (Tasklet): Read entire list (and not chunks) of DTOs created from step 1 —> Check if any of the DTOs has errors populated in it. If yes, then abort the writing to DB and fail the JOB.
In second way, I get all benefits of chunk processing and scaling. At the same time I have created transaction boundary for entire job.
PS: In both ways in their first step there won’t be any step failure, if there is failure; errors are stored in DTO object itself. Thus, DTO object is always created.
Question is - Since I am new to Spring batch, is it a good pattern to go with second way. And is there a way that I can share data between steps so that entire List of DTOs is available to second step (in second way above) ?

In my opinion, trying to process the entire file in a single transaction (ie a transaction at the job level) is not the way to go. I would proceed in two steps:
Step 1: process the input and writes errors to the file
Step 2: this step is conditioned by step1. If no errors has been detected in step 1, then save data to the db.
This approach does not require to write data to the database and roll it back if there are errors (as suggested by option 1 in your description). It only writes to the database when everything is ok.
Moreover, this approach does not require holding a list of items in-memory as suggested by option 2, which could be inefficient in terms of memory usage and performs poorly if the file is big.

Streamsets Transformer - JDBC Origin without offset column

I'm testing platforms that can allow any user to easily create data processing pipelines. This platform has to meet certain requirements and one of them is to be capable of moving data from Oracle/SQL Server to HDFS.
Streamsets Transformer (v3.11) meets all requirements including the one referred above. I just can't get it to work in a very specific case: When ingesting a table that contains no numeric columns.
In these cases I want the pipeline to process all data so, in the JDBC Origin, I enabled the "Skip Offset Tracking" property. I thought that by skipping the offset tracking there would be no need to set the "Offset Column" property (guess I was wrong).
JDBC_05 - Table doesn't have compatible primary key configuration - supporting exactly one column but table have 0
If a numeric column exists, a possible workaround is to set it as the offset column but I can't find a way of doing this when none exists.
Am I missing something?
Thanks

We are looking at providing this functionality in Transformer in a future release. I'll come back and update this answer with any news.
In the meantime, you might want to look at using StreamSets Data Collector for these tables. It does not have the 'numeric offset column' requirement.

Disable Retry When commitInterval = 1

The behavior for the batch processing of our business entities we would like is to rollback the failed transaction and not try again. I have read through the forum and it appears that it is not possible. We have set the commitInterval=1 and tried the Never Retry Policy for this special case but to no avail. I have read the rational is that the writer does not know if the list of items received is the initial or subsequent processing in the case of a failure.
Have I summarized this correctly and Spring batch does not currently support the behavior we are looking for?

Sounds like a candidate for Skip Logic
https://docs.spring.io/spring-batch/reference/html/configureStep.html
Check out these two sections in particular:
5.1.5 Configuring Skip Logic
5.1.7 Controlling Rollback

Itemwriter output using same order that itemreader used to read file

We have a springbatch job that reads a file (flatfileitemreader), process it and writes data to a queue (jmsitemwriter).
We have another job that reads the queue (jmsitemreader) and writes a file (flatfileitemwriter). It's asynchronous process (in between the execution of the two jobs, we have some manual process that must be performed).
The flat file content doesn't have a line identifier. And we use a multi-threaded approach when reading the file ("throttle-limit"). So, the messages queued do not maintain the same order that they used to have into the flat file.
The problem is that we should generate an output file respecting the original order. So the line 33 inside the incoming file, should be line 33 into the outgoing file (it will have the contents of the original line, plus some data).
Does springbatch provide "native" a way to order the output, respecting the original read order? I used "native" because one solution that we thought is to create an additional step just to add a line number to the file and use it at the end, but I was wondering if this "reinvent the wheel"...
We are using SB 3.0.3
TIA,
Bob

The use case you are describing asks that you maintain order across multiple jobs which is not supported. In theory (while not guaranteed) a single, single threaded step would retain the order of the input file.
Since you are reading in a multithreaded manor, there really isn't a good way to guarantee the order of the items as they are being read. The best you could do is synchronize the read method and add an id as the items are being read. If the bottleneck you're attempting to address with multithreading is in the processor or writer, this may not be a bad option.

Preventing update loops for multiple databases using CDC

We have a number of legacy systems that we're unable to make changes to - however, we want to start taking data changes from these systems and applying them automatically to other systems.
We're thinking of some form of service bus (no specific tech picked yet) sitting in the middle, and a set of bus adapters (one per legacy application) to translate between database specific concepts and general update messages.
One area I've been looking at is using Change Data Capture (CDC) to monitor update activity in the legacy databases, and use that information to construct appropriate messages. However, I have a concern - how best could I, as a consumer of CDC information, distinguish changes applied by the application vs changes applied by the bus adapter on receipt of messages - because otherwise, the first update that gets distributed by the bus will get re-distributed by every receiver when they apply that change to their own system.
If I was implementing "poor mans" CDC - i.e. triggers, then those triggers execute within the context/transaction/connection of the original DML statements - so I could either design them to ignore one particular user (the user applying incoming updates from the bus), or set and detect a session property to similar ignore certain updates.
Any ideas?

If I understand your question correctly, you're trying to define a message routing structure that works with a design you've already selected (using an enterprise service bus) and a message implementation that you can use to flow data off your legacy systems that only forward-ports changes to your newer systems.
The difficulty is you're trying to apply changes in such a way that they don't themselves generate a CDC message from the clients receiving the data bundle from your legacy systems. In fact, all you're concerned about is having your newer systems consume the data and not propagate messages back to your bus, creating unnecessary crosstalk that might exponentiate, overloading your infrastructure.
The secret is how MSSQL's CDC features reconcile changes as they propagate through the network. Specifically, note this caveat:
All the changes are logged in terms of LSN or Log Sequence Number. SQL
distinctly identifies each operation of DML via a Log Sequence Number.
Any committed modifications on any tables are recorded in the
transaction log of the database with a specific LSN provided by SQL
Server. The __$operationcolumn values are: 1 = delete, 2 = insert, 3 =
update (values before update), 4 = update (values after update).
cdc.fn_cdc_get_net_changes_dbo_Employee gives us all the records net
changed falling between the LSN we provide in the function. We have
three records returned by the net_change function; there was a delete,
an insert, and two updates, but on the same record. In case of the
updated record, it simply shows the net changed value after both the
updates are complete.
For getting all the changes, execute
cdc.fn_cdc_get_all_changes_dbo_Employee; there are options either to
pass 'ALL' or 'ALL UPDATE OLD'. The 'ALL' option provides all the
changes, but for updates, it provides the after updated values. Hence
we find two records for updates. We have one record showing the first
update when Jason was updated to Nichole, and one record when Nichole
was updated to EMMA.
While this documentation is somewhat terse and difficult to understand, it appears that changes are logged and reconciled in LSN order. Competing changes should be discarded by this system, allowing your consistency model to work effectively.
Note also:
CDC is by default disabled and must be enabled at the database level
followed by enabling on the table.
Option B then becomes obvious: institute CDC on your legacy systems, then use your service bus to translate these changes into updates that aren't bound to CDC (using, for example, raw transactional update statements). This should allow for the one-way flow of data that you seek from the design of your system.
For additional methods of reconciling changes, consider the concepts raised by this Wikipedia article on "eventual consistency". Best of luck with your internal database messaging system.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse