Unexpected rollbackCount and calls of shouldSkip() during item write - spring-batch

Spring documentation (Pg. 46, Section: 5.1.7) says:
By default, regardless of retry or skip, any exceptions thrown from the ItemWriter will cause the transaction controlled by the Step to rollback. If skip is configured as described above, exceptions thrown from the ItemReader will not cause a rollback.
My commit interval is set to 10. So my understanding of above paragraph is, if their is error in reading 7th record out of the chunk of 10, the item will be skipped and the correct 9 records will be sent ahead by itemReader.
However, if the 7th record is in error during writing - none of the 10 records will be written and a rollback will happen.
However, when I am including the error thrown in my skipPolicy, itemWriter IS writing the remaining 9 records to the database skipping the errored one. This is contradictory to what is mentioned above.
Can any one please explain the concept of "skip during item writing".
Also even though single error is thrown I am getting the following:
SkipCount as -1 twice, then as 0 once, and again -1 once in my shouldSkip(Object, Throwable) method. -- I am not getting this behavior.
Also rollback count is 2 -- what does it mean ? why it is 2 ?
#michael Would it be possible for you to explain the behavior using some scenario!!
like "i am reading 20 records from a file and writing to a database after some processing. I have a skip policy set for some exception. what will happen if the exception occurrs during - read, process, write -- how the chunks will be committed, how default retry works, how the counts will be updated, etc. etc...."
It will really be a big help for me, as I am still confused with the behavior..

From your usecase description it seems, that you mix different concepts.
You describe a skip scenario but you seem to expect skip should work like a no-rollback scenario.
from the spring batch documentation
skip:
errors encountered while processing should not result in Step failure,
but should be skipped instead
vs no-rollback:
If skip is configured as described above, exceptions thrown from the
ItemReader will not cause a rollback.
in my own words skip means:
If the step encounters an error during read/process/write, the current chunk will be rollbacked and each item of the chunk is read/processed/written individually - without the bad item. Basically Spring Batch falls back to commit-rate 1 for the bad chunk and goes back to the specified commit-rate after the bad chunk.
Also rollback count is 2 -- what does it mean ? why it is 2 ?
from B.5. BATCH_STEP_EXECUTION
ROLLBACK_COUNT: The number of rollbacks during this execution. Note
that this count includes each time rollback occurs,
including rollbacks for retry and those in the skip recovery procedure.
(emphasize mine)
Also even though single error is thrown I am getting the following:
SkipCount as -1 twice, then as 0 once, and again -1 once in my
shouldSkip(Object, Throwable) method. -- I am not getting this
behavior.
i tried a simple skip job with both configuration styles, skip-policy and skip-limit with skippable-exception, both worked identically in relation to rollback and skip counts
(step metadata is allright but shouldSkip(...) seems to be called a lot more than expected)

I'd like to explain one issue you mentioned:
SkipCount as -1 twice, then as 0 once, and again -1 once in my shouldSkip(Object, Throwable) method. -- I am not getting this behavior.
I don't know to which signature of the shouldSkip() method you refer to, but in my SkipPolicy interface there is only one method with the following signature:
boolean shouldSkip(Throwable t, int skipCount) throws SkipLimitExceededException;
This method decides whether the Exception e given the skipCount should be skipped or not.
Unfortunately, the programmers of Spring Batch misuse this method to test, whether an exception is skippable in general regardless of the current skip count. That's why there are several calls to this method with the skipCount parameter set to -1.
So just don't wonder about the behaviour you saw.

Related

UpdateOne fails on client due to timeout, but MongoDB processes it anyway

One of my tests for a function that performs increments using the MongoDB driver for Go is randomly breaking in an unexpected way. Here's what the test does:
Create a proxy (with toxiproxy) to a local MongoDB instance.
Disable the proxy, so the database looks like it's down.
Run a function that does an update that increments a field, timing out after 100ms. If it fails, it keeps retrying every 100ms until the command succeeds.
Sleep 1 second.
Enable the proxy.
Wait for the function to complete and assert that the field has been incremented correctly - only once.
This test is randomly breaking because sometimes that field gets incremented twice. I noticed that it happens when an update is retried just as the proxy gets enabled: the client code receives an incomplete read of message header: context deadline exceeded error, which makes it retry the command, but the previous one indeed succeeded because the field ends up being incremented twice.
I took a look at the driver code and I guess it's timing out while reading the server response - perhaps the proxy is enabled just after the update has started and there isn't much timeout left for both write and read operations to complete.
Is there anything that I can do on my side to prevent this from happening? I tried to find a specific error to catch, but I couldn’t find any. Or is this something the driver itself is supposed to handle?
Any help is appreciated.
UPDATE: I looked closely at the error messages and noticed that, while the MongoDB instance was down, all errors were handshake failures. So I made sure the test ping the database before disabling the proxy to get the handshake out of the way and the test stopped randomly breaking; it ran 1000 times flawlessly, at least. I assume the handshake itself takes time to complete and that contributes to the command timeout.
In general, if you know the command went through (to the server), if you can't read the response, you can't assume anything about its success.
In some cases when it only matters if the server got the command, or you only care about the command reaching the server, then read on.
Unfortunately the current state of the driver (v1.7.1) is not "sophisticated" enough to easily tell if the error is from reading the response.
I was able to reproduce your issue locally. Here is the error when a timeout happens reading the response:
mongo.CommandError{Code:0, Message:"connection(localhost:27017[-30]) incomplete read of message header: context deadline exceeded", Labels:[]string{"NetworkError", "RetryableWriteError"}, Name:"", Wrapped:topology.ConnectionError{ConnectionID:"localhost:27017[-30]", Wrapped:context.deadlineExceededError{}, init:false, message:"incomplete read of message header"}}
And there is the error when the timeout happens writing the command:
mongo.CommandError{Code:0, Message:"connection(localhost:27017[-31]) unable to write wire message to network: context deadline exceeded", Labels:[]string{"NetworkError", "RetryableWriteError"}, Name:"", Wrapped:topology.ConnectionError{ConnectionID:"localhost:27017[-31]", Wrapped:context.deadlineExceededError{}, init:false, message:"unable to write wire message to network"}}
As you can see, in both cases mongo.CommandError is returned, with identical Code and Labels fields. Which leaves you having to analyze the error string (which is ugly and may "break" with future changes).
So the best you can do is check if the error string contains "incomplete read of message header", and if so, you don't have to retry. Hopefully this (error support and analysis) improves in the future.
If you are using the retryable writes as implemented by MongoDB 3.6+ and the respective drivers, this shouldn't happen. Each write is accompanied by a transaction number (not to be confused with client-side transactions as implemented by MongoDB 4.0+), and if the same transaction number is used in two consecutive writes there is only one write being done by the server.
This functionality has been around for years so unless you are using an ancient driver version you should already have it.
If you are performing write retries in your application manually rather than using the driver's retryable write functionality, you can write twice as you found out. The solution is to use the driver's retryable writes.
I had the same problem (running on go.mongodb.org/mongo-driver v1.8.1 on a MongoDB 4.4) and will leave my experiences with this problem here.
To add to #icza solution:
You can also get the error context deadline exceeded so check also for that.
A check for a context abortion would look something like this:
if strings.Contains(err.Error(), "context") && (strings.Contains(err.Error(), " canceled") || strings.Contains(err.Error(), " deadline exceeded")) {
...
}
My solution to the problem was instead of first checking if there was an error you'd first check if there was a result from the transaction.
Example:
result, err := database.collection.InsertOne(context, item)
if result != nil {
return result.InsertedID, err
}
return nil, err
If the transaction did process it despite the error, you could add some compensation logic to undo the transaction.

How to avoid long delay before finally getting "40001 could not serialize access due to concurrent update"

We have a Postgres 12 system running one master master and two async hot-standby replica servers and we use SERIALIZABLE transactions. All the database servers have very fast SSD storage for Postgres and 64 GB of RAM. Clients connect directly to master server if they cannot accept delayed data for a transaction. Read-only clients that accept data up to 5 seconds old use the replica servers for querying data. Read-only clients use REPEATABLE READ transactions.
I'm aware that because we use SERIALIZABLE transactions Postgres might give us false positive matches and force us to repeat transactions. This is fine and expected.
However, the problem I'm seeing is that randomly a single line INSERT or UPDATE query stalls for a very long time. As an example, one error case was as follows (speaking directly to master to allow modifying table data):
A simple single row insert
insert into restservices (id, parent_id, ...) values ('...', '...', ...);
stalled for 74.62 seconds before finally emitting error
ERROR 40001 could not serialize access due to concurrent update
with error context
SQL statement "SELECT 1 FROM ONLY "public"."restservices" x WHERE "id" OPERATOR(pg_catalog.=) $1 FOR KEY SHARE OF x"
We log all queries exceeding 40 ms so I know this kind of stall is rare. Like maybe a couple of queries a day. We average around 200-400 transactions per second during normal load with 5-40 queries per transaction.
After finally getting the above error, the client code automatically released two savepoints, rolled back the transaction and disconnected from database (this cleanup took 2 ms total). It then reconnected to database 2 ms later and replayed the whole transaction from the start and finished in 66 ms including the time to connect to the database. So I think this is not about performance of the client or the master server as a whole. The expected transaction time is between 5-90 ms depending on transaction.
Is there some PostgreSQL connection or master configuration setting that I can use to make PostgreSQL to return the error 40001 faster even if it caused more transactions to be rolled back? Does anybody know if setting
set local statement_timeout='250'
within the transaction has dangerous side-effects? According to the documentation https://www.postgresql.org/docs/12/runtime-config-client.html "Setting statement_timeout in postgresql.conf is not recommended because it would affect all sessions" but I could set the timeout only for transactions by this client that's able to automatically retry the transaction very fast.
Is there anything else to try?
It looks like someone had the parent row to the one you were trying to insert locked. PostgreSQL doesn't know what to do about that until the lock is released, so it blocks. If you failed rather than blocking, and upon failure retried the exact same thing, the same parent row would (most likely) still be locked and so would just fail again, and you would busy-wait. Busy-waiting is not good, so blocking rather than failing is generally a good thing here. It blocks and then unblocks only to fail, but once it does fail a retry should succeed.
An obvious exception to blocking-better-than-failing being if when you retry, you can pick a different parent row to retry with, if that make sense in your context. In this case, maybe the best thing to do is explicitly lock the parent row with NOWAIT before attempting the insert. That way you can perhaps deal with failures in a more nuanced way.
If you must retry with the same parent_id, then I think the only real solution is to figure out who is holding the parent row lock for so long, and fix that. I don't think that setting statement_timeout would be hazardous, but it also wouldn't solve your problem, as you would probably just keep retrying until the lock on the offending row is released. (Setting it on the other session, the one holding the lock, might be helpful, depending on what that session is doing while the lock is held.)

How to prevent Spring Batch from resubmitting chunk items in the event of a skippable exception thrown by an item writer?

I am in reference to a Spring Batch book by Manning publisher. To quote a paragraph from the book:
When an item reader throws a skippable exception, Spring Batch just
calls the read method again on the item reader to get the next item.
There’s no rollback on the transaction. When an item processor throws
a skippable exception, Spring Batch rolls back the transaction of the
current chunk and resubmits the read items to the item processor,
except for the one that triggered the skippable exception in the
previous run. Figure 8.3 shows what Spring Batch does when the item
writer throws a skippable exception. Because the framework doesn’t
know which item threw the exception, it reprocesses each item in the
chunk one by one, in its own transaction.
I would like to know what is the official Spring Batch terminology for the process described above whereby an item writer throws a skippable exception and the chunk is resubmitted one by one?
My item writer sends emails and I want to ensure this process of resubmitting and reprocessing the chunk items one by one does not occur in any circumstance for any exception (this would resend already sent emails in the chunk which would be an issue). How can I ensure that resubmission of the items (or whatever it is called) does not occur?
The best way is to NOT throw a skippable exception in the first place. If it's an exception you are explicitly creating due to expected business/validation rules, I would advise instead leveraging the "filter" pattern of just returning null from the processor. Alternatively, you could use a Classifier processor/writer to handle invalid records differently from valid ones.
If throwing the exception is unavoidable, I would suggest you use try/catch and handle it within the ItemProcessor and handle it there.
As for an overview of what the framework does when a skippable exception is encountered, check out this older answer.

In spring batch, how to mark a record a skipped record (without retry) during the writing phase

Spring batch has facility to provide the declarative skip policy (i.e. skippable-exception-classes) to state that the particular record needs to be skipped in the batch processing.
This is quite straight forward in case of ItemReader and ItemProcessor (as they operate record by record basis).
However in case of ItemWriter, when the writing of the record fails (because of the DB Constraint violation), I want to skip that record and let other records go through.
As far as I have researched, I can implement this in two ways,
1) Throw the skippable exception, and Spring Batch will start retry operation with one item per batch, and so if the original batch size is 1000, then the batch will call the writer (and processor if it's transactional) 1000 times (once for each record) and record the skipCount for such item which fails with skip exception (which is most probably the same item which had failed in normal operation)
2) ItemWriter catches the SQLException, and resumes the processing the next record till the end of the items list.
The 2nd approach has a problem of losing the statistics about how many records did not go through (i.e. skipped records) and the batch will record all the items are successfully written and hence update the write count with improper value.
The 1st approach is a little bit tricky in my use-case as it involves re-execution of all the items (on DB side we have complex SPs + triggers) and therefore unnecessarily takes more time.
I am looking for some legal alternative to retry to just record the skipped record count during writing phase.
If none, I will go for the 1st option.
Thanks !
This specifies after how many executions of writer the transaction is commited.
<chunk ... commit-interval="10"/>
As you want to skip all the items that fail while persisted to DB you need commit-interval to be 1 in order to actually persist the good items and not be rolled back along a bad one.
Assuming the reader sends only one item to the processor (and not the list of 1000) reader, processor and writer get executed in order for each item. In this case option 2) is not useful as writer receives only one item always.
You can control how the skip count is incremented by calling StepContribution.html#incrementWriteCount and other increment*Count methods from this class.

Does MongoDB fail silently if I don't check error codes?

I'm wondering if any persistence failure will go undetected if I don't check error codes? If so, what's the right way to write fast (asynchronously) while still detecting errors?
If you don't check for errors, your update is only fireAndForget. You'll indeed miss all errors which could arise. Please see MongoDB WriteConcerns for the available write modes in MongoDB (sorry I always fail to find the official, non driver related documentation, I really should bookmark it).
So with NORMAL you'll get at least connectivity errors, with NONE no exceptions at all. If you want to be informed of exceptions you have to use one of the other modes, which differ only in the persistence guarantee they give you.
You can't detect errors when running asynchronous, as this is against the intention. Your connection which sent the write operation, may be already closed or reused, so you can't sent it through that connection. Further more only your actual code knows what to do if it fails. As mongoDB doesn't offer some remote procedure call to asynchronous inform you of updates you'll have to wait until the write finished to a given stage.
So the fastest, but most unrelieable is SAFE, where the write only happened to memory. JOURNAL gives you the security that it was written at least to disk. With FSYNC you'll have those changes persisted on your db on disk. REPLICA that a least two replicas have written it, and MAJORITY that more than half of your replicas have written it(by three replicas which should be the default this doesn't differ).
The only chance I see to have something like asynchronous, is to have a separate Thread who is performing all write operations synchronous. This thread you could handle the actual update as well as a class which is called in case of a failure to perform the needed operations to handle this failure. But I don't think that this is good application design.
Yes, depending on the error, it can fail silently if you don't check the returned error code. It's necessary to wait for error checking. Your only other option would be for your app to occasionally tell the user "oops, remember when I acted like I saved your data a moment ago? Well, not really."