Concurrent modification should not be logged as ERROR? - wildfly

We have an application running on a Wildfly 17. I have a scenario, which occurs occasionally, in which two background threads are accessing the same entity:
Thread A deletes the entity (for good reason)
Thread B is working on slightly older data and attempts to update the entity
When thread B is the later of the two, it fails due to the concurrent modification. This works correctly. It is retried automatically and finds that nothing needs to be done anymore (because the entity has been deleted). That is the intended behavior, when these two threads collide. All is fine!
However I find that this is logged as ERROR by CMTTxInterceptor:
2020-03-31 16:51:35,463 +0200 ERROR: as.ejb3.invocation - WFLYEJB0034: EJB Invocation failed on component ... for method ... throws ...:
javax.ejb.EJBTransactionRolledbackException: Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1
at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInCallerTx(CMTTxInterceptor.java:203) [wildfly-ejb3-17.0.1.Final.jar:17.0.1.Final]
at org.jboss.as.ejb3.tx.CMTTxInterceptor.required(CMTTxInterceptor.java:364) [wildfly-ejb3-17.0.1.Final.jar:17.0.1.Final]
at org.jboss.as.ejb3.tx.CMTTxInterceptor.processInvocation(CMTTxInterceptor.java:144) [wildfly-ejb3-17.0.1.Final.jar:17.0.1.Final]
at org.jboss.invocation.InterceptorContext.proceed(InterceptorContext.java:422)
...
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [rt.jar:1.8.0_171]
at org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
Caused by: javax.persistence.OptimisticLockException: Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1
at org.hibernate.jpa.spi.AbstractEntityManagerImpl.wrapStaleStateException(AbstractEntityManagerImpl.java:1729) [hibernate-entitymanager.jar:5.0.12.Final]
... at org.jboss.as.ejb3.tx.CMTTxInterceptor.invokeInCallerTx(CMTTxInterceptor.java:185) [wildfly-ejb3-17.0.1.Final.jar:17.0.1.Final]
... 262 more
Caused by: org.hibernate.StaleStateException: Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1
at org.hibernate.jdbc.Expectations$BasicExpectation.checkBatched(Expectations.java:67) [hibernate-core.jar:5.0.12.Final]
at org.hibernate.jdbc.Expectations$BasicExpectation.verifyOutcome(Expectations.java:54) [hibernate-core.jar:5.0.12.Final]
at org.hibernate.engine.jdbc.batch.internal.NonBatchingBatch.addToBatch(NonBatchingBatch.java:46) [hibernate-core.jar:5.0.12.Final]
at org.hibernate.persister.entity.AbstractEntityPersister.delete(AbstractEntityPersister.java:3261) [hibernate-core.jar:5.0.12.Final]
...
It seems to me, that this log is incorrect, mostly because this is not an ERROR. Concurrent modification is something normal, that is to be expected - and that our application logic handles. This log will distract my colleagues at the hotline.
Do you agree that the logging is incorrect, or am I missing something?
I think I will disable logging for "as.ejb3.invocation".

In your case this exception was thrown because Hibernate detected that the entity previously fetched from database was changed (deleted) during the current transaction. So, there is nothing to update.
In this specific case I think you can ignore or swallow the exception. In other cases, it may be good to know about this exception, so I recommend swallow the exception but don’t disable as a whole on log level.

Related

How to debug further this dropped record in apache beam?

I am seeing intermittent dropped records(only for error messages though not for success ones). We have a test case that intermittenly fails/passes because of a lost record. We are using "org.apache.beam.sdk.testing.TestPipeline.java" in the test case. This is the relevant setup code where I have tracked the dropped record too ....
PCollectionTuple processed = records
.apply("Process RosterRecord", ParDo.of(new ProcessRosterRecordFn(factory))
.withOutputTags(TupleTags.OUTPUT_INTEGER, TupleTagList.of(TupleTags.FAILURE))
);
errors = errors.and(processed.get(TupleTags.FAILURE));
PCollection<OrderlyBeamDto<Integer>> validCounts = processed.get(TupleTags.OUTPUT_INTEGER);
PCollection<OrderlyBeamDto<Integer>> errorCounts = errors
.apply("Flatten Roster File Error Count", Flatten.pCollections())
.apply("Publish Errors", ParDo.of(new ErrorPublisherFn(factory)));
The relevant code in ProcessRosterRecordFn.java is this
if(dto.hasValidationErrors()) {
RosterIngestError error = new RosterIngestError(record.getRowNumber(), record.toTitleValue());
error.getValidationErrors().addAll(dto.getValidationErrors());
error.getOldValidationErrors().addAll(dto.getOldValidationErrors());
log.info("Tagging record row number="+record.getRowNumber());
c.output(TupleTags.FAILURE, new OrderlyBeamDto<>(error));
return;
}
I see this log for the lost record of Tagging record row for 2 rows that fail. After that however, inside the first line of ErrorPublisherFn.java, we log immediately after receiving each message. We only receive 1 of the 2 rows SOMETIMES. When we receive both, the test passes. The test is very flaky in this regard.
Apache Beam is really annoying in it's naming of threads(they are all the same name), so I added a logback thread hashcode to get more insight and I don't see any and the ErrorPublisherFn could publish #4 on any thread anyways.
Ok, so now the big question: How to insert more things to figure out why this is being dropped INTERMITTENTLY?
Do I have to debug apache beam itself? Can I insert other functions or make changes to figure out why this error is 'sometimes' lost on some test runs and not others?
EDIT: Thankfully, this set of tests are not testing errors upstream and this line "errors = errors.and(processed.get(TupleTags.FAILURE));" can be removed which forces me to remove ".apply("Flatten Roster File Error Count", Flatten.pCollections())" and in removing those 2 lines, the issue goes away for 10 test runs in a row(ie. can't completely say it is gone with this flaky stuff going on). Are we doing something wrong in the join and flattening? I checked the Error structure and rowNumber is a part of equals and hashCode so there should be no duplicates and I am not sure why it would be intermittently failure if there are duplicate objects either.
What more can be done to debug here and figure out why this join is not working in the TestPipeline?
How to get insight into the flatten and join so I can debug why we are losing an event and why it is only 'sometimes' we lose the event?
Is this a windowing issue? even though our job started with a file to read in and we want to process that file. We wanted a constant dataflow stream available as google kept running into limits but perhaps this was the wrong decision?

bypassing exc_breakpoint crash to continue program execution

During testing my iOS app, (it's a workout app), the app crashed (EXC_BREAKPOINT) as it was trying to save the workout data.
The crash was an index out of range issue whereby the array count is 1 lesser than the workout seconds. (I should have started the seconds counter from 1 instead of 0)
for i in 0...seconds {
let data = "\(i),\(dataArray.powerGenY[i-1]),\(dataArray.powerGenYAlt[i-1])\n"
do {
try data.appendToURL(fileURL: fileURL)
}
catch {
print("Could not write data to file")
}
}
anyways, the error dropped me to LLDB. Is there any way I an Use LLDB to bypass this error and continue execution?
Having worked out for a hour, I wasn't prepared to have this crash take my data along with it. Since the crashed dropped me into LLDB, I wanted to see if there's any way to salvage the data by stepping over / bypassing / changing the value of i so that the program execution can continue.
Initially I tried
(lldb) po i = 3327
error: <EXPR>:3:1: error: cannot assign to value: 'i' is immutable
i = 3327
^
but it won't let me change the value (I is immutable)
Then I tried thread jump -l 1 but it spewed some error about not the code execution outside of current function.
(lldb) th j -l 29
error: CSVExport.swift:29 is outside the current function.
Finally, going thru this site https://www.inovex.de/blog/lldb-patch-your-code-with-breakpoints/ and trying a few things. The one that helped was thread jump
thread return
The mentioned disadvantage of thread jump can be avoided by using a
different technique to change control flow behaviour. Instead of
manipulating the affected lines of code directly the idea is to
manipulate other parts of the program which in turn results in the
desired behaviour. For the given example this means changing the
return value of can_pass() from 0 to 1. Which, of course, can be done
through LLDB. The command to use is thread just like before, but this
time with the subcommand return to prematurely return from a stack
frame, thereby short-circuiting its execution.
Executing thread return 1 did the trick. This returned true (1) to the index out of range issue and then continued the execution to the next line of code.

Is it possible to avoid nested RetryLoop.callWithRetry calls so I get a consistent timeout?

I've configured a reasonable timeout using BoundedExponentialBackoffRetry, and generally it works as I'd expect if ZK is down when I make a call like "create.forPath". But if ZK is unavailable when I call acquire on an InterProcessReadWriteLock, it takes far longer before it finally times out.
I call acquire which is wrapped in "RetryLoop.callWithRetry" and it goes onto call findProtectedNodeInForeground which is also wrapped in "RetryLoop.callWithRetry". If I've configured the BoundedExponentialBackoffRetry to retry 20 times, the inner retry tries 20 times for every one of the 20 outer retry loops, so it retries 400 times.
We really need a consistent timeout after which we fail. Have I done anything wrong / anyway around this? If not, I guess I'll call the troublesome methods in a new thread that I can kill after my own timeout.
Here is the sample code to recreate it. I stick break points at the lines following the comments, bring ZK down and then let it continue and take the stacktrace whilst it's re-trying.
public class GoCurator {
public static void main(String[] args) throws Exception {
CuratorFramework cf = CuratorFrameworkFactory.newClient(
"localhost:2181",
new BoundedExponentialBackoffRetry(200, 10000, 20)
);
cf.start();
String root = "/myRoot";
if(cf.checkExists().forPath(root) == null) {
// Stacktrace A showing what happens if ZK is down for this call
cf.create().forPath(root);
}
InterProcessReadWriteLock lcok = new InterProcessReadWriteLock(cf, "/grant/myLock");
// See stacktrace B showing the nested re-try if ZK is down for this call
lcok.readLock().acquire();
lcok.readLock().release();
System.out.println("done");
}
}
Stacktrace A (if ZK is down when I'm calling create().forPath). This shows the single retry loop so it exist after the correct number of attempts:
java.lang.Thread.State: WAITING
at java.lang.Object.wait(Object.java:-1)
at java.lang.Object.wait(Object.java:502)
at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1499)
at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1487)
at org.apache.zookeeper.ZooKeeper.getChildren(ZooKeeper.java:2617)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:242)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl$3.call(GetChildrenBuilderImpl.java:231)
at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl.pathInForeground(GetChildrenBuilderImpl.java:228)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:219)
at org.apache.curator.framework.imps.GetChildrenBuilderImpl.forPath(GetChildrenBuilderImpl.java:41)
at com.gebatech.curator.GoCurator.main(GoCurator.java:25)
Stacktrace B (if ZK is down when I call InterProcessReadWriteLock#readLock#acquire). This shows the nested re-try loop so it doesn't exit until 20*20 attempts.
java.lang.Thread.State: WAITING
at sun.misc.Unsafe.park(Unsafe.java:-1)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
at org.apache.curator.CuratorZookeeperClient.internalBlockUntilConnectedOrTimedOut(CuratorZookeeperClient.java:434)
at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:56)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)
at org.apache.curator.framework.imps.CreateBuilderImpl.findProtectedNodeInForeground(CreateBuilderImpl.java:1239)
at org.apache.curator.framework.imps.CreateBuilderImpl.access$1700(CreateBuilderImpl.java:51)
at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1167)
at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156)
at org.apache.curator.connection.StandardConnectionHandlingPolicy.callWithRetry(StandardConnectionHandlingPolicy.java:64)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:100)
at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:1153)
at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:607)
at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:597)
at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:575)
at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:51)
at org.apache.curator.framework.recipes.locks.StandardLockInternalsDriver.createsTheLock(StandardLockInternalsDriver.java:54)
at org.apache.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:225)
at org.apache.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:237)
at org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:89)
at com.gebatech.curator.GoCurator.main(GoCurator.java:29)
This turns out to be a real, longstanding, problem with how Curator uses retries. I have a fix and PR ready here: https://github.com/apache/curator/pull/346 - I'd appreciate more eyes on it.

Unexpected behavior from Drools' collectSet and difficulty effecting undo moves

What I actually wanted to do is to have a Drools rule as so:
rule "globalRequiredPredecessorAfterMe"
when
$rpAll: Set(size>1) from accumulate (
Customer(vehicle!= null, vehicle.vehicleTyp != VehicleTyp.DUMMY, $rpAfterMe: requiredPredecessorsAfterMe);
collectSet($rpAfterMe)
)
then
scoreHolder.addMediumConstraintMatch(kcontext, - $rpAll.size()-1);
end
Unfortunately, on the second move in CH drools awards me with a
Exception in thread "main" java.lang.RuntimeException: java.lang.NullPointerException
at org.drools.core.rule.SingleAccumulate.reverse(SingleAccumulate.java:124)
...
Is this a feature?
I swallowed my pride and to wrote a fantastically complex Listener which is supposed to to what the Drools rule above did. With an assertionScoreDirectorFactory (an EasyScoreCalculator) I get score corruption. On a closer look, I get a corruption in one of the two cases:
My planning entity's previousXXX is set exactly to the same value
An Undo move is done
To investigate, I set the solution to the state just before the score corruption occurs. I use the CreateChangeMoves from Optaplanner's 7.4.1.Final SolutionBusiness class. (On a side note, it would be really helpful if one could also set a MoveCountLimit in the SolverConfig.)
So I create the scoreDirector as follows
SolverFactory<MySolution> solverFactory = SolverFactory.createFromXmlResource(
SolverConfigXML);
solver = solverFactory.buildSolver();
ScoreDirectorFactory<MySolution> scoreDirectorFactory = solver.getScoreDirectorFactory();
scoreDirector = scoreDirectorFactory.buildScoreDirector();
scoreDirector.setWorkingSolution(unsolvedSolution);
within a JUnit 5 test and then perform several ChangeMoves to mimick the solution just before the offending move occurs. However, when I perform the offending move as in bullet 1 above,
// customer SUK0002030's previousVehicleOrCustomer is DUMMY_3
cm = createChangeMove(nameToCustomerMap.get("SUK0002030"), "previousVehicleOrCustomer", nameToVehicleMap.get("DUMMY_3"));
cm.doMove(scoreDirector);
I get a
java.lang.IllegalStateException: The entity (SUK0002030) has a variable (previousVehicleOrCustomer) with value (DUMMY_3) which has a sourceVariableName variable (nextCustomer) with a value (null) which is not that entity.
Verify the consistency of your input problem for that sourceVariableName variable.
at org.optaplanner.core.impl.domain.variable.inverserelation.SingletonInverseVariableListener.retract(SingletonInverseVariableListener.java:87)
...
When I set the previousVehicleOrCustomer to null, and then back to DUMMY_3, everything is fine and no score corruption occurs.
Similarly, when attempting to create an (offending) UndoMove of the previous Move cm, bullet 2 above, as so
cm.createUndoMove(scoreDirector).doMove(scoreDirector)
I get the same message:
java.lang.IllegalStateException: The entity (SUK0002014) has a variable (previousVehicleOrCustomer) with value (Vehicle_0) which has a sourceVariableName variable (nextCustomer) with a value (null) which is not that entity.
Verify the consistency of your input problem for that sourceVariableName variable.
at org.optaplanner.core.impl.domain.variable.inverserelation.SingletonInverseVariableListener.retract(SingletonInverseVariableListener.java:87)
...
Of course, if I manually create the UndoMove (by simply setting the previousVehicleOrCustomer back to null), everything is fine.
These moves really should be possible and I'm really curious to find out what's wrong.

javax.jcr.InvalidItemStateException: Item cannot be saved

I am getting following exception in a single box cq5 author environment.
javax.jcr.InvalidItemStateException: Item cannot be saved
because node property has been modified externally
more exception details:
Caused by: javax.jcr.InvalidItemStateException: Unable to update a stale item: item.save()
at org.apache.jackrabbit.core.ItemSaveOperation.perform(ItemSaveOperation.java:262)
at org.apache.jackrabbit.core.session.SessionState.perform(SessionState.java:216)
at org.apache.jackrabbit.core.ItemImpl.perform(ItemImpl.java:91)
at org.apache.jackrabbit.core.ItemImpl.save(ItemImpl.java:329)
at org.apache.jackrabbit.core.session.SessionSaveOperation.perform(SessionSaveOperation.java:65)
at org.apache.jackrabbit.core.session.SessionState.perform(SessionState.java:216)
at org.apache.jackrabbit.core.SessionImpl.perform(SessionImpl.java:361)
at org.apache.jackrabbit.core.SessionImpl.save(SessionImpl.java:812)
at com.day.crx.core.CRXSessionImpl.save(CRXSessionImpl.java:142)
at org.apache.sling.jcr.resource.internal.helper.jcr.JcrResourceProvider.commit(JcrResourceProvider.java:511)
... 215 more
Caused by: org.apache.jackrabbit.core.state.StaleItemStateException: 3bec1cb7-9276-4bed-a24e-0f41bb3cf5b7/{}ssn has been modified externally
at org.apache.jackrabbit.core.state.SharedItemStateManager$Update.begin(SharedItemStateManager.java:679)
at org.apache.jackrabbit.core.state.SharedItemStateManager.beginUpdate(SharedItemStateManager.java:1507)
at org.apache.jackrabbit.core.state.SharedItemStateManager.update(SharedItemStateManager.java:1537)
at org.apache.jackrabbit.core.state.LocalItemStateManager.update(LocalItemStateManager.java:400)
at org.apache.jackrabbit.core.state.XAItemStateManager.update(XAItemStateManager.java:354)
at org.apache.jackrabbit.core.state.LocalItemStateManager.update(LocalItemStateManager.java:375)
at org.apache.jackrabbit.core.state.SessionItemStateManager.update(SessionItemStateManager.java:275)
at org.apache.jackrabbit.core.ItemSaveOperation.perform(ItemSaveOperation.java:258)
Here is the code sample:
adminResourceResolver = resourceResolverFactory
.getAdministrativeResourceResolver(null);
Resource fundPageResource = adminResourceResolver.getResource(page
.getPath() + "/jcr:content");
ModifiableValueMap homePageResourceProperties = fundPageResource
.adaptTo(ModifiableValueMap.class);
homePageResourceProperties.put("ssn",(person.getSsn());
adminResourceResolver.commit();
Any ideas ? It could be possible multiple threads accessing this code, as multiple authors on multiple pages calling this code from a authored component.
Thank you,
Sri
This is an error your see often in CQ5.5 (and lessens with each version upwards). The root cause of this issue is that multiple processes/services are modifying the same resource in roughly the same timespan (usually using different sessions, sometimes even with different users).
A small example to demonstrate perhaps. Session A and B both have a reference to Resource X. Session A modifies some properties on X, saves and commits, and is destroyed. This all goes smoothly. Session B still has a snapshot of the situation before A made modifications, session B makes modifications and all seems well UNTIL it tries to save. At this point, session B detects that it can't commit its changes because it doesn't have the latest node state. It has detected some other sessions made changes to the same node. In essence the current node state conflicts with modifications that session A has done and throws an ItemStale exception. The reason for this exception is the notion that the API doesn't know wether you want to keep the changes made by A, keep the changes made by the current session and discard the changes made by A, or merge them.
This error happens often with long running sessions and with workflow/listener combinations. Therefore the recommendation is to keep sessions as short as possible to prevent this kind of conflicts as much as possible.
One way to deal with this is to call session.refresh(keepChangesBoolean) before calling .save(). This instructs the current session to check for updates made by other sessions and deal with it according to the boolean flag you submit. This however is not a guarantee as it's still possible that between your refresh and your save call, yet another session has done the same. It only lowers the odds of this exception occurring.
Another way to deal with this is to retry again from scratch.