We are seeing below strange behaviour issue in our instance-
when we are running our scheduled concurrent programs.
Instead of program going to the desired manager it goes to the different manager and then it errored out with the below error.
Error:
Routine CALL_STORED_PROCEDURE cannot initialize concurrent request
Routine AFPEOT cannot initialize concurrent request information
We checked the controlling manager process id in FND_CONCURRENT_REQUESTS table and we don't see any process it shows null.
when some of the concurrent programs for e.g. Customer Interface are assigned to the concurrent manager the actual process of manager are going down and when we are cancelling the program the actual processes are coming back up again.
The actual process keeps on varying from target process sometimes less and sometimes 0.
This is again the intermittent issue. We could not recognize any patter as such.
Tried bouncing database , application, services but nothing help.
Error:
ORA-600 in database
Related
I'm loading data into redshift which usually takes about an hour when successful but seems to timeout randomly sometimes. I continue to get a "STARTED" status from DescribeStatement calls for my query but when I look in the console it says the query was ABORTED and rolled back via "Undoing 1 transactions on table ..." statement. But I'm not finding any errors in STL_LOAD_ERRORS related to the query or anything useful in STL_UTILITYTEXT for that transaction; though STL_UNDONE view does show the rollback.
I would've expected DescribeStatement to update with "FAILED" or "ABORTED" status when this occurred but that doesn't seem to be the case. Any idea what is causing the load to fail without any errors? Is there a way to catch/handle this via redshift data api? I'm currently thinking of checking STL_UNDONE after a specified time but was hoping there's a better solution.
Statement timeout seems like a likely cause. What you are describing sounds like the connection closed out from under the executing statement. There are a number of places where this timeout can come from but a common one is in the cluster configuration and the WLM configuration.
Another possibility is a network timeout. Database connections stay open for the entirety of the session but when a statement is in flight there is no activity on the connection. Some network equipment see this an assume that something is wrong and close the connection which closes the session which aborts the transaction in flight.
If your issue is caused by the connection closing you may be able to line things up in stl_sessions. There is info in there about timeouts but also you can see if the time the session closes is right when the query commands abort.
Just one area that could be causing your issue but is more common than people think.
So after escalating to AWS support, it was confirmed there was a bug on their end. Related to data API autoscaling protocols that were sometimes scaling down without waiting for outstanding tasks to complete. There's a temporary fix in place to avoid this happening while they implement a long term solution. Should hopefully be rolled out end of this month, June 2022.
One of my tests for a function that performs increments using the MongoDB driver for Go is randomly breaking in an unexpected way. Here's what the test does:
Create a proxy (with toxiproxy) to a local MongoDB instance.
Disable the proxy, so the database looks like it's down.
Run a function that does an update that increments a field, timing out after 100ms. If it fails, it keeps retrying every 100ms until the command succeeds.
Sleep 1 second.
Enable the proxy.
Wait for the function to complete and assert that the field has been incremented correctly - only once.
This test is randomly breaking because sometimes that field gets incremented twice. I noticed that it happens when an update is retried just as the proxy gets enabled: the client code receives an incomplete read of message header: context deadline exceeded error, which makes it retry the command, but the previous one indeed succeeded because the field ends up being incremented twice.
I took a look at the driver code and I guess it's timing out while reading the server response - perhaps the proxy is enabled just after the update has started and there isn't much timeout left for both write and read operations to complete.
Is there anything that I can do on my side to prevent this from happening? I tried to find a specific error to catch, but I couldn’t find any. Or is this something the driver itself is supposed to handle?
Any help is appreciated.
UPDATE: I looked closely at the error messages and noticed that, while the MongoDB instance was down, all errors were handshake failures. So I made sure the test ping the database before disabling the proxy to get the handshake out of the way and the test stopped randomly breaking; it ran 1000 times flawlessly, at least. I assume the handshake itself takes time to complete and that contributes to the command timeout.
In general, if you know the command went through (to the server), if you can't read the response, you can't assume anything about its success.
In some cases when it only matters if the server got the command, or you only care about the command reaching the server, then read on.
Unfortunately the current state of the driver (v1.7.1) is not "sophisticated" enough to easily tell if the error is from reading the response.
I was able to reproduce your issue locally. Here is the error when a timeout happens reading the response:
mongo.CommandError{Code:0, Message:"connection(localhost:27017[-30]) incomplete read of message header: context deadline exceeded", Labels:[]string{"NetworkError", "RetryableWriteError"}, Name:"", Wrapped:topology.ConnectionError{ConnectionID:"localhost:27017[-30]", Wrapped:context.deadlineExceededError{}, init:false, message:"incomplete read of message header"}}
And there is the error when the timeout happens writing the command:
mongo.CommandError{Code:0, Message:"connection(localhost:27017[-31]) unable to write wire message to network: context deadline exceeded", Labels:[]string{"NetworkError", "RetryableWriteError"}, Name:"", Wrapped:topology.ConnectionError{ConnectionID:"localhost:27017[-31]", Wrapped:context.deadlineExceededError{}, init:false, message:"unable to write wire message to network"}}
As you can see, in both cases mongo.CommandError is returned, with identical Code and Labels fields. Which leaves you having to analyze the error string (which is ugly and may "break" with future changes).
So the best you can do is check if the error string contains "incomplete read of message header", and if so, you don't have to retry. Hopefully this (error support and analysis) improves in the future.
If you are using the retryable writes as implemented by MongoDB 3.6+ and the respective drivers, this shouldn't happen. Each write is accompanied by a transaction number (not to be confused with client-side transactions as implemented by MongoDB 4.0+), and if the same transaction number is used in two consecutive writes there is only one write being done by the server.
This functionality has been around for years so unless you are using an ancient driver version you should already have it.
If you are performing write retries in your application manually rather than using the driver's retryable write functionality, you can write twice as you found out. The solution is to use the driver's retryable writes.
I had the same problem (running on go.mongodb.org/mongo-driver v1.8.1 on a MongoDB 4.4) and will leave my experiences with this problem here.
To add to #icza solution:
You can also get the error context deadline exceeded so check also for that.
A check for a context abortion would look something like this:
if strings.Contains(err.Error(), "context") && (strings.Contains(err.Error(), " canceled") || strings.Contains(err.Error(), " deadline exceeded")) {
...
}
My solution to the problem was instead of first checking if there was an error you'd first check if there was a result from the transaction.
Example:
result, err := database.collection.InsertOne(context, item)
if result != nil {
return result.InsertedID, err
}
return nil, err
If the transaction did process it despite the error, you could add some compensation logic to undo the transaction.
NEventStore: 5.1
Simple setup: WebApp (Asp.NET 4.5) == command-side
I'm searching for the "right" way for not losing commands, with an eye on sagas/process-managers which maybe would wait endlessly for an event produced from a command that was actually never handled.
Old: Dispatchers
I initially used sync commands, but with an eye on sagas/process-managers I thought it would be safer to first store them an then get them through SyncDispatcher (or AsyncDispatcher). Otherwise, that's my concern, if a saga would try to send a command and the command didn't finish due to app-crash/powerloss/..., it would be lost and noone would know.
So I created a command-stream and appended each command to that. The IsDispatched showed, if that command was already handled.
That worked.
PollingClient and Command-Stream
Now that the dispatchers are obsolete, I switched to PollingClient. What I lost is the Dispatched information.
A startup-issue arose:
I naively started polling from the current latest checkpoint going forward, but when the application restarted there was a chance that commands were stored but not executed before the crash and therefore lost (that actually happened).
I just came across the idea:
store the basic outcome of commands as (non-domain-)events in another stream.
This stream would contain CommandSucceeded and CommandFailed events.
Whenever the application starts the latest command-id or command-checkpoint-number gets extracted used to load the commands right after that one...
Questions
Are my concerns, that sync command-handling leads to the danger of losing a saga-generated command, wrong? If yes, why?
Is this generally a good idea: one big command stream?
Is this generally a good idea: store generic command-outcome-events in a stream?
You can:
Store you command in a command queue | persistent log
Use command id (guid) as Commit Id on NEventStore
Mark your command as executed in your Command Handler | Pipeline Hook | Polling Client
NEventStore gives you idempotency on same AggregateId (streamid) + CommitId, so if you app crashes before the command is marked as processed and you replay your command, the resulting commits are automatically discarded by NES.
Afaik NEventStore is meant to be the storage for event sourcing i.e storing domain objects as a stream of events. Commands and sagas have nothing to do with it. It's your service bus which should take care of durability and saga management.
Personally, I treat the event store simply as a repository detail. The application service (command handler) will dispatch the generated events, after they've been persisted.
If the app crashes and the service bus is durable (not a memory one) then the event/command will be handled again automatically, because the service bus should detect if a message wasn't successfully handled. Of course, your message handlers should be idempotent for that reason.
When I try to start up my Ensemble production, I get the following error:
ERROR ErrCanNotAcquireRuntimeLock: Could not acquire Ensemble runtime
global lock within timeout '10'
I figured I will disable all the services, processes and opperations and restart them individually to see which one is causing the error, however any action I take on the production takes a very long time and then comes back with the same error.
Googling the issue did not yield much, any ideas?
You should check the contents of your lock table while the production is not running -- it's likely that you have a job (or multiple jobs) that still have locks on the core Ensemble runtime globals. If you can identify the OS-level process(es) and can work out what they are actually doing, you should be able to terminate the OS processes. In both cases, you should perform this detection and termination from within Ensemble. You should be able to use the System Management Portal for both actions, or you can use the ^LOCKTAB and ^JOBEXAM CHUI utilities in the %SYS namespace to track this down.
If you can restart Ensemble server, lock table should be cleared. This however doesn't help to find the cause of your problem.
I'm wondering if any persistence failure will go undetected if I don't check error codes? If so, what's the right way to write fast (asynchronously) while still detecting errors?
If you don't check for errors, your update is only fireAndForget. You'll indeed miss all errors which could arise. Please see MongoDB WriteConcerns for the available write modes in MongoDB (sorry I always fail to find the official, non driver related documentation, I really should bookmark it).
So with NORMAL you'll get at least connectivity errors, with NONE no exceptions at all. If you want to be informed of exceptions you have to use one of the other modes, which differ only in the persistence guarantee they give you.
You can't detect errors when running asynchronous, as this is against the intention. Your connection which sent the write operation, may be already closed or reused, so you can't sent it through that connection. Further more only your actual code knows what to do if it fails. As mongoDB doesn't offer some remote procedure call to asynchronous inform you of updates you'll have to wait until the write finished to a given stage.
So the fastest, but most unrelieable is SAFE, where the write only happened to memory. JOURNAL gives you the security that it was written at least to disk. With FSYNC you'll have those changes persisted on your db on disk. REPLICA that a least two replicas have written it, and MAJORITY that more than half of your replicas have written it(by three replicas which should be the default this doesn't differ).
The only chance I see to have something like asynchronous, is to have a separate Thread who is performing all write operations synchronous. This thread you could handle the actual update as well as a class which is called in case of a failure to perform the needed operations to handle this failure. But I don't think that this is good application design.
Yes, depending on the error, it can fail silently if you don't check the returned error code. It's necessary to wait for error checking. Your only other option would be for your app to occasionally tell the user "oops, remember when I acted like I saved your data a moment ago? Well, not really."