UpdateOne fails on client due to timeout, but MongoDB processes it anyway - mongodb

One of my tests for a function that performs increments using the MongoDB driver for Go is randomly breaking in an unexpected way. Here's what the test does:
Create a proxy (with toxiproxy) to a local MongoDB instance.
Disable the proxy, so the database looks like it's down.
Run a function that does an update that increments a field, timing out after 100ms. If it fails, it keeps retrying every 100ms until the command succeeds.
Sleep 1 second.
Enable the proxy.
Wait for the function to complete and assert that the field has been incremented correctly - only once.
This test is randomly breaking because sometimes that field gets incremented twice. I noticed that it happens when an update is retried just as the proxy gets enabled: the client code receives an incomplete read of message header: context deadline exceeded error, which makes it retry the command, but the previous one indeed succeeded because the field ends up being incremented twice.
I took a look at the driver code and I guess it's timing out while reading the server response - perhaps the proxy is enabled just after the update has started and there isn't much timeout left for both write and read operations to complete.
Is there anything that I can do on my side to prevent this from happening? I tried to find a specific error to catch, but I couldn’t find any. Or is this something the driver itself is supposed to handle?
Any help is appreciated.
UPDATE: I looked closely at the error messages and noticed that, while the MongoDB instance was down, all errors were handshake failures. So I made sure the test ping the database before disabling the proxy to get the handshake out of the way and the test stopped randomly breaking; it ran 1000 times flawlessly, at least. I assume the handshake itself takes time to complete and that contributes to the command timeout.

In general, if you know the command went through (to the server), if you can't read the response, you can't assume anything about its success.
In some cases when it only matters if the server got the command, or you only care about the command reaching the server, then read on.
Unfortunately the current state of the driver (v1.7.1) is not "sophisticated" enough to easily tell if the error is from reading the response.
I was able to reproduce your issue locally. Here is the error when a timeout happens reading the response:
mongo.CommandError{Code:0, Message:"connection(localhost:27017[-30]) incomplete read of message header: context deadline exceeded", Labels:[]string{"NetworkError", "RetryableWriteError"}, Name:"", Wrapped:topology.ConnectionError{ConnectionID:"localhost:27017[-30]", Wrapped:context.deadlineExceededError{}, init:false, message:"incomplete read of message header"}}
And there is the error when the timeout happens writing the command:
mongo.CommandError{Code:0, Message:"connection(localhost:27017[-31]) unable to write wire message to network: context deadline exceeded", Labels:[]string{"NetworkError", "RetryableWriteError"}, Name:"", Wrapped:topology.ConnectionError{ConnectionID:"localhost:27017[-31]", Wrapped:context.deadlineExceededError{}, init:false, message:"unable to write wire message to network"}}
As you can see, in both cases mongo.CommandError is returned, with identical Code and Labels fields. Which leaves you having to analyze the error string (which is ugly and may "break" with future changes).
So the best you can do is check if the error string contains "incomplete read of message header", and if so, you don't have to retry. Hopefully this (error support and analysis) improves in the future.

If you are using the retryable writes as implemented by MongoDB 3.6+ and the respective drivers, this shouldn't happen. Each write is accompanied by a transaction number (not to be confused with client-side transactions as implemented by MongoDB 4.0+), and if the same transaction number is used in two consecutive writes there is only one write being done by the server.
This functionality has been around for years so unless you are using an ancient driver version you should already have it.
If you are performing write retries in your application manually rather than using the driver's retryable write functionality, you can write twice as you found out. The solution is to use the driver's retryable writes.

I had the same problem (running on go.mongodb.org/mongo-driver v1.8.1 on a MongoDB 4.4) and will leave my experiences with this problem here.
To add to #icza solution:
You can also get the error context deadline exceeded so check also for that.
A check for a context abortion would look something like this:
if strings.Contains(err.Error(), "context") && (strings.Contains(err.Error(), " canceled") || strings.Contains(err.Error(), " deadline exceeded")) {
...
}
My solution to the problem was instead of first checking if there was an error you'd first check if there was a result from the transaction.
Example:
result, err := database.collection.InsertOne(context, item)
if result != nil {
return result.InsertedID, err
}
return nil, err
If the transaction did process it despite the error, you could add some compensation logic to undo the transaction.

Related

Discrepancy between Redshift data api DescribeStatement status and console status

I'm loading data into redshift which usually takes about an hour when successful but seems to timeout randomly sometimes. I continue to get a "STARTED" status from DescribeStatement calls for my query but when I look in the console it says the query was ABORTED and rolled back via "Undoing 1 transactions on table ..." statement. But I'm not finding any errors in STL_LOAD_ERRORS related to the query or anything useful in STL_UTILITYTEXT for that transaction; though STL_UNDONE view does show the rollback.
I would've expected DescribeStatement to update with "FAILED" or "ABORTED" status when this occurred but that doesn't seem to be the case. Any idea what is causing the load to fail without any errors? Is there a way to catch/handle this via redshift data api? I'm currently thinking of checking STL_UNDONE after a specified time but was hoping there's a better solution.
Statement timeout seems like a likely cause. What you are describing sounds like the connection closed out from under the executing statement. There are a number of places where this timeout can come from but a common one is in the cluster configuration and the WLM configuration.
Another possibility is a network timeout. Database connections stay open for the entirety of the session but when a statement is in flight there is no activity on the connection. Some network equipment see this an assume that something is wrong and close the connection which closes the session which aborts the transaction in flight.
If your issue is caused by the connection closing you may be able to line things up in stl_sessions. There is info in there about timeouts but also you can see if the time the session closes is right when the query commands abort.
Just one area that could be causing your issue but is more common than people think.
So after escalating to AWS support, it was confirmed there was a bug on their end. Related to data API autoscaling protocols that were sometimes scaling down without waiting for outstanding tasks to complete. There's a temporary fix in place to avoid this happening while they implement a long term solution. Should hopefully be rolled out end of this month, June 2022.

Postgres: processes terminated after connetion break / invalidation

I don't understand some of Postgres mechanism and it makes me quite upset.
I usually use DBeaver as SQL client to query external pg base. If run create.. or insert.. queries and then connection for some reason is broken or invalidated, the pid is still running and finishes transaction.
But for some more complicated PL/pgSQL functions (with temp tables, loops, inserts, etc.) we wrote, breaking connection always causes process termination (it disappears from session list just before making next sql operation, eg. inserting a row in logtable). No matter if it's DBeaver editor or psql command.
I know that maybe disconnecting is critical problem, which should be eliminated and maybe I shouldn't expect process to successfully continue, but I do:) Or just to know why it happened and is it possible to prevent it?
If the network connection fails, the database server can detect that in two ways:
if it tries to send data to the client, it will figure out pretty quickly that the connection is down
if it tries to receive data from the client, it will only notice when the kernel's TCP keepalive mechanism has determined that the connection is down
When you say that sometimes execution of a function is terminated right away, I would say that is because the function returned data to the client.
In the case where a query keeps running, it is not attempting to return any data yet.
There is no cure for the former, but in PostgreSQL v14 you can prevent the latter by setting client_connection_check_interval. In addition, you have to set the PostgreSQL keepalive parameters so that the dead connection becomes known quickly.
See my article for more.

Smartsheet API rate limit exceed

Last week encountered for the first time a rate limit exceed error (4003) in our nightly batch-process. This batch proces is synchronising Smartsheet objects with our TimeTracking application 4TT.
Since 2016 this proces works fine, but somehow now this rate limit error occurs and therefore stops synchronising. With the help of the API (and blog about rate limit) I managed to change the code, putting in pauses when this error occurs. This has taken me quite a lot of time, as every time the error occured in a different part of the synchronisation proces.
Is there or will there be a way to let the API automatically pauses, when the rate limit is about to exceed in stead of changing the code every time. And for those who don't want this feature, for example adding an optional boolean argument 'AutomaticallyPauseWhenRateLimitExceeds' (default false) when making the connection to the Smartsheet API?
You'll need to include logic in your code to effectively handle the rate limiting error -- there's no mechanism by which the Smartsheet API can automatically handle this situation for you.
A simple approach would be for you to include logic in your code such that when a rate limiting error is thrown, your code pauses execution for 60 seconds before continuing. Alternatively, a more sophisticated approach would be to implement exponential backoff logic in your code (an error handling strategy whereby you periodically retry a failed request with progressively longer wait times between retries, until either the request succeeds or the certain number of retry attempts is reached).
Implementing this type of error handling logic should not be difficult or tedious, provided that your code is structured in an efficient manner (i.e., error handling logic is encapsulated in a single location).
Additional note: The Smartsheet API Best Practices blog post (specifically the Be practical: Adhere to rate limiting guidelines section) contains info about this topic.
All our SDKs include error retry. So that's the easiest way to handle this situation. http://smartsheet-platform.github.io/api-docs/#sdks-and-sample-code
I found this and other interesting problems (in my lab) while updating the sheet including Poor Internet connection/bandwidth issues.
If unable to accommodate your code to process chunks of data, my suggestion is to use a simple Try/Catch logic to pause the thread/task for 60 secs and then try again.
using System.Threading
...
... //all your code goes here
...
try
{
// your code to Save/update the Sheet goes here
}
catch (Exception ex)
{
Console.WriteLine(ex.Message);
Thread.Sleep(60000);
}
The next step is to work notifications when those errors happen

How test if connection is down in txmongo

How to test if query is successful, for example due to connection problems? I am using txmongo driver with twisted.
My application takes something from RabbitMQ and put it into MongoDB.I need to be able to test if there are some issues, so that I won't acknowledge the message and it will be stored in queue.
I looked at the code but couldn't find a proper way. I am newb in using twisted.
There is a warning RuntimeWarning("not connected") but it's not an exception to be caught. Maybe it's possible to test factory.size > 0

Does MongoDB fail silently if I don't check error codes?

I'm wondering if any persistence failure will go undetected if I don't check error codes? If so, what's the right way to write fast (asynchronously) while still detecting errors?
If you don't check for errors, your update is only fireAndForget. You'll indeed miss all errors which could arise. Please see MongoDB WriteConcerns for the available write modes in MongoDB (sorry I always fail to find the official, non driver related documentation, I really should bookmark it).
So with NORMAL you'll get at least connectivity errors, with NONE no exceptions at all. If you want to be informed of exceptions you have to use one of the other modes, which differ only in the persistence guarantee they give you.
You can't detect errors when running asynchronous, as this is against the intention. Your connection which sent the write operation, may be already closed or reused, so you can't sent it through that connection. Further more only your actual code knows what to do if it fails. As mongoDB doesn't offer some remote procedure call to asynchronous inform you of updates you'll have to wait until the write finished to a given stage.
So the fastest, but most unrelieable is SAFE, where the write only happened to memory. JOURNAL gives you the security that it was written at least to disk. With FSYNC you'll have those changes persisted on your db on disk. REPLICA that a least two replicas have written it, and MAJORITY that more than half of your replicas have written it(by three replicas which should be the default this doesn't differ).
The only chance I see to have something like asynchronous, is to have a separate Thread who is performing all write operations synchronous. This thread you could handle the actual update as well as a class which is called in case of a failure to perform the needed operations to handle this failure. But I don't think that this is good application design.
Yes, depending on the error, it can fail silently if you don't check the returned error code. It's necessary to wait for error checking. Your only other option would be for your app to occasionally tell the user "oops, remember when I acted like I saved your data a moment ago? Well, not really."