How test if connection is down in txmongo - mongodb

How to test if query is successful, for example due to connection problems? I am using txmongo driver with twisted.
My application takes something from RabbitMQ and put it into MongoDB.I need to be able to test if there are some issues, so that I won't acknowledge the message and it will be stored in queue.
I looked at the code but couldn't find a proper way. I am newb in using twisted.
There is a warning RuntimeWarning("not connected") but it's not an exception to be caught. Maybe it's possible to test factory.size > 0

Related

Discrepancy between Redshift data api DescribeStatement status and console status

I'm loading data into redshift which usually takes about an hour when successful but seems to timeout randomly sometimes. I continue to get a "STARTED" status from DescribeStatement calls for my query but when I look in the console it says the query was ABORTED and rolled back via "Undoing 1 transactions on table ..." statement. But I'm not finding any errors in STL_LOAD_ERRORS related to the query or anything useful in STL_UTILITYTEXT for that transaction; though STL_UNDONE view does show the rollback.
I would've expected DescribeStatement to update with "FAILED" or "ABORTED" status when this occurred but that doesn't seem to be the case. Any idea what is causing the load to fail without any errors? Is there a way to catch/handle this via redshift data api? I'm currently thinking of checking STL_UNDONE after a specified time but was hoping there's a better solution.
Statement timeout seems like a likely cause. What you are describing sounds like the connection closed out from under the executing statement. There are a number of places where this timeout can come from but a common one is in the cluster configuration and the WLM configuration.
Another possibility is a network timeout. Database connections stay open for the entirety of the session but when a statement is in flight there is no activity on the connection. Some network equipment see this an assume that something is wrong and close the connection which closes the session which aborts the transaction in flight.
If your issue is caused by the connection closing you may be able to line things up in stl_sessions. There is info in there about timeouts but also you can see if the time the session closes is right when the query commands abort.
Just one area that could be causing your issue but is more common than people think.
So after escalating to AWS support, it was confirmed there was a bug on their end. Related to data API autoscaling protocols that were sometimes scaling down without waiting for outstanding tasks to complete. There's a temporary fix in place to avoid this happening while they implement a long term solution. Should hopefully be rolled out end of this month, June 2022.

UpdateOne fails on client due to timeout, but MongoDB processes it anyway

One of my tests for a function that performs increments using the MongoDB driver for Go is randomly breaking in an unexpected way. Here's what the test does:
Create a proxy (with toxiproxy) to a local MongoDB instance.
Disable the proxy, so the database looks like it's down.
Run a function that does an update that increments a field, timing out after 100ms. If it fails, it keeps retrying every 100ms until the command succeeds.
Sleep 1 second.
Enable the proxy.
Wait for the function to complete and assert that the field has been incremented correctly - only once.
This test is randomly breaking because sometimes that field gets incremented twice. I noticed that it happens when an update is retried just as the proxy gets enabled: the client code receives an incomplete read of message header: context deadline exceeded error, which makes it retry the command, but the previous one indeed succeeded because the field ends up being incremented twice.
I took a look at the driver code and I guess it's timing out while reading the server response - perhaps the proxy is enabled just after the update has started and there isn't much timeout left for both write and read operations to complete.
Is there anything that I can do on my side to prevent this from happening? I tried to find a specific error to catch, but I couldn’t find any. Or is this something the driver itself is supposed to handle?
Any help is appreciated.
UPDATE: I looked closely at the error messages and noticed that, while the MongoDB instance was down, all errors were handshake failures. So I made sure the test ping the database before disabling the proxy to get the handshake out of the way and the test stopped randomly breaking; it ran 1000 times flawlessly, at least. I assume the handshake itself takes time to complete and that contributes to the command timeout.
In general, if you know the command went through (to the server), if you can't read the response, you can't assume anything about its success.
In some cases when it only matters if the server got the command, or you only care about the command reaching the server, then read on.
Unfortunately the current state of the driver (v1.7.1) is not "sophisticated" enough to easily tell if the error is from reading the response.
I was able to reproduce your issue locally. Here is the error when a timeout happens reading the response:
mongo.CommandError{Code:0, Message:"connection(localhost:27017[-30]) incomplete read of message header: context deadline exceeded", Labels:[]string{"NetworkError", "RetryableWriteError"}, Name:"", Wrapped:topology.ConnectionError{ConnectionID:"localhost:27017[-30]", Wrapped:context.deadlineExceededError{}, init:false, message:"incomplete read of message header"}}
And there is the error when the timeout happens writing the command:
mongo.CommandError{Code:0, Message:"connection(localhost:27017[-31]) unable to write wire message to network: context deadline exceeded", Labels:[]string{"NetworkError", "RetryableWriteError"}, Name:"", Wrapped:topology.ConnectionError{ConnectionID:"localhost:27017[-31]", Wrapped:context.deadlineExceededError{}, init:false, message:"unable to write wire message to network"}}
As you can see, in both cases mongo.CommandError is returned, with identical Code and Labels fields. Which leaves you having to analyze the error string (which is ugly and may "break" with future changes).
So the best you can do is check if the error string contains "incomplete read of message header", and if so, you don't have to retry. Hopefully this (error support and analysis) improves in the future.
If you are using the retryable writes as implemented by MongoDB 3.6+ and the respective drivers, this shouldn't happen. Each write is accompanied by a transaction number (not to be confused with client-side transactions as implemented by MongoDB 4.0+), and if the same transaction number is used in two consecutive writes there is only one write being done by the server.
This functionality has been around for years so unless you are using an ancient driver version you should already have it.
If you are performing write retries in your application manually rather than using the driver's retryable write functionality, you can write twice as you found out. The solution is to use the driver's retryable writes.
I had the same problem (running on go.mongodb.org/mongo-driver v1.8.1 on a MongoDB 4.4) and will leave my experiences with this problem here.
To add to #icza solution:
You can also get the error context deadline exceeded so check also for that.
A check for a context abortion would look something like this:
if strings.Contains(err.Error(), "context") && (strings.Contains(err.Error(), " canceled") || strings.Contains(err.Error(), " deadline exceeded")) {
...
}
My solution to the problem was instead of first checking if there was an error you'd first check if there was a result from the transaction.
Example:
result, err := database.collection.InsertOne(context, item)
if result != nil {
return result.InsertedID, err
}
return nil, err
If the transaction did process it despite the error, you could add some compensation logic to undo the transaction.

How to get Squeryl to release closed connections back to C3P0?

Sometimes I see the following error repeatedly in my logs:
com.mchange.v2.c3p0.impl.NewPooledConnection - [c3p0] A PooledConnection that has already signalled a Connection error is still in use!
com.mchange.v2.c3p0.impl.NewPooledConnection - [c3p0] Another error has occurred [ org.postgresql.util.PSQLException: This connection has been closed. ] which will not be reported to listeners!
org.postgresql.util.PSQLException: This connection has been closed.
at org.postgresql.jdbc2.AbstractJdbc2Connection.checkClosed(AbstractJdbc2Connection.java:822)
at org.postgresql.jdbc2.AbstractJdbc2Connection.rollback(AbstractJdbc2Connection.java:839)
at com.mchange.v2.c3p0.impl.NewProxyConnection.rollback(NewProxyConnection.java:855)
at org.squeryl.dsl.QueryDsl$class._executeTransactionWithin(QueryDsl.scala:131)
at org.squeryl.dsl.QueryDsl$class.transaction(QueryDsl.scala:78)
at org.squeryl.PrimitiveTypeMode$.transaction(PrimitiveTypeMode.scala:40)
Below that is the trace of my own code up to the point of my transaction {} block.
My software repeats the transaction {} after it throws an exception, but it appears to use the same (closed) connection over again, so the next attempt also fails. Strangely this takes quite some time, sometimes 50 seconds, sometimes up to 2 minutes. One would think a closed connection would fail immediately.
How do I get Squeryl to release this connection to the pool and acquire a new one?
I found out why this was happening. I needed to use JDBC features not supported by Squeryl. So I got the connection from Squeryl and used it directly. But I had a bug where a prepared statement was not being closed. This resulted in the dead connection being reused over and over. I'm not sure how or why this happened. But as soon as I put the closing of the statement in a finally block everything started working. Now when Squeryl gets to the transaction block a second time it receives a fresh connection from c3p0.
For anyone else seeing the same errors, I also found out that you can get the c3p0 errors (top two errors in the question text above) even when nothing is wrong. If your thread holding the database connection is busy (a Thread.sleep() in my case, for testing) and c3p0 notices before you do that the connection is dead, then you can get the error about a dead connection still being in use. In that case it is a perfectly normal situation and just a question of which thread sees the problem first - nothing to worry about.

Does MongoDB fail silently if I don't check error codes?

I'm wondering if any persistence failure will go undetected if I don't check error codes? If so, what's the right way to write fast (asynchronously) while still detecting errors?
If you don't check for errors, your update is only fireAndForget. You'll indeed miss all errors which could arise. Please see MongoDB WriteConcerns for the available write modes in MongoDB (sorry I always fail to find the official, non driver related documentation, I really should bookmark it).
So with NORMAL you'll get at least connectivity errors, with NONE no exceptions at all. If you want to be informed of exceptions you have to use one of the other modes, which differ only in the persistence guarantee they give you.
You can't detect errors when running asynchronous, as this is against the intention. Your connection which sent the write operation, may be already closed or reused, so you can't sent it through that connection. Further more only your actual code knows what to do if it fails. As mongoDB doesn't offer some remote procedure call to asynchronous inform you of updates you'll have to wait until the write finished to a given stage.
So the fastest, but most unrelieable is SAFE, where the write only happened to memory. JOURNAL gives you the security that it was written at least to disk. With FSYNC you'll have those changes persisted on your db on disk. REPLICA that a least two replicas have written it, and MAJORITY that more than half of your replicas have written it(by three replicas which should be the default this doesn't differ).
The only chance I see to have something like asynchronous, is to have a separate Thread who is performing all write operations synchronous. This thread you could handle the actual update as well as a class which is called in case of a failure to perform the needed operations to handle this failure. But I don't think that this is good application design.
Yes, depending on the error, it can fail silently if you don't check the returned error code. It's necessary to wait for error checking. Your only other option would be for your app to occasionally tell the user "oops, remember when I acted like I saved your data a moment ago? Well, not really."

Unexpected behaviour after memcached server restarts. How to configure/rectify it?

I have a pool of persistent connections(Memcached clients). Data are being cached in the memcached server. If after restarting the memcached server, I try to get the cached data using the client from the pool, I m getting the below exception:
java.util.concurrent.ExecutionException: java.lang.RuntimeException: Cancelled
at net.spy.memcached.MemcachedClient$OperationFuture.get(MemcachedClient.java:1662)
at net.spy.memcached.MemcachedClient$GetFuture.get(MemcachedClient.java:1708)
at com.eos.gds.cache.CacheClient.get(CacheClient.java:49)
I get this exception only for the first time after the restart when I try to get the cached data. I did a lot of search. But unable to find the exact reason for this.
Spymemcached has a bunch of internal queues that operations are placed in before they are actually sent out to memcached. What is happening here is that you do an operation and then before that operation is sent over the wire or before a response is received from memcached, Spymemcached realizes that the connection has been lost. As a result Spymemcached cancels all operations in flight and then reestablishes the connection.
When you call get() on the Future then since the operation was cancelled by Spymemcached an exception is thrown. What I recommend doing here is catching all exceptions on every individual operation you do with Spymemcached and then, depending on the error, either retrying the operation of just forgetting about it. If it's a get for example and your cluster of memcached servers goes down then you can probably forget about it since the cache will be empty, but you will probably want to retry a set.
I ran into the exact same problem and fix it by handling the exception until success
while(true){
try{
memcacheclient.get(key);
break;
}
catch(java.util.concurrent.CancellationException e ){
log.info("cache cancelled");
}
}
Run MemcachedClient.getStats() for each new client once and that will fix the cancelations issue.
I had the same issue. I am using Spymemcached client to connect with Memcache server.
I found .
There must be a connection issue.
Ref: https://github.com/couchbase/spymemcached/blob/master/src/main/java/net/spy/memcached/internal/OperationFuture.java
Been searching for days for a solution. Posting in case it helps someone else.
Our implementation of ServletContextListener was getting a new MemcachedClient(...) on contextInitialized, and would then call the MemacachedClient method shutdown() on contextDestroyed. I would always get a CancellationException or ExecutionException on the first request I would send. (The error messaging alluded to both, but an ExecutionException is what I was able to catch.)
Solution: switched from shutdown() to shutdown(1, TimeUnit.SECONDS)
Now the get call succeeds the very first time that it is run.
I cannot explain for sure how the contextDestroyed call was interfering with the regular handling of the request. My best guess is that spymemcached's single thread somehow gets shared between servlets, and so when a servlet was created to handle a request sent by a verification step of our build process, it would get destroyed prior to the first request I would send, and the MemcachedClient my request's servlet was using would then try to use that same thread and get hit with the exceptions from the shutdown.
(Our team had established the need to call shutdown a while back when we learned our web app had too many open connections to our memcached server.)