MongoDB killOp() not killing the op. What do I do? - mongodb

> db.currentOp().inprog.length
11587
Several minutes later, the count is still the same. I made a small script to cycle through and killOp() all the ops that originated from the offending client, but when it finishes, all of the ops are still running.
I then tried a single killOp() and checked the op count and it was the same. I tried killing 10 ops, then checking the op count and it still hand't changed.
Most of the queries are all on the same table, which has over 20 million documents. The client generating all the queries has been terminated but I can't call getIndexes() to see if there's an indexing misconfiguration on the table because that call just goes on the end of the op queue and never returns.
We're running MongoDB on a single Linux server. There's no replication in place at this point.
What should I do?

Do you know which op is that? Check mongod log to see whether it moves or has any error message. If you don't see any progress, I would suggest you to restart mongod (don't kill -9, normal kill should be ok).

Related

UpdateOne fails on client due to timeout, but MongoDB processes it anyway

One of my tests for a function that performs increments using the MongoDB driver for Go is randomly breaking in an unexpected way. Here's what the test does:
Create a proxy (with toxiproxy) to a local MongoDB instance.
Disable the proxy, so the database looks like it's down.
Run a function that does an update that increments a field, timing out after 100ms. If it fails, it keeps retrying every 100ms until the command succeeds.
Sleep 1 second.
Enable the proxy.
Wait for the function to complete and assert that the field has been incremented correctly - only once.
This test is randomly breaking because sometimes that field gets incremented twice. I noticed that it happens when an update is retried just as the proxy gets enabled: the client code receives an incomplete read of message header: context deadline exceeded error, which makes it retry the command, but the previous one indeed succeeded because the field ends up being incremented twice.
I took a look at the driver code and I guess it's timing out while reading the server response - perhaps the proxy is enabled just after the update has started and there isn't much timeout left for both write and read operations to complete.
Is there anything that I can do on my side to prevent this from happening? I tried to find a specific error to catch, but I couldn’t find any. Or is this something the driver itself is supposed to handle?
Any help is appreciated.
UPDATE: I looked closely at the error messages and noticed that, while the MongoDB instance was down, all errors were handshake failures. So I made sure the test ping the database before disabling the proxy to get the handshake out of the way and the test stopped randomly breaking; it ran 1000 times flawlessly, at least. I assume the handshake itself takes time to complete and that contributes to the command timeout.
In general, if you know the command went through (to the server), if you can't read the response, you can't assume anything about its success.
In some cases when it only matters if the server got the command, or you only care about the command reaching the server, then read on.
Unfortunately the current state of the driver (v1.7.1) is not "sophisticated" enough to easily tell if the error is from reading the response.
I was able to reproduce your issue locally. Here is the error when a timeout happens reading the response:
mongo.CommandError{Code:0, Message:"connection(localhost:27017[-30]) incomplete read of message header: context deadline exceeded", Labels:[]string{"NetworkError", "RetryableWriteError"}, Name:"", Wrapped:topology.ConnectionError{ConnectionID:"localhost:27017[-30]", Wrapped:context.deadlineExceededError{}, init:false, message:"incomplete read of message header"}}
And there is the error when the timeout happens writing the command:
mongo.CommandError{Code:0, Message:"connection(localhost:27017[-31]) unable to write wire message to network: context deadline exceeded", Labels:[]string{"NetworkError", "RetryableWriteError"}, Name:"", Wrapped:topology.ConnectionError{ConnectionID:"localhost:27017[-31]", Wrapped:context.deadlineExceededError{}, init:false, message:"unable to write wire message to network"}}
As you can see, in both cases mongo.CommandError is returned, with identical Code and Labels fields. Which leaves you having to analyze the error string (which is ugly and may "break" with future changes).
So the best you can do is check if the error string contains "incomplete read of message header", and if so, you don't have to retry. Hopefully this (error support and analysis) improves in the future.
If you are using the retryable writes as implemented by MongoDB 3.6+ and the respective drivers, this shouldn't happen. Each write is accompanied by a transaction number (not to be confused with client-side transactions as implemented by MongoDB 4.0+), and if the same transaction number is used in two consecutive writes there is only one write being done by the server.
This functionality has been around for years so unless you are using an ancient driver version you should already have it.
If you are performing write retries in your application manually rather than using the driver's retryable write functionality, you can write twice as you found out. The solution is to use the driver's retryable writes.
I had the same problem (running on go.mongodb.org/mongo-driver v1.8.1 on a MongoDB 4.4) and will leave my experiences with this problem here.
To add to #icza solution:
You can also get the error context deadline exceeded so check also for that.
A check for a context abortion would look something like this:
if strings.Contains(err.Error(), "context") && (strings.Contains(err.Error(), " canceled") || strings.Contains(err.Error(), " deadline exceeded")) {
...
}
My solution to the problem was instead of first checking if there was an error you'd first check if there was a result from the transaction.
Example:
result, err := database.collection.InsertOne(context, item)
if result != nil {
return result.InsertedID, err
}
return nil, err
If the transaction did process it despite the error, you could add some compensation logic to undo the transaction.

Monitoring memcached flush with delay

In order to not overload our database server we are trying to flush each server with a 60 second delay between them. I'm having a bit of issue determining when a server was actually flushed when a delay is given.
I'm using BeITMemcached and calling the FlushAll with a 60 second delay and staggered set to true.
I've tried using command line telnet host port followed by stats to see if the flush delay is working, however when I look at the cmd_flush the value goes up instantly on all of the host/port combinations being flushed without a delay. I've tried stats items and stats slabs but can't find information on what all the values represent and if there is anything that shows that it has been invalidated.
Is there another place I can look to determine when the server was actually flushed? Or does that value going up instantly mean that the delay isn't working as expected?
I found a round about way of testing this. Even though the cmd_flush gets updated right away the actual keys don't until after the delay.
So I connected with telnet to the server/port I wanted to monitor. Then used gets key to find a key with a value set. Once found I ran the flushall with a delay between the first servers and this one and continued to monitor that key value. After the delay was up the key started to return no value.

select pg_database_size('name') hangs and can't be killed

Our PostgreSQL 10.1 server ran out of connections today because a monitor process that was calling
select pg_database_size('databasename');
was getting stuck. It was NOT getting an obvious Lock. It just never returned. The monitor dutifully logged in every few minutes, over and over until we ran out of connections. When I run the query for other databases it works, but not for our main database.
Killing the calling process did not clear the query.
select pg_cancel_backend(1234)
doesn't kill the query. Nor does
select pg_terminate_backend(1234)
Ditto if I run the query by hand, nothing kills it in the database.
I will probably have to restart the database server to recover from this. However I'd like to prevent it from happening again.
What is this function doing that would resist signals and never return (like 8 hours after being invoked)? Is there any way to clear them from the process table without restarting the database and breaking the users who still have the few remaining connections still active in the system?

Keeping a cursor alive in pymongo

By default Mongo cursors die after 10 minutes of inactivity. I have a blank cursor that I eventually want to run though the whole database but there will be times of inactivity for over 10 minutes. I need a way to keep this alive to I can keep calling it.
Setting the expiry time completely off is not an option. If this program crashes it will cause cursors to linger in the databases memory which is not good. Also occasionally calling .next() during my other stuff does not work as the batch sizes are set fairly high to get good performance on the other parts of the code that are calling the cursor a lot.
I tried just periodically calling cursor.alive to see if that sent a signal to Mongo that would keep the cursor active but that did not work.
Try to use a smaller batch size. This will cause activity and you should not hit the 10 minute timeout.
for doc in coll.find().batch_size(10):
Alternatively you can set timeout=False when calling find (this could lead to issues when the cursor is not manually closed):
for doc in coll.find(timeout=False)

Mongo Connection Count creeping up one per 10 second with mgo driver

We monitor our mongoDB connection count using this:
http://godoc.org/labix.org/v2/mgo#GetStats
However, we have been facing a strange connection leak issue where the connectionCount creeps up consistently by 1 more open connection per 10 seconds. (That's regardless whether there is any requests). I can spin up a server in localhost, leave it there, do nothing, the conectionCount will still creep up. Connection count eventually creeps up to a few thousand and it kills the app/db then and we have to restart the app.
This might not be enough information for you to debug. Does anyone have any ideas, connection leaks that you have dealt with in the past. How did you debug it? What are some of the way that I can debug this.
We have tried a few things, we scanned our code base for any code that could open a connection and put counters/debugging statements there, and so far we have found no leak. It is almost like there is a leak in a library somewhere.
This is a bug in a branch that we have been working on and there have been a few hundred commits into it. We have done a diff between this and master and couldn't find why there is a connection leak in this branch.
As an example, there is the dataset that I am referencing:
Clusters: 1
MasterConns: 9936 <-- creeps up 1 per second
SlaveConns: -7359 <-- why is this negative?
SentOps: 42091780
ReceivedOps: 38684525
ReceivedDocs: 39466143
SocketsAlive: 78 <-- what is the difference between the socket count and the master conns count?
SocketsInUse: 1231
SocketRefs: 1231
MasterConns is the number that creeps up one per 10 second. I am not entirely sure what the other numbers can mean.
MasterConns cannot tell you whether there's a leak or not, because it does not decrease. The field indicates the number of connections made since the last statistics reset, not the number of sockets that are currently in use. The latter is indicated by the SocketsAlive field.
To give you some additional relief on the subject, every single test in the mgo suite is wrapped around logic that ensures that statistics show sane values after the test finishes, so that potential leaks don't go unnoticed. That's the main reason why such statistics collection system was introduced.
Then, the reason why you see this number increasing every 10 seconds or so is due to the internal activity that happens to learn the status of the cluster. That said, this behavior was recently changed so that it doesn't establish new connections and instead picks existent sockets from the pool, so I believe you're not using the latest release.
Having SlaveConns negative looks like a bug. There's a small edge case about statistics collection for connections made, because we cannot tell whether a given server is a master or a slave before we've talked to it, so there might be an uncovered path. If you still see that behavior after you upgrade, please report the issue and I'll be happy to look at it.
SocketsInUse is the number of sockets that are still being referenced by one or more sessions, whether they are alive (the connection is established) or not. SocketsAlive is, again, the real number of live TCP connections. The delta between the two indicates that a number of sessions were not closed. This may be okay, if they are still being held in memory by the application and will eventually be closed, or it may be a leak if a session.Close operation was missed by the application.