In order to not overload our database server we are trying to flush each server with a 60 second delay between them. I'm having a bit of issue determining when a server was actually flushed when a delay is given.
I'm using BeITMemcached and calling the FlushAll with a 60 second delay and staggered set to true.
I've tried using command line telnet host port followed by stats to see if the flush delay is working, however when I look at the cmd_flush the value goes up instantly on all of the host/port combinations being flushed without a delay. I've tried stats items and stats slabs but can't find information on what all the values represent and if there is anything that shows that it has been invalidated.
Is there another place I can look to determine when the server was actually flushed? Or does that value going up instantly mean that the delay isn't working as expected?
I found a round about way of testing this. Even though the cmd_flush gets updated right away the actual keys don't until after the delay.
So I connected with telnet to the server/port I wanted to monitor. Then used gets key to find a key with a value set. Once found I ran the flushall with a delay between the first servers and this one and continued to monitor that key value. After the delay was up the key started to return no value.
Related
I found two different places with different explanation what does socketTimeoutMS do.
The time in milliseconds to attempt a send or receive on a socket before the attempt times out. The default is never to timeout, though different drivers might vary. See the driver documentation.
From here
And following one:
The socketTimeoutMS sets the number of milliseconds a socket stays inactive after the driver has successfully connected before closing. If the value is set to 360000 milliseconds, the socket closes if there is no activity during a 30 seconds window.
From here
What does really socketTimeoutMS do?
Due to what the docs say, it will be the functionality what the driver is providing
See the driver documentation.
In your case if you're using nodejs (the link you sent) it will be as you quoted in your second quote
The socketTimeoutMS sets the number of milliseconds a socket stays inactive after the driver has successfully connected before closing. If the value is set to 360000 milliseconds, the socket closes if there is no activity during a 30 seconds window.
By default Mongo cursors die after 10 minutes of inactivity. I have a blank cursor that I eventually want to run though the whole database but there will be times of inactivity for over 10 minutes. I need a way to keep this alive to I can keep calling it.
Setting the expiry time completely off is not an option. If this program crashes it will cause cursors to linger in the databases memory which is not good. Also occasionally calling .next() during my other stuff does not work as the batch sizes are set fairly high to get good performance on the other parts of the code that are calling the cursor a lot.
I tried just periodically calling cursor.alive to see if that sent a signal to Mongo that would keep the cursor active but that did not work.
Try to use a smaller batch size. This will cause activity and you should not hit the 10 minute timeout.
for doc in coll.find().batch_size(10):
Alternatively you can set timeout=False when calling find (this could lead to issues when the cursor is not manually closed):
for doc in coll.find(timeout=False)
We monitor our mongoDB connection count using this:
http://godoc.org/labix.org/v2/mgo#GetStats
However, we have been facing a strange connection leak issue where the connectionCount creeps up consistently by 1 more open connection per 10 seconds. (That's regardless whether there is any requests). I can spin up a server in localhost, leave it there, do nothing, the conectionCount will still creep up. Connection count eventually creeps up to a few thousand and it kills the app/db then and we have to restart the app.
This might not be enough information for you to debug. Does anyone have any ideas, connection leaks that you have dealt with in the past. How did you debug it? What are some of the way that I can debug this.
We have tried a few things, we scanned our code base for any code that could open a connection and put counters/debugging statements there, and so far we have found no leak. It is almost like there is a leak in a library somewhere.
This is a bug in a branch that we have been working on and there have been a few hundred commits into it. We have done a diff between this and master and couldn't find why there is a connection leak in this branch.
As an example, there is the dataset that I am referencing:
Clusters: 1
MasterConns: 9936 <-- creeps up 1 per second
SlaveConns: -7359 <-- why is this negative?
SentOps: 42091780
ReceivedOps: 38684525
ReceivedDocs: 39466143
SocketsAlive: 78 <-- what is the difference between the socket count and the master conns count?
SocketsInUse: 1231
SocketRefs: 1231
MasterConns is the number that creeps up one per 10 second. I am not entirely sure what the other numbers can mean.
MasterConns cannot tell you whether there's a leak or not, because it does not decrease. The field indicates the number of connections made since the last statistics reset, not the number of sockets that are currently in use. The latter is indicated by the SocketsAlive field.
To give you some additional relief on the subject, every single test in the mgo suite is wrapped around logic that ensures that statistics show sane values after the test finishes, so that potential leaks don't go unnoticed. That's the main reason why such statistics collection system was introduced.
Then, the reason why you see this number increasing every 10 seconds or so is due to the internal activity that happens to learn the status of the cluster. That said, this behavior was recently changed so that it doesn't establish new connections and instead picks existent sockets from the pool, so I believe you're not using the latest release.
Having SlaveConns negative looks like a bug. There's a small edge case about statistics collection for connections made, because we cannot tell whether a given server is a master or a slave before we've talked to it, so there might be an uncovered path. If you still see that behavior after you upgrade, please report the issue and I'll be happy to look at it.
SocketsInUse is the number of sockets that are still being referenced by one or more sessions, whether they are alive (the connection is established) or not. SocketsAlive is, again, the real number of live TCP connections. The delta between the two indicates that a number of sessions were not closed. This may be okay, if they are still being held in memory by the application and will eventually be closed, or it may be a leak if a session.Close operation was missed by the application.
I have a large file I need to load into the cache (As a hash) which will be shared between perl processes. Loading the data into cache takes around 2 seconds, but we have over 10 calls per second.
Does using the compute method cause other processes to be locked out?
Otherwise, I would appreciate suggestions on how to manage the load process so that there's a guaranteed lock during load and only one load process happening!
Thanks!
Not sure about the guaranteed lock, but you could use memcached with a known key as a mutex substitute (as of this writing, you haven't said what your "cache" is). If the value for the key is false, set it to true, start loading, and then return the result.
For the requests happening during that time, you could either busy-wait, or try using a 503 "Service Unavailable" status with a few seconds in the Retry-After field. I'm not sure of the browser support for this.
As a bonus, using a time-to-live of, say, 5 minutes on the mutex key will cause the entire file to be reloaded every 5 minutes. Otherwise, if you have a different reason for reloading the file, you will need to manually delete the key.
> db.currentOp().inprog.length
11587
Several minutes later, the count is still the same. I made a small script to cycle through and killOp() all the ops that originated from the offending client, but when it finishes, all of the ops are still running.
I then tried a single killOp() and checked the op count and it was the same. I tried killing 10 ops, then checking the op count and it still hand't changed.
Most of the queries are all on the same table, which has over 20 million documents. The client generating all the queries has been terminated but I can't call getIndexes() to see if there's an indexing misconfiguration on the table because that call just goes on the end of the op queue and never returns.
We're running MongoDB on a single Linux server. There's no replication in place at this point.
What should I do?
Do you know which op is that? Check mongod log to see whether it moves or has any error message. If you don't see any progress, I would suggest you to restart mongod (don't kill -9, normal kill should be ok).