Mongo Connection Count creeping up one per 10 second with mgo driver

Mongo Connection Count creeping up one per 10 second with mgo driver - mongodb

We monitor our mongoDB connection count using this:
http://godoc.org/labix.org/v2/mgo#GetStats
However, we have been facing a strange connection leak issue where the connectionCount creeps up consistently by 1 more open connection per 10 seconds. (That's regardless whether there is any requests). I can spin up a server in localhost, leave it there, do nothing, the conectionCount will still creep up. Connection count eventually creeps up to a few thousand and it kills the app/db then and we have to restart the app.
This might not be enough information for you to debug. Does anyone have any ideas, connection leaks that you have dealt with in the past. How did you debug it? What are some of the way that I can debug this.
We have tried a few things, we scanned our code base for any code that could open a connection and put counters/debugging statements there, and so far we have found no leak. It is almost like there is a leak in a library somewhere.
This is a bug in a branch that we have been working on and there have been a few hundred commits into it. We have done a diff between this and master and couldn't find why there is a connection leak in this branch.
As an example, there is the dataset that I am referencing:
Clusters: 1
MasterConns: 9936 <-- creeps up 1 per second
SlaveConns: -7359 <-- why is this negative?
SentOps: 42091780
ReceivedOps: 38684525
ReceivedDocs: 39466143
SocketsAlive: 78 <-- what is the difference between the socket count and the master conns count?
SocketsInUse: 1231
SocketRefs: 1231
MasterConns is the number that creeps up one per 10 second. I am not entirely sure what the other numbers can mean.

MasterConns cannot tell you whether there's a leak or not, because it does not decrease. The field indicates the number of connections made since the last statistics reset, not the number of sockets that are currently in use. The latter is indicated by the SocketsAlive field.
To give you some additional relief on the subject, every single test in the mgo suite is wrapped around logic that ensures that statistics show sane values after the test finishes, so that potential leaks don't go unnoticed. That's the main reason why such statistics collection system was introduced.
Then, the reason why you see this number increasing every 10 seconds or so is due to the internal activity that happens to learn the status of the cluster. That said, this behavior was recently changed so that it doesn't establish new connections and instead picks existent sockets from the pool, so I believe you're not using the latest release.
Having SlaveConns negative looks like a bug. There's a small edge case about statistics collection for connections made, because we cannot tell whether a given server is a master or a slave before we've talked to it, so there might be an uncovered path. If you still see that behavior after you upgrade, please report the issue and I'll be happy to look at it.
SocketsInUse is the number of sockets that are still being referenced by one or more sessions, whether they are alive (the connection is established) or not. SocketsAlive is, again, the real number of live TCP connections. The delta between the two indicates that a number of sessions were not closed. This may be okay, if they are still being held in memory by the application and will eventually be closed, or it may be a leak if a session.Close operation was missed by the application.

Related

Monitoring memcached flush with delay

In order to not overload our database server we are trying to flush each server with a 60 second delay between them. I'm having a bit of issue determining when a server was actually flushed when a delay is given.
I'm using BeITMemcached and calling the FlushAll with a 60 second delay and staggered set to true.
I've tried using command line telnet host port followed by stats to see if the flush delay is working, however when I look at the cmd_flush the value goes up instantly on all of the host/port combinations being flushed without a delay. I've tried stats items and stats slabs but can't find information on what all the values represent and if there is anything that shows that it has been invalidated.
Is there another place I can look to determine when the server was actually flushed? Or does that value going up instantly mean that the delay isn't working as expected?

I found a round about way of testing this. Even though the cmd_flush gets updated right away the actual keys don't until after the delay.
So I connected with telnet to the server/port I wanted to monitor. Then used gets key to find a key with a value set. Once found I ran the flushall with a delay between the first servers and this one and continued to monitor that key value. After the delay was up the key started to return no value.

Haskell database connections

Please look at this scotty app (it's taken directly from this old answer from 2014):
import Web.Scotty
import Database.MongoDB
import qualified Data.Text.Lazy as T
import Control.Monad.IO.Class
runQuery :: Pipe -> Query -> IO [Document]
runQuery pipe query = access pipe master "nutrition" (find query >>= rest)
main = do
pipe <- connect $ host "127.0.0.1"
scotty 3000 $ do
get "/" $ do
res <- liftIO $ runQuery pipe (select [] "stock_foods")
text $ T.pack $ show res
You see how the the database connection (pipe) is created only once when the web app launches. Subsequently, thousands if not millions of visitors will hit the "/" route simultaneously and read from the database using the same connection (pipe).
I have questions about how to properly use Database.MongoDB:
Is this the proper way of setting things up? As opposed to creating a database connection for every visit to "/". In this latter case, we could have millions of connections at once. Is that discouraged? What are the advantages and drawbacks of such an approach?
In the app above, what happens if the database connection is lost for some reason and needs to be created again? How would you recover from that?
What about authentication with the auth function? Should the auth function only be called once after creating the pipe, or should it be called on every hit to "/"?
Some say that I'm supposed to use a pool (Data.Pool). It looks like that would only help limit the number of visitors using the same database connection simultaneously. But why would I want to do that? Doesn't the MongoDB connection have a built-in support for simultaneous usages?

Even if you create connection per client you won't be able to create too many of them. You will hit ulimit. Once you hit that ulimit the client that hit this ulimit will get a runtime error.
The reason it doesn't make sense is because mongodb server will be spending too much time polling all those connections and it will have only as many meaningful workers as many CPUs your db server has.
One connection is not a bad idea, because mongodb is designed to send several requests and wait for responses. So, it will utilize as much resources as your mongodb can have with only one limitation - you have only one pipe for writing, and if it closes accidentally you will need to recreate this pipe yourself.
So, it makes more sense to have a pool of connections. It doesn't need to be big. I had an app which authenticates users and gives them tokens. With 2500 concurrent users per second it only had 3-4 concurrent connections to the database.
Here are the benefits connection pool gives you:
If you hit pool connection limit you will be waiting for the next available connection and will not get runtime error. So, you app will wait a little bit instead of rejecting your client.
Pool will be recreating connections for you. You can configure pool to close excess of connections and create more up until certain limit as you need them. If you connection breaks while you read from it or write to it, then you just take another connection from the pool. If you don't return that broken connection to the pool pool will create another connection for you.
If the database connection is closed then: mongodb listener on this connection will exit printing a error message on your terminal, your app will receive an IO error. In order to handle this error you will need to create another connection and try again. When it comes to handling this situation you understand that it's easier to use a db pool. Because eventually you solution to this will resemble connection pool very much.
I do auth once as part of opening a connection. If you need to auth another user later you can always do it.
Yes, mongodb handles simultaneous usage, but like I said it gives only one pipe to write and it soon becomes a bottle neck. If you create at least as many connections as your mongodb server can afford threads for handling them(CPU count), then they will be going at full speed.
If I missed something feel free to ask for clarifications.
Thank you for your question.

What you really want is a database connection pool. Take a look at the code from this other answer.
Instead of auth, you can use withMongoDBPool to if your MongoDB server is in secure mode.

Is this the proper way of setting things up? As opposed to creating a database connection for every visit to "/". In this latter case, we could have millions of connections at once. Is that discouraged? What are the advantages and drawbacks of such an approach?
You do not want to open one connection and then use it. The HTTP server you are using, which underpins Scotty, is called Warp. Warp has a multi-core, multi-green-thread design. You are allowed to share the same connection across all threads, since Database.MongoDB says outright that connections are thread-safe, but what will happen is that when one thread is blocked waiting for a response (the MongoDB protocol follows a simple request-response design) all threads in your web service will block. This is unfortunate.
We can instead create a connection on every request. This trivially solves the problem of one thread's blocking another but leads to its own share of problems. The overhead of setting up a TCP connection, while not substantial, is also not zero. Recall that every time we want to open or close a socket we have to jump from the user to the kernel, wait for the kernel to update its internal data structures, and then jump back (a context switch). We also have to deal with the TCP handshake and goodbyes. We would also, under high load, run out file descriptors or memory.
It would be nice if we had a solution somewhere in between. The solution should be
Thread-safe
Let us max-bound the number of connections so we don't exhaust the finite resources of the operating system
Quick
Share connections across threads under normal load
Create new connections as we experience increased load
Allow us to clean up resources (like closing a handle) as connections are deleted under reduced load
Hopefully already written and battle-tested by other production systems
It is this exactly problem that resource-pool tackles.
Some say that I'm supposed to use a pool (Data.Pool). It looks like that would only help limit the number of visitors using the same database connection simultaneously. But why would I want to do that? Doesn't the MongoDB connection have a built-in support for simultaneous usages?
It is unclear what you mean by simultaneous usages. There is one interpretation I can guess at: you mean something like HTTP/2, which has pipelining built into the protocol.
standard picture of pipelining http://research.worksap.com/wp-content/uploads/2015/08/pipeline.png
Above we see the client making multiple requests to the server, without waiting for a response, and then the client can receive responses back in some order. (Time flows from the top to the bottom.) This MongoDB does not have. This is a fairly complicated protocol design that is not that much better than just asking your clients to use connection pools. And MongoDB is not alone here: the simple request-and-response design is something that Postgres, MySQL, SQL Server, and most other databases have settled on.
And: it is true that connection pool limits the load you can take as a web service before all threads are blocked and your user just sees a loading bar. But this problem would exist in any of the three scenarios (connection pooling, one shared connection, one connection per request)! The computer has finite resources, and at some point something will collapse under sufficient load. Connection pooling's advantages are that it scales gracefully right up until the point it cannot. The correct solution to handling more traffic is to increase the number of computers; we should not avoid pooling simply due to this problem.
In the app above, what happens if the database connection is lost for some reason and needs to be created again? How would you recover from that?
I believe these kinds of what-if's are outside the scope of Stack Overflow and deserve no better answer than "try it and see." Buuuuuuut given that the server terminates the connection, I can take a stab at what might happen: assuming Warp forks a green thread for each request (which I think it does), each thread will experience an unchecked IOException as it tries to write to the closed TCP connection. Warp would catch this exception and serve it as an HTTP 500, hopefully writing something useful to the logs also. Assuming a single-connection model like you have now, you could either do something clever (but high in lines of code) where you "reboot" your main function and set up a second connection. Something I do for hobby projects: should anything odd occur, like a dropped connection, I ask my supervisor process (like systemd) to watch the logs and restart the web service. Though clearly not a great solution for a production, money-makin' website, it works well enough for small apps.
What about authentication with the auth function? Should the auth function only be called once after creating the pipe, or should it be called on every hit to "/"?
It should be called once after creating the connection. MongoDB authentication is per-connection. You can see an example here of how the db.auth() command mutates the MongoDB server's data structures corresponding to the current client connection.

Connection timeout and socket timeout advice

I'm currently doing some research into connection and socket timeout settings but I'm fairly new to this stuff.
As a stab in the dark we are thinking of adding 40 seconds as a connection and sockettimeout when we run json over http calls to another server.
httpConnectionManagerParams.setConnectionTimeout(40000);
httpConnectionManagerParams.setSoTimeout(40000);
But really I don't know about how to understand what the ideal settings or best practices to use are. I would appreciate it if someone could give me some pointers on what to consider when or tips on how to come up with a good gestimate on these settings.
The sort of advice I'm looking for is something like ... 40 seconds is far too long because it might cause another issue.... or ... the higher you set this value the more chance you have of causing another isssue... or 40 seconds is not to high at all... or to work out an ideal figure multiply Y by T
thanks
EDIT
Adding firebug trace of server call.

There's no reason whatsoever why they should be equal. Considering each condition separately, you want to set it high enough that a timeout will indicate a genuine problem rather than just a temporary overload, and low enough that you maintain responsiveness of the application.
As a rule, 40 seconds is far too long for a connect timeout. I would view anything in double figures with suspicion. A server should be able to accept tens or hundreds of connections per second.
Read timeouts are a completely different matter. The 'correct' value, if there is such a thing, depends entirely on the average service time for the request and on its variance. As a starting point, you might want to set it to double the expected service time, or the average service time plus two or three standard deviations, depending entirely on your service level requirements and the performance of your server and its variance. There is no hard and fast rule about this. Many contractual service level agreements (SLAs) specify a 'normal' response time of two seconds, which may inform your deliberations. But it's your decision.

Golang tcp socket read gives EOF eventually

I have problem reading from socket. There is asterisk instance running with plenty of calls (10-60 in a minute) and I'm trying to read and process CDR events related to those calls (connected to AMI).
Here is library which I'm using (not mine, but was pushed to fork because of bugs) https://github.com/warik/gami
Its pretty straightforward, main action goes in gami.go - readDispatcher.
buf := make([]byte, _READ_BUF) // read buffer
for {
rc, err := (*a.conn).Read(buf)
So, there is TCPConn (a.conn) and buffer with size 1024 to which I'm reading messages from socket. So far so good, but eventually, from time to time (this time may vary from 10 minutes to 5 hours independently of data amount which comes through socket) Read operation fails with io.EOF error. I was trying to reconnect and relogin immediately, but its also impossible - connection times out, so i was pushed to wait for about 40-60sec, and this time is very crucial to me, I'm losing a lot of data because of delay. I was googling, reading sources and trying a lot of stuff - nothing. The most strange thing, that simple socket opened in python or php does not fail.
Is it possible that problem because of lack of file descriptors to represent socket on mine machine or on asterisk server?
Is it possible that problem in asterisk configuration (because i have another asterisk on which this problem doesn't reproduce, but also, i have time less calls on last one)?
Is it possible that problem in my way to deal with socket connection or with Go in general?
go version go1.2.1 linux/amd64
asterisk 1.8

Update to latest asterisk. There was bug like that when AMI send alot of data.
For check issue, you have send via ami command like "COMMAND sip show peers"(or any other long output command) and see result.

Ok, problem was in OS socket buffer overflow. As appeared there were to much data to handle.
So, were are three possible ways to fix this:
increase socket buffer volume
increase somehow speed of process which reeds data from socket
lower data volume or frequency
The thing that gami is by default reading all data from asterisk. And i was reading all of them and filter them after actual read operation. According that AMI listening application were running on pretty poor PC it appeared that it simply cannot read all the data before buffer capacity will be exposed.But its possible to receive only particular events, by sending "Events" action to AMI and specifying desired "EventMask".
So, my decision was to do that. And create different connections for different events type.

Apache HttpClient PoolingHttpClientConnectionManager leaking connections?

I am using the Apache Http Client in a Scala application.
The application is fairly high throughput with high parallelism.
I am not sure but I think perhaps I am leaking connections. It seems that whenever the section of code that uses the client gets busy, the application become unresponsive. My suspicion is that I am leaking sockets or something which is then causing other aspects of the application to stop working. It may also not be leaking connections so much as not closing them fast enough.
For more context, occasionally, certain actions lead to this code being executed hundreds of times a minute in parallel. When this happens the Rest API (Spray) of the application becomes unresponsive. There are other areas of the application that operate in high parallelism as well and those never cause a problem with the applications responsiveness.
Cutting back on the parallelism of this section of code does seem to alleviate the problem but isn't a viable long term solution.
Am I forgetting to configure something, or configuring something incorrectly?
The code I am using is something like this:
class SomeClass {
val connectionManager = new PoolingHttpClientConnectionManager()
connectionManager.setDefaultMaxPerRoute(50)
connectionManager.setMaxTotal(500)
val httpClient = HttpClients.custom().setConnectionManager(connectionManager).build()
def postData() {
val post = new HttpPost("http://SomeUrl") // Typically this URL is fixed. It doesn't vary much if at all.
post.setEntity(new StringEntity("Some Data"))
try {
val response = httpClient.execute(post)
try {
// Check the response
} finally {
response.close()
}
} finally {
post.releaseConnection()
}
}
}
EDIT
I can see that I am building up a lot of connections in the TIME_WAIT state. I have tried adjusting the DefaultMaxPerRoute and the MaxTotal to a variety of values with no noticeable effect. It seems like I am missing something and as a result the connections are not being re-used, but I can't find any documentation that suggests what I am missing. It is critical that these connections get re-used.
EDIT 2
With further investigation, using lsof -p, I can see that if I set the MaxPerRoute to 10, there are in fact 10 connections being listed as "ESTABLISHED". I can see that the port numbers do not change. This seems to imply to me that in fact it is re-using the connections.
What that doesn't explain is why I am still leaking connections in this code? The reused connections and leaked connections (found with netstat -a) showing up in TIME_WAIT status share the same base url. So they are definitely related. Is it possible perhaps that I am re-using the connections but then somehow not properly closing the response?
EDIT 3
Located the source of the TIME_WAIT "leak". It was in an unrelated section of code. So it wasn't anything to do with the HttpClient. However after fixing up that code, all the TIME_WAITs went away, but the application is still becoming unresponsive when hitting the HttpClient code many times. Still investigating that portion.

You should really consider re-using HttpClient instance or at least the connection pool that underpins it instead of creating them for each new request execution. If you wish to continue doing the latter, you should also close the client or shut down the connection pool before they go out of scope.
As far as the leak is concerned, it should be relatively easy to track by running your application with context logging for connection management turned out as described here

IMO - you can use a much lower number of maxConnection per domain ( like 5 instead of 50 ) and still completely saturate your network bandwidth, if you use http efficiently.
im not a scala person ( android , java ) but have done lots and lots of optimization on http client side threadpools. IMO - blindly increasing connections per domain to 50 is masking some other serious issue with thruput.
2 points:
if you are using a shared "sharedPoolingClientConnManager" , correctly going to a small pool per domain and you conform to the recommended way of release your conn back to the pool ( you should be able to debug all this seeing a running metric of the state of connection per threadpool instance ) then u should be good.
whatever the parallelism feature of scala , you should understand something of how the 5 respective threads from the pool on a domain are sharing the socket?? IMO from the android/java experience is that even though each thread executor is supposedly doing blocking I/O to the server in the scope of that httpclient.exec statement, the actual channel management involved allows very high thruput without resorting to ASNyC client libs for http.
Android experience may not be relevant because client has only 4 threads. Having said that , even if you have 64 or more threads available , i just dont understand needing more than 10 connection per domain in order to keep your underlying http socket very , very busy with thruput.