I have an Amazon ELB in front of Postgres. This is for Kubernetes-related reasons, see this question. I'm trying to work around the maximum AWS ELB Idle Timeout limit of 1 hour so I can have clients that can execute long-running transactions without being disconnected by the ELB. I have no control over the client configuration in my case, so any workaround needs to happen on the server side.
I've come across the tcp_keepalives_idle setting in Postgres, which in theory should get around this by sending periodic keepalive packets to the client, thus creating activity so the ELB doesn't think the client is idle.
I tried testing this by setting the idle timeout on the ELB to 2 minutes. I set tcp_keepalives_idle to 30 seconds, which should force the server to send the client a keepalive every 30 seconds. I then execute the following query through the load balancer: psql -h elb_dns_name.com -U my_user -c "select pg_sleep(140)". After 2 minutes, the ELB disconnects the client. Why are the keepalives not coming through to the client? Is there something with pg_sleep that might be blocking them? If so, is there a better way to simulate a long running query/transaction?
I fear this might be a deep dive and I may need to bring out tcpdump or similar tools. Unfortunately things do get a bit more complicated to parse with all of the k8s chatter going on as well. So before going down this route I thought it would be good to see if I was missing something obvious. If not, any tips on how to best determine whether a keepalive is actually being sent to the server, through the ELB, and ending up at the client would be much appreciated.
Update: I reached out to Amazon regarding this. Apparently idle is defined as not transferring data over the wire. Data is defined as any network packet that has a payload. Since TCP keep-alives do not have payloads, the client and server keep-alives are considered idle. So unless there's a way to get the server to send data inside of their keep alive payloads, or send data in some other form, this may be impossible.
Keepalives are sent on the TCP level, well below PostgreSQL, so it doesn't make a difference if the server is running a pg_sleep or something else.
Since a hosted database is somewhat of a black box, you could try to control the behavior on the client side. The fortunate thing is that PostgreSQL also offers keepalive parameters on the client side.
Experiment with
psql 'host=elb_dns_name.com user=my_user keepalives_idle=1800' -c 'select pg_sleep(140)'
Related
in Java, application servers like JBoss EAP have the option to periodically verify the connections in a database pool (https://access.redhat.com/documentation/en-us/red_hat_jboss_enterprise_application_platform/6.4/html/administration_and_configuration_guide/sect-database_connection_validation). This has been very useful for removing stale connections.
I'm now looking at a ADO.NET application, and I was wondering if there was any similar functionality that could be used with a Microsoft SQL Server?
I ended up find this post by redgate that describes some of the validation that goes on when connections are taken from the pool:
If the connection has died because a router has decided that it no
longer wants to forward your packets and no other routers like you
either then there is no way to know this unless you try to send some
data and don’t get a response.
If you create a connection and a connection pool is created and
connections are put into the pool and not used, the longer they are in
there, the bigger the chance of something bad happening to it.
When you go to use a connection there is nothing to warn you that a
router has stopped forwarding your packets until you go to use it; so
until you use it, you do not know that there is a problem.
This was an issue with connection pooling that was fixed in the first
.Net 4 reliability update (see issue 14 which vaguely describes this)
with a feature called “Connection Pool Resiliency”. The update meant
that when a connection is about to be taken from the pool, it is
checked for TCP validity and only returned if it is in a good state.
When I run a query that takes a long time on my Postgres server (maybe 30 minutes), I get the error. I've verified the query is running with active status on the server using pgAdmin. I've also verified the correctness of the query, as it runs successfully on a smaller dataset. Server configurations are default, I haven't changed anything. Please help!
Look into the PostgreSQL server log.
Either you'll find a crash report there, which would explain the broken connection, or there is something in your network that cuts connections with no activity after a while.
Investigate your firewalls!
Maybe it is a solution to set the configuration parameter tcp_keepalives_idle to a value shorter than the time when the connection is cut. That will cause the server operating system to send keepalive messages on idle connections, which may be enough to prevent the overzealous connection reaper in your environment from disrupting your work.
The situation:
Postgres 9.1 on Debian Server
Scala(Java) application using the LISTEN/NOTIFY mechanism to get notified through JDBC
As there can be very long pauses (multipla days) between notifications I ran into the problem that the underlying TCP connection silently got terminated after some time and my application stopped to receive the notifications.
When googeling for a solution I found that there is a parameter tcpKeepAlive that you can set on the connection. So I set it to true and was happy. Until the next day I saw that again my connection was dead.
As I had been suspicious there was a wireshark capture running in parallel which now turns out to be very usefull. Just about exactly two hours after the last successfull communication on the connection of interest my application sends a keepalive packet to the database server. However the server responds with RST as it seems it has already closed the connection.
The net.ipv4.tcp_keepalive_time on the server is set to 7200 which is 2 hours.
Do I need to somehow enable keepalive on the server or increase the keepalive_time?
Is this the way to go about keeping my application connected?
TL;DR: My database connection gets terminated after long inactivity. Setting tcpKeepAlive didnt fix it as server responds with RST. What to do?
As Craig suggested in the comments the problem was very likely related to some piece of network hardware in between the server and the application. The fix was to increase the frequency of the keepalive messages.
In my case the OS was Windows where you have to create a Registry key with the idle time in milliseconds after which the message should be sent. Info on that here
I have set it to 15 minutes which seems to have solved the issue.
UPDATE:
It only seemed like it solved the issue. After about two days of program run time my connection was gone again. I switched to checking the validity my connection every time I use it. This does not seem like it is the solution but it is a solution nonetheless.
we’ve got a strange little problem we’re experiencing for months now:
The load on our cluster (http, long lasting keepalive connections with a lot of very short (<100ms) requests) is distributed very uneven.
All servers are configured the same way but some connections that push through thousands of requests per second just end up being sent to only one server.
We tried both load balancing strategies but that does not help.
It seems to be strictly keepalive related.
The misbehaving backend has the following settings:
option tcpka
option http-pretend-keepalive
Is the option http-server-close made to cover that issue?
If I get it right it will close and re-open a lot of connections which means load to the systems? Isn't there a way to keep the connections open but evenly balance the traffic anyway?
I tried to enable that option but it kills all of our backends when under load.
HAProxy currently only support keep-alive HTTP-connections toward the client, not the server. If you want to be able to inspect (and balance) each HTTP request, you currently have to use one of the following options
# enable keepalive to the client
option http-server-close
# or
# disable keepalive completely
option httpclose
The option http-pretend-keepalive doesn't change the actual behavior of HAProxy in regards of connection handling. Instead, it is intended as a workaround for servers which don't work well when they see a non-keepalive connection (as is generated by HAProxy to the backend server).
Support for keep-alive towards the backend server is scheduled to be in the final HAProxy 1.5 release. But the actual scope of that might still vary and the final release date is sometime in the future...
Just FYI, it's present in the latest release 1.5-dev20 (but take the fixes with it, as it shipped with a few regressions).
I've set up PGBouncer and configured it to connect to my postgres DB and it all connects just fine, however I'm not sure if its actually working.
I have a php script that runs as a daemon and picks up beanstalk jobs. The problem is for every distinct user/action on the system it opens a new connection to postgres and then leaves that connection idling because the daemon doesn't actually stop running so the connection is never terminated (a quick fix for this was to reset the connection at the end of the script loop, but that is going to be inefficient with many connects).
Anyway, this caused postgres to eventually run out of connections and lock up...
So PGBouncer seems like the answer.
But now when I run it, I see the same db connection multiple times when i do ps ax | grep postgres.
Isn't PGBouncer supposed to only ever have 1 connection open to the DB and route all traffic through that connection? Then open a new connection if that one is full?
At present I have 3 for one db connection (my access control system) and 2 for the other database (my client specific data).
To me, it feels like if I roll out these changes then I will just be faced with the same problem that the connections will just get eaten up again because they're not being released.
I hope that explains enough for someone to offer any advice.
It sounds like an important step here is to fix the scripts so they release the connections when they're done. Once you do that, PgBouncer will help reduce the connection setup/teardown overhead, but in connection pooling mode it won't give you the ability to maintain more connections to Pg than you otherwise could.
However, you can also use PgBouncer in transaction-pooling mode. When used for transaction pooling, PgBouncer keeps a pool of idle backend transactions and only assigns them when a client does a BEGIN. Connections are returned to the pool after the client does a COMMIT or ROLLBACK. That means that in transaction pooling mode you can have large numbers of open connections to PgBouncer; they don't each need a corresponding connection to a PostgreSQL backend, so long as most of them are idle at any point in time and don't have any transaction open.
The only real downside of transaction pooling mode is that it breaks applications that expect to SET session-level variables, keep prepared statements across transactions, etc. Applications may need modification, e.g. to use SET LOCAL.