Multiple could not receive data from client: Connection reset by peer Postgresql and Resque - postgresql

I have a server that runs Postgresql. in the logs I am seeing this message for my resque based 'worker' box, multiple times a minute. Some minutes there isn't a message, others could be 10 times.
2016-01-12 13:40:36 EST:1.1.8.2(33899):[16141]: LOG: could not receive data from client: Connection reset by peer
Now when i go into the 1.1.8.2 box to look at netstat -ntp i don't see a port 33899, and most of them are at least in the 40xxx range by now. That may be conjecture but I'm at a loss to find out why a Redis/Resque/Puma Rails stack would be printing out these messages, let alone what that means even if i get to the bottom of it.
Will I gain memory back if they are closed 'normally'?
Is this a thing to be wary of?
How does one debug OLD ports that are open when the db box and the worker box both don't display the ports any more?

This message is probably due to the resque worker task not closing the database connection before it exits. It's not a huge problem, but presumably Postgres is doing a little extra work to clean it up, and it makes a mess of your log file...
One solution is to add a hook to your resque worker's task file (the same file that contains the self.perform definition):
def self.after_perform(*args)
ActiveRecord::Base.connection.disconnect!
end

Related

reopen a connection on same port results in bad file descriptor handle

I have a proc which spawns a child process on port 2600. it connects to the process with handle 6, extracts some data, kills the process and then it starts another child process on the same port, I connect to it and when I try to run my commands on the handle I get this following error:
'Cannot write to handle 6. OS reports: Bad File Descriptor
The process is running and I can connect to it manually and get my data. I have a table tracking the connection and I can see the handle for the second time the child process is spawned is also 6. Any suggestions on why this is happening?
UPDATE: Just to make it clear. When I close the handle I send an async message to the proc with "exit 1". Could this be that the process is not killed before I open a connection to it, gets killed after that and then what I think it's the new process is just a garrbage handle to the old, now defunct process?
UPDATE: I do flush the handle after I send "exit 1" and the new process seem to start ok. I can see the new process running on the process (they have different names)
UPDATE: Even thought connection is successful the handle is not added to .z.W. so the handle doesn't appear in .z.W
UPDATE: found the issue. the .IPC.SCON function which I was using which i thought was just a simple wrapper for an error trap ... has logic to check a table of cached handles for the same hst:prt and take that one instead of opening another one. because my process was running on the same host:port, it was using the old cached handle instead of opening it again. thank you all for your help.
What version of kdb are you using? Can you run the below without issues?
KDB+ 4.0 2020.08.28 Copyright (C) 1993-2020 Kx Systems
q)system"q -p 5000";system"sleep 1";h:hopen 5000;0N!h".z.i";neg[h](exit;0)
4475i
q)system"q -p 5000";system"sleep 1";h:hopen 5000;0N!h".z.i";neg[h](exit;0)
4481i
q)system"q -p 5000";system"sleep 1";h:hopen 5000;0N!h".z.i";neg[h](exit;0)
4487i
Update: Without seeing your code or the order of execution, it's hard to pinpoint what the issue is. If the process wasn't terminated before you attempt to start the new process, you would see an Address already in use error. If the process isn't up when you attempt to connect you would see a 'hop error which means hopen failed.
Also, async messages are not sent immediately so depending on the execution order of your code, you may need to flush the async 'exit' message, but like I mentioned, if the original child process is still up when you attempt to start another, you would get an address clash
Update 2: As per my comment above
q)system"q -p 5000"; system"sleep 1"; h:#[hopen;(`::5000;2000);0Ni]; 0N!h".z.i"; neg[h](exit;1)
4365i
q)system"q busy.q -p 5000"; system"sleep 1"; #[hopen;(`::5000;2000);0Ni]; h".z.i" // hopen times out + handle variable not amended
'Cannot write to handle 4. OS reports: Bad file descriptor
[0] system"q busy.q -p 5000"; system"sleep 1"; #[hopen;(`::5000;2000);0Ni]; h".z.i" // hopen times out + handle variable not amended
^
q).z.W
q)
I think what you have written in your UPDATE is correct. The process is trying to connect to that port before your new process is running on it.
I think the best option would be for the newly started process to initiate the connection to the parent rather, that way you know it will be running and don't have to introduce any sleeps.
Other option is to try and reconnect on a timer, once a successful connection occurs remove it from the timer and continue with what you are trying to do.

Postgres 11.8 on AWS Losing Connection/Terminating Query After 15 Minutes with No Notification or Error

I am running into an issue where multiple different clients apps (DataGrip, DBeaver, Looker) have their queries cancelled after exactly 15 minutes, but no termination message or connection error is ever sent to the app. As far as the app is concerned, the query is still running even though it has been terminated in Postgres.
For example, if I run the following query, according to the client app it just runs forever. If I check pg_stat_activity, it shows the query no longer running after 15 minutes.
SELECT pg_sleep(16 * 60);
Does anyone know of a Postgres or AWS setting that would cause this? I've checked the configuration and couldn't find any settings set to a value of 15 minutes (or 900 seconds).
There is probably a ill-configured firewall that closes your session.
Assuming that the clients you are mentioning use libpq to connect to PostgreSQL, include this in the connection string:
keepalives_idle=300
See the documentation for details.
You could of course also configure the TCP stack on your operating system to use that value, so the problem will never surface again.
Your DB log might be able to tell you what happened.
In addition, check your statement_timeout setting. The units are milliseconds so you should be looking for 900000, not 900.
If it's not that, there exist firewalls that kill idle connections. Setting tcp_keepalives_idle could help avoid those types of problems.

Opening DB connections to Postgres taking long

Some of our applications are facing issues with the connection pool. I run one of them. A JEE application on Payara 4.1 which uses PostgreSQL 9.5.8.
I have as good as no problems when running the application localy with local db instance. When running on the remote environment I have seen issues happening every 10 minutes that the application was unresponsive (well, it actually responded everything with HTTP status 503). Guessing it was related to opening connections taking long, we have set the parameter idleTimeoutInSeconds="0" in jdbc-resource. Now we have the same issues about 4 times a day which is an improvement, but - well - neighbour systems are still complaining.
We usually run with 5 steady connection allowing maximum connections of 30. Our application usually uses 1 up to 2 to handle traffic. With TCP dump I have seen, that at a certain point in time the connection pool tries to open many connections (the pool realizes the connections it holds have been closed by the DB without any information like TCP FIN, opening each connections takes about 1 second). During this time of about 30 seconds not all requests can be safely queued and some 503 happen.
Locally everything is fine. Opening a connection takes ~50ms and everyone is happy. Our postgres team is not helping at all and I am stuck with a problem. As I don't see any improvement possibility with the connection pool in JEE, I have radical ideas going in the direction of:
Refreshing the connections myself. All the time. Constantly. (Which would be hard to implement in JEE where I can not simply look into the connection pool and tell each connection to be refreshed just in case).
Replacing the not-helping-at-all JEE implementation of connection pool with something that works better. (Future generations of developers maintaining our app will hate me...)
Replacing the DB with something managed by myself. (Even dumber idea)
Does anyone:
Has any idea how I could perform 1 or 2 above?
Has any other ideas what could help?
Here my current JDBC resource definition if needed:
<jdbc-resource poolName="<poolName>" jndiName="<jndiName>" isConnectionValidationRequired="true"
connectionValidationMethod="table" validationTableName="version()" maxPoolSize="30"
validateAtmostOncePeriodInSeconds="30" statementTimeoutInSeconds="30" isTimerPool="true" steadyPoolSize="5"
idleTimeoutInSeconds="0" connectionCreationRetryAttempts="100000" connectionCreationRetryIntervalInSeconds="30"
maxWaitTimeInMillis="2000">

Watchdog monitoring UNIX domain socket, triggering events upon specific content

I am on an embedded platform (mipsel architecture, Linux 2.6 kernel) where I need to monitor IPC between two closed-source processes (router firmware) in order to react to a certain event (dynamic IP change because of DSL reconnect). What I found out so far via strace is that whenever the IP changes, the DSL daemon writes a special message into a UNIX domain socket bound to a specific file name. The message is consumed by another daemon.
Now here is my requirement: I want to monitor the data flow through that specific UNIX domain socket and trigger an event (call a shell script) if a certain message is detected. I tried to monitor the file name with inotify, but it does not work on socket files. I know I could run strace all the time, filtering its output and react to changes in the filtered log file, but that would be too heavy a solution because strace really slows down the system. I also know I could just poll for the IP address change via cron, but I want a watchdog, not a polling solution. And I am interested in finding out whether there is a tool which can specifically monitor UNIX domain sockets and react to specific messages flowing through in a predefined direction. I imagine something similar to inotifywait, i.e. the tool should wait for a certain event, then exit, so I can react to the event and loop back into starting the tool again, waiting for the next event of the same type.
Is there any existing Linux tool capable of doing that? Or is there some simple C code for a stand-alone binary which I could compile on my platform (uClibc, not glibc)? I am not a C expert, but capable of running a makefile. Using a binary from the shell is no problem, I know enough about shell programming.
It has been a while since I was dealing with this topic and did not actually get around to testing what an acquaintance of mine, Denys Vlasenko, maintainer of Busybox, proposed as a solution to me several months ago. Because I just checked my account here on StackOverflow and saw the question again, let me share his insights with you. Maybe it is helpful for somebody:
One relatively easy hack I can propose is to do the following:
I assume that you have a running server app which opened a Unix domain listening socket (say, /tmp/some.socket), and client programs connect to it and talk to the server.
rename /tmp/some.socket -> /tmp/some.socket1
create a new socket /tmp/some.socket
listen on it for new client connections
for every such connection, open another connection to /tmp/some.socket1 to original server process
pump data (client<->server) over resulting pairs of sockets (code to do so is very similar to what telnetd server does) until EOF from either side.
While you are pumping data, it's easy to look at it, to save it, and even to modify it if you need to.
The downside is that this sniffer program needs to be restarted every time the original server program is restarted.
This is similar to what Celada also answered. Thanks to him as well! Denys's answer was a bit more concrete, though.
I asked back:
This sounds hacky, yes, because of the restart necessity, but feasible.
Me not being a C programmer, I keep wondering though if you know a
command line tool which could do the pass-through and protocolling or
event-based triggering work for me. I have one guy from our project in
mind who could hack a little C binary for that, but I am unsure if he
likes to do it. If there is something pre-fab, I would prefer it. Can it
even be done with a (combination of) BusyBox applet(s), maybe?
Denys answered again:
You need to build busybox with CONFIG_FEATURE_UNIX_LOCAL=y.
Run the following as intercepting server:
busybox tcpsvd -vvvE local:/tmp/socket 0 ./script.sh
Where script.sh is a simple passthrough connection
to the "original server":
#!/bin/sh
busybox nc -o /tmp/hexdump.$$ local:/tmp/socket1 0
As an example, I added hex logging to file (-o FILE option).
Test it by running an emulated "original server":
busybox tcpsvd -vvvE local:/tmp/socket1 0 sh -c 'echo PID:$$'
and by connecting to "intercepting server":
echo Hello world | busybox nc local:/tmp/socket 0
You should see "PID:19094" message and have a new /tmp/hexdump.19093 file
with the dumped data. Both tcpsvd processes should print some log too
(they are run with -vvv verbosity).
If you need more complex processing, replace nc invocation in script.sh
with a custom program.
I don't think there is anything that will let you cleanly sniff UNIX socket traffic. Here are some options:
Arrange for the sender process to connect to a different socket where you are listening. Also connect to the original socket as a client. On receipt of data, notice the data you want to notice and also pass everything along to the original socket.
Monitor the system for IP address changes yourself using a netlink socket (RTM_NEWADDR, RTM_NEWLINK, etc...).
Run ip monitor as an external process and take action when it writes messages about added & removed IP addresses on its standard output.

Sqlalchemy: Connections aren't closed when pool is overflowed

When I run ab (apache benchmark) on my site (with SQLAlchemy and postgresql hosted on Apache web server), SQLAlchemy makes many connections to postgre and I got too many connections error.
I traced the problem, and found that problem is the pool (actually QueuePool).
The documentation at http://www.sqlalchemy.org/docs/core/pooling.html#sqlalchemy.pool.Pool says that if when the pool is full, returning connections (that opened because max_overflow allowed creation of these extra connections) will be discarded and disconnected.
But it seems connections actually didn't close! They silently dropped out of pool without closing.
So SQLAlchemy continuously opens new connections, ignores them (without closing!) and opens new ones.
Increasing pool size is not the real solution, the problem is additional connections aren't closed.
(Default settings for QueuePool is pool_size=5 and max_overflow=10)
Looks like a bug in SQLAlchemy, fixed 2 weeks ago: http://hg.sqlalchemy.org/sqlalchemy/rev/aff95843c12f#l2.17
There was no release with this fix, so you have to patch it manually.
i think its bug and fixed ... install from source and have fun ;)