My netty TCP/IP server stops listenning few hours after starting - sockets

I have written TCP/IP server using Netty4.0 running on a Linux machine listening to small GPS tracking devices. I have been facing weird problem, which is server stops listening to them in a sudden several hours after I starts it. There is any error log I can see and still server is running. It looks like only channel is not working. When I run a client to do health check, the client socket is still alive and keep sending packet to the server but server does not get it.
If you have any idea how to solve it, please tell me about it. It would be appreciated.

It is impossible to tell without more informations. I would check different things like if there was an OOM exception or with telnet if the server really refuse connections etc. Also jstack may show you if there is some deadlock etc.

Related

TCP retransmission on RST - Different socket behaviour on Windows and Linux?

Summary:
I am guessing that the issue here is something to do with how Windows and Linux handle TCP connections, or sockets, but I have no idea what it is. I'm initiating a TCP connection to a piece of custom hardware that someone else has developed and I am trying to understand its behaviour. In doing so, I've created a .Net core 2.2 application; run on a Windows system, I can initiate the connection successfully, but on Linux (latest Raspbian), I cannot.
It appears that it may be because Linux systems do not try to retry/retransmit a SYN after a RST, whereas Windows ones do - and this behaviour seems key to how this peculiar piece of hardware works..
Background:
We have a black box piece of hardware that can be controlled and queried over a network, by using a manufacturer-supplied Windows application. Data is unencrypted and requires no authentication to connect to it and the application has some other issues. Ultimately, we want to be able to relay data from it to another system, so we decided to make our own application.
I've spent quite a long time trying to understand the packet format and have created a library, which targets .net core 2.2, that can be used to successfully communicate with this kit. In doing so, I discovered that the device seems to require a kind of "request to connect" command to be sent, via UDP. Straight afterwards, I am able to initiate a TCP connection on port 16000, although the first TCP attempt always results in a RST,ACK being returned - so a second attempt needs to be made.
What I've developed works absolutely fine on both Windows (x86) and Linux (Raspberry Pi/ARM) systems and I can send and receive data. However, when run on the Raspbian system, there seems to be problems when initiating the TCP connection. I could have sworn that we had it working absolutely fine on a previous build, but none of the previous commits seem to work - so it may well be a system/kernel update that has changed something.
The issue:
When initiating a TCP connection to this device, it will - straight away - reset the connection. It does this even with the manufacturer-supplied software, which itself then immediately re-attempts the connection again and it succeeds; so this kind of reset-once-then-it-works-the-second-time behaviour in itself isn't a "problem" that I have any control over.
What I am trying to understand is why a Windows system immediately re-attempts the connection through a retransmission...
..but the Linux system just gives up after one attempt (this is the end of the packet capture..)
To prove it is not an application-specific issue, I've tried using ncat/netcat on both the Windows system and the Raspbian system, as well as a Kali system on a separate laptop to prove it isn't an ARM/Raspberry issue. Since the UDP "request" hasn't been sent, the connection will never succeed anyway, but this simply demonstrates different behaviour between the OSes.
Linux versions look pretty much the same as above, whereby they send a single packet that gets reset - whereas the Windows attempt demonstrates the multiple retransmissions..
So, does anyone have any answer for this behaviour difference? I am guessing it isn't a .net core specific issue, but is there any way I can set socket options to attempt a retransmission? Or can it be set at the OS level with systemctl commands or something? I did try and see if there are any SocketOptionNames, in .net, that look like they'd control attempts/retries, as this answer had me wonder, but no luck so far.
If anyone has any suggestions as to how to better align this behaviour across platforms, or can explain the reason for this difference is at all, I would very much appreciate it!
Nice find! According to this, Windows´ TCP will retry a connection if it receives a RST/ACK from the remote host after sending a SYN:
... Upon receiving the ACK/RST client from the target host, the client determines that there is indeed no service listening there. In the Microsoft Winsock implementation of TCP, a pending connection will keep attempting to issue SYN packets until a maximum retry value is reached (set in the registry, this value defaults to 3 extra times)...
The value used to limit those retries is set in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\TcpMaxConnectRetransmissions according to the same article. At least in Win10 Pro it doesn´t seem to be present by default.
Although this is a conveniece for Windows machines, an application still should determine its own criteria for handling a failed connect attempt IMO (i. e number of attempts, timeouts etc).
Anyhow, as I said, surprising fact! Living and learning I guess ...
Cristian.

What occurs when socket's IP changed, and how to deal with it?

I'm developing a C/S program using Delphi 7, TServerSocket and TClientSocket controls. One problem is now I can only use my PC as server, and my PC is using virtual dialer, so ISP keeps changing my IP, about once in one or two days.
Because I'm using a router, the ServerSocket is opened directly in my local IP (192.168.1.x), just mapped to my public IP, so I suppose the ServerSocket itself shouldn't crash when my public IP changes. What I suppose should be: when my IP changes, all connecting sockets become unavailable, and when my application doesn't know it and still using the socket, ServerSocket should receive some event like OnClientError.
But I found a weird problem - when my IP changed, the server application automatically shut down. I don't know exactly what happened because the shut-down time is afternoon, I was in my office, but I noticed another result: even I used heartbeat in my application layer protocol, the server didn't catch the keep-alive failure - because I recorded everything in a log file on my server, and didn't find it. So my server must be shut down instantly when my IP changed, which even didn't reach the keep-alive logic.
This seems very weird, how can a socket error(due to IP change) directly lead to the whole application shut down? Please if someone have any explanations, and how to deal with this problem, thanks
Once the socket is opened, its bound IP address will never change. This can not be 'fixed' on the server side. I would recommend to work on the servers stability, also the clients should detect that the server no longer exists at the given IP address, and re-connect. (This is independent of why the server became unavailable - a restarting server is normal.)

TCP keepalive not working

The situation:
Postgres 9.1 on Debian Server
Scala(Java) application using the LISTEN/NOTIFY mechanism to get notified through JDBC
As there can be very long pauses (multipla days) between notifications I ran into the problem that the underlying TCP connection silently got terminated after some time and my application stopped to receive the notifications.
When googeling for a solution I found that there is a parameter tcpKeepAlive that you can set on the connection. So I set it to true and was happy. Until the next day I saw that again my connection was dead.
As I had been suspicious there was a wireshark capture running in parallel which now turns out to be very usefull. Just about exactly two hours after the last successfull communication on the connection of interest my application sends a keepalive packet to the database server. However the server responds with RST as it seems it has already closed the connection.
The net.ipv4.tcp_keepalive_time on the server is set to 7200 which is 2 hours.
Do I need to somehow enable keepalive on the server or increase the keepalive_time?
Is this the way to go about keeping my application connected?
TL;DR: My database connection gets terminated after long inactivity. Setting tcpKeepAlive didnt fix it as server responds with RST. What to do?
As Craig suggested in the comments the problem was very likely related to some piece of network hardware in between the server and the application. The fix was to increase the frequency of the keepalive messages.
In my case the OS was Windows where you have to create a Registry key with the idle time in milliseconds after which the message should be sent. Info on that here
I have set it to 15 minutes which seems to have solved the issue.
UPDATE:
It only seemed like it solved the issue. After about two days of program run time my connection was gone again. I switched to checking the validity my connection every time I use it. This does not seem like it is the solution but it is a solution nonetheless.

Sockets won't connect after Windows 8.1 update

I am currently working on a project that involves the use of sockets. The program was working just fine, but after a windows 8.1 update on my computer, either the client socket is sending out a signal to be accepted or the server socket isn't receiving the signal so it can accept. I tried to use it on my emulator and it worked just fine. In addition, I have updated my GPU drivers to see if this could be fixed, but they still won't connect. The program doesn't crash and give me an error; it just sits there and waits. Does anyone have any ideas? Perhaps the update is blocking my peer to peer connection? Any help is much appreciated. Thanks.

Session getting disconnected in the middle of working

Sessions are getting disconnected automatically (in the middle of working).
Disconnection happens for the users when they working by using telnet connection to Linux server via putty telnet application.
During the disconnection, the Network b/w utilization is high and no limitation for total number of users in a network.
Error "Hangup signal received (562)"
Any idea about this ??
The network connection was interrupted or a hangup signal was sent via "kill".
You mention network utilization being "high" when disconnects happen. How do you know that? What measurement are you looking at that tells you it is "high"? That might be a symptom of a networking issue that is at the root of the problem.
There are few directions:
OpenEdge has published this article with links to implementing keep-alive packets:
https://knowledgebase.progress.com/articles/Article/Telnet-connection-times-out-after-15-minutes
Increase the number of "instances" in xinetd.conf, and then restart the service.
Make sure that the database watchdog is up and running: https://documentation.progress.com/output/ua/OpenEdge_latest/index.html#page/dmadm/prowdog-command.html
Check the database log file, to find out what happened just before the hangup (https://documentation.progress.com/output/ua/OpenEdge_latest/index.html#page/gsins/openedge-database-log-file.html)