Error Listening To Socket - Cannot run MOOS-IvP on Beaglebone Black - sockets

I am attempting to run MOOS-IvP on a Beaglebone Black
On attempting to run the MOOS database it continuously throws the exception
"Exception Thrown in listen loop: Error Listening To Socket. Operation not supported"
This software runs on a Raspberry Pi
Any ideas what might be the issue?

I have found the problem and fixed it.
When the socket is created it needs to be TCP. However when getprotobyname(_sName) is called in the XPCGetProtocol class to lookup the correct protocol number in /etc/protocols it returns the value of the previous time it was called, which was when a UDP socket was setup.
To fix it I simply called the function twice, the second time it returns the correct value.
I am not sure why it would return incorrect the first time but this works!

I also encountered this error while working with a BeagleBone Black running Ubuntu 14.04. However, the solution of running the request twice did not work. More troubleshooting led me to determine that the one that was supposed to be a TCP socket was opened after another process had opened a UDP socket. The structure returned by getprotobyname() is a pointer to a static location that does not change from call to call, but does get updated with the protocol details (see here although for another Unix os). Therefore, the second call by another process overwrites the original details.
This then gets tested during socket creation in the constructor of XPCSocket, and results in creation of a UDP socket where it should have been a TCP socket. This could probably be fixed by adding a lock to this function, but I took the non-blocking approach to initialize the requested protocol using the string the constructor was called with (_sProtocol) instead of the one returned in the socketProtocol structure. In addition, I modified the XPCGetProtocol class to store the protocol number in a member variable that would not be changed upon subsequent calls to getprotobyname().
My modifications can be found here.

Related

Any ideas why we're getting Intermittent gRPC Unavailable/Unknown RpcExceptions (C++/C#)

We are using gRPC (version 1.37.1) for our inter-process communication between our C# process and C++ process. Both processes act as a server and client with the other and run on the same machine over localhost using the HTTP/2 transport. All of the calls are use blocking synchronous unary calls and not bi-directional streaming. Some average(ish) stats:
From C++->C#: 0-2 calls per second, 0-40 calls per minute
From C#->C++: 0-5 calls per second, 0-200 calls per minute
Intermittently, we were getting one of 3 issues
C# client call to C++ server comes back with an RpcException, usually “HTTP2/Parse Error”, “Endpoint Read Failed”, or “Transport Closed”
C++ client call to C# server comes back with Unavailable or Unknown
C++ client WaitForConnected call to check the channel fails after 500ms
The top most one is the most frequent and where we have the most information about. Usually, what we’ll see is the Client receives the RPC call and runs into an unknown frame type. Then the subchannel goes into shutdown and everything usually re-connects fine. We also generally see an embedded error like the following (note that we replaced all FILE instances to FUNCTION in our gRPC source):
win_read","file_line":307,"os_error":"The system detected an invalid pointer address in attempting to use a pointer argument in a call.\r\n","syscall":"WSARecv","wsa_error":10014}]},{"created":"#1622120588.494000000","description":"frame of size 262404 overflows local window of 65535","file":"grpc_core::chttp2::TransportFlowControl::ValidateRecvData","file_line":213}]}
What we’ve seen with the unknown frame type, is that it parses the HEADERS, WINDOW_UPDATE, DATA, WINDOW_UPDATE and then gets a TCP: on_read without a corresponding READ and then tries to parse again. It’s this parse where it looks like the parser is at the wrong offset in the buffer, because it gets the unknown frame type, incoming frame size and incoming stream_id all map to the middle of the RPC call that it just parsed.
The above was what we were encountering prior to a change to create a new channel for each rpc call. While we realize it is not great from a performance standpoint, we have seen increased stability since making the change. However, we still do occasionally get rpc exceptions. Now, the most common is “Unknown”/”Stream Removed” rather than the ones listed above.
Any ideas on what might be going wrong is appreciated. We've turned on all gRPC tracing and have even added to it, as well as captured the issue in wireshark but so far aren't getting a great indication of what's causing the transport to close. Are there any good tools to monitor the socket/port for failure?

What is correct procedure following a failure to connect a TCP socket?

I'm writing a TCP client using asynchronous calls. If the server is active when the app starts then it connects and talks OK. However if the first connect fails, then every subsequent call to connect() fails with WSAENOTCONN(10057) without producing any network traffic (checked with Wireshark).
Currently the code does not close the socket when it gets the error. The TCP state diagram does not seem to require it. It simply waits for 30 seconds and tries the connect() again.
That subsequent connect() and the following read() both return the WSAENOTCONN error.
Do I need to close the socket and open a new one? If so, which errors require me to close the socket, since there are a lot of different errors, some of which I will probably never see on the test bench.
You can assume this is MS Winsock2, although it is actually Interval Zero RTX 2009, which is subtly different in some places.
Do I need to close the socket and open a new one?
Yes.
If so, which errors require me to close the socket, since there are a lot of different errors, some of which I will probably never see on the test bench.
Almost all errors are fatal to the connection and should result in you closing the socket. EAGAIN/EWOULDBLOCK s a prominent exception, as is EINTR, but I can't think of any others offhand.
Do I need to close the socket and open a new one?
Yes.
You should close the socket under all error conditions that results in connection gone for good (Say, like the peer has closed the connection)

Kernel gets stuck after sock_release() call in a custom module

I wrote a Kernel module that deals with socket-based TCP connections. Everything works great except one specific use case. I’d appreciate if somebody advise me how to solve the problem described below.
I have:
Kernel module which is a device registered using
misc_register().
User space application that communicates with this module using the standard file i/o functions: open,
close, ioctl, etc.
The exact scenario looks like this:
Load the module using insmod.
Open the associated device from user application using the standard open() function
call ioctl() that performs the following actions in the Kernel module (insignificant code lines omitted):
`
...
sock_create(PF_INET, SOCK_STREAM, 0, sock);
...
flags = O_NONBLOCK;
sock_map_fd(*sock, flags);
...
kernel_connect (sock, (struct sockaddr *)server_addr, sizeof(struct sockaddr_in), sock->file->f_flags);
...
`
All functions return successfully. The TCP connection is established successfully. After that tere can be also reads/writes on this connection but it doesn’t influence the problem.
If the application finishes naturally or I interrupt it by sending SIGINT the connection is closed nicely - with FIN exchange etc. On SIGKILL it issues TCP as I expect. No problems so far.
Now I would like to close this socket w/o stopping application. I try to do it by calling sock_release() in my Kernel module via another ioctl call. Upon this call the TCP connection is also closed nicely. However now the Kernel gets stuck when my application finishes or is interrupted!
I suspect that the Kernel somehow is not “informed” that the socket is closed. It tries to close it again and fails once the socket memory structure is de-allocated.
Did somebody use sockets from Kernel modules and had similar problems?
Can you recommend an alternative way to work with TCP sockets from Kernel modules?
Alternative ways to close sockets from within Kernel?
Thank you very much in advance.
After Kernel code investigation I found out that in case you map socket to a file using sock_map_fd() function it is not enough to call sock_release(). This function doesn't release the file descriptor associated wit the socket. In case you really need to map a Kernel socket to a file keep the file descriptor returned by sock_map_fd() and use sys_close() function to close the socket and clean up the associated file. Note that when the device file descriptor is closed all sockets created in the module and associated with files are also closed automatically.
Alternatively you can just avoid mapping socket to a file descriptor. The socket basic functionality will stay ok even without the mapping. In this case sock_release() works perfectly.

GetQueuedCompletionStatus returns ERROR_NETNAME_DELETED on remote socket closure

I am writing a small server-client-stuff using an I/O-Completion Port.
I get the server and client connected successfully via AcceptEx over my completion port.
After the client has connected the client socket is associated with the completion port and an overlapped call to WSARecv on that socket is invoked.
Everything works just fine, until I close the client test program.
GetQueuedCompletionStatus() returns FALSE and GetLastError returns
ERROR_NETNAME_DELETED
, which makes sense to me (after I read the description on the MSDN).
But my problem is, that I thought the call to GetQueuedCompletionStatus would return me a packet indicating the failure due to closure of the socket, because WSARecv would return the apropriate return value.
Since this is not the case I don´t know which clients´ socket caused the error and cant act the way i need to (freeing structures , cleanup for this particular connection, etc)...
Any suggestion on how to solve this, Or hints?
Thanks:)
EDIT: http://codepad.org/WeYINasO <- the code responsible... the "error" occures at the first functions beginning of the while-loop (the call to GetCompletionStatus() which is only a wrapper for GetQueuedCompletionStatus() working fine in other cases) [Did post it there, because it looks shitty & messy in here]
Here are the scenarios that you need to watch for when calling GetQueuedCompletionStatus:
GetQueuedCompletionStatus returns TRUE: A successful completion packet has been received, all the out parameters have been populated.
GetQueuedCompletionStatus returns FALSE, lpOverlapped == NULL: No packet was dequeued. The other out parameters contain indeterminate values.
GetQueuedCompletionStatus returns FALSE, lpOverlapped != NULL: The function has dequeued a failed completion packet. The error code is available via GetLastError.
To answer your question, when GetQueuedCompletionStatus returns FALSE and lpOverlapped != NULL, there was a failed I/O completion. It's the value of lpOverlapped that you need to be concerned about.
I know this is an old question, but I found this page while fruitlessly googling for details about ERROR_NETNAME_DELETED. It is an error which I get while doing an overlapped Readfile().
After some debugging it turned out that the problem was caused by a program which was writing to a socket but forgetting to call closesocket() before using ExitProcess() (due to garbage collection issues). Calling CloseHandle() did not prevent the error, nor did adding WSACleanup() before ExitProcess(). However, adding a short sleep before the client exited did prevent the error. Maybe avoiding ExitProcess() would have prevented the problem also.
So I suspect your problem is caused by the program exiting without closing down the socket properly.
I don't think this would be an issue on Unix where sockets are just ordinary file descriptors.

What causes the ENOTCONN error?

I'm currently maintaining some web server software and I need to perform a lot of I/O operations. The read(), write(), close() and shutdown() calls, when used on a socket, may sometimes raise an ENOTCONN error. What exactly does this error mean? What are the conditions that would trigger it? I can never seem to reproduce it locally but there are users who can.
Right now I just ignore ENOTCONN when raised by close() and shutdown() because it seems harmless, but I'm not entirely sure.
EDIT:
I am absolutely sure that the connect() call succeeded. I check for its return value.
ENOTCONN is most often raised by close() and shutdown(). I've only very rarely seen a read() and write() raising ENOTCONN.
If you are sure that nothing on your side of the TCP connection is closing the connection, then it sounds to me like the remote side is closing the connection.
ENOTCONN, as others have pointed out, simply means that the socket is not connected. This doesn't necessarily mean that connect failed. The socket may well have been connected previously, it just wasn't at the time of the call that resulted in ENOTCONN.
This differs from:
ECONNRESET: the other end of the connection sent a TCP reset packet. This can happen if the other end is refusing a connection, or doesn't acknowledge that it is already connected, among other things.
ETIMEDOUT: this generally applies only to connect. This can happen if the connection attempt is not successful within a system-dependent amount of time.
EPIPE can sometimes be returned by some socket-related system calls under conditions that are more or less the same as ENOTCONN. For example, on some systems, EPIPE and ENOTCONN are synonymous when returned by send.
While it's not unusual for shutdown to return ENOTCONN, since this function is supposed to tear down the TCP connection, I would be surprised to see close return ENOTCONN. It really should never do that.
Finally, as dwc mentioned, EBADF shouldn't apply in your scenario unless you are attempting some operation on a file descriptor that has already been closed. Having a socket get disconnected (i.e. the TCP connection has broken) is not the same as closing the file descriptor associated with that socket.
It's because, at the moment of shutting() the socket, you have data in the socket's buffer waiting to be delivered to the remote party which has closed() or shutted down() its receiving socket.
I don't finish understanding how sockets work, I am rather a noob, and I've failed to even find the files where this "shutdown" function is implemented, but seeing that there's practically no user manual for the whole sockets thing I started trying all possibilities until I got the error in a "controlled" environment. It could be something else, but after much trying these are the explanations I settled for:
If you sent data after the remote side closed the connection, when you shutdown(), you get the error.
If you sent data before the remote side closed the connection but it didn't get received() on the other end, you can shutdown() once, the next time you try to shutdown(), you get the error.
If you didn't send any data, you can shutdown all the times you want, as long as the remote side doesn't shutdown(); once the remote side has shutdown(), if you try to shutdown() and the socket was already shutdown(), you get the error.
I believe ENOTCONN is returned, because shutdown() is not supposed to return ECONNRESET or other more accurate errors.
It is wrong to assume that the other side “just” closed the connection. On the TCP-level, the other side can only half-close a connection (or abort it). The connection is ordinary fully closed if both sides do a shutdown() (or close()). If both side do that, shutdown() actually succeeds for both of them!
The problem is that shutdown() did not succeed in ordinary (half-)closing the connection, neither as the first one to close it, nor as the second one. – From the errors listed in the POSIX docs for shutdown(), ENOTCONN is the least inappropriate, because the others indicate problems with arguments passed to shutdown() (or local resource problems to handle the request).
So what happened? These days, a NAT device somewhere between the two parties involved might have dropped the association and sends out RESET packets as a reaction. Reset connections are so common for IPv4, that you will get them anywhere in your code, even masked as ENOTCONN in shutdown().
A coding bug might also be the reason. On a non-blocking socket, for example, a connect() can return 0 without indicating a successful connection yet.
Transport endpoint is not connected
The socket is associated with a connection-oriented protocol and has not been connected. This is usually a programming flaw.
From: http://www.wlug.org.nz/ENOTCONN
If you're sure you've connected properly in the first place, ENOTCONN is most likely to be caused by either the fd being closed on your end (perhaps in another thread?) while you're in the middle of a request, or by the connection dropping while you're in the middle of the request.
In any case, it means that the socket is not connected. Go ahead and clean up that socket. It's dead. No problem calling close() or shutdown() on it.