If a close is interrupted or fails, what is the state of the fd? - sockets

Reading the man page for close, if it's interrupted by a signal then the fd's state is unspecified. Is there a best practice for handling this case, or is it assumed to become the OS's problem afterwards.
I assume that failure after EIO closes the fd appropriately.

If you want your program to run for a long time, a possible file descriptor leak is never only the operating system's problem. Short-lived programs which don't use many file descriptors of course have the option of terminating the the descriptors unclosed, and rely on the kernel closing them when the program terminates. So, for the rest of my answer I'll assume your program is long-running.
If your program is not multi-threaded, you have a very easy situation:
int close_some_fd(int fd, int *was_interrupted) {
*was_interrupted = 0;
/* this is point X to which I will draw your attention later. */
for (;;) {
if (0 == close(fd))
return 0; /* Success. */
if (EIO != errno)
return -1; /* Some failure, not interrupted by a signal. */
/* Our close attempt was interrupted. */
*was_interrupted = 1;
int fdflags = 0;
/* Just use the fd to find out if it is still open. */
if (0 != fcntl(fd, F_GETFD, &fdflags))
return 0; /* Interrupted, but file is also closed. So we are done. */
}
}
On the other hand, if your code is multi-threaded, some other thread (and perhaps one you don't control, such as a name service cache) may have called dup, dup2, socket, open, accept or some other similar function that makes the kernel allocate a file descriptor.
To make a similar approach work in such an environment you will need to be able to tell the difference between the file descriptor you started with and a file descriptor newly opened by another thread. Knowing that there is already a lower-numbered fd which still isn't open is enough to discount may of those, but in a multi-threaded environment you don't have a simple way of figuring out that this is still the case.
One option is to rely on some common aspect to all the file descriptors your program works with. For example if it never uses O_CLOEXEC, you can use fcntl to set the O_CLOEXEC flag at the point marked X in the code, and then you just change the existing call to fcntl like this:
if (0 = fcntl(fd, F_GETFD, &fdflags)) {
if (fdflags & O_CLOEXEC) {
/* open, and still marked with O_CLOEXEC, unused elsewhere. */
continue;
} else {
return 0; /* Interrupted, but file is also closed. So we are done. */
}
}
You can adjust this approach to use something other than O_CLOEXEC (perhaps fstat for example, recording st_dev and st_ino), but if you can't be sure what the rest of your multithreaded program is doing, this general idea is likely to be unsatisfying.
There's another approach which is also a bit problematic, but which might serve. This is that, instead of closing your file descriptor, you use sendmsg to pass the file descriptor across a Unix domain socket to a separate, single-threaded special-purpose server whose only job is to close the file descriptor. Yes, this is a bit icky. Some entertaining properties of this approach though are:
In order to avoid uncertainty over whether your fd was really passed to the server and closed successfully, you should probably read from a return channel of fd-closed-OK messages coming back from the server. This avoids needing to block signal delivery while you are executing sendmsg too. However, it means a user-space context-switch overhead for every file descriptor close (unless you batch them up to amortise the cost). You need to avoid a situation where thread A may be reading an fd-closed-OK report corresponding to a request made from thread B. You can avoid that problem by serialising close operations (which will limit performance) or demultiplexing the responses (which is complex). Alternatively, you could use some other IPC mechanism to avoid the need to serialise or demultiplex (SYSV semaphores for example).
For a high-volume process this will place an entertainingly high load on the kernel's file descriptor garbage collector apparatus, which is normally not highly stressed and may therefore give you some interesting symptoms.
As for what I'd do personally in the applications I work with most often, I'd figure out to what extent I could make assumptions about what kind of file descriptor I was closing. If, for example, they were normally sockets, I'd just try this out with a test program and figure out whether EIO normally leaves the file descriptor closed or not. That is, determine whether the theoretically unspecified state of the file descriptor on EIO is, in practice, predictable. This will not work well if the fd could be anything (e.g. disk file, socket, tty, ...). Of course, if you're using some open-source operating system you may just be able to read the kernel source and determine what the possible outcomes are.
Certainly I'd try the above experiment-based system before worrying about sharding the fd-close servers to scale out on fd-closing.

Related

What's the read logic when I call recvfrom() function in C/C++

I wrote a C++ program to create a socket and bind on this socket to receive ICMP/UDP packets. The code I wrote as following:
while(true){
recvfrom(sockId, rePack, sizeof(rePack), 0, (struct sockaddr *)&raddr, (socklen_t *)&len);
processPakcet(recv_size);
}
So, I used a endless while loop to receive messages continually, But I worried about the following two questions:
1, How long the message would be kept in the receiver queue or say in NIC queue?
I worried about that if it takes too long to process the first message, then I might miss the second message. so how fast should I read after read.
2, How to prevent reading the duplicated messages?
i.e, does the receiver queue knows me, when my thread read the first message done, would the queue automatically give me the second one? or say, when I read the first message, then the first message would be deleted by the queue and no one could receive it again.
Additionally, I think the while(true) module is not good, anyone could give me a good suggestion please. (I heard something like polling module).
First, you should always check the return value from recvfrom. It's unlikely the recvfrom will fail, but if it does (for example, if you later implement signal handling, it might fail with EINTR) you will be processing undefined data. Also, of course, the return value tells you the size of the packet you received.
For question 1, the actual answer is operating system-dependent. However, most operating systems will buffer some number of packets for you. The OS interrupt handler that handles the incoming packet will never be copying it directly into your application level buffer, so it will always go into an OS buffer first. The OS has previously noted your interest in it (by virtue of creating the socket and binding it you expressed interest), so it will then place a pointer to the buffer onto a queue associated with your socket.
A different part of the OS code will then (after the interrupt handler has completed) copy the data from the OS buffer into your application memory, free the OS buffer, and return to your program from the recvfrom system call. If additional packets come in, either before or after you have started processing the first one, they'll be placed on the queue too.
That queue is not infinite of course. It's likely that you can configure how many packets (or how much buffer space) can be reserved, either at a system-wide level (think sysctl-type settings in linux), or at the individual socket level (setsockopt / ioctl).
If, when you call recvfrom, there are already queued packets on the socket, the system call handler will not block your process, instead it will simply copy from the OS buffer of the next queued packet into your buffer, release the OS buffer, and return immediately. As long as you can process incoming packets roughly as fast as they arrive or faster, you should not lose any. (However, note that if another system is generating packets at a very high rate, it's likely that the OS memory reserved will be exhausted at some point, after which the OS will simply discard packets that exceed its resource reservation.)
For question 2, you will receive no duplicate messages (unless something upstream of your machine is actually duplicating them). Once a queued message is copied into your buffer, it's released before returning to you. That message is gone forever.
(Note that it's possible that some other process has also created a socket expressing interest in the same packets. That process would also get a copy of the packet data, which is typically handled internal to the operating system by reference counting rather than by actually duplicating the OS buffers, although that detail is invisible to applications. In any case, once all interested processes have received the packet, it will be discarded.)
There's really nothing at all wrong with a while (true) loop; it's a very common control structure for long-running server-type programs. If your program has nothing else it needs to be doing in the meantime, while true allowing it to block in recvfrom is the simplest and hence clearest way to implement it.
(You could use a select(2) or poll(2) call to wait. This allows you to handle waiting for any one of multiple file descriptors at the same time, or to periodically "time out" and go do something else, say, but again if you have nothing else you might need to be doing in the meantime, that is introducing needless complication.)

Update file descriptor pointing to /proc/self after fork() from python multiprocess.Process

I'm working on a C++ program that uses boost::python to provide a python wrapper/API for the user. The program tracks and limits its own memory usage by opening /proc/self/statm using a file descriptor. Every timestep it seeks to the beginning of that file and reads the vmsize from it.
proc_self_statm_fd = open( "/proc/self/statm", O_RDONLY );
However, this causes a problem when calling fork(). In particular, when a user writes a python script that does something like this:
proc = multiprocessing.Process(name="bkg_process",target=bkg_process,daemon=True)
The problem is that the forked process gets the file descriptor pointing to /proc/self/statm from the parent process, not its own, and this reports the wrong memory usage. Even worse, if the parent process exits, the child process will fail when trying to read from the file descriptor.
What's the correct solution for this? It needs to be handled at the C++ level because we don't have control over the user's python scripts. Is there a way to have the class auto detect that a fork has happened and grab a new file descriptor? In the worst case I can have it re-open the file for every update. I'm worried that would add runtime overhead though.
You could store the PID in the class, and check it against the value of getpid() on each call, and then reopen the file if the PID has changed. getpid() is typically much cheaper than open - on some systems it doesn't even need a context switch (it just fetches the PID from a magic location in the process's own memory).
That said, you may also want to actually measure the cost of reopening the file each time - it may not actually be significant.

iOS / iPhone Journaling / File System Caching

From local device testing, I've seen that writing a file to the iOS file system (regardless of how low level the call you use) will often return success before the file is fully committed to the flash. Meaning, if you hard reset the device then reboot, your file could be rolled back (if the write completed or was atomic) or corrupted. What is the source of this delay (documentation appreciated, I haven't been able to find anything), and is there a way to get feedback when the actual filesystem write is completed. For instance, I'd like to acknowledge receipt and storage of a piece of data from a remote server, but I find that acknowledging it after write "reports" success could result in data loss in the event of a hard crash or power failure.
Since this is a 4 years old questions, I'll provide not only the answer, but also the path I took while searching for it.
I was not able to find any clear explanation in the official documentation: File System Programming Guide. There was only a clue in the Performance Tips section. It states that:
Apps can call the BSD fcntl function with the F_NOCACHE flag to enable or disable caching for a file. For more information about this function, see fcntl.
Enabling the F_NOCACHE flag does not solve the problem you're stating, however, the manual for fcntl method states there's an option that you might just find interesting:
F_FULLFSYNC Does the same thing as fsync(2) then asks the drive to flush all buffered data to the permanent storage device
(from man fcntl, see here).
I've checked the manual for fsync for more details. It has given me, eventually, the clearest and most understandable explanation of both the problem and the solution:
Note that while fsync() will flush all data from the host to the drive (i.e. the "permanent storage device"), the drive itself may not physically write the data to the platters for quite some time and it may be written in an out-of-order sequence.
Specifically, if the drive loses power or the OS crashes, the application may find that only some or none of their data was written. The disk drive may also re-order the data so that later writes may be present, while earlier writes are not.
This is not a theoretical edge case. This scenario is easily reproduced with real world workloads and drive power failures.
For applications that require tighter guarantees about the integrity of their data, Mac OS X provides the F_FULLFSYNC fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage. Applications, such as databases, that require a strict ordering of writes should use F_FULLFSYNC to ensure that their data is written in the order they expect.
(from man fsync, see here).
Yeah, it's definitely not a theoretical edge case. Thankfully, once you know the problem, the solution is trivial:
let filePath: String = "your file path"
// you can use other option than read-write
let fd = open(String(path.utf8), O_RDWR)
// if fd is -1, there was an error opening file, handle it as you wish
guard fd != -1 else { return }
// syncResult is -1 if sync operation failed, handle it as you wish
let syncResult = fcntl(fd, F_FULLFSYNC)
// don't forget to close opened file
close(fd)
Once fcntl finishes, your data will be saved.
Notice this operation is slower than a usual writing to file (via NSFileManager or writeToURL methods family). In case of performance issues, it's best to move writing to background thread.

Shutdown Persistent TCP Con. (C multithreaded server)

I'm designing a multi-threaded server with a thread pool. This system is designed to use persistent TCP connections, as clients will maintain connects close to 24/7. The problem I run into is how to manage shutdowns. Currently, a connection comes in through "accept(listen_fd....)" and gets assigned to a work order struct. This struct is dumped onto the work queue, and is picked up by a thread. From this point on, this thread is devoted to the current connection. My code inside the thread is:
/* Function which runs in a thread to handle a request */
void *
handle_req( void *in)
{
ssize_t n;
char read;
/* Convert the input to a workorder_ptr */
workorder_t *workorder_ptr = (workorder_t *)in;
while( !serv_shutdown
&& (n=recv(workorder_ptr->sock_fd,&read,1,0) != 0))
{
printf("Read a character: %c\n",read);
}
printf("Peer has shutdown.\n");
/* Free the workorder memory */
close(workorder_ptr->sock_fd);
free(workorder_ptr);
return NULL;
}
Which simply listens to the socket and echos the characters indefinitely, and operates correctly when the client terminates the connection. You see the "!serv_shutdown" part in the while loop - this is my attempt to get the thread to break out of its loop on a shutdown signal. When a SIGINT is caught, the global variable is set to 1. Unfortunately, the program is currently blocking on the recv statement, and won't check this flag until another character is read. I want to avoid that, since it could be an arbitrary amount of time before another character is sent on this connection.
Also, I read on another post here that it's better to use "select" than "accept" to wait on a socket connection, but I didn't quite understand. Would you do a select to wait, and then do an accept right after that? I'm not sure how select creates a socket connection. I ask this, because if my understanding of select is cleared up, maybe it applies to the question I am asking?
Also also, how do I detect the case where a connection simply times out?
Thanks!
EDIT
I think I may have finally found a solution, after further digging:
Wake up thread blocked on accept() call
Basically, I could create a global pipe and have each thread do a select on its own socket_fd as well as this global pipe. Then, when a signal is caught, I'll just write something to the pipe. All threads should be woken, no?
Well, on FreeBSD, MacOSX and maybe somewhere else there is kevent() call, that allows listening on a broad range of system events including connect requests and signaling when data arrives to the socket.
It will solve all of your problems in a neat way, but it's not portable. There are libs such libevent and libev, that wraps OS-specific functionality like kevent() on BSD's, epoll() on Linux and so on. May be it would help you.
You can use the recv() primitive. If it returns 0, that means that the socket has been closed.
More information: http://beej.us/guide/bgnet/output/html/singlepage/bgnet.html#recvman

An IOCP documentation interpretation question - buffer ownership ambiguity

Since I'm not a native English speaker I might be missing something so maybe someone here knows better than me.
Taken from WSASend's doumentation at MSDN:
lpBuffers [in]
A pointer to an array of WSABUF
structures. Each WSABUF structure
contains a pointer to a buffer and the
length, in bytes, of the buffer. For a
Winsock application, once the WSASend
function is called, the system owns
these buffers and the application may
not access them. This array must
remain valid for the duration of the
send operation.
Ok, can you see the bold text? That's the unclear spot!
I can think of two translations for this line (might be something else, you name it):
Translation 1 - "buffers" refers to the OVERLAPPED structure that I pass this function when calling it. I may reuse the object again only when getting a completion notification about it.
Translation 2 - "buffers" refer to the actual buffers, those with the data I'm sending. If the WSABUF object points to one buffer, then I cannot touch this buffer until the operation is complete.
Can anyone tell what's the right interpretation to that line?
And..... If the answer is the second one - how would you resolve it?
Because to me it implies that for each and every data/buffer I'm sending I must retain a copy of it at the sender side - thus having MANY "pending" buffers (in different sizes) on an high traffic application, which really going to hurt "scalability".
Statement 1:
In addition to the above paragraph (the "And...."), I thought that IOCP copies the data to-be-sent to it's own buffer and sends from there, unless you set SO_SNDBUF to zero.
Statement 2:
I use stack-allocated buffers (you know, something like char cBuff[1024]; at the function body - if the translation to the main question is the second option (i.e buffers must stay as they are until the send is complete), then... that really screws things up big-time! Can you think of a way to resolve it? (I know, I asked it in other words above).
The answer is that the overlapped structure and the data buffer itself cannot be reused or released until the completion for the operation occurs.
This is because the operation is completed asynchronously so even if the data is eventually copied into operating system owned buffers in the TCP/IP stack that may not occur until some time in the future and you're notified of when by the write completion occurring. Note that with write completions these may be delayed for a surprising amount of time if you're sending without explicit flow control and relying on the the TCP stack to do flow control for you (see here: some OVERLAPS using WSASend not returning in a timely manner using GetQueuedCompletionStatus?) ...
You can't use stack allocated buffers unless you place an event in the overlapped structure and block on it until the async operation completes; there's not a lot of point in doing that as you add complexity over a normal blocking call and you don't gain a great deal by issuing the call async and then waiting on it.
In my IOCP server framework (which you can get for free from here) I use dynamically allocated buffers which include the OVERLAPPED structure and which are reference counted. This means that the cleanup (in my case they're returned to a pool for reuse) happens when the completion occurs and the reference is released. It also means that you can choose to continue to use the buffer after the operation and the cleanup is still simple.
See also here: I/O Completion Port, How to free Per Socket Context and Per I/O Context?