Does Synchronous I/O keeps thread busy? - sockets

Let's say I am doing I/O on a synchronous I/O socket, which is ready for read or write operation. That means that calling thread wouldn't be blocked on the operation, irrespective of the non-blocking(SOCK_NONBLOCK)/blocking nature of the socket. But following things are not clear to me -
When does the actual transfer happen? Is data already present in the memory when the socket is marked ready for reading, or will data be transferred on calling read command? Does it depend on the family of the socket?
If the data transfer is performed during read command, does that mean the calling thread will be busy and the latency will depend on the socket hardware?
Update:
With socket hardware I was wrong, I was thinking about the actual data transfer underneath. I understand that a Socket is not a matter, just an entity in OS to denote a file descriptor fit for communication.
Follow up question - This also means during write, a calling thread writes data into memory. Is there a kernel thread which will take care of transferring data on the other side of the socket? If yes, then how an asyncronous io for sockets is different than the synchronous io?

In general you can think of socket I/O as a two level buffering system. There is the buffer in your application, and then there are kernel buffers. So when you call read(), the kernel will copy data from the kernel buffer(s) to your application buffer. Correspondingly, when you call write(), you are copying data from your application buffer to the kernel buffer(s).
The kernel then tells the NIC to write incoming data to the kernel buffers, and read outgoing data from the kernel buffers. This I/O is AFAIK usually DMA-driven, meaning that the kernel just needs to tell the NIC what to do, and the NIC is responsible for the actual data transfer. And when the NIC is finished, it will raise an interrupt (or for high IO rates, interrupts are disabled and the kernel instead polls), causing the CPU core that received the interrupt to stop executing whatever it was executing (user code, kernel code (unless interrupts disabled in which case the interrupt will be queued)) and execute the interrupt handler which then takes care of other steps that need to be done.
So to answer your follow-up question, in general there isn't a separate kernel thread handling socket I/O on the kernel side, work is done by the NIC hardware and in interrupt context.
For asynchronous I/O, or rather non-blocking I/O, the only difference is how the copying from the user application buffer and the kernel buffer(s) is done. For a non-blocking read, only the data that is ready and waiting in the kernel buffers is copied to userspace (which can result in a short read), or if no data is ready the read() call returns immediately with EAGAIN. Similarly, for a non-blocking write(), it copies only as much data as there is available space for in the kernel buffers, which can cause a short write, or if no space is available at all, returning with EAGAIN. For blocking read(), if there is no data available the call will block until there is, whereas for a blocking write(), if the kernel buffer(s) are full, it will block until there is some space available.

Related

RDMA read and write data placement/visibility semantics

I am trying to get more details on the RDMA read and write semantics (especially data placement semantics) and I would like to confirm my understanding with the experts here.
RDMA read :
Would the data be available/seen in the local buffer, once the RDMA read completion is seen in the completion queue. Is the behavior the same, if I am using GPU Direct DMA and the local address maps to GPU memory. Would the data be immediately available in GPU, once the RDMA READ completion is seen in completion queue. If it is not immediately available, what operation will make ensure it.
RDMA Write with Immediate (or) RDMA Write + Send:
Can the remote host check for presence of data in its memory, after it has seen the Immediate data in receive queue. And is the expectation/behavior going to change, if the Write is to GPU memory (using GDR).
RDMA read. Would the data be available/seen in the local buffer, once the RDMA read completion is seen in the completion queue?
Yes
Is the behavior the same, if I am using GPU Direct DMA and the local address maps to GPU memory?
Not necessarily. It is possible that the NIC has sent the data towards the GPU, but the GPU hasn't received it yet. Meanwhile the RDMA read completion has already arrived to the CPU. The root cause of this is PCIe semantics, which allow reordering of writes to different destination (CPU/GPU memory).
If it is not immediately available, what operation will make ensure it?
To ensure data has arrived to the GPU, one may set a flag on CPU following the RDMA completion and poll on this flag from GPU code. This works because the PCIe read issued by the GPU will "push" the NIC's DMA writes (according to PCIe ordering semantics).
RDMA Write with Immediate (or) RDMA Write + Send:
Can the remote host check for presence of data in its memory, after it has seen the Immediate data in receive queue. And is the expectation/behavior going to change, if the Write is to GPU memory (using GDR).
Yes, this works, but GDR suffers from the same issue as above with writes arriving out-of-order to GPU memory as compared to CPU memory, again due to PCIe ordering semantics. The RNIC cannot control PCIe and therefore it cannot force the "desired" semantics in either case.

multiprogramming on single kernel thread operating systems

Suppose there is a system with a single processor and an operating system that has single kernel thread and I run a C program having scanf() function.
Now if I execute the program and don't supply values. How kernel would handle this? I mean since scanf() executes a blocking system call, read() and kernel is executing on the processor to handle the system call and is blocked. How would kernel would make another process to run which itself is blocked.
How multiprogramming is supported on single kernel thread operating system.
A kernel doesn’t have to have a thread context for each user thread. In particular, many microkernels have no notion of a kernel thread context at all. These stateless kernels manipulate data structures representing threads, so when one wants to read, the kenrel might start the io and enqueues a data structure (a continuation) to record what to do when the io is complete. Then kernel is then free to select another thread to run while the io operation is in progress.
The classic heavyweight thread model has become ubiquitous due to unix (later linux), but is by no means the only or even the best model.
How kernel would handle this? I mean since scanf() executes a blocking system call, read() and kernel is executing on the processor to handle the system call and is blocked. How would kernel would make another process to run which itself is blocked.
If the system supports "more threads than CPUs"; then it must have something (a scheduler) to figure out which thread gets to use which CPU somewhere (maybe in kernel, maybe not). In this case, hopefully (but not necessarily), if a thread blocks the scheduler will mark the thread as blocked (and why) and assign the CPU to a different thread, and then later (when it's notified that whatever the blocked thread was waiting for has occurred) the scheduler will unblock the thread and let it use a CPU again.

Should I use IOCPs or overlapped WSASend/Receive?

I am investigating the options for asynchronous socket I/O on Windows. There is obviously more than one option: I can use WSASend... with an overlapped structure providing either a completion callback or an event, or I could use IOCPs and the (new) thread pool. From I usually read, the latter option is the recommended one.
However, it is not clear to me, why I should use IOCPs if the completion routine suffices for my goal: tell the socket to send this block of data and inform me if it is done.
I understand that the IOCP stuff in combination with CreateThreadpoolIo etc. uses the OS thread pool. However, the "normal" overlapped I/O must also use separate threads? So what is the difference/disadvantage? Is my callback called by an I/O thread and blocks other stuff?
Thanks in advance,
Christoph
You can use either but, for servers, IOCP with the 'completion queue' will have better performance, in general, because it can use multiple client<>server threads, either with CreateThreadpoolIo or some user-space thread pool. Obviously, in this case, dedicated handler threads are usual.
Overlapped completion-routine I/O is more useful for clients, IMHO. The completion-routine is fired by an Asynchronous Procedure Call that is queued to the thread that initiated the I/O request, (WSASend, WSARecv). This implies that that thread must be in a position to process the APC and typically this means a while(true) loop around some 'blahEx()' call. This can be useful because it's fairly easy to wait on a blocking queue, or other inter-thread signal, that allows the thread to be supplied with data to send and the completion routine is always handled by that thread. This I/O mechanism leaves the 'hEvent' OVL parameter free to use - ideal for passing a comms buffer object pointer into the completion routine.
Overlapped I/O using an actual synchro event/Semaphore/whatever for the overlapped hEvent parameter should be avoided.
Windows IOCP documentation recommends no more than one thread per available core per completion port. Hyperthreading doubles the number of cores. Since use of IOCPs results in a for all practical purposes event-driven application the use of thread pools adds unnecessary processing to the scheduler.
If you think about it you'll understand why: an event should be serviced in its entirety (or placed in some queue after initial processing) as quickly as possible. Suppose five events are queued to an IOCP on a 4-core computer. If there are eight threads associated with the IOCP you run the risk of the scheduler interrupting one event to begin servicing another by using another thread which is inefficient. It can be dangerous too if the interrupted thread was inside a critical section. With four threads you can process four events simultaneously and as soon as one event has been completed you can start on the last remaining event in the IOCP queue.
Of course, you may have thread pools for non-IOCP related processing.
EDIT________________
The socket (file handles work fine too) is associated with an IOCP. The completion routine waits on the IOCP. As soon as a requested read from or write to the socket completes the OS - via the IOCP - releases the completion routine waiting on the IOCP and returns with the additional information you provided when you called the read or write (I usually pass a pointer to a control block). So the completion routine immediately "knows" where the to find information pertinent to the completion.
If you passed information referring to a control block (similar) then that control block (probably) needs to keep track of what operation has completed so it knows what to do next. The IOCP itself neither knows nor cares.
If you're writing a server attached to the internet, the server would issue a read to wait for client input. That input may arrive a milli-second or a week later and when it does the IOCP will release the completion routine which analyzes the input. Typically it responds with a write containing the data requested in the input and then waits on the IOCP. When the write completed the IOCP again releases the completion routine which sees that the write has completed, (typically) issues a new read and a new cycle starts.
So an IOCP-based application typically consumes very little (or no) CPU until the moment a completion occurs at which time the completion routine goes full tilt until it has finished processing, sends a new I/O request and again waits on the completion port. Apart from the IOCP timeout (which can be used to signal house-keeping or such) all I/O-related stuff occurs in the OS.
To further complicate (or simplify) things it is not necessary that sockets be serviced using the WSA routines, the Win32 functions ReadFile and WriteFile work just fine.

what happens when I write data to a blocking socket, faster than the other side reads?

suppose I write data really fast [I have all the data in memory] to a blocking socket.
further suppose the other side will read data very slow [like sleep 1 second between each read].
what is the expected behavior on the writing side in this case?
would the write operation block until the other side reads enough data, or will the write return an error like connection reset?
For a blocking socket, the send() call will block until all the data has been copied into the networking stack's buffer for that connection. It does not have to be received by the other side. The size of this buffer is implementation dependent.
Data is cleared from the buffer when the remote side acknowledges it. This is an OS thing and is not dependent upon the remote application actually reading the data. The size of this buffer is also implementation dependent.
When the remote buffer is full, it tells your local stack to stop sending. When data is cleared from the remote buffer (by being read by the remote application) then the remote system will inform the local system to send more data.
In both cases, small systems (like embedded systems) may have buffers of a few KB or smaller and modern servers may have buffers of a few MB or larger.
Once space is available in the local buffer, more data from your send() call will be copied. Once all of that data has been copied, your call will return.
You won't get a "connection reset" error (from the OS -- libraries may do anything) unless the connection actually does get reset.
So... It really doesn't matter how quickly the remote application is reading data until you've sent as much data as both local & remote buffer sizes combined. After that, you'll only be able to send() as quickly as the remote side will recv().
Output (send) buffer gets filled until it gets full and send() block until the buffer get freed enough to enqueue the packet.
As send manual page says:
When the message does not fit into the send buffer of the socket,
send() normally blocks, unless the socket has been placed in non-
blocking I/O mode.
Look at this: http://manpages.ubuntu.com/manpages/lucid/man2/send.2.html

Polling vs Interrupt

I have a basic doubt regarding interrupts. Imagine a computer that does not have any interrupts, so in order for it to do I/O the CPU will have to poll* the keyboard for a key press, the mouse for a click etc at regular intervals. Now if it has interrupts the CPU will keep checking whether the interrupt line got high( or low) at regular intervals. So how is CPU cycles getting saved by using interrupts. As per my understanding instead of checking the device now we are checking the interrupt line. Can someone explain what basic logic I am getting wrong.
*Here by polling I don't mean that the CPU is in a busy-wait. To quote Wikipedia "Polling also refers to the situation where a device is repeatedly checked for readiness, and if it is not the computer returns to a different task"
#David Schwartz and #RKT are right, it doesn't take any CPU cycles to check the interrupt line.
Basically, the processor has a set of interrupt wires which are connected to a bunch of devices. When one of the devices has something to say, it turns its interrupt wire on, which triggers the processor (without the help of any software) to pause the execution of current instructions and start running a handler function.
Here's how it works. When the operating system boots, it registers a set of callbacks (a table of function pointers, actually) with the processor using a special instruction which takes the address of the first entry of the table. When interrupt N is triggered, the processor pulls the Nth entry from the table and runs the code at the location in memory it refers to. The code inside the function is written by the OS authors in assembly, but typically all it does is save the state of the stack and registers so that the current task can be resumed after the interrupt handler has been called and then call a higher-level common interrupt handler which is written in C and that handles the logic of "If this a page fault, do X", "If this is a keyboard interrupt, do Y", "If this is a system call, do Z", etc. Of course there are variations on this with different architectures and languages, but the gist of it is the same.
The idea with software interrupts ("signals", in Unix parlance) is the same, except that the OS does the work of setting up the stack for the signal handler to run. The basic procedure is that the userland process registers signal handlers one at a time to the OS via a system call which takes the address of the handler function as an argument, then some time in the future the OS recognizes that it should send that process a signal. The next time that process is run, the OS will set its instruction pointer to the beginning of the handler function and save all its registers to somewhere the process can restore them from before resuming the execution of that process. Usually, the handler will have some sort of routing logic to alert the relevant bit of code that it received a signal. When the process finishes executing the signal handler, it restores the register state that existed previous to the signal handler running, and resumes execution where it left off. Hence, software interrupts are also more efficient than polling for learning about events coming from the kernel to this process (however this is not really a general-use mechanism since most of the signals have specific uses).
It doesn't take any CPU cycles to check the interrupt line. It's done by dedicated hardware, not CPU instructions. The reason it's called an interrupt is because if the interrupt line is asserted, the CPU is interrupted.
"CPU is interrupted" : It will leave (put on hold) the normal program execution and then execute the ISR( interrupt subroutine) and again get back to execution of suspended program.
CPU come to know about interrupts through IRQ(interrupt request) and IF(interrupt flag)
Interrupt: An event generated by a device in a computer to get attention of the CPU.
Provided to improve processor utilization.
To handle an interrupt, there is an Interrupt Service Routine (ISR) associated with it.
To interrupt the processor, the device sends a signal on its IRQ line and continue doing so until the processor acknowledges the interrupt.
CPU then performs a context switch by pushing the Program Status Word (PSW) and PC onto the control stack.
CPU executes the ISR.
whereas Pooling is the process where the computer waits for an external device to check for it readiness.
The computer does not do anything else than check the status of the device
Polling is often used with low-level hardware
Example: when a printer connected via a Parrnell port the computer waits until the next character has been received by the printer.
These process can be as minute as only reading 1 Byte
There are two different methods(Polling & interrupt) to serve I/O of a computer system. In polling, CPU continuously remain busy, either an input data is given to an I/O device and if so, then checks the source port of corresponding device and the priority of that input to serve it.
In Interrupt driven approach, when a data is given to an I/O device, an interrupt is generated and CPU checks the priority of that input to serve it.