Are socket operations performed in parallel at the system level? - sockets

If I have a quad core machine and 4 open and active sockets is it faster to have 4 threads to read from these sockets in parallel or just have one thread that reads the sockets iteratively? Can the socket read operation be performed in parallel or is it synchronized somewhere at the system level?
Analogously, if I have 4 sockets to which I need to write some data, is it faster to have 4 threads that write to these sockets at the same time, or is it better to have one thread that does all the writing?
It seems that multiple cores should enable higher throughput (of course if the network bandwidth is not the bottleneck), however I wonder if this is the case, since after all, in the end there is only a single ethernet cord (or other medium), so at some time, all these writes will have to be synchronized anyway, and the packets will be sent sequentially...
Specifically, if I have a machine running n threads which generate requests that need to be sent to another machine is it better to open n sockets and let each thread write to it? Or is it better to synchronize all the threads and send all the data through a single socket?
And on the receiving side is it better to have a single socket to read from? Or is it better to have n sockets read in parallel? What if all the readers need to be synchronized after reading anyway?

Socket operations are largely asynchronous to the application; buffered; and, if you're lucky, concurrent inside the kernel: the only significant state to synchronize on is per-connection. However the rate-determining step is really the bandwidth of the network, which you can easily saturate with even just one thread. Web browsers typically open between 4 and 8 threads per Web page to fetch the HTML itself and the images etc.

Related

How exactly do socket receives work at a lower level (eg. socket.recv(1024))?

I've read many stack overflow questions similar to this, but I don't think any of the answers really satisfied my curiosity. I have an example below which I would like to get some clarification.
Suppose the client is blocking on socket.recv(1024):
socket.recv(1024)
print("Received")
Also, suppose I have a server sending 600 bytes to the client. Let us assume that these 600 bytes are broken into 4 small packets (of 150 bytes each) and sent over the network. Now suppose the packets reach the client at different timings with a difference of 0.0001 seconds (eg. one packet arrives at 12.00.0001pm and another packet arrives at 12.00.0002pm, and so on..).
How does socket.recv(1024) decide when to return execution to the program and allow the print() function to execute? Does it return execution immediately after receiving the 1st packet of 150 bytes? Or does it wait for some arbitrary amount of time (eg. 1 second, for which by then all packets would have arrived)? If so, how long is this "arbitrary amount of time"? Who determines it?
Well, that will depend on many things, including the OS and the speed of the network interface. For a 100 gigabit interface, the 100us is "forever," but for a 10 mbit interface, you can't even transmit the packets that fast. So I won't pay too much attention to the exact timing you specified.
Back in the day when TCP was being designed, networks were slow and CPUs were weak. Among the flags in the TCP header is the "Push" flag to signal that the payload should be immediately delivered to the application. So if we hop into the Waybak
machine the answer would have been something like it depends on whether or not the PSH flag is set in the packets. However, there is generally no user space API to control whether or not the flag is set. Generally what would happen is that for a single write that gets broken into several packets, the final packet would have the PSH flag set. So the answer for a slow network and weakling CPU might be that if it was a single write, the application would likely receive the 600 bytes. You might then think that using four separate writes would result in four separate reads of 150 bytes, but after the introduction of Nagle's algorithm the data from the second to fourth writes might well be sent in a single packet unless Nagle's algorithm was disabled with the TCP_NODELAY socket option, since Nagle's algorithm will wait for the ACK of the first packet before sending anything less than a full frame.
If we return from our trip in the Waybak machine to the modern age where 100 Gigabit interfaces and 24 core machines are common, our problems are very different and you will have a hard time finding an explicit check for the PSH flag being set in the Linux kernel. What is driving the design of the receive side is that networks are getting way faster while the packet size/MTU has been largely fixed and CPU speed is flatlining but cores are abundant. Reducing per packet overhead (including hardware interrupts) and distributing the packets efficiently across multiple cores is imperative. At the same time it is imperative to get the data from that 100+ Gigabit firehose up to the application ASAP. One hundred microseconds of data on such a nic is a considerable amount of data to be holding onto for no reason.
I think one of the reasons that there are so many questions of the form "What the heck does receive do?" is that it can be difficult to wrap your head around what is a thoroughly asynchronous process, wheres the send side has a more familiar control flow where it is much easier to trace the flow of packets to the NIC and where we are in full control of when a packet will be sent. On the receive side packets just arrive when they want to.
Let's assume that a TCP connection has been set up and is idle, there is no missing or unacknowledged data, the reader is blocked on recv, and the reader is running a fresh version of the Linux kernel. And then a writer writes 150 bytes to the socket and the 150 bytes gets transmitted in a single packet. On arrival at the NIC, the packet will be copied by DMA into a ring buffer, and, if interrupts are enabled, it will raise a hardware interrupt to let the driver know there is fresh data in the ring buffer. The driver, which desires to return from the hardware interrupt in as few cycles as possible, disables hardware interrupts, starts a soft IRQ poll loop if necessary, and returns from the interrupt. Incoming data from the NIC will now be processed in the poll loop until there is no more data to be read from the NIC, at which point it will re-enable the hardware interrupt. The general purpose of this design is to reduce the hardware interrupt rate from a high speed NIC.
Now here is where things get a little weird, especially if you have been looking at nice clean diagrams of the OSI model where higher levels of the stack fit cleanly on top of each other. Oh no, my friend, the real world is far more complicated than that. That NIC that you might have been thinking of as a straightforward layer 2 device, for example, knows how to direct packets from the same TCP flow to the same CPU/ring buffer. It also knows how to coalesce adjacent TCP packets into larger packets (although this capability is not used by Linux and is instead done in software). If you have ever looked at a network capture and seen a jumbo frame and scratched your head because you sure thought the MTU was 1500, this is because this processing is at such a low level it occurs before netfilter can get its hands on the packet. This packet coalescing is part of a capability known as receive offloading, and in particular lets assume that your NIC/driver has generic receive offload (GRO) enabled (which is not the only possible flavor of receive offloading), the purpose of which is to reduce the per packet overhead from your firehose NIC by reducing the number of packets that flow through the system.
So what happens next is that the poll loop keeps pulling packets off of the ring buffer (as long as more data is coming in) and handing it off to GRO to consolidate if it can, and then it gets handed off to the protocol layer. As best I know, the Linux TCP/IP stack is just trying to get the data up to the application as quickly as it can, so I think your question boils down to "Will GRO do any consolidation on my 4 packets, and are there any knobs I can turn that affect this?"
Well, the first thing you can do is disable any form of receive offloading (e.g. via ethtool), which I think should get you 4 reads of 150 bytes for 4 packets arriving like this in order, but I'm prepared to be told I have overlooked another reason why the Linux TCP/IP stack won't send such data straight to the application if the application is blocked on a read as in your example.
The other knob you have if GRO is enabled is GRO_FLUSH_TIMEOUT which is a per NIC timeout in nanoseconds which can be (and I think defaults to) 0. If it is 0, I think your packets may get consolidated (there are many details here including the value of MAX_GRO_SKBS) if they arrive while the soft IRQ poll loop for the NIC is still active, which in turn depends on many things unrelated to your four packets in your TCP flow. If non-zero, they may get consolidated if they arrive within GRO_FLUSH_TIMEOUT nanoseconds, though to be honest I don't know if this interval could span more than one instantiation of a poll loop for the NIC.
There is a nice writeup on the Linux kernel receive side here which can help guide you through the implementation.
A normal blocking receive on a TCP connection returns as soon as there is at least one byte to return to the caller. If the caller would like to receive more bytes, they can simply call the receive function again.

Which kind of inter process communication (ipc) mechanism should I use at which moment?

I know that there are several methods of inter-process communication (ipc), like:
File
Signal
Socket
Message Queue
Pipe
Named pipe
Semaphore
Shared memory
Message passing
Memory-mapped file
However I was unable to find a list or a paper comparing these mechanism to each other and pointing out the benefits of them in different environemnts.
E.g I know that if I use a file which gets written by process A and process B reads it out it will work on any OS and is pretty robust, on the other hand - why shouldn't I use TCP Socket ? Has anyone a kind of overview in which cases which methods are the most suitable ?
Long story short:
Use lock files, mutexes, semaphores and barriers when processes compete for a scarce resource. They operate in a similar manner: several process try to acquire a synchronisation primitive, some of them acquire it, others are put in sleeping state until the primitive is available again. Use semaphores to limit the amount of processes working with a resource. Use a mutex to limit the amount to 1.
You can partially avoid using synchronisation primitives by using non-blocking thread-safe data structures.
Use signals, queues, pipes, events, messages, unix sockets when processes need to exchange data. Signals and events are usually used for notifying a process of something (for instance, ctrl+c in unix terminal sends a SIGINT signal to a process). Pipes, shared memory and unix sockets are for transmitting data.
Use sockets for networking (or, speaking formally, for exchanging data between processes located on different machines).
Long story long: take a look at Modern Operating Systems book by Tanenbaum & Bos, namely IPC chapter. The topic is vast and can't be completely covered within a list or a paper.

Can a server handle multiple sockets in a single thread?

I'm writing a test program that needs to emulate several connections between virtual machines, and it seems like the best way to do that is to use Unix domain sockets, for various reasons. It doesn't really matter whether I use SOCK_STREAM or SOCK_DGRAM, but it seems like SOCK_STREAM is easier/simpler for my usage.
My problem seems to be a little backwards from the typical scenario. I want to have a single client communicating with the server over 4 distinct sockets. (I could have 4 clients with one socket each, but that distinction shouldn't matter.) Now, the thing I'm emulating doesn't have multiple threads and gets an interrupt whenever a data packet is received over one of the "sockets". Is there some easy way to emulate this with Unix sockets?
I believe that I have to do the socket(), bind(), and listen() for all 4 sockets first, then do an accept() for all 4, and do fcntl( fd, F_SETFF, FNDELAY ) for each one to make them nonblocking, so that I can check each one for data with read() in a round-robin fashion. Is there any way to make it interrupt-driven or event-driven, so that my main loop only checks for data in the socket if there's data there? Or is it better to poll them all like this?
Yes. Handling multiple connections is almost synonymous with "server", and they are often single threaded -- but please not this way:
check each one for data with read() in a round-robin fashion
That would require, as you mention, non-blocking sockets and some kind of delay to prevent your "round-robin" from becoming a system killing busy loop.
A major problem with that is the granularity of the delay. You can't make it too small, or the loop will still hog too much CPU time when nothing is happening. But what about when something is happening, and that something is data incoming simultaneously on multiple connections? Now your delay can produce a snowballing backlog of tish leading to refused connections, etc.
It just is not feasible, and no one writes a server that way, although I am sure anyone would give it serious thought if they were unaware of the library functions intended to tackle the problem. Note that networking is a platform specific issue, so these are not actually part of the C standard (which does not deal with sockets at all).
The functions are select(), poll(), and epoll(); the last one is linux specific and the other two are POSIX. The basic idea is that the call blocks, waiting until one or more of any number of active connections is ready to read or write. Waiting for a socket to be ready to write only meaningfully applies to NON_BLOCK sockets. You don't have to use NON_BLOCK, however, and the select() call blocks regardless. Using NON_BLOCK on the individual sockets makes the implementation more complex, but increases performance potential in a single threaded server -- this is the idea behind asynchronous servers (such as nginx), a paradigm which contrasts with the more traditional threaded synchronous model.
However, I would recommend that you not use NON_BLOCK initially because of the added complexity. When/if it ends up being called for, you'll know. You still do not need threads.
There are many, many, many examples and tutorials around about how to use select() in particular.

kernel-based (Linux) data relay between two TCP sockets

I wrote TCP relay server which works like peer-to-peer router (supernode).
The simplest case are two opened sockets and data relay between them:
clientA <---> server <---> clientB
However the server have to serve about 2000 such A-B pairs, ie. 4000 sockets...
There are two well known data stream relay implementations in userland (based on socketA.recv() --> socketB.send() and socketB.recv() --> socketA.send()):
using of select / poll functions (non-blocking method)
using of threads / forks (blocking method)
I used threads so in the worst case the server creates 2*2000 threads! I had to limit stack size and it works but is it right solution?
Core of my question:
Is there a way to avoid active data relaying between two sockets in userland?
It seems there is a passive way. For example I can create file descriptor from each socket, create two pipes and use dup2() - the same method like stdin/out redirecting. Then two threads are useless for data relay and can be finished/closed.
The question is if the server should ever close sockets and pipes and how to know when the pipe is broken to log the fact?
I've also found "socket pairs" but I am not sure about it for my purpose.
What solution would you advice to off-load the userland and limit amount fo threads?
Some extra explanations:
The server has defined static routing table (eg. ID_A with ID_B - paired identifiers). Client A connects to the server and sends ID_A. Then the server waits for client B. When A and B are paired (both sockets opened) the server starts the data relay.
Clients are simple devices behind symmetric NAT therefore N2N protocol or NAT traversal techniques are too complex for them.
Thanks to Gerhard Rieger I have the hint:
I am aware of two kernel space ways to avoid read/write, recv/send in
user space:
sendfile
splice
Both have restrictions regarding type of file descriptor.
dup2 will not help to do something in kernel, AFAIK.
Man pages: splice(2) splice(2) vmsplice(2) sendfile(2) tee(2)
Related links:
Understanding sendfile() and splice()
http://blog.superpat.com/2010/06/01/zero-copy-in-linux-with-sendfile-and-splice/
http://yarchive.net/comp/linux/splice.html (Linus)
C, sendfile() and send() difference?
bridging between two file descriptors
Send and Receive a file in socket programming in Linux with C/C++ (GCC/G++)
http://ogris.de/howtos/splice.html
OpenBSD implements SO_SPLICE:
relayd asiabsdcon2013 slides / paper
http://www.manualpages.de/OpenBSD/OpenBSD-5.0/man2/setsockopt.2.html
http://metacpan.org/pod/BSD::Socket::Splice .
Does Linux support something similar or only own kernel-module is the solution?
TCPSP
SP-MOD described here
TCP-Splicer described here
L4/L7 switch
HAProxy
Even for loads as tiny as 2000 concurrent connections, I'd never go with threads. They have the highest stack and switching overhead, simply because it's always more expensive to ensure that you can be interrupted anywhere than when you can only be interrupted at specific places. Just use epoll() and splice (if your sockets are TCP, which seems to be the case) and you'll be fine. You can even make epoll work in event triggered mode, where you only register your fds once.
If you absolutely want to use threads, use one thread per CPU core to spread the load, but if you need to do this, it means you're playing at speeds where affinity, RAM location on each CPU socket etc... plays a significant role, which doesn't seem to be the case in your question. So I'm assuming that a single thread is more than enough in your case.

Under winsock2, what alternative is there to select()?

I have a working multi-client, single-threaded TCP/IP server application built in C++ over bare winsock2. The heart of it uses select() to wait for new work to do. I'm thinking of extending the number of simultaneous clients to some hundreds or thousands, in practice all mostly idle. My architecture uses very little memory for a connected, idle client.
Before each select(), I build an fd_set of client sockets in read state, plus my listening socket (for accepting new connections); and another fd_set of sockets in write state. Then, after the select(), I scan these to reconstruct, from the socket number, which of my client that was for. This fd_set building and scanning, though objectively not the current CPU bottleneck, makes me uneasy: the amount of work per transaction grows linearly with the number of clients; and while I see how to go over the default 64-sockets limit in an fd_set, I'm reluctant to go that route.
I vaguely see how I could use two threads, one handling the few most active clients, and another for the bulk of idle clients. That seems workable, but a tad complex.
So: what are the alternatives to select() under winsock2?
As you have seen, select() has a max limit for the number of sockets it can handle in a single call. If scalability is an issue for you then you should use Overlapped I/O or I/O Completion Ports instead. That way, you can issue read/write operations on individual sockets when needed and the OS will notify you when the work is finished, there is no need to poll for it.