I wrote TCP relay server which works like peer-to-peer router (supernode).
The simplest case are two opened sockets and data relay between them:
clientA <---> server <---> clientB
However the server have to serve about 2000 such A-B pairs, ie. 4000 sockets...
There are two well known data stream relay implementations in userland (based on socketA.recv() --> socketB.send() and socketB.recv() --> socketA.send()):
using of select / poll functions (non-blocking method)
using of threads / forks (blocking method)
I used threads so in the worst case the server creates 2*2000 threads! I had to limit stack size and it works but is it right solution?
Core of my question:
Is there a way to avoid active data relaying between two sockets in userland?
It seems there is a passive way. For example I can create file descriptor from each socket, create two pipes and use dup2() - the same method like stdin/out redirecting. Then two threads are useless for data relay and can be finished/closed.
The question is if the server should ever close sockets and pipes and how to know when the pipe is broken to log the fact?
I've also found "socket pairs" but I am not sure about it for my purpose.
What solution would you advice to off-load the userland and limit amount fo threads?
Some extra explanations:
The server has defined static routing table (eg. ID_A with ID_B - paired identifiers). Client A connects to the server and sends ID_A. Then the server waits for client B. When A and B are paired (both sockets opened) the server starts the data relay.
Clients are simple devices behind symmetric NAT therefore N2N protocol or NAT traversal techniques are too complex for them.
Thanks to Gerhard Rieger I have the hint:
I am aware of two kernel space ways to avoid read/write, recv/send in
user space:
sendfile
splice
Both have restrictions regarding type of file descriptor.
dup2 will not help to do something in kernel, AFAIK.
Man pages: splice(2) splice(2) vmsplice(2) sendfile(2) tee(2)
Related links:
Understanding sendfile() and splice()
http://blog.superpat.com/2010/06/01/zero-copy-in-linux-with-sendfile-and-splice/
http://yarchive.net/comp/linux/splice.html (Linus)
C, sendfile() and send() difference?
bridging between two file descriptors
Send and Receive a file in socket programming in Linux with C/C++ (GCC/G++)
http://ogris.de/howtos/splice.html
OpenBSD implements SO_SPLICE:
relayd asiabsdcon2013 slides / paper
http://www.manualpages.de/OpenBSD/OpenBSD-5.0/man2/setsockopt.2.html
http://metacpan.org/pod/BSD::Socket::Splice .
Does Linux support something similar or only own kernel-module is the solution?
TCPSP
SP-MOD described here
TCP-Splicer described here
L4/L7 switch
HAProxy
Even for loads as tiny as 2000 concurrent connections, I'd never go with threads. They have the highest stack and switching overhead, simply because it's always more expensive to ensure that you can be interrupted anywhere than when you can only be interrupted at specific places. Just use epoll() and splice (if your sockets are TCP, which seems to be the case) and you'll be fine. You can even make epoll work in event triggered mode, where you only register your fds once.
If you absolutely want to use threads, use one thread per CPU core to spread the load, but if you need to do this, it means you're playing at speeds where affinity, RAM location on each CPU socket etc... plays a significant role, which doesn't seem to be the case in your question. So I'm assuming that a single thread is more than enough in your case.
Related
If I want to use (UDP) sockets as an inter-process communication mechanism on a single PC, are there restrictions on what I can set up due to the two endpoints having the same IP address?
I imagine that in order to have two processes A and B both listening on the same IP/port address, SO_REUSADDR would be necessary - correct? And even though that might conceptually allow for full duplex comms over a single socket, there are other questions I have if I try to go full duplex:
would I end up receiving my own transmissions, and have to filter them out?
would I be exposing myself to other processes injecting spurious or malicious data into my sockets due to the use of SO_REUSEADDR... or do I face this possibility simply by using (connectionless) UDP?
how would things be different (in an addressing/security/restrictions sense) if I chose to use TCP instead?
I'm confident that there is a viable solution using two sockets at each end (one for A -> B data, one for B ->A data)... but is there a viable solution using a single socket at each end? Would there be any clear advantages to using one full-duplex socket per process if it is possible?
The question arises from a misunderstanding. The misunderstanding arises from reading variable names like receivePort and sendPort with different values, and reading them as if they have an implicit link to the socket at the local end. This might make one (mistakenly) believe that two sockets are being used, or must be used - one for send, one for receive. This is wrong - a single socket (at each end) is all that is required.
If using variables to refer to ports on a single host, it is preferable to name them such that it is clear that one is local or pertaining to "this" process, and the other is remote or peer and pertains to the address of a different process, despite being on the same local host. Then it should be clearer that, like any socket, it is entirely possibly to support both send and receive from the single socket with its single port number.
In this scenario (inter-process communication on the same host necessarily using different port numbers for the single socket at each end) all the other questions (SO_REUSEADDR, TCP vs UDP and receiving one's own transmissions) are distractions arising from the misunderstanding.
I was wondering,
1st question What are the pros and cons of using one socket (full duplex) vs. two socket (simplex) per peer: one for read and other write? Specially in terms of performance and resource utilization.
2nd question In case, if i choose to use more than 1 sockets per peer, on all i do read and write. Then will it helps me scale out in handling no of messages handled?
3rd question: what should help me determine the number of sockets per peer? Network Bandwidth? No. of message in and out?
All questions are different and do not have any inter-relation.
What are the pros and cons of using one socket (full duplex) vs. two socket one for read and other write? Specially in terms of performance and resource utilization.
Pro one socket: resource utilization. Contra one socket: nil. Performance: identical, except that you save on connect and close handshakes if you only use one socket.
In case, I choose to take two socket approach, then will not be useful to use both of them full duplex, that way it helps me scale out in terms of data flowing in and out?
Now you're comparing apples and oranges. You can't compare one full-duplex socket with two full-duplex sockets. I don't know why you think you might need two inbound and two outbound flows, but you don't. Every protocol I can think of except FTP uses only one.
what impact does network bandwidth has on it?
Nil.
or it has on network utilization?
Nil, apart from the connect and close handshakes. But it wastes resources at both ends.
We've added --full-duplex to iperf 2.0.14 which will test a full-duplex socket. One can dompare it to two sockets per the -d or --dualtest option. We've found "your mileage will vary" and there is no universal to answer of having equal performance or not. In theory, it seems they should be equal but, in practice, maybe not.
-d, --dualtest
Do a bidirectional test simultanous test using two unidirectional sockets
--fq-rate n[kmgKMG]
Set a rate to be used with fair-queueing based socket-level pacing, in bytes or bits per second. Only available on platforms supporting the SO_MAX_PACING_RATE socket option.
(Note: Here the suffixes indicate bytes/sec or bits/sec per use of uppercase or lowercase, respectively)
--full-duplex
run a full duplex test, i.e. traffic in both transmit and receive directions using the same socket
Bob
I'm writing a test program that needs to emulate several connections between virtual machines, and it seems like the best way to do that is to use Unix domain sockets, for various reasons. It doesn't really matter whether I use SOCK_STREAM or SOCK_DGRAM, but it seems like SOCK_STREAM is easier/simpler for my usage.
My problem seems to be a little backwards from the typical scenario. I want to have a single client communicating with the server over 4 distinct sockets. (I could have 4 clients with one socket each, but that distinction shouldn't matter.) Now, the thing I'm emulating doesn't have multiple threads and gets an interrupt whenever a data packet is received over one of the "sockets". Is there some easy way to emulate this with Unix sockets?
I believe that I have to do the socket(), bind(), and listen() for all 4 sockets first, then do an accept() for all 4, and do fcntl( fd, F_SETFF, FNDELAY ) for each one to make them nonblocking, so that I can check each one for data with read() in a round-robin fashion. Is there any way to make it interrupt-driven or event-driven, so that my main loop only checks for data in the socket if there's data there? Or is it better to poll them all like this?
Yes. Handling multiple connections is almost synonymous with "server", and they are often single threaded -- but please not this way:
check each one for data with read() in a round-robin fashion
That would require, as you mention, non-blocking sockets and some kind of delay to prevent your "round-robin" from becoming a system killing busy loop.
A major problem with that is the granularity of the delay. You can't make it too small, or the loop will still hog too much CPU time when nothing is happening. But what about when something is happening, and that something is data incoming simultaneously on multiple connections? Now your delay can produce a snowballing backlog of tish leading to refused connections, etc.
It just is not feasible, and no one writes a server that way, although I am sure anyone would give it serious thought if they were unaware of the library functions intended to tackle the problem. Note that networking is a platform specific issue, so these are not actually part of the C standard (which does not deal with sockets at all).
The functions are select(), poll(), and epoll(); the last one is linux specific and the other two are POSIX. The basic idea is that the call blocks, waiting until one or more of any number of active connections is ready to read or write. Waiting for a socket to be ready to write only meaningfully applies to NON_BLOCK sockets. You don't have to use NON_BLOCK, however, and the select() call blocks regardless. Using NON_BLOCK on the individual sockets makes the implementation more complex, but increases performance potential in a single threaded server -- this is the idea behind asynchronous servers (such as nginx), a paradigm which contrasts with the more traditional threaded synchronous model.
However, I would recommend that you not use NON_BLOCK initially because of the added complexity. When/if it ends up being called for, you'll know. You still do not need threads.
There are many, many, many examples and tutorials around about how to use select() in particular.
I have a working multi-client, single-threaded TCP/IP server application built in C++ over bare winsock2. The heart of it uses select() to wait for new work to do. I'm thinking of extending the number of simultaneous clients to some hundreds or thousands, in practice all mostly idle. My architecture uses very little memory for a connected, idle client.
Before each select(), I build an fd_set of client sockets in read state, plus my listening socket (for accepting new connections); and another fd_set of sockets in write state. Then, after the select(), I scan these to reconstruct, from the socket number, which of my client that was for. This fd_set building and scanning, though objectively not the current CPU bottleneck, makes me uneasy: the amount of work per transaction grows linearly with the number of clients; and while I see how to go over the default 64-sockets limit in an fd_set, I'm reluctant to go that route.
I vaguely see how I could use two threads, one handling the few most active clients, and another for the bulk of idle clients. That seems workable, but a tad complex.
So: what are the alternatives to select() under winsock2?
As you have seen, select() has a max limit for the number of sockets it can handle in a single call. If scalability is an issue for you then you should use Overlapped I/O or I/O Completion Ports instead. That way, you can issue read/write operations on individual sockets when needed and the OS will notify you when the work is finished, there is no need to poll for it.
Ok folks, NOT counting ethernet speed (Infinitband), kernel bypass or any other fancy stuff, just plain TCP/IP (TCP/UDP over Ethernet) networking. What is the fastest messaging queue implementation that can deliver a message from host A to host B?
Let's assume 10Gigabits ethernet cards connecting both machines with up-to-date architecture and CPUs. What latency in microseconds are we talking here for a 1472 bytes message (MTU - IP/UDP headers)?
As #Sachin described very well, what I am looking for is the messaging queue and the latency number to send a message from A to B like below:
Host A <-------TCP-------> Messaging queue (process, route, etc) <-------TCP-------> Host B
if you do not require a broker in between, 0MQ gave us the best performance (you will need to test the numbers on your platform/use case). If using a broker in between, both ActiveMQ & RabbitMQ performed in the same range. Using Redis as a messaging server did not hold up for us.
If not using a messaging server, options such as Netty, J-groups etc might be useful (not sure about your programming language).
You could look into reliable UDP as well if going with straight socket connectivity.
hope it helps.
The lower bound would be at least 2 TCP connections and the routing time inside the messaging queue server (meaning the delays associated with these)
Host A <-------TCP-------> Messaging queue (process, route, etc) <-------TCP-------> Host B
Off course, if you build in redundancy, fault tolerance etc, then you are going to be certainly way above this lower bound.
It looks like you are talking about an UDP-based MQ because you mentioned MTU. Well, for UDP-based MQs this time is usually measured as the time required to publish a message and see it back in the message bus. So it is a round-trip time, not a one-way time as you described. This can usually be done in less than 6 microseconds, depending of course on your choice of LAN.