Enforce max threshold of tracked connections in eBPF - counter

I have an kprobe ebpf program that tracks a number of active TCP connections. In order to reduce the overhead, I set an upper limit on the number of the TCP connections that can be tracked simultaneously. Thus, I have to maintain a counter in the ebpf program so that when a new connection establishes, the counter is increased if it is below the limit, and when a connection finishes, the counter is decreased. Since the program is reentrant, the manipulation of the counter should be atomic.
I tried bpf_spin_lock. However it cannot be used in the tracing programs (such as kprobe). Also ebpf has few atomic operations. I knew one operation __sync_fetch_and_add. However it is not enough to implement the logic here.
I found some discussion online https://lists.iovisor.org/g/iovisor-dev/topic/bpf_concurrency/74407447?p=,,,20,0,0,0::recentpostdate%2Fsticky,,,20,2,20,74407447. The discussion is still open without any viable solution.
Is there any readily available solution out there? Thanks a lot!

After some discussion in comments, we concluded that the best solution is probably to rely on the map's max_entries parameter to enforce a limit on the number of tracked connections.
With that approach, the BPF program would then look something like:
// Update an existing connection or insert a new one.
res = bpf_map_update_elem(&conntrack, &tuple, &connection, 0);
if (res == -E2BIG) {
// The map is full. The max. threshold for tracked connections was reached.
} else if (res != 0) {
// The map update failed for some other reason.
} else {
// The map update was successful.
Accessing hash maps in BPF is an atomic operation, so even in case of concurrent accesses, the map will never contain more elements than max_entries. Note the access to specific map elements is not atomic however.
We discarded per-CPU arrays to maintain per-CPU counters because there's no guarantee the event closing the connection (e.g., TCP FIN) will be processed on the same CPU as the event opening it (e.g., TCP SYN).


nondeterminism.njoin: maxQueued and prefetching

Why does the njoin prefetch the data before processing? It seems like an unnecessary complication, unless it has something to do with how Processes of Processes are merged?
I have a stream that runs effects whenever a new element is generated. I'd like to keep the effects to a minimum, so whenever a njoin with, say maxOpen = 4, 4 should be the maximum number of elements generated at the same time (no element should be generated unless it can be processed immediately).
Is there a way to solve this gracefully with njoin? Right now I'm using a bounded queue of "tickets" (an element is generated only after it got a ticket).
See https://github.com/scalaz/scalaz-stream/issues/274, specifically the comment below from djspiewak.
"From a conceptual level, the problem here is the interface point between the "pull" model of Process and the "push" model that is required for any concurrent stream merging. Both wye and njoin sit at this boundary point and "cheat" by actively pulling on their source processes to fill an inbound queue, pushing results into an outbound queue pending the pull on the output process. (obviously, both wye and njoin make their inbound queues implicit via Actor) For the most part, this works extremely well and it preserves most of the properties that users care about (e.g. propagation of termination, back pressure, etc)."
The second parameter to njoined, maxQueued, bounds the amount of prefetching. If that parameter is 0, there is no limit on the queue size, and thus no limit on the prefetching. The docs for mergeN, which calls njoin explain a bit more the reasoning for this prefetching behavior. "Internally mergeN keeps small buffer that reads ahead up to n values of A where n equals to number of active source streams. That does not mean that every source process is consulted in this read-ahead cache, it just tries to be as much fair as possible when processes provide their A on almost the same speed." So it seems that the njoin is dealing with the problem of what happens when all the sources provide a value at nearly the same time, but it's trying to prevent any one of those joined streams from crowding out slower streams.

Does socket recv() force a flush of socket send() buffers?

In my application, I send two small messages to the server (a memcached-like service). In Python-like pseudocode, this looks like:
sock.send("add some-key 0")
ignored = sock.recv(...)
sock.send("incr some-key 1")
new_value = sock.recv(...)
Since the server supports fire-and-forget-style writes, I can optimize this code to look more like:
sock.send("add some-key 0 noreply")
sock.send("incr some-key 1")
new_value = sock.recv(...)
However, this takes significantly longer -- an average of 40ms for this version, versus an average of under 1ms for the former.
Furthermore, I've noticed that if I create the socket with TCP_NODELAY, thus disabling Nagle's algorithm, the timings for the second snippet are similar to the first. This suggests that the delay is happening between the two send()s (the "write-write-read" problem).
I'm reasonably convinced that disabling Nagle is the right move for my application -- I have a fairly high volume of fairly small writes that must be handled with as little latency as possible -- but I'm not sure why it wasn't necessary in the first example. Does recv() force the kernel to send any buffered writes? I suspect this is true, but I haven't been able to find documentation to that effect anywhere.
(Note this is Linux 2.6.32 and glibc 2.12 with Python 2.6.8, in case any of that has any bearing on the answer)
Does recv() force the kernel to send any buffered writes?
No. The two directions of a TCP connection are completely independent.
The Nagle algorithm delays sending a packet for up to I think 200ms under certain circumstances so that it can be coalesced with subsequent writes. That's what you're seeing, and when you disable Nagle it stops happening.
I suspect this is true
It isn't.
but I haven't been able to find documentation to that effect anywhere.
You won't.

What is the benefit of using non-blocking sockets with the "select" function?

I'm writing a server in Linux that will have to support simultaneous read/write operations from multiple clients. I want to use the select function to manage read/write availability.
What I don't understand is this: Suppose I want to wait until a socket has data available to be read. The documentation for select states that it blocks until there is data available to read, and that the read function will not block.
So if I'm using select and I know that the read function will not block, why would I need to set my sockets to non-blocking?
There might be cases when a socket is reported as ready but by the time you get to check it, it changes its state.
One of the good examples is accepting connections. When a new connection arrives, a listening socket is reported as ready for read. By the time you get to call accept, the connection might be closed by the other side before ever sending anything and before we called accept. Of course, the handling of this case is OS-dependent, but it's possible that accept will simply block until a new connection is established, which will cause our application to wait for indefinite amount of time preventing processing of other sockets. If your listening socket is in a non-blocking mode, this won't happen and you'll get EWOULDBLOCK or some other error, but accept will not block anyway.
Some kernels used to have (I hope it's fixed now) an interesting bug with UDP and select. When a datagram arrives select wakes up with the socket with datagram being marked as ready for read. The datagram checksum validation is postponed until a user code calls recvfrom (or some other API capable of receiving UDP datagrams). When the code calls recvfrom and the validating code detects a checksum mismatch, a datagram is simply dropped and recvfrom ends up being blocked until a next datagram arrives. One of the patches fixing this problem (along with the problem description) can be found here.
Other than the kernel bugs mentioned by others, a different reason for choosing non-blocking sockets, even with a polling loop, is that it allows for greater performance with fast-arriving data. Think what happens when a blocking socket is marked as "readable". You have no idea how much data has arrived, so you can safely read it only once. Then you have to get back to the event loop to have your poller check whether the socket is still readable. This means that for every single read from or write to the socket you have to do at least two system calls: the select to tell you it's safe to read, and the reading/writing call itself.
With non-blocking sockets you can skip the unnecessary calls to select after the first one. When a socket is flagged as readable by select, you have the option of reading from it as long as it returns data, which allows faster processing of quick bursts of data.
This going to sound snarky but it isn't. The best reason to make them non-blocking is so you don't block.
Think about it. select() tells you there is something to read but you don't know how much. Could be 2 bytes, could be 2,000. In most cases it more efficient to drain whatever data is there before going back to select. So you enter a while loop to read
while (1)
n = read(sock, buffer, 200);
//check return code, etc
What happens on the last read when there is nothing left to read? If the socket isn't non-blocking you will block, thereby defeating (at least partially) the point of the select().
One of the benefits, is that it will catch any programming errors you make, because if you try to read a socket that would normally block you, you'll get EWOULDBLOCK instead. For objects other than sockets, the exact api behaviour may change, see http://www.scottklement.com/rpg/socktut/nonblocking.html.

How much to read from socket when using select

I'm using select() to listen for data on multiple sockets. When I'm notified that there is data available, how much should I read()?
I could loop over read() until there is no more data, process the data, and then return back to the select-loop. However, I can imagine that the socket recieves so much data so fast that it temporarily 'starves' the other sockets. Especially since I am thinking of using select also for inter-thread communication (message-passing style), I'd like to keep latency low. Is this an issue in reality?
The alternative would be to always read a fixed size of bytes, and then return to the loop. The downside here would be added overhead when there is more data available than fits into my buffer.
What's the best practice here?
Not sure how this is implemented on other platforms, but on Windows the ioctlsocket(FIONREAD) call tells you how many bytes can be read by a single call to recv(). More bytes could be in the socket's queue by the time you actually call recv(). The next call to select() will report the socket is still readable, though.
The too-common approach here is to read everything that's pending on a given socket, especially if one moves to platform-specific advanced polling APIs like kqueue(2) and epoll(7) enabling edge-triggered events. But, you certainly don't have to! Flip a bit associated with that socket somewhere once you think you got enough data (but not everything), and do more recv(2)'s later, say at the very end of the file-descriptor checking loop, without calling select(2) again.
Then the question is too general. What are your goals? Low latency? Hight throughput? Scalability? There's no single answer to everything (well, except for 42 :)

Implement a good performing "to-send" queue with TCP

In order not to flood the remote endpoint my server app will have to implement a "to-send" queue of packets I wish to send.
I use Windows Winsock, I/O Completion Ports.
So, I know that when my code calls "socket->send(.....)" my custom "send()" function will check to see if a data is already "on the wire" (towards that socket).
If a data is indeed on the wire it will simply queue the data to be sent later.
If no data is on the wire it will call WSASend() to really send the data.
So far everything is nice.
Now, the size of the data I'm going to send is unpredictable, so I break it into smaller chunks (say 64 bytes) in order not to waste memory for small packets, and queue/send these small chunks.
When a "write-done" completion status is given by IOCP regarding the packet I've sent, I send the next packet in the queue.
That's the problem; The speed is awfully low.
I'm actually getting, and it's on a local connection ( speeds like 200kb/s.
So, I know I'll have to call WSASend() with seveal chunks (array of WSABUF objects), and that will give much better performance, but, how much will I send at once?
Is there a recommended size of bytes? I'm sure the answer is specific to my needs, yet I'm also sure there is some "general" point to start with.
Is there any other, better, way to do this?
Of course you only need to resort to providing your own queue if you are trying to send data faster than the peer can process it (either due to link speed or the speed that the peer can read and process the data). Then you only need to resort to your own data queue if you want to control the amount of system resources being used. If you only have a few connections then it is likely that this is all unnecessary, if you have 1000s then it's something that you need to be concerned about. The main thing to realise here is that if you use ANY of the asynchronous network send APIs on Windows, managed or unmanaged, then you are handing control over the lifetime of your send buffers to the receiving application and the network. See here for more details.
And once you have decided that you DO need to bother with this you then don't always need to bother, if the peer can process the data faster than you can produce it then there's no need to slow things down by queuing on the sender. You'll see that you need to queue data because your write completions will begin to take longer as the overlapped writes that you issue cannot complete due to the TCP stack being unable to send any more data due to flow control issues (see http://www.tcpipguide.com/free/t_TCPWindowSizeAdjustmentandFlowControl.htm). At this point you are potentially using an unconstrained amount of limited system resources (both non-paged pool memory and the number of memory pages that can be locked are limited and (as far as I know) both are used by pending socket writes)...
Anyway, enough of that... I assume you already have achieved good throughput before you added your send queue? To achieve maximum performance you probably need to set the TCP window size to something larger than the default (see http://msdn.microsoft.com/en-us/library/ms819736.aspx) and post multiple overlapped writes on the connection.
Assuming you already HAVE good throughput then you need to allow a number of pending overlapped writes before you start queuing, this maximises the amount of data that is ready to be sent. Once you have your magic number of pending writes outstanding you can start to queue the data and then send it based on subsequent completions. Of course, as soon as you have ANY data queued all further data must be queued. Make the number configurable and profile to see what works best as a trade off between speed and resources used (i.e. number of concurrent connections that you can maintain).
I tend to queue the whole data buffer that is due to be sent as a single entry in a queue of data buffers, since you're using IOCP it's likely that these data buffers are already reference counted to make it easy to release then when the completions occur and not before and so the queuing process is made simpler as you simply hold a reference to the send buffer whilst the data is in the queue and release it once you've issued a send.
Personally I wouldn't optimise by using scatter/gather writes with multiple WSABUFs until you have the base working and you know that doing so actually improves performance, I doubt that it will if you have enough data already pending; but as always, measure and you will know.
64 bytes is too small.
You may have already seen this but I wrote about the subject here: http://www.lenholgate.com/blog/2008/03/bug-in-timer-queue-code.html though it's possibly too vague for you.