when would a blocking socket.send() block? (UDP) - sockets

I have been reading about UDP sockets in python.
By default a socket is configured so that sending or receiving data blocks, stopping program execution until the socket is ready. Calls to send() wait for buffer space to be available for the outgoing data, and calls to recv() wait for the other program to send data that can be read.
I understand the receiving part, it has to wait till the other end sends.
but why would the send block? It says in the book: "send() waits for buffer space to be available". What's the buffer space?
Is it for the whole Operating System or is it defined for every application running.
Is there a way to know how much buffer space is available?

The buffer referred to by the BSD sockets and POSIX documentation, and presumably also by whatever you were reading, is a per-socket buffer, not per-system or per-app. It's the same one you can set with SO_SNDBUF. When that buffer is full, send will block.
But the real question is, why would that buffer get full?
Underneath the stuff your app can see, the kernel usually has something like a ring of transmit buffers per NIC. When the NIC finishes putting a buffer worth of data out on the wire, it pulls another one off the ring. When there's space on the ring, it pulls another transmit buffer from one of the socket buffers. Or, if there are too many buffers waiting, it drops some of them and then pulls one. So, that part is system-wide (or at least NIC-wide).
So, just knowing your send buffer size doesn't really tell you everything you really care about, and to truly understand it you need to ask questions specific to your kernel. If that's Linux or a *BSD, the networking stack is open source and pretty well documented (in Linux, if I remember correctly, and if my 2.4-era knowledge is still useful, searching for SO_SNDBUF and txqueuelen give you good starting points); otherwise, it may not be. Of course Linux exposes all the information there is to expose somewhere in the /proc filesystem, so once you know what you want to look for, you can find it. *BSD doesn't expose quite as much, but there's a lot of stuff there. Even Windows has a slew of performance counters that are relevant.

Related

Sending custom data to socket using eBPF

I'm trying, finally, to understand eBPF and maybe use it in an upcoming project.
For sake of simplicity I started with reading bcc documentation.
In my project I'll need to send some data over network upon some kernel function calls.
Can that be done without sending the data to userspace first?
I see that I can redirect skbs from one socket to another etc., and I see that I can submit custom data to user space. Is there a way to get the best of both worlds?
EDIT: I'm trying to log some file system events to another server that'll collect this data from multiple machines. Those machines can be fairly busy in some situations. It should be real time and with low latency.
I'd love to avoid going through userspace to prevent copying the data back and forth and to reduce sw overhead as much as possible.
Thank you all!
It seems this question can be summarized to: is it possible to send data over the network from a BPF tracing program (kprobes, tracepoints, etc.)?
The answer to that question is no. As far as I know, there are currently no way to craft and send packets over the network from BPF programs. You can resend a received packet to the network with some helpers, but they are only available to networking BPF programs.

Notify userland from a kernel module

I'm implementing a kernel module that drives GPIOs. I offer the possibility for the userland to perform actions on it via ioctls, but I'd like to get deeper and set up a "notification" system, where the kernel module will contact directly userland on a detected event. For example, a value change on a GPIO (already notified by interrupt in the kernel module).
The main purpose is to avoid active polling loops in userland, and I really don't know how to interface kernel module and userland to keep speed, efficiency, and more or less passive.
I can't find anything on a good practice in this case. Some people talk about having a character interface (via a file in /dev) and performing a blocking read() from userland, and so get notified when the read returns.
This method should be good enough, but in case of very fast GPIO value changes, the userland would maybe be too slow to handle a notification and finally would be crushed by tons of notifications it can't handle.
So I'm looking for a method like userland callback functions, that could be called from the kernel module on an event.
What do you guys think is the best solution ? Is there any existing way of solving this specific problem ?
Thank you :)
Calling from the kernel to userspace is certainly possible, for instance spawning a userspace process (consider the kernel launches init, udev and some more) or using IPC (netlink and others).
However, this is not what you want.
As people mentioned to you, the way to go is to have a char device and then use standard and well-known select/poll semantics. I don't think you should be worried about this being slow, assuming your userspace program is well-designed.
In fact, this design is so common that there is an existing framework called UIO or Userspace I/O (see here and here).
I'm sorry, I don't know if you could call userland callbacks from kernel space, but you can make your user space application to listen on different signals like SIGKILL, SIGTERM, etc. which you can send to a user space process from kernel space.
There are also SIGUSR1 and SIGUSR2, which are reserved for custom use/implementation. Your application could listen on SIGUSR1 and/or SIGUSR2. Then you only have to check, why you were notified.
I know, it's not exactly what you wanted, but maybe it's a little help. ;)
I finally changed for something else, as spawning userland processes was far too slow and error likely.
I changed my software design to make the userland call an ioctl to get last events. The ioctl is blocking via wait queues, sleeping while the event queue is empty.
Thanks for your answer guys !

Is is possible to know if data was buffered when a TCP connection fails on Linux?

When you call send on a socket, data buffers in the kernel and you get a non-error return. The kernel implementation gets busy acking and windowing to get all your data to the other end.
If a Pekinese Terrier bites through a wire, the connection will close, leaving some data unsent. Is there any way to find out, upon getting the error indicating the close, that this is the case? Eventually a mechanism on Linux, Windows, and OS/X is desirable, but it doesn't have to be the same mechanism.
Someone in a comment wondered: why?
Consider a system that can already recover from entire crashes of a node, but was built with the assumption that 'TCP connections are forever' (which they are not, necessarily, on AWS). So, if a TCP connection closes, there are only two possibilities: the other end has crashed, and we've got a solution for that, or it's still up. If it's still up, it got as much data as TCP delivered before the socket closed. (I realize this is not necessarily a valid assumption.) Since the TCP protocol is already doing all this ack book-keeping in the kernel, it seems a shame to replicate it in user space to keep track of how much got from one end to the other.
I've stumbled across this problem myself, and so have others (e.g. here and here).
Since TCP is buffered and as it abstracts away the nitty gritty details of re-transmissions, acks and the like, there is no clean way of making sure at the application layer that your data was delivered.
Moreover, and this is key, even if it did provide you with some sort of confirmation that the data was delivered, it could only confirm delivery to the TCP buffer on the other end. You'd still be left with the question of whether that data was actually processed by the actual application. After all, it could be that a second Pekinese Terrier could have suddenly killed the application you're talking to or caused it to hang so it can't read the data from its TCP buffer.
If you need application layer acknowledgment of data delivery (and/or processing), you need an application layer mechanism for doing so by way of application layer acknowledgments.

intercept packet in kernel and pass in userspace

Assume that I implemented a kernel driver that parses RX packet and decides to pass it to the user space depending on EthType. What are the "official" ways to do that in the Linux kernel?
The only one that comes on my mind is the user application opens a socket to the kernel and listens on it, while the kernel pushes packets satisfying criteria (eg. specific EthType) in to the socket's buffer. I'm certainly not accurate about this, but I hope you get my point :)
Are there any other ways to do this?
Thanks.
You can achieve your goal by using Netfilter framework. Netfilter framework helps intercept a ingrees/egrees packet. The points where packets can be intercepted inside the Kernel/Network stack are called as HOOKS in Netfilter.We can write a kernel module, which can get hooked at any of these HOOKS. The kernel module must have a function defined by us, which can parse the packet and its headers and than take a decision to whether drop a packet, send it to kernel stack, queue it to user space etc.
The packet of our interest can be intercepted at IP_PREROUTING hook and queued by returning NF_QUEUE from our function. The packets will be queued and can be accessed by any application.
Please go through Netfilter documentation.
Regards,
Roy
When the packet arrives on the NIC, these packets are first copied onto the kernel buffers and then copied onto the user space, which are accessed through the socket() followed by read()/write() calls in the user space. You may want to refer to Kernel Network Flow for more details.
Additionally, NIC can directly copy the packets into the DMA bypassing the CPU. Refer to: What happens after a packet is captured?

Winsock: Can i call send function at the same time for different socket?

Let's say, I have a server with many connected clients via TCP, i have a socket for every client and i have a sending and receiving thread for every client. Is it safe and possible to call send function at the same time as it will not call send function for same socket.
If it's safe and ok, Can i stream data to clients simultaneously without blocking send function for other clients ?
Thank you very much for answers.
Yes it is possible and thread-safe. You could have tested it, or worked out for yourself that IS, IIS, SQL Server etc. wouldn't work very well if it wasn't.
Assuming this is Windows from the tag of "Winsock".
This design (having a send/receive thread for every single connected client), overall, is not going to scale. Hopefully you are aware of that and you know that you have an extremely limited number of clients (even then, I wouldn't write it this way).
You don't need to have a thread pair for every single client.
You can serve tons of clients with a single thread using non-blocking IO and read/write ready notifications (either with select() or one of the varieties of Overlapped IO such as completion routines or completion ports). If you use completion ports you can set a pool of threads to handle socket IO and queue the work for your own worker thread or threads/threadpool.
Yes, you can send and receive to many sockets at once from different threads; but you shouldn't need those extra threads because you shouldn't be making blocking calls to send/recv at all. When you make a non-blocking call the amount that could be written immediately is written and the function returns, you then note how much was sent and ask for notification when the socket is next writable.
I think you might want to consider a different approach as this isn't simple stuff; if you're using .Net you might get by with building this with TcpListener or HttpListener (both of which use completion ports for you), though be aware that you can't easily disable Nagle's algorithm with those so if you need interactivity (think of the auto-complete on Google's search page) then you probably won't get the performance you want.