what is the n number of file descriptors after which it is best to switch to epoll over poll? - sockets

Do we have any benchmarks for a range of descriptors from 1 to 50 or so? Most benchmarks I see are for large number of descriptors 100s..1000s...
I am currently using poll with 16 descriptors and thinking of using epoll if that will improve speed of app.
Please advise in 3 scenarios with 16 socket descriptors in the set for poll/epoll:
1. most of the sockets are active...>both should be same performance?
2. half active half idle....what is better here?
3. mostly idle...> clearly epoll is better ?

I would very much suspect that switching from poll() to epoll() will not make any difference in the performance of your application. The main advantage of epoll() crops up when you have many file descriptors (hundreds or thousands) where a standard poll() requires a little more work to be done on every call, whereas epoll() does the setup in advance - as long as you don't change the set of file descriptors you're watching, each call is very slightly quicker. But generally this difference is only noticeable for many, many file descriptors.
Bear in mind that if the set of file descriptors you're watching changes very frequently, epoll()'s main advantage is lost because you still need to do the work of passing new file descriptors into the kernel. So, if you're handling lots of short-lived connections then it's even less compelling to switch to it.
Another difference is that epoll() can be edge-triggered, where the call only returns when new activity occurs on a descriptor, or level-triggered, where the call returns while the descriptor is read/write-ready. The standard poll() call is always level-triggered. For most people, however, level-triggered is what they want - edge-triggered interfaces are occasionally useful, but in most cases they lead to race conditions where data arrives on a socket after reading but before entering the epoll() call. My advice is stay well away from edge-triggered code unless you really, really know what you're doing.
The price you pay for epoll() is the lack of portability - both poll() and select() are standard POSIX interfaces, so your code will be much more portable by using them. The epoll() call, on the other hand, is only available on Linux. Some other Unix variants also have their own equivalent mechanisms, such as kqueue on FreeBSD, but you have to write different code for each platform in that case.
My advice is until you reach a point where you're using many file descriptors, don't even worry about epoll() - seriously, there are almost certainly many other places in your code to make far bigger performance improvements and it's entirely possible that epoll() may not be faster for your use-case anyway.
If you do reach a stage where you're handling many connections and the rest of your code is already pretty optimal then you should first consider something like libev which is a cross-platform interface which uses the best performance calls on each particular platform. It performs very well and it's probably rather less hassle overall than directly using epoll() even if you only want to support Linux.
I haven't referred to the three scenarios you mention so far because I don't believe any of them will perform any differently for a low number of file descriptors such as 16. For a large number of file descriptors, epoll() should outperform poll() particularly where there are mostly idle file descriptors. If all file descriptors are always active, both calls require iterating through every connection to handle it. However, as the proportion of idle connections increases, epoll() gives better performance as it only returns the active connections - with poll() you still have to iterate through everything and most of them will be skipped, but epoll() returns you only the ones you need to handle (up to a maximum limit you can specify).
To spell that out explicitly (and this is only relevant for large numbers of connections, as I mentioned above):
Most of the sockets are active: Both calls broadly comparable, perhaps epoll() still slightly ahead.
Half active half idle: Would expect epoll() to be somewhat better here.
Mostly idle: Would expect epoll() to definitely be better here.
EDIT:
You might like to see this graph which is from the libevent author and shows the relative overhead of handling an event as the number of file descriptors changes. Note how all the lines are converging around the origin, demonstrating that all the mechanisms achieve comparable performance for a small number of descriptors.

Related

Custom DispatchQueue quality of service

Is there a way to create a custom DispatchQueue quality of service with its own custom "speed"? For example, I want a QoS that's twice as slow as .utility.
Ideas on how to solve it
Somehow telling the CPU/GPU that we want to run the task every X operation cycles? Not sure if that's directly possible with iOS.
This is a really bad hack which produces messy code and doesn't really solve the issue if 1 line of code runs for several seconds, but we can introduce a wait after every line of code.
In SpriteKit/SceneKit, it's possible to slow down time. Is there a way to utilize that somehow to slow down an arbitrary piece of code?
Blocking the thread every X seconds so that it slows down - not sure if possible without sacrificing app speed
There is no mechanism in iOS or any other Cocoa platform to control the "speed" (for any meaning of that word) of a work item. The only tool offered us is some control over scheduling. Once your work item is scheduled, it will get 100% (*) of the CPU core until it ends or is preempted. There is no way to be asked to be preempted more often (and it would be expensive to allow that, since context switches are expensive).
The way to manage how much work is done is to directly manage the work, not preemption. The best way is to split up the work into small pieces, and schedule them over time and combine them at the end. If your algorithm doesn't support that kind of input segmentation, then the algorithm's main "loop" needs to limit the number of iterations it performs (or the amount of time it spends iterating), and return at that point to be scheduled later.
If you don't control the algorithm code, and you cannot work with whoever does, and you cannot slice your data into smaller pieces, this may be an unsolvable problem.
(*) With the rise of "performance" cores and other such CPU advances, this isn't completely true, but for this question it's close enough.
Technically you cannot alter the speed on the QoS such as .background or .utility or any other Qos.
The way to handle this is to choose the right QoS based on the task you want to perform.
The higher the QoS is, the more resources the OS will spend on it and descends when you use a lower one.

The reason(s)/benefit(s) to use realtime operating system instead of while-loop on MCU

I'm working on a wheeled-robot platform. My team is implementing some algorithms on MCU to
keep getting sensors reading (sonar array, IR array, motor encoders, IMU)
receive user command (via a serial port connected to a tablet)
control actuators (motors) to execute user commands.
keep sending sensor readings to the tablet for more complicated algorithms.
We currently implement everything inside a global while-loop, while I know most of the other use-cases do the very same things with a real-time operating system.
Please tell me the benefits and reasons to use a real-time os instead of a simple while-loop.
Thanks.
The RTOS will provide priority based preemption. If your code is off parsing a serial command and executing it, it can't respond to anything else until it returns to your beastly loop. An RTOS will provide the abstractions needed for an instant context switch based on an interrupt event. Otherwise the worst-case latency of an event response is going to be that of the longest possible excursion out of the main loop, and sometimes you really do need long-running processes. Such as, for example, to update an LCD panel or respond to USB device enumeration. Preemption permits you to go off and do these things safe in the knowledge that a 16-bit timer running at the CPU clock isn't going to roll over several times before it's done. While for simple control jobs a loop is sufficient, the problem starting with it is when you get into something like USB device enumeration it's no longer practical and will need a full rewrite. By starting with a preemptive framework like an RTOS provides, you have a lot more future flexibility. But there's definitely a bit more up-front work, and definitely a learning curve.
"Real Time" OS ensures your task periodicity. If you want to read sensors data precisely at every 100msec, simple while loop will not guarantee that. On other hand, RTOS can easily take care of that.
RTOS gives you predictibility. An operation will be executed at given time and it will not be missed.
RTOS gives you Semaphores/Mutex so that your memory will not be corrupted or multiple sources will not access buffers.
RTOS provides message queues which can be useful for communication between tasks.
Yes, you can implement all these features in While loop, but then that's the advantage! You get everything ready and tested.
If your while loop works (i.e. it fulfills the real-time requirements of your system), and it's robust, maintainable, and somewhat extensible, then there probably is no benefit to using a real-time operating system.
But if your while-loop can't fulfill the real-time requirements or is overly complex or over-extended (i.e., any change requires further tuning and tweaking to restore real-time performance), then you should consider another architecture.
An RTOS architecture is one popular next step beyond the super-loop. An RTOS is basically a tool for managing software complexity. It allows you to divide a complex set of software requirements into multiple threads of execution. When done properly, each individual thread has a relatively simple set of requirements and becomes easier to implement. And thread prioritization can make it easier to fulfill the real-time requirements of the application. These are basically the benefits of employing an RTOS.
An RTOS is not a panacea, however. An RTOS increases the overall system complexity and opens you up to new types of bugs (such as deadlocks). It requires knowledge and experience to design and implement an RTOS based program effectively. Consider alternatives such as Multi-Rate Main Loop Tasking or an event-based state machine architecture (such as QP). These alternatives might be simpler, easier to understand, or more compatible with your way of designing software.
There are a couple huge advantage that a RTOS multitasker has:
Very good I/O performance: an RTOS can set a waiting thread ready as soon as an I/O action that it requested has completed, so latency in handling reads/writes/whatever can be low. A looped, polled design cannot respond to I/O completions until it gets round to checking the I/O status, (either directly or my polling some volatile flag set by an interrupt-handler).
Indpendent functionality: The ease of implementing isolated subsystems for serial comms, actuators etc. may well be one for you, as suggested in the other answers. It's very reassuring to know that any extra delay in, say some serial exchange, will not adversely affect timing elsewhere. You need to wait a second? No problem: sleep(1000) and you're done - no effect on the other subsystems:) It's the difference between 'no, I cannot add a network server, it would change all the timing and I would have to retest everything' and 'sure, there's plenty of CPU free, I already have the code from another job and I just need another thread to run it'.
There ae other advanatges that help offset the added annoyance of having to program a preemptive multitasker with its critical sections, semaphores and condvars.
Obviously, if your hardware has multiple cores, the RTOS helps - it is designed to share out available CPU execution cycles just like any other resource, and adding cores means more cycles.
In the end, though, it's the I/O performance and functional isolation that's the big win.
Some of the suggestions in other answers may help, either instead of, or together with, an RTOS. When controlling multiple I/O hardware, eg. sensors and actuators, an event-driven state-machine is a very good idea indeed. I often implement that by queueing all events into one thread, (producer-consumer queue), that has sole access to the state data and implements the state-machine, so serializing actions.
Whether the advantages are worth the candle is between you and your requirements:)
RTOS is not instead of while loop - it is while loop + tools which organize your tasks. How do they organize your tasks? Assign priorities to them, decide how much time each one have for a job and/or at what time it should start/end. RTOS also layers your software, i.e harwdare related stuff, application tasks, etc. Aside of that gives you data structures, containers, ready to use interfaces to handle common tasks so you don't have to implement your own i.e allocate some memory for you, lock access for some resources and so on.

Can a server handle multiple sockets in a single thread?

I'm writing a test program that needs to emulate several connections between virtual machines, and it seems like the best way to do that is to use Unix domain sockets, for various reasons. It doesn't really matter whether I use SOCK_STREAM or SOCK_DGRAM, but it seems like SOCK_STREAM is easier/simpler for my usage.
My problem seems to be a little backwards from the typical scenario. I want to have a single client communicating with the server over 4 distinct sockets. (I could have 4 clients with one socket each, but that distinction shouldn't matter.) Now, the thing I'm emulating doesn't have multiple threads and gets an interrupt whenever a data packet is received over one of the "sockets". Is there some easy way to emulate this with Unix sockets?
I believe that I have to do the socket(), bind(), and listen() for all 4 sockets first, then do an accept() for all 4, and do fcntl( fd, F_SETFF, FNDELAY ) for each one to make them nonblocking, so that I can check each one for data with read() in a round-robin fashion. Is there any way to make it interrupt-driven or event-driven, so that my main loop only checks for data in the socket if there's data there? Or is it better to poll them all like this?
Yes. Handling multiple connections is almost synonymous with "server", and they are often single threaded -- but please not this way:
check each one for data with read() in a round-robin fashion
That would require, as you mention, non-blocking sockets and some kind of delay to prevent your "round-robin" from becoming a system killing busy loop.
A major problem with that is the granularity of the delay. You can't make it too small, or the loop will still hog too much CPU time when nothing is happening. But what about when something is happening, and that something is data incoming simultaneously on multiple connections? Now your delay can produce a snowballing backlog of tish leading to refused connections, etc.
It just is not feasible, and no one writes a server that way, although I am sure anyone would give it serious thought if they were unaware of the library functions intended to tackle the problem. Note that networking is a platform specific issue, so these are not actually part of the C standard (which does not deal with sockets at all).
The functions are select(), poll(), and epoll(); the last one is linux specific and the other two are POSIX. The basic idea is that the call blocks, waiting until one or more of any number of active connections is ready to read or write. Waiting for a socket to be ready to write only meaningfully applies to NON_BLOCK sockets. You don't have to use NON_BLOCK, however, and the select() call blocks regardless. Using NON_BLOCK on the individual sockets makes the implementation more complex, but increases performance potential in a single threaded server -- this is the idea behind asynchronous servers (such as nginx), a paradigm which contrasts with the more traditional threaded synchronous model.
However, I would recommend that you not use NON_BLOCK initially because of the added complexity. When/if it ends up being called for, you'll know. You still do not need threads.
There are many, many, many examples and tutorials around about how to use select() in particular.

How much to read from socket when using select

I'm using select() to listen for data on multiple sockets. When I'm notified that there is data available, how much should I read()?
I could loop over read() until there is no more data, process the data, and then return back to the select-loop. However, I can imagine that the socket recieves so much data so fast that it temporarily 'starves' the other sockets. Especially since I am thinking of using select also for inter-thread communication (message-passing style), I'd like to keep latency low. Is this an issue in reality?
The alternative would be to always read a fixed size of bytes, and then return to the loop. The downside here would be added overhead when there is more data available than fits into my buffer.
What's the best practice here?
Not sure how this is implemented on other platforms, but on Windows the ioctlsocket(FIONREAD) call tells you how many bytes can be read by a single call to recv(). More bytes could be in the socket's queue by the time you actually call recv(). The next call to select() will report the socket is still readable, though.
The too-common approach here is to read everything that's pending on a given socket, especially if one moves to platform-specific advanced polling APIs like kqueue(2) and epoll(7) enabling edge-triggered events. But, you certainly don't have to! Flip a bit associated with that socket somewhere once you think you got enough data (but not everything), and do more recv(2)'s later, say at the very end of the file-descriptor checking loop, without calling select(2) again.
Then the question is too general. What are your goals? Low latency? Hight throughput? Scalability? There's no single answer to everything (well, except for 42 :)

Are nonblocking I/O operations in Perl limited to one thread? Good design?

I am attempting to develop a service that contains numerous client and server sockets (a server service as well as clients that connect out to managed components and persist) that are synchronously polled through IO::Select. The idea was to handle the I/O and/or request processing needs that arise through pools of worker threads.
The shared keyword that makes data shareable across threads in Perl (threads::shared) has its limits--handle references are not among the primitives that can be made shared.
Before I figured out that handles and/or handle references cannot be shared, the plan was to have a select() thread that takes care of the polling, and then puts the relevant handles in certain ThreadQueues spread across a thread pool to actually do the reading and writing. (I was, of course, designing this so that modification to the actual descriptor sets used by select would be thread-safe and take place in one thread only--the same one that runs select(), and therefore never while it's running, obviously.)
That doesn't seem like it's going to happen now because the handles themselves can't be shared, so the polling as well as the reading and writing is all going to need to happen from one thread. Is there any workaround for this? I am referring to the decomposition of the actual system calls across threads; clearly, there are ways to use queues and buffers to have data produced in other threads and actually sent in others.
One problem that arises from this situation is that I have to give select() a timeout, and expect that it'll be high enough to not cause any issues with polling a rather large set of descriptors while low enough not to introduce too much latency into my timing event loop - although, I do understand that if there is actual I/O set membership detected in the polling process, select() will return early, which partly mitigates the problem. I'd rather have some way of waking select() up from another thread, but since handles can't be shared, I cannot easily think of a way of doing that nor see the value in doing so; what is the other thread going to know about when it's appropriate to wake select() anyway?
If no workaround, what is a good design pattern for this type of service in Perl? I have a requirement for a rather high amount of scalability and concurrent I/O, and for that reason went the nonblocking route rather than just spawning threads for each listening socket and/or client and/or server process, as many folks using higher-level languages these days are wont to do when dealing with sockets - it seems to be kind of a standard practice in Java land, and nobody seems to care about java.nio.* outside the narrow realm of systems-oriented programming. Maybe that's just my impression. Anyway, I don't want to do it that way.
So, from the point of view of an experienced Perl systems programmer, how should this stuff be organised? Monolithic I/O thread + pure worker (non-I/O) threads + lots of queues? Some sort of clever hack? Any thread safety gotchas to look out for beyond what I have already enumerated? Is there a Better Way? I have extensive experience architecting this sort of program in C, but not with Perl idioms or runtime characteristics.
EDIT: P.S. It has definitely occurred to me that perhaps a program with these performance requirements and this design should simply not be written in Perl. But I see an awful lot of very sophisticated services produced in Perl, so I am not sure about that.
Bracketing out your several, larger design questions, I can offer a few approaches to sharing filehandles across perl threads.
One may pass $client to a thread start routine or simply reference it in a new thread:
$client = $server_socket->accept();
threads->new(\&handle_client, $client);
async { handle_client($client) };
# $client will be closed only when all threads' references
# to it pass out of scope.
For a Thread::Queue design, one may enqueue() the underlying fd:
$q->enqueue( POSIX::dup(fileno $client) );
# we dup(2) so that $client may safely go out of scope,
# closing its underlying fd but not the duplicate thereof
async {
my $client = IO::Handle->new_from_fd( $q->dequeue, "r+" );
handle_client($client);
};
Or one may just use fds exclusively, and the bit vector form of Perl's select.