Operating system computations for throughput - operating-system

Can i get help with this? I can't seem to understand the question
"In this problem you are to compare reading a file using a single-threaded file server with a multi- threaded file server. It takes 16 msec to get a request for work, dispatch it, and do the rest of the necessary processing, assuming the data are in the block cache. If a disk operation is needed (assume a spinning disk drive with 1 head), as is the case one-fourth of the time, an additional 32 msec is required."

Can i get help with this?
I don't think so (I don't think there's enough information in the question for anyone to be able to understand it).
Example 1
The file server is single-threaded and handles asynchronous requests, and the "16 msec" is primarily "request delivery latency" (time between a process sending a request and the file server receiving the request). A process sends a single request asking to read from 1000 files, the file server receives this request, "immediately" sends back 750 replies (for file data that was cached) and sends a single request asking something (file system code, disk driver?) to fetch the remaining 250 things; then file server "immediately" waits for more requests while waiting for the reply from something (file system code, disk driver?) to complete the early 250 things. In this case you can say that throughput for single-threaded file server is virtually infinite (e.g. infinite throughput for "file data cache hit", which is the only thing that matters because you can make more requests while waiting for slow disk IO).
Example 2
The file server has 8 threads and handles synchronous requests. A single-threaded process sends 1 request (to read from 1 file) and then has to wait for the reply, the request is given to one of the file server's threads (doesn't matter which) and that thread takes an average of "16 + 32*0.25 = 24 msec" for the request to be handled before the process can make it's next request; and the process does this in a loop because it wants to read 1000 files. In this case throughput is "1/0.024 = 41.66 requests per second", which is extremely bad (primarily because the single-threaded process can't send requests fast enough to keep all threads of the multi-threaded server busy).
Example 3
The file server has 8 threads and handles synchronous requests. A process with 1000 threads send 1 request (to read from 1 file) from each of its threads. In this case we need to know how many CPUs there are (and how scheduler works) to determine anything about throughput. E.g. if there's only 2 CPUs then you're not going to get 8 file server threads running in parallel at the same time.

Related

Performance issue sending larger messages via akka remote actors

The responses to concurrent requests to remote actors were taking long time to respond, aka 1 request takes 300 ms, but 100 concurrent requests took almost 30 seconds to complete! So it almost looks like the requests are being executed sequentially! The request size is small, but response size was about 120 kB in JVM before serialization. But the response had deep nested case class.
The response times are similar when running on two different JVMs on same machine as well. But responses are fast when in same JVM (i.e. local actors). It is a single client making concurrent requests to one remote actor.
I see this log in akka debug logs. What does this indicate?
DEBUG test-app akka.remote.EndpointWriter - Drained buffer with
maxWriteCount: 50, fullBackoffCount: 546, smallBackoffCount: 2,
noBackoffCount: 1 , adaptiveBackoff: 2000
The logs show that write to send-buffer failed. This could indicate that
send-buffer is too small
receive-buffer on the remote actor's side is too small
network issues
The send buffer size and receive buffer size directly limits the number of concurrent requests and responses! Increase the send buffer and receive buffer sizes, on both client and server, to support the required concurrency in both client and server.
If the buffer size is not adequate, netty will wait for the buffer to be cleared before attempting to rewrite to the buffer. And by default there will be a backoff time too, and this can be configured as well.
The settings are under remote.netty.tcp:
akka {
remote {
netty.tcp {
# Sets the send buffer size of the Sockets,
# set to 0b for platform default
send-buffer-size = 1024000b
# Sets the receive buffer size of the Sockets,
# set to 0b for platform default
receive-buffer-size = 2048000b
}
# Controls the backoff interval after a refused write is reattempted.
# (Transports may refuse writes if their internal buffer is full)
backoff-interval = 1 ms
}
}
For full configuration see Akka reference config.

Why is UDP packet reception seemingly optimized mid-execution?

I'm running a client server configuration over Ethernet and measuring packet latency at both ends. The client (windows) is sending packets every 5 ms (confirmed with wire shark) as it should. Yet, the server (embedded linux) only receives packets at 5 ms intervals for a few seconds, at which point it stops for 300 ms. After this break the latency is only 20 us. After another period of about a few seconds it takes another break for 300 ms. This repeats indefinitely (300ms break, 20 us packet latency burst). It seems as if the server program is being optimized mid-execution to read IO in shorter bursts. Why is this happening?
Disclaimer: I haven't posted the code, as the client and server are small subsets of more complex applications, however, I am willing to factor it out if an obvious answer doesn't present itself.
This is UDP so there is no handshake or any flow control mechanism. Those 300 ms must be because of work the server is doing in the processing of the UDP messages received. During those 300 ms the server has surely lost around 60 messages that were not read from client.
You might probably want to check the server does not take more than 5 ms in processing each message if it uses one thread to process. If the server uses multi-threading to process the messages and the processing takes some time, even if it takes 1 ms, you might be in a situation where at some point all threads are competing for resources and they don't finish in time to read the next message. For the problem you are describing I would bet the server is multithreaded and you have that problem. I cannot assure that 100% for lack of info though. But in any case, you want to check the time it takes to process messages because you might be dealing with real-time requirements.
I spaced out the measurements to 1 in every 1000 packets and now it is behaving itself. I was using printf every 5ms which must've eventually filled the printf tx queue entirely. This then delayed the execution for 300ms. Once printf caught its breath, the program had a queue full of incoming packets and thus was seemingly receiving packets every 20 us.

Play Framework: File uploads - blocking or non-blocking?

Given this example code from Play documentation:
def upload = Action(parse.temporaryFile) { request =>
request.body.moveTo(new File("/tmp/picture/uploaded"))
Ok("File uploaded")
}
How 100 simultaneous slow upload requests will be handled (number of threads)?
Will be uploaded file buffered in memory or streamed directly to disk?
How 100 simultaneous slow upload requests will be handled (number of threads)?
It depends. The number of actual threads being used isn't really relevant. By default, Play uses a number of threads equal to the number of CPU cores available. But this doesn't mean that if you have 4 cores, you're limited to 4 concurrent processes at once. HTTP requests in Play are processed asynchronously in a special internal ExecutionContext provisioned by Akka. Processes running in an ExecutionContext can share threads, so long as they are non-blocking--which is abstracted away by Akka. All of this can be configured in different ways. See Understanding Play Thread Pools.
The Iteratee that consumes the client data must do some blocking in order to write the file chunks to disk, but done in small (and fast) enough chunks, this shouldn't cause other file uploads to become blocked.
What I would be more worried about is the amount of disk I/O your server can handle. 100 slow uploads might be okay, but you can't really say without benchmarking. At some point you will run into trouble when the client input exceeds the rate that your server can write to disk. This will also not work in a distributed environment. I almost always choose to bypass the Play server entirely and direct uploads to Amazon S3.
Will be uploaded file buffered in memory or streamed directly to disk?
All temporary files are streamed to disk. Under the hood, all data sent from the client to the server is read using the iteratee library asynchronously. For multipart uploads, it is no different. The client data is consumed by an Iteratee, which streams the file chunks to a temporary file on disk. So when using the parse.temporaryFile BodyParser, request.body is just a handle to a temporary file on disk, and not the file stored in memory.
It is worth noting that while Play can handle those requests in a non-blocking manner, moving the file once complete will block. That is, request.body.moveTo(...) will block the controller function until the move is complete. This means that if several of the 100 uploads complete at about the same time, Play's internal ExecutionContext for handling requests can quickly become overloaded. The underlying API of moveTo is also deprecated in Play 2.3, as it uses FileInputStream and FileOutputStream to copy the TemporaryFile to a permanent location. The docs advise you to use the Java 7 File API, instead, as it is much more efficient.
This may be a little crude, but something more like this should do it:
import java.io.File
import java.nio.file.Files
def upload = Action(parse.temporaryFile) { request =>
Files.copy(request.body.file.toPath, new File("/tmp/picture/uploaded").toPath)
Ok("File uploaded")
}

Are socket operations performed in parallel at the system level?

If I have a quad core machine and 4 open and active sockets is it faster to have 4 threads to read from these sockets in parallel or just have one thread that reads the sockets iteratively? Can the socket read operation be performed in parallel or is it synchronized somewhere at the system level?
Analogously, if I have 4 sockets to which I need to write some data, is it faster to have 4 threads that write to these sockets at the same time, or is it better to have one thread that does all the writing?
It seems that multiple cores should enable higher throughput (of course if the network bandwidth is not the bottleneck), however I wonder if this is the case, since after all, in the end there is only a single ethernet cord (or other medium), so at some time, all these writes will have to be synchronized anyway, and the packets will be sent sequentially...
Specifically, if I have a machine running n threads which generate requests that need to be sent to another machine is it better to open n sockets and let each thread write to it? Or is it better to synchronize all the threads and send all the data through a single socket?
And on the receiving side is it better to have a single socket to read from? Or is it better to have n sockets read in parallel? What if all the readers need to be synchronized after reading anyway?
Socket operations are largely asynchronous to the application; buffered; and, if you're lucky, concurrent inside the kernel: the only significant state to synchronize on is per-connection. However the rate-determining step is really the bandwidth of the network, which you can easily saturate with even just one thread. Web browsers typically open between 4 and 8 threads per Web page to fetch the HTML itself and the images etc.

Implement a good performing "to-send" queue with TCP

In order not to flood the remote endpoint my server app will have to implement a "to-send" queue of packets I wish to send.
I use Windows Winsock, I/O Completion Ports.
So, I know that when my code calls "socket->send(.....)" my custom "send()" function will check to see if a data is already "on the wire" (towards that socket).
If a data is indeed on the wire it will simply queue the data to be sent later.
If no data is on the wire it will call WSASend() to really send the data.
So far everything is nice.
Now, the size of the data I'm going to send is unpredictable, so I break it into smaller chunks (say 64 bytes) in order not to waste memory for small packets, and queue/send these small chunks.
When a "write-done" completion status is given by IOCP regarding the packet I've sent, I send the next packet in the queue.
That's the problem; The speed is awfully low.
I'm actually getting, and it's on a local connection (127.0.0.1) speeds like 200kb/s.
So, I know I'll have to call WSASend() with seveal chunks (array of WSABUF objects), and that will give much better performance, but, how much will I send at once?
Is there a recommended size of bytes? I'm sure the answer is specific to my needs, yet I'm also sure there is some "general" point to start with.
Is there any other, better, way to do this?
Of course you only need to resort to providing your own queue if you are trying to send data faster than the peer can process it (either due to link speed or the speed that the peer can read and process the data). Then you only need to resort to your own data queue if you want to control the amount of system resources being used. If you only have a few connections then it is likely that this is all unnecessary, if you have 1000s then it's something that you need to be concerned about. The main thing to realise here is that if you use ANY of the asynchronous network send APIs on Windows, managed or unmanaged, then you are handing control over the lifetime of your send buffers to the receiving application and the network. See here for more details.
And once you have decided that you DO need to bother with this you then don't always need to bother, if the peer can process the data faster than you can produce it then there's no need to slow things down by queuing on the sender. You'll see that you need to queue data because your write completions will begin to take longer as the overlapped writes that you issue cannot complete due to the TCP stack being unable to send any more data due to flow control issues (see http://www.tcpipguide.com/free/t_TCPWindowSizeAdjustmentandFlowControl.htm). At this point you are potentially using an unconstrained amount of limited system resources (both non-paged pool memory and the number of memory pages that can be locked are limited and (as far as I know) both are used by pending socket writes)...
Anyway, enough of that... I assume you already have achieved good throughput before you added your send queue? To achieve maximum performance you probably need to set the TCP window size to something larger than the default (see http://msdn.microsoft.com/en-us/library/ms819736.aspx) and post multiple overlapped writes on the connection.
Assuming you already HAVE good throughput then you need to allow a number of pending overlapped writes before you start queuing, this maximises the amount of data that is ready to be sent. Once you have your magic number of pending writes outstanding you can start to queue the data and then send it based on subsequent completions. Of course, as soon as you have ANY data queued all further data must be queued. Make the number configurable and profile to see what works best as a trade off between speed and resources used (i.e. number of concurrent connections that you can maintain).
I tend to queue the whole data buffer that is due to be sent as a single entry in a queue of data buffers, since you're using IOCP it's likely that these data buffers are already reference counted to make it easy to release then when the completions occur and not before and so the queuing process is made simpler as you simply hold a reference to the send buffer whilst the data is in the queue and release it once you've issued a send.
Personally I wouldn't optimise by using scatter/gather writes with multiple WSABUFs until you have the base working and you know that doing so actually improves performance, I doubt that it will if you have enough data already pending; but as always, measure and you will know.
64 bytes is too small.
You may have already seen this but I wrote about the subject here: http://www.lenholgate.com/blog/2008/03/bug-in-timer-queue-code.html though it's possibly too vague for you.