I try to use hardware to speed up the scheduling and dispatching.
Therefore i need to know what exactly is in the ready queue in order to figure out whether using hardware can indeed help and by how much.
In all OS literature, it just mentions scheduler fetches process and put into ready queue.
And i have some knowledge about process, like virtual address space, executable code, PID and so on.
But i just can't connect them together. I don't think each time, scheduler will store all these information in the ready queue.
So can somebody help? What is exactly stored in the ready queue? Like how many bytes of data, what are they? If it is system-dependent, can you give me at least one example for one system?
Thanks
Ready queues stores the processes which can be executed in the processor when given an opportunity i.e. the processes which are not waiting for any sort of I/O operations, etc to complete before they can be executed.
As far as h/w for increased scheduling and dispatching is concerned,
I feel increasing the main memory capacity can help substantially.
Increasing main memory will result in less swap in/ swap out of memory blocks between secondary and primary memory and hence will ultimately result in less thrashing which will increase the performance very much.
Related
I'm unsure how Round Robin scheduling works with I/O Operations. I've learned that CPU bound processes are favoured by Round Robin scheduling, but what happens if a process finishes its time slice early?
Say we neglect the dispatching process itself and a process finishes its time slice early, will the scheduler schedule another process if its CPU bound, or will the current process start its IO operation, and since that isn't CPU bound, will immediately switch to another (CPU bound) process after? And if CPU bound processes are favoured, will the scheduler schedule ALL CPU bound process until they are finished and only afterwards schedule the I/O processes?
Please help me understand.
There are two distinct schedulers: the CPU (process/thread ...) scheduler, and the I/O scheduler(s).
CPU schedulers typically employ some hybrid algorithms, because they certainly do regularly encounter both pre-emption and processes which voluntarily give up part of their time-slice. They must service higher-priority work quickly, while not "starving" anyone. (A study of the current Linux scheduler is most interesting. There have been several.)
CPU schedulers identify processes as being either "primarily 'I/O-bound'" or "primarily 'CPU-bound'" at this particular time, knowing that their characteristics can and do change. If your process repeatedly consumes full time slices, it is seen as CPU-bound.
I/O schedulers seek to order and re-order the I/O request queues for maximum efficiency. For instance, to keep the read/write head of a physical disk-drive moving efficiently in a single direction. (The two components of disk-drive delay are "seek time" and "rotational latency," with "seek time" being by-far the worst of the two. Per contra, solid-state drives have very different timing.) I/O-schedulers also have to be aware of the channels (disk interface cards, cabling, etc.) that provide access to each device: they can't simply watch what any one drive is doing. As with the CPU-scheduler, requests must be efficiently handled but never "starved." Linux's I/O-schedulers are also readily available for your study.
"Pure round-robin," as a scheduling discipline, simply means that all requests have equal priority and will be serviced sequentially in the order that they were originally submitted. Very pretty birds though they are, you rarely encounter Pure Robins in real life.
I have a data mining app.
There is 1 Mining Actor which receives and processes a Json containing 1000 objects. I put this into a list and foreach, I log the data by sending it to 1 Logger Actor which logs data into many files.
Processing the list sequentially, my app uses 700MB and takes ~15 seconds of 20% cpu power to process (4 core cpu). When I parallelize the list, my app uses 2GB and ~ the same amount of time and cpu to process.
My questions are:
Since I parallelized the list and thus the computation, shouldn't the compute-time decrease?
I think having only one Logger Actor is a bottleneck in this case. The computation may be faster but the bottleneck hides the speed increase. So if I add more Loggers to the pool, the app time should decrease?
Why does the memory usage jump to 2GB? Does the JVM have to store the entire collection in memory to parallelize it? And after the computation is done, the JVM garbage collector should deal with it?
Without more details, any answer is a guess. However, even a guess might point you to the right direction.
Parallelized execution should decrease the running time but your problem might lie elsewhere. For some reason, your CPU is idling a lot even in the single-threaded mode. You do not specify whether you read the input from disk or the network or where you write your output to. You explicitly say that you write logs to a lot of files. Disk and network reading/writing might in your case take much longer than data processing. Most probably your process is idle due to this I/O waiting. You should not expect any speedups from parallelizing a job that spends 80% of its time waiting on I/O. I therefore also suspect that loggers are not the bottleneck here.
The memory usage might jump if your threads allocate a lot of memory each. In that case, the more threads you have the more memory will be required. I don't know what kind of collection you are parallelizing on, but most are stored in memory, completely. Yes, the garbage collector will free any resources that do not require you to explicitly free them, such as files.
How many threads for reading and writing to the hard disk?
The memory increases because I send messages faster than the Logger can write, so the Mailbox balloons in size until the Logger has processed the messages and the GC kicks in.
I solved this by writing state to a protocol buffer file. Before doing any writes, I compare with the protobuf file because reads are significantly cheaper than writes. My resource usage is now 10% for 2 seconds, and less than 400MB RAM.
In the text books they say that the major advantage of CFS is that it is very fair in allocating CPU to different processes. However, I am unable to know how CFS with the RB-Tree is capable of achieving better form of fairness than that achieved by simple Round Robin queue !
If we forget about CFS grouping and other features, which can also be incorporated somehow in simple RR queue, can anybody tell me how CFS is more fair than RR?
Thanks in advance
I believe the key difference relates to the concept of "sleeper fairness".
With RR, each of the processes on the ready queue gets an equal share of CPU time, but what about the processes that are blocked/waiting for I/O? They may sit on the I/O queue for a long time, but they don't get any built-up credit for that once they get back into the ready queue.
With CFS, processes DO get credit for that waiting time, and will get more CPU time once they are no longer blocked. That helps reward more interactive processes (which tend to use more I/O) and promotes system responsiveness.
Here is a good detailed article about CFS, which mentions "sleeper fairness": https://developer.ibm.com/tutorials/l-completely-fair-scheduler/
I'm looking for any concrete info related to the number of background threads an NSOperationQueue with create given the NSOperationQueueDefaultMaxConcurrentOperationCount maximum concurrency setting.
I had assumed that some sort of load monitoring is employed to determine the most appropriate number of threads to spawn, plus this setting is recommended in the docs. What I'm finding is that the queue spawns roughly 100 background threads and my app (running on iPad 3 with iOS 5.1.1) crashes with SIGABRT. I've reduced this to a more acceptable number like 3 and everything is working fine.
Any comments or insight would be appreciated.
My experience matches yours (though not to 100 threads; do put in some instrumenting to make sure that you really have that many running simultaneously. I've never seen it go quite that high). Unless you manually manage the number of concurrent operations, NSOperationQueue will tend to generate too many concurrent operations. (I have yet to see anyone refute this with testable code rather than inferences from the documentation.) For anything that may generate a large number of potentially concurrent operations, I recommend setMaxConcurrentOperations. While not ideal, I often wind up using a function like this one to assist (this of course doesn't help you balance between queues, so is very sub-optimal):
unsigned int countOfCores() {
unsigned int ncpu;
size_t len = sizeof(ncpu);
sysctlbyname("hw.ncpu", &ncpu, &len, NULL, 0);
return ncpu;
}
I eagerly await anyone posting real code demonstrating NSOperationQueue automatically performing correct load balancing for CPU-bound operations. I've posted a sample gist demonstrating what I'm talking about. Without calling setMaxConcurrentOperations:, it will spawn about 6 parallel processes on a 2-core iPad 3. In this very simplistic case with no contention or shared resources, this adds about a 10%-15% overhead. In more complicated code with contention (and particularly if operations might be cancelled), it can slow things down by an order of magnitude.
assuming your threads are busy working, 100 active threads in one process on a dual-core iPad is unreasonable. each thread consumes a good amount of time and memory. having that many busy threads is going to slow things down on a dual-core.
regardless of whether you're doing something silly (like sleeping them all or adding run loops or just giving them nothing to do), this would be a bug.
From the documentation:
The default maximum number of operations is determined dynamically by the NSOperationQueue object based on current system conditions.
The iPad 3 has a powerful processor and 1Gb of ram. Since NSOperationQueue calculates the amount of thread based on system conditions, it's very likely that it determined to be able to run a large number of NSOperation based on the power available on that device. The reason why it crashed might not have to do with the amount of threads running simultaneously, but on the code being executed inside those threads. Check the backtrace and see if there is some condition or resource being shared among these treads.
Can a shared ready queue limit the scalability of a multiprocessor system?
Simply put, most definetly. Read on for some discussion.
Tuning a service is an art-form or requires benchmarking (and the space for the amount of concepts you need to benchmark is huge). I believe that it depends on factors such as the following (this is not exhaustive).
how much time an item which is picked up from the ready qeueue takes to process, and
how many worker threads are their?
how many producers are their, and how often do they produce ?
what type of wait concepts are you using ? spin-locks or kernel-waits (the latter being slower) ?
So, if items are produced often, and if the amount of threads is large, and the processing time is low: the data structure could be locked for large windows, thus causing thrashing.
Other factors may include the data structure used and how long the data structure is locked for -e.g., if you use a linked list to manage such a queue the add and remove oprations take constant time. A prio-queue (heaps) takes a few more operations on average when items are added.
If your system is for business processing you could take this question out of the picture by just using:
A process based architecure and just spawning multiple producer consumer processes and using the file system for communication,
Using a non-preemtive collaborative threading programming language such as stackless python, Lua or Erlang.
also note: synchronization primitives cause inter-processor cache-cohesion floods which are not good and therefore should be used sparingly.
The discussion could go on to fill a Ph.D dissertation :D
A per-cpu ready queue is a natural selection for the data structure. This is because, most operating systems will try to keep a process on the same CPU, for many reasons, you can google for.What does that imply? If a thread is ready and another CPU is idling, OS will not quickly migrate the thread to another CPU. load-balance kicks in long run only.
Had the situation been different, that is it was not a design goal to keep thread-cpu affinities, rather thread migration was frequent, then keeping separate per-cpu run queues would be costly.