Number of threads with NSOperationQueueDefaultMaxConcurrentOperationCount - iphone

I'm looking for any concrete info related to the number of background threads an NSOperationQueue with create given the NSOperationQueueDefaultMaxConcurrentOperationCount maximum concurrency setting.
I had assumed that some sort of load monitoring is employed to determine the most appropriate number of threads to spawn, plus this setting is recommended in the docs. What I'm finding is that the queue spawns roughly 100 background threads and my app (running on iPad 3 with iOS 5.1.1) crashes with SIGABRT. I've reduced this to a more acceptable number like 3 and everything is working fine.
Any comments or insight would be appreciated.

My experience matches yours (though not to 100 threads; do put in some instrumenting to make sure that you really have that many running simultaneously. I've never seen it go quite that high). Unless you manually manage the number of concurrent operations, NSOperationQueue will tend to generate too many concurrent operations. (I have yet to see anyone refute this with testable code rather than inferences from the documentation.) For anything that may generate a large number of potentially concurrent operations, I recommend setMaxConcurrentOperations. While not ideal, I often wind up using a function like this one to assist (this of course doesn't help you balance between queues, so is very sub-optimal):
unsigned int countOfCores() {
unsigned int ncpu;
size_t len = sizeof(ncpu);
sysctlbyname("hw.ncpu", &ncpu, &len, NULL, 0);
return ncpu;
}
I eagerly await anyone posting real code demonstrating NSOperationQueue automatically performing correct load balancing for CPU-bound operations. I've posted a sample gist demonstrating what I'm talking about. Without calling setMaxConcurrentOperations:, it will spawn about 6 parallel processes on a 2-core iPad 3. In this very simplistic case with no contention or shared resources, this adds about a 10%-15% overhead. In more complicated code with contention (and particularly if operations might be cancelled), it can slow things down by an order of magnitude.

assuming your threads are busy working, 100 active threads in one process on a dual-core iPad is unreasonable. each thread consumes a good amount of time and memory. having that many busy threads is going to slow things down on a dual-core.
regardless of whether you're doing something silly (like sleeping them all or adding run loops or just giving them nothing to do), this would be a bug.

From the documentation:
The default maximum number of operations is determined dynamically by the NSOperationQueue object based on current system conditions.
The iPad 3 has a powerful processor and 1Gb of ram. Since NSOperationQueue calculates the amount of thread based on system conditions, it's very likely that it determined to be able to run a large number of NSOperation based on the power available on that device. The reason why it crashed might not have to do with the amount of threads running simultaneously, but on the code being executed inside those threads. Check the backtrace and see if there is some condition or resource being shared among these treads.

Related

What is the relation between threads and concurrency?

Concurrency means the ability to allow more than one tasking process at a time
But where does threading fit in it?
What's the relation between threading and concurrency?
What is the important link between these two which will fully clear all the confusion?
Threads are one way to achieve concurrency. Concurrency can be achieved at many levels and in many ways. Here are some of them from low to high level to give you a rough idea:
CPU pipelines: at a hardware level, multiple instructions are executed in parallel (each instruction is at a different stage in the pipeline)
Duplication of ALU and FPU CPU units. There are more arithmetic-logic units and floating point units in a processor that can execute instructions in parallel.
vectorized instructions. Instructions which execute for multiple data.
hyperthreading/SMT. Duplication of the process context.
threads. Streams of instructions which can be executed in parallel.
processes. You run both a browser and a word processor on your system.
tasks. Higher abstraction over threads and async work.
multiple computers. Run your program on multiple computers
I'm new here but I don't really understand the down votes? Could someone explain it to me? Is it just because this question has (likely) been answered or because it's considered obvious?
Now that that's out of the way...
Nothing being executed on the CPU is from a "process" or anything else. They're all threads, scheduled and entirely managed by the kernel using a variety of algorithms to reach expected performance for any given application. The CPU only allows n threads, where n equals (cores * hyperthreads). In most cases hyperthreads will be 2 so you have double the core count to get logical CPU count. What this really means is that instead of 4 (for example) threads being run at once, it can support up to 8. Now the OS may have hundreds of threads at any given time, how is that possible? Well the kernel uses a variety of checks such as how frequently and long the thread sleeps to assign it a priority. Whenever the CPU triggers a timer interrupt the OS will swap out threads appropriately if they've reached their alotted time slice based on the OS determination of its priority.

Can Multiprocessor CPUs avoid context-switching?

Today's computer architecture are trying to maximize the number of registers. It is faster to access a register (which is an integrated memory circuit near the cpu) than to access first-level cache. The problem is, that each context switch has to save all registers into cache, because the next thread needs other register values. What a modern CPU is doing is to cycle in one second through 100 tasks and everytime it saves the registers, and fetches the old one until the task can be started.
IMHO it would be nice to use one CPU for one task, and no context switching is happening. That means we get 100 CPUs, each 1000 registers which has to be never saved. Is that possible or have I a ignored an important detail?
The only way to completely avoid context switching is by having at least as many cores as there are tasks. Generally, there is no guarantee regarding the maximum number of tasks that may run. Current GPUs and manycore processors and co-processors contain hundreds of small cores. If you put multiple of these things in the same system or in a cluster of systems, you can have thousands or more cores. Still, even if you could avoid context switching with such design, these cores are much slower than the traditional high-end CPU cores, so the net effect might be negative.
But let's take a step back here. The number of context switches is not primarily determined by the number of tasks and cores. Tasks don't just perform computations, they also need to interact with I/O devices and wait for things to happen such as results from other tasks or user input. So some tasks would be in a wait state. The overhead of context switching depends on not only the number of tasks but also the behavior of these tasks.
Both processors architects and OS developers are aware of context switching overhead and employ a variety of techniques to alleviate it. For example, x86 provides a number of instructions that are tuned to saving the context (partially) of the current task. The OS thread scheduler uses techniques such as priorities, preemption (with possibly large time slices on servers), and priority boosting. All of these help reducing the number of context switches and therefore their overall overhead. In addition, reducing the overhead of context switching is not the only thing that matters. In particular, the responsiveness of the system is very important as well, which is at odds with that overhead.

Can two processes simultaneously run on one CPU core?

Can two processes simultaneously run on one CPU core, which has hyper threading? I learn from the Internet. But, I do not see a clear straight answer.
Edit:
Thanks for discussion and sharing! My purse to post my question here is not to discuss about parallel computing. It will be too big to be discussed here. I just want to know if a multithread application can benefit more from hyper threading than a multi process application. After further reading, I have following as my learning notes.
1) A Hyper-Threading Technology enabled CPU Core has two set of CPU state and Interrupt Logic. Meanwhile, it has only one set of Execution Units and Cache. (I have not study what is pipeline yet)
2) Multi threading benefits from Hyper Threading only if there is latency happen in some executed thread. I think this point can exactly map to the common reason for why and when software programmer use multi thread. If the multi thread application has been optimized. It may not gain any benefit from Hypter threading.
3) If the CPU state maps to process state, I believe Marc is correct that multiple process application can even benefit more from hyper threading technology.
4) When CPU vendor says "thread", it looks like their "thread" is different from thread that I know as a java programmer?
No, a hyperthreaded CPU core still only has a single execution pipeline. Even though it appears as two CPUs to the overlying OS, there's still only ever one instruction being executed at any given time.
Hyperthreading was intended to allow the CPU to continue executing one thread while another thread was stalled waiting for a resource or other operation to complete, without leaving too many stages of the pipeline empty and useless. This goes back to the Pentium 4 days, with its absurdly long pipeline - a stall was essentially catastrophic for efficiency and throughput, and hyperthreading allowed Intel to keep the cpu busy doing other things while it cleaned up from the stall.
While Marc B's answer is pretty much the definitive summary of how HT works, I just want to make a small contribution by linking this article, which should clear up a lot of things about HT: http://software.intel.com/en-us/articles/performance-insights-to-intel-hyper-threading-technology/
Short answer, yes.
A single core cpu(a processor), can run 2 or more threads simultaneously. These threads may belong to the one program, or they may belong different programs and thus processes. This type of multithreading is called Simultaneous MultiThreading(SMT).
Information that claims cpu core can execute only one instruction at any given time is also not true. Modern CPUs exploit Instruction Level Parallelism(ILP) by duplicating pipeline resources(e.g 2 ALUs instead of 1). This type of pipeline is called "superscalar" pipeline.
Wikipedia page of Simultaneous Multithreading:
Simultaneous multithreading

Determine maximum number of threads that run on different windows systems

Can anyone tell me if there is a way to find out the maximum number of threads that can run on different windows systems?
For example - (Assumption)A windows 32-bit system can run maximum 4000 threads.
I doubt there is a maximum number. Well, since we're using a finite amount of memory, it would be as many threads as you can fit into memory or as many as you can keep track of. Each system is different and I know Java and C don't have a function to provide this. C# can tell you how much memory a specific object/app needs so you could go calculate the estimate.
You could test this on your system. Write a sample app which spawns threads and see when you run out of memory. Use a counter to count them. This will give you roughly the range for your system.
In Java, you can use an ExecutorService with a thread pool.. Depending on which executor service you use, it can keep spawning threads if you submit more jobs.
A similar technique exists in C#.
A better question is what the maximum number of threads to spawn and avoid thrashing is.
Are you trying to take over the OS and do your own process/thread management? You should not be doing this.

Can a shared ready queue limit the scalability of a multiprocessor system?

Can a shared ready queue limit the scalability of a multiprocessor system?
Simply put, most definetly. Read on for some discussion.
Tuning a service is an art-form or requires benchmarking (and the space for the amount of concepts you need to benchmark is huge). I believe that it depends on factors such as the following (this is not exhaustive).
how much time an item which is picked up from the ready qeueue takes to process, and
how many worker threads are their?
how many producers are their, and how often do they produce ?
what type of wait concepts are you using ? spin-locks or kernel-waits (the latter being slower) ?
So, if items are produced often, and if the amount of threads is large, and the processing time is low: the data structure could be locked for large windows, thus causing thrashing.
Other factors may include the data structure used and how long the data structure is locked for -e.g., if you use a linked list to manage such a queue the add and remove oprations take constant time. A prio-queue (heaps) takes a few more operations on average when items are added.
If your system is for business processing you could take this question out of the picture by just using:
A process based architecure and just spawning multiple producer consumer processes and using the file system for communication,
Using a non-preemtive collaborative threading programming language such as stackless python, Lua or Erlang.
also note: synchronization primitives cause inter-processor cache-cohesion floods which are not good and therefore should be used sparingly.
The discussion could go on to fill a Ph.D dissertation :D
A per-cpu ready queue is a natural selection for the data structure. This is because, most operating systems will try to keep a process on the same CPU, for many reasons, you can google for.What does that imply? If a thread is ready and another CPU is idling, OS will not quickly migrate the thread to another CPU. load-balance kicks in long run only.
Had the situation been different, that is it was not a design goal to keep thread-cpu affinities, rather thread migration was frequent, then keeping separate per-cpu run queues would be costly.