Concurrency means the ability to allow more than one tasking process at a time
But where does threading fit in it?
What's the relation between threading and concurrency?
What is the important link between these two which will fully clear all the confusion?
Threads are one way to achieve concurrency. Concurrency can be achieved at many levels and in many ways. Here are some of them from low to high level to give you a rough idea:
CPU pipelines: at a hardware level, multiple instructions are executed in parallel (each instruction is at a different stage in the pipeline)
Duplication of ALU and FPU CPU units. There are more arithmetic-logic units and floating point units in a processor that can execute instructions in parallel.
vectorized instructions. Instructions which execute for multiple data.
hyperthreading/SMT. Duplication of the process context.
threads. Streams of instructions which can be executed in parallel.
processes. You run both a browser and a word processor on your system.
tasks. Higher abstraction over threads and async work.
multiple computers. Run your program on multiple computers
I'm new here but I don't really understand the down votes? Could someone explain it to me? Is it just because this question has (likely) been answered or because it's considered obvious?
Now that that's out of the way...
Nothing being executed on the CPU is from a "process" or anything else. They're all threads, scheduled and entirely managed by the kernel using a variety of algorithms to reach expected performance for any given application. The CPU only allows n threads, where n equals (cores * hyperthreads). In most cases hyperthreads will be 2 so you have double the core count to get logical CPU count. What this really means is that instead of 4 (for example) threads being run at once, it can support up to 8. Now the OS may have hundreds of threads at any given time, how is that possible? Well the kernel uses a variety of checks such as how frequently and long the thread sleeps to assign it a priority. Whenever the CPU triggers a timer interrupt the OS will swap out threads appropriately if they've reached their alotted time slice based on the OS determination of its priority.
Related
How isolates are distributed across CPU cores
In Dart, you can run multiple isolates at the same time, and I haven't been able to find a guideline or best practice for using isolates.
My question is how will overall CPU usage and performance be affected by the numbers of isolates running at the same time, and is it better to use a small number of isolates (or even just one) or not.
One isolate per one thread
One isolate takes one platform thread - you can observe threads created per each isolate in the Call Stack pane of VSCode when debugging the Dart/Flutter app with multiple isolates. If the workload of interest allows parallelism you can get great performance gains via isolates.
Note that Dart explicitly abstracts away the implementation detail and docs avoid the specifics of scheduling of isolates and their intrinsics.
Number of isolates = ±number of CPU core
In determining the number of isolates/threads as the rule of thumb you can take the number of cores as the initial value. You can import 'dart:io'; and use the Platform.numberOfProcessors property to determine the number of cores. Though to fine tune experimentation would be required to see which number makes more sense. There're many factors that can influence the optimal number of threads:
Presence of Simultaneous MultiThreading (SMT) in CPU, such as Intel HyperThreading
Instruction level parallelism (ILP) and specific machine code produced for your code
CPU architecture
Mobile/smartphone scenarios vs desktop - e.g. Intel CPUs have the same cores, less tendency to throttling. Smartphones have efficiency and high-performance cores, they are prone to trotling, creating a myriad of threads can lead to worse results due to OS slowing down your code.
E.g. for one of my Flutter apps which uses multiple isolates to parallelize file processing I empirically came to the following piece of code determining the number of isolates to be created:
var numberOfIsolates = max(Platform.numberOfProcessors - 2, 2)
Isolate is not a thread
The model offered by isolate is way more restricting than what the standard threaded model suggests.
Isolates do not share memory vs Threads can read each other's vars. There're technical exceptions, e.g. since around Flutter 2.5.0 isolates use one heap, there're exceptions for immutable types sharing across isolates, such as strings - though they are an implementation detail and don't change the concept.
Isolates communicate only via messages vs numerous synchronizations prymitives in threads (critical sections, locks, semaphores, mutexes etc.).
The clear tradeoff is that Isolates are not prone to multi-threaded programming horrors (tricky bugs, debugging, development complexity) yet provide fewer capabilities for implementing parallelism.
In Dart/Flutter there're only 2 ways to work with Isolates:
Low level, Dart style - using the Isolate class to spawn individual isolates, set-up send/receive ports for messaging, code entry points.
Higher level Compute helper function in Flutter - it get's input params, creates a new isolate with defined entry point, processes the inputs and prives a single result - not back and forth communication, streams of events etc., request-response pattern.
Note that in Dart/Flutter SDK there is no parallelism APIs such as Task Parallel Library (TPL) in .NET which provides multi-core CPU optimized APIs to process data on multiple threads, e.g. sorting a collection in parallel. A huge number of algorithms can benefit from parallelism using threads though are not feasable with Isolates model where there's no shared memory. Also there's no Isolate pool, a set of isolates up and running and waiting for incoming tasks (I had to create one by myself https://pub.dev/packages/isolate_pool_2).
P.S.: the influence of SMT, ILP and other stuff on the performance of multiple treads can be observed via the following CPU benchmark (https://play.google.com/store/apps/details?id=xcom.saplin.xOPS) - e.g. one can see that there's typically a sweet spot in terms of a number of multiple threads doing computations. It is greater than the number of cores. E.g. on my Intel i7 8th gen MacBook with 6 cores and 12 threads per CPU the best performance was observed with the number of threads at about 4 times the number of cores.
The distribution of isolates across CPU cores is done by the OS. But each isolate correspond to a thread. The number of isolates to use will depend on the number of CPU cores physically available.
This is illustrated by a short article available here:
https://martin-robert-fink.medium.com/dart-is-indeed-multi-threaded-94e75f66aa1e
Is it the Operating System who delegates any job to core?
What is that specific algorithm or a way, on which it is decided that the next task will be assigned to which cpu core?
Correct, it is the operating system's responsibility to designate tasks for the CPU to complete, regardless of how many cores it has. It does this via a scheduling algorithm, which decides in what order tasks/processes should be executed. In a symmetric multiprocessing environment, the OS views each core as an independent, identical CPU and therefore schedules them individually. When several cores are available, there are a couple important things to keep in mind:
1. Load balancing- For maximum performance, each core should be performing roughly the same amount of work.
2. Affinity- Because of caching, it is best (in terms of performance) for processes to complete the entirety of their execution on just one processor.
These things need to be kept in mind along with the traditional scheduling considerations of priority, fairness etc. Obviously, this topic is far too large for just one post to handle, so here are some resources that go in to further detail:
https://www.tutorialspoint.com/operating_system/os_process_scheduling_algorithms.htm
https://www.geeksforgeeks.org/multiple-processor-scheduling-in-operating-system/
Today's computer architecture are trying to maximize the number of registers. It is faster to access a register (which is an integrated memory circuit near the cpu) than to access first-level cache. The problem is, that each context switch has to save all registers into cache, because the next thread needs other register values. What a modern CPU is doing is to cycle in one second through 100 tasks and everytime it saves the registers, and fetches the old one until the task can be started.
IMHO it would be nice to use one CPU for one task, and no context switching is happening. That means we get 100 CPUs, each 1000 registers which has to be never saved. Is that possible or have I a ignored an important detail?
The only way to completely avoid context switching is by having at least as many cores as there are tasks. Generally, there is no guarantee regarding the maximum number of tasks that may run. Current GPUs and manycore processors and co-processors contain hundreds of small cores. If you put multiple of these things in the same system or in a cluster of systems, you can have thousands or more cores. Still, even if you could avoid context switching with such design, these cores are much slower than the traditional high-end CPU cores, so the net effect might be negative.
But let's take a step back here. The number of context switches is not primarily determined by the number of tasks and cores. Tasks don't just perform computations, they also need to interact with I/O devices and wait for things to happen such as results from other tasks or user input. So some tasks would be in a wait state. The overhead of context switching depends on not only the number of tasks but also the behavior of these tasks.
Both processors architects and OS developers are aware of context switching overhead and employ a variety of techniques to alleviate it. For example, x86 provides a number of instructions that are tuned to saving the context (partially) of the current task. The OS thread scheduler uses techniques such as priorities, preemption (with possibly large time slices on servers), and priority boosting. All of these help reducing the number of context switches and therefore their overall overhead. In addition, reducing the overhead of context switching is not the only thing that matters. In particular, the responsiveness of the system is very important as well, which is at odds with that overhead.
Can two processes simultaneously run on one CPU core, which has hyper threading? I learn from the Internet. But, I do not see a clear straight answer.
Edit:
Thanks for discussion and sharing! My purse to post my question here is not to discuss about parallel computing. It will be too big to be discussed here. I just want to know if a multithread application can benefit more from hyper threading than a multi process application. After further reading, I have following as my learning notes.
1) A Hyper-Threading Technology enabled CPU Core has two set of CPU state and Interrupt Logic. Meanwhile, it has only one set of Execution Units and Cache. (I have not study what is pipeline yet)
2) Multi threading benefits from Hyper Threading only if there is latency happen in some executed thread. I think this point can exactly map to the common reason for why and when software programmer use multi thread. If the multi thread application has been optimized. It may not gain any benefit from Hypter threading.
3) If the CPU state maps to process state, I believe Marc is correct that multiple process application can even benefit more from hyper threading technology.
4) When CPU vendor says "thread", it looks like their "thread" is different from thread that I know as a java programmer?
No, a hyperthreaded CPU core still only has a single execution pipeline. Even though it appears as two CPUs to the overlying OS, there's still only ever one instruction being executed at any given time.
Hyperthreading was intended to allow the CPU to continue executing one thread while another thread was stalled waiting for a resource or other operation to complete, without leaving too many stages of the pipeline empty and useless. This goes back to the Pentium 4 days, with its absurdly long pipeline - a stall was essentially catastrophic for efficiency and throughput, and hyperthreading allowed Intel to keep the cpu busy doing other things while it cleaned up from the stall.
While Marc B's answer is pretty much the definitive summary of how HT works, I just want to make a small contribution by linking this article, which should clear up a lot of things about HT: http://software.intel.com/en-us/articles/performance-insights-to-intel-hyper-threading-technology/
Short answer, yes.
A single core cpu(a processor), can run 2 or more threads simultaneously. These threads may belong to the one program, or they may belong different programs and thus processes. This type of multithreading is called Simultaneous MultiThreading(SMT).
Information that claims cpu core can execute only one instruction at any given time is also not true. Modern CPUs exploit Instruction Level Parallelism(ILP) by duplicating pipeline resources(e.g 2 ALUs instead of 1). This type of pipeline is called "superscalar" pipeline.
Wikipedia page of Simultaneous Multithreading:
Simultaneous multithreading
Can anyone tell me if there is a way to find out the maximum number of threads that can run on different windows systems?
For example - (Assumption)A windows 32-bit system can run maximum 4000 threads.
I doubt there is a maximum number. Well, since we're using a finite amount of memory, it would be as many threads as you can fit into memory or as many as you can keep track of. Each system is different and I know Java and C don't have a function to provide this. C# can tell you how much memory a specific object/app needs so you could go calculate the estimate.
You could test this on your system. Write a sample app which spawns threads and see when you run out of memory. Use a counter to count them. This will give you roughly the range for your system.
In Java, you can use an ExecutorService with a thread pool.. Depending on which executor service you use, it can keep spawning threads if you submit more jobs.
A similar technique exists in C#.
A better question is what the maximum number of threads to spawn and avoid thrashing is.
Are you trying to take over the OS and do your own process/thread management? You should not be doing this.