Opencl - single queue with 2 devices

Opencl - single queue with 2 devices - queue

Im trying to port CUDA a test to Opencl. It requires a copy of a buffer from PCIe device-1 onto device-2 of the same type (same brand, same driver etc)
In CUDA it is quite simple: Allocate memory on device-1 and device-2 and copy
In Opencl:
1. Create a context with both devices
2. Allocate memory on a context and associate it with device-1 using migrateMemObject. repeat the same for device-2
When I create a queue, I need to specify both context and device. So a queue can only be associated with a device and hence can only access memory associated with that device (?)
Is there a way for 2 devices to use one queue?
Or is there any way that I can copy from gfx-mem associated with device1 to gfx-mem associated with device2 without copying to main memory?
Thank you for your help!

No, you can't associate two devices to one queue. You need two queues for two devices. But... If you read the documentation of clEnqueueMigrateMemObjects, it says:
Typically, memory objects are implicitly migrated to a device for which enqueued commands, using the memory object, are targeted
In other words, it's the responsibility of the OpenCL runtime to automatically migrate the required buffers to the device where thqy are going to be used.
The question is, is the buffer you want to share between devices read-only or will it be written to by kernels ? If it's read-only than you don't need to bother with any migrations at all, simply enqueue your kernels, and the runtime will copy the buffer once and reuse it. If it's written by kernels from both queues(=devices), then you need to properly order the queued kernels with event dependencies; otherwise the kernels could run in wrong order and the buffer content becomes nonsense.
But in any case, the runtime does the migrations in the background, as efficiently as possible. clEnqueueMigrateMemObjects is useful only as an optimization - if you want to manually overlap memory transfer with kernel execution. I'd focus on getting the application to run reliably before trying any tricks with clEnqueueMigrateMemObjects.

Related

Do memory instructions pass through the load-store queue and issue queue in the microarchitecture

What is the difference between the issue queue and lsq queue for
memory instructions? Do memory instructions pass through both queues, or do they only pass
through the lsq queue.
If they pass through both queues what is their order?

I'm assuming you use the arm-like nomenclature here so the issue queue is what Intel calls RS (reservation station) and by issue you mean sending a uop ready for execution.
The answer is that memory instructions need to pass both. All instructions need to be issued (except the ones that can be eliminated without execution, for example register moves, zero idioms, nops, etc..). Let's rephrase - all instructions that need to go through an ALU need to go through the issue process first. Memory instructions will simply use that step to calculate their addresses.
This is true for loads, for stores there is usually an internal split into store-address and store-data, so the store-address will behave like a load in that sense and calculate its address during that step.
There is usually a dedicated execution port for that and dedicated execution units because the address calculation usually follows one of few specific addressing modes (each architecture has a different set of these), but aside from that the execution needs to follow the same rules like any other operation in the CPU - it needs to have its sources ready and read from the register file or bypassed from an in flight operation, it needs to get arbitrated when the execution port is free and prioritized by the same aging rules, so it makes sense that it uses the common path.
Once the memory operation has finished execution, it will be sent to the LSU (load-store unit, or the DCU, data-cache unit on Intel) and perform the actual memory access using the generated address. The LSU pipe will take care of the address translation, TLB lookups, the page walk if needed (though this is sometimes done in a dedicated unit), the address range and property checks, the cache lookup (if cacheable) and sending a miss to the next cache level or memory if needed. It may also trigger prefetches as part of the process.
For a load, when the LSU pipe has completed (which may require multiple passes and wakeups if the data was not available in the L1), the LSU will signal the issue queue again in order to wakeup anyone who was depended on the result.
For a store, store-address may fetch the line to the cache in advance as an optimization but the actual next step is usually to wakeup after retirement (since stores may not be dispatched to memory while speculative, unless you have some tricks to handle that).
It's also worth to mention that some CPUs try to optimize loads that can forward the data directly from prior stores instead of fetching it from the cache/memory. This can include forwarding (very common) or memory renaming (less common). The former is usually handled by the LSU internally, but the latter can be done much earlier and without the LSU (though the LSU pipe is usually still activated to validate the result).

Does each system call create a process?

Does each system call create a process?
Are all functions (e.g. interrupts) of programs and operating systems executed in the form of processes?
I feel that such a large number of process control blocks, a large number of process scheduling waste a lot of resources.
Or, the kernel instruction of the system call is regarded as part of the current
process.

The short answer is - not exactly. But we have to agree on what we are going to call a "process". A process is more of an abstract idea, which encapsulates multiple instructions, each sequentially executed.
So let's start from the first question.
Does each system call create a process?
No. Each system call is the product of the currently running process, that tells the OS - "Hey OS, I need you to open this file for me, or read these here bits". In this case, the process is a bag of sequentially executed instructions, some are system calls, some are not.
Then we have.
Are all functions (e.g. interrupts) of programs and operating systems executed in the form of processes?
Well this kind of goes back to the first question. We are not considering that a system call (an operation that tell the OS to do something and works under very strict conditions) is a separate process. We will NOT see that system call execution to have its OWN process id (pid).
Then we have.
I feel that such a large number of process control blocks, a large number of process scheduling waste a lot of resources.
Well, I would say, do not underestimate your OS and the capabilities of your hardware. A modern processor with a modern OS on it, is VERY, VERY fast and more than capable of computing billions of instructions in seconds. We can't really imagine how fast that is. I wouldn't worry for optimizations on such a micro-level.
Okay, but let's dig deeper into this. What is a process exactly?
Informally, a process is a program in execution. The status of the current activity of a process is represented by a value, called the program counter, and the contents of the processor’s registers. The memory layout of a process is typically divided into multiple sections.
These sections include:
Text section.
Data section.
Heap section.
Stack section.
As a process executes, it changes state. The state of a process is defined in part by the current activity of that process. Each process is represented in the OS by a process control block (PCB), as you already mentioned.
So we can see that we treat a process as a very complicated structure that is MORE that just occupying CPU time. It has a state, storage, timing, and so on.
But because you are interested in system calls, then what are they?
For us, system calls provide an interface to the services made available by an OS. They are the way we tell the OS to do things FOR US. We know that systems execute thousands of system calls per second.

No, they don't.
The operating system uses software interrupt to execute the system call operation within the same process.
You can imagine them as a function call but they are executed with kernel privileges.

How the speculative load and store happen in modern Intel processor? [duplicate]

Given the small program shown below (handcrafted to look the same from a sequential consistency / TSO perspective), and assuming it's being run by a superscalar out-of-order x86 cpu:
Load A <-- A in main memory
Load B <-- B is in L2
Store C, 123 <-- C is L1
I have a few questions:
Assuming a big enough instruction-window, will the three instructions be fetched, decoded, executed at the same time? I assume not, as that would break execution in program order.
The 2nd load is going to take longer to fetch A from memory than B. Will the later have to wait until the first is fully executed? Will the fetching of B only start after Load A is fully executed? or until when does it have to wait?
Why would the store have to wait for the loads? If yes, will the instruction just wait to be committed in the store buffer until the loads finish or after decoding it will have to sit and wait for the loads?
Thanks

Terminology: "instruction-window" normally means out-of-order execution window, over which the CPU can find ILP. i.e. ROB or RS size. See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths
The term for how many instructions can go through the pipeline in a single cycle is pipeline width. e.g. Skylake is 4-wide superscalar out-of-order. (Parts of its pipeline, like decode, uop-cache fetch, and retirement, are wider than 4 uops, but issue/rename is the narrowest point.)
Terminology: "wait to be committed in the store buffer" store data + address gets written into the store buffer when a store executes. It commits from the store buffer to L1d at any point after retirement, when it's known to be non-speculative.
(In program order, to maintain the TSO memory model of no store reordering. A store buffer allows stores to execute inside this core out of order but still commit to L1d (and become globally visible) in-order. Executing a store = writing address + data to the store buffer.)
Can a speculatively executed CPU branch contain opcodes that access RAM?
Also what is a store buffer? and
Size of store buffers on Intel hardware? What exactly is a store buffer?
The front-end is irrelevant. 3 consecutive instructions might well be fetched in the same 16-byte fetch block, and might go through pre-decode and decode in the same cycle as a group. And (also or instead) issue into the out-of-order back-end as part of a group of 3 or 4 uops. IDK why you think any of that would cause any potential problem.
The front end (from fetch to issue/rename) processes instructions in program order. Processing simultaneously doesn't put later instructions before earlier ones, it puts them at the same time. And more importantly, it preserves the information of what program order is; that's not lost or discarded because it matters for instructions that depend on the previous one1!
There are queues between most pipeline stages, so (for example on Intel Sandybridge) instructions that pre-decode as part of a group of up-to-6 instructions might not hit the decoders as part of the same group of up-to-4 (or more with macro-fusion). See https://www.realworldtech.com/sandy-bridge/3/ for fetch, and the next page for decode. (And the uop cache.)
Executing (dispatching uops to execution ports from the out-of-order scheduler) is where ordering matters. The out-of-order scheduler has to avoid breaking single threaded code.2
Usually issue/rename is far ahead of execution, unless you're bottlenecked on the front-end. So there's normally no reason to expect that uops that issued together will execute together. (For the sake of argument, let's assume that the 2 loads you show do get dispatched for execution in the same cycle, regardless of how they got there via the front-end.)
But anyway, there's no problem here starting both loads and the store the same time. The uop scheduler doesn't know whether a load will hit or miss in L1d. It just sends 2 load uops to the load execution units in a cycle, and a store-address + store-data uop to those ports.
[load ordering]
This is the tricky part.
As I explained in an answer + comments on your last question, modern x86 CPUs will speculatively use the L2 hit result from Load B for later instructions, even though the memory model requires that this load happens after Load A.
But if no other cores write to cache line B before Load A completes, then nothing can tell the difference. The Memory-Order Buffer takes care of detecting invalidations of cache lines that were loaded from before earlier loads complete, and doing a memory-order mis-speculation pipeline flush (rollback to retirement state) in the rare case that allowing load re-ordering could change the result.
Why would the store have to wait for the loads?
It won't, unless the store-address depends on a load value. The uop scheduler will dispatch the store-address and store-data uops to execution units when their inputs are ready.
It's after the loads in program order, and the store buffer will make it even farther after the loads as far as global memory order is concerned. The store buffer won't commit the store data to L1d (making it globally visible) until after the store has retired. Since it's after the loads, they'll have also retired.
(Retirement is in-order to allow precise exceptions, and to make sure no previous instructions took an exception or were a mispredicted branch. In-order retirement allows us to say for sure that an instruction is non-speculative after it retires.)
So yes, this mechanism does ensure that the store can't commit to L1d until after both loads have taken data from memory (via L1d cache which provides a coherent view of memory to all cores). So this prevents LoadStore reordering (of earlier loads with later stores).
I'm not sure if any weakly-ordered OoO CPUs do LoadStore reordering. It is possible on in-order CPUs when a cache-miss load comes before a cache-hit store, and the CPU uses scoreboarding to avoid stalling until the load data is actually read from a register, if it still isn't ready. (LoadStore is a weird one: see also Jeff Preshing's Memory Barriers Are Like Source Control Operations). Maybe some OoO exec CPUs can also track cache-miss stores post retirement when they're known to be definitely happening, but the data just still hasn't arrived yet. x86 doesn't do this because it would violate the TSO memory model.
Footnote 1: There are some architectures (typically VLIW) where bundles of simultaneous instructions are part of the architecture in a way that's visible to software. So if software can't fill all 3 slots with instructions that can execute simultaneously, it has to fill them with NOPs. It might even be allowed to swap 2 registers with a bundle that contained mov r0, r1 and mov r1, r0, depending on whether the ISA allows instructions in the same bundle to read and write the same registers.
But x86 is not like that: superscalar out-of-order execution must always preserve the illusion of running instructions one at a time in program order. The cardinal rule of OoO exec is: don't break single-threaded code.
Anything that would violate this can only be done with checking for hazards, or speculatively with rollback upon detection of mistakes.
Footnote 2: (continued from footnote 1)
You can fetch / decode / issue two back-to-back inc eax instructions, but they can't execute in the same cycle because register renaming + the OoO scheduler has to detect that the 2nd one reads the output of the first.

PC and CPU registers when context switching happens?

According to this question: Storing and retrieving process control block
PCB contains a lot of information and it is managed by the kernel (to avoid user access).
But I have question about PC and CPU registers. Is Kernel save these values every time an instruction is executed or only in context switching process?
Are PCBs linked list?

Actually, the value of CPU registers are modified as per the running sequence of instructions.
Say,the Instruction Pointer points to next instruction to be executed, the Stack Pointer,if active,would store the address of the last program request in a stack. And so on. These all are basically CPU registers!
PCB has one of the part Processor state data,which are those pieces
of information that define the status of a process when it's
suspended, allowing the OS to restart it later and still execute
correctly. This always includes the content of the CPU general-purpose
registers, the CPU process status word, stack and frame pointers etc.
During context switch, the running process is stopped and another
process is given a chance to run. The kernel must stop the execution
of the running process, copy out the values in hardware registers to
its PCB, and update the hardware registers with the values from the
PCB of the new process. // (Taken from Wikipedia)
Does Kernel save these values every time an instruction is executed or only in context switching process?
So,you might have got your question solved. The kernel only bothers saving values of hardware(CPU) registers in the case of context switching,not normally. Else,it leaves the burden on process itself to maintain the registers!
Also, the last question's answer is---The implementation of PCB is 'generally' done as a doubly linked list data structure!

Relationship between a kernel and a user thread

Is there a relationship between a kernel and a user thread?
Some operating system textbooks said that "maps one (many) user thread to one (many) kernel thread". What does map means here?

When they say map, they mean that each kernel thread is assigned to a certain number of user mode threads.
Kernel threads are used to provide privileged services to applications (such as system calls ). They are also used by the kernel to keep track of what all is running on the system, how much of which resources are allocated to what process, and to schedule them.
If your applications make heavy use of system calls, more user threads per kernel thread, and your applications will run slower. This is because the kernel thread will become a bottleneck, since all system calls will pass through it.
On the flip side though, if your programs rarely use system calls (or other kernel services), you can assign a large number of user threads to a kernel thread without much performance penalty, other than overhead.
You can increase the number of kernel threads, but this adds overhead to the kernel in general, so while individual threads will be more responsive with respect to system calls, the system as a whole will become slower.
That is why it is important to find a good balance between the number of kernel threads and the number of user threads per kernel thread.

http://www.informit.com/articles/printerfriendly.aspx?p=25075
Implementing Threads in User Space
There are two main ways to implement a threads package: in user space and in the kernel. The choice is moderately controversial, and a hybrid implementation is also possible. We will now describe these methods, along with their advantages and disadvantages.
The first method is to put the threads package entirely in user space. The kernel knows nothing about them. As far as the kernel is concerned, it is managing ordinary, single-threaded processes. The first, and most obvious, advantage is that a user-level threads package can be implemented on an operating system that does not support threads. All operating systems used to fall into this category, and even now some still do.
All of these implementations have the same general structure, which is illustrated in Fig. 2-8(a). The threads run on top of a run-time system, which is a collection of procedures that manage threads. We have seen four of these already: thread_create, thread_exit, thread_wait, and thread_yield, but usually there are more.
When threads are managed in user space, each process needs its own private thread table to keep track of the threads in that process. This table is analogous to the kernel's process table, except that it keeps track only of the per-thread properties such the each thread's program counter, stack pointer, registers, state, etc. The thread table is managed by the run-time system. When a thread is moved to ready state or blocked state, the information needed to restart it is stored in the thread table, exactly the same way as the kernel stores information about processes in the process table.
When a thread does something that may cause it to become blocked locally, for example, waiting for another thread in its process to complete some work, it calls a run-time system procedure. This procedure checks to see if the thread must be put into blocked state. If so, it stores the thread's registers (i.e., its own) in the thread table, looks in the table for a ready thread to run, and reloads the machine registers with the new thread's saved values. As soon as the stack pointer and program counter have been switched, the new thread comes to life again automatically. If the machine has an instruction to store all the registers and another one to load them all, the entire thread switch can be done in a handful of instructions. Doing thread switching like this is at least an order of magnitude faster than trapping to the kernel and is a strong argument in favor of user-level threads packages.
However, there is one key difference with processes. When a thread is finished running for the moment, for example, when it calls thread_yield, the code of thread_yield can save the thread's information in the thread table itself. Furthermore, it can then call the thread scheduler to pick another thread to run. The procedure that saves the thread's state and the scheduler are just local procedures, so invoking them is much more efficient than making a kernel call. Among other issues, no trap is needed, no context switch is needed, the memory cache need not be flushed, and so on. This makes thread scheduling very fast.
User-level threads also have other advantages. They allow each process to have its own customized scheduling algorithm. For some applications, for example, those with a garbage collector thread, not having to worry about a thread being stopped at an inconvenient moment is a plus. They also scale better, since kernel threads invariably require some table space and stack space in the kernel, which can be a problem if there are a very large number of threads.
Despite their better performance, user-level threads packages have some major problems. First among these is the problem of how blocking system calls are implemented. Suppose that a thread reads from the keyboard before any keys have been hit. Letting the thread actually make the system call is unacceptable, since this will stop all the threads. One of the main goals of having threads in the first place was to allow each one to use blocking calls, but to prevent one blocked thread from affecting the others. With blocking system calls, it is hard to see how this goal can be achieved readily.
The system calls could all be changed to be nonblocking (e.g., a read on the keyboard would just return 0 bytes if no characters were already buffered), but requiring changes to the operating system is unattractive. Besides, one of the arguments for user-level threads was precisely that they could run with existing operating systems. In addition, changing the semantics of read will require changes to many user programs.
Another alternative is possible in the event that it is possible to tell in advance if a call will block. In some versions of UNIX, a system call, select, exists, which allows the caller to tell whether a prospective read will block. When this call is present, the library procedure read can be replaced with a new one that first does a select call and then only does the read call if it is safe (i.e., will not block). If the read call will block, the call is not made. Instead, another thread is run. The next time the run-time system gets control, it can check again to see if the read is now safe. This approach requires rewriting parts of the system call library, is inefficient and inelegant, but there is little choice. The code placed around the system call to do the checking is called a jacket or wrapper.
Somewhat analogous to the problem of blocking system calls is the problem of page faults. We will study these in Chap. 4. For the moment, it is sufficient to say that computers can be set up in such a way that not all of the program is in main memory at once. If the program calls or jumps to an instruction that is not in memory, a page fault occurs and the operating system will go and get the missing instruction (and its neighbors) from disk. This is called a page fault. The process is blocked while the necessary instruction is being located and read in. If a thread causes a page fault, the kernel, not even knowing about the existence of threads, naturally blocks the entire process until the disk I/O is complete, even though other threads might be runnable.
Another problem with user-level thread packages is that if a thread starts running, no other thread in that process will ever run unless the first thread voluntarily gives up the CPU. Within a single process, there are no clock interrupts, making it impossible to schedule processes round-robin fashion (taking turns). Unless a thread enters the run-time system of its own free will, the scheduler will never get a chance.
One possible solution to the problem of threads running forever is to have the run-time system request a clock signal (interrupt) once a second to give it control, but this, too, is crude and messy to program. Periodic clock interrupts at a higher frequency are not always possible, and even if they are, the total overhead may be substantial. Furthermore, a thread might also need a clock interrupt, interfering with the run-time system's use of the clock.
Another, and probably the most devastating argument against user-level threads, is that programmers generally want threads precisely in applications where the threads block often, as, for example, in a multithreaded Web server. These threads are constantly making system calls. Once a trap has occurred to the kernel to carry out the system call, it is hardly any more work for the kernel to switch threads if the old one has blocked, and having the kernel do this eliminates the need for constantly making select system calls that check to see if read system calls are safe. For applications that are essentially entirely CPU bound and rarely block, what is the point of having threads at all? No one would seriously propose computing the first n prime numbers or playing chess using threads because there is nothing to be gained by doing it that way.

User threads are managed in userspace - that means scheduling, switching, etc. are not from the kernel.
Since, ultimately, the OS kernel is responsible for context switching between "execution units" - your user threads must be associated (ie., "map") to a kernel schedulable object - a kernel thread†1.
So, given N user threads - you could use N kernel threads (a 1:1 map). That allows you to take advantage of the kernel's hardware multi-processing (running on multiple CPUs) and be a pretty simplistic library - basically just deferring most of the work to the kernel. It does, however, make your app portable between OS's as you're not directly calling the kernel thread functions. I believe that POSIX Threads (PThreads) is the preferred *nix implementation, and that it follows the 1:1 map (making it virtually equivalent to a kernel thread). That, however, is not guaranteed as it'd be implementation dependent (a main reason for using PThreads would be portability between kernels).
Or, you could use only 1 kernel thread. That'd allow you to run on non multitasking OS's, or be completely in charge of scheduling. Windows' User Mode Scheduling is an example of this N:1 map.
Or, you could map to an arbitrary number of kernel threads - a N:M map. Windows has Fibers, which would allow you to map N fibers to M kernel threads and cooperatively schedule them. A threadpool could also be an example of this - N workitems for M threads.
†1: A process has at least 1 kernel thread, which is the actual execution unit. Also, a kernel thread must be contained in a process. OS's must schedule the thread to run - not the process.

This is a question about thread library implement.
In Linux, a thread (or task) could be in user space or in kernel space. The process enter kernel space when it ask kernel to do something by syscall(read, write or ioctl).
There is also a so-called kernel-thread that runs always in kernel space and does not represent any user process.

According to Wikipedia and Oracle, user-level threads are actually in a layer mounted on the kernel threads; not that kernel threads execute alongside user-level threads but that, generally speaking, the only entities that are actually executed by the processor/OS are kernel threads.
For example, assume that we have a program with 2 user-level threads, both mapped to (i.e. assigned) the same kernel thread. Sometimes, the kernel thread runs the first user-level thread (and it is said that currently this kernel thread is mapped to the first user-level thread) and some other times the kernel thread runs the second user-level thread. So we say that we have two user-level threads mapped to the same kernel thread.
As a clarification:
The core of an OS is called its kernel, so the threads at the kernel level (i.e. the threads that the kernel knows of and manages) are called kernel threads, the calls to the OS core for services can be called kernel calls, and ... . The only definite relation between kernel things is that they are strongly related to the OS core, nothing more.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse