OpenCL Events and Command Queues

OpenCL Events and Command Queues - queue

I'm working on translating a CUDA application (this if you must know) to OpenCL. The original application uses the C-style CUDA API, with a single stream just to avoid the automatic busy-wait when reading the results.
Now I notice that OpenCL command queues look a lot like CUDA streams. But in the device read command, and likewise in the write and kernel execute commands, I notice parameters for events too. So I'm wondering, what does it take to execute a device write, a number of kernels (e.g. one call to one kernel then 100 calls to another kernel), and a device read, all sequentially?
If I just enqueue them sequentially into the same queue, will they execute sequentially like they do in CUDA?
If that doesn't work, can/should I daisy-chain events, making each call's wait list the previous call's event?
Or should I add all previous events to each call's wait list, like if there's an N^2 search for dependencies or something?
Or do I just have to event.wait() for each call individually, like it says to in AMD's tutorial?
Thanks!

That depends on how you create the command queue. in clCreateCommandQueue there's a properties parameter that can contain CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE, which enables non-sequential execution in the command queue.
If that property is set, commands might execute out of order or in parallel, and the only way of synchronize them is using events.
When that property is not set, commands execute sequentially in the queue.

Related

Need for multi-threading in Systemverilog using fork-join

In most text books advocating layered testbench designs, it is recommended that different layers/block run in parallel. I'm currently unable to figure out the reason why is it so. Why cannot we follow the following sequence.
repeat for 1000 tests
generate a transaction
drive the transaction on the DUT
monitor the transaction on the DUT
compare output with a reference
Instead, what is recommended is that all four blocks generator, driver, monitor and scoreboard/checker should run in parallel. My confusion is that why do we avoid the above mentioned sequential behavior in which we go through tests one test case at a time and prefer different blocks running in parallel.
Some texts say that it is because that is how things are done in hardware, i.e. everything runs in parallel. However, the layered testbench is not needed to model any synthesizable hardware. So, why do we have to restrict our verification enivornment/testbench to follow these hardware-like behavior.
A sample block diagram that I'm referring to is given below:

Suppose that you have a fifo which you want to test. Your driver pushes data into it, and the monitor checks the other end. The data gets pushed when it is available and till the fifo is full, the consumer on the other end reads data when it can. So, the pipe gets sometimes full, sometimes empty.
When the fifo is full, the driver must stop. The monitor works always, but its values do not change at the same frequency as the stimuli and it is delayed due to the fifo depth.
In your example, when the fifo is full, the stopped driver will block the whole loop, so the monitor will not work either. Of course, you can come up with some conditional statements which will bypass stopped driver. But you will need to run the monitor and the scoreboard every time, even if the data is not changing.
With more complicated designs with multiple fifos, pipelines, delays, clock frequencies, etc., your loop will become so complicated that it would be difficult if not impossible to manage.
The problem is that in the simple programming it is not possible to express block/wait conditions for statement without blocking the whole loop. It is much easier to do with parallel threads.
The general approach is to run driver and monitor in separate simulation threads. Monitor in this case waits for the data to appear and does not block the driver. The driver pushes data when it is available and can be blocked by fifo full or if there is nothing to drive. It does not block the monitor.
With a single monitor, you can probably pack the scoreboard in the same thread with the monitor, but with multiple monitors it will be problematic, in particular when all monitors run in separate threads. So, the scoreboard should run as a separate thread as well.

You are mixing two different concepts. The layered approach is a software concept that helps manage different abstraction levels from software transactions (a frame of data) to the individual pin wiggles. These layers are very similar to OSI Network Model. Layering also help with maintenance and reusability by defining clear interfaces that enable you to build up a larger system. It's hard to see the benefits of this on a testbench for a small combinational block.
Parallelism come into play for other reasons. There are relatively few complete designs out there that can be tested as a single stream of inputs and then comparing the output to a reference model. You might be able to test one small block of a design this way, but not a complete chip as it typically has many interfaces that need to be driven in parallel.
But let's take the case of two simple blocks that you tested individually with the approach above. Now you want to connect them together where the output of the first DUT becomes the driver of the second DUT
Driver1 -> DUT1 -> DUT2 -> Monitor2
This works best if I originally write the drivers and monitors as separate objects running in parallel.

Does each system call create a process?

Does each system call create a process?
Are all functions (e.g. interrupts) of programs and operating systems executed in the form of processes?
I feel that such a large number of process control blocks, a large number of process scheduling waste a lot of resources.
Or, the kernel instruction of the system call is regarded as part of the current
process.

The short answer is - not exactly. But we have to agree on what we are going to call a "process". A process is more of an abstract idea, which encapsulates multiple instructions, each sequentially executed.
So let's start from the first question.
Does each system call create a process?
No. Each system call is the product of the currently running process, that tells the OS - "Hey OS, I need you to open this file for me, or read these here bits". In this case, the process is a bag of sequentially executed instructions, some are system calls, some are not.
Then we have.
Are all functions (e.g. interrupts) of programs and operating systems executed in the form of processes?
Well this kind of goes back to the first question. We are not considering that a system call (an operation that tell the OS to do something and works under very strict conditions) is a separate process. We will NOT see that system call execution to have its OWN process id (pid).
Then we have.
I feel that such a large number of process control blocks, a large number of process scheduling waste a lot of resources.
Well, I would say, do not underestimate your OS and the capabilities of your hardware. A modern processor with a modern OS on it, is VERY, VERY fast and more than capable of computing billions of instructions in seconds. We can't really imagine how fast that is. I wouldn't worry for optimizations on such a micro-level.
Okay, but let's dig deeper into this. What is a process exactly?
Informally, a process is a program in execution. The status of the current activity of a process is represented by a value, called the program counter, and the contents of the processor’s registers. The memory layout of a process is typically divided into multiple sections.
These sections include:
Text section.
Data section.
Heap section.
Stack section.
As a process executes, it changes state. The state of a process is defined in part by the current activity of that process. Each process is represented in the OS by a process control block (PCB), as you already mentioned.
So we can see that we treat a process as a very complicated structure that is MORE that just occupying CPU time. It has a state, storage, timing, and so on.
But because you are interested in system calls, then what are they?
For us, system calls provide an interface to the services made available by an OS. They are the way we tell the OS to do things FOR US. We know that systems execute thousands of system calls per second.

No, they don't.
The operating system uses software interrupt to execute the system call operation within the same process.
You can imagine them as a function call but they are executed with kernel privileges.

Asynchronous SQL procedure execution set and wait for completion

Say I have a large set of calls to a procedure to run which have varying parameters but are independent so I want to make parallel/async calls. I use the service broker to fire these all off but the problem I have is I want to know neat ways of knowing how to wait for them all to complete (or error).
Is there a way to do this? I believe I could just loop with waits on the result table checking for completion on that but that isn't very "event triggered". Hoping for a nicer way to do this.
I have used the service broker with queue code and processing based off this other answer: Remus' service broker queuing example

Good day Shiv,
There are several ways (like always) that you can use in order to implement this requirement. One of these is using this logic:
(1) Create two queues: one will be the trigger to execute the main SP that you want execute in Asynchronous, and the other will be the trigger to execute whatever you want to execute after all the executions ended.
(2) When you create the message in the first queue you should also create a message in the second queue, which will only tell us which execution did not ended yet (first queue gives the information which execution started since once we START the execution we use the message and remove it from the queue).
(3) Inside the SP that you execute using the main first queue (this part executed in synchronous):
(3.1) execute the queries you need
(3.2) clear the equivalent message from the second queue (meaning that this message will removed only after the queries ended)
(3.3) check if there are messages in the second queue. If there are no messages then all the tasks ended and you can execute your final step
** Theoretically instead of using the second queue, you can store data in a table, but using second queue should probably give better performance then updating table each time an execution ended. Anyhow, you test the option of using a table as well.

UVM shared variables

I have a doubt regarding UVM. Let's think I have a DUT with two interfaces, each one with its agent, generating transactions with the same clock. These transactions are handled with analysis imports (and write functions) on the scoreboard. My problem is that both these transactions read/modify shared variables of the scoreboard.
My questions are:
1) Have I to guarantee mutual exclusion explicitly though a semaphore? (i suppose yes)
2) Is this, in general, a correct way to proceed?
3) and the main problem, can in some way the order of execution be fixed?
Depending on that order the values of shared variables can change, generating inconsistency. Moreover, that order is fixed by specifications.
Thanks in advance.

While SystemVerilog tasks and functions do run concurrently, they do not run in parallel. It is important to understand the difference between parallelism and concurrency and it has been explained well here.
So while a SystemVerilog task or function could be executing concurrently with another task or function, in reality it does not actually run at the same time (run time context). The SystemVerilog scheduler keeps a list of all the tasks and functions that need to run on the same simulation time and at that time it executes them one-by-one (sequentially) on the same processor (concurrency) and not together on multiple processors (parallelism). As a result mutual exclusion is implicit and you do not need to use semaphores on that account.
The sequence in which two such concurrent functions would be executed is not deterministic but it is repeatable. So when you execute a testbench multiple times on the same simulator, the sequence of execution would be same. But two different simulators (or different versions of the same simulator) could execute these functions in a different order.
If the specifications require a certain order of execution, you need to ensure that order by making one of these tasks/functions wait on the other. In your scoreboard example, since you are using analysis port, you will have two "write" functions (perhaps using uvm_analysis_imp_decl macro) executing concurrently. To ensure an order, (since functions can not wait) you can fork out join_none threads and make one of the threads wait on the other by introducing an event that gets triggered at the conclusion of the first thread and the other thread waits for this event at the start.

This is a pretty difficult problem to address. If you get 2 transactions in the same time step, you have to be able to process them regardless of the order in which they get sent to your scoreboard. You can't know for sure which monitor will get triggered first. The only thing you can do is collect the transactions and at the end of the time step do your modeling/checking/etc.
Semaphores only help you if you have concurrent threads that take (simulation) time that are trying to access a shared resource. If you get things from an analysis port, then you get them in 0 time, so semaphores won't help you here.

So to my understanding, the answer is: compiler/vendor/uvm cannot ensure the order of execution. If you need to ensure the order which actually happen in same time step, you need to use semaphore correctly to make it work the way you want.
Another thing is, only you yourself know which one must execute after the other if they are in same simulation time.

this is a classical race condition where the result depends upon the actual thread order...
first of all you have to decide if the write race is problematic for you and/or if there is a priority order in this case. if you dont care the last access would win.
if the access isnt atomic you might need a semaphore to ensure only one access is handled at a time and the next waits till the first has finished.
you can also try to control order by changing the structure or introducing thread ordering (wait_order) or if possible you remove timing at all (here instead of directly operating with the data you get you simply store the data for some time and then later you operate on it.

Networking using run loop

I have an application which uses some external library for analytics. Problem is that I suspect it does some things synchronously, which blocks my thread and makes watchdog kill my app after 10 secs (0x8badf00d code). It is really hard to reproduce (I cannot), but there are quite few cases "in the wild".
I've read some documentation, which suggested that instead creating another thread I should use run-loops. Unfortunately the more I read about them, the more confused I get. And the last thing i want to do is release a fix which will break even more things :/
What I am trying to achieve is:
From main thread add a task to the run-loop, which calls just one function: initMyAnalytics(). My thread continues running, even if initMyAnalytics() gets locked waiting for network data. After initMyAnalytics() finishes, it quietly quits and never gets called again (so it doesnt loop or anything).
Any ideas how to achieve it? Code examples are welcome ;)
Regards!

You don't need to use a run loop in that case. Run loops' purpose is to proceed events from various sources sequentially in a particular thread and stay idle when they have nothing to do. Of course, you can detach a thread, create a run loop, add a source for your function and run the run loop until the function ends. The same as you can use a semi-trailer truck to carry your groceries home.
Here, what you need are dispatch queues. Dispatch queues are First-In-First-Out data structures that run tasks asynchronously. In contrary to run loops, a dispatch queue isn't tied to a particular thread: the working threads are automatically created and terminated as and when required.
As you only have one task to execute, you don't need to create a dispatch queue. Instead you will use an existing global concurrent queue. A concurrent queue execute one or more tasks concurrently, which is perfectly fine in our case. But if we had many tasks to execute and wanted each task to wait for its predecessor to end, we would need to create a serial queue.
So all you have to do is:
create a task for your function by enclosing it into a Block
get a global queue using dispatch_get_global_queue
add the task to the queue using dispatch_async.
dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
initMyAnalytics();
});
DISPATCH_QUEUE_PRIORITY_DEFAULT is a macro that evaluates to 0. You can get different global queues with different priorities. The second parameter is reserved for future use and should always be 0.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse