UVM shared variables - system-verilog

I have a doubt regarding UVM. Let's think I have a DUT with two interfaces, each one with its agent, generating transactions with the same clock. These transactions are handled with analysis imports (and write functions) on the scoreboard. My problem is that both these transactions read/modify shared variables of the scoreboard.
My questions are:
1) Have I to guarantee mutual exclusion explicitly though a semaphore? (i suppose yes)
2) Is this, in general, a correct way to proceed?
3) and the main problem, can in some way the order of execution be fixed?
Depending on that order the values of shared variables can change, generating inconsistency. Moreover, that order is fixed by specifications.
Thanks in advance.

While SystemVerilog tasks and functions do run concurrently, they do not run in parallel. It is important to understand the difference between parallelism and concurrency and it has been explained well here.
So while a SystemVerilog task or function could be executing concurrently with another task or function, in reality it does not actually run at the same time (run time context). The SystemVerilog scheduler keeps a list of all the tasks and functions that need to run on the same simulation time and at that time it executes them one-by-one (sequentially) on the same processor (concurrency) and not together on multiple processors (parallelism). As a result mutual exclusion is implicit and you do not need to use semaphores on that account.
The sequence in which two such concurrent functions would be executed is not deterministic but it is repeatable. So when you execute a testbench multiple times on the same simulator, the sequence of execution would be same. But two different simulators (or different versions of the same simulator) could execute these functions in a different order.
If the specifications require a certain order of execution, you need to ensure that order by making one of these tasks/functions wait on the other. In your scoreboard example, since you are using analysis port, you will have two "write" functions (perhaps using uvm_analysis_imp_decl macro) executing concurrently. To ensure an order, (since functions can not wait) you can fork out join_none threads and make one of the threads wait on the other by introducing an event that gets triggered at the conclusion of the first thread and the other thread waits for this event at the start.

This is a pretty difficult problem to address. If you get 2 transactions in the same time step, you have to be able to process them regardless of the order in which they get sent to your scoreboard. You can't know for sure which monitor will get triggered first. The only thing you can do is collect the transactions and at the end of the time step do your modeling/checking/etc.
Semaphores only help you if you have concurrent threads that take (simulation) time that are trying to access a shared resource. If you get things from an analysis port, then you get them in 0 time, so semaphores won't help you here.

So to my understanding, the answer is: compiler/vendor/uvm cannot ensure the order of execution. If you need to ensure the order which actually happen in same time step, you need to use semaphore correctly to make it work the way you want.
Another thing is, only you yourself know which one must execute after the other if they are in same simulation time.

this is a classical race condition where the result depends upon the actual thread order...
first of all you have to decide if the write race is problematic for you and/or if there is a priority order in this case. if you dont care the last access would win.
if the access isnt atomic you might need a semaphore to ensure only one access is handled at a time and the next waits till the first has finished.
you can also try to control order by changing the structure or introducing thread ordering (wait_order) or if possible you remove timing at all (here instead of directly operating with the data you get you simply store the data for some time and then later you operate on it.

Related

Do memory instructions pass through the load-store queue and issue queue in the microarchitecture

What is the difference between the issue queue and lsq queue for
memory instructions? Do memory instructions pass through both queues, or do they only pass
through the lsq queue.
If they pass through both queues what is their order?
I'm assuming you use the arm-like nomenclature here so the issue queue is what Intel calls RS (reservation station) and by issue you mean sending a uop ready for execution.
The answer is that memory instructions need to pass both. All instructions need to be issued (except the ones that can be eliminated without execution, for example register moves, zero idioms, nops, etc..). Let's rephrase - all instructions that need to go through an ALU need to go through the issue process first. Memory instructions will simply use that step to calculate their addresses.
This is true for loads, for stores there is usually an internal split into store-address and store-data, so the store-address will behave like a load in that sense and calculate its address during that step.
There is usually a dedicated execution port for that and dedicated execution units because the address calculation usually follows one of few specific addressing modes (each architecture has a different set of these), but aside from that the execution needs to follow the same rules like any other operation in the CPU - it needs to have its sources ready and read from the register file or bypassed from an in flight operation, it needs to get arbitrated when the execution port is free and prioritized by the same aging rules, so it makes sense that it uses the common path.
Once the memory operation has finished execution, it will be sent to the LSU (load-store unit, or the DCU, data-cache unit on Intel) and perform the actual memory access using the generated address. The LSU pipe will take care of the address translation, TLB lookups, the page walk if needed (though this is sometimes done in a dedicated unit), the address range and property checks, the cache lookup (if cacheable) and sending a miss to the next cache level or memory if needed. It may also trigger prefetches as part of the process.
For a load, when the LSU pipe has completed (which may require multiple passes and wakeups if the data was not available in the L1), the LSU will signal the issue queue again in order to wakeup anyone who was depended on the result.
For a store, store-address may fetch the line to the cache in advance as an optimization but the actual next step is usually to wakeup after retirement (since stores may not be dispatched to memory while speculative, unless you have some tricks to handle that).
It's also worth to mention that some CPUs try to optimize loads that can forward the data directly from prior stores instead of fetching it from the cache/memory. This can include forwarding (very common) or memory renaming (less common). The former is usually handled by the LSU internally, but the latter can be done much earlier and without the LSU (though the LSU pipe is usually still activated to validate the result).

Need for multi-threading in Systemverilog using fork-join

In most text books advocating layered testbench designs, it is recommended that different layers/block run in parallel. I'm currently unable to figure out the reason why is it so. Why cannot we follow the following sequence.
repeat for 1000 tests
generate a transaction
drive the transaction on the DUT
monitor the transaction on the DUT
compare output with a reference
Instead, what is recommended is that all four blocks generator, driver, monitor and scoreboard/checker should run in parallel. My confusion is that why do we avoid the above mentioned sequential behavior in which we go through tests one test case at a time and prefer different blocks running in parallel.
Some texts say that it is because that is how things are done in hardware, i.e. everything runs in parallel. However, the layered testbench is not needed to model any synthesizable hardware. So, why do we have to restrict our verification enivornment/testbench to follow these hardware-like behavior.
A sample block diagram that I'm referring to is given below:
Suppose that you have a fifo which you want to test. Your driver pushes data into it, and the monitor checks the other end. The data gets pushed when it is available and till the fifo is full, the consumer on the other end reads data when it can. So, the pipe gets sometimes full, sometimes empty.
When the fifo is full, the driver must stop. The monitor works always, but its values do not change at the same frequency as the stimuli and it is delayed due to the fifo depth.
In your example, when the fifo is full, the stopped driver will block the whole loop, so the monitor will not work either. Of course, you can come up with some conditional statements which will bypass stopped driver. But you will need to run the monitor and the scoreboard every time, even if the data is not changing.
With more complicated designs with multiple fifos, pipelines, delays, clock frequencies, etc., your loop will become so complicated that it would be difficult if not impossible to manage.
The problem is that in the simple programming it is not possible to express block/wait conditions for statement without blocking the whole loop. It is much easier to do with parallel threads.
The general approach is to run driver and monitor in separate simulation threads. Monitor in this case waits for the data to appear and does not block the driver. The driver pushes data when it is available and can be blocked by fifo full or if there is nothing to drive. It does not block the monitor.
With a single monitor, you can probably pack the scoreboard in the same thread with the monitor, but with multiple monitors it will be problematic, in particular when all monitors run in separate threads. So, the scoreboard should run as a separate thread as well.
You are mixing two different concepts. The layered approach is a software concept that helps manage different abstraction levels from software transactions (a frame of data) to the individual pin wiggles. These layers are very similar to OSI Network Model. Layering also help with maintenance and reusability by defining clear interfaces that enable you to build up a larger system. It's hard to see the benefits of this on a testbench for a small combinational block.
Parallelism come into play for other reasons. There are relatively few complete designs out there that can be tested as a single stream of inputs and then comparing the output to a reference model. You might be able to test one small block of a design this way, but not a complete chip as it typically has many interfaces that need to be driven in parallel.
But let's take the case of two simple blocks that you tested individually with the approach above. Now you want to connect them together where the output of the first DUT becomes the driver of the second DUT
Driver1 -> DUT1 -> DUT2 -> Monitor2
This works best if I originally write the drivers and monitors as separate objects running in parallel.

Queues: How to process dependent jobs

I am working on an application where multiple clients will be writing to a queue (or queues), and multiple workers will be processing jobs off the queue. The problem is that in some cases, jobs are dependent on each other. By 'dependent', I mean they need to be processed in order.
This typically happens when an entity is created by the user, then deleted shortly after. Obviously I want the first job (i.e. the creation) to take place before the deletion. The problem is that creation can take a lot longer than deletion, so I can't guarantee that it will be complete before the deletion job commences.
I imagine that this type of problem is reasonably common with asynchronous processing. What strategies are there to deal with it? I know that I can assign priorities to queues to have some control over the processing order, but this is not good enough in this case. I need concrete guarantees.
This may not fit your model, but the model I have used involves not providing the deletion functionality until the creation functionality is complete.
When Create_XXX command is completed, it is responsible for raising an XXX_Created event, which also gets put on the queue. This event can then be handled to enable the deletion functionality, allowing the deletion of the newly created item.
The process of a Command completing, then raising an event which is handled and creates another Command is a common method of ensuring Commands get processed in the desired order.
I think an handy feature for your use case is Job chaining:
https://laravel.com/docs/5.5/queues#job-chaining

Why is response time important in CPU scheduling?

I'm looking for an example of a job for which response time is important.
One definition of response time is:
The time taken in an interactive program from the issuance of a command to the commence of a response to that command.
I've read that response time is important for interactivity, but I can't understand why. If the job isn't fully completed, what output could be produced that would be of interest to a user?
Wouldn't the user only care about how soon a job finishes, as that's the first time any output is produced?
For example, consider these two possible schedulings of two jobs:
Case 1: |---B---|---A---|
Case 2: |-A-|---B---|-A-|
Suppose that job A and B are issued at the same time, A being a command typed in by the user and B being some background process.
The response time for job A as I understand it would be shorter in case 2. As job A finishes (and produces output) at the same time in the two cases, I don't understand how the user benefits (or even notices) the better response time in case 2.
When writing an operating system, one has to take into consideration what will the intended audience be. In some cases it matters most to finish jobs as quickly as possible (supercomputer systems), in some cases it matters most to be as responsive as possible (regular desktop systems), and in some cases it matters most to be as predictable as possible (real-time systems).
For finishing jobs as fast as possible, tasks should be interrupted the rarest possible (so big intervals between task switches are the best option). Here response time doesn't really matter much. It should be noted that task switches usually take some time (thousands of CPU cycles usually) due to having to save the state (including registers and paging structures) of the old task to memory and restore the state (including registers and paging structures) of the new task from memory. This also causes cache and TLB misses, since the cached information doesn't usually belong to the current process.
For being the most responsive possible, tasks should be interrupted as often as possible so the user doesn't experience the so-called lag. This is where response time is important. Note however that on interrupt-driven architectures (like x86) an interrupt from the keyboard or the mouse would automatically pause execution of the current task and call the interrupt handler, which processes the input and sends it to the appropriate program.
For being the most predictable possible, input should be processed neither too fast, neither too slow. This means that response time is constrained from both ways, thus being much more important than in "most responsive possible" designs. A misprediction can even be a fatal failure in mission-critical systems.
In a nutshell, importance of response time varies from design to design and can range from nearly unimportant to critical.
I think I have an answer to my own question. The problem was, I was just thinking about simple processes like ls that once issued runs for some amount of time and then, when they're finished, deliver their first and only output.
However, suppose job A in the example from the question is a program with multiple print statements. Output will in that case be produced before the process is complete (and some of the printouts may well occur during the first scheduled burst). It would thus make sense for interactivity to want to begin running such a process as soon as possible.

Is it required to use spin_lock inside tasklets?

As far as I know in interrupt handler, there is no need of synchronization technique. The interrupt handler cannot run concurrently. In short, the pre-emption is disabled in ISR. However, I have a doubt regarding tasklets. As per my knowledge, tasklets runs under interrupt context. Thus, In my opinion, there is no need for spin lock under tasklet function routine. However, I am not sure on it. Can somebody please explain on it? Thanks for your replies.
If data is shared between top half and bottom half then go for lock. Simple rules for locking. Locks meant to protect data not code.
1. What to protect?.
2. Why to protect?
3. How to protect.
Two tasklets of the same type do not ever run simultaneously. Thus, there is no need to protect data used only within a single type of tasklet. If the data is shared between two different tasklets, however, you must obtain a normal spin lock before accessing the data in the bottom half. You do not need to disable bottom halves because a tasklet never preempts another running tasklet on the same processor.
For synchronization between code running in process context (A) and code running in softirq context (B) we need to use special locking primitives. We must use spinlock operations augmented with deactivation of bottom-half handlers on the current processor in (A), and in (B) only basic spinlock operations. Using spinlocks makes sure that we don't have races between multiple CPUs while deactivating the softirqs makes sure that we don't deadlock in the softirq is scheduled on the same CPU where we already acquired a spinlock. (c) Kernel docs