What is the difference between the issue queue and lsq queue for
memory instructions? Do memory instructions pass through both queues, or do they only pass
through the lsq queue.
If they pass through both queues what is their order?
I'm assuming you use the arm-like nomenclature here so the issue queue is what Intel calls RS (reservation station) and by issue you mean sending a uop ready for execution.
The answer is that memory instructions need to pass both. All instructions need to be issued (except the ones that can be eliminated without execution, for example register moves, zero idioms, nops, etc..). Let's rephrase - all instructions that need to go through an ALU need to go through the issue process first. Memory instructions will simply use that step to calculate their addresses.
This is true for loads, for stores there is usually an internal split into store-address and store-data, so the store-address will behave like a load in that sense and calculate its address during that step.
There is usually a dedicated execution port for that and dedicated execution units because the address calculation usually follows one of few specific addressing modes (each architecture has a different set of these), but aside from that the execution needs to follow the same rules like any other operation in the CPU - it needs to have its sources ready and read from the register file or bypassed from an in flight operation, it needs to get arbitrated when the execution port is free and prioritized by the same aging rules, so it makes sense that it uses the common path.
Once the memory operation has finished execution, it will be sent to the LSU (load-store unit, or the DCU, data-cache unit on Intel) and perform the actual memory access using the generated address. The LSU pipe will take care of the address translation, TLB lookups, the page walk if needed (though this is sometimes done in a dedicated unit), the address range and property checks, the cache lookup (if cacheable) and sending a miss to the next cache level or memory if needed. It may also trigger prefetches as part of the process.
For a load, when the LSU pipe has completed (which may require multiple passes and wakeups if the data was not available in the L1), the LSU will signal the issue queue again in order to wakeup anyone who was depended on the result.
For a store, store-address may fetch the line to the cache in advance as an optimization but the actual next step is usually to wakeup after retirement (since stores may not be dispatched to memory while speculative, unless you have some tricks to handle that).
It's also worth to mention that some CPUs try to optimize loads that can forward the data directly from prior stores instead of fetching it from the cache/memory. This can include forwarding (very common) or memory renaming (less common). The former is usually handled by the LSU internally, but the latter can be done much earlier and without the LSU (though the LSU pipe is usually still activated to validate the result).
In most text books advocating layered testbench designs, it is recommended that different layers/block run in parallel. I'm currently unable to figure out the reason why is it so. Why cannot we follow the following sequence.
repeat for 1000 tests
generate a transaction
drive the transaction on the DUT
monitor the transaction on the DUT
compare output with a reference
Instead, what is recommended is that all four blocks generator, driver, monitor and scoreboard/checker should run in parallel. My confusion is that why do we avoid the above mentioned sequential behavior in which we go through tests one test case at a time and prefer different blocks running in parallel.
Some texts say that it is because that is how things are done in hardware, i.e. everything runs in parallel. However, the layered testbench is not needed to model any synthesizable hardware. So, why do we have to restrict our verification enivornment/testbench to follow these hardware-like behavior.
A sample block diagram that I'm referring to is given below:
Suppose that you have a fifo which you want to test. Your driver pushes data into it, and the monitor checks the other end. The data gets pushed when it is available and till the fifo is full, the consumer on the other end reads data when it can. So, the pipe gets sometimes full, sometimes empty.
When the fifo is full, the driver must stop. The monitor works always, but its values do not change at the same frequency as the stimuli and it is delayed due to the fifo depth.
In your example, when the fifo is full, the stopped driver will block the whole loop, so the monitor will not work either. Of course, you can come up with some conditional statements which will bypass stopped driver. But you will need to run the monitor and the scoreboard every time, even if the data is not changing.
With more complicated designs with multiple fifos, pipelines, delays, clock frequencies, etc., your loop will become so complicated that it would be difficult if not impossible to manage.
The problem is that in the simple programming it is not possible to express block/wait conditions for statement without blocking the whole loop. It is much easier to do with parallel threads.
The general approach is to run driver and monitor in separate simulation threads. Monitor in this case waits for the data to appear and does not block the driver. The driver pushes data when it is available and can be blocked by fifo full or if there is nothing to drive. It does not block the monitor.
With a single monitor, you can probably pack the scoreboard in the same thread with the monitor, but with multiple monitors it will be problematic, in particular when all monitors run in separate threads. So, the scoreboard should run as a separate thread as well.
You are mixing two different concepts. The layered approach is a software concept that helps manage different abstraction levels from software transactions (a frame of data) to the individual pin wiggles. These layers are very similar to OSI Network Model. Layering also help with maintenance and reusability by defining clear interfaces that enable you to build up a larger system. It's hard to see the benefits of this on a testbench for a small combinational block.
Parallelism come into play for other reasons. There are relatively few complete designs out there that can be tested as a single stream of inputs and then comparing the output to a reference model. You might be able to test one small block of a design this way, but not a complete chip as it typically has many interfaces that need to be driven in parallel.
But let's take the case of two simple blocks that you tested individually with the approach above. Now you want to connect them together where the output of the first DUT becomes the driver of the second DUT
Driver1 -> DUT1 -> DUT2 -> Monitor2
This works best if I originally write the drivers and monitors as separate objects running in parallel.
Because by definition of atomicity, it implies "either all occur, or nothing occurs". But if two processes simultaneously perform a wait on the same semaphore on two different processors, it does not violate atomicity, but will lead to problems. So, what do you exactly mean by atomicity in this context? Shouldn't they be performed in a mutually exclusive manner also?
Let's say you have 2 threads and a semaphore with count of 1.
If they both down() at the same time, the atomicity of the primitive guarantees that one will be granted the semaphore and the other one will go to sleep. In particular it is impossible for both to decide to go to sleep OR both acquiring it.
Similarly, down() vs up(). up() will release and wakeup as necessary. In particular it is impossible for the thread doing down() to go to sleep after up() released it.
It's the entire point of the primitive.
The how of implementing semaphores depends upon whether they will work on a single processor or multi processor. On a single processor, a system can lock the semaphore by blocking interrupts for the short time required or using atomic instructions (ie that cannot be interrupted).
Obviously, that does not work in a multi processor system. There, you have to implement semaphores using interlocked instructions that block other processes from accessing the data at the same time.
So it all depends upon how your semaphore is implemented.
I have a doubt regarding UVM. Let's think I have a DUT with two interfaces, each one with its agent, generating transactions with the same clock. These transactions are handled with analysis imports (and write functions) on the scoreboard. My problem is that both these transactions read/modify shared variables of the scoreboard.
My questions are:
1) Have I to guarantee mutual exclusion explicitly though a semaphore? (i suppose yes)
2) Is this, in general, a correct way to proceed?
3) and the main problem, can in some way the order of execution be fixed?
Depending on that order the values of shared variables can change, generating inconsistency. Moreover, that order is fixed by specifications.
Thanks in advance.
While SystemVerilog tasks and functions do run concurrently, they do not run in parallel. It is important to understand the difference between parallelism and concurrency and it has been explained well here.
So while a SystemVerilog task or function could be executing concurrently with another task or function, in reality it does not actually run at the same time (run time context). The SystemVerilog scheduler keeps a list of all the tasks and functions that need to run on the same simulation time and at that time it executes them one-by-one (sequentially) on the same processor (concurrency) and not together on multiple processors (parallelism). As a result mutual exclusion is implicit and you do not need to use semaphores on that account.
The sequence in which two such concurrent functions would be executed is not deterministic but it is repeatable. So when you execute a testbench multiple times on the same simulator, the sequence of execution would be same. But two different simulators (or different versions of the same simulator) could execute these functions in a different order.
If the specifications require a certain order of execution, you need to ensure that order by making one of these tasks/functions wait on the other. In your scoreboard example, since you are using analysis port, you will have two "write" functions (perhaps using uvm_analysis_imp_decl macro) executing concurrently. To ensure an order, (since functions can not wait) you can fork out join_none threads and make one of the threads wait on the other by introducing an event that gets triggered at the conclusion of the first thread and the other thread waits for this event at the start.
This is a pretty difficult problem to address. If you get 2 transactions in the same time step, you have to be able to process them regardless of the order in which they get sent to your scoreboard. You can't know for sure which monitor will get triggered first. The only thing you can do is collect the transactions and at the end of the time step do your modeling/checking/etc.
Semaphores only help you if you have concurrent threads that take (simulation) time that are trying to access a shared resource. If you get things from an analysis port, then you get them in 0 time, so semaphores won't help you here.
So to my understanding, the answer is: compiler/vendor/uvm cannot ensure the order of execution. If you need to ensure the order which actually happen in same time step, you need to use semaphore correctly to make it work the way you want.
Another thing is, only you yourself know which one must execute after the other if they are in same simulation time.
this is a classical race condition where the result depends upon the actual thread order...
first of all you have to decide if the write race is problematic for you and/or if there is a priority order in this case. if you dont care the last access would win.
if the access isnt atomic you might need a semaphore to ensure only one access is handled at a time and the next waits till the first has finished.
you can also try to control order by changing the structure or introducing thread ordering (wait_order) or if possible you remove timing at all (here instead of directly operating with the data you get you simply store the data for some time and then later you operate on it.
Is there a relationship between a kernel and a user thread?
Some operating system textbooks said that "maps one (many) user thread to one (many) kernel thread". What does map means here?
When they say map, they mean that each kernel thread is assigned to a certain number of user mode threads.
Kernel threads are used to provide privileged services to applications (such as system calls ). They are also used by the kernel to keep track of what all is running on the system, how much of which resources are allocated to what process, and to schedule them.
If your applications make heavy use of system calls, more user threads per kernel thread, and your applications will run slower. This is because the kernel thread will become a bottleneck, since all system calls will pass through it.
On the flip side though, if your programs rarely use system calls (or other kernel services), you can assign a large number of user threads to a kernel thread without much performance penalty, other than overhead.
You can increase the number of kernel threads, but this adds overhead to the kernel in general, so while individual threads will be more responsive with respect to system calls, the system as a whole will become slower.
That is why it is important to find a good balance between the number of kernel threads and the number of user threads per kernel thread.
http://www.informit.com/articles/printerfriendly.aspx?p=25075
Implementing Threads in User Space
There are two main ways to implement a threads package: in user space and in the kernel. The choice is moderately controversial, and a hybrid implementation is also possible. We will now describe these methods, along with their advantages and disadvantages.
The first method is to put the threads package entirely in user space. The kernel knows nothing about them. As far as the kernel is concerned, it is managing ordinary, single-threaded processes. The first, and most obvious, advantage is that a user-level threads package can be implemented on an operating system that does not support threads. All operating systems used to fall into this category, and even now some still do.
All of these implementations have the same general structure, which is illustrated in Fig. 2-8(a). The threads run on top of a run-time system, which is a collection of procedures that manage threads. We have seen four of these already: thread_create, thread_exit, thread_wait, and thread_yield, but usually there are more.
When threads are managed in user space, each process needs its own private thread table to keep track of the threads in that process. This table is analogous to the kernel's process table, except that it keeps track only of the per-thread properties such the each thread's program counter, stack pointer, registers, state, etc. The thread table is managed by the run-time system. When a thread is moved to ready state or blocked state, the information needed to restart it is stored in the thread table, exactly the same way as the kernel stores information about processes in the process table.
When a thread does something that may cause it to become blocked locally, for example, waiting for another thread in its process to complete some work, it calls a run-time system procedure. This procedure checks to see if the thread must be put into blocked state. If so, it stores the thread's registers (i.e., its own) in the thread table, looks in the table for a ready thread to run, and reloads the machine registers with the new thread's saved values. As soon as the stack pointer and program counter have been switched, the new thread comes to life again automatically. If the machine has an instruction to store all the registers and another one to load them all, the entire thread switch can be done in a handful of instructions. Doing thread switching like this is at least an order of magnitude faster than trapping to the kernel and is a strong argument in favor of user-level threads packages.
However, there is one key difference with processes. When a thread is finished running for the moment, for example, when it calls thread_yield, the code of thread_yield can save the thread's information in the thread table itself. Furthermore, it can then call the thread scheduler to pick another thread to run. The procedure that saves the thread's state and the scheduler are just local procedures, so invoking them is much more efficient than making a kernel call. Among other issues, no trap is needed, no context switch is needed, the memory cache need not be flushed, and so on. This makes thread scheduling very fast.
User-level threads also have other advantages. They allow each process to have its own customized scheduling algorithm. For some applications, for example, those with a garbage collector thread, not having to worry about a thread being stopped at an inconvenient moment is a plus. They also scale better, since kernel threads invariably require some table space and stack space in the kernel, which can be a problem if there are a very large number of threads.
Despite their better performance, user-level threads packages have some major problems. First among these is the problem of how blocking system calls are implemented. Suppose that a thread reads from the keyboard before any keys have been hit. Letting the thread actually make the system call is unacceptable, since this will stop all the threads. One of the main goals of having threads in the first place was to allow each one to use blocking calls, but to prevent one blocked thread from affecting the others. With blocking system calls, it is hard to see how this goal can be achieved readily.
The system calls could all be changed to be nonblocking (e.g., a read on the keyboard would just return 0 bytes if no characters were already buffered), but requiring changes to the operating system is unattractive. Besides, one of the arguments for user-level threads was precisely that they could run with existing operating systems. In addition, changing the semantics of read will require changes to many user programs.
Another alternative is possible in the event that it is possible to tell in advance if a call will block. In some versions of UNIX, a system call, select, exists, which allows the caller to tell whether a prospective read will block. When this call is present, the library procedure read can be replaced with a new one that first does a select call and then only does the read call if it is safe (i.e., will not block). If the read call will block, the call is not made. Instead, another thread is run. The next time the run-time system gets control, it can check again to see if the read is now safe. This approach requires rewriting parts of the system call library, is inefficient and inelegant, but there is little choice. The code placed around the system call to do the checking is called a jacket or wrapper.
Somewhat analogous to the problem of blocking system calls is the problem of page faults. We will study these in Chap. 4. For the moment, it is sufficient to say that computers can be set up in such a way that not all of the program is in main memory at once. If the program calls or jumps to an instruction that is not in memory, a page fault occurs and the operating system will go and get the missing instruction (and its neighbors) from disk. This is called a page fault. The process is blocked while the necessary instruction is being located and read in. If a thread causes a page fault, the kernel, not even knowing about the existence of threads, naturally blocks the entire process until the disk I/O is complete, even though other threads might be runnable.
Another problem with user-level thread packages is that if a thread starts running, no other thread in that process will ever run unless the first thread voluntarily gives up the CPU. Within a single process, there are no clock interrupts, making it impossible to schedule processes round-robin fashion (taking turns). Unless a thread enters the run-time system of its own free will, the scheduler will never get a chance.
One possible solution to the problem of threads running forever is to have the run-time system request a clock signal (interrupt) once a second to give it control, but this, too, is crude and messy to program. Periodic clock interrupts at a higher frequency are not always possible, and even if they are, the total overhead may be substantial. Furthermore, a thread might also need a clock interrupt, interfering with the run-time system's use of the clock.
Another, and probably the most devastating argument against user-level threads, is that programmers generally want threads precisely in applications where the threads block often, as, for example, in a multithreaded Web server. These threads are constantly making system calls. Once a trap has occurred to the kernel to carry out the system call, it is hardly any more work for the kernel to switch threads if the old one has blocked, and having the kernel do this eliminates the need for constantly making select system calls that check to see if read system calls are safe. For applications that are essentially entirely CPU bound and rarely block, what is the point of having threads at all? No one would seriously propose computing the first n prime numbers or playing chess using threads because there is nothing to be gained by doing it that way.
User threads are managed in userspace - that means scheduling, switching, etc. are not from the kernel.
Since, ultimately, the OS kernel is responsible for context switching between "execution units" - your user threads must be associated (ie., "map") to a kernel schedulable object - a kernel thread†1.
So, given N user threads - you could use N kernel threads (a 1:1 map). That allows you to take advantage of the kernel's hardware multi-processing (running on multiple CPUs) and be a pretty simplistic library - basically just deferring most of the work to the kernel. It does, however, make your app portable between OS's as you're not directly calling the kernel thread functions. I believe that POSIX Threads (PThreads) is the preferred *nix implementation, and that it follows the 1:1 map (making it virtually equivalent to a kernel thread). That, however, is not guaranteed as it'd be implementation dependent (a main reason for using PThreads would be portability between kernels).
Or, you could use only 1 kernel thread. That'd allow you to run on non multitasking OS's, or be completely in charge of scheduling. Windows' User Mode Scheduling is an example of this N:1 map.
Or, you could map to an arbitrary number of kernel threads - a N:M map. Windows has Fibers, which would allow you to map N fibers to M kernel threads and cooperatively schedule them. A threadpool could also be an example of this - N workitems for M threads.
†1: A process has at least 1 kernel thread, which is the actual execution unit. Also, a kernel thread must be contained in a process. OS's must schedule the thread to run - not the process.
This is a question about thread library implement.
In Linux, a thread (or task) could be in user space or in kernel space. The process enter kernel space when it ask kernel to do something by syscall(read, write or ioctl).
There is also a so-called kernel-thread that runs always in kernel space and does not represent any user process.
According to Wikipedia and Oracle, user-level threads are actually in a layer mounted on the kernel threads; not that kernel threads execute alongside user-level threads but that, generally speaking, the only entities that are actually executed by the processor/OS are kernel threads.
For example, assume that we have a program with 2 user-level threads, both mapped to (i.e. assigned) the same kernel thread. Sometimes, the kernel thread runs the first user-level thread (and it is said that currently this kernel thread is mapped to the first user-level thread) and some other times the kernel thread runs the second user-level thread. So we say that we have two user-level threads mapped to the same kernel thread.
As a clarification:
The core of an OS is called its kernel, so the threads at the kernel level (i.e. the threads that the kernel knows of and manages) are called kernel threads, the calls to the OS core for services can be called kernel calls, and ... . The only definite relation between kernel things is that they are strongly related to the OS core, nothing more.