AMD-V interrupt virtualiztion (AVIC): hw support and IOMMU interaction - x86-64

I've came across Advanced Virtual Interrupt Controller (AVIC) in the AMD64 Architecture Programming Manual (APM), Volume 2. Some bits were unclear to me, so I've quickly skimmed through the popular open-source hypervisors (Qemu/KVM and Xen, to name a few) sources to see how it is used together with AMD IOMMU. It appeared that none of them are using AVIC, and Bochs/Qemu don't emulate it either (there is an IOMMU emulation in Qemu, but it's for Revision 1 that doesn't virtualize interrupts).
So the two questions arise:
Why the AVIC is so "unpopular"? Maybe it isn't broadly supported by the CPUs today on the market, or is it because all these hypervisors have a long history and already virtualize interrupts by themselves, so porting to AVIC isn't a top priority? (or have I just missed something in the sources?)
[The original question] Is the Guest Virtual APIC Table Root Pointer in IOMMU's Device Table Entry a pointer to Physical APIC ID Table, as defined in APM Vol. 2, Sect. 15.29.2.3?
Thank you for the clarifications.

Looks like the AVIC is not supported even in 16h family processors (the "Preliminary BIOS and Kernel Developer’s Guide (BKDG) for AMD Family 16h Models 00h-0Fh (Kabini) Processors" says that CPUID Fn8000_000A_EDX[AVIC] value is 0). This is likely the reason it is not implemented in the hypervisors either.
It also appeared that the second question was due to me using a different APM and IOMMU specification revisions. When I got the compatible ones (APM Rev. 3.24 and IOMMU spec Rev. 2.6), it appeared that the data structures for IOMMU changed significantly. Guest Virtual APIC Table Root Pointer moved from the DTE to the IRTE, and is now strictly defined as a pointer to the virtual APIC backing page.
A useful mechanism indeed, and it's a pity it's not supported in the hardware yet.

AVIC was announced with a bit of fanfare in 2012 (see http://www.slideshare.net/xen_com_mgr/introduction-of-amd-virtual-interrupt-controller and http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/09/2012-lpc-virt-interrupt-virt-kvm-roedel.pdf) but has yet to materialize in actual hardware.
Emulating x2APIC for older hardware provides a reasonable performance boost. Qemu emulates x2APIC even for Opteron gen 1 now: http://lists.nongnu.org/archive/html/qemu-devel/2014-01/msg02441.html AVIC, as designed originally, doesn't provide x2APIC support so that might be another reason why it never took off. (See slides 2-4 in the 2nd link.)
Simultaneously, in 2012, Intel announced its own experimental "APICv" although it lacked a catchy name at its introduction http://www.linuxplumbersconf.org/2012/wp-content/uploads/2012/09/2012-lpc-virt-intel-vt-feat-nakajima.pdf. But Intel's tech appears real with benchmarks appearing in late 2013: https://software.intel.com/en-us/blogs/2013/12/17/apic-virtualization-performance-testing-and-iozone Intel's APICv is implemented at microcode level in the Ivy Bridge EP series (http://ark.intel.com/products/codename/68926/Ivy-Bridge-EP##All), which are sold under the brand names Xeon E5-26xx v2 (introduced in late 2013) and Xeon E5-46xx v2 (introduced in early 2014).

Related

Why is Page Size specified as part of Instruction Set Architecture?

I am trying to understand why is Page Size specified as part of an ISA.
More specifically, I am looking for details where any of the hardware modules (MMU, TLB) (apart from the Operating System) use the Page Size information to provide a certain functionality.
Please let me know the reasons Page Size has to be part of the ISA instead of just being decided by the OS.
Thanks.
The TLB hardware has to know the page size to figure out whether a translation applies to an address or not. e.g. given a translation, does an address 2500 bytes above it use that translation or not?
Or to put it another way, the TLB has to know which address bits are part of the page offset (within a page), and which bits need translating from virtual to physical.
Also, on architectures with HW page walk, the whole page table format is part of the ISA, and the typical design uses the virtual page number as an index to find the right entry (e.g. x86-64's 4-level page tables). Not a linear or binary search through the page table to find an entry that contains the virtual address being searched for. Normally this same design is used for page tables walked by software, AFAIK.
It is possible to build a TLB where each entry has a mask to control how many address bits it matches. i.e. where a single TLB can have entries for pages of multiples sizes. This only works if pages have power-of-2 sizes and are naturally aligned (i.e. the start address of a page is always some multiple of its size, so zeroing the low bits of an address inside a page gives you the page-start address).
You could potentially use this with an extent-based page-table format, where you have one entry for each contiguous mapping instead of one entry for each page.
Page-walks would probably be more costly, having to check more entries for more mappings, but the same number of TLB entries could cover more address space.
In many cases OSes want to be able to easily unmap or even page out unused pages, and this conflicts with using huge pages that cover a mix of hot and cold data or especially code. (But normal fixed-size hugepages have this problem, too, so x86-64's 2M and 1G hugepages aren't always a win vs. standard 4k pages.)
Page size isn't a part of the ISA (what a compiler would normally emit) for x86_64. The instruction set architecture for x86_64 is formally known as Intel® 64 Architecture, and it is briefly described in section 2.2.10 (volume 1) of the Intel® 64 and IA-32 Architectures Software Developer’s Manual. It describes what an application program can see and do. There is something similar for ARMv8.
Instead, page size is left to the OS, and it isn't a part of the ISA. This is because page sizes can vary amongst implementations and can vary according to mode settings (4K/2M/4M/1G). x86_64 implementations present something like an ISA to the OS which Intel refers to as the system programming level (what an OS would use). That's described in Chapter 13 of volume 2 of Intel's Software Developer's Manual.
That level describes page sizes and modes. But a 'correct' application program should run with different page sizes on different systems in different page size modes.

How do operating systems allow userspace programs to interact with kernelspace programs?

This isn't quite a question about a specific OS, but let's take Windows as an example. A userspace program uses the Windows API to communicate with kernelspace. However, I don't understand how that's possible. The API, according to MS websites, lives in userspace. In order to access kernelspace it has to be in kernelspace, if I understand it correctly. So what is the mechanism by which the windows API gets extra privileges to speak to kernelspace? In which space does that mechanism operate? Is this sort of thing universal to all modern PC OS's?
As you're already aware there a bunch of facilities exposed to userspace programs by the Windows kernel. (If you're curious there's a list of system calls). These system calls are all identified by a unique number, which isn't part of the publicly documented interface given by Microsoft. Instead when you call a publicly exposed function from your program there's a DLL installed when you install (or update) Windows that has an entry point which is just a normal, unprivileged user mode function call. This DLL knows the mappings between public interfaces and the available system calls in the currently running kernel. These mappings are not always 1:1 which allows for tweaks and enhancements without breaking existing code using stable interfaces.
When some userland code calls one of these functions its role is to prepare arguments for the system call and then initiate the jump into kernel mode. How exactly that jump occurs is specific to the architecture that Windows is currently running on. In fact it varies not just between x86 and Arm but between AMD and Intel x86 systems even. I'll talk just about the modern Intel x86 32-bit case (using the SYSENTER instruction) here for simplicity. On x86 most of the other variations are relatively minor, for instance int 2Eh was used prior to SYSENTER support.
Early in boot up the operating system does a bunch of work to prepare for enabling a userland and system calls from it. Understanding this is critical to understanding how system calls really work.
First let's rewind a little and consider what exactly we mean by userland and kernelmode. On x86 when we talk about privileged vs un-privileged code we talk about "rings". There are actually 4 (ignoring hypervisors) but for various reasons nobody really used anything but ring0 (kernel) and ring3 (userland). When we run code on x86 the address that's being executed (EIP) and data that's being read/written come from segments.
Segments are mostly just a historical accident left over from the days before virtual addressing on x86 was a thing. They are however important for us here because there are special registers that define which segments are currently being used when we execute instructions or otherwise reference memory. Segments on x86 are all defined in a big table, called the Global Descriptor Table or GDT. (There's also a local descriptor table, LDT, but that's not going to further the current discussion here). The important point for our discussion here is that the (arcane) layout of the table entries include 2 bits, called DPL which define the privilege level of the currently active segment. You'll notice that 2 bits is exactly enough to define 4 levels of privilege.
So in short when we talk about "executing in kernel mode" we really just mean that our active code segment (CS) and data segment selectors point to entries in the GDT which have DPL set to 0. Likewise for userland we have a CS and data segment selectors pointing to GDT entries with DPL set to 3 and no access to kernel addresses. (There are other selectors too, but to keep it simple we'll just consider "code" and "data" for now).
Back to early on during kernel boot up: during start up the kernel creates the GDT entries we need. (These have to be laid out in a specific order for SYSENTER to work, but that's mostly just an implementation detail). There are also some "machine specific registers" that control how our processor behaves. These can only be set by privileged code. Three of them that are important here are:
IA32_SYSENTER_ESP
IA32_SYSENTER_EIP
IA32_SYSENTER_CS
Recall that we've got some code runnig in userland (ring3) that wants to transition to ring0. Let's assume that it has saved any registers that it needs to per the calling convention and put arguments into the right registers that the call expects. We then hit the SYSENTER instruction. (Actually it uses KiFastSystemCall I think). The SYSENTER instruction is special. It modifies the current code and data segment selectors based on the value that the kernel setup in the machine specific register IA32_SYSENTER_CS. (The stack/data segument values are computed as an offset of IA32_SYSENTER_CS). Subsequently the stack pointer itself (ESP) is set to the kernel stack that was setup for handling system calls earlier on and saved into the MSR IA32_SYSENTER_ESP and likewise for EIP the instruction pointer from IA32_SYSENTER_EIP.
Since the CS selector now points to a GDT entry with DPL set to 0 and EIP points to kernel mode code on a kernel stack we're running in the kernel at this point.
From here onwards the kernel mode code can read and write memory from both kernel and userspace (with some appropriate caution) to undertake the actual work needed to perform the system call. The arguments to the system call can be read from registers etc. according to the calling convention, but any arguments that are actually pointers back to userland or handles to kernel objects can be accessed to read larger blocks of data too.
When the system call is over the process is basically reversed and we end up back in userland with DPL 3 for the selectors.
Its the CPU that is acts as intermediate for transfer of information between user memory space(accessible in user mode) to protected memory space(accessible in kernel mode), via CPU registers.
Here's an Example:
Suppose a user writes a program in higher level language. Now when execution of the program happens, CPU generates the virtual addresses.
Now before any read/write operation occurs, the virtual address, is converted to physical address. Because the translation mechanism(memory management unit), is only accessible in kernel mode, cause its stored in protected memory, the translation occurs in kernel mode and the physical address is finally saved into some register of the CPU, and only then a read/write operation occurs.

Are TPM's locality 2 registers accessible from SRTM?

We are designing a solution that uses virtualization and TPM. Our intention is to utilize its locality to control access to certain PCRs. The question is whether it make sense to read, say, the status register in locality 2 from the statically boot hypervisor or guest OS? If I understand it correctly, the hypervisor and the guest OS are in locality 0, and there seems to be nothing that stops them from reading the register.
I assume you have a TPM 1.2.
I'm not sure whether I get your question. There is just one set of PCRs, most likely 24 in your case. All of them are readable all the time. There are no restrictions for reading a PCR.

Does a CPU cache entry contains physical or virtual address?

Does a CPU cache deal with physical or virtual addresses? And if it deals with virtual addresses, does that mean that it has be to emptied on context switch, assuming that the new thread is from another process.
This depends on the processor model. Some processors use both. (See “SPARC” in the “Virtual tags and vhints” section.)
You have tagged this question with x86-64, and an answer could be given for all x86-64 models to date, but I am not sure whether the architecture specification specifies whether processors conforming to the specification must use one or the other for cache information.

Is there a standard definition of what constitutes a version(revision) change

I am currently in bureaucratic hell at my company and need to define what constitutes the different levels of software change to our test programs. We have a rough practice that we follow internally, but I am looking for a standard (if it exists) to reference in our Quality system. I recognize that systems may vary greatly between developers, but ultimately I am looking for a "best practice" guide to what constitutes a major change, a minor change etc. I would like to reference a published doc in my submission to our quality system for ISO purposes if possible.
To clarify the software developed at my company is used internally for test automation of Semi-Conductors. We are not selling this code and versioning is really for record keeping only. We are using the x.y.z changes to effect the level of sign-off and approval needed for release.
A good practice is to use 3 level revision numbers:
x.y.z
x is the major
y is the minor
z are bug fixes
The important thing is that two different software versions with the same x should have binary compatibility. A software version with a y greater than another, but the same x may add features, but not remove any. This ensures portability within the same major number. And finally z should not change any functional behavior except for bug fixes.
Edit:
Here are some links to used revision-number schemes:
http://en.wikipedia.org/wiki/Software_versioning
http://apr.apache.org/versioning.html
http://www.advogato.org/article/40.html
I would add build number to the x.y.z format:
x.y.z.build
x = major feature change
y = minor feature change
z = bug fixes only
build = incremented every time the code is compiled
Including the build number is crucial for internal purposes where people are trying to figure out whether or not a particular change is in the binaries that they have.
to enlarge on what #lewap said, use
x.y.z
where z level changes are almost entirely bug fixes that don't change interfaces or involve external systems
where y level changes add functionality and may change the UI/API interface in addition to fixing more serious bugs that may involve external systems
where x level changes involve anything from a complete rewrite/redesign to just changing the database structures to changing databases (i.e. from Oracle to SQLServer) - in other words anything that isn't a drop in change that requires a "port" or "conversion" process
I think it might differ if you are working on an internal software vs a external software (a product).
For internal software it will almost never be a problem to use a formally defined scheme. However for a product the version or release number is in most cases a commercial decision that does not reflect any technical or functional criteria.
In our company the x.y in an x.y.z numbering scheme is determined by the marketing boys and girls. The z and the internal build number are determined by the R&D department and track back into our revision control system and are related to the sprint in which it was produced (sprint is the Scrum term for an iteration).
In addition formally defining some level of compatability between versions and releases could cause you to very rapidly move up or to hardly move at all. This might not reflect added functionality.
I think that best approach for you to take hen explaining this to your co-workers is by examples drawn from well known and succesful software packages, and the way they approach major and minor releases.
The first thing I would say is that the major.minor dot notation for releases is a relatively recent invention. For example, most releases of UNIX actually had names (which sometimes included a meaningless number) rather than version numbers.
But assuming you want to use major.minor, numbering, then the major number indicates a version that is basically incompatible with much that went before. Consider the change from Windows 2,0 to 3.0 - most 2,0 applications simply didn't fit in with the new overlapped windows in Windows 3,0. For less all-encompassing apps, a radical change in file formats (for example) could be a reason for a major version change - WP &n graphic apps often work this way.
The other reason for a major version number change is that the user notices a difference. Once again this was true for the change from Windows 2.0 to 3.0 and was responsible forv the latters success. If your app looks very different, that;s a major change.
A for the minor version number, this is typically used to indicate a chanhe that actually is quite major, but that won't be noticeable to the user. For example, the differences internally between Win 3.0 and Win 3.1 were actually quite major, but the interface stayed the same.
Regarding the third version number, well few people know hat it really means and fewer care. For example, in my everyday work I use the GNU C++ compiler version 3.4.5 - how does this differ from 3.4.4 - I haven't a clue!
As I said before in an answer to a similar question: The terminology used is not very precise. There is an article describing the five relevant dimensions. Data management tools for software development don't tend to support more than three of them consistently at the same time. If you want to support all five you have to describe a development proces:
Version (semantics: modification)
View (semantics: equivalence, derivation)
Hierarchy (semantics: consists of)
Status (semantics: approval, accessibility)
Variant (semantics: product variations)
Peter van den Hamer and Kees Lepoeter (1996) Managing Design Data: The Five Dimensions of CAD Frameworks, Configuration Management, and Product Data Management, Proceedings of the IEEE, Vol. 84, No. 1, January 1996