When there is a vfio device in the QEMU-KVM VM, it is not able to save (virsh save) the VM, why? - vfio

When the QEMU-KVM VM contains a vfio device, the "virsh save" command will not be able to save the VM.
So, I am curious why the vfio-device does not support the VM save command.
One of the reason I know is that, it will conflict with the VM migration.
But if I am not going to migrate the VM, is it possible the save the VM with the vfio device, and why?

Because the physical device is stateful, and because there is no way to dump the state from an arbitrary device, this state cannot be saved when the VM state is saved. Because the state cannot be saved, it cannot be restored. Without the state of the device being restored, the in-VM driver's knowledge of the state of the hardware will not match the state of the hardware and in virtually 100% of cases this will result in the driver, VM, and possibly the PCI bus crashing. Usually crashing very hard and taking down the entire host.
Therefore, no saving of VMs with hardware passed through is allowed.

Related

How can operating system detect live resize of disk capacity?

I saw the following discussion and had some questions:
live resize of a NVMe drive
If the physical capacity of the nvme device changes (e.g., from 10GB to 20GB), how the operating system detect it without rebooting?
In the above link, re-scanning pci bus is solution.
When the re-scanning be executed, does the operating system ask the nvme device to update its meta-info (e.g., capacity, etc.) ?
How does OS interact with disk specifically? (How to read changed device parameters from the disk, not the old device parameters in memory?)
This is an AWS virtual-machine probably so the disk is actually a virtual-disk. You can't resize a physical disk like you upgrade its capacity physically (you'd need to change the disk).
With that said, this machine probably runs on top of a type 1 hypervisor. What I understand about these is that the virtual machines (VMs) run as processes on a different ring on top of a minimal operating-system (hypervisor). When the VMs execute privileged instructions, it will trigger a protection fault and the hypervisor can thus inspect who actually triggered the fault (was it the guest kernel or a user mode process within the guest kernel?). If it was the guest kernel than it can execute that instruction on behalf of the guest. Otherwise, it will probably do what a real kernel would do (trigger an exception). It can tell the difference because the guest kernel runs in a different ring than the ring 3 (user mode).
With that said, the NVME device isn't PCI it is NVME. The host controller of the NVME drive is PCI. To rescan the NVME drives, you will read/write to some registers that are memory mapped in RAM and ask the NVME PCI host controller what is the size of the different disks that are found. PCI is known to be hot pluggable (similarly to USB) in some cases but mostly not on consumer motherboards. I don't think you'll get any interrupt when a PCI device is hot plugged so you are left doing a rescan of the devices.
For NVME, it depends on the host controller if you'll get an interrupt when a disk is swapped/changes size. As to virtual-disks, it probably depends on a lot of different things. You could definitely be left doing a PCI rescan here. I guess it depends on the hypervisor, on the OS and on the host controller configurations.

How does Virtualization works internally?

I was reading about virtualization and a doubt popped in my head. How does virtualization works internally at Operating System level? The topic was discussed in my class and I came across this.
A virtual box runs like a process with some extra privileges on the host operating system.
My doubt is, If VM is running as a process then who provides this extra privilege to it so that it could actually interfere with the underlying OS and hardware resources.
I read what a Hypervisor is: http://searchservervirtualization.techtarget.com/definition/hypervisor
A hypervisor is the connection between VM and host OS. Running hypervisor on a host OS means we will run it as a user process. Again my doubt is the same.
How can a user process (which is a hypervisor) control the host processor and resources? As far as I knew user processes dont have those rights.
Thanks in advance.

How does the Sulley fuzzing framework procmon work on a virtual machine?

From my understanding, the process_monitor stores crashbin information locally. If this is running on a virtual machine and a test case causes the process and target machine to become unresponsive, vmcontrol would then revert to an earlier snapshot. How is the crashbin information displayed to the web interface, or accessed at this point if it was lost on the revert to an earlier snapshot?
After walking through most of the code in the Sulley environment, I found that the restart_target() method in the sessions.py module calls for a restart on the virtual machine if vmcontrol is available first, and then tries to restart the process via the procmon if its available. By switching the order of these, I can solve the problem of losing the log information from the crashbin unless the entire target machine becomes unresponsive.

Mongodb hotfix KB2731284

I installed MongoDb on a windows server 2008 R2 and the hotfix KB2731284 is not installed, but I cannot restart the server easily.
In the hotfix description, I got this message "You run an application that uses the FlushViewOfFile() function to clean up memory-mapped files from the paged memory pool." (see https://support.microsoft.com/en-us/kb/2731284)
My question is, when the funtion FlushViewOfFile() is called? My application is just writing in a collection and get data from it. Do I risk to get some wrong behaviors?
I think you can run MongoDb without applying the Hotfix, but I would not recommend it. In long time you may run into problems. They have included some fixes in MongoDB to workaround the problem.
A detailed description of the problem can be found here and here.
See also this.
On Windows, Memory Mapped File flushes are synchronous operations. When the OS Virtual Memory Manager is asked to flush a memory mapped file, it makes a synchronous write request to the file cache manager in the OS. This causes large I/O stalls on Windows systems with high Disk IO latency, while on Linux the same writes are asynchronous.
The problem becomes critical on high-latency disk drives like Azure persistent storage (10ms). This behavior results in very long bg flush times, capping disk IOPS at 100. On low latency storage (local storage and AWS) the problem is not that visible.
On Windows 7 and Windows Server 2008 R2 when applying the hotfix you get a better file allocation performance what is relevant for MongoDB

how does a hypervisor knows that a privileged instruction happened inside a VM?

I've started reading about VMM and wondered to myself how does the hypervisor knows a privileged instruction (for ex, cpuid) happened inside a VM and not real OS ?
let's say I've executed cpuid, a trap will occur and a VMEXIT would happen, how does the hypevisor
would know that the instruction happened inside my regular OS or inside a VM ?
First off, you are using the wrong terminology. When an OS runs on top of a hypervisor, the OS becomes the VM (virtual-machine) itself and the hypervisor is the VMM (=virtual machine monitor). A VM can also be called "guest". Thus: OS on top of hypervisor = VM = guest (these expressions mean the same thing).
Secondly, you tell the CPU that it's executing inside the VM from the moment you've executed VMLAUNCH or VMRESUME, assuming you're reading about Intel VMX. When for some reason the VM causes a hypervisor trap, we say that "a VM-exit occured" and the CPU knows it's no longer executing inside the VM. Thus:
Between VMLAUNCH/VMRESUME executions and VM-exits we are in the VM and CPUID will trap (causing a VM-exit)
Between VM-exits and VMLAUNCH/VMRESUME executions we are in the VMM (=hypervisor) and CPUID will NOT TRAP, since we already are in the hypervisor
Instructions that are privileged generate exceptions when executed in user mode. The exception is usually an undefined instruction exception. The hypervisor hooks this exception, inspects the executing instruction and then returns control to the VM. When the host OS calls the same instruction, it is in a supervisor or elevated privilege and usually no exception is generated when it executes the instruction. So generally, these issues are handled by the CPU.
However, if an instruction is not available on the processor (say floating point emulation), then the hypervisor may emulate for the VM and chain to the OS handler if not. Possibly it may even allow the OS to handle the emulation for both VMs and user tasks in the OS.
So generally, this question is unanswerable for a generic CPU. It depends on how the instruction is emulated in the VM. However, the best case is that the hypervisor does not emulate any OS instructions. Emulations will not only slow down the VMs, but the entire CPU, including user processes in the host OS.