Avi Kivity's blog

C, assembly, and security

2011-08-25T05:08:00.000-07:00

Let's look at the innocent C statement:

    a = b + c;

What could possible go wrong? Let us list the ways:

a, b, or c are the not the variables we want
We specified addition, but we wanted something else
The addition operation overflows
a and b are unsigned, while c is signed and negative; the result becomes unsigned
a has a smaller size than b or c; the assignment operation overflows
a is signed while b or c are unsigned; a becomes negative while b+c are positive
a is unsigned, while b and c are signed; again we have an overflow
We used the wrong indentation level and people are unhappy about it

We can't expect the language to prevent all of these errors, but can we make C safer by trapping some of them at least during runtime? Turns out we can't do that without sacrificing performance:

to handle (3), we need trapping signed addition and trapping unsigned addition instructions
to handle (4), we need a mixed signedness trapping add
to handle (5), we need a trapping store unsigned and trapping store signed instructions, which check that the value in the register fits into the memory location specified
ditto with (6) and (7)

these (and similar) issues show up regularly in security vulnerabilities; it's hard to fix them because the necessary processor instructions are not there. We could emulate them by using sequences of existing instructions, but that would bloat the code and hurt performance; since performance is something that can be benchmarked but security is not, we end up with exploitable code.

So why are those instructions missing? In the 70's and 80's when the industry was ramping up performance was a much greater concern than security. Code was smaller and easier to audit; CPU cycles were longer and therefore more important to conserve; networks were small and private; truly malicious attacks were rare.

An unvirtuous cycle followed: C tried to make the most of exising processors, so its semantics mimic the instruction set of those days. It then became wildly successful, so processors were optimized for running C code; naturally they implemented or optimized instructions which directly translated to C concepts. This made C even more popular.

A pair of examples from the x86 world are the INTO and BOUND instructions. INTO (INTerrupt on Overflow) can follow an addition or subtraction instruction, effectively turning it into a trapping signed instruction. BOUND performs an array subscript bounds checking, trapping if the index is out of bounds. But the first implementations were rarely used, so they were not optimized in later iterations of the processor. Finally, the 64-bit extensions to the x86 instruction set removed those two instructions for good.

Nested vmx support coming to kvm

2009-09-06T12:14:00.000-07:00

Almost exactly a year ago I reported on nested svm for kvm - a way to run hypervisors as kvm guests, on AMD hosts. I'm happy to follow up with the corresponding feature for Intel hosts - nested vmx.

Unlike the nested svm patchset, which was relatively simple, nested vmx is relatively complex. This is due to several reasons:

While svm uses a memory region to communicate between hypervisor and processor, vmx uses special instructions -- VMREAD and VMWRITE. kvm must trap and emulate the effect of these instructions, instead of allowing the guest to read and write as it pleases.
vmx is significantly more complex than svm: vmx uses 144 fields for hypervisor-to-processor communications, while svm gets along with just 91. All of those fields have to be virtualized. Note that nested virtualization must reconcile the way kvm uses those fields with the way its guest (which is also a hypervisor) uses those fields; this causes complexity to increase even more.
The nested vmx patchset implements support for Extended Page Tables (EPT) in the guest hypervisor, in addition to existing support in the host. This means that kvm must now support guest pagetables in the 32-bit format, 64-bit format, and now the EPT format.

Support for EPT in the guest deserves special mention, since it is critical for obtaining reasonable performance. Without nested EPT, the guest hypervisor will have to trap writes to guest page tables and context switches. The the guest hypervisor has to service those traps - by issuing the VMREAD and VMWRITE to communicate with the processor. Since those instructions must trap to kvm, any trap taken by the guest is multiplied by quite a large factor into kvm traps.

So how does nested EPT work?

Without nesting, EPT provides for two levels of address translation:

The first level is managed by the guest, and translates guest virtual addresses (gva) to guest physical addresses (gpa).
The second address translation level translates guest physical addresses into host physical adresses (hpa). This second level is managed by the host (kvm).

When nesting is introduced, we now have three levels of address translation:

Nested guest virtual address (ngva) to nested guest physical address (ngpa) (managed by the nested guest)
Nested guest physical address (ngpa) to guest physical address (gpa) (managed by the guest hypervisor)
Guest physical address (gpa) to host physical address (hpa) (managed by the host - kvm)

Given that the hardware only supports two levels of address translation, we need to invoke software wizardry. Fortunately, we already have code in kvm that can fold two levels of address translation into one - the shadow mmu.

The shadow mmu, which is used when EPT or NPT are not available, folds the gva→gpa→hpa translation into a single gva→hpa translation which is supported by hardware. We can reuse this code to fold the ngpa→gpa→hpa translation into a single ngpa→hpa. Since the hardware supports two levels, it will happily translate ngva→ngpa→hpa.

But what about performance? Weren't NPT and EPT introduced to solve performance problems with the shadow mmu? Shadow mmu performance depends heavily on the rate of change of the two translation levels folded together. Virtual address translations (gva→gpa or ngva→ngpa) do change very frequently, but physical address translations (ngpa→gpa or gpa→hpa) change only rarely, usually in response to a guest starting up or swapping activity. So, while the code is complex and relatively expensive, it will only be invoked rarely.

To summarize, nested vmx looks to be one of the most complicated features in kvm, especially if we wish to maintain reasonable performance. It is expected that it will take Orit Wasserman and the rest of the IBM team some time to mature this code, but once this work is complete, kvm users will be able to enjoy another unique kvm feature.

kvm userspace merging into upstream qemu

2008-12-24T14:19:00.000-08:00

Recently, Anthony Liguori, one of the qemu maintainers has included kvm support into stock Qemu. This is tremendously important.

Why? you might ask. It has to do with how software forks are managed.

When a software project is forked, there are two ways to go about it. One can add new features, restructuring code along the way so that the new code fits in snugly. This allows you to easily make large changes, but has the side effect of diverging from the original code. Over time, it is no longer possible (or at least very difficult) to incorporate fixes and new features that evolved in the original code, since the two code bases are wildly different.

An alternative strategy is to add the new features in a way that makes as little impact as possible on the original code. This allows updating from the origin to pick up fixes and new features relatively frequently. The downside is that we become severely limited in the kind of changes we can make to our copy of Qemu without diverging too much.

We have mostly followed the second strategy. Adaptations to qemu were as small as possible, and we have "encouraged" non-kvm-specific changes to be contributed directly to qemu upstream. This kept the amount and scope of local modifications at a minimum.

But now that kvm has been merged, it is possible to make larger modifications to qemu in order to make it fit virtualization roles better. Live migration and virtio have already been merged. Device and cpu hotplug are on the queue. Deeper changes, like modifying how qemu manages memory and performs DMA, are pending. And, of course, kvm integration is much cleaner and more maintainable.

There is of course some friction involved. The new implementation has a few bugs and several missing features (for example, support for true SMP and Windows patching), so it will be rough for a while. However, once the transition is complete, kvm and qemu will be able to evolve at a faster pace, to the benefit of both.

Nested svm virtualization for kvm

2008-09-02T02:16:00.000-07:00

Yesterday I found a nice surprise in my inbox - a post, by Alex Graf, adding support for virtualizing AMD's SVM instruction set when running KVM on AMD SVM.

What does this mean? up until now, when kvm virtualizes a processor, the guest sees a cpu that is similar to the host processor, but does not have virtualization extensions. This means that you cannot run a hypervisor that needs these virtualization extensions within a guest (you can still run hypervisors that do not rely on these extensions, such as VMware, but with lower performance). With the new patches, the virtualized cpu does include the virtualization extensions; this means the guest can run a hypervisor, including kvm, and have its own guests.

There are two uses that immediately spring to mind: debugging hypervisors and embedded hypervisors. Obviously having svm enabled in a guest means that one can debug a hypervisor in a guest, which is a lot easier that debugging on bare metal. The other use is to have a hypervisor running in the firmware at all times; up until now this meant you couldn't run another hypervisor on such a machine. With nested virtualization, you can.

The reason the post surprised me was the relative simplicity in which nested virtualization was implemented: less than a thousand lines of code. This is due to the clever design of the svm instruction set, and the ingenuity of the implementers (Alex Graf and Jörg Rödel) in exploiting the instruction set and meshing the implementation so well with the existing kvm code.

How kvm does security

2008-05-13T11:04:00.001-07:00

Like most software, kvm does security in layers.

At the inner privilege layer is the kvm module. This code interacts directly with the guest and also has full access to the machine. If breached, a guest could potentially take over the host and any virtual machines running on it.

The outer privilege layer is qemu. While it is much larger than the kvm kernel module, it is relatively easy to contain a qemu breach so that it doesn't affect the rest of the host:

The kernel already protects itself from non-root user processes; if you run kvm as an unprivileged user, the kernel will not let you harm it.
Processes that run as different users are also restricted; so if you run each guest under a distinct user ID, more isolation is gained.
Mandatory access control systems such as selinux can be used to further restrict the damage that a breached qemu can inflict.

What are the most vulnerable submodules in kvm?

Probably the most critical piece is the x86 instruction emulator, which is invoked whenever the guest accesses I/O registers or the its page tables. This code weighs in at about 2000 lines.
If the kvm mmu can be tricked into mapping an arbitrary host page into guest memory, then the guest can potentially insert its own code into the kernel. The mmu is about 3000 lines in length, but it has been the subject of endless inspection, so it is likely a very difficult target.

So again the "reuse Linux" theme repeats: kvm leverages the existing Linux kernel both to reduce the attack surface presented to malicious guests, and also to contain the damage should a security breach occur.

Comparing code size

2008-05-02T08:14:00.000-07:00

Starting with Linux 2.6.26, kvm supports four different machine architectures: x86, s390 (System Z, or mainframes), ia64 (Intel's Itanium), and embedded PowerPC processors. It is interesting to compare the size of the code supporting each architecture:

arch	lines
x86	17442
ia64	8154
s390	2509
ppc	2229

x86 is old and crufty; it supports three instruction sets and four paging modes; its long and successful history means that it needs the most kvm support code. There are two different virtualization extensions that kvm supports on x86 (Intel's VT and AMD's SVM). It is also the architecture that has been supported by kvm for the longest time. It is no surprise that it leads the pack by a significant amount.

ia64 is a newer architecture, but a quite complex one. The mechanism by which is supports virtualization, with a module loaded into the host kernel and a second module loaded into the guest address space, also adds complexity. So it comes in second, though far behind x86.

s390 is older (and probably far cruftier) than x86. But on the other hand, its hardware virtualization support is so mature and complete that a complete hypervisor fits in a fraction of the lines required for x86. Indeed, it will take a while until x86 can support 64-way guests.

ppc 44x, the embedded PowerPC variant targeted by kvm, has a simple software-managed tlb model, and the regular instruction set encoding favored by RISC processors, so it gets by with just a seventh of the amount of code required by x86.

As we add more features, kvm code size will continue to grow slowly, but the relative comparison will no doubt remain valid. And kvm will likely remain the smallest full virtualization solution available.

KVM Forum 2008 Agenda posted

2008-04-27T12:37:00.000-07:00

The near-final agenda for the KVM Forum 2008 has been posted! I'm pleased to see a well-rounded set of presentations, covering all aspects of kvm development.

If you're interested in kvm development, and haven't already, make sure to register now.

See you all in Napa!

I/O: Maintainability vs Performance

2008-04-25T03:03:00.000-07:00

I/O performance is of great importance to a hypervisor. I/O is also a huge maintenance burden, due to the large number of hardware devices that need to be supported, numerous I/O protocols, high availability options, and management for it all.

VMware opted for the performance option, but putting the I/O stack in the hypervisor. Unfortunately the VMware kernel is proprietary, so that means VMware has to write and maintain the entire I/O stack. That means a slow development rate, and that your hardware may take a while to be supported.

Xen took the maintainability route, by doing all I/O within a Linux guest, called "domain 0". By reusing Linux for I/O, the Xen maintainers don't have to write an entire I/O stack. Unfortunately, this eats away performance: every interrupt has to go through the Xen scheduler so that Xen can switch to domain 0, and everything has to go through an additional layer of mapping.

Not that Xen solved the maintainability problem completely: the Xen domain 0 kernel is still stuck on the ancient Linux 2.6.18 release (whereas 2.6.25 is now available). These problems have led Fedora 9 to drop support for hosting Xen guests, leaving kvm as the sole hypervisor.

So how does kvm fare here? like VMware, I/O is done within the hypervisor context, so full performance is retained. Like Xen, it reuses the entire Linux I/O stack, so kvm users enjoy the latest drivers and I/O stack improvements. Who said you can't have your cake and eat it?

Memory overcommit with kvm

2008-04-15T07:37:00.000-07:00

kvm supports (or rather, will support; this is work in progress) several ways of running guests with more memory that you have on the host:

Swapping: This is the classical way to support overcommit; the host picks some memory pages from one of the guests and writes them out to disk, freeing the memory for use. Should a guest require memory that has been swapped, the host reads it back from the disk.
Ballooning: With ballooning, the guest and host cooperate on which page is evicted. It is the guest's responsibility to pick the page and swap it out if necessary.
Page sharing: The hypervisor looks for memory pages that have identical data; these pages are all merged into a single page, which is marked read only. If a guest writes to a shared page, it is unshared before granting the guest write access.
Live migration: The hypervisor moves one or more guests to a different host, freeing the memory used by these guests

Why does kvm need four ways of overcommitting memory? Each method provides different reliability/performance tradeoffs.

Ballooning is fairly efficient since it relies on the guest to pick the memory to be evicted. Many times the guest can simply shrink its cache in order to free memory, which can have a very low guest impact. The problem with ballooning is that it relies on guest cooperation, which reduces its reliability.

Swapping does not depend on the guest at all, so it is completely reliable from the host's point of view. However, the host has less knowledge than the guest about the guest's memory, so swapping is less performant than ballooning.

Page sharing relies on the guest behavior indirectly. As long as guests run similar applications, the host will achieve a high share ratio. But if a guest starts running new applications, the share ratio will decrease and free memory in the host will drop.

Live migration does not depend on the guest, but instead on the availablity of free memory on other hosts in the virtualization pool; if other hosts do not have free space, you cannot migrate to them. In addition, live migration takes time, which the host may not have when facing a memory shortage.

So kvm uses a mixed strategy: page sharing and ballooning are used as the preferred methods for memory overcommit since they are efficient. Live migration is used for long-term balancing of memory requirements and resources. Swapping is used as a last resort in order to guarantee that services to not fail.

Paravirtualization is dead

2008-04-10T07:47:00.000-07:00

Well, not all paravirtualization. I/O device virtualization is certainly the best way to get good I/O performance out of virtual machines, and paravirtualized clocks are still necessary to avoid clock-drift issues.

But mmu paravirtualization, hacking your guest operating system's memory management to cooperate with the hypervisor, is going away. The combination of hardware paging (NPT/EPT) and large pages match or beat paravirtualization on most workloads. Talking to a hypervisor is simply more expensive than letting the hardware handle everything transparently, even before taking into account the costs introduced by paravirtualization, like slower system calls.

The design of the kvm paravirtualized mmu reflects this planned obsolescence. Instead of an all or nothing approach, kvm paravirtualization is divided into a set of features which can be enabled or disabled independently. The guest picks the features it supports and starts using them.

The trick is that when the host supports NPT or EPT, kvm does not expose the paravirtualized mmu to the guest; in turn the guest doesn't use these features, and receives the benefit of the more powerful hardware. All this is done transparently without any user intervention.

True myths

2008-03-28T11:41:00.000-07:00

The appearance of kvm naturally provoked reactions from the competition, which are interesting in the way they imply some untruths while being 100% accurate:

kvm is good for desktop -- that is eminently true, by being integrated with Linux kvm inherits all the desktop and laptop goodies, like excellent power management, suspend/resume, good regular (non-virtual-machine) process performance, and driver integration.

The implication, however, is that kvm is not suitable for server use. This is wrong: kvm also inherits from Linux its server qualities, including excellent scalability, advanced memory management, security, and I/O stack.

you need a bare metal hypervisor for server workloads -- that is also true, without complete control of the hardware a hypervisor will be hopelessly inefficient.

Somehow the people who say this ignore the fact that kvm is a bare metal hypervisor, accessing the hardware directly. In fact kvm is much closer to the bare metal than Xen, which can only access I/O devices through a special guest, "dom0", which is definitely not running on bare metal.

A thin hypervisor gives better security -- true again, the smaller your trusted computing base is, the greater confidence you have in your hypervisor.

The same speakers then go on about how thin Xen is. But they seem to ignore that the entire I/O and management plane is in fact a Linux guest -- and that it is part of the trusted computing base. Now which is smaller, Linux, or Xen with a trusted Linux guest?

Developers, of course, realize all of this immediately; but it will take some time and counter-marketing to repair the damage already done. Hence this article.