I've been playing with memory backings for QEMU for a while now and I'm starting to realize that the only option for memory backing if you want the best performance (who doesn't?) is to use static hugepages.

There are a few pitfalls that I'm going to go through and explain how you can work around them, but first, let's talk about the different memory pages and the benefits and drawbacks of each.

Standard Pages.

Standard pages are 4kb in size. Since most programs only allocate a few kb at a time, or so it used to be, that the kernel can simply give the program a single page and not worry about inefficient use of memory. The drawback of standard page sizes is that they're not great for large allocations and what's more is that the kernel randomly allocates regions of memory, meaning that the memory is fragmented and not allocated sequentially. This can cause performance problems.

Right about now, you're probably thinking I'm losing it. "But RAM is Solid State, there's almost no performance impact of random reads!" That's true. Random Access Memory is, by design, very resiliant to performance impacts of random reads and writes. The problem is not the memory, but the kernel. The kernel will actively try to defragment the memory because of issues with high-order allocation. This patch, by Marcelo Tosatti is a good example of memory defragmentation. On a high level, it moves around allocated memory in an attempt to maintain contiguous free memory space for new allocations.

In addition to the defragmentation, the kernel needs to keep a list of all the pages that are allocated, the region of physical memory they are from and the process that the memory belongs to. This, in turn, takes up memory!

This these are all wanted and required features of the Linux kernel. That said, they still cause problems with virtualization and memory performance. When you're doing realtime things in a VM, such as editing video, Computer Aided Modeling, and especially playing games, you'll notice periodic stutters. If your memory is fragmented significantly enough, you can even notice it when simply browsing the web in a VM.

Let's say we want to allocate 8GiB of memory to our VM. With standard 4k pages, the kernel will try to allocate 2,097,152 separate pages. Obviously, the kernel will try to contiguously allocate these pages, but on any machine that's been up for much more than 5 minutes, the chance that will happen approaches zero. The kernel will spend a bunch of CPU time defragmenting memory to build the pages contiguously. Additionally, whenever new memory allocations need to happen, those pages will then get "defragmented" or moved around to defragment free space, because that's what's really happening here. When we talk about defragmenting memory, we're not talking about keeping groups of pages together, we're talking about keeping free memory together. That's going to cause the stutters you may experience. Thankfully, there's a great solution to this: Don't defragment the VM's memory.

Static HugePages to the rescue!

In Linux, you have three sizes of pages. 4KB, 2MB and 1GB. You can fit 512 standard pages inside a 2MB HugePage. You can then fit 512 2MB pages inside a 1GB HugePage. This reduces the overhead of searching through a giant page table and aids in reducing the fragmentation of the memory. The downside is that you must allocate the memory in 2MB or 1GB lots. This downside is the reason that most programs choose not to take advantage of this feature. It's important to note that this does not solve the problem of memory defragmentation causing performance drops and stuttering. For that reason, these pages are called Transparent HugePages

Moving on to Static HugePages, these are special pages. They're manually allocated, usually at boot, and the kernel permenantly marks those regions of memory as in-use and unmovable. This is beneficial for a number of reasons, the largest comes with VFIO and passthrough, which we'll talk about below. These memory regions are allocated with either 2MB or 1GB pages and does solve the problem of memory defragmentation causing performance drops and stuttering.

If I want to allocate the same 8GB of memory to my VM, I'm going to need to do things a bit differently. If I've already booted my machine, I can use the command sysctl -w vm.nr_hugepages=4096 to attempt to setup 8GB of 2MB pages. This may take a while and will only succeed if the kernel can quickly find 8GB of contiguous memory to build the pages. Otherwise, you'll end up with less than the 4096 pages that you commanded.

Allocating these pages on boot is much easier, you can use /etc/sysctl.d/10-hugepages.conf to set the vm.nr_hugepages parameter to your desired number of HugePages to be allocated. The kernel allocates these pages early in the boot process when only a few MB of memory has been used in total, giving you a significantly higher chance of success.

When working with VFIO and an IOMMU, you're going to get significantly better performance as well when allocating memory for the device you're passing through because you don't see any stutters or slowdowns from those kernel memory defragmentations. The combination of standard VM memory and IOMMU allocated memory being static can bring you to within 2% of bare metal performance in most circumstances.

Conclusions and further research

For those who, like me, are using Passthrough with a Windows VM on their Linux system, I highly recommend using Static HugePages for their VM. I'll write up a quick 'n' dirty guide for that soon, but I've already laid the groundwork for people to get their hands dirty. For those running VMs in the enterprise and even, theoretically, those running high-performance, high-memory demand Java applications, you'll probably benefit from utilizing Static HugePages as well. While the performance benefits are hard to measure, due to the variable nature of memory, the subjective user experience is absolutely better and certain workloads can absolutely see large benefits from Static HugePages.