Kernel Memory Management: Theory and Practice

This is my 3rd post in a series of writeups on Kernel Memory Management: Theory & Practice

The first two are here:-

In this post, we are going to scratch a bit into the vmalloc API frequently used for dynamic memory Allocations for large-size memory.

vmalloc and kmalloc are the two most commonly and frequently used dynamic memory allocation API in the kernel. With subtle differences between their modes of operation and the type of memory returned, it’s worth mentioning their implementation and logic. (refer to mainline kernel)

Here is a typical internal API flow called by vmalloc:-

vmalloc -> __vmalloc_node_range -> __get_vm_area_node -> __vmalloc_area_node -> kasan_unpoison_vmalloc

vmalloc is called to allocate a contiguous block of virtual memory.
vmalloc internally calls __vmalloc_node_range to allocate memory from the system page allocator.
__vmalloc_node_range calls __get_vm_area_node to find a suitable vm_area_struct data structure for the newly allocated virtual memory area.
__get_vm_area_node searches the kernel’s list of free vm_area_struct structures to find a suitable structure for the new virtual memory area.
If no suitable vm_area_struct structure is found, __get_vm_area_node calls __vmalloc_area_node to allocate a new vm_area_struct.
__vmalloc_area_node allocates a vm_area_struct and maps it into the kernel’s virtual address space using map_vm_area.
map_vm_area sets up the vm_area_struct to describe the newly allocated virtual memory area and inserts it into the kernel’s list of virtual memory areas using insert_vm_struct.
Finally, vmalloc returns a pointer to the start of the allocated virtual memory block to the calling code.
kasan_unpoison_vmalloc is called to unpoison the memory if Kernel Address Sanitizer (KASan) is enabled

The vmalloc function allocates memory from the high memory area of kernel virtual memory. High memory is a region of kernel virtual memory that is not directly accessible by the kernel and is therefore managed differently from other regions of kernel memory. vmalloc is used to allocate variable-sized memory regions that are not page-aligned. Because high memory can be dynamically allocated and deallocated, vmalloc is well-suited for allocating large, non-contiguous memory regions.

However, it’s worth noting that vmalloc does not actually allocate physical memory; instead, it allocates virtual memory pages and maps them to physical memory as needed. This means that the amount of physical memory used vmalloc may be larger than the amount of virtual memory allocated by the function.

The below diagram explains how memory allocated by vmalloc is guaranteed to be contagious in virtual memory but not guaranteed to be contagious in physical memory. Mostly there are discontiguous physical pages that are mapped into a contiguous virtual memory area via page table.

Below is a diagram that shows a typical memory map of an ARM-based system. In the above diagram, the top section represents the user space virtual addresses, which are used by user-level processes. The bottom section represents the physical memory, which is the actual hardware memory that stores the data. The user application resides within the user space virtual addresses, while the operating system kernel resides in the kernel space. The kernel space is typically mapped to a fixed region of virtual addresses that is separate from the user space virtual addresses.

Physical memory is divided into two regions: low memory and high memory. The low memory is directly mapped to physical memory and is used for kernel code, data, and stack. The high memory region is not directly mapped to physical memory and is used for dynamic memory allocation, including vmalloc allocations.

Overall, this diagram shows how the virtual and physical memory maps are organized on ARM-based systems, with user and kernel spaces separated and low and high memory regions used for different purposes.

So having described the vmalloc philosophy above, below we take look at the actual code of one of the core internal help functions in kernel vmalloc_node_range.

Below is the actual vmalloc_node_range which is the heart of vmalloc implementation.


void *__vmalloc_node_range(unsigned long size, unsigned long align,	unsigned long start, unsigned long end, gfp_t gfp_mask,	pgprot_t prot, unsigned long vm_flags, int node,const void *caller)
{
	if (WARN_ON_ONCE(!size))
		return NULL;
	if ((size >> PAGE_SHIFT) > totalram_pages()) {
		warn_alloc(gfp_mask, NULL,
			"vmalloc error: size %lu, exceeds total pages",
			real_size);
		return NULL;
	}
	if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP)) {
		unsigned long size_per_node;
		size_per_node = size;
		if (node == NUMA_NO_NODE)
			size_per_node /= num_online_nodes();
		if (arch_vmap_pmd_supported(prot) && size_per_node >= PMD_SIZE)
			shift = PMD_SHIFT;
		else
			shift = arch_vmap_pte_supported_shift(size_per_node);

		align = max(real_align, 1UL << shift);
		size = ALIGN(real_size, 1UL << shift);
	}
again:
	area = __get_vm_area_node(real_size, align, shift, VM_ALLOC |
				  VM_UNINITIALIZED | vm_flags, start, end, node,
				  gfp_mask, caller);
	if (!area) {
		bool nofail = gfp_mask & __GFP_NOFAIL;
		warn_alloc(gfp_mask, NULL,
			"vmalloc error: size %lu, vm_struct allocation failed%s",
			real_size, (nofail) ? ". Retrying." : "");
		if (nofail) {
			schedule_timeout_uninterruptible(1);
			goto again;
		}
		goto fail;
	}
	if (pgprot_val(prot) == pgprot_val(PAGE_KERNEL)) {
		if (kasan_hw_tags_enabled()) {
			prot = arch_vmap_pgprot_tagged(prot);

			gfp_mask |= __GFP_SKIP_KASAN_UNPOISON | __GFP_SKIP_ZERO;
		}

		
		kasan_flags |= KASAN_VMALLOC_PROT_NORMAL;
	}
	ret = __vmalloc_area_node(area, gfp_mask, prot, shift, node);
	if (!ret)
		goto fail;
	kasan_flags |= KASAN_VMALLOC_VM_ALLOC;
	if (!want_init_on_free() && want_init_on_alloc(gfp_mask) &&
	    (gfp_mask & __GFP_SKIP_ZERO))
		kasan_flags |= KASAN_VMALLOC_INIT;

	area->addr = kasan_unpoison_vmalloc(area->addr, real_size, kasan_flags);
	clear_vm_uninitialized_flag(area);
	size = PAGE_ALIGN(size);
	if (!(vm_flags & VM_DEFER_KMEMLEAK))
		kmemleak_vmalloc(area, size, gfp_mask);
	return area->addr;
fail:
	if (shift > PAGE_SHIFT) {
		shift = PAGE_SHIFT;
		align = real_align;
		size = real_size;
		goto again;
	}
	return NULL;
}

vmalloc_node_range takes in several parameters, including the size of the requested memory allocation, alignment, start and end addresses, allocation flags, protection flags, node ID, and a caller identifier.

The function checks whether the size of the requested allocation exceeds the total number of pages of physical memory available on the system. If it does, an error message is printed, and NULL is returned.
It then checks whether the architecture supports huge pages and whether the allocation size per node is greater than or equal to the size of a huge page.
If so, the function sets the shift variable to PMD_SHIFT, the size of a huge page, and sets the align variable to the maximum value between the real alignment
value and the size of a huge page. Otherwise, the function calls the architecture-specific arch_vmap_pte_supported_shift() function to determine the appropriate shift value for the allocation size per node.
The function calls the __get_vm_area_node() function to obtain a virtual memory area that can accommodate the requested allocation size, alignment, and flags. If no suitable area is found,the function returns NULL. If the __GFP_NOFAIL flag is set in the gfp_mask parameter, the function retries the allocation after waiting for one second.
If the prot parameter is equal to PAGE_KERNEL, the function checks whether KASan (Kernel Address Sanitizer) hardware tagging is enabled. If so, it sets the prot variable to the result of calling the arch_vmap_pgprot_tagged() function with the prot parameter as an argument. It also sets the __GFP_SKIP_KASAN_UNPOISON and __GFP_SKIP_ZERO flags in the gfp_mask parameter.
The function sets the KASAN_VMALLOC_PROT_NORMAL flag in the kasan_flags variable.
The function calls the __vmalloc_area_node() function to allocate the virtual memory area obtained in step 4. If the allocation fails, the function returns NULL.
The function sets the KASAN_VMALLOC_VM_ALLOC flag in the kasan_flags variable.
If the __GFP_SKIP_ZERO flag is set in the gfp_mask parameter and the want_init_on_alloc() function returns true, the function sets the KASAN_VMALLOC_INIT flag in the kasan_flags variable.
The function calls the kasan_unpoison_vmalloc() function to mark the allocated memory as uninitialized.
The function clears the VM_UNINITIALIZED flag in the virtual memory area obtained in step 4.
The function calls the PAGE_ALIGN() macro to round up the size of the allocation to the nearest page boundary.

Finally, let’s talk about the performance of vmalloc and how it may depend on several other factors.

Compared to other dynamic memory allocation APIs, such as kmalloc and kmem_cache_alloc, vmalloc can have relatively poorer performance in certain scenarios. This is because vmalloc provides a higher-level interface for allocating memory, which is more flexible but also requires more overhead. One factor that can affect vmalloc performance is fragmentation. Since vmalloc is used for allocating contiguous chunks of virtual memory, fragmentation can occur if there are many small or irregularly sized allocations made. This can result in increased overhead for managing the virtual memory, as well as a higher likelihood of page faults and TLB misses.

Another factor that can affect performance is the use of huge pages. While vmalloc can allocate memory using huge pages, this can also result in increased overhead due to the need for managing larger page tables and the possibility of TLB thrashing.

Overall, the performance of vmalloc can vary depending on the specific use case and the size and frequency of allocations. In some cases, kmalloc or kmem_cache_alloc may be a better choice for allocating memory with higher performance and lower overhead.

Kernel Memory Management: Theory and Practice

Sunday, March 19, 2023

Deep Dive into Vmalloc: Internal Mechanisms and Performance Analysis.