Kernel Memory Management: Theory and Practice: March 2023

Sunday, March 19, 2023

Deep Dive into Vmalloc: Internal Mechanisms and Performance Analysis.

This is my 3rd post in a series of writeups on Kernel Memory Management: Theory & Practice

The first two are here:-

In this post, we are going to scratch a bit into the vmalloc API frequently used for dynamic memory Allocations for large-size memory.

vmalloc and kmalloc are the two most commonly and frequently used dynamic memory allocation API in the kernel. With subtle differences between their modes of operation and the type of memory returned, it’s worth mentioning their implementation and logic. (refer to mainline kernel)

Here is a typical internal API flow called by vmalloc:-

vmalloc -> __vmalloc_node_range -> __get_vm_area_node -> __vmalloc_area_node -> kasan_unpoison_vmalloc

vmalloc is called to allocate a contiguous block of virtual memory.
vmalloc internally calls __vmalloc_node_range to allocate memory from the system page allocator.
__vmalloc_node_range calls __get_vm_area_node to find a suitable vm_area_struct data structure for the newly allocated virtual memory area.
__get_vm_area_node searches the kernel’s list of free vm_area_struct structures to find a suitable structure for the new virtual memory area.
If no suitable vm_area_struct structure is found, __get_vm_area_node calls __vmalloc_area_node to allocate a new vm_area_struct.
__vmalloc_area_node allocates a vm_area_struct and maps it into the kernel’s virtual address space using map_vm_area.
map_vm_area sets up the vm_area_struct to describe the newly allocated virtual memory area and inserts it into the kernel’s list of virtual memory areas using insert_vm_struct.
Finally, vmalloc returns a pointer to the start of the allocated virtual memory block to the calling code.
kasan_unpoison_vmalloc is called to unpoison the memory if Kernel Address Sanitizer (KASan) is enabled

The vmalloc function allocates memory from the high memory area of kernel virtual memory. High memory is a region of kernel virtual memory that is not directly accessible by the kernel and is therefore managed differently from other regions of kernel memory. vmalloc is used to allocate variable-sized memory regions that are not page-aligned. Because high memory can be dynamically allocated and deallocated, vmalloc is well-suited for allocating large, non-contiguous memory regions.

However, it’s worth noting that vmalloc does not actually allocate physical memory; instead, it allocates virtual memory pages and maps them to physical memory as needed. This means that the amount of physical memory used vmalloc may be larger than the amount of virtual memory allocated by the function.

The below diagram explains how memory allocated by vmalloc is guaranteed to be contagious in virtual memory but not guaranteed to be contagious in physical memory. Mostly there are discontiguous physical pages that are mapped into a contiguous virtual memory area via page table.

Below is a diagram that shows a typical memory map of an ARM-based system. In the above diagram, the top section represents the user space virtual addresses, which are used by user-level processes. The bottom section represents the physical memory, which is the actual hardware memory that stores the data. The user application resides within the user space virtual addresses, while the operating system kernel resides in the kernel space. The kernel space is typically mapped to a fixed region of virtual addresses that is separate from the user space virtual addresses.

Physical memory is divided into two regions: low memory and high memory. The low memory is directly mapped to physical memory and is used for kernel code, data, and stack. The high memory region is not directly mapped to physical memory and is used for dynamic memory allocation, including vmalloc allocations.

Overall, this diagram shows how the virtual and physical memory maps are organized on ARM-based systems, with user and kernel spaces separated and low and high memory regions used for different purposes.

So having described the vmalloc philosophy above, below we take look at the actual code of one of the core internal help functions in kernel vmalloc_node_range.

Below is the actual vmalloc_node_range which is the heart of vmalloc implementation.


void *__vmalloc_node_range(unsigned long size, unsigned long align,	unsigned long start, unsigned long end, gfp_t gfp_mask,	pgprot_t prot, unsigned long vm_flags, int node,const void *caller)
{
	if (WARN_ON_ONCE(!size))
		return NULL;
	if ((size >> PAGE_SHIFT) > totalram_pages()) {
		warn_alloc(gfp_mask, NULL,
			"vmalloc error: size %lu, exceeds total pages",
			real_size);
		return NULL;
	}
	if (vmap_allow_huge && (vm_flags & VM_ALLOW_HUGE_VMAP)) {
		unsigned long size_per_node;
		size_per_node = size;
		if (node == NUMA_NO_NODE)
			size_per_node /= num_online_nodes();
		if (arch_vmap_pmd_supported(prot) && size_per_node >= PMD_SIZE)
			shift = PMD_SHIFT;
		else
			shift = arch_vmap_pte_supported_shift(size_per_node);

		align = max(real_align, 1UL << shift);
		size = ALIGN(real_size, 1UL << shift);
	}
again:
	area = __get_vm_area_node(real_size, align, shift, VM_ALLOC |
				  VM_UNINITIALIZED | vm_flags, start, end, node,
				  gfp_mask, caller);
	if (!area) {
		bool nofail = gfp_mask & __GFP_NOFAIL;
		warn_alloc(gfp_mask, NULL,
			"vmalloc error: size %lu, vm_struct allocation failed%s",
			real_size, (nofail) ? ". Retrying." : "");
		if (nofail) {
			schedule_timeout_uninterruptible(1);
			goto again;
		}
		goto fail;
	}
	if (pgprot_val(prot) == pgprot_val(PAGE_KERNEL)) {
		if (kasan_hw_tags_enabled()) {
			prot = arch_vmap_pgprot_tagged(prot);

			gfp_mask |= __GFP_SKIP_KASAN_UNPOISON | __GFP_SKIP_ZERO;
		}

		
		kasan_flags |= KASAN_VMALLOC_PROT_NORMAL;
	}
	ret = __vmalloc_area_node(area, gfp_mask, prot, shift, node);
	if (!ret)
		goto fail;
	kasan_flags |= KASAN_VMALLOC_VM_ALLOC;
	if (!want_init_on_free() && want_init_on_alloc(gfp_mask) &&
	    (gfp_mask & __GFP_SKIP_ZERO))
		kasan_flags |= KASAN_VMALLOC_INIT;

	area->addr = kasan_unpoison_vmalloc(area->addr, real_size, kasan_flags);
	clear_vm_uninitialized_flag(area);
	size = PAGE_ALIGN(size);
	if (!(vm_flags & VM_DEFER_KMEMLEAK))
		kmemleak_vmalloc(area, size, gfp_mask);
	return area->addr;
fail:
	if (shift > PAGE_SHIFT) {
		shift = PAGE_SHIFT;
		align = real_align;
		size = real_size;
		goto again;
	}
	return NULL;
}

vmalloc_node_range takes in several parameters, including the size of the requested memory allocation, alignment, start and end addresses, allocation flags, protection flags, node ID, and a caller identifier.

The function checks whether the size of the requested allocation exceeds the total number of pages of physical memory available on the system. If it does, an error message is printed, and NULL is returned.
It then checks whether the architecture supports huge pages and whether the allocation size per node is greater than or equal to the size of a huge page.
If so, the function sets the shift variable to PMD_SHIFT, the size of a huge page, and sets the align variable to the maximum value between the real alignment
value and the size of a huge page. Otherwise, the function calls the architecture-specific arch_vmap_pte_supported_shift() function to determine the appropriate shift value for the allocation size per node.
The function calls the __get_vm_area_node() function to obtain a virtual memory area that can accommodate the requested allocation size, alignment, and flags. If no suitable area is found,the function returns NULL. If the __GFP_NOFAIL flag is set in the gfp_mask parameter, the function retries the allocation after waiting for one second.
If the prot parameter is equal to PAGE_KERNEL, the function checks whether KASan (Kernel Address Sanitizer) hardware tagging is enabled. If so, it sets the prot variable to the result of calling the arch_vmap_pgprot_tagged() function with the prot parameter as an argument. It also sets the __GFP_SKIP_KASAN_UNPOISON and __GFP_SKIP_ZERO flags in the gfp_mask parameter.
The function sets the KASAN_VMALLOC_PROT_NORMAL flag in the kasan_flags variable.
The function calls the __vmalloc_area_node() function to allocate the virtual memory area obtained in step 4. If the allocation fails, the function returns NULL.
The function sets the KASAN_VMALLOC_VM_ALLOC flag in the kasan_flags variable.
If the __GFP_SKIP_ZERO flag is set in the gfp_mask parameter and the want_init_on_alloc() function returns true, the function sets the KASAN_VMALLOC_INIT flag in the kasan_flags variable.
The function calls the kasan_unpoison_vmalloc() function to mark the allocated memory as uninitialized.
The function clears the VM_UNINITIALIZED flag in the virtual memory area obtained in step 4.
The function calls the PAGE_ALIGN() macro to round up the size of the allocation to the nearest page boundary.

Finally, let’s talk about the performance of vmalloc and how it may depend on several other factors.

Compared to other dynamic memory allocation APIs, such as kmalloc and kmem_cache_alloc, vmalloc can have relatively poorer performance in certain scenarios. This is because vmalloc provides a higher-level interface for allocating memory, which is more flexible but also requires more overhead. One factor that can affect vmalloc performance is fragmentation. Since vmalloc is used for allocating contiguous chunks of virtual memory, fragmentation can occur if there are many small or irregularly sized allocations made. This can result in increased overhead for managing the virtual memory, as well as a higher likelihood of page faults and TLB misses.

Another factor that can affect performance is the use of huge pages. While vmalloc can allocate memory using huge pages, this can also result in increased overhead due to the need for managing larger page tables and the possibility of TLB thrashing.

Overall, the performance of vmalloc can vary depending on the specific use case and the size and frequency of allocations. In some cases, kmalloc or kmem_cache_alloc may be a better choice for allocating memory with higher performance and lower overhead.

Friday, March 17, 2023

Kernel Heap management

Now, this is the most important topic for memory management enthusiasts and developers as it offers key insights into kernel heap management, strategies, and algorithms thereof.

It is a critical component of the Linux kernel that provides memory allocation and deallocation services to various parts of the operating system. The kernel heap is dedicated to storing dynamically allocated memory. This memory is allocated on demand and is released back to the heap when it is no longer needed. The kernel heap is managed by a set of memory allocation routines that are provided by the kernel. Following is a list of some of the most commonly used kernel heap management routines:

kmalloc(): This routine is used to allocate a contiguous block of memory from the kernel heap. It takes two arguments - the size of the memory to be allocated and a set of flags that specify the type of memory to be allocated. returns a pointer to the allocated memory, or NULL.

kcalloc(): This routine is similar to kmalloc(), but it allocates memory for an array of objects rather than a single object. It takes two arguments - the number of objects to be allocated and the size of each object. Returns a pointer to the allocated memory, or NULL.

kzalloc(): This routine is similar to kmalloc(), but it initializes the memory to zero before returning it to the caller. This is useful for allocating memory for data structures that need to be initialized to a known state.

vmalloc(): This routine is used to allocate a non-contiguous block of memory from the kernel heap. It takes two arguments - the size of the memory to be allocated and a set of flags that specify the type of memory to be allocated.

vfree(): This routine is used to release memory that was previously allocated with vmalloc(). It takes a single argument - a pointer to the memory that should be released.

kmem_cache_create(): This routine is used to create a memory cache that can be used to allocate objects of a specific size. is used to create a slab cache in the kernel. Slab caches are used to manage small, fixed-sized objects in the kernel, such as task_struct and file structures. Once a slab cache is created with kmem_cache_create(), it can be used by other parts of the kernel to allocate and free objects of the specified size. This reduces the overhead of frequent calls to kmalloc() and kfree() for small, fixed-size objects.

kmem_cache_alloc(): This routine is used to allocate an object from a memory cache that was created with kmem_cache_create(). It takes a single argument - a pointer to the cache from which the object should be allocated.

kmem_cache_free(): This routine is used to release an object that was previously allocated with kmem_cache_alloc(). It takes two arguments - a pointer to the cache from which the object was allocated, and a pointer to the object that should be released.

Kernel heap management is designed to be efficient and scalable. To achieve this, the kernel uses a number of techniques to manage the heap, including memory caching, slab allocation, and page allocation. Memory caching is used to reduce the overhead of memory allocation by reusing previously allocated memory whenever possible. Slab allocation is used to improve the efficiency of memory allocation by allocating and deallocating objects of the same size in bulk. Page allocation is used to allocate large blocks of contiguous memory when needed.

In summary, kernel heap management is a critical component of the Linux kernel that provides memory allocation and deallocation services to various parts of the kernel.

Kernel Memory Management: Introduction

Kernel Memory Management is a fundamental aspect of operating system design and plays a crucial role in the functioning of the kernel. In general, memory management is responsible for managing the physical memory of a computer system, allocating memory resources to different applications, and freeing up memory resources when they are no longer needed. In the case of the kernel, memory management becomes more complex due to the fact that the kernel itself requires memory resources to function, and must manage memory resources on behalf of other applications. As such, kernel memory management is an important area of study for operating system developers and programmers.

There are two main aspects to kernel memory management: virtual memory management and physical memory management. Virtual memory management is concerned with the allocation and mapping of virtual memory addresses to physical memory addresses. This allows multiple applications to use the same physical memory resources without interfering with each other. Physical memory management, on the other hand, is concerned with the allocation and deallocation of physical memory pages to the kernel and applications. To better understand kernel memory management, it is important to understand the different types of memory used by the kernel.

These include:

Kernel code and data - this is the memory used by the kernel itself to store its own code and data structures.
User process memory - this is the memory used by applications running in user mode. This includes the stack and heap used by the application.
Kernel stack - each thread in the kernel has its own stack, used to store temporary data.
Kernel heap - this is a general-purpose memory allocator used by the kernel to dynamically allocate memory.
Device memory - this is memory used by devices and peripherals connected to the system.
Direct Memory Access (DMA) memory - this is memory used by devices to directly access memory without going through the CPU.

Effective kernel memory management involves balancing the needs of the kernel and applications for memory resources, while also ensuring that memory is used efficiently and effectively. This often involves techniques such as memory pooling, garbage collection, and memory mapping.

In subsequent posts, I will be dwelling deeper into each of these subtopics with snippets from kernel code for illustrator purpose. the idea is to simplify memory understanding when it comes to Linux as it can be bit hairy at times.