Notes: Understanding the linux kernel Chapter 8 Memory Management

标签：Chapter kernel Management cache descriptor memory frames page address

dynamic memory

Page Frame Management

Page Descriptors

used to distinguish the page frames that are used to contain pages that belong to processes from those that contain kernel code or kernel data structures. Similarly, it must be able to determine whether a page frame in dynamic memory is free.

All page descriptors struct page are stored in the mem_map array. Because each descriptor is 32 bytes long, the space required by mem_map is slightly less than 1% of the whole RAM.

The virt_to_page(addr) macro yields the address of the page descriptor associated with the linear address addr.

fields of struct page
_count

A usage reference counter for the page. If it is set to -1, the corresponding page frame is free and can be assigned to any process or to the kernel itself. The page_count() function returns the value of the _count field increased by one, that is, the number of users of the page.

flags

flags describe the status of the page frame. For each PG_xyz flag, the kernel defines some macros that manipulate its value. Usually, the PageXyz macro returns the value of the flag, while the SetPageXyz mand ClearPageXyz macro set and clear the corresponding bit, respectively.

private

available to the kernel component that is using the page.

Non-Uniform Memory Access(NUMA)

Linux 2.6 supports the Non-Uniform Memory Access (NUMA) model, in which the access times for different memory locations from a given CPU may vary.The physical memory of the system is partitioned in several nodes. The time needed by a given CPU to access pages within a single node is the same. However, this time might not be the same for two different CPUs. For every CPU, the kernel tries to minimize the number of accesses to costly nodes by carefully selecting where the kernel data structures that are most often referenced by the CPU are stored.

The physical memory inside each node can be split into several zones. Each node has a descriptor of type pg_data_t. All node descriptors are stored in a singly linked list, whose first element is pointed to by the pgdat_list variable.

On the architecture that don't required NUMA, the linux kernel also maintained a node pointed by contig_page_data to make kernel code more protable.(see below)

Memory Zones

In architeccture don't need NUMA, Linux makes use of a single node that includes all system physical memory. Thus, the pgdat_list variable points to a list consisting of a single element--the node 0 descriptor--stored in the contig_page_data variable.

The Direct Memory Access (DMA) processors for old ISA buses have a strong limitation: they are able to address only the first 16 MB of RAM.
In modern 32-bit computers with lots of RAM, the CPU cannot directly access all physical memory because the linear address space is too small.

To cope with these two limitations, Linux 2.6 partitions the physical memory of every memory node into three zones:

ZONE_DMA

Contains page frames of memory below 16 MB
ZONE_NORMAL

Contains page frames of memory at and above 16 MB and below 896 MB
ZONE_HIGHMEM

Contains page frames of memory at and above 896 MB

The ZONE_DMA zone includes page frames that can be used by old ISA-based devices by means of the DMA(use by device and blos). The ZONE_DMA and ZONE_NORMAL zones include the “normal” page frames that can be directly accessed by the kernel through the linear mapping in the fourth gigabyte of the linear address space(3GB~4GB).the ZONE_HIGHMEM zone includes page frames that cannot be directly accessed by the kernel through the linear mapping(those RAM has not been mapped by create corresponding PTE). The ZONE_HIGHMEM zone is always empty on 64-bit architectures.

Each page descriptor has links to the memory node and to the zone inside the node that includes the corresponding page frame.To save space, these links are not stored as classical pointers; rather, they are encoded as indices stored in the high bits of the flags field. The page_zone() function receives as its parameter the address of a page descriptor; it reads the most significant bits of the flags field in the page descriptor, then it determines the address of the corresponding zone descriptor by looking in the zone_table array. This array is initialized at boot time with the addresses of all zone descriptors of all memory nodes.

The Pool of Reserved Page Frames

atomic memory allocation requests

An atomic memory request never blocks: if there are not enough free pages, the allocation simply fails. the kernel reserves a pool of page frames for atomic memory allocation requests to be used only on low-on-memory conditions.(e.g. the routine to free memory for kernel)

The ZONE_DMA and ZONE_NORMAL memory zones contribute to the reserved memory with a number of page frames proportional to their relative sizes.

The pages_min field of the zone descriptor stores the number of reserved page frames inside the zone.

The Zoned Page Frame Allocator

The kernel subsystem that handles the memory allocation requests for groups of contiguous page frames is called the zoned page frame allocator.

The component named “zone allocator” receives the requests for allocation and deallocation of dynamic memory. In the case of allocation requests, the component searches a memory zone that includes a group of contiguous page frames that can satisfy the request. Inside each zone, page frames are handled by a component named “buddy system” . To get better system performance, a small number of page frames are kept in page frame cache to quickly satisfy the allocation requests for single page frames.

Requesting and release page frames

macros

//alloc
alloc_pages(gfp_mask, order)//Macro used to request 2^order contiguous page frames.returns the address of the descriptor of the first allocated page frame.
alloc_page(gfp_mask)//Macro used to get a single page frame; it expands to: alloc_pages(gfp_mask, 0)
__get_free_pages(gfp_mask, order)//returns the linear address of the first allocated page.
__get_free_page(gfp_mask)//this expands to: __get_free_pages(gfp_mask, 0).
get_zeroed_page(gfp_mask)//Function used to obtain a page frame filled with zeros; it invokes: alloc_pages(gfp_mask | __GFP_ZERO, 0)
__get_dma_pages(gfp_mask, order)//Macro used to get page frames suitable for DMA; it expands to: __get_free_pages(gfp_mask | __GFP_DMA, order)


//free
__free_pages(page, order)
free_pages(addr, order)
__free_page(page)
free_page(addr)

The parameter gfp_mask is a group of flags that specify how to look for free page frames.

predefined gfp_mask

The node_zonelists field of the contig_page_data node descriptor is an array of lists of zone descriptors
representing the fallback zones: for each setting of the zone modifiers, the corresponding list includes the memory zones that could be used to satisfy the memory allocation request in case the original zone is short on page frames.

kernel annotation fo zongelists

 * ..........
 * When a memory allocation must conform to specific limitations (such
 * as being suitable for DMA) the caller will pass in hints to the
 * allocator in the gfp_mask, in the zone modifier bits.  These bits
 * are used to select a priority ordered list of memory zones which
 * match the requested limits.  GFP_ZONEMASK defines which bits within
 * the gfp_mask should be considered as zone modifiers.  Each valid
 * combination of the zone modifier bits has a corresponding list
 * of zones (in node_zonelists).  Thus for two zone modifiers there
 * will be a maximum of 4 (2 * 2) zonelists, for 3 modifiers there will
 * be 8 (2 * 3) zonelists.  GFP_ZONETYPES defines the number of possible
 * combinations of zone modifiers in "zone modifier space".
 * ..........

Kernel Mappings of High-Memory Page Frames

The highest 128 MB of linear addresses are left available for several kinds of mappings, thus the kernel address space left for mapping the RAM is thus 1 GB – 128 MB = 896 MB(linear address).

Page frames above the 896 MB boundary (ZONE_HIGHMEM) are not generally mapped in the fourth gigabyte of the kernel linear address spaces, so the kernel is unable to directly access them. This implies that each page allocator function that returns the linear address of the assigned page frame doesn’t work for high memory page frames, that is, for page frames in the ZONE_HIGHMEM memory zone[32bit arch].(e.g. __get_free_pages(GFP_HIGHMEM,0) return null)

This problem does not exist on 64-bit hardware platforms, because the available linear address space(up to 2^64) is much larger than the amount of RAM that can be installed. On 32-bit platforms Linux designers had to find some way to allow the kernel to exploit all the available RAM(2^36 with PAE support). The approach adopted is the following:

The allocation of high-memory page frames is done only through the alloc_pages() function and its alloc_page() shortcut.

the functions return the linear address of the page descriptor of the first allocated page frame. These linear addresses always exist, because all page descriptors are allocated in low memory once and forever during the kernel initialization.
Page frames in high memory that do not have a linear address cannot be accessed by the kernel. Therefore, part of the last 128 MB of the kernel linear address space is dedicated to mapping high-memory page frames temporarily.

Page frames in high memory that do not have a linear address cannot be accessed by the kernel. Therefore, part of the last 128 MB of the kernel linear address space is dedicated to mapping high-memory page frames.

the way mapping page frames in high memeory

The kernel uses three different mechanisms to map page frames in high memory; they are called permanent kernel mapping, temporary kernel mapping, and noncontiguous memory allocation.

Establishing a permanent kernel mapping may block the current process(cannot created by interrupt handlers and deferrable functions); this happens when no free Page Table entries exist that can be used as “windows” on the page frames in high memory.
Conversely, establishing a temporary kernel mapping never requires blocking the current process; its drawback, however, is that very few temporary kernel mappings can be established at the same time.

permanent kernel mapping

use the address of the struct page to represent physical address

Permanent kernel mappings use the dedicated Page Table (can store 4MB) in the master kernel page tables. The pkmap_page_table variable stores the address of this Page Table, while the LAST_PKMAP macro yields the number of entries(512 when PAE or 1024).

The Page Table maps the linear addresses starting from PKMAP_BASE(between 3GB~4GB). The pkmap_count array includes LAST_PKMAP counters, one for each entry of the pkmap_page_table Page Table. for each counter:

counter == 0: is unused
counter == 1: The corresponding Page Table entry does not map any high-memory page frame, but it cannot be used because the corresponding TLB entry has not been flushed since its last usage.
counter > 1 : The corresponding Page Table entry maps a high-memory page frame, which is used by exactly n −1 kernel components.

page_address_htable

page_address_htable contains one page_address_map data structure for each page frame in high memory that is currently mapped. In turn, this data structure contains a pointer to the page descriptor(descriptor of page frame stored in low memory) and the linear address assigned to the page frame(the linear address between PKMAP_BASE~PKMAP_BASE+2MB\4MB).

page_address()

The page_address() function returns the linear address associated with the page frame, or NULL if the page frame is in high memory and is not mapped. This function, which receives as its parameter a page descriptor pointer page, distinguishes two cases:

If the page frame is not in high memory (PG_highmem flag clear), the linear address always exists and is obtained by computing the page frame index, converting it into a physical address, and finally deriving the linear address corresponding to the physical address(linear address = physical address + PAGE_OFFSET[0xc0000000]).
If the page frame is in high memory (PG_highmem flag set), the function looks into the page_address_htable hash table. If the page frame is found in the hash table, page_address() returns its linear address, otherwise it returns NULL.

kmap() & kunmap_high()

The kmap() function establishes a permanent kernel mapping.

see p308 for detailed information
//call chain
kmap() -i PageHighMem(page) -o page_address() -a kmap_high() -i page_address() -a return -o map_new_virtual() -o wait()

kunmap_high() -i wake_up()

temporary kernel mapping

use the index in the fixed_address(km_type) represent the physical address
fixed_addresses(fix-mapped linear address)

Each fix-mapped linear address is represented by a small integer index defined in the enum fixed_addresses data structure.(includes the fix-mapped address indexed by symbols in the km_type)
s

enum fixed_addresses

enum fixed_addresses {
	FIX_HOLE,
	FIX_VSYSCALL,

............

#ifdef CONFIG_HIGHMEM
	FIX_KMAP_BEGIN,	/* reserved pte's for temporary kernel mappings */
	FIX_KMAP_END = FIX_KMAP_BEGIN+(KM_TYPE_NR*NR_CPUS)-1, (see below)
#endif

............
}

//fix_to_virtual()     index -----> virtual address
inline unsigned long fix_to_virt(const unsigned int idx)
{
if (idx >= _ _end_of_fixed_addresses)
_ _this_fixmap_does_not_exist();
return (0xfffff000UL - (idx << PAGE_SHIFT));// PAGE_SHIFT:12
}

window

a Page Table entry that is reserved to map high memory. The number of windows reserved for temporary kernel mappings is quite small.

km_type

Each CPU has its own set of 13 windows, represented by the enum km_type data structure. Each symbol defined in this data structure—such as KM_BOUNCE_READ, KM_USER0, or KM_PTE0—identifies the linear address of a window.The kernel must ensure that the same window is never used by two kernel control paths at the same time. Thus, each symbol in the km_type structure is dedicated to one kernel component and is named after the component.

Each symbol in km_type, except the last one, is an index of a fix-mapped linear address(invoke the fix_to_virtual(index) to get the corresponding virtual address).

kmap_atomic()
establish temporary kernel mapping

The type argument and the CPU identifier retrieved through smp_processor_id() specify what fix-mapped linear address has to be used to map the request page. The function returns the linear address of the page frame if it doesn’t belong to high memory; otherwise, it sets up the Page Table entry corresponding to the fix-mapped linear address with the page’s physical address.

kunmap_atomic to destroy temporary kernel mapping.

The Buddy System Algorithm(external fragmentation)

the buddy system algorithm

The technique adopted by Linux to solve the external fragmentation problem is based on the well-known buddy system algorithm. All free page frames are grouped into 11 lists of blocks that contain groups of 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024 contiguous page frames, respectively.

Data structures

Linux 2.6 uses a different buddy system for each zone. Thus, in the 80 × 86 architecture, there are 3 buddy systems: the first handles the page frames suitable for ISA DMA, the second handles the “normal” page frames, and the third handles the highmemory page frames.

relay on two main data structures

the mem_map array.(see above)
An array consisting of eleven elements of type free_area, one element for each group size. The array is stored in the free_area field of the zone descriptor.

struct free_area

the k-th element of the free_area array in the zone descriptor, which identifies all the free blocks of size 2^k.The free_list field of this element is the head of a doubly linked circular list that collects the page descriptors associated with the free blocks of 2^k pages.

Besides the head of the list, the k-th element of the free_area array includes also the field nr_free, which specifies the number of free blocks of size 2^k pages. Finally, the private field of the descriptor of the first page in a block of 2^k free pages stores the order of the block, that is, the number k.

Allocating a block

The __rmqueue() function is used to find a free block in a zone. The function takes two arguments: the address of the zone descriptor, and order, which denotes the logarithm of the size of the requested block of free pages (0 for a one-page block, 1 for a two-page block, and so forth). If the page frames are successfully allocated, the __rmqueue() function returns the address of the page descriptor of the first allocated page frame. Otherwise, the function returns NULL.

The __rmqueue() function assumes that the caller has already disabled local interrupts and acquired the zone->lock spin lock, which protects the data structures of the buddy system.

Freeing a block

The __free_pages_bulk( ) function implements the buddy system strategy for freeing page frames. The function assumes that the caller has already disabled local interrupts and acquired bthe zone->lock spin lock.

the way the kenrel find the 'buddy' of the current freed page frame and combine them is skillful, see P316.

The Per-CPU Page Frame Cache

the kernel often requests and releases single page frames. To boost system performance, each memory zone defines a per-CPU page frame cache. Each per-CPU cache includes some pre-allocated page frames to be used for single memory requests issued by the local CPU.

there are two caches for each memory zone and for each CPU: a hot cache, which stores page frames whose contents are likely to be included in the CPU’s hardware cache, and a cold cache.

hot cache could reduce the number of the hardware cache invalidation. cold cache would be better to be allocated cope with the DMA operations because it preserves the reserve of hot page frames for the other kinds of memory allocation requests.(gfp_flags: __GFP_COLD)

struct per_cpu_pageset & struct per_cpu_pages

The main data structure implementing the per-CPU page frame cache is an array of per_cpu_pageset data structures stored in the pageset field of the memory zone descriptor. The array includes one element for each CPU; this element, in turn, consists of two per_cpu_pages descriptors, each represent a hot cache and a cold cache.

The kernel monitors the size of the both the hot and cold caches by using two watermarks: if the number of page frames falls below the low watermark, the kernel replenishes the proper cache by allocating batch single page frames from the buddy system; otherwise, if the number of page frames rises above the high watermark, the kernel releases to the buddy system batch page frames in the cache. The values of batch, low, and high essentially depend on the number of page frames included in the memory zone.

Allocating page frames through the per-CPU page frame caches

The buffered_rmqueue() function allocates page frames in a given memory zone. It makes use of the per-CPU page frame caches to handle single page frame requests.

static struct page *buffered_rmqueue(struct zone *zone, int order, int gfp_flags)

If the _ _GFP_COLD flag is set in gfp_flags, the page frame should be taken from the cold cache, otherwise it should be taken from the hot cache (this flag is meaningful only for single page frame requests).

steps:

If order is not equal to 0, the per-CPU page frame cache cannot be used: the function jumps to step 4.
Checks whether the memory zone’s local per-CPU cache identified by the value of the _ _GFP_COLD flag need to be replenished (the count field of the per_cpu_pages descriptor is lower than or equal to the low field) and refill it.
If count is positive, the function gets a page frame from the cache’s list, decreases count, and jumps to step 5.if the count is negative(the step 2 refilling the cache failed), do step 4.
Here, the memory request has not yet been satisfied, either because the request spans several contiguous page frames, or because the selected page frame cache is empty. Invokes the _ _rmqueue() function to allocate the requested page frames from the buddy system.
If the memory request has been satisfied, the function initializes the page descriptor of the (first) page frame: clears some flags, sets the private field to zero, and sets the page frame reference counter to one. Moreover, if the _ _GPF_ZERO flag in gfp_flags is set, it fills the allocated memory area with zeros.
Returns the page descriptor address of the (first) page frame, or NULL if the memory allocation request failed.

Releasing page frames to the per-CPU page frame caches

free_hot_cold_page() function, receives as its parameters the descriptor address page of the page frame to be released and a cold flag specifying either the hot cache or the cold cache.

steps:

Gets from the page->flags field the address of the memory zone descriptor including the page frame.
Gets the address of the per_cpu_pages descriptor of the zone’s cache selected by the cold flag.
Checks whether the cache should be depleted: if count is higher than or equal to high, invokes the free_pages_bulk() function to free some page bufferd in cache.
Adds the page frame to be released to the cache’s list, and increases the count field.

The Zone Allocator

The zone allocator is the frontend of the kernel page frame allocator. This component must locate a memory zone that includes a number of free page frames large enough to satisfy the memory request. the zone allocator must satisfy several goals:
1. It should protect the pool of reserved page frames.
2. It should trigger the page frame reclaiming algorithm when memory is scarce and blocking the current process is allowed; once some page frames have been freed, the zone allocator will retry the allocation.
3. It should preserve the small, precious ZONE_DMA memory zone, if possible.

alloc_pages() ----> __alloc_pages()

/*
 * This is the 'heart' of the zoned buddy allocator.
 */
struct page * fastcall
__alloc_pages(unsigned int gfp_mask, unsigned int order,
		struct zonelist *zonelist)
{
......
    for (i = 0; (z=zonelist->zones[i]) != NULL; i++) {
    if (zone_watermark_ok(z, order, ...)) {
    page = buffered_rmqueue(z, order, gfp_mask);
    if (page)
    return page;
    }
    }
......
}

// zonelist: Pointer to a zonelist data structure describing, in order of preference, the memory zones suitable for the memory allocation

The _ _alloc_pages() function scans every memory zone included in the zonelist data structure. For each memory zone, the function compares the number of free page frames with a threshold value that depends on the memory allocation flags, on the type of current process, and on how many times the zone has already been checked by the function.

zone_watermark_ok()

/*
 * Return 1 if free pages are above 'mark'. This takes into account the order
 * of the allocation.
 */
int zone_watermark_ok(struct zone *z, int order, unsigned long mark,
		      int classzone_idx, int can_try_harder, int gfp_high)
{
  ......
}

return 1 if:

Besides the page frames to be allocated, there are at least min free page frames in the memory zone, not including the page frames in the low-on-memory reserve(the value of lowmem_reserve, which specific the pages reserved in case of low-on-memory situation).
Besides the page frames to be allocated, there are at least free page frames in blocks of order at least k, for each k between 1 and the order of the allocation.

steps of the __alloc_pages()

Performs a first scanning of the memory zones. In this first scan, the min threshold value is set to z->pages_low, where z points to the zone descriptor being analyzed (the can_try_harder and gfp_high parameters are set to zero).
If the function did not terminate in the previous step, there is not much free memory left: the function awakens the kswapd kernel threads to start reclaiming page frames asynchronously.
Performs a second scanning of the memory zones, passing as base threshold the value z->pages_min. The new threshold is nearly identical to step 1, except that the function is using a lower threshold.
If the function did not terminate in the previous step, the system is definitely low on memory.If the kernel control path that issued the memory allocation request is not an interrupt handler or a deferrable function and it is trying to reclaim page frames (either the PF_MEMALLOC flag or the PF_MEMDIE flag of current is set), the function then performs a third scanning of the memory zones, trying to allocate the page frames ignoring the low-on-memory thresholds—that is, without invoking zone_watermark_ok(). If no memory zone includes enough page frames, the function returns NULL to notify the caller of the failure.
Here, the invoking kernel control path is not trying to reclaim memory. If the _ _ GFP_WAIT flag of gfp_mask is not set, the function returns NULL to notify the kernel control path of the memory allocation failure: in this case, there is no way to satisfy the request without blocking the current process.
Here the current process can be blocked: invokes cond_resched() to check whether some other process needs the CPU.
Sets the PF_MEMALLOC flag of current, to denote the fact that the process is ready to perform memory reclaiming.
Stores in current->reclaim_state a pointer to a reclaim_state structure(see blow).
Invokes try_to_free_pages() to look for some page frames to be reclaimed.The latter function may block the current process. Once that function returns, _ _alloc_pages() resets the PF_MEMALLOC flag of current and invokes once more cond_resched().
If the previous step has freed some page frames, the function performs yet another scanning of the memory zones equal to the one performed in step 3.
If no page frame has been freed in step 9, the kernel is in deep trouble, because free memory is dangerously low and it was not possible to reclaim any page frame. then the kernel is try to kill process to reclaim memory.

Releasing a group of page frames

All kernel macros and functions that release page frames rely on the __free_pages() function. It receives as its parameters the address of the page descriptor of the first page frame to be released (page), and the logarithmic size of the group of contiguous page frames to be released (order).

Memory Area Management(internal fragmentation)

This section deals with memory areas—that is, with sequences of memory cells having contiguous physical addresses and an arbitrary length.

Clearly, it would be quite wasteful to allocate a full page frame to store a few bytes. A better approach instead consists of introducing new data structures that describe how small memory areas are allocated within the same page frame.

The Slab Allocator

The slab allocator groups objects into caches. Each cache is a “store” of objects of the same type.The area of main memory that contains a cache is divided into slabs; each slab consists of one or more contiguous page frames that contain both allocated and free objects

the kernel periodically scans the caches and releases the page frames corresponding to empty slabs.

Cache Descriptor

Each cache is described by a structure of type kmem_cache_t(which is equivalent to the type struct kmem_cache_s).

Slab Descriptor

Slab descriptors can be stored in two possible places:

External slab descriptor

Stored outside the slab, in one of the general caches not suitable for ISA DMA pointed to by cache_sizes(see blow).
Internal slab descriptor

Stored inside the slab, at the beginning of the first page frame assigned to the slab.

The slab allocator chooses the second solution when the size of the objects is smaller than 512MB or when internal fragmentation leaves enough space for the slab descriptor and the object descriptors (as described later)—inside the slab. The CFLGS_OFF_SLAB flag in the flags field of the cache descriptor is set to one if the slab descriptor is stored outside the slab; it is set to zero otherwise.

Relationship between cache and slab descriptors

Full slabs, Partially full slabs, and Free slabs are linked in different lists.

General and Specific Caches

Caches are divided into two types: general and specific. General caches are used only by the slab allocator for its own purposes, while specific caches are used by the remaining parts of the kernel.

general caches

A first cache called kmem_cache whose objects are the cache descriptors of the remaining caches used by the kernel. The cache_cache variable contains the descriptor of this special cache.
Several additional caches contain general purpose memory areas. The range of the memory area sizes typically includes 13 geometrically distributed sizes. A table called malloc_sizes (whose elements are of type cache_sizes) points to 26 cache descriptors associated with memory areas of size 32, 64, 128, 256, 512, 1,024, 2,048, 4,096, 8,192, 16,384, 32,768, 65,536, and 131,072 bytes. For each size, there are two caches: one suitable for ISA DMA allocations and the other for normal allocations.

The kmem_cache_init() function is invoked during system initialization to set up the general caches.

special caches

Specific caches are created by the kmem_cache_create() function.It allocates a cache descriptor for the new cache from the cache_cache general cache and inserts the descriptor in the cache_chain list of cache descriptors with holding the cache_chain_sem semaphore.

allocation and reclaim of the caches

It is also possible to destroy a cache and remove it from the cache_chain list by invoking kmem_cache_destroy(). This function is mostly useful to modules that create their own caches when loaded and destroy them when unloaded. To avoid wasting memory space, the kernel must destroy all slabs before destroying the cache itself. The kmem_cache_shrink() function destroys all the slabs in a cache by invoking slab_destroy() iteratively.

the way to calculate the number of the caches
cat /proc/slabinfo | head

Interfacing the Slab Allocator with the Zoned Page Frame Allocator

When the slab allocator creates a new slab, it relies on the zoned page frame allocator to obtain a group of free contiguous page frames, thus calling kmem_getpages() to ask memory.

kmem_getpages()

//return linear address yield by page_address(), thus it can't alloc ZONE_HIGHMEM memory to cache. 
void * kmem_getpages(kmem_cache_t *cachep, int flags)
{
}

If the slab cache has been created with the SLAB_RECLAIM_ACCOUNT flag set, the page frames assigned to the slabs are accounted for as reclaimable pages when the kernel checks whether there is enough memory to satisfy some User Mode requests. The function also sets the PG_slab flag in the page descriptors of the allocated page frames.

cachep

Points to the cache descriptor of the cache that needs additional page frames, number of required page frames is determined by the order in the cachep->gfporder field.

flags

Specifies how the page frame is requested, This set of flags is combined with the specific cache allocation flags stored in the gfpflags field of the cache descriptor.

kmem_freepages()

void kmem_freepages(kmem_cache_t *cachep, void *addr){...}

The function releases the page frames, starting from the one having the linear address addr, that had been allocated to the slab of the cache identified by cachep.

Allocating a Slab to a Cache

A newly created cache does not contain a slab and therefore does not contain any free objects. New slabs are assigned to a cache only when 1: A request has been issued to allocate a new object. 2:The cache does not include a free object.

call chain of allocating new slab
cache_grow() ------> kmem_getpages() ------> alloc_slabmgmt()

steps of cache_grow():

calls kmem_getpages() to obtain from the zoned page frame allocator the group of page frames needed to store a single slab; it then calls alloc_slabmgmt() to get a new slab descriptor.
If the CFLGS_OFF_SLAB flag of the cache descriptor is set, the slab descriptor is allocated from the general cache
pointed to by the slabp_cache field of the cache descriptor(same as cache_cache); otherwise, the slab descriptor is allocated in the first page frame of the slab.
loads the next and prev subfields of the lru fields of the struct page allocated to the slab with the addresses of, respectively, the cache descriptor and the slab descriptor. This helps kernel to determine whether this page is used by the slab allocator(see also PG_slab) and can derive the address of the corresponding cache and slab descriptors quickly.
4.cache_grow() calls cache_init_objs( ), which applies the constructor method (if defined) to all the objects contained in the new slab.
finally, add slab descriptor to the free slab list of the cache.

Releasing a Slab from a Cache

Slabs can be destroyed in two cases 1: too many free objects. 2: A timer function invoked periodically determines that there are fully unused slabs that can be released.

slab_destroy()

void slab_destroy(kmem_cache_t *cachep, slab_t *slabp)
{}

the destructor of the object is called for each objects if it has.(the dtor field is not NULL)
calls kmem_freepages( ), which returns all the contiguous page frames used by the slab to the buddy system.
if slab descriptor is External slab descriptor(see previous), release it from the cache fo slab descriptors.

Object Descriptor

Each object has a short descriptor of type kmem_bufctl_t.Object descriptors are stored in an array placed right after the corresponding slab descriptor. there are two types of object descriptors:"External object descriptor" and "Internal object descriptors".

External object descriptor

Stored outside the slab, in the general cache pointed to by the slabp_cache field of the cache descriptor. The size of the memory area, and thus the particular general cache used to store object descriptors, depends on the number of objects stored in the slab.(see blow)

The first object descriptor in the array describes the first object in the slab, and so on. An object descriptor is simply an unsigned short integer, which is meaningful only when the object is free. It contains the index of the next free object in the slab, thus implementing a simple list of free objects inside the slab.

Aligning Objects in Memory

The objects managed by the slab allocator are stored in memory cells whose initial physical addresses are multiples of a given constant, which is usually a power of 2. This constant is called the alignment factor.

align in the first-level hardware cache

When creating a new slab cache, it’s possible to specify that the objects included in it be aligned in the first-level hardware cache. To achieve this, the kernel sets the SLAB_HWCACHE_ALIGN cache descriptor flag. The kmem_cache_create( ) function handles the request as follows:

If the object’s size is greater than half of a cache line, it is aligned in RAM to a multiple of L1_CACHE_BYTES(at the begainning of the line).
the object size is rounded up to a submultiple of L1_CACHE_BYTES; this ensures that a small object will never span across two cache lines.

Slab Coloring(?)

The slab allocator takes advantage of the unused bytes to color the slab. The term “color” is used simply to subdivide the slabs and allow the memory allocator to spread objects out among different linear addresses.

Slabs having different colors store the first object of the slab in different memory locations, while satisfying the alignment constraint. The number of available colors is free⁄aln.(at least 0)

If a slab is colored with color col, the offset of the first object (with respect to the slab initial address) is equal to col × aln + dsize bytes.

Local Caches of Free Slab Objects

each cache of the slab allocator includes a per-CPU data structure consisting of a small array of pointers to freed objects called the slab local cache(pointer to each slab's free objects in this cahche).

The array field of the cache descriptor is an array of pointers to array_cache data structures, one element for each CPU in the system. Each array_cache data structure is a descriptor of the local cache of free objects.Notice that the local cache descriptor does not include the address of the local cache itself; in fact, the local cache is placed right after the descriptor. Of course, the local cache stores the pointers to the freed objects, not the object themselves, which are always placed inside the slabs of the cache.

When creating a new slab cache, the kmem_cache_create() function determines the size of the local caches.(field limit of the cache descriptor)

In multiprocessor systems, slab caches for small objects also sport an additional local cache, whose address is stored in the lists.shared field of the cache descriptor. The shared local cache is, as the name suggests, shared among all CPUs, and it makes the task of migrating free objects from a local cache to another easier.

Allocating a Slab Object

New objects may be obtained by invoking the kmem_cache_alloc( ) function. The parameter cachep points to the cache descriptor from which the new free object must be obtained, while the parameter flag represents the flags to be passed to the zoned page frame allocator functions, should all slabs of the cache be full.

void * kmem_cache_alloc(kmem_cache_t *cachep, int flags)

Freeing a Slab Object

The kmem_cache_free( ) function releases an object previously allocated by the slab allocator to some kernel function. Its parameters are cachep, the address of the cache descriptor, and objp, the address of the object to be released.

void kmem_cache_free(kmem_cache_t *cachep, void *objp)

The function checks first whether the local cache has room for an additional pointer to a free object. If so, the pointer is added to the local cache and the function returns. Otherwise it first invokes cache_flusharray() to deplete the local cache and then adds the pointer to the local cache.

General Purpose Objects

void * kmalloc(size_t size, int flags)
{
  ...
  return kmem_cache_alloc(cachep, flags);
  ...
}

void kfree(const void *objp);

Memory Pools

a memory pool allows a kernel component—such as the block device subsystem—to allocate some dynamic memory to be used only in low-on-memory emergencies. a memory pool is a reserve of dynamic memory that can be used only by a specific kernel component, namely the “owner” of the pool. The owner does not normally use the reserve and only use pool in emergency.

Noncontiguous Memory Area Management

sometimes it might be resonable to allocate memory base on noncontiguous page frames, benefit from avoid external fragmentation.

Linear Addresses of Noncontiguous Memory Areas

A safety interval of size 8 MB (macro VMALLOC_OFFSET) is inserted between the end of the physical memory mapping and the first memory area; its purpose is to “capture” out-of-bounds memory accesses. For the same reason, additional
safety intervals of size 4 KB are inserted to separate noncontiguous memory areas.

Descriptors of Noncontiguous Memory Areas

Each noncontiguous memory area is associated with a descriptor of type vm_struct. These descriptors are inserted in a simple list by means of the next field; the address of the first element of the list is stored in the vmlist variable. Accesses to this list are protected by means of the vmlist_lock read/write spin lock. The flags field identifies the type of memory mapped by the area: VM_ALLOC for pages obtained by means of vmalloc(), VM_MAP for already allocated pages mapped by means of vmap(), and VM_IOREMAP for on-board memory of hardware devices mapped by means of ioremap().

get_vm_area()

The get_vm_area( ) function looks for a free range of linear addresses between VMALLOC_START and VMALLOC_END. This function acts on two parameters: the size (size) in bytes of the memory region to be created, and a flag (flag) specifying the type of region.

Invokes kmalloc( ) to obtain a memory area for the new descriptor of type vm_struct.
Gets the vmlist_lock lock for writing and scans the list of descriptors of type vm_struct looking for a free range of linear addresses that includes at least size+4096 addresses (4096 is the size of the safety interval between the memory areas).
If such an interval exists, the function initializes the fields of the descriptor, releases the vmlist_lock lock, and terminates by returning the initial address of the noncontiguous memory area, return NULL otherwise.

Allocating a Noncontiguous Memory Area

The vmalloc( ) function allocates a noncontiguous memory area to the kernel. The parameter size denotes the size of the requested area. If the function is able to satisfy the request, it then returns the initial linear address of the new area; otherwise, it returns a NULL pointer.

map_vm_area(): cast the pages in the vm_struct to the contiguous linear address.it create PGD,PUD,PMD to index the physical page. Notice that the Page Tables of the current process are not touched by map_vm_area(). Therefore, when a process in Kernel Mode accesses the noncontiguous memory area, a Page Fault occurs, because the entries in the process’s Page Tables corresponding to the area are null.

Beside the vmalloc() function, a noncontiguous memory area can be allocated by the vmalloc_32() function, which is very similar to vmalloc() but only allocates page frames from the ZONE_NORMAL and ZONE_DMA memory zones.

Releasing a Noncontiguous Memory Area

The vfree( ) function releases noncontiguous memory areas created by vmalloc() or vmalloc_32(), while the vunmap() function releases memory areas created by vmap(). Both functions have one parameter—the address of the initial linear address of the area to be released; they both rely on the _ _vunmap() function to do the real work.

标签：Chapter,kernel,Management,cache,descriptor,memory,frames,page,address
From： https://www.cnblogs.com/syp2023/p/18129284