从这一篇开始记录以下我看有关内存管理的内核代码的笔记. 内容很长,很多是我自己的理解,请谨慎观看.
伙伴系统的工作的基础是物理页的组织,组织结构有小到大依次为page->zone->node。下面从源码里看看各个结构是如何组织的。
typedef struct pglist_data { struct zone node_zones[MAX_NR_ZONES]; /*当前node包含的zone列表*/
struct zonelist node_zonelists[MAX_ZONELISTS]; /*有两个元素,第一个代表当前node的zone链表,第二个是其他node的zone 链表*/
int nr_zones; /* number of populated zones in this node */
unsigned long node_start_pfn;
unsigned long node_present_pages; /* total number of physical pages */
unsigned long node_spanned_pages; /* total size of physical page range, including holes */
...
}
pglist_data表示node的内存layout。包含当前node的所有zone,以及所有node的zonelist。分配内存首先从当前node开始分配,如果没有足够的内存才从其他node开始分配。
struct zone { /* Read-mostly fields */ /* zone watermarks, access with *_wmark_pages(zone) macros */
/*水位相关的项,跟内存回收有关 unsigned long _watermark[NR_WMARK]; unsigned long watermark_boost; unsigned long nr_reserved_highatomic; /* * We don't know if the memory that we're going to allocate will be * freeable or/and it will be released eventually, so to avoid totally * wasting several GB of ram we must reserve some of the lower zone * memory (otherwise we risk to run OOM on the lower zones despite * there being tons of freeable ram on the higher zones). This array is * recalculated at runtime if the sysctl_lowmem_reserve_ratio sysctl * changes. */ long lowmem_reserve[MAX_NR_ZONES]; #ifdef CONFIG_NUMA int node; #endif struct pglist_data *zone_pgdat; struct per_cpu_pages __percpu *per_cpu_pageset; struct per_cpu_zonestat __percpu *per_cpu_zonestats; /* * the high and batch values are copied to individual pagesets for * faster access */ int pageset_high_min; int pageset_high_max; int pageset_batch; #ifndef CONFIG_SPARSEMEM /* * Flags for a pageblock_nr_pages block. See pageblock-flags.h. * In SPARSEMEM, this map is stored in struct mem_section */ unsigned long *pageblock_flags; #endif /* CONFIG_SPARSEMEM */
//下面一系列成员跟zone包含的page数量,内存位置相关 /* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */ unsigned long zone_start_pfn; /* * spanned_pages is the total pages spanned by the zone, including * holes, which is calculated as: * spanned_pages = zone_end_pfn - zone_start_pfn; * * present_pages is physical pages existing within the zone, which * is calculated as: * present_pages = spanned_pages - absent_pages(pages in holes); * * present_early_pages is present pages existing within the zone * located on memory available since early boot, excluding hotplugged * memory. * * managed_pages is present pages managed by the buddy system, which * is calculated as (reserved_pages includes pages allocated by the * bootmem allocator): * managed_pages = present_pages - reserved_pages; * * cma pages is present pages that are assigned for CMA use * (MIGRATE_CMA). * * So present_pages may be used by memory hotplug or memory power * management logic to figure out unmanaged pages by checking * (present_pages - managed_pages). And managed_pages should be used * by page allocator and vm scanner to calculate all kinds of watermarks * and thresholds. * * Locking rules: * * zone_start_pfn and spanned_pages are protected by span_seqlock. * It is a seqlock because it has to be read outside of zone->lock, * and it is done in the main allocator path. But, it is written * quite infrequently. * * The span_seq lock is declared along with zone->lock because it is * frequently read in proximity to zone->lock. It's good to * give them a chance of being in the same cacheline. * * Write access to present_pages at runtime should be protected by * mem_hotplug_begin/done(). Any reader who can't tolerant drift of * present_pages should use get_online_mems() to get a stable value. */ atomic_long_t managed_pages; unsigned long spanned_pages; unsigned long present_pages; #if defined(CONFIG_MEMORY_HOTPLUG) unsigned long present_early_pages; #endif #ifdef CONFIG_CMA unsigned long cma_pages; #endif const char *name; #ifdef CONFIG_MEMORY_ISOLATION /* * Number of isolated pageblock. It is used to solve incorrect * freepage counting problem due to racy retrieving migratetype * of pageblock. Protected by zone->lock. */ unsigned long nr_isolate_pageblock; #endif #ifdef CONFIG_MEMORY_HOTPLUG /* see spanned/present_pages for more description */ seqlock_t span_seqlock; #endif int initialized; /* Write-intensive fields used from the page allocator */ CACHELINE_PADDING(_pad1_); /* free areas of different sizes */
//跟页面分配相关的重要结构
struct free_area free_area[MAX_ORDER + 1];。。。 #if defined CONFIG_COMPACTION || defined CONFIG_CMA /* Set to true when the PG_migrate_skip bits should be cleared */ bool compact_blockskip_flush; #endif bool contiguous; CACHELINE_PADDING(_pad3_); /* Zone statistics */
//统计相关的项 atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS]; atomic_long_t vm_numa_event[NR_VM_NUMA_EVENT_ITEMS]; } ____cacheline_internodealigned_in_smp;
zone最重要的数据即,起始位置由zone_start_pfn表示,页面数量,由present_pages, spanned_pages, managed_pages共同表示。跟伙伴系统相关的参数free_area数组表示内存是怎么按照伙伴系统进行组织的。长度是最大order加1。
struct free_area { struct list_head free_list[MIGRATE_TYPES]; unsigned long nr_free; };
free_area表示当前zone可以分配的内存。是一个二维数组,相同的内存阶数(buddy系统按order分配内存,阶代表2的指数,单位是page)组成一行,列是不同的内存迁移类型。每个元素又是一个页组成的链表。下图是从奔跑吧linux卷1上的截图。
start_kernel->setup_arch->bootmem_init->arch_numa_init->numa_init完成numa的初始化。
全局数组numa_distance存放了该信息。默认情况下如果是同一个node,距离是10, 如果不是同一个node,距离是20,当然这个数字是可以从dtb或者acpi表中由厂商提供的。可以从/sys/devices/system/node/nodeX/distance中得到不同numa node与当前node的距离。
node_data是一个pglist_data*类型的全局数组,存放了所有node的信息。node_data的初始化在numa_register_nodes中,设置了每个node的起始pfn,长度信息。
struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;
全局数组node_states存放了node的状态信息。
nodemask_t node_states[NR_NODE_STATES] __read_mostly = { [N_POSSIBLE] = NODE_MASK_ALL, [N_ONLINE] = { { [0] = 1UL } }, #ifndef CONFIG_NUMA [N_NORMAL_MEMORY] = { { [0] = 1UL } }, #ifdef CONFIG_HIGHMEM [N_HIGH_MEMORY] = { { [0] = 1UL } }, #endif [N_MEMORY] = { { [0] = 1UL } }, [N_CPU] = { { [0] = 1UL } }, #endif /* NUMA */ };
mem_section存放永久的稀疏内存的page指针,对应与flat memory的mem_map。也就是说我们可以在mem_section或者mem_map中找到所有的物理内存页,听起来是一个非常有用的结构。
struct mem_section **mem_section; struct mem_section { /* * This is, logically, a pointer to an array of struct * pages. However, it is stored with some other magic. * (see sparse.c::sparse_init_one_section()) * * Additionally during early boot we encode node id of * the location of the section here to guide allocation. * (see sparse.c::memory_present()) * * Making it a UL at least makes someone do a cast * before using it wrong. */ unsigned long section_mem_map; struct mem_section_usage *usage; #ifdef CONFIG_PAGE_EXTENSION /* * If SPARSEMEM, pgdat doesn't have page_ext pointer. We use * section. (see page_ext.h about this.) */ struct page_ext *page_ext; unsigned long pad; #endif /* * WARNING: mem_section must be a power-of-2 in size for the * calculation and use of SECTION_ROOT_MASK to make sense. */ };
mem_section在sparse_init函数中被初始化,可以参考物理内存模型 — The Linux Kernel documentation
free_area_init完成内存域的初始化。即每个node中所有zone的起始pfn,span_pages, present_pages, per_cpu_pageset, free_area。不过此时free_area只是初始化为0. 初始化node,zone的信息来源是memblock。
start_kernel->mm_core_init->build_all_zonelists->build_all_zonelists_init->__build_all_zonelists
__build_all_zonelists初始化所有node的zonelist。
static void __build_all_zonelists(void *data) { ... for_each_node(nid) { pg_data_t *pgdat = NODE_DATA(nid); build_zonelists(pgdat); ... } ... }
zonelist包含俩数组,一个是本node的zonelist,一个是其他zonelist,即fallback。fallback zonelist的添加顺序跟由node序列和zone序列共同决定。通过node之间的距离信息定下node order,每个node的zone的顺序是由高到低,比如最高可以是ZONE_NORMAL,最低可能是ZONE_DMA.
start_kernel->mm_core_init->mem_init->memblock_free_all会将memblock释放的内存加入伙伴系统。具体的函数路径是memblock_free_all->free_low_memory_core_early->__free_memory_core->__free_pages_memory->memblock_free_pages->__free_pages_core,将从memblock释放的物理page以最大的阶加入buddy system,并返回pages 数,随后将该值加入到_totalram_pages上。此时大部分的内存的迁移类型是MIGRATE_MOVABLE,在buddy system中的order以10最多,也就是在组织内存的时候尽量组成最大的连续内存块。
static void __init memmap_init_zone_range(struct zone *zone, unsigned long start_pfn, unsigned long end_pfn, unsigned long *hole_pfn) { unsigned long zone_start_pfn = zone->zone_start_pfn; unsigned long zone_end_pfn = zone_start_pfn + zone->spanned_pages; int nid = zone_to_nid(zone), zone_id = zone_idx(zone); start_pfn = clamp(start_pfn, zone_start_pfn, zone_end_pfn); end_pfn = clamp(end_pfn, zone_start_pfn, zone_end_pfn); if (start_pfn >= end_pfn) return; memmap_init_range(end_pfn - start_pfn, nid, zone_id, start_pfn, zone_end_pfn, MEMINIT_EARLY, NULL, MIGRATE_MOVABLE); if (*hole_pfn < start_pfn) init_unavailable_range(*hole_pfn, start_pfn, zone_id, nid); *hole_pfn = end_pfn; }
内存组织讲完了,看看内存分配得几个API。
alloc_pages(gfp, order) 分配2^order个页面。影响页面分配行为的因素很多,包括gfp,current->flags。核心函数为__alloc_pages。
/* * This is the 'heart' of the zoned buddy allocator. */ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid, nodemask_t *nodemask) { ... gfp = current_gfp_context(gfp); alloc_gfp = gfp; if (!prepare_alloc_pages(gfp, order, preferred_nid, nodemask, &ac, &alloc_gfp, &alloc_flags)) return NULL; ... /* First allocation attempt */ page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac); if (likely(page)) goto out; ... page = __alloc_pages_slowpath(alloc_gfp, order, &ac); out: ... return page; }
首先让get_page_from_freelist尝试分配内存,如果失败,使用__alloc_pages_slowpath继续尝试。
/* * get_page_from_freelist goes through the zonelist trying to allocate * a page. */ static struct page * get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags, const struct alloc_context *ac) { 。。。 for_next_zone_zonelist_nodemask(zone, z, ac->highest_zoneidx, ac->nodemask) { if (zone_watermark_fast(zone, order, mark, ac->highest_zoneidx, alloc_flags, gfp_mask)) goto try_this_zone; try_this_zone: page = rmqueue(ac->preferred_zoneref->zone, zone, order, gfp_mask, alloc_flags, ac->migratetype); if (page) { prep_new_page(page, order, gfp_mask, alloc_flags);
。。。
return page; } 。。。 }
get_page_from_freelist首先判断一下当前zone是不是有足够的空闲页,如果没有就继续找,直到找到一个拥有足够空闲页的zone,rmqueue会在该zone上分配页。zonelist的扫描顺序是首先是prefered_zone,然后是按照zone_type从高到低进行扫描。
__no_sanitize_memory static inline struct page *rmqueue(struct zone *preferred_zone, struct zone *zone, unsigned int order, gfp_t gfp_flags, unsigned int alloc_flags, int migratetype) { struct page *page; 。。。 if (likely(pcp_allowed_order(order))) { page = rmqueue_pcplist(preferred_zone, zone, order, migratetype, alloc_flags); if (likely(page)) goto out; } page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags, migratetype); out: /* Separate test+clear to avoid unnecessary atomics */ if ((alloc_flags & ALLOC_KSWAPD) && unlikely(test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags))) { clear_bit(ZONE_BOOSTED_WATERMARK, &zone->flags); wakeup_kswapd(zone, 0, 0, zone_idx(zone)); } return page; }
首先看看请求的页面order是不是足够小,当前是小于3,可以在pcp_list里面获取,如果是首先尝试在pcplist里面分配。这个是percpu的空闲list,不需要获取zone的lock,只需获取percpu list的lock,速度快。pcplist内的页应该是页面释放的时候放进去的,没看到初始化过程中对其的操作。如果不能在pcplist上获取则尝试去free_area获取。
rmqueue_buddy最终会调用rmqueue_smallest去分配内存。
static __always_inline struct page *__rmqueue_smallest(struct zone *zone, unsigned int order, int migratetype) { for (current_order = order; current_order <= MAX_ORDER; ++current_order) { area = &(zone->free_area[current_order]); page = get_page_from_free_area(area, migratetype); if (!page) continue; del_page_from_free_list(page, zone, current_order); expand(zone, page, order, current_order, migratetype); set_pcppage_migratetype(page, migratetype); return page; } return NULL; }
如果搞明白了buddy system的内存组织结构,理解上面的代码就比较容易了。空闲页是按阶存储的,初始化的时候10阶最多,寻找空闲页的顺序是从当前order开始递增查找。空闲页存放在每个zone的free_area中,free_area可以视为一个2维数组,第一维是order,第二维是迁移类型。get_page_from_free_area会从对应的迁移类型找到对应的第一个page的指针。如果在比所需的阶更高的阶那一层才找到内存,则会剩余一部分内存块,expand会尝试将剩余块再加入free_area。
如果快速路径没能获得空闲页,就要进入慢速路径。
static inline struct page * __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct alloc_context *ac) {
慢速路径里会做各种可能的尝试去回收内存,做内存规整来获得足够得空闲页,其中可能会主动调度,因此慢速路径可能要花较长得时间来获得内存,甚至失败。所以内存大小对系统的性能会有较大的影响,当系统中的内存不足时,系统在获取内存时会花费大量的时间,很大程度上造成性能下降。慢速路径涉及较多高阶内容,之后再分析。
看看释放页的API,free_pages.
free_the_page是其核心函数。
static inline void free_the_page(struct page *page, unsigned int order) { if (pcp_allowed_order(order)) /* Via pcp? */ free_unref_page(page, order); else __free_pages_ok(page, order, FPI_NONE); }
首先判断一下page order是不是足够小可以放到pcplist里面,如果不能就放到free_area里面。zone->per_cpu_pageset是一个一维数组,每一阶有MIGRATE_PCPTYPES个元素,每个元素包含一个page list。释放页就是把页放到对应order和迁移类型的位置即可。接着更新一下元数据,如果页面数量高于per_cpu_pageset->high就释放一些页到buddy系统。见free_unref_page->free_unref_page_commit。
这里有一个重要的函数page_zone,从page中得到zone的指针。对于单node系统,page的flag域包含了nodeid和zoneid,通过nodeid得到node,再通过zoneid得到zone。对于多node系统,要先找到page对应的section,由全局变量section_to_node_table得到node。
static inline struct zone *page_zone(const struct page *page) { return &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]; }
由page找到对应的zone是free的前提,系统的page是从哪里来还回那里去。
/* * Free a pcp page */ void free_unref_page(struct page *page, unsigned int order) { 。。。 zone = page_zone(page); pcp_trylock_prepare(UP_flags); pcp = pcp_spin_trylock(zone->per_cpu_pageset); if (pcp) { free_unref_page_commit(zone, pcp, page, pcpmigratetype, order); pcp_spin_unlock(pcp); } else { free_one_page(zone, page, pfn, order, migratetype, FPI_NONE); } pcp_trylock_finish(UP_flags); }
static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp, struct page *page, int migratetype, unsigned int order) { 。。。 pindex = order_to_pindex(migratetype, order); list_add(&page->pcp_list, &pcp->lists[pindex]); pcp->count += 1 << order; 。。。
回来看看__free_page_ok
static void __free_pages_ok(struct page *page, unsigned int order, fpi_t fpi_flags) { .. spin_lock_irqsave(&zone->lock, flags); .. __free_one_page(page, pfn, zone, order, migratetype, fpi_flags); spin_unlock_irqrestore(&zone->lock, flags); ... }
核心函数是__free_one_page。
static inline void __free_one_page(struct page *page, unsigned long pfn, struct zone *zone, unsigned int order, int migratetype, fpi_t fpi_flags) { ... while (order < MAX_ORDER) { ... buddy = find_buddy_page_pfn(page, pfn, order, &buddy_pfn); if (!buddy) goto done_merging; ... ... del_page_from_free_list(buddy, zone, order); combined_pfn = buddy_pfn & pfn; page = page + (combined_pfn - pfn); pfn = combined_pfn; order++; } done_merging: set_buddy_order(page, order); ... if (to_tail) add_to_free_list_tail(page, zone, order, migratetype); else add_to_free_list(page, zone, order, migratetype); ... }
__free_one_page的名字取得不好,让人以为只free一个page。原理也不复杂,依据page得pfn和order找他的buddy。buddy的含义是跟它相邻大小相同的一块区域。如果找到了,那么这两块区域是可以合并成更大的区域的,所以order会自增,然后继续寻找buddy,直到找不到buddy或者order已经达到最大值,然后就跳到done_merging,将之前找到的这一大块buddy加入到zone的free_area。之所以可以加入是因为之前找到的buddy都已经从list中取下来了。
下面讲讲slab。这里有一篇讲解很好的文章Linux 内核 | 内存管理——slab 分配器 - 知乎 (zhihu.com) 内存管理-slab[原理] - DoOrDie - 博客园 (cnblogs.com)
这个slab我曾经尝试理解,一直没有太明白。我没看到slab到底是啥,也没这么个数据结构,有个叫kmem_cache的东西slab在它之下,嗯,slab是另一个概念的子结构,这个概念叫缓冲区或者cache。话说cache这个在linux里面真的被用滥了,有一大堆自称缓冲的东西,让人迷惑,这个初学者太不友好了。对于复杂的系统,命名是个非常重要的环节,可惜现在做的不好。
slab复杂的地方还有它有多个变种,slab,slub,slob。我看到slob似乎是在6.x内核中被移除了,只剩slab和slub了。我们先主要看看slab吧。现在slab也没了,只剩slub了. 不过下面讲的还是老的slab.
linux已经走过了30年,我以为它已经老态龙钟不在有太大的变化,没想到它过一阵子就大变样,即使是像进程调度,内存管理这些核心的代码也是在不停的进化。就拿slab来讲,最早是有slab结构的,后来给加入到struct page里面去了,这不21年又给整回来了。
为啥需要slab呢?
前面提到的内存分配的API都是按页分配,最小是4k,对于小内存分配这有点浪费;
内核中经常需要用到一些数据结构,比如task_struct, fs_struct, inode。如果每次获取这些结构都去找伙伴系统不仅慢,对cache也不友好。可以预先分配一批这些结构,存起来,用的时候直接拿走,不用了再存起来给一下用,这就高效多了;
slab就是为此设计的。理解slab要先知道俩数据结构,kmem_cache 和slab。简单来说,slab是具体的内存块,kmeme_cache是管理slab的。
/* Reuses the bits in struct page */ struct slab { unsigned long __page_flags; #if defined(CONFIG_SLAB) struct kmem_cache *slab_cache; union { struct { struct list_head slab_list; void *freelist; /* array of free object indexes */ void *s_mem; /* first object */ }; struct rcu_head rcu_head; }; unsigned int active; #endif atomic_t __page_refcount; };
去掉slub的东西,看起来slab也不复杂。一个指向管理slab的kmem_cache指针,一个slab链表。
/* * Definitions unique to the original Linux SLAB allocator. */ struct kmem_cache { struct array_cache __percpu *cpu_cache; /* 1) Cache tunables. Protected by slab_mutex */ unsigned int batchcount; unsigned int limit; unsigned int shared; unsigned int size; struct reciprocal_value reciprocal_buffer_size; /* 2) touched by every alloc & free from the backend */ slab_flags_t flags; /* constant flags */ unsigned int num; /* # of objs per slab */ /* 3) cache_grow/shrink */ /* order of pgs per slab (2^n) */ unsigned int gfporder; /* force GFP flags, e.g. GFP_DMA */ gfp_t allocflags; size_t colour; /* cache colouring range */ //着色区的数量,分配PAGESIZE << gfporder个页面,num个对象,等于剩余长度/colour_off unsigned int colour_off; /* colour offset */ //L1 cache大小 unsigned int freelist_size; /* constructor func */ void (*ctor)(void *obj); /* 4) cache creation/removal */ const char *name; struct list_head list; int refcount; int object_size; int align; /* 5) statistics */。。。 struct kmem_cache_node *node[MAX_NUMNODES]; };
kmem_cache就有点复杂了。主要看看cpu_cache和node。
/* * struct array_cache * * Purpose: * - LIFO ordering, to hand out cache-warm objects from _alloc * - reduce the number of linked list operations * - reduce spinlock operations * * The limit is stored in the per-cpu structure to reduce the data cache * footprint. * */ struct array_cache { unsigned int avail; unsigned int limit; unsigned int batchcount; unsigned int touched; void *entry[]; /* * Must have this definition in here for the proper * alignment of array_cache. Also simplifies accessing * the entries. */ };
array_cache主要是给per-cpu变量用的,是个fifo队列,这样可以更好的利用cache。它用一个指针数组存放数据,应该就是slab了。其他是管理数据。
/* * The slab lists for all objects. */ struct kmem_cache_node { #ifdef CONFIG_SLAB raw_spinlock_t list_lock; struct list_head slabs_partial; /* partial list first, better asm code */ struct list_head slabs_full; struct list_head slabs_free; unsigned long total_slabs; /* length of all slab lists */ unsigned long free_slabs; /* length of free slab list only */ unsigned long free_objects; unsigned int free_limit; unsigned int colour_next; /* Per-node cache coloring */ struct array_cache *shared; /* shared per node */ struct alien_cache **alien; /* on other nodes */ unsigned long next_reap; /* updated without locking */ int free_touched; /* updated without locking */ #endif #ifdef CONFIG_SLUB spinlock_t list_lock; unsigned long nr_partial; struct list_head partial; #endif };
还是slub好,简单明了,为啥slab这么复杂,搞3个链表,七七八八的管理数据,不知道在搞什么。node是为了更好的利用本地内存结点。
下面看看创建和分配slab比较重要的API。
struct kmem_cache * kmem_cache_create(const char *name, unsigned int size, unsigned int align, slab_flags_t flags, void (*ctor)(void *)) { return kmem_cache_create_usercopy(name, size, align, flags, 0, 0, ctor); }
struct kmem_cache * kmem_cache_create_usercopy(const char *name, unsigned int size, unsigned int align, slab_flags_t flags, unsigned int useroffset, unsigned int usersize, void (*ctor)(void *)) { struct kmem_cache *s = NULL; const char *cache_name; int err; mutex_lock(&slab_mutex); 。。。 if (!usersize) s = __kmem_cache_alias(name, size, align, flags, ctor); if (s) goto out_unlock; 。。。 s = create_cache(cache_name, size, calculate_alignment(flags, align, size), flags, useroffset, usersize, ctor, NULL); 。。。 out_unlock: mutex_unlock(&slab_mutex); return s; }
先看看现在是不是已经有了要创建的对象缓冲,如果有就不需要创建了。
struct kmem_cache * __kmem_cache_alias(const char *name, unsigned int size, unsigned int align, slab_flags_t flags, void (*ctor)(void *)) { struct kmem_cache *cachep; cachep = find_mergeable(size, align, flags, name, ctor); if (cachep) { cachep->refcount++; /* * Adjust the object sizes so that we clear * the complete object on kzalloc. */ cachep->object_size = max_t(int, cachep->object_size, size); } return cachep; }
struct kmem_cache *find_mergeable(unsigned int size, unsigned int align, slab_flags_t flags, const char *name, void (*ctor)(void *)) { struct kmem_cache *s; 。。。 list_for_each_entry_reverse(s, &slab_caches, list) { 。。。 return s; } return NULL; }
find_mergeable会遍历全局变量slab_caches,寻找满足条件的cache对象。kmem cache都会链接到这个变量中。
如果没找到就去创建。
static struct kmem_cache *create_cache(const char *name, unsigned int object_size, unsigned int align, slab_flags_t flags, unsigned int useroffset, unsigned int usersize, void (*ctor)(void *), struct kmem_cache *root_cache) { struct kmem_cache *s; int err; s = kmem_cache_zalloc(kmem_cache, GFP_KERNEL); 。。。 err = __kmem_cache_create(s, flags); s->refcount = 1; list_add(&s->list, &slab_caches); return s; 。。。 }
先分配一段内存给kmem_cache变量,在此变量上创建缓冲区对象,初始化它的计数为1,之后将其链接到上面提到过的slab_caches全局变量里。
int __kmem_cache_create(struct kmem_cache *cachep, slab_flags_t flags) { size = ALIGN(size, BYTES_PER_WORD); /* 3) caller mandated alignment */ if (ralign < cachep->align) { ralign = cachep->align; } cachep->align = ralign; cachep->colour_off = cache_line_size(); size = ALIGN(size, cachep->align);
//计算gpforder,对象数num,着色区数colour if (set_objfreelist_slab_cache(cachep, size, flags)) { flags |= CFLGS_OBJFREELIST_SLAB; goto done; } if (set_off_slab_cache(cachep, size, flags)) { flags |= CFLGS_OFF_SLAB; goto done; } if (set_on_slab_cache(cachep, size, flags)) goto done; return -E2BIG; done: cachep->freelist_size = cachep->num * sizeof(freelist_idx_t); cachep->flags = flags; cachep->allocflags = __GFP_COMP; if (flags & SLAB_CACHE_DMA) cachep->allocflags |= GFP_DMA; if (flags & SLAB_CACHE_DMA32) cachep->allocflags |= GFP_DMA32; if (flags & SLAB_RECLAIM_ACCOUNT) cachep->allocflags |= __GFP_RECLAIMABLE; cachep->size = size; cachep->reciprocal_buffer_size = reciprocal_value(size); err = setup_cpu_cache(cachep, gfp); return 0; }
slab有两种格式,freelist管理数组在slab上和不在slab上,分别调用set_objfreelist_slab_cache和set_off_slab_cache。
setup_cpu_cache设置cpu_array和kmem_cache_node感觉名字取得不大对。
cpu_array得初始化在setup_cpu_cache->enable_cpucache->do_tune_cpucache中。
static int do_tune_cpucache(struct kmem_cache *cachep, int limit, int batchcount, int shared, gfp_t gfp) { struct array_cache __percpu *cpu_cache, *prev; int cpu; cpu_cache = alloc_kmem_cache_cpus(cachep, limit, batchcount); prev = cachep->cpu_cache; cachep->cpu_cache = cpu_cache; check_irq_on(); cachep->batchcount = batchcount; cachep->limit = limit; cachep->shared = shared; for_each_online_cpu(cpu) { LIST_HEAD(list); int node; struct kmem_cache_node *n; struct array_cache *ac = per_cpu_ptr(prev, cpu); node = cpu_to_mem(cpu); n = get_node(cachep, node); raw_spin_lock_irq(&n->list_lock); free_block(cachep, ac->entry, ac->avail, node, &list); raw_spin_unlock_irq(&n->list_lock); slabs_destroy(cachep, &list); } free_percpu(prev); setup_node: return setup_kmem_cache_nodes(cachep, gfp); }
首先是分配一个cpu_array,这是一个percpu变量,每个cpu一个,然后初始化。
static struct array_cache __percpu *alloc_kmem_cache_cpus( struct kmem_cache *cachep, int entries, int batchcount) { size = sizeof(void *) * entries + sizeof(struct array_cache); cpu_cache = __alloc_percpu(size, sizeof(void *)); for_each_possible_cpu(cpu) { init_arraycache(per_cpu_ptr(cpu_cache, cpu), entries, batchcount); } return cpu_cache; }
static void init_arraycache(struct array_cache *ac, int limit, int batch) { if (ac) { ac->avail = 0; ac->limit = limit; ac->batchcount = batch; ac->touched = 0; } }
avail为0,所以下面得free_block操作就不需要了。设置了limit和batch。
除了cpu_array, node也会初始化。
static int setup_kmem_cache_nodes(struct kmem_cache *cachep, gfp_t gfp) { for_each_online_node(node) { ret = setup_kmem_cache_node(cachep, node, gfp, true); } return 0; }
static int setup_kmem_cache_node(struct kmem_cache *cachep, int node, gfp_t gfp, bool force_change) { LIST_HEAD(list); if (use_alien_caches) { new_alien = alloc_alien_cache(node, cachep->limit, gfp); } if (cachep->shared) { new_shared = alloc_arraycache(node, cachep->shared * cachep->batchcount, 0xbaadf00d, gfp); } ret = init_cache_node(cachep, node, gfp); n = get_node(cachep, node); raw_spin_lock_irq(&n->list_lock); if (n->shared && force_change) { free_block(cachep, n->shared->entry, n->shared->avail, node, &list); n->shared->avail = 0; } if (!n->shared || force_change) { old_shared = n->shared; n->shared = new_shared; new_shared = NULL; } if (!n->alien) { n->alien = new_alien; new_alien = NULL; } raw_spin_unlock_irq(&n->list_lock); slabs_destroy(cachep, &list); return ret; }
分配并初始化一个alien cache和一个array cache。alien是所有其他node上的缓存,array cache付给shared变量,为本node共享的缓存。然后初始化一个kmem cache node,并把刚刚创建的alien cache和shared cache赋给node。这里有个奇怪的地方,0xbaadf00d是啥意思?kernel代码有时候写的真是莫名奇妙,这么ugly的东西至少得加个注释吧。
呵呵,分析了这么一天的slab,晚上发现,slab已经被kernel移除了,呵呵白干一天。明天还是看slub吧,睡觉。
这里有一篇博客讲slub,图解slub (wowotech.net)
重新看看kmem_cache
/* * Slab cache management. */ struct kmem_cache { #ifndef CONFIG_SLUB_TINY struct kmem_cache_cpu __percpu *cpu_slab; #endif /* Used for retrieving partial slabs, etc. */ slab_flags_t flags; unsigned long min_partial; unsigned int size; /* Object size including metadata */ unsigned int object_size; /* Object size without metadata */ struct reciprocal_value reciprocal_size; unsigned int offset; /* Free pointer offset */ #ifdef CONFIG_SLUB_CPU_PARTIAL /* Number of per cpu partial objects to keep around */ unsigned int cpu_partial; /* Number of per cpu partial slabs to keep around */ unsigned int cpu_partial_slabs; #endif struct kmem_cache_order_objects oo; /* Allocation and freeing of slabs */ struct kmem_cache_order_objects min; gfp_t allocflags; /* gfp flags to use on each alloc */ int refcount; /* Refcount for slab cache destroy */ void (*ctor)(void *object); /* Object constructor */ unsigned int inuse; /* Offset to metadata */ unsigned int align; /* Alignment */ unsigned int red_left_pad; /* Left redzone padding size */ const char *name; /* Name (only for display!) */ struct list_head list; /* List of slab caches */ struct kmem_cache_node *node[MAX_NUMNODES]; };
slub感觉还是要比slab简洁很多。那些着色相关的东西都不见了。percpu cache已经不是必要的了。node cache更是精简。主要的知道描述一个缓冲区的几个关键数据:包含元数据的size,不包含元数据的object_size,空闲区的偏移offset,缓冲区长度兼对象数oo,这个有趣,其实就是一个无符号int变量,slab list(这个意思是把slab直接放在这个list上?跟node是啥关系?),对齐align,node。下面看看kmem_cache_node.
/* * The slab lists for all objects. */ struct kmem_cache_node { spinlock_t list_lock; unsigned long nr_partial; struct list_head partial; #ifdef CONFIG_SLUB_DEBUG atomic_long_t nr_slabs; atomic_long_t total_objects; struct list_head full; #endif };
清爽,简洁,比slab看起来舒服多了,抛开debug只有一个链表partial。slab删得好,早看他不顺眼。
在6.7的内核中slab已经回归不再寄生于page结构中了。
/* Reuses the bits in struct page */ struct slab { unsigned long __page_flags; struct kmem_cache *slab_cache; union { struct { union { struct list_head slab_list; #ifdef CONFIG_SLUB_CPU_PARTIAL struct { struct slab *next; int slabs; /* Nr of slabs left */ }; #endif }; /* Double-word boundary */ union { struct { void *freelist; /* first free object */ union { unsigned long counters; struct { unsigned inuse:16; unsigned objects:15; unsigned frozen:1; }; }; }; #ifdef system_has_freelist_aba freelist_aba_t freelist_counter; #endif }; }; struct rcu_head rcu_head; }; unsigned int __unused; atomic_t __page_refcount; #ifdef CONFIG_MEMCG unsigned long memcg_data; #endif };
盗图一张,slab还嵌在page中,需自行更正。
看看slub的API,kmem_cache_create,用来创建一个对象缓冲。
struct kmem_cache * kmem_cache_create(const char *name, unsigned int size, unsigned int align, slab_flags_t flags, void (*ctor)(void *)) { return kmem_cache_create_usercopy(name, size, align, flags, 0, 0, ctor); }
关键函数是kmem_cache_create_usercopy。
struct kmem_cache * kmem_cache_create_usercopy(const char *name, unsigned int size, unsigned int align, slab_flags_t flags, unsigned int useroffset, unsigned int usersize, void (*ctor)(void *)) { mutex_lock(&slab_mutex); err = kmem_cache_sanity_check(name, size); /* Fail closed on bad usersize of useroffset values. */ if (!usersize) s = __kmem_cache_alias(name, size, align, flags, ctor);
if (s)
goto out_unlock; cache_name = kstrdup_const(name, GFP_KERNEL); s = create_cache(cache_name, size, calculate_alignment(flags, align, size), flags, useroffset, usersize, ctor, NULL); out_unlock: mutex_unlock(&slab_mutex);return s; }
简化后主要执行俩函数,第一步先去看看slab_cache里面有没有合适的缓冲,有个话直接返回。
struct kmem_cache *find_mergeable(unsigned int size, unsigned int align, slab_flags_t flags, const char *name, void (*ctor)(void *)) { 。。。 list_for_each_entry_reverse(s, &slab_caches, list) { 。。。 return s; } 。。 }
如果没有找到已有的缓冲,就创建一个。
static struct kmem_cache *create_cache(const char *name, unsigned int object_size, unsigned int align, slab_flags_t flags, unsigned int useroffset, unsigned int usersize, void (*ctor)(void *), struct kmem_cache *root_cache) { 。。。 s = kmem_cache_zalloc(kmem_cache, GFP_KERNEL); s->name = name; s->size = s->object_size = object_size; s->align = align; s->ctor = ctor; #ifdef CONFIG_HARDENED_USERCOPY s->useroffset = useroffset; s->usersize = usersize; #endif err = __kmem_cache_create(s, flags); s->refcount = 1; list_add(&s->list, &slab_caches); return s; 。。。 }
先分配内存给kmem_cache,用入参初始化一下后交给kmem_cache的核心函数__kmem_cache_create。创建好后加入到全局变量slab_caches中。
int __kmem_cache_create(struct kmem_cache *s, slab_flags_t flags) { 。。。 err = kmem_cache_open(s, flags); 。。。return 0; }
kmem_cache_open做了重要的初始化工作,calculate_sizes计算slab的大小,也即计算出kmem_cache->oo,包含order和对象数。set_cpu_partial计算per cpu partial list的对象数和slab数。init_kmem_cache_nodes分配kmem cache node并初始化。alloc_kmem_cache_cpus分配percpu cpu_slab并初始化。
下面看看如何销毁一个缓冲区。
void kmem_cache_destroy(struct kmem_cache *s) { 。。。 s->refcount--; err = shutdown_cache(s); 。。。 kmem_cache_release(s); }
shutdown_cache会释放所有slab内存,kmem_cache_release会删除sysfs相关对象。
创建缓冲区对slab分配内存来说只是第一步,比如这个时候我想分配内存,需要用到另一个API,kmem_cache_alloc.
kmem_cache_alloc有点复杂,这里简单介绍一下流程。存放slab的地方有三个,percpu freelist, percpu partial list, node cache。分配器会依次尝试获取object,一旦成功就可以返回object地址,但是如果都没有获得那就从buddy system获取新的slab。分配之后存放slab的顺序同上。
释放一个slab对象也是按照上述顺序。这就是slab被叫做缓冲的原因,缓冲被释放的时候并不是还给伙伴系统而是还给缓冲区,留给下次再用。
slab在内核内存分配中非常重要,很多高频内存分配器就是依赖它,比如常见的kmalloc。使用slab的流程也就明了了,首先使用kmem_cache_create创建一个kmem_cache,基于kmem_cache使用kmem_cache_alloc分配一个slab对象。
来看一个内核中是slab的例子,比如task_struct:
void __init fork_init(void) { ... task_struct_cachep = kmem_cache_create_usercopy("task_struct", arch_task_struct_size, align, SLAB_PANIC|SLAB_ACCOUNT, useroffset, usersize, NULL); ... } static inline struct task_struct *alloc_task_struct_node(int node) { return kmem_cache_alloc_node(task_struct_cachep, GFP_KERNEL, node); }
在fork_init预先使用kmem_cache_create_usercopy创建了task_struct的缓冲,然后再alloc_task_struct_node中使用kmem_cache_alloc_node来分配一个task_struct对象。
标签:struct,zone,int,cache,内存,linux,slab,page,物理 From: https://www.cnblogs.com/banshanjushi/p/17978306