原文:https://www.kernel.org/doc/html/latest/admin-guide/mm/concepts.html
The memory management in Linux is a complex system that evolved over the years and included more and more functionality to support a variety of systems from MMU-less microcontrollers to supercomputers. The memory management for systems without an MMU is called nommu and it definitely deserves a dedicated document, which hopefully will be eventually written. Yet, although some of the concepts are the same, here we assume that an MMU is available and a CPU can translate a virtual address to a physical address.
Linux中的内存管理是一个复杂的系统,经过多年的发展,包含了越来越多的功能,以支持从没有MMU的微控制器到超级计算机等各种系统。对于没有MMU的系统,内存管理被称为nommu,它确实值得有一份专门的文档,希望最终会有人撰写。然而,尽管一些概念是相同的,但在这里我们假设有MMU可用,并且CPU可以将虚拟地址转换为物理地址。
- Virtual Memory Primer
- Huge Pages
- Zones
- Nodes
- Page cache
- Anonymous Memory
- Reclaim
- Compaction
- OOM killer
Virtual Memory Primer
The physical memory in a computer system is a limited resource and even for systems that support memory hotplug there is a hard limit on the amount of memory that can be installed. The physical memory is not necessarily contiguous; it might be accessible as a set of distinct address ranges. Besides, different CPU architectures, and even different implementations of the same architecture have different views of how these address ranges are defined.
计算机系统中的物理内存是有限的资源,即使对于支持内存热插拔的系统,安装的内存量也有硬性限制。物理内存不一定是连续的;它可能以一组不同的地址范围的形式可访问。此外,不同的CPU架构,甚至同一架构的不同实现,对这些地址范围的定义方式也不同。
All this makes dealing directly with physical memory quite complex and to avoid this complexity a concept of virtual memory was developed.
所有这些使得直接处理物理内存变得相当复杂,为了避免这种复杂性,发展出了虚拟内存的概念。
The virtual memory abstracts the details of physical memory from the application software, allows to keep only needed information in the physical memory (demand paging) and provides a mechanism for the protection and controlled sharing of data between processes.
虚拟内存将物理内存的细节抽象出来,允许应用软件仅保留物理内存中需要的信息(需求分页),并提供了一种保护和控制数据共享的机制。
With virtual memory, each and every memory access uses a virtual address. When the CPU decodes an instruction that reads (or writes) from (or to) the system memory, it translates the virtual address encoded in that instruction to a physical address that the memory controller can understand.
使用虚拟内存时,每次内存访问都使用虚拟地址。当CPU解码一个从系统内存读取(或写入)数据的指令时,它将该指令中编码的虚拟地址转换为内存控制器能理解的物理地址。
The physical system memory is divided into page frames, or pages. The size of each page is architecture specific. Some architectures allow selection of the page size from several supported values; this selection is performed at the kernel build time by setting an appropriate kernel configuration option.
物理系统内存被划分为页面帧或页。每个页面的大小是特定于架构的。一些架构允许从几个支持的值中选择页面大小;这个选择是通过在内核构建时设置适当的内核配置选项来完成的。
Each physical memory page can be mapped as one or more virtual pages. These mappings are described by page tables that allow translation from a virtual address used by programs to the physical memory address. The page tables are organized hierarchically.
每个物理内存页面可以映射为一个或多个虚拟页面。这些映射由页表描述,允许从程序使用的虚拟地址到物理内存地址的转换。页表是按层次结构组织的。
The tables at the lowest level of the hierarchy contain physical addresses of actual pages used by the software. The tables at higher levels contain physical addresses of the pages belonging to the lower levels. The pointer to the top level page table resides in a register. When the CPU performs the address translation, it uses this register to access the top level page table. The high bits of the virtual address are used to index an entry in the top level page table. That entry is then used to access the next level in the hierarchy with the next bits of the virtual address as the index to that level page table. The lowest bits in the virtual address define the offset inside the actual page.
层次结构的最低级别的表包含软件使用的实际页面的物理地址。更高级别的表包含属于较低级别的页面的物理地址。顶级页表的指针位于一个寄存器中。当CPU执行地址转换时,它使用这个寄存器来访问顶级页表。虚拟地址的高位用于索引顶级页表中的条目。然后使用该条目来访问层次结构中的下一级,下一个虚拟地址的位作为该级页表的索引。虚拟地址中的最低位定义了实际页面内的偏移量。
Huge Pages
大页
The address translation requires several memory accesses and memory accesses are slow relatively to CPU speed. To avoid spending precious processor cycles on the address translation, CPUs maintain a cache of such translations called Translation Lookaside Buffer (or TLB). Usually TLB is pretty scarce resource and applications with large memory working set will experience performance hit because of TLB misses.
地址转换需要多次内存访问,而内存访问相对于CPU速度来说是比较慢的。为了避免在地址转换上浪费宝贵的处理器周期,CPU维护着这些转换的缓存,称为TLB(Translation Lookaside Buffer)。通常TLB是相当稀缺的资源,具有大内存工作集的应用程序会因为TLB未命中而遭受性能损失。
Many modern CPU architectures allow mapping of the memory pages directly by the higher levels in the page table. For instance, on x86, it is possible to map 2M and even 1G pages using entries in the second and the third level page tables. In Linux such pages are called huge. Usage of huge pages significantly reduces pressure on TLB, improves TLB hit-rate and thus improves overall system performance.
许多现代CPU架构允许通过页表中的较高级别直接映射内存页面。例如,在x86上,可以使用第二级和第三级页表中的条目来映射2M甚至1G的页面。在Linux中,这样的页面被称为大页。使用大页显著减轻了对TLB的压力,提高了TLB命中率,从而提高了整个系统的性能。
There are two mechanisms in Linux that enable mapping of the physical memory with the huge pages. The first one is HugeTLB filesystem, or hugetlbfs. It is a pseudo filesystem that uses RAM as its backing store. For the files created in this filesystem the data resides in the memory and mapped using huge pages. The hugetlbfs is described at HugeTLB Pages.
Linux中有两种机制可以使用大页来映射物理内存。第一种是HugeTLB文件系统,或者称为hugetlbfs。它是一个使用RAM作为后备存储的伪文件系统。在该文件系统中创建的文件的数据驻留在内存中,并使用大页进行映射。HugeTLB页面中描述了hugetlbfs。
Another, more recent, mechanism that enables use of the huge pages is called Transparent HugePages, or THP. Unlike the hugetlbfs that requires users and/or system administrators to configure what parts of the system memory should and can be mapped by the huge pages, THP manages such mappings transparently to the user and hence the name. See Transparent Hugepage Support for more details about THP.
另一个更近期的机制是称为Transparent HugePages,或者THP,它使得使用大页成为可能。与hugetlbfs不同,THP管理这种映射对用户透明,因此得名。有关THP的更多详细信息,请参阅Transparent Hugepage Support。
Zones
Often hardware poses restrictions on how different physical memory ranges can be accessed. In some cases, devices cannot perform DMA to all the addressable memory. In other cases, the size of the physical memory exceeds the maximal addressable size of virtual memory and special actions are required to access portions of the memory. Linux groups memory pages into zones according to their possible usage. For example, ZONE_DMA will contain memory that can be used by devices for DMA, ZONE_HIGHMEM will contain memory that is not permanently mapped into kernel's address space and ZONE_NORMAL will contain normally addressed pages.
通常硬件对不同的物理内存范围如何访问施加了限制。在某些情况下,设备无法对所有可寻址内存执行DMA。在其他情况下,物理内存的大小超过了虚拟内存的最大可寻址大小,需要采取特殊措施来访问内存的部分区域。Linux根据它们可能的使用将内存页面分组到不同的区域中。例如,ZONE_DMA将包含可以供设备进行DMA的内存,ZONE_HIGHMEM将包含未永久映射到内核地址空间的内存,ZONE_NORMAL将包含通常寻址的页面。
The actual layout of the memory zones is hardware dependent as not all architectures define all zones, and requirements for DMA are different for different platforms.
内存区域的实际布局是依赖于硬件的,因为并非所有架构都定义了所有区域,而且对于不同的平台,DMA的要求也是不同的。
Nodes
Many multi-processor machines are NUMA - Non-Uniform Memory Access - systems. In such systems the memory is arranged into banks that have different access latency depending on the "distance" from the processor. Each bank is referred to as a node and for each node Linux constructs an independent memory management subsystem. A node has its own set of zones, lists of free and used pages and various statistics counters. You can find more details about NUMA in What is NUMA? and in NUMA Memory Policy.
许多多处理器机器是NUMA(Non-Uniform Memory Access)系统。在这样的系统中,内存被组织成具有不同访问延迟的“距离”不同的BANK。每个BANK被称为一个节点,对于每个节点,Linux构建了一个独立的内存管理子系统。一个节点有它自己的区域集合、空闲和已用页面列表以及各种统计计数器。您可以在What is NUMA?和NUMA Memory Policy中找到有关NUMA的更多详细信息。
Page cache
The physical memory is volatile and the common case for getting data into the memory is to read it from files. Whenever a file is read, the data is put into the page cache to avoid expensive disk access on the subsequent reads. Similarly, when one writes to a file, the data is placed in the page cache and eventually gets into the backing storage device. The written pages are marked as dirty and when Linux decides to reuse them for other purposes, it makes sure to synchronize the file contents on the device with the updated data.
物理内存是易失性的,将数据放入内存的常见情况是从文件中读取。每当读取文件时,数据都会被放入页缓存,以避免在后续读取时进行昂贵的磁盘访问。类似地,当向文件写入时,数据会被放入页缓存,并最终进入后备存储设备。写入的页面被标记为脏页,当Linux决定将它们用于其他目的时,它会确保将设备上的文件内容与更新的数据进行同步。
Anonymous Memory
The anonymous memory or anonymous mappings represent memory that is not backed by a filesystem. Such mappings are implicitly created for program's stack and heap or by explicit calls to mmap(2) system call. Usually, the anonymous mappings only define virtual memory areas that the program is allowed to access. The read accesses will result in creation of a page table entry that references a special physical page filled with zeroes. When the program performs a write, a regular physical page will be allocated to hold the written data. The page will be marked dirty and if the kernel decides to repurpose it, the dirty page will be swapped out.
匿名内存或匿名映射表示没有由文件系统支持的内存。这样的映射通常为程序的堆栈和堆隐式创建,或者通过对mmap(2)系统调用的显式调用创建。通常,匿名映射只定义了程序被允许访问的虚拟内存区域。读取访问将导致创建一个引用特殊物理页面的页表条目,该页面填充有零。当程序执行写操作时,将分配一个常规的物理页面来保存写入的数据。该页面将被标记为脏页,如果内核决定重新分配它,脏页将被交换出去。
Reclaim
回收
Throughout the system lifetime, a physical page can be used for storing different types of data. It can be kernel internal data structures, DMA'able buffers for device drivers use, data read from a filesystem, memory allocated by user space processes etc.
在系统生命周期中,物理页面可以用于存储不同类型的数据。它可以是内核内部数据结构、设备驱动程序使用的DMA缓冲区、从文件系统读取的数据、用户空间进程分配的内存等。
Depending on the page usage it is treated differently by the Linux memory management. The pages that can be freed at any time, either because they cache the data available elsewhere, for instance, on a hard disk, or because they can be swapped out, again, to the hard disk, are called reclaimable. The most notable categories of the reclaimable pages are page cache and anonymous memory.
根据页面的使用方式,Linux内存管理对其进行不同的处理。可以随时释放的页面,无论是因为它们缓存了其他地方可用的数据,例如硬盘上的数据,还是因为它们可以被交换出去,再次到硬盘上,被称为可回收页面。可回收页面的最显著类别是页缓存和匿名内存。
In most cases, the pages holding internal kernel data and used as DMA buffers cannot be repurposed, and they remain pinned until freed by their user. Such pages are called unreclaimable. However, in certain circumstances, even pages occupied with kernel data structures can be reclaimed. For instance, in-memory caches of filesystem metadata can be re-read from the storage device and therefore it is possible to discard them from the main memory when system is under memory pressure.
在大多数情况下,持有内核内部数据并用作DMA缓冲区的页面不能被重新分配,并且它们会一直保持固定,直到它们的用户释放。这样的页面被称为不可回收页面。然而,在某些情况下,即使被内核数据结构占用的页面也可以被回收。例如,文件系统元数据的内存中缓存可以从存储设备重新读取,因此在内存压力下可以将它们从主内存中丢弃。
The process of freeing the reclaimable physical memory pages and repurposing them is called (surprise!) reclaim. Linux can reclaim pages either asynchronously or synchronously, depending on the state of the system. When the system is not loaded, most of the memory is free and allocation requests will be satisfied immediately from the free pages supply. As the load increases, the amount of the free pages goes down and when it reaches a certain threshold (low watermark), an allocation request will awaken the kswapd daemon. It will asynchronously scan memory pages and either just free them if the data they contain is available elsewhere, or evict to the backing storage device (remember those dirty pages?). As memory usage increases even more and reaches another threshold - min watermark - an allocation will trigger direct reclaim. In this case allocation is stalled until enough memory pages are reclaimed to satisfy the request.
释放可回收的物理内存页面并重新分配它们的过程称为(惊喜!)回收。Linux可以异步或同步地回收页面,这取决于系统的状态。当系统没有负载时,大部分内存是空闲的,分配请求将立即从空闲页面供应中得到满足。随着负载的增加,空闲页面的数量减少,当它达到某个阈值(低水位标)时,分配请求将唤醒kswapd守护进程。它将异步扫描内存页面,如果它们包含的数据在其他地方可用,就释放它们,或者将它们逐出到后备存储设备(还记得那些脏页吗?)。当内存使用量进一步增加并达到另一个阈值 - 最小水位标 - 分配请求将触发直接回收。在这种情况下,分配将被暂停,直到回收足够的内存页面以满足请求。
Compaction
压缩
As the system runs, tasks allocate and free the memory and it becomes fragmented. Although with virtual memory it is possible to present scattered physical pages as virtually contiguous range, sometimes it is necessary to allocate large physically contiguous memory areas. Such need may arise, for instance, when a device driver requires a large buffer for DMA, or when THP allocates a huge page. Memory compaction addresses the fragmentation issue. This mechanism moves occupied pages from the lower part of a memory zone to free pages in the upper part of the zone. When a compaction scan is finished free pages are grouped together at the beginning of the zone and allocations of large physically contiguous areas become possible.
随着系统运行,任务分配和释放内存,内存变得碎片化。虽然使用虚拟内存可以将分散的物理页面呈现为虚拟连续范围,但有时需要分配大的物理连续内存区域。例如,当设备驱动程序需要大的DMA缓冲区时,或者当THP分配一个大页时,可能会出现这种需求。内存压缩解决了碎片化问题。这种机制将占用的页面从内存区域的下部移动到区域上部的空闲页面。当压缩扫描完成后,空闲页面将在区域的开始处组合在一起,从而可以分配大的物理连续区域。
Like reclaim, the compaction may happen asynchronously in the kcompactd daemon or synchronously as a result of a memory allocation request.
与回收类似,压缩可能会在kcompactd守护进程中异步发生,也可能会作为内存分配请求的结果同步发生。
OOM killer
It is possible that on a loaded machine memory will be exhausted and the kernel will be unable to reclaim enough memory to continue to operate. In order to save the rest of the system, it invokes the OOM killer.
在负载较重的机器上,可能会耗尽内存,内核将无法回收足够的内存以继续运行。为了拯救系统的其余部分,它会调用OOM killer。
The OOM killer selects a task to sacrifice for the sake of the overall system health. The selected task is killed in a hope that after it exits enough memory will be freed to continue normal operation.
OOM killer选择一个任务来牺牲整个系统的健康。选择的任务被终止,希望在它退出后会释放足够的内存以继续正常运行。