首页 > 其他分享 >balance_dirty_pages_ratelimited分析

balance_dirty_pages_ratelimited分析

时间:2024-09-08 14:36:22浏览次数:13  
标签:pause bdi ratelimited thresh dirty nr balance pages

balance_dirty_pages_ratelimited分析

  • nr_dirtied_pause:当前task的脏页门限;
  • dirty_exceeded:全局的脏页数超过门限或者该bdi的脏页数超过门限;(dirty_exceeded = (bdi_dirty > bdi_thresh) &&((nr_dirty > dirty_thresh) || strictlimit); )
  • bdp_ratelimits:percpu变量,当前CPU的脏页数
  • ratelimit_pages:CPU的脏页门限

调用balance_dirty_pages的条件有:
1:当前task的脏页数量大于ratelimit ,(如果dirty_exceeded为0,则为current->nr_dirtied_pause;如果dirty_exceeded为1,则最大为32KB)

2:当前CPU的脏页数超过了门限值ratelimit_pages;

3:当前脏页数+退出线程遗留的脏页超过了门限;

void balance_dirty_pages_ratelimited(struct address_space *mapping)
{
	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
	int ratelimit;
	int *p;

	if (!bdi_cap_account_dirty(bdi))
		return;

	ratelimit = current->nr_dirtied_pause;  /* 门限:初始值为32表示128KB */
	if (bdi->dirty_exceeded)                /* 如果该值设置了,则需要通过降低平衡触发的门限来加速脏页回收 */
		ratelimit = min(ratelimit, 32 >> (PAGE_SHIFT - 10));  /* 重新修改门限,最大为32KB,初始值128KB,加快回收 */

	preempt_disable();
	/*
	 * This prevents one CPU to accumulate too many dirtied pages without
	 * calling into balance_dirty_pages(), which can happen when there are
	 * 1000+ tasks, all of them start dirtying pages at exactly the same
	 * time, hence all honoured too large initial task->nr_dirtied_pause.
	 */
	/* 即保证当前线程脏页数超过门限,或者当前CPU超过门限,都要回收 */
	p =  this_cpu_ptr(&bdp_ratelimits);  /* 当前CPU的脏页计数 */
	if (unlikely(current->nr_dirtied >= ratelimit))  /* 如果当前线程脏页数超过门限值,则肯定会触发下面的回收流程。同时重新计算当前CPU的脏页数 */
		*p = 0;
	else if (unlikely(*p >= ratelimit_pages)) {     /* 默认值为32页 */ /* 当前线程的脏页数未超过门限值,但是当前CPU的脏页数超过CPU脏页门限值,则设置门限为0,肯定会触发回收。同时重新计算当前CPU的脏页数 */
		*p = 0;
		ratelimit = 0;
	}
	/*
	 * Pick up the dirtied pages by the exited tasks. This avoids lots of
	 * short-lived tasks (eg. gcc invocations in a kernel build) escaping
	 * the dirty throttling and livelock other long-run dirtiers.
	 */
	p = this_cpu_ptr(&dirty_throttle_leaks);   /* 退出的线程,也放在这里处理 */
	if (*p > 0 && current->nr_dirtied < ratelimit) {  
		unsigned long nr_pages_dirtied;
		nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
		*p -= nr_pages_dirtied;
		current->nr_dirtied += nr_pages_dirtied;
	}
	preempt_enable();

	if (unlikely(current->nr_dirtied >= ratelimit))    /* 当前线程脏页超过门限值 */
		balance_dirty_pages(mapping, current->nr_dirtied);
}
EXPORT_SYMBOL(balance_dirty_pages_ratelimited);

正常情况下应该是周期回收和背景回收,不会占用当前task的时间。但是当dirty > dirty_freerun_ceiling(thresh, bg_thresh) 即脏页数大于直接回收门限和背景回收门限的1/2时,需要将当前CPU休眠一会,让回收线程工作。

但是dirty <= dirty_freerun_ceiling(thresh, bg_thresh),也会动态的调整nr_dirtied_pause ,号让其更好的回收,调整的策略为:

static unsigned long dirty_poll_interval(unsigned long dirty,
					 unsigned long thresh)
{
	/*  */
	if (thresh > dirty)  /*  */
		return 1UL << (ilog2(thresh - dirty) >> 1);

	return 1;  /* 脏页数超过门限值,则返回1页就需要回收 */
}

至于为什么这么做,可以参考如下解析:
/*
Ideally if we know there are N dirtiers, it’s safe to let each task
poll at (thresh-dirty)/N without exceeding the dirty limit.

However we neither know the current N, nor is sure whether it will
rush high at next second. So sqrt is used to tolerate larger N on
increased (thresh-dirty) gap:

irb> 0.upto(10) { |i| mb=2**i; pages=mb<<(20-12); printf “%4d\t%4d\n”, mb, Math.sqrt(pages)}

1 16
2 22
4 32
8 45
16 64
32 90
64 128
128 181
256 256
512 362
1024 512

The above table means, given 1MB (or 1GB) gap and the dd tasks polling
balance_dirty_pages() on every 16 (or 512) pages, the dirty limit
won’t be exceeded as long as there are less than 16 (or 512) concurrent
dd’s.

Note that dirty_poll_interval() will mainly be used when (dirty < freerun).
When the dirty pages are floating in range [freerun, limit],
“[PATCH 14/18] writeback: control dirty pause time” will independently
adjust tsk->nr_dirtied_pause to get suitable pause time.

So the sqrt naturally leads to less overheads and more N tolerance for
large memory servers, which have large (thresh-freerun) gaps.

*/

void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
{
	/* 可用内存并不是系统所有内存,而是free pages + reclaimable pages(文件页) */
	const unsigned long available_memory = global_dirtyable_memory();
	unsigned long background;
	unsigned long dirty;
	struct task_struct *tsk;

	if (vm_dirty_bytes)
		dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
	else
		dirty = (vm_dirty_ratio * available_memory) / 100;

	if (dirty_background_bytes)
		background = DIV_ROUND_UP(dirty_background_bytes, PAGE_SIZE);
	else
		background = (dirty_background_ratio * available_memory) / 100;

	if (background >= dirty)
		background = dirty / 2;
	tsk = current;
	if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {   /* 如果设置了该属性PF_LESS_THROTTLE或者是实时线程,门限稍微提高1/4 */
		background += background / 4;
		dirty += dirty / 4;
	}
	*pbackground = background;
	*pdirty = dirty;
	trace_global_dirty_state(background, dirty);
}

static unsigned long global_dirtyable_memory(void)
{
	unsigned long x;

	/* 可用内存并不是系统所有内存,而是free pages + file pages(文件页) */
	x = global_page_state(NR_FREE_PAGES);
	x -= min(x, dirty_balance_reserve);

	x += global_page_state(NR_INACTIVE_FILE);
	x += global_page_state(NR_ACTIVE_FILE);

	if (!vm_highmem_is_dirtyable)
		x -= highmem_dirtyable_memory(x);

	return x + 1;	/* Ensure that we never return 0 */
}

1:如果可回收+正在回写脏页数量 < background和显式回写阈值的均值此次先不启动回写,否则启动background回写
2:如果可回收的脏页数大于背景回收门限值,则触发背景回收执行;

static void balance_dirty_pages(struct address_space *mapping,
				unsigned long pages_dirtied)
{
	unsigned long nr_reclaimable;	/* = file_dirty + unstable_nfs */
	unsigned long nr_dirty;  /* = file_dirty + writeback + unstable_nfs */
	unsigned long background_thresh;
	unsigned long dirty_thresh;
	long period;
	long pause;
	long max_pause;
	long min_pause;
	int nr_dirtied_pause;
	bool dirty_exceeded = false;
	unsigned long task_ratelimit;
	unsigned long dirty_ratelimit;
	unsigned long pos_ratio;
	struct backing_dev_info *bdi = inode_to_bdi(mapping->host);
	bool strictlimit = bdi->capabilities & BDI_CAP_STRICTLIMIT; //单独门限值回收
	unsigned long start_time = jiffies;

	for (;;) {
		unsigned long now = jiffies;
		unsigned long uninitialized_var(bdi_thresh);
		unsigned long thresh;
		unsigned long uninitialized_var(bdi_dirty);
		unsigned long dirty;
		unsigned long bg_thresh;

		/*
		 * Unstable writes are a feature of certain networked
		 * filesystems (i.e. NFS) in which data may have been
		 * written to the server's write cache, but has not yet
		 * been flushed to permanent storage.
		 */
		nr_reclaimable = global_page_state(NR_FILE_DIRTY) +
					global_page_state(NR_UNSTABLE_NFS);  /* 全局 文件脏页  + 网络文件系统 */  /* = file_dirty + unstable_nfs */
		nr_dirty = nr_reclaimable + global_page_state(NR_WRITEBACK); /*全局 文件总的脏页+包括正在回写 */  /* = file_dirty + writeback + unstable_nfs */

		global_dirty_limits(&background_thresh, &dirty_thresh);//获取两个门限值

		if (unlikely(strictlimit)) {  /* 单独bdi回收 */
			bdi_dirty_limits(bdi, dirty_thresh, background_thresh,
					 &bdi_dirty, &bdi_thresh, &bg_thresh);

			dirty = bdi_dirty;
			thresh = bdi_thresh;
		} else {                       /* 全局回收 */
			dirty = nr_dirty;          /* 全局 文件总的脏页+包括正在回写 */
			thresh = dirty_thresh;
			bg_thresh = background_thresh;
		}

		/*
		 * Throttle it only when the background writeback cannot
		 * catch-up. This avoids (excessively) small writeouts
		 * when the bdi limits are ramping up in case of !strictlimit.
		 *
		 * In strictlimit case make decision based on the bdi counters
		 * and limits. Small writeouts when the bdi limits are ramping
		 * up are the price we consciously pay for strictlimit-ing.
		 */
		/* 小于直接回收文件和背景回收的/2, 不占用本线程时间;否则说明背景回收没有运行,需要占用本线程时间,  */
		if (dirty <= dirty_freerun_ceiling(thresh, bg_thresh)) {  //(thresh + bg_thresh) / 2; 不回收
			current->dirty_paused_when = now;
			current->nr_dirtied = 0;                 /* 脏页数量重新置0 */
			current->nr_dirtied_pause =
				dirty_poll_interval(dirty, thresh);   /* 重新设置线程脏页门限 */
			break;
		}

		if (unlikely(!writeback_in_progress(bdi)))  /* 唤醒真正的回写线程 */
			bdi_start_background_writeback(bdi);

		if (!strictlimit)
			bdi_dirty_limits(bdi, dirty_thresh, background_thresh,
					 &bdi_dirty, &bdi_thresh, NULL);
		
		//nr_dirty > dirty_thresh
		/*
		 * 如果是单个bdi独自回收,当前bdi的 脏页超过门限即回收;
		 * 如果是整个系统回收,当前bdi超过门限且系统的脏页也要超超过门限;
		 */
		dirty_exceeded = (bdi_dirty > bdi_thresh) &&
				 ((nr_dirty > dirty_thresh) || strictlimit); //超过门限
		
		if (dirty_exceeded && !bdi->dirty_exceeded)
			bdi->dirty_exceeded = 1;                        //超过门限,后面需要加速回收

		bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
				     nr_dirty, bdi_thresh, bdi_dirty,
				     start_time);

		dirty_ratelimit = bdi->dirty_ratelimit;
		pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
					       background_thresh, nr_dirty,
					       bdi_thresh, bdi_dirty);
		task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>
							RATELIMIT_CALC_SHIFT;
		max_pause = bdi_max_pause(bdi, bdi_dirty);
		min_pause = bdi_min_pause(bdi, max_pause,
					  task_ratelimit, dirty_ratelimit,
					  &nr_dirtied_pause);

		if (unlikely(task_ratelimit == 0)) {
			period = max_pause;
			pause = max_pause;
			goto pause;
		}
		period = HZ * pages_dirtied / task_ratelimit;
		pause = period;
		if (current->dirty_paused_when)
			pause -= now - current->dirty_paused_when;
		/*
		 * For less than 1s think time (ext3/4 may block the dirtier
		 * for up to 800ms from time to time on 1-HDD; so does xfs,
		 * however at much less frequency), try to compensate it in
		 * future periods by updating the virtual time; otherwise just
		 * do a reset, as it may be a light dirtier.
		 */
		if (pause < min_pause) {
			trace_balance_dirty_pages(bdi,
						  dirty_thresh,
						  background_thresh,
						  nr_dirty,
						  bdi_thresh,
						  bdi_dirty,
						  dirty_ratelimit,
						  task_ratelimit,
						  pages_dirtied,
						  period,
						  min(pause, 0L),
						  start_time);
			if (pause < -HZ) {
				current->dirty_paused_when = now;
				current->nr_dirtied = 0;
			} else if (period) {
				current->dirty_paused_when += period;
				current->nr_dirtied = 0;
			} else if (current->nr_dirtied_pause <= pages_dirtied)
				current->nr_dirtied_pause += pages_dirtied;
			break;
		}
		if (unlikely(pause > max_pause)) {
			/* for occasional dropped task_ratelimit */
			now += min(pause - max_pause, max_pause);
			pause = max_pause;
		}

pause:
		trace_balance_dirty_pages(bdi,
					  dirty_thresh,
					  background_thresh,
					  nr_dirty,
					  bdi_thresh,
					  bdi_dirty,
					  dirty_ratelimit,
					  task_ratelimit,
					  pages_dirtied,
					  period,
					  pause,
					  start_time);
		__set_current_state(TASK_KILLABLE);
		io_schedule_timeout(pause);//有可能会切出去,但最大超过200ms

		current->dirty_paused_when = now + pause;
		current->nr_dirtied = 0;
		current->nr_dirtied_pause = nr_dirtied_pause;

		/*
		 * This is typically equal to (nr_dirty < dirty_thresh) and can
		 * also keep "1000+ dd on a slow USB stick" under control.
		 */
		if (task_ratelimit)
			break;

		/*
		 * In the case of an unresponding NFS server and the NFS dirty
		 * pages exceeds dirty_thresh, give the other good bdi's a pipe
		 * to go through, so that tasks on them still remain responsive.
		 *
		 * In theory 1 page is enough to keep the comsumer-producer
		 * pipe going: the flusher cleans 1 page => the task dirties 1
		 * more page. However bdi_dirty has accounting errors.  So use
		 * the larger and more IO friendly bdi_stat_error.
		 */
		if (bdi_dirty <= bdi_stat_error(bdi))
			break;

		if (fatal_signal_pending(current))
			break;
	}

	if (!dirty_exceeded && bdi->dirty_exceeded)  //如果不超过门限,则置0
		bdi->dirty_exceeded = 0;

	if (writeback_in_progress(bdi))  //正在回收,则退出
		return;

	/*
	 * In laptop mode, we wait until hitting the higher threshold before
	 * starting background writeout, and then write out all the way down
	 * to the lower threshold.  So slow writers cause minimal disk activity.
	 *
	 * In normal mode, we start background writeout at the lower
	 * background_thresh, to keep the amount of dirty memory low.
	 */
	/*
	* 节能模式,起到什么作用呢??
	*/
	if (laptop_mode)
		return;

	if (nr_reclaimable > background_thresh) //可回收的页面大于background_thresh,则触发线程异步回收
		bdi_start_background_writeback(bdi);
}

标签:pause,bdi,ratelimited,thresh,dirty,nr,balance,pages
From: https://www.cnblogs.com/linhaostudy/p/18402841

相关文章

  • 【思考模型框架】BSC,Balance Scorecard(平衡计分卡),帮助企业全面、系统地制定和实施战略
    一、定义BSC,全称为BalancedScorecard(平衡计分卡)BSC,是一种战略规划和管理工具。BSC,是一种战略管理和绩效评估工具。BSC,不仅仅是一个评估工具,更是一种战略执行框架。BSC,从财务、客户、内部运营、学习与成长四个维度出发BSC,通过提供一个全面的框架来评估组织绩效,涵盖了......
  • Paper Reading: Multi-class imbalance problem: A multi-objective solution
    目录研究动机文章贡献本文方法问题定义多分类多目标选择集成框架多类样本的客观建模理论分析实验结果数据集和实验设置对比实验结果运行时间优化边界的有效性优点和创新点PaperReading是从个人角度进行的一些总结分享,受到个人关注点的侧重和实力所限,可能有理解不到位的地方。具......
  • Spring Cloud LoadBalancer 源码解析
    前言LoadBalancer(负载均衡器):一种网络设备或软件机制,用于分发传入的网络流量负载到多个后端目标服务器上,依次来提高系统的可用性和性能,SpringCloud2020版本以后,移除了对Netflix的依赖,也就移除了负载均衡器Ribbon,SpringCloud官方推荐使用Loadbalancer替换Ribbon,并......
  • DirtyCOW-内核分析报告-cnblog
    基础知识mmap(void*start,size_tlength,intprot,intflags,intfd,off_toffset)一个比较常用的函数,将磁盘上的文件映射到虚拟内存中,POC中参数prot为PROT_READ参数,参数flags为MAP_PRIVATE,请参考linux库函数mmap()原理及用法详解_linuxmmap函数madvice(caddr_tadd......
  • Balanced String
    这道题目真的不知道怎么总结了,这技巧太新了见这篇题解为什么最开始要引入这个子问题呢?实际上,我们假设我们已经得到了最终的交换后的答案,设为\(t\),\(s\)就是题目给的原串,从\(s\)到\(t\)的最小交换次数当然就是从\(t\)到\(s\)的最小交换次数,于是考虑从\(t\)到\(s\)的最小交换次数,......
  • 全面掌握 Spring Cloud LoadBalancer:从自定义到策略优化的实战教程
    引言在微服务架构中,负载均衡是保障系统高效运行的关键技术之一。无论是服务端负载均衡还是客户端负载均衡,合理的负载均衡策略都能显著提升系统的稳定性和响应速度。本文将从基础概念入手,详细讲解如何在SpringCloud中实现和优化负载均衡,并结合实际案例,帮助读者快速上手并......
  • Consider defining a bean of type ‘org.springframework.cloud.client.loadbalancer
    1、bug报错问题:项目启动失败***************************APPLICATIONFAILEDTOSTART***************************Description:Parameter1ofconstructorincom.tianji.learning.controller.InteractionQuestionAdminControllerrequiredabeanoftype'org......
  • [CVPR2022]DASO Distribution-Aware Semantics-Oriented Pseudo-label for Imbalanced
    问题的背景设置:半监督学习下,labeleddata和unlabeleddata的分布不同,且存在类别不平衡。文章提出了一种新的伪标签生成方法:DistributionAwareSemantics-Oriented(DASO)Pseudo-label。首先生成语义伪标签和线性为标签,然后将它们混合实现互补。另外作者的方法不需要估计无标签数......
  • CF873B Balanced Substring
    Abstract传送门本题定义平衡串为0和1数量相等的字符串,要求我们找出给定01串中含有的最大平衡串。Idea如果把1视为+1,0视为-1,那么一个01串是平衡串当且仅当其和值为0,那么问题就转变为寻找给定01串中和值为0的最长子段。首先做一个前缀和,a[i]表示前i项的......