Linux kernel学习-进程调度

Uranus Zhou — Wed, 06 Jun 2012 18:18:14 +0000

本文同步自（如浏览不正常请点击跳转）：https://zohead.com/archives/linux-kernel-learning-process-scheduling/

接着上面进程基本概念的文章，进程调度器决定系统中什么进程需要运行，运行多长时间。Linux kernel 实现的是抢占式的时间片调度方式，而不是进程主动让出时间片的方式。

Linux 从 2.5 开始使用名为 O(1) 的调度器，它解决了 2.4 及之前早期的调度器中很多设计上就存在的问题，O(1) 就表示该算法可以在常数时间内完成工作。在 ULK3 对应的 2.6.11 内核中仍然在使用此调度器，它对于很大的服务器负载的情况是很理想的，但由于 O(1) 调度器对于延迟敏感的程序（被称为交互式进程）而言却有缺陷，因此从 2.6.23 内核开始 Linux 引入一种调度器类的框架，并且默认使用一种新的调度器：Completely Fair Scheduler（CFS）完全公平调度器，鉴于历史的车轮在前进着，本文就主要讨论 CFS 调度器了。

进程通常可以分为两类：I/O密集型和计算密集型。I/O密集型进程花费更多的时间在等待 I/O 请求上（不一定是磁盘I/O，也可以是键盘、网络 I/O 等），大多数的 GUI 程序都是 I/O密集型进程。计算密集型的进程则要求运行频率小些但运行时间更多，像各种加解密程序和 MATLAB 这种就是典型的计算密集型进程。一个好的调度策略应该能同时满足低延迟和高吞吐量，Linux 调度器会采取偏向I/O密集型进程的策略。

Linux kernel 实现了两种独立的进程优先级：一种是 nice 值，从 -20 到 +19，默认值为 0，越大的值表示优先级越低（表示你对其它进程更加 "nice"，哈哈），nice 值在所有 Unix 系统中是一个通用的进程优先级范围，运行 ps -el 可以看到进程的 nice 值；第二种是可配置的实时优先级，范围从 0 到 99，越大的值表示优先级越高，实时进程比普通进程的优先级高，Linux 根据 POSIX.1b Unix 标准实现了实现了实时优先级，运行 ps 时增加 rtprio 参数可以在 RTPRIO 栏中看到实时优先级（如果值为 - 表示不是实时进程）。

Linux 2.6.34 默认的 CFS 完全公平调度器并不像传统调度器那样，根据 nice 绝对值为相应的进程分配固定的时间片，它没有明确的时间片概念，而是根据每个进程的 nice 相对差异值作为权重得到进程可以运行的时间在处理器时间中的比例。CFS 设置了一个预定的 targeted latency 值作为调度持续时间来根据比例计算时间片，当然此值越小越接近完全公平。假设 targeted latency 值为 20 毫秒，系统中有两个进程 nice 值分别为 0 和 5，根据权重计算出来的时间片分别为 15 和 5 毫秒，当两个进程为 10 和 15 时，计算出来的仍然为 15 和 5 毫秒，因为 nice 值的相对差异值并没有变。在系统中进程不是特别多时，CFS 调度器可以做到接近完全公平，而进程数量特别多甚至接近无限时，每个进程获得的时间片将非常小，为了避免进程切换导致的开销，CFS 又规定了一个 minimum granularity 值作为每个进程最小的时间片，默认为 1 毫秒，也即即使进程无限，每个进程也最少能运行 1 毫秒的时间，因此进程特别多时 CFS 就不会那么公平了。

1、CFS调度器：

CFS 调度器实现在 kernel/sched_fair.c 文件中，这在上面一篇博文：进程基本中有简单的介绍的。CFS 使用 sched_entity 调度实体结构，task_struct 中就有这个成员，看看 sched_entity 的定义，它定义在 include/linux/sched.h 头文件中：

struct sched_entity {
	struct load_weight	load;		/* for load-balancing */
	struct rb_node		run_node;
	struct list_head	group_node;
	unsigned int		on_rq;

	u64			exec_start;
	u64			sum_exec_runtime;
	u64			vruntime;
	u64			prev_sum_exec_runtime;

	u64			last_wakeup;
	u64			avg_overlap;

	u64			nr_migrations;

	u64			start_runtime;
	u64			avg_wakeup;

#ifdef CONFIG_SCHEDSTATS
	u64			wait_start;
	u64			wait_max;
	u64			wait_count;
	u64			wait_sum;
	u64			iowait_count;
	u64			iowait_sum;

	u64			sleep_start;
	u64			sleep_max;
	s64			sum_sleep_runtime;

	u64			block_start;
	u64			block_max;
	u64			exec_max;
	u64			slice_max;

	u64			nr_migrations_cold;
	u64			nr_failed_migrations_affine;
	u64			nr_failed_migrations_running;
	u64			nr_failed_migrations_hot;
	u64			nr_forced_migrations;

	u64			nr_wakeups;
	u64			nr_wakeups_sync;
	u64			nr_wakeups_migrate;
	u64			nr_wakeups_local;
	u64			nr_wakeups_remote;
	u64			nr_wakeups_affine;
	u64			nr_wakeups_affine_attempts;
	u64			nr_wakeups_passive;
	u64			nr_wakeups_idle;
#endif

#ifdef CONFIG_FAIR_GROUP_SCHED
	struct sched_entity	*parent;
	/* rq on which this entity is (to be) queued: */
	struct cfs_rq		*cfs_rq;
	/* rq "owned" by this entity/group: */
	struct cfs_rq		*my_q;
#endif
};

可以看到此结构中下面很大一部分是开启了 CONFIG_SCHEDSTATS 之后才有用的。其中 vruntime 为进程的虚拟运行时间（实际运行时间经可运行的进程个数进行权重计算后的结果），在理想的 CFS 环境中，处理器都处于理想状态，所有同级别的进程的 vruntime 值应该都相同。但实际上多处理器不能做到完美多任务，CFS 调度器就用 vruntime 记录进程的运行时间并得到它应当还要运行多长时间。

再看到下面会用到的 cfs_rq 运行队列属性的定义，在 kernel/sched.c 中定义：

struct cfs_rq {
	struct load_weight load;
	unsigned long nr_running;

	u64 exec_clock;
	u64 min_vruntime;

	struct rb_root tasks_timeline;
	struct rb_node *rb_leftmost;

	struct list_head tasks;
	struct list_head *balance_iterator;

	/*
	 * 'curr' points to currently running entity on this cfs_rq.
	 * It is set to NULL otherwise (i.e when none are currently running).
	 */
	struct sched_entity *curr, *next, *last;

	unsigned int nr_spread_over;

#ifdef CONFIG_FAIR_GROUP_SCHED
	struct rq *rq;	/* cpu runqueue to which this cfs_rq is attached */

	/*
	 * leaf cfs_rqs are those that hold tasks (lowest schedulable entity in
	 * a hierarchy). Non-leaf lrqs hold other higher schedulable entities
	 * (like users, containers etc.)
	 *
	 * leaf_cfs_rq_list ties together list of leaf cfs_rq's in a cpu. This
	 * list is used during load balance.
	 */
	struct list_head leaf_cfs_rq_list;
	struct task_group *tg;	/* group that "owns" this runqueue */

#ifdef CONFIG_SMP
	/*
	 * the part of load.weight contributed by tasks
	 */
	unsigned long task_weight;

	/*
	 *   h_load = weight * f(tg)
	 *
	 * Where f(tg) is the recursive weight fraction assigned to
	 * this group.
	 */
	unsigned long h_load;

	/*
	 * this cpu's part of tg->shares
	 */
	unsigned long shares;

	/*
	 * load.weight at the time we set shares
	 */
	unsigned long rq_weight;
#endif
#endif
};

cfs_rq 中的 curr 字段即指向当前队列上正在运行的实体（如果没有则为 NULL 了），rq 字段即为 CPU 运行队列。

来看看 sched_entity 的 vruntime 是如何在 update_curr 函数中更新的：

static inline void
__update_curr(struct cfs_rq *cfs_rq, struct sched_entity *curr,
	      unsigned long delta_exec)
{
	unsigned long delta_exec_weighted;

	schedstat_set(curr->exec_max, max((u64)delta_exec, curr->exec_max));

	curr->sum_exec_runtime += delta_exec;
	schedstat_add(cfs_rq, exec_clock, delta_exec);
	delta_exec_weighted = calc_delta_fair(delta_exec, curr);

	curr->vruntime += delta_exec_weighted;
	update_min_vruntime(cfs_rq);
}

static void update_curr(struct cfs_rq *cfs_rq)
{
	struct sched_entity *curr = cfs_rq->curr;
	u64 now = rq_of(cfs_rq)->clock;
	unsigned long delta_exec;

	if (unlikely(!curr))
		return;

	/*
	 * Get the amount of time the current task was running
	 * since the last time we changed load (this cannot
	 * overflow on 32 bits):
	 */
	delta_exec = (unsigned long)(now - curr->exec_start);
	if (!delta_exec)
		return;

	__update_curr(cfs_rq, curr, delta_exec);
	curr->exec_start = now;

	if (entity_is_task(curr)) {
		struct task_struct *curtask = task_of(curr);

		trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
		cpuacct_charge(curtask, delta_exec);
		account_group_exec_runtime(curtask, delta_exec);
	}
}

update_curr 会被系统定时器周期性的调用，进程转变为可运行或不可运行状态时都会被调用，而 update_curr 本身则调用 __update_curr 增加实际运行时间和 vruntime 虚拟运行时间。

由于实际情况下，每个进程的 vruntime 并不会像理想状况那样完全一样，CFS 调度器在需要调度时从运行队列里中取 vruntime 最小的那个进程来运行。CFS 调度器使用一个红黑树来管理可运行进程的列表，并用于快速查找最小的 vruntime 进程。

红黑树在 Linux 中被称为 rbtree，可以用于存储任意数据的节点，由特定的关键字来标识。sched_entity 调度实体中的 run_node 就是一个红黑树节点，cfs_rq 中的 rb_leftmost 即是红黑树最左边的节点（缓存在 cfs_rq 结构中以加快访问速度，这样可以避免遍历红黑树），最小的 vruntime 进程就在此节点上，如果找不到此进程（返回 NULL），CFS 唤醒 idle 任务。

看看将进程加到红黑树的实现：

static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	struct rb_node **link = &cfs_rq->tasks_timeline.rb_node;
	struct rb_node *parent = NULL;
	struct sched_entity *entry;
	s64 key = entity_key(cfs_rq, se);
	int leftmost = 1;

	/*
	 * Find the right place in the rbtree:
	 */
	while (*link) {
		parent = *link;
		entry = rb_entry(parent, struct sched_entity, run_node);
		/*
		 * We dont care about collisions. Nodes with
		 * the same key stay together.
		 */
		if (key < entity_key(cfs_rq, entry)) {
			link = &parent->rb_left;
		} else {
			link = &parent->rb_right;
			leftmost = 0;
		}
	}

	/*
	 * Maintain a cache of leftmost tree entries (it is frequently
	 * used):
	 */
	if (leftmost)
		cfs_rq->rb_leftmost = &se->run_node;

	rb_link_node(&se->run_node, parent, link);
	rb_insert_color(&se->run_node, &cfs_rq->tasks_timeline);
}

static void
enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
	/*
	 * Update the normalized vruntime before updating min_vruntime
	 * through callig update_curr().
	 */
	if (!(flags & ENQUEUE_WAKEUP) || (flags & ENQUEUE_MIGRATE))
		se->vruntime += cfs_rq->min_vruntime;

	/*
	 * Update run-time statistics of the 'current'.
	 */
	update_curr(cfs_rq);
	account_entity_enqueue(cfs_rq, se);

	if (flags & ENQUEUE_WAKEUP) {
		place_entity(cfs_rq, se, 0);
		enqueue_sleeper(cfs_rq, se);
	}

	update_stats_enqueue(cfs_rq, se);
	check_spread(cfs_rq, se);
	if (se != cfs_rq->curr)
		__enqueue_entity(cfs_rq, se);
}

enqueue_entity 中更新了当前进程的 vruntime，并最终调用 __enqueue_entity 将进程加到红黑树中。__enqueue_entity 中先通过遍历找到正确位置，遍历过程中就能确定红黑树中最左边的节点是什么，然后设置红黑树中节点左右信息，调用 rb_link_node 添加节点，必要时更新 cfs_rq 中保存的最左边的节点缓存。

好吧，看了添加过程再看从红黑树中删除进程：

static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
	if (cfs_rq->rb_leftmost == &se->run_node) {
		struct rb_node *next_node;

		next_node = rb_next(&se->run_node);
		cfs_rq->rb_leftmost = next_node;
	}

	rb_erase(&se->run_node, &cfs_rq->tasks_timeline);
}

static void
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int sleep)
{
	/*
	 * Update run-time statistics of the 'current'.
	 */
	update_curr(cfs_rq);

	update_stats_dequeue(cfs_rq, se);
	if (sleep) {
#ifdef CONFIG_SCHEDSTATS
		if (entity_is_task(se)) {
			struct task_struct *tsk = task_of(se);

			if (tsk->state & TASK_INTERRUPTIBLE)
				se->sleep_start = rq_of(cfs_rq)->clock;
			if (tsk->state & TASK_UNINTERRUPTIBLE)
				se->block_start = rq_of(cfs_rq)->clock;
		}
#endif
	}

	clear_buddies(cfs_rq, se);

	if (se != cfs_rq->curr)
		__dequeue_entity(cfs_rq, se);
	account_entity_dequeue(cfs_rq, se);
	update_min_vruntime(cfs_rq);

	/*
	 * Normalize the entity after updating the min_vruntime because the
	 * update can refer to the ->curr item and we need to reflect this
	 * movement in our normalized position.
	 */
	if (!sleep)
		se->vruntime -= cfs_rq->min_vruntime;
}

同样 dequeue_entity 先调用 update_curr 更新当前进程的 vruntime，删除的实际操作由 __dequeue_entity 来完成。__dequeue_entity 中判断如果最左边节点正是要删除的进程，必须更新最左边的节点缓存，然后调用 rb_erase 删除节点。

Linux 中进程调度入口是 schedule() 函数，定义在 kernel/sched.c 中。对于一个进程，schedule() 会先查找最高优先级的调度器类并调用此调度器类中的函数进行调度。看看 pick_next_task 的实现：

static const struct sched_class rt_sched_class;

#define sched_class_highest (&rt_sched_class)
#define for_each_class(class) \
   for (class = sched_class_highest; class; class = class->next)

static inline struct task_struct *
pick_next_task(struct rq *rq)
{
	const struct sched_class *class;
	struct task_struct *p;

	/*
	 * Optimization: we know that if all tasks are in
	 * the fair class we can call that function directly:
	 */
	if (likely(rq->nr_running == rq->cfs.nr_running)) {
		p = fair_sched_class.pick_next_task(rq);
		if (likely(p))
			return p;
	}

	class = sched_class_highest;
	for ( ; ; ) {
		p = class->pick_next_task(rq);
		if (p)
			return p;
		/*
		 * Will never be NULL as the idle class always
		 * returns a non-NULL p:
		 */
		class = class->next;
	}
}

首先由于 Linux 普通进程默认使用 CFS 调度器，pick_next_task 先判断是不是所有进程都在 CFS 调度器中，如果是就直接调用 CFS 的 pick_next_task 函数节省遍历时间。sched_class_highest 的定义在上面也可以看到，就是 RT 调度器类，for_each_class 用于遍历调度器类。而其它类中找不到时，idle 调度器类的 pick_next_task 就返回一个有效 task_struct。在 CFS 调度器中 pick_next_task 会调用 pick_next_entity 函数选择下一个运行的进程。

2、睡眠与唤醒：

进程需要睡眠时，kernel 的处理大体如下：进程将自己标记为睡眠状态，将自己加到等待队列，从记录可运行进程的红黑树上删除自己，调用 schedule() 选择新进程来运行，schedule() 会调用 deactivate_task() 函数将进程从运行队列中移除。唤醒的过程则相反：进程被设置为可运行，从等待队列删除，重新加回到红黑树中。

等待队列在 kernel 中以 wait_queue_head_t 表示，定义在 include/linux/wait.h 头文件中，它其实是一个 __wait_queue_head 结构，下面的头文件中有一些经常用到的声明：

typedef struct __wait_queue wait_queue_t;

struct __wait_queue {
	unsigned int flags;
#define WQ_FLAG_EXCLUSIVE	0x01
	void *private;
	wait_queue_func_t func;
	struct list_head task_list;
};

struct __wait_queue_head {
	spinlock_t lock;
	struct list_head task_list;
};
typedef struct __wait_queue_head wait_queue_head_t;

#define __WAITQUEUE_INITIALIZER(name, tsk) {				\
	.private	= tsk,						\
	.func		= default_wake_function,			\
	.task_list	= { NULL, NULL } }

#define DECLARE_WAITQUEUE(name, tsk)					\
	wait_queue_t name = __WAITQUEUE_INITIALIZER(name, tsk)

#define DEFINE_WAIT_FUNC(name, function)				\
	wait_queue_t name = {						\
		.private	= current,				\
		.func		= function,				\
		.task_list	= LIST_HEAD_INIT((name).task_list),	\
	}

#define DEFINE_WAIT(name) DEFINE_WAIT_FUNC(name, autoremove_wake_function)

#define __wait_event(wq, condition) 					\
do {									\
	DEFINE_WAIT(__wait);						\
									\
	for (;;) {							\
		prepare_to_wait(&wq, &__wait, TASK_UNINTERRUPTIBLE);	\
		if (condition)						\
			break;						\
		schedule();						\
	}								\
	finish_wait(&wq, &__wait);					\
} while (0)

#define wait_event(wq, condition) 					\
do {									\
	if (condition)	 						\
		break;							\
	__wait_event(wq, condition);					\
} while (0)

#define __wait_event_timeout(wq, condition, ret)			\
do {									\
	DEFINE_WAIT(__wait);						\
									\
	for (;;) {							\
		prepare_to_wait(&wq, &__wait, TASK_UNINTERRUPTIBLE);	\
		if (condition)						\
			break;						\
		ret = schedule_timeout(ret);				\
		if (!ret)						\
			break;						\
	}								\
	finish_wait(&wq, &__wait);					\
} while (0)

#define wait_event_timeout(wq, condition, timeout)			\
({									\
	long __ret = timeout;						\
	if (!(condition)) 						\
		__wait_event_timeout(wq, condition, __ret);		\
	__ret;								\
})

#define __wait_event_interruptible(wq, condition, ret)			\
do {									\
	DEFINE_WAIT(__wait);						\
									\
	for (;;) {							\
		prepare_to_wait(&wq, &__wait, TASK_INTERRUPTIBLE);	\
		if (condition)						\
			break;						\
		if (!signal_pending(current)) {				\
			schedule();					\
			continue;					\
		}							\
		ret = -ERESTARTSYS;					\
		break;							\
	}								\
	finish_wait(&wq, &__wait);					\
} while (0)

#define wait_event_interruptible(wq, condition)				\
({									\
	int __ret = 0;							\
	if (!(condition))						\
		__wait_event_interruptible(wq, condition, __ret);	\
	__ret;								\
})

#define __wait_event_interruptible_timeout(wq, condition, ret)		\
do {									\
	DEFINE_WAIT(__wait);						\
									\
	for (;;) {							\
		prepare_to_wait(&wq, &__wait, TASK_INTERRUPTIBLE);	\
		if (condition)						\
			break;						\
		if (!signal_pending(current)) {				\
			ret = schedule_timeout(ret);			\
			if (!ret)					\
				break;					\
			continue;					\
		}							\
		ret = -ERESTARTSYS;					\
		break;							\
	}								\
	finish_wait(&wq, &__wait);					\
} while (0)

#define wait_event_interruptible_timeout(wq, condition, timeout)	\
({									\
	long __ret = timeout;						\
	if (!(condition))						\
		__wait_event_interruptible_timeout(wq, condition, __ret); \
	__ret;								\
})

需要注意的是每个等待队列都需要可以在中断时被修改，因此操作等待队列之前必须获得一个自旋锁。而在实际使用中等待时需要处理竟态条件，为此 kernel 定义了几个很好用的等待条件的宏，为调用者减少操作，这些宏也是定义在上面的文件中。常用的有 wait_event 根据条件在队列上无限等待，wait_event_timeout 相对加了超时处理，它是调用 schedule_timeout 进行调度，wait_event_interruptible 即在等待时进程是可以响应信号之类的唤醒。

内核中的 completion 完成量机制也是基于等待队列的，用于等待某一操作结束。__wait_queue 结构的 task_list 成员通过双链表链接到 __wait_queue_head 中，__wait_queue 中的 private 成员指向等待进程的 task_struct。__wait_queue 中 flags 中有个 WQ_FLAG_EXCLUSIVE 标志，如果被设置过，表示进程需要被独占的唤醒。

等待队列的使用步骤一般为：

1) 进程调用 wait_event 之类的函数使进程进入睡眠，将控制权释放给调度器；
2) 在内核另一处调用 wake_up() 函数唤醒等待队列中的进程。

可以使用 DEFINE_WAIT 宏来为当前进程定义一个等待队列，它使用 autoremove_wake_function 唤醒函数，用 DECLARE_WAITQUEUE 宏为指定进程定义等待队列，它使用 default_wake_function 唤醒函数，也可以使用 init_waitqueue_entry（使用 default_wake_function 唤醒函数）、init_waitqueue_func_entry 来动态初始化等待队列。

autoremove_wake_function 的实现中除了调用 default_wake_function 之外还将所等待成员从等待队列中删除。default_wake_function 唤醒函数则很简单明了，直接尝试唤醒：

int default_wake_function(wait_queue_t *curr, unsigned mode, int wake_flags,
			  void *key)
{
	return try_to_wake_up(curr->private, mode, wake_flags);
}

调用 add_wait_queue 将进程加入等待队列，上面的 wait_event 之类的宏中使用的 prepare_to_wait 函数就会自动做持有和释放自旋锁、调用 add_wait_queue 函数、设置进程状态等操作。add_wait_queue_exclusive 与 add_wait_queue 相似，只是添加好之后，会设置 WQ_FLAG_EXCLUSIVE 独占唤醒标志，相应的 prepare_to_wait_exclusive 就是它的简化调用版本。

进程的唤醒可以使用 wake_up()、try_to_wake_up()、wake_up_process() 等函数，实际上 wake_up_process 就是调用 try_to_wake_up 的，try_to_wake_up 用于唤醒指定进程，wake_up、wake_up_interruptible、wake_up_nr、wake_up_all 之类的函数最终会调用 __wake_up 实现唤醒，__wake_up 主要用于唤醒指定等待队列上的进程，它最终会调用 __wake_up_common 函数做真正的唤醒。

wake_up_interruptible 与 wake_up 类似，只是除了它跳过处于不可中断休眠的进程。wake_up_nr 用于唤醒指定数目的进程。wake_up_all 唤醒所有进程，不管它们是否进行独占等待。

先看看 __wake_up_common 的实现：

static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,
			int nr_exclusive, int wake_flags, void *key)
{
	wait_queue_t *curr, *next;

	list_for_each_entry_safe(curr, next, &q->task_list, task_list) {
		unsigned flags = curr->flags;

		if (curr->func(curr, mode, wake_flags, key) &&
				(flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
			break;
	}
}

可以看到实现还是比较简单，通过遍历等待队列，依次调用等待队列的唤醒函数（autoremove_wake_function、default_wake_function 之类的），如果是独占唤醒则只唤醒一个，唤醒函数最终如上面所述会调用 try_to_wake_up 函数。

内核中为了方便进程将自己睡眠在等待队列上，又提供了 sleep_on、interruptible_sleep_on、sleep_on_timeout 等函数，需要注意的是其中 sleep_on 的进程可以被 wake_up 唤醒，interruptible_sleep_on 的进程可以被 wake_up_interruptible 唤醒，interruptible_sleep_on 睡眠的进程可以被信号和中断唤醒并中断睡眠。这几个睡眠函数最终都会调用 sleep_on_common，看看它的实现：

static long __sched
sleep_on_common(wait_queue_head_t *q, int state, long timeout)
{
	unsigned long flags;
	wait_queue_t wait;

	init_waitqueue_entry(&wait, current);

	__set_current_state(state);

	spin_lock_irqsave(&q->lock, flags);
	__add_wait_queue(q, &wait);
	spin_unlock(&q->lock);
	timeout = schedule_timeout(timeout);
	spin_lock_irq(&q->lock);
	__remove_wait_queue(q, &wait);
	spin_unlock_irqrestore(&q->lock, flags);

	return timeout;
}

可以看到 sleep_on_common 会先用当前进程初始化等待队列，然后调用 __add_wait_queue 将进程加到等待队列，然后调用 schedule_timeout 进行睡眠，当被 wake_up 之类的函数唤醒之后，再调用 __remove_wait_queue 将进程从等待队列中删除。

再来看看最终调用到的唤醒函数 try_to_wake_up 的实现：

static int try_to_wake_up(struct task_struct *p, unsigned int state,
			  int wake_flags)
{
	int cpu, orig_cpu, this_cpu, success = 0;
	unsigned long flags;
	struct rq *rq;

	if (!sched_feat(SYNC_WAKEUPS))
		wake_flags &= ~WF_SYNC;

	this_cpu = get_cpu();

	smp_wmb();
	rq = task_rq_lock(p, &flags);
	update_rq_clock(rq);
	if (!(p->state & state))
		goto out;

	if (p->se.on_rq)
		goto out_running;

	cpu = task_cpu(p);
	orig_cpu = cpu;

#ifdef CONFIG_SMP
	if (unlikely(task_running(rq, p)))
		goto out_activate;

	/*
	 * In order to handle concurrent wakeups and release the rq->lock
	 * we put the task in TASK_WAKING state.
	 *
	 * First fix up the nr_uninterruptible count:
	 */
	if (task_contributes_to_load(p))
		rq->nr_uninterruptible--;
	p->state = TASK_WAKING;

	if (p->sched_class->task_waking)
		p->sched_class->task_waking(rq, p);

	__task_rq_unlock(rq);

	cpu = select_task_rq(p, SD_BALANCE_WAKE, wake_flags);
	if (cpu != orig_cpu) {
		/*
		 * Since we migrate the task without holding any rq->lock,
		 * we need to be careful with task_rq_lock(), since that
		 * might end up locking an invalid rq.
		 */
		set_task_cpu(p, cpu);
	}

	rq = cpu_rq(cpu);
	raw_spin_lock(&rq->lock);
	update_rq_clock(rq);

	/*
	 * We migrated the task without holding either rq->lock, however
	 * since the task is not on the task list itself, nobody else
	 * will try and migrate the task, hence the rq should match the
	 * cpu we just moved it to.
	 */
	WARN_ON(task_cpu(p) != cpu);
	WARN_ON(p->state != TASK_WAKING);

#ifdef CONFIG_SCHEDSTATS
	schedstat_inc(rq, ttwu_count);
	if (cpu == this_cpu)
		schedstat_inc(rq, ttwu_local);
	else {
		struct sched_domain *sd;
		for_each_domain(this_cpu, sd) {
			if (cpumask_test_cpu(cpu, sched_domain_span(sd))) {
				schedstat_inc(sd, ttwu_wake_remote);
				break;
			}
		}
	}
#endif /* CONFIG_SCHEDSTATS */

out_activate:
#endif /* CONFIG_SMP */
	schedstat_inc(p, se.nr_wakeups);
	if (wake_flags & WF_SYNC)
		schedstat_inc(p, se.nr_wakeups_sync);
	if (orig_cpu != cpu)
		schedstat_inc(p, se.nr_wakeups_migrate);
	if (cpu == this_cpu)
		schedstat_inc(p, se.nr_wakeups_local);
	else
		schedstat_inc(p, se.nr_wakeups_remote);
	activate_task(rq, p, 1);
	success = 1;

	/*
	 * Only attribute actual wakeups done by this task.
	 */
	if (!in_interrupt()) {
		struct sched_entity *se = ¤t->se;
		u64 sample = se->sum_exec_runtime;

		if (se->last_wakeup)
			sample -= se->last_wakeup;
		else
			sample -= se->start_runtime;
		update_avg(&se->avg_wakeup, sample);

		se->last_wakeup = se->sum_exec_runtime;
	}

out_running:
	trace_sched_wakeup(rq, p, success);
	check_preempt_curr(rq, p, wake_flags);

	p->state = TASK_RUNNING;
#ifdef CONFIG_SMP
	if (p->sched_class->task_woken)
		p->sched_class->task_woken(rq, p);

	if (unlikely(rq->idle_stamp)) {
		u64 delta = rq->clock - rq->idle_stamp;
		u64 max = 2*sysctl_sched_migration_cost;

		if (delta > max)
			rq->avg_idle = max;
		else
			update_avg(&rq->avg_idle, delta);
		rq->idle_stamp = 0;
	}
#endif
out:
	task_rq_unlock(rq, &flags);
	put_cpu();

	return success;
}

try_to_wake_up 会首先调用 task_rq_lock 用于关中断并给可执行的队列加锁，并调用 activate_task 将进程加入可执行的队列，check_preempt_curr 用于使被唤醒进程可以抢占当前进程，try_to_wake_up 将进程状态设置为 TASK_RUNNING，最终调用 task_rq_unlock 释放可执行队列的锁，其中 try_to_wake_up 带有一个 wake_flags 参数可以禁止被唤醒的进程抢占当前进程。

3、进程切换与抢占

Linux 中一个进程切换到另一个进程是由 context_switch() 函数实现的，此函数也定义在 kernel/sched.c 中。调用 schedule() 时如果选择到新的进程来运行，context_switch() 将被调用。

static inline void
context_switch(struct rq *rq, struct task_struct *prev,
	       struct task_struct *next)
{
	struct mm_struct *mm, *oldmm;

	prepare_task_switch(rq, prev, next);
	trace_sched_switch(rq, prev, next);
	mm = next->mm;
	oldmm = prev->active_mm;
	/*
	 * For paravirt, this is coupled with an exit in switch_to to
	 * combine the page table reload and the switch backend into
	 * one hypercall.
	 */
	arch_start_context_switch(prev);

	if (likely(!mm)) {
		next->active_mm = oldmm;
		atomic_inc(&oldmm->mm_count);
		enter_lazy_tlb(oldmm, next);
	} else
		switch_mm(oldmm, mm, next);

	if (likely(!prev->mm)) {
		prev->active_mm = NULL;
		rq->prev_mm = oldmm;
	}
	/*
	 * Since the runqueue lock will be released by the next
	 * task (which is an invalid locking op but in the case
	 * of the scheduler it's an obvious special-case), so we
	 * do an early lockdep release here:
	 */
#ifndef __ARCH_WANT_UNLOCKED_CTXSW
	spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
#endif

	/* Here we just switch the register state and the stack. */
	switch_to(prev, next, prev);

	barrier();
	/*
	 * this_rq must be evaluated again because prev may have moved
	 * CPUs since it called schedule(), thus the 'rq' on its stack
	 * frame will be invalid.
	 */
	finish_task_switch(this_rq(), prev);
}

可以看到 context_switch 中如果需要的话先调用 switch_mm 切换进程的虚拟内存映射，然后调用 switch_to 将处理器状态从前面的进程切换到当前进程，包括堆栈保存和恢复，设置 CPU 寄存器等操作。有关 switch_to 的实现可以参考 arch/x86/include/asm/system.h，你会发现里面全是平台相关的汇编实现的。

由于 Linux 完整支持了内核抢占，如果只由程序自己调用 schedule() 来让出运行权是明显不合理的，因为这样用户进程可以一直无限运行下去了，因此 kernel 需要在适当时候主动调用 schedule()。kernel 用一个每个进程都有的 need_resched 标志（注：2.2 版本及之前的内核中此标志为全局变量）来表示是否需要重新调度，此标志实际上是 thread_info 的 flags 中的一位。当一个进程应该被抢占时，通过调用 scheduler_tick() 设置 need_resched 标志，当另外一个比当前运行进程优先级更高的进程被唤醒时，通过调用 wake_up()、try_to_wake_up()、wait_up_process() 也可以设置 need_resched 标志。kernel 检查此标志是否设置上，如果被设置上就调用 schedule() 切换到新进程。调用 need_resched() 函数可以检查当前进程的 need_resched 标志有没有被设置上。

当需要从 kernel 返回用户空间时，或者从中断处理返回，或从系统调用返回时，都需要检查 need_resched 标志，如果被设置上就需要选择更合适的新进程来运行，这就是用户级抢占。

Linux 从 2.6 开始支持完全的内核级抢占，实际上一个运行在内核模式的 task 只要不是持有锁就都可以被抢占，因此锁常被用于标记不可抢占的区间。为支持内核级抢占，Linux 首先在 thread_info 结构中增加了 preempt_count 抢占计数。此计数从 0 开始，每个锁被申请一次就加 1，被释放一次就减 1。很显然，此计数为 0 时就表示可被抢占。当从中断返回内核空间时，内核会检查 need_resched 和 preempt_count 值，当 need_resched 被设置上而且 preempt_count 值为 0，表示有更重要的程序需要运行，此时可以安全被抢占，因此调度器被调用；如果 preempt_count 非 0，表示有锁被持有，此时被重新调度是不安全的，这样中断就返回到当前执行进程。当所有的锁都被释放时，解锁的地方会判断 need_resched 标志是否被设置上，如果被设置了就调用调度器。另外如果内核进程发生阻塞或者显示调用 schedule()，此时也会发生内核抢占。

4、Linux实时调度策略：

Linux 提供了两种实时调度策略：SCHED_FIFO 和 SCHED_RR，相应的正常使用的非实时调度策略为：SCHED_NORMAL，实时调度策略的实现在 kernel/sched_rt.c 文件中，其调度实体是 sched_rt_entity，另外实时调度策略都会使用静态优先级值（默认 0~99）以保证实时性。

SCHED_FIFO 如其名，是一种没有时间片的先进先出调度算法。SCHED_FIFO 的可运行进程总在 SCHED_NORMAL 的进程之上调度，它开始运行之后都会一直运行到阻塞或者它放弃处理器，而且只有更高级别的 SCHED_FIFO 或 SCHED_RR 进程可以抢占它。多个同级别的 SCHED_FIFO 进程会轮转运行，此时更低级别的进程将无法进程直到它们主动放弃处理器变为不可运行状态。

SCHED_RR 与 SCHED_FIFO 类似，但每个进程只能运行一个预定的时间片，也相当于有时间片的 SCHED_FIFO，它是一种实时的轮转调度算法。需要注意的是时间片只能被用于重新调度同样优先级的进程，因此高优先级的进程总会抢占低优先级的，低优先级的进程无法抢占一个 SCHED_RR 进程直到它的时间片耗尽。

Linux 提供的实时调度是软实时，并不能保证像硬件实时那样完全可靠，但 Linux 的实时调度性能还是很不错的，而且现在也有专门的 Real-Time Linux Patch，可以参考 [这里]。

另外 Linux 也提供了相当多的系统调用用于调整进程 nice 值（nice），修改调度策略（sched_setscheduler），绑定进程所在 CPU（sched_setaffinity）、让出处理器（sched_yield）等一系列操作，这些就需要在实际使用中取参考了解了。

到此 Linux 2.6.34 中的进度调度大概有个了解了，有任何问题欢迎指正哦~~~ ^_^

Soul Of Free Loop » 调度器

Linux kernel学习-进程调度