本文同步自(如浏览不正常请点击跳转):https://zohead.com/archives/linux-kernel-learning-process/
Linux 中进程通过 fork() 被创建时,它差不多是和父进程一样的,它得到父进程的地址空间拷贝,运行和父进程一样的代码,从 fork() 的后面开始执行,父进程和子进程共享代码页,但子进程的 data 页是独立的(包括 stack 和 heap)。
早期的 Linux kernel 并不支持多线程的程序,从 kernel 来看,一个多线程的程序只是一个普通的进程,它的多个执行流应该完全在 user mode 来完成创建、处理、调度等操作,例如使用 POSIX pthread 库。当然这样的实现是无法让人满意的,Linux 为此使用轻量级进程为多线程程序提供更好的支持,两个轻量级进程可以共享资源(例如:地址空间、打开的文件等等),一个比较简单的方法是将为每个线程关联一个轻量级进程,这样每个线程可以被 kernel 单独调度,使用 Linux 轻量级进程的库有:LinuxThreads、NPTL、NGPT 等。Linux kernel 同时也支持线程组(可以理解为轻量级进程组)的概念。
1、进程描述符:
进程描述符由 task_struct 结构来表示,一般来说,每个可以被独立调度的执行上下文都必须有自己的进程描述符,因此尽管轻量级进程共享了很大一部分 kernel 数据结构,它也必须有自己的 task_struct。task_struct 中包含关于一个进程的差不多所有信息,它定义在 include/linux/sched.h 文件中,你会看到这是非常大的结构,其中还包含指向其它结构的指针。访问进程自身的 task_struct 结构,使用宏操作 current。
task_struct 中的 struct mm_struct *mm 即指向进程的地址空间。task_struct 的 state 字段表示进程的运行状态,取值有 TASK_RUNNING(正在运行或正在队列中等待运行,进程如果在用户空间只能为此状态)、TASK_INTERRUPTIBLE(可响应信号)、TASK_UNINTERRUPTIBLE(不响应信号)、TASK_STOPED 等,另外 state 还有特殊的两个值是 EXIT_ZOMBIE(僵尸进程) 和 EXIT_DEAD(进程将被系统移除)。kernel 提供 set_task_state 宏修改进程状态,set_task_state 最终调用 set_mb,set_current_state 用于当前进程的状态。task_struct 的 pid 字段就是咱们喜闻乐见的进程 ID 了。
这是一个典型的 Linux 进程状态机图:
POSIX 1003.1c 标准规定一个多线程程序的每个线程都应该有相同的 PID,这样的好处是例如发一个信号给一个 PID,一个线程组里的所有线程都能收到。同一线程组中的线程有相同的线程组号(Thread Group ID),线程组组号放在 task_struct 的 tgid 成员变量中,一般是线程组里的第一个轻量级进程的 PID。特别需要注意 getpid() 系统调用返回的就是 tgid 的值,而不是 pid 值,这样一个多线程程序的所有线程可以共享一个 PID。
对每个进程,kernel 在通过 slab 分配器分配 task_struct 时,通常是实际分配了两个连续的物理页面(8KB),以 thread_union 联合表示,其中包括一个 thread_info 结构(其 task 成员是指向 task_struct 的指针)以及 kernel 模式的进程堆栈。esp CPU 堆栈指针即表示此进程堆栈的栈顶地址,进程从用户模式切换到 kernel 模式时,kernel 堆栈会被清空。为了效率考虑,kernel 会将这两个连续的物理页面的第一个页面按 2^13(也就是 8KB) 对齐,为了避免内存较少时产生问题,kernel 提供配置选项(就是下面的 THREAD_SIZE 了)可以将 thread_info 和堆栈包含在一个页面也就是 4KB 的内存区域里。一般来说,8KB 的堆栈对于内核程序已经够用。
看看 Linux 2.6.34 中 thread_union 的定义:
union thread_union {
	struct thread_info thread_info;
	unsigned long stack[THREAD_SIZE/sizeof(long)];
};
由于 thread_info 和内核堆栈是合并在连续的页面里的,kernel 就可以从 esp 指针得到 thread_info 结构地址,这是通过 current_thread_info 函数来实现的。
/* how to get the current stack pointer from C */
register unsigned long current_stack_pointer asm("esp") __used;
/* how to get the thread information struct from C */
static inline struct thread_info *current_thread_info(void)
{
	return (struct thread_info *)
		(current_stack_pointer & ~(THREAD_SIZE - 1));
}
假设 thread_union 是 8KB 大小,也即 2^13,将 esp 的最低 13 位屏蔽掉即可得到 thread_info 的地址,如果是 4KB 的栈大小,屏蔽掉最低 12 位即可(和上面的代码一致),这样通过 current_thread_info()->task 就能得到当前的 task_struct,这就是 current 宏的实现了。
#define get_current() (current_thread_info()->task) #define current get_current()
系统中进程的列表保存在 init_task 所在的双向链表中,task_struct 的 tasks 字段就是 list_head,init_task 表示的就是 PID 为 0 的 swapper 进程(或者叫 idle 进程),其 tasks 会依次指向下一个 task_struct,PID 为 1 的进程就是 init 进程,这两个进程都由 kernel 来创建。
而关于可以运行的进程的调度,Linux 2.6.34 和 ULK 上说的已经有很大的不同了。2.6.34 上加上了 struct sched_class 结构体表示不同类型的调度算法类,目前 2.6.34 上实现了三种:Completely Fair Scheduling (CFS) Class(完全公平算法,见 kernel/sched_fair.c)、Real-Time Scheduling Class(实时算法,见 kernel/sched_rt.c)和 idle-task scheduling class(见 kernel/sched_idletask.c),这三个源文件都被 include 在 kernel/sched.c 中进行编译了。CFS Class 使用 sched_entity 结构作为调度实体,其中包含权重、运行时间等信息,比 RT Class 复杂,其中还有专门的红黑树。RT Class 使用 sched_rt_entity 作为调度实体。
每个 task_struct 中都包含了 sched_entity 和 sched_rt_entity 这两个字段,sched_class 中则有 enqueue_task、dequeue_task 等函数指针指向对应调度算法中的实现函数,enqueue_task 将进程加入运行队列,dequeue_task 将进程从队列中移除,由于这段变化较大而且比较复杂,有关这三种调度算法的具体实现以后再来介绍了。
task_struct 的 real_parent 字段指向创建该进程的进程(如果父进程已不存在则为 init 进程),parent 指向当前进程的父进程,children 为该进程子进程列表,sibling 为该进程的兄弟进程列表,group_leader 字段指向该进程的线程组长。与 ULK 不同的是,ULK 中 ptrace_children 为被调试器 trace 的该进程的子进程列表,2.6.34 中 ptraced 字段包含该进程原本的子进程和 ptrace attach 的目标进程,ptrace_list 改为 ptrace_entry。另外 2.6.34 kernel 中已经引入 namespace 的概念,获得进程组 ID 和会话期 ID 的方式也于 ULK 中的有不少区别。
kernel 中进程的 PID 散列表存在 pid_hash 中以加快根据 PID 搜索 task_struct 的速度,pidhash_init 函数初始化此 PID 散列表,由于 2.6.34 中已有 namespace,pid_hashfn 也由原来的一个参数变为两个参数(增加一个 ns 参数表示哪个 namespace)。Linux kernel 也增加了 pid 和 upid 两个结构体,pid 是内核对进程 PID 的内部表示(惟一的),upid 是进程在特定的 namespace 中看到的 PID。
2、进程创建:
Linux 中使用 fork() 函数创建新进程,父进程的地址空间会复制给子进程,为了效率考虑,这个复制通过 COW(Copy-on-write) 来实现,真正有写操作时才会复制。fork()、vfork()、__clone() 函数都是通过 clone() 系统调用来实现的,clone() 系统调用最终调用 do_fork()。需要注意的是 vfork() 的结果是子进程完全运行在父进程的地址空间上,父进程的页表项并不会被拷贝,而且子进程优先运行,父进程会一直阻塞直至子进程结束(调用 exec 执行新程序或者 _exit 退出,不可调用 exit 退出)。do_fork 函数定义在 kernel/fork.c 文件中,do_fork() 会再调用 copy_process() 函数。
可以看到 copy_process 是相当长的一个函数:
static struct task_struct *copy_process(unsigned long clone_flags,
					unsigned long stack_start,
					struct pt_regs *regs,
					unsigned long stack_size,
					int __user *child_tidptr,
					struct pid *pid,
					int trace)
{
	int retval;
	struct task_struct *p;
	int cgroup_callbacks_done = 0;
	if ((clone_flags & (CLONE_NEWNS|CLONE_FS)) == (CLONE_NEWNS|CLONE_FS))
		return ERR_PTR(-EINVAL);
	/*
	 * Thread groups must share signals as well, and detached threads
	 * can only be started up within the thread group.
	 */
	if ((clone_flags & CLONE_THREAD) && !(clone_flags & CLONE_SIGHAND))
		return ERR_PTR(-EINVAL);
	/*
	 * Shared signal handlers imply shared VM. By way of the above,
	 * thread groups also imply shared VM. Blocking this case allows
	 * for various simplifications in other code.
	 */
	if ((clone_flags & CLONE_SIGHAND) && !(clone_flags & CLONE_VM))
		return ERR_PTR(-EINVAL);
	/*
	 * Siblings of global init remain as zombies on exit since they are
	 * not reaped by their parent (swapper). To solve this and to avoid
	 * multi-rooted process trees, prevent global and container-inits
	 * from creating siblings.
	 */
	if ((clone_flags & CLONE_PARENT) &&
				current->signal->flags & SIGNAL_UNKILLABLE)
		return ERR_PTR(-EINVAL);
	retval = security_task_create(clone_flags);
	if (retval)
		goto fork_out;
	retval = -ENOMEM;
	p = dup_task_struct(current);
	if (!p)
		goto fork_out;
	ftrace_graph_init_task(p);
	rt_mutex_init_task(p);
#ifdef CONFIG_PROVE_LOCKING
	DEBUG_LOCKS_WARN_ON(!p->hardirqs_enabled);
	DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
#endif
	retval = -EAGAIN;
	if (atomic_read(&p->real_cred->user->processes) >=
			task_rlimit(p, RLIMIT_NPROC)) {
		if (!capable(CAP_SYS_ADMIN) && !capable(CAP_SYS_RESOURCE) &&
		    p->real_cred->user != INIT_USER)
			goto bad_fork_free;
	}
	retval = copy_creds(p, clone_flags);
	if (retval < 0)
		goto bad_fork_free;
	/*
	 * If multiple threads are within copy_process(), then this check
	 * triggers too late. This doesn't hurt, the check is only there
	 * to stop root fork bombs.
	 */
	retval = -EAGAIN;
	if (nr_threads >= max_threads)
		goto bad_fork_cleanup_count;
	if (!try_module_get(task_thread_info(p)->exec_domain->module))
		goto bad_fork_cleanup_count;
	p->did_exec = 0;
	delayacct_tsk_init(p);	/* Must remain after dup_task_struct() */
	copy_flags(clone_flags, p);
	INIT_LIST_HEAD(&p->children);
	INIT_LIST_HEAD(&p->sibling);
	rcu_copy_process(p);
	p->vfork_done = NULL;
	spin_lock_init(&p->alloc_lock);
	init_sigpending(&p->pending);
	p->utime = cputime_zero;
	p->stime = cputime_zero;
	p->gtime = cputime_zero;
	p->utimescaled = cputime_zero;
	p->stimescaled = cputime_zero;
#ifndef CONFIG_VIRT_CPU_ACCOUNTING
	p->prev_utime = cputime_zero;
	p->prev_stime = cputime_zero;
#endif
#if defined(SPLIT_RSS_COUNTING)
	memset(&p->rss_stat, 0, sizeof(p->rss_stat));
#endif
	p->default_timer_slack_ns = current->timer_slack_ns;
	task_io_accounting_init(&p->ioac);
	acct_clear_integrals(p);
	posix_cpu_timers_init(p);
	p->lock_depth = -1;		/* -1 = no lock */
	do_posix_clock_monotonic_gettime(&p->start_time);
	p->real_start_time = p->start_time;
	monotonic_to_bootbased(&p->real_start_time);
	p->io_context = NULL;
	p->audit_context = NULL;
	cgroup_fork(p);
#ifdef CONFIG_NUMA
	p->mempolicy = mpol_dup(p->mempolicy);
 	if (IS_ERR(p->mempolicy)) {
 		retval = PTR_ERR(p->mempolicy);
 		p->mempolicy = NULL;
 		goto bad_fork_cleanup_cgroup;
 	}
	mpol_fix_fork_child_flag(p);
#endif
#ifdef CONFIG_TRACE_IRQFLAGS
	p->irq_events = 0;
#ifdef __ARCH_WANT_INTERRUPTS_ON_CTXSW
	p->hardirqs_enabled = 1;
#else
	p->hardirqs_enabled = 0;
#endif
	p->hardirq_enable_ip = 0;
	p->hardirq_enable_event = 0;
	p->hardirq_disable_ip = _THIS_IP_;
	p->hardirq_disable_event = 0;
	p->softirqs_enabled = 1;
	p->softirq_enable_ip = _THIS_IP_;
	p->softirq_enable_event = 0;
	p->softirq_disable_ip = 0;
	p->softirq_disable_event = 0;
	p->hardirq_context = 0;
	p->softirq_context = 0;
#endif
#ifdef CONFIG_LOCKDEP
	p->lockdep_depth = 0; /* no locks held yet */
	p->curr_chain_key = 0;
	p->lockdep_recursion = 0;
#endif
#ifdef CONFIG_DEBUG_MUTEXES
	p->blocked_on = NULL; /* not blocked yet */
#endif
#ifdef CONFIG_CGROUP_MEM_RES_CTLR
	p->memcg_batch.do_batch = 0;
	p->memcg_batch.memcg = NULL;
#endif
	p->bts = NULL;
	/* Perform scheduler related setup. Assign this task to a CPU. */
	sched_fork(p, clone_flags);
	retval = perf_event_init_task(p);
	if (retval)
		goto bad_fork_cleanup_policy;
	if ((retval = audit_alloc(p)))
		goto bad_fork_cleanup_policy;
	/* copy all the process information */
	if ((retval = copy_semundo(clone_flags, p)))
		goto bad_fork_cleanup_audit;
	if ((retval = copy_files(clone_flags, p)))
		goto bad_fork_cleanup_semundo;
	if ((retval = copy_fs(clone_flags, p)))
		goto bad_fork_cleanup_files;
	if ((retval = copy_sighand(clone_flags, p)))
		goto bad_fork_cleanup_fs;
	if ((retval = copy_signal(clone_flags, p)))
		goto bad_fork_cleanup_sighand;
	if ((retval = copy_mm(clone_flags, p)))
		goto bad_fork_cleanup_signal;
	if ((retval = copy_namespaces(clone_flags, p)))
		goto bad_fork_cleanup_mm;
	if ((retval = copy_io(clone_flags, p)))
		goto bad_fork_cleanup_namespaces;
	retval = copy_thread(clone_flags, stack_start, stack_size, p, regs);
	if (retval)
		goto bad_fork_cleanup_io;
	if (pid != &init_struct_pid) {
		retval = -ENOMEM;
		pid = alloc_pid(p->nsproxy->pid_ns);
		if (!pid)
			goto bad_fork_cleanup_io;
		if (clone_flags & CLONE_NEWPID) {
			retval = pid_ns_prepare_proc(p->nsproxy->pid_ns);
			if (retval < 0)
				goto bad_fork_free_pid;
		}
	}
	p->pid = pid_nr(pid);
	p->tgid = p->pid;
	if (clone_flags & CLONE_THREAD)
		p->tgid = current->tgid;
	if (current->nsproxy != p->nsproxy) {
		retval = ns_cgroup_clone(p, pid);
		if (retval)
			goto bad_fork_free_pid;
	}
	p->set_child_tid = (clone_flags & CLONE_CHILD_SETTID) ? child_tidptr : NULL;
	/*
	 * Clear TID on mm_release()?
	 */
	p->clear_child_tid = (clone_flags & CLONE_CHILD_CLEARTID) ? child_tidptr: NULL;
#ifdef CONFIG_FUTEX
	p->robust_list = NULL;
#ifdef CONFIG_COMPAT
	p->compat_robust_list = NULL;
#endif
	INIT_LIST_HEAD(&p->pi_state_list);
	p->pi_state_cache = NULL;
#endif
	/*
	 * sigaltstack should be cleared when sharing the same VM
	 */
	if ((clone_flags & (CLONE_VM|CLONE_VFORK)) == CLONE_VM)
		p->sas_ss_sp = p->sas_ss_size = 0;
	/*
	 * Syscall tracing and stepping should be turned off in the
	 * child regardless of CLONE_PTRACE.
	 */
	user_disable_single_step(p);
	clear_tsk_thread_flag(p, TIF_SYSCALL_TRACE);
#ifdef TIF_SYSCALL_EMU
	clear_tsk_thread_flag(p, TIF_SYSCALL_EMU);
#endif
	clear_all_latency_tracing(p);
	/* ok, now we should be set up.. */
	p->exit_signal = (clone_flags & CLONE_THREAD) ? -1 : (clone_flags & CSIGNAL);
	p->pdeath_signal = 0;
	p->exit_state = 0;
	/*
	 * Ok, make it visible to the rest of the system.
	 * We dont wake it up yet.
	 */
	p->group_leader = p;
	INIT_LIST_HEAD(&p->thread_group);
	/* Now that the task is set up, run cgroup callbacks if
	 * necessary. We need to run them before the task is visible
	 * on the tasklist. */
	cgroup_fork_callbacks(p);
	cgroup_callbacks_done = 1;
	/* Need tasklist lock for parent etc handling! */
	write_lock_irq(&tasklist_lock);
	/* CLONE_PARENT re-uses the old parent */
	if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) {
		p->real_parent = current->real_parent;
		p->parent_exec_id = current->parent_exec_id;
	} else {
		p->real_parent = current;
		p->parent_exec_id = current->self_exec_id;
	}
	spin_lock(¤t->sighand->siglock);
	/*
	 * Process group and session signals need to be delivered to just the
	 * parent before the fork or both the parent and the child after the
	 * fork. Restart if a signal comes in before we add the new process to
	 * it's process group.
	 * A fatal signal pending means that current will exit, so the new
	 * thread can't slip out of an OOM kill (or normal SIGKILL).
 	 */
	recalc_sigpending();
	if (signal_pending(current)) {
		spin_unlock(¤t->sighand->siglock);
		write_unlock_irq(&tasklist_lock);
		retval = -ERESTARTNOINTR;
		goto bad_fork_free_pid;
	}
	if (clone_flags & CLONE_THREAD) {
		atomic_inc(¤t->signal->count);
		atomic_inc(¤t->signal->live);
		p->group_leader = current->group_leader;
		list_add_tail_rcu(&p->thread_group, &p->group_leader->thread_group);
	}
	if (likely(p->pid)) {
		tracehook_finish_clone(p, clone_flags, trace);
		if (thread_group_leader(p)) {
			if (clone_flags & CLONE_NEWPID)
				p->nsproxy->pid_ns->child_reaper = p;
			p->signal->leader_pid = pid;
			tty_kref_put(p->signal->tty);
			p->signal->tty = tty_kref_get(current->signal->tty);
			attach_pid(p, PIDTYPE_PGID, task_pgrp(current));
			attach_pid(p, PIDTYPE_SID, task_session(current));
			list_add_tail(&p->sibling, &p->real_parent->children);
			list_add_tail_rcu(&p->tasks, &init_task.tasks);
			__get_cpu_var(process_counts)++;
		}
		attach_pid(p, PIDTYPE_PID, pid);
		nr_threads++;
	}
	total_forks++;
	spin_unlock(¤t->sighand->siglock);
	write_unlock_irq(&tasklist_lock);
	proc_fork_connector(p);
	cgroup_post_fork(p);
	perf_event_fork(p);
	return p;
bad_fork_free_pid:
	if (pid != &init_struct_pid)
		free_pid(pid);
bad_fork_cleanup_io:
	if (p->io_context)
		exit_io_context(p);
bad_fork_cleanup_namespaces:
	exit_task_namespaces(p);
bad_fork_cleanup_mm:
	if (p->mm)
		mmput(p->mm);
bad_fork_cleanup_signal:
	if (!(clone_flags & CLONE_THREAD))
		__cleanup_signal(p->signal);
bad_fork_cleanup_sighand:
	__cleanup_sighand(p->sighand);
bad_fork_cleanup_fs:
	exit_fs(p); /* blocking */
bad_fork_cleanup_files:
	exit_files(p); /* blocking */
bad_fork_cleanup_semundo:
	exit_sem(p);
bad_fork_cleanup_audit:
	audit_free(p);
bad_fork_cleanup_policy:
	perf_event_free_task(p);
#ifdef CONFIG_NUMA
	mpol_put(p->mempolicy);
bad_fork_cleanup_cgroup:
#endif
	cgroup_exit(p, cgroup_callbacks_done);
	delayacct_tsk_free(p);
	module_put(task_thread_info(p)->exec_domain->module);
bad_fork_cleanup_count:
	atomic_dec(&p->cred->user->processes);
	exit_creds(p);
bad_fork_free:
	free_task(p);
fork_out:
	return ERR_PTR(retval);
}
/*
 *  Ok, this is the main fork-routine.
 *
 * It copies the process, and if successful kick-starts
 * it and waits for it to finish using the VM if required.
 */
long do_fork(unsigned long clone_flags,
	      unsigned long stack_start,
	      struct pt_regs *regs,
	      unsigned long stack_size,
	      int __user *parent_tidptr,
	      int __user *child_tidptr)
{
	struct task_struct *p;
	int trace = 0;
	long nr;
	/*
	 * Do some preliminary argument and permissions checking before we
	 * actually start allocating stuff
	 */
	if (clone_flags & CLONE_NEWUSER) {
		if (clone_flags & CLONE_THREAD)
			return -EINVAL;
		/* hopefully this check will go away when userns support is
		 * complete
		 */
		if (!capable(CAP_SYS_ADMIN) || !capable(CAP_SETUID) ||
				!capable(CAP_SETGID))
			return -EPERM;
	}
	/*
	 * We hope to recycle these flags after 2.6.26
	 */
	if (unlikely(clone_flags & CLONE_STOPPED)) {
		static int __read_mostly count = 100;
		if (count > 0 && printk_ratelimit()) {
			char comm[TASK_COMM_LEN];
			count--;
			printk(KERN_INFO "fork(): process `%s' used deprecated "
					"clone flags 0x%lx\n",
				get_task_comm(comm, current),
				clone_flags & CLONE_STOPPED);
		}
	}
	/*
	 * When called from kernel_thread, don't do user tracing stuff.
	 */
	if (likely(user_mode(regs)))
		trace = tracehook_prepare_clone(clone_flags);
	p = copy_process(clone_flags, stack_start, regs, stack_size,
			 child_tidptr, NULL, trace);
	/*
	 * Do this prior waking up the new thread - the thread pointer
	 * might get invalid after that point, if the thread exits quickly.
	 */
	if (!IS_ERR(p)) {
		struct completion vfork;
		trace_sched_process_fork(current, p);
		nr = task_pid_vnr(p);
		if (clone_flags & CLONE_PARENT_SETTID)
			put_user(nr, parent_tidptr);
		if (clone_flags & CLONE_VFORK) {
			p->vfork_done = &vfork;
			init_completion(&vfork);
		}
		audit_finish_fork(p);
		tracehook_report_clone(regs, clone_flags, nr, p);
		/*
		 * We set PF_STARTING at creation in case tracing wants to
		 * use this to distinguish a fully live task from one that
		 * hasn't gotten to tracehook_report_clone() yet.  Now we
		 * clear it and set the child going.
		 */
		p->flags &= ~PF_STARTING;
		if (unlikely(clone_flags & CLONE_STOPPED)) {
			/*
			 * We'll start up with an immediate SIGSTOP.
			 */
			sigaddset(&p->pending.signal, SIGSTOP);
			set_tsk_thread_flag(p, TIF_SIGPENDING);
			__set_task_state(p, TASK_STOPPED);
		} else {
			wake_up_new_task(p, clone_flags);
		}
		tracehook_report_clone_complete(trace, regs,
						clone_flags, nr, p);
		if (clone_flags & CLONE_VFORK) {
			freezer_do_not_count();
			wait_for_completion(&vfork);
			freezer_count();
			tracehook_report_vfork_done(p, nr);
		}
	} else {
		nr = PTR_ERR(p);
	}
	return nr;
}
copy_process 中会调用 dup_task_struct 先复制 task_struct 进程描述符,并做一些必要的改动,默认会去掉 flags 字段也即描述符标志中的PF_SUPERPRIV 值,表示进程没有使用超级用户权限,默认设置了PF_FORKNOEXEC 表示还没有执行 exec,设置了 PF_STARTING 标志表示进程正在被创建。copy_process 还会调用 copy_creds 复制进程凭证,这个以后再来专门研究,调用 task_io_accounting_init 初始化 I/O 统计,调用 cgroup_fork 将此进程加到父 cgroups 中,cgroups 是新 kernel 加入的一个比较重要的分组控制的机制,也是依赖 namespace 来实现的,有关 cgroups 以后将专门写一篇文章来介绍。
copy_process 然后会调用 sched_fork 为新创建的进程配置调度器,其中会将进程状态设为 TASK_WAKING 保证没人能运行它或者用信号之类将此新进程加入运行队列,并会调用 set_task_cpu 为进程选择一个空闲的 CPU 来运行,sched_fork 相关函数在 kernel/sched.c 中。
copy_process 然后会根据 clone_flags 来调用 copy_files、copy_fs 等来复制文件描述符、文件系统信息、信号处理函数等,copy_process 中调用 alloc_pid 给新进程分配 PID,copy_process 最终返回一个新的 task_struct。do_fork 最终会通过 task_pid_vnr 来返回子进程的 PID。
fork() 之后 Linux kernel 主观上会调用 wake_up_new_task 让子进程先运行,这样避免父进程有写地址空间的改动而可以减少 COW 的次数,但实际运行中并不一定完全会这样。
当使用 vfork() 函数创建进程时,task_struct 的 vfork_done 指向一个特定的地址,kernel 通过 init_completion 初始化等待,从上面的代码也可以看到父进程会一直调用 wait_for_completion 等待直到子进程通过 vfork_done 指针通知它,每个进程退出地址空间时(exec 或者 exit) mm_release() 函数中会检查 vfork_done 如果不为 NULL 父进程就能收到信号,这样就可以保证 vfork() 时必须是子进程先运行。
实际上由于 COW 的存在,vfork() 相对 fork() 惟一的好处只在于少了页表项的拷贝,现在也已经有了测试版的 COW 页表项 patch,这个就以后再抽空测试了。
3、Linux线程实现:
Linux kernel 的线程实现是比较特别的,它没有专门的线程概念,所有线程只是标准的进程,线程只是一个与其它进程共享资源的进程,这与 Windows、Solaris 等操作系统有明显的不同。
Linux 中线程的创建也是通过 clone() 系统调用来实现,只是增加了特别的参数:
clone(CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND, 0);
上面的结果是地址空间、文件系统资源、文件描述符、信号处理都被共享了。
普通的 fork() 的调用方式为:
clone(SIGCHLD, 0);
vfork() 函数的调用方式为:
clone(CLONE_VFORK | CLONE_VM | SIGCHLD, 0);
这样明显就能看到区别了。
另外来专门说下 kernel thread,kernel thread 是一种只在内核空间存在的进程,它不会发生上下文切换到用户空间,它也是可被调度和可被抢占的,它与普通进程的区别在于它没有地址空间,它的 task_struct 的 mm 指针(即进程地址空间)为 NULL。常见的 kernel thread 有 flush、ksoftirqd 等。
可以使用 kthread_create 函数创建 kernel thread,此函数定义在 kernel/kthread.c 中,此函数即返回 task_struct 指针,kernel thread 创建之后默认为 TASK_INTERRUPTIBLE 状态,需要使用 wake_up_process 来唤醒它,因此可以使用 kthread_run 直接创建并运行一个 kernel thread。kthread_stop 用于停止 kernel thread 执行。
看看 kthread_create 的代码:
struct task_struct *kthread_create(int (*threadfn)(void *data),
				   void *data,
				   const char namefmt[],
				   ...)
{
	struct kthread_create_info create;
	create.threadfn = threadfn;
	create.data = data;
	init_completion(&create.done);
	spin_lock(&kthread_create_lock);
	list_add_tail(&create.list, &kthread_create_list);
	spin_unlock(&kthread_create_lock);
	wake_up_process(kthreadd_task);
	wait_for_completion(&create.done);
	if (!IS_ERR(create.result)) {
		struct sched_param param = { .sched_priority = 0 };
		va_list args;
		va_start(args, namefmt);
		vsnprintf(create.result->comm, sizeof(create.result->comm),
			  namefmt, args);
		va_end(args);
		/*
		 * root may have changed our (kthreadd's) priority or CPU mask.
		 * The kernel thread should not inherit these properties.
		 */
		sched_setscheduler_nocheck(create.result, SCHED_NORMAL, ¶m);
		set_cpus_allowed_ptr(create.result, cpu_all_mask);
	}
	return create.result;
}
这时就要请出 PID 为 2 的 kthreadd 进程出场了,kernel thread 的创建是由 kthreadd 完成的。kthread_create 中会先把创建的信息 kthread_create_info 加到 kthread_create_list 链接中,然后唤醒 kthreadd 进程(其进程描述符保存在 kthreadd_task 全局变量中),并使用 wait_for_completion 等待 kthreadd 创建过程完成。
再看看 kthreadd 的实现:
int kthreadd(void *unused)
{
	struct task_struct *tsk = current;
	/* Setup a clean context for our children to inherit. */
	set_task_comm(tsk, "kthreadd");
	ignore_signals(tsk);
	set_cpus_allowed_ptr(tsk, cpu_all_mask);
	set_mems_allowed(node_states[N_HIGH_MEMORY]);
	current->flags |= PF_NOFREEZE | PF_FREEZER_NOSIG;
	for (;;) {
		set_current_state(TASK_INTERRUPTIBLE);
		if (list_empty(&kthread_create_list))
			schedule();
		__set_current_state(TASK_RUNNING);
		spin_lock(&kthread_create_lock);
		while (!list_empty(&kthread_create_list)) {
			struct kthread_create_info *create;
			create = list_entry(kthread_create_list.next,
					    struct kthread_create_info, list);
			list_del_init(&create->list);
			spin_unlock(&kthread_create_lock);
			create_kthread(create);
			spin_lock(&kthread_create_lock);
		}
		spin_unlock(&kthread_create_lock);
	}
	return 0;
}
kthreadd 在循环中检查 kthread_create_list 链表,会找到刚才的 kthread_create_info 结构,并将之从 kthread_create_list 链表中删除,然后调用 create_kthread 最终完成创建。
OK,既然都分析这么多了就再来 create_kthread:
static int kthread(void *_create)
{
	/* Copy data: it's on kthread's stack */
	struct kthread_create_info *create = _create;
	int (*threadfn)(void *data) = create->threadfn;
	void *data = create->data;
	struct kthread self;
	int ret;
	self.should_stop = 0;
	init_completion(&self.exited);
	current->vfork_done = &self.exited;
	/* OK, tell user we're spawned, wait for stop or wakeup */
	__set_current_state(TASK_UNINTERRUPTIBLE);
	create->result = current;
	complete(&create->done);
	schedule();
	ret = -EINTR;
	if (!self.should_stop)
		ret = threadfn(data);
	/* we can't just return, we must preserve "self" on stack */
	do_exit(ret);
}
static void create_kthread(struct kthread_create_info *create)
{
	int pid;
	/* We want our own signal handler (we take no signals by default). */
	pid = kernel_thread(kthread, create, CLONE_FS | CLONE_FILES | SIGCHLD);
	if (pid < 0) {
		create->result = ERR_PTR(pid);
		complete(&create->done);
	}
}
create_kthread 会调用 kernel_thread 创建新的进程,而且它的入口函数是 kthread 函数,参数为 create。kthread 中把自己设为 TASK_UNINTERRUPTIBLE 状态,并用 complete 告诉 kthread_create 创建好了,接着它调用 schedule() 函数使其所在进程进入睡眠状态,如果被唤醒(可以使用 wait_up_process 之类的了)就执行创建 kthread thread 时指定的入口函数。
而 kernel_thread 的实现则是平台相关的,它会调用 do_fork 创建新的进程,看看 x86 上的实现:
int kernel_thread(int (*fn)(void *), void *arg, unsigned long flags)
{
	struct pt_regs regs;
	memset(®s, 0, sizeof(regs));
	regs.si = (unsigned long) fn;
	regs.di = (unsigned long) arg;
#ifdef CONFIG_X86_32
	regs.ds = __USER_DS;
	regs.es = __USER_DS;
	regs.fs = __KERNEL_PERCPU;
	regs.gs = __KERNEL_STACK_CANARY;
#else
	regs.ss = __KERNEL_DS;
#endif
	regs.orig_ax = -1;
	regs.ip = (unsigned long) kernel_thread_helper;
	regs.cs = __KERNEL_CS | get_kernel_rpl();
	regs.flags = X86_EFLAGS_IF | 0x2;
	/* Ok, create the new process.. */
	return do_fork(flags | CLONE_VM | CLONE_UNTRACED, 0, ®s, 0, NULL, NULL);
}
由此也可以看到在 kernel 中创建 kernel thread 有两种方法:kthread_create 和 kernel_thread,它们的区别是:kthread_create 创建的 kernel thread 有完整的上下文环境,其父进程一定为 kthreadd,而 kernel_thread 的父进程可以为 init 或其它 kernel thread。
4、进程结束:
进程可以使用 exit 函数主动结束自己,也可以在收到信号或异常时非主动结束。不管怎样结束,最终都由 do_exit() 函数来完成,此函数定义在 kernel/exit.c 中,看看它的实现:
static void exit_notify(struct task_struct *tsk, int group_dead)
{
	int signal;
	void *cookie;
	/*
	 * This does two things:
	 *
  	 * A.  Make init inherit all the child processes
	 * B.  Check to see if any process groups have become orphaned
	 *	as a result of our exiting, and if they have any stopped
	 *	jobs, send them a SIGHUP and then a SIGCONT.  (POSIX 3.2.2.2)
	 */
	forget_original_parent(tsk);
	exit_task_namespaces(tsk);
	write_lock_irq(&tasklist_lock);
	if (group_dead)
		kill_orphaned_pgrp(tsk->group_leader, NULL);
	/* Let father know we died
	 *
	 * Thread signals are configurable, but you aren't going to use
	 * that to send signals to arbitary processes.
	 * That stops right now.
	 *
	 * If the parent exec id doesn't match the exec id we saved
	 * when we started then we know the parent has changed security
	 * domain.
	 *
	 * If our self_exec id doesn't match our parent_exec_id then
	 * we have changed execution domain as these two values started
	 * the same after a fork.
	 */
	if (tsk->exit_signal != SIGCHLD && !task_detached(tsk) &&
	    (tsk->parent_exec_id != tsk->real_parent->self_exec_id ||
	     tsk->self_exec_id != tsk->parent_exec_id))
		tsk->exit_signal = SIGCHLD;
	signal = tracehook_notify_death(tsk, &cookie, group_dead);
	if (signal >= 0)
		signal = do_notify_parent(tsk, signal);
	tsk->exit_state = signal == DEATH_REAP ? EXIT_DEAD : EXIT_ZOMBIE;
	/* mt-exec, de_thread() is waiting for us */
	if (thread_group_leader(tsk) &&
	    tsk->signal->group_exit_task &&
	    tsk->signal->notify_count < 0)
		wake_up_process(tsk->signal->group_exit_task);
	write_unlock_irq(&tasklist_lock);
	tracehook_report_death(tsk, signal, cookie, group_dead);
	/* If the process is dead, release it - nobody will wait for it */
	if (signal == DEATH_REAP)
		release_task(tsk);
}
NORET_TYPE void do_exit(long code)
{
	struct task_struct *tsk = current;
	int group_dead;
	profile_task_exit(tsk);
	WARN_ON(atomic_read(&tsk->fs_excl));
	if (unlikely(in_interrupt()))
		panic("Aiee, killing interrupt handler!");
	if (unlikely(!tsk->pid))
		panic("Attempted to kill the idle task!");
	tracehook_report_exit(&code);
	validate_creds_for_do_exit(tsk);
	/*
	 * We're taking recursive faults here in do_exit. Safest is to just
	 * leave this task alone and wait for reboot.
	 */
	if (unlikely(tsk->flags & PF_EXITING)) {
		printk(KERN_ALERT
			"Fixing recursive fault but reboot is needed!\n");
		/*
		 * We can do this unlocked here. The futex code uses
		 * this flag just to verify whether the pi state
		 * cleanup has been done or not. In the worst case it
		 * loops once more. We pretend that the cleanup was
		 * done as there is no way to return. Either the
		 * OWNER_DIED bit is set by now or we push the blocked
		 * task into the wait for ever nirwana as well.
		 */
		tsk->flags |= PF_EXITPIDONE;
		set_current_state(TASK_UNINTERRUPTIBLE);
		schedule();
	}
	exit_irq_thread();
	exit_signals(tsk);  /* sets PF_EXITING */
	/*
	 * tsk->flags are checked in the futex code to protect against
	 * an exiting task cleaning up the robust pi futexes.
	 */
	smp_mb();
	raw_spin_unlock_wait(&tsk->pi_lock);
	if (unlikely(in_atomic()))
		printk(KERN_INFO "note: %s[%d] exited with preempt_count %d\n",
				current->comm, task_pid_nr(current),
				preempt_count());
	acct_update_integrals(tsk);
	/* sync mm's RSS info before statistics gathering */
	if (tsk->mm)
		sync_mm_rss(tsk, tsk->mm);
	group_dead = atomic_dec_and_test(&tsk->signal->live);
	if (group_dead) {
		hrtimer_cancel(&tsk->signal->real_timer);
		exit_itimers(tsk->signal);
		if (tsk->mm)
			setmax_mm_hiwater_rss(&tsk->signal->maxrss, tsk->mm);
	}
	acct_collect(code, group_dead);
	if (group_dead)
		tty_audit_exit();
	if (unlikely(tsk->audit_context))
		audit_free(tsk);
	tsk->exit_code = code;
	taskstats_exit(tsk, group_dead);
	exit_mm(tsk);
	if (group_dead)
		acct_process();
	trace_sched_process_exit(tsk);
	exit_sem(tsk);
	exit_files(tsk);
	exit_fs(tsk);
	check_stack_usage();
	exit_thread();
	cgroup_exit(tsk, 1);
	if (group_dead)
		disassociate_ctty(1);
	module_put(task_thread_info(tsk)->exec_domain->module);
	proc_exit_connector(tsk);
	/*
	 * FIXME: do that only when needed, using sched_exit tracepoint
	 */
	flush_ptrace_hw_breakpoint(tsk);
	/*
	 * Flush inherited counters to the parent - before the parent
	 * gets woken up by child-exit notifications.
	 */
	perf_event_exit_task(tsk);
	exit_notify(tsk, group_dead);
#ifdef CONFIG_NUMA
	mpol_put(tsk->mempolicy);
	tsk->mempolicy = NULL;
#endif
#ifdef CONFIG_FUTEX
	if (unlikely(current->pi_state_cache))
		kfree(current->pi_state_cache);
#endif
	/*
	 * Make sure we are holding no locks:
	 */
	debug_check_no_locks_held(tsk);
	/*
	 * We can do this unlocked here. The futex code uses this flag
	 * just to verify whether the pi state cleanup has been done
	 * or not. In the worst case it loops once more.
	 */
	tsk->flags |= PF_EXITPIDONE;
	if (tsk->io_context)
		exit_io_context(tsk);
	if (tsk->splice_pipe)
		__free_pipe_info(tsk->splice_pipe);
	validate_creds_for_do_exit(tsk);
	preempt_disable();
	exit_rcu();
	/* causes final put_task_struct in finish_task_switch(). */
	tsk->state = TASK_DEAD;
	schedule();
	BUG();
	/* Avoid "noreturn function does return".  */
	for (;;)
		cpu_relax();	/* For when BUG is null */
}
do_exit 会将 task_struct 的 flags 字段设为 PF_EXITING(在 exit_signals 中设置),调用 acct_update_integrals 更新进程统计信息,调用 exit_mm 释放进程使用的地址空间(mm_struct),调用 exit_sem、exit_files、exit_fs 释放等待的信号量,递减文件描述符和文件系统的引用计数,task_struct 的 exit_code 被设置为对应的退出值,调用 exit_notify 通知进程的父进程,而在 exit_notify 函数中将此进程的子进程的父进程改为线程组中的另一个线程或者 init 进程,并将 exit_state 标记为 EXIT_ZOMBIE(需要父进程得到退出状态了),最终进程 state 状态被设置为 TASK_DEAD,然后调用 schedule() 以切换到新的进程上。
do_exit 完成之后,进程的描述符和 thread_info 都还是存在的,只是进程是 EXIT_ZOMBIE 状态而且不可运行,如果父进程通过 wait4 等系统调用处理完此进程的状态,此进程的 task_struct 和 thread_info 就会通过调用 release_task() 被最终释放,看看它的实现:
void release_task(struct task_struct * p)
{
	struct task_struct *leader;
	int zap_leader;
repeat:
	tracehook_prepare_release_task(p);
	/* don't need to get the RCU readlock here - the process is dead and
	 * can't be modifying its own credentials. But shut RCU-lockdep up */
	rcu_read_lock();
	atomic_dec(&__task_cred(p)->user->processes);
	rcu_read_unlock();
	proc_flush_task(p);
	write_lock_irq(&tasklist_lock);
	tracehook_finish_release_task(p);
	__exit_signal(p);
	/*
	 * If we are the last non-leader member of the thread
	 * group, and the leader is zombie, then notify the
	 * group leader's parent process. (if it wants notification.)
	 */
	zap_leader = 0;
	leader = p->group_leader;
	if (leader != p && thread_group_empty(leader) && leader->exit_state == EXIT_ZOMBIE) {
		BUG_ON(task_detached(leader));
		do_notify_parent(leader, leader->exit_signal);
		/*
		 * If we were the last child thread and the leader has
		 * exited already, and the leader's parent ignores SIGCHLD,
		 * then we are the one who should release the leader.
		 *
		 * do_notify_parent() will have marked it self-reaping in
		 * that case.
		 */
		zap_leader = task_detached(leader);
		/*
		 * This maintains the invariant that release_task()
		 * only runs on a task in EXIT_DEAD, just for sanity.
		 */
		if (zap_leader)
			leader->exit_state = EXIT_DEAD;
	}
	write_unlock_irq(&tasklist_lock);
	release_thread(p);
	call_rcu(&p->rcu, delayed_put_task_struct);
	p = leader;
	if (unlikely(zap_leader))
		goto repeat;
}
它先调用 __exit_signal,其中会调用 __unhash_process,__unhash_process 调用 detach_pid 将 PID 从上面说的进程 PID hash 表(pid_hash)中删除。release_task 最终会调用 delayed_put_task_struct,其中再调用 put_task_struct,put_task_struct 再分别调用 free_thread_info 和 free_task_struct 来释放 thread_info 和 task_struct。
需要注意的是如果父进程在子进程之前退出,这时需要一个机制设置子进程的新父进程,不然子进程退出时将一直处于僵尸状态。因此进程退出时需要调用 exit_notify 进行通知处理,exit_notify 调用 forget_original_parent,其中再调用 find_new_reaper(意思不错,找到新收割者 ^_^) 来设置新父进程,看看它的实现:
static struct task_struct *find_new_reaper(struct task_struct *father)
{
	struct pid_namespace *pid_ns = task_active_pid_ns(father);
	struct task_struct *thread;
	thread = father;
	while_each_thread(father, thread) {
		if (thread->flags & PF_EXITING)
			continue;
		if (unlikely(pid_ns->child_reaper == father))
			pid_ns->child_reaper = thread;
		return thread;
	}
	if (unlikely(pid_ns->child_reaper == father)) {
		write_unlock_irq(&tasklist_lock);
		if (unlikely(pid_ns == &init_pid_ns))
			panic("Attempted to kill init!");
		zap_pid_ns_processes(pid_ns);
		write_lock_irq(&tasklist_lock);
		/*
		 * We can not clear ->child_reaper or leave it alone.
		 * There may by stealth EXIT_DEAD tasks on ->children,
		 * forget_original_parent() must move them somewhere.
		 */
		pid_ns->child_reaper = init_pid_ns.child_reaper;
	}
	return pid_ns->child_reaper;
}
它先用 while_each_thread 在线程组中找一个进程,找到就直接返回,找不到就返回 init 进程了,init 进程会自动调用 wait() 来等待它的子进程。
至此 Linux kernel 中的进程基本实现大概了解了,将了进程和线程概念、进程创建和退出等做了代码上的分析介绍,下面就是专门的进程调度了,有任何问题欢迎指正哦~~~ ^_^
