本文同步自(如浏览不正常请点击跳转):https://zohead.com/archives/linux-kernel-learning-block-layer/
Linux 内核中的 block I/O 层又是非常重要的一个概念,它相对字符设备的实现来说复杂很多,而且在现今应用中,block 层可以说是随处可见,下面分别介绍 kernel block I/O 层的一些知识,你需要对块设备、字符设备的区别清楚,而且对 kernel 基础有一些了解哦。
1、buffer_head 的概念:
buffer_head 是 block 层中一个常见的数据结构(当然和下面的 bio 之类的结构相比就差多了哦,HOHO)。
当块设备中的一个块(一般为扇区大小的整数倍,并不超过一个内存 page 的大小)通过读写等方式存放在内存中,一般被称为存在 buffer 中,每个 buffer 和一个块相关联,它就表示在内存中的磁盘块。kernel 因此需要有相关的控制信息来表示块数据,每个块与一个描述符相关联,这个描述符就被称为 buffer head,并用 struct buffer_head 来表示,其定义在 <linux/buffer_head.h> 头文件中。
enum bh_state_bits { BH_Uptodate, /* Contains valid data */ BH_Dirty, /* Is dirty */ BH_Lock, /* Is locked */ BH_Req, /* Has been submitted for I/O */ BH_Uptodate_Lock,/* Used by the first bh in a page, to serialise * IO completion of other buffers in the page */ BH_Mapped, /* Has a disk mapping */ BH_New, /* Disk mapping was newly created by get_block */ BH_Async_Read, /* Is under end_buffer_async_read I/O */ BH_Async_Write, /* Is under end_buffer_async_write I/O */ BH_Delay, /* Buffer is not yet allocated on disk */ BH_Boundary, /* Block is followed by a discontiguity */ BH_Write_EIO, /* I/O error on write */ BH_Ordered, /* ordered write */ BH_Eopnotsupp, /* operation not supported (barrier) */ BH_Unwritten, /* Buffer is allocated on disk but not written */ BH_Quiet, /* Buffer Error Prinks to be quiet */ BH_PrivateStart,/* not a state bit, but the first bit available * for private allocation by other entities */ }; struct buffer_head { unsigned long b_state; /* buffer state bitmap (see above) */ struct buffer_head *b_this_page;/* circular list of page's buffers */ struct page *b_page; /* the page this bh is mapped to */ sector_t b_blocknr; /* start block number */ size_t b_size; /* size of mapping */ char *b_data; /* pointer to data within the page */ struct block_device *b_bdev; bh_end_io_t *b_end_io; /* I/O completion */ void *b_private; /* reserved for b_end_io */ struct list_head b_assoc_buffers; /* associated with another mapping */ struct address_space *b_assoc_map; /* mapping this buffer is associated with */ atomic_t b_count; /* users using this buffer_head */ };
b_state 字段说明这段 buffer 的状态,它可以是 bh_state_bits 联合(也在上面的代码中,注释说明状态,应该比较好明白哦)中的一个或多个与值。b_count 为 buffer 的引用计数,它通过 get_bh、put_bh 函数进行原子性的增加和减小,需要操作 buffer_head 时调用 get_bh,完成之后调用 put_bh。b_bdev 表示关联的块设备,下面会单独介绍 block_device 结构,b_blocknr 表示在 b_bdev 块设备上 buffer 所关联的块的起始地址。b_page 指向的内存页即为 buffer 所映射的页。b_data 为指向块的指针(在 b_page 中),并且长度为 b_size。
在 Linux 2.6 版本以前,buffer_head 是 kernel 中非常重要的数据结构,它曾经是 kernel 中 I/O 的基本单位(现在已经是 bio 结构),它曾被用于为一个块映射一个页,它被用于描述磁盘块到物理页的映射关系,所有的 block I/O 操作也包含在 buffer_head 中。但是这样也会引起比较大的问题:buffer_head 结构过大(现在已经缩减了很多),用 buffer head 来操作 I/O 数据太复杂,kernel 更喜欢根据 page 来工作(这样性能也更好);另一个问题是一个大的 buffer_head 常被用来描述单独的 buffer,而且 buffer 还很可能比一个页还小,这样就会造成效率低下;第三个问题是 buffer_head 只能描述一个 buffer,这样大块的 I/O 操作常被分散为很多个 buffer_head,这样会增加额外占用的空间。因此 2.6 开始的 kernel (实际 2.5 测试版的 kernel 中已经开始引入)使用 bio 结构直接处理 page 和地址空间,而不是 buffer。
2、bio:
说了一堆 buffer_head 的坏话,现在来看看它的替代者:bio,它倾向于为 I/O 请求提供一个轻量级的表示方法,它定义在 <linux/bio.h> 头文件中。
struct bio { sector_t bi_sector; /* device address in 512 byte sectors */ struct bio *bi_next; /* request queue link */ struct block_device *bi_bdev; unsigned long bi_flags; /* status, command, etc */ unsigned long bi_rw; /* bottom bits READ/WRITE, * top bits priority */ unsigned short bi_vcnt; /* how many bio_vec's */ unsigned short bi_idx; /* current index into bvl_vec */ /* Number of segments in this BIO after * physical address coalescing is performed. */ unsigned int bi_phys_segments; unsigned int bi_size; /* residual I/O count */ /* * To keep track of the max segment size, we account for the * sizes of the first and last mergeable segments in this bio. */ unsigned int bi_seg_front_size; unsigned int bi_seg_back_size; unsigned int bi_max_vecs; /* max bvl_vecs we can hold */ unsigned int bi_comp_cpu; /* completion CPU */ atomic_t bi_cnt; /* pin count */ struct bio_vec *bi_io_vec; /* the actual vec list */ bio_end_io_t *bi_end_io; void *bi_private; #if defined(CONFIG_BLK_DEV_INTEGRITY) struct bio_integrity_payload *bi_integrity; /* data integrity */ #endif bio_destructor_t *bi_destructor; /* destructor */ /* * We can inline a number of vecs at the end of the bio, to avoid * double allocations for a small number of bio_vecs. This member * MUST obviously be kept at the very end of the bio. */ struct bio_vec bi_inline_vecs[0]; }; struct bio_vec { struct page *bv_page; unsigned int bv_len; unsigned int bv_offset; };
该定义中已经有详细的注释了哦。bi_sector 为以 512 字节为单位的扇区地址(即使物理设备的扇区大小不是 512 字节,bi_sector 也以 512 字节为单位)。bi_bdev 为关联的块设备。bi_rw 表示为读请求还是写请求。bi_cnt 为引用计数,通过 bio_get、bio_put 宏可以对 bi_cnt 进行增加和减小操作。当 bi_cnt 值为 0 时,bio 结构就被销毁并且后端的内存也被释放。
I/O 向量:
bio 结构中最重要的是 bi_vcnt、bi_idx、bi_io_vec 等成员,bi_vcnt 为 bi_io_vec 所指向的 bio_vec 类型列表个数,bi_io_vec 表示指定的 block I/O 操作中的单独的段(如果你用过 readv 和 writev 函数那应该对这个比较熟悉),bi_idx 为当前在 bi_io_vec 数组中的索引,随着 block I/O 操作的进行,bi_idx 值被不断更新,kernel 提供 bio_for_each_segment 宏用于遍历 bio 中的 bio_vec。另外 kernel 中的 MD 软件 RAID 驱动也会使用 bi_idx 值来将一个 bio 请求分发到不同的磁盘设备上进行处理。
bio_vec 的定义也在上面的代码中,同样在 <linux/bio.h> 头文件中,每个 bio_vec 类型指向对应的 page,bv_page 表示它所在的页,bv_offset 为块相对于 page 的偏移量,bv_len 即为块的长度。
buffer_head 和 bio 总结:
因此也可以看出 block I/O 请求是以 I/O 向量的形式进行提交和处理的。
bio 相对 buffer_head 的好处有:bio 可以更方便的使用高端内存,因为它只与 page 打交道,并不直接使用地址。bio 可以表示 direct I/O(不经过 page cache,后面再详细描述)。对向量形式的 I/O(包括 sg I/O) 支持更好,防止 I/O 被打散。但是 buffer_head 还是需要的,它用于映射磁盘块到内存,因为 bio 中并没有包含 kernel 需要的 buffer 状态的成员以及一些其它信息。
3、请求队列:
块设备使用请求队列来保存等待中的 block I/O 请求,其使用 request_queue 结构来表示,定义在 <linux/blkdev.h> 头文件中,此头文件中还包含了非常重要的 request 结构:
struct request { struct list_head queuelist; struct call_single_data csd; struct request_queue *q; unsigned int cmd_flags; enum rq_cmd_type_bits cmd_type; unsigned long atomic_flags; int cpu; /* the following two fields are internal, NEVER access directly */ unsigned int __data_len; /* total data len */ sector_t __sector; /* sector cursor */ struct bio *bio; struct bio *biotail; struct hlist_node hash; /* merge hash */ /* * The rb_node is only used inside the io scheduler, requests * are pruned when moved to the dispatch queue. So let the * completion_data share space with the rb_node. */ union { struct rb_node rb_node; /* sort/lookup */ void *completion_data; }; /* * two pointers are available for the IO schedulers, if they need * more they have to dynamically allocate it. */ void *elevator_private; void *elevator_private2; struct gendisk *rq_disk; unsigned long start_time; /* Number of scatter-gather DMA addr+len pairs after * physical address coalescing is performed. */ unsigned short nr_phys_segments; unsigned short ioprio; int ref_count; void *special; /* opaque pointer available for LLD use */ char *buffer; /* kaddr of the current segment if available */ int tag; int errors; /* * when request is used as a packet command carrier */ unsigned char __cmd[BLK_MAX_CDB]; unsigned char *cmd; unsigned short cmd_len; unsigned int extra_len; /* length of alignment and padding */ unsigned int sense_len; unsigned int resid_len; /* residual count */ void *sense; unsigned long deadline; struct list_head timeout_list; unsigned int timeout; int retries; /* * completion callback. */ rq_end_io_fn *end_io; void *end_io_data; /* for bidi */ struct request *next_rq; }; struct request_queue { /* * Together with queue_head for cacheline sharing */ struct list_head queue_head; struct request *last_merge; struct elevator_queue *elevator; /* * the queue request freelist, one for reads and one for writes */ struct request_list rq; request_fn_proc *request_fn; make_request_fn *make_request_fn; prep_rq_fn *prep_rq_fn; unplug_fn *unplug_fn; merge_bvec_fn *merge_bvec_fn; prepare_flush_fn *prepare_flush_fn; softirq_done_fn *softirq_done_fn; rq_timed_out_fn *rq_timed_out_fn; dma_drain_needed_fn *dma_drain_needed; lld_busy_fn *lld_busy_fn; /* * Dispatch queue sorting */ sector_t end_sector; struct request *boundary_rq; /* * Auto-unplugging state */ struct timer_list unplug_timer; int unplug_thresh; /* After this many requests */ unsigned long unplug_delay; /* After this many jiffies */ struct work_struct unplug_work; struct backing_dev_info backing_dev_info; /* * The queue owner gets to use this for whatever they like. * ll_rw_blk doesn't touch it. */ void *queuedata; /* * queue needs bounce pages for pages above this limit */ gfp_t bounce_gfp; /* * various queue flags, see QUEUE_* below */ unsigned long queue_flags; /* * protects queue structures from reentrancy. ->__queue_lock should * _never_ be used directly, it is queue private. always use * ->queue_lock. */ spinlock_t __queue_lock; spinlock_t *queue_lock; /* * queue kobject */ struct kobject kobj; /* * queue settings */ unsigned long nr_requests; /* Max # of requests */ unsigned int nr_congestion_on; unsigned int nr_congestion_off; unsigned int nr_batching; void *dma_drain_buffer; unsigned int dma_drain_size; unsigned int dma_pad_mask; unsigned int dma_alignment; struct blk_queue_tag *queue_tags; struct list_head tag_busy_list; unsigned int nr_sorted; unsigned int in_flight[2]; unsigned int rq_timeout; struct timer_list timeout; struct list_head timeout_list; struct queue_limits limits; /* * sg stuff */ unsigned int sg_timeout; unsigned int sg_reserved_size; int node; #ifdef CONFIG_BLK_DEV_IO_TRACE struct blk_trace *blk_trace; #endif /* * reserved for flush operations */ unsigned int ordered, next_ordered, ordseq; int orderr, ordcolor; struct request pre_flush_rq, bar_rq, post_flush_rq; struct request *orig_bar_rq; struct mutex sysfs_lock; #if defined(CONFIG_BLK_DEV_BSG) struct bsg_class_device bsg_dev; #endif };
request_queue 中的很多成员和 I/O 调度器、request、bio 等息息相关。request_queue 中的 queue_head 成员为请求的双向链表。nr_requests 为请求的数量。I/O 请求被文件系统等上层的代码加入到队列中(需要经过 I/O 调度器,下面会介绍),只要队列不为空,block 设备驱动程序就需要从队列中抓取请求并提交到对应的块设备中。这个队列中的就是单独的请求,以 request 结构来表示。
每个 request 结构又可以由多个 bio 组成,一个 request 中放着顺序排列的 bio(请求在多个连续的磁盘块上)。
实际上在 request_queue 中,只有当请求队列有一定数目的请求时,I/O 调度算法才能发挥作用,否则极端情况下它将退化成 “先来先服务算法”,这就悲催了。通过对 request_queue 进行 plug 操作相当于停用,unplug 相当于恢复。请求少时将request_queue 停用,当请求达到一定数目,或者 request_queue 里最 “老” 的请求已经等待一段时间了才将 request_queue 恢复,这些见 request_queue 中的 unplug_fn、unplug_timer、unplug_thresh、unplug_delay 等成员。
4、I/O 调度器:
I/O 调度器也是 block 层的大头,它肩负着非常重要的使命。由于现在的机械硬盘设备的寻道是非常慢的(常常是毫秒级),因此尽可能的减少寻道操作是提高性能的关键所在。一般 I/O 调度器要做的事情就是在完成现有请求的前提下,让磁头尽可能少移动,从而提高磁盘的读写效率。最有名的就是 “电梯算法” 了。
由于 I/O 调度器的存在,kernel 并不会按实际收到的顺序将请求发到底层设备上,而是经过了合并(减少请求数量和寻道,如果无法合并将请求放在队列尾部)和排序处理(类似电梯的减少往返寻道的处理,也是 I/O 调度器被称为 elevators 的原因)。I/O 调度器就是来管理块设备的请求队列的,它来决定队列中请求的顺序,以及每个请求什么时候到派遣到块设备上。
现在的 Linux kernel 中已经有几种好用的 I/O 调度器,常见的包括 Linus(2.4 版本中的调度器)、cfq(很多发行版中的默认调度器)、deadline、noop、anticipatory(相对 deadline 的优化) 等。
Linus 调度器同时实现了合并和排序处理,而且是 front merging(新请求在当前的前面) 和 back merging(新请求在当前的后面,当然比 front merging 常见) 都支持的,并且有一定的请求时限处理。
deadline 调度器主要解决 Linus 调度器导致的请求饥饿问题(不能及时有效的被处理),deadline 调度器保证请求的开始服务时间。另外 deadline 解决了写请求(一般为异步处理)使读请求(一般为同步处理)不能被及时处理的问题,也就是解决读延迟。
noop 调度器几乎保持原始请求顺序不变(仍然有合并),而 cfq 则提供类似完全公平的调度策略。
总之不同的 I/O 调度器通常是对于特定类型的请求进行优化的,有关这些调度器的具体实现,之后将专门写文章来介绍它们,这里就不会熬述咯。
I/O 调度的一些数据结构声明在 <linux/elevator.h> 头文件中,比较重要的包括 elevator_ops、elevator_type 以及 elevator_queue 等。elevator_ops 中定义了 I/O 调度算法的各种操作函数接口。
struct elevator_ops { elevator_merge_fn *elevator_merge_fn; elevator_merged_fn *elevator_merged_fn; elevator_merge_req_fn *elevator_merge_req_fn; elevator_allow_merge_fn *elevator_allow_merge_fn; elevator_dispatch_fn *elevator_dispatch_fn; elevator_add_req_fn *elevator_add_req_fn; elevator_activate_req_fn *elevator_activate_req_fn; elevator_deactivate_req_fn *elevator_deactivate_req_fn; elevator_queue_empty_fn *elevator_queue_empty_fn; elevator_completed_req_fn *elevator_completed_req_fn; elevator_request_list_fn *elevator_former_req_fn; elevator_request_list_fn *elevator_latter_req_fn; elevator_set_req_fn *elevator_set_req_fn; elevator_put_req_fn *elevator_put_req_fn; elevator_may_queue_fn *elevator_may_queue_fn; elevator_init_fn *elevator_init_fn; elevator_exit_fn *elevator_exit_fn; void (*trim)(struct io_context *); }; struct elevator_type { struct list_head list; struct elevator_ops ops; struct elv_fs_entry *elevator_attrs; char elevator_name[ELV_NAME_MAX]; struct module *elevator_owner; }; struct elevator_queue { struct elevator_ops *ops; void *elevator_data; struct kobject kobj; struct elevator_type *elevator_type; struct mutex sysfs_lock; struct hlist_head *hash; };
elevator_type 用来描述不同的 I/O 调度器,你可以在 request_queue 的声明中看到 elevator_queue 的身影。
5、块设备请求处理:
当需要发起块设备读写请求时,kernel 首先根据需求构造 bio 结构(毕竟是 I/O 请求单位哦),其中包含了读写的地址、长度、设备、回调函数等信息,然后 kernel 通过 submit_bio 函数将请求转发给块设备,看看 submit_bio 的实现(在 block/blk-core.c 中,下面几个非常重要的函数也在这个关键的 block 层实现文件中):
void submit_bio(int rw, struct bio *bio) { int count = bio_sectors(bio); bio->bi_rw |= rw; /* * If it's a regular read/write or a barrier with data attached, * go through the normal accounting stuff before submission. */ if (bio_has_data(bio)) { if (rw & WRITE) { count_vm_events(PGPGOUT, count); } else { task_io_account_read(bio->bi_size); count_vm_events(PGPGIN, count); } if (unlikely(block_dump)) { char b[BDEVNAME_SIZE]; printk(KERN_DEBUG "%s(%d): %s block %Lu on %s\n", current->comm, task_pid_nr(current), (rw & WRITE) ? "WRITE" : "READ", (unsigned long long)bio->bi_sector, bdevname(bio->bi_bdev, b)); } } generic_make_request(bio); }
submit_bio 的输入参数为 bio 结构,submit_bio 最终会调用 generic_make_request 函数不断转发 bio 请求:
static inline void __generic_make_request(struct bio *bio) { struct request_queue *q; sector_t old_sector; int ret, nr_sectors = bio_sectors(bio); dev_t old_dev; int err = -EIO; might_sleep(); if (bio_check_eod(bio, nr_sectors)) goto end_io; /* * Resolve the mapping until finished. (drivers are * still free to implement/resolve their own stacking * by explicitly returning 0) * * NOTE: we don't repeat the blk_size check for each new device. * Stacking drivers are expected to know what they are doing. */ old_sector = -1; old_dev = 0; do { char b[BDEVNAME_SIZE]; q = bdev_get_queue(bio->bi_bdev); if (unlikely(!q)) { printk(KERN_ERR "generic_make_request: Trying to access " "nonexistent block-device %s (%Lu)\n", bdevname(bio->bi_bdev, b), (long long) bio->bi_sector); goto end_io; } if (unlikely(!bio_rw_flagged(bio, BIO_RW_DISCARD) && nr_sectors > queue_max_hw_sectors(q))) { printk(KERN_ERR "bio too big device %s (%u > %u)\n", bdevname(bio->bi_bdev, b), bio_sectors(bio), queue_max_hw_sectors(q)); goto end_io; } if (unlikely(test_bit(QUEUE_FLAG_DEAD, &q->queue_flags))) goto end_io; if (should_fail_request(bio)) goto end_io; /* * If this device has partitions, remap block n * of partition p to block n+start(p) of the disk. */ blk_partition_remap(bio); if (bio_integrity_enabled(bio) && bio_integrity_prep(bio)) goto end_io; if (old_sector != -1) trace_block_remap(q, bio, old_dev, old_sector); old_sector = bio->bi_sector; old_dev = bio->bi_bdev->bd_dev; if (bio_check_eod(bio, nr_sectors)) goto end_io; if (bio_rw_flagged(bio, BIO_RW_DISCARD) && !blk_queue_discard(q)) { err = -EOPNOTSUPP; goto end_io; } trace_block_bio_queue(q, bio); ret = q->make_request_fn(q, bio); } while (ret); return; end_io: bio_endio(bio, err); } void generic_make_request(struct bio *bio) { struct bio_list bio_list_on_stack; if (current->bio_list) { /* make_request is active */ bio_list_add(current->bio_list, bio); return; } /* following loop may be a bit non-obvious, and so deserves some * explanation. * Before entering the loop, bio->bi_next is NULL (as all callers * ensure that) so we have a list with a single bio. * We pretend that we have just taken it off a longer list, so * we assign bio_list to a pointer to the bio_list_on_stack, * thus initialising the bio_list of new bios to be * added. __generic_make_request may indeed add some more bios * through a recursive call to generic_make_request. If it * did, we find a non-NULL value in bio_list and re-enter the loop * from the top. In this case we really did just take the bio * of the top of the list (no pretending) and so remove it from * bio_list, and call into __generic_make_request again. * * The loop was structured like this to make only one call to * __generic_make_request (which is important as it is large and * inlined) and to keep the structure simple. */ BUG_ON(bio->bi_next); bio_list_init(&bio_list_on_stack); current->bio_list = &bio_list_on_stack; do { __generic_make_request(bio); bio = bio_list_pop(current->bio_list); } while (bio); current->bio_list = NULL; /* deactivate */ }
generic_make_request 中获取 bio 指向的块设备的请求队列,并循环通过 __generic_make_request 调用请求队列的 make_request_fn 方法(见 request_queue 的声明,里面定义了一系列的函数指针)来下发 bio。
普通的块设备处理中一般会将 __make_request 函数注册到请求队列的 make_request_fn 函数指针上。另外设备驱动程序也可以注册自己的 I/O 提交等函数,这样可以绕过 Linux 默认提供的 I/O 协议栈,不走标准的 I/O 请求队列,由驱动程序自己来处理,有很多 nvram、SSD 卡等的驱动程序会为了提高性能而做这样的处理。
来看看 __make_request 的处理:
static int __make_request(struct request_queue *q, struct bio *bio) { struct request *req; int el_ret; unsigned int bytes = bio->bi_size; const unsigned short prio = bio_prio(bio); const bool sync = bio_rw_flagged(bio, BIO_RW_SYNCIO); const bool unplug = bio_rw_flagged(bio, BIO_RW_UNPLUG); const unsigned int ff = bio->bi_rw & REQ_FAILFAST_MASK; int rw_flags; if (bio_rw_flagged(bio, BIO_RW_BARRIER) && (q->next_ordered == QUEUE_ORDERED_NONE)) { bio_endio(bio, -EOPNOTSUPP); return 0; } /* * low level driver can indicate that it wants pages above a * certain limit bounced to low memory (ie for highmem, or even * ISA dma in theory) */ blk_queue_bounce(q, &bio); spin_lock_irq(q->queue_lock); if (unlikely(bio_rw_flagged(bio, BIO_RW_BARRIER)) || elv_queue_empty(q)) goto get_rq; el_ret = elv_merge(q, &req, bio); switch (el_ret) { case ELEVATOR_BACK_MERGE: BUG_ON(!rq_mergeable(req)); if (!ll_back_merge_fn(q, req, bio)) break; trace_block_bio_backmerge(q, bio); if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff) blk_rq_set_mixed_merge(req); req->biotail->bi_next = bio; req->biotail = bio; req->__data_len += bytes; req->ioprio = ioprio_best(req->ioprio, prio); if (!blk_rq_cpu_valid(req)) req->cpu = bio->bi_comp_cpu; drive_stat_acct(req, 0); if (!attempt_back_merge(q, req)) elv_merged_request(q, req, el_ret); goto out; case ELEVATOR_FRONT_MERGE: BUG_ON(!rq_mergeable(req)); if (!ll_front_merge_fn(q, req, bio)) break; trace_block_bio_frontmerge(q, bio); if ((req->cmd_flags & REQ_FAILFAST_MASK) != ff) { blk_rq_set_mixed_merge(req); req->cmd_flags &= ~REQ_FAILFAST_MASK; req->cmd_flags |= ff; } bio->bi_next = req->bio; req->bio = bio; /* * may not be valid. if the low level driver said * it didn't need a bounce buffer then it better * not touch req->buffer either... */ req->buffer = bio_data(bio); req->__sector = bio->bi_sector; req->__data_len += bytes; req->ioprio = ioprio_best(req->ioprio, prio); if (!blk_rq_cpu_valid(req)) req->cpu = bio->bi_comp_cpu; drive_stat_acct(req, 0); if (!attempt_front_merge(q, req)) elv_merged_request(q, req, el_ret); goto out; /* ELV_NO_MERGE: elevator says don't/can't merge. */ default: ; } get_rq: /* * This sync check and mask will be re-done in init_request_from_bio(), * but we need to set it earlier to expose the sync flag to the * rq allocator and io schedulers. */ rw_flags = bio_data_dir(bio); if (sync) rw_flags |= REQ_RW_SYNC; /* * Grab a free request. This is might sleep but can not fail. * Returns with the queue unlocked. */ req = get_request_wait(q, rw_flags, bio); /* * After dropping the lock and possibly sleeping here, our request * may now be mergeable after it had proven unmergeable (above). * We don't worry about that case for efficiency. It won't happen * often, and the elevators are able to handle it. */ init_request_from_bio(req, bio); spin_lock_irq(q->queue_lock); if (test_bit(QUEUE_FLAG_SAME_COMP, &q->queue_flags) || bio_flagged(bio, BIO_CPU_AFFINE)) req->cpu = blk_cpu_to_group(smp_processor_id()); if (queue_should_plug(q) && elv_queue_empty(q)) blk_plug_device(q); add_request(q, req); out: if (unplug || !queue_should_plug(q)) __generic_unplug_device(q); spin_unlock_irq(q->queue_lock); return 0; }
I/O 调度器的合并处理就在 __make_request 中通过调用相应调度器的函数来完成。__make_request 调用 elv_merge 通过调度器判断是否可以合并,如果可以则根据 front merging 或者 back merging 分别由调度器做处理。如果不能合并则调用 get_request_wait 和 init_request_from_bio 根据 bio 请求创建并初始化新的 request,然后调用 add_request 将 request 加入请求队列。
static inline void add_request(struct request_queue *q, struct request *req) { drive_stat_acct(req, 1); /* * elevator indicated where it wants this request to be * inserted at elevator_merge time */ __elv_add_request(q, req, ELEVATOR_INSERT_SORT, 0); }
add_request 还是会通过调度器将 request 插入请求队列中合适的位置。
最终 __make_request 返回 0 表示 bio 转发结束。后续 request 的处理方法和设备驱动的实现有关,一般通过注册到 request_queue 的 request_fn 函数指针进行处理,例如常见的 SCSI 设备就会将 scsi_request_fn 注册到 request_fn 上。驱动程序中请求发送以及自己的队列等处理完毕后调用 blk_complete_request 结束请求,而在结束请求过程中会调用 bio 的回调函数结束 bio。
本文只是对 Linux block 层做了基本的介绍,类似 buffer_head 处理、同步异步 I/O 处理等很多都没有涉及,以后再专门来研究了,文章有任何问题欢迎指正哦 ^_^