测来测去16：Adding new kprobe to ftrace tracing

Posted on July 19, 2019 本文总阅读量次

添加ftrace新事件

Ftrace自己带很多事件，之前一直以为在内核方面ftrace的触角就到此为止了。今天才了解到kprobe也可以像uprobe一样自己动态定义一个出来。这样默认的ftrace没有的事件也就可以通过添加新kprobe事件的方式添加了。

kproble + uprobe都支持动态自定义添加，感觉这个电脑在做什么终于可以有一个比较全面的认识了。

为什么不用eBPF?

像bcc这类的工具有很多预定义好的方法，直接拿来用确实很方便。但eBPF现在有一个比较大的缺陷就是对内核版本有要求。很多企业用户的生产系统上根本没有4.1版本以上的内核，同时对往内核里加东西（容易被类比为内核模块）比较敏感，所以需要用ftrace作为eBPF的替代方案。

另外插一句题外话，像银行、金融、政府这类客户还是以稳定为中心的，不能有一点点让领导背锅的风险 XD

添加kprobe event

找到目标方法

可以在System.map文件里找一下有没有你要观察的内核函数方法。这个文件其实相当于内核的符号表（symbol table）。如果拿不准内核方法名的时候可以在这里面grep一下看看。

增加指令

以内核方法blk_start_request为例，这个方法在默认的ftrace事件中是找不到的（至少在我当前的机器上哈）。当我们也需要用ftrace的方式追踪（trace）这个方法的时候，就可以自己给它添加一个。

如果只是想看具体执行的时间、执行它的进程、使用的CPU等信息，指令也非常简单：

echo 'p:myprobe blk_start_request' > /sys/kernel/debug/tracing/kprobe_events

执行之后可以在kprobe_events中看到该事件：

1 2	[root@Server-N3 tracing]# cat /sys/kernel/debug/tracing/kprobe_events p:kprobes/myprobe blk_start_request

同时在/sys/kernel/debug/tracing/events/kprobes/目录下新出现了一个myprobe文件夹。

进来看看

1 2	[root@Server-N3 myprobe]# ls /sys/kernel/debug/tracing/events/kprobes/myprobe/ enable filter format id

此时直接给该目录下的enable文件写入个1，echo 1 > enable就使能了针对该方法的追踪机制。

回到/sys/kernel/debug/tracing看一下追踪的结果：

[root@Server-N3 tracing]# cat trace|tail
           <...>-92844 [001] d... 8125310.179907: myprobe: (blk_start_request+0x0/0x50)
          <idle>-0     [001] dNs. 8125310.179953: myprobe: (blk_start_request+0x0/0x50)
           <...>-51846 [020] d... 8125310.329592: myprobe: (blk_start_request+0x0/0x50)
           <...>-51846 [021] d... 8125310.329824: myprobe: (blk_start_request+0x0/0x50)
           <...>-92844 [001] d... 8125310.329979: myprobe: (blk_start_request+0x0/0x50)
          <idle>-0     [001] dNs. 8125310.330025: myprobe: (blk_start_request+0x0/0x50)
           <...>-51858 [021] d... 8125310.476478: myprobe: (blk_start_request+0x0/0x50)
           <...>-51858 [022] d... 8125310.476750: myprobe: (blk_start_request+0x0/0x50)
           <...>-92844 [001] d... 8125310.476890: myprobe: (blk_start_request+0x0/0x50)
          <idle>-0     [001] dNs. 8125310.476941: myprobe: (blk_start_request+0x0/0x50)

追踪任意内容

如果只是看执行的时间点用以上的方法就足够了。但kprobe（还有uprobe）给我们提供了更为强大的机制看到能想象到的所有内容，包括：

CPU个寄存器的值
方法各个参数的值
方法的返回值
方法内部栈的信息
如果传入参数为一个数据结构的指针，拿到指针所指对象中某个成员变量的值
等等

这些需要你掌握一套定义的语法，先看一下总体的说明：

p[:[GRP/]EVENT] [MOD:]SYM[+offs]|MEMADDR [FETCHARGS] : Set a probe
r[MAXACTIVE][:[GRP/]EVENT] [MOD:]SYM[+0] [FETCHARGS] : Set a return probe
-:[GRP/]EVENT : Clear a probe

GRP : Group name. If omitted, use "kprobes" for it.
EVENT : Event name. If omitted, the event name is generated
based on SYM+offs or MEMADDR.
MOD : Module name which has given SYM.
SYM[+offs] : Symbol+offset where the probe is inserted.
MEMADDR : Address where the probe is inserted.
MAXACTIVE : Maximum number of instances of the specified function that
can be probed simultaneously, or 0 for the default value
as defined in Documentation/kprobes.txt section 1.3.1.

FETCHARGS : Arguments. Each probe can have up to 128 args.
%REG : Fetch register REG
@ADDR : Fetch memory at ADDR (ADDR should be in kernel)
@SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol)
$stackN : Fetch Nth entry of stack (N >= 0)
$stack : Fetch stack address.
$argN : Fetch the Nth function argument. (N >= 1) (\*1)
$retval : Fetch return value.(\*2)
$comm : Fetch current task comm.
+|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address.(\*3)
NAME=FETCHARG : Set NAME as the argument name of FETCHARG.
FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types
(u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types
(x8/x16/x32/x64), "string" and bitfield are supported.

(\*1) only for the probe on function entry (offs == 0).
(\*2) only for return probe.
(\*3) this is useful for fetching a field of data structures.

主要内容是FETCHARGS下面罗列的信息。具体的内容可以在kernel官网上查看，我这边主要给出几个应用的实例。

以blk_account_io_completion为例，它在内核中的原型为：

void blk_account_io_completion(struct request *req, unsigned int bytes);

当我们要看它的参数信息时，从上面能看到取参数的方式，主要可以通过读取寄存器的值来拿到。

上面的文档里确实有直接通过$argN的形式拿到参数的方式，但我感觉这个方法受限于内核版本，当前我用的机器上还没有实现这个方法。

所以去哪找到这两个参数对应的寄存器是哪个呢？

在kernel源码arch/x86/include/asm/ptrace.h文件中：

/**
 * regs_get_kernel_argument() - get Nth function argument in kernel
 * @regs:   pt_regs of that context
 * @n:      function argument number (start from 0)
 *
 * regs_get_argument() returns @n th argument of the function call.
 * Note that this chooses most probably assignment, in some case
 * it can be incorrect.
 * This is expected to be called from kprobes or ftrace with regs
 * where the top of stack is the return address.
 */
static inline unsigned long regs_get_kernel_argument(struct pt_regs *regs,
                             unsigned int n)
{
    static const unsigned int argument_offs[] = {
#ifdef __i386__
        offsetof(struct pt_regs, ax),
        offsetof(struct pt_regs, cx),
        offsetof(struct pt_regs, dx),
#define NR_REG_ARGUMENTS 3
#else
        offsetof(struct pt_regs, di),
        offsetof(struct pt_regs, si),
        offsetof(struct pt_regs, dx),
        offsetof(struct pt_regs, cx),
        offsetof(struct pt_regs, r8),
        offsetof(struct pt_regs, r9),
#define NR_REG_ARGUMENTS 6
#endif
    };

    if (n >= NR_REG_ARGUMENTS) {
        n -= NR_REG_ARGUMENTS - 1;
        return regs_get_kernel_stack_nth(regs, n);
    } else
        return regs_get_register(regs, argument_offs[n]);
}

从这个方法中可以看到在x86_64的机器上，保存内核方法参数的寄存器依次是di si dx…

所以我们写的指令就是：

echo 'p:blkprobe blk_account_io_completion req=%di bytes=%si' > /sys/kernel/debug/tracing/kprobe_events

用同样的方法enable之后就可以看到追踪的信息了：

[root@server-P1 events]# cat ../trace | tail
          <idle>-0     [000] dNs. 8222360.833141: blkprobe: (blk_account_io_completion+0x0/0xb0) req=0xffff8c2028e6f600 bytes=0x0
          <idle>-0     [000] d.s. 8222383.061651: blkprobe: (blk_account_io_completion+0x0/0xb0) req=0xffff8c1f4fb8fc00 bytes=0x1c00
          <idle>-0     [000] dNs. 8222383.076929: blkprobe: (blk_account_io_completion+0x0/0xb0) req=0xffff8c1f4fb8fc00 bytes=0x0
          <idle>-0     [000] d.s. 8222413.140181: blkprobe: (blk_account_io_completion+0x0/0xb0) req=0xffff8c1f4fb8f300 bytes=0x1000
          <idle>-0     [034] d.s. 8222413.141468: blkprobe: (blk_account_io_completion+0x0/0xb0) req=0xffff8c26c6eaf000 bytes=0x400
           <...>-54375 [034] d.s. 8222413.141526: blkprobe: (blk_account_io_completion+0x0/0xb0) req=0xffff8c26c6ead080 bytes=0x2000
          <idle>-0     [034] d.s. 8222413.141565: blkprobe: (blk_account_io_completion+0x0/0xb0) req=0xffff8c26c6eaca80 bytes=0x2000
          <idle>-0     [034] d.s. 8222413.141603: blkprobe: (blk_account_io_completion+0x0/0xb0) req=0xffff8c26c6eac480 bytes=0x1000
          <idle>-0     [034] d.s. 8222413.141679: blkprobe: (blk_account_io_completion+0x0/0xb0) req=0xffff8c26c6eac600 bytes=0x2000
          <idle>-0     [000] dNs. 8222413.174594: blkprobe: (blk_account_io_completion+0x0/0xb0) req=0xffff8c1f4fb8f300 bytes=0x0

还有一个有意思的地方是blk_account_io_completion这个方法第一个参数是一个类型为strcut request的指针。如果想看到该类型中某一个成员变量的值，可以根据类型定义的偏移量获取地址并追踪，并且支持多次嵌套使用。详细方法可以看这个文档里面对网络函数的实例，我这里就不赘述了。

产品化

其实bcc也是利用同样的能力，将一些关键方法预定义为kprobe，通过对原始结果的一些综合，获取对用户的业务是实际意义的信息。比如bcc/tools里面的磁盘操作时延分布、网络收发流量排名等等。

如果能进一步综合kprobe和uprobe的能力，以及对特定应用场景的预定义，辅以一定的UI、数据可视化和用户交互设计，可以得到很强大的业务性能分析诊断工具。