Dynamic tracing in Linux user and kernel space

Colorful deployments: An introduction to blue-green, canary, and rolling deployments

Image by:

Internet Archive Book Images. Modified by Opensource.com. CC BY-SA 4.0

Have you ever been in a situation where you realize that you didn't insert debug print at a few points in your code, so now you won't know if your CPU hits a particular line of code for execution until you recompile the code with debug statements? Don't worry, there's an easier solution. Basically, you need to insert dynamic probe points at different locations of your source code assembly instructions.

For advanced users, kernel documentation/trace and man perf provide a lot of information about different types of kernels and user space tracing mechanisms; however, average users just want a few simple steps and an example to get started quickly. That's where this article will help.

Let's start with definitions.

Probe point

A probe point is a debug statement that helps explore execution characteristics of software (i.e., the execution flow and state of software data structure at the moment a probe statement executes). printk is the simplest form of probe statement and one of the fundamental tools used by developers for kernel hacking.

Static vs. dynamic probing

Because it requires recompilation of source code, printk insertion is a static probing method. There are many other static tracepoints at important locations in the kernel code that can be enabled or disabled dynamically. The Linux kernel has a few frameworks that can help a developer probe either the kernel or user space application without recompiling the source code. Kprobe is one such dynamic method of inserting probe points within kernel code, and uprobe does so within a user application.

Tracing user space using uprobe

An uprobe tracepoint can be inserted into any user space code by using either the sysfs interface or the perf tool.

Inserting uprobe using the sysfs interface

Consider the following simple test code with no print statements in which we want to insert a probe at some instruction:

[[app-listing]]
[source,c]
.test.c

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

static int func_1_cnt;
static int func_2_cnt;

static void func_1(void)
{
	func_1_cnt++;
}

static void func_2(void)
{
	func_2_cnt++;
}

int main(int argc, void **argv)
{
	int number;

	while(1) {
		sleep(1);
		number = rand() % 10;
		if (number < 5)
			func_2();
		else
			func_1();
	}
}

Compile the code and find the instruction addresses you want to probe:

# gcc -o test test.c
# objdump -d test

Let's say we have the following object code on the ARM64 platform:

0000000000400620 <func_1>:
  400620:	90000080 	adrp	x0, 410000 <__FRAME_END__+0xf6f8>
  400624:	912d4000 	add	x0, x0, #0xb50
  400628:	b9400000 	ldr	w0, [x0]
  40062c:	11000401 	add	w1, w0, #0x1
  400630:	90000080 	adrp	x0, 410000 <__FRAME_END__+0xf6f8>
  400634:	912d4000 	add	x0, x0, #0xb50
  400638:	b9000001 	str	w1, [x0]
  40063c:	d65f03c0 	ret

0000000000400640 <func_2>:
  400640:	90000080 	adrp	x0, 410000 <__FRAME_END__+0xf6f8>
  400644:	912d5000 	add	x0, x0, #0xb54
  400648:	b9400000 	ldr	w0, [x0]
  40064c:	11000401 	add	w1, w0, #0x1
  400650:	90000080 	adrp	x0, 410000 <__FRAME_END__+0xf6f8>
  400654:	912d5000 	add	x0, x0, #0xb54
  400658:	b9000001 	str	w1, [x0]
  40065c:	d65f03c0 	ret

and we want to insert a probe at offset 0x620 and 0x644. Execute the following commands:

# echo 'p:func_2_entry test:0x620' > /sys/kernel/debug/tracing/uprobe_events
# echo 'p:func_1_entry test:0x644' >> /sys/kernel/debug/tracing/uprobe_events
# echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
# ./test&

In the first and second second echo statement above, p tells us that it is a simple probe. (A probe can be either simple or return.) func_n_entry is the name we see in the trace output. Name is an optional field; if it is not provided, we should expect a name like p_test_0x644. test is the executable binary in which we want to insert the probe. If test is not in the current directory, we need to specify path_to_test/test. The 0x620 or 0x640 is the instruction offset from the start of the program. Please note >> in the second echo statement, as we want to add one more probe. So, when we enable uprobe tracing after inserting probe points in the first two commands, it will enable all the uprobe events when we write to events/uprobes/enable. We can enable individual events, as well, by writing to the specific event file created in the events directory. Once the probe points have been inserted and enabled, we can see a trace entry whenever the probed instruction is executed.

Read the trace file to see the output:

# cat /sys/kernel/debug/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 8/8   #P:8
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
            test-2788  [003] ....  1740.674740: func_1_entry: (0x400644)
            test-2788  [003] ....  1741.674854: func_1_entry: (0x400644)
            test-2788  [003] ....  1742.674949: func_2_entry: (0x400620)
            test-2788  [003] ....  1743.675065: func_2_entry: (0x400620)
            test-2788  [003] ....  1744.675158: func_1_entry: (0x400644)
            test-2788  [003] ....  1745.675273: func_1_entry: (0x400644)
            test-2788  [003] ....  1746.675390: func_2_entry: (0x400620)
            test-2788  [003] ....  1747.675503: func_2_entry: (0x400620)

We can see what task was done by what CPU and at what time it executed the probed instruction.

A return probe can also be inserted into any instruction. This will log an entry when the function with that instruction is returned:

# echo 0 > /sys/kernel/debug/tracing/events/uprobes/enable
# echo 'r:func_2_exit test:0x620' >> /sys/kernel/debug/tracing/uprobe_events
# echo 'r:func_1_exit test:0x644' >> /sys/kernel/debug/tracing/uprobe_events
# echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable

Here we use r instead of p, and all the other parameters are the same. Note that we need to disable uprobe events if we want to insert a new probe point:

            test-3009  [002] ....  4813.852674: func_1_entry: (0x400644)
            test-3009  [002] ....  4813.852691: func_1_exit: (0x4006b0 <- 0x400644)
            test-3009  [002] ....  4814.852805: func_2_entry: (0x400620)
            test-3009  [002] ....  4814.852807: func_2_exit: (0x4006b8 <- 0x400620)
            test-3009  [002] ....  4815.852920: func_2_entry: (0x400620)
            test-3009  [002] ....  4815.852921: func_2_exit: (0x4006b8 <- 0x400620)

The log above tells us func_1 returned to the address 0x4006b0 at timestamp 4813.852691.

# echo 0 > /sys/kernel/debug/tracing/events/uprobes/enable
# echo 'p:func_2_entry test:0x630' > /sys/kernel/debug/tracing/uprobe_events count=%x1
# echo 1 > /sys/kernel/debug/tracing/events/uprobes/enable
# echo > /sys/kernel/debug/tracing/trace
# ./test&

Here we are printing the value of ARM64 x1 register as count= when the instruction at offset 0x630 is executed.

The output would look like this:

            test-3095  [003] ....  7918.629728: func_2_entry: (0x400630) count=0x1
            test-3095  [003] ....  7919.629884: func_2_entry: (0x400630) count=0x2
            test-3095  [003] ....  7920.629988: func_2_entry: (0x400630) count=0x3
            test-3095  [003] ....  7922.630272: func_2_entry: (0x400630) count=0x4
            test-3095  [003] ....  7923.630429: func_2_entry: (0x400630) count=0x5

Inserting uprobe using perf

Always finding the offset of an instruction or function where a probe needs to be inserted is cumbersome, and it's even more complicated to know the name of a CPU register assigned to any local variable. perf is a useful tool to help prepare and insert uprobe into any line in source code.

Besides perf, there are a few other tools, such as SystemTap, DTrace, and LTTng, that can be used for kernel and user space tracing; however, perf is fully coupled with the kernel, so it's favored by kernel developers.

# gcc -g -o test test.c
# perf probe -x ./test func_2_entry=func_2
# perf probe -x ./test func_2_exit=func_2%return
# perf probe -x ./test test_15=test.c:15
# perf probe -x ./test test_25=test.c:25 number
# perf record -e probe_test:func_2_entry -e probe_test:func_2_exit -e probe_test:test_15  -e probe_test:test_25 ./test

As the above example shows, we can insert a probe point directly into the function start and return, a specific line number of the source file, etc. You can get a local variable printed. You can have many other options, like all the instances of a function calling (see man perf probe for details). perf probe is used to create a probe point event, then those events can be probed using perf record while executing a ./test executable. When we create a perf probe point, we can have other recording options like perf stat, and we can have many post-analysis options like perf script or perf report.

Using perf script, our output for the above example looks like:

# perf script
            test  2741 [002]   427.584681: probe_test:test_25: (4006a0) number=3
            test  2741 [002]   427.584717: probe_test:test_15: (400640)
            test  2741 [002]   428.584861: probe_test:test_25: (4006a0) number=6
            test  2741 [002]   428.584872: probe_test:func_2_entry: (400620)
            test  2741 [002]   428.584881: probe_test:func_2_exit: (400620 <- 4006b8)
            test  2741 [002]   429.585012: probe_test:test_25: (4006a0) number=7
            test  2741 [002]   429.585021: probe_test:func_2_entry: (400620)
            test  2741 [002]   429.585025: probe_test:func_2_exit: (400620 <- 4006b8)

Tracing kernel space using kprobe

As with uprobe, either the sysfs interface or the perf tool can be used to insert a kprobe tracepoint into kernel code.

Inserting kprobe using the sysfs interface

We can insert kprobe within most of the symbols in /proc/kallsyms; other symbols have been blacklisted in the kernel. A kprobe insertion into the kprobe_events file for the symbols that aren't compatible with a kprobe insertion should result in a write error. A probe can be inserted at some offset from the symbol base, as well. Like uprobe, we can also trace the return of a function using kretprobe. The value of a local variable can also be printed in trace output.

This example explains how to do that:

; disable all events, just to insure that we see only kprobe output in trace.
# echo 0 > /sys/kernel/debug/tracing/events/enable
; disable kprobe events until probe points are inseted.
# echo 0 > /sys/kernel/debug/tracing/events/kprobes/enable
; clear out all the events from kprobe_events, to insure that we see output for
; only those for which we have enabled 
# echo > /sys/kernel/debug/tracing/kprobe_events
; insert probe point at kfree
# echo "p kfree" >> /sys/kernel/debug/tracing/kprobe_events
; insert probe point at kfree+0x10 with name kfree_probe_10
# echo "p:kree_probe_10 kfree+0x10" >> /sys/kernel/debug/tracing/kprobe_events
; insert probe point at kfree return
# echo "r:kfree_probe kfree" >> /sys/kernel/debug/tracing/kprobe_events
; enable kprobe events until probe points are inseted.
# echo 1 > /sys/kernel/debug/tracing/events/kprobes/enable

[root@pratyush ~]# more /sys/kernel/debug/tracing/trace
# tracer: nop
#
# entries-in-buffer/entries-written: 9037/9037   #P:8
#
#                              _-----=> irqs-off
#                             / _----=> need-resched
#                            | / _---=> hardirq/softirq
#                            || / _--=> preempt-depth
#                            ||| /     delay
#           TASK-PID   CPU#  ||||    TIMESTAMP  FUNCTION
#              | |       |   ||||       |         |
            sshd-2189  [002] dn..  1908.930731: kree_probe: (__audit_syscall_exit+0x194/0x218 <- kfree)
            sshd-2189  [002] d...  1908.930744: p_kfree_0: (kfree+0x0/0x214)
            sshd-2189  [002] d...  1908.930746: kree_probe_10: (kfree+0x10/0x214)

Inserting kprobe using perf

As with uprobe, we can use perf to insert a kprobe within the kernel code. We can insert a probe point directly into the function start and return, the specific line number of the source file, etc. We can either provide vmlinux to the -k option, or we can provide the kernel source code path to the -s option:

# perf probe -k vmlinux kfree_entry=kfree
# perf probe -k vmlinux kfree_exit=kfree%return
# perf probe -s ./  kfree_mid=mm/slub.c:3408 x
# perf record -e probe:kfree_entry -e probe:kfree_exit -e probe:kfree_mid sleep 10

Using a perf script, we see this output for the above example:

# perf script
           sleep  2379 [001]  2702.291224: probe:kfree_entry: (fffffe0000201944)
           sleep  2379 [001]  2702.291229: probe:kfree_mid: (fffffe0000201978) x=0x0
           sleep  2379 [001]  2702.291231: probe:kfree_exit: (fffffe0000201944 <- fffffe000019f67c)
           sleep  2379 [001]  2702.291241: probe:kfree_entry: (fffffe0000201944)
           sleep  2379 [001]  2702.291242: probe:kfree_mid: (fffffe0000201978) x=0xfffffe01db8f6000
           sleep  2379 [001]  2702.291243: probe:kfree_exit: (fffffe0000201944 <- fffffe000019f67c)
           sleep  2379 [001]  2702.291249: probe:kfree_entry: (fffffe0000201944)
           sleep  2379 [001]  2702.291250: probe:kfree_mid: (fffffe0000201978) x=0xfffffe01db8f6000
           sleep  2379 [001]  2702.291251: probe:kfree_exit: (fffffe0000201944 <- fffffe000019f67c)

I hope that this tutorial has explained how to hack your executable code and insert a couple of probe points into them. Happy tracing!