An operating system's kernel is one of the most elusive pieces of software out there. It's always there running in the background from the time your system gets turned on. Every user achieves their computing work with the help of the kernel, yet they never interact with it directly. The interaction with the kernel occurs by making system calls or having those calls made on behalf of the user by various libraries or applications that they use daily.
I've covered how to trace system calls in an earlier article using strace
. However, with strace
, your visibility is limited. It allows you to view the system calls invoked with specific parameters and, after the work gets done, see the return value or status indicating whether they passed or failed. But you had no idea what happened inside the kernel during this time. Besides just serving system calls, there's a lot of other activity happening inside the kernel that you're oblivious to.
Ftrace Introduction
This article aims to shed some light on tracing the kernel functions by using a mechanism called ftrace
. It makes kernel tracing easily accessible to any Linux user, and with its help you can learn a lot about Linux kernel internals.
The default output generated by the ftrace
is often massive, given that the kernel is always busy. To save space, I've kept the output to a minimum and, in many cases truncated the output entirely.
I am using Fedora for these examples, but they should work on any of the latest Linux distributions.
Enabling ftrace
Ftrace
is part of the Linux kernel now, and you no longer need to install anything to use it. It is likely that, if you are using a recent Linux OS, ftrace
is already enabled. To verify that the ftrace
facility is available, run the mount command and search for tracefs
. If you see output similar to what is below, ftrace
is enabled, and you can easily follow the examples in this article. These commands must be run as the root user (sudo
is insufficient.)
# mount | grep tracefs
none on /sys/kernel/tracing type tracefs (rw,relatime,seclabel)
To make use of ftrace
, you first must navigate to the special directory as specified in the mount command above, from where you'll run the rest of the commands in the article:
# cd /sys/kernel/tracing
General work flow
First of all, you must understand the general workflow of capturing a trace and obtaining the output. If you're using ftrace
directly, there isn't any special ftrace-
specific commands to run. Instead, you basically write to some files and read from some files using standard command-line Linux utilities.
The general steps:
- Write to some specific files to enable/disable tracing.
- Write to some specific files to set/unset filters to fine-tune tracing.
- Read generated trace output from files based on 1 and 2.
- Clear earlier output or buffer from files.
- Narrow down to your specific use case (kernel functions to trace) and repeat steps 1, 2, 3, 4.
Types of available tracers
There are several different kinds of tracers available to you. As mentioned earlier, you need to be in a specific directory before running any of these commands because the files of interest are present there. I use relative paths (as opposed to absolute paths) in my examples.
You can view the contents of the available_tracers
file to see all the types of tracers available. You can see a few listed below. Don't worry about all of them just yet:
# pwd
/sys/kernel/tracing
# cat available_tracers
hwlat blk mmiotrace function_graph wakeup_dl wakeup_rt wakeup function nop
Out of all the given tracers, I focus on three specific ones: function
and function_graph
to enable tracing, and nop
to disable tracing.
Identify current tracer
Usually, by default, the tracer is set to nop
. That is, "No operation" in the special file current_tracer
, which usually means tracing is currently off:
# pwd
/sys/kernel/tracing
# cat current_tracer
nop
View Tracing output
Before you enable any tracing, take a look at the file where the tracing output gets stored. You can view the contents of the file named trace
using the cat command:
# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 0/0 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
Enable function tracer
You can enable your first tracer called function
by writing function
to the file current_tracer
(its earlier content was nop
, indicating that tracing was off.) Think of this operation as a way of enabling tracing:
# pwd
/sys/kernel/tracing
# cat current_tracer
nop
# echo function > current_tracer
# cat current_tracer
function
View updated tracing output for function tracer
Now that you've enabled tracing, it's time to view the output. If you view the contents of the trace
file, you see a lot of data being written to it continuously. I've piped the output and am currently viewing only the top 20 lines to keep things manageable. If you follow the headers in the output on the left, you can see which task and Process ID are running on which CPU. Toward the right side of the output, you see the exact kernel function running, followed by its parent function. There is also time stamp information in the center:
# sudo cat trace | head -20
# tracer: function
#
# entries-in-buffer/entries-written: 409936/4276216 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
<idle>-0 [000] d... 2088.841739: tsc_verify_tsc_adjust <-arch_cpu_idle_enter
<idle>-0 [000] d... 2088.841739: local_touch_nmi <-do_idle
<idle>-0 [000] d... 2088.841740: rcu_nocb_flush_deferred_wakeup <-do_idle
<idle>-0 [000] d... 2088.841740: tick_check_broadcast_expired <-do_idle
<idle>-0 [000] d... 2088.841740: cpuidle_get_cpu_driver <-do_idle
<idle>-0 [000] d... 2088.841740: cpuidle_not_available <-do_idle
<idle>-0 [000] d... 2088.841741: cpuidle_select <-do_idle
<idle>-0 [000] d... 2088.841741: menu_select <-do_idle
<idle>-0 [000] d... 2088.841741: cpuidle_governor_latency_req <-menu_select
Remember that tracing is on, which means the output of tracing continues to get written to the trace file until you turn tracing off.
Turn off tracing
Turning off tracing is simple. All you have to do is replace function
tracer with nop
in the current_tracer
file and tracing gets turned off:
# cat current_tracer
function
# echo nop > current_tracer
# cat current_tracer
nop
Enable function_graph tracer
Now try the second tracer, called function_graph
. You can enable this using the same steps as before: write function_graph
to the current_tracer
file:
# echo function_graph > current_tracer
# cat current_tracer
function_graph
Tracing output of function_graph tracer
Notice that the output format of the trace
file has changed. Now, you can see the CPU ID and the duration of the kernel function execution. Next, you see curly braces indicating the beginning of a function and what other functions were called from inside it:
# cat trace | head -20
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
6) | n_tty_write() {
6) | down_read() {
6) | __cond_resched() {
6) 0.341 us | rcu_all_qs();
6) 1.057 us | }
6) 1.807 us | }
6) 0.402 us | process_echoes();
6) | add_wait_queue() {
6) 0.391 us | _raw_spin_lock_irqsave();
6) 0.359 us | _raw_spin_unlock_irqrestore();
6) 1.757 us | }
6) 0.350 us | tty_hung_up_p();
6) | mutex_lock() {
6) | __cond_resched() {
6) 0.404 us | rcu_all_qs();
6) 1.067 us | }
Enable trace settings to increase the depth of tracing
You can always tweak the tracer slightly to see more depth of the function calls using the steps below. After which, you can view the contents of the trace
file and see that the output is slightly more detailed. For readability, the output of this example is omitted:
# cat max_graph_depth
0
# echo 1 > max_graph_depth ## or:
# echo 2 > max_graph_depth
# sudo cat trace
Finding functions to trace
The steps above are sufficient to get started with tracing. However, the amount of output generated is enormous, and you can often get lost while trying to find out items of interest. Often you want the ability to trace specific functions only and ignore the rest. But how do you know which processes to trace if you don't know their exact names? There is a file that can help you with this—available_filter_functions
provides you with a list of available functions for tracing:
# wc -l available_filter_functions
63165 available_filter_functions
Search for general kernel functions
Now try searching for a simple kernel function that you are aware of. User-space has malloc
to allocate memory, while the kernel has its kmalloc
function, which provides similar functionality. Below are all the kmalloc
related functions:
# grep kmalloc available_filter_functions
debug_kmalloc
mempool_kmalloc
kmalloc_slab
kmalloc_order
kmalloc_order_trace
kmalloc_fix_flags
kmalloc_large_node
__kmalloc
__kmalloc_track_caller
__kmalloc_node
__kmalloc_node_track_caller
[...]
Search for kernel module or driver related functions
From the output of available_filter_functions
, you can see some lines ending with text in brackets, such as [kvm_intel]
in the example below. These functions are related to the kernel module kvm_intel
, which is currently loaded. You can run the lsmod
command to verify:
# grep kvm available_filter_functions | tail
__pi_post_block [kvm_intel]
vmx_vcpu_pi_load [kvm_intel]
vmx_vcpu_pi_put [kvm_intel]
pi_pre_block [kvm_intel]
pi_post_block [kvm_intel]
pi_wakeup_handler [kvm_intel]
pi_has_pending_interrupt [kvm_intel]
pi_update_irte [kvm_intel]
vmx_dump_dtsel [kvm_intel]
vmx_dump_sel [kvm_intel]
# lsmod | grep -i kvm
kvm_intel 335872 0
kvm 987136 1 kvm_intel
irqbypass 16384 1 kvm
Trace specific functions only
To enable tracing of specific functions or patterns, you can make use of the set_ftrace_filter
file to specify which functions from the above output you want to trace.
This file also accepts the *
pattern, which expands to include additional functions with the given pattern. As an example, I am using the ext4
filesystem on my machine. I can specify ext4
specific kernel functions to trace using the following commands:
# mount | grep home
/dev/mapper/fedora-home on /home type ext4 (rw,relatime,seclabel)
# pwd
/sys/kernel/tracing
# cat set_ftrace_filter
#### all functions enabled ####
$
$ echo ext4_* > set_ftrace_filter
$
$ cat set_ftrace_filter
ext4_has_free_clusters
ext4_validate_block_bitmap
ext4_get_group_number
ext4_get_group_no_and_offset
ext4_get_group_desc
[...]
Now, when you see the tracing output, you can only see functions ext4
related to kernel functions for which you had set a filter earlier. All the other output gets ignored:
# cat trace |head -20
## tracer: function
#
# entries-in-buffer/entries-written: 3871/3871 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
cupsd-1066 [004] .... 3308.989545: ext4_file_getattr <-vfs_fstat
cupsd-1066 [004] .... 3308.989547: ext4_getattr <-ext4_file_getattr
cupsd-1066 [004] .... 3308.989552: ext4_file_getattr <-vfs_fstat
cupsd-1066 [004] .... 3308.989553: ext4_getattr <-ext4_file_getattr
cupsd-1066 [004] .... 3308.990097: ext4_file_open <-do_dentry_open
cupsd-1066 [004] .... 3308.990111: ext4_file_getattr <-vfs_fstat
cupsd-1066 [004] .... 3308.990111: ext4_getattr <-ext4_file_getattr
cupsd-1066 [004] .... 3308.990122: ext4_llseek <-ksys_lseek
cupsd-1066 [004] .... 3308.990130: ext4_file_read_iter <-new_sync_read
Exclude functions from being traced
You don't always know what you want to trace but, you surely know what you don't want to trace. For that, there is this file aptly named set_ftrace_notrace
—notice the "no" in there. You can write your desired pattern in this file and enable tracing, upon which everything except the mentioned pattern gets traced. This is often helpful to remove common functionality that clutters our output:
# cat set_ftrace_notrace
#### no functions disabled ####
Targetted tracing
So far, you've been tracing everything that has happened in the kernel. But that won't help us if you wish to trace events related to a specific command. To achieve this, you can turn tracing on and off on-demand and, and in between them, run our command of choice so that you do not get extra output in your trace output. You can enable tracing by writing 1
to tracing_on
, and 0
to turn it off:
# cat tracing_on
0
# echo 1 > tracing_on
# cat tracing_on
1
### Run some specific command that we wish to trace here ###
# echo 0 > tracing_on
# cat tracing_on
0
Tracing specific PID
If you want to trace activity related to a specific process that is already running, you can write that PID to a file named set_ftrace_pid
and then enable tracing. That way, tracing is limited to this PID only, which is very helpful in some instances:
# echo $PID > set_ftrace_pid
Conclusion
Ftrace
is a great way to learn more about the internal workings of the Linux kernel. With some practice, you can learn to fine-tune ftrace
and narrow down your searches. To understand ftrace
in more detail and its advanced usage, see these excellent articles written by the core author of ftrace
himself—Steven Rostedt.
Comments are closed.