Analyze the Linux kernel with ftrace | Opensource.com

Analyze the Linux kernel with ftrace

Ftrace is a great way to learn more about the internal workings of the Linux kernel.

Linux keys on the keyboard for a desktop computer
x

Subscribe now

Get the highlights in your inbox every week.

An operating system's kernel is one of the most elusive pieces of software out there. It's always there running in the background from the time your system gets turned on. Every user achieves their computing work with the help of the kernel, yet they never interact with it directly. The interaction with the kernel occurs by making system calls or having those calls made on behalf of the user by various libraries or applications that they use daily.

I've covered how to trace system calls in an earlier article using strace. However, with strace, your visibility is limited. It allows you to view the system calls invoked with specific parameters and, after the work gets done, see the return value or status indicating whether they passed or failed. But you had no idea what happened inside the kernel during this time. Besides just serving system calls, there's a lot of other activity happening inside the kernel that you're oblivious to.

Ftrace Introduction

This article aims to shed some light on tracing the kernel functions by using a mechanism called ftrace. It makes kernel tracing easily accessible to any Linux user, and with its help you can learn a lot about Linux kernel internals.

The default output generated by the ftrace is often massive, given that the kernel is always busy. To save space, I've kept the output to a minimum and, in many cases truncated the output entirely.

I am using Fedora for these examples, but they should work on any of the latest Linux distributions.

Enabling ftrace

Ftrace is part of the Linux kernel now, and you no longer need to install anything to use it. It is likely that, if you are using a recent Linux OS, ftrace is already enabled. To verify that the ftrace facility is available, run the mount command and search for tracefs. If you see output similar to what is below, ftrace is enabled, and you can easily follow the examples in this article. These commands must be run as the root user (sudo is insufficient.)

# mount | grep tracefs
none on /sys/kernel/tracing type tracefs (rw,relatime,seclabel)

To make use of ftrace, you first must navigate to the special directory as specified in the mount command above, from where you'll run the rest of the commands in the article:

# cd /sys/kernel/tracing

General work flow

First of all, you must understand the general workflow of capturing a trace and obtaining the output. If you're using ftrace directly, there isn't any special ftrace-specific commands to run. Instead, you basically write to some files and read from some files using standard command-line Linux utilities.

The general steps:

  1. Write to some specific files to enable/disable tracing.
  2. Write to some specific files to set/unset filters to fine-tune tracing.
  3. Read generated trace output from files based on 1 and 2.
  4. Clear earlier output or buffer from files.
  5. Narrow down to your specific use case (kernel functions to trace) and repeat steps 1, 2, 3, 4.

Types of available tracers

There are several different kinds of tracers available to you. As mentioned earlier, you need to be in a specific directory before running any of these commands because the files of interest are present there. I use relative paths (as opposed to absolute paths) in my examples.

You can view the contents of the available_tracers file to see all the types of tracers available. You can see a few listed below. Don't worry about all of them just yet:

# pwd
/sys/kernel/tracing

# cat available_tracers
hwlat blk mmiotrace function_graph wakeup_dl wakeup_rt wakeup function nop

Out of all the given tracers, I focus on three specific ones: function and function_graph to enable tracing, and nop to disable tracing.

Identify current tracer

Usually, by default, the tracer is set to nop. That is, "No operation" in the special file current_tracer, which usually means tracing is currently off:

# pwd
/sys/kernel/tracing

# cat current_tracer
nop

View Tracing output

Before you enable any tracing, take a look at the file where the tracing output gets stored. You can view the contents of the file named trace using the cat command:

# cat trace

# tracer: nop
#
# entries-in-buffer/entries-written: 0/0   #P:8
#
#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| /     delay
#           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
#              | |         |   ||||      |         |

Enable function tracer

You can enable your first tracer called function by writing function to the file current_tracer (its earlier content was nop, indicating that tracing was off.) Think of this operation as a way of enabling tracing:

# pwd
/sys/kernel/tracing

# cat current_tracer
nop

# echo function > current_tracer

# cat current_tracer
function

View updated tracing output for function tracer

Now that you've enabled tracing, it's time to view the output. If you view the contents of the trace file, you see a lot of data being written to it continuously. I've piped the output and am currently viewing only the top 20 lines to keep things manageable. If you follow the headers in the output on the left, you can see which task and Process ID are running on which CPU. Toward the right side of the output, you see the exact kernel function running, followed by its parent function. There is also time stamp information in the center:

# sudo cat trace | head -20

# tracer: function
#
# entries-in-buffer/entries-written: 409936/4276216   #P:8
#
#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| /     delay
#           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
#              | |         |   ||||      |         |
          <idle>-0       [000] d...  2088.841739: tsc_verify_tsc_adjust <-arch_cpu_idle_enter
          <idle>-0       [000] d...  2088.841739: local_touch_nmi <-do_idle
          <idle>-0       [000] d...  2088.841740: rcu_nocb_flush_deferred_wakeup <-do_idle
          <idle>-0       [000] d...  2088.841740: tick_check_broadcast_expired <-do_idle
          <idle>-0       [000] d...  2088.841740: cpuidle_get_cpu_driver <-do_idle
          <idle>-0       [000] d...  2088.841740: cpuidle_not_available <-do_idle
          <idle>-0       [000] d...  2088.841741: cpuidle_select <-do_idle
          <idle>-0       [000] d...  2088.841741: menu_select <-do_idle
          <idle>-0       [000] d...  2088.841741: cpuidle_governor_latency_req <-menu_select

Remember that tracing is on, which means the output of tracing continues to get written to the trace file until you turn tracing off.

Turn off tracing

Turning off tracing is simple. All you have to do is replace function tracer with nop in the current_tracer file and tracing gets turned off:

# cat current_tracer
function

# echo nop > current_tracer

# cat current_tracer
nop

Enable function_graph tracer

Now try the second tracer, called function_graph. You can enable this using the same steps as before: write function_graph to the current_tracer file:

# echo function_graph > current_tracer

# cat current_tracer
function_graph

Tracing output of function_graph tracer

Notice that the output format of the trace file has changed. Now, you can see the CPU ID and the duration of the kernel function execution. Next, you see curly braces indicating the beginning of a function and what other functions were called from inside it:

# cat trace | head -20

# tracer: function_graph
#
# CPU  DURATION                  FUNCTION CALLS
# |     |   |                     |   |   |   |
 6)               |              n_tty_write() {
 6)               |                down_read() {
 6)               |                  __cond_resched() {
 6)   0.341 us    |                    rcu_all_qs();
 6)   1.057 us    |                  }
 6)   1.807 us    |                }
 6)   0.402 us    |                process_echoes();
 6)               |                add_wait_queue() {
 6)   0.391 us    |                  _raw_spin_lock_irqsave();
 6)   0.359 us    |                  _raw_spin_unlock_irqrestore();
 6)   1.757 us    |                }
 6)   0.350 us    |                tty_hung_up_p();
 6)               |                mutex_lock() {
 6)               |                  __cond_resched() {
 6)   0.404 us    |                    rcu_all_qs();
 6)   1.067 us    |                  }

Enable trace settings to increase the depth of tracing

You can always tweak the tracer slightly to see more depth of the function calls using the steps below. After which, you can view the contents of the trace file and see that the output is slightly more detailed. For readability, the output of this example is omitted:

# cat max_graph_depth
0

# echo 1 > max_graph_depth ## or:
# echo 2 > max_graph_depth

# sudo cat trace

Finding functions to trace

The steps above are sufficient to get started with tracing. However, the amount of output generated is enormous, and you can often get lost while trying to find out items of interest. Often you want the ability to trace specific functions only and ignore the rest. But how do you know which processes to trace if you don't know their exact names? There is a file that can help you with this—available_filter_functions provides you with a list of available functions for tracing:

# wc -l available_filter_functions  
63165 available_filter_functions

Search for general kernel functions

Now try searching for a simple kernel function that you are aware of. User-space has malloc to allocate memory, while the kernel has its kmalloc function, which provides similar functionality. Below are all the kmalloc related functions:

# grep kmalloc available_filter_functions
debug_kmalloc
mempool_kmalloc
kmalloc_slab
kmalloc_order
kmalloc_order_trace
kmalloc_fix_flags
kmalloc_large_node
__kmalloc
__kmalloc_track_caller
__kmalloc_node
__kmalloc_node_track_caller
[...]

Search for kernel module or driver related functions

From the output of available_filter_functions, you can see some lines ending with text in brackets, such as [kvm_intel] in the example below. These functions are related to the kernel module kvm_intel, which is currently loaded. You can run the lsmod command to verify:

# grep kvm available_filter_functions | tail
__pi_post_block [kvm_intel]
vmx_vcpu_pi_load [kvm_intel]
vmx_vcpu_pi_put [kvm_intel]
pi_pre_block [kvm_intel]
pi_post_block [kvm_intel]
pi_wakeup_handler [kvm_intel]
pi_has_pending_interrupt [kvm_intel]
pi_update_irte [kvm_intel]
vmx_dump_dtsel [kvm_intel]
vmx_dump_sel [kvm_intel]

# lsmod  | grep -i kvm
kvm_intel             335872  0
kvm                   987136  1 kvm_intel
irqbypass              16384  1 kvm

Trace specific functions only

To enable tracing of specific functions or patterns, you can make use of the set_ftrace_filter file to specify which functions from the above output you want to trace.
This file also accepts the * pattern, which expands to include additional functions with the given pattern. As an example, I am using the ext4 filesystem on my machine. I can specify ext4 specific kernel functions to trace using the following commands:

# mount | grep home
/dev/mapper/fedora-home on /home type ext4 (rw,relatime,seclabel)

# pwd
/sys/kernel/tracing

# cat set_ftrace_filter

#### all functions enabled ####
$
$ echo ext4_* > set_ftrace_filter
$
$ cat set_ftrace_filter
ext4_has_free_clusters
ext4_validate_block_bitmap
ext4_get_group_number
ext4_get_group_no_and_offset
ext4_get_group_desc
[...]

Now, when you see the tracing output, you can only see functions ext4 related to kernel functions for which you had set a filter earlier. All the other output gets ignored:

# cat trace |head -20

## tracer: function
#
# entries-in-buffer/entries-written: 3871/3871   #P:8
#
#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| /     delay
#           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
#              | |         |   ||||      |         |
           cupsd-1066    [004] ....  3308.989545: ext4_file_getattr <-vfs_fstat
           cupsd-1066    [004] ....  3308.989547: ext4_getattr <-ext4_file_getattr
           cupsd-1066    [004] ....  3308.989552: ext4_file_getattr <-vfs_fstat
           cupsd-1066    [004] ....  3308.989553: ext4_getattr <-ext4_file_getattr
           cupsd-1066    [004] ....  3308.990097: ext4_file_open <-do_dentry_open
           cupsd-1066    [004] ....  3308.990111: ext4_file_getattr <-vfs_fstat
           cupsd-1066    [004] ....  3308.990111: ext4_getattr <-ext4_file_getattr
           cupsd-1066    [004] ....  3308.990122: ext4_llseek <-ksys_lseek
           cupsd-1066    [004] ....  3308.990130: ext4_file_read_iter <-new_sync_read

Exclude functions from being traced

You don't always know what you want to trace but, you surely know what you don't want to trace. For that, there is this file aptly named set_ftrace_notrace—notice the "no" in there. You can write your desired pattern in this file and enable tracing, upon which everything except the mentioned pattern gets traced. This is often helpful to remove common functionality that clutters our output:

# cat set_ftrace_notrace
#### no functions disabled ####

Targetted tracing

So far, you've been tracing everything that has happened in the kernel. But that won't help us if you wish to trace events related to a specific command. To achieve this, you can turn tracing on and off on-demand and, and in between them, run our command of choice so that you do not get extra output in your trace output. You can enable tracing by writing 1 to tracing_on, and 0 to turn it off:

# cat tracing_on
0

# echo 1 > tracing_on

# cat tracing_on
1

### Run some specific command that we wish to trace here ###

# echo 0 > tracing_on

# cat tracing_on
0

Tracing specific PID

If you want to trace activity related to a specific process that is already running, you can write that PID to a file named set_ftrace_pid and then enable tracing. That way, tracing is limited to this PID only, which is very helpful in some instances:

# echo $PID > set_ftrace_pid

Conclusion

Ftrace is a great way to learn more about the internal workings of the Linux kernel. With some practice, you can learn to fine-tune ftrace and narrow down your searches. To understand ftrace in more detail and its advanced usage, see these excellent articles written by the core author of ftrace himself—Steven Rostedt.

Puzzle pieces coming together to form a computer screen

trace-cmd is an easy-to-use, feature-rich utility for tracing Linux kernel functions.
Hand putting a Linux file folder into a drawer

Trace the thin layer between user processes and the Linux kernel with strace.
A Linux boot.

Let's examine the basics of kdump usage and look at the internals of kdump/kexec kernel implementation.
arrows cycle symbol for failing faster

The Continuous Kernel Integration (CKI) project aims to prevent bugs from entering the Linux kernel.

Topics

About the author

Gaurav Kamathe - Gaurav is a seasoned Software Engineering professional. His primary interests are Security, Linux, Malware. He loves working on the command-line and is mostly interested in low-level software and understanding how things work. Opinions expressed here are my own and not that of my employer