In my previous article, I showed how debuginfo is used to map between the current instruction pointer (IP) and the function or line containing it. That information is valuable in showing what code the processor is currently executing. However, having more context for the calls that lead up to the current function and line being executed is also extremely helpful.
For example, suppose a function in a library has an illegal memory access due to a null pointer being passed as a parameter into the function. Just looking at the current function and line shows that the fault was triggered by attempted access through a null pointer. However, what you really want to know is the full context of the active function calls leading up to that null pointer access, so you can determine how that null pointer was initially passed into the library function. This context information is provided by a backtrace, and allows you to determine which functions could be responsible for the bogus parameter.
One thing’s certain: Determining the currently active function calls is a non-trivial operation.
Function activation records
Modern programming languages have local variables and allow for recursion where a function can call itself. Also, concurrent programs have multiple threads that may have the same function running at the same time. The local variables cannot be stored in global locations in these situations. The locations of the local variables must be unique for each invocation of the function. Here’s how it works:
- The compiler produces a function activation record each time a function is called to store local variables in a unique location.
- For efficiency, the processor stack is used to store the function activation records.
- A new function activation record is created at the top of the processor stack for the function when it’s called.
- If that function calls another function, then a new function activation record is placed above the existing function activation record.
- Each time there is a return from a function, its function activation record is removed from the stack.
The creation of the function activation record is created by code in the function called the prologue. The removal of the function activation record is handled by the function epilogue. The body of the function can make use of the memory set aside on the stack for it for temporary values and local variables.
Function activation records can be variable size. For some functions, there’s no need for space to store local variables. Ideally, the function activation record only needs to store the return address of the function that called this function. For other functions, significant space may be required to store local data structures for the function in addition to the return address. This variation in frame sizes leads to compilers using frame pointers to track the start of the function’s activation frame. Now the function prologue code has the additional task of storing the old frame pointer before creating a new frame pointer for the current function, and the epilogue has to restore the old frame pointer value.
The way that the function activation record is laid out, the return address and old frame pointer of the calling function are constant offsets from the current frame pointer. With the old frame pointer, the next function’s activation frame on the stack can be located. This process is repeated until all the function activation records have been examined.
Optimization complications
There are a couple of disadvantages to having explicit frame pointers in code. On some processors, there are relatively few registers available. Having an explicit frame pointer causes more memory operations to be used. The resulting code is slower because the frame pointer must be in one of the registers. Having explicit frame pointers may constrain the code that the compiler can generate, because the compiler may not intermix the function prologue and epilogue code with the body of the function.
The compiler’s goal is to generate fast code where possible, so compilers typically omit frame pointers from generated code. Keeping frame pointers can significantly lower performance, as shown by Phoronix’s benchmarking. The downside of omitting frame pointers is that finding the previous calling function’s activation frame and return address are no longer simple offsets from the frame pointer.
Call Frame Information
To aid in the generation of function backtraces, the compiler includes DWARF Call Frame Information (CFI) to reconstruct frame pointers and to find return addresses. This supplemental information is stored in the .eh_frame
section of the execution. Unlike traditional debuginfo for function and line location information, the .eh_frame
section is in the executable even when the executable is generated without debug information, or when the debug information has been stripped from the file. The call frame information is essential for the operation of language constructs like throw-catch in C++.
The CFI has a Frame Description Entry (FDE) for each function. As one of its steps, the backtrace generation process finds the appropriate FDE for the current activation frame being examined. Think of the FDE as a table, with each row representing one or more instructions, with these columns:
- Canonical Frame Address (CFA), the location the frame pointer would point to
- The return address
- Information about other registers
The encoding of the FDE is designed to minimize the amount of space required. The FDE describes the changes between rows rather than fully specify each row. To further compress the data, starting information common to multiple FDEs is factored out and placed in Common Information Entries (CIE). This makes the FDE more compact, but it also requires more work to compute the actual CFA and find the return address location. The tool must start from the uninitialized state. It steps through the entries in the CIE to get the initial state on function entry, then it moves on to process the FDE by starting at the FDE’s first entry, and processes operations until it gets to the row that covers the instruction pointer currently being analyzed.
Example use of Call Frame Information
Start with a simple example with a function that converts Fahrenheit to Celsius. Inlined functions do not have entries in the CFI, so the __attribute__((noinline))
for the f2c
function ensures the compiler keeps f2c
as a real function.
#include <stdio.h>
int __attribute__ ((noinline)) f2c(int f)
{
int c;
printf("converting\n");
c = (f-32.0) * 5.0 /9.0;
return c;
}
int main (int argc, char *argv[])
{
int f;
scanf("%d", &f);
printf ("%d Fahrenheit = %d Celsius\n",
f, f2c(f));
return 0;
}
Compile the code with:
$ gcc -O2 -g -o f2c f2c.c
The .eh_frame
is there as expected:
$ eu-readelf -S f2c |grep eh_frame
[17] .eh_frame_hdr PROGBITS 0000000000402058 00002058 00000034 0 A 0 0 4
[18] .eh_frame PROGBITS 0000000000402090 00002090 000000a0 0 A 0 0 8
We can get the CFI information in human readable form with:
$ readelf --debug-dump=frames f2c > f2c.cfi
Generate a disassembly file of the f2c
binary so you can look up the addresses of the f2c
and main
functions:
$ objdump -d f2c > f2c.dis
Find the following lines in f2c.dis
to see the start of f2c
and main
:
0000000000401060 <main>:
0000000000401190 <f2c>:
In many cases, all the functions in the binary use the same CIE to define the initial conditions before a function’s first instruction is executed. In this example, both f2c
and main
use the following CIE:
00000000 0000000000000014 00000000 CIE
Version: 1
Augmentation: "zR"
Code alignment factor: 1
Data alignment factor: -8
Return address column: 16
Augmentation data: 1b
DW_CFA_def_cfa: r7 (rsp) ofs 8
DW_CFA_offset: r16 (rip) at cfa-8
DW_CFA_nop
DW_CFA_nop
For this example, don’t worry about the Augmentation or Augmentation data entries. Because x86_64 processors have variable length instructions from 1 to 15 bytes in size, the “Code alignment factor” is set to 1. On a processor that only has 32-bit (4 byte instructions), this would be set to 4 and would allow more compact encoding of how many bytes a row of state information applies to. In a similar fashion, there is the “Data alignment factor” to make the adjustments to where the CFA is located more compact. On x86_64, the stack slots are 8 bytes in size.
The column in the virtual table that holds the return address is 16. This is used in the instructions at the tail end of the CIE. There are four DW_CFA
instructions. The first instruction, DW_CFA_def_cfa
describes how to compute the Canonical Frame Address (CFA) that a frame pointer would point at if the code had a frame pointer. In this case, the CFA is computed from r7 (rsp)
and CFA=rsp+8
.
The second instruction DW_CFA_offset
defines where to obtain the return address CFA-8
. In this case, the return address is currently pointed to by the stack pointer (rsp+8)-8
. The CFA starts right above the return address on the stack.
The DW_CFA_nop
at the end of the CIE is padding to keep alignment in the DWARF information. The FDE can also have padding at the end of the for alignment.
Find the FDE for main
in f2c.cfi
, which covers the main
function from 0x40160
up to, but not including, 0x401097
:
00000084 0000000000000014 00000088 FDE cie=00000000 pc=0000000000401060..0000000000401097
DW_CFA_advance_loc: 4 to 0000000000401064
DW_CFA_def_cfa_offset: 32
DW_CFA_advance_loc: 50 to 0000000000401096
DW_CFA_def_cfa_offset: 8
DW_CFA_nop
Before executing the first instruction in the function, the CIE describes the call frame state. However, as the processor executes instructions in the function, the details will change. First the instructions DW_CFA_advance_loc
and DW_CFA_def_cfa_offset
match up with the first instruction in main
at 401060
. This adjusts the stack pointer down by 0x18
(24 bytes). The CFA has not changed location but the stack pointer has, so the correct computation for CFA at 401064
is rsp+32
. That’s the extent of the prologue instruction in this code. Here are the first couple of instructions in main
:
0000000000401060 <main>:
401060: 48 83 ec 18 sub $0x18,%rsp
401064: bf 1b 20 40 00 mov $0x40201b,%edi
The DW_CFA_advance_loc
makes the current row apply to the next 50 bytes of code in the function, until 401096
. The CFA is at rsp+32
until the stack adjustment instruction at 401092
completes execution. The DW_CFA_def_cfa_offset
updates the calculations of the CFA to the same as entry into the function. This is expected, because the next instruction at 401096
is the return instruction (ret
) and pops the return value off the stack.
401090: 31 c0 xor %eax,%eax
401092: 48 83 c4 18 add $0x18,%rsp
401096: c3 ret
This FDE for f2c
function uses the same CIE as the main
function, and covers the range of 0x41190
to 0x4011c3
:
00000068 0000000000000018 0000006c FDE cie=00000000 pc=0000000000401190..00000000004011c3
DW_CFA_advance_loc: 1 to 0000000000401191
DW_CFA_def_cfa_offset: 16
DW_CFA_offset: r3 (rbx) at cfa-16
DW_CFA_advance_loc: 29 to 00000000004011ae
DW_CFA_def_cfa_offset: 8
DW_CFA_nop
DW_CFA_nop
DW_CFA_nop
The objdump
output for the f2c
function in the binary:
0000000000401190 <f2c>:
401190: 53 push %rbx
401191: 89 fb mov %edi,%ebx
401193: bf 10 20 40 00 mov $0x402010,%edi
401198: e8 93 fe ff ff call 401030 <puts@plt>
40119d: 66 0f ef c0 pxor %xmm0,%xmm0
4011a1: f2 0f 2a c3 cvtsi2sd %ebx,%xmm0
4011a5: f2 0f 5c 05 93 0e 00 subsd 0xe93(%rip),%xmm0 # 402040 <__dso_handle+0x38>
4011ac: 00
4011ad: 5b pop %rbx
4011ae: f2 0f 59 05 92 0e 00 mulsd 0xe92(%rip),%xmm0 # 402048 <__dso_handle+0x40>
4011b5: 00
4011b6: f2 0f 5e 05 92 0e 00 divsd 0xe92(%rip),%xmm0 # 402050 <__dso_handle+0x48>
4011bd: 00
4011be: f2 0f 2c c0 cvttsd2si %xmm0,%eax
4011c2: c3 ret
In the FDE for f2c
, there’s a single byte instruction at the beginning of the function with the DW_CFA_advance_loc
. Following the advance operation, there are two additional operations. A DW_CFA_def_cfa_offset
changes the CFA to %rsp+16
and a DW_CFA_offset
indicates that the initial value in %rbx
is now at CFA-16
(the top of the stack).
Looking at this fc2
disassembly code, you can see that a push
is used to save %rbx
onto the stack. One of the advantages of omitting the frame pointer in the code generation is that compact instructions like push
and pop
can be used to store and retrieve values from the stack. In this case, %rbx
is saved because the %rbx
is used to pass arguments to the printf
function (actually converted to a puts
call), but the initial value of f
passed into the function needs to be saved for the later computation. The DW_CFA_advance_loc
29 bytes to 4011ae
shows the next state change just after pop %rbx
, which recovers the original value of %rbx
. The DW_CFA_def_cfa_offset
notes the pop changed CFA to be %rsp+8
.
GDB using the Call Frame Information
Having the CFI information allows GNU Debugger (GDB) and other tools to generate accurate backtraces. Without CFI information, GDB would have a difficult time finding the return address. You can see GDB making use of this information, if you set a breakpoint at line 7 of f2c.c
. GDB puts the breakpoint before the pop %rbx
in the f2c
function is done and the return value is not at the top of the stack.
GDB is able to unwind the stack, and as a bonus is also able to fetch the argument f
that was currently saved on the stack:
$ gdb f2c
[...]
(gdb) break f2c.c:7
Breakpoint 1 at 0x40119d: file f2c.c, line 7.
(gdb) run
Starting program: /home/wcohen/present/202207youarehere/f2c
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
98
converting
Breakpoint 1, f2c (f=98) at f2c.c:8
8 return c;
(gdb) where
#0 f2c (f=98) at f2c.c:8
#1 0x000000000040107e in main (argc=<optimized out>, argv=<optimized out>)
at f2c.c:15
Call Frame Information
The DWARF Call Frame Information provides a flexible way for a compiler to include information for accurate unwinding of the stack. This makes it possible to determine the currently active function calls. I’ve provided a brief introduction in this article, but for more details on how the DWARF implements this mechanism, see the DWARF specification.
Comments are closed.