Optimize runtime performance with C++'s move semantics

Image by:

CC BY 3.0 US Mapbox Uncharted ERG

If you are allowed to choose which programming language to use for an application, you usually pick one you know and that offers the shortest path to your goal. If you require a high runtime speed, programming languages that compile directly to machine code— like C++—are your best option.

In modern applications, the back and forth of memory addresses, jumps, loops, and the (sometimes unnecessary) copying of data areas consumes a huge share of machine code. In this article, I'll highlight the C++ move semantics, which enable you to avoid unnecessarily copying processes. Even if you are not a programmer, you can still analyze memory allocations with the valgrind heap profiler massif.

Code: rvalues and lvalues

In C++ programming you have to deal with lvalues and rvalues. In the example below, a is on the left side of the assignment operator = so it is an lvalue. The assigned value of 5 is on the right side, so it is the rvalue.

a = 5;

When this line is compiled, the lvalue is interpreted as a symbolic address which can be altered afterwards. The rvalue is a pure hardcoded value. It cannot be accessed by subsequent code because rvalues don’t have an address. If you can determine an expression’s address or, if the compiler allows it, it is an lvalue. That is, if an expression survives the semicolon at the end of a line, it is an lvalue; otherwise, it’s an rvalue.

An lvalue can be present on the left and right side. An rvalue only in the right side.

a = 5;
b = a;

Since C++11, the move semantic and the possibility to deal with rvalue references found its way into the standard.

Rvalue references are marked with a double ampersand &&. They allows to interpret an lvalue as a rvalue. Especially when creating objects, the use of rvalue references can increase performance, as we will see in the example code.

There are a lot of articles in the web which cover the syntax and the correct usage. The Chromium Project Documentation offers a good introduction to the subject. In this article we will use an example to reveal the actual impact on performance.

git clone https://github.com/hANSIc99/optimizing_cpp_sample.git

The default behavior

Imagine you have a class called MyObject that takes another class, MyType, as an argument in its constructor. In order to create an instance of MyObject, an instance of MyType has to be created in advance.

In code, it looks like this (main.cpp line 16):

MyType<double> type_1(container);
MyObject<MyType<double> > object_1(type_1);

MyType is a template class that expects a double vector in its constructor (you do not need to be concerned about this property). MyObject, also a template class, takes an instance of MyType (type_1) in its constructor.

Using MyObject:

Invokes the constructor of MyType to create an instance (type_1)
Invokes the copy constructor of MyType (makes a copy of type_1)
Invokes the constructor of MyObject (takes the copy of type_1 as an argument)
(… do some work with object_1 … )
Invokes the destructor of type_1
Invokes the destructor of object_1, which causes it to invoke the destructor of the copy of type_1

If the instance of MyType (type_1) exists only to create the instance of MyObject (object_1), a lot of unnecessary code will be executed.

You can try it yourself: change the directory and invoke make:

cd optimizing_cpp_sample
make CFLAGS=-DOPT1

After it compiles, invoke the sample program:

 ./memory_sample

You should now see trace messages of the different constructor types:

MyType::MyType() contructor called
MyType::MyType(const MyType&) copy constructor called
MyObject::MyObject(const T& mytype) constructor called
MyType::m_data contains 32767 elements
MyType::~MyType() destructor called
MyType::~MyType() destructor called

Optimizing the example

Clean the binaries, invoke make with different arguments, and run the program again:

make clean
make CFLAGS=-DOPT2
./memory_sample

This time, the constructors' trace messages look a bit different:

MyType::MyType() contructor called
MyType::MyType(MyType&& other) move constructor called
MyObject::MyObject(T&& mytype) constructor with move called
MyType::~MyType() destructor called
MyType::m_data contains 32767 elements
MyType::~MyType() destructor called

Instead of invoking the MyType copy constructor, it calls the move constructor, and the MyType destructor is called only once. As you will see below, this has a positive impact on the runtime performance.

Look at main.cpp line 23 in the code. It shows that object_2 is directly created with the argument that should have been passed to the constructor of MyType:

MyObject<MyType<double> > object_2(container);

That's it: just one line invokes MyObject's constructor with the arguments for MyType's constructor. The compiler detects there is no need to keep an instance of MyType in memory. Instead of making another copy, it sets the internal pointer to an instance of MyType inside object_2 to the previously created instance.

Move with care

Using the move semantic requires a certain amount of caution. Run the program again:

make clean 
make CFLAGS=-DOPT3
./memory_sample

The program crashes immediately:

MyType::MyType() contructor called.
MyType::MyType(MyType&& other) move constructor called
MyType::m_data contains 3 elements
Segmentation fault (core dumped)

What happened? Open main.cpp and move to line 27:

void case_3(){
                               
    MyType<double> type_3({1.2, 3.4, 5.6});
    MyObject<MyType<double> >  object_3(std::move(type_3));
    object_3.m_mytype.print();
    /* Dangerous: std::move destroys the object */
    type_3.print();
}

This forces using the move constructor to initialize object_3 (line 30) by using std::move when passing type_3 as an argument.

After moving type_3 into object_3, you cannot refer to type_3 (because you moved it). The object is destroyed and cannot be used again.

Measuring performance impact

You may have noticed that the execution time is displayed in the output. While you won't see any effect on a single invocation (like in the examples above), there is a noticeable improvement when you run your code in an infinite loop.

Without a move constructor:

make clean
make CFLAGS=-DOPT4
./memory_sample

On my system, execution takes an average of 170ms:

Average time: 170ms - Last execution took: 160um
Average time: 170ms - Last execution took: 179um
Average time: 170ms - Last execution took: 162um

With a move constructor:

make clean
make CFLAGS=-DOPT5
./memory_sample

This time, it takes an average of 143ms to execute:

Average time: 143ms - Last execution took: 142um
Average time: 143ms - Last execution took: 138um
Average time: 143ms - Last execution took: 142um

This (constructed) example achieves a 16% reduction in runtime. Note that compiler optimizations are switched off (-O0). With a higher degree of optimization (-O3), runtime reduction would be less significant (11%).

Analyzing memory allocations

The first choice for analyzing memory allocations on Linux-based systems is the tool valgrind, or to be precise, the heap profiler massif. In this case, executing the copy constructor causes heap memory allocations.

Run massif on the first example:

$ make clean
$ make CFLAGS=-DOPT1
$ valgrind --tool=massif ./memory_sample

This outputs a file called massif.out with a trailing process ID. This file can be read with ms_print:

ms_print massif.out.4781

It prints out a graph of memory allocation over time:

Graph showing memory allocations over time

Image by:

^{(Stephan Avenwedde, CC BY-SA 4.0)}

This graph shows increasing memory allocation over time. This is consistent with the implementation. The ouput of ms_print also includes information about which lines of code caused the memory allocation. I won't go into detail on how to read the output because there is excellent documentation available.

Conclusion

Regardless of the programming language you use, reducing copy operations (which results in heap memory allocations in this example) is a good way to increase runtime execution speed for performance-critical applications. When using C++, you will need a couple of prerequisites to apply these optimizations:

C++11, since the move semantics have only been available since that release
Availability of the move constructor/move assignment operator (see the rule of 5)

On a modern x86-64 CPU, heap memory allocations are so fast you won't notice whether an application is optimized or not without precise measurement. The less powerful your CPU, the more it makes sense to optimize runtime code. For example, on mobile devices, not only can it improve the response behavior, but it can also extend battery life.