Optimize runtime performance with C++'s move semantics
Reducing copy operations is a good way to increase runtime execution speed for performance-critical applications.
If you are allowed to choose which programming language to use for an application, you usually pick one you know and that offers the shortest path to your goal. If you require a high runtime speed, programming languages that compile directly to machine code— like C++—are your best option.
In modern applications, the back and forth of memory addresses, jumps, loops, and the (sometimes unnecessary) copying of data areas consumes a huge share of machine code. In this article, I'll highlight the C++
move semantics, which enable you to avoid unnecessarily copying processes. Even if you are not a programmer, you can still analyze memory allocations with the
valgrind heap profiler
massif.
Code: rvalues and lvalues
In C++ programming you have to deal with
lvalues and
rvalues. In the example below,
a is on the left side of the assignment operator
= so it is an
lvalue. The assigned value of
5 is on the right side, so it is the
rvalue.
a = 5;
When this line is compiled, the
lvalue is interpreted as a symbolic address which can be altered afterwards. The
rvalue is a pure hardcoded value. It cannot be accessed by subsequent code because
rvalues don’t have an address. If you can determine an expression’s address or, if the compiler allows it, it is an
lvalue. That is, if an expression survives the semicolon at the end of a line, it is an
lvalue; otherwise, it’s an
rvalue.
An
lvalue can be present on the left and right side. An
rvalue only in the right side.
a = 5;
b = a;
Since C++11, the move semantic and the possibility to deal with rvalue references found its way into the standard.
Rvalue references are marked with a double ampersand
&&. They allows to interpret an
lvalue as a
rvalue. Especially when creating objects, the use of
rvalue references can increase performance, as we will see in the example code.
There are a lot of articles in the web which cover the syntax and the correct usage. The Chromium Project Documentation offers a good introduction to the subject. In this article we will use an example to reveal the actual impact on performance.
git clone https://github.com/hANSIc99/optimizing_cpp_sample.git
The default behavior
Imagine you have a class called
MyObject that takes another class,
MyType, as an argument in its constructor. In order to create an instance of
MyObject, an instance of
MyType has to be created in advance.
In code, it looks like this (
main.cpp line 16):
MyType<double> type_1(container);
MyObject<MyType<double> > object_1(type_1);
MyType is a template class that expects a double vector in its constructor (you do not need to be concerned about this property).
MyObject, also a template class, takes an instance of
MyType (type_1) in its constructor.
Using
MyObject:
- Invokes the constructor of
MyTypeto create an instance (
type_1)
- Invokes the copy constructor of
MyType(makes a copy of
type_1)
- Invokes the constructor of
MyObject(takes the copy of
type_1as an argument)
- (… do some work with object_1 … )
- Invokes the destructor of
type_1
- Invokes the destructor of
object_1, which causes it to invoke the destructor of the copy of
type_1
If the instance of
MyType (
type_1) exists only to create the instance of
MyObject (
object_1), a lot of unnecessary code will be executed.
You can try it yourself: change the directory and invoke
make:
cd optimizing_cpp_sample
make CFLAGS=-DOPT1
After it compiles, invoke the sample program:
./memory_sample
You should now see trace messages of the different constructor types:
MyType::MyType() contructor called
MyType::MyType(const MyType&) copy constructor called
MyObject::MyObject(const T& mytype) constructor called
MyType::m_data contains 32767 elements
MyType::~MyType() destructor called
MyType::~MyType() destructor called
Optimizing the example
Clean the binaries, invoke
make with different arguments, and run the program again:
make clean
make CFLAGS=-DOPT2
./memory_sample
This time, the constructors' trace messages look a bit different:
MyType::MyType() contructor called
MyType::MyType(MyType&& other) move constructor called
MyObject::MyObject(T&& mytype) constructor with move called
MyType::~MyType() destructor called
MyType::m_data contains 32767 elements
MyType::~MyType() destructor called
Instead of invoking the
MyType copy constructor, it calls the
move constructor, and the
MyType destructor is called only once. As you will see below, this has a positive impact on the runtime performance.
Look at
main.cpp line 23 in the code. It shows that
object_2 is directly created with the argument that should have been passed to the constructor of
MyType:
MyObject<MyType<double> > object_2(container);
That's it: just one line invokes
MyObject's constructor with the arguments for
MyType's constructor. The compiler detects there is no need to keep an instance of
MyType in memory. Instead of making another copy, it sets the internal pointer to an instance of
MyType inside
object_2 to the previously created instance.
Move with care
Using the
move semantic requires a certain amount of caution. Run the program again:
make clean
make CFLAGS=-DOPT3
./memory_sample
The program crashes immediately:
MyType::MyType() contructor called.
MyType::MyType(MyType&& other) move constructor called
MyType::m_data contains 3 elements
Segmentation fault (core dumped)
What happened? Open
main.cpp and move to line 27:
void case_3(){
MyType<double> type_3({1.2, 3.4, 5.6});
MyObject<MyType<double> > object_3(std::move(type_3));
object_3.m_mytype.print();
/* Dangerous: std::move destroys the object */
type_3.print();
}
This forces using the
move constructor to initialize
object_3 (line 30) by using
std::move when passing
type_3 as an argument.
After moving
type_3 into
object_3, you cannot refer to
type_3 (because you moved it). The object is destroyed and cannot be used again.
Measuring performance impact
You may have noticed that the execution time is displayed in the output. While you won't see any effect on a single invocation (like in the examples above), there is a noticeable improvement when you run your code in an infinite loop.
Without a move constructor:
make clean
make CFLAGS=-DOPT4
./memory_sample
On my system, execution takes an average of 170ms:
Average time: 170ms - Last execution took: 160um
Average time: 170ms - Last execution took: 179um
Average time: 170ms - Last execution took: 162um
With a move constructor:
make clean
make CFLAGS=-DOPT5
./memory_sample
This time, it takes an average of 143ms to execute:
Average time: 143ms - Last execution took: 142um
Average time: 143ms - Last execution took: 138um
Average time: 143ms - Last execution took: 142um
This (constructed) example achieves a 16% reduction in runtime. Note that compiler optimizations are switched off (
-O0). With a higher degree of optimization (
-O3), runtime reduction would be less significant (11%).
Analyzing memory allocations
The first choice for analyzing memory allocations on Linux-based systems is the tool
valgrind, or to be precise, the heap profiler
massif. In this case, executing the copy constructor causes heap memory allocations.
Run
massif on the first example:
$ make clean
$ make CFLAGS=-DOPT1
$ valgrind --tool=massif ./memory_sample
This outputs a file called
massif.out with a trailing process ID. This file can be read with
ms_print:
ms_print massif.out.4781
It prints out a graph of memory allocation over time:
This graph shows increasing memory allocation over time. This is consistent with the implementation. The ouput of
ms_print also includes information about which lines of code caused the memory allocation. I won't go into detail on how to read the output because there is excellent documentation available.
Conclusion
Regardless of the programming language you use, reducing copy operations (which results in heap memory allocations in this example) is a good way to increase runtime execution speed for performance-critical applications. When using C++, you will need a couple of prerequisites to apply these optimizations:
- C++11, since the
movesemantics have only been available since that release
- Availability of the
moveconstructor/
moveassignment operator (see the rule of 5)
On a modern x86-64 CPU, heap memory allocations are so fast you won't notice whether an application is optimized or not without precise measurement. The less powerful your CPU, the more it makes sense to optimize runtime code. For example, on mobile devices, not only can it improve the response behavior, but it can also extend battery life.
