Help STL Help You

Today we’ll dissect some classes and functions from the C++ standard library to understand how they tick. If you feel a bit overwhelmed by cppReference.com, this should help you learn how to interpret the documentation, to reason about the available tools for a given job, and to use them to your best advantage.

A Less Trivial Sum

Here’s a simple task, take a std::map<std::string, int> and sum the values. In Python, it’s very easy: sum(my_dict.values()). But in C++, maps don’t offer separate key and value iterators, so we have to iterate over key/value pairs and extract the value for each addend.

#include <ranges>
// Using unordered_map to remove distracting tree-related assembly code.
#include <unordered_map>
#include <string>

int SumValues(const std::unordered_map<std::string, int>& scores) {
    int sum = 0;
    for (const auto& kv : scores) {
        sum += kv.second;
    }
    return sum;
}

That’s the “naive” implementation, but we can use std::accumulate to get STL involved, by supplying begin/end iterators and a binary operator (i.e. a function). It makes the code more verbose, but that doesn’t necessarily mean more work at runtime. If we were just summing a container x of integers, the call would be std::accumulate(x.begin(), x.end(), 0); which is much more concise because the default operator is +. But for types without a + operator, we need to supply a function.

#include <ranges>
#include <unordered_map>
#include <numeric>
#include <string>

int SumValues(const std::unordered_map<std::string, int>& scores) {
    int result = std::accumulate(scores.begin(), scores.end(), 0,
  // Normally I would define a separate function and pass its name as the
  // operator argument, but I'm using a lambda here in case any readers 
  // haven't seen it before.
 [](int acc,  const std::pair<std::string, int>& kv) {
      return acc + kv.second;
 });
    return result;
}

Here are the two versions compared in Compiler Explorer. The basic loop generates only 12 lines of assembly, while this clunkier version generates 75. So far there are no apparent benefits for this task, but let’s see if that extra code helps with performance.

Here they are compared in Quickbench. For this code, the loop is reportedly 11 or 12 times faster than std::accumulate (Note: when summing a container of plain integers, the two approaches have equal speed). Could it be that this is not the right function for the job? If we look at the “See Also” section of the reference page for std::accumulate, we see some related options. What if std::reduce fares better?

One advantage of std::reduce is its ExecutionPolicy overloads, so it might benefit from parallelism. While the function headers look the same, simply replacing accumulate with reduce gives a compilation error. It might feel arcane, but through careful reading we can understand it.

In file included from <source>:4:
/opt/compiler-explorer/gcc-13.2.0/lib/gcc/x86_64-linux-gnu/13.2.0/../../../../include/c++/13.2.0/numeric:292:21: error: static assertion failed due to requirement 'is_invocable_r_v<int, (lambda at <source>:8:63) &, const std::pair<const std::basic_string<char, std::char_traits<char>, std::allocator<char>>, int> &, int &>'
  292 |       static_assert(is_invocable_r_v<_Tp, _BinaryOperation&, __ref, _Tp&>);
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
<source>:8:23: note: in instantiation of function template specialization 'std::reduce<std::__detail::_Node_const_iterator<std::pair<const std::basic_string<char>, int>, false, true>, int, (lambda at <source>:8:63)>' requested here
    8 |     int result = std::reduce(scores.begin(), scores.end(), 0, [](const int& acc,  const std::pair<std::string, int>& kv){return acc + kv.second;});
      |                       ^
In file included from <source>:4:
/opt/compiler-explorer/gcc-13.2.0/lib/gcc/x86_64-linux-gnu/13.2.0/../../../../include/c++/13.2.0/numeric:293:21: error: static assertion failed due to requirement 'is_invocable_r_v<int, (lambda at <source>:8:63) &, int &, int &>'
  293 |       static_assert(is_invocable_r_v<_Tp, _BinaryOperation&, _Tp&, _Tp&>);
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/opt/compiler-explorer/gcc-13.2.0/lib/gcc/x86_64-linux-gnu/13.2.0/../../../../include/c++/13.2.0/numeric:294:21: error: static assertion failed due to requirement 'is_invocable_r_v<int, (lambda at <source>:8:63) &, const std::pair<const std::basic_string<char, std::char_traits<char>, std::allocator<char>>, int> &, const std::pair<const std::basic_string<char, std::char_traits<char>, std::allocator<char>>, int> &>'
  294 |       static_assert(is_invocable_r_v<_Tp, _BinaryOperation&, __ref, __ref>);
      |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3 errors generated.
Compiler returned: 1

Let’s look at std::is_invocable_r_v. The _v suffix is a “helper variable template,” which just pulls the bool value from the base struct is_invocable_r, whose template parameters are <class R, class Fn, class... ArgTypes>. The requirement is deciding whether a function can be called with a specified return type and sequence of arguments. for the first assertion, R is int, Fn is the lambda function, and the ArgTypes are const std::pair<std::string, int>& and int&. The lambda function is not invocable with these arguments because they are in the wrong order. The next assertion requires that the function is callable with int and int, and the final assertion requires that it be callable with two pairs.

In summary, this error reveals that reduce has a stricter requirement than accumulate, that the return type and the iterated element type be the same or convertible between each other. Therefore we have to use std::transform_reduce, overload (3). It looks like this:

#include <ranges>
#include <unordered_map>
#include <numeric>
#include <string>
#include <functional>

int SumValues(const std::unordered_map<std::string, int>& scores) {
    int result = std::transform_reduce(scores.begin(), scores.end(), 0, std::plus<int>(),
      [](const std::pair<std::string, int>& kv) {
        return kv.second;
    });
    return result;
}

Compiler Explorer shows that This produces the exact same assembly as std::accumulate, down to the register numbers. What if we add an execution policy? We see that std::execution::par permits, but does not mandate, parallel (multithreaded) execution, and std::execution::unseq applies to vectorized (SIMD) instructions. The benchmarks at the bottom of this page show that parallel execution offers a big boost to sorting, but vectorization does hardly anything. We can dig into the details of that in another post. However, when we add std::execution::par to this code, the assembly changes, but it seems to just add some exception handling. Quickbench still has it 12 times slower, as before, but a container with only 300 elements would hardly benefit from parallelization. When I increase it to 10000, the difference is cut to 1.7 times slower. Perhaps the reducer is struggling because of the string keys. What if the keys were string_views, which are more lightweight?

After Changing the key type, the assembly is exactly the same as that of the naive loop! (Note: this is a toy example, so please read the comments to understand the caveats of the use of string_view, in that link). Clearly the key’s type makes a huge difference here, but why? Using the sizeof() function, I find that the std::pair<std::string, int>> and std::pair<std::string_view, int>> both have constant size, 40 bytes and 24 bytes respectively, because string is 32 bytes and string_view is 16 bytes.

A tangent on strings

Why do string objects have 32 bytes? Because of Small String Optimization. In most implementations of the standard, std::string objects have an extra 16 bytes reserved, to avoid dynamic allocation if the string is small enough. We can verify this by looking at the GNU C++ standard library source code. The source file string #includes bits/basic_string.h, which (as of gcc 13.2) shows the following on lines 180-208:

      // Use empty-base optimization: http://www.cantrip.org/emptyopt.html
      struct _Alloc_hider : allocator_type // TODO check __is_final
      {
#if __cplusplus < 201103L
        _Alloc_hider(pointer __dat, const _Alloc& __a = _Alloc())
        : allocator_type(__a), _M_p(__dat) { }
#else
        _GLIBCXX20_CONSTEXPR
        _Alloc_hider(pointer __dat, const _Alloc& __a)
        : allocator_type(__a), _M_p(__dat) { }
 
        _GLIBCXX20_CONSTEXPR
        _Alloc_hider(pointer __dat, _Alloc&& __a = _Alloc())
        : allocator_type(std::move(__a)), _M_p(__dat) { }
#endif
 
        pointer _M_p; // The actual data.
      };
 
      _Alloc_hider      _M_dataplus;
      size_type         _M_string_length;
 
      enum { _S_local_capacity = 15 / sizeof(_CharT) };
 
      union
      {
        _CharT           _M_local_buf[_S_local_capacity + 1];
        size_type        _M_allocated_capacity;
      };

The data in the _Alloc_hider is just a pointer, for 8 bytes in 64-bit architectures, followed by a size_type, also 8 bytes in arm64 and x64, and an anonymous union of _CharT[16] (for string, _CharT is char, but it could be larger for other heirs of basic_string), and size_type again. The larger type in the union is 16 bytes, giving us a total of 32.

std::string_view is smaller because it only holds a pointer and a length. We can verify this by looking at line 583-584 of string_view.

 // Components for manipulating non-owning sequences of characters -*- C++ -*-
  ...
      size_t        _M_len;
      const _CharT* _M_str;

Consider the comment on the first line of the file. A key difference between string and string_view is that the former owns the memory that it points to, while the latter risks outliving the memory that it points to. That makes string_view great as a (non-threaded) function argument, but it demands caution as a return value. I didn’t understand the concept of ownership when I was relatively new to C++, but after running the gauntlet of memory safety issues, it became quite clear to me. If an object’s lifetime ends, and there are other objects that hold pointers to that object, they are called dangling pointers, and dereferencing them will yield undefined behavior or a segmentation fault (sigsegv). If an object or scope “owns” something in memory, that means it will not free the memory until its lifetime ends or it transfers ownership. See this intro document for more details.

Persnickety Types

Returning to the strange difference between key types for our std::map, there is another difference between string and string_view that comes into play. Because string_view does not own any memory, it is trivially copyable, just like an int. string is not trivially copyable, so it ran into a very subtle issue. std::map and std::unordered_map both accept an unadorned Key type, but as containers, their value_type is std::pair<const Key, T>.

the binary operators we gave to the STL functions took the equivalent of const std::pair<Key, T>, but they were passed a [const] std::pair<const Key, T>. It’s natural to expect that the const-ness of the pair would imply all of its members are const, but that’s only true when the member is accessed. As far as the compiler is concerned, the string sitting inside the pair only has its own qualifiers. to satisfy the is_invokable requirement, the complier must convert the const string (by copy) to an unqualified string, even though it’s owned by a const pair. Therefore, it’s our responsibility to match the exact qualified types and use const whenever we can. In this Compiler Explorer example we can see the const string gives us the short assembly.

The naive loop didn’t have a chance to mismatch because it used auto. Frustratingly, you get an error if the type in a range-based for-loop is not an exact match, whereas functions just get these silently costly conversions. But that’s fine, we don’t have to memorize every little detail because Function headers can also use auto. It may be surprising, but that’s the beauty of static typing. Just make sure the intended type is obvious in context. You could also do something like this:

using scores_kv = std::unordered_map<std::string, int>::value_type;

constexpr int GetValue(const scores_kv& kv) {
    return kv.second;
}

int SumValues(const std::unordered_map<std::string, int>& scores){
    int result = std::transform_reduce(scores.begin(), scores.end(), 0, std::plus<int>(), GetValue);
    return result;
}

Finally, we can validate the performance in Quickbench. I expect Any differences of 10% or less to be the result of noise. Now let’s see if we can take advantage of std::execution::unseq. In this quickbench, I convinced the compiler to use vectorized instructions, but in the annotated assembly you can see that those instructions make up a tiny proportion of the CPU time. It comes down to a limitation of unordered_map and other associative data structures. Even though the elements live in a flat hash table’s array rather than a tree, they aren’t packed. The unordered_map‘s iterator only fulfills the “forward iterator” requirement, whereas vector‘s iterator fulfills the “random access iterator” (i.e. we can jump to any position) and “contiguous iterator” requirements. By intuition, random access is important for a parallel strategy to partition a container, and contiguous access is important for vectorized operations.

For a brief palate cleanser, you can look at the simpler task of summing integers in a vector here. The assembly shows that the naive loop is simple enough for the compiler to translate the intent and vectorize without being asked.

Compact indirection

For a more contrived example, let’s work with a vector<int> this time, and treat the values as indices (be careful!) into another vector<int>, and sum up the index multiplied by the value at that index. The naive loop looks like this:

std::vector<int> values = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16};

int SumValuesLoop(const std::vector<int>& indices) {
    int acc = 0;
    for (const auto& score : scores) {
        acc += score * values[score];
    }
    return acc;
}

The version with std::transform_reduce looks like this:

constexpr int SumValuesStd(const std::vector<int>& scores){
    int result = std::transform_reduce(/*std::execution::par_unseq,*/ scores.begin(), scores.end(), 0, std::plus<int>(), [](const int i){ return i * values[i];});
    return result;
}

Compared in Quickbench, we see the std approach is actually faster! I’ve commented out std::execution::par_unseq because it yields the same assembly as the loop. The default execution policy is seq, which disallows vectorization and concurrency (std::execution::par is simply ignored in this case), so the compiler is prevented from getting too fancy. In the loop assembly we find a panoply of moves, shuffles, and unpacks on xmm registers, which serve to place, extract, and convert between long and int, all to accommodate the pmuludq instruction.

In an unexpected fashion, the library function gave us more power to arrive at a better result. Vectorization tends to work much better when we know the value range is small, like uint8_t or uint16_t, because more of them can fit on a single xmm (128-bit) register. Multiply ops are especially clunky because, for example, two 32-bit integers multiply into a 64-bit output. Since the output register is 128-bit, you only get to do two multiplies per instruction. See here for more details. If we were using AVX, floating point multiplication, and a much larger container size, the vectorized approach might have been faster.

To summarize some key lessons:

const can and should be used in template parameters, and everywhere, if not constexpr
string and other objects that own heap memory might unexpectedly copy if you’re not careful
Execution policies are just suggestions, and might not pull their weight
Execution policies can impose restrictions that actually streamline the compiled code
auto protects against mistakes with complicated types
map and unordered_map can be roadblocks to optimization
- If you’re doing mass operations on these containers and need better performance, Consider outside libraries such as boost, which offers concurrent_flat_map among other things

Above all, library classes and functions provide interfaces (pass by reference, pass by rvalue reference, constexpr, etc) that a hand-rolled approach may not have considered. Expressing intent through these constructs makes your code self-documenting and resilient to stale comments. Especially when the inner (custom) behavior grows complex, functions like foreach, accumulate, and transform_reduce pay dividends when names and types in the outer system change. your points of interaction with them should be tightly localized and modular.

Next time: Making Cool patterns with OpenCL

Cover art: Forerunner Library interior, Halo, 343 Industries