C++ 高性能编程实战（三）：内存优化

pasu 发表于 2024-7-15 18:27

转载自：
1、tcmalloc 和 jemalloc

线程池技术中，每个线程各司其职，完成一个一个的任务。在 malloc 看来，就是多个长生命周期的线程，随机的在各个时间节点进行内存申请和内存释放。基于这样的场景，首先，尽量分配持续地址空间。其次，多线程下需要考虑分区隔离和减少竞争。
tcmalloc 和 jemalloc 共同的思路是引入线程缓存机制。通过一次从后端获取大块内存，放入缓存供线程多次申请，降低对后端的实际竞争强度。主要分歧点是，当线程缓存被击穿后，tcmalloc 采用了单一的 page heap（简化了中间的 transfer cache 和 central cache）来承载；而 jemalloc 采用了多个 arena（甚至超过了处事器 core 数）。一般来讲，在线程数较少，或释放强度较低的情况下，较为简洁的 tcmalloc 性能稍胜 jemalloc。在 core 数较多、申请释放频繁时，jemalloc 因为锁竞争强度远小于 tcmalloc，性能较好。

抱负的 malloc 模型是什么？

[*]低竞争性和持续性
微处事、流式计算、缓存，这几种业务模型几乎涵盖了所有主流的后端处事场景。而这几种业务对内存的应用有一个重要的特征：拥有边界明确的生命周期。比如在早期的 server 设计中，每个 client 请求都分配一个单独的线程措置，措置完再整体销毁。但随着新型的子任务级线程池并发技术的广泛应用，即请求细分为多个子任务充实操作多核并发来提升计算性能。
std::vector<std::string> 如何优化？这里提供一种思路：

[*]和典型的 vector 措置主要分歧点是：在 clear 或者 pop_back 等操作缩减大小之后，内容对象并不实际析构，只是清空重置。因此，再一次用到这个槽位的时候，可以直接拿到已经构造好的元素，而且其 capacity 之内的内存依然持有。当反复使用同一个实例时，容器内存和每个元素自身的 capacity 城市趋于饱和值，反复的分配和构造需求都被减少了。
内存分配和实例构造功能解耦。这也是 PMR（Polymorphic Memory Resource，C++17 的新特性）设计的出发点，大名鼎鼎的 EASTL 就是它的原型，它就是为低延迟、高频、计算密集型任务开发的。

2、string

短字符串分配

#include <chrono>
#include <iostream>

struct Timer {
std::chrono::high_resolution_clock::time_point start, end;
std::chrono::duration<float> duration;
Timer() { start = std::chrono::high_resolution_clock::now(); }
~Timer() {
   end = std::chrono::high_resolution_clock::now();
   duration = end - start;
   float ns = duration.count() * 1000000.0f;
   std::cout << ”Timer took ” << ns << ”ns”
               << ”\n”;
}
};

const int SIZE = 1000000;
void test_stack() {
Timer timer;
for (int i = 0; i < SIZE; i++) {
   char buf;
}
}

void test_string() {
Timer timer;
for (int i = 0; i < SIZE; i++) {
   std::string str(”hello world”);
}
}

int main() {
test_stack();
test_string();
return 0;
}测试成果：

短字符串构造，char 和 string 性能差不多

长字符串分配

const int SIZE = 1000000;
void test_stack() {
Timer timer;
for (int i = 0; i < SIZE; i++) {
   char buf;
}
}

void test_string() {
Timer timer;
for (int i = 0; i < SIZE; i++) {
   std::string str(”hello world, it is test string.”);
}
}

int main() {
test_stack();
test_string();
return 0;
}测试成果：

长字符串构造，string 性能比 char 差很多

string 在 libstadc++ 和 libc++ 的实现方式是纷歧样的，具体参考下面这篇文章：

std::pmr::string

#include <memory_resource>

const int SIZE = 1000000;
void test_stack() {
Timer timer;
for (int i = 0; i < SIZE; i++) {
   std::string str(”hello world, it is test string.”);
}
}

void test_string() {
Timer timer;
for (int i = 0; i < SIZE; i++) {
   std::pmr::string str(”hello world, it is test string.”);
}
}测试成果：

std::pmr::string允许我们在栈上创建string，当超过 1024 个字节后才会在堆上申请内存。
3、vector

stl 中 vector 的内存增长速度是 2 的幂次方，而这个值是可以调整的，比如：folly 的 small vector

4、map

STL 中的 map 是基于红黑树来实现的，而高效的 map 必然是 hash map，进一步优化的思路就是在 hash map 的基础上引入内存池技术。

5、protobuf

比如采纳某些字段合并策略，尽量减少序列化、反序列化的次数。
6、高效使用智能指针

[*]使用 std::make_shared 代替 new T
class MyClass {
public:
MyClass(std::string s, int i) : s(s), i(i) {}// 使用初始化列表斗劲快

public:
std::string s;
int i;
};

const int SIZE = 1000000;
void test1() {
Timer timer;
for (int i = 0; i < SIZE; i++) {
   std::shared_ptr<MyClass> p(new MyClass(”hello”, 123));// 会调用两次内存打点器，第一次用于创建 MyClass 的实例，第二次用来创建 std::shared_ptr 的内部布局。
}
}

void test2() {
Timer timer;
for (int i = 0; i < SIZE; i++) {
   std::shared_ptr<MyClass> p = std::make_shared<MyClass>(”hello”, 123);// 一次性分配内存同时保留以上两种数据布局
}
}

int main() {
test1();
test2();
return 0;
}测试成果：

[*]避免使用 std::shared_ptr 作为函数的入参，而是通过 get() 函数传递实际的指针
[*]通过 = delete 修饰，在类定义中禁止不但愿发生的复制
本系列其他文章：

页: [1]

Unity开发者联盟's Archiver

C++ 高性能编程实战（三）：内存优化