You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Paddle/paddle/memory/README.md

4.6 KiB

In my mind, the memory package works like the following:

Design

Usage

To allocate 4KB CPU memory:

p = memory::Alloc(platform::CPUPlace(), 4*1024);

To allocate 4KB memory on the 3rd GPU:

p = memory::Alloc(platform::GPUPlace(2), 4*1024);

To free memory and check the so-far used amount of memory on a place:

auto pl = platform::GPUPlace(0);
p = memory::Alloc(pl, 4*1024);
cout << memory::Used(pl);
memory::Free(pl, p);

The API

In paddle/memory/memory.h we have:

template <typeanme Place> void* Alloc(Place, size_t);
template <typeanme Place> void Free(Place, void*);
}

These function templates have specializations on either platform::CPUPlace or platform::GPUPlace:

template<>
void Alloc<CPUPlace>(CPUPlace p, size_t size) {
  return GetCPUBuddyAllocator()->Alloc(size);
}

and

template<>
void Alloc(GPUPlace)(GPUPlace p, size_t size) {
  return GetGPUBuddyAllocator(p.id)->Alloc(size);
}

The Implementation

GetCPUBuddyAllocator and GetGPUBuddyAllocator are singletions.

BuddyAllocator* GetCPUBuddyAllocator() {
  static BuddyAllocator* a = NULL;
  if (a == NULL) {
    a = new BuddyAllocator(new CPUAllocator /*backup allocator*/, ...);
  }
  return a;
}

BuddyAllocator* GetGPUBuddyAllocator(int gpu_id) {
  static BuddyAllocator* as = NULL;
  if (as == NULL) {
    as = new BuddyAllocator*[platform::NumGPUs()];
    for (int gpu = 0; gpu < platform::NumGPUs(); gpu++) {
      as[gpu] = new BuddyAllocator(new GPUAllocator(gpu) /* backup allocator */, ...);
    }
  }
  return as[gpu_id);

BuddyAllocator

BuddyAllocator implements the buddy allocation algorithm. Its constructor takes parameters only related with the algorithm:

BuddyAllocator::BuddyAllocator(initial_pool_size, max_pool_size) {
  ...
}

Please be aware that BuddyAllocator always allocate aligned memory, aligned on 32-bytes, which can hold a BuddyAllocator::Block object:

class BuddyAllocator {
 private:
  struct Block {
    size_t size;
    Blobk* left, right;
  };
  ...
};

System Allocators

The GPUAllocator and CPUAllocator are calls system allocators. They hold information about the device, including the amount of memory has been allocated. So that we can call

  • GPUAllocator::Used and
  • CPUAllocator::Used

to get the amount of memory that has been allocated so far.

Why Such a Design

I got inspiration from Majel and Caffe2, though above design look different from both.

Caffe2

In Caffe2, Tensor<Context>::mutable_data() allocates the memroy. In particular, Tensor<Context>::mutable_data calls Tensor<Context>::raw_mutable_data, which in turn calls Context::New.

There are two implementations of Context:

  1. CPUContext, whose New method calls g_cpu_allocator.get()->New(size_t) to allocate the memory.

  2. CUDAContext, which has a data member int gpu_id_. This looks very similar to class majel::GPUPlace, who also has an int id_ data member. CUDAContext::New(size_t) calls g_cub_allocator->DeviceAllocate(&ptr, nbytes) to allocate the memory.

Majel

In Majel, there are basically two allocator types:

  1. cpu::SystemAllocator, which has similar functionality to caffe2::CPUContext::New/Delete.
  2. gpu::SystemAllocator, which has similar functionality to caffe2::CUDAContext::New/Delete.

However, memory allocation is not via these two allocators. Instead, these two allocators are defined in hidden namespaces.

In Majel there are hidden global variables like:

  1. cpu::SystemAllocator g_cpu_allocator, and
  2. vector<gpu::SystemAllocator*> g_gpu_allocators(NUM_GPUS).

Programs allocate memory via a BuddyAllocator, which can take the g_cpu_allocator or a g_gpu_allocators[gpu_id] as its fallback allocator, so that if BuddyAllocator cannot find a block in its memory pool, it extends its memory pool by calling the fallback allocator's New(size_t).