Paddle/paddle/memory/README.md

In my mind, the memory package works like the following:

## Design

### Usage

To allocate 4KB CPU memory:

```cpp
p = memory::Alloc(platform::CPUPlace(), 4*1024);
```

To allocate 4KB memory on the 3rd GPU:

```cpp
p = memory::Alloc(platform::GPUPlace(2), 4*1024);
```

To free memory and check the so-far used amount of memory on a place:

```cpp
auto pl = platform::GPUPlace(0);
p = memory::Alloc(pl, 4*1024);
cout << memory::Used(pl);
memory::Free(pl, p);
```

### The API

In `paddle/memory/memory.h` we have:

```cpp
template <typeanme Place> void* Alloc(Place, size_t);
template <typeanme Place> void Free(Place, void*);
}
```

These function templates have specializations on either `platform::CPUPlace` or `platform::GPUPlace`:

```cpp
template<>
void Alloc<CPUPlace>(CPUPlace p, size_t size) {
  return GetCPUBuddyAllocator()->Alloc(size);
}
```

and 

```cpp
template<>
void Alloc(GPUPlace)(GPUPlace p, size_t size) {
  return GetGPUBuddyAllocator(p.id)->Alloc(size);
}
```

### The Implementation

`GetCPUBuddyAllocator` and `GetGPUBuddyAllocator` are singletions.

```cpp
BuddyAllocator* GetCPUBuddyAllocator() {
  static BuddyAllocator* a = NULL;
  if (a == NULL) {
    a = new BuddyAllocator(new CPUAllocator /*backup allocator*/, ...);
  }
  return a;
}

BuddyAllocator* GetGPUBuddyAllocator(int gpu_id) {
  static BuddyAllocator* as = NULL;
  if (as == NULL) {
    as = new BuddyAllocator*[platform::NumGPUs()];
    for (int gpu = 0; gpu < platform::NumGPUs(); gpu++) {
      as[gpu] = new BuddyAllocator(new GPUAllocator(gpu) /* backup allocator */, ...);
    }
  }
  return as[gpu_id);
```

#### `BuddyAllocator`

`BuddyAllocator` implements the buddy allocation algorithm.  Its constructor takes parameters only related with the algorithm:

```cpp
BuddyAllocator::BuddyAllocator(initial_pool_size, max_pool_size) {
  ...
}
```

Please be aware that **`BuddyAllocator` always allocate aligned memory**, aligned on 32-bytes, which can hold a `BuddyAllocator::Block` object:

```cpp
class BuddyAllocator {
 private:
  struct Block {
    size_t size;
    Blobk* left, right;
  };
  ...
};
```

#### System Allocators

The `GPUAllocator` and `CPUAllocator` are calls *system allocators*.  They hold information about the device, including the amount of memory has been allocated.  So that we can call

- `GPUAllocator::Used` and
- `CPUAllocator::Used`

to get the amount of memory that has been allocated so far.


## Why Such a Design

I got inspiration from Majel and Caffe2, though above design look different from both.

### Caffe2

In Caffe2, `Tensor<Context>::mutable_data()` allocates the memroy.  In particular, [`Tensor<Context>::mutable_data`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/tensor.h#L523) calls [`Tensor<Context>::raw_mutable_data`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/tensor.h#L459), which in turn calls [`Context::New`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/tensor.h#L479).

There are two implementations of `Context`:

1. [`CPUContext`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context.h#L105), whose [`New` method](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context.h#L131) calls [`g_cpu_allocator.get()->New(size_t)`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context.cc#L15) to allocate the memory.

1. [`CUDAContext`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context_gpu.h#L99), which has a data member [`int gpu_id_`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context_gpu.h#L202).  This looks very similar to class `majel::GPUPlace`, who also has an `int id_` data member.   `CUDAContext::New(size_t)` calls [`g_cub_allocator->DeviceAllocate(&ptr, nbytes)`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context_gpu.cu#L355) to allocate the memory.

### Majel

In Majel, there are basically two allocator types:

1. `cpu::SystemAllocator`, which has similar functionality to `caffe2::CPUContext::New/Delete`.
1. `gpu::SystemAllocator`, which has similar functionality to `caffe2::CUDAContext::New/Delete`.

However, memory allocation is not via these two allocators.  Instead, these two allocators are defined in hidden namespaces.

In Majel there are hidden global variables like:

1. `cpu::SystemAllocator g_cpu_allocator`, and
1. `vector<gpu::SystemAllocator*> g_gpu_allocators(NUM_GPUS)`.

Programs allocate memory via a BuddyAllocator, which can take the `g_cpu_allocator` or a `g_gpu_allocators[gpu_id]` as its *fallback allocator*, so that if BuddyAllocator cannot find a block in its memory pool, it extends its memory pool by calling the fallback allocator's `New(size_t)`.
Add paddle/memory/README.md 8 years ago			`In my mind, the memory package works like the following:`

			`## Design`

			`### Usage`

			`To allocate 4KB CPU memory:`

			```cpp
			`p = memory::Alloc(platform::CPUPlace(), 4*1024);`
			```

			`To allocate 4KB memory on the 3rd GPU:`

			```cpp
			`p = memory::Alloc(platform::GPUPlace(2), 4*1024);`
			```

			`To free memory and check the so-far used amount of memory on a place:`

			```cpp
			`auto pl = platform::GPUPlace(0);`
			`p = memory::Alloc(pl, 4*1024);`
			`cout << memory::Used(pl);`
			`memory::Free(pl, p);`
			```

			`### The API`

			In `paddle/memory/memory.h` we have:

			```cpp
			`template <typeanme Place> void* Alloc(Place, size_t);`
			`template <typeanme Place> void Free(Place, void*);`
			`}`
			```

			These function templates have specializations on either `platform::CPUPlace` or `platform::GPUPlace`:

			```cpp
			`template<>`
			`void Alloc<CPUPlace>(CPUPlace p, size_t size) {`
			`return GetCPUBuddyAllocator()->Alloc(size);`
			`}`
			```

			`and`

			```cpp
			`template<>`
			`void Alloc(GPUPlace)(GPUPlace p, size_t size) {`
			`return GetGPUBuddyAllocator(p.id)->Alloc(size);`
			`}`
			```

			`### The Implementation`

			`GetCPUBuddyAllocator` and `GetGPUBuddyAllocator` are singletions.

			```cpp
			`BuddyAllocator* GetCPUBuddyAllocator() {`
			`static BuddyAllocator* a = NULL;`
			`if (a == NULL) {`
			`a = new BuddyAllocator(new CPUAllocator /backup allocator/, ...);`
			`}`
			`return a;`
			`}`

			`BuddyAllocator* GetGPUBuddyAllocator(int gpu_id) {`
			`static BuddyAllocator* as = NULL;`
			`if (as == NULL) {`
			`as = new BuddyAllocator*[platform::NumGPUs()];`
			`for (int gpu = 0; gpu < platform::NumGPUs(); gpu++) {`
			`as[gpu] = new BuddyAllocator(new GPUAllocator(gpu) /* backup allocator */, ...);`
			`}`
			`}`
			`return as[gpu_id);`
			```

			#### `BuddyAllocator`

			`BuddyAllocator` implements the buddy allocation algorithm. Its constructor takes parameters only related with the algorithm:

			```cpp
			`BuddyAllocator::BuddyAllocator(initial_pool_size, max_pool_size) {`
			`...`
			`}`
			```

			Please be aware that `BuddyAllocator` always allocate aligned memory, aligned on 32-bytes, which can hold a `BuddyAllocator::Block` object:

			```cpp
			`class BuddyAllocator {`
			`private:`
			`struct Block {`
			`size_t size;`
			`Blobk* left, right;`
			`};`
			`...`
			`};`
			```

			`#### System Allocators`

			The `GPUAllocator` and `CPUAllocator` are calls system allocators. They hold information about the device, including the amount of memory has been allocated. So that we can call

			- `GPUAllocator::Used` and
			- `CPUAllocator::Used`

			`to get the amount of memory that has been allocated so far.`


			`## Why Such a Design`

			`I got inspiration from Majel and Caffe2, though above design look different from both.`

			`### Caffe2`

			In Caffe2, `Tensor<Context>::mutable_data()` allocates the memroy. In particular, [`Tensor<Context>::mutable_data`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/tensor.h#L523) calls [`Tensor<Context>::raw_mutable_data`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/tensor.h#L459), which in turn calls [`Context::New`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/tensor.h#L479).

			There are two implementations of `Context`:

			1. [`CPUContext`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context.h#L105), whose [`New` method](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context.h#L131) calls [`g_cpu_allocator.get()->New(size_t)`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context.cc#L15) to allocate the memory.

			1. [`CUDAContext`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context_gpu.h#L99), which has a data member [`int gpu_id_`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context_gpu.h#L202). This looks very similar to class `majel::GPUPlace`, who also has an `int id_` data member. `CUDAContext::New(size_t)` calls [`g_cub_allocator->DeviceAllocate(&ptr, nbytes)`](https://github.com/caffe2/caffe2/blob/v0.7.0/caffe2/core/context_gpu.cu#L355) to allocate the memory.

			`### Majel`

			`In Majel, there are basically two allocator types:`

			1. `cpu::SystemAllocator`, which has similar functionality to `caffe2::CPUContext::New/Delete`.
			1. `gpu::SystemAllocator`, which has similar functionality to `caffe2::CUDAContext::New/Delete`.

			`However, memory allocation is not via these two allocators. Instead, these two allocators are defined in hidden namespaces.`

			`In Majel there are hidden global variables like:`

			1. `cpu::SystemAllocator g_cpu_allocator`, and
			1. `vector<gpu::SystemAllocator*> g_gpu_allocators(NUM_GPUS)`.

			Programs allocate memory via a BuddyAllocator, which can take the `g_cpu_allocator` or a `g_gpu_allocators[gpu_id]` as its fallback allocator, so that if BuddyAllocator cannot find a block in its memory pool, it extends its memory pool by calling the fallback allocator's `New(size_t)`.