You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Paddle/doc/design/cluster_train/pserver_client.md

158 lines
6.8 KiB

8 years ago
# Design Doc: The Client Library of Parameter Server
For an overview of trainer's role, please refer to [distributed training design doc](README.md). In this design doc, we will discuss the parameter server's client library, which will manage communication with parameter servers. The library will be implemented in [Go](https://golang.org/) and made available as a static or dynamic library with a C header file.
## Parameter Partition
Each parameter will be partitioned into parameter blocks to make the parameters evenly distributed on parameter servers. The partition is done automatically by the client library. The *sparse parameter* require a little different treatment:
### Sparse Parameter
8 years ago
The sparse parameter is a parameter that is updated sparsely. The name is somewhat misleading, it does not have a sparse representation, it has the same representation as a dense vector.
Because a sparse parameter is updated sparsely, the trainer will have to partition the sparse parameter. Because the parameter server will merge all sparse parameter shard into the same file when saving the parameter. It needs special naming convention:
If a sparse parameter is partitioned into n shards, they should be named as:
```text
name:sparse-0
name:sparse-1
...
name:sparse-n-1
```
8 years ago
The library is unaware of the partition, and treat each parameter independently. Only when saving parameters, the parameter servers will merge the sparse parameters according to the naming convention.
8 years ago
## Model Optimization Using Gradients
8 years ago
There are two ways to perform model optimization using gradients:
- On Client
8 years ago
8 years ago
The client does multiple steps of forward and backward update. In each step, the gradients are calculated and a new model is generated. After some steps, the client will calculate the difference between the newest model and the old model at step 0. The difference will be updated to parameter servers. Parameter servers will just update parameters using the difference without any optimization using gradients (such as Adam and L1 regularization).
- On Parameter Server
8 years ago
8 years ago
The client will send accumulated gradients to parameter servers, the parameter server will do the optimization using gradients.
## L1 and L2 Regularization
8 years ago
PaddlePaddle allows L1 or L2 regularizations to be specified per parameter, so when the trainer initializes the parameter it needs include a parameter configuration when L1 or L2 regularization is necessary.
8 years ago
## Parameter Initialization
8 years ago
The parameters on parameter servers need to be initialized. To provide maximum flexibility, the trainer will initialize the parameters. Only one trainer will do the initialization, the other trainers will wait for the completion of initialization and get the parameters from the parameter servers.
8 years ago
8 years ago
### Trainer Selection
8 years ago
To select the trainer for initialization, every trainer will try to get a distributed lock, whoever owns the lock will do the initialization. As illustrated below:
8 years ago
<img src="./src/init_lock.png">
8 years ago
### Trainer Selection Process
8 years ago
8 years ago
The trainer select process is encapsulated in the C API function:
8 years ago
```c
int paddle_begin_init_params(paddle_pserver_client* client, const char* config_proto);
```
The selected trainer's call to `paddle_begin_init_params` will return with 1, and the other trainers' call to `paddle_begin_init_params` will block until initialization is done, and return 0. As illustrated below:
8 years ago
<img src="./src/pserver_init.png">
8 years ago
## C Interface
```c
typedef enum {
PADDLE_ELEMENT_TYPE_INT32 = 0,
PADDLE_ELEMENT_TYPE_UINT32 = 1,
PADDLE_ELEMENT_TYPE_INT64 = 2,
PADDLE_ELEMENT_TYPE_UINT64 = 3,
PADDLE_ELEMENT_TYPE_FLOAT32 = 4,
PADDLE_ELEMENT_TYPE_FLOAT64 = 5,
} paddle_element_type;
8 years ago
8 years ago
typedef struct {
char* name;
paddle_element_type element_type;
void* content;
int content_len;
8 years ago
} paddle_parameter, paddle_gradient;
8 years ago
typedef struct paddle_pserver_client paddle_pserver_client;
paddle_pserver_client* paddle_new_pserver_client();
void paddle_pserver_client_release(paddle_pserver_client* client);
/**
* @brief paddle_begin_init_params begins to initialize parameters on
* parameter servers.
8 years ago
*
* paddle_begin_init_params will be called from multiple trainers,
* only one trainer will be selected to initialize the parameters on
8 years ago
* parameter servers. Other trainers will be blocked until the
* initialization is done, and they need to get the initialized
8 years ago
* parameters from parameter servers using @paddle_get_params.
8 years ago
*
* @param pserver_config_proto serialized parameter server configuration in
* Protocol Buffers format.
* @return 1 if the trainer is selected to initialize parameter
* servers, otherwise 0.
8 years ago
*/
int paddle_begin_init_params(paddle_pserver_client* client, const char* pserver_config_proto);
8 years ago
/**
* @brief paddle_init_param initializes the parameter on parameter
* servers.
*
8 years ago
* @param param the parameter to initialize.
* @param param_config_proto the configuration for the parameter.
8 years ago
* @return 0 if successful, otherwise -1. On failure, the trainer
* needs to restart the entire initialization process (starting from
* @paddle_begin_init_param). Or simply exit the program and wait for
* the cluster management system to restart the trainer.
8 years ago
*/
int paddle_init_param(paddle_pserver_client* client, paddle_parameter params, const char* param_config_proto);
8 years ago
/**
8 years ago
* @brief paddle_finish_init_params tells parameter servers client has
8 years ago
* sent all parameters to parameter servers as initialization.
*
8 years ago
* @return 0 if successful, otherwise -1. On failure, the trainer
* needs to restart the entire initialization process (starting from
* @paddle_begin_init_param). Or simply exit the program and wait for
* the cluster management system to restart the trainer.
8 years ago
*/
8 years ago
int paddle_finish_init_params(paddle_pserver_client* client);
8 years ago
/**
8 years ago
* @brief paddle_send_grads sends gradients to parameter servers for
8 years ago
* updating parameters.
*
8 years ago
* @param grads the array of gradients to send.
* @param len the length of the gradient array.
8 years ago
* @param learning_rate the learning rate for the gradients.
8 years ago
* @return 0 if successful, otherwise -1.
*/
int paddle_send_grads(paddle_pserver_client* client, const paddle_gradient* grads, int len);
8 years ago
/**
8 years ago
* @brief paddle_get_params gets parameters from parameter servers.
8 years ago
*
8 years ago
* @param names the array of names of the parameters to get.
* @param dst the destination array of parameters to save to.
* @param len the length of the names array and the paddle_parameter
* array.
8 years ago
* @return 0 if successful, otherwise -1.
*/
int paddle_get_params(paddle_pserver_client* client, const char** names, paddle_parameter* dst, int len);
8 years ago
/**
* @brief paddle_save_model indicates parameters to save the parameter
* to the given path
*
8 years ago
* @param path the path to save parameters.
8 years ago
* @return 0 if successful, otherwise -1.
*/
int paddle_save_model(paddle_pserver_client* client, const char* path);
```