Merge pull request #5776 from typhoonzero/update_refactor_dist_train_doc
Update design of dist train refactordetection_output_fixbug
|
Before Width: | Height: | Size: 16 KiB After Width: | Height: | Size: 16 KiB |
|
Before Width: | Height: | Size: 222 KiB After Width: | Height: | Size: 222 KiB |
|
After Width: | Height: | Size: 189 KiB |
|
Before Width: | Height: | Size: 28 KiB After Width: | Height: | Size: 28 KiB |
|
After Width: | Height: | Size: 102 KiB |
|
Before Width: | Height: | Size: 350 KiB After Width: | Height: | Size: 350 KiB |
|
Before Width: | Height: | Size: 76 KiB After Width: | Height: | Size: 76 KiB |
|
Before Width: | Height: | Size: 20 KiB After Width: | Height: | Size: 20 KiB |
|
After Width: | Height: | Size: 134 KiB |
@ -1,180 +0,0 @@
|
|||||||
# Design Doc: Session
|
|
||||||
|
|
||||||
## Abstract
|
|
||||||
|
|
||||||
The *session* object encapsulates the environment in which the
|
|
||||||
computation graph is executed.
|
|
||||||
|
|
||||||
We will have the *local* session and *remote* session, they offer the
|
|
||||||
same [interface](#interface). The local session encapsulates the local
|
|
||||||
runtime environment and the remote session encapsulates the cluster
|
|
||||||
runtime environment.
|
|
||||||
|
|
||||||
The local runtime environment contains:
|
|
||||||
|
|
||||||
1. computation devices (i.e., CPU, GPU) handles, and
|
|
||||||
1. the [scope](../scope.md) which holds all variables.
|
|
||||||
|
|
||||||
The remote runtime environment contains:
|
|
||||||
|
|
||||||
1. computation devices (i.e., CPU and GPU on node 0, 1) in a cluster,
|
|
||||||
and
|
|
||||||
1. the distributed [scope](../scope.md) in a cluster which holds all
|
|
||||||
variables.
|
|
||||||
|
|
||||||
The user can create a remote session on Paddle Cloud and evaluate the
|
|
||||||
computation graph with it. In this way, the user can control the
|
|
||||||
remote computation resource in a cluster from his local computer.
|
|
||||||
|
|
||||||
|
|
||||||
## Background
|
|
||||||
|
|
||||||
The current design has an implicit global session in which
|
|
||||||
`paddle.eval()` is executed. The pain point is:
|
|
||||||
|
|
||||||
Since the user is not able to explicitly switch between runtime
|
|
||||||
environments, the user cannot run a topology in two independent
|
|
||||||
environments.
|
|
||||||
|
|
||||||
For example, in reinforcement learning, the user may want to have a
|
|
||||||
stale model for inference and a fresh model for training, and only
|
|
||||||
replace the stale model with the fresh model periodically.
|
|
||||||
|
|
||||||
Furthermore, we have no concept that encapsulates a remote environment
|
|
||||||
that executes a computation graph.
|
|
||||||
|
|
||||||
We need the session object to address above issues.
|
|
||||||
|
|
||||||
|
|
||||||
## Session
|
|
||||||
|
|
||||||
A session is an object that owns the runtime environment. All
|
|
||||||
computations are executed through `session.eval()`.
|
|
||||||
|
|
||||||
|
|
||||||
### Interface
|
|
||||||
|
|
||||||
```python
|
|
||||||
eval(
|
|
||||||
targets,
|
|
||||||
feed_dict=None,
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
Evaluates the target Operations or Variables in `targets`.
|
|
||||||
|
|
||||||
- *targets*: the evaluation targets. Can be a single Operation or
|
|
||||||
Variable, or a list with the Operations or Variables as
|
|
||||||
elements. The value returned by `eval()` has the same shape as the
|
|
||||||
`target` argument.
|
|
||||||
|
|
||||||
The PaddlePaddle program is represented by
|
|
||||||
the [ProgramDesc](../design/program.md), `eval()` will infer the
|
|
||||||
ProgramDesc from the given targets and run the PaddlePaddle
|
|
||||||
program. Please
|
|
||||||
see
|
|
||||||
[this graph](./distributed_architecture.md#local-training-architecture) for
|
|
||||||
the detailed illustration for the local session
|
|
||||||
and
|
|
||||||
[this graph](./distributed_architecture.md#distributed-training-architecture) for
|
|
||||||
the detailed illustration for the remote session.
|
|
||||||
|
|
||||||
- *feed_dict*: a dictionary that contains the tensors which override
|
|
||||||
the edges of the computation graph.
|
|
||||||
|
|
||||||
feed_dict not only can provide the input data, it can override any
|
|
||||||
OP's input as well:
|
|
||||||
|
|
||||||
```python
|
|
||||||
a = pd.constant(2.0, name="a")
|
|
||||||
b = pd.variable(name="b")
|
|
||||||
c = pd.mul(a,b)
|
|
||||||
sess.eval(targets=c, feed_dict={"b":3.0}) # returns 6.0
|
|
||||||
```
|
|
||||||
|
|
||||||
```python
|
|
||||||
close()
|
|
||||||
```
|
|
||||||
|
|
||||||
Closes the session and releases the scope that the session owns.
|
|
||||||
|
|
||||||
|
|
||||||
### Create a Local Session
|
|
||||||
|
|
||||||
```python
|
|
||||||
session(
|
|
||||||
devices=None
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
Creates a new session. One session owns one global scope, so creating
|
|
||||||
multiple sessions will create different scopes.
|
|
||||||
|
|
||||||
- *devices*: a single `string` or a list of `string` of device names,
|
|
||||||
the corresponding devices will be the computation devices for
|
|
||||||
`eval()`. If not specified, all available devices (e.g., all GPUs)
|
|
||||||
will be used. The user doesn't need to specify the CPU device since
|
|
||||||
it will be always used. Multiple sessions can use the same device.
|
|
||||||
|
|
||||||
|
|
||||||
#### Example
|
|
||||||
|
|
||||||
```Python
|
|
||||||
a = paddle.constant(1.0)
|
|
||||||
b = paddle.constant(2.0)
|
|
||||||
c = a + b
|
|
||||||
sess = paddle.session(devices=["gpu:0", "gpu:1", "fpga:0"])
|
|
||||||
sess.eval(c)
|
|
||||||
sess.close()
|
|
||||||
```
|
|
||||||
|
|
||||||
### Create a Remote Session
|
|
||||||
|
|
||||||
```python
|
|
||||||
create_cloud_job(
|
|
||||||
name,
|
|
||||||
num_trainer,
|
|
||||||
mem_per_trainer,
|
|
||||||
gpu_per_trainer,
|
|
||||||
cpu_per_trainer,
|
|
||||||
num_ps,
|
|
||||||
mem_per_ps,
|
|
||||||
cpu_per_ps,
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
Creates a Paddle Cloud job. Fails if the job name exists.
|
|
||||||
|
|
||||||
```python
|
|
||||||
get_cloud_job(
|
|
||||||
name
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
Gets a Paddle Cloud job.
|
|
||||||
|
|
||||||
```python
|
|
||||||
remote_session(
|
|
||||||
job
|
|
||||||
)
|
|
||||||
```
|
|
||||||
|
|
||||||
- *job*: the Paddle Cloud job.
|
|
||||||
|
|
||||||
#### Example
|
|
||||||
|
|
||||||
```Python
|
|
||||||
reader = paddle.reader.recordio("/pfs/home/peter/mnist-train-*") # data stored on Paddle Cloud
|
|
||||||
image = reader.column(0)
|
|
||||||
label = reader.column(1)
|
|
||||||
fc1 = paddle.op.fc(image, size=256, act="sigmoid")
|
|
||||||
fc2 = paddle.op.fc(fc1, size=10, act="softmax")
|
|
||||||
cost = paddle.op.cross_entropy(fc2, label)
|
|
||||||
opt = paddle.optimizer.sgd(cost)
|
|
||||||
|
|
||||||
job = paddle.create_cloud_job("test", 3, "1G", 1, 1, 2, "1G", 1)
|
|
||||||
sess = paddle.remote_ession(job)
|
|
||||||
for i in range(1000):
|
|
||||||
sess.eval(opt)
|
|
||||||
sess.close()
|
|
||||||
```
|
|
||||||
|
Before Width: | Height: | Size: 46 KiB |
|
Before Width: | Height: | Size: 28 KiB |