Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into batch-norm-latest

8 years ago · 8a07aff4d7
parent 822cf9785b 12d862b556
commit 8a07aff4d7
181 changed files with 6174 additions and 1228 deletions
--- a/doc/design/model_format.md
+++ b/doc/design/model_format.md
@ -12,27 +12,25 @@ The topology is saved as a plain text in a detailed self-contain protobuf file.

 The parameters are saved as a binary file. As we all know, the protobuf message has a limit of [64M size](https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.io.coded_stream#CodedInputStream.SetTotalBytesLimit.details). We have done a [benchmark experiment](https://github.com/PaddlePaddle/Paddle/pull/4610), which shows that protobuf is not fit for the task.

-As a result, we design a particular format for tensor serialization. By default, an arbitrary tensor in Paddle is a [LoDTensor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/lod_tensor.md), and has a description information proto of [LoDTensorDesc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto#L99). We save the DescProto as the byte string header. It contains all the necessary information, such as the `dims`, the `name` of the tensor, and the `LoD` information in [LoDTensor](https://github.com/PaddlePaddle/Paddle/blob/1c0a4c901c9fc881d120249c703b15d1c50dae7d/paddle/framework/lod_tensor.md). A tensor stores values in a continuous memory buffer. For speed we dump the raw memory to disk and save it as the byte string content. So, the binary format of one tensor is, 
-
-|HeaderLength|ContentLength|**LoDTensorDesc**|**TensorValue**|
+As a result, we design a particular format for tensor serialization. By default, an arbitrary tensor in Paddle is a [LoDTensor](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/lod_tensor.md), and has a description information proto of [LoDTensorDesc](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/framework/framework.proto#L99). We save the DescProto as the byte string header. It contains all the necessary information, such as the `dims`, and the `LoD` information in [LoDTensor](https://github.com/PaddlePaddle/Paddle/blob/1c0a4c901c9fc881d120249c703b15d1c50dae7d/paddle/framework/lod_tensor.md). A tensor stores values in a continuous memory buffer. For speed we dump the raw memory to disk and save it as the byte string content. So, the binary format of one tensor is, 

 The table below shows a tensor's byte view in detail. Note that all the signed values are written in the little-endian format.

-```text
-[offset] [type]              [description] 
-0004     4 bytes integer      HeaderLength, the length of LoDTensorDesc
-0008     4 bytes integer      ContentLength, the length of LodTensor Buffer
-0009     1 bytes char         TensorDesc
-00010    1 bytes char         TensorDesc
-...
-00100    1 bytes char         TensorValue
-00101    1 bytes char         TensorValue
-00102    1 bytes char         TensorValue              ..
-...
-```
+|field name  | type | description |
+| --- | --- | --- |
+| version | uint32_t | Version of saved file. Always 0 now. |
+| tensor desc length | uint32_t | TensorDesc(Protobuf message) length in bytes. |
+| tensor desc | void* | TensorDesc protobuf binary message |
+| tensor data | void* | Tensor's data in binary format. The length of `tensor_data` is decided by `TensorDesc.dims()` and `TensorDesc.data_type()` |
+| lod_level | uint64_t | Level of LoD |
+| length of lod[0] | uint64_t | [Optional] length of lod[0] in bytes. |
+| data of lod[0] | uint64_t*  | [Optional] lod[0].data() |
+| ... | ... | ... |
+
+

 ## Summary

 - We introduce a model format.
- The `ProgramDesc` describe the model **topology**. 
+- The model represented by its forward-pass computation procedure is saved in a **ProgramDesc** protobuf message.
 - A bunch of specified format binary tensors describe the **parameters**.
--- a/doc/design/regularization.md
+++ b/doc/design/regularization.md
@ -1,7 +1,7 @@
 # Regularization in PaddlePaddle

 ## Introduction to Regularization
-A central problem in machine learning is how to design an algorithm that will perform well not just on the training data, but also on new data. Many strategies are used by machine learning practitioners to reduce the test error, possibly at the expense of increased training error. These strategies are collectively known as **regularization**. 
+A central problem in machine learning is how to design an algorithm that will perform well not just on the training data, but also on new data. A frequently faced problem is the problem of **overfitting**, where the model does not make reliable predictions on new unseen data. **Regularization** is the process of introducing additional information in order to prevent overfitting. This is usually done by adding extra penalties to the loss function that restricts the parameter spaces that an optimization algorithm can explore.

 ### Parameter Norm Penalties
 Most common regularization approaches in deep learning are based on limiting the capacity of the models by adding a parameter norm penalty to the objective function `J`. This is given as follows:
@ -18,52 +18,21 @@ The most commonly used norm penalties are the L2 norm penalty and the L1 norm pe
 ##### L1 Regularization
 <img src="./images/l1_regularization.png" align="center"/><br/>

-A much more detailed mathematical background of reguilarization can be found [here](http://www.deeplearningbook.org/contents/regularization.html).
+A much more detailed mathematical background of regularization can be found [here](http://www.deeplearningbook.org/contents/regularization.html).

+## Regularization Survey

-## How to do Regularization in PaddlePaddle
-
-On surveying existing frameworks like Tensorflow, PyTorch, Caffe, etc, it can be seen that there are 2 common approaches of doing regularization:
-
-1. Making regularization a part of the optimizer using an attribute like `weight_decay` that is used to control the scale of the L2 Penalty. This approach is used in PyTorch as follows:
-	```python
-	opt =  torch.optim.SGD(params, lr=0.2, weight_decay=0.2)
-	```
-    At every optimization step, this code will add the gradient of the L2 Norm of the params to the gradient of the params with respect to the loss function. This can seen in the following code snippet:
-    ```python
-    if weight_decay != 0:
-    	d_p.add_(weight_decay, p.data)
-    ```
-    This is a very restyrictive way of doing regularization and does not give the users enough flexibility. 
-    
-    **Advantages**:
-    -  It is easy to implement for us.
-    -  Faster execution of backward. However, it can be done manually by advanced users too.
-
-	**Disadvantages**:
-    - Not flexible for other regularizations such as L1/L0 regularization.
-    - Does not allow for different regularization coefficient for different parameters. For example, in most models, ony the weight matrices are regularized and the bias vectors are unregularized.
-    - Tightly coupled optimizer and regularization implementation. 
-
-
-2. Adding regularization ops to the graph through Python API. This approach is used by Tensorflow and Caffe. Using this approach, we manually add regularization ops to the graph and then add the regularization loss to the final loss function before sending them to the optimizer.
-
-	**Advantages**:
-    - Allows for greater flexibility to the users of Paddle. Using this approach, the users can put different regularization to different parameters and also choose parameters that are not a part of regularization.
-    - Makes it easy for the users to customize and extend the framework. 
-
-	**Disadvantages**:
-    - Implementation requires comprehensive design and time. 
+A detailed survey of regularization in various deep learning frameworks can be found [here](https://github.com/PaddlePaddle/Paddle/wiki/Regularization-Survey). 

 ## Proposal for Regularization in PaddlePaddle

 ### Low-Level implementation

-In the new design, we propose to create new operations for regularization. For now, we can add 2 ops thgat correspond to the most frequently used regularizations:
+In the new design, we propose to create new operations for regularization. For now, we can add 2 ops that correspond to the most frequently used regularizations:
 - L2_regularization_op
 - L1_regularization_op

-These ops can be like any other ops with their own CPU/GPU implementations either using Eigen or separate Cpu and GPU kernels. As the initial implementation, we can implement their kernels using Eigen following the abstraction pattern implemented for [Activation Ops](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/accuracy_op.h). This abstraction pattern can make it very easy to implement new regularization schemes. other than L1 and L2 norm penalties. 
+These ops can be like any other ops with their own CPU/GPU implementations either using Eigen or separate CPU and GPU kernels. As the initial implementation, we can implement their kernels using Eigen following the abstraction pattern implemented for [Activation Ops](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/accuracy_op.h). This abstraction pattern can make it very easy to implement new regularization schemes other than L1 and L2 norm penalties. 

 The idea of building ops for regularization is in sync with the refactored Paddle philosophy of using operators to represent any computation unit. The way these ops will be added to the computation graph, will be decided by the [layer functions](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/python_api.md#layer-function) in Python API. 

@ -94,7 +63,7 @@ Since we want to create the regularization ops in a lazy manner, the regularizat

 #### High-level API

-In PaddlePaddle Python API, users will primarily rely on [layer functions](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/python_api.md#layer-function) to create neural network layers. Hence, we lso need to provide regularization functionality in layer functions. The design of these APIs can be postponed for later right now. A good reference for these APIs can be found in [Keras](https://keras.io/regularizers/) and also by looking at Tensorflow in [`tf.contrib.layers`](https://www.tensorflow.org/api_guides/python/contrib.layers).
+In PaddlePaddle Python API, users will primarily rely on [layer functions](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/python_api.md#layer-function) to create neural network layers. Hence, we also need to provide regularization functionality in layer functions. The design of these APIs can be postponed for later right now. A good reference for these APIs can be found in [Keras](https://keras.io/regularizers/) and also by looking at Tensorflow in [`tf.contrib.layers`](https://www.tensorflow.org/api_guides/python/contrib.layers).



--- a/go/cmd/pserver/pserver.go
+++ b/go/cmd/pserver/pserver.go
@ -67,7 +67,7 @@ func main() {
 		cp, err = pserver.LoadCheckpoint(e, idx)
 		if err != nil {
 			if err == pserver.ErrCheckpointNotFound {
-				log.Info("Could not find the pserver checkpoint.")
+				log.Info("load checkpoint error", "error", err)
 			} else {
 				panic(err)
 			}
@ -99,7 +99,7 @@ func main() {
 	candy.Must(err)

 	go func() {
-		log.Info("starting pserver", log.Ctx{"port": *port})
+		log.Info("serving pserver", log.Ctx{"port": *port})
 		err = http.Serve(l, nil)
 		candy.Must(err)
 	}()
--- a/go/master/c/client.go
+++ b/go/master/c/client.go
@ -123,7 +123,8 @@ func paddle_set_dataset(client C.paddle_master_client, path **C.char, size C.int
 	}
 	err := c.SetDataset(paths)
 	if err != nil {
-		log.Error("error set dataset", log.Ctx{"error": err})
+		log.Error("error set dataset",
+			log.Ctx{"error": err, "paths": paths})
 		return C.PADDLE_MASTER_ERROR
 	}

--- a/go/master/client.go
+++ b/go/master/client.go
@ -121,6 +121,7 @@ func (c *Client) StartGetRecords(passID int) {
 }

 func (c *Client) getRecords(passID int) {
+	i := 0
 	for {
 		t, err := c.getTask(passID)
 		if err != nil {
@ -130,12 +131,20 @@ func (c *Client) getRecords(passID int) {
 				c.ch <- record{nil, err}
 				break
 			}
-			if err.Error() == ErrPassAfter.Error() {
-				// wait util last pass finishes
-				time.Sleep(time.Second * 3)
-				continue
+
+			if i%60 == 0 {
+				log.Debug("getTask of passID error.",
+					log.Ctx{"error": err, "passID": passID})
+				i = 0
 			}
-			log.Error("getTask error.", log.Ctx{"error": err})
+
+			// if err.Error() == ErrPassAfter.Error()
+			//   wait util last pass finishes
+			// if other error such as network error
+			//   wait to reconnect or task time out
+			time.Sleep(time.Second * 3)
+			i += 3
+			continue
 		}

 		for _, chunk := range t.Chunks {
--- a/go/master/client_test.go
+++ b/go/master/client_test.go
@ -117,6 +117,7 @@ func TestNextRecord(t *testing.T) {
 			if e != nil {
 				panic(e)
 			}
+
 			// test for n passes
 			for pass := 0; pass < 10; pass++ {
 				c.StartGetRecords(pass)
--- a/go/pserver/optimizer.go
+++ b/go/pserver/optimizer.go
@ -71,9 +71,15 @@ func newOptimizer(paramWithConfigs ParameterWithConfig, State []byte) *optimizer
 		cstate = unsafe.Pointer(&s[0])
 	}

+	var cptr (*C.uchar)
+	if len(c) > 0 {
+		cptr = (*C.uchar)(&c[0])
+	} else {
+		log.Error("empty config", "param name", paramWithConfigs.Param.Name)
+	}
 	o.config = c
 	o.opt = C.paddle_create_optimizer(
-		(*C.uchar)(&c[0]),
+		cptr,
 		C.int(len(c)),
 		C.paddle_element_type(p.ElementType),
 		cbuffer,
--- a/go/pserver/service.go
+++ b/go/pserver/service.go
@ -17,12 +17,11 @@ package pserver
 import (
 	"bufio"
 	"bytes"
-	"crypto/md5"
 	"encoding/gob"
-	"encoding/hex"
 	"encoding/json"
 	"errors"
 	"fmt"
+	"hash/crc32"
 	"io/ioutil"
 	"os"
 	"path"
@ -40,7 +39,7 @@ type ElementType int

 // ErrCheckpointNotFound indicates that the pserver checkpoint could
 // not be found.
-var ErrCheckpointNotFound = errors.New("checkpoint not found")
+var ErrCheckpointNotFound = errors.New("checkpoint not found in etcd")

 // RPC error message.
 const (
@ -76,7 +75,7 @@ type ParameterWithConfig struct {
 type checkpointMeta struct {
 	UUID      string `json:"uuid"`
 	Path      string `json:"path"`
-	MD5       string `json:"md5"`
+	CRC32     uint32 `json:"crc32"`
 	Timestamp int64  `json:"timestamp"`
 }

@ -92,7 +91,7 @@ type Service struct {
 	idx                int
 	checkpointInterval time.Duration
 	checkpointPath     string
-	client             *EtcdClient
+	client             KVStore

 	mu     sync.Mutex
 	optMap map[string]*optimizer
@ -104,7 +103,12 @@ type parameterCheckpoint struct {
 	State []byte
 }

-func loadMeta(e *EtcdClient, idx int) (meta checkpointMeta, err error) {
+type KVStore interface {
+	GetKey(key string, timeout time.Duration) ([]byte, error)
+	PutKey(key string, value []byte, timeout time.Duration, withLease bool) error
+}
+
+func loadMeta(e KVStore, idx int) (meta checkpointMeta, err error) {
 	v, err := e.GetKey(PsCheckpoint+strconv.Itoa(idx), 3*time.Second)
 	if err != nil {
 		return
@ -123,7 +127,7 @@ func loadMeta(e *EtcdClient, idx int) (meta checkpointMeta, err error) {
 }

 // LoadCheckpoint loads checkpoint from file.
-func LoadCheckpoint(e *EtcdClient, idx int) (Checkpoint, error) {
+func LoadCheckpoint(e KVStore, idx int) (Checkpoint, error) {
 	log.Info("Loading checkpoint", "pserver index", idx)
 	defer traceTime(time.Now(), "load checkpoint")

@ -137,11 +141,8 @@ func LoadCheckpoint(e *EtcdClient, idx int) (Checkpoint, error) {
 		return nil, err
 	}

-	// TODO(helin): change MD5 to CRC since CRC is better for file
-	// checksum in our use case (emphasize speed over security).
-	h := md5.New()
-	md5 := hex.EncodeToString(h.Sum(content))
-	if md5 != cpMeta.MD5 {
+	crc32 := crc32.ChecksumIEEE(content)
+	if crc32 != cpMeta.CRC32 {
 		return nil, errors.New(WrongChecksum)
 	}

@ -150,12 +151,13 @@ func LoadCheckpoint(e *EtcdClient, idx int) (Checkpoint, error) {
 	if err = dec.Decode(&cp); err != nil {
 		return nil, err
 	}
+
 	return cp, nil
 }

 // NewService creates a new service, will bypass etcd registration if no
 // endpoints specified. It will recovery from checkpoint file if a exists a specified checkpoint.
-func NewService(idx int, interval time.Duration, path string, client *EtcdClient, cp Checkpoint) (*Service, error) {
+func NewService(idx int, interval time.Duration, path string, client KVStore, cp Checkpoint) (*Service, error) {
 	s := &Service{
 		idx:                idx,
 		checkpointInterval: interval,
@ -173,6 +175,7 @@ func NewService(idx int, interval time.Duration, path string, client *EtcdClient
 			}
 			s.optMap[p.Param.Name] = newOptimizer(p, item.State)
 		}
+		close(s.initialized)
 	}
 	return s, nil
 }
@ -221,7 +224,7 @@ func (s *Service) FinishInitParams(_ int, _ *int) error {
 		for range t {
 			err := s.checkpoint()
 			if err != nil {
-				log.Error("finish init params error", log.Ctx{"error": err})
+				log.Error("checkpoint error", log.Ctx{"error": err})
 			}
 		}
 	}()
@ -274,6 +277,7 @@ func (s *Service) GetParam(name string, parameter *Parameter) error {
 	parameter.Name = name
 	parameter.ElementType = opt.elementType
 	parameter.Content = opt.GetWeights()
+
 	log.Info("sending parameter to the trainer", "name", parameter.Name, "size", len(parameter.Content), "type", parameter.ElementType)
 	return nil
 }
@ -354,20 +358,29 @@ func (s *Service) checkpoint() (err error) {

 	oldMeta, err := loadMeta(s.client, s.idx)
 	if err == ErrCheckpointNotFound {
-		log.Info("Do not have existing checkpoint.")
+		log.Info("old meta not found, skip removing old meta")
 		err = nil
+	} else if err == nil {
+		log.Info("removing old meta")
+		if oldMeta.Path != "" {
+			rmErr := os.Remove(oldMeta.Path)
+			if rmErr != nil {
+				// log error, but still treat checkpoint as
+				// successful.
+				log.Error("remove old meta file error", log.Ctx{"error": rmErr})
+			}
+		}
 	}

 	if err != nil {
 		return
 	}

-	h := md5.New()
-	md5 := hex.EncodeToString(h.Sum(buf.Bytes()))
+	crc32 := crc32.ChecksumIEEE(buf.Bytes())
 	cpMeta := checkpointMeta{
 		UUID:      id,
 		Timestamp: time.Now().UnixNano(),
-		MD5:       md5,
+		CRC32:     crc32,
 		Path:      p,
 	}

@ -381,14 +394,5 @@ func (s *Service) checkpoint() (err error) {
 		return
 	}

-	if oldMeta.Path != "" {
-		rmErr := os.Remove(oldMeta.Path)
-		if rmErr != nil {
-			// log error, but still treat checkpoint as
-			// successful.
-			log.Error("remove old meta file error", log.Ctx{"error": rmErr})
-		}
-	}
-
 	return
 }
--- a/go/pserver/service_internal_test.go
+++ b/go/pserver/service_internal_test.go
@ -0,0 +1,86 @@
+package pserver
+
+import (
+	"bytes"
+	"encoding/binary"
+	"fmt"
+	"testing"
+	"time"
+
+	"github.com/stretchr/testify/assert"
+)
+
+const testDir = "./test_data"
+
+type myKV struct {
+	m map[string][]byte
+}
+
+func (m *myKV) GetKey(key string, timeout time.Duration) ([]byte, error) {
+	if m.m == nil {
+		m.m = make(map[string][]byte)
+	}
+	return m.m[key], nil
+}
+
+func (m *myKV) PutKey(key string, value []byte, timeout time.Duration, withLease bool) error {
+	if m.m == nil {
+		m.m = make(map[string][]byte)
+	}
+	m.m[key] = value
+	return nil
+}
+
+func TestCheckpoint(t *testing.T) {
+	kv := &myKV{}
+	s, err := NewService(0, time.Hour, testDir, kv, nil)
+	assert.Nil(t, err)
+	err = s.checkpoint()
+	assert.Nil(t, err)
+	_, err = LoadCheckpoint(kv, 0)
+	assert.Nil(t, err)
+}
+
+func float32ToByte(f float32) []byte {
+	var buf bytes.Buffer
+	err := binary.Write(&buf, binary.LittleEndian, f)
+	if err != nil {
+		fmt.Println("binary.Write failed:", err)
+	}
+	return buf.Bytes()
+}
+
+func TestCheckpointWithData(t *testing.T) {
+	kv := &myKV{}
+	s, err := NewService(0, time.Hour, testDir, kv, nil)
+	assert.Nil(t, err)
+
+	var content []byte
+	for i := 0; i < 50000; i++ {
+		content = append(content, float32ToByte(float32(i))...)
+	}
+
+	p1 := Parameter{Name: "p1", ElementType: 1, Content: content}
+	err = s.InitParam(ParameterWithConfig{Param: p1}, nil)
+	assert.Nil(t, err)
+
+	err = s.FinishInitParams(0, nil)
+	assert.Nil(t, err)
+
+	var p2 Parameter
+	err = s.GetParam(p1.Name, &p2)
+	assert.Nil(t, err)
+	assert.Equal(t, p1, p2)
+
+	err = s.checkpoint()
+	assert.Nil(t, err)
+	cp, err := LoadCheckpoint(kv, 0)
+	assert.Nil(t, err)
+	s1, err := NewService(0, time.Hour, testDir, kv, cp)
+	assert.Nil(t, err)
+
+	var p3 Parameter
+	err = s1.GetParam(p1.Name, &p3)
+	assert.Nil(t, err)
+	assert.Equal(t, p1, p3)
+}
--- a/go/pserver/service_test.go
+++ b/go/pserver/service_test.go
@ -178,7 +178,3 @@ func TestBlockUntilInitialized(t *testing.T) {

 	wg.Wait()
 }
-
-func TestCheckpointSpeed(t *testing.T) {
-	//TODO(zhihong): test speed
-}
--- a/paddle/capi/gradient_machine.cpp
+++ b/paddle/capi/gradient_machine.cpp
@ -64,12 +64,18 @@ paddle_error paddle_gradient_machine_create_for_inference_with_parameters(
  modelConfigProtobuf.resize(modelConfigSize);
  is.read(&modelConfigProtobuf[0], modelConfigSize);
  paddle::TrainerConfig config;
+  paddle::ModelConfig modelConfig;
  if (!config.ParseFromString(modelConfigProtobuf) || !config.IsInitialized()) {
-    return kPD_PROTOBUF_ERROR;
+    if (!modelConfig.ParseFromString(modelConfigProtobuf) ||
+        !modelConfig.IsInitialized()) {
+      return kPD_PROTOBUF_ERROR;
+    }
+  } else {
+    modelConfig = config.model_config();
  }
  auto ptr = new paddle::capi::CGradientMachine();
  ptr->machine.reset(paddle::GradientMachine::create(
-      config.model_config(), CREATE_MODE_TESTING, {paddle::PARAMETER_VALUE}));
+      modelConfig, CREATE_MODE_TESTING, {paddle::PARAMETER_VALUE}));
  std::vector<paddle::ParameterPtr>& parameters = ptr->machine->getParameters();
  for (auto& para : parameters) {
    para->load(is);
--- a/paddle/framework/CMakeLists.txt
+++ b/paddle/framework/CMakeLists.txt
@ -1,6 +1,5 @@
 # ddim lib
 proto_library(framework_proto SRCS framework.proto)
-proto_library(saver_proto SRCS framework.proto saver.proto)

 cc_library(ddim SRCS ddim.cc DEPS eigen3)
 cc_test(ddim_test SRCS ddim_test.cc DEPS ddim)
@ -10,7 +9,7 @@ cc_library(tensor SRCS tensor.cc DEPS ddim place paddle_memory device_context)
 cc_test(tensor_test SRCS tensor_test.cc DEPS tensor)
 cc_test(eigen_test SRCS eigen_test.cc DEPS tensor)

-cc_library(lod_tensor SRCS lod_tensor.cc DEPS ddim place tensor saver_proto framework_proto)
+cc_library(lod_tensor SRCS lod_tensor.cc DEPS ddim place tensor framework_proto)
 cc_test(lod_tensor_test SRCS lod_tensor_test.cc DEPS lod_tensor paddle_memory)
 nv_test(lod_tensor_gpu_test SRCS lod_tensor_test.cu DEPS lod_tensor)

@ -27,7 +26,7 @@ cc_test(op_proto_maker_test SRCS op_proto_maker_test.cc DEPS op_proto_maker)
 cc_library(op_info SRCS op_info.cc DEPS attribute framework_proto)
 cc_library(operator SRCS operator.cc DEPS op_info device_context tensor scope glog)
 cc_test(operator_test SRCS operator_test.cc DEPS operator op_registry)
-cc_library(proto_desc SRCS var_desc.cc op_desc.cc block_desc.cc program_desc.cc DEPS attribute ddim op_info operator)
+cc_library(proto_desc SRCS var_desc.cc op_desc.cc block_desc.cc program_desc.cc DEPS attribute ddim op_info operator glog)

 cc_library(op_registry SRCS op_registry.cc DEPS op_proto_maker op_info operator glog proto_desc)
 cc_test(op_registry_test SRCS op_registry_test.cc DEPS op_registry)
@ -43,7 +42,7 @@ add_custom_command(TARGET framework_py_proto POST_BUILD
    WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR})

 cc_library(backward SRCS backward.cc DEPS net_op)
-cc_test(backward_test SRCS backward_test.cc DEPS backward recurrent_op device_context)
+cc_test(backward_test SRCS backward_test.cc DEPS backward recurrent_op device_context fill_constant_op)

 cc_library(executor SRCS executor.cc DEPS op_registry device_context scope framework_proto backward glog)

--- a/paddle/framework/backward.cc
+++ b/paddle/framework/backward.cc
@ -315,6 +315,7 @@ static void CreateGradVarInBlock(
                     return false; /* not break */
                   });
    if (need_infer_shape) {
+      ops[op_index]->InferVarType(block_desc);
      ops[op_index]->InferShape(*block_desc);
    }
  }
@ -452,11 +453,16 @@ ParamGradInfoMap AppendBackward(
  std::transform(target_shape_desc.begin(), target_shape_desc.end(),
                 std::back_inserter(target_shape),
                 [](int64_t dim) { return static_cast<int>(dim); });
+  VLOG(3) << "backward from loss=" << target.Name()
+          << " data_type=" << target.GetDataType();
  std::unique_ptr<OpDescBind> fill_one_op(
      new OpDescBind("fill_constant", {}, {{"Out", {fill_one_op_out}}},
                     {{"shape", target_shape},
                      {"value", static_cast<float>(1.0)},
-                      {"data_type", framework::DataType::FP32}}));
+                      {"data_type", target.GetDataType()}}));
+  // infer var type of fill_one_op
+  fill_one_op->InferVarType(root_block);
+
  root_block->AppendAllocatedOp(std::move(fill_one_op));
  size_t forward_op_num = root_block->OpSize();
  size_t forward_block_num = program_desc.Size();
@ -475,8 +481,7 @@ ParamGradInfoMap AppendBackward(
  std::unordered_map<std::string, GradVarInfo> retv;

  auto var = root_block->Var(fill_one_op_out);
-  // FIXME(qiao) infer the data type
-  var->SetDataType(framework::DataType::FP32);
+  var->SetDataType(target.GetDataType());
  var->SetShape(target.Shape());
  auto& target_grad = retv[target.Name()];
  target_grad.name_ = fill_one_op_out;
--- a/paddle/framework/backward_test.cc
+++ b/paddle/framework/backward_test.cc
@ -21,6 +21,8 @@
 #include "paddle/framework/var_desc.h"
 #include "paddle/operators/net_op.h"

+USE_OP(fill_constant);
+
 namespace paddle {
 namespace framework {

--- a/paddle/framework/block_desc.cc
+++ b/paddle/framework/block_desc.cc
@ -120,6 +120,17 @@ BlockDesc *BlockDescBind::Proto() {
  Flush();
  return desc_;
 }
+
+BlockDescBind::BlockDescBind(ProgramDescBind *prog, BlockDesc *desc)
+    : prog_(prog), desc_(desc), need_update_(false) {
+  for (const VarDesc &var_desc : desc_->vars()) {
+    vars_[var_desc.name()].reset(new VarDescBind(var_desc));
+  }
+  for (const OpDesc &op_desc : desc_->ops()) {
+    ops_.emplace_back(new OpDescBind(op_desc, prog));
+  }
+}
+
 BlockDescBind::BlockDescBind(const BlockDescBind &other, BlockDesc *desc,
                             ProgramDescBind *prog)
    : prog_(prog), desc_(desc) {
--- a/paddle/framework/block_desc.h
+++ b/paddle/framework/block_desc.h
@ -36,8 +36,7 @@ class ProgramDescBind;

 class BlockDescBind {
 public:
-  BlockDescBind(ProgramDescBind *prog, BlockDesc *desc)
-      : prog_(prog), desc_(desc), need_update_(false) {}
+  BlockDescBind(ProgramDescBind *prog, BlockDesc *desc);

  BlockDescBind(const BlockDescBind &other, BlockDesc *desc,
                ProgramDescBind *prog);
--- a/paddle/framework/data_type.h
+++ b/paddle/framework/data_type.h
@ -15,6 +15,7 @@
 #pragma once
 #include <typeindex>
 #include "paddle/framework/framework.pb.h"
+#include "paddle/platform/enforce.h"

 namespace paddle {
 namespace framework {
--- a/paddle/framework/ddim.cc
+++ b/paddle/framework/ddim.cc
@ -195,6 +195,14 @@ std::vector<int64_t> vectorize(const DDim& ddim) {
  return result;
 }

+// NOTE: framework::vectorize converts to type int64_t
+//       which does not fit cudnn inputs.
+std::vector<int> vectorize2int(const DDim& ddim) {
+  std::vector<int64_t> temp = vectorize(ddim);
+  std::vector<int> result(temp.begin(), temp.end());
+  return result;
+}
+
 struct ProductVisitor : public boost::static_visitor<int64_t> {
  template <int D>
  int64_t operator()(const Dim<D>& dim) {
--- a/paddle/framework/ddim.h
+++ b/paddle/framework/ddim.h
@ -93,6 +93,7 @@ int64_t get(const DDim& dim, int idx);
 void set(DDim& dim, int idx, int val);

 std::vector<int64_t> vectorize(const DDim& ddim);
+std::vector<int> vectorize2int(const DDim& ddim);

 int64_t product(const DDim& ddim);

--- a/paddle/framework/details/op_registry.h
+++ b/paddle/framework/details/op_registry.h
@ -28,7 +28,8 @@ enum OpInfoFillType {
  kOperator = 0,
  kOpProtoAndCheckerMaker = 1,
  kGradOpDescMaker = 2,
-  kVarTypeInference = 3
+  kVarTypeInference = 3,
+  kShapeInference = 4
 };

 template <typename T>
@ -42,7 +43,10 @@ struct OpInfoFillTypeID {
                             ? kGradOpDescMaker
                             : (std::is_base_of<VarTypeInference, T>::value
                                    ? kVarTypeInference
-                                    : static_cast<OpInfoFillType>(-1))));
+                                    : (std::is_base_of<InferShapeBase, T>::value
+                                           ? kShapeInference
+                                           : static_cast<OpInfoFillType>(
+                                                 -1)))));
  }
 };

@ -121,6 +125,16 @@ struct OpInfoFiller<T, kVarTypeInference> {
  }
 };

+template <typename T>
+struct OpInfoFiller<T, kShapeInference> {
+  void operator()(const char* op_type, OpInfo* info) const {
+    info->infer_shape_ = [](InferShapeContext* ctx) {
+      T inference;
+      inference(ctx);
+    };
+  }
+};
+
 }  // namespace details

 }  // namespace framework
--- a/paddle/framework/executor.cc
+++ b/paddle/framework/executor.cc
@ -20,6 +20,7 @@ limitations under the License. */
 #include <set>
 #include <vector>

+#include "paddle/framework/feed_fetch_type.h"
 #include "paddle/framework/lod_tensor.h"
 #include "paddle/framework/op_registry.h"
 #include "paddle/framework/scope.h"
@ -56,6 +57,22 @@ Executor::~Executor() {
  }
 }

+static void CreateTensor(Variable* var, VarDesc::VarType var_type) {
+  if (var_type == VarDesc::LOD_TENSOR) {
+    var->GetMutable<LoDTensor>();
+  } else if (var_type == VarDesc::SELECTED_ROWS) {
+    var->GetMutable<SelectedRows>();
+  } else if (var_type == VarDesc::FEED_MINIBATCH) {
+    var->GetMutable<FeedFetchList>();
+  } else if (var_type == VarDesc::FETCH_LIST) {
+    var->GetMutable<FeedFetchList>();
+  } else {
+    PADDLE_THROW(
+        "Variable type must be "
+        "LoDTensor/SelectedRows/FEED_MINIBATCH/FETCH_LIST.");
+  }
+}
+
 void Executor::Run(const ProgramDesc& pdesc, Scope* scope, int block_id) {
  // TODO(tonyyang-svail):
  //    - only runs on the first device (i.e. no interdevice communication)
@ -69,10 +86,12 @@ void Executor::Run(const ProgramDesc& pdesc, Scope* scope, int block_id) {
  for (auto& var : block.vars()) {
    if (var.persistable()) {
      auto* ptr = scope->Var(var.name());
+      CreateTensor(ptr, var.type());
      VLOG(3) << "Create Variable " << var.name()
              << " global, which pointer is " << ptr;
    } else {
      auto* ptr = local_scope.Var(var.name());
+      CreateTensor(ptr, var.type());
      VLOG(3) << "Create Variable " << var.name()
              << " locally, which pointer is " << ptr;
    }
--- a/paddle/framework/lod_tensor.cc
+++ b/paddle/framework/lod_tensor.cc
@ -13,7 +13,6 @@
   limitations under the License. */

 #include "paddle/framework/lod_tensor.h"
-#include "paddle/framework/saver.pb.h"

 #include "paddle/memory/memcpy.h"
 #include "paddle/memory/memory.h"
@ -136,141 +135,5 @@ void LoDTensor::ShrinkInLevel(size_t level, size_t elem_begin,
  PADDLE_ENFORCE_LT(begin, end, "Cannot shrink, the result tensor is empty.");
  ShareDataWith(Slice(begin, end));
 }
-
-std::string LoDTensor::SerializeToString() const {
-  LoDTensorProto desc;
-
-  // set data_type
-  if (this->type() == typeid(int8_t)) desc.set_data_type(DataType::BOOL);
-  if (this->type() == typeid(int16_t)) desc.set_data_type(DataType::INT16);
-  if (this->type() == typeid(int32_t)) desc.set_data_type(DataType::INT32);
-  if (this->type() == typeid(int64_t)) desc.set_data_type(DataType::INT64);
-  // FIXME(dzh): there is no fp16 in standard c++
-
-  if (this->type() == typeid(float))  // NOLINT
-    desc.set_data_type(DataType::FP32);
-  if (this->type() == typeid(double))  // NOLINT
-    desc.set_data_type(DataType::FP64);
-
-  for (int i = 0; i < dims().size(); ++i) {
-    desc.add_dims(dims()[i]);
-  }
-
-  // set lod information
-  desc.set_lod_level(this->NumLevels());
-  for (size_t i = 0; i < this->NumLevels(); ++i) {
-    LoDInfo* lod = desc.add_levels();
-    for (size_t j = 0; j < lod_[i].size(); ++j) {
-      lod->add_level(lod_[i][j]);
-    }
-  }
-
-  desc.set_version(0);
-
-  std::string desc_bytes = desc.SerializeAsString();
-
-  // FIXME(dzh) : implement fix chunk size buffer.
-  size_t DESC_SIZE = desc_bytes.size();
-  size_t DATA_SIZE = holder_->size() - offset_;
-
-  const size_t BUFFER_SIZE = DESC_SIZE + DATA_SIZE + 2 * sizeof(size_t);
-  char* buffer =
-      static_cast<char*>(memory::Alloc(platform::CPUPlace(), BUFFER_SIZE));
-
-  // format: desc_size data_size, desc_bytes, data_bytes.
-  platform::CPUPlace src_place;
-  platform::CPUPlace dst_place;
-
-  memory::Copy(dst_place, buffer, src_place, &BUFFER_SIZE, sizeof(size_t));
-  memory::Copy(dst_place, buffer + sizeof(size_t), src_place, &DESC_SIZE,
-               sizeof(size_t));
-  memory::Copy(dst_place, buffer + sizeof(size_t) * 2, src_place,
-               desc_bytes.c_str(), desc_bytes.size());
-
-  PADDLE_ENFORCE(this->numel() != 0, "Serialize a empty Tensor!");
-
-  platform::Place place = holder_->place();
-  int element_width = holder_->size() / this->numel();
-
-  if (platform::is_cpu_place(place)) {
-    memory::Copy(dst_place, buffer + sizeof(size_t) * 2 + desc_bytes.size(),
-                 boost::get<platform::CPUPlace>(place),
-                 static_cast<char*>(holder_->ptr()) + offset_ / element_width,
-                 DATA_SIZE);
-  }
-#ifdef PADDLE_WITH_GPU
-  if (platform::is_gpu_place(place)) {
-    memory::Copy(dst_place, buffer + sizeof(size_t) * 2 + desc_bytes.size(),
-                 boost::get<platform::GPUPlace>(place),
-                 static_cast<char*>(holder_->ptr()) + offset_ / element_width,
-                 DATA_SIZE);
-  }
-#endif
-
-  std::string ret(buffer, BUFFER_SIZE);
-  memory::Free(platform::CPUPlace(), buffer);
-  return ret;
-}
-
-void LoDTensor::DeserializeFromString(const std::string& s,
-                                      const platform::Place& dst_place) {
-  size_t DESC_SIZE, BUFFER_SIZE;
-  platform::CPUPlace src_place;
-
-  memory::Copy(src_place, &BUFFER_SIZE, src_place, s.c_str(), sizeof(size_t));
-  memory::Copy(src_place, &DESC_SIZE, src_place, s.c_str() + sizeof(size_t),
-               sizeof(size_t));
-
-  const size_t DATA_SIZE = BUFFER_SIZE - DESC_SIZE - sizeof(size_t) * 2;
-
-  // parse LoDTensorDesc
-  LoDTensorProto desc;
-  desc.ParseFromArray(s.c_str() + sizeof(size_t) * 2, DESC_SIZE);
-
-  std::vector<int64_t> dims;
-  std::copy(desc.dims().begin(), desc.dims().end(), std::back_inserter(dims));
-  this->Resize(make_ddim(dims));
-
-  // parse data type
-  void* ptr = nullptr;
-  if (desc.data_type() == DataType::BOOL)
-    ptr = this->mutable_data<bool>(dst_place);
-  if (desc.data_type() == DataType::INT16)
-    ptr = this->mutable_data<int16_t>(dst_place);
-  if (desc.data_type() == DataType::INT32)
-    ptr = this->mutable_data<int32_t>(dst_place);
-  if (desc.data_type() == DataType::INT64)
-    ptr = this->mutable_data<int64_t>(dst_place);
-  // FIXME(dzh): there is no fp16 in standard c++
-
-  if (desc.data_type() == DataType::FP32)
-    ptr = this->mutable_data<float>(dst_place);
-  if (desc.data_type() == DataType::FP64)
-    ptr = this->mutable_data<double>(dst_place);
-
-  LoD lod;
-  std::vector<size_t> levels;
-  for (int i = 0; i < desc.levels().size(); ++i) {
-    auto current_level = desc.levels()[i].level();
-    std::copy(current_level.begin(), current_level.end(),
-              std::back_inserter(levels));
-    lod.emplace_back(levels);
-    levels.clear();
-  }
-
-  this->set_lod(lod);
-
-  if (platform::is_cpu_place(dst_place)) {
-    memory::Copy(boost::get<platform::CPUPlace>(dst_place), ptr, src_place,
-                 s.c_str() + sizeof(size_t) * 2 + DESC_SIZE, DATA_SIZE);
-  }
-#ifdef PADDLE_WITH_GPU
-  if (platform::is_gpu_place(dst_place)) {
-    memory::Copy(boost::get<platform::GPUPlace>(dst_place), ptr, src_place,
-                 s.c_str() + sizeof(size_t) * 2 + DESC_SIZE, DATA_SIZE);
-  }
-#endif
-}
-
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/framework/lod_tensor.h
+++ b/paddle/framework/lod_tensor.h
@ -85,7 +85,9 @@ class LoDTensor : public Tensor {

  void set_lod(const LoD& lod) { lod_ = lod; }

-  LoD lod() const { return lod_; }
+  const LoD& lod() const { return lod_; }
+
+  LoD* mutable_lod() { return &lod_; }

  /*
   * Get the start offset and end offset of an  element from LoD.
@ -139,27 +141,6 @@ class LoDTensor : public Tensor {
   */
  void ShrinkInLevel(size_t level, size_t elem_begin, size_t elem_end);

-  /**
-   *  @brief Serialize tensor to char bytes.
-   *  Please check model_format.md for the format detail.
-   *  NOTE: GPUTensor will copy data to cpu implicitly.
-   *  @return return string
-   */
-
-  // FIXME(dzh) : Currently, this interface should only be used in
-  // save/restore model and checkpoint. ParameterServer do not use shape
-  // information to do the optimization, as a result, when we serialize
-  // parameter/gradient to string, we should serialize the tensor
-  // to string in the ps trainer instead of LoDTensor.
-  std::string SerializeToString() const;
-
-  /**
-   *  @brief Deserialize char bytes to tensor.
-   *  @return return string
-   */
-  void DeserializeFromString(const std::string& s,
-                             const platform::Place& dst_place);
-
 private:
  LoD lod_;
 };
--- a/paddle/framework/lod_tensor_test.cc
+++ b/paddle/framework/lod_tensor_test.cc
@ -144,21 +144,5 @@ TEST(LodExpand, test) {
  }
 }

-TEST_F(LoDTensorTester, SerializeDeserialize) {
-  LoDTensor new_lod_tensor = lod_tensor_;
-  float* src_ptr = lod_tensor_.data<float>();
-  std::string s = lod_tensor_.SerializeToString();
-  LoDTensor dst;
-  dst.DeserializeFromString(s, platform::CPUPlace());
-  float* dst_ptr = dst.data<float>();
-  for (int i = 0; i < kLodTensorSize; ++i) {
-    EXPECT_EQ(dst_ptr[i], src_ptr[i]);
-  }
-
-  ASSERT_EQ(dst.NumElements(0), 2UL);
-  ASSERT_EQ(dst.NumElements(1), 3UL);
-  ASSERT_EQ(dst.NumElements(2), 8UL);
-}
-
 }  // namespace framework
 }  // namespace paddle
--- a/paddle/framework/lod_tensor_test.cu
+++ b/paddle/framework/lod_tensor_test.cu
@ -47,31 +47,4 @@ TEST(LoDTensor, LoDInGPU) {
  for (size_t i = 0; i < src_lod[0].size(); ++i) {
    CHECK_EQ(lod[0].data()[i], src_lod[0].data()[i] * 2);
  }
-}
-
-TEST(LoDTensor, SerializeDeserialize) {
-  paddle::framework::LoDTensor lod_tensor;
-  paddle::platform::GPUPlace place(0);
-
-  paddle::framework::LoD src_lod;
-  src_lod.push_back(std::vector<size_t>{0, 2, 4, 6, 8, 10, 12, 14});
-
-  lod_tensor.Resize({14, 16});
-  lod_tensor.mutable_data<float>(place);
-
-  lod_tensor.set_lod(src_lod);
-  CHECK_EQ(lod_tensor.lod_element(0, 2).first, 4UL);
-  CHECK_EQ(lod_tensor.lod_element(0, 4).first, 8UL);
-
-  test<<<1, 8>>>(src_lod[0].data(), src_lod[0].size());
-  cudaDeviceSynchronize();
-
-  std::string s = lod_tensor.SerializeToString();
-  paddle::framework::LoDTensor dst;
-  dst.DeserializeFromString(s, place);
-  paddle::framework::LoD dst_lod = dst.lod();
-
-  for (size_t i = 0; i < dst_lod[0].size(); ++i) {
-    CHECK_EQ(src_lod[0].data()[i], dst_lod[0].data()[i] * 2);
-  }
-}
+}
--- a/Show More
+++ b/Show More