* change INT8 to template so that checking dst_dt with if-else could be removed. CI will be enabled after fixing reviews
* reverse user_residual_memory_p and user_bias_memory_p declaration scope
test=develop
* extend matmul op to support multiple head multiplication
With the support of multiple head, the multiplication of two big matrixes is
split into multiplication of several (head_number) small matrixes. e.g. if
Mat A is [3, 24] and Mat B is [24, 4], when multiple A and B with head_number
as 4, Mat A will be split as 4 matrix of [3, 6] and Mat B will be 4 matrix of
[6, 4]. The result of final matrix will be 4 matrix of [3, 4], i.e. [3, 16].
* update paddle-trt for:
1. fix bug: when batch > 2, core in split plugin.
2. add leaky_relu trt5.0 support (yolov3 from 65ms to 42ms.)
3. add new attr to dropout.
4. shuffle channel, swish, relu6 support
test=develop
* 1. fix ci
test=develop
The change includes 2 things:
1. save delta model and shrink table are control by the same parameter before, now add delete_after_unseen_days to control shrink table.
2. value in sparse table has no slot before, now add slot in sparse table, and add DownpureCtrAccessor to support the new meta.
test=develop
(1)support patch data (merge slots of instances of same line id, modify dense layer which
changes its size)
(2)add fleet load_one_table interface, support load from paddle model and load from pslib model
(3)fix push sparse bug which cause push sparse cost more time(about 10% in my testcase)
(4)when some slots are not in one of your network (join/update, etc.),data feed、collect label info、push/pull sparse will skip these slots, instead of throw error.
(5)add more debug info in TrainFilesWithProfiler
The change includes 3 things:
1. Set CPU_NUM to 1 in the tests because the ParallelExecutor will print warning that CPU_NUM is not set and use default 1.
2. Old tests compare two RNNs, hand written simple RNN and same RNN built by Paddle, but initialized RNN weights in numpy random and Paddle random separately. Fixed it by setting weights and bias values.
3. Also set numpy random seed in the tests. Now the two RNNs diff can be smaller (rtol from 0.1, 0.2 to. 0.01) in the tests.
test=develop
Test PaddingRNN on V100 GPU device.
Test configuration: large model, padding mode (which is the mode using recurrentOp), one GPU.
GPU memory (MiB): 6414 (this PR) vs 6837 (without this PR)
Speed (steps/s): 10.28 (this PR) vs 9.89 (without this PR)