commit
9a8318685a
Binary file not shown.
Before Width: | Height: | Size: 28 KiB After Width: | Height: | Size: 28 KiB |
@ -0,0 +1,41 @@
|
||||
# Design Doc: The C++ Class `Parameters`
|
||||
|
||||
`Parameters` is a concept we designed in Paddle V2 API. `Parameters` is a container of parameters, and make Paddle can shared parameter between topologies. We described usages of `Parameter` in [api.md](./api.md).
|
||||
|
||||
We used Python to implement Parameters when designing V2 API before. There are several defects for current implementation:
|
||||
* We just use `memcpy` to share Parameters between topologies, but this is very inefficient.
|
||||
* We did not implement share Parameters while training. We just trigger `memcpy` when start training.
|
||||
|
||||
It is necessary that we implement Parameters in CPP side. However, it could be a code refactoring for Paddle, because Paddle was designed for training only one topology before, i.e., each GradientMachine contains its Parameter as a data member. In current Paddle implementation, there are three concepts associated with `Parameters`:
|
||||
|
||||
1. `paddle::Parameter`. A `Parameters` is a container for `paddle::Parameter`.
|
||||
It is evident that we should use `paddle::Parameter` when developing `Parameters`.
|
||||
However, the `Parameter` class contains many functions and does not have a clear interface.
|
||||
It contains `create/store Parameter`, `serialize/deserialize`, `optimize(i.e SGD)`, `randomize/zero`.
|
||||
When we developing `Parameters`, we only use `create/store Parameter` functionality.
|
||||
We should extract functionalities of Parameter into many classes to clean Paddle CPP implementation.
|
||||
|
||||
2. `paddle::GradientMachine` and its sub-classes, e.g., `paddle::MultiGradientMachine`, `paddle::NeuralNetwork`.
|
||||
We should pass `Parameters` to `paddle::GradientMachine` when `forward/backward` to avoid `memcpy` between topologies.
|
||||
Also, we should handle multi-GPU/CPU training, because `forward` and `backward` would perform on multi-GPUs and multi-CPUs.
|
||||
`Parameters` should dispatch the parameter value to each device, and gather the parameter gradient from each device.
|
||||
|
||||
3. `paddle::ParameterUpdater`. The ParameterUpdater is used to update parameters in Paddle.
|
||||
So `Parameters` should be used by `paddle::ParameterUpdater`, and `paddle::ParameterUpdater` should optimize `Parameters` (by SGD).
|
||||
|
||||
|
||||
The step by step approach for implementation Parameters in Paddle C++ core is listed below. Each step should be a PR and could be merged into Paddle one by one.
|
||||
|
||||
1. Clean `paddle::Parameter` interface. Extract the functionalities of `paddle::Parameter` to prepare for the implementation of Parameters.
|
||||
|
||||
2. Implementation a `Parameters` class. It just stores the `paddle::Parameter` inside. Make `GradientMachine` uses `Parameters` as a class member.
|
||||
|
||||
3. Make `Parameters` support Multi-CPU and Multi-GPU training to prepare for sharing `Parameter` between topologies.
|
||||
Because we need share `Parameters` between topologies, it is `Parameters`'s response to exchange Parameters between GPUs.
|
||||
`GradientMachine` should not handle how to exchange Parameters because `GradientMachine` only used to train one topology and we need to support train many topologies in Paddle, i.e., there could be many GradientMachines use one `Parameters`.
|
||||
* We should use a global function to exchange Parameters between GPUs, not a member function in `Parameters`. The `MultiGradientMachine` invoke this function, which uses `Parameters` as this function inputs.
|
||||
* The MultiGradientMachine contains many functionalities. Extracting the Parameters exchanging logic could make MultiGradientMachine clearer and simpler.
|
||||
|
||||
4. Make `Parameters` as an argument for `forward/backward` function, not a data member for `GradientMachine`. For example, `forward` could be `forward(const Parameters& params, ...)` and `backward` could be `backward(Parameters* params, ...)`. After this step, Paddle could share `Parameters` between topologies.
|
||||
|
||||
5. `ParameterUpdater` is invoked by `GradientMachine` and `Trainer`, but it updates `Parameters`. In the end of this code refactoring, we could change `ParameterUpdater` directly uses `Parameters` to make `ParameterUpdater`'s implementation clear.
|
@ -0,0 +1,155 @@
|
||||
# DeepSpeech2 on PaddlePaddle: Design Doc
|
||||
|
||||
We are planning to build Deep Speech 2 (DS2) \[[1](#references)\], a powerful Automatic Speech Recognition (ASR) engine, on PaddlePaddle. For the first-stage plan, we have the following short-term goals:
|
||||
|
||||
- Release a basic distributed implementation of DS2 on PaddlePaddle.
|
||||
- Contribute a chapter of Deep Speech to PaddlePaddle Book.
|
||||
|
||||
Intensive system optimization and low-latency inference library (details in \[[1](#references)\]) are not yet covered in this first-stage plan.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [Tasks](#tasks)
|
||||
- [Task Dependency](#task-dependency)
|
||||
- [Design Details](#design-details)
|
||||
- [Overview](#overview)
|
||||
- [Row Convolution](#row-convolution)
|
||||
- [Beam Search With CTC and LM](#beam-search-with-ctc-and-lm)
|
||||
- [Future Work](#future-work)
|
||||
- [References](#references)
|
||||
|
||||
## Tasks
|
||||
|
||||
We roughly break down the project into 14 tasks:
|
||||
|
||||
1. Develop an **audio data provider**:
|
||||
- Json filelist generator.
|
||||
- Audio file format transformer.
|
||||
- Spectrogram feature extraction, power normalization etc.
|
||||
- Batch data reader with SortaGrad.
|
||||
- Data augmentation (optional).
|
||||
- Prepare (one or more) public English data sets & baseline.
|
||||
2. Create a **simplified DS2 model configuration**:
|
||||
- With only fixed-length (by padding) audio sequences (otherwise need *Task 3*).
|
||||
- With only bidirectional-GRU (otherwise need *Task 4*).
|
||||
- With only greedy decoder (otherwise need *Task 5, 6*).
|
||||
3. Develop to support **variable-shaped** dense-vector (image) batches of input data.
|
||||
- Update `DenseScanner` in `dataprovider_converter.py`, etc.
|
||||
4. Develop a new **lookahead-row-convolution layer** (See \[[1](#references)\] for details):
|
||||
- Lookahead convolution windows.
|
||||
- Within-row convolution, without kernels shared across rows.
|
||||
5. Build KenLM **language model** (5-gram) for beam search decoder:
|
||||
- Use KenLM toolkit.
|
||||
- Prepare the corpus & train the model.
|
||||
- Create infererence interfaces (for Task 6).
|
||||
6. Develop a **beam search decoder** with CTC + LM + WORDCOUNT:
|
||||
- Beam search with CTC.
|
||||
- Beam search with external custom scorer (e.g. LM).
|
||||
- Try to design a more general beam search interface.
|
||||
7. Develop a **Word Error Rate evaluator**:
|
||||
- update `ctc_error_evaluator`(CER) to support WER.
|
||||
8. Prepare internal dataset for Mandarin (optional):
|
||||
- Dataset, baseline, evaluation details.
|
||||
- Particular data preprocessing for Mandarin.
|
||||
- Might need cooperating with the Speech Department.
|
||||
9. Create **standard DS2 model configuration**:
|
||||
- With variable-length audio sequences (need *Task 3*).
|
||||
- With unidirectional-GRU + row-convolution (need *Task 4*).
|
||||
- With CTC-LM beam search decoder (need *Task 5, 6*).
|
||||
10. Make it run perfectly on **clusters**.
|
||||
11. Experiments and **benchmarking** (for accuracy, not efficiency):
|
||||
- With public English dataset.
|
||||
- With internal (Baidu) Mandarin dataset (optional).
|
||||
12. Time **profiling** and optimization.
|
||||
13. Prepare **docs**.
|
||||
14. Prepare PaddlePaddle **Book** chapter with a simplified version.
|
||||
|
||||
## Task Dependency
|
||||
|
||||
Tasks parallelizable within phases:
|
||||
|
||||
Roadmap | Description | Parallelizable Tasks
|
||||
----------- | :------------------------------------ | :--------------------
|
||||
Phase I | Simplified model & components | *Task 1* ~ *Task 8*
|
||||
Phase II | Standard model & benchmarking & profiling | *Task 9* ~ *Task 12*
|
||||
Phase III | Documentations | *Task13* ~ *Task14*
|
||||
|
||||
Issue for each task will be created later. Contributions, discussions and comments are all highly appreciated and welcomed!
|
||||
|
||||
## Design Details
|
||||
|
||||
### Overview
|
||||
|
||||
Traditional **ASR** (Automatic Speech Recognition) pipelines require great human efforts devoted to elaborately tuning multiple hand-engineered components (e.g. audio feature design, accoustic model, pronuncation model and language model etc.). **Deep Speech 2** (**DS2**) \[[1](#references)\], however, trains such ASR models in an end-to-end manner, replacing most intermediate modules with only a single deep network architecture. With scaling up both the data and model sizes, DS2 achieves a very significant performance boost.
|
||||
|
||||
Please read Deep Speech 2 \[[1](#references),[2](#references)\] paper for more background knowledge.
|
||||
|
||||
The classical DS2 network contains 15 layers (from bottom to top):
|
||||
|
||||
- **Two** data layers (audio spectrogram, transcription text)
|
||||
- **Three** 2D convolution layers
|
||||
- **Seven** uni-directional simple-RNN layers
|
||||
- **One** lookahead row convolution layers
|
||||
- **One** fully-connected layers
|
||||
- **One** CTC-loss layer
|
||||
|
||||
<div align="center">
|
||||
<img src="image/ds2_network.png" width=350><br/>
|
||||
Figure 1. Archetecture of Deep Speech 2 Network.
|
||||
</div>
|
||||
|
||||
We don't have to persist on this 2-3-7-1-1-1 depth \[[2](#references)\]. Similar networks with different depths might also work well. As in \[[1](#references)\], authors use a different depth (e.g. 2-2-3-1-1-1) for final experiments.
|
||||
|
||||
Key ingredients about the layers:
|
||||
|
||||
- **Data Layers**:
|
||||
- Frame sequences data of audio **spectrogram** (with FFT).
|
||||
- Token sequences data of **transcription** text (labels).
|
||||
- These two type of sequences do not have the same lengthes, thus a CTC-loss layer is required.
|
||||
- **2D Convolution Layers**:
|
||||
- Not only temporal convolution, but also **frequency convolution**. Like a 2D image convolution, but with a variable dimension (i.e. temporal dimension).
|
||||
- With striding for only the first convlution layer.
|
||||
- No pooling for all convolution layers.
|
||||
- **Uni-directional RNNs**
|
||||
- Uni-directional + row convolution: for low-latency inference.
|
||||
- Bi-direcitional + without row convolution: if we don't care about the inference latency.
|
||||
- **Row convolution**:
|
||||
- For looking only a few steps ahead into the feature, instead of looking into a whole sequence in bi-directional RNNs.
|
||||
- Not nessesary if with bi-direcitional RNNs.
|
||||
- "**Row**" means convolutions are done within each frequency dimension (row), and no convolution kernels shared across.
|
||||
- **Batch Normalization Layers**:
|
||||
- Added to all above layers (except for data and loss layer).
|
||||
- Sequence-wise normalization for RNNs: BatchNorm only performed on input-state projection and not state-state projection, for efficiency consideration.
|
||||
|
||||
|
||||
Required Components | PaddlePaddle Support | Need to Develop
|
||||
:------------------------------------- | :-------------------------------------- | :-----------------------
|
||||
Data Layer I (Spectrogram) | Not supported yet. | TBD (Task 3)
|
||||
Data Layer II (Transcription) | `paddle.data_type.integer_value_sequence` | -
|
||||
2D Convolution Layer | `paddle.layer.image_conv_layer` | -
|
||||
DataType Converter (vec2seq) | `paddle.layer.block_expand` | -
|
||||
Bi-/Uni-directional RNNs | `paddle.layer.recurrent_group` | -
|
||||
Row Convolution Layer | Not supported yet. | TBD (Task 4)
|
||||
CTC-loss Layer | `paddle.layer.warp_ctc` | -
|
||||
Batch Normalization Layer | `paddle.layer.batch_norm` | -
|
||||
CTC-Beam search | Not supported yet. | TBD (Task 6)
|
||||
|
||||
### Row Convolution
|
||||
|
||||
TODO by Assignees
|
||||
|
||||
### Beam Search with CTC and LM
|
||||
|
||||
TODO by Assignees
|
||||
|
||||
## Future Work
|
||||
|
||||
- Efficiency Improvement
|
||||
- Accuracy Improvement
|
||||
- Low-latency Inference Library
|
||||
- Large-scale benchmarking
|
||||
|
||||
## References
|
||||
|
||||
1. Dario Amodei, etc., [Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin](http://proceedings.mlr.press/v48/amodei16.pdf). ICML 2016.
|
||||
2. Dario Amodei, etc., [Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin](https://arxiv.org/abs/1512.02595). arXiv:1512.02595.
|
After Width: | Height: | Size: 114 KiB |
@ -0,0 +1,93 @@
|
||||
package main
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"net"
|
||||
"net/http"
|
||||
"net/rpc"
|
||||
"os"
|
||||
"path/filepath"
|
||||
"strconv"
|
||||
"strings"
|
||||
"time"
|
||||
|
||||
"github.com/namsral/flag"
|
||||
|
||||
"github.com/PaddlePaddle/Paddle/paddle/go/master"
|
||||
"github.com/PaddlePaddle/Paddle/paddle/go/recordio"
|
||||
)
|
||||
|
||||
func main() {
|
||||
port := flag.Int("port", 8080, "port of the master server.")
|
||||
dataset := flag.String("training_dataset", "", "dataset: comma separated path to RecordIO paths, supports golb patterns.")
|
||||
faultTolerance := flag.Bool("fault_tolerance", false, "enable fault tolerance (requires etcd).")
|
||||
taskTimeoutDur := flag.Duration("task_timout_dur", 20*time.Minute, "task timout duration.")
|
||||
taskTimeoutMax := flag.Int("task_timeout_max", 3, "max timtout count for each task before it being declared failed task.")
|
||||
chunkPerTask := flag.Int("chunk_per_task", 10, "chunk per task.")
|
||||
flag.Parse()
|
||||
|
||||
if *dataset == "" {
|
||||
panic("no dataset specified.")
|
||||
}
|
||||
|
||||
if *faultTolerance {
|
||||
panic("fault tolernance not implemented.")
|
||||
}
|
||||
|
||||
var chunks []master.Chunk
|
||||
var paths []string
|
||||
ss := strings.Split(*dataset, ",")
|
||||
fmt.Println(ss)
|
||||
for _, s := range ss {
|
||||
match, err := filepath.Glob(s)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
paths = append(paths, match...)
|
||||
}
|
||||
|
||||
if len(paths) == 0 {
|
||||
panic("no valid datset specified.")
|
||||
}
|
||||
|
||||
idx := 0
|
||||
for _, path := range paths {
|
||||
f, err := os.Open(path)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
|
||||
index, err := recordio.LoadIndex(f)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
f.Close()
|
||||
|
||||
count := index.NumChunks()
|
||||
for i := 0; i < count; i++ {
|
||||
chunk := master.Chunk{
|
||||
Idx: idx,
|
||||
Path: path,
|
||||
Index: *index.ChunkIndex(i),
|
||||
}
|
||||
chunks = append(chunks, chunk)
|
||||
}
|
||||
}
|
||||
|
||||
s := master.NewService(chunks, *chunkPerTask, *taskTimeoutDur, *taskTimeoutMax)
|
||||
err := rpc.Register(s)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
|
||||
rpc.HandleHTTP()
|
||||
l, err := net.Listen("tcp", ":"+strconv.Itoa(*port))
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
|
||||
err = http.Serve(l, nil)
|
||||
if err != nil {
|
||||
panic(err)
|
||||
}
|
||||
}
|
@ -0,0 +1,178 @@
|
||||
package master
|
||||
|
||||
import (
|
||||
"errors"
|
||||
"log"
|
||||
"sync"
|
||||
"time"
|
||||
|
||||
"github.com/PaddlePaddle/Paddle/paddle/go/recordio"
|
||||
)
|
||||
|
||||
const (
|
||||
targetTaskCount = 300
|
||||
)
|
||||
|
||||
// errors
|
||||
var (
|
||||
ErrNoMoreTask = errors.New("no more task for current pass")
|
||||
ErrPendingTaskNotFound = errors.New("pending task not found")
|
||||
)
|
||||
|
||||
// Service is the master server service.
|
||||
type Service struct {
|
||||
timeoutDur time.Duration
|
||||
timeoutMax int
|
||||
|
||||
mu sync.Mutex
|
||||
taskQueues taskQueues
|
||||
}
|
||||
|
||||
// Recover recovers service state from etcd.
|
||||
func Recover() (*Service, error) {
|
||||
// TODO(helin): recover from snapshot state from etcd.
|
||||
return nil, nil
|
||||
}
|
||||
|
||||
func partition(chunks []Chunk, chunksPerTask int) []taskEntry {
|
||||
id := 0
|
||||
if chunksPerTask <= 0 {
|
||||
chunksPerTask = 1
|
||||
}
|
||||
|
||||
var result []taskEntry
|
||||
var cur taskEntry
|
||||
for i, c := range chunks {
|
||||
if i%chunksPerTask == 0 && len(cur.Task.Chunks) > 0 {
|
||||
cur.Task.ID = id
|
||||
id++
|
||||
result = append(result, cur)
|
||||
cur.Task.Chunks = nil
|
||||
}
|
||||
|
||||
cur.Task.Chunks = append(cur.Task.Chunks, c)
|
||||
}
|
||||
|
||||
if len(cur.Task.Chunks) > 0 {
|
||||
cur.Task.ID = id
|
||||
id++
|
||||
result = append(result, cur)
|
||||
}
|
||||
|
||||
return result
|
||||
}
|
||||
|
||||
// NewService creates a new service.
|
||||
func NewService(chunks []Chunk, chunksPerTask int, timeoutDur time.Duration, timeoutMax int) *Service {
|
||||
s := &Service{}
|
||||
s.timeoutDur = timeoutDur
|
||||
s.timeoutMax = timeoutMax
|
||||
s.taskQueues = taskQueues{}
|
||||
s.taskQueues.Pending = make(map[int]taskEntry)
|
||||
s.taskQueues.Todo = partition(chunks, chunksPerTask)
|
||||
return s
|
||||
}
|
||||
|
||||
// Chunk is a chunk of data consisted of several data instances.
|
||||
type Chunk struct {
|
||||
Idx int // index of the chunk within the file
|
||||
Path string
|
||||
Index recordio.Index // block index
|
||||
}
|
||||
|
||||
// Task is the basic unit of data instances assigned to trainers.
|
||||
type Task struct {
|
||||
ID int
|
||||
Chunks []Chunk
|
||||
}
|
||||
|
||||
type taskEntry struct {
|
||||
Epoch int
|
||||
NumTimeout int
|
||||
Task Task
|
||||
}
|
||||
|
||||
type taskQueues struct {
|
||||
Todo []taskEntry
|
||||
Pending map[int]taskEntry // map from task ID to task entry
|
||||
Done []taskEntry
|
||||
Failed []Task
|
||||
}
|
||||
|
||||
// *must* be called with s.mu being held.
|
||||
func (s *Service) snapshot() error {
|
||||
// TODO(helin): snapshot state on etcd.
|
||||
return nil
|
||||
}
|
||||
|
||||
// GetTask gets a new task from the service.
|
||||
func (s *Service) GetTask(dummy int, task *Task) error {
|
||||
s.mu.Lock()
|
||||
defer s.mu.Unlock()
|
||||
|
||||
if len(s.taskQueues.Todo) == 0 {
|
||||
return ErrNoMoreTask
|
||||
}
|
||||
|
||||
t := s.taskQueues.Todo[0]
|
||||
t.Epoch++
|
||||
s.taskQueues.Todo = s.taskQueues.Todo[1:]
|
||||
s.taskQueues.Pending[t.Task.ID] = t
|
||||
err := s.snapshot()
|
||||
if err != nil {
|
||||
return err
|
||||
}
|
||||
|
||||
time.AfterFunc(s.timeoutDur, func(taskID int, epoch int) func() {
|
||||
return func() {
|
||||
s.mu.Lock()
|
||||
defer s.mu.Unlock()
|
||||
|
||||
t, ok := s.taskQueues.Pending[taskID]
|
||||
if !ok {
|
||||
return
|
||||
}
|
||||
|
||||
if t.Epoch != epoch {
|
||||
// new epoch, task launched after the
|
||||
// schedule of this timeout check.
|
||||
return
|
||||
}
|
||||
|
||||
defer func() {
|
||||
err := s.snapshot()
|
||||
if err != nil {
|
||||
log.Println(err)
|
||||
}
|
||||
}()
|
||||
|
||||
delete(s.taskQueues.Pending, t.Task.ID)
|
||||
|
||||
t.NumTimeout++
|
||||
if t.NumTimeout > s.timeoutMax {
|
||||
s.taskQueues.Failed = append(s.taskQueues.Failed, t.Task)
|
||||
return
|
||||
}
|
||||
|
||||
s.taskQueues.Todo = append(s.taskQueues.Todo, t)
|
||||
}
|
||||
}(t.Task.ID, t.Epoch))
|
||||
return nil
|
||||
}
|
||||
|
||||
// TaskFinished tell the service that a task is finished.
|
||||
func (s *Service) TaskFinished(taskID int, dummy *int) error {
|
||||
s.mu.Lock()
|
||||
defer s.mu.Unlock()
|
||||
|
||||
t, ok := s.taskQueues.Pending[taskID]
|
||||
if !ok {
|
||||
return ErrPendingTaskNotFound
|
||||
}
|
||||
|
||||
// task finished, reset timeout
|
||||
t.NumTimeout = 0
|
||||
s.taskQueues.Done = append(s.taskQueues.Done, t)
|
||||
delete(s.taskQueues.Pending, taskID)
|
||||
return s.snapshot()
|
||||
}
|
@ -0,0 +1,37 @@
|
||||
package master
|
||||
|
||||
import "testing"
|
||||
|
||||
func TestPartitionCount(t *testing.T) {
|
||||
cs := make([]Chunk, 100)
|
||||
ts := partition(cs, 5)
|
||||
if len(ts) != 20 {
|
||||
t.Error(len(ts))
|
||||
}
|
||||
|
||||
cs = make([]Chunk, 101)
|
||||
ts = partition(cs, 5)
|
||||
if len(ts) != 21 {
|
||||
t.Error(len(ts))
|
||||
}
|
||||
|
||||
ts = partition(cs, 1)
|
||||
if len(ts) != 101 {
|
||||
t.Error(len(ts))
|
||||
}
|
||||
|
||||
ts = partition(cs, 0)
|
||||
if len(ts) != 101 {
|
||||
t.Error(len(ts))
|
||||
}
|
||||
}
|
||||
|
||||
func TestPartionIndex(t *testing.T) {
|
||||
cs := make([]Chunk, 100)
|
||||
ts := partition(cs, 20)
|
||||
for i := range ts {
|
||||
if ts[i].Task.ID != i {
|
||||
t.Error(ts[i], i)
|
||||
}
|
||||
}
|
||||
}
|
@ -1 +1,5 @@
|
||||
add_python_test(test_ploter test_ploter.py)
|
||||
if (NOT APPLE)
|
||||
# The Mac OS X backend will not be able to function correctly if Python is
|
||||
# not installed as a framework.
|
||||
add_python_test(test_ploter test_ploter.py)
|
||||
endif()
|
||||
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in new issue