PyDataProvider English Document

Thanks to caoying to check english gramma. ISSUE=4598269 git-svn-id: https://svn.baidu.com/idl/trunk/paddle@1462 1ad973e4-5ce8-4261-8a94-b56d1f490c56
9 years ago · ed6716578d
parent cecdedea63
commit ed6716578d
6 changed files with 294 additions and 176 deletions
--- a/doc/ui/api/py_data_provider_wrapper.rst
+++ b/doc/ui/api/py_data_provider_wrapper.rst
@ -1,6 +0,0 @@
-PyDataProviderWrapper API
-=========================
-
-
-..  automodule:: paddle.trainer.PyDataProviderWrapper
-    :members:
--- a/doc/ui/data_provider/index.md
+++ b/doc/ui/data_provider/index.md
@ -1,55 +0,0 @@
-# DataProvider Tutorial #
-
-DataProvider is responsible for data management in PaddlePaddle, corresponding to <a href = "../trainer_config_helpers_api.html#trainer_config_helpers.layers.data_layer">Data Layer</a>.
-
-## Input Data Format ##
-PaddlePaddle uses **Slot** to describe the data layer of neural network. One slot describes one data layer. Each slot stores a series of samples, and each sample contains a set of features. There are three attributes of a slot: 
-+ **Dimension**: dimenstion of features
-+ **SlotType**: there are 5 different slot types in PaddlePaddle, following table compares the four commonly used ones.
-
-<table border="2" frame="border">
-<thead>
-<tr>
-<th scope="col" class="left">SlotType</th>
-<th scope="col" class="left">Feature Description</th>
-<th scope="col" class="left">Vector Description</th>
-</tr>
-</thead>
-
-<tbody>
-<tr>
-<td class="left"><b>DenseSlot</b></td>
-<td class="left">Continuous Features</td>
-<td class="left">Dense Vector</td>
-</tr>
-
-<tr>
-<td class="left"><b>SparseNonValueSlot<b></td>
-<td class="left">Discrete Features without weights</td>
-<td class="left">Sparse Vector with all non-zero elements equaled to 1</td>
-</tr>
-
-<tr>
-<td class="left"><b>SparseValueSlot</b></td>
-<td class="left">Discrete Features with weights</td>
-<td class="left">Sparse Vector</td>
-</tr>
-
-<tr>
-<td class="left"><b>IndexSlot</b></td>
-<td class="left">mostly the same as SparseNonValueSlot, but especially for a single label</td>
-<td class="left">Sparse Vector with only one value in each time step</td>
-</tr>
-</tbody>
-</table>
-</br>
-
-And the remained one is **StringSlot**. It stores Character String, and can be used for debug or to describe data Id for prediction, etc.
-+ **SeqType**: a **sequence** is a sample whose features are expanded in time scale. And a **sub-sequence** is a continous ordered subset of a sequence. For example, (a1, a2) and (a3, a4, a5) are two sub-sequences of one sequence (a1, a2, a3, a4, a5). Following are 3 different sequence types in PaddlePaddle:
-  - **NonSeq**: input sample is not sequence
-  - **Seq**: input sample is a sequence without sub-sequence
-  - **SubSeq**: input sample is a sequence with sub-sequence
-
-## Python DataProvider
-  
-PyDataProviderWrapper is a python decorator in PaddlePaddle, used to read custom python DataProvider class. It currently supports all SlotTypes and SeqTypes of input data. User should only concern how to read samples from file. Feel easy with its [Use Case](python_case.md) and <a href = "../py_data_provider_wrapper_api.html">API Reference</a>.
--- a/doc/ui/data_provider/index.rst
+++ b/doc/ui/data_provider/index.rst
@ -0,0 +1,42 @@
+PaddlePaddle DataProvider Introduction
+================================
+DataProvider is a module that loads training or testing data into cpu or gpu
+memory for the following triaining or testing process.
+
+For simple use, users can use Python :code:`PyDataProvider` to dynamically reads
+the original data in any format or in any form, and then transfer them into a
+data format PaddlePaddle requires. The process is extremly flexible and highly
+customized, with sacrificing the efficiency only a little. This is extremly
+useful when you have to dynamically generate certain kinds of data according to,
+for example, the training performance.
+
+Besides, users also can also customize a C++ :code:`DataProvider` for a more
+complex usage, or for a higher efficiency.
+
+The following parameters are required to define in the PaddlePaddle network
+configuration file (trainer_config.py): which DataProvider is chosen to used,
+and specific parameters for DataProvider, including training file list
+(train.list) and testing file list (test.list).
+
+Train.list and test.list are simply two plain text files, which defines path
+of training or testing data. It is recommended that directly placing them into
+the training directory, and reference to them by using a relative path (
+relative to the PaddePaddle program).
+
+Testing or evaluating will not be performed during training if the test.list is
+not set or set to None. Otherwise, PaddlePaddle will evaluate the trained model
+by the specified tesing data while training, every testing period (a user
+defined command line parameter in PaddlePaddle) to prevent over-fitting.
+
+Each line of train.list and test.list is an absolute or relative path (relative
+to the PaddePaddle program runtime) of data file. Fascinatingly more, each line
+can also be a HDFS file path or a SQL connection string. As long as the user
+assures how to access each file in DataProvider.
+
+Please refer to the following articles for more information about the detail
+usages of DataProvider and how to implement a new DataProvider,
+
+..  toctree::
+
+    pydataprovider2.rst
+    write_new_dataprovider.rst
--- a/doc/ui/data_provider/pydataprovider2.rst
+++ b/doc/ui/data_provider/pydataprovider2.rst
--- a/doc/ui/data_provider/python_case.md
+++ b/doc/ui/data_provider/python_case.md
@ -1,112 +0,0 @@
-# Python Use Case #
-
-This tutorial guides you into using python script that converts user input data into PaddlePaddle Data Format. 
-
-## Quick Start ##
-
-We use a custom data to show the quick usage. This data consists of two parts with semicolon-delimited `';'`: a) label with 2 dimensions, b) continuous features with 9 dimensions:
-
-    1;0 0 0 0 0.192157 0.070588 0.215686 0.533333 0
-    0;0 0 0 0.988235 0.913725 0.329412 0.376471 0 0
-
-The `simple_provider.py` defines a python data provider:
-
-```python
-from trainer.PyDataProviderWrapper import DenseSlot, IndexSlot, provider
-
-@provider([DenseSlot(9), IndexSlot(2)])
-def process(obj, file_name):
-    with open(file_name, 'r') as f:
-        for line in f:
-        line = line.split(";")
-        label = int(line[0])
-        image = [float(x) for x in line[1].split()[1:]]
-        yield label, image
-```
-
- `@provider`: specify the SlotType and its dimension. Here, we have 2 Slots, DenseSlot(9) stores continuous features with 9 dimensions, and IndexSlot(2) stores label with 2 dimensions. 
- `process`: a generator using **yield** keyword to return results one by one. Here, the return format is 1 Discrete Feature and a list of 9 float Continuous Features.
-
-The corresponding python **Train** data source `define_py_data_source` is:
-
-```python
-define_py_data_sources('train.list', None, 'simple_provider', 'process')
-```
-See <a href = "../trainer_config_helpers_api.html#trainer_config_helpers.data_sources.define_py_data_sources">here</a> for detail API reference of `define_py_data_sources`.
-
-## Sequence Example ##
-
-In some tasks such as Natural Language Processing (NLP), the dimension of Slot is related to the dictionary size, and the dictionary should be dynamically loaded during training or generating. PyDataProviderWrapper can satisfy all these demands easily.
-
-### Sequence has no sub-sequence ###
-Following is an example of data provider when using LSTM network to do sentiment analysis (If you want to understand the whole details of this task, please refer to [Sentiment Analysis Tutorial](../demo/sentiment_analysis/index.md)). 
-
-The input data consists of two parts with two-tabs-delimited: a) label with 2 dimensions, b) sequence with dictionary length dimensions: 
-
-    0		I saw this movie at the AFI Dallas festival . It all takes place at a lake house and it looks wonderful .
-    1		This documentary makes you travel all around the globe . It contains rare and stunning sequels from the wilderness .
-    ...
-
-The `dataprovider.py` in `demo/sentiment` is:
-
-```python
-from trainer.PyDataProviderWrapper import *
-
-@init_hook_wrapper
-def hook(obj, dictionary, **kwargs):
-    obj.word_dict = dictionary
-    obj.slots = [IndexSlot(len(obj.word_dict)), IndexSlot(2)]
-    obj.logger.info('dict len : %d' % (len(obj.word_dict)))
-
-@provider(use_seq=True, init_hook=hook)
-# @provider(use_seq=True, init_hook=hook, pool_size=PoolSize(5000))
-def process(obj, file_name):
-    with open(file_name, 'r') as fdata:
-        for line_count, line in enumerate(fdata):
-            label, comment = line.strip().split('\t\t')
-            label = int(''.join(label.split(' ')))
-            words = comment.split()
-            word_slot = [obj.word_dict[w] for w in words if w in obj.word_dict]
-            yield word_slot, [label]
-```
-
- `hook`: Initialization hook of data provider. Here, it reads the dictionary, sets the obj.slots based on the dictionary length, and uses obj.logger to output some logs.
- `process`: Here, as the Sequence Mode of input is **Seq** and SlotType is IndexSlot, use_seq is set to True, and the yield format is `[int, int, ....]`.
- `PoolSize`: If there are a lot of data, you may need this argument to increase loading speed and reduce memory footprint. Here, PoolSize(5000) means read at most 5000 samples to memory.
-
-The corresponding python **Train/Test** data sources `define_py_data_sources` is:
-
-```python
-train_list = train_list if not is_test else None
-word_dict = dict()
-with open(dict_file, 'r') as f:
-    for i, line in enumerate(open(dict_file, 'r')):
-        word_dict[line.split('\t')[0]] = i 
-
-define_py_data_sources(train_list, test_list, module = "dataprovider", obj = "processData",
-                       args = {'dictionary': word_dict}, train_async = True)
-```
-
-### Sequence has sub-sequence ###
-
-If the sequence of above input data is considered as several sub-sequences joint by dot `'.'`, quesion mark `'?'` or exclamation mark `'!'`, see `processData2` in `demo/sentiment/dataprovider.py` as follows:
-
-```python
-import re
-
-@provider(use_seq=True, init_hook=hook)
-def process2(obj, file_name):
-    with open(file_name, 'r') as fdata:
-    pat = re.compile(r'[^.?!]+[.?!]')
-    for line_count, line in enumerate(fdata):
-        label, comment = line.strip().split('\t\t')
-        label = int(''.join(label.split(' ')))
-        words_list = pat.findall(comment)
-        word_slot_list = [[obj.word_dict[w] for w in words.split() \
-                          if w in obj.word_dict] for words in words_list]
-        yield word_slot_list, [[label]]
-```
-
- `hook`: the same as above. Note that as **SubSeq Slot must put before Seq Slot** in PaddlePaddle, we could not reverse the yield order in this case. 
- `process2`: Here, as the Sequence Mode of input is **SubSeq**, and the SlotType is IndexSlot, use_seq is set to True, and the yield format is `[[int, int, ...], [int, int, ...], ... ]`.
- `define_py_data_sources`: the same as above.
--- a/doc/ui/index.md
+++ b/doc/ui/index.md
@ -2,12 +2,11 @@

 ## Data Provider

-* [Introduction](data_provider/index.md)
-* [Python Use Case](data_provider/python_case.md)
+* [Introduction](data_provider/index.rst)
+* [PyDataProvider2](data_provider/pydataprovider2.rst)

 ## API Reference

-* [PyDataProviderWrapper](api/py_data_provider_wrapper.rst)
 * [Trainer Config Helpers](api/trainer_config_helpers/index.md)

 ## Command Line Argument