remove unused v1_api_tutorials

add_depthwiseConv_op_gpu
Luo Tao 7 years ago
parent f98720a9ed
commit 30f69beefd

@ -1,5 +0,0 @@
The tutorials in v1_api_tutorials are using v1_api currently, and will be upgraded to v2_api later.
Thus, v1_api_tutorials is a temporary directory. We decide not to maintain it and will delete it in future.
Please go to [PaddlePaddle/book](https://github.com/PaddlePaddle/book) and
[PaddlePaddle/models](https://github.com/PaddlePaddle/models) to learn PaddlePaddle.

@ -1,139 +0,0 @@
# 中文词向量模型的使用 #
----------
本文档介绍如何在PaddlePaddle平台上,使用预训练的标准格式词向量模型。
在此感谢 @lipeng 提出的代码需求,并给出的相关模型格式的定义。
## 介绍 ###
### 中文字典 ###
我们的字典使用内部的分词工具对百度知道和百度百科的语料进行分词后产生。分词风格如下: "《红楼梦》"将被分为 "《""红楼梦""》",和 "《红楼梦》"。字典采用UTF8编码输出有2列词本身和词频。字典共包含 3206326个词和4个特殊标记
- `<s>`: 分词序列的开始
- `<e>`: 分词序列的结束
- `PALCEHOLDER_JUST_IGNORE_THE_EMBEDDING`: 占位符,没有实际意义
- `<unk>`: 未知词
### 中文词向量的预训练模型 ###
遵循文章 [A Neural Probabilistic Language Model](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf)中介绍的方法,模型采用 n-gram 语言模型结构如下图6元上下文作为输入层->全连接层->softmax层 。对应于字典我们预训练得到4种不同维度的词向量分别为32维、64维、128维和256维。
<center>![](./neural-n-gram-model.png)</center>
<center>Figure 1. neural-n-gram-model</center>
### 下载和数据抽取 ###
运行以下的命令下载和获取我们的字典和预训练模型:
cd $PADDLE_ROOT/demo/model_zoo/embedding
./pre_DictAndModel.sh
## 中文短语改写的例子 ##
以下示范如何使用预训练的中文字典和词向量进行短语改写。
### 数据的准备和预处理 ###
首先运行以下的命令下载数据集。该数据集utf8编码包含20个训练样例5个测试样例和2个生成式样例。
cd $PADDLE_ROOT/demo/seqToseq/data
./paraphrase_data.sh
第二步,将数据处理成规范格式,在训练数集上训练生成词向量字典(数据将保存在 `$PADDLE_SOURCE_ROOT/demo/seqToseq/data/pre-paraphrase`:
cd $PADDLE_ROOT/demo/seqToseq/
python preprocess.py -i data/paraphrase [--mergeDict]
- 其中,如果使用`--mergeDict`选项,源语言短语和目标语言短语的字典将被合并(源语言和目标语言共享相同的编码字典)。本实例中,源语言和目标语言都是相同的语言,因此可以使用该选项。
### 使用用户指定的词向量字典 ###
使用如下命令,从预训练模型中,根据用户指定的字典,抽取对应的词向量构成新的词表:
cd $PADDLE_ROOT/demo/model_zoo/embedding
python extract_para.py --preModel PREMODEL --preDict PREDICT --usrModel USRMODEL--usrDict USRDICT -d DIM
- `--preModel PREMODEL`: 预训练词向量字典模型的路径
- `--preDict PREDICT`: 预训练模型使用的字典的路径
- `--usrModel USRMODEL`: 抽取出的新词表的保存路径
- `--usrDict USRDICT`: 用户指定新的字典的路径,用于构成新的词表
- `-d DIM`: 参数(词向量)的维度
此处,你也可以简单的运行以下的命令:
cd $PADDLE_ROOT/demo/seqToseq/data/
./paraphrase_model.sh
运行成功以后,你将会看到以下的模型结构:
paraphrase_model
|--- _source_language_embedding
|--- _target_language_embedding
### 在PaddlePaddle平台训练模型 ###
首先,配置模型文件,配置如下(可以参考保存在 `demo/seqToseq/paraphrase/train.conf`的配置):
from seqToseq_net import *
is_generating = False
################## Data Definition #####################
train_conf = seq_to_seq_data(data_dir = "./data/pre-paraphrase",
job_mode = job_mode)
############## Algorithm Configuration ##################
settings(
learning_method = AdamOptimizer(),
batch_size = 50,
learning_rate = 5e-4)
################# Network configure #####################
gru_encoder_decoder(train_conf, is_generating, word_vector_dim = 32)
这个配置与`demo/seqToseq/translation/train.conf` 基本相同
然后,使用以下命令进行模型训练:
cd $PADDLE_SOURCE_ROOT/demo/seqToseq/paraphrase
./train.sh
其中,`train.sh` 与`demo/seqToseq/translation/train.sh` 基本相同只有2个配置不一样:
- `--init_model_path`: 初始化模型的路径配置为`data/paraphrase_modeldata/paraphrase_model`
- `--load_missing_parameter_strategy`:如果参数模型文件缺失,除词向量模型外的参数将使用正态分布随机初始化
如果用户想要了解详细的数据集的格式、模型的结构和训练过程,请查看 [Text generation Tutorial](../text_generation/index_cn.md).
## 可选功能 ##
### 观测词向量
PaddlePaddle 平台为想观测词向量的用户提供了将二进制词向量模型转换为文本模型的功能:
cd $PADDLE_ROOT/demo/model_zoo/embedding
python paraconvert.py --b2t -i INPUT -o OUTPUT -d DIM
- `-i INPUT`: 输入的(二进制)词向量模型名称
- `-o OUTPUT`: 输出的文本模型名称
- `-d DIM`: (词向量)参数维度
运行完以上命令,用户可以在输出的文本模型中看到:
0,4,32156096
-0.7845433,1.1937413,-0.1704215,0.4154715,0.9566584,-0.5558153,-0.2503305, ......
0.0000909,0.0009465,-0.0008813,-0.0008428,0.0007879,0.0000183,0.0001984, ......
......
- 其中,第一行是`PaddlePaddle` 输出文件的格式说明包含3个属性:
- `PaddlePaddle`的版本号本例中为0
- 浮点数占用的字节数本例中为4
- 总计的参数个数本例中为32,156,096
- 其余行是词向量参数行假设词向量维度为32
- 每行打印32个参数以','分隔
- 共有32,156,096/32 = 1,004,877行也就是说模型共包含1,004,877个被向量化的词
### 词向量模型的修正
`PaddlePaddle` 为想修正词向量模型的用户提供了将文本词向量模型转换为二进制模型的命令:
cd $PADDLE_ROOT/demo/model_zoo/embedding
python paraconvert.py --t2b -i INPUT -o OUTPUT
- `-i INPUT`: 输入的文本词向量模型名称
- `-o OUTPUT`: 输出的二进制词向量模型名称
请注意,输入的文本格式如下:
-0.7845433,1.1937413,-0.1704215,0.4154715,0.9566584,-0.5558153,-0.2503305, ......
0.0000909,0.0009465,-0.0008813,-0.0008428,0.0007879,0.0000183,0.0001984, ......
......
- 输入文本中没有头部(格式说明)行
- (输入文本)每行存储一个词,以逗号','分隔

@ -1,140 +0,0 @@
# Chinese Word Embedding Model Tutorial #
----------
This tutorial is to guide you through the process of using a Pretrained Chinese Word Embedding Model in the PaddlePaddle standard format.
We thank @lipeng for the pull request that defined the model schemas and pretrained the models.
## Introduction ###
### Chinese Word Dictionary ###
Our Chinese-word dictionary is created on Baidu ZhiDao and Baidu Baike by using in-house word segmentor. For example, the participle of "《红楼梦》" is "《""红楼梦""》"and "《红楼梦》". Our dictionary (using UTF-8 format) has has two columns: word and its frequency. The total word count is 3206326, including 4 special token:
- `<s>`: the start of a sequence
- `<e>`: the end of a sequence
- `PALCEHOLDER_JUST_IGNORE_THE_EMBEDDING`: a placeholder, just ignore it and its embedding
- `<unk>`: a word not included in dictionary
### Pretrained Chinese Word Embedding Model ###
Inspired by paper [A Neural Probabilistic Language Model](http://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf), our model architecture (**Embedding joint of six words->FullyConnect->SoftMax**) is as following graph. And for our dictionary, we pretrain four models with different word vector dimenstions, i.e 32, 64, 128, 256.
<center>![](./neural-n-gram-model.png)</center>
<center>Figure 1. neural-n-gram-model</center>
### Download and Extract ###
To download and extract our dictionary and pretrained model, run the following commands.
cd $PADDLE_ROOT/demo/model_zoo/embedding
./pre_DictAndModel.sh
## Chinese Paraphrasing Example ##
We provide a paraphrasing task to show the usage of pretrained Chinese Word Dictionary and Embedding Model.
### Data Preparation and Preprocess ###
First, run the following commands to download and extract the in-house dataset. The dataset (using UTF-8 format) has 20 training samples, 5 testing samples and 2 generating samples.
cd $PADDLE_ROOT/demo/seqToseq/data
./paraphrase_data.sh
Second, preprocess data and build dictionary on train data by running the following commands, and the preprocessed dataset is stored in `$PADDLE_SOURCE_ROOT/demo/seqToseq/data/pre-paraphrase`:
cd $PADDLE_ROOT/demo/seqToseq/
python preprocess.py -i data/paraphrase [--mergeDict]
- `--mergeDict`: if using this option, the source and target dictionary are merged, i.e, two dictionaries have the same context. Here, as source and target data are all chinese words, this option can be used.
### User Specified Embedding Model ###
The general command of extracting desired parameters from the pretrained embedding model based on user dictionary is:
cd $PADDLE_ROOT/demo/model_zoo/embedding
python extract_para.py --preModel PREMODEL --preDict PREDICT --usrModel USRMODEL--usrDict USRDICT -d DIM
- `--preModel PREMODEL`: the name of pretrained embedding model
- `--preDict PREDICT`: the name of pretrained dictionary
- `--usrModel USRMODEL`: the name of extracted embedding model
- `--usrDict USRDICT`: the name of user specified dictionary
- `-d DIM`: dimension of parameter
Here, you can simply run the command:
cd $PADDLE_ROOT/demo/seqToseq/data/
./paraphrase_model.sh
And you will see following embedding model structure:
paraphrase_model
|--- _source_language_embedding
|--- _target_language_embedding
### Training Model in PaddlePaddle ###
First, create a model config file, see example `demo/seqToseq/paraphrase/train.conf`:
from seqToseq_net import *
is_generating = False
################## Data Definition #####################
train_conf = seq_to_seq_data(data_dir = "./data/pre-paraphrase",
job_mode = job_mode)
############## Algorithm Configuration ##################
settings(
learning_method = AdamOptimizer(),
batch_size = 50,
learning_rate = 5e-4)
################# Network configure #####################
gru_encoder_decoder(train_conf, is_generating, word_vector_dim = 32)
This config is almost the same as `demo/seqToseq/translation/train.conf`.
Then, train the model by running the command:
cd $PADDLE_SOURCE_ROOT/demo/seqToseq/paraphrase
./train.sh
where `train.sh` is almost the same as `demo/seqToseq/translation/train.sh`, the only difference is following two command arguments:
- `--init_model_path`: path of the initialization model, here is `data/paraphrase_model`
- `--load_missing_parameter_strategy`: operations when model file is missing, here use a normal distibution to initialize the other parameters except for the embedding layer
For users who want to understand the dataset format, model architecture and training procedure in detail, please refer to [Text generation Tutorial](../text_generation/index_en.md).
## Optional Function ##
### Embedding Parameters Observation
For users who want to observe the embedding parameters, this function can convert a PaddlePaddle binary embedding model to a text model by running the command:
cd $PADDLE_ROOT/demo/model_zoo/embedding
python paraconvert.py --b2t -i INPUT -o OUTPUT -d DIM
- `-i INPUT`: the name of input binary embedding model
- `-o OUTPUT`: the name of output text embedding model
- `-d DIM`: the dimension of parameter
You will see parameters like this in output text model:
0,4,32156096
-0.7845433,1.1937413,-0.1704215,0.4154715,0.9566584,-0.5558153,-0.2503305, ......
0.0000909,0.0009465,-0.0008813,-0.0008428,0.0007879,0.0000183,0.0001984, ......
......
- 1st line is **PaddlePaddle format file head**, it has 3 attributes:
- version of PaddlePaddle, here is 0
- sizeof(float), here is 4
- total number of parameter, here is 32156096
- Other lines print the paramters (assume `<dim>` = 32)
- each line print 32 paramters splitted by ','
- there is 32156096/32 = 1004877 lines, meaning there is 1004877 embedding words
### Embedding Parameters Revision
For users who want to revise the embedding parameters, this function can convert a revised text embedding model to a PaddlePaddle binary model by running the command:
cd $PADDLE_ROOT/demo/model_zoo/embedding
python paraconvert.py --t2b -i INPUT -o OUTPUT
- `-i INPUT`: the name of input text embedding model.
- `-o OUTPUT`: the name of output binary embedding model
Note that the format of input text model is as follows:
-0.7845433,1.1937413,-0.1704215,0.4154715,0.9566584,-0.5558153,-0.2503305, ......
0.0000909,0.0009465,-0.0008813,-0.0008428,0.0007879,0.0000183,0.0001984, ......
......
- there is no file header in 1st line
- each line stores parameters for one word, the separator is commas ','

Binary file not shown.

Before

Width:  |  Height:  |  Size: 67 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 17 KiB

@ -1,137 +0,0 @@
# Generative Adversarial Networks (GAN)
This demo implements GAN training described in the original [GAN paper](https://arxiv.org/abs/1406.2661) and deep convolutional generative adversarial networks [DCGAN paper](https://arxiv.org/abs/1511.06434).
The high-level structure of GAN is shown in Figure. 1 below. It is composed of two major parts: a generator and a discriminator, both of which are based on neural networks. The generator takes in some kind of noise with a known distribution and transforms it into an image. The discriminator takes in an image and determines whether it is artificially generated by the generator or a real image. So the generator and the discriminator are in a competitive game in which generator is trying to generate image to look as real as possible to fool the discriminator, while the discriminator is trying to distinguish between real and fake images.
<center>![](./gan.png)</center>
<p align="center">
Figure 1. GAN-Model-Structure
<a href="https://ishmaelbelghazi.github.io/ALI/">figure credit</a>
</p>
The generator and discriminator take turn to be trained using SGD. The objective function of the generator is for its generated images being classified as real by the discriminator, and the objective function of the discriminator is to correctly classify real and fake images. When the GAN model is trained to converge to the equilibrium state, the generator will transform the given noise distribution to the distribution of real images, and the discriminator will not be able to distinguish between real and fake images at all.
## Implementation of GAN Model Structure
Since GAN model involves multiple neural networks, it requires to use paddle python API. So the code walk-through below can also partially serve as an introduction to the usage of Paddle Python API.
There are three networks defined in gan_conf.py, namely **generator_training**, **discriminator_training** and **generator**. The relationship to the model structure we defined above is that **discriminator_training** is the discriminator, **generator** is the generator, and the **generator_training** combined the generator and discriminator since training generator would require the discriminator to provide loss function. This relationship is described in the following code:
```python
if is_generator_training:
noise = data_layer(name="noise", size=noise_dim)
sample = generator(noise)
if is_discriminator_training:
sample = data_layer(name="sample", size=sample_dim)
if is_generator_training or is_discriminator_training:
label = data_layer(name="label", size=1)
prob = discriminator(sample)
cost = cross_entropy(input=prob, label=label)
classification_error_evaluator(
input=prob, label=label, name=mode + '_error')
outputs(cost)
if is_generator:
noise = data_layer(name="noise", size=noise_dim)
outputs(generator(noise))
```
In order to train the networks defined in gan_conf.py, one first needs to initialize a Paddle environment, parse the config, create GradientMachine from the config and create trainer from GradientMachine as done in the code chunk below:
```python
import py_paddle.swig_paddle as api
# init paddle environment
api.initPaddle('--use_gpu=' + use_gpu, '--dot_period=10',
'--log_period=100', '--gpu_id=' + args.gpu_id,
'--save_dir=' + "./%s_params/" % data_source)
# Parse config
gen_conf = parse_config(conf, "mode=generator_training,data=" + data_source)
dis_conf = parse_config(conf, "mode=discriminator_training,data=" + data_source)
generator_conf = parse_config(conf, "mode=generator,data=" + data_source)
# Create GradientMachine
dis_training_machine = api.GradientMachine.createFromConfigProto(
dis_conf.model_config)
gen_training_machine = api.GradientMachine.createFromConfigProto(
gen_conf.model_config)
generator_machine = api.GradientMachine.createFromConfigProto(
generator_conf.model_config)
# Create trainer
dis_trainer = api.Trainer.create(dis_conf, dis_training_machine)
gen_trainer = api.Trainer.create(gen_conf, gen_training_machine)
```
In order to balance the strength between generator and discriminator, we schedule to train whichever one is performing worse by comparing their loss function value. The loss function value can be calculated by a forward pass through the GradientMachine.
```python
def get_training_loss(training_machine, inputs):
outputs = api.Arguments.createArguments(0)
training_machine.forward(inputs, outputs, api.PASS_TEST)
loss = outputs.getSlotValue(0).copyToNumpyMat()
return numpy.mean(loss)
```
After training one network, one needs to sync the new parameters to the other networks. The code below demonstrates one example of such use case:
```python
# Train the gen_training
gen_trainer.trainOneDataBatch(batch_size, data_batch_gen)
# Copy the parameters from gen_training to dis_training and generator
copy_shared_parameters(gen_training_machine,
dis_training_machine)
copy_shared_parameters(gen_training_machine, generator_machine)
```
## A Toy Example
With the infrastructure explained above, we can now walk you through a toy example of generating two dimensional uniform distribution using 10 dimensional Gaussian noise.
The Gaussian noises are generated using the code below:
```python
def get_noise(batch_size, noise_dim):
return numpy.random.normal(size=(batch_size, noise_dim)).astype('float32')
```
The real samples (2-D uniform) are generated using the code below:
```python
# synthesize 2-D uniform data in gan_trainer.py:114
def load_uniform_data():
data = numpy.random.rand(1000000, 2).astype('float32')
return data
```
The generator and discriminator network are built using fully-connected layer and batch_norm layer, and are defined in gan_conf.py.
To train the GAN model, one can use the command below. The flag -d specifies the training data (cifar, mnist or uniform) and flag --useGpu specifies whether to use gpu for training (0 is cpu, 1 is gpu).
```bash
$python gan_trainer.py -d uniform --useGpu 1
```
The generated samples can be found in ./uniform_samples/ and one example is shown below as Figure 2. One can see that it roughly recovers the 2D uniform distribution.
<center>![](./uniform_sample.png)</center>
<p align="center">
Figure 2. Uniform Sample
</p>
## MNIST Example
### Data preparation
To download the MNIST data, one can use the following commands:
```bash
$cd data/
$./get_mnist_data.sh
```
### Model description
Following the DC-Gan paper (https://arxiv.org/abs/1511.06434), we use convolution/convolution-transpose layer in the discriminator/generator network to better deal with images. The details of the network structures are defined in gan_conf_image.py.
### Training the model
To train the GAN model on mnist data, one can use the following command:
```bash
$python gan_trainer.py -d mnist --useGpu 1
```
The generated sample images can be found at ./mnist_samples/ and one example is shown below as Figure 3.
<center>![](./mnist_sample.png)</center>
<p align="center">
Figure 3. MNIST Sample
</p>

Binary file not shown.

Before

Width:  |  Height:  |  Size: 28 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 22 KiB

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

Binary file not shown.

Before

Width:  |  Height:  |  Size: 35 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 53 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 43 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 58 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 30 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 48 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 45 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 9.3 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 7.3 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 9.2 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 8.5 KiB

Some files were not shown because too many files have changed in this diff Show More

Loading…
Cancel
Save