# Transformer Example ## Description This example implements training and evaluation of Transformer Model, which is introduced in the following paper: - Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS 2017, pages 5998–6008. ## Requirements - Install [MindSpore](https://www.mindspore.cn/install/en). - Download and preprocess the WMT English-German dataset for training and evaluation. > Notes:If you are running an evaluation task, prepare the corresponding checkpoint file. ## Example structure ```shell . └─Transformer ├─README.md ├─scripts ├─process_output.sh ├─replace-quote.perl ├─run_distribute_train.sh └─run_standalone_train.sh ├─src ├─__init__.py ├─beam_search.py ├─config.py ├─dataset.py ├─eval_config.py ├─lr_schedule.py ├─process_output.py ├─tokenization.py ├─transformer_for_train.py ├─transformer_model.py └─weight_init.py ├─create_data.py ├─eval.py └─train.py ``` --- ## Prepare the dataset - You may use this [shell script](https://github.com/tensorflow/nmt/blob/master/nmt/scripts/wmt16_en_de.sh) to download and preprocess WMT English-German dataset. Assuming you get the following files: - train.tok.clean.bpe.32000.en - train.tok.clean.bpe.32000.de - vocab.bpe.32000 - newstest2014.tok.bpe.32000.en - newstest2014.tok.bpe.32000.de - newstest2014.tok.de - Convert the original data to mindrecord for training: ``` bash paste train.tok.clean.bpe.32000.en train.tok.clean.bpe.32000.de > train.all python create_data.py --input_file train.all --vocab_file vocab.bpe.32000 --output_file /path/ende-l128-mindrecord --max_seq_length 128 ``` - Convert the original data to mindrecord for evaluation: ``` bash paste newstest2014.tok.bpe.32000.en newstest2014.tok.bpe.32000.de > test.all python create_data.py --input_file test.all --vocab_file vocab.bpe.32000 --output_file /path/newstest2014-l128-mindrecord --num_splits 1 --max_seq_length 128 --clip_to_max_len True ``` ## Running the example ### Training - Set options in `config.py`, including loss_scale, learning rate and network hyperparameters. Click [here](https://www.mindspore.cn/tutorial/zh-CN/master/use/data_preparation/loading_the_datasets.html#mindspore) for more information about dataset. - Run `run_standalone_train.sh` for non-distributed training of Transformer model. ``` bash sh scripts/run_standalone_train.sh DEVICE_ID EPOCH_SIZE DATA_PATH ``` - Run `run_distribute_train.sh` for distributed training of Transformer model. ``` bash sh scripts/run_distribute_train.sh DEVICE_NUM EPOCH_SIZE DATA_PATH MINDSPORE_HCCL_CONFIG_PATH ``` ### Evaluation - Set options in `eval_config.py`. Make sure the 'data_file', 'model_file' and 'output_file' are set to your own path. - Run `eval.py` for evaluation of Transformer model. ```bash python eval.py ``` - Run `process_output.sh` to process the output token ids to get the real translation results. ```bash sh scripts/process_output.sh REF_DATA EVAL_OUTPUT VOCAB_FILE ``` You will get two files, REF_DATA.forbleu and EVAL_OUTPUT.forbleu, for BLEU score calculation. - Calculate BLEU score, you may use this [perl script](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl) and run following command to get the BLEU score. ```bash perl multi-bleu.perl REF_DATA.forbleu < EVAL_OUTPUT.forbleu ``` --- ## Usage ### Training ``` usage: train.py [--distribute DISTRIBUTE] [--epoch_size N] [----device_num N] [--device_id N] [--enable_save_ckpt ENABLE_SAVE_CKPT] [--enable_lossscale ENABLE_LOSSSCALE] [--do_shuffle DO_SHUFFLE] [--enable_data_sink ENABLE_DATA_SINK] [--save_checkpoint_steps N] [--save_checkpoint_num N] [--save_checkpoint_path SAVE_CHECKPOINT_PATH] [--data_path DATA_PATH] options: --distribute pre_training by serveral devices: "true"(training by more than 1 device) | "false", default is "false" --epoch_size epoch size: N, default is 52 --device_num number of used devices: N, default is 1 --device_id device id: N, default is 0 --enable_save_ckpt enable save checkpoint: "true" | "false", default is "true" --enable_lossscale enable lossscale: "true" | "false", default is "true" --do_shuffle enable shuffle: "true" | "false", default is "true" --enable_data_sink enable data sink: "true" | "false", default is "false" --checkpoint_path path to load checkpoint files: PATH, default is "" --save_checkpoint_steps steps for saving checkpoint files: N, default is 2500 --save_checkpoint_num number for saving checkpoint files: N, default is 30 --save_checkpoint_path path to save checkpoint files: PATH, default is "./checkpoint/" --data_path path to dataset file: PATH, default is "" ``` ## Options and Parameters It contains of parameters of Transformer model and options for training and evaluation, which is set in file `config.py` and `evaluation_config.py` respectively. ### Options: ``` config.py: transformer_network version of Transformer model: base | large, default is large init_loss_scale_value initial value of loss scale: N, default is 2^10 scale_factor factor used to update loss scale: N, default is 2 scale_window steps for once updatation of loss scale: N, default is 2000 optimizer optimizer used in the network: Adam, default is "Adam" eval_config.py: transformer_network version of Transformer model: base | large, default is large data_file data file: PATH model_file checkpoint file to be loaded: PATH output_file output file of evaluation: PATH ``` ### Parameters: ``` Parameters for dataset and network (Training/Evaluation): batch_size batch size of input dataset: N, default is 96 seq_length length of input sequence: N, default is 128 vocab_size size of each embedding vector: N, default is 36560 hidden_size size of Transformer encoder layers: N, default is 1024 num_hidden_layers number of hidden layers: N, default is 6 num_attention_heads number of attention heads: N, default is 16 intermediate_size size of intermediate layer: N, default is 4096 hidden_act activation function used: ACTIVATION, default is "relu" hidden_dropout_prob dropout probability for TransformerOutput: Q, default is 0.3 attention_probs_dropout_prob dropout probability for TransformerAttention: Q, default is 0.3 max_position_embeddings maximum length of sequences: N, default is 128 initializer_range initialization value of TruncatedNormal: Q, default is 0.02 label_smoothing label smoothing setting: Q, default is 0.1 input_mask_from_dataset use the input mask loaded form dataset or not: True | False, default is True beam_width beam width setting: N, default is 4 max_decode_length max decode length in evaluation: N, default is 80 length_penalty_weight normalize scores of translations according to their length: Q, default is 1.0 compute_type compute type in Transformer: mstype.float16 | mstype.float32, default is mstype.float16 Parameters for learning rate: learning_rate value of learning rate: Q warmup_steps steps of the learning rate warm up: N start_decay_step step of the learning rate to decay: N min_lr minimal learning rate: Q ```