You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
mindspore/model_zoo/official/nlp/fasttext
caojiewen da60f433f1
removed the useless link of apply form
4 years ago
..
scripts fixed the code spell errors. 4 years ago
src fixed the code spell errors. 4 years ago
README.md removed the useless link of apply form 4 years ago
eval.py Increase performance of fasttext 4 years ago
export.py add export file 4 years ago
requirements.txt add fasttext to model_zoo 4 years ago
train.py fixed the code spell errors. 4 years ago

README.md

FastText

FastText is a fast text classification algorithm, which is simple and efficient. It was proposed by Armand Joulin, Tomas Mikolov etc. in the article "Bag of Tricks for Efficient Text Classification" in 2016. It is similar to CBOW in model architecture, where the middle word is replace by a label. FastText adopts ngram feature as addition feature to get some information about words. It speeds up training and testing while maintaining high precision, and widly used in various tasks of text classification.

Paper: "Bag of Tricks for Efficient Text Classification", 2016, A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov

Model Structure

The FastText model mainly consists of an input layer, hidden layer and output layer, where the input is a sequence of words (text or sentence). The output layer is probability that the words sequence belongs to different categories. The hidden layer is formed by average of multiple word vector. The feature is mapped to the hidden layer through linear transformation, and then mapped to the label from the hidden layer.

Dataset

Note that you can run the scripts based on the dataset mentioned in original paper or widely used in relevant domain/network architecture. In the following sections, we will introduce how to run the scripts using the related dataset below.

  • AG's news topic classification dataset
  • DBPedia Ontology Classification Dataset
  • Yelp Review Polarity Dataset

Environment Requirements

Quick Start

After dataset preparation, you can start training and evaluation as follows:

# run training example
cd ./scripts
sh run_standalone_train.sh [TRAIN_DATASET] [DEVICEID]

# run distributed training example
sh run_distribute_train.sh [TRAIN_DATASET] [RANK_TABLE_PATH]

# run evaluation example
sh run_eval.sh [EVAL_DATASET_PATH] [DATASET_NAME] [MODEL_CKPT] [DEVICEID]

Script Description

The FastText network script and code result are as follows:

├── fasttext
  ├── README.md                              // Introduction of FastText model.
  ├── src
  │   ├──config.py                           // Configuration instance definition.
  │   ├──create_dataset.py                   // Dataset preparation.
  │   ├──fasttext_model.py                   // FastText model architecture.
  │   ├──fasttext_train.py                   // Use FastText model architecture.
  │   ├──load_dataset.py                     // Dataset loader to feed into model.
  │   ├──lr_scheduler.py                     // Learning rate scheduler.
  ├── scripts
  │   ├──run_distributed_train.sh            // shell script for distributed train on ascend.
  │   ├──run_eval.sh                         // shell script for standalone eval on ascend.
  │   ├──run_standalone_train.sh             // shell script for standalone eval on ascend.
  ├── eval.py                                // Infer API entry.
  ├── requirements.txt                       // Requirements of third party package.
  ├── train.py                               // Train API entry.

Dataset Preparation

  • Download the AG's News Topic Classification Dataset, DBPedia Ontology Classification Dataset and Yelp Review Polarity Dataset. Unzip datasets to any path you want.

  • Run the following scripts to do data preprocess and convert the original data to mindrecord for training and evaluation.

    cd scripts
    sh creat_dataset.sh [SOURCE_DATASET_PATH] [DATASET_NAME]
    

Configuration File

Parameters for both training and evaluation can be set in config.py. All the datasets are using same parameter name, parameters value could be changed according the needs.

  • Network Parameters

       vocab_size               # vocabulary size.
       buckets                  # bucket sequence length.
       test_buckets             # test dataset bucket sequence length
       batch_size               # batch size of input dataset.
       embedding_dims           # The size of each embedding vector.
       num_class                # number of labels.
       epoch                    # total training epochs.
       lr                       # initial learning rate.
       min_lr                   # minimum learning rate.
       warmup_steps             # warm up steps.
       poly_lr_scheduler_power  # a value used to calculate decayed learning rate.
       pretrain_ckpt_dir        # pretrain checkpoint direction.
       keep_ckpt_max            # Max ckpt files number.
    

Training Process

  • Start task training on a single device and run the shell script

    cd ./scripts
    sh run_standalone_train.sh [DATASET_PATH] [DEVICEID]
    
  • Running scripts for distributed training of FastText. Task training on multiple device and run the following command in bash to be executed in scripts/:

    cd ./scripts
    sh run_distributed_train.sh [DATASET_PATH] [RANK_TABLE_PATH]
    

Inference Process

  • Running scripts for evaluation of FastText. The commdan as below.

    cd ./scripts
    sh run_eval.sh [DATASET_PATH] [DATASET_NAME] [MODEL_CKPT] [DEVICEID]
    

    Note: The DATASET_PATH is path to mindrecord. eg. /dataset_path/*.mindrecord

Model Description

Performance

Training Performance

Parameters Ascend
Resource Ascend 910
uploaded Date 12/21/2020 (month/day/year)
MindSpore Version 1.1.0
Dataset AG's News Topic Classification Dataset
Training Parameters epoch=5, batch_size=512
Optimizer Adam
Loss Function Softmax Cross Entropy
outputs probability
Speed 10ms/step (1pcs)
Epoch Time 2.36s (1pcs)
Loss 0.0067
Params (M) 22
Checkpoint for inference 254M (.ckpt file)
Scripts fasttext
Parameters Ascend
Resource Ascend 910
uploaded Date 11/21/2020 (month/day/year)
MindSpore Version 1.1.0
Dataset DBPedia Ontology Classification Dataset
Training Parameters epoch=5, batch_size=4096
Optimizer Adam
Loss Function Softmax Cross Entropy
outputs probability
Speed 58ms/step (1pcs)
Epoch Time 8.15s (1pcs)
Loss 2.6e-4
Params (M) 106
Checkpoint for inference 1.2G (.ckpt file)
Scripts fasttext
Parameters Ascend
Resource Ascend 910
uploaded Date 11/21/2020 (month/day/year)
MindSpore Version 1.1.0
Dataset Yelp Review Polarity Dataset
Training Parameters epoch=5, batch_size=2048
Optimizer Adam
Loss Function Softmax Cross Entropy
outputs probability
Speed 101ms/step (1pcs)
Epoch Time 28s (1pcs)
Loss 0.062
Params (M) 103
Checkpoint for inference 1.2G (.ckpt file)
Scripts fasttext

Inference Performance

Parameters Ascend
Resource Ascend 910
Uploaded Date 12/21/2020 (month/day/year)
MindSpore Version 1.1.0
Dataset AG's News Topic Classification Dataset
batch_size 512
Epoch Time 2.36s
outputs label index
Accuracy 92.53
Model for inference 254M (.ckpt file)
Parameters Ascend
Resource Ascend 910
Uploaded Date 12/21/2020 (month/day/year)
MindSpore Version 1.1.0
Dataset DBPedia Ontology Classification Dataset
batch_size 4096
Epoch Time 8.15s
outputs label index
Accuracy 98.6
Model for inference 1.2G (.ckpt file)
Parameters Ascend
Resource Ascend 910
Uploaded Date 12/21/2020 (month/day/year)
MindSpore Version 1.1.0
Dataset Yelp Review Polarity Dataset
batch_size 2048
Epoch Time 28s
outputs label index
Accuracy 95.7
Model for inference 1.2G (.ckpt file)

Random Situation Description

There only one random situation.

  • Initialization of some model weights.

Some seeds have already been set in train.py to avoid the randomness of weight initialization.

Others

This model has been validated in the Ascend environment and is not validated on the CPU and GPU.

ModelZoo HomePage

Please check the official homepage