You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
|
5 years ago | |
---|---|---|
.. | ||
scripts | 5 years ago | |
src | 5 years ago | |
README.md | 5 years ago | |
eval.py | 5 years ago | |
export.py | 5 years ago | |
train.py | 5 years ago |
README.md
Inception-v3 Example
Description
This is an example of training Inception-v3 in MindSpore.
Requirements
- Install Mindspore.
- Downlaod the dataset.
Structure
.
└─Inception-v3
├─README.md
├─scripts
├─run_standalone_train_for_gpu.sh # launch standalone training with gpu platform(1p)
├─run_distribute_train_for_gpu.sh # launch distributed training with gpu platform(8p)
└─run_eval_for_gpu.sh # launch evaluating with gpu platform
├─src
├─config.py # parameter configuration
├─dataset.py # data preprocessing
├─inception_v3.py # network definition
├─loss.py # Customized CrossEntropy loss function
├─lr_generator.py # learning rate generator
├─eval.py # eval net
├─export.py # convert checkpoint
└─train.py # train net
Parameter Configuration
Parameters for both training and evaluating can be set in config.py
'random_seed': 1, # fix random seed
'rank': 0, # local rank of distributed
'group_size': 1, # world size of distributed
'work_nums': 8, # number of workers to read the data
'decay_method': 'cosine', # learning rate scheduler mode
"loss_scale": 1, # loss scale
'batch_size': 128, # input batchsize
'epoch_size': 250, # total epoch numbers
'num_classes': 1000, # dataset class numbers
'smooth_factor': 0.1, # label smoothing factor
'aux_factor': 0.2, # loss factor of aux logit
'lr_init': 0.00004, # initiate learning rate
'lr_max': 0.4, # max bound of learning rate
'lr_end': 0.000004, # min bound of learning rate
'warmup_epochs': 1, # warmup epoch numbers
'weight_decay': 0.00004, # weight decay
'momentum': 0.9, # momentum
'opt_eps': 1.0, # epsilon
'keep_checkpoint_max': 100, # max numbers to keep checkpoints
'ckpt_path': './checkpoint/', # save checkpoint path
'is_save_on_master': 1 # save checkpoint on rank0, distributed parameters
Running the example
Train
Usage
# distribute training example(8p)
sh run_distribute_train_for_gpu.sh DATA_DIR
# standalone training
sh run_standalone_train_for_gpu.sh DEVICE_ID DATA_DIR
Launch
# distributed training example(8p) for GPU
sh scripts/run_distribute_train_for_gpu.sh /dataset/train
# standalone training example for GPU
sh scripts/run_standalone_train_for_gpu.sh 0 /dataset/train
Result
You can find checkpoint file together with result in log.
Evaluation
Usage
# Evaluation
sh run_eval_for_gpu.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT
Launch
# Evaluation with checkpoint
sh scripts/run_eval_for_gpu.sh 0 /dataset/val ./checkpoint/inceptionv3-rank3-247_1251.ckpt
checkpoint can be produced in training process.
Result
Evaluation result will be stored in the scripts path. Under this, you can find result like the followings in log.
acc=78.75%(TOP1)
acc=94.07%(TOP5)