mindspore/model_zoo/official/recommend/wide_and_deep/README.md

Recommendation Model
## Overview
This is an implementation of WideDeep as described in the [Wide & Deep Learning for Recommender System](https://arxiv.org/pdf/1606.07792.pdf) paper.

WideDeep model jointly trained wide linear models and deep neural network, which combined the benefits of memorization and generalization for recommender systems.

## Requirements

- Install [MindSpore](https://www.mindspore.cn/install/en).

- Download the dataset and convert the dataset to mindrecord, command as follows:
```
python src/preprocess_data.py  --dense_dim=13 --slot_dim=26 --threshold=100 --train_line_count=45840617 --skip_id_convert=0

```
Arguments:
  * `--data_type` {criteo,synthetic}: Currently we support criteo dataset and synthetic dataset.(Default: ./criteo_data/).
  * `--data_path` : The path of the data file.
  * `--dense_dim` : The number of your continues fields.
  * `--slot_dim` : The number of your sparse fields, it can also be called category features.
  * `--threshold` : Word frequency below this value will be regarded as OOV. It aims to reduce the vocab size.
  * `--train_line_count`: The number of examples in your dataset.
  * `--skip_id_convert`: 0 or 1. If set 1, the code will skip the id convert, regarding the original id as the final id.

   
## Dataset
The common used benchmark datasets are used for model training and evaluation.


### Generate the synthetic Data

The following command will generate 40 million lines of click data, in the format of "label\tdense_feature[0]\tdense_feature[1]...\tsparse_feature[0]\tsparse_feature[1]...". 
```
mkdir -p syn_data/origin_data
python src/generate_synthetic_data.py --output_file=syn_data/origin_data/train.txt --number_examples=40000000 --dense_dim=13 --slot_dim=51 --vocabulary_size=2000000000 --random_slot_values=0
```
Arguments:
  * `--output_file`: The output path of the generated file
  * `--label_dim` : The label category
  * `--number_examples`: The row numbers of the generated file
  * `--dense_dim` : The number of the continue feature.
  * `--slot_dim`: The number of the category features
  * `--vocabulary_size`: The vocabulary size of the total dataset
  * `--random_slot_values`: 0 or 1. If 1, the id is generated by the random. If 0, the id is set by the row_index mod part_size, where
                        part_size is the vocab size for each slot

Preprocess the generated data
```
python src/preprocess_data.py --data_path=./syn_data/ --data_type=synthetic --dense_dim=13 --slot_dim=51 --threshold=0 --train_line_count=40000000 --skip_id_convert=1
```


## Running Code

### Code Structure
The entire code structure is as following:
```
|--- wide_and_deep/
    train_and_eval.py                  "Entrance of Wide&Deep model training and evaluation"
    eval.py                            "Entrance of Wide&Deep model evaluation"
    train.py                           "Entrance of Wide&Deep model training"
    train_and_eval_multinpu.py         "Entrance of Wide&Deep model data parallel training and evaluation"
    train_and_eval_auto_parallel.py
    train_and_eval_parameter_server.py "Entrance of Wide&Deep model parameter server training and evaluation"
    |--- src/                          "Entrance of training and evaluation"
        config.py                      "Parameters configuration"
        dataset.py                     "Dataset loader class"
        process_data.py                "Process dataset"
        preprocess_data.py             "Pre_process dataset"
        wide_and_deep.py               "Model structure"
        callbacks.py                   "Callback class for training and evaluation"
        generate_synthetic_data.py     "Generate the synthetic data for benchmark"
        metrics.py                     "Metric class"
    |--- script/                       "Run shell dir"
        run_multinpu_train.sh          "Run data parallel"
        run_auto_parallel_train.sh     "Run auto parallel"
        run_parameter_server_train.sh  "Run parameter server"
```


### Train and evaluate model
To train and evaluate the model, command as follows:
```
python train_and_eval.py
```
Arguments:
  * `--device_target`: Device where the code will be implemented (Default: Ascend).
  * `--data_path`: This should be set to the same directory given to the data_download's data_dir argument.
  * `--epochs`: Total train epochs.
  * `--batch_size`: Training batch size.
  * `--eval_batch_size`: Eval batch size.
  * `--field_size`: The number of features.
  * `--vocab_size`： The total features of dataset.
  * `--emb_dim`： The dense embedding dimension of sparse feature.
  * `--deep_layers_dim`： The dimension of all deep layers.
  * `--deep_layers_act`： The activation of all deep layers.
  * `--dropout_flag`： Whether do dropout.
  * `--keep_prob`： The rate to keep in dropout layer.
  * `--ckpt_path`：The location of the checkpoint file.
  * `--eval_file_name` : Eval output file.
  * `--loss_file_name` :  Loss output file.
  * `--dataset_type` :  tfrecord/mindrecord/hd5.

To train the model in one device, command as follows:
```
python train.py
```
Arguments:
  * `--device_target`: Device where the code will be implemented (Default: Ascend).
  * `--data_path`: This should be set to the same directory given to the data_download's data_dir argument.
  * `--epochs`: Total train epochs.
  * `--batch_size`: Training batch size.
  * `--eval_batch_size`: Eval batch size.
  * `--field_size`: The number of features.
  * `--vocab_size`： The total features of dataset.
  * `--emb_dim`： The dense embedding dimension of sparse feature.
  * `--deep_layers_dim`： The dimension of all deep layers.
  * `--deep_layers_act`： The activation of all deep layers.
  * `--dropout_flag`： Whether do dropout.
  * `--keep_prob`： The rate to keep in dropout layer.
  * `--ckpt_path`：The location of the checkpoint file.
  * `--eval_file_name` : Eval output file.
  * `--loss_file_name` :  Loss output file.
  * `--dataset_type` :  tfrecord/mindrecord/hd5.

To train the model in distributed, command as follows:
```
# configure environment path before training
bash run_multinpu_train.sh RANK_SIZE EPOCHS DATASET RANK_TABLE_FILE 
```
```
# configure environment path before training
bash run_auto_parallel_train.sh RANK_SIZE EPOCHS DATASET RANK_TABLE_FILE 
```

To train the model in clusters, command as follows:'''
```
# deploy wide&deep script in clusters
# CLUSTER_CONFIG is a json file, the sample is in script/.
# EXECUTE_PATH is the scripts path after the deploy.
bash deploy_cluster.sh CLUSTER_CONFIG_PATH EXECUTE_PATH

# enter EXECUTE_PATH, and execute start_cluster.sh as follows.
# MODE: "host_device_mix"
bash start_cluster.sh CLUSTER_CONFIG_PATH EPOCH_SIZE VOCAB_SIZE EMB_DIM
                      DATASET ENV_SH RANK_TABLE_FILE MODE
```

To train and evaluate the model in parameter server mode, command as follows:'''
```
# SERVER_NUM is the number of parameter servers for this task.
# SCHED_HOST is the IP address of scheduler.
# SCHED_PORT is the port of scheduler.
# The number of workers is the same as RANK_SIZE.
bash run_parameter_server_train.sh RANK_SIZE EPOCHS DATASET RANK_TABLE_FILE SERVER_NUM SCHED_HOST SCHED_PORT
```

To evaluate the model, command as follows:
```
python eval.py
```
Arguments:
  * `--device_target`: Device where the code will be implemented (Default: Ascend).
  * `--data_path`: This should be set to the same directory given to the data_download's data_dir argument.
  * `--epochs`: Total train epochs.
  * `--batch_size`: Training batch size.
  * `--eval_batch_size`: Eval batch size.
  * `--field_size`: The number of features.
  * `--vocab_size`： The total features of dataset.
  * `--emb_dim`： The dense embedding dimension of sparse feature.
  * `--deep_layers_dim`： The dimension of all deep layers.
  * `--deep_layers_act`： The activation of all deep layers.
  * `--keep_prob`： The rate to keep in dropout layer.
  * `--ckpt_path`：The location of the checkpoint file.
  * `--eval_file_name` : Eval output file.
  * `--loss_file_name` :  Loss output file.

There are other arguments about models and training process. Use the `--help` or `-h` flag to get a full list of possible arguments with detailed descriptions.
-												modify readme

											
										
										
											5 years ago
+								Recommendation Model
-												adjust dir

											
										
										
											5 years ago
+								## Overview
 								This is an implementation of WideDeep as described in the [Wide & Deep Learning for Recommender System](https://arxiv.org/pdf/1606.07792.pdf) paper.
 								WideDeep model jointly trained wide linear models and deep neural network, which combined the benefits of memorization and generalization for recommender systems.
-												update readme

											
										
										
											5 years ago
+								## Requirements
 								- Install [MindSpore](https://www.mindspore.cn/install/en).
 								- Download the dataset and convert the dataset to mindrecord, command as follows:
 								```
-												Add synthetic data generate process

											
										
										
											5 years ago
+								python src/preprocess_data.py  --dense_dim=13 --slot_dim=26 --threshold=100 --train_line_count=45840617 --skip_id_convert=0
-												update readme

											
										
										
											5 years ago
+								```
 								Arguments:
-												Add synthetic data generate process

											
										
										
											5 years ago
+								  * `--data_type` {criteo,synthetic}: Currently we support criteo dataset and synthetic dataset.(Default: ./criteo_data/).
 								  * `--data_path` : The path of the data file.
 								  * `--dense_dim` : The number of your continues fields.
 								  * `--slot_dim` : The number of your sparse fields, it can also be called category features.
 								  * `--threshold` : Word frequency below this value will be regarded as OOV. It aims to reduce the vocab size.
 								  * `--train_line_count`: The number of examples in your dataset.
 								  * `--skip_id_convert`: 0 or 1. If set 1, the code will skip the id convert, regarding the original id as the final id.
-												update readme

											
										
										
											5 years ago
-												adjust dir

											
										
										
											5 years ago
+								## Dataset
-												modezoo wide&deep run clusters

											
										
										
											5 years ago
+								The common used benchmark datasets are used for model training and evaluation.
-												adjust dir

											
										
										
											5 years ago
-												Add synthetic data generate process

											
										
										
											5 years ago
 								### Generate the synthetic Data
 								The following command will generate 40 million lines of click data, in the format of "label\tdense_feature[0]\tdense_feature[1]...\tsparse_feature[0]\tsparse_feature[1]...".
 								```
 								mkdir -p syn_data/origin_data
 								python src/generate_synthetic_data.py --output_file=syn_data/origin_data/train.txt --number_examples=40000000 --dense_dim=13 --slot_dim=51 --vocabulary_size=2000000000 --random_slot_values=0
 								```
 								Arguments:
 								  * `--output_file`: The output path of the generated file
 								  * `--label_dim` : The label category
 								  * `--number_examples`: The row numbers of the generated file
 								  * `--dense_dim` : The number of the continue feature.
 								  * `--slot_dim`: The number of the category features
 								  * `--vocabulary_size`: The vocabulary size of the total dataset
 								  * `--random_slot_values`: 0 or 1. If 1, the id is generated by the random. If 0, the id is set by the row_index mod part_size, where
 								                        part_size is the vocab size for each slot
 								Preprocess the generated data
 								```
 								python src/preprocess_data.py --data_path=./syn_data/ --data_type=synthetic --dense_dim=13 --slot_dim=51 --threshold=0 --train_line_count=40000000 --skip_id_convert=1
 								```
-												adjust dir

											
										
										
											5 years ago
+								## Running Code
 								### Code Structure
 								The entire code structure is as following:
 								```
 								|--- wide_and_deep/
-												Add parameter server model_zoo case and CI test cases.

											
										
										
											5 years ago
+								    train_and_eval.py                  "Entrance of Wide&Deep model training and evaluation"
 								    eval.py                            "Entrance of Wide&Deep model evaluation"
 								    train.py                           "Entrance of Wide&Deep model training"
 								    train_and_eval_multinpu.py         "Entrance of Wide&Deep model data parallel training and evaluation"
-												fix_modelzoo_widedeep_run_multinup_train

											
										
										
											5 years ago
+								    train_and_eval_auto_parallel.py
-												Add parameter server model_zoo case and CI test cases.

											
										
										
											5 years ago
+								    train_and_eval_parameter_server.py "Entrance of Wide&Deep model parameter server training and evaluation"
 								    |--- src/                          "Entrance of training and evaluation"
 								        config.py                      "Parameters configuration"
 								        dataset.py                     "Dataset loader class"
 								        process_data.py                "Process dataset"
 								        preprocess_data.py             "Pre_process dataset"
 								        wide_and_deep.py               "Model structure"
 								        callbacks.py                   "Callback class for training and evaluation"
-												Add synthetic data generate process

											
										
										
											5 years ago
+								        generate_synthetic_data.py     "Generate the synthetic data for benchmark"
-												Add parameter server model_zoo case and CI test cases.

											
										
										
											5 years ago
+								        metrics.py                     "Metric class"
 								    |--- script/                       "Run shell dir"
 								        run_multinpu_train.sh          "Run data parallel"
 								        run_auto_parallel_train.sh     "Run auto parallel"
 								        run_parameter_server_train.sh  "Run parameter server"
-												adjust dir

											
										
										
											5 years ago
+								```
-												Add synthetic data generate process

											
										
										
											5 years ago
-												adjust dir

											
										
										
											5 years ago
+								### Train and evaluate model
-												modify readme

											
										
										
											5 years ago
+								To train and evaluate the model, command as follows:
-												adjust dir

											
										
										
											5 years ago
+								```
-												fix_modelzoo_widedeep_run_multinup_train

											
										
										
											5 years ago
+								python train_and_eval.py
-												adjust dir

											
										
										
											5 years ago
+								```
 								Arguments:
-												add wide&deep stanalone training script for gpu in model zoo

											
										
										
											5 years ago
+								  * `--device_target`: Device where the code will be implemented (Default: Ascend).
-												adjust dir

											
										
										
											5 years ago
+								  * `--data_path`: This should be set to the same directory given to the data_download's data_dir argument.
 								  * `--epochs`: Total train epochs.
 								  * `--batch_size`: Training batch size.
 								  * `--eval_batch_size`: Eval batch size.
 								  * `--field_size`: The number of features.
 								  * `--vocab_size`： The total features of dataset.
 								  * `--emb_dim`： The dense embedding dimension of sparse feature.
 								  * `--deep_layers_dim`： The dimension of all deep layers.
 								  * `--deep_layers_act`： The activation of all deep layers.
-												fix_modelzoo_widedeep_run_multinup_train

											
										
										
											5 years ago
+								  * `--dropout_flag`： Whether do dropout.
-												adjust dir

											
										
										
											5 years ago
+								  * `--keep_prob`： The rate to keep in dropout layer.
 								  * `--ckpt_path`：The location of the checkpoint file.
 								  * `--eval_file_name` : Eval output file.
 								  * `--loss_file_name` :  Loss output file.
-												modezoo wide&deep run clusters

											
										
										
											5 years ago
+								  * `--dataset_type` :  tfrecord/mindrecord/hd5.
-												adjust dir

											
										
										
											5 years ago
-												modify readme

											
										
										
											5 years ago
+								To train the model in one device, command as follows:
-												adjust dir

											
										
										
											5 years ago
+								```
 								python train.py
 								```
 								Arguments:
-												add wide&deep stanalone training script for gpu in model zoo

											
										
										
											5 years ago
+								  * `--device_target`: Device where the code will be implemented (Default: Ascend).
-												adjust dir

											
										
										
											5 years ago
+								  * `--data_path`: This should be set to the same directory given to the data_download's data_dir argument.
 								  * `--epochs`: Total train epochs.
 								  * `--batch_size`: Training batch size.
 								  * `--eval_batch_size`: Eval batch size.
 								  * `--field_size`: The number of features.
 								  * `--vocab_size`： The total features of dataset.
 								  * `--emb_dim`： The dense embedding dimension of sparse feature.
 								  * `--deep_layers_dim`： The dimension of all deep layers.
 								  * `--deep_layers_act`： The activation of all deep layers.
-												fix_modelzoo_widedeep_run_multinup_train

											
										
										
											5 years ago
+								  * `--dropout_flag`： Whether do dropout.
-												adjust dir

											
										
										
											5 years ago
+								  * `--keep_prob`： The rate to keep in dropout layer.
 								  * `--ckpt_path`：The location of the checkpoint file.
 								  * `--eval_file_name` : Eval output file.
 								  * `--loss_file_name` :  Loss output file.
-												modezoo wide&deep run clusters

											
										
										
											5 years ago
+								  * `--dataset_type` :  tfrecord/mindrecord/hd5.
-												adjust dir

											
										
										
											5 years ago
-												modify readme

											
										
										
											5 years ago
+								To train the model in distributed, command as follows:
 								```
-												fix_modelzoo_widedeep_run_multinup_train

											
										
										
											5 years ago
+								# configure environment path before training
 								bash run_multinpu_train.sh RANK_SIZE EPOCHS DATASET RANK_TABLE_FILE
 								```
 								```
 								# configure environment path before training
 								bash run_auto_parallel_train.sh RANK_SIZE EPOCHS DATASET RANK_TABLE_FILE
-												modify readme

											
										
										
											5 years ago
+								```
-												modezoo wide&deep run clusters

											
										
										
											5 years ago
+								To train the model in clusters, command as follows:'''
 								```
 								# deploy wide&deep script in clusters
 								# CLUSTER_CONFIG is a json file, the sample is in script/.
 								# EXECUTE_PATH is the scripts path after the deploy.
 								bash deploy_cluster.sh CLUSTER_CONFIG_PATH EXECUTE_PATH
 								# enter EXECUTE_PATH, and execute start_cluster.sh as follows.
 								# MODE: "host_device_mix"
 								bash start_cluster.sh CLUSTER_CONFIG_PATH EPOCH_SIZE VOCAB_SIZE EMB_DIM
 								                      DATASET ENV_SH RANK_TABLE_FILE MODE
 								```
-												Add parameter server model_zoo case and CI test cases.

											
										
										
											5 years ago
+								To train and evaluate the model in parameter server mode, command as follows:'''
 								```
 								# SERVER_NUM is the number of parameter servers for this task.
 								# SCHED_HOST is the IP address of scheduler.
 								# SCHED_PORT is the port of scheduler.
 								# The number of workers is the same as RANK_SIZE.
 								bash run_parameter_server_train.sh RANK_SIZE EPOCHS DATASET RANK_TABLE_FILE SERVER_NUM SCHED_HOST SCHED_PORT
 								```
-												modify readme

											
										
										
											5 years ago
+								To evaluate the model, command as follows:
-												adjust dir

											
										
										
											5 years ago
+								```
-												fix_modelzoo_widedeep_run_multinup_train

											
										
										
											5 years ago
+								python eval.py
-												adjust dir

											
										
										
											5 years ago
+								```
 								Arguments:
-												add wide&deep stanalone training script for gpu in model zoo

											
										
										
											5 years ago
+								  * `--device_target`: Device where the code will be implemented (Default: Ascend).
-												adjust dir

											
										
										
											5 years ago
+								  * `--data_path`: This should be set to the same directory given to the data_download's data_dir argument.
 								  * `--epochs`: Total train epochs.
 								  * `--batch_size`: Training batch size.
 								  * `--eval_batch_size`: Eval batch size.
 								  * `--field_size`: The number of features.
 								  * `--vocab_size`： The total features of dataset.
 								  * `--emb_dim`： The dense embedding dimension of sparse feature.
 								  * `--deep_layers_dim`： The dimension of all deep layers.
 								  * `--deep_layers_act`： The activation of all deep layers.
 								  * `--keep_prob`： The rate to keep in dropout layer.
 								  * `--ckpt_path`：The location of the checkpoint file.
 								  * `--eval_file_name` : Eval output file.
 								  * `--loss_file_name` :  Loss output file.
 								There are other arguments about models and training process. Use the `--help` or `-h` flag to get a full list of possible arguments with detailed descriptions.