Recommendation Model
## Overview
This is an implementation of WideDeep as described in the [Wide & Deep Learning for Recommender System ](https://arxiv.org/pdf/1606.07792.pdf ) paper.
WideDeep model jointly trained wide linear models and deep neural network, which combined the benefits of memorization and generalization for recommender systems.
## Requirements
- Install [MindSpore ](https://www.mindspore.cn/install/en ).
- Download the dataset and convert the dataset to mindrecord, command as follows:
```
python src/preprocess_data.py --dense_dim=13 --slot_dim=26 --threshold=100 --train_line_count=45840617 --skip_id_convert=0
```
Arguments:
* `--data_type` {criteo,synthetic}: Currently we support criteo dataset and synthetic dataset.(Default: ./criteo_data/).
* `--data_path` : The path of the data file.
* `--dense_dim` : The number of your continues fields.
* `--slot_dim` : The number of your sparse fields, it can also be called category features.
* `--threshold` : Word frequency below this value will be regarded as OOV. It aims to reduce the vocab size.
* `--train_line_count` : The number of examples in your dataset.
* `--skip_id_convert` : 0 or 1. If set 1, the code will skip the id convert, regarding the original id as the final id.
## Dataset
The common used benchmark datasets are used for model training and evaluation.
### Generate the synthetic Data
The following command will generate 40 million lines of click data, in the format of "label\tdense_feature[0]\tdense_feature[1]...\tsparse_feature[0]\tsparse_feature[1]...".
```
mkdir -p syn_data/origin_data
python src/generate_synthetic_data.py --output_file=syn_data/origin_data/train.txt --number_examples=40000000 --dense_dim=13 --slot_dim=51 --vocabulary_size=2000000000 --random_slot_values=0
```
Arguments:
* `--output_file` : The output path of the generated file
* `--label_dim` : The label category
* `--number_examples` : The row numbers of the generated file
* `--dense_dim` : The number of the continue feature.
* `--slot_dim` : The number of the category features
* `--vocabulary_size` : The vocabulary size of the total dataset
* `--random_slot_values` : 0 or 1. If 1, the id is generated by the random. If 0, the id is set by the row_index mod part_size, where
part_size is the vocab size for each slot
Preprocess the generated data
```
python src/preprocess_data.py --data_path=./syn_data/ --data_type=synthetic --dense_dim=13 --slot_dim=51 --threshold=0 --train_line_count=40000000 --skip_id_convert=1
```
## Running Code
### Code Structure
The entire code structure is as following:
```
|--- wide_and_deep/
train_and_eval.py "Entrance of Wide& Deep model training and evaluation"
eval.py "Entrance of Wide& Deep model evaluation"
train.py "Entrance of Wide& Deep model training"
train_and_eval_multinpu.py "Entrance of Wide& Deep model data parallel training and evaluation"
train_and_eval_auto_parallel.py
train_and_eval_parameter_server.py "Entrance of Wide& Deep model parameter server training and evaluation"
|--- src/ "Entrance of training and evaluation"
config.py "Parameters configuration"
dataset.py "Dataset loader class"
process_data.py "Process dataset"
preprocess_data.py "Pre_process dataset"
wide_and_deep.py "Model structure"
callbacks.py "Callback class for training and evaluation"
generate_synthetic_data.py "Generate the synthetic data for benchmark"
metrics.py "Metric class"
|--- script/ "Run shell dir"
run_multinpu_train.sh "Run data parallel"
run_auto_parallel_train.sh "Run auto parallel"
run_parameter_server_train.sh "Run parameter server"
```
### Train and evaluate model
To train and evaluate the model, command as follows:
```
python train_and_eval.py
```
Arguments:
* `--device_target` : Device where the code will be implemented (Default: Ascend).
* `--data_path` : This should be set to the same directory given to the data_download's data_dir argument.
* `--epochs` : Total train epochs.
* `--batch_size` : Training batch size.
* `--eval_batch_size` : Eval batch size.
* `--field_size` : The number of features.
* `--vocab_size` : The total features of dataset.
* `--emb_dim` : The dense embedding dimension of sparse feature.
* `--deep_layers_dim` : The dimension of all deep layers.
* `--deep_layers_act` : The activation of all deep layers.
* `--dropout_flag` : Whether do dropout.
* `--keep_prob` : The rate to keep in dropout layer.
* `--ckpt_path` : The location of the checkpoint file.
* `--eval_file_name` : Eval output file.
* `--loss_file_name` : Loss output file.
* `--dataset_type` : tfrecord/mindrecord/hd5.
To train the model in one device, command as follows:
```
python train.py
```
Arguments:
* `--device_target` : Device where the code will be implemented (Default: Ascend).
* `--data_path` : This should be set to the same directory given to the data_download's data_dir argument.
* `--epochs` : Total train epochs.
* `--batch_size` : Training batch size.
* `--eval_batch_size` : Eval batch size.
* `--field_size` : The number of features.
* `--vocab_size` : The total features of dataset.
* `--emb_dim` : The dense embedding dimension of sparse feature.
* `--deep_layers_dim` : The dimension of all deep layers.
* `--deep_layers_act` : The activation of all deep layers.
* `--dropout_flag` : Whether do dropout.
* `--keep_prob` : The rate to keep in dropout layer.
* `--ckpt_path` : The location of the checkpoint file.
* `--eval_file_name` : Eval output file.
* `--loss_file_name` : Loss output file.
* `--dataset_type` : tfrecord/mindrecord/hd5.
To train the model in distributed, command as follows:
```
# configure environment path before training
bash run_multinpu_train.sh RANK_SIZE EPOCHS DATASET RANK_TABLE_FILE
```
```
# configure environment path before training
bash run_auto_parallel_train.sh RANK_SIZE EPOCHS DATASET RANK_TABLE_FILE
```
To train the model in clusters, command as follows:'''
```
# deploy wide&deep script in clusters
# CLUSTER_CONFIG is a json file, the sample is in script/.
# EXECUTE_PATH is the scripts path after the deploy.
bash deploy_cluster.sh CLUSTER_CONFIG_PATH EXECUTE_PATH
# enter EXECUTE_PATH, and execute start_cluster.sh as follows.
# MODE: "host_device_mix"
bash start_cluster.sh CLUSTER_CONFIG_PATH EPOCH_SIZE VOCAB_SIZE EMB_DIM
DATASET ENV_SH RANK_TABLE_FILE MODE
```
To train and evaluate the model in parameter server mode, command as follows:'''
```
# SERVER_NUM is the number of parameter servers for this task.
# SCHED_HOST is the IP address of scheduler.
# SCHED_PORT is the port of scheduler.
# The number of workers is the same as RANK_SIZE.
bash run_parameter_server_train.sh RANK_SIZE EPOCHS DATASET RANK_TABLE_FILE SERVER_NUM SCHED_HOST SCHED_PORT
```
To evaluate the model, command as follows:
```
python eval.py
```
Arguments:
* `--device_target` : Device where the code will be implemented (Default: Ascend).
* `--data_path` : This should be set to the same directory given to the data_download's data_dir argument.
* `--epochs` : Total train epochs.
* `--batch_size` : Training batch size.
* `--eval_batch_size` : Eval batch size.
* `--field_size` : The number of features.
* `--vocab_size` : The total features of dataset.
* `--emb_dim` : The dense embedding dimension of sparse feature.
* `--deep_layers_dim` : The dimension of all deep layers.
* `--deep_layers_act` : The activation of all deep layers.
* `--keep_prob` : The rate to keep in dropout layer.
* `--ckpt_path` : The location of the checkpoint file.
* `--eval_file_name` : Eval output file.
* `--loss_file_name` : Loss output file.
There are other arguments about models and training process. Use the `--help` or `-h` flag to get a full list of possible arguments with detailed descriptions.