|
5 years ago | |
---|---|---|
.. | ||
data | 5 years ago | |
output | 5 years ago | |
README.md | 5 years ago | |
create_dataset.py | 5 years ago | |
run.sh | 5 years ago | |
run_read.sh | 5 years ago | |
run_read_simple.sh | 5 years ago | |
run_simple.sh | 5 years ago |
README.md
Guideline to Convert Training Data zhwiki to MindRecord For Bert Pre Training
What does the example do
This example is based on zhwiki training data, generating MindRecord file, and finally used for Bert network training.
- run.sh: generate MindRecord entry script.
- run_read.py: create MindDataset by MindRecord entry script.
- create_dataset.py: use MindDataset to read MindRecord to generate dataset.
Run simple test
Follow the step:
bash run_simple.sh # generate output/simple.mindrecord* by ../../../third_party/to_mindrecord/zhwiki/sample_text.txt
bash run_read_simple.sh # use MindDataset to read output/simple.mindrecord*
How to use the example to process zhwiki
Download zhwiki data, extract it, convert it to MindRecord, use MindDataset to read MindRecord.
Download zhwiki training data
zhwiki dataset download address -> 20200401 -> zhwiki-20200401-pages-articles-multistream.xml.bz2
- put the zhwiki-20200401-pages-articles-multistream.xml.bz2 in {your-mindspore}/example/nlp_to_mindrecord/zhwiki/data directory.
Extract the zhwiki
-
Download wikiextractor script to {your-mindspore}/example/nlp_to_mindrecord/zhwiki/data directory.
$ ls data/ README.md wikiextractor zhwiki-20200401-pages-articles-multistream.xml.bz2
-
Extract the zhwiki.
python data/wikiextractor/WikiExtractor.py data/zhwiki-20200401-pages-articles-multistream.xml.bz2 --processes 4 --templates data/template --bytes 8M --min_text_length 0 --filter_disambig_pages --output data/extract
-
Generate like this:
$ ls data/extract AA AB
Generate MindRecord
-
Run the run.sh script.
bash run.sh
Caution: This process maybe slow, please wait patiently. If you do not have a machine with enough memory and cpu, it is recommended that you modify the script to generate mindrecord in step by step.
-
The output like this:
patching file create_pretraining_data_patched.py (read from create_pretraining_data.py) Begin preprocess input file: ./data/extract/AA/wiki_00 Begin output file: AAwiki_00.mindrecord Total task: 5, processing: 1 Begin preprocess input file: ./data/extract/AA/wiki_01 Begin output file: AAwiki_01.mindrecord Total task: 5, processing: 2 Begin preprocess input file: ./data/extract/AA/wiki_02 Begin output file: AAwiki_02.mindrecord Total task: 5, processing: 3 Begin preprocess input file: ./data/extract/AB/wiki_02 Begin output file: ABwiki_02.mindrecord Total task: 5, processing: 4 ...
-
Generate files like this:
$ ls output/ AAwiki_00.mindrecord AAwiki_00.mindrecord.db AAwiki_01.mindrecord AAwiki_01.mindrecord.db AAwiki_02.mindrecord AAwiki_02.mindrecord.db ... ABwiki_00.mindrecord ABwiki_00.mindrecord.db ...
Create MindDataset By MindRecord
-
Run the run_read.sh script.
bash run_read.sh
-
The output like this:
... example 74: input_ids: [ 101 8168 118 12847 8783 9977 15908 117 8256 9245 11643 8168 8847 8588 11575 8154 8228 143 8384 8376 9197 10241 103 10564 11421 8199 12268 112 161 8228 11541 9586 8436 8174 8363 9864 9702 103 103 119 103 9947 10564 103 8436 8806 11479 103 8912 119 103 103 103 12209 8303 103 8757 8824 117 8256 103 8619 8168 11541 102 11684 8196 103 8228 8847 11523 117 9059 9064 12410 8358 8181 10764 117 11167 11706 9920 148 8332 11390 8936 8205 10951 11997 103 8154 117 103 8670 10467 112 161 10951 13139 12413 117 10288 143 10425 8205 152 10795 8472 8196 103 161 12126 9172 13129 12106 8217 8174 12244 8205 143 103 8461 8277 10628 160 8221 119 102] example 74: input_mask: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] example 74: segment_ids: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] example 74: masked_lm_positions: [ 6 22 37 38 40 43 47 50 51 52 55 60 67 76 89 92 98 109 120 0] example 74: masked_lm_ids: [ 8118 8165 8329 8890 8554 8458 119 8850 8565 10392 8174 11467 10291 8181 8549 12718 13139 112 158 0] example 74: masked_lm_weights: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.] example 74: next_sentence_labels: [0] ...