enhance: add example for zhwiki and CLUERNER2020 to mindrecord

6 years ago · 56d03f9eb9
parent 5306172fee
commit 56d03f9eb9
32 changed files with 44689 additions and 0 deletions
--- a/example/nlp_to_mindrecord/CLUERNER2020/README.md
+++ b/example/nlp_to_mindrecord/CLUERNER2020/README.md
@ -0,0 +1,82 @@
 # Guideline to Convert Training Data CLUERNER2020 to MindRecord For Bert Fine Tuning
 <!-- TOC -->
 - [What does the example do](#what-does-the-example-do)
 - [How to use the example to process CLUERNER2020](#how-to-use-the-example-to-process-cluerner2020)
    - [Download CLUERNER2020 and unzip](#download-cluerner2020-and-unzip)
    - [Generate MindRecord](#generate-mindrecord)
    - [Create MindDataset By MindRecord](#create-minddataset-by-mindrecord)
 <!-- /TOC -->
 ## What does the example do
 This example is based on [CLUERNER2020](https://www.cluebenchmarks.com/introduce.html) training data, generating MindRecord file, and finally used for Bert Fine Tuning progress.
 1.  run.sh: generate MindRecord entry script
 2.  run_read.py: create MindDataset by MindRecord entry script.
    - create_dataset.py: use MindDataset to read MindRecord to generate dataset.
 ## How to use the example to process CLUERNER2020
 Download CLUERNER2020, convert it to MindRecord, use MindDataset to read MindRecord.
 ### Download CLUERNER2020 and unzip
 1. Download the training data zip.
    > [CLUERNER2020 dataset download address](https://www.cluebenchmarks.com/introduce.html) **-> 任务介绍 -> CLUENER 细粒度命名实体识别 -> cluener下载链接**
 2. Unzip the training data to dir example/nlp_to_mindrecord/CLUERNER2020/cluener_public.
    ```
    unzip -d {your-mindspore}/example/nlp_to_mindrecord/CLUERNER2020/data/cluener_public cluener_public.zip
    ```
 ### Generate MindRecord
 1. Run the run.sh script.
    ```bash
    bash run.sh
    ```
 2. Output like this:
    ```
    ...
    [INFO] ME(17603:139620983514944,MainProcess):2020-04-28-16:56:12.498.235 [mindspore/mindrecord/filewriter.py:313] The list of mindrecord files created are: ['data/train.mindrecord'], and the list of index files are: ['data/train.mindrecord.db']
    ...
    [INFO] ME(17603,python):2020-04-28-16:56:13.400.175 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:667] WriteRawData] Write 1 records successfully.
    [INFO] ME(17603,python):2020-04-28-16:56:13.400.863 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:667] WriteRawData] Write 1 records successfully.
    [INFO] ME(17603,python):2020-04-28-16:56:13.401.534 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:667] WriteRawData] Write 1 records successfully.
    [INFO] ME(17603,python):2020-04-28-16:56:13.402.179 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:667] WriteRawData] Write 1 records successfully.
    [INFO] ME(17603,python):2020-04-28-16:56:13.402.702 [mindspore/ccsrc/mindrecord/io/shard_writer.cc:667] WriteRawData] Write 1 records successfully.
    ...
    [INFO] ME(17603:139620983514944,MainProcess):2020-04-28-16:56:13.431.208 [mindspore/mindrecord/filewriter.py:313] The list of mindrecord files created are: ['data/dev.mindrecord'], and the list of index files are: ['data/dev.mindrecord.db']
    ```
 3. Generate files like this:
    ```bash
    $ ls output/
    dev.mindrecord  dev.mindrecord.db  README.md  train.mindrecord  train.mindrecord.db
    ```
 ### Create MindDataset By MindRecord
 1. Run the run_read.sh script.
    ```bash
    bash run_read.sh
    ```
 2. Output like this:
    ```
    ...
    example 1340: input_ids: [ 101 3173 1290 4852 7676 3949  122 3299  123  126 3189 4510 8020 6381 5442 7357 2590 3636 8021 7676 3949 4294 1166 6121 3124 1277 6121 3124 7270 2135 3295 5789 3326 123  126 3189 1355 6134 1093 1325 3173 2399 6590 6791 8024  102    0    0    0    0    0    0    0    0    0    0   0    0    0    0    0    0    0    0]
    example 1340: input_mask: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1  1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    example 1340: segment_ids: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    example 1340: label_ids: [ 0 18 19 20  2  4  0  0  0  0  0  0  0 34 36 26 27 28  0 34 35 35 35 35 35 35 35 35 35 36 26 27 28  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
    example 1341: input_ids: [ 101 1728  711 4293 3868 1168 2190 2150 3791  934 3633 3428 4638 6237 7025 8024 3297 1400 5310 3362 6206 5023 5401 1744 3297 7770 3791 7368  976 1139 1104 2137  511 102    0    0    0    0    0    0    0    0   0    0    0    0    0    0    0    0    0    0    0    0    0    0   0    0    0    0    0    0    0    0]
    example 1341: input_mask: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
    example 1341: segment_ids: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
   example 1341: label_ids: [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 18 19 19 19 19 20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0]
    ...
    ```
--- a/example/nlp_to_mindrecord/CLUERNER2020/create_dataset.py
+++ b/example/nlp_to_mindrecord/CLUERNER2020/create_dataset.py
@ -0,0 +1,36 @@
 # Copyright 2020 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 """create MindDataset by MindRecord"""
 import mindspore.dataset as ds
 def create_dataset(data_file):
    """create MindDataset"""
    num_readers = 4
    data_set = ds.MindDataset(dataset_file=data_file, num_parallel_workers=num_readers, shuffle=True)
    index = 0
    for item in data_set.create_dict_iterator():
        # print("example {}: {}".format(index, item))
        print("example {}: input_ids: {}".format(index, item['input_ids']))
        print("example {}: input_mask: {}".format(index, item['input_mask']))
        print("example {}: segment_ids: {}".format(index, item['segment_ids']))
        print("example {}: label_ids: {}".format(index, item['label_ids']))
        index += 1
        if index % 1000 == 0:
            print("read rows: {}".format(index))
    print("total rows: {}".format(index))
 if __name__ == '__main__':
    create_dataset('output/train.mindrecord')
    create_dataset('output/dev.mindrecord')
--- a/example/nlp_to_mindrecord/CLUERNER2020/data/.gitignore
+++ b/example/nlp_to_mindrecord/CLUERNER2020/data/.gitignore
@ -0,0 +1 @@
 cluener_public
--- a/example/nlp_to_mindrecord/CLUERNER2020/data/README.md
+++ b/example/nlp_to_mindrecord/CLUERNER2020/data/README.md
@ -0,0 +1 @@
 ## The input dataset
--- a/example/nlp_to_mindrecord/CLUERNER2020/output/README.md
+++ b/example/nlp_to_mindrecord/CLUERNER2020/output/README.md
@ -0,0 +1 @@
 ## output dir
--- a/example/nlp_to_mindrecord/CLUERNER2020/run.sh
+++ b/example/nlp_to_mindrecord/CLUERNER2020/run.sh
@ -0,0 +1,40 @@
 #!/bin/bash
 # Copyright 2020 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 rm -f output/train.mindrecord*
 rm -f output/dev.mindrecord*
 if [ ! -d "../../../third_party/to_mindrecord/CLUERNER2020" ]; then
    echo "The patch base dir ../../../third_party/to_mindrecord/CLUERNER2020 is not exist."
    exit 1
 fi
 if [ ! -f "../../../third_party/patch/to_mindrecord/CLUERNER2020/data_processor_seq.patch" ]; then
    echo "The patch file ../../../third_party/patch/to_mindrecord/CLUERNER2020/data_processor_seq.patch is not exist."
    exit 1
 fi
 # patch for data_processor_seq.py
 patch -p0 -d ../../../third_party/to_mindrecord/CLUERNER2020/ -o data_processor_seq_patched.py < ../../../third_party/patch/to_mindrecord/CLUERNER2020/data_processor_seq.patch
 if [ $? -ne 0 ]; then
    echo "Patch ../../../third_party/to_mindrecord/CLUERNER2020/data_processor_seq.py failed"
    exit 1
 fi
 # use patched script
 python ../../../third_party/to_mindrecord/CLUERNER2020/data_processor_seq_patched.py \
 --vocab_file=../../../third_party/to_mindrecord/CLUERNER2020/vocab.txt \
 --label2id_file=../../../third_party/to_mindrecord/CLUERNER2020/label2id.json
--- a/example/nlp_to_mindrecord/CLUERNER2020/run_read.sh
+++ b/example/nlp_to_mindrecord/CLUERNER2020/run_read.sh
@ -0,0 +1,17 @@
 #!/bin/bash
 # Copyright 2020 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 python create_dataset.py
--- a/example/nlp_to_mindrecord/zhwiki/README.md
+++ b/example/nlp_to_mindrecord/zhwiki/README.md
@ -0,0 +1,113 @@
 # Guideline to Convert Training Data zhwiki to MindRecord For Bert Pre Training
 <!-- TOC -->
 - [What does the example do](#what-does-the-example-do)
 - [Run simple test](#run-simple-test)
 - [How to use the example to process zhwiki](#how-to-use-the-example-to-process-zhwiki)
    - [Download zhwiki training data](#download-zhwiki-training-data)
    - [Extract the zhwiki](#extract-the-zhwiki)
    - [Generate MindRecord](#generate-mindrecord)
    - [Create MindDataset By MindRecord](#create-minddataset-by-mindrecord)
 <!-- /TOC -->
 ## What does the example do
 This example is based on [zhwiki](https://dumps.wikimedia.org/zhwiki) training data, generating MindRecord file, and finally used for Bert network training.
 1.  run.sh: generate MindRecord entry script.
 2.  run_read.py: create MindDataset by MindRecord entry script.
    - create_dataset.py: use MindDataset to read MindRecord to generate dataset.
 ## Run simple test
 Follow the step:
 ```bash
 bash run_simple.sh         # generate output/simple.mindrecord* by ../../../third_party/to_mindrecord/zhwiki/sample_text.txt
 bash run_read_simple.sh    # use MindDataset to read output/simple.mindrecord*
 ```
 ## How to use the example to process zhwiki
 Download zhwiki data, extract it, convert it to MindRecord, use MindDataset to read MindRecord.
 ### Download zhwiki training data
 > [zhwiki dataset download address](https://dumps.wikimedia.org/zhwiki) **-> 20200401 -> zhwiki-20200401-pages-articles-multistream.xml.bz2**
 - put the zhwiki-20200401-pages-articles-multistream.xml.bz2 in {your-mindspore}/example/nlp_to_mindrecord/zhwiki/data directory.
 ### Extract the zhwiki
 1. Download [wikiextractor](https://github.com/attardi/wikiextractor) script to {your-mindspore}/example/nlp_to_mindrecord/zhwiki/data directory.
    ```
    $ ls data/
    README.md  wikiextractor  zhwiki-20200401-pages-articles-multistream.xml.bz2
    ```
 2. Extract the zhwiki.
    ```python
    python data/wikiextractor/WikiExtractor.py data/zhwiki-20200401-pages-articles-multistream.xml.bz2 --processes 4 --templates data/template --bytes 8M --min_text_length 0 --filter_disambig_pages --output data/extract
    ```
 3. Generate like this:
    ```
    $ ls data/extract
    AA AB
    ```
 ### Generate MindRecord
 1. Run the run.sh script.
    ```
    bash run.sh
    ```
    > Caution: This process maybe slow, please wait patiently. If you do not have a machine with enough memory and cpu, it is recommended that you modify the script to generate mindrecord in step by step.
 2. The output like this:
    ```
    patching file create_pretraining_data_patched.py (read from create_pretraining_data.py)
    Begin preprocess input file: ./data/extract/AA/wiki_00
    Begin output file: AAwiki_00.mindrecord
    Total task: 5, processing: 1
    Begin preprocess input file: ./data/extract/AA/wiki_01
    Begin output file: AAwiki_01.mindrecord
    Total task: 5, processing: 2
    Begin preprocess input file: ./data/extract/AA/wiki_02
    Begin output file: AAwiki_02.mindrecord
    Total task: 5, processing: 3
    Begin preprocess input file: ./data/extract/AB/wiki_02
    Begin output file: ABwiki_02.mindrecord
    Total task: 5, processing: 4
    ...
    ```
 3. Generate files like this:
    ```bash
    $ ls output/
    AAwiki_00.mindrecord AAwiki_00.mindrecord.db AAwiki_01.mindrecord AAwiki_01.mindrecord.db AAwiki_02.mindrecord AAwiki_02.mindrecord.db ... ABwiki_00.mindrecord ABwiki_00.mindrecord.db ...
    ```
 ### Create MindDataset By MindRecord
 1. Run the run_read.sh script.
    ```bash
    bash run_read.sh
    ```
 2. The output like this:
    ```
    ...
    example 74: input_ids: [  101  8168   118 12847  8783  9977 15908   117  8256  9245 11643  8168  8847  8588 11575  8154  8228   143  8384  8376  9197 10241   103 10564 11421  8199 12268   112   161  8228 11541  9586  8436  8174  8363  9864  9702   103   103   119   103  9947 10564   103  8436  8806 11479   103  8912   119   103   103   103 12209  8303   103  8757  8824   117  8256   103  8619  8168 11541   102 11684  8196   103  8228  8847 11523   117  9059  9064 12410  8358  8181 10764   117 11167 11706  9920   148  8332 11390  8936  8205 10951 11997   103  8154   117   103  8670 10467   112   161 10951 13139 12413   117 10288   143 10425  8205   152 10795  8472  8196   103   161 12126  9172 13129 12106  8217  8174 12244  8205   143   103  8461  8277 10628   160  8221   119   102]
    example 74: input_mask: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
    example 74: segment_ids: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
    example 74: masked_lm_positions: [  6  22  37  38  40  43  47  50  51  52  55  60  67  76  89  92  98 109 120   0]
    example 74: masked_lm_ids: [ 8118  8165  8329  8890  8554  8458   119  8850  8565 10392  8174 11467  10291  8181  8549 12718 13139   112   158     0]
    example 74: masked_lm_weights: [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.]
    example 74: next_sentence_labels: [0]
    ...
    ```
--- a/example/nlp_to_mindrecord/zhwiki/create_dataset.py
+++ b/example/nlp_to_mindrecord/zhwiki/create_dataset.py
@ -0,0 +1,43 @@
 # Copyright 2020 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 """create MindDataset by MindRecord"""
 import argparse
 import mindspore.dataset as ds
 def create_dataset(data_file):
    """create MindDataset"""
    num_readers = 4
    data_set = ds.MindDataset(dataset_file=data_file, num_parallel_workers=num_readers, shuffle=True)
    index = 0
    for item in data_set.create_dict_iterator():
        # print("example {}: {}".format(index, item))
        print("example {}: input_ids: {}".format(index, item['input_ids']))
        print("example {}: input_mask: {}".format(index, item['input_mask']))
        print("example {}: segment_ids: {}".format(index, item['segment_ids']))
        print("example {}: masked_lm_positions: {}".format(index, item['masked_lm_positions']))
        print("example {}: masked_lm_ids: {}".format(index, item['masked_lm_ids']))
        print("example {}: masked_lm_weights: {}".format(index, item['masked_lm_weights']))
        print("example {}: next_sentence_labels: {}".format(index, item['next_sentence_labels']))
        index += 1
        if index % 1000 == 0:
            print("read rows: {}".format(index))
    print("total rows: {}".format(index))
 if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument("-i", "--input_file", nargs='+', type=str, help='Input mindreord file')
    args = parser.parse_args()
    create_dataset(args.input_file)
--- a/example/nlp_to_mindrecord/zhwiki/data/.gitignore
+++ b/example/nlp_to_mindrecord/zhwiki/data/.gitignore
@ -0,0 +1,3 @@
 wikiextractor/
 zhwiki-20200401-pages-articles-multistream.xml.bz2
 extract/
--- a/example/nlp_to_mindrecord/zhwiki/data/README.md
+++ b/example/nlp_to_mindrecord/zhwiki/data/README.md
@ -0,0 +1 @@
 ## The input dataset
--- a/example/nlp_to_mindrecord/zhwiki/output/README.md
+++ b/example/nlp_to_mindrecord/zhwiki/output/README.md
@ -0,0 +1 @@
 ## Output the mindrecord
--- a/example/nlp_to_mindrecord/zhwiki/run.sh
+++ b/example/nlp_to_mindrecord/zhwiki/run.sh
@ -0,0 +1,112 @@
 #!/bin/bash
 # Copyright 2020 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 rm -f output/*.mindrecord*
 data_dir="./data/extract"
 file_list=()
 output_filename=()
 file_index=0
 function getdir() {
    elements=`ls $1`
    for element in ${elements[*]};
    do
        dir_or_file=$1"/"$element
        if [ -d $dir_or_file ];
        then
            getdir $dir_or_file
        else
            file_list[$file_index]=$dir_or_file
            echo "${dir_or_file}" | tr '/' '\n' > dir_file_list.txt   # dir dir file to mapfile
            mapfile parent_dir < dir_file_list.txt
            rm dir_file_list.txt >/dev/null 2>&1
            tmp_output_filename=${parent_dir[${#parent_dir[@]}-2]}${parent_dir[${#parent_dir[@]}-1]}".mindrecord"
            output_filename[$file_index]=`echo ${tmp_output_filename} | sed 's/ //g'`
            file_index=`expr $file_index + 1`
        fi
    done
 }
 getdir "${data_dir}"
 # echo "The input files: "${file_list[@]}
 # echo "The output files: "${output_filename[@]}
 if [ ! -d "../../../third_party/to_mindrecord/zhwiki" ]; then
    echo "The patch base dir ../../../third_party/to_mindrecord/zhwiki is not exist."
    exit 1
 fi
 if [ ! -f "../../../third_party/patch/to_mindrecord/zhwiki/create_pretraining_data.patch" ]; then
    echo "The patch file ../../../third_party/patch/to_mindrecord/zhwiki/create_pretraining_data.patch is not exist."
    exit 1
 fi
 # patch for create_pretraining_data.py
 patch -p0 -d ../../../third_party/to_mindrecord/zhwiki/ -o create_pretraining_data_patched.py < ../../../third_party/patch/to_mindrecord/zhwiki/create_pretraining_data.patch
 if [ $? -ne 0 ]; then
    echo "Patch ../../../third_party/to_mindrecord/zhwiki/create_pretraining_data.py failed"
    exit 1
 fi
 # get the cpu core count
 num_cpu_core=`cat /proc/cpuinfo | grep "processor" | wc -l`
 avaiable_core_size=`expr $num_cpu_core / 3 \* 2`
 echo "Begin preprocess `date`"
 # using patched script to generate mindrecord
 file_list_len=`expr ${#file_list[*]} - 1`
 for index in $(seq 0 $file_list_len); do
    echo "Begin preprocess input file: ${file_list[$index]}"
    echo "Begin output file: ${output_filename[$index]}"
    python ../../../third_party/to_mindrecord/zhwiki/create_pretraining_data_patched.py \
        --input_file=${file_list[$index]} \
        --output_file=output/${output_filename[$index]} \
        --partition_number=1 \
        --vocab_file=../../../third_party/to_mindrecord/zhwiki/vocab.txt \
        --do_lower_case=True \
        --max_seq_length=128 \
        --max_predictions_per_seq=20 \
        --masked_lm_prob=0.15 \
        --random_seed=12345 \
        --dupe_factor=5 >/tmp/${output_filename[$index]}.log 2>&1 &
    process_count=`ps -ef | grep create_pretraining_data_patched | grep -v grep | wc -l`
    echo "Total task: ${file_list_len}, processing: ${process_count}"
    if [ $process_count -ge $avaiable_core_size ]; then
        while [ 1 ]; do
            process_num=`ps -ef | grep create_pretraining_data_patched | grep -v grep | wc -l`
            if [ $process_count -gt $process_num ]; then
                process_count=$process_num
                break;
            fi
            sleep 2
        done
    fi
 done
 process_num=`ps -ef | grep create_pretraining_data_patched | grep -v grep | wc -l`
 while [ 1 ]; do
    if [ $process_num -eq 0 ]; then
        break;
    fi
    echo "There are still ${process_num} preprocess running ..."
    sleep 2
    process_num=`ps -ef | grep create_pretraining_data_patched | grep -v grep | wc -l`
 done
 echo "Preprocess all the data success."
 echo "End preprocess `date`"
--- a/example/nlp_to_mindrecord/zhwiki/run_read.sh
+++ b/example/nlp_to_mindrecord/zhwiki/run_read.sh
@ -0,0 +1,34 @@
 #!/bin/bash
 # Copyright 2020 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 file_list=()
 file_index=0
 # get all the mindrecord file from output dir
 function getdir() {
    elements=`ls $1/[A-Z]*.mindrecord`
    for element in ${elements[*]};
    do
        file_list[$file_index]=$element
        file_index=`expr $file_index + 1`
    done
 }
 getdir "./output"
 echo "Get all the mindrecord files: "${file_list[*]}
 # create dataset for train
 python create_dataset.py --input_file ${file_list[*]}
--- a/example/nlp_to_mindrecord/zhwiki/run_read_simple.sh
+++ b/example/nlp_to_mindrecord/zhwiki/run_read_simple.sh
@ -0,0 +1,18 @@
 #!/bin/bash
 # Copyright 2020 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 # create dataset for train
 python create_dataset.py --input_file=output/simple.mindrecord0
--- a/example/nlp_to_mindrecord/zhwiki/run_simple.sh
+++ b/example/nlp_to_mindrecord/zhwiki/run_simple.sh
@ -0,0 +1,47 @@
 #!/bin/bash
 # Copyright 2020 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 rm -f output/simple.mindrecord*
 if [ ! -d "../../../third_party/to_mindrecord/zhwiki" ]; then
    echo "The patch base dir ../../../third_party/to_mindrecord/zhwiki is not exist."
    exit 1
 fi
 if [ ! -f "../../../third_party/patch/to_mindrecord/zhwiki/create_pretraining_data.patch" ]; then
    echo "The patch file ../../../third_party/patch/to_mindrecord/zhwiki/create_pretraining_data.patch is not exist."
    exit 1
 fi
 # patch for create_pretraining_data.py
 patch -p0 -d ../../../third_party/to_mindrecord/zhwiki/ -o create_pretraining_data_patched.py < ../../../third_party/patch/to_mindrecord/zhwiki/create_pretraining_data.patch
 if [ $? -ne 0 ]; then
    echo "Patch ../../../third_party/to_mindrecord/zhwiki/create_pretraining_data.py failed"
    exit 1
 fi
 # using patched script to generate mindrecord
 python ../../../third_party/to_mindrecord/zhwiki/create_pretraining_data_patched.py \
 --input_file=../../../third_party/to_mindrecord/zhwiki/sample_text.txt \
 --output_file=output/simple.mindrecord \
 --partition_number=4 \
 --vocab_file=../../../third_party/to_mindrecord/zhwiki/vocab.txt \
 --do_lower_case=True \
 --max_seq_length=128 \
 --max_predictions_per_seq=20 \
 --masked_lm_prob=0.15 \
 --random_seed=12345 \
 --dupe_factor=5
--- a/third_party/patch/to_mindrecord/CLUERNER2020/README.md
+++ b/third_party/patch/to_mindrecord/CLUERNER2020/README.md
@ -0,0 +1 @@
 ## the file is a patch which is about just change data_processor_seq.py the part of generated tfrecord to MindRecord in [CLUEbenchmark/CLUENER2020](https://github.com/CLUEbenchmark/CLUENER2020/tree/master/tf_version)
--- a/third_party/patch/to_mindrecord/CLUERNER2020/data_processor_seq.patch
+++ b/third_party/patch/to_mindrecord/CLUERNER2020/data_processor_seq.patch
@ -0,0 +1,105 @@
 --- data_processor_seq.py	2020-05-28 10:07:13.365947168 +0800
 +++ data_processor_seq.py	2020-05-28 10:14:33.298177130 +0800
@@ -4,11 +4,18 @@
 @author: Cong Yu
 @time: 2019-12-07 17:03
 """
 +import sys
 +sys.path.append("../../../third_party/to_mindrecord/CLUERNER2020")
 +
 +import argparse
 import json
 import tokenization
 import collections
 -import tensorflow as tf
 +import numpy as np
 +from mindspore.mindrecord import FileWriter
 +
 +# pylint: skip-file
 def _truncate_seq_pair(tokens_a, tokens_b, max_length):
     """Truncates a sequence pair in place to the maximum length."""
@@ -80,11 +87,18 @@ def process_one_example(tokenizer, label
     return feature
 -def prepare_tf_record_data(tokenizer, max_seq_len, label2id, path, out_path):
 +def prepare_mindrecord_data(tokenizer, max_seq_len, label2id, path, out_path):
     """
 -        生成训练数据， tf.record, 单标签分类模型, 随机打乱数据
 +        生成训练数据， *.mindrecord, 单标签分类模型, 随机打乱数据
     """
 -    writer = tf.python_io.TFRecordWriter(out_path)
 +    writer = FileWriter(out_path)
 +
 +    data_schema = {"input_ids": {"type": "int64", "shape": [-1]},
 +                   "input_mask": {"type": "int64", "shape": [-1]},
 +                   "segment_ids": {"type": "int64", "shape": [-1]},
 +                   "label_ids": {"type": "int64", "shape": [-1]}}
 +    writer.add_schema(data_schema, "CLUENER2020 schema")
 +
     example_count = 0
     for line in open(path):
@@ -113,16 +127,12 @@ def prepare_tf_record_data(tokenizer, ma
         feature = process_one_example(tokenizer, label2id, list(_["text"]), labels,
                                       max_seq_len=max_seq_len)
 -        def create_int_feature(values):
 -            f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
 -            return f
 -
         features = collections.OrderedDict()
         # 序列标注任务
 -        features["input_ids"] = create_int_feature(feature[0])
 -        features["input_mask"] = create_int_feature(feature[1])
 -        features["segment_ids"] = create_int_feature(feature[2])
 -        features["label_ids"] = create_int_feature(feature[3])
 +        features["input_ids"] = np.asarray(feature[0])
 +        features["input_mask"] = np.asarray(feature[1])
 +        features["segment_ids"] = np.asarray(feature[2])
 +        features["label_ids"] = np.asarray(feature[3])
         if example_count < 5:
             print("*** Example ***")
             print(_["text"])
@@ -132,8 +142,7 @@ def prepare_tf_record_data(tokenizer, ma
             print("segment_ids: %s" % " ".join([str(x) for x in feature[2]]))
             print("label: %s " % " ".join([str(x) for x in feature[3]]))
 -        tf_example = tf.train.Example(features=tf.train.Features(feature=features))
 -        writer.write(tf_example.SerializeToString())
 +        writer.write_raw_data([features])
         example_count += 1
         # if example_count == 20:
@@ -141,17 +150,22 @@ def prepare_tf_record_data(tokenizer, ma
         if example_count % 3000 == 0:
             print(example_count)
     print("total example:", example_count)
 -    writer.close()
 +    writer.commit()
 if __name__ == "__main__":
 -    vocab_file = "./vocab.txt"
 +    parser = argparse.ArgumentParser()
 +    parser.add_argument("--vocab_file", type=str, required=True, help='The vocabulary file.')
 +    parser.add_argument("--label2id_file", type=str, required=True, help='The label2id.json file.')
 +    args = parser.parse_args()
 +
 +    vocab_file = args.vocab_file
     tokenizer = tokenization.FullTokenizer(vocab_file=vocab_file)
 -    label2id = json.loads(open("label2id.json").read())
 +    label2id = json.loads(open(args.label2id_file).read())
     max_seq_len = 64
 -    prepare_tf_record_data(tokenizer, max_seq_len, label2id, path="data/thuctc_train.json",
 -                           out_path="data/train.tf_record")
 -    prepare_tf_record_data(tokenizer, max_seq_len, label2id, path="data/thuctc_valid.json",
 -                           out_path="data/dev.tf_record")
 +    prepare_mindrecord_data(tokenizer, max_seq_len, label2id, path="data/cluener_public/train.json",
 +                           out_path="output/train.mindrecord")
 +    prepare_mindrecord_data(tokenizer, max_seq_len, label2id, path="data/cluener_public/dev.json",
 +                           out_path="output/dev.mindrecord")
--- a/third_party/patch/to_mindrecord/zhwiki/README.md
+++ b/third_party/patch/to_mindrecord/zhwiki/README.md
@ -0,0 +1 @@
 ## the file is a patch which is about just change create_pretraining_data.py the part of generated tfrecord to MindRecord in [google-research/bert](https://github.com/google-research/bert)
--- a/third_party/patch/to_mindrecord/zhwiki/create_pretraining_data.patch
+++ b/third_party/patch/to_mindrecord/zhwiki/create_pretraining_data.patch
--- a/third_party/to_mindrecord/CLUERNER2020/.gitignore
+++ b/third_party/to_mindrecord/CLUERNER2020/.gitignore
@ -0,0 +1 @@
 data_processor_seq_patched.py
--- a/third_party/to_mindrecord/CLUERNER2020/README.md
+++ b/third_party/to_mindrecord/CLUERNER2020/README.md
@ -0,0 +1 @@
 ## All the scripts here come from [CLUEbenchmark/CLUENER2020](https://github.com/CLUEbenchmark/CLUENER2020/tree/master/tf_version)
--- a/third_party/to_mindrecord/CLUERNER2020/data_processor_seq.py
+++ b/third_party/to_mindrecord/CLUERNER2020/data_processor_seq.py
@ -0,0 +1,157 @@
 #!/usr/bin/python
 # coding:utf8
 """
@author: Cong Yu
@time: 2019-12-07 17:03
 """
 import json
 import tokenization
 import collections
 import tensorflow as tf
 def _truncate_seq_pair(tokens_a, tokens_b, max_length):
    """Truncates a sequence pair in place to the maximum length."""
    # This is a simple heuristic which will always truncate the longer sequence
    # one token at a time. This makes more sense than truncating an equal percent
    # of tokens from each, since if one sequence is very short then each token
    # that's truncated likely contains more information than a longer sequence.
    while True:
        total_length = len(tokens_a) + len(tokens_b)
        if total_length <= max_length:
            break
        if len(tokens_a) > len(tokens_b):
            tokens_a.pop()
        else:
            tokens_b.pop()
 def process_one_example(tokenizer, label2id, text, label, max_seq_len=128):
    # textlist = text.split(' ')
    # labellist = label.split(' ')
    textlist = list(text)
    labellist = list(label)
    tokens = []
    labels = []
    for i, word in enumerate(textlist):
        token = tokenizer.tokenize(word)
        tokens.extend(token)
        label_1 = labellist[i]
        for m in range(len(token)):
            if m == 0:
                labels.append(label_1)
            else:
                print("some unknown token...")
                labels.append(labels[0])
    # tokens = tokenizer.tokenize(example.text)  -2 的原因是因为序列需要加一个句首和句尾标志
    if len(tokens) >= max_seq_len - 1:
        tokens = tokens[0:(max_seq_len - 2)]
        labels = labels[0:(max_seq_len - 2)]
    ntokens = []
    segment_ids = []
    label_ids = []
    ntokens.append("[CLS]")  # 句子开始设置CLS 标志
    segment_ids.append(0)
    # [CLS] [SEP] 可以为 他们构建标签，或者 统一到某个标签，反正他们是不变的，基本不参加训练 即：x-l 永远不变
    label_ids.append(0)  # label2id["[CLS]"]
    for i, token in enumerate(tokens):
        ntokens.append(token)
        segment_ids.append(0)
        label_ids.append(label2id[labels[i]])
    ntokens.append("[SEP]")
    segment_ids.append(0)
    # append("O") or append("[SEP]") not sure!
    label_ids.append(0)  # label2id["[SEP]"]
    input_ids = tokenizer.convert_tokens_to_ids(ntokens)
    input_mask = [1] * len(input_ids)
    while len(input_ids) < max_seq_len:
        input_ids.append(0)
        input_mask.append(0)
        segment_ids.append(0)
        label_ids.append(0)
        ntokens.append("**NULL**")
    assert len(input_ids) == max_seq_len
    assert len(input_mask) == max_seq_len
    assert len(segment_ids) == max_seq_len
    assert len(label_ids) == max_seq_len
    feature = (input_ids, input_mask, segment_ids, label_ids)
    return feature
 def prepare_tf_record_data(tokenizer, max_seq_len, label2id, path, out_path):
    """
        生成训练数据， tf.record, 单标签分类模型, 随机打乱数据
    """
    writer = tf.python_io.TFRecordWriter(out_path)
    example_count = 0
    for line in open(path):
        if not line.strip():
            continue
        _ = json.loads(line.strip())
        len_ = len(_["text"])
        labels = ["O"] * len_
        for k, v in _["label"].items():
            for kk, vv in v.items():
                for vvv in vv:
                    span = vvv
                    s = span[0]
                    e = span[1] + 1
                    # print(s, e)
                    if e - s == 1:
                        labels[s] = "S_" + k
                    else:
                        labels[s] = "B_" + k
                        for i in range(s + 1, e - 1):
                            labels[i] = "M_" + k
                        labels[e - 1] = "E_" + k
            # print()
        # feature = process_one_example(tokenizer, label2id, row[column_name_x1], row[column_name_y],
        #                               max_seq_len=max_seq_len)
        feature = process_one_example(tokenizer, label2id, list(_["text"]), labels,
                                      max_seq_len=max_seq_len)
        def create_int_feature(values):
            f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))
            return f
        features = collections.OrderedDict()
        # 序列标注任务
        features["input_ids"] = create_int_feature(feature[0])
        features["input_mask"] = create_int_feature(feature[1])
        features["segment_ids"] = create_int_feature(feature[2])
        features["label_ids"] = create_int_feature(feature[3])
        if example_count < 5:
            print("*** Example ***")
            print(_["text"])
            print(_["label"])
            print("input_ids: %s" % " ".join([str(x) for x in feature[0]]))
            print("input_mask: %s" % " ".join([str(x) for x in feature[1]]))
            print("segment_ids: %s" % " ".join([str(x) for x in feature[2]]))
            print("label: %s " % " ".join([str(x) for x in feature[3]]))
        tf_example = tf.train.Example(features=tf.train.Features(feature=features))
        writer.write(tf_example.SerializeToString())
        example_count += 1
        # if example_count == 20:
        #     break
        if example_count % 3000 == 0:
            print(example_count)
    print("total example:", example_count)
    writer.close()
 if __name__ == "__main__":
    vocab_file = "./vocab.txt"
    tokenizer = tokenization.FullTokenizer(vocab_file=vocab_file)
    label2id = json.loads(open("label2id.json").read())
    max_seq_len = 64
    prepare_tf_record_data(tokenizer, max_seq_len, label2id, path="data/thuctc_train.json",
                           out_path="data/train.tf_record")
    prepare_tf_record_data(tokenizer, max_seq_len, label2id, path="data/thuctc_valid.json",
                           out_path="data/dev.tf_record")
--- a/third_party/to_mindrecord/CLUERNER2020/label2id.json
+++ b/third_party/to_mindrecord/CLUERNER2020/label2id.json
@ -0,0 +1,43 @@
 {
  "O": 0,
  "S_address": 1,
  "B_address": 2,
  "M_address": 3,
  "E_address": 4,
  "S_book": 5,
  "B_book": 6,
  "M_book": 7,
  "E_book": 8,
  "S_company": 9,
  "B_company": 10,
  "M_company": 11,
  "E_company": 12,
  "S_game": 13,
  "B_game": 14,
  "M_game": 15,
  "E_game": 16,
  "S_government": 17,
  "B_government": 18,
  "M_government": 19,
  "E_government": 20,
  "S_movie": 21,
  "B_movie": 22,
  "M_movie": 23,
  "E_movie": 24,
  "S_name": 25,
  "B_name": 26,
  "M_name": 27,
  "E_name": 28,
  "S_organization": 29,
  "B_organization": 30,
  "M_organization": 31,
  "E_organization": 32,
  "S_position": 33,
  "B_position": 34,
  "M_position": 35,
  "E_position": 36,
  "S_scene": 37,
  "B_scene": 38,
  "M_scene": 39,
  "E_scene": 40
 }
--- a/third_party/to_mindrecord/CLUERNER2020/tokenization.py
+++ b/third_party/to_mindrecord/CLUERNER2020/tokenization.py
--- a/Show More
+++ b/Show More
		`@ -0,0 +1 @@`
							`## the file is a patch which is about just change data_processor_seq.py the part of generated tfrecord to MindRecord in [CLUEbenchmark/CLUENER2020](https://github.com/CLUEbenchmark/CLUENER2020/tree/master/tf_version)`
		`@ -0,0 +1 @@`
							`## All the scripts here come from [CLUEbenchmark/CLUENER2020](https://github.com/CLUEbenchmark/CLUENER2020/tree/master/tf_version)`