Skip to content

Commit b6fb9aa

Browse files
sharathtsszmigacz
authored andcommitted
[BERT][PyTorch]: add dgx1-16g and dgx2 specific pretraining instructions (NVIDIA#164)
* add dgx1-16g and dgx2 specific pretraining instructions * fix typo in readme
1 parent 22f1221 commit b6fb9aa

File tree

1 file changed

+8
-4
lines changed

1 file changed

+8
-4
lines changed

PyTorch/LanguageModeling/BERT/README.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -229,7 +229,7 @@ To download, verify, extract the datasets, and create the shards in hdf5 format,
229229
BERT is designed to pre-train deep bidirectional representations for language representations. The following scripts are to replicate pretraining on Wikipedia+Book Corpus from this [paper](https://arxiv.org/pdf/1810.04805.pdf). These scripts are general and can be used for pre-training language representations on any corpus of choice.
230230

231231
From within the container, you can use the following script to run pre-training.
232-
`bash scripts/run_pretraining.sh 14 0.875e-4 fp16 16 0.01 1142857 2000 false true`
232+
`bash scripts/run_pretraining.sh`
233233

234234
More details can be found in Details/Training Process
235235

@@ -466,7 +466,7 @@ The `create_pretraining_data.py` script takes in raw text and creates training i
466466

467467
#### Multi-dataset
468468

469-
This repository provides functionality to combine multiple datasets into a single dataset for pre-training on a diverse text corpus at the shard level in `. /workspace/bert/data/create_datasets_from_start.sh`.
469+
This repository provides functionality to combine multiple datasets into a single dataset for pre-training on a diverse text corpus at the shard level in `data/create_datasets_from_start.sh`.
470470

471471
### Training process
472472

@@ -476,7 +476,7 @@ The training process consists of two steps: pre-training and fine-tuning.
476476

477477
Pre-training is performed using the `run_pretraining.py` script along with parameters defined in the `scripts/run_pretraining.sh`.
478478

479-
The `run_pretraining.sh` script runs a job on a single node that trains the BERT-large model from scratch using the Wikipedia and BookCorpus datasets as training data using LAMB optimizer. By default, the training script runs two phases of training:
479+
The `run_pretraining.sh` script runs a job on a single node that trains the BERT-large model from scratch using the Wikipedia and BookCorpus datasets as training data using LAMB optimizer. By default, the training script runs two phases of training with a hyperparameter recipe specific to 8 x V100 32G cards:
480480

481481
Phase 1: (Maximum sequence length of 128)
482482
- Runs on 8 GPUs with training batch size of 64 per GPU
@@ -527,7 +527,11 @@ For example:
527527

528528
`bash scripts/run_pretraining.sh`
529529

530-
Trains BERT-large from scratch on a single DGX-1 32G using FP16 arithmetic. 90% of the training steps are done with sequence length 128 (phase1 of training) and 10% of the training steps are done with sequence length 512 (phase2 of training).
530+
Trains BERT-large from scratch on a DGX-1 32G using FP16 arithmetic. 90% of the training steps are done with sequence length 128 (phase1 of training) and 10% of the training steps are done with sequence length 512 (phase2 of training).
531+
532+
In order to train on a DGX-1 16G, set `gradient_accumulation_steps` to `512` and `gradient_accumulation_steps_phase2` to `1024` in `scripts/run_pretraining.sh`
533+
534+
In order to train on a DGX-2 32G, set `train_batch_size` to `4096`, `train_batch_size_phase2` to `2048`, `num_gpus` to `16`, `gradient_accumulation_steps` to `64` and `gradient_accumulation_steps_phase2` to `256` in `scripts/run_pretraining.sh`
531535

532536
##### Fine-tuning
533537

0 commit comments

Comments
 (0)