DeepLearningExamples/TensorFlow2/LanguageModeling/BERT/data at foo · LinkDecoder/DeepLearningExamples · GitHub

Name		Name	Last commit message	Last commit date
parent directory ..
images		images
BooksDownloader.py		BooksDownloader.py
BookscorpusTextFormatting.py		BookscorpusTextFormatting.py
Downloader.py		Downloader.py
GLUEDownloader.py		GLUEDownloader.py
GooglePretrainedWeightDownloader.py		GooglePretrainedWeightDownloader.py
NVIDIAPretrainedWeightDownloader.py		NVIDIAPretrainedWeightDownloader.py
PubMedDownloader.py		PubMedDownloader.py
PubMedTextFormatting.py		PubMedTextFormatting.py
README.md		README.md
SquadDownloader.py		SquadDownloader.py
TextSharding.py		TextSharding.py
WikiDownloader.py		WikiDownloader.py
WikicorpusTextFormatting.py		WikicorpusTextFormatting.py
__init__.py		__init__.py
bertPrep.py		bertPrep.py
create_biobert_datasets_from_start.sh		create_biobert_datasets_from_start.sh
create_datasets_from_start.sh		create_datasets_from_start.sh

README.md

Steps to reproduce datasets from web

Build the container

docker build -t bert_tf2 .

Run the container interactively

nvidia-docker run -it --ipc=host bert_tf2
Optional: Mount data volumes
- -v yourpath:/workspace/bert_tf2/data/wikipedia_corpus/download
- -v yourpath:/workspace/bert_tf2/data/wikipedia_corpus/extracted_articles
- -v yourpath:/workspace/bert_tf2/data/wikipedia_corpus/raw_data
- -v yourpath:/workspace/bert_tf2/data/wikipedia_corpus/intermediate_files
- -v yourpath:/workspace/bert_tf2/data/wikipedia_corpus/final_text_file_single
- -v yourpath:/workspace/bert_tf2/data/wikipedia_corpus/final_text_files_sharded
- -v yourpath:/workspace/bert_tf2/data/wikipedia_corpus/final_tfrecords_sharded
- -v yourpath:/workspace/bert_tf2/data/bookcorpus/download
- -v yourpath:/workspace/bert_tf2/data/bookcorpus/final_text_file_single
- -v yourpath:/workspace/bert_tf2/data/bookcorpus/final_text_files_sharded
- -v yourpath:/workspace/bert_tf2/data/bookcorpus/final_tfrecords_sharded
Optional: Select visible GPUs
- -e CUDA_VISIBLE_DEVICES=0

** Inside of the container starting here** 3) Download pretrained weights (they contain vocab files for preprocessing) and SQuAD

bash data/create_datasets_from_start.sh squad

"One-click" Wikipedia data download and prep (provides tfrecords)

bash data/create_datasets_from_start.sh pretrained wiki_only

"One-click" Wikipedia and BookCorpus data download and prep (provided tfrecords)

bash data/create_datasets_from_start.sh pretrained wiki_books