BERT Natural Language Processing on Biowulf

Quick Links

BERT (Bidirectional Encoder Representations from Transforners) is a technique for Natural Language Processing (NLP) pre-training developed by Google.

These instructions provide a concrete example detailing the first steps to using BERT on Biowulf. They are derived from the BERT documentation on GitHub.

The NIH HPC staff provides this quickstart guide as a convenience and makes a best effort to keep it updated. But Deep Learning development moves quickly and users are encouraged to review primary documentation published by the BERT model developers.

References:

Documentation

BERT on GitHub

Important Notes

BERT is a set of pre-trained models for use with TensorFlow that must be downloaded and fine tuned on a particular data set before being used for inference.
The BERT models are not installed centrally and must be downloaded by the user before use.

Interactive job

Interactive jobs should be used for debugging, graphics, or applications that cannot be run as batch jobs.

Allocate an interactive session and run this example.
Sample session (user input in bold):

As of the time of writing this tuturial, BERT did not support TensorFlow >=2 so the example begins by installing a custom conda environment with TensorFlow 1.15.

If you have not already done so, follow the instructions for installing and updating conda in your space here.

[user@biowulf ~]$ sinteractive --ntasks=1 --cpus-per-task=8 --mem=50g --gres=gpu:k80:2,lscratch:10
salloc.exe: Pending job allocation 45496024
salloc.exe: job 45496024 queued and waiting for resources
salloc.exe: job 45496024 has been allocated resources
salloc.exe: Granted job allocation 45496024
salloc.exe: Waiting for resource configuration
salloc.exe: Nodes cn4184 are ready for job
srun: error: x11: no local DISPLAY defined, skipping

[user@cn4184 ~]$ source /data/${USER}/conda/etc/profile.d/conda.sh

[user@cn4184 ~]$ conda create -n my-tensorflow python=3.7 tensorflow-gpu==1.15.0
Collecting package metadata (current_repodata.json): done
Solving environment: failed with repodata from current_repodata.json, will retry with next repodata source.
Collecting package metadata (repodata.json): done
Solving environment: done

## Package Plan ##

  environment location: /data/user/conda/envs/my-tensorflow

  added / updated specs:
    - python=3.7
    - tensorflow-gpu==1.15.0

The following NEW packages will be INSTALLED:
[...]
Proceed ([y]/n)? y

Preparing transaction: done
Verifying transaction: done
Executing transaction: done
#
# To activate this environment, use
#
#     $ conda activate my-tensorflow
#
# To deactivate an active environment, use
#
#     $ conda deactivate

[user@cn4184 ~]$ conda activate my-tensorflow

(my-tensorflow)[user@cn4184 ~]$ which python #ensure you are using your tensorflow installation
/data/user/conda/envs/my-tensorflow/bin/python

Now download the BERT GitHub repository as well as some sample data and a pre-trained model and set environment variables to point to these locations.

(my-tensorflow)[user@cn4184 ~]$ mkdir -pv /data/${USER}/bert
mkdir: created directory ‘/data/user/bert’

(my-tensorflow)[user@cn4184 ~]$ cd /data/${USER}/bert

(my-tensorflow)[user@cn4184 bert]$ git clone https://github.com/google-research/bert.git
Cloning into 'bert'...
remote: Enumerating objects: 336, done.
remote: Total 336 (delta 0), reused 0 (delta 0), pack-reused 336
Receiving objects: 100% (336/336), 291.41 KiB | 0 bytes/s, done.
Resolving deltas: 100% (184/184), done.

(my-tensorflow)[user@cn4184 bert]$ cd bert/ && git checkout cc7051dc && cd .. # ensure a working starting point for tutorial
Note: checking out 'cc7051dc'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b new_branch_name

HEAD is now at cc7051d... Updating XNLI paths

(my-tensorflow)[user@cn4184 bert]$ wget https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
--2020-01-03 17:49:54--  https://gist.githubusercontent.com/W4ngatang/60c2bdb54d156a41194446737ce03e2e/raw/17b8dd0d724281ed7c3b2aeeda662b92809aadd5/download_glue_data.py
Resolving dtn06-e0 (dtn06-e0)... 10.1.200.242
Connecting to dtn06-e0 (dtn06-e0)|10.1.200.242|:3128... connected.
Proxy request sent, awaiting response... 200 OK
Length: 8225 (8.0K) [text/plain]
Saving to: ‘download_glue_data.py’

100%[============================================================>] 8,225       --.-K/s   in 0.001s

2020-01-03 17:49:54 (5.38 MB/s) - ‘download_glue_data.py’ saved [8225/8225]

(my-tensorflow)[user@cn4184 bert]$ python download_glue_data.py
Downloading and extracting CoLA...
        Completed!
Downloading and extracting SST...
        Completed!
Processing MRPC...
Local MRPC data not specified, downloading data from https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt
        Completed!
Downloading and extracting QQP...
        Completed!
Downloading and extracting STS...
        Completed!
Downloading and extracting MNLI...
        Completed!
Downloading and extracting SNLI...
        Completed!
Downloading and extracting QNLI...
        Completed!
Downloading and extracting RTE...
        Completed!
Downloading and extracting WNLI...
        Completed!
Downloading and extracting diagnostic...
        Completed!

(my-tensorflow)[user@cn4184 bert]$ export GLUE_DIR=/data/${USER}/bert/glue_data


(my-tensorflow)[user@cn4184 bert]$ wget https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
--2020-01-03 17:52:59--  https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip
Resolving dtn06-e0 (dtn06-e0)... 10.1.200.242
Connecting to dtn06-e0 (dtn06-e0)|10.1.200.242|:3128... connected.
Proxy request sent, awaiting response... 200 OK
Length: 407727028 (389M) [application/zip]
Saving to: ‘uncased_L-12_H-768_A-12.zip’

100%[============================================================>] 407,727,028  120MB/s   in 3.2s

2020-01-03 17:53:03 (120 MB/s) - ‘uncased_L-12_H-768_A-12.zip’ saved [407727028/407727028]

(my-tensorflow)[user@cn4184 bert]$ unzip uncased_L-12_H-768_A-12.zip
Archive:  uncased_L-12_H-768_A-12.zip
   creating: uncased_L-12_H-768_A-12/
  inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.meta
  inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.data-00000-of-00001
  inflating: uncased_L-12_H-768_A-12/vocab.txt
  inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.index
  inflating: uncased_L-12_H-768_A-12/bert_config.json

(my-tensorflow)[user@cn4184 bert]$ export BERT_BASE_DIR=/data/${USER}/bert/uncased_L-12_H-768_A-12

Now we can use the run_classifier.py script from the BERT GitHub repo to fine tune the uncased_L-12_H-768_A-12 model on some example data. This step should take around 10 minutes to complete.

(my-tensorflow)[user@cn4184 bert]$ python bert/run_classifier.py \
    --task_name=MRPC \
    --do_train=true \
    --do_eval=true \
    --data_dir=$GLUE_DIR/MRPC \
    --vocab_file=$BERT_BASE_DIR/vocab.txt \
    --bert_config_file=$BERT_BASE_DIR/bert_config.json \
    --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
    --max_seq_length=128 --train_batch_size=32 \
    --learning_rate=2e-5 \
    --num_train_epochs=3.0 \
    --output_dir=/lscratch/${SLURM_JOB_ID}/mrpc_output
WARNING:tensorflow:From /gpfs/gsfs11/users/user/bert/bert/optimization.py:87: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

WARNING:tensorflow:From bert/run_classifier.py:981: The name tf.app.run is deprecated. Please use tf.compat.v1.app.run instead.

WARNING:tensorflow:From bert/run_classifier.py:784: The name tf.logging.set_verbosity is deprecated. Please use tf.compat.v1.logging.set_verbosity instead.
[...]
INFO:tensorflow:evaluation_loop marked as finished
I0103 18:07:13.907693 46912496418496 error_handling.py:101] evaluation_loop marked as finished
INFO:tensorflow:***** Eval results *****
I0103 18:07:13.907948 46912496418496 run_classifier.py:923] ***** Eval results *****
INFO:tensorflow:  eval_accuracy = 0.86764705
I0103 18:07:13.908066 46912496418496 run_classifier.py:925]   eval_accuracy = 0.86764705
INFO:tensorflow:  eval_loss = 0.38859132
I0103 18:07:13.908277 46912496418496 run_classifier.py:925]   eval_loss = 0.38859132
INFO:tensorflow:  global_step = 343
I0103 18:07:13.908396 46912496418496 run_classifier.py:925]   global_step = 343
INFO:tensorflow:  loss = 0.38859132
I0103 18:07:13.908495 46912496418496 run_classifier.py:925]   loss = 0.38859132

Now we can use the fine-tuned model to perform inference on some of the example data.

(my-tensorflow)[user@cn4184 bert]$ TRAINED_CLASSIFIER=/lscratch/${SLURM_JOB_ID}/mrpc_output

(my-tensorflow)[user@cn4184 bert]$ python bert/run_classifier.py \
    --task_name=MRPC \
    --do_predict=true \
    --data_dir=$GLUE_DIR/MRPC \
    --vocab_file=$BERT_BASE_DIR/vocab.txt \
    --bert_config_file=$BERT_BASE_DIR/bert_config.json \
    --init_checkpoint=$TRAINED_CLASSIFIER \
    --max_seq_length=128 \
    --output_dir=/lscratch/${SLURM_JOB_ID}/mrpc_output
[...]
INFO:tensorflow:Running local_init_op.
I0103 18:15:42.120980 46912496418496 session_manager.py:500] Running local_init_op.
INFO:tensorflow:Done running local_init_op.
I0103 18:15:42.180213 46912496418496 session_manager.py:502] Done running local_init_op.
2020-01-03 18:15:42.976214: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
INFO:tensorflow:prediction_loop marked as finished
I0103 18:16:15.129939 46912496418496 error_handling.py:101] prediction_loop marked as finished
INFO:tensorflow:prediction_loop marked as finished
I0103 18:16:15.130402 46912496418496 error_handling.py:101] prediction_loop marked as finished

(my-tensorflow)[user@cn4184 bert]$ tail /lscratch/${SLURM_JOB_ID}/mrpc_output/test_results.tsv
0.009501748     0.99049824
0.024339601     0.9756604
0.009656649     0.99034333
0.9432048       0.056795176
0.012551893     0.98744816
0.96603405      0.03396591
0.9437976       0.056202445
0.010656527     0.9893435
0.008318217     0.99168175
0.008777457     0.9912225

At this point, the fine-tuned model and the inference results are located in the local /lscratch/${SLURM_JOB_ID} directory. Don't forget to copy them to your space before exiting the job.

Batch job

Most jobs should be run as batch jobs.

Create a batch input file. For example:

[user@biowulf ~]$ cat >submit.sh <<'EOF'
#!/bin/bash
set -e
source /data/${USER}/conda/etc/profile.d/conda.sh
conda activate my-tensorflow
cd /data/${USER}/bert
export GLUE_DIR=/data/${USER}/bert/glue_data
export BERT_BASE_DIR=/data/${USER}/bert/uncased_L-12_H-768_A-12
python bert/run_classifier.py \
    --task_name=MRPC \
    --do_train=true \
    --do_eval=true \
    --data_dir=$GLUE_DIR/MRPC \
    --vocab_file=$BERT_BASE_DIR/vocab.txt \
    --bert_config_file=$BERT_BASE_DIR/bert_config.json \
    --init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
    --max_seq_length=128 --train_batch_size=32 \
    --learning_rate=2e-5 \
    --num_train_epochs=3.0 \
    --output_dir=/lscratch/${SLURM_JOB_ID}/mrpc_output
cp -r /lscratch/${SLURM_JOB_ID} /data/$USER/${SLURM_JOB_ID}-trained-model
EOF

Submit this job using the Slurm sbatch command.

[user@biowulf ~]$ sbatch --partition=gpu --ntasks=1 --cpus-per-task=8 --mem=50g --gres=gpu:k80:2,lscratch:10 submit.sh
45503181