The computations you'll use TensorFlow for - like training a massive deep neural network - can be complex and confusing. To make it easier to understand, debug, and optimize TensorFlow programs, the tensorflow developers included a suite of visualization tools called TensorBoard. You can use TensorBoard to visualize your TensorFlow graph, plot quantitative metrics about the execution of your graph, and show additional data like images that pass through it.
This can be done by submitting either a batch or interactive job. This guide will demonstrate the latter. Note that the script that runs the deep learning training and testing must specify the use of tensorboard summaries . For details please see the Tensorboard main site.
Allocate an interactive session and start a Tensorflow MNIST training and Tensorboard instance as shown below.
[user@biowulf]$ sinteractive --tunnel -c 8 --mem 30g --gres=gpu:k80:1,lscratch:20 salloc.exe: Pending job allocation 26710013 salloc.exe: job 26710013 queued and waiting for resources salloc.exe: job 26710013 has been allocated resources salloc.exe: Granted job allocation 26710013 salloc.exe: Waiting for resource configuration salloc.exe: Nodes cn3094 are ready for job Created 1 generic SSH tunnel(s) from this compute node to biowulf for your use at port numbers defined in the $PORTn ($PORT1, ...) environment variables. Please create a SSH tunnel from your workstation to these ports on biowulf. On Linux/MacOS, open a terminal and run: ssh -L 45000:localhost:45000 biowulf.nih.gov For Windows instructions, see https://hpc.nih.gov/docs/tunneling [user@cn3144]$ echo $PORT1 45000 [user@cn3144]$ cd /lscratch/$SLURM_JOB_ID [user@cn3144]$ export TMPDIR=/lscratch/$SLURM_JOB_ID [user@cn3144]$ module load python/3.10 [user@cn3144]$ # Clone the tensorflow github repo [user@cn3144]$ git clone https://github.com/tensorflow/tensorflow.git Cloning into 'tensorflow'... remote: Enumerating objects: 16, done. remote: Counting objects: 100% (16/16), done. remote: Compressing objects: 100% (16/16), done. remote: Total 531395 (delta 2), reused 11 (delta 0), pack-reused 531379 Receiving objects: 100% (531395/531395), 308.15 MiB | 36.91 MiB/s, done. Resolving deltas: 100% (427138/427138), done. [user@cn3144]$ # Let's start training the mnist model with tensorboard summaries specifying # where to write the log files. The script (mnist_with_summaries.py) is a # modified mnist script (original: mnist.py) for tensorboard use. [user@cn3144]$ python tensorflow/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py --log_dir=./logs /usr/local/Anaconda/envs/py3.6/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters WARNING:tensorflow:From tensorflow/tensorflow/examples/tutorials/mnist/mnist_with_summaries.py:41: read_data_sets (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version. Instructions for updating: Please use alternatives such as official/mnist/dataset.py from tensorflow/models. WARNING:tensorflow:From /usr/local/Anaconda/envs/py3.6/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:260: maybe_download (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version. Instructions for updating: Please write your own downloading logic. WARNING:tensorflow:From /usr/local/Anaconda/envs/py3.6/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:262: extract_images (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version. Instructions for updating: Please use tf.data to implement this functionality. Extracting /tmp/tensorflow/mnist/input_data/train-images-idx3-ubyte.gz WARNING:tensorflow:From /usr/local/Anaconda/envs/py3.6/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:267: extract_labels (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version. Instructions for updating: Please use tf.data to implement this functionality. Extracting /tmp/tensorflow/mnist/input_data/train-labels-idx1-ubyte.gz Extracting /tmp/tensorflow/mnist/input_data/t10k-images-idx3-ubyte.gz Extracting /tmp/tensorflow/mnist/input_data/t10k-labels-idx1-ubyte.gz WARNING:tensorflow:From /usr/local/Anaconda/envs/py3.6/lib/python3.6/site-packages/tensorflow/contrib/learn/python/learn/datasets/mnist.py:290: DataSet.__init__ (from tensorflow.contrib.learn.python.learn.datasets.mnist) is deprecated and will be removed in a future version. Instructions for updating: Please use alternatives such as official/mnist/dataset.py from tensorflow/models. 2019-02-26 09:49:58.979015: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX 2019-02-26 09:49:59.695498: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: Tesla K20Xm major: 3 minor: 5 memoryClockRate(GHz): 0.732 pciBusID: 0000:27:00.0 totalMemory: 5.94GiB freeMemory: 5.87GiB 2019-02-26 09:49:59.695600: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-02-26 09:50:00.178254: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-02-26 09:50:00.178322: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-02-26 09:50:00.178348: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-02-26 09:50:00.178676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5662 MB memory) -> physical GPU (device: 0, name: Tesla K20Xm, pci bus id: 0000:27:00.0, compute capability: 3.5) Accuracy at step 0: 0.0926 Accuracy at step 10: 0.695 Accuracy at step 20: 0.8179 Accuracy at step 30: 0.8716 Accuracy at step 40: 0.8849 Accuracy at step 50: 0.8971 [...] Accuracy at step 940: 0.9654 Accuracy at step 950: 0.9637 Accuracy at step 960: 0.9685 Accuracy at step 970: 0.9684 Accuracy at step 980: 0.968 Accuracy at step 990: 0.9689 Adding run metadata for 999 # $PORT1 environment set with the --tunnel option in sinteractive [user@cn3144]$ tensorboard --logdir=./logs --port=$PORT1 --host=localhost /usr/local/Anaconda/envs/py3.6/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters TensorBoard 1.10.0 at http://cn0619:45000 (Press CTRL+C to quit)
The port must be unique to avoid clashing with other users. Keep this open for as long as you're using tensorboard. Note: In the unlikely event that this port is already in use on the compute node, please select another random port.
Connecting to your Tensorboard instance on a compute node with your local browser requires a tunnel. How to set up the tunnel depends on your local workstation. If the port selected above is already in use on biowulf you will get an error. Please select another random port, restart Tensorboard on that port, and try tunneling again.
From your local machine, do the following:
[user@workstation]$ ssh -L 45000:localhost:45000 biowulf.nih.gov
After entering your password,the tunnel will have been established. The link localhost:45000 will now work if you paste it into your web browser.
For setting up a tunnel from your desktop to a compute node with putty, please see https://hpc.nih.gov/docs/tunneling/
Open a web browser and use localhost:45000 for the address (modify port accordingly).