Run parallel computing with multi-nodes on containers －AI Benchmark

In this document, we will show you how to create containers and run multi-node parallel computing using TWNIA2 (HPC CLI).

Both TWSC Container Compute Service (Interactive Container and Scheduled Container) and High-performance Computing run containers optimized by NVIDIA NGC, including TensorFlow, PyTorch, and more. In this document, we will also use the NGC container as an example^[1].

Since the version after NGC 19.11^[2], the container image has experimental support for Singularity container which is well suited to the HPC environment. Therefore, users of TWNIA2 (HPC CLI) can directly download the NGC container for use, or use the container from NGC as the base images for building and customizing your environment^[3].

We will use the Singularity container to wrap the packages needed for computing jobs, and then write job scripts to request resources, queue schedules, and submit computing jobs using Slurm Workload Manager.

info

[1] NVIDIA provides many deep learning examples to improve the accessibility.
[2] NGC version numbers consist of year.month: NGC 20.09 is the version released in September 2020. Refer to NGC Support Matrix to know more about the differences between NGC's AI frameworks and packages versions.
[3] You may also download containers from Docker Hub or other container registry, for more information, please refer to Create TWNIA2 containers.

Step 1. Download pre-loaded or NGC containers

TWSC has pre-loaded some commonly used NGC containers, such as TensorFlow and PyTorch, for users and store in the /work/TWCC_cntr directory.

If you want to use other versions or other containers, TWNIA2 has already installed Singularity, you may run the singularity pull command to download Singularity containers^[4].

info

[4] You may also download containers from Docker Hub or other container registry, for more information, please refer to Create an HPC container.

Step 2. Install packages in the container (Optional)

It requires sudo privileges to install packages in Singularity container, so you may create custom container on your own server (virtual machines are recommended, such as TWSC VCS), then upload to TWNIA2 for use^[5].

For example: Using PyTorch container as the base image to install Horovod distributed training framework^[6]:

Create a container

sudo singularity build pytorch_20.09-py3_horovod.sif pytorch_20.09-py3_horovod.def

Contents of pytorch_20.09-py3_horovod.def

BootStrap: docker
From: nvcr.io/nvidia/pytorch:20.09-py3
Stage: build
%post
    # source env
    . /.singularity.d/env/10-docker*.sh
    # install horovod
    export HOROVOD_GPU=CUDA
    export HOROVOD_GPU_OPERATIONS=NCCL
    export HOROVOD_NCCL_LINK=SHARED
    export HOROVOD_WITHOUT_GLOO=1
    export HOROVOD_WITH_MPI=1
    export HOROVOD_WITH_PYTORCH=1
    export HOROVOD_WITHOUT_TENSORFLOW=1
    export HOROVOD_WITHOUT_MXNET=1
    export LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:$LD_LIBRARY_PATH
    pip install --no-cache-dir horovod==0.20.3

info

[5] For more information about how to customize the container, please refer to Customize Singularity Containers.
[6] For the containers from NGC, TensorFlow has already installed Horovod, but PyTorch has not. So, additional installation is required for PyTorch.

Step 3. Enable Mixed Precision (Optional)

TensorFlow, PyTorch and MXNet provided by NVIDIA are able to enable Automatic Mixed Precision, which helps improving the computing speed. Please refer to NVIDIA Automatic Mixed Precision for Deep Learning if needed.

info

You may also refer to Enable TensorFlow Automatic Mixed Precision and run benchmarks using TWSC container service.

Step 4. Write Slurm Job Scripts

When you have prepared the computing environment, please follow the steps below to write job scripts using Slurm Workload Manager to request resources, queue schedules, and submit computing jobs.

The .sh file is the job script submitted by Slurm, the content is divided into two parts:

Information of Job, project and resources: job name, number of nodes, number of jobs running per node, number of GPUs per node, maximum job running time, project ID, and queue name.
The content of the job

Here is an example of the benchmark script for editing and testing using Horovod:

Run the following edit command > Press i key> Select an example to copy and paste > Press ESC key> Enter :wq! to save and exit. We're done!

vim <FILE_NAME>.sh

info

You may write .sh files using the editor you're used to, the example uses vim to operate.

TensorFlow 1

#!/bin/bash
#SBATCH --job-name=Hello_twcc    ## job name
#SBATCH --nodes=2                ## 索取 2 節點
#SBATCH --ntasks-per-node=8      ## 每個節點運行 8 srun tasks
#SBATCH --cpus-per-task=4        ## 每個 srun task 索取 4 CPUs
#SBATCH --gres=gpu:8             ## 每個節點索取 8 GPUs
#SBATCH --time=00:10:00          ## 最長跑 10 分鐘 (測試完這邊記得改掉，或是直接刪除該行)
#SBATCH --account="PROJECT_ID"   ## PROJECT_ID 請填入計畫ID(ex: MST108XXX)，扣款也會根據此計畫ID
#SBATCH --partition=gtest        ## gtest 為測試用 queue，後續測試完可改 gp1d(最長跑1天)、gp2d(最長跑2天)、gp4d(最長跑4天)

module purge
module load singularity 

# net
export UCX_NET_DEVICES=mlx5_0:1
export UCX_IB_GPU_DIRECT_RDMA=1

# tf 1
SIF=/work/TWCC_cntr/tensorflow_21.11-tf1-py3.sif
SINGULARITY="singularity run --nv $SIF"

# tf 1 horovod benchmark script from
# wget https://raw.githubusercontent.com/horovod/horovod/v0.19.1/examples/tensorflow_synthetic_benchmark.py
HOROVOD="python tensorflow_synthetic_benchmark.py --batch-size 256"

# enable AUTO MIXED PRECISION
export TF_ENABLE_AUTO_MIXED_PRECISION=1

# enable NCCL log
export NCCL_DEBUG=INFO

srun $SINGULARITY $HOROVOD

TensorFlow 2

#!/bin/bash
#SBATCH --job-name=Hello_twcc    ## job name
#SBATCH --nodes=2                ## 索取 2 節點
#SBATCH --ntasks-per-node=8      ## 每個節點運行 8 srun tasks
#SBATCH --cpus-per-task=4        ## 每個 srun task 索取 4 CPUs
#SBATCH --gres=gpu:8             ## 每個節點索取 8 GPUs
#SBATCH --time=00:10:00          ## 最長跑 10 分鐘 (測試完這邊記得改掉，或是直接刪除該行)
#SBATCH --account="PROJECT_ID"   ## PROJECT_ID 請填入計畫ID(ex: MST108XXX)，扣款也會根據此計畫ID
#SBATCH --partition=gtest        ## gtest 為測試用 queue，後續測試完可改 gp1d(最長跑1天)、gp2d(最長跑2天)、gp4d(最長跑4天)

module purge
module load singularity 

# net
export UCX_NET_DEVICES=mlx5_0:1
export UCX_IB_GPU_DIRECT_RDMA=1

# tf 2
SIF=/work/TWCC_cntr/tensorflow_21.11-tf2-py3.sif
SINGULARITY="singularity run --nv $SIF"

# tf 2 horovod benchmark script from
# wget https://raw.githubusercontent.com/horovod/horovod/v0.19.5/examples/tensorflow2_synthetic_benchmark.py
HOROVOD="python tensorflow2_synthetic_benchmark.py --batch-size 256"

# enable NCCL log
export NCCL_DEBUG=INFO

srun $SINGULARITY $HOROVOD

PyTorch

#!/bin/bash
#SBATCH --job-name=Hello_twcc    ## job name
#SBATCH --nodes=2                ## 索取 2 節點
#SBATCH --ntasks-per-node=8      ## 每個節點運行 8 srun tasks
#SBATCH --cpus-per-task=4        ## 每個 srun task 索取 4 CPUs
#SBATCH --gres=gpu:8             ## 每個節點索取 8 GPUs
#SBATCH --time=00:10:00          ## 最長跑 10 分鐘 (測試完這邊記得改掉，或是直接刪除該行)
#SBATCH --account="PROJECT_ID"   ## PROJECT_ID 請填入計畫ID(ex: MST108XXX)，扣款也會根據此計畫ID
#SBATCH --partition=gtest        ## gtest 為測試用 queue，後續測試完可改 gp1d(最長跑1天)、gp2d(最長跑2天)、gp4d(最長跑4天)

module purge
module load singularity 

# net
export UCX_NET_DEVICES=mlx5_0:1
export UCX_IB_GPU_DIRECT_RDMA=1

# pytorch 
SIF=/work/TWCC_cntr/pytorch_21.11-py3_horovod.sif
SINGULARITY="singularity run --nv $SIF"

# pytorch horovod benchmark script from
# wget https://raw.githubusercontent.com/horovod/horovod/v0.20.3/examples/pytorch/pytorch_synthetic_benchmark.py
HOROVOD="python pytorch_synthetic_benchmark.py --batch-size 256"

# enable NCCL log
export NCCL_DEBUG=INFO

srun $SINGULARITY $HOROVOD

info

Adding the following content to the script file header, you may add an email to get the job state notification:
```
#SBATCH --mail-type=ALL
#SBATCH --mail-user=$Your_email
```
The ratio of the requested resources will be based on the number of GPUs you assign, and will be allocated based on the ratio of 1 GPU: 4 CPU: 90 GB Memory. For example,
Request 1 GPU, you will get 4 CPU cores and 90 GB Memory automatically allocated.
Request 8 GPUs, you will get 32 CPU cores and 720 GB Memory automatically allocated.
For more information about queue, please refer to Usage instructions of queue and computing resources.

Step 5. Submit jobs

Run the following command to submit the job. The system will arrange the queue schedule and the required resources for you, then start the computing according to the sequence.

sbatch <FILE_NAME>.sh

info

After submitting, it will show the Job ID distributed by the system.

Step 6. View and cancel

Once the job starts running, you can run the following command to view the log.

tail -f slurm_<JOB_ID>.out

info

For more commonly used commands, please refer to Slurm command:

Use squeue -u $USER to view the running job.
Use sacct -X to check today's running job and state to make sure whether it is still running or has already finished.

For canceling the running job, run the command:

scancel <JOB_ID>

Run parallel computing with multi-nodes on containers －AI Benchmark

Step 1. Download pre-loaded or NGC containers​

Step 2. Install packages in the container (Optional)​

Step 3. Enable Mixed Precision (Optional)​

Step 4. Write Slurm Job Scripts​

TensorFlow 1​

TensorFlow 2​

PyTorch​

Step 5. Submit jobs​

Step 6. View and cancel​

Step 1. Download pre-loaded or NGC containers

Step 2. Install packages in the container (Optional)

Step 3. Enable Mixed Precision (Optional)

Step 4. Write Slurm Job Scripts

TensorFlow 1

TensorFlow 2

PyTorch

Step 5. Submit jobs

Step 6. View and cancel