Skip to main content

Submit a parallel job with multi-nodes

In this article, you will learn how to register an account, log into Taiwania2 (HPC CLI), create an interactive container, and finish submitting cross-node parallel computing jobs using HPC Job.

info

Taiwania2 (HPC CLI)?HPC Job?Cross-node parallel computing?

  • Taiwania2 (HPC CLI) uses Slurm which is well-suited for different sizes of Linux clusters, supports MPI, and is a fault-tolerant, and highly scalable cluster management and job scheduling system. Please refer to Service overview for more information.

  • TWSC single HPC job host with 8 high-performance NVIDIA V100 GPU, Taiwania2 (HPC CLI) allows for high-speed parallel computing across nodes, using more than 8 GPUs for your computational work. The system provides a large number of GPUs for you to call on, and with the extremely high throughput and low latency network InfiniBand, the efficient storage system design can greatly reduce the development time by several times. Please refer to Available computing resources to learn about available computing and storage resources and other basic resource information.

Step 1. Prerequisites: accounts, projects, and credits

  1. Sign up for TWSC
  2. Once finished, get yourself an available project via methods below:

Step 2. Log in to Taiwania2 (HPC CLI)

  1. Prepare your system account, reset password and obtain OTP authentication code.
  2. Log in HPC.

Step 3. Submit a cross-node compute job

Follow the Run parallel computing with multi-nodes on containers -AI Benchmarkto complete the step, you will learn how to create containers in Taiwania2 (HPC CLI), write setup scripts for compute jobs, and submit jobs, view job status, or cancel jobs.

info
  • Try it out! This tutorial uses a benchmark script written by Horovod as an example script, so you can copy the content directly to create a working script.
  • Different queues have different maximum job run times and can submit different numbers of jobs, please refer to Queues and compute resources, choose according to your needs, and modify the script settings.
  • Use the sacct -X command to check the status of the operation if it is completed (COMPLETED), canceled (CANCELLED), or failed (FAILED), The system will use the number of GPUs used, the total number of hours, and deduct the credit.
  • Use scancel <JOB_ID> to cancel the job and stop further billing.
  • Pricing information: see Price List