Submit a parallel job with multi-nodes
In this article, you will learn how to register an account, log into Taiwania2 (HPC CLI), create an interactive container, and finish submitting cross-node parallel computing jobs using HPC Job.
Taiwania2 (HPC CLI)?HPC Job?Cross-node parallel computing?
Taiwania2 (HPC CLI) uses Slurm which is well-suited for different sizes of Linux clusters, supports MPI, and is a fault-tolerant, and highly scalable cluster management and job scheduling system. Please refer to Service overview for more information.
TWSC single HPC job host with 8 high-performance NVIDIA V100 GPU, Taiwania2 (HPC CLI) allows for high-speed parallel computing across nodes, using more than 8 GPUs for your computational work. The system provides a large number of GPUs for you to call on, and with the extremely high throughput and low latency network InfiniBand, the efficient storage system design can greatly reduce the development time by several times. Please refer to Available computing resources to learn about available computing and storage resources and other basic resource information.
Step 1. Prerequisites: accounts, projects, and credits
- Sign up for TWSC
- Once finished, get yourself an available project via methods below:
- Apply for a trial project, or
- Contact TWSC Sales (sales@twsc.io) to learn how TWSC can support your company. We will help you create projects, add credits, and enable TWSC services based on your needs, or
- Ask Tenant Admins to add you to an exsisting project.
Step 2. Log in to Taiwania2 (HPC CLI)
Step 3. Submit a cross-node compute job
Follow the Run parallel computing with multi-nodes on containers -AI Benchmarkto complete the step, you will learn how to create containers in Taiwania2 (HPC CLI), write setup scripts for compute jobs, and submit jobs, view job status, or cancel jobs.
- Try it out! This tutorial uses a benchmark script written by Horovod as an example script, so you can copy the content directly to create a working script.
- Different queues have different maximum job run times and can submit different numbers of jobs, please refer to Queues and compute resources, choose according to your needs, and modify the script settings.
- Use the
sacct -X
command to check the status of the operation if it is completed (COMPLETED
), canceled (CANCELLED
), or failed (FAILED
), The system will use the number of GPUs used, the total number of hours, and deduct the credit. - Use
scancel <JOB_ID>
to cancel the job and stop further billing. - Pricing information: see Price List