Skip to main content

HowTo: Monitor your resource-GPU Burn Testing

This tutorial demonstrates how to use GPU stress test tools to check whether the GPU is working properly when the GPU is fully loaded.

The GPU is working normally if the final result is OK, while FAULTY means that there is a problem with the GPU.


Step 1. Sign in to TWSC


Step 2. Create an Interactive Container

  • Please refer to Interactive Container to create an Interactive Container.
  • Please select TensorFlow for the image type, select an image version that supports tensorflow-21.11-tf2-py3:latest, and select 1 GPU for the hardware.

Step 3. Connect to the container, download the training program

  • Use Jupyter Notebook to connect to the container and open Terminal.
  • Enter the following command to download the NCHC_GitHub training program to the container.
git clone https://github.com/TW-NCHC/AI-Services.git

Step 4. Perform GPU Burn Testing

  • Enter the following command to access the Tutorial_Two directory.
cd AI-Services/Tutorial_Two
  • Enter the following command to download GPU_Burn program and start execution.
bash gpu_testing.sh

Step 5. Computing capability

  • View computing capability
    The GPU used by the GPU container service is NVIDIA V100 32GB, which has powerful computing capabilities. The test running GPU-burn showed that the container has a computing capability of 13198 Gflop/s.

  • Monitor Interactive Container

a. Interactive Container monitoring: Utilization of CPU, GPU and memory

b. In the Terminal of Jupyter Notebook in the container, you can run the following command to monitor the GPU temperature and power.

nvidia-smi

GPU quantity Displayed in increments of number 0, the example in the figure below is 1 GPU
GPU temperature Displayed in Celsius, the example in the figure below is 31 degrees C
GPU power usage Displayed in wattage, the example in the figure below is 43W