Skip to main content

FAQs | Container Compute Service (CCS)

Before you begin

Q1. How to get started with the HPC?

TWSC has substantial HPC resources, and you can make use of through the following services:

  1. Interactive Container: you can rapidly establish and deploy containers, refer to this document for more information.
  2. High-performance Computing: you can use supercomputing resources through a command-line interface to perform high-performance parallel computing. For more information about connecting to HPC login nodes, refer to this document.
Q2. How to use containers?

You can use containers to train AI models and generate inference engines. The steps are as follows:

Step 1. Upload your model training code and data to HFS under /home/account or /work/account directory. For more details, refer to Hyper File System.
Step 2. Create a container, connect to it ,and run the model training. For more information, refer to Interactive Container.
Step 3. You can download required data after the training is completed. For more details, refer to Hyper File System.
Step 4. You can generate an inference engine on a CCS container or on an VCS Instance.

Q3. How to choose which service to use, CCS or HPC CLI?

Both services can run on GPU containerized environments:

  • If your computing process requires less than 8 GPUs, we recommend you choose CCS.
  • If you want to deploy a multi-node, distributed high-performance parallel computing environment with more than 8 GPUs, we recommend you choose Taiwania 2 (HPC CLI).

Connect to a container

Q1. How to connect to the container?

To connect to your container using SSH or Jupyter Notebook, refer to Connect container for more information.

Q2. What open source clients are available for connecting to TWSC resources, like CCS, VCS and HPC?

Third-party open source software such as MobaXterm, PuTTY and VSCode,etc.

Q3. How to fix SSH `Permission denied` errors while connecting to a container?

You might be entering the wrong password. Please re-enter or reset system password in Member Center, refer to this document for more information.

Q4. What should I do if I cannot launch Jupyter Notebook?

Please refer to the following 2 methods:

  1. Perform the following operations to restore the container to its initial state:
    • Step 1. Clear or move the packages in the /home/system account/.local/ directory. Refer to suggested troubleshooting methods for abnormal program execution for more information.
    • Step 2. Enter the /home/system account/.cache/ directory and clear the temporary files generated during the computing process.
    • Step 3. If you have installed Anaconda or Miniconda, please also remove or rename it.
    • Step 4. Re-create a new container. When selecting the image type, move the cursor to , and go to the NGC website to check image information. Select a suitable image to create a new container and launch the Jupyter Notebook.
  2. Please check whether your organization’s firewall settings have blocked the port used by the container. The container port range is 50000 ~ 60000.

Manage containers

Q1. How to stop a container?

Currently CCS does not support container suspension. You can instead choose any of the following solutions to reduce costs:

  1. You can create an image of the container to keep the working environment, delete the container, and create a new container with the image when you need to use the container.
  2. Write scripts to automate computing and deletion tasks, refer to this document for more information.
Q2. How do I restore the container to its initial state?

You can perform the following operations to restore the container to its initial state:

Step 1. Clear or move the packages in the /home/system account/.local/ directory. Refer to suggested troubleshooting methods for abnormal program execution for more information.
Step 2. Enter the /home/system account/.cache/ directory and clear the temporary files generated during the computing process.
Step 3. If you have installed Anaconda or Miniconda, please also remove or rename it.
Step 4. Re-create a new container. When selecting the image type, move the cursor to , and go to the NGC website to check image information. Select a suitable image to create a new container and launch the Jupyter Notebook.

Q3. Are the environments in different containers different?

All containers you create are mounted with the same storage space, Hyoer File System (HFS). The life cycle of the HFS storage space follows the user's system account. Therefore, all containers created by one user are mounted with the same HFS storage space.

Q4. Why can't I delete the container?

Please check whether Deletion Protection function is enabled. If it is enabled, please disable the Deletion Protection on the Interactive Container Details page to delete the container. If the Deletion Protection function is not enabled and the container cannot be deleted, please contact our technical support or customer service.

Resource allocation and monitoring

Q1. How to use more than 8 GPUs?

Please use Taiwania 2 (HPC CLI) instead. For the usage, refer to the Horovod and Singularity manuals on the Internet or refer to the tutorial: HowTo: High-performance parallel computing with containers-AI Benchmark for more information.

Q2. How to know the number of GPUs allocated to the container?

You can query the numbers of GPU with the following 2 methods:

  1. Execute the following commands at the terminal:$ nvidia-smi
  2. On the TWSC portal, go to the Interactive Container Management page and then the Interactive Container Details page. Then you can check the number of GPUs in the Basic Configuration field.

Q3. How to know the GPU usage when the program is running?

Please refer to the following steps: Step 1. Execute the following commands at the terminal: $ nvidia-smi
Step 2. Check the column of GPU-Util. If it is not 0%, it means in use, and 0% means not in use (as shown in the figure below).

Q4. Why can't I use the GPU in the container?

The following problems may cause the container's GPU to be unavailable:

  1. The number of GPUs used by your program does not match the number created. Please make sure that the number of GPUs in the two places match.
  2. The package compatibility issue. Please fix it with the following steps:
    • Step 1. Clear or move the packages in the /home/system account/.local/ directory. Refer to suggested troubleshooting methods for abnormal program execution for more information.
    • Step 2. Enter the /home/system account/.cache/ directory and clear the temporary files generated during the computing process.
    • Step 3. If you have installed Anaconda or Miniconda, please also remove or rename it.
    • Step 4. Re-create a new container. When selecting the image type, move the cursor to , and go to the NGC website to check image information. Select a suitable image to create a new container and launch the Jupyter Notebook.
Q5. Why is there shared memory in the basic settings when creating a container?

Shared memory is the memory space used when using certain frameworks. For example, PyTorch, refer to PyTorch document for more information.

Q6. How to know the memory usage when the program is running?

You can check the memory usage on the portal or in the container:

  1. On the Monitoring Interactive Container page of the portal, you can view the memory usage graph, refer to Monitoring Interactive Container document for more information.
  2. Execute the command top or free in the container to check the memory usage.

Q7. What is the difference between "Memory Utilization" and "GPU Memory Utilization" in the Monitoring Interactive Container page?

  • Memory Utilization: The memory usage of the container allocated to you by the system, and its capacity is the specification you selected in the basic settings when you created the container.
  • GPU Memory Utilization:The deployed container's GPU shows the memory usage on the core. The GPU of TWSC is NVIDIA V100. Refer to NVIDIA Official website description for more detailed information regarding GPU memory capacity.

Software and packages

Q1. How many computing environments does the container currently support?

In TWSC's container service, 25 environments are provided for users to choose from, including:

  • TensorFlow
  • PyTorch
  • CUDA
  • MATLAB (BYOL)
  • Caffe
  • CNTK
  • MXNet
  • Caffe2
  • TensorRT
  • Triton Inference Server
  • Theano
  • Torch
  • DIGITS
  • NeMo
  • RAPIDS
  • Clara Train SDK
  • CUDA GL
  • Morpheus
  • Merlin Training
  • Merlin Inference
  • Maxine Audio Effect SDK
  • HPC SDK
  • TAO Toolkit for Computer Vision
  • Modulus
  • Clara Parabricks
Q2. How to check what packages and versions are in the container image?

You can use either of the two methods to refer to the packages and versions in the container image:

  1. In the upper right corner on NGC Website, enter TensorFlow release notes, PyTorch release notes, etc., to search a framework's release note. Then, on the release notes page, select an image version to learn more about the packages in the image.
  2. When you are creating a Interactive Container and choosing image file type, please move the mouse to , the prompt will display the NGC URL, and you may find related information in it.
Q3. I deleted the container and then re-created a new one. Why do the packages in the old container exist in the new one?

To provide computing convenience, TWSC mounts the Hyper File System (/home and /work directory, bound with your personal account) to all the containers you create by default, so that your data or packages can be used across containers. Therefore, deleting the container will not affect the packages and data installed in /home and /work directory.

Q4. What should I do if an error message `Permission denied` occurs when installing the package?

Take the following figure as an example. If the file pointed by Permission denied is not located under /home or /work directory, please refer to the Q3 in Other questions and re-install the package after switching to the container root user.

Q5. How to install cuDNN in the container?

CuDNN has been installed in the container environment. The detailed version information can be checked with the following three methods:

  1. In the upper right corner on NGC Website, enter TensorFlow release notes, PyTorch release notes, etc., to search a framework's release note. Then, on the release notes page, select an image version to learn more about the packages in the image.
  2. When you are creating a Interactive Container and choosing image file type, please move the mouse to , the prompt will display the NGC URL, and you may find related information in it.
  3. Execute the set | grep CUDNN command after connecting to the container.
Q6. What are the built-in package management tools in the container?

You can use the built-in tools to manage your packages: apt, apt-get, and pip.

Q7. What should I do if Unable to change to /home/system account-chdir (13: Permission denied) occurs when installing the package?

To ensure data security, the root user of the container cannot access your /home and /work directories. Please install with your system account and do not switch to the root user.

Q8. How to install docker in the container?

TWSC containers do not provide OS-level permissions and therefore cannot be installed and used with docker services.

Storage and data transfer

Q1. How to upload or download files to or from the container?

For uploading files to /home or /work of the container, or downloading files to your local machine, refer to this document for more information.

Q2. Why can't I access my /home and /work directory when I switch to root user?

To ensure data security, the container's root user cannot access your directories, and only the user's account has permission to access them.

Q3. How to share the data of /home and /work directory to other users of the same project?

You can share container's data to other users using TWSC Cloud Object Storage (COS) with TWCC CLI. Refer to this document for more information.

Q4. How to set up automation to transfer the data in the container to the local machine?

You can use container's public ports to transfer data between your local machine. The available ports for the container are: 22, 80, and 443.

Q5. Why can't I access /home and /work directory in the Matlab container?

Since the current Matlab image has not been integrated with the Hyper File System (HFS), please execute the following commands in the terminal to access /home and /work directory:

sudo su -
su [system account]
/opt/matlab/R2019b/bin/matlab
Q6. Can the shared memory be used as hard disk space?

If you select a container type with shared memory to create your container, you can use /dev/shm the shared memory space, as a hard disk to store your data.

Important:
  • Since storing data in the shared memory will occupy the space, please consider the storage space required by your program before storing.
  • The data stored here will disappear when the container is deleted. Move the data that need to be saved to /home/system account or /work/system account directories before deleting the container.
Q7. Why can't I add new files with Jupyter Notebook?

You cannot add new files because the Hyper File System's storage space is almost full. Please refer to Hyper File System FAQ Q6 to check and free up your storage space, or purchase more storage space. For more information about the storage pricing and purchasing, refer to the two paragraphs of "Check used capacity" and "Storage space management policy" at Hyper File System.

Q8. Why can't I save files with Jupyter Notebook?

You cannot save files because the Hyper File System's storage space is almost full. Please refer to Hyper File System FAQ Q6 to check and free up your storage space, or purchase more storage space. For more information about the storage pricing and purchasing, refer to the two paragraphs of "Check used capacity" and "Storage space management policy" at Hyper File System.

Q9. How to upload files to Jupyter Notebook?

The storage space you access from Jupyter Notebook is the Hyper File System (HFS). For uploading your file, refer to this document for more information.

Q10. How to transfer files between the container and Cloud Object Storage (COS)?
  1. Please install TWCC CLI in your container.
  2. For using TWCC CLI to transfer files between the container and Cloud Object Storage (COS), refer to this file for more information.
Q11. How to mount Cloud Object Storage to containers?

The storage system used by TWSC containers is Hyper File System (HFS), which currently does not support hooking up Cloud Object Storage (COS) directly to the containers.

If you only need to transfer files with Cloud Object Storage (COS), please refer to Q10.

Networking & Security

Q1. What is the range of the container's port?

The port numbers of containers range from 50000 to 60000.

Q2. Can I use VPN to link containers?

Currently TWSC containers do not support the deployment of VPN services (e.g. OpenVPN). The default open outbound ports for VPN services are different from those supported by TWSC containers, which use Port-Forwarding and outbound ports are randomly assigned and cannot be assigned corresponding port numbers.

Container images

Q1. How to download the image of a container?

Currently the system does not support this feature.

Performance

Q1. Why is I/O slow when running the program?

It might be a dataset problem or the node where the container is located is busy:

  1. If your dataset consists of many small files and occupies a lot of space, we recommend that you gather small files into large files to reduce I/O pressure.
  2. Make a image of the container, and then use the image to create a new container. If we have sufficient capacity, the container can be created on a less busy node.
Q2. Why is the performance not as expected when the program is running?

Follow the steps below to troubleshoot package compatibility issues:
Step 1. Clear or move the packages in the /home/system account/.local/ directory. Refer to suggested troubleshooting methods for abnormal program execution for more information.
Step 2. Enter the /home/system account/.cache/ directory and clear the temporary files generated during the computing process.
Step 3. If you have installed Anaconda or Miniconda, please also remove or rename it.
Step 4. Re-create a new container. When selecting the image type, move the cursor to , and go to the NGC website to check image information. Select a suitable image to create a new container and launch the Jupyter Notebook.

Q3. Why is the performance slower than the local machine when running a program?

For ways to improve performance, please refer to the following:

  1. Troubleshoot package compatibility issues
    • Step 1. Clear or move the packages in the /home/system account/.local/ directory. Refer to suggested troubleshooting methods for abnormal program execution for more information.
    • Step 2. Enter the /home/system account/.cache/ directory and clear the temporary files generated during the computing process.
    • Step 3. If you have installed Anaconda or Miniconda, please also remove or rename it.
    • Step 4. Re-create a new container. When selecting the image type, move the cursor to , and go to the NGC website to check image information. Select a suitable image to create a new container and launch the Jupyter Notebook.
  2. If your dataset consists of many small files and occupies a lot of space, we recommend that you gather small files into large files to reduce I/O pressure.
  3. Make a image of the container, and then use the image to create a new container. If there is still room for the overall system load, the container can be arranged on a less busy node.

Execution error

Q1. Shows insufficient shared memory when the program is running?

  1. If it is a PyTorch container environment, please set the num workers of Dataloader to 0.
  2. Or create a new container and choose a specification with shared memory.

Q2. Shows bus error when the program is running?

Follow the steps below to troubleshoot package compatibility issues:
Step 1. Clear or move the packages in the /home/system account/.local/ directory. Refer to suggested troubleshooting methods for abnormal program execution for more information.
Step 2. Enter the /home/system account/.cache/ directory and clear the temporary files generated during the computing process.
Step 3. If you have installed Anaconda or Miniconda, please also remove or rename it.
Step 4. Re-create a new container. When selecting the image type, move the cursor to , and go to the NGC website to check image information. Select a suitable image to create a new container and launch the Jupyter Notebook.

Q3. Why couldn't I load some libraries during program execution (Could not load dynamic library...)?

This might because the library version called in the program does not match the version in the container. Please execute the following command to get the library version in the environment, and then modify the library version your program calls: sudo find / -name [library name]

Q4. Why does sudo apt update occurs Unable to change to /home/wistron1/ -chdir (13: Permission denied)?

Please switch to root user and execute apt update.

Q5. Why is kernel busy displayed in the upper right corner when using Jupyter Notebook?

Please follow the procedure below to resolve package compatibility issues:
Step 1. Clear or move the packages in the /home/system account/.local/ directory. Refer to suggested troubleshooting methods for abnormal program execution for more information.
Step 2. Enter the /home/system account/.cache/ directory and clear the temporary files generated during the computing process.
Step 3. If you have installed Anaconda or Miniconda, please also remove or rename it.
Step 4. Re-create a new container. When selecting the image type, move the cursor to , and go to the NGC website to check image information. Select a suitable image to create a new container and launch the Jupyter Notebook.

Other questions

Q1. How to transfer from the container to Taiwania 2 (HPC CLI) for training

You can refer to the instructions for use of Conda and Singularity on the Internet, or refer to the following tutorial:

Q2. Can I create a container for others to use?

When creating a container for others to use, you need to consider the following points:

  • Your system password must be provided to others to connect to the container.
  • /home and /work directory are your personal HFS storage space. The data and files might lost or damaged when used by others. Even if you open a new container, these changes cannot be restored.
  • There will be data security risks when sharing computing resources. Please consider carefully.

Therefore, in addition to creating containers for others, you can add others to the project on the Member Center so that the user can create containers on his own.

Q3. How to switch to the root user of the container?

Execute the following command to switch to root user:

sudo su
or
sudo -i
Q4. Do you charge for the container once it is created, or do you charge when it is computing?

Once a container is created, it occupies compute resources. Therefore, the container will continue to be billed before you delete it.