Computing

Servers

Computing Servers

APP has three computing servers installed at SIO. The specifications of each server are listed in the table below.

Server Name kamino manaan castilon
Description NVIDIA DGX Workstation GPU Server CPU Server
Operating System Ubuntu 22.04 Ubuntu 22.04 Ubuntu 22.04
CPU Intel Xeon E5-2698 v4 (2.20 GHz) AMD EPYC 7453 (2.75 GHz) AMD EPYC 9654 (2.4 GHz)
CPU Cores (1 socket) x (20 cores/socket) x (2 threads/core) = 40 logical cores (1 socket) x (28 cores/socket) x (2 threads/core) = 56 logical cores (2 sockets) x (96 cores/socket) x (2 threads/core) = 384 logical cores
Memory 252 GB 504 GB 1.48 TB
Swap 128 GB 128 GB 128 GB
GPUs 4x NVIDIA Tesla V100 32GB NVLINK 4x NVIDIA A100 80GB PCIe -
GPU Memory 32 GB/GPU 80 GB/GPU -

File Server

While each server has its own file system, three important directories are shared between all servers: /home, /data, and /project. These directories are mounted from a separate file server maintained by SIO IT. Any files that are located in these directories are accessible from all servers, and any changes made to these files are immediately visible on all servers.

The home directory is where your personal files are stored, under the convention of /home/username. The data directory is where large datasets are stored, and the project directory is where shared project files are stored.

Connecting to the servers

Virtual Private Network (VPN)

The servers can be reached from anywhere in the world using the UCSD VPN. The VPN is required to access the servers from off-campus locations and from on-campus WiFi. The only case where the VPN is not required is when you are connected to the UCSD wired network.

Authentication

Certain actions that you perform on the server (e.g., running commands with sudo privileges, using an FTP client) require you to use UCSD’s two-factor authentication using DuoMobile. Be sure to keep your phone handy when using the servers.

Secure Shell (SSH)

The most important tool for accessing the servers is SSH. SSH is a secure protocol that allows you to connect to a remote server and execute commands on it. You can use SSH to connect to the servers at SIO and run your code on them from anywhere in the world. Additionally, you can use SSH to transfer files between your local machine and the servers. Finally, you can run instances of programs (like Jupyter Notebooks) on the servers and access them through your local web browser. DigitalOcean has a rather thorough tutorial on how to configure and use SSH, including how to access the servers without using your password.

Servers can be reached through a shell/terminal program using SSH. The following command can be used to connect to the servers using your UCSD username and the servername (e.g., kamino, manaan, or castilon):

ssh username@servername

Note that you don’t need to append ucsd.edu to servername.

File Transfer Protocol (FTP)

Cyberduck is a free and open-source FTP client that can be used to transfer files between your local machine and the servers. Always be sure to use the SFTP (secure FTP) protocol when connecting to the servers.

Port forwarding

If you are running a service on one of the servers (e.g., Jupyter notebook, Tensorboard), you can access it through your local web browser by setting up port forwarding. For example, if a service is running on port 1234 on the server, you can forward that port to port 9876 on your local machine by running the following command on your local machine:

ssh -L 9876:localhost:1234 username@servername

You can then access the service by opening a web browser and navigating to http://localhost:9876.

Port forwarding can also be performed between servers.

Guest access

Non-UCSD affiliated users can be granted guest access to the servers. To receive access, send an email to YT or Billy from the email address you wish to be associated with the request. Affiliate accounts must be requested by the PI, and the request must be approved by the SIO IT department. More information is available on the SIO IT website.

Resource Management

Choosing which server to use

When deciding which server to use for your computations, consider the following:

  • GPU vs. CPU: If your code can benefit from GPU acceleration, use kamino or manaan; otherwise, use castilon.
  • Parallelization: If your code can be parallelized, consider using castilon due to its high core count.

Selecting which GPU to use

If using a GPU server, you can specify which GPU is visible to your script by prepending the environment variable CUDA_VISIBLE_DEVICES to your command, e.g.,

CUDA_VISIBLE_DEVICES=0 python my_script.py

will only use the first GPU on the server. You can set this variable to a comma-separated list of GPU indices to use multiple GPUs, e.g.,

CUDA_VISIBLE_DEVICES=0,1 python my_script.py

will use the first two GPUs on the server.

Resource limitations

We do not currently limit the resources available to users. Please be considerate of other users and do not run jobs that will consume all of the resources on a server. If you need to run a job that will consume a large amount of resources, please coordinate with other users who may be using the server.

Monitoring resource usage

To monitor the memory and CPU resource usage of the servers, you can use the top command. To view only your own processes, run

top -u username

Alternatively, you can use the htop command, which provides a more user-friendly interface for monitoring resource usage:

htop -u username

Monitoring GPU usage

To monitor the GPU resource usage of the servers, you can use the nvidia-smi command. This command provides information about the GPUs installed on the server, including their memory usage, temperature, and utilization. You can run nvidia-smi in a loop to monitor the GPU usage in real-time:

nvidia-smi -l 1

Alternatively, the program nvtop provides an interactive interface for monitoring GPU usage that is similar to htop:

nvtop

Tips & Best Practices

Command line interface (CLI) vs. graphical user interface (GUI)

When working on the servers, it is best to use the command line interface (CLI) rather than a graphical user interface (GUI). While GUI tools are available on the servers, they consume more resources and are [often significantly] slower than CLI tools. Additionally, using the CLI will help you become more proficient at working with the servers and will allow you to automate tasks more easily. Most tasks can be accomplished using the CLI, including MATLAB.

MATLAB

To enter the MATLAB command prompt, run the following:

matlab -nodisplay

To run a MATLAB script from the command line, use the following command:

matlab -nodisplay -r "run('my_script.m'); exit;"

screen and tmux

When you are running a long process on the server, it is best to use a terminal multiplexer like screen or tmux. These tools allow you to run multiple terminal sessions within a single terminal window. If you lose your connection to the server, the process will continue running in the background. When you reconnect to the server, you can reattach to the screen or tmux session and see the output of your process. Failing to use a terminal multiplexer can result in your process being terminated if you lose your connection to the server.

Some useful screen commands:

  • screen: Start a new screen session.
  • Ctrl + A followed by Ctrl + D: Detach from the screen session (the session will continue running in the background).
  • screen -ls: List all screen sessions.
  • screen -r: Reattach to the most recently used screen session.
  • screen -r <session_id>: Reattach to a specific screen session.
  • exit: Close the terminal session and terminate the screen session.
  • screen -S <session_name>: Start a new screen session with a specific name. This is useful when you have multiple sessions running and want to reattach with a human-friendly <session_name> instead of a random number <session_id>.

Running long jobs

Don’t run long jobs unless they have been thoroughly tested and you are confident they will finish successfully. Avoid running long jobs on the server that proceed without checkpoints, saving files, or logging progress. If your job fails, you will lose all the progress you made up to that point, wasting your time and needlessly using resources that could have been better employed by other users.

Ideally, break your workflow into small tasks that can be easily troubleshot. This will help you quickly identify the source of any errors that may arise without having to wait for the completion of a long job, wasting your valuable time and possibly precluding others from utilizing the resources. Re-running a small task is faster than re-running an entire job.

Additionally, look for opportunities to parallelize your code. This could involve using a bash script to run multiple instances of your code with different parameters, or using a parallel processing library like multiprocessing in Python. In MATLAB, you can use the parfor construct to parallelize loops.

Running Jupyter notebooks

When running a Jupyter notebook on the server, you can access the notebook through your local web browser by setting up port forwarding. Jupyter notebooks typically run on port 8888, so you can forward that port to port 8888 on your local machine.

Start a Jupyter notebook on the server with the following command:

Note: If port 8888 is already in use, you will need to select a different port.

jupyter notebook --no-browser --port=8888

Jupyter will print information to your terminal, including a URL that you can use to access the notebook; do not use this URL. Also note the token that Jupyter prints to the terminal; you may need this to access the notebook.

On your local machine, run the following command to forward port 8888 on the server to port 8888 on your local machine:

ssh -L 8888:localhost:8888 username@servername

Open a web browser on your local machine and navigate to http://localhost:8888. Here you may be prompted to enter the token that Jupyter printed to the terminal. Enter the token, log in, and you should be able to access the Jupyter notebook running on the server.

  • GNU Parallel is useful for running multiple jobs in parallel.
  • rsync performs file transfers and is extremely versatile with advanced options for how to handle merges and conflicts.