Computing
Servers
Computing Servers
APP has three computing servers installed at SIO. The specifications of each server are listed in the table below.
Server Name | kamino | manaan | castilon |
---|---|---|---|
Description | NVIDIA DGX Workstation | GPU Server | CPU Server |
Operating System | Ubuntu 22.04 | Ubuntu 22.04 | Ubuntu 22.04 |
CPU | Intel Xeon E5-2698 v4 (2.20 GHz) | AMD EPYC 7453 (2.75 GHz) | AMD EPYC 9654 (2.4 GHz) |
CPU Cores | (1 socket) x (20 cores/socket) x (2 threads/core) = 40 logical cores | (1 socket) x (28 cores/socket) x (2 threads/core) = 56 logical cores | (2 sockets) x (96 cores/socket) x (2 threads/core) = 384 logical cores |
Memory | 252 GB | 504 GB | 1.48 TB |
Swap | 128 GB | 128 GB | 128 GB |
GPUs | 4x NVIDIA Tesla V100 32GB NVLINK | 4x NVIDIA A100 80GB PCIe | - |
GPU Memory | 32 GB/GPU | 80 GB/GPU | - |
File Server
While each server has its own file system, three important directories are shared between all servers: /home
, /data
, and /project
.
These directories are mounted from a separate file server maintained by SIO IT.
Any files that are located in these directories are accessible from all servers, and any changes made to these files are immediately visible on all servers.
The home
directory is where your personal files are stored, under the convention of /home/username
.
The data
directory is where large datasets are stored, and the project
directory is where shared project files are stored.
Connecting to the servers
Virtual Private Network (VPN)
The servers can be reached from anywhere in the world using the UCSD VPN. The VPN is required to access the servers from off-campus locations and from on-campus WiFi. The only case where the VPN is not required is when you are connected to the UCSD wired network.
Authentication
Certain actions that you perform on the server (e.g., running commands with sudo
privileges, using an FTP client) require you to use UCSD’s two-factor authentication using DuoMobile.
Be sure to keep your phone handy when using the servers.
Secure Shell (SSH)
The most important tool for accessing the servers is SSH. SSH is a secure protocol that allows you to connect to a remote server and execute commands on it. You can use SSH to connect to the servers at SIO and run your code on them from anywhere in the world. Additionally, you can use SSH to transfer files between your local machine and the servers. Finally, you can run instances of programs (like Jupyter Notebooks) on the servers and access them through your local web browser. DigitalOcean has a rather thorough tutorial on how to configure and use SSH, including how to access the servers without using your password.
Servers can be reached through a shell/terminal program using SSH.
The following command can be used to connect to the servers using your UCSD username
and the servername
(e.g., kamino
, manaan
, or castilon
):
ssh username@servername
Note that you don’t need to append ucsd.edu
to servername
.
File Transfer Protocol (FTP)
Cyberduck is a free and open-source FTP client that can be used to transfer files between your local machine and the servers. Always be sure to use the SFTP (secure FTP) protocol when connecting to the servers.
Port forwarding
If you are running a service on one of the servers (e.g., Jupyter notebook, Tensorboard), you can access it through your local web browser by setting up port forwarding.
For example, if a service is running on port 1234
on the server, you can forward that port to port 9876
on your local machine by running the following command on your local machine:
ssh -L 9876:localhost:1234 username@servername
You can then access the service by opening a web browser and navigating to http://localhost:9876
.
Port forwarding can also be performed between servers.
Guest access
Non-UCSD affiliated users can be granted guest access to the servers. To receive access, send an email to YT or Billy from the email address you wish to be associated with the request. Affiliate accounts must be requested by the PI, and the request must be approved by the SIO IT department. More information is available on the SIO IT website.
Resource Management
Choosing which server to use
When deciding which server to use for your computations, consider the following:
- GPU vs. CPU: If your code can benefit from GPU acceleration, use
kamino
ormanaan
; otherwise, usecastilon
. - Parallelization: If your code can be parallelized, consider using
castilon
due to its high core count.
Selecting which GPU to use
If using a GPU server, you can specify which GPU is visible to your script by prepending the environment variable CUDA_VISIBLE_DEVICES
to your command, e.g.,
CUDA_VISIBLE_DEVICES=0 python my_script.py
will only use the first GPU on the server. You can set this variable to a comma-separated list of GPU indices to use multiple GPUs, e.g.,
CUDA_VISIBLE_DEVICES=0,1 python my_script.py
will use the first two GPUs on the server.
Resource limitations
We do not currently limit the resources available to users. Please be considerate of other users and do not run jobs that will consume all of the resources on a server. If you need to run a job that will consume a large amount of resources, please coordinate with other users who may be using the server.
Monitoring resource usage
To monitor the memory and CPU resource usage of the servers, you can use the top
command.
To view only your own processes, run
top -u username
Alternatively, you can use the htop
command, which provides a more user-friendly interface for monitoring resource usage:
htop -u username
Monitoring GPU usage
To monitor the GPU resource usage of the servers, you can use the nvidia-smi
command.
This command provides information about the GPUs installed on the server, including their memory usage, temperature, and utilization.
You can run nvidia-smi
in a loop to monitor the GPU usage in real-time:
nvidia-smi -l 1
Alternatively, the program nvtop
provides an interactive interface for monitoring GPU usage that is similar to htop
:
nvtop
Tips & Best Practices
Command line interface (CLI) vs. graphical user interface (GUI)
When working on the servers, it is best to use the command line interface (CLI) rather than a graphical user interface (GUI). While GUI tools are available on the servers, they consume more resources and are [often significantly] slower than CLI tools. Additionally, using the CLI will help you become more proficient at working with the servers and will allow you to automate tasks more easily. Most tasks can be accomplished using the CLI, including MATLAB.
MATLAB
To enter the MATLAB command prompt, run the following:
matlab -nodisplay
To run a MATLAB script from the command line, use the following command:
matlab -nodisplay -r "run('my_script.m'); exit;"
screen
and tmux
When you are running a long process on the server, it is best to use a terminal multiplexer like screen
or tmux
.
These tools allow you to run multiple terminal sessions within a single terminal window.
If you lose your connection to the server, the process will continue running in the background.
When you reconnect to the server, you can reattach to the screen
or tmux
session and see the output of your process.
Failing to use a terminal multiplexer can result in your process being terminated if you lose your connection to the server.
Some useful screen
commands:
screen
: Start a newscreen
session.Ctrl + A
followed byCtrl + D
: Detach from thescreen
session (the session will continue running in the background).screen -ls
: List allscreen
sessions.screen -r
: Reattach to the most recently usedscreen
session.screen -r <session_id>
: Reattach to a specificscreen
session.exit
: Close the terminal session and terminate thescreen
session.screen -S <session_name>
: Start a newscreen
session with a specific name. This is useful when you have multiple sessions running and want to reattach with a human-friendly<session_name>
instead of a random number<session_id>
.
Running long jobs
Don’t run long jobs unless they have been thoroughly tested and you are confident they will finish successfully. Avoid running long jobs on the server that proceed without checkpoints, saving files, or logging progress. If your job fails, you will lose all the progress you made up to that point, wasting your time and needlessly using resources that could have been better employed by other users.
Ideally, break your workflow into small tasks that can be easily troubleshot. This will help you quickly identify the source of any errors that may arise without having to wait for the completion of a long job, wasting your valuable time and possibly precluding others from utilizing the resources. Re-running a small task is faster than re-running an entire job.
Additionally, look for opportunities to parallelize your code.
This could involve using a bash script to run multiple instances of your code with different parameters, or using a parallel processing library like multiprocessing
in Python.
In MATLAB, you can use the parfor
construct to parallelize loops.
Running Jupyter notebooks
When running a Jupyter notebook on the server, you can access the notebook through your local web browser by setting up port forwarding.
Jupyter notebooks typically run on port 8888
, so you can forward that port to port 8888
on your local machine.
Start a Jupyter notebook on the server with the following command:
Note: If port
8888
is already in use, you will need to select a different port.
jupyter notebook --no-browser --port=8888
Jupyter will print information to your terminal, including a URL that you can use to access the notebook; do not use this URL. Also note the token that Jupyter prints to the terminal; you may need this to access the notebook.
On your local machine, run the following command to forward port 8888
on the server to port 8888
on your local machine:
ssh -L 8888:localhost:8888 username@servername
Open a web browser on your local machine and navigate to http://localhost:8888
.
Here you may be prompted to enter the token that Jupyter printed to the terminal.
Enter the token, log in, and you should be able to access the Jupyter notebook running on the server.
Recommended software
- GNU Parallel is useful for running multiple jobs in parallel.
- rsync performs file transfers and is extremely versatile with advanced options for how to handle merges and conflicts.